From fd55c177cc2d398d56e53979f11f1de82e1860ab Mon Sep 17 00:00:00 2001 From: Sungwoo Kim Date: Mon, 12 Aug 2024 00:08:05 -0400 Subject: [PATCH] add papers --- ... of the Generalization Error Across Scales | 1 + ...h Neural Networks for Graph Classification | 1 + ...ethod for Solving Vehicle Routing Problems | 1 + ...urity Vulnerabilities of Transfer Learning | 1 + ...f the Number of Shots in Few-Shot Learning | 1 + ..., or what we can learn from a single image | 1 + ...gregated Memory For Reinforcement Learning | 1 + ...h momentum for over-parameterized learning | 4 ++++ ...e Effects of Actions in Multiagent Systems | 1 + ...ibria of Linear-Quadratic Mean-Field Games | 1 + ... Fingerprints for Graph Attention Networks | 1 + ...uniform Discretization for Neural Networks | 1 + .../iclr/Adjustable Real-time Style Transfer | 1 + ...ies: Attacking Deep Reinforcement Learning | 1 + ...obust Representations with Smooth Encoders | 1 + .../Adversarially robust transfer learning | 1 + ...n Extended Semi-discrete Optimal transport | 1 + ... Nets that Respect the Triangle Inequality | 1 + ...mple of Zebrafish Swim Bout Classification | 1 + ...but Strong Baselines for Grammar Induction | 1 + ...imators of sequence-to-sequence functions? | 1 + ...Neural Connectivity in Video Architectures | 1 + ... Networks for Exploring the Chemical Space | 1 + ...ed Kernel-Wise Neural Network Quantization | 1 + .../iclr/Automated Relational Meta-learning | 1 + ...eration through setter-solver interactions | 1 + ... Visual Categories with Ranking Statistics | 1 + ...ck with Transferable Model-based Embedding | 1 + ... of Descent Paths in Shallow ReLU Networks | 1 + ...Loss Landscapes and Adversarial Robustness | 1 + ...etwork Training Under Resource Constraints | 1 + .../iclr/CAQL: Continuous Action Q-Learning | 1 + ... Invariants with Continuous Logic Networks | 1 + ...i-stage Multi-agent Reinforcement Learning | 1 + ...an gradient clipping mitigate label noise? | 1 + ...ial Perturbations via Randomized Smoothing | 1 + ... Expedited Deep Neural Network Compilation | 1 + ...emerge in a neural iterated learning model | 1 + ...putation Reallocation for Object Detection | 1 + ...nual Learning with Adaptive Weights (CLAW) | 1 + ...an Neural Networks for Non-Stationary Data | 1 + ...odular structure of deep generative models | 1 + data/2020/iclr/Curvature Graph Network | 1 + ...ackdoor Attacks against Federated Learning | 1 + ...intGoal Navigators from 2.5 Billion Frames | 3 +++ ...ta-Independent Neural Pruning via Coresets | 1 + ...en Embeddings for Neural Sequence Modeling | 1 + ... with global and local adaptive dilations" | 1 + ... Flexible Inference, Planning, and Control | 1 + ...Processes via Proper Spectral Sub-gradient | 1 + ...cattering and Homotopy Dictionary Learning | 1 + .../Deep Semi-Supervised Anomaly Detection | 1 + ...entiable Scale-Invariant Sparsity Measures | 1 + ... with Differentiable Structure from Motion | 1 + ...ve Receptive Fields for Object Deformation | 2 ++ data/2020/iclr/Depth-Adaptive Transformer | 1 + ...tecting Extrapolation with Local Ensembles | 1 + ... Class-Conditional Capsule Reconstructions | 1 + ...versarial Network-Unseen Sample Generation | 1 + .../iclr/Differentially Private Meta-Learning | 1 + ...ing Factors of Variations Using Few Labels | 1 + ...ing from Errors for Confidence Calibration | 1 + ...casting with Determinantal Point Processes | 1 + ...h Noisy Labels as Semi-supervised Learning | 1 + ...ime Lag Regression: Predicting What & When | 1 + ...upervised and Unsupervised Skill Discovery | 1 + ... for Large-scale Knowledge Graph Reasoning | 1 + ...ES-MAML: Simple Hessian-Free Meta Learning | 1 + data/2020/iclr/Editable Neural Networks | 1 + ... Stiefel Manifold via the Cayley Transform | 1 + ...serving Future Frame Prediction and Beyond | 1 + ...ial Attacks with a Distribution Classifier | 1 + .../iclr/Ensemble Distribution Distillation | 1 + ...dle Points Faster with Stochastic Momentum | 1 + ...Search Phase of Neural Architecture Search | 1 + ...cement Learning with Deep Covering Options | 1 + ... Model-based Planning with Policy Networks | 1 + ...resentations with Featurewise Sort Pooling | 1 + ...than free: Revisiting adversarial training | 1 + ...for Faster Real-time Semantic Segmentation | 1 + ...n Systems via Neural Interaction Detection | 1 + .../Federated Adversarial Domain Adaptation | 1 + ...r-Classes based on Graph spectral Measures | 1 + ...ssification with Distributional Signatures | 1 + ...sses of Deep Reinforcement Learning Agents | 1 + ...al Attack against Multiple Object Tracking | 1 + ...Should Know to Improve Batch Normalization | 1 + ... Variational to Deterministic Autoencoders | 1 + ...s. parametric equivalence of ReLU networks | 1 + ...xample Detection and Robust Classification | 1 + ...with Object-Centric Latent Representations | 1 + .../iclr/GLAD: Learning Sparse Graph Recovery | 1 + ...Gap-Aware Mitigation of Gradient Staleness | 1 + ...nds for deep convolutional neural networks | 1 + .../iclr/Generative Ratio Matching Networks | 1 + ...o the Convergence of Nonlinear TD Learning | 1 + .../Global Relational Models of Source Code | 1 + ...earning for semi-supervised classification | 1 + ...orizon Tasks via Visual Subgoal Generation | 1 + ...ition for Comparing Classifiers Adaptively | 1 + ...lows for Recovering Latent Representations | 1 + ...ization Under Extreme Overparameterization | 1 + .../iclr/Image-guided Neural Object Rendering | 1 + ...rning via Off-Policy Distribution Matching | 1 + ...sed Adversarial Training on Separable Data | 1 + ...ust Classification via an All-Layer Margin | 1 + ...Requires Revisiting Misclassified Examples | 1 + ...ndly Binarized Neural Network Architecture | 1 + ...on with Likelihood-based Generative Models | 1 + ...ued Neural Networks for Privacy Protection | 1 + ...ation for Encouraging Synergistic Behavior | 1 + ...istency between Neural Networks and Beyond | 1 + ...ge MOdeling for Lifelong Language Learning | 1 + data/2020/iclr/Language GANs Falling Short | 1 + ...Deep Learning: Training BERT in 76 minutes | 1 + ...extensive games with imperfect information | 1 + data/2020/iclr/Learned Step Size quantization | 1 + ...resentations for CounterFactual Regression | 1 + ...nchronization Policies for Distributed SGD | 1 + ...rning Execution through Neural Code fusion | 1 + ...rdination: An Event-Based Deep RL Approach | 1 + ...an Formulas through Reinforcement Learning | 1 + ...from Demonstrations with Negative Sampling | 1 + ...ace Partitions for Nearest Neighbor Search | 1 + ...ependent embedding and Hungarian attention | 1 + ...ime for Problems in Reinforcement Learning | 1 + .../Learning to Learn by Zeroth-Order Oracle | 1 + data/2020/iclr/Learning to Link | 2 ++ ...epresent Programs with Property Signatures | 1 + ...ing to solve the credit assignment problem | 1 + ...etworks for Low-precision Integer Hardware | 1 + ...and Compositionality in Zero-Shot Learning | 1 + .../Logic and the 2-Simplicial Transformer | 1 + ...rce Knowledge-Grounded Dialogue Generation | 1 + ...t Training via Maximizing Certified Radius | 1 + ...trolling the Estimation Bias of Q-learning | 1 + ...: A Comprehensive Method on Realistic Data | 1 + ...ts for Learning to Learn from Few Examples | 1 + .../iclr/MetaPix: Few-Shot Video Retargeting | 1 + ... to Learn Efficient Sparse Representations | 1 + ...une Large-scale Pretrained Language Models | 1 + ...oiting Mixup to Defend Adversarial Attacks | 1 + ...ment Learning for Networked System Control | 1 + ...cative Interactions and Where to Find Them | 1 + ...ain Adaptation on Person Re-identification | 1 + ... for interpretable time series forecasting | 1 + .../iclr/NAS evaluation is frustratingly hard | 1 + ...nsembles for Deep Learning on Tabular Data | 1 + data/2020/iclr/Neural Stored-program Memory | 1 + ...Text Generation With Unlikelihood Training | 1 + data/2020/iclr/Novelty Detection Via Blurring | 1 + ...onal Overfitting in Reinforcement Learning | 1 + ... Generative Adversarial Imitation Learning | 1 + .../iclr/On Identifiability in Transformers | 1 + ...n Maximization for Representation Learning | 1 + ...lity\" of generative adversarial networks" | 1 + ...e of the Adaptive Learning Rate and Beyond | 1 + ...nt Learning for Neural Machine Translation | 1 + ...l Networks by Jacobian Spectrum Evaluation | 1 + ...ion even with a Pessimistic Initialisation | 1 + ...Option Discovery using Deep Skill Chaining | 1 + ...ning and Its Application to Age Estimation | 1 + .../Overlearning Reveals Sensitive Attributes | 3 +++ ...d Physical Parameter Estimation from Video | 1 + ...shape the loss surfaces of neural networks | 1 + ...e languages: lottery tickets in RL and NLP | 1 + ...r Fast and Accurate Multi-sentence Scoring | 1 + ...l Policy Search for Reinforcement Learning | 1 + ... for Embedding-based Large-scale Retrieval | 1 + ...rvised Knowledge-Pretrained Language Model | 1 + ...ry Banks for Incremental Domain Adaptation | 1 + ...works under Regularization and Constraints | 1 + .../iclr/Pruned Graph Scattering Transforms | 1 + ... 3D Object Detection in Autonomous Driving | 1 + ...ints: a Geometric Study of Linear Networks | 1 + ... Sample Efficient for Infinite-Horizon MDP | 1 + ...-Performance Learned Lossy Representations | 1 + ...ng to New Environment Dynamics via Reading | 1 + ...tical Training For Collaborative Filtering | 1 + data/2020/iclr/Ranking Policy Gradient | 1 + ...ds Understanding the Effectiveness of MAML | 1 + ...ension Dataset Requiring Logical Reasoning | 1 + ...bution Matching and Augmentation Anchoring | 1 + ...iance Reduced Temporal Difference Learning | 1 + ...rent neural circuits for contour detection | 1 + ...ced active learning for image segmentation | 1 + ...ence Model for Natural Question Generation | 1 + ...bles of Information-Constrained Primitives | 1 + ... Model for Stochastic Multi-Object Systems | 1 + ...ss-Entropy Loss for Adversarial Robustness | 1 + ...ia Bias-Free Convolutional Neural Networks | 1 + ...the Generalization of Adversarial Training | 1 + .../Robust training with ensemble consensus | 1 + ...iant of Adam for Strongly Convex Functions | 0 ...o Filter Noisy Labels with Self-Ensembling | 1 + ...Reinforcement Learning with Sparse Rewards | 1 + ...ning of Bayesian Quantized Neural Networks | 1 + ...on by Entropy Penalized Reparameterization | 1 + ...r Reasoning With a Symbolic Knowledge Base | 1 + ...ning with Additive Parameter Decomposition | 1 + ...Efficient Data Selection for Deep Learning | 1 + ...arative Discrimination for Text Generation | 1 + ...arning for Self-Supervised Monocular Depth | 1 + ... in Multi-Task Deep Reinforcement Learning | 1 + ...parse Deconvolution - A Geometric Approach | 1 + ...its Are All You Need for Black-Box Attacks | 1 + ...ry-Efficient Hard-label Adversarial Attack | 1 + ...ficient Distributed SGD with Slow Momentum | 1 + ...AUC Maximization with Deep Neural Networks | 1 + ...nerative Networks with Basis Decomposition | 1 + ...Large-Batch Training That Generalizes Well | 1 + ...raph Pooling via Conditional Random Fields | 1 + ... Dataset for Table-based Fact Verification | 1 + ...Incremental Learning Drives Generalization | 1 + ...hastic Evaluation on an Information Budget | 1 + ... of the Hessian of DNN throughout training | 1 + ... for Learning Disentangled Representations | 1 + ...treet! Model Extraction of BERT-based APIs | 1 + ...ur Headache of Training an MRF, Take AdVIL | 1 + ... Algorithms in Generative Adversarial Nets | 1 + ...erturbations of Deep Feature Distributions | 1 + ...d Attention with Hierarchical Accumulation | 1 + ...t by Cell-based Neural Architecture Search | 1 + ... in Non-autoregressive Machine Translation | 1 + ... Variational Mutual Information Estimators | 1 + ...n on Real Scans using Adversarial Training | 1 + ...ional Disentangled Representation Learning | 1 + ...zation for Discrete and Continuous Control | 1 + ...ks for Video-level Representation Learning | 1 + ... Generic Visual-Linguistic Representations | 1 + ...Solving Partially Observable Control Tasks | 1 + ...haracters Extracted from Real-World Videos | 2 ++ ...ased Model for Stochastic Video Generation | 1 + ...a-Learning from Demonstrations and Rewards | 1 + ...lustering by Exploiting Unique Class Count | 1 + ...ural networks cannot learn: depth vs width | 1 + ...mmunication-Efficient Distributed Learning | 1 + ...entation for Training Deep Neural Networks | 0 ...f Self-Expressive Deep Subspace Clustering | 1 + .../A Design Space Study for LISTA and Beyond | 1 + ...t Descent Exponentially Favors Flat Minima | 1 + ...ative Gaussian Mixture Model with Sparsity | 1 + ...nal Approach to Controlled Text Generation | 1 + ...nerative Image Models and Its Applications | 0 ...u Need for High-Resolution Video Synthesis | 1 + ...ow Framework For Analyzing Network Pruning | 1 + ...o Robust Regression without Correspondence | 1 + ...oretic Perspective on Local Explainability | 1 + ...anguage Models Help Solve Downstream Tasks | 1 + ...alization Bounds for Graph Neural Networks | 1 + ...aptive Multi-Exit Neural Network Inference | 1 + ... Learning with Continuous-time Information | 1 + ...regation and its Relationship to Attention | 1 + ...g and Boosting Adversarial Transferability | 1 + ...er Layer for Few-Shot Image Classification | 1 + ... for Group Equivariant Convolution Kernels | 1 + ...of cold posteriors in deep neural networks | 1 + ...t framework to distill future trajectories | 1 + ...it bias in training linear neural networks | 1 + ...died Environments for Interactive Learning | 1 + ...iators via Constrained Structural Learning | 0 ...g Unlabeled data by REgularizing Diversity | 1 + ...astic Gradient MCMC via Variance Reduction | 1 + ...epresentations with Graph Multiset Pooling | 1 + ...articipation in Non-IID Federated Learning | 1 + ...nments with Non-Stationary Markov Policies | 1 + ...-level uncertainty in deep neural networks | 1 + ...ning of Audio-Visual Video Representations | 1 + ...n Network for Efficient Action Recognition | 1 + ...ph Convolutional Networks into Deep Models | 1 + ...: Adaptive Text to Speech for Custom Voice | 1 + ...ntum Optimizers on Scale-invariant Weights | 1 + ...sivity via Spectral Reinforcement Learning | 1 + ...Methods for Min-Max Optimization and Games | 1 + .../2021/iclr/Adaptive Federated Optimization | 1 + ...k Generation for Hard-Exploration Problems | 1 + ... Generalized PageRank Graph Neural Network | 1 + ...Adaptive and Generative Zero-Shot Learning | 1 + ...and improved sampling for image generation | 2 ++ .../iclr/Adversarially Guided Actor-Critic | 1 + ...tter: Illustration on Image Classification | 1 + .../iclr/Aligning AI With Shared Human Values | 1 + ...ransformers for Image Recognition at Scale | 1 + ...ng Approach for Real-World Image Denoising | 0 ... Neural Networks in a Spectral Perspective | 1 + ... Hidden Representations and Task Semantics | 1 + ...g Sparse Embeddings for Large Vocabularies | 1 + ...n Questions with Multi-Hop Dense Retrieval | 1 + ...regressive Models via Ordered Autoencoding | 1 + ...trastive Learning for Dense Text Retrieval | 1 + ...larity Through Differentiable Weight Masks | 1 + ...formed by Gradient Boosted Decision Trees? | 0 ...etter given the same number of parameters? | 1 + ...e Generalization in Reinforcement Learning | 1 + ...chastic Method using Deep Denoising Priors | 1 + ...l Constellation Nets for Few-Shot Learning | 1 + .../Auction Learning as a Two-Player Game | 1 + ... Networks for Complex Dynamics Forecasting | 1 + ...etric Surrogates for Semantic Segmentation | 1 + ...hedule by Bayesian Optimization on the Fly | 1 + ...Offline Policy Evaluation and Optimization | 1 + .../2021/iclr/Autoregressive Entity Retrieval | 1 + ...liary Learning by Implicit Differentiation | 1 + ...osition: the Good, the Bad and the neutral | 1 + ...ion for Bilinear Games and Normal Matrices | 1 + ...eting Attention in Protein Language Models | 1 + ...epresentation Change for Few-shot Learning | 0 ...ining Quantization by Block Reconstruction | 1 + ...BREEDS: Benchmarks for Subpopulation Shift | 1 + ...ixed-Precision Neural Network Quantization | 1 + ...thesis Through Learning-Guided Exploration | 1 + .../Bag of Tricks for Adversarial Training | 1 + ...raints and Rewards with Meta-Gradient D4PG | 1 + ...ement Learning Through Continuation Method | 1 + ...263lya-Gamma Augmented Gaussian Processes" | 1 + ...havioral Cloning from Noisy Demonstrations | 0 ...sk bound and superiority to kernel methods | 1 + ...ning by Reducing Representational Collapse | 1 + ...l Representations for Image Classification | 1 + ...omplex Multiplications with 1 n Parameters | 1 + ...et: Binary Neural Network for Point Clouds | 1 + ...ence for Non-Autoregressive Text-to-Speech | 0 ...ation for Efficient Reinforcement Learning | 1 + ...dient Boosting Meets Graph Neural Networks | 1 + ...-Shot Recognition and Novel-View Synthesis | 1 + ... SGD with Gradient Subspace Identification | 1 + ...ent Non-Convex Stochastic Gradient Descent | 1 + ...-Aware Cumulative Accessibility Estimation | 1 + ...Achieve Goals via Recursive Classification | 1 + ...nsupervised Visual Representation Learning | 1 + ...tion Regularization for Continual Learning | 1 + ...ural Network Training via Cyclic Precision | 1 + ...orization Network for Video Classification | 1 + ...dential and Private Collaborative Learning | 1 + ...libration of Neural Networks using Splines | 1 + .../Calibration tests beyond classification | 1 + .../Can a Fruit Fly Learn Word Embeddings? | 1 + .../Capturing Label Characteristics in VAEs | 1 + ...izing Flows via Continuous Transformations | 1 + ...for Causal Structure and Transfer Learning | 1 + ... Adversarial Networks for Image Generation | 1 + ...obustness with Compositional Architectures | 0 ...m and Coordination via Game Decompositions | 1 + ...he performance gap in unnormalized ResNets | 1 + ...g with Heaviside Continuous Approximations | 1 + ...A Pipeline Toolkit for Medical Time Series | 1 + ...Continual)? Generalized Zero-Shot Learning | 1 + ...e Discrimination and Feature Decorrelation | 1 + ...ed Joint Mixup with Supermodular Diversity | 1 + ...als for Evaluating Dialogue State Trackers | 1 + ...ed Approach for Controlled Text Generation | 1 + ...ntation for Natural Language Understanding | 1 + ...g Interdependence in Graph Neural Networks | 1 + data/2021/iclr/Colorization Transformer | 1 + ...ata Augmentation Can Harm Your Calibration | 1 + ... Models out-performs Graph Neural Networks | 1 + ...chine Learning for Network Flow Estimation | 1 + ... Reinforcement Learning: Intention Sharing | 0 ...works for Faster Multi-Platform Deployment | 1 + ...uery Answering with Neural Link Predictors | 1 + ...Convolutional and Fully-Connected Networks | 1 + .../Concept Learners for Few-Shot Learning | 1 + ...ive Modeling via Learning the Latent Space | 1 + ...rastive Learning of Visual Representations | 1 + ... in NLP Using Fewer Parameters & Less Data | 1 + ...sentation with Hamiltonian Neural Networks | 1 + ...onservative Safety Critics for Exploration | 1 + ...emplating Real-World Object Classification | 1 + ... Efficient Sample-Dependent Dropout Module | 1 + ...ion Networks for Online Continual Learning | 1 + ...nual learning in recurrent neural networks | 1 + ...er Estimation without Minimax Optimization | 1 + ...r Generalization in Reinforcement Learning | 1 + ...arning is a Time Reversal Adversarial Game | 1 + ...ent Learning via Embedded Self Predictions | 1 + ...turbations for Conditional Text Generation | 1 + ...astive Learning with Hard Negative Samples | 1 + .../Contrastive Syn-to-Real Generalization | 1 + ...ons for Model-based Reinforcement Learning | 1 + ... Optimal Transport and Convex Optimization | 1 + ...egularization behind Neural Reconstruction | 1 + ...t via Distributionally Robust Optimisation | 1 + ...l Roles of Graphs in Graph Neural Networks | 1 + ...ience replay for multi-agent communication | 1 + .../iclr/Counterfactual Generative Networks | 1 + ...ecture for learning long time dependencies | 1 + data/2021/iclr/Creative Sketch Generation | 1 + ... for Weakly-Supervised Action Localization | 1 + ... better segmentation with weak supervision | 1 + ...of Performance Collapse Without Indicators | 1 + ...hod for optimization with hard constraints | 1 + ...ntial Dynamic Programming Neural Optimizer | 1 + ...ditional Redundancy Adversarial Estimation | 1 + ...al Energy-Based GAN for Domain Translation | 1 + ...cy Multi-Agent Decomposed Policy Gradients | 1 + ...eration with Music via Curriculum Learning | 1 + ...rning with Self-Predictive Representations | 1 + ...ataset Condensation with Gradient Matching | 1 + ...: Ownership Resolution in Machine Learning | 1 + ...Meta-Learning from Kernel Ridge-Regression | 1 + ...DeLighT: Deep and Light-weight Transformer | 0 ...-Enhanced Bert with Disentangled Attention | 1 + ...pt-based Explanations with Causal Analysis | 1 + ...ntralized Attribution of Generative Models | 1 + ...ti-Task Learning: a Random Matrix Approach | 0 ...nstructing the Regularization of BatchNorm | 0 ...sentations via Invertible Generative Flows | 1 + ...ing Non-autoregressive Machine Translation | 1 + ...hallow for ReLU Networks in Kernel Regimes | 1 + .../Deep Learning meets Projective Clustering | 3 +++ ...Networks and the Multiple Manifold Problem | 1 + ...inting by Conferrable Adversarial Examples | 1 + ...rnel and Laplace Kernel Have the Same RKHS | 1 + ...Defenses against General Poisoning Attacks | 1 + ...Data Based on Order-Identity Decomposition | 0 ...rom data via risk-seeking policy gradients | 1 + ...ing By Solving Derived Non-Parametric MDPs | 1 + ...ansformers for End-to-End Object Detection | 1 + ...n-Aware Training for Graph Neural Networks | 1 + .../iclr/Denoising Diffusion Implicit Models | 1 + ...rning via Model-Based Offline Optimization | 1 + ...-Graph Networks into Negotiation Dialogues | 1 + ...satile Diffusion Model for Audio Synthesis | 1 + .../Differentiable Segmentation of Sequences | 1 + ...ion Layers for Deep Reinforcement Learning | 1 + ... Needs Better Features (or Much More Data) | 1 + .../Directed Acyclic Graph Neural Networks | 1 + ...adient Descent with Moderate Learning Rate | 1 + ...Symbolic Expressions in Informal Documents | 1 + ...trategic Behavior via Reward Randomization | 1 + ...ssive Orderings with Variational Inference | 1 + ... set of policies for the worst case reward | 1 + ...rning for Forecasting Multiple Time Series | 1 + ...ntangled Recurrent Wasserstein Autoencoder | 1 + ...cal Networks for Few-Shot Concept Learning | 1 + ...arisation of Deep Networks for Fine-Tuning | 1 + ...Reader to Retriever for Question Answering | 1 + ...tine-resilient Stochastic Gradient Descent | 1 + ...in and Applications to Generative Modeling | 1 + ...eneration using a Gaussian Process Trigger | 1 + ...3D Shape Reconstruction from 2D Image GANs | 1 + ... Representations Vary with Width and Depth | 1 + ...mbedding Perturbation for Private Learning | 1 + ... network robustness to common corruptions? | 1 + .../iclr/Domain Generalization with MixStyle | 1 + ...arning with Mutual Information Constraints | 1 + ...rNAS: Dirichlet Neural Architecture Search | 1 + ...epresentation for Noise-Robust Exploration | 1 + ...e Streaming ASR with Full-context Modeling | 1 + ...ization in Deep Neural Network Compilation | 1 + .../iclr/Dynamic Tensor Rematerialization | 1 + ...d Regenerate Images for Continual Learning | 1 + ...ks: Double Descent and How to Eliminate it | 1 + ... Optimization with Blended Search Strategy | 1 + ...tract Reasoning with Dual-Contrast Network | 1 + ...m Features: Improved Bounds and Algorithms | 0 ... Efficient Vote Attack on Capsule Networks | 1 + ...Against Patch Attacks on Image Classifiers | 1 + ...Cascaded Inference with Expanded Admission | 0 ...th Modular Networks and Task-Driven Priors | 1 + ... Estimation for Unsupervised Stabilization | 1 + .../iclr/Efficient Generalized Spherical CNNs | 1 + ...ble Interaction in Spiking-neuron Networks | 1 + ...ed MDPs with Application to Constrained RL | 1 + ... Learning using Actor-Learner Distillation | 1 + ...tural Gradients for Reinforcement Learning | 1 + .../iclr/EigenGame: PCA as a Nash Equilibrium | 1 + ... Rules In Multi-Agent Driving Environments | 1 + ...Symbols through Binding in External Memory | 1 + ...Entity Problem in Named Entity Recognition | 1 + ...imization? A Sample Complexity Perspective | 1 + .../iclr/End-to-End Egospheric Spatial Memory | 1 + .../End-to-end Adversarial Text-to-Speech | 1 + ... guarantees within neural network policies | 1 + ... Image Editing via Latent Space Navigation | 1 + ...nt descent algorithms and wide flat minima | 1 + ...stants of monotone deep equilibrium models | 1 + ...ctive Uncertainty in Deep Object Detectors | 1 + ... of samples with Smooth Unique Information | 1 + ...enerative Models through Manifold Topology | 1 + ...s vs Cross-Entropy in Classification Tasks | 2 ++ ...valuation of Similarity-based Explanations | 1 + ...or Explanation through Robustness Analysis | 1 + ...Evolving Reinforcement Learning Algorithms | 1 + ...han State-of-the-Art Feature Visualization | 0 .../Explainable Deep One-Class Classification | 1 + ...r Forecasting on Temporal Knowledge Graphs | 1 + ...Decisions by Interpretable Policy Learning | 1 + ...fficacy of Counterfactually Augmented Data | 1 + ...Feature Spaces for Representation Learning | 1 + ...mplicit Priors in the Infinite-Width Limit | 1 + ...iant and Equivariant Graph Neural Networks | 1 + ...asks from Zero-Order Trajectory Optimizers | 1 + ...e Memorization via Scale of Initialization | 1 + ...etric Learning and Behavior Regularization | 0 ...edge in Structured, Dynamical Environments | 0 .../Fair Mixup: Fairness via Interpolation | 1 + ...rBatch: Batch Selection for Model Fairness | 1 + ...iasing Method for Pretrained Text Encoders | 1 + ...s on Singular Values of Convolution Layers | 0 ...arning Of Recurrent Independent Mechanisms | 1 + ...ections for Local Robustness Certification | 1 + ...nd Massively Parallel Incomplete Verifiers | 1 + ...tic subgradient method under interpolation | 1 + ...and High-Quality End-to-End Text to Speech | 1 + ...eddings for Preserving Euclidean Distances | 1 + ... Ensemble Applicable to Federated Learning | 1 + ...IID Features via Local Batch Normalization | 1 + ...up under Mean Augmented Federated Learning | 1 + ...d Learning Based on Dynamic Regularization | 1 + ...A New Perspective and Practical Algorithms | 1 + ...ter-Client Consistency & Disjoint Learning | 1 + ...n Optimization with Deep Kernel Surrogates | 1 + ... via Learning the Representation, Provably | 1 + .../Fidelity-based Deep Adiabatic Scheduling | 1 + ...ction for Crosslingual Embedding Alignment | 1 + ...ative Network for Text-to-Speech Synthesis | 1 + ...Fooling a Complete Neural Network Verifier | 1 + ...tionality implies generalization, provably | 1 + ... Parametric Partial Differential Equations | 1 + ...ew-shot Learning: Distribution Calibration | 1 + ...ith Convolutional Variational Autoencoders | 1 + ... to Learning Sparse Representations Online | 1 + ...GAN \"Steerability\" without optimization" | 1 + ...r Blind Denoising with Single Noisy Images | 1 + .../iclr/GANs Can Play Lottery Tickets Too | 1 + ...itional Computation and Automatic Sharding | 1 + ...isotropic convolutions on geometric graphs | 1 + .../Generalization bounds via distillation | 1 + ...ata-driven models of primary visual cortex | 1 + .../2021/iclr/Generalized Energy Based Models | 1 + data/2021/iclr/Generalized Multimodal ELBO | 0 ...Generalized Variational Continual Learning | 1 + ...uter Programs using Optimized Obfuscations | 1 + ...ape and Appearance across Multiple Domains | 1 + ...n-and-Language Navigation with Bayes' Rule | 1 + .../2021/iclr/Generative Scene Graph Networks | 1 + ...ve Time-series Modeling with Fourier Flows | 0 ...y Evolution in Deep Reinforcement Learning | 1 + ... Algorithms for Neural Architecture Search | 1 + ...e Instance-reweighted Adversarial Training | 1 + ...ethod for Explaining Uncertainty Estimates | 1 + ...r Neural Networks in the Mean Field Regime | 1 + ...r neural networks in the mean-field regime | 1 + ...the flow: Adaptive control for Neural ODEs | 1 + ...ed Pre-Training for Table Semantic Parsing | 1 + ... Typically Occurs at the Edge of Stability | 1 + ...t Projection Memory for Continual Learning | 1 + ...imization in Massively Multilingual Models | 1 + .../Graph Coarsening with Neural Networks | 1 + ...tion with Low-rank Learnable Local Filters | 1 + data/2021/iclr/Graph Edit Networks | 1 + ...mation Bottleneck for Subgraph Recognition | 1 + ...ls: A Meta-Algorithm for Scalable Learning | 1 + data/2021/iclr/Graph-Based Continual Learning | 1 + ...aining Code Representations with Data Flow | 1 + ...nite-time Analysis and Improved Complexity | 1 + .../Grounded Language Learning Fast and Slow | 1 + ...mously-Acquired Skills via Goal Generation | 1 + ...nd Events Through Dynamic Visual Reasoning | 1 + ...p Equivariant Conditional Neural Processes | 1 + ...quivariant Generative Adversarial Networks | 1 + ...iant Stand-Alone Self-Attention For Vision | 1 + ...ks by Structured Continuous Sparsification | 1 + ...Aware Neural Architecture Search Benchmark | 1 + ...ory Forecasting with Hallucinative Intents | 1 + ...sarial scenarios and generalization bounds | 1 + ...derated Learning for Heterogeneous Clients | 1 + ...Deep Learning with Adaptive Regularization | 1 + ...sive Modeling for Neural Video Compression | 1 + ... Learning by Discovering Intrinsic Options | 1 + .../iclr/High-Capacity Expert Binary Networks | 1 + .../iclr/Hopfield Networks is All You Need | 1 + ...p Transformer for Spatiotemporal Reasoning | 1 + .../iclr/How Benign is Benign Overfitting ? | 1 + ...p Help With Robustness and Generalization? | 1 + ...Is Sufficient to Learn Deep ReLU Networks? | 1 + ... From Feedforward to Graph Neural Networks | 1 + ...aph Attention Design with Self-Supervision | 1 + ... No-Press Diplomacy via Equilibrium Search | 1 + ...ject and Agent Dynamics with Hypernetworks | 1 + ... Towards A Single Model for Multiple Tasks | 1 + data/2021/iclr/Hyperbolic Neural Networks++ | 1 + ...er Discrete Flows for Lossless Compression | 1 + ...-Level Pretext Tasks for Few-Shot Learning | 0 ...aluating Generalization in Theorem Proving | 1 + ...ayer Reordering for Transformer Structures | 1 + ...w of Hamiltonian Systems via Meta-Learning | 1 + ...le time scales and long-range dependencies | 1 + ...ng Deep Reinforcement Learning from Pixels | 1 + ...hics and Interpretable 3D Neural Rendering | 1 + ... Representation Learning in Linear Bandits | 1 + ...nd Three-Layer Networks in Polynomial Time | 1 + .../iclr/Implicit Gradient Regularization | 1 + data/2021/iclr/Implicit Normalizing Flows | 1 + ...Data-Efficient Deep Reinforcement Learning | 1 + ...: Towards Accurate and Efficient Detectors | 0 ...ssive Modeling with Distribution Smoothing | 1 + ...p-Norm Distance Metrics Using Half Spaces" | 1 + ...ss via Channel-wise Activation Suppressing | 1 + ... Spherical Sliced Fused Gromov Wasserstein | 1 + ...nce in Contrastive Representation Learning | 1 + ...ing VAEs' Robustness to Adversarial Attack | 1 + ...r via Disentangled Representation Learning | 1 + ...ion Framework for Semi-Supervised Learning | 1 + .../In Search of Lost Domain Generalization | 1 + ...rmation for Out-of-Distribution Robustness | 1 + ...ynamics Models for Improved Generalization | 1 + ...vector quantization in deep embedded space | 1 + .../iclr/Individually Fair Gradient Boosting | 1 + data/2021/iclr/Individually Fair Rankings | 1 + ...mporal Networks via Causal Anonymous Walks | 1 + ...mation for Generative Adversarial Networks | 1 + ...nce Functions in Deep Learning Are Fragile | 1 + ... from An Information Theoretic Perspective | 1 + .../Information Laundering for Model Privacy | 1 + ...Regularization of Factorized Neural Layers | 1 + ...ntics into Unsupervised Domain Translation | 1 + ...arning Useful Heuristics for Data Labeling | 1 + ...lity Using Self-explaining Neural Networks | 1 + ...ptimisation with Weisfeiler-Lehman Kernels | 0 ...s for NLP With Differentiable Edge Masking | 1 + ...lation Representation from Word Embeddings | 0 ...oosting Dropout from a Game-Theoretic View | 1 + ...udio-Visual Separation of On-Screen Sounds | 1 + ...cit learning ability that regularizes DNNs | 1 + ...ling for Learning on 3D Protein Structures | 1 + ...ttention Better Than Matrix Decomposition? | 1 + ...Knowledge Distillation: An Empirical Study | 1 + ...mark for High-level Mathematical Reasoning | 1 + ...Network for Generalized Zero-shot Learning | 1 + ...d Equivariant Graph Convolutional Networks | 1 + ...al Embedding Space: Clusters and Manifolds | 0 ...learning for emergent systematicity in VQA | 1 + ...me Solving via Single Policy Best Response | 1 + ...ble, Locally Block Allocated Latent Memory | 1 + ...e Distillation as Semiparametric Inference | 1 + ...softmax regression representation learning | 1 + ...earnable Frontend for Audio Classification | 1 + ... long-range Interactions without Attention | 1 + ... of Source Code from Structure and Context | 1 + ...oblem in Neurobiology and Machine Learning | 1 + ...Simulation for Deep Reinforcement Learning | 1 + ...-Modulated Generative Adversarial Networks | 1 + ...mptotics for deep Gaussian neural networks | 1 + .../2021/iclr/Latent Convergent Cross Mapping | 1 + ...kill Planning for Exploration and Transfer | 1 + ...e Sparsity for the Magnitude-based Pruning | 1 + ...le Embedding sizes for Recommender Systems | 1 + ...planations for Sequential Decision-Making" | 0 ...earning A Minimax Optimizer: A Pilot Study | 1 + ...ith Global Reference for Image Compression | 1 + ...ciative Inference Using Fast Weight Memory | 1 + ...ns Using Low-rank Adaptive Label Smoothing | 1 + ...or Control with Dynamics Cycle-Consistency | 1 + ...atures in Instrumental Variable Regression | 1 + ... via Coarse-to-Fine Expanding and Sampling | 0 ...ed Models by Diffusion Recovery Likelihood | 1 + ...l Representations via Interactive Gameplay | 1 + ...ic Representations of Topological Features | 1 + ...ifferentiable Fluid Models that Generalize | 1 + ...nforcement Learning without Reconstruction | 1 + ... with Region Proposal Interaction Networks | 1 + ...h-Based Representations of Man-Made Shapes | 1 + ... Mesh-Based Simulation with Graph Networks | 1 + ...ctured Sparse Neural Networks From Scratch | 1 + ...ctions for Ordinary Differential Equations | 1 + ...mics for Molecular Conformation Generation | 1 + ...earning Parametrised Graph Shift Operators | 1 + ...mantic Graphs for Video-grounded Dialogues | 1 + ...stractions for Hidden-Parameter Block MDPs | 0 ... Decentralized Neural Barrier Certificates | 1 + ...Edits via Incremental Tree Transformations | 1 + ...Subgoal Representations with Slow Dynamics | 1 + ...osition with Ordered Memory Policy Network | 1 + ...ns with Generative Neuro-Symbolic Modeling | 1 + ...p Policy Gradients using Residual Variance | 1 + ...Learning What To Do by Simulating the Past | 1 + ...ng Problems using Variational Autoencoders | 0 ...ng a Latent Simplex in Input Sparsity Time | 1 + ...ed mathematical computations from examples | 1 + ...ntations for Deep One-Class Classification | 1 + ...rom sparse data with graph neural networks | 1 + ...earning explanations that are hard to vary | 1 + ...ion with Weakly Supervised Disentanglement | 1 + ...tructure with Geometric Vector Perceptrons | 1 + ...iding dataset biases without modeling them | 1 + ...turbation sets for robust machine learning | 1 + ...arning the Pareto Front with Hypernetworks | 2 ++ ...Augmented Models via Targeted Perturbation | 1 + ...D Shapes with Generative Cellular Automata | 1 + ...ke Decisions via Submodular Regularization | 0 ...ach Goals via Iterated Supervised Learning | 1 + ...mple Data For Compositional Generalization | 1 + ...ues as a Hypergraph on the Action Vertices | 1 + ...lobal Contexts in Experience Replay Buffer | 1 + ... Set Waypoints for Audio-Visual Navigation | 1 + ...h separate excitatory and inhibitory units | 1 + ...o: Adversarially Motivated Intrinsic Goals | 1 + ...endent Label Noise: A Progressive Approach | 1 + ...ndent Label Noise: A Sample Sieve Approach | 1 + ...based Support Estimation in Sublinear Time | 1 + ...elong Learning of Compositional Structures | 1 + .../LiftPool: Bidirectional ConvNet Pooling | 1 + ...ecentralized Optimization with Compression | 1 + ...e in Constrained Saddle-point Optimization | 2 ++ ...tivity in Multitask and Continual Learning | 1 + ...nt Ascent with Finite Timescale Separation | 0 ...s for Rank-Constrained Convex Optimization | 1 + ...ee Weight Sharing for Network Width Search | 1 + ...ce of Winning Tickets in Lifelong Learning | 1 + ...a : A Benchmark for Efficient Transformers | 1 + .../Long-tail learning via logit adjustment | 1 + ...Routing Diverse Distribution-Aware Experts | 1 + ...n via Convergence-Simulation Driven Search | 1 + ...tructured Convolutional Models via Lifting | 1 + ...Social Media Users from Facial Recognition | 1 + ...everse accurate integrator for Neural ODEs | 1 + ...ampling for Multi-objective Drug Discovery | 1 + ...-Level Relationships for Few-Shot Learning | 0 ...ated Data Augmentation in the Latent Space | 1 + ...work for Efficient Neural Network Training | 1 + ...ale Organization of Neural Language Models | 1 + ...ing via Self-supervised Skip-tree Training | 1 + ...g Massive Multitask Language Understanding | 1 + .../Memory Optimization for Deep Networks | 1 + data/2021/iclr/Meta Back-Translation | 1 + ...aussian VAE for Unsupervised Meta-Learning | 0 ... Task Distributions in Humans and Machines | 1 + .../Meta-Learning with Neural Tangent Kernels | 1 + ...-learning Symmetries by Reparameterization | 1 + ...Meta-learning with negative learning rates | 1 + ... Normalize Few-Shot Batches Across Domains | 1 + ... Experts for Unsupervised Image Clustering | 1 + ...rence in Sequential Latent-Variable Models | 1 + ...ind the Pad - CNNs Can Develop Blind Spots | 1 + .../Minimum Width for Universal Approximation | 1 + ...lgorithm that directly controls perplexity | 1 + ...istillation of Large-scale Language Models | 1 + ...ed-Features Vectors and Subspace Splitting | 0 ...pervised Learning with Momentum Prototypes | 1 + ...onvolutions for Visual Counting and Beyond | 1 + ...oup Performance Gap with Data Augmentation | 1 + data/2021/iclr/Model-Based Offline Planning | 1 + ... with Self-Supervised Functional Distances | 1 + ...odel properties and which model to choose? | 1 + ...er in Distributionally Robust Optimization | 1 + ...ramework for Task-oriented Dialogue System | 1 + ...cule Optimization by Explainable Evolution | 0 .../iclr/Monotonic Kronecker-Factored Lattice | 1 + ...rning with Language Action Value Estimates | 1 + ...ild Convolutional Neural Network Ensembles | 1 + ...ual Information Maximization-based Binning | 1 + ...GD for Heterogeneous Hierarchical Networks | 1 + ...rks by Pruning A Randomly Weighted Network | 1 + ...tworks for Irregularly Sampled Time Series | 1 + ...hastic process identifies causes of cancer | 1 + ...sentation Learning in LSTM Language Models | 1 + ...ion answering over text, tables and images | 1 + data/2021/iclr/Multiplicative Filter Networks | 1 + ...Matching for Out-of-Distribution Detection | 1 + ...ecasting via Conditioned Normalizing Flows | 1 + ...Mutual Information State Intrinsic Control | 1 + ...hology in Graph-Based Incompatible Control | 1 + ...Architecture Search for Speech Recognition | 1 + .../iclr/NBDT: Neural-Backed Decision Tree | 1 + ...Search for End-to-end Learning and Control | 1 + ...ive Features for Robust 3D Pose Estimation | 1 + .../iclr/Nearest Neighbor Machine Translation | 1 + data/2021/iclr/Negative Data Augmentation | 1 + ...F: Effective Deep Modeling of Tabular Data | 1 + ...tters: A Case Study on Retraining Variants | 1 + ... Sufficient Statistics for Implicit Models | 1 + ...ours: A Theoretically Inspired Perspective | 1 + ...ackdoor Triggers from Deep Neural Networks | 1 + .../iclr/Neural Delay Differential Equations | 1 + ...t Continuous-Time Prediction and Filtering | 1 + ...orial Problems in Structured Output Spaces | 1 + ...onservation Laws in Deep Learning Dynamics | 1 + ...ual G-Invariances from Single Environments | 1 + data/2021/iclr/Neural ODE Processes | 1 + .../Neural Pruning via Growing Regularization | 1 + .../Neural Spatio-Temporal Point Processes | 1 + ...nthesis of Binaural Speech From Mono Audio | 0 data/2021/iclr/Neural Thompson Sampling | 1 + .../Neural Topic Model via Optimal Transport | 1 + ...al: improved quantized and sparse training | 1 + .../Neural networks with late-phase weights | 1 + ...nd generation for RNA secondary structures | 1 + data/2021/iclr/Neurally Augmented ALISTA | 1 + ...ted Mean Estimation and Variance Reduction | 1 + ...or Making Better Mistakes in Deep Networks | 1 + ...and stable training of energy-based models | 1 + ...el noise helps combat inherent label noise | 1 + ...of Image Backgrounds in Object Recognition | 1 + ...-policy Evaluation: Primal and Dual Bounds | 1 + .../Nonseparable Symplectic Neural Networks | 1 + ...ccelerating Offline Reinforcement Learning | 1 + ...ining for Transfer with Domain Classifiers | 1 + ...a Normalized Maximum Likelihood Estimation | 1 + ...Consistency-Based Semi-Supervised Learning | 1 + ...g and Mitigating Bias in Graph Connections | 1 + ...Adaptation in Model-Agnostic Meta-Learning | 1 + ...eural Networks versus Graph-Augmented MLPs | 1 + ...Retrieval, and Sparse Matrix Factorization | 2 ++ ...Universal Representations Across Languages | 1 + data/2021/iclr/On Position Embeddings in BERT | 1 + ...d Image Representations for GAN Evaluation | 0 ...In Active Learning: How and When to Fix It | 1 + ...al Networks and its Practical Implications | 1 + ...entions in Adaptive Human-AI Collaboration | 1 + ...s: Approximation and Optimization Analysis | 1 + ... the Dynamics of Training Attention Models | 1 + ...bal Convergence in Multi-Loss Optimization | 1 + ...ularization in Stochastic Gradient Descent | 1 + ...ptions, Explanations, and Strong Baselines | 1 + ...g: Global Convergence with Implicit Layers | 1 + ...gled Representations in Realistic Settings | 1 + ... Rotation Equivariant Point Cloud Networks | 2 ++ ...ouble Descent Peak in Ridgeless Regression | 1 + ...n and memorization in deep neural networks | 1 + ...networks and Restricted Boltzmann Machines | 1 + ...in model-based deep reinforcement learning | 1 + ...ithic Task Formulations in Neural Networks | 1 + ...fication based on Self-supervised Learning | 1 + ...en Question Answering over Tables and Text | 1 + ...Neural Networks to Spiking Neural Networks | 1 + ...Descent under Neural Tangent Kernel Regime | 1 + ...Regularization can Mitigate Double Descent | 1 + ... Generalized Linear Function Approximation | 1 + ... Evolutionary Graph Reinforcement Learning | 1 + ...olutional Layers with the Cayley Transform | 1 + ...Profit: Instance-Adaptive Data Compression | 1 + ... worst-case generalisation: friend or foe? | 1 + ...ctions for Deep Neural Network Classifiers | 1 + ...frame Reconstruction from Raw Point Clouds | 1 + .../PDE-Driven Spatiotemporal Disentanglement | 1 + ...ng: Principled masking of correlated spans | 1 + ...poral Convolution on Point Cloud Sequences | 1 + ...sformers for Video Representation Learning | 1 + .../2021/iclr/Parameter-Based Value Functions | 1 + ...havioral Priors for Reinforcement Learning | 1 + .../iclr/Partitioned Learned Bloom Filters | 1 + ...ness: Defense Against Unseen Threat Models | 1 + ...arning with First Order Model Optimization | 1 + ... order reduction with guaranteed stability | 1 + ...xed Reward Shaping for Goal-Directed Tasks | 1 + ... from Pixels using Inverse Dynamics Models | 1 + ...tion Benchmark with Differentiable Physics | 1 + ...points for Keypoint Based Object Detection | 0 ... Hard-label Black-box Adversarial Examples | 0 ...lo Tree Search Applied to Molecular Design | 1 + ...rrent Learning with a Sparse Approximation | 1 + ...nsformers for Concept-centric Common Sense | 1 + ...ccuracy When Adding New Unobserved Classes | 1 + ...ing Inductive Biases of Pre-Trained Models | 1 + ...fectiousness for Proactive Contact Tracing | 1 + ...sation over directed actions by grid cells | 1 + .../Primal Wasserstein Imitation Learning | 1 + ...stem Side Channels Using Generative Models | 0 data/2021/iclr/Private Post-GAN Boosting | 1 + ...stic Numeric Convolutional Neural Networks | 1 + .../iclr/Probing BERT in Hyperbolic Spaces | 1 + ... more fat from a network at initialization | 1 + ... Conditional Sampling of Normalizing Flows | 1 + ...toencoder via Invertible Mutual Dependence | 1 + ... Theft using an Ensemble of Diverse Models | 1 + ...e Learning of Unsupervised Representations | 1 + ...sentation Learning for Relation Extraction | 1 + ... Learning with Combinatorial Latent States | 1 + ...ion of adversarial examples with detection | 1 + ...able Convergence under K\305\201 Geometry" | 1 + ...itialization: Why Are We Missing the Mark? | 1 + ...ng Pseudo Labels for Semantic Segmentation | 1 + ...LEX: Duplex Dueling Multi-Agent Q-Learning | 1 + ...uantifying Differences in Reward Functions | 1 + ...-GAP: Recursive Gradient Attack on Privacy | 1 + ...prop converges with proper hyper-parameter | 1 + ...ic Rules for Reasoning on Knowledge Graphs | 1 + ...rning Roles to Decompose Multi-Agent Tasks | 1 + data/2021/iclr/Random Feature Attention | 1 + .../iclr/Randomized Automatic Differentiation | 1 + ... Q-Learning: Learning Fast Without a Model | 1 + ...ion in Procedurally-Generated Environments | 1 + ...-Through Gumbel-Softmax Gradient Estimator | 1 + ... Learning to Generate Graphs from Datasets | 1 + .../Rapid Task-Solving in Novel Environments | 1 + .../iclr/Recurrent Independent Mechanisms | 1 + ...erative Models with Binary Neural Networks | 1 + ...ive Models via Discriminator Gradient Flow | 1 + ...- An Empirical Study on Continuous Control | 1 + ...Regularized Inverse Reinforcement Learning | 1 + .../Reinforcement Learning with Random Delays | 1 + ...Framework for Multimodal Generative Models | 1 + ...xplanations Reduce Catastrophic Forgetting | 23 +++++++++++++++++++ ...ntributions Using Out-of-Distribution Data | 1 + ...Offline Model-based Reinforcement Learning | 1 + ...th Deep Autoencoding Predictive Components | 1 + ...n Learning via Invariant Causal Mechanisms | 1 + ...tion accuracy of clinical factors from EEG | 1 + ...l Programs with Blended Abstract Semantics | 1 + ...for Robust Out-of-domain Few-Shot Learning | 1 + ...: Neural ODEs and Their Numerical Solution | 1 + ...ifelong Learning with Skill-Space Planning | 1 + ...chitecture Selection in Differentiable NAS | 1 + .../iclr/Rethinking Attention with Performers | 1 + ...ng Coupling in Pre-trained Language Models | 1 + ...sitional Encoding in Language Pre-training | 1 + ...tion: A Bias-Variance Tradeoff Perspective | 1 + ...ibution Methods for Model Interpretability | 1 + ...tion for Code Summarization via Hybrid GNN | 1 + ...tation Learning for Reinforcement Learning | 1 + ...namic Convolution via Matrix Decomposition | 1 + .../Revisiting Few-sample BERT Fine-tuning | 1 + ... for Persistent Long-Term Video Prediction | 1 + ...ing: an Alternative to End-to-end Training | 1 + ...es by Minimizing the Maximal Expected Loss | 1 + ...Analysis of Nonlinear Feedforward Networks | 1 + ...Risk-Averse Offline Reinforcement Learning | 1 + ...re Bayesian Networks in Nearly-Linear Time | 1 + ... mitigated by properly learned smoothening | 1 + .../iclr/Robust Pruning at Initialization | 1 + ...bservations with Learned Optimal Adversary | 1 + ...sentation Learning via Random Convolutions | 1 + ...Hindering the memorization of noisy labels | 1 + ...Accurate and Fast Neural Network Inference | 0 ...D: Sign Agnostic Learning with Derivatives | 1 + ...ntation in Conversational Semantic Parsing | 0 ...Networks toward Greedy Block-wise Learning | 0 ...sed Distillation For Visual Representation | 1 + ...orcement Learning in Unstable Environments | 1 + ...e Orthogonal Learned and Random Embeddings | 1 + ...work for Self-Supervised Outlier Detection | 1 + ...erring When Diagnosing Poor Generalization | 1 + ...ntation Strategy for Better Regularization | 1 + ...ient Automated Deep Reinforcement Learning | 1 + ...le Bayesian Inverse Reinforcement Learning | 1 + ...Nonsymmetric Determinantal Point Processes | 1 + ...lable Transfer Learning with Expert Models | 1 + ...ing Gradients for Neural Model Explanation | 1 + ...caling the Convex Barrier with Active Sets | 0 ... through Stochastic Differential Equations | 1 + ...tion Can Magnify Disparities Across Groups | 1 + ...causal impact of class selectivity in DNNs | 1 + ...arning of Compressed Video Representations | 1 + ...rvised Policy Adaptation during Deployment | 1 + ...stness for the Low-label, High-data Regime | 1 + ...sed Learning from a Multi-view Perspective | 1 + ...n Learning with Relative Predictive Coding | 1 + ...arning with Object-centric Representations | 1 + ...t Transfer Across Extreme Task Differences | 1 + ...emantic Re-tuning with Contrastive Tension | 0 .../Semi-supervised Keypoint Localization | 1 + ...variance for Enforcing Individual Fairness | 1 + ...aration and Concentration in Deep Networks | 1 + ...f Sequences by Low-Rank Tensor Projections | 1 + ...taneous Optimization of Speed and Accuracy | 1 + ...tructure as Conditional Density Estimation | 1 + ...erstanding Discriminative Features in CNNs | 1 + ...e-Texture Debiased Neural Network Training | 1 + data/2021/iclr/Shapley Explanation Networks | 1 + ...hapley explainability on the data manifold | 1 + ...ific Capacity for Multilingual Translation | 1 + ...ith Gradient-dominated Objective Functions | 1 + ...n for Efficiently Improving Generalization | 1 + ...gsignature transforms, on both CPU and GPU | 1 + ...Goes a Long Way: ADRL for DNN Quantization | 1 + .../iclr/Simple Spectral Graph Convolution | 1 + .../iclr/Single-Photon Image Classification | 3 +++ ...tic Provably Finds Globally Optimal Policy | 1 + ... RNN with Strict Upper Computational Limit | 1 + .../iclr/Sliced Kernelized Stein Discrepancy | 1 + ...ement Learning Problems via Task Reduction | 1 + .../iclr/Sparse Quantized Spectral Clustering | 1 + ...ions in probabilistic matrix factorization | 1 + ...ers for Improved Generative Image Modeling | 1 + .../Spatially Structured Recurrent Modules | 1 + ...Spatio-Temporal Graph Scattering Transform | 1 + .../iclr/Stabilized Medical Image Attacks | 1 + ...tistical inference for individual fairness | 1 + ...g Long-Run Dynamics of Energy-Based Models | 1 + ...lation between Augmented Natural Languages | 1 + ...for Pre-trained Language Model Fine-tuning | 1 + ...cks for video-text representation learning | 1 + ...Aware Actor-Critic for 3D Molecular Design | 1 + ...alisation with group invariant predictions | 1 + ...tes on the Fly Helps Language Pre-Training | 0 .../iclr/Taming GANs with Lookahead-Minmax | 1 + ... Networks via Flipping Limited Weight Bits | 1 + .../iclr/Task-Agnostic Morphology Evolution | 1 + ...eaching Temporal Logics to Neural Networks | 1 + data/2021/iclr/Teaching with Commentaries | 1 + ...ally-Extended \316\265-Greedy Exploration" | 0 ...st-Time Adaptation by Entropy Minimization | 1 + ...Generation by Learning from Demonstrations | 1 + ...ine Learners are Good Offline Generalizers | 0 ...imism in Fixed-Dataset Policy Optimization | 1 + ...nsion of Images and Its Impact on Learning | 1 + .../iclr/The Recurrent Neural Tangent Kernel | 1 + .../The Risks of Invariant Risk Minimization | 1 + ...ce of Adaptive Polyak's Heavy-ball Methods | 1 + ...arning Through Spatial Variable Embeddings | 1 + ...ches in Deep Convolutional Kernels Methods | 1 + ...of integration in text classification RNNs | 1 + ...LU networks on orthogonally separable data | 0 ... role of Disentanglement in Generalisation | 0 ...ining with Deep Networks on Unlabeled Data | 1 + ...unds on estimation error for meta-learning | 1 + .../iclr/Tilted Empirical Risk Minimization | 1 + ...rvised Bayesian Recovery of Corrupted Data | 1 + ...e Segmentation Using Discrete Morse Theory | 1 + ...for High-fidelity Few-shot Image Synthesis | 1 + .../Towards Impartial Multi-task Learning | 1 + ...n Natural Data with Temporal Sparse Coding | 1 + ...ix Factorization: Greedy Low-Rank Learning | 1 + ...ust Neural Networks via Close-loop Control | 1 + ...gainst Natural Language Word Substitutions | 1 + ...s in Data Augmentation: An Empirical Study | 1 + ...xpressive Power of Random Features in CNNs | 1 + ...ugmentations via Contrastive Discriminator | 1 + ...ependent subnetworks for robust prediction | 1 + ...zation Noise for Extreme Model Compression | 1 + ...n using Equivariant Continuous Convolution | 1 + ...models are unsupervised structure learners | 1 + ...eralisation in Deep Reinforcement Learning | 1 + ...cting Linear Terms in Deep Neural Networks | 1 + .../iclr/Trusted Multi-View Classification | 1 + ...ssion for efficient recommendation systems | 1 + ...RL via Policy Decoupling with Transformers | 0 ...acher for Semi-Supervised Object Detection | 1 + ...ation with Finite-State Probabilistic RNNs | 1 + ...on in Autoregressive Structured Prediction | 1 + ...age Classifiers using Conformal Prediction | 1 + ...rtainty in Gradient Boosting via Ensembles | 1 + ...e Learning for Optimal Bayesian Classifier | 0 ...ization in Generative Adversarial Networks | 1 + ...er Fusion in Sequence-to-Sequence Learning | 1 + ...l Choice in Non-Autoregressive Translation | 1 + ...sm and sparsity on neural network training | 1 + ...odes of out-of-distribution generalization | 1 + ... of importance weighting for deep learning | 1 + ...A Nasty Teacher That CANNOT teach students | 1 + ...n by Pixel-to-Segment Contrastive Learning | 1 + ...ural networks via nonlinear control theory | 1 + ...amples: Making Personal Data Unexploitable | 1 + ...visual Synthesis via Exemplar Autoencoders | 1 + ...iscovery of 3D Physical Objects from Video | 1 + ...t-Space Interpolation in Generative Models | 1 + ...earning using Local Spatial Predictability | 1 + ...e Series with Temporal Neighborhood Coding | 1 + ...of Optimal Representations During Training | 1 + ...lyze and leverage compositionality in GANs | 1 + ...-RED2: Video Adaptive Redundancy Reduction | 1 + ...ional Autoencoders and Energy-based Models | 1 + ...ng Causal Effects of Continuous Treatments | 1 + ...sformer Network for Object Goal Navigation | 1 + ...eck for Effective Low-Resource Fine-Tuning | 1 + .../Variational Intrinsic Control Revisited | 1 + ...Localisation and Dense 3D Mapping in 6 DoF | 1 + ...er Networks and Polynomial-time Algorithms | 1 + ...e Models and Can Outperform Them on Images | 1 + ...s for Unsupervised Representation Learning | 1 + ...hanism for Online RL with Unknown Dynamics | 1 + ...mperceptible Warping-based Backdoor Attack | 1 + ...d: Online contextualized few-shot learning | 1 + .../Wasserstein Embedding for Graph Learning | 1 + .../iclr/Wasserstein-2 Generative Networks | 1 + ...cial Perception and Human-AI Collaboration | 1 + ...timating Gradients for Waveform Generation | 1 + ...ual Representation from Human Interactions | 1 + ...Discrimination Good for Transfer Learning? | 1 + ... Actor-Critic Methods? A Large-Scale Study | 1 + ...Not Be Contrastive in Contrastive Learning | 1 + ...ine RL with Linear Function Approximation? | 2 ++ ...dy of inductive biases in seq2seq learners | 2 ++ data/2021/iclr/When Do Curricula Work? | 1 + ...ng f-Divergence is Robust with Label Noise | 1 + ...econditioning help or hurt generalization? | 1 + ...ample-Efficient than Fully-Connected Nets? | 1 + ...ng sampling bias with stochastic gradients | 1 + ...nt via Semi-Markov Afterstate Actor-Critic | 1 + ...Scale Data Poisoning via Gradient Matching | 1 + ...erence with Ultra-Low-Precision Arithmetic | 1 + ...ce with Online Learning from User Feedback | 1 + ...l Supervision for Semantic Image Synthesis | 1 + .../Zero-Cost Proxies for Lightweight NAS | 1 + ...t Synthesis with Group-Supervised Learning | 1 + ...stem identification and visuomotor control | 1 + ...gy for Contrastive Representation Learning | 1 + ... Modelling with Missing not at Random Data | 1 + ...bit Optimizers via Block-wise Quantization | 1 + ...Pathways and Imaging Phenotypes of Disease | 0 ...rson Mixing Methods and Their Applications | 0 ... Representative Variable Selection Methods | 1 + ...ent Paradigm for 3D Point Cloud Completion | 1 + ...ional Approach to Clustering Survival Data | 1 + ...ine-Grained Analysis on Distribution Shift | 1 + ...e-Tuning Approach to Belief State Modeling | 1 + ... Representation for Reinforcement Learning | 1 + ...-Selection for Stochastic Gradient Descent | 1 + ...d for Computational Learning and Inversion | 1 + ...ss Framework for Randomly Initialized CNNs | 1 + ...ning Instabilities of Deep Learning Models | 0 ...nel Perspective of Infinite Tree Ensembles | 1 + ...l Networks Go Beyond Weisfeiler-Lehman?\"" | 1 + ...Deep RELU Network Under Noisy Observations | 1 + ...m to Build E(N)-Equivariant Steerable CNNs | 0 ...rvative Bandits and Reinforcement Learning | 1 + ...tion in Model-Based Reinforcement Learning | 1 + ...ribution Detection in Deep Neural Networks | 1 + ...Normalizing Flow Toward Energy-Based Model | 1 + ...m Inputs and Advantage over Fixed Features | 1 + .../A Theory of Tournament Representations | 1 + ...Generative Ability of Adversarial Training | 1 + ...ustness Framework for Adversarial Training | 1 + ...s Architecture-Independent Model Distances | 1 + ...mal transport: analysis and implementation | 1 + ...he randomized singular value decomposition | 1 + ...mplicit networks via over-parameterization | 1 + ...rence Applied To Pyramidal Bayesian Models | 1 + ...n Using Adversarial Extreme Value Analysis | 1 + ...-Explicit Matching and Implicit Similarity | 1 + ... Axial Shifted MLP Architecture for Vision | 1 + ...by Pairing GNNs with Neural Wave Functions | 1 + ...ng with Parallel Differentiable Simulation | 1 + ...th Alleviated Forgetting in Local Training | 1 + ...ith Stable Subgoal Representation Learning | 1 + ...n a Large-Scale Imperfect-Information Game | 1 + ...ased towards high entropy optimal policies | 1 + ...Neighbour Discovery in the Structure Space | 1 + ...stance-adaptive Data Augmentation Policies | 1 + ...-Supervised Learning and Domain Adaptation | 1 + ...o Adapt in Transfer Reinforcement Learning | 1 + ...twork for 3D Shape Representation Learning | 1 + ... Retriever-Ranker for Dense Text Retrieval | 1 + ...l Robustness Through the Lens of Causality | 1 + data/2022/iclr/Adversarial Support Alignment | 1 + ...ng of Backdoors via Implicit Hypergradient | 1 + .../Adversarially Robust Conformal Prediction | 1 + ...dictions against Adversarial Perturbations | 1 + ...sed Proof Cost Network to Aid Game Solving | 1 + ...iation for Stochastic Bilevel Optimization | 1 + ...lanning and Synthesizable Molecular Design | 1 + ...to Federated Learning with Class Imbalance | 1 + ...Molecular Geometry Generation from Scratch | 1 + ...tive on Model-Based Reinforcement Learning | 1 + ...xt Learning as Implicit Bayesian Inference | 1 + ...arning with Instance-Dependent Label Noise | 1 + ...retic View On Pruning Deep Neural Networks | 1 + ...ayer-Peeled Perspective on Neural Collapse | 1 + ...Variance in Diffusion Probabilistic Models | 1 + ... Landscape of Noise-Contrastive Estimation | 1 + ...Ornstein-Uhlenbeck variational autoencoder | 1 + ...ndom Feature Regression in High Dimensions | 1 + ...ar Data with Internal Contrastive Learning | 1 + ...aly Detection with Association Discrepancy | 1 + ...onfidence Bonuses For Scalable Exploration | 1 + ...r Domain Analysis: From Theory to Practice | 1 + ...ense Prediction with Confidence Adaptivity | 1 + ...Convolutional Models: a Kernel Perspective | 1 + ...ing Generalization of SGD via Disagreement | 1 + ...on that Works on CNN, RNN, and Transformer | 0 ...ally-invariant Classification in OOD Tasks | 0 ...ased adversarial black-box methods is easy | 1 + ...Interpretability with Concept Transformers | 1 + ...ightweight, Noise-Robust, and Transferable | 1 + .../Augmented Sliced Wasserstein Distances | 1 + ...ning to Route Transferable Representations | 1 + ...aling Vision Transformers without Training | 1 + ...omated Self-Supervised Learning for Graphs | 1 + ...mize Problems with Strong Ranking Property | 1 + ...ntric Abstractions for High-Level Planning | 1 + ...ement Learning: Formalism and Benchmarking | 1 + .../2022/iclr/Autoregressive Diffusion Models | 1 + ...lows for Predictive Uncertainty Estimation | 1 + ...Search, Retrieval, and Similarity Learning | 1 + .../2022/iclr/BAM: Bayes with Adaptive Memory | 1 + ...for Fast and High-Quality Speech Synthesis | 1 + ...T: BERT Pre-Training of Image Transformers | 1 + ... Improving Real-time Predictions in Future | 1 + ...efense via Decoupling the Training Process | 1 + ...tacks to Pre-trained NLP Foundation Models | 1 + ...gation Boosts Self-supervised Distillation | 1 + ...ack, and Self-Reinforcing User Preferences | 0 .../Bayesian Framework for Gradient Leakage | 1 + ...r Learning to Optimize: What, Why, and How | 1 + .../Bayesian Neural Network Priors Revisited | 1 + ...marking the Spectrum of Agent Capabilities | 1 + ...visory Signals by Observing Learning Paths | 1 + ...Adversarial Examples for Black-box Domains | 1 + ...orks for Multi-goal Reinforcement Learning | 0 .../BiBERT: Accurate Fully Binarized BERT | 1 + ...r Phase Retrieval of Meromorphic Functions | 1 + .../Boosted Curriculum Reinforcement Learning | 1 + ...moothing with Variance Reduced Classifiers | 1 + ...ied Robustness of L-infinity Distance Nets | 1 + data/2022/iclr/Bootstrapped Meta-Learning | 1 + ...mantic Segmentation with Regional Contrast | 1 + .../iclr/Bregman Gradient Policy Optimization | 1 + ...Marketing via Recurrent Intensity Modeling | 1 + ... Problems with Inscrutable Representations | 1 + ...ive Approach to Exploring Many-to-one Maps | 1 + ...ng on Heterogeneous Datasets via Bucketing | 1 + ...urriculum for Learning Goal-Reaching Tasks | 1 + ...entiable Data Augmentation for EEG Signals | 1 + ...sformer for Unsupervised Domain Adaptation | 1 + ...ous Kernel Convolution For Sequential Data | 1 + ...te Research Transparency and Comparability | 1 + ...rcement Learning against Poisoning Attacks | 1 + ...tionary Distribution Correction Estimation | 1 + ...ment Learning through Functional Smoothing | 1 + ...Classifier Suffice For Action Recognition? | 1 + ...early Classified Under All Possible Views? | 1 + ...Locality in Non-parametric Language Models | 1 + ...lization in textual reinforcement learning | 1 + ...extual Bandits with Targeted Interventions | 1 + ...rium Models via Interval Bound Propagation | 1 + ...trastive Learning via Augmentation Overlap | 1 + ...rs via Gradient-based Subword Tokenization | 1 + ...ion-Aware Molecule Representation Learning | 1 + ...ive GAN for Conditional Waveform Synthesis | 1 + .../iclr/Churn Reduction via Distillation | 1 + ... Inverse Task for Dynamic Scene Deblurring | 1 + ...e Awareness by Generating Images of Floods | 1 + ...ng Generative Models in Zero-shot Learning | 1 + ...ontrastive BERT for Reinforcement Learning | 1 + .../iclr/CoMPS: Continual Meta Policy Search | 1 + ...epresentations for Time Series Forecasting | 1 + ...ng an Extensible Relational Representation | 1 + ...ime Series for Accelerated Active Learning | 0 ...s with Incomplete or Missing Neighborhoods | 1 + ...g Class-conditional GANs with Limited Data | 1 + ...easoning of Objects and Events from Videos | 1 + ...ritic Methods for Homogeneous Markov Games | 1 + ...ng Differences that Affect Decision Making | 1 + ...-Neuron Relaxation Guided Branch-and-Bound | 1 + ...ention: Disentangling Search and Retrieval | 1 + ...ining for End-to-End Deep AUC Maximization | 1 + ...ngle Source Cross-Domain Few-Shot Learning | 1 + ...ersarial Learning for Large-Batch Training | 1 + ...nditional Contrastive Learning with Kernel | 1 + ... by Conditioning Variational Auto-Encoders | 1 + ...itional Object-Centric Learning from Video | 1 + ...sequence Networks with Learned Activations | 0 ...iable Model of Whole-Brain Neural Activity | 1 + ...Consistent Counterfactuals for Deep Models | 1 + ...mical System Identification and Prediction | 1 + ...icy Optimization via Bayesian World Models | 1 + ...ing Linear-chain CRFs to Regular Languages | 1 + ...hogonal Convolutions in an Explicit Manner | 1 + ... Transfer using Generalized Policy Updates | 1 + ... Manipulations with Differentiable Physics | 1 + ...text-Aware Sparse Deep Coordination Graphs | 1 + ...ation for Generative Commonsense Reasoning | 1 + ...ntinual Learning with Filter Atom Swapping | 1 + ...rning with Recursive Gradient Optimization | 1 + ...ormalization for Online Continual Learning | 1 + ...Learning with Forward Mode Differentiation | 1 + ...s via Reward-Switching Policy Optimization | 1 + ...Parallel Data for Unsupervised Translation | 0 ...tering via Generative Adversarial Networks | 1 + ...ling Directions Orthogonal to a Classifier | 1 + ...ipschitz Constant improves Polynomial Nets | 1 + data/2022/iclr/Convergent Graph Solvers | 1 + ...nt and Efficient Deep Q Learning Algorithm | 0 ...presentation with a Split MLP Architecture | 1 + ... Modules Through a Shared Global Workspace | 1 + ...ctual Plans under Distributional Ambiguity | 1 + ...raining Sets via Weak Indirect Supervision | 1 + ...itical Points in Quantum Generative Models | 1 + ...n Imitation Learning via Optimal Transport | 1 + ...eighted Language-Invariant Representations | 1 + ...earning for Zero-Shot Generalization in RL | 1 + ...g to Search in Bottom-Up Program Synthesis | 1 + ...ansformer Hinging on Cross-scale Attention | 1 + ... for Open-Set Single Domain Generalization | 1 + ... Human Demonstrations for Offline Learning | 1 + ...toencoder for Periodic Material Generation | 1 + ...o uncover learning principles in the brain | 1 + ...namic Scale Networks for Multi-View Stereo | 1 + ...MLP-like Architecture for Dense Prediction | 1 + ...losed-form ODEs from Observed Trajectories | 1 + ...c Anchor Boxes are Better Queries for DETR | 1 + ...entation in Offline Reinforcement Learning | 1 + ...ased Explanation for Graph Neural Networks | 1 + ...rning for Periodic Time Series Forecasting | 1 + ...aneous Explanations via Concept Traversals | 1 + ...IVA: Dataset Derivative of a Learning Task | 1 + ...ering Layer for Neural Network Compression | 1 + ... Learning Requires Explicit Regularization | 1 + ...nition with Optimal Transport Distillation | 1 + ...ing Won't Save You From Facial Recognition | 1 + ...ion for Architecting Hardware Accelerators | 1 + ... Grammar Learning for Molecular Generation | 1 + ...ol with a Deep Stochastic Koopman Operator | 0 ...ity in MARL via Trust-Region Decomposition | 1 + ... Multi-Agent Kernel Approximation Approach | 1 + ...clarative nets that are equilibrium models | 1 + ...tive Biases of Hamiltonian Neural Networks | 1 + ...aptation for Cross-Domain Object Detection | 1 + .../iclr/Deep Attentive Variational Inference | 1 + data/2022/iclr/Deep AutoAugment | 1 + ...he All-Round Blessings of Dynamic Sparsity | 1 + ...haping the Kernel with Tailored Rectifiers | 1 + .../2022/iclr/Deep Point Cloud Reconstruction | 1 + ...eep ReLU Networks Preserve Expected Length | 1 + ...ruptions Through Adversarial Augmentations | 1 + ...sis for Evaluation of Data Representations | 1 + ...ith Supplementary Imperfect Demonstrations | 1 + ...ization Models and Implicit Regularization | 1 + ...ty in Automatic Speech Recognition Systems | 1 + ...or Conditional Score-based Data Generation | 1 + ...r: Tiny Transformer with Shared Dictionary | 1 + ...Deformable Object Manipulations with Tools | 1 + data/2022/iclr/Differentiable DAG Sampling | 1 + ...ximization for Set Representation Learning | 1 + ... Scene Reconstructions from a Single Image | 1 + ...d Language Models Better Few-shot Learners | 1 + ...Scaffolding Tree for Molecule Optimization | 1 + ...lly Private Fine-tuning of Language Models | 1 + ...ents Estimation with Polylogarithmic Space | 1 + ...th Fast Maximum Likelihood Sampling Scheme | 1 + ...overy for State Covering and Goal Reaching | 1 + ...riant Rationales for Graph Neural Networks | 1 + ...iscovering Latent Concepts Learned in BERT | 1 + ... Scarce Data with Physics-encoded Learning | 1 + ...ning the Representation Bottleneck of DNNS | 1 + ...ased Active Learning for Domain Adaptation | 1 + ...s Strengthen Vision Transformer Robustness | 1 + ...criminative Similarity for Data Clustering | 1 + ...sis with Partial Information Decomposition | 1 + ...lets for X2I Translation with Limited Data | 1 + ...stribution Compression in Near-Linear Time | 1 + ...nforcement Learning with Monotonic Splines | 1 + ...Principal Components via Geodesic Descents | 1 + ...t Models with Parametric Likelihood Ratios | 1 + ...s from Periodically Shifting Distributions | 1 + .../Dive Deeper Into Integral Pose Regression | 1 + ...e-aware Federated Self-Supervised Learning | 1 + ...rated Learning via Submodular Maximization | 0 ...s Image Recognition Performance in AlexNet | 1 + ...al Coordinates on the Latent Space of GANs | 1 + ...ision? A User Study, Baseline, And Dataset | 1 + ...We Need Anisotropic Graph Neural Networks? | 1 + ...works transfer invariances across classes? | 1 + ...thing on graphs with tabular node features | 1 + ...n Adversarial Training: A Game Perspective | 1 + ...tematic Errors with Cross-Modal Embeddings | 1 + ...ne Learning Using Second-Order Information | 1 + ... Stimuli Induced Patterns in M EEG Signals | 1 + ...or Doubly Efficient Reinforcement Learning | 1 + data/2022/iclr/Dual Lottery Ticket Hypothesis | 1 + ...Normalization improves Vision Transformers | 1 + ...are Comparison of Learned Reward Functions | 1 + ...tion Neural Networks in Contextual Bandits | 1 + ...ion Transformers via Token Reorganizations | 0 ...raining via Extreme Activation Compression | 1 + ...catastrophic forgetting in neural networks | 0 ...cation by Scheduled Grow-and-Prune Methods | 1 + ...ch for Combinatorial Optimization Problems | 1 + ...-Width Neural Networks that Learn Features | 0 ...g Policy via Human-AI Copilot Optimization | 1 + ...l Discovery without Acyclicity Constraints | 1 + ...n Transformers for Representation Learning | 1 + ...n for Improved Training of Neural Networks | 1 + ...ng for On-Demand and In-Situ Customization | 1 + ...mers via Adaptive Fourier Neural Operators | 0 ...l Prediction with General Function Classes | 1 + ...ong Sequences with Structured State Spaces | 1 + ...en playing games is better than optimizing | 1 + ...c Objectives with Skewed Hessian Spectrums | 1 + ... Manipulations with Einstein-like Notation | 1 + ...from SGD with Truncated Heavy-tailed Noise | 1 + ...arning and explicit probabilistic modeling | 1 + .../2022/iclr/Emergent Communication at Scale | 1 + ...ation Objectives with Adaptive Tree Search | 1 + ...rsity for Fixed-to-Fixed Model Compression | 1 + ...ing of Probabilistic Hierarchies on Graphs | 1 + ... to Valuation Problems in Machine Learning | 1 + ...spired Molecular Conformation Optimization | 1 + ...g Cross-lingual Transfer by Manifold Mixup | 1 + ...ntQA: Entity Linking as Question Answering | 1 + ...ntropy Model for Learned Image Compression | 1 + ...nt Predictive Coding for Visual Navigation | 1 + ... Graph Mechanics Networks with Constraints | 1 + ...ncouraging Equivariance in Representations | 1 + .../Equivariant Subgraph Aggregation Networks | 1 + ... Neural Network based Molecular Potentials | 1 + ...ng for More Powerful Graph Neural Networks | 1 + ...ined nonconvex-nonconcave minimax problems | 1 + ...with Orthogonal Projected Gradient Descent | 1 + ...entanglement of Structured Representations | 1 + ...nal Distortion in Neural Language Modeling | 1 + ...lanner Amortization for Continuous Control | 1 + ...roblems, Pitfalls, and Practical Solutions | 1 + data/2022/iclr/Evidential Turing Processes | 1 + ...based Selection for Reinforcement Learning | 1 + ...e Multi-Task Scaling for Transfer Learning | 1 + ...ble GNN-Based Models over Knowledge Graphs | 1 + ...earning Interpretable Temporal Logic Rules | 1 + ... based on Directional Feature Interactions | 1 + ...ctivation Value for Partial-Label Learning | 1 + ...oring Memorization in Adversarial Training | 1 + ...ompression for pre-trained language models | 1 + ...ing the Limits of Large Scale Pre-training | 1 + ...d Language Models via Metropolis--Hastings | 1 + ...mation Properties of Graph Neural Networks | 1 + ...Contextual Complexity and Unpredictability | 1 + ...ILDS Benchmark for Unsupervised Adaptation | 1 + ...ly Multiplication for Network Quantization | 1 + ...tic descriptions, and Conceptual Relations | 1 + ...ed Interactive Language-Image Pre-Training | 1 + ...tructions in Language with Modular Methods | 1 + ...Transformer Advanced by Fully Pre-training | 1 + data/2022/iclr/Fair Normalizing Flows | 1 + ...Fairness Calibration for Face Verification | 1 + ...airness Guarantees under Demographic Shift | 1 + ...periments on Conditional Language Modeling | 0 data/2022/iclr/Fast AdvProp | 1 + .../Fast Differentiable Matrix Square Root | 1 + ...for Model Interpretability and Compression | 1 + data/2022/iclr/Fast Model Editing at Scale | 1 + .../Fast Regression for Structured Inputs | 1 + ...gical clustering with Wasserstein distance | 1 + ...stSHAP: Real-Time Shapley Value Estimation | 1 + data/2022/iclr/Feature Kernel Distillation | 1 + ...ntation for Federated Image Classification | 1 + ...l Communication Cost in Federated Learning | 1 + ...Communication-Efficient Federated Learning | 1 + ...ata with Class-conditional-sharing Clients | 1 + ...Backdoor Attacks on Visual Object Tracking | 1 + ...arning via Dirichlet Tessellation Ensemble | 0 ...Series Imputation by Graph Neural Networks | 1 + ...g of Counterfactual Physics in Pixel Space | 1 + ...rially Robust Features via Metameric Tasks | 1 + ...ter in each of your Deep Generative Models | 1 + ...tures and Underperform Out-of-Distribution | 1 + ...le Physics: A Yarn-level Model for Fabrics | 1 + ...ned Language Models are Zero-Shot Learners | 1 + ...Reinforcement Learning with Average Reward | 0 ...ography: Train the images, not the network | 1 + ...volutions With Differentiable Kernel Sizes | 1 + ...d: Group Distributional Robustness Follows | 1 + .../Fooling Explanations in Text Classifiers | 1 + ...itous Forgetting in Connectionist Networks | 1 + ...r Invariant and Equivariant Network Design | 1 + ... Embedding Learning with Provable Benefits | 1 + ...vel Perspective to Optimize Recommendation | 1 + ...ing Any GNN with Local Structure Awareness | 1 + ...al Training for Simulation-Based Inference | 1 + ... Min-Imax Optimization via Anderson Mixing | 1 + ...ricks for Subgraph Representation Learning | 1 + ...ter? Revisiting GNN for Question Answering | 1 + ... Modeling based on Global Contexts via GNN | 1 + ... End-to-End Task-Oriented Dialogue Systems | 0 ... Graph Neural Diffusion with A Source Term | 1 + .../Gaussian Mixture Convolution Networks | 1 + ... for Experimental Design in Drug Discovery | 1 + ...ement Learning through Logical Composition | 1 + ...on Through the Lens of Leave-One-Out Error | 1 + ...Through the Lens of Adversarial Robustness | 1 + ...for Offline Hindsight Information Matching | 1 + ...ized Demographic Parity for Group Fairness | 1 + data/2022/iclr/Generalized Kernel Thinning | 1 + ...ws in Hidden Convex-Concave Games and GANs | 1 + ...et covariance models for texture synthesis | 1 + ...lizing Few-Shot NAS with Gradient Matching | 1 + ...e Implicit Generative Adversarial Networks | 1 + ...ative Modeling with Optimal Transport Maps | 1 + ...urce for Multiview Representation Learning | 1 + ...ated Exploration in Reinforcement Learning | 1 + .../Generative Principal Component Analysis | 1 + .../iclr/Generative Pseudo-Inverse Memory | 1 + ...odel for Molecular Conformation Generation | 1 + ...s for Protein Interface Contact Prediction | 1 + ...s improve E(3) Equivariant Message Passing | 1 + ...entation with Implicit Displacement Fields | 1 + ...A Heavy-Neck Paradigm for Object Detection | 1 + ...ix Learning in Trainable Embedding Indexes | 1 + ... Policy Gradient in Markov Potential Games | 1 + ...d Planning via Hindsight Experience Replay | 1 + ...Neural Networks using Gradient Information | 1 + ...rmance Inference with Theoretical Insights | 1 + ...tance Learning for Incomplete Observations | 1 + ...mization by Back-propagating through Model | 1 + ...radient Matching for Domain Generalization | 1 + ...Step Denoiser for convergent Plug-and-Play | 1 + ...fies genomic loci regulating transcription | 1 + ...ia Neighborhood Wasserstein Reconstruction | 1 + ...aph Condensation for Graph Neural Networks | 1 + ...arch for the Traveling Salesperson Problem | 1 + ... Structural and Positional Representations | 1 + ... Anomaly Detection of Multiple Time Series | 1 + ...regularly Sampled Multivariate Time Series | 1 + .../iclr/Graph-Relational Domain Adaptation | 1 + ...arest Neighbor Search in Hyperbolic Spaces | 0 ...ching Old MLPs New Tricks Via Distillation | 1 + ...s for Class-Imbalanced Node Classification | 0 ...Testing of Networks: Algorithms and Theory | 1 + ...: Graph REASoning Enhanced Language Models | 1 + ...up equivariant neural posterior estimation | 1 + ...e Parallelism for Large-scale DNN Training | 1 + ...-Training and Prompting of Language Models | 1 + ...verse Gradients for Physical Deep Learning | 1 + ...hifts on Graphs: An Invariance Perspective | 1 + ...ncoder For Irregularly Sampled Time Series | 1 + ...nerative Models with Closed-Form Solutions | 1 + ...ace Models For Changing Dynamics Scenarios | 1 + ...hot Imitation with Skill Transition Models | 1 + ...emory for Few-shot Learning Across Domains | 1 + ...Nonconvex Algorithms with AdaGrad Stepsize | 1 + ...ounds with Fast Rates for Minimax Problems | 1 + ...Relabeling for Meta-Reinforcement Learning | 1 + ...aging Past Traversals to Aid 3D Perception | 1 + ...rievers for improved open-ended generation | 1 + ...ree Compatible Training in Image Retrieval | 0 ...ow Attentive are Graph Attention Networks? | 1 + ...ntly Assessing Machine Learning API Shifts | 1 + .../iclr/How Do Vision Transformers Work? | 1 + ... with Self-supervised Contrastive Learning | 1 + ...Memory for Error in Low-Precision Training | 1 + ...an CLIP Benefit Vision-and-Language Tasks? | 1 + ... Pre-Training Perform with Streaming Data? | 1 + ...eep networks: a loss landscape perspective | 1 + ...Consistency: Logit Anchoring on Clean Data | 1 + ...s? A Zeroth-Order Optimization Perspective | 1 + ...r MAML to Excel in Few-Shot Classification | 1 + ... missing data in supervised deep learning? | 1 + ...g? A one-hidden-layer theoretical analysis | 1 + ...ls for Non-stationary Time Series Analysis | 1 + ... Learning via Hybrid Action Representation | 1 + ...Learning with Heterogeneous Communications | 1 + ...rence at the Discrete-Continuous Interface | 1 + data/2022/iclr/Hybrid Random Features | 1 + ...ion Method for Deep Reinforcement Learning | 1 + ...ter Tuning with Renyi Differential Privacy | 1 + ...nctional Relationships in 3D Indoor Scenes | 1 + ...U: Efficient GCN Training via Lazy Updates | 1 + ... Approach to Out-of-Distribution Detection | 1 + .../iclr/Illiterate DALL-E Learns to Compose | 1 + ...ge BERT Pre-training with Online Tokenizer | 0 data/2022/iclr/Imbedding Deep Neural Networks | 1 + ...itation Learning by Reinforcement Learning | 1 + ...ervations under Transition Model Disparity | 1 + ...ersarial Training for Deep Neural Networks | 0 ...tion in Underparameterized Neural Networks | 1 + ...covery of Subspaces of Unknown Codimension | 1 + ...ic l2 robustness on CIFAR-10 and CIFAR-100 | 1 + ... Recognition via Privacy-Agnostic Clusters | 1 + ...tion with Annealed and Energy-Based Bounds | 1 + ...ve Translation Models Without Distillation | 1 + ...ample Weights for Imbalance Classification | 1 + ...oals for Following Temporal Specifications | 1 + ...l Extraction with Calibrated Proof of Work | 1 + ...egative Detection for Contrastive Learning | 1 + ...odels for End-to-End Rigid Protein Docking | 1 + ...ediction Using Analogy Subgraph Embeddings | 1 + ...AN: Towards Infinite-Pixel Image Synthesis | 1 + ...ct Analysis of (Quantized) Neural Networks | 1 + ... to Graph Active Learning with Soft Labels | 1 + ...rough Empowerment in Visual Model-based RL | 1 + ...ne Memory Selection for Continual Learning | 1 + ...atless Compression of Stochastic Gradients | 1 + ...tour Stochastic Gradient Langevin Dynamics | 1 + ...d Diversity Denoising and Artefact Removal | 1 + ...ing for Out-of-Distribution Generalization | 1 + ...ng Non-Stationary and Reactionary Policies | 1 + ...sing Subgroup Gaps in Deep Metric Learning | 1 + ... in RL? A Case Study in Continuous Control | 1 + ...ily a Necessity for Graph Neural Networks? | 1 + ...compatible with Interpolating Classifiers? | 1 + ...f Play for Automatic Curriculum Generation | 1 + ...o to Tango: Mixup for Deep Metric Learning | 1 + ...rative and Byzantine Decentralized Teaming | 1 + ... for Antibody Sequence-Structure Co-design | 1 + ...ues: a measure of joint feature importance | 1 + data/2022/iclr/KL Guided Domain Adaptation | 0 ...l Control Policies Through Robot-Awareness | 1 + ...ction Relations for Reinforcement Learning | 1 + data/2022/iclr/Knowledge Infused Decoding | 1 + ...moval in Sampling-based Bayesian Inference | 1 + .../L0-Sparse Canonical Correlation Analysis | 0 ...uage Learning Based on Prompt Tuning of T5 | 1 + ...eration Selection for Multi-Agent Learning | 1 + ...ure in Neural Rough Differential Equations | 1 + .../Label Encoding for Regression Networks | 1 + ...and Protection in Two-party Split Learning | 1 + ...emantic Segmentation with Diffusion Models | 1 + ...ssion with weighted low-rank factorization | 1 + ...Language modeling via stochastic processes | 1 + ...aluation based on semantic representations | 1 + .../Language-driven Semantic Segmentation | 1 + ... Be Strong Differentially Private Learners | 1 + ...ogeneity: Convergence and Balancing Effect | 1 + ...ation Learning on Graphs via Bootstrapping | 1 + ...Animate Images via Latent Space Navigation | 1 + ...rs for Joint Multi-Agent Motion Prediction | 1 + ...gorithm for Training Graph Neural Networks | 1 + ...ugh Adversarial Invertible Transformations | 1 + ...input via mixed and anisotropic smoothness | 1 + .../iclr/Learned Simulators for Turbulence | 0 ...hirality with Invariance to Bond Rotations | 1 + ...orcement Learning without External Rewards | 1 + ...on by Masked Multimodal Cluster Prediction | 1 + ...oment Restrictions by Importance Weighting | 1 + ... Environment Fields via Implicit Functions | 1 + ...gression with Power-Law Priors and Targets | 1 + ...ning Curves for SGD on Structured Features | 1 + ...Encoder using Natural Evolution Strategies | 1 + ...rative Models: A Contrastive Learning View | 1 + ...Models at Scale via Composite Optimization | 1 + ...Networks via Structure-Regularized Pruning | 0 ...Bin Packing on Packing Configuration Trees | 1 + ... by Differentiating Through Sample Quality | 1 + ...hod based on Complementary Learning System | 1 + ...arning Features with Parameter-Free Layers | 1 + ...ve Meta-learner of Behavioral Similarities | 1 + ...ield Games and Approximate Nash Equilibria | 1 + ...nal Networks on the Stochastic Block Model | 1 + ...ith Differentiable Nondeterministic Stacks | 1 + ...bution via Randomized Return Decomposition | 1 + ...Multimodal VAEs through Mutual Supervision | 1 + ...ntextual Bandits through Perturbed Rewards | 1 + ...t-Oriented Dynamics for Planning from Text | 1 + .../Learning Optimal Conformal Classifiers | 1 + ...nted Set Representations for Meta-Learning | 1 + ... One-Shot, Any-Sparsity, And No Retraining | 1 + ... Fisher Kernel with Low-rank Approximation | 1 + ...ving Two-stage Stochastic Integer Programs | 1 + ...ns via Retracing in Reinforcement Learning | 1 + ...g Strides in Convolutional Neural Networks | 1 + ...earning Super-Features for Image Retrieval | 1 + ...Reward Networks for Reinforcement Learning | 1 + ...atent Processes from General Temporal Data | 1 + .../iclr/Learning Towards The Largest Margins | 1 + ...Object Localization with Policy Adaptation | 1 + ...ions from Undirected State-only Experience | 1 + ...Architectures by Propagating Network Codes | 1 + ...n End-to-End with Cross-Modal Transformers | 1 + ...kly-supervised Contrastive Representations | 1 + ...nline adaptation in Reinforcement Learning | 1 + .../Learning by Directional Gradient Descent | 1 + ...ks: Self-knowledge transfer and forgetting | 1 + .../iclr/Learning meta-features for AutoML | 1 + ...more skills through optimistic exploration | 1 + ... Observations with Finite Element Networks | 1 + ...e Part Segmentation with Gradient Matching | 1 + .../Learning to Complete Code with Sketches | 1 + ...earning to Dequantise with Truncated Flows | 1 + ...gmentation of Ultra-High Resolution Images | 1 + ...Molecular Scaffolds with Structural Motifs | 1 + ...lize across Domains on Single Test Samples | 1 + ...be Guided in the Architect-Builder Problem | 1 + ...to Map for Active Semantic Goal Navigation | 1 + ...ng Memory Networks for Traffic Forecasting | 1 + ...e Learning rate with Graph Neural Networks | 1 + ... with hierarchical latent mixture policies | 1 + ...A Study Using Real-World Human Annotations | 1 + .../Learning-Augmented $k$-means Clustering | 1 + ...it Tests for Unsupervised Code Translation | 1 + ...to predict out-of-distribution performance | 1 + ...Bridge using Forward-Backward SDEs Theory" | 1 + ... and Natural Languages via Corpus Transfer | 1 + ...z-constrained Unsupervised Skill Discovery | 1 + ...w-Rank Adaptation of Large Language Models | 1 + ...r Generalization in Reinforcement Learning | 1 + ...ng Expressive Memory for Sequence Modeling | 1 + ...iences For Class task Incremental Learning | 1 + ...ss Compression with Probabilistic Circuits | 1 + ...t as Entropy Constrained Optimal Transport | 1 + ... Distance: An Integer Programming Approach | 1 + ...oisy Contrastive Learner in Classification | 1 + ...el with Neural Transport Latent Space MCMC | 1 + ...ical Performance via Hierarchical Modeling | 1 + ... Multi-Task Multitrack Music Transcription | 1 + ...ative Network Manifolds Without Retraining | 1 + ... Neural Scaling Law and Minimax Optimality | 1 + ...fficient exploration in novel environments | 1 + ...guage Models to Grounded Conceptual Spaces | 1 + ... adaptation under generalized target shift | 1 + ...oved Data-Augmented Reinforcement Learning | 1 + ...e Diversity in Deep Reinforcement Learning | 1 + ... (Provably) Solves Some Robust RL Problems | 1 + ...aximum n-times Coverage for Vaccine Design | 1 + ...ack-box Testing of Visual Reasoning Models | 1 + ...esentations via Quantized Reversed Probing | 1 + data/2022/iclr/Memorizing Transformers | 1 + ...ory Augmented Optimizers for Deep Learning | 1 + ...th Data Compression for Continual Learning | 1 + ...nsformers through entity mention attention | 1 + .../iclr/Message Passing Neural PDE Solvers | 1 + ...over Novel Classes given Very Limited Data | 1 + ...for Energy Based Deterministic Uncertainty | 1 + ... Learning by Watching Video Demonstrations | 1 + ...ith Fewer Tasks through Task Interpolation | 1 + ...ng Universal Controllers with Transformers | 1 + ...Distribution Shifts and Training Conflicts | 1 + ...tation for Generative Adversarial Networks | 1 + ...fling: Tight Convergence Bounds and Beyond | 1 + ...esn't Imply Distribution Learning for GANs | 1 + ...zation with Smooth Algorithmic Adversaries | 1 + .../iclr/Mirror Descent Policy Optimization | 1 + .../iclr/Missingness Bias in Model Debugging | 1 + .../MoReL: Multi-omics Relational Learning | 1 + ...se, and Mobile-friendly Vision Transformer | 1 + ...pretability for Multiple Instance Learning | 1 + ...o: A Growing Brain That Learns Continually | 1 + ...Reinforcement Learning with Regularization | 1 + ...el-augmented Prioritized Experience Replay | 1 + ...-label Classification using Box Embeddings | 1 + ...nforcement Learning via Neural Composition | 1 + ...Features for Monocular 3D Object Detection | 1 + .../Monotonic Differentiable Sorting Networks | 1 + .../iclr/Multi-Agent MDP Homomorphic Networks | 1 + ...ng: Teaching RL Policies to Act with Style | 1 + ...-Mode Deep Matrix and Tensor Factorization | 1 + ...ol for Strategic Exploration in Text Games | 1 + data/2022/iclr/Multi-Task Processes | 1 + ...e Optimization by Learning Space Partition | 1 + .../iclr/Multimeasurement Generative Models | 1 + ... with Approximate Implicit Differentiation | 1 + ...ning Enables Zero-Shot Task Generalization | 1 + ... NAS Evaluation is (Now) Surprisingly Easy | 1 + ...ural Architecture Search at Initialization | 1 + ...ction of Automated Machine Learning Models | 1 + ... Gradient Conflict aware Supernet Training | 1 + ...tive Model for Interpretable Deep Learning | 1 + ...guage Descriptions of Deep Visual Features | 1 + ...ainty for Exponential Family Distributions | 1 + ...or Linear Mixture MDPs with Plug-in Solver | 1 + ...raging Variance Information with Pessimism | 1 + ...etwork Augmentation for Tiny Deep Learning | 1 + ...Noise via Parameter Attack During Training | 0 .../iclr/NeuPL: Neural Population Learning | 1 + ...ximity to and Dynamics on the Central Path | 1 + ...eep Representation and Shallow Exploration | 1 + .../2022/iclr/Neural Deep Equilibrium Solvers | 1 + .../Neural Link Prediction with Walk Pooling | 1 + ...stic Optimization for Continuous-Time Data | 1 + ...or Logical Reasoning over Knowledge Graphs | 1 + ...Space Invariance in Combinatorial Problems | 1 + ...n Hausdorff distance of Tropical Zonotopes | 1 + ...rnel Learners: The Silent Alignment Effect | 1 + .../iclr/Neural Parameter Allocation Search | 1 + ...ying more attention to the context dataset | 1 + .../iclr/Neural Program Synthesis with Query | 1 + ...l Inference with Node-Specific Information | 1 + ...ast and Accurate Numerical Optimal Control | 1 + .../Neural Spectral Marked Point Processes | 1 + ...Neural Stochastic Dual Dynamic Programming | 1 + ...ediction for Inductive Node Classification | 1 + .../iclr/Neural Variational Dropout Processes | 1 + ...ime: consistency guarantees and algorithms | 1 + ...tation Change in Online Continual Learning | 1 + ...: Overlapping Features of Training Methods | 1 + ...Rate for Training Large Transformer Models | 1 + ...rvised Multi-scale Neighborhood Prediction | 1 + ... Representations of Large Knowledge Graphs | 1 + data/2022/iclr/Noisy Feature Mixup | 1 + ... Approximations for Initial Value Problems | 1 + ...le Transfer with Self-Parallel Supervision | 1 + ...rification and Applicability Authorization | 1 + ...CA Using Volume-Preserving Transformations | 1 + ...age Embeddings for Cross-Lingual Alignment | 1 + ...for Scene Decomposition and Representation | 1 + ...jects via Discriminative Weight Generation | 1 + data/2022/iclr/Objects in Semantic Topology | 1 + ...Pessimism, Optimization and Generalization | 1 + ...orcement Learning with Implicit Q-Learning | 1 + ... Learning with Value-based Episodic Memory | 1 + .../iclr/Omni-Dimensional Dynamic Convolution | 1 + ...nfiguration for time series classification | 1 + ...ederated Learning for Image Classification | 0 ...rs in Imitation and Reinforcement Learning | 1 + ...ive Optimization with Gradient Compression | 1 + ...uation Metrics for Graph Generative Models | 1 + ...ial Transferability of Vision Transformers | 1 + ...n Incorporating Inductive Biases into VAEs | 1 + ...esentations in Deep Reinforcement Learning | 1 + ...Missing Labels in Semi-Supervised Learning | 1 + .../On Predicting Generalization using GANs | 1 + ...y in Cell-based Neural Architecture Search | 1 + ...bust Prefix-Tuning for Text Classification | 1 + ...etworks with global convergence guarantees | 1 + ... Robustness for Ensemble Models and Beyond | 1 + ...tention and Dynamic Depth-wise Convolution | 1 + ...t Training with Interval Bound Propagation | 1 + ...GD and AdaGrad for Stochastic Optimization | 1 + ...tarts Algorithm for Reinforcement Learning | 1 + ...the Existence of Universal Lottery Tickets | 1 + ...ormation-Theoretic Bounds and Implications | 1 + ...alibration in Membership Inference Attacks | 1 + ... Bias Reduction in Few-Shot Classification | 1 + ... Learning and Learnability of Quasimetrics | 1 + .../On the Limitations of Multimodal VAEs | 1 + ...Memorization Power of ReLU Neural Networks | 1 + ...zing Individual Neurons in Language Models | 1 + ...imation with Probabilistic Neural Networks | 1 + ...le of Neural Collapse in Transfer Learning | 1 + ... Functions in Energy-Based Sequence Models | 1 + ...of recurrent encoder-decoder architectures | 1 + ... estimation for Regression and Forecasting | 1 + ...tistical learning and perceptual distances | 1 + ...on heterogeneity in emergent communication | 1 + ...icy Model Errors in Reinforcement Learning | 1 + ...ng Incremental Skills for a Changing World | 1 + ...d Hoc Teamwork under Partial Observability | 0 data/2022/iclr/Online Adversarial Attacks | 1 + ... Task Configuration with Anytime Inference | 1 + ...ion for Rehearsal-based Continual Learning | 1 + .../Online Facility Location with Predictions | 1 + ...a-Learning with Hypergradient Distillation | 1 + ...finding the Optimal Policy for Linear MDPs | 1 + ...n Pretraining With Gene Ontology Embedding | 1 + ...Good Closed-Set Classifier is All You Need | 1 + .../iclr/Open-World Semi-Supervised Learning | 1 + ...Vision and Language Knowledge Distillation | 1 + ... Ultra-low-latency Spiking Neural Networks | 1 + ...ptimal Representations for Covariate Shift | 1 + .../Optimal Transport for Causal Discovery | 1 + ...led Recognition with Learnable Cost Matrix | 1 + ...eralization of Three layer Neural Networks | 0 ...n inspired Multi-Branch Equilibrium Models | 1 + data/2022/iclr/Optimizer Amalgamation | 1 + ... Networks with Gradient Lexicase Selection | 1 + ...d Value Mapping for Reinforcement Learning | 1 + ... of Nuisance-Induced Spurious Correlations | 1 + ...pectral Bias of Neural Value Approximation | 1 + ... from Language Models with Diverse Prompts | 1 + .../PAC Prediction Sets Under Covariate Shift | 1 + .../iclr/PAC-Bayes Information Bottleneck | 1 + ...gs and Adversarial Reconstruction Learning | 1 + ...phatic Temporal Difference Learning Method | 1 + ...imation of universal graph representations | 1 + ...ction Intervals from Three Neural Networks | 1 + ...licy Learning with Adaptive Decision Trees | 1 + ...f Attention GANs for Synthetic Time Series | 1 + ...ith a Multi-Grid Solver for Long Sequences | 1 + data/2022/iclr/Pareto Policy Adaptation | 1 + ...Model-based Offline Reinforcement Learning | 1 + ...Multi-Objective Combinatorial Optimization | 1 + ...twork for Non-rigid Point Set Registration | 1 + ...for mean field neural network optimization | 1 + ... Robust Against Adversarial Perturbations? | 1 + ...iliary Proposal for MCMC in Discrete Space | 1 + ...A Stochastic Control Approach For Sampling | 1 + ...al Network, and How to Find It Efficiently | 1 + ...chitecture for Structured Inputs & Outputs | 1 + ... Faster Distributed Nonconvex Optimization | 1 + .../Permutation-Based SGD: Is Random Optimal? | 1 + ...inty-Driven Offline Reinforcement Learning | 1 + ...nforcement Learning under Partial Coverage | 1 + .../iclr/Phase Collapse in Neural Networks | 1 + ...le Descent in Finite-Width Neural Networks | 1 + ... Disambiguation for Partial Label Learning | 1 + ...works with Pipelined Feature Communication | 1 + ...ge Modeling Framework for Object Detection | 1 + ... Sparse training for Neural Network Models | 1 + ...ochastic Environments with a Learned Model | 1 + ...'n' Seek: Can You Find the Winning Ticket? | 1 + ...r Efficient Token Mixing in Long Sequences | 1 + ...oning and Backdooring Contrastive Learning | 1 + .../Policy Gradients Incorporating the Future | 1 + ...for Provably Robust Reinforcement Learning | 1 + ...Policy improvement by planning with Gumbel | 1 + ...rspective of Classification Loss Functions | 1 + ...earning And Using Hierarchical Affordances | 1 + ...for Detecting Unknown Spurious Correlation | 1 + ...s for Two-Class and Multi-Attack Scenarios | 1 + ...rocess Via Tractable Dependent Predictions | 1 + ...tegration via Separable Bijective Networks | 1 + ...ular Graph Representation with 3D Geometry | 1 + ...Mesh-reduced Space with Temporal Attention | 1 + ...in Continual Learning: A Comparative Study | 1 + ...rial Mixture of Training Signal Generators | 1 + ... Models with Data-Dependent Adaptive Prior | 1 + .../iclr/Privacy Implications of Shuffling | 1 + .../Probabilistic Implicit Scene Completion | 1 + ...planning with self-supervised world models | 1 + ...tic Reinforcement Learning without Oracles | 1 + ...tion for Fast Sampling of Diffusion Models | 1 + ...Deep Unsupervised RGB-D Saliency Detection | 1 + ...g for Theorem Proving with Language Models | 1 + ...ve on identifiable representation learning | 1 + ...hts at Initialization using Meta-Gradients | 1 + ...e Authoring via Learned Inverse Kinematics | 1 + ...n mechanisms for few shot image generation | 1 + ...Prototypical Contrastive Predictive Coding | 0 ...ltiway Domains via Representation Learning | 1 + ...arning-based Algorithm For Sparse Recovery | 1 + ...stractors using Multistep Inverse Dynamics | 0 .../iclr/Provably Robust Adversarial Examples | 1 + ...s for mean-field two-player zero-sum games | 1 + ...pothesis for Convolutional Neural Networks | 0 ... Methods for Diffusion Models on Manifolds | 1 + ... for Semi-Supervised Keypoint Localization | 1 + ...Range Time Series Modeling and Forecasting | 1 + ...tremely Low-bit Post-Training Quantization | 1 + ...Quadtree Attention for Vision Transformers | 1 + ... Units via Topological Entropy Calculation | 1 + ...cks Against Black-Box Deep Learning Models | 1 + ...dding on Hyper-Relational Knowledge Graphs | 1 + ...Objects for Long-Range Distance Estimation | 1 + ...nforced and Recurrent Relational Reasoning | 1 + ...ring for Cross-Domain Parameter Estimation | 1 + ...y random features with no performance loss | 1 + .../iclr/Real-Time Neural Voice Camouflage | 1 + .../iclr/Recursive Disentanglement Network | 1 + ...Learning: Are Gradient Subspaces Low-Rank? | 1 + ...a Better Accuracy vs. Robustness Trade-off | 0 ...to-Local Attention for Vision Transformers | 1 + ...ders for Isometric Representation Learning | 0 ...ce of Discrete Markovian Context Evolution | 1 + ...te Representation Model: Method and Theory | 0 ... using Guidance from Offline Demonstration | 1 + ...ransformer for Visual Relational Reasoning | 1 + ...presentations of the hippocampal formation | 1 + ...Relational Learning with Variational Bayes | 1 + ... Modeling Relations between Data and Tasks | 1 + .../iclr/Relational Surrogate Loss Learning | 1 + ...p Inference Attacks without Losing Utility | 1 + ...rial Distillation with Unreliable Teachers | 1 + ...for Online and Offline RL in Low-rank MDPs | 1 + .../iclr/Representation-Agnostic Shape Fields | 1 + ...inuity for Unsupervised Continual Learning | 1 + ...beddings with Mixtures of Topic Embeddings | 1 + ...Biases via Influence-based Data Relabeling | 1 + ... Can Drive Divergence of SGD with Momentum | 1 + ...ative Models Using Scalable Fingerprinting | 1 + ...ility from a Data Distribution Perspective | 1 + ...Estimation for Positive-Unlabeled Learning | 1 + ... Learning and Its Connection to Offline RL | 1 + ...int Cloud: A Simple Residual MLP Framework | 1 + ...raining for Better Downstream Transferring | 1 + ...sentation as a Token-Level Bipartite Graph | 1 + ...erceptible Adversarial Image Perturbations | 1 + ...ies Forecasting against Distribution Shift | 1 + ...ith Lottery Regulated Grouped Convolutions | 1 + ...Offline Model Based Reinforcement Learning | 1 + ...hing in BERT from the Perspective of Graph | 1 + ...e models for Out-of-distribution detection | 0 ...in Preference-based Reinforcement Learning | 1 + ...in Federated Learning with Modified Models | 1 + ...tributions Improve Adversarial Robustness? | 1 + ... Data Privacy Against Adversarial Learning | 0 ...ble SDE Learning: A Functional Perspective | 1 + ...dient Homogenization in Multitask Learning | 1 + ...al for Offline RL via Supervised Learning? | 1 + ...ing with Stochastic Differential Equations | 1 + .../iclr/SGD Can Converge to Local Maxima | 0 ... bi-level optimization and implicit models | 1 + ...lations by Second-Order Structured Pruning | 1 + ...sentation Learning for Speech Pre-Training | 1 + ...ization via Diagonal Hessian Approximation | 1 + ...ta-Features for Neural Architecture Search | 1 + ...nt Preference-based Reinforcement Learning | 1 + ...ing with Differentiable Symbolic Execution | 1 + ...scover spurious features in Deep Learning? | 1 + ...cement Learning via Uncertainty Estimation | 1 + ...radient Algorithm for Zero-Sum Markov Game | 1 + ...y of Losses for Learning with Noisy Labels | 1 + ...edistribution for Efficient Face Detection | 1 + .../Sampling with Mirrored Stein Operators | 1 + ...yperparameters by Implicit Differentiation | 1 + ...Nonsymmetric Determinantal Point Processes | 1 + ...om Pretraining and Finetuning Transformers | 0 ...tures of Neural Network Gaussian Processes | 1 + ...caling Laws for Neural Machine Translation | 1 + ...e Learning using Random Feature Corruption | 1 + ...nd Rotationally Equivariant Spherical CNNs | 1 + ...ing future trajectories of multiple agents | 0 ... with Critically-Damped Langevin Diffusion | 1 + ...ctive Ensembles for Consistent Predictions | 1 + data/2022/iclr/Self-Joint Supervised Learning | 1 + ...d Electroencephalographic Seizure Analysis | 1 + ...Supervised Inference in State-Space Models | 1 + ...ed Feature Selection with Correlated Gates | 1 + ...versarial Training for Improved Robustness | 1 + ...arning is More Robust to Dataset Imbalance | 1 + ...tein divergence and applications on graphs | 0 ... Learning: Theory and Optimization Methods | 1 + ...adient Alignment for Multilingual Learning | 1 + ...Optimal Approximators of Korobov Functions | 1 + ...nforcement Learning or Behavioral Cloning? | 1 + ... End-task Aware Training as an Alternative | 1 + ...fle Private Stochastic Convex Optimization | 1 + .../Signing the Supermask: Keep, Hide, Invert | 1 + ...ge Model Pretraining with Weak Supervision | 1 + ...D Molecular Property Prediction and Beyond | 0 ...l sketch representation in continuous time | 1 + .../Skill-based Meta-Reinforcement Learning | 1 + ...Imaging with Score-Based Generative Models | 1 + .../Sound Adversarial Audio-Visual Navigation | 1 + ...ir with Minimality and Locality Guarantees | 1 + ...nt Shift via Bottom-Up Feature Restoration | 1 + .../iclr/Space-Time Graph Neural Networks | 1 + ... Tree-based Graph Generation for Molecules | 1 + .../Sparse Attention with Learning to Hash | 1 + ...arse Communication via Mixed Distributions | 1 + ...d Object Detection with Learnable Sparsity | 1 + ...eneralization from More Efficient Training | 1 + ...driven Policy for Antiviral Drug Discovery | 1 + ... is All You Need for Deep Face Recognition | 1 + ...al Message Passing for 3D Molecular Graphs | 1 + ...ast and accurate recurrent neural networks | 1 + ...ccuracy with Spurious Attribute Estimation | 1 + ...mension Dependence of Langevin Monte Carlo | 1 + ...ation for Discrete Representation Learning | 1 + ... Operators for Equivariant Neural Networks | 1 + ...zation for Generative Adversarial Networks | 1 + ...Denoising Autoencoders for Text Generation | 1 + ...l network for learning Hamiltonian systems | 1 + ...aining is Not Necessary for Generalization | 1 + .../iclr/Strength of Minibatch Noise in SGD | 1 + ...ogeneous Multi-Task Reinforcement Learning | 1 + ...nd Applications of Aligned StyleGAN Models | 1 + ...erator for High-resolution Image Synthesis | 1 + ...rs for Few-Shot Class Incremental Learning | 1 + ...Model For Learning Fine-Grained Embeddings | 1 + ...stive Language-Image Pre-training Paradigm | 1 + ...rogeneous disease-related imaging patterns | 1 + ...mization Improves Sharpness-Aware Training | 1 + ...ed Search Spaces of Tabular NAS Benchmarks | 1 + ...g for Cross-Domain Few-Shot Classification | 1 + ...: Towards Interpretability and Scalability | 1 + ...eneration from Pre-trained Language Models | 1 + ...al Network for Time Series Signal Analysis | 1 + ...ional Networks for Time-Series Forecasting | 1 + ...raining via Learning a Neural SQL Executor | 1 + ...ptive Convolutions for Video Understanding | 1 + ...p Output with learned Multi-Agent Sampling | 1 + ...herence from dynamic point cloud sequences | 1 + ...al Imitation Learning with Suboptimal Data | 1 + ...Gradient Projection for Continual Learning | 1 + ...ing Trilemma with Denoising Diffusion GANs | 1 + ...ivated Transformer with Stochastic Experts | 1 + ...tation for Sequence to Sequence Generation | 1 + ...um Bipartite Matching in Few-Shot Learning | 1 + ...ed Generalization Bounds for Meta Learning | 1 + .../iclr/Task-Induced Representation Learning | 1 + ...rning and Few-Shot Sequence Classification | 1 + ...g Neural Network via Gradient Re-weighting | 1 + ...r Systematic Suboptimality in Human Models | 1 + ...een Contrastive Learning and Meta-Learning | 1 + ... Extreme Points of the Dual Convex Program | 1 + ...ty of Encoders in Variational Autoencoders | 1 + ...: Mapping and Mitigating Misaligned Models | 1 + data/2022/iclr/The Efficiency Misnomer | 1 + ...lution of Uncertainty of Learning in Games | 1 + ...cy Optimization in Infinite-Horizon POMDPs | 1 + ...xact Characterization of Optimal Solutions | 1 + ...ing: Rethinking Pretraining Example Design | 1 + ...try of Unsupervised Reinforcement Learning | 1 + ...BERT Reproductions for Robustness Analysis | 1 + ...formers Improves Systematic Generalization | 1 + ...sparate Impact of Semi-Supervised Learning | 1 + ...inear Mode Connectivity of Neural Networks | 1 + ...ns for the OOD Generalization of RL Agents | 1 + ...pectral Bias of Polynomial Neural Networks | 1 + ...ynamics in High-dimensional Kernel Methods | 1 + ...Uncanny Similarity of Recurrence and Depth | 1 + ...he Most Naive Baseline for Sparse Training | 1 + ...roximation Bounds for ReLU Neural Networks | 1 + ...cation and Cooperation with Theory of Mind | 1 + ...d Graph Generation without Exchangeability | 1 + ...ration and multiclass-to-binary reductions | 1 + data/2022/iclr/Topological Experience Replay | 1 + .../iclr/Topological Graph Neural Networks | 1 + .../Topologically Regularized Data Embeddings | 1 + ...t Optimization and Hysteresis Quantization | 1 + ...types in a Nearest Neighbor-friendly Space | 0 ...Histology Images with Contrastive Learning | 1 + ...d Representation Disentanglement Framework | 1 + ...nual Knowledge Learning of Language Models | 1 + ...rks: A GNTK-based Optimization Perspective | 1 + ...ement Learning: Lower Bound and Optimality | 1 + ...ich Bounds on the Rate-Distortion Function | 1 + ...of Neural Networks Learned by Transduction | 1 + ...ion Approximation in Zero-Sum Markov Games | 1 + ...ated Learning Using Knowledge Distillation | 1 + ...aph Neural Networks for Atomic Simulations | 1 + ...ation via Decomposing Excess Risk Dynamics | 1 + ...he Data Dependency of Mixup-style Training | 1 + ...Against Evasion Attack on Categorical Data | 1 + ...w of Parameter-Efficient Transfer Learning | 1 + ... and detecting harmful distribution shifts | 1 + ... Biases Enables Input Length Extrapolation | 1 + ...e Reconstruction via Bi-level Optimization | 1 + ...fold Identification and Variance Reduction | 1 + ...ia Distribution Matching for Complex Tasks | 1 + ...ow-rank phenomenon: beyond linear networks | 1 + ...ing through self- and mutual-distillations | 1 + ...ture Spaces via Model-Based Regularization | 1 + ...arial Attack based on Integrated Gradients | 1 + ...-Control Policy for Efficient Agent Design | 1 + ...larly Spaced Events and Their Participants | 1 + .../iclr/Transformer-based Transform Coding | 1 + .../Transformers Can Do Bayesian Inference | 1 + ...merging Property of Assembling Weak Models | 1 + ...Counting with Predictions in Graph Streams | 1 + ...h a Topological Prior for Trojan Detection | 1 + ...model differences (on ImageNet and beyond) | 1 + ...tion in Multi-Agent Reinforcement Learning | 1 + ... for Improved Generalization or Efficiency | 1 + ...ing for Out-of-Distribution Generalization | 1 + ...se in Contrastive Self-supervised Learning | 1 + ...ain Randomization for Sim-to-real Transfer | 1 + ...trinsic Robustness Using Label Uncertainty | 1 + ...upervision: An Identifiability Perspective | 1 + ...ection Attack by Promoting Unnoticeability | 1 + ...meterization in Recursive Value Estimation | 1 + ...ng Capacity Loss in Reinforcement Learning | 1 + ...d dictionary learning for pattern recovery | 1 + ...ng and bottlenecks on graphs via curvature | 1 + ...Attention for Efficient Speech Recognition | 1 + ...riance Collapse of SVGD in High Dimensions | 1 + ...t Spatial-Temporal Representation Learning | 1 + .../Unified Visual Transformer Compression | 1 + ...nce with Black-box Optimization and Beyond | 1 + ... Constraints is Possible with Transformers | 1 + .../2022/iclr/Universalizing Weak Supervision | 1 + ...-Learning via The Adaptation Learning Rate | 0 ...LM for Sparse Semi-Blind Source Separation | 1 + ...rvised Discovery of Object Radiance Fields | 1 + ...ensor Product Representations on the Torus | 1 + ...nd Partial Differential Equation in a Loop | 1 + ...tion by Distilling Feature Correspondences | 1 + ...r Induction with Shared Structure Modeling | 1 + ...easure the Severity of Depressive Symptoms | 1 + ...ation Error: ELBO and Exponential Families | 1 + ...ls for Manipulating 3D ARTiculated Objects | 1 + ...al networks in the overparametrized regime | 1 + ...egularization for Self-Supervised Learning | 1 + ...ou Don't Know by Virtual Outlier Synthesis | 1 + ...te Abstractions for Long-Horizon Reasoning | 1 + ...eighted Model-Based Reinforcement Learning | 1 + ...enerative Modeling of Feature Incompletion | 1 + .../iclr/Variational Neural Cellular Automata | 1 + ... Routing with Nested Subjective Timescales | 1 + ...ensional data: landscape and implicit bias | 1 + ...nal methods for simulation-based inference | 1 + ... oracle guiding for reinforcement learning | 1 + ...antized Image Modeling with Improved VQGAN | 1 + ...ve Fully Transformer-based Object Detector | 1 + ...AN: Training GANs with Vision Transformers | 1 + ...pulators Need to Also See from Their Hands | 1 + .../iclr/Visual Correspondence Hallucination | 1 + ...Generalize Strongly Within the Same Domain | 1 + ...epresentation Learning over Latent Domains | 0 ...g sensor and recurrent neural computations | 1 + ...enerative Model of Parametric CAD Sketches | 1 + ...mporal Classification Loss with Wild Cards | 1 + ...y Supervised Monocular 3D Object Detection | 1 + ...n by Generalization in Federated Learning? | 1 + ...ches Zero Loss? --A Mathematical Framework | 1 + ...s? Augment Difficult but Not too Different | 0 ...Tree Search for Combinatorial Optimization | 1 + ...arge Number of Players Sample-Efficiently? | 1 + ... Pre-training or Strong Data Augmentations | 1 + data/2022/iclr/When should agents explore? | 1 + ...Why, and Which Pretrained GANs Are Useful? | 1 + ...Study from the Parameter-Space Perspective | 1 + ...Partner in Positive and Unlabeled Learning | 0 ...l and Efficient Evasion Attacks in Deep RL | 1 + ...allel Use of Labels and Features on Graphs | 1 + ...Needed to Produce a Primate Ventral Stream | 1 + ...pproach To Faster and More Accurate Models | 1 + ...on for long-horizon dexterous manipulation | 1 + ...ency in Deep Learning with A Minimax Model | 1 + ...ature Attribution in Trajectory Prediction | 1 + ...n Framework for Hypergraph Neural Networks | 1 + ...l Directional Boundary by Vector Transform | 1 + ...gative-free symmetric contrastive learning | 1 + ...Supervised Learning for MRI Reconstruction | 1 + ...for Federated Learning with Local Sparsity | 1 + ...cosFormer: Rethinking Softmax In Attention | 1 + ...iFlood: A Stable and Effective Regularizer | 1 + ... dynamics with applications to neural data | 1 + ...mark for formal Olympiad-level mathematics | 1 + ...achine Translation Via Code-Switch Decoder | 1 + ...r Single Multi-Labeled Text Classification | 1 + ...ified Framework for Soft Threshold Pruning | 1 + ...eural Networks for Universal Approximation | 1 + ...llation for Multi-View 3D Object Detection | 1 + ...n for Faster and Better Visual Pretraining | 1 + ...for Geometry-Sequence Modeling in Proteins | 1 + ...hanced Explainer for Graph Neural Networks | 1 + .../Delving into Semantic Scale Imbalance | 1 + ...nd Rectifying Vision Models using Language | 1 + ...f-Distribution Robustness via Disagreement | 1 + ... Tensors for Memory-Efficient DNN Training | 1 + ...l Affordance for Dual-gripper Manipulation | 1 + ...afe Exploration with Weakest Preconditions | 1 + ...All You Need for Oriented Object Detection | 1 + ... Examples via Augmenting Content and Style | 1 + ...t Data-Free Learning from Black-Box Models | 1 + ...ostic Representation for Disease Diagnosis | 1 + ...ge-Graphs for Differentiable Rule Learning | 1 + ...E(3)-Invariant Denoising Distance Matching | 1 + ...ng convex conjugates for optimal transport | 1 + ... Dense Contrastive Representation Learning | 1 + ...ly Detection in Industry Vision: Graphcore | 1 + ...ning for Low-rank General-sum Markov Games | 0 ...-Sample Matching for Domain Generalization | 1 + ...eature Extractor for Few-shot Segmentation | 1 + ...Improves Adaptation to Distribution Shifts | 1 + ...sk Prompting for Dense Scene Understanding | 1 + ...asses by Extrapolating from a Single Image | 1 + .../Trainability Preserving Neural Pruning | 1 + 2209 files changed, 2144 insertions(+) create mode 100644 data/2020/iclr/A Constructive Prediction of the Generalization Error Across Scales create mode 100644 data/2020/iclr/A Fair Comparison of Graph Neural Networks for Graph Classification create mode 100644 data/2020/iclr/A Learning-based Iterative Method for Solving Vehicle Routing Problems create mode 100644 data/2020/iclr/A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning create mode 100644 data/2020/iclr/A Theoretical Analysis of the Number of Shots in Few-Shot Learning create mode 100644 data/2020/iclr/A critical analysis of self-supervision, or what we can learn from a single image create mode 100644 data/2020/iclr/AMRL: Aggregated Memory For Reinforcement Learning create mode 100644 data/2020/iclr/Accelerating SGD with momentum for over-parameterized learning create mode 100644 data/2020/iclr/Action Semantics Network: Considering the Effects of Actions in Multiagent Systems create mode 100644 data/2020/iclr/Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games create mode 100644 data/2020/iclr/Adaptive Structural Fingerprints for Graph Attention Networks create mode 100644 data/2020/iclr/Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks create mode 100644 data/2020/iclr/Adjustable Real-time Style Transfer create mode 100644 data/2020/iclr/Adversarial Policies: Attacking Deep Reinforcement Learning create mode 100644 data/2020/iclr/Adversarially Robust Representations with Smooth Encoders create mode 100644 data/2020/iclr/Adversarially robust transfer learning create mode 100644 data/2020/iclr/Ae-OT: a New Generative Model based on Extended Semi-discrete Optimal transport create mode 100644 data/2020/iclr/An Inductive Bias for Distances: Neural Nets that Respect the Triangle Inequality create mode 100644 data/2020/iclr/Analysis of Video Feature Learning in Two-Stream CNNs on the Example of Zebrafish Swim Bout Classification create mode 100644 data/2020/iclr/Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction create mode 100644 data/2020/iclr/Are Transformers universal approximators of sequence-to-sequence functions? create mode 100644 data/2020/iclr/AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures create mode 100644 data/2020/iclr/Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space create mode 100644 data/2020/iclr/AutoQ: Automated Kernel-Wise Neural Network Quantization create mode 100644 data/2020/iclr/Automated Relational Meta-learning create mode 100644 data/2020/iclr/Automated curriculum generation through setter-solver interactions create mode 100644 data/2020/iclr/Automatically Discovering and Learning New Visual Categories with Ranking Statistics create mode 100644 data/2020/iclr/Black-Box Adversarial Attack with Transferable Model-based Embedding create mode 100644 data/2020/iclr/Bounds on Over-Parameterization for Guaranteed Existence of Descent Paths in Shallow ReLU Networks create mode 100644 data/2020/iclr/Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness create mode 100644 data/2020/iclr/Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints create mode 100644 data/2020/iclr/CAQL: Continuous Action Q-Learning create mode 100644 data/2020/iclr/CLN2INV: Learning Loop Invariants with Continuous Logic Networks create mode 100644 data/2020/iclr/CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning create mode 100644 data/2020/iclr/Can gradient clipping mitigate label noise? create mode 100644 data/2020/iclr/Certified Robustness for Top-k Predictions against Adversarial Perturbations via Randomized Smoothing create mode 100644 data/2020/iclr/Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation create mode 100644 data/2020/iclr/Compositional languages emerge in a neural iterated learning model create mode 100644 data/2020/iclr/Computation Reallocation for Object Detection create mode 100644 data/2020/iclr/Continual Learning with Adaptive Weights (CLAW) create mode 100644 data/2020/iclr/Continual Learning with Bayesian Neural Networks for Non-Stationary Data create mode 100644 data/2020/iclr/Counterfactuals uncover the modular structure of deep generative models create mode 100644 data/2020/iclr/Curvature Graph Network create mode 100644 data/2020/iclr/DBA: Distributed Backdoor Attacks against Federated Learning create mode 100644 data/2020/iclr/DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames create mode 100644 data/2020/iclr/Data-Independent Neural Pruning via Coresets create mode 100644 data/2020/iclr/DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling create mode 100644 "data/2020/iclr/Deep 3D Pan via local adaptive \"t-shaped\" convolutions with global and local adaptive dilations" create mode 100644 data/2020/iclr/Deep Imitative Models for Flexible Inference, Planning, and Control create mode 100644 data/2020/iclr/Deep Learning of Determinantal Point Processes via Proper Spectral Sub-gradient create mode 100644 data/2020/iclr/Deep Network Classification by Scattering and Homotopy Dictionary Learning create mode 100644 data/2020/iclr/Deep Semi-Supervised Anomaly Detection create mode 100644 data/2020/iclr/DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures create mode 100644 data/2020/iclr/DeepV2D: Video to Depth with Differentiable Structure from Motion create mode 100644 data/2020/iclr/Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation create mode 100644 data/2020/iclr/Depth-Adaptive Transformer create mode 100644 data/2020/iclr/Detecting Extrapolation with Local Ensembles create mode 100644 data/2020/iclr/Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions create mode 100644 data/2020/iclr/Difference-Seeking Generative Adversarial Network-Unseen Sample Generation create mode 100644 data/2020/iclr/Differentially Private Meta-Learning create mode 100644 data/2020/iclr/Disentangling Factors of Variations Using Few Labels create mode 100644 data/2020/iclr/Distance-Based Learning from Errors for Confidence Calibration create mode 100644 data/2020/iclr/Diverse Trajectory Forecasting with Determinantal Point Processes create mode 100644 data/2020/iclr/DivideMix: Learning with Noisy Labels as Semi-supervised Learning create mode 100644 data/2020/iclr/Dynamic Time Lag Regression: Predicting What & When create mode 100644 data/2020/iclr/Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery create mode 100644 data/2020/iclr/Dynamically Pruned Message Passing Networks for Large-scale Knowledge Graph Reasoning create mode 100644 data/2020/iclr/ES-MAML: Simple Hessian-Free Meta Learning create mode 100644 data/2020/iclr/Editable Neural Networks create mode 100644 data/2020/iclr/Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform create mode 100644 data/2020/iclr/Efficient and Information-Preserving Future Frame Prediction and Beyond create mode 100644 data/2020/iclr/Enhancing Transformation-Based Defenses Against Adversarial Attacks with a Distribution Classifier create mode 100644 data/2020/iclr/Ensemble Distribution Distillation create mode 100644 data/2020/iclr/Escaping Saddle Points Faster with Stochastic Momentum create mode 100644 data/2020/iclr/Evaluating The Search Phase of Neural Architecture Search create mode 100644 data/2020/iclr/Exploration in Reinforcement Learning with Deep Covering Options create mode 100644 data/2020/iclr/Exploring Model-based Planning with Policy Networks create mode 100644 data/2020/iclr/FSPool: Learning Set Representations with Featurewise Sort Pooling create mode 100644 data/2020/iclr/Fast is better than free: Revisiting adversarial training create mode 100644 data/2020/iclr/FasterSeg: Searching for Faster Real-time Semantic Segmentation create mode 100644 data/2020/iclr/Feature Interaction Interpretability: A Case for Explaining Ad-Recommendation Systems via Neural Interaction Detection create mode 100644 data/2020/iclr/Federated Adversarial Domain Adaptation create mode 100644 data/2020/iclr/Few-Shot Learning on graphs via super-Classes based on Graph spectral Measures create mode 100644 data/2020/iclr/Few-shot Text Classification with Distributional Signatures create mode 100644 data/2020/iclr/Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents create mode 100644 data/2020/iclr/Fooling Detection Alone is Not Enough: Adversarial Attack against Multiple Object Tracking create mode 100644 data/2020/iclr/Four Things Everyone Should Know to Improve Batch Normalization create mode 100644 data/2020/iclr/From Variational to Deterministic Autoencoders create mode 100644 data/2020/iclr/Functional vs. parametric equivalence of ReLU networks create mode 100644 data/2020/iclr/GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification create mode 100644 data/2020/iclr/GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations create mode 100644 data/2020/iclr/GLAD: Learning Sparse Graph Recovery create mode 100644 data/2020/iclr/Gap-Aware Mitigation of Gradient Staleness create mode 100644 data/2020/iclr/Generalization bounds for deep convolutional neural networks create mode 100644 data/2020/iclr/Generative Ratio Matching Networks create mode 100644 data/2020/iclr/Geometric Insights into the Convergence of Nonlinear TD Learning create mode 100644 data/2020/iclr/Global Relational Models of Source Code create mode 100644 data/2020/iclr/Graph inference learning for semi-supervised classification create mode 100644 data/2020/iclr/Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation create mode 100644 data/2020/iclr/I Am Going MAD: Maximum Discrepancy Competition for Comparing Classifiers Adaptively create mode 100644 data/2020/iclr/Identifying through Flows for Recovering Latent Representations create mode 100644 data/2020/iclr/Identity Crisis: Memorization and Generalization Under Extreme Overparameterization create mode 100644 data/2020/iclr/Image-guided Neural Object Rendering create mode 100644 data/2020/iclr/Imitation Learning via Off-Policy Distribution Matching create mode 100644 data/2020/iclr/Implicit Bias of Gradient Descent based Adversarial Training on Separable Data create mode 100644 data/2020/iclr/Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin create mode 100644 data/2020/iclr/Improving Adversarial Robustness Requires Revisiting Misclassified Examples create mode 100644 data/2020/iclr/In Search for a SAT-friendly Binarized Neural Network Architecture create mode 100644 data/2020/iclr/Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models create mode 100644 data/2020/iclr/Interpretable Complex-Valued Neural Networks for Privacy Protection create mode 100644 data/2020/iclr/Intrinsic Motivation for Encouraging Synergistic Behavior create mode 100644 data/2020/iclr/Knowledge Consistency between Neural Networks and Beyond create mode 100644 data/2020/iclr/LAMOL: LAnguage MOdeling for Lifelong Language Learning create mode 100644 data/2020/iclr/Language GANs Falling Short create mode 100644 data/2020/iclr/Large Batch Optimization for Deep Learning: Training BERT in 76 minutes create mode 100644 data/2020/iclr/Lazy-CFR: fast and near-optimal regret minimization for extensive games with imperfect information create mode 100644 data/2020/iclr/Learned Step Size quantization create mode 100644 data/2020/iclr/Learning Disentangled Representations for CounterFactual Regression create mode 100644 data/2020/iclr/Learning Efficient Parameter Server Synchronization Policies for Distributed SGD create mode 100644 data/2020/iclr/Learning Execution through Neural Code fusion create mode 100644 data/2020/iclr/Learning Expensive Coordination: An Event-Based Deep RL Approach create mode 100644 data/2020/iclr/Learning Heuristics for Quantified Boolean Formulas through Reinforcement Learning create mode 100644 data/2020/iclr/Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling create mode 100644 data/2020/iclr/Learning Space Partitions for Nearest Neighbor Search create mode 100644 data/2020/iclr/Learning deep graph matching with channel-independent embedding and Hungarian attention create mode 100644 data/2020/iclr/Learning the Arrow of Time for Problems in Reinforcement Learning create mode 100644 data/2020/iclr/Learning to Learn by Zeroth-Order Oracle create mode 100644 data/2020/iclr/Learning to Link create mode 100644 data/2020/iclr/Learning to Represent Programs with Property Signatures create mode 100644 data/2020/iclr/Learning to solve the credit assignment problem create mode 100644 data/2020/iclr/Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware create mode 100644 data/2020/iclr/Locality and Compositionality in Zero-Shot Learning create mode 100644 data/2020/iclr/Logic and the 2-Simplicial Transformer create mode 100644 data/2020/iclr/Low-Resource Knowledge-Grounded Dialogue Generation create mode 100644 data/2020/iclr/MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius create mode 100644 data/2020/iclr/Maxmin Q-learning: Controlling the Estimation Bias of Q-learning create mode 100644 data/2020/iclr/Measuring Compositional Generalization: A Comprehensive Method on Realistic Data create mode 100644 data/2020/iclr/Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples create mode 100644 data/2020/iclr/MetaPix: Few-Shot Video Retargeting create mode 100644 data/2020/iclr/Minimizing FLOPs to Learn Efficient Sparse Representations create mode 100644 data/2020/iclr/Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models create mode 100644 data/2020/iclr/Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks create mode 100644 data/2020/iclr/Multi-agent Reinforcement Learning for Networked System Control create mode 100644 data/2020/iclr/Multiplicative Interactions and Where to Find Them create mode 100644 data/2020/iclr/Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification create mode 100644 data/2020/iclr/N-BEATS: Neural basis expansion analysis for interpretable time series forecasting create mode 100644 data/2020/iclr/NAS evaluation is frustratingly hard create mode 100644 data/2020/iclr/Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data create mode 100644 data/2020/iclr/Neural Stored-program Memory create mode 100644 data/2020/iclr/Neural Text Generation With Unlikelihood Training create mode 100644 data/2020/iclr/Novelty Detection Via Blurring create mode 100644 data/2020/iclr/Observational Overfitting in Reinforcement Learning create mode 100644 data/2020/iclr/On Computation and Generalization of Generative Adversarial Imitation Learning create mode 100644 data/2020/iclr/On Identifiability in Transformers create mode 100644 data/2020/iclr/On Mutual Information Maximization for Representation Learning create mode 100644 "data/2020/iclr/On the \"steerability\" of generative adversarial networks" create mode 100644 data/2020/iclr/On the Variance of the Adaptive Learning Rate and Beyond create mode 100644 data/2020/iclr/On the Weaknesses of Reinforcement Learning for Neural Machine Translation create mode 100644 data/2020/iclr/One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation create mode 100644 data/2020/iclr/Optimistic Exploration even with a Pessimistic Initialisation create mode 100644 data/2020/iclr/Option Discovery using Deep Skill Chaining create mode 100644 data/2020/iclr/Order Learning and Its Application to Age Estimation create mode 100644 data/2020/iclr/Overlearning Reveals Sensitive Attributes create mode 100644 data/2020/iclr/Physics-as-Inverse-Graphics: Unsupervised Physical Parameter Estimation from Video create mode 100644 data/2020/iclr/Piecewise linear activations substantially shape the loss surfaces of neural networks create mode 100644 data/2020/iclr/Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP create mode 100644 data/2020/iclr/Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring create mode 100644 data/2020/iclr/Population-Guided Parallel Policy Search for Reinforcement Learning create mode 100644 data/2020/iclr/Pre-training Tasks for Embedding-based Large-scale Retrieval create mode 100644 data/2020/iclr/Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model create mode 100644 data/2020/iclr/Progressive Memory Banks for Incremental Domain Adaptation create mode 100644 data/2020/iclr/ProxSGD: Training Structured Neural Networks under Regularization and Constraints create mode 100644 data/2020/iclr/Pruned Graph Scattering Transforms create mode 100644 data/2020/iclr/Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving create mode 100644 data/2020/iclr/Pure and Spurious Critical Points: a Geometric Study of Linear Networks create mode 100644 data/2020/iclr/Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP create mode 100644 data/2020/iclr/Quantifying the Cost of Reliable Photo Authentication via High-Performance Learned Lossy Representations create mode 100644 data/2020/iclr/RTFM: Generalising to New Environment Dynamics via Reading create mode 100644 data/2020/iclr/RaCT: Toward Amortized Ranking-Critical Training For Collaborative Filtering create mode 100644 data/2020/iclr/Ranking Policy Gradient create mode 100644 data/2020/iclr/Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML create mode 100644 data/2020/iclr/ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning create mode 100644 data/2020/iclr/ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring create mode 100644 data/2020/iclr/Reanalysis of Variance Reduced Temporal Difference Learning create mode 100644 data/2020/iclr/Recurrent neural circuits for contour detection create mode 100644 data/2020/iclr/Reinforced active learning for image segmentation create mode 100644 data/2020/iclr/Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation create mode 100644 data/2020/iclr/Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives create mode 100644 data/2020/iclr/Relational State-Space Model for Stochastic Multi-Object Systems create mode 100644 data/2020/iclr/Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness create mode 100644 data/2020/iclr/Robust And Interpretable Blind Image Denoising Via Bias-Free Convolutional Neural Networks create mode 100644 data/2020/iclr/Robust Local Features for Improving the Generalization of Adversarial Training create mode 100644 data/2020/iclr/Robust training with ensemble consensus create mode 100644 data/2020/iclr/SAdam: A Variant of Adam for Strongly Convex Functions create mode 100644 data/2020/iclr/SELF: Learning to Filter Noisy Labels with Self-Ensembling create mode 100644 data/2020/iclr/SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards create mode 100644 data/2020/iclr/Sampling-Free Learning of Bayesian Quantized Neural Networks create mode 100644 data/2020/iclr/Scalable Model Compression by Entropy Penalized Reparameterization create mode 100644 data/2020/iclr/Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base create mode 100644 data/2020/iclr/Scalable and Order-robust Continual Learning with Additive Parameter Decomposition create mode 100644 data/2020/iclr/Selection via Proxy: Efficient Data Selection for Deep Learning create mode 100644 data/2020/iclr/Self-Adversarial Learning with Comparative Discrimination for Text Generation create mode 100644 data/2020/iclr/Semantically-Guided Representation Learning for Self-Supervised Monocular Depth create mode 100644 data/2020/iclr/Sharing Knowledge in Multi-Task Deep Reinforcement Learning create mode 100644 data/2020/iclr/Short and Sparse Deconvolution - A Geometric Approach create mode 100644 data/2020/iclr/Sign Bits Are All You Need for Black-Box Attacks create mode 100644 data/2020/iclr/Sign-OPT: A Query-Efficient Hard-label Adversarial Attack create mode 100644 data/2020/iclr/SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum create mode 100644 data/2020/iclr/Stochastic AUC Maximization with Deep Neural Networks create mode 100644 data/2020/iclr/Stochastic Conditional Generative Networks with Basis Decomposition create mode 100644 data/2020/iclr/Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well create mode 100644 data/2020/iclr/StructPool: Structured Graph Pooling via Conditional Random Fields create mode 100644 data/2020/iclr/TabFact: A Large-scale Dataset for Table-based Fact Verification create mode 100644 data/2020/iclr/The Implicit Bias of Depth: How Incremental Learning Drives Generalization create mode 100644 data/2020/iclr/The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget create mode 100644 data/2020/iclr/The asymptotic spectrum of the Hessian of DNN throughout training create mode 100644 data/2020/iclr/Theory and Evaluation Metrics for Learning Disentangled Representations create mode 100644 data/2020/iclr/Thieves on Sesame Street! Model Extraction of BERT-based APIs create mode 100644 data/2020/iclr/To Relieve Your Headache of Training an MRF, Take AdVIL create mode 100644 data/2020/iclr/Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets create mode 100644 data/2020/iclr/Transferable Perturbations of Deep Feature Distributions create mode 100644 data/2020/iclr/Tree-Structured Attention with Hierarchical Accumulation create mode 100644 data/2020/iclr/Understanding Architectures Learnt by Cell-based Neural Architecture Search create mode 100644 data/2020/iclr/Understanding Knowledge Distillation in Non-autoregressive Machine Translation create mode 100644 data/2020/iclr/Understanding the Limitations of Variational Mutual Information Estimators create mode 100644 data/2020/iclr/Unpaired Point Cloud Completion on Real Scans using Adversarial Training create mode 100644 data/2020/iclr/Unsupervised Model Selection for Variational Disentangled Representation Learning create mode 100644 data/2020/iclr/V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control create mode 100644 data/2020/iclr/V4D: 4D Convolutional Neural Networks for Video-level Representation Learning create mode 100644 data/2020/iclr/VL-BERT: Pre-training of Generic Visual-Linguistic Representations create mode 100644 data/2020/iclr/Variational Recurrent Models for Solving Partially Observable Control Tasks create mode 100644 data/2020/iclr/Vid2Game: Controllable Characters Extracted from Real-World Videos create mode 100644 data/2020/iclr/VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation create mode 100644 data/2020/iclr/Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards create mode 100644 data/2020/iclr/Weakly Supervised Clustering by Exploiting Unique Class Count create mode 100644 data/2020/iclr/What graph neural networks cannot learn: depth vs width create mode 100644 data/2021/iclr/A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning create mode 100644 data/2021/iclr/A Block Minifloat Representation for Training Deep Neural Networks create mode 100644 data/2021/iclr/A Critique of Self-Expressive Deep Subspace Clustering create mode 100644 data/2021/iclr/A Design Space Study for LISTA and Beyond create mode 100644 data/2021/iclr/A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima create mode 100644 data/2021/iclr/A Discriminative Gaussian Mixture Model with Sparsity create mode 100644 data/2021/iclr/A Distributional Approach to Controlled Text Generation create mode 100644 data/2021/iclr/A Geometric Analysis of Deep Generative Image Models and Its Applications create mode 100644 data/2021/iclr/A Good Image Generator Is What You Need for High-Resolution Video Synthesis create mode 100644 data/2021/iclr/A Gradient Flow Framework For Analyzing Network Pruning create mode 100644 data/2021/iclr/A Hypergradient Approach to Robust Regression without Correspondence create mode 100644 data/2021/iclr/A Learning Theoretic Perspective on Local Explainability create mode 100644 data/2021/iclr/A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks create mode 100644 data/2021/iclr/A PAC-Bayesian Approach to Generalization Bounds for Graph Neural Networks create mode 100644 data/2021/iclr/A Panda? No, It's a Sloth: Slowdown Attacks on Adaptive Multi-Exit Neural Network Inference create mode 100644 data/2021/iclr/A Temporal Kernel Approach for Deep Learning with Continuous-time Information create mode 100644 data/2021/iclr/A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention create mode 100644 data/2021/iclr/A Unified Approach to Interpreting and Boosting Adversarial Transferability create mode 100644 data/2021/iclr/A Universal Representation Transformer Layer for Few-Shot Image Classification create mode 100644 data/2021/iclr/A Wigner-Eckart Theorem for Group Equivariant Convolution Kernels create mode 100644 data/2021/iclr/A statistical theory of cold posteriors in deep neural networks create mode 100644 data/2021/iclr/A teacher-student framework to distill future trajectories create mode 100644 data/2021/iclr/A unifying view on implicit bias in training linear neural networks create mode 100644 data/2021/iclr/ALFWorld: Aligning Text and Embodied Environments for Interactive Learning create mode 100644 data/2021/iclr/ANOCE: Analysis of Causal Effects with Multiple Mediators via Constrained Structural Learning create mode 100644 data/2021/iclr/ARMOURED: Adversarially Robust MOdels using Unlabeled data by REgularizing Diversity create mode 100644 data/2021/iclr/Accelerating Convergence of Replica Exchange Stochastic Gradient MCMC via Variance Reduction create mode 100644 data/2021/iclr/Accurate Learning of Graph Representations with Graph Multiset Pooling create mode 100644 data/2021/iclr/Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning create mode 100644 data/2021/iclr/Acting in Delayed Environments with Non-Stationary Markov Policies create mode 100644 data/2021/iclr/Activation-level uncertainty in deep neural networks create mode 100644 data/2021/iclr/Active Contrastive Learning of Audio-Visual Video Representations create mode 100644 data/2021/iclr/AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition create mode 100644 data/2021/iclr/AdaGCN: Adaboosting Graph Convolutional Networks into Deep Models create mode 100644 data/2021/iclr/AdaSpeech: Adaptive Text to Speech for Custom Voice create mode 100644 data/2021/iclr/AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights create mode 100644 data/2021/iclr/Adapting to Reward Progressivity via Spectral Reinforcement Learning create mode 100644 data/2021/iclr/Adaptive Extra-Gradient Methods for Min-Max Optimization and Games create mode 100644 data/2021/iclr/Adaptive Federated Optimization create mode 100644 data/2021/iclr/Adaptive Procedural Task Generation for Hard-Exploration Problems create mode 100644 data/2021/iclr/Adaptive Universal Generalized PageRank Graph Neural Network create mode 100644 data/2021/iclr/Adaptive and Generative Zero-Shot Learning create mode 100644 data/2021/iclr/Adversarial score matching and improved sampling for image generation create mode 100644 data/2021/iclr/Adversarially Guided Actor-Critic create mode 100644 data/2021/iclr/Adversarially-Trained Deep Nets Transfer Better: Illustration on Image Classification create mode 100644 data/2021/iclr/Aligning AI With Shared Human Values create mode 100644 data/2021/iclr/An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale create mode 100644 data/2021/iclr/An Unsupervised Deep Learning Approach for Real-World Image Denoising create mode 100644 data/2021/iclr/Analyzing the Expressive Power of Graph Neural Networks in a Spectral Perspective create mode 100644 data/2021/iclr/Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics create mode 100644 data/2021/iclr/Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies create mode 100644 data/2021/iclr/Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval create mode 100644 data/2021/iclr/Anytime Sampling for Autoregressive Models via Ordered Autoencoding create mode 100644 data/2021/iclr/Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval create mode 100644 data/2021/iclr/Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks create mode 100644 data/2021/iclr/Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees? create mode 100644 data/2021/iclr/Are wider nets better given the same number of parameters? create mode 100644 data/2021/iclr/Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning create mode 100644 data/2021/iclr/Async-RED: A Provably Convergent Asynchronous Block Parallel Stochastic Method using Deep Denoising Priors create mode 100644 data/2021/iclr/Attentional Constellation Nets for Few-Shot Learning create mode 100644 data/2021/iclr/Auction Learning as a Two-Player Game create mode 100644 data/2021/iclr/Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting create mode 100644 data/2021/iclr/Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation create mode 100644 data/2021/iclr/AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly create mode 100644 data/2021/iclr/Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization create mode 100644 data/2021/iclr/Autoregressive Entity Retrieval create mode 100644 data/2021/iclr/Auxiliary Learning by Implicit Differentiation create mode 100644 data/2021/iclr/Auxiliary Task Update Decomposition: the Good, the Bad and the neutral create mode 100644 data/2021/iclr/Average-case Acceleration for Bilinear Games and Normal Matrices create mode 100644 data/2021/iclr/BERTology Meets Biology: Interpreting Attention in Protein Language Models create mode 100644 data/2021/iclr/BOIL: Towards Representation Change for Few-shot Learning create mode 100644 data/2021/iclr/BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction create mode 100644 data/2021/iclr/BREEDS: Benchmarks for Subpopulation Shift create mode 100644 data/2021/iclr/BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization create mode 100644 data/2021/iclr/BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration create mode 100644 data/2021/iclr/Bag of Tricks for Adversarial Training create mode 100644 data/2021/iclr/Balancing Constraints and Rewards with Meta-Gradient D4PG create mode 100644 data/2021/iclr/Batch Reinforcement Learning Through Continuation Method create mode 100644 "data/2021/iclr/Bayesian Few-Shot Classification with One-vs-Each P\303\263lya-Gamma Augmented Gaussian Processes" create mode 100644 data/2021/iclr/Behavioral Cloning from Noisy Demonstrations create mode 100644 data/2021/iclr/Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods create mode 100644 data/2021/iclr/Better Fine-Tuning by Reducing Representational Collapse create mode 100644 data/2021/iclr/Beyond Categorical Label Representations for Image Classification create mode 100644 data/2021/iclr/Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1 n Parameters create mode 100644 data/2021/iclr/BiPointNet: Binary Neural Network for Point Clouds create mode 100644 data/2021/iclr/Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech create mode 100644 data/2021/iclr/Blending MPC & Value Function Approximation for Efficient Reinforcement Learning create mode 100644 data/2021/iclr/Boost then Convolve: Gradient Boosting Meets Graph Neural Networks create mode 100644 data/2021/iclr/Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and Novel-View Synthesis create mode 100644 data/2021/iclr/Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification create mode 100644 data/2021/iclr/Byzantine-Resilient Non-Convex Stochastic Gradient Descent create mode 100644 data/2021/iclr/C-Learning: Horizon-Aware Cumulative Accessibility Estimation create mode 100644 data/2021/iclr/C-Learning: Learning to Achieve Goals via Recursive Classification create mode 100644 data/2021/iclr/CO2: Consistent Contrast for Unsupervised Visual Representation Learning create mode 100644 data/2021/iclr/CPR: Classifier-Projection Regularization for Continual Learning create mode 100644 data/2021/iclr/CPT: Efficient Deep Neural Network Training via Cyclic Precision create mode 100644 data/2021/iclr/CT-Net: Channel Tensorization Network for Video Classification create mode 100644 data/2021/iclr/CaPC Learning: Confidential and Private Collaborative Learning create mode 100644 data/2021/iclr/Calibration of Neural Networks using Splines create mode 100644 data/2021/iclr/Calibration tests beyond classification create mode 100644 data/2021/iclr/Can a Fruit Fly Learn Word Embeddings? create mode 100644 data/2021/iclr/Capturing Label Characteristics in VAEs create mode 100644 data/2021/iclr/Categorical Normalizing Flows via Continuous Transformations create mode 100644 data/2021/iclr/CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning create mode 100644 data/2021/iclr/CcGAN: Continuous Conditional Generative Adversarial Networks for Image Generation create mode 100644 data/2021/iclr/Certify or Predict: Boosting Certified Robustness with Compositional Architectures create mode 100644 data/2021/iclr/Chaos of Learning Beyond Zero-sum and Coordination via Game Decompositions create mode 100644 data/2021/iclr/Characterizing signal propagation to close the performance gap in unnormalized ResNets create mode 100644 data/2021/iclr/ChipNet: Budget-Aware Pruning with Heaviside Continuous Approximations create mode 100644 data/2021/iclr/Clairvoyance: A Pipeline Toolkit for Medical Time Series create mode 100644 data/2021/iclr/Class Normalization for (Continual)? Generalized Zero-Shot Learning create mode 100644 data/2021/iclr/Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation create mode 100644 data/2021/iclr/Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity create mode 100644 data/2021/iclr/CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers create mode 100644 data/2021/iclr/CoCon: A Self-Supervised Approach for Controlled Text Generation create mode 100644 data/2021/iclr/CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding create mode 100644 data/2021/iclr/Collective Robustness Certificates: Exploiting Interdependence in Graph Neural Networks create mode 100644 data/2021/iclr/Colorization Transformer create mode 100644 data/2021/iclr/Combining Ensembles and Data Augmentation Can Harm Your Calibration create mode 100644 data/2021/iclr/Combining Label Propagation and Simple Models out-performs Graph Neural Networks create mode 100644 data/2021/iclr/Combining Physics and Machine Learning for Network Flow Estimation create mode 100644 data/2021/iclr/Communication in Multi-Agent Reinforcement Learning: Intention Sharing create mode 100644 data/2021/iclr/CompOFA - Compound Once-For-All Networks for Faster Multi-Platform Deployment create mode 100644 data/2021/iclr/Complex Query Answering with Neural Link Predictors create mode 100644 data/2021/iclr/Computational Separation Between Convolutional and Fully-Connected Networks create mode 100644 data/2021/iclr/Concept Learners for Few-Shot Learning create mode 100644 data/2021/iclr/Conditional Generative Modeling via Learning the Latent Space create mode 100644 data/2021/iclr/Conditional Negative Sampling for Contrastive Learning of Visual Representations create mode 100644 data/2021/iclr/Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data create mode 100644 data/2021/iclr/Conformation-Guided Molecular Representation with Hamiltonian Neural Networks create mode 100644 data/2021/iclr/Conservative Safety Critics for Exploration create mode 100644 data/2021/iclr/Contemplating Real-World Object Classification create mode 100644 data/2021/iclr/Contextual Dropout: An Efficient Sample-Dependent Dropout Module create mode 100644 data/2021/iclr/Contextual Transformation Networks for Online Continual Learning create mode 100644 data/2021/iclr/Continual learning in recurrent neural networks create mode 100644 data/2021/iclr/Continuous Wasserstein-2 Barycenter Estimation without Minimax Optimization create mode 100644 data/2021/iclr/Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning create mode 100644 data/2021/iclr/Contrastive Divergence Learning is a Time Reversal Adversarial Game create mode 100644 data/2021/iclr/Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions create mode 100644 data/2021/iclr/Contrastive Learning with Adversarial Perturbations for Conditional Text Generation create mode 100644 data/2021/iclr/Contrastive Learning with Hard Negative Samples create mode 100644 data/2021/iclr/Contrastive Syn-to-Real Generalization create mode 100644 data/2021/iclr/Control-Aware Representations for Model-based Reinforcement Learning create mode 100644 data/2021/iclr/Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization create mode 100644 data/2021/iclr/Convex Regularization behind Neural Reconstruction create mode 100644 data/2021/iclr/Coping with Label Shift via Distributionally Robust Optimisation create mode 100644 data/2021/iclr/CopulaGNN: Towards Integrating Representational and Correlational Roles of Graphs in Graph Neural Networks create mode 100644 data/2021/iclr/Correcting experience replay for multi-agent communication create mode 100644 data/2021/iclr/Counterfactual Generative Networks create mode 100644 data/2021/iclr/Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies create mode 100644 data/2021/iclr/Creative Sketch Generation create mode 100644 data/2021/iclr/Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization create mode 100644 data/2021/iclr/Cut out the annotator, keep the cutout: better segmentation with weak supervision create mode 100644 data/2021/iclr/DARTS-: Robustly Stepping out of Performance Collapse Without Indicators create mode 100644 data/2021/iclr/DC3: A learning method for optimization with hard constraints create mode 100644 data/2021/iclr/DDPNOpt: Differential Dynamic Programming Neural Optimizer create mode 100644 data/2021/iclr/DICE: Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation create mode 100644 data/2021/iclr/DINO: A Conditional Energy-Based GAN for Domain Translation create mode 100644 data/2021/iclr/DOP: Off-Policy Multi-Agent Decomposed Policy Gradients create mode 100644 data/2021/iclr/Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning create mode 100644 data/2021/iclr/Data-Efficient Reinforcement Learning with Self-Predictive Representations create mode 100644 data/2021/iclr/Dataset Condensation with Gradient Matching create mode 100644 data/2021/iclr/Dataset Inference: Ownership Resolution in Machine Learning create mode 100644 data/2021/iclr/Dataset Meta-Learning from Kernel Ridge-Regression create mode 100644 data/2021/iclr/DeLighT: Deep and Light-weight Transformer create mode 100644 data/2021/iclr/Deberta: decoding-Enhanced Bert with Disentangled Attention create mode 100644 data/2021/iclr/Debiasing Concept-based Explanations with Causal Analysis create mode 100644 data/2021/iclr/Decentralized Attribution of Generative Models create mode 100644 data/2021/iclr/Deciphering and Optimizing Multi-Task Learning: a Random Matrix Approach create mode 100644 data/2021/iclr/Deconstructing the Regularization of BatchNorm create mode 100644 data/2021/iclr/Decoupling Global and Local Representations via Invertible Generative Flows create mode 100644 data/2021/iclr/Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation create mode 100644 data/2021/iclr/Deep Equals Shallow for ReLU Networks in Kernel Regimes create mode 100644 data/2021/iclr/Deep Learning meets Projective Clustering create mode 100644 data/2021/iclr/Deep Networks and the Multiple Manifold Problem create mode 100644 data/2021/iclr/Deep Neural Network Fingerprinting by Conferrable Adversarial Examples create mode 100644 data/2021/iclr/Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS create mode 100644 data/2021/iclr/Deep Partition Aggregation: Provable Defenses against General Poisoning Attacks create mode 100644 data/2021/iclr/Deep Repulsive Clustering of Ordered Data Based on Order-Identity Decomposition create mode 100644 data/2021/iclr/Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients create mode 100644 data/2021/iclr/DeepAveragers: Offline Reinforcement Learning By Solving Derived Non-Parametric MDPs create mode 100644 data/2021/iclr/Deformable DETR: Deformable Transformers for End-to-End Object Detection create mode 100644 data/2021/iclr/Degree-Quant: Quantization-Aware Training for Graph Neural Networks create mode 100644 data/2021/iclr/Denoising Diffusion Implicit Models create mode 100644 data/2021/iclr/Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization create mode 100644 data/2021/iclr/DialoGraph: Incorporating Interpretable Strategy-Graph Networks into Negotiation Dialogues create mode 100644 data/2021/iclr/DiffWave: A Versatile Diffusion Model for Audio Synthesis create mode 100644 data/2021/iclr/Differentiable Segmentation of Sequences create mode 100644 data/2021/iclr/Differentiable Trust Region Layers for Deep Reinforcement Learning create mode 100644 data/2021/iclr/Differentially Private Learning Needs Better Features (or Much More Data) create mode 100644 data/2021/iclr/Directed Acyclic Graph Neural Networks create mode 100644 data/2021/iclr/Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate create mode 100644 data/2021/iclr/Disambiguating Symbolic Expressions in Informal Documents create mode 100644 data/2021/iclr/Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization create mode 100644 data/2021/iclr/Discovering Non-monotonic Autoregressive Orderings with Variational Inference create mode 100644 data/2021/iclr/Discovering a set of policies for the worst case reward create mode 100644 data/2021/iclr/Discrete Graph Structure Learning for Forecasting Multiple Time Series create mode 100644 data/2021/iclr/Disentangled Recurrent Wasserstein Autoencoder create mode 100644 data/2021/iclr/Disentangling 3D Prototypical Networks for Few-Shot Concept Learning create mode 100644 data/2021/iclr/Distance-Based Regularisation of Deep Networks for Fine-Tuning create mode 100644 data/2021/iclr/Distilling Knowledge from Reader to Retriever for Question Answering create mode 100644 data/2021/iclr/Distributed Momentum for Byzantine-resilient Stochastic Gradient Descent create mode 100644 data/2021/iclr/Distributional Sliced-Wasserstein and Applications to Generative Modeling create mode 100644 data/2021/iclr/Diverse Video Generation using a Gaussian Process Trigger create mode 100644 data/2021/iclr/Do 2D GANs Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image GANs create mode 100644 data/2021/iclr/Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth create mode 100644 data/2021/iclr/Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning create mode 100644 data/2021/iclr/Does enhanced shape bias improve neural network robustness to common corruptions? create mode 100644 data/2021/iclr/Domain Generalization with MixStyle create mode 100644 data/2021/iclr/Domain-Robust Visual Imitation Learning with Mutual Information Constraints create mode 100644 data/2021/iclr/DrNAS: Dirichlet Neural Architecture Search create mode 100644 data/2021/iclr/Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration create mode 100644 data/2021/iclr/Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling create mode 100644 data/2021/iclr/DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation create mode 100644 data/2021/iclr/Dynamic Tensor Rematerialization create mode 100644 data/2021/iclr/EEC: Learning to Encode and Regenerate Images for Continual Learning create mode 100644 data/2021/iclr/Early Stopping in Deep Networks: Double Descent and How to Eliminate it create mode 100644 data/2021/iclr/Economic Hyperparameter Optimization with Blended Search Strategy create mode 100644 data/2021/iclr/Effective Abstract Reasoning with Dual-Contrast Network create mode 100644 data/2021/iclr/Effective Distributed Learning with Random Features: Improved Bounds and Algorithms create mode 100644 data/2021/iclr/Effective and Efficient Vote Attack on Capsule Networks create mode 100644 data/2021/iclr/Efficient Certified Defenses Against Patch Attacks on Image Classifiers create mode 100644 data/2021/iclr/Efficient Conformal Prediction via Cascaded Inference with Expanded Admission create mode 100644 data/2021/iclr/Efficient Continual Learning with Modular Networks and Task-Driven Priors create mode 100644 data/2021/iclr/Efficient Empowerment Estimation for Unsupervised Stabilization create mode 100644 data/2021/iclr/Efficient Generalized Spherical CNNs create mode 100644 data/2021/iclr/Efficient Inference of Flexible Interaction in Spiking-neuron Networks create mode 100644 data/2021/iclr/Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL create mode 100644 data/2021/iclr/Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation create mode 100644 data/2021/iclr/Efficient Wasserstein Natural Gradients for Reinforcement Learning create mode 100644 data/2021/iclr/EigenGame: PCA as a Nash Equilibrium create mode 100644 data/2021/iclr/Emergent Road Rules In Multi-Agent Driving Environments create mode 100644 data/2021/iclr/Emergent Symbols through Binding in External Memory create mode 100644 data/2021/iclr/Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition create mode 100644 data/2021/iclr/Empirical or Invariant Risk Minimization? A Sample Complexity Perspective create mode 100644 data/2021/iclr/End-to-End Egospheric Spatial Memory create mode 100644 data/2021/iclr/End-to-end Adversarial Text-to-Speech create mode 100644 data/2021/iclr/Enforcing robust control guarantees within neural network policies create mode 100644 data/2021/iclr/Enjoy Your Editing: Controllable GANs for Image Editing via Latent Space Navigation create mode 100644 data/2021/iclr/Entropic gradient descent algorithms and wide flat minima create mode 100644 data/2021/iclr/Estimating Lipschitz constants of monotone deep equilibrium models create mode 100644 data/2021/iclr/Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors create mode 100644 data/2021/iclr/Estimating informativeness of samples with Smooth Unique Information create mode 100644 data/2021/iclr/Evaluating the Disentanglement of Deep Generative Models through Manifold Topology create mode 100644 data/2021/iclr/Evaluation of Neural Architectures trained with square Loss vs Cross-Entropy in Classification Tasks create mode 100644 data/2021/iclr/Evaluation of Similarity-based Explanations create mode 100644 data/2021/iclr/Evaluations and Methods for Explanation through Robustness Analysis create mode 100644 data/2021/iclr/Evolving Reinforcement Learning Algorithms create mode 100644 data/2021/iclr/Exemplary Natural Images Explain CNN Activations Better than State-of-the-Art Feature Visualization create mode 100644 data/2021/iclr/Explainable Deep One-Class Classification create mode 100644 data/2021/iclr/Explainable Subgraph Reasoning for Forecasting on Temporal Knowledge Graphs create mode 100644 data/2021/iclr/Explaining by Imitating: Understanding Decisions by Interpretable Policy Learning create mode 100644 data/2021/iclr/Explaining the Efficacy of Counterfactually Augmented Data create mode 100644 data/2021/iclr/Exploring Balanced Feature Spaces for Representation Learning create mode 100644 data/2021/iclr/Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit create mode 100644 data/2021/iclr/Expressive Power of Invariant and Equivariant Graph Neural Networks create mode 100644 data/2021/iclr/Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers create mode 100644 data/2021/iclr/Extreme Memorization via Scale of Initialization create mode 100644 data/2021/iclr/FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization create mode 100644 data/2021/iclr/Factorizing Declarative and Procedural Knowledge in Structured, Dynamical Environments create mode 100644 data/2021/iclr/Fair Mixup: Fairness via Interpolation create mode 100644 data/2021/iclr/FairBatch: Batch Selection for Model Fairness create mode 100644 data/2021/iclr/FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders create mode 100644 data/2021/iclr/Fantastic Four: Differentiable and Efficient Bounds on Singular Values of Convolution Layers create mode 100644 data/2021/iclr/Fast And Slow Learning Of Recurrent Independent Mechanisms create mode 100644 data/2021/iclr/Fast Geometric Projections for Local Robustness Certification create mode 100644 data/2021/iclr/Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers create mode 100644 data/2021/iclr/Fast convergence of stochastic subgradient method under interpolation create mode 100644 data/2021/iclr/FastSpeech 2: Fast and High-Quality End-to-End Text to Speech create mode 100644 data/2021/iclr/Faster Binary Embeddings for Preserving Euclidean Distances create mode 100644 data/2021/iclr/FedBE: Making Bayesian Model Ensemble Applicable to Federated Learning create mode 100644 data/2021/iclr/FedBN: Federated Learning on Non-IID Features via Local Batch Normalization create mode 100644 data/2021/iclr/FedMix: Approximation of Mixup under Mean Augmented Federated Learning create mode 100644 data/2021/iclr/Federated Learning Based on Dynamic Regularization create mode 100644 data/2021/iclr/Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms create mode 100644 data/2021/iclr/Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint Learning create mode 100644 data/2021/iclr/Few-Shot Bayesian Optimization with Deep Kernel Surrogates create mode 100644 data/2021/iclr/Few-Shot Learning via Learning the Representation, Provably create mode 100644 data/2021/iclr/Fidelity-based Deep Adiabatic Scheduling create mode 100644 data/2021/iclr/Filtered Inner Product Projection for Crosslingual Embedding Alignment create mode 100644 data/2021/iclr/Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis create mode 100644 data/2021/iclr/Fooling a Complete Neural Network Verifier create mode 100644 data/2021/iclr/For self-supervised learning, Rationality implies generalization, provably create mode 100644 data/2021/iclr/Fourier Neural Operator for Parametric Partial Differential Equations create mode 100644 data/2021/iclr/Free Lunch for Few-shot Learning: Distribution Calibration create mode 100644 data/2021/iclr/Fully Unsupervised Diversity Denoising with Convolutional Variational Autoencoders create mode 100644 data/2021/iclr/Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online create mode 100644 "data/2021/iclr/GAN \"Steerability\" without optimization" create mode 100644 data/2021/iclr/GAN2GAN: Generative Noise Learning for Blind Denoising with Single Noisy Images create mode 100644 data/2021/iclr/GANs Can Play Lottery Tickets Too create mode 100644 data/2021/iclr/GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding create mode 100644 data/2021/iclr/Gauge Equivariant Mesh CNNs: Anisotropic convolutions on geometric graphs create mode 100644 data/2021/iclr/Generalization bounds via distillation create mode 100644 data/2021/iclr/Generalization in data-driven models of primary visual cortex create mode 100644 data/2021/iclr/Generalized Energy Based Models create mode 100644 data/2021/iclr/Generalized Multimodal ELBO create mode 100644 data/2021/iclr/Generalized Variational Continual Learning create mode 100644 data/2021/iclr/Generating Adversarial Computer Programs using Optimized Obfuscations create mode 100644 data/2021/iclr/Generating Furry Cars: Disentangling Object Shape and Appearance across Multiple Domains create mode 100644 data/2021/iclr/Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule create mode 100644 data/2021/iclr/Generative Scene Graph Networks create mode 100644 data/2021/iclr/Generative Time-series Modeling with Fourier Flows create mode 100644 data/2021/iclr/Genetic Soft Updates for Policy Evolution in Deep Reinforcement Learning create mode 100644 data/2021/iclr/Geometry-Aware Gradient Algorithms for Neural Architecture Search create mode 100644 data/2021/iclr/Geometry-aware Instance-reweighted Adversarial Training create mode 100644 data/2021/iclr/Getting a CLUE: A Method for Explaining Uncertainty Estimates create mode 100644 data/2021/iclr/Global Convergence of Three-layer Neural Networks in the Mean Field Regime create mode 100644 data/2021/iclr/Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime create mode 100644 data/2021/iclr/Go with the flow: Adaptive control for Neural ODEs create mode 100644 data/2021/iclr/GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing create mode 100644 data/2021/iclr/Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability create mode 100644 data/2021/iclr/Gradient Projection Memory for Continual Learning create mode 100644 data/2021/iclr/Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models create mode 100644 data/2021/iclr/Graph Coarsening with Neural Networks create mode 100644 data/2021/iclr/Graph Convolution with Low-rank Learnable Local Filters create mode 100644 data/2021/iclr/Graph Edit Networks create mode 100644 data/2021/iclr/Graph Information Bottleneck for Subgraph Recognition create mode 100644 data/2021/iclr/Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning create mode 100644 data/2021/iclr/Graph-Based Continual Learning create mode 100644 data/2021/iclr/GraphCodeBERT: Pre-training Code Representations with Data Flow create mode 100644 data/2021/iclr/Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity create mode 100644 data/2021/iclr/Grounded Language Learning Fast and Slow create mode 100644 data/2021/iclr/Grounding Language to Autonomously-Acquired Skills via Goal Generation create mode 100644 data/2021/iclr/Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning create mode 100644 data/2021/iclr/Group Equivariant Conditional Neural Processes create mode 100644 data/2021/iclr/Group Equivariant Generative Adversarial Networks create mode 100644 data/2021/iclr/Group Equivariant Stand-Alone Self-Attention For Vision create mode 100644 data/2021/iclr/Growing Efficient Deep Networks by Structured Continuous Sparsification create mode 100644 data/2021/iclr/HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark create mode 100644 data/2021/iclr/HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents create mode 100644 data/2021/iclr/Heating up decision boundaries: isocapacitory saturation, adversarial scenarios and generalization bounds create mode 100644 data/2021/iclr/HeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous Clients create mode 100644 data/2021/iclr/Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization create mode 100644 data/2021/iclr/Hierarchical Autoregressive Modeling for Neural Video Compression create mode 100644 data/2021/iclr/Hierarchical Reinforcement Learning by Discovering Intrinsic Options create mode 100644 data/2021/iclr/High-Capacity Expert Binary Networks create mode 100644 data/2021/iclr/Hopfield Networks is All You Need create mode 100644 data/2021/iclr/Hopper: Multi-hop Transformer for Spatiotemporal Reasoning create mode 100644 data/2021/iclr/How Benign is Benign Overfitting ? create mode 100644 data/2021/iclr/How Does Mixup Help With Robustness and Generalization? create mode 100644 data/2021/iclr/How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? create mode 100644 data/2021/iclr/How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks create mode 100644 data/2021/iclr/How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision create mode 100644 data/2021/iclr/Human-Level Performance in No-Press Diplomacy via Equilibrium Search create mode 100644 data/2021/iclr/HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks create mode 100644 data/2021/iclr/HyperGrid Transformers: Towards A Single Model for Multiple Tasks create mode 100644 data/2021/iclr/Hyperbolic Neural Networks++ create mode 100644 data/2021/iclr/IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression create mode 100644 data/2021/iclr/IEPT: Instance-Level and Episode-Level Pretext Tasks for Few-Shot Learning create mode 100644 data/2021/iclr/INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving create mode 100644 data/2021/iclr/IOT: Instance-wise Layer Reordering for Transformer Structures create mode 100644 data/2021/iclr/Identifying Physical Law of Hamiltonian Systems via Meta-Learning create mode 100644 data/2021/iclr/Identifying nonlinear dynamical systems with multiple time scales and long-range dependencies create mode 100644 data/2021/iclr/Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels create mode 100644 data/2021/iclr/Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering create mode 100644 data/2021/iclr/Impact of Representation Learning in Linear Bandits create mode 100644 data/2021/iclr/Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time create mode 100644 data/2021/iclr/Implicit Gradient Regularization create mode 100644 data/2021/iclr/Implicit Normalizing Flows create mode 100644 data/2021/iclr/Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning create mode 100644 data/2021/iclr/Improve Object Detection with Feature-based Knowledge Distillation: Towards Accurate and Efficient Detectors create mode 100644 data/2021/iclr/Improved Autoregressive Modeling with Distribution Smoothing create mode 100644 "data/2021/iclr/Improved Estimation of Concentration Under \342\204\223p-Norm Distance Metrics Using Half Spaces" create mode 100644 data/2021/iclr/Improving Adversarial Robustness via Channel-wise Activation Suppressing create mode 100644 data/2021/iclr/Improving Relational Regularized Autoencoders with Spherical Sliced Fused Gromov Wasserstein create mode 100644 data/2021/iclr/Improving Transformation Invariance in Contrastive Representation Learning create mode 100644 data/2021/iclr/Improving VAEs' Robustness to Adversarial Attack create mode 100644 data/2021/iclr/Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning create mode 100644 data/2021/iclr/In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning create mode 100644 data/2021/iclr/In Search of Lost Domain Generalization create mode 100644 data/2021/iclr/In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness create mode 100644 data/2021/iclr/Incorporating Symmetry into Deep Dynamics Models for Improved Generalization create mode 100644 data/2021/iclr/Incremental few-shot learning via vector quantization in deep embedded space create mode 100644 data/2021/iclr/Individually Fair Gradient Boosting create mode 100644 data/2021/iclr/Individually Fair Rankings create mode 100644 data/2021/iclr/Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks create mode 100644 data/2021/iclr/Influence Estimation for Generative Adversarial Networks create mode 100644 data/2021/iclr/Influence Functions in Deep Learning Are Fragile create mode 100644 data/2021/iclr/InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective create mode 100644 data/2021/iclr/Information Laundering for Model Privacy create mode 100644 data/2021/iclr/Initialization and Regularization of Factorized Neural Layers create mode 100644 data/2021/iclr/Integrating Categorical Semantics into Unsupervised Domain Translation create mode 100644 data/2021/iclr/Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling create mode 100644 data/2021/iclr/Interpretable Models for Granger Causality Using Self-explaining Neural Networks create mode 100644 data/2021/iclr/Interpretable Neural Architecture Search via Bayesian Optimisation with Weisfeiler-Lehman Kernels create mode 100644 data/2021/iclr/Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking create mode 100644 data/2021/iclr/Interpreting Knowledge Graph Relation Representation from Word Embeddings create mode 100644 data/2021/iclr/Interpreting and Boosting Dropout from a Game-Theoretic View create mode 100644 data/2021/iclr/Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds create mode 100644 data/2021/iclr/Intraclass clustering: an implicit learning ability that regularizes DNNs create mode 100644 data/2021/iclr/Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures create mode 100644 data/2021/iclr/Is Attention Better Than Matrix Decomposition? create mode 100644 data/2021/iclr/Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study create mode 100644 data/2021/iclr/IsarStep: a Benchmark for High-level Mathematical Reasoning create mode 100644 data/2021/iclr/Isometric Propagation Network for Generalized Zero-shot Learning create mode 100644 data/2021/iclr/Isometric Transformation Invariant and Equivariant Graph Convolutional Networks create mode 100644 data/2021/iclr/Isotropy in the Contextual Embedding Space: Clusters and Manifolds create mode 100644 data/2021/iclr/Iterated learning for emergent systematicity in VQA create mode 100644 data/2021/iclr/Iterative Empirical Game Solving via Single Policy Best Response create mode 100644 data/2021/iclr/Kanerva++: Extending the Kanerva Machine With Differentiable, Locally Block Allocated Latent Memory create mode 100644 data/2021/iclr/Knowledge Distillation as Semiparametric Inference create mode 100644 data/2021/iclr/Knowledge distillation via softmax regression representation learning create mode 100644 data/2021/iclr/LEAF: A Learnable Frontend for Audio Classification create mode 100644 data/2021/iclr/LambdaNetworks: Modeling long-range Interactions without Attention create mode 100644 data/2021/iclr/Language-Agnostic Representation Learning of Source Code from Structure and Context create mode 100644 data/2021/iclr/Large Associative Memory Problem in Neurobiology and Machine Learning create mode 100644 data/2021/iclr/Large Batch Simulation for Deep Reinforcement Learning create mode 100644 data/2021/iclr/Large Scale Image Completion via Co-Modulated Generative Adversarial Networks create mode 100644 data/2021/iclr/Large-width functional asymptotics for deep Gaussian neural networks create mode 100644 data/2021/iclr/Latent Convergent Cross Mapping create mode 100644 data/2021/iclr/Latent Skill Planning for Exploration and Transfer create mode 100644 data/2021/iclr/Layer-adaptive Sparsity for the Magnitude-based Pruning create mode 100644 data/2021/iclr/Learnable Embedding sizes for Recommender Systems create mode 100644 "data/2021/iclr/Learning \"What-if\" Explanations for Sequential Decision-Making" create mode 100644 data/2021/iclr/Learning A Minimax Optimizer: A Pilot Study create mode 100644 data/2021/iclr/Learning Accurate Entropy Model with Global Reference for Image Compression create mode 100644 data/2021/iclr/Learning Associative Inference Using Fast Weight Memory create mode 100644 data/2021/iclr/Learning Better Structured Representations Using Low-rank Adaptive Label Smoothing create mode 100644 data/2021/iclr/Learning Cross-Domain Correspondence for Control with Dynamics Cycle-Consistency create mode 100644 data/2021/iclr/Learning Deep Features in Instrumental Variable Regression create mode 100644 data/2021/iclr/Learning Energy-Based Generative Models via Coarse-to-Fine Expanding and Sampling create mode 100644 data/2021/iclr/Learning Energy-Based Models by Diffusion Recovery Likelihood create mode 100644 data/2021/iclr/Learning Generalizable Visual Representations via Interactive Gameplay create mode 100644 data/2021/iclr/Learning Hyperbolic Representations of Topological Features create mode 100644 data/2021/iclr/Learning Incompressible Fluid Dynamics from Scratch - Towards Fast, Differentiable Fluid Models that Generalize create mode 100644 data/2021/iclr/Learning Invariant Representations for Reinforcement Learning without Reconstruction create mode 100644 data/2021/iclr/Learning Long-term Visual Dynamics with Region Proposal Interaction Networks create mode 100644 data/2021/iclr/Learning Manifold Patch-Based Representations of Man-Made Shapes create mode 100644 data/2021/iclr/Learning Mesh-Based Simulation with Graph Networks create mode 100644 data/2021/iclr/Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch create mode 100644 data/2021/iclr/Learning Neural Event Functions for Ordinary Differential Equations create mode 100644 data/2021/iclr/Learning Neural Generative Dynamics for Molecular Conformation Generation create mode 100644 data/2021/iclr/Learning Parametrised Graph Shift Operators create mode 100644 data/2021/iclr/Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues create mode 100644 data/2021/iclr/Learning Robust State Abstractions for Hidden-Parameter Block MDPs create mode 100644 data/2021/iclr/Learning Safe Multi-agent Control with Decentralized Neural Barrier Certificates create mode 100644 data/2021/iclr/Learning Structural Edits via Incremental Tree Transformations create mode 100644 data/2021/iclr/Learning Subgoal Representations with Slow Dynamics create mode 100644 data/2021/iclr/Learning Task Decomposition with Ordered Memory Policy Network create mode 100644 data/2021/iclr/Learning Task-General Representations with Generative Neuro-Symbolic Modeling create mode 100644 data/2021/iclr/Learning Value Functions in Deep Policy Gradients using Residual Variance create mode 100644 data/2021/iclr/Learning What To Do by Simulating the Past create mode 100644 data/2021/iclr/Learning a Latent Search Space for Routing Problems using Variational Autoencoders create mode 100644 data/2021/iclr/Learning a Latent Simplex in Input Sparsity Time create mode 100644 data/2021/iclr/Learning advanced mathematical computations from examples create mode 100644 data/2021/iclr/Learning and Evaluating Representations for Deep One-Class Classification create mode 100644 data/2021/iclr/Learning continuous-time PDEs from sparse data with graph neural networks create mode 100644 data/2021/iclr/Learning explanations that are hard to vary create mode 100644 data/2021/iclr/Learning from Demonstration with Weakly Supervised Disentanglement create mode 100644 data/2021/iclr/Learning from Protein Structure with Geometric Vector Perceptrons create mode 100644 data/2021/iclr/Learning from others' mistakes: Avoiding dataset biases without modeling them create mode 100644 data/2021/iclr/Learning perturbation sets for robust machine learning create mode 100644 data/2021/iclr/Learning the Pareto Front with Hypernetworks create mode 100644 data/2021/iclr/Learning to Deceive Knowledge Graph Augmented Models via Targeted Perturbation create mode 100644 data/2021/iclr/Learning to Generate 3D Shapes with Generative Cellular Automata create mode 100644 data/2021/iclr/Learning to Make Decisions via Submodular Regularization create mode 100644 data/2021/iclr/Learning to Reach Goals via Iterated Supervised Learning create mode 100644 data/2021/iclr/Learning to Recombine and Resample Data For Compositional Generalization create mode 100644 data/2021/iclr/Learning to Represent Action Values as a Hypergraph on the Action Vertices create mode 100644 data/2021/iclr/Learning to Sample with Local and Global Contexts in Experience Replay Buffer create mode 100644 data/2021/iclr/Learning to Set Waypoints for Audio-Visual Navigation create mode 100644 data/2021/iclr/Learning to live with Dale's principle: ANNs with separate excitatory and inhibitory units create mode 100644 data/2021/iclr/Learning with AMIGo: Adversarially Motivated Intrinsic Goals create mode 100644 data/2021/iclr/Learning with Feature-Dependent Label Noise: A Progressive Approach create mode 100644 data/2021/iclr/Learning with Instance-Dependent Label Noise: A Sample Sieve Approach create mode 100644 data/2021/iclr/Learning-based Support Estimation in Sublinear Time create mode 100644 data/2021/iclr/Lifelong Learning of Compositional Structures create mode 100644 data/2021/iclr/LiftPool: Bidirectional ConvNet Pooling create mode 100644 data/2021/iclr/Linear Convergent Decentralized Optimization with Compression create mode 100644 data/2021/iclr/Linear Last-iterate Convergence in Constrained Saddle-point Optimization create mode 100644 data/2021/iclr/Linear Mode Connectivity in Multitask and Continual Learning create mode 100644 data/2021/iclr/Local Convergence Analysis of Gradient Descent Ascent with Finite Timescale Separation create mode 100644 data/2021/iclr/Local Search Algorithms for Rank-Constrained Convex Optimization create mode 100644 data/2021/iclr/Locally Free Weight Sharing for Network Width Search create mode 100644 data/2021/iclr/Long Live the Lottery: The Existence of Winning Tickets in Lifelong Learning create mode 100644 data/2021/iclr/Long Range Arena : A Benchmark for Efficient Transformers create mode 100644 data/2021/iclr/Long-tail learning via logit adjustment create mode 100644 data/2021/iclr/Long-tailed Recognition by Routing Diverse Distribution-Aware Experts create mode 100644 data/2021/iclr/Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search create mode 100644 data/2021/iclr/Lossless Compression of Structured Convolutional Models via Lifting create mode 100644 data/2021/iclr/LowKey: Leveraging Adversarial Attacks to Protect Social Media Users from Facial Recognition create mode 100644 data/2021/iclr/MALI: A memory efficient and reverse accurate integrator for Neural ODEs create mode 100644 data/2021/iclr/MARS: Markov Molecular Sampling for Multi-objective Drug Discovery create mode 100644 data/2021/iclr/MELR: Meta-Learning via Modeling Episode-Level Relationships for Few-Shot Learning create mode 100644 data/2021/iclr/MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space create mode 100644 data/2021/iclr/MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training create mode 100644 data/2021/iclr/Mapping the Timescale Organization of Neural Language Models create mode 100644 data/2021/iclr/Mathematical Reasoning via Self-supervised Skip-tree Training create mode 100644 data/2021/iclr/Measuring Massive Multitask Language Understanding create mode 100644 data/2021/iclr/Memory Optimization for Deep Networks create mode 100644 data/2021/iclr/Meta Back-Translation create mode 100644 data/2021/iclr/Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning create mode 100644 data/2021/iclr/Meta-Learning of Structured Task Distributions in Humans and Machines create mode 100644 data/2021/iclr/Meta-Learning with Neural Tangent Kernels create mode 100644 data/2021/iclr/Meta-learning Symmetries by Reparameterization create mode 100644 data/2021/iclr/Meta-learning with negative learning rates create mode 100644 data/2021/iclr/MetaNorm: Learning to Normalize Few-Shot Batches Across Domains create mode 100644 data/2021/iclr/MiCE: Mixture of Contrastive Experts for Unsupervised Image Clustering create mode 100644 data/2021/iclr/Mind the Gap when Conditioning Amortised Inference in Sequential Latent-Variable Models create mode 100644 data/2021/iclr/Mind the Pad - CNNs Can Develop Blind Spots create mode 100644 data/2021/iclr/Minimum Width for Universal Approximation create mode 100644 data/2021/iclr/Mirostat: a Neural Text decoding Algorithm that directly controls perplexity create mode 100644 data/2021/iclr/MixKD: Towards Efficient Distillation of Large-scale Language Models create mode 100644 data/2021/iclr/Mixed-Features Vectors and Subspace Splitting create mode 100644 data/2021/iclr/MoPro: Webly Supervised Learning with Momentum Prototypes create mode 100644 data/2021/iclr/MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond create mode 100644 data/2021/iclr/Model Patching: Closing the Subgroup Performance Gap with Data Augmentation create mode 100644 data/2021/iclr/Model-Based Offline Planning create mode 100644 data/2021/iclr/Model-Based Visual Planning with Self-Supervised Functional Distances create mode 100644 data/2021/iclr/Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose? create mode 100644 data/2021/iclr/Modeling the Second Player in Distributionally Robust Optimization create mode 100644 data/2021/iclr/Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System create mode 100644 data/2021/iclr/Molecule Optimization by Explainable Evolution create mode 100644 data/2021/iclr/Monotonic Kronecker-Factored Lattice create mode 100644 data/2021/iclr/Monte-Carlo Planning and Learning with Language Action Value Estimates create mode 100644 data/2021/iclr/More or Less: When and How to Build Convolutional Neural Network Ensembles create mode 100644 data/2021/iclr/Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning create mode 100644 data/2021/iclr/Multi-Level Local SGD: Distributed SGD for Heterogeneous Hierarchical Networks create mode 100644 data/2021/iclr/Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network create mode 100644 data/2021/iclr/Multi-Time Attention Networks for Irregularly Sampled Time Series create mode 100644 data/2021/iclr/Multi-resolution modeling of a discrete stochastic process identifies causes of cancer create mode 100644 data/2021/iclr/Multi-timescale Representation Learning in LSTM Language Models create mode 100644 data/2021/iclr/MultiModalQA: complex question answering over text, tables and images create mode 100644 data/2021/iclr/Multiplicative Filter Networks create mode 100644 data/2021/iclr/Multiscale Score Matching for Out-of-Distribution Detection create mode 100644 data/2021/iclr/Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows create mode 100644 data/2021/iclr/Mutual Information State Intrinsic Control create mode 100644 data/2021/iclr/My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control create mode 100644 data/2021/iclr/NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition create mode 100644 data/2021/iclr/NBDT: Neural-Backed Decision Tree create mode 100644 data/2021/iclr/NOVAS: Non-convex Optimization via Adaptive Stochastic Search for End-to-end Learning and Control create mode 100644 data/2021/iclr/NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation create mode 100644 data/2021/iclr/Nearest Neighbor Machine Translation create mode 100644 data/2021/iclr/Negative Data Augmentation create mode 100644 data/2021/iclr/Net-DNF: Effective Deep Modeling of Tabular Data create mode 100644 data/2021/iclr/Network Pruning That Matters: A Case Study on Retraining Variants create mode 100644 data/2021/iclr/Neural Approximate Sufficient Statistics for Implicit Models create mode 100644 data/2021/iclr/Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective create mode 100644 data/2021/iclr/Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks create mode 100644 data/2021/iclr/Neural Delay Differential Equations create mode 100644 data/2021/iclr/Neural Jump Ordinary Differential Equations: Consistent Continuous-Time Prediction and Filtering create mode 100644 data/2021/iclr/Neural Learning of One-of-Many Solutions for Combinatorial Problems in Structured Output Spaces create mode 100644 data/2021/iclr/Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics create mode 100644 data/2021/iclr/Neural Networks for Learning Counterfactual G-Invariances from Single Environments create mode 100644 data/2021/iclr/Neural ODE Processes create mode 100644 data/2021/iclr/Neural Pruning via Growing Regularization create mode 100644 data/2021/iclr/Neural Spatio-Temporal Point Processes create mode 100644 data/2021/iclr/Neural Synthesis of Binaural Speech From Mono Audio create mode 100644 data/2021/iclr/Neural Thompson Sampling create mode 100644 data/2021/iclr/Neural Topic Model via Optimal Transport create mode 100644 data/2021/iclr/Neural gradients are near-lognormal: improved quantized and sparse training create mode 100644 data/2021/iclr/Neural networks with late-phase weights create mode 100644 data/2021/iclr/Neural representation and generation for RNA secondary structures create mode 100644 data/2021/iclr/Neurally Augmented ALISTA create mode 100644 data/2021/iclr/New Bounds For Distributed Mean Estimation and Variance Reduction create mode 100644 data/2021/iclr/No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks create mode 100644 data/2021/iclr/No MCMC for me: Amortized sampling for fast and stable training of energy-based models create mode 100644 data/2021/iclr/Noise against noise: stochastic label noise helps combat inherent label noise create mode 100644 data/2021/iclr/Noise or Signal: The Role of Image Backgrounds in Object Recognition create mode 100644 data/2021/iclr/Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds create mode 100644 data/2021/iclr/Nonseparable Symplectic Neural Networks create mode 100644 data/2021/iclr/OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning create mode 100644 data/2021/iclr/Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers create mode 100644 data/2021/iclr/Offline Model-Based Optimization via Normalized Maximum Likelihood Estimation create mode 100644 data/2021/iclr/On Data-Augmentation and Consistency-Based Semi-Supervised Learning create mode 100644 data/2021/iclr/On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections create mode 100644 data/2021/iclr/On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning create mode 100644 data/2021/iclr/On Graph Neural Networks versus Graph-Augmented MLPs create mode 100644 data/2021/iclr/On InstaHide, Phase Retrieval, and Sparse Matrix Factorization create mode 100644 data/2021/iclr/On Learning Universal Representations Across Languages create mode 100644 data/2021/iclr/On Position Embeddings in BERT create mode 100644 data/2021/iclr/On Self-Supervised Image Representations for GAN Evaluation create mode 100644 data/2021/iclr/On Statistical Bias In Active Learning: How and When to Fix It create mode 100644 data/2021/iclr/On the Bottleneck of Graph Neural Networks and its Practical Implications create mode 100644 data/2021/iclr/On the Critical Role of Conventions in Adaptive Human-AI Collaboration create mode 100644 data/2021/iclr/On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis create mode 100644 data/2021/iclr/On the Dynamics of Training Attention Models create mode 100644 data/2021/iclr/On the Impossibility of Global Convergence in Multi-Loss Optimization create mode 100644 data/2021/iclr/On the Origin of Implicit Regularization in Stochastic Gradient Descent create mode 100644 data/2021/iclr/On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines create mode 100644 data/2021/iclr/On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers create mode 100644 data/2021/iclr/On the Transfer of Disentangled Representations in Realistic Settings create mode 100644 data/2021/iclr/On the Universality of Rotation Equivariant Point Cloud Networks create mode 100644 data/2021/iclr/On the Universality of the Double Descent Peak in Ridgeless Regression create mode 100644 data/2021/iclr/On the geometry of generalization and memorization in deep neural networks create mode 100644 data/2021/iclr/On the mapping between Hopfield networks and Restricted Boltzmann Machines create mode 100644 data/2021/iclr/On the role of planning in model-based deep reinforcement learning create mode 100644 data/2021/iclr/One Network Fits All? Modular versus Monolithic Task Formulations in Neural Networks create mode 100644 data/2021/iclr/Online Adversarial Purification based on Self-supervised Learning create mode 100644 data/2021/iclr/Open Question Answering over Tables and Text create mode 100644 data/2021/iclr/Optimal Conversion of Conventional Artificial Neural Networks to Spiking Neural Networks create mode 100644 data/2021/iclr/Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime create mode 100644 data/2021/iclr/Optimal Regularization can Mitigate Double Descent create mode 100644 data/2021/iclr/Optimism in Reinforcement Learning with Generalized Linear Function Approximation create mode 100644 data/2021/iclr/Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning create mode 100644 data/2021/iclr/Orthogonalizing Convolutional Layers with the Cayley Transform create mode 100644 data/2021/iclr/Overfitting for Fun and Profit: Instance-Adaptive Data Compression create mode 100644 data/2021/iclr/Overparameterisation and worst-case generalisation: friend or foe? create mode 100644 data/2021/iclr/PAC Confidence Predictions for Deep Neural Network Classifiers create mode 100644 data/2021/iclr/PC2WF: 3D Wireframe Reconstruction from Raw Point Clouds create mode 100644 data/2021/iclr/PDE-Driven Spatiotemporal Disentanglement create mode 100644 data/2021/iclr/PMI-Masking: Principled masking of correlated spans create mode 100644 data/2021/iclr/PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences create mode 100644 data/2021/iclr/Parameter Efficient Multimodal Transformers for Video Representation Learning create mode 100644 data/2021/iclr/Parameter-Based Value Functions create mode 100644 data/2021/iclr/Parrot: Data-Driven Behavioral Priors for Reinforcement Learning create mode 100644 data/2021/iclr/Partitioned Learned Bloom Filters create mode 100644 data/2021/iclr/Perceptual Adversarial Robustness: Defense Against Unseen Threat Models create mode 100644 data/2021/iclr/Personalized Federated Learning with First Order Model Optimization create mode 100644 data/2021/iclr/Physics-aware, probabilistic model order reduction with guaranteed stability create mode 100644 data/2021/iclr/Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks create mode 100644 data/2021/iclr/Planning from Pixels using Inverse Dynamics Models create mode 100644 data/2021/iclr/PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics create mode 100644 data/2021/iclr/PolarNet: Learning to Optimize Polar Keypoints for Keypoint Based Object Detection create mode 100644 data/2021/iclr/Policy-Driven Attack: Learning to Query for Hard-label Black-box Adversarial Examples create mode 100644 data/2021/iclr/Practical Massively Parallel Monte-Carlo Tree Search Applied to Molecular Design create mode 100644 data/2021/iclr/Practical Real Time Recurrent Learning with a Sparse Approximation create mode 100644 data/2021/iclr/Pre-training Text-to-Text Transformers for Concept-centric Common Sense create mode 100644 data/2021/iclr/Predicting Classification Accuracy When Adding New Unobserved Classes create mode 100644 data/2021/iclr/Predicting Inductive Biases of Pre-Trained Models create mode 100644 data/2021/iclr/Predicting Infectiousness for Proactive Contact Tracing create mode 100644 data/2021/iclr/Prediction and generalisation over directed actions by grid cells create mode 100644 data/2021/iclr/Primal Wasserstein Imitation Learning create mode 100644 data/2021/iclr/Private Image Reconstruction from System Side Channels Using Generative Models create mode 100644 data/2021/iclr/Private Post-GAN Boosting create mode 100644 data/2021/iclr/Probabilistic Numeric Convolutional Neural Networks create mode 100644 data/2021/iclr/Probing BERT in Hyperbolic Spaces create mode 100644 data/2021/iclr/Progressive Skeletonization: Trimming more fat from a network at initialization create mode 100644 data/2021/iclr/Projected Latent Markov Chain Monte Carlo: Conditional Sampling of Normalizing Flows create mode 100644 data/2021/iclr/Property Controllable Variational Autoencoder via Invertible Mutual Dependence create mode 100644 data/2021/iclr/Protecting DNNs from Theft using an Ensemble of Diverse Models create mode 100644 data/2021/iclr/Prototypical Contrastive Learning of Unsupervised Representations create mode 100644 data/2021/iclr/Prototypical Representation Learning for Relation Extraction create mode 100644 data/2021/iclr/Provable Rich Observation Reinforcement Learning with Combinatorial Latent States create mode 100644 data/2021/iclr/Provably robust classification of adversarial examples with detection create mode 100644 "data/2021/iclr/Proximal Gradient Descent-Ascent: Variable Convergence under K\305\201 Geometry" create mode 100644 data/2021/iclr/Pruning Neural Networks at Initialization: Why Are We Missing the Mark? create mode 100644 data/2021/iclr/PseudoSeg: Designing Pseudo Labels for Semantic Segmentation create mode 100644 data/2021/iclr/QPLEX: Duplex Dueling Multi-Agent Q-Learning create mode 100644 data/2021/iclr/Quantifying Differences in Reward Functions create mode 100644 data/2021/iclr/R-GAP: Recursive Gradient Attack on Privacy create mode 100644 data/2021/iclr/RMSprop converges with proper hyper-parameter create mode 100644 data/2021/iclr/RNNLogic: Learning Logic Rules for Reasoning on Knowledge Graphs create mode 100644 data/2021/iclr/RODE: Learning Roles to Decompose Multi-Agent Tasks create mode 100644 data/2021/iclr/Random Feature Attention create mode 100644 data/2021/iclr/Randomized Automatic Differentiation create mode 100644 data/2021/iclr/Randomized Ensembled Double Q-Learning: Learning Fast Without a Model create mode 100644 data/2021/iclr/Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments create mode 100644 data/2021/iclr/Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator create mode 100644 data/2021/iclr/Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets create mode 100644 data/2021/iclr/Rapid Task-Solving in Novel Environments create mode 100644 data/2021/iclr/Recurrent Independent Mechanisms create mode 100644 data/2021/iclr/Reducing the Computational Cost of Deep Generative Models with Binary Neural Networks create mode 100644 data/2021/iclr/Refining Deep Generative Models via Discriminator Gradient Flow create mode 100644 data/2021/iclr/Regularization Matters in Policy Optimization - An Empirical Study on Continuous Control create mode 100644 data/2021/iclr/Regularized Inverse Reinforcement Learning create mode 100644 data/2021/iclr/Reinforcement Learning with Random Delays create mode 100644 data/2021/iclr/Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models create mode 100644 data/2021/iclr/Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting create mode 100644 data/2021/iclr/Removing Undesirable Feature Contributions Using Out-of-Distribution Data create mode 100644 data/2021/iclr/Representation Balancing Offline Model-based Reinforcement Learning create mode 100644 data/2021/iclr/Representation Learning for Sequence Data with Deep Autoencoding Predictive Components create mode 100644 data/2021/iclr/Representation Learning via Invariant Causal Mechanisms create mode 100644 data/2021/iclr/Representation learning for improved interpretability and classification accuracy of clinical factors from EEG create mode 100644 data/2021/iclr/Representing Partial Programs with Blended Abstract Semantics create mode 100644 data/2021/iclr/Repurposing Pretrained Models for Robust Out-of-domain Few-Shot Learning create mode 100644 data/2021/iclr/ResNet After All: Neural ODEs and Their Numerical Solution create mode 100644 data/2021/iclr/Reset-Free Lifelong Learning with Skill-Space Planning create mode 100644 data/2021/iclr/Rethinking Architecture Selection in Differentiable NAS create mode 100644 data/2021/iclr/Rethinking Attention with Performers create mode 100644 data/2021/iclr/Rethinking Embedding Coupling in Pre-trained Language Models create mode 100644 data/2021/iclr/Rethinking Positional Encoding in Language Pre-training create mode 100644 data/2021/iclr/Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective create mode 100644 data/2021/iclr/Rethinking the Role of Gradient-based Attribution Methods for Model Interpretability create mode 100644 data/2021/iclr/Retrieval-Augmented Generation for Code Summarization via Hybrid GNN create mode 100644 data/2021/iclr/Return-Based Contrastive Representation Learning for Reinforcement Learning create mode 100644 data/2021/iclr/Revisiting Dynamic Convolution via Matrix Decomposition create mode 100644 data/2021/iclr/Revisiting Few-sample BERT Fine-tuning create mode 100644 data/2021/iclr/Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction create mode 100644 data/2021/iclr/Revisiting Locally Supervised Learning: an Alternative to End-to-end Training create mode 100644 data/2021/iclr/Reweighting Augmented Samples by Minimizing the Maximal Expected Loss create mode 100644 data/2021/iclr/Ringing ReLUs: Harmonic Distortion Analysis of Nonlinear Feedforward Networks create mode 100644 data/2021/iclr/Risk-Averse Offline Reinforcement Learning create mode 100644 data/2021/iclr/Robust Learning of Fixed-Structure Bayesian Networks in Nearly-Linear Time create mode 100644 data/2021/iclr/Robust Overfitting may be mitigated by properly learned smoothening create mode 100644 data/2021/iclr/Robust Pruning at Initialization create mode 100644 data/2021/iclr/Robust Reinforcement Learning on State Observations with Learned Optimal Adversary create mode 100644 data/2021/iclr/Robust and Generalizable Visual Representation Learning via Random Convolutions create mode 100644 data/2021/iclr/Robust early-learning: Hindering the memorization of noisy labels create mode 100644 data/2021/iclr/SAFENet: A Secure, Accurate and Fast Neural Network Inference create mode 100644 data/2021/iclr/SALD: Sign Agnostic Learning with Derivatives create mode 100644 data/2021/iclr/SCoRe: Pre-Training for Context Representation in Conversational Semantic Parsing create mode 100644 data/2021/iclr/SEDONA: Search for Decoupled Neural Networks toward Greedy Block-wise Learning create mode 100644 data/2021/iclr/SEED: Self-supervised Distillation For Visual Representation create mode 100644 data/2021/iclr/SMiRL: Surprise Minimizing Reinforcement Learning in Unstable Environments create mode 100644 data/2021/iclr/SOLAR: Sparse Orthogonal Learned and Random Embeddings create mode 100644 data/2021/iclr/SSD: A Unified Framework for Self-Supervised Outlier Detection create mode 100644 data/2021/iclr/Saliency is a Possible Red Herring When Diagnosing Poor Generalization create mode 100644 data/2021/iclr/SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization create mode 100644 data/2021/iclr/Sample-Efficient Automated Deep Reinforcement Learning create mode 100644 data/2021/iclr/Scalable Bayesian Inverse Reinforcement Learning create mode 100644 data/2021/iclr/Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes create mode 100644 data/2021/iclr/Scalable Transfer Learning with Expert Models create mode 100644 data/2021/iclr/Scaling Symbolic Methods using Gradients for Neural Model Explanation create mode 100644 data/2021/iclr/Scaling the Convex Barrier with Active Sets create mode 100644 data/2021/iclr/Score-Based Generative Modeling through Stochastic Differential Equations create mode 100644 data/2021/iclr/Selective Classification Can Magnify Disparities Across Groups create mode 100644 data/2021/iclr/Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs create mode 100644 data/2021/iclr/Self-Supervised Learning of Compressed Video Representations create mode 100644 data/2021/iclr/Self-Supervised Policy Adaptation during Deployment create mode 100644 data/2021/iclr/Self-supervised Adversarial Robustness for the Low-label, High-data Regime create mode 100644 data/2021/iclr/Self-supervised Learning from a Multi-view Perspective create mode 100644 data/2021/iclr/Self-supervised Representation Learning with Relative Predictive Coding create mode 100644 data/2021/iclr/Self-supervised Visual Reinforcement Learning with Object-centric Representations create mode 100644 data/2021/iclr/Self-training For Few-shot Transfer Across Extreme Task Differences create mode 100644 data/2021/iclr/Semantic Re-tuning with Contrastive Tension create mode 100644 data/2021/iclr/Semi-supervised Keypoint Localization create mode 100644 data/2021/iclr/SenSeI: Sensitive Set Invariance for Enforcing Individual Fairness create mode 100644 data/2021/iclr/Separation and Concentration in Deep Networks create mode 100644 data/2021/iclr/Seq2Tens: An Efficient Representation of Sequences by Low-Rank Tensor Projections create mode 100644 data/2021/iclr/Sequential Density Ratio Estimation for Simultaneous Optimization of Speed and Accuracy create mode 100644 data/2021/iclr/Set Prediction without Imposing Structure as Conditional Density Estimation create mode 100644 data/2021/iclr/Shape or Texture: Understanding Discriminative Features in CNNs create mode 100644 data/2021/iclr/Shape-Texture Debiased Neural Network Training create mode 100644 data/2021/iclr/Shapley Explanation Networks create mode 100644 data/2021/iclr/Shapley explainability on the data manifold create mode 100644 data/2021/iclr/Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation create mode 100644 data/2021/iclr/Sharper Generalization Bounds for Learning with Gradient-dominated Objective Functions create mode 100644 data/2021/iclr/Sharpness-aware Minimization for Efficiently Improving Generalization create mode 100644 data/2021/iclr/Signatory: differentiable computations of the signature and logsignature transforms, on both CPU and GPU create mode 100644 data/2021/iclr/Simple Augmentation Goes a Long Way: ADRL for DNN Quantization create mode 100644 data/2021/iclr/Simple Spectral Graph Convolution create mode 100644 data/2021/iclr/Single-Photon Image Classification create mode 100644 data/2021/iclr/Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy create mode 100644 data/2021/iclr/SkipW: Resource Adaptable RNN with Strict Upper Computational Limit create mode 100644 data/2021/iclr/Sliced Kernelized Stein Discrepancy create mode 100644 data/2021/iclr/Solving Compositional Reinforcement Learning Problems via Task Reduction create mode 100644 data/2021/iclr/Sparse Quantized Spectral Clustering create mode 100644 data/2021/iclr/Sparse encoding for more-interpretable feature-selecting representations in probabilistic matrix factorization create mode 100644 data/2021/iclr/Spatial Dependency Networks: Neural Layers for Improved Generative Image Modeling create mode 100644 data/2021/iclr/Spatially Structured Recurrent Modules create mode 100644 data/2021/iclr/Spatio-Temporal Graph Scattering Transform create mode 100644 data/2021/iclr/Stabilized Medical Image Attacks create mode 100644 data/2021/iclr/Statistical inference for individual fairness create mode 100644 data/2021/iclr/Stochastic Security: Adversarial Defense Using Long-Run Dynamics of Energy-Based Models create mode 100644 data/2021/iclr/Structured Prediction as Translation between Augmented Natural Languages create mode 100644 data/2021/iclr/Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning create mode 100644 data/2021/iclr/Support-set bottlenecks for video-text representation learning create mode 100644 data/2021/iclr/Symmetry-Aware Actor-Critic for 3D Molecular Design create mode 100644 data/2021/iclr/Systematic generalisation with group invariant predictions create mode 100644 data/2021/iclr/Taking Notes on the Fly Helps Language Pre-Training create mode 100644 data/2021/iclr/Taming GANs with Lookahead-Minmax create mode 100644 data/2021/iclr/Targeted Attack against Deep Neural Networks via Flipping Limited Weight Bits create mode 100644 data/2021/iclr/Task-Agnostic Morphology Evolution create mode 100644 data/2021/iclr/Teaching Temporal Logics to Neural Networks create mode 100644 data/2021/iclr/Teaching with Commentaries create mode 100644 "data/2021/iclr/Temporally-Extended \316\265-Greedy Exploration" create mode 100644 data/2021/iclr/Tent: Fully Test-Time Adaptation by Entropy Minimization create mode 100644 data/2021/iclr/Text Generation by Learning from Demonstrations create mode 100644 data/2021/iclr/The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers create mode 100644 data/2021/iclr/The Importance of Pessimism in Fixed-Dataset Policy Optimization create mode 100644 data/2021/iclr/The Intrinsic Dimension of Images and Its Impact on Learning create mode 100644 data/2021/iclr/The Recurrent Neural Tangent Kernel create mode 100644 data/2021/iclr/The Risks of Invariant Risk Minimization create mode 100644 data/2021/iclr/The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-ball Methods create mode 100644 data/2021/iclr/The Traveling Observer Model: Multi-task Learning Through Spatial Variable Embeddings create mode 100644 data/2021/iclr/The Unreasonable Effectiveness of Patches in Deep Convolutional Kernels Methods create mode 100644 data/2021/iclr/The geometry of integration in text classification RNNs create mode 100644 data/2021/iclr/The inductive bias of ReLU networks on orthogonally separable data create mode 100644 data/2021/iclr/The role of Disentanglement in Generalisation create mode 100644 data/2021/iclr/Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data create mode 100644 data/2021/iclr/Theoretical bounds on estimation error for meta-learning create mode 100644 data/2021/iclr/Tilted Empirical Risk Minimization create mode 100644 data/2021/iclr/Tomographic Auto-Encoder: Unsupervised Bayesian Recovery of Corrupted Data create mode 100644 data/2021/iclr/Topology-Aware Segmentation Using Discrete Morse Theory create mode 100644 data/2021/iclr/Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis create mode 100644 data/2021/iclr/Towards Impartial Multi-task Learning create mode 100644 data/2021/iclr/Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding create mode 100644 data/2021/iclr/Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning create mode 100644 data/2021/iclr/Towards Robust Neural Networks via Close-loop Control create mode 100644 data/2021/iclr/Towards Robustness Against Natural Language Word Substitutions create mode 100644 data/2021/iclr/Tradeoffs in Data Augmentation: An Empirical Study create mode 100644 data/2021/iclr/Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs create mode 100644 data/2021/iclr/Training GANs with Stronger Augmentations via Contrastive Discriminator create mode 100644 data/2021/iclr/Training independent subnetworks for robust prediction create mode 100644 data/2021/iclr/Training with Quantization Noise for Extreme Model Compression create mode 100644 data/2021/iclr/Trajectory Prediction using Equivariant Continuous Convolution create mode 100644 data/2021/iclr/Transformer protein language models are unsupervised structure learners create mode 100644 data/2021/iclr/Transient Non-stationarity and Generalisation in Deep Reinforcement Learning create mode 100644 data/2021/iclr/TropEx: An Algorithm for Extracting Linear Terms in Deep Neural Networks create mode 100644 data/2021/iclr/Trusted Multi-View Classification create mode 100644 data/2021/iclr/UMEC: Unified model and embedding compression for efficient recommendation systems create mode 100644 data/2021/iclr/UPDeT: Universal Multi-agent RL via Policy Decoupling with Transformers create mode 100644 data/2021/iclr/Unbiased Teacher for Semi-Supervised Object Detection create mode 100644 data/2021/iclr/Uncertainty Estimation and Calibration with Finite-State Probabilistic RNNs create mode 100644 data/2021/iclr/Uncertainty Estimation in Autoregressive Structured Prediction create mode 100644 data/2021/iclr/Uncertainty Sets for Image Classifiers using Conformal Prediction create mode 100644 data/2021/iclr/Uncertainty in Gradient Boosting via Ensembles create mode 100644 data/2021/iclr/Uncertainty-aware Active Learning for Optimal Bayesian Classifier create mode 100644 data/2021/iclr/Understanding Over-parameterization in Generative Adversarial Networks create mode 100644 data/2021/iclr/Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning create mode 100644 data/2021/iclr/Understanding and Improving Lexical Choice in Non-Autoregressive Translation create mode 100644 data/2021/iclr/Understanding the effects of data parallelism and sparsity on neural network training create mode 100644 data/2021/iclr/Understanding the failure modes of out-of-distribution generalization create mode 100644 data/2021/iclr/Understanding the role of importance weighting for deep learning create mode 100644 data/2021/iclr/Undistillable: Making A Nasty Teacher That CANNOT teach students create mode 100644 data/2021/iclr/Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning create mode 100644 data/2021/iclr/Universal approximation power of deep residual neural networks via nonlinear control theory create mode 100644 data/2021/iclr/Unlearnable Examples: Making Personal Data Unexploitable create mode 100644 data/2021/iclr/Unsupervised Audiovisual Synthesis via Exemplar Autoencoders create mode 100644 data/2021/iclr/Unsupervised Discovery of 3D Physical Objects from Video create mode 100644 data/2021/iclr/Unsupervised Meta-Learning through Latent-Space Interpolation in Generative Models create mode 100644 data/2021/iclr/Unsupervised Object Keypoint Learning using Local Spatial Predictability create mode 100644 data/2021/iclr/Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding create mode 100644 data/2021/iclr/Usable Information and Evolution of Optimal Representations During Training create mode 100644 data/2021/iclr/Using latent space regression to analyze and leverage compositionality in GANs create mode 100644 data/2021/iclr/VA-RED2: Video Adaptive Redundancy Reduction create mode 100644 data/2021/iclr/VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models create mode 100644 data/2021/iclr/VCNet and Functional Targeted Regularization For Learning Causal Effects of Continuous Treatments create mode 100644 data/2021/iclr/VTNet: Visual Transformer Network for Object Goal Navigation create mode 100644 data/2021/iclr/Variational Information Bottleneck for Effective Low-Resource Fine-Tuning create mode 100644 data/2021/iclr/Variational Intrinsic Control Revisited create mode 100644 data/2021/iclr/Variational State-Space Models for Localisation and Dense 3D Mapping in 6 DoF create mode 100644 data/2021/iclr/Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms create mode 100644 data/2021/iclr/Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images create mode 100644 data/2021/iclr/Viewmaker Networks: Learning Views for Unsupervised Representation Learning create mode 100644 data/2021/iclr/Vulnerability-Aware Poisoning Mechanism for Online RL with Unknown Dynamics create mode 100644 data/2021/iclr/WaNet - Imperceptible Warping-based Backdoor Attack create mode 100644 data/2021/iclr/Wandering within a world: Online contextualized few-shot learning create mode 100644 data/2021/iclr/Wasserstein Embedding for Graph Learning create mode 100644 data/2021/iclr/Wasserstein-2 Generative Networks create mode 100644 data/2021/iclr/Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration create mode 100644 data/2021/iclr/WaveGrad: Estimating Gradients for Waveform Generation create mode 100644 data/2021/iclr/What Can You Learn From Your Muscles? Learning Visual Representation from Human Interactions create mode 100644 data/2021/iclr/What Makes Instance Discrimination Good for Transfer Learning? create mode 100644 data/2021/iclr/What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study create mode 100644 data/2021/iclr/What Should Not Be Contrastive in Contrastive Learning create mode 100644 data/2021/iclr/What are the Statistical Limits of Offline RL with Linear Function Approximation? create mode 100644 data/2021/iclr/What they do when in doubt: a study of inductive biases in seq2seq learners create mode 100644 data/2021/iclr/When Do Curricula Work? create mode 100644 data/2021/iclr/When Optimizing f-Divergence is Robust with Label Noise create mode 100644 data/2021/iclr/When does preconditioning help or hurt generalization? create mode 100644 data/2021/iclr/Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets? create mode 100644 data/2021/iclr/Why resampling outperforms reweighting for correcting sampling bias with stochastic gradients create mode 100644 data/2021/iclr/Winning the L2RPN Challenge: Power Grid Management via Semi-Markov Afterstate Actor-Critic create mode 100644 data/2021/iclr/Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching create mode 100644 data/2021/iclr/WrapNet: Neural Net Inference with Ultra-Low-Precision Arithmetic create mode 100644 data/2021/iclr/X2T: Training an X-to-Text Typing Interface with Online Learning from User Feedback create mode 100644 data/2021/iclr/You Only Need Adversarial Supervision for Semantic Image Synthesis create mode 100644 data/2021/iclr/Zero-Cost Proxies for Lightweight NAS create mode 100644 data/2021/iclr/Zero-shot Synthesis with Group-Supervised Learning create mode 100644 data/2021/iclr/gradSim: Differentiable simulation for system identification and visuomotor control create mode 100644 data/2021/iclr/i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning create mode 100644 data/2021/iclr/not-MIWAE: Deep Generative Modelling with Missing not at Random Data create mode 100644 data/2022/iclr/8-bit Optimizers via Block-wise Quantization create mode 100644 data/2022/iclr/A Biologically Interpretable Graph Convolutional Network to Link Genetic Risk Pathways and Imaging Phenotypes of Disease create mode 100644 data/2022/iclr/A Class of Short-term Recurrence Anderson Mixing Methods and Their Applications create mode 100644 data/2022/iclr/A Comparison of Hamming Errors of Representative Variable Selection Methods create mode 100644 data/2022/iclr/A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion create mode 100644 data/2022/iclr/A Deep Variational Approach to Clustering Survival Data create mode 100644 data/2022/iclr/A Fine-Grained Analysis on Distribution Shift create mode 100644 data/2022/iclr/A Fine-Tuning Approach to Belief State Modeling create mode 100644 data/2022/iclr/A First-Occupancy Representation for Reinforcement Learning create mode 100644 data/2022/iclr/A General Analysis of Example-Selection for Stochastic Gradient Descent create mode 100644 data/2022/iclr/A Generalized Weighted Optimization Method for Computational Learning and Inversion create mode 100644 data/2022/iclr/A Johnson-Lindenstrauss Framework for Randomly Initialized CNNs create mode 100644 data/2022/iclr/A Loss Curvature Perspective on Training Instabilities of Deep Learning Models create mode 100644 data/2022/iclr/A Neural Tangent Kernel Perspective of Infinite Tree Ensembles create mode 100644 "data/2022/iclr/A New Perspective on \"How Graph Neural Networks Go Beyond Weisfeiler-Lehman?\"" create mode 100644 data/2022/iclr/A Non-Parametric Regression Viewpoint : Generalization of Overparametrized Deep RELU Network Under Noisy Observations create mode 100644 data/2022/iclr/A Program to Build E(N)-Equivariant Steerable CNNs create mode 100644 data/2022/iclr/A Reduction-Based Framework for Conservative Bandits and Reinforcement Learning create mode 100644 data/2022/iclr/A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning create mode 100644 data/2022/iclr/A Statistical Framework for Efficient Out of Distribution Detection in Deep Neural Networks create mode 100644 data/2022/iclr/A Tale of Two Flows: Cooperative Learning of Langevin Flow and Normalizing Flow Toward Energy-Based Model create mode 100644 data/2022/iclr/A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features create mode 100644 data/2022/iclr/A Theory of Tournament Representations create mode 100644 data/2022/iclr/A Unified Contrastive Energy-based Model for Understanding the Generative Ability of Adversarial Training create mode 100644 data/2022/iclr/A Unified Wasserstein Distributional Robustness Framework for Adversarial Training create mode 100644 data/2022/iclr/A Zest of LIME: Towards Architecture-Independent Model Distances create mode 100644 data/2022/iclr/A fast and accurate splitting method for optimal transport: analysis and implementation create mode 100644 data/2022/iclr/A generalization of the randomized singular value decomposition create mode 100644 data/2022/iclr/A global convergence theory for deep ReLU implicit networks via over-parameterization create mode 100644 data/2022/iclr/ADAVI: Automatic Dual Amortized Variational Inference Applied To Pyramidal Bayesian Models create mode 100644 data/2022/iclr/AEVA: Black-box Backdoor Detection Using Adversarial Extreme Value Analysis create mode 100644 data/2022/iclr/ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity create mode 100644 data/2022/iclr/AS-MLP: An Axial Shifted MLP Architecture for Vision create mode 100644 data/2022/iclr/Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions create mode 100644 data/2022/iclr/Accelerated Policy Learning with Parallel Differentiable Simulation create mode 100644 data/2022/iclr/Acceleration of Federated Learning with Alleviated Forgetting in Local Training create mode 100644 data/2022/iclr/Active Hierarchical Exploration with Stable Subgoal Representation Learning create mode 100644 data/2022/iclr/Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game create mode 100644 data/2022/iclr/Actor-critic is implicitly biased towards high entropy optimal policies create mode 100644 data/2022/iclr/Ada-NETS: Face Clustering via Adaptive Neighbour Discovery in the Structure Space create mode 100644 data/2022/iclr/AdaAug: Learning Class- and Instance-adaptive Data Augmentation Policies create mode 100644 data/2022/iclr/AdaMatch: A Unified Approach to Semi-Supervised Learning and Domain Adaptation create mode 100644 data/2022/iclr/AdaRL: What, Where, and How to Adapt in Transfer Reinforcement Learning create mode 100644 data/2022/iclr/Adaptive Wavelet Transformer Network for 3D Shape Representation Learning create mode 100644 data/2022/iclr/Adversarial Retriever-Ranker for Dense Text Retrieval create mode 100644 data/2022/iclr/Adversarial Robustness Through the Lens of Causality create mode 100644 data/2022/iclr/Adversarial Support Alignment create mode 100644 data/2022/iclr/Adversarial Unlearning of Backdoors via Implicit Hypergradient create mode 100644 data/2022/iclr/Adversarially Robust Conformal Prediction create mode 100644 data/2022/iclr/Almost Tight L0-norm Certified Robustness of Top-k Predictions against Adversarial Perturbations create mode 100644 data/2022/iclr/AlphaZero-based Proof Cost Network to Aid Game Solving create mode 100644 data/2022/iclr/Amortized Implicit Differentiation for Stochastic Bilevel Optimization create mode 100644 data/2022/iclr/Amortized Tree Generation for Bottom-up Synthesis Planning and Synthesizable Molecular Design create mode 100644 data/2022/iclr/An Agnostic Approach to Federated Learning with Class Imbalance create mode 100644 data/2022/iclr/An Autoregressive Flow Model for 3D Molecular Geometry Generation from Scratch create mode 100644 data/2022/iclr/An Experimental Design Perspective on Model-Based Reinforcement Learning create mode 100644 data/2022/iclr/An Explanation of In-context Learning as Implicit Bayesian Inference create mode 100644 data/2022/iclr/An Information Fusion Approach to Learning with Instance-Dependent Label Noise create mode 100644 data/2022/iclr/An Operator Theoretic View On Pruning Deep Neural Networks create mode 100644 data/2022/iclr/An Unconstrained Layer-Peeled Perspective on Neural Collapse create mode 100644 data/2022/iclr/Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models create mode 100644 data/2022/iclr/Analyzing and Improving the Optimization Landscape of Noise-Contrastive Estimation create mode 100644 data/2022/iclr/Ancestral protein sequence reconstruction using a tree-structured Ornstein-Uhlenbeck variational autoencoder create mode 100644 data/2022/iclr/Anisotropic Random Feature Regression in High Dimensions create mode 100644 data/2022/iclr/Anomaly Detection for Tabular Data with Internal Contrastive Learning create mode 100644 data/2022/iclr/Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy create mode 100644 data/2022/iclr/Anti-Concentrated Confidence Bonuses For Scalable Exploration create mode 100644 data/2022/iclr/Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice create mode 100644 data/2022/iclr/Anytime Dense Prediction with Confidence Adaptivity create mode 100644 data/2022/iclr/Approximation and Learning with Deep Convolutional Models: a Kernel Perspective create mode 100644 data/2022/iclr/Assessing Generalization of SGD via Disagreement create mode 100644 data/2022/iclr/Associated Learning: an Alternative to End-to-End Backpropagation that Works on CNN, RNN, and Transformer create mode 100644 data/2022/iclr/Asymmetry Learning for Counterfactually-invariant Classification in OOD Tasks create mode 100644 data/2022/iclr/Attacking deep networks with surrogate-based adversarial black-box methods is easy create mode 100644 data/2022/iclr/Attention-based Interpretability with Concept Transformers create mode 100644 data/2022/iclr/Audio Lottery: Speech Recognition Made Ultra-Lightweight, Noise-Robust, and Transferable create mode 100644 data/2022/iclr/Augmented Sliced Wasserstein Distances create mode 100644 data/2022/iclr/Auto-Transfer: Learning to Route Transferable Representations create mode 100644 data/2022/iclr/Auto-scaling Vision Transformers without Training create mode 100644 data/2022/iclr/Automated Self-Supervised Learning for Graphs create mode 100644 data/2022/iclr/Automatic Loss Function Search for Predict-Then-Optimize Problems with Strong Ranking Property create mode 100644 data/2022/iclr/Autonomous Learning of Object-Centric Abstractions for High-Level Planning create mode 100644 data/2022/iclr/Autonomous Reinforcement Learning: Formalism and Benchmarking create mode 100644 data/2022/iclr/Autoregressive Diffusion Models create mode 100644 data/2022/iclr/Autoregressive Quantile Flows for Predictive Uncertainty Estimation create mode 100644 data/2022/iclr/Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning create mode 100644 data/2022/iclr/BAM: Bayes with Adaptive Memory create mode 100644 data/2022/iclr/BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis create mode 100644 data/2022/iclr/BEiT: BERT Pre-Training of Image Transformers create mode 100644 data/2022/iclr/Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future create mode 100644 data/2022/iclr/Backdoor Defense via Decoupling the Training Process create mode 100644 data/2022/iclr/BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models create mode 100644 data/2022/iclr/Bag of Instances Aggregation Boosts Self-supervised Distillation create mode 100644 data/2022/iclr/Bandit Learning with Joint Effect of Incentivized Sampling, Delayed Sampling Feedback, and Self-Reinforcing User Preferences create mode 100644 data/2022/iclr/Bayesian Framework for Gradient Leakage create mode 100644 data/2022/iclr/Bayesian Modeling and Uncertainty Quantification for Learning to Optimize: What, Why, and How create mode 100644 data/2022/iclr/Bayesian Neural Network Priors Revisited create mode 100644 data/2022/iclr/Benchmarking the Spectrum of Agent Capabilities create mode 100644 data/2022/iclr/Better Supervisory Signals by Observing Learning Paths create mode 100644 data/2022/iclr/Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains create mode 100644 data/2022/iclr/Bi-linear Value Networks for Multi-goal Reinforcement Learning create mode 100644 data/2022/iclr/BiBERT: Accurate Fully Binarized BERT create mode 100644 data/2022/iclr/Blaschke Product Neural Networks (BPNN): A Physics-Infused Neural Network for Phase Retrieval of Meromorphic Functions create mode 100644 data/2022/iclr/Boosted Curriculum Reinforcement Learning create mode 100644 data/2022/iclr/Boosting Randomized Smoothing with Variance Reduced Classifiers create mode 100644 data/2022/iclr/Boosting the Certified Robustness of L-infinity Distance Nets create mode 100644 data/2022/iclr/Bootstrapped Meta-Learning create mode 100644 data/2022/iclr/Bootstrapping Semantic Segmentation with Regional Contrast create mode 100644 data/2022/iclr/Bregman Gradient Policy Optimization create mode 100644 data/2022/iclr/Bridging Recommendation and Marketing via Recurrent Intensity Modeling create mode 100644 data/2022/iclr/Bridging the Gap: Providing Post-Hoc Symbolic Explanations for Sequential Decision-Making Problems with Inscrutable Representations create mode 100644 data/2022/iclr/Bundle Networks: Fiber Bundles, Local Trivializations, and a Generative Approach to Exploring Many-to-one Maps create mode 100644 data/2022/iclr/Byzantine-Robust Learning on Heterogeneous Datasets via Bucketing create mode 100644 data/2022/iclr/C-Planning: An Automatic Curriculum for Learning Goal-Reaching Tasks create mode 100644 data/2022/iclr/CADDA: Class-wise Automatic Differentiable Data Augmentation for EEG Signals create mode 100644 data/2022/iclr/CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation create mode 100644 data/2022/iclr/CKConv: Continuous Kernel Convolution For Sequential Data create mode 100644 data/2022/iclr/CLEVA-Compass: A Continual Learning Evaluation Assessment Compass to Promote Research Transparency and Comparability create mode 100644 data/2022/iclr/COPA: Certifying Robust Policies for Offline Reinforcement Learning against Poisoning Attacks create mode 100644 data/2022/iclr/COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation create mode 100644 data/2022/iclr/CROP: Certifying Robust Policies for Reinforcement Learning through Functional Smoothing create mode 100644 data/2022/iclr/Can an Image Classifier Suffice For Action Recognition? create mode 100644 data/2022/iclr/Capacity of Group-invariant Linear Readouts from Equivariant Representations: How Many Objects can be Linearly Classified Under All Possible Views? create mode 100644 data/2022/iclr/Capturing Structural Locality in Non-parametric Language Models create mode 100644 data/2022/iclr/Case-based reasoning for better generalization in textual reinforcement learning create mode 100644 data/2022/iclr/Causal Contextual Bandits with Targeted Interventions create mode 100644 data/2022/iclr/Certified Robustness for Deep Equilibrium Models via Interval Bound Propagation create mode 100644 data/2022/iclr/Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via Augmentation Overlap create mode 100644 data/2022/iclr/Charformer: Fast Character Transformers via Gradient-based Subword Tokenization create mode 100644 data/2022/iclr/Chemical-Reaction-Aware Molecule Representation Learning create mode 100644 data/2022/iclr/Chunked Autoregressive GAN for Conditional Waveform Synthesis create mode 100644 data/2022/iclr/Churn Reduction via Distillation create mode 100644 data/2022/iclr/Clean Images are Hard to Reblur: Exploiting the Ill-Posed Inverse Task for Dynamic Scene Deblurring create mode 100644 data/2022/iclr/ClimateGAN: Raising Climate Change Awareness by Generating Images of Floods create mode 100644 data/2022/iclr/Closed-form Sample Probing for Learning Generative Models in Zero-shot Learning create mode 100644 data/2022/iclr/CoBERL: Contrastive BERT for Reinforcement Learning create mode 100644 data/2022/iclr/CoMPS: Continual Meta Policy Search create mode 100644 data/2022/iclr/CoST: Contrastive Learning of Disentangled Seasonal-Trend Representations for Time Series Forecasting create mode 100644 data/2022/iclr/CodeTrek: Flexible Modeling of Code using an Extensible Relational Representation create mode 100644 data/2022/iclr/Coherence-based Label Propagation over Time Series for Accelerated Active Learning create mode 100644 data/2022/iclr/Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods create mode 100644 data/2022/iclr/Collapse by Conditioning: Training Class-conditional GANs with Limited Data create mode 100644 data/2022/iclr/ComPhy: Compositional Physical Reasoning of Objects and Events from Videos create mode 100644 data/2022/iclr/Communication-Efficient Actor-Critic Methods for Homogeneous Markov Games create mode 100644 data/2022/iclr/Comparing Distributions by Measuring Differences that Affect Decision Making create mode 100644 data/2022/iclr/Complete Verification via Multi-Neuron Relaxation Guided Branch-and-Bound create mode 100644 data/2022/iclr/Compositional Attention: Disentangling Search and Retrieval create mode 100644 data/2022/iclr/Compositional Training for End-to-End Deep AUC Maximization create mode 100644 data/2022/iclr/ConFeSS: A Framework for Single Source Cross-Domain Few-Shot Learning create mode 100644 data/2022/iclr/Concurrent Adversarial Learning for Large-Batch Training create mode 100644 data/2022/iclr/Conditional Contrastive Learning with Kernel create mode 100644 data/2022/iclr/Conditional Image Generation by Conditioning Variational Auto-Encoders create mode 100644 data/2022/iclr/Conditional Object-Centric Learning from Video create mode 100644 data/2022/iclr/Conditioning Sequence-to-sequence Networks with Learned Activations create mode 100644 data/2022/iclr/Connectome-constrained Latent Variable Model of Whole-Brain Neural Activity create mode 100644 data/2022/iclr/Consistent Counterfactuals for Deep Models create mode 100644 data/2022/iclr/Constrained Physical-Statistics Models for Dynamical System Identification and Prediction create mode 100644 data/2022/iclr/Constrained Policy Optimization via Bayesian World Models create mode 100644 data/2022/iclr/Constraining Linear-chain CRFs to Regular Languages create mode 100644 data/2022/iclr/Constructing Orthogonal Convolutions in an Explicit Manner create mode 100644 data/2022/iclr/Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates create mode 100644 data/2022/iclr/Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics create mode 100644 data/2022/iclr/Context-Aware Sparse Deep Coordination Graphs create mode 100644 data/2022/iclr/Contextualized Scene Imagination for Generative Commonsense Reasoning create mode 100644 data/2022/iclr/Continual Learning with Filter Atom Swapping create mode 100644 data/2022/iclr/Continual Learning with Recursive Gradient Optimization create mode 100644 data/2022/iclr/Continual Normalization: Rethinking Batch Normalization for Online Continual Learning create mode 100644 data/2022/iclr/Continuous-Time Meta-Learning with Forward Mode Differentiation create mode 100644 data/2022/iclr/Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization create mode 100644 data/2022/iclr/Contrastive Clustering to Mine Pseudo Parallel Data for Unsupervised Translation create mode 100644 data/2022/iclr/Contrastive Fine-grained Class Clustering via Generative Adversarial Networks create mode 100644 data/2022/iclr/Controlling Directions Orthogonal to a Classifier create mode 100644 data/2022/iclr/Controlling the Complexity and Lipschitz Constant improves Polynomial Nets create mode 100644 data/2022/iclr/Convergent Graph Solvers create mode 100644 data/2022/iclr/Convergent and Efficient Deep Q Learning Algorithm create mode 100644 data/2022/iclr/CoordX: Accelerating Implicit Neural Representation with a Split MLP Architecture create mode 100644 data/2022/iclr/Coordination Among Neural Modules Through a Shared Global Workspace create mode 100644 data/2022/iclr/Counterfactual Plans under Distributional Ambiguity create mode 100644 data/2022/iclr/Creating Training Sets via Weak Indirect Supervision create mode 100644 data/2022/iclr/Critical Points in Quantum Generative Models create mode 100644 data/2022/iclr/Cross-Domain Imitation Learning via Optimal Transport create mode 100644 data/2022/iclr/Cross-Lingual Transfer with Class-Weighted Language-Invariant Representations create mode 100644 data/2022/iclr/Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL create mode 100644 data/2022/iclr/CrossBeam: Learning to Search in Bottom-Up Program Synthesis create mode 100644 data/2022/iclr/CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention create mode 100644 data/2022/iclr/CrossMatch: Cross-Classifier Consistency Regularization for Open-Set Single Domain Generalization create mode 100644 data/2022/iclr/CrowdPlay: Crowdsourcing Human Demonstrations for Offline Learning create mode 100644 data/2022/iclr/Crystal Diffusion Variational Autoencoder for Periodic Material Generation create mode 100644 data/2022/iclr/Curriculum learning as a tool to uncover learning principles in the brain create mode 100644 data/2022/iclr/Curvature-Guided Dynamic Scale Networks for Multi-View Stereo create mode 100644 data/2022/iclr/CycleMLP: A MLP-like Architecture for Dense Prediction create mode 100644 data/2022/iclr/D-CODE: Discovering Closed-form ODEs from Observed Trajectories create mode 100644 data/2022/iclr/DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR create mode 100644 data/2022/iclr/DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning create mode 100644 data/2022/iclr/DEGREE: Decomposition Based Explanation for Graph Neural Networks create mode 100644 data/2022/iclr/DEPTS: Deep Expansion Learning for Periodic Time Series Forecasting create mode 100644 data/2022/iclr/DISSECT: Disentangled Simultaneous Explanations via Concept Traversals create mode 100644 data/2022/iclr/DIVA: Dataset Derivative of a Learning Task create mode 100644 data/2022/iclr/DKM: Differentiable k-Means Clustering Layer for Neural Network Compression create mode 100644 data/2022/iclr/DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization create mode 100644 data/2022/iclr/Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation create mode 100644 data/2022/iclr/Data Poisoning Won't Save You From Facial Recognition create mode 100644 data/2022/iclr/Data-Driven Offline Optimization for Architecting Hardware Accelerators create mode 100644 data/2022/iclr/Data-Efficient Graph Grammar Learning for Molecular Generation create mode 100644 data/2022/iclr/DeSKO: Stability-Assured Robust Control with a Deep Stochastic Koopman Operator create mode 100644 data/2022/iclr/Dealing with Non-Stationarity in MARL via Trust-Region Decomposition create mode 100644 data/2022/iclr/Decentralized Learning for Overparameterized Problems: A Multi-Agent Kernel Approximation Approach create mode 100644 data/2022/iclr/Declarative nets that are equilibrium models create mode 100644 data/2022/iclr/Deconstructing the Inductive Biases of Hamiltonian Neural Networks create mode 100644 data/2022/iclr/Decoupled Adaptation for Cross-Domain Object Detection create mode 100644 data/2022/iclr/Deep Attentive Variational Inference create mode 100644 data/2022/iclr/Deep AutoAugment create mode 100644 data/2022/iclr/Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity create mode 100644 data/2022/iclr/Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers create mode 100644 data/2022/iclr/Deep Point Cloud Reconstruction create mode 100644 data/2022/iclr/Deep ReLU Networks Preserve Expected Length create mode 100644 data/2022/iclr/Defending Against Image Corruptions Through Adversarial Augmentations create mode 100644 data/2022/iclr/Delaunay Component Analysis for Evaluation of Data Representations create mode 100644 data/2022/iclr/DemoDICE: Offline Imitation Learning with Supplementary Imperfect Demonstrations create mode 100644 data/2022/iclr/Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization create mode 100644 data/2022/iclr/Demystifying Limited Adversarial Transferability in Automatic Speech Recognition Systems create mode 100644 data/2022/iclr/Denoising Likelihood Score Matching for Conditional Score-based Data Generation create mode 100644 data/2022/iclr/DictFormer: Tiny Transformer with Shared Dictionary create mode 100644 data/2022/iclr/DiffSkill: Skill Abstraction from Differentiable Physics for Deformable Object Manipulations with Tools create mode 100644 data/2022/iclr/Differentiable DAG Sampling create mode 100644 data/2022/iclr/Differentiable Expectation-Maximization for Set Representation Learning create mode 100644 data/2022/iclr/Differentiable Gradient Sampling for Learning Implicit 3D Scene Reconstructions from a Single Image create mode 100644 data/2022/iclr/Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners create mode 100644 data/2022/iclr/Differentiable Scaffolding Tree for Molecule Optimization create mode 100644 data/2022/iclr/Differentially Private Fine-tuning of Language Models create mode 100644 data/2022/iclr/Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space create mode 100644 data/2022/iclr/Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme create mode 100644 data/2022/iclr/Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching create mode 100644 data/2022/iclr/Discovering Invariant Rationales for Graph Neural Networks create mode 100644 data/2022/iclr/Discovering Latent Concepts Learned in BERT create mode 100644 data/2022/iclr/Discovering Nonlinear PDEs from Scarce Data with Physics-encoded Learning create mode 100644 data/2022/iclr/Discovering and Explaining the Representation Bottleneck of DNNS create mode 100644 data/2022/iclr/Discrepancy-Based Active Learning for Domain Adaptation create mode 100644 data/2022/iclr/Discrete Representations Strengthen Vision Transformer Robustness create mode 100644 data/2022/iclr/Discriminative Similarity for Data Clustering create mode 100644 data/2022/iclr/Disentanglement Analysis with Partial Information Decomposition create mode 100644 data/2022/iclr/Distilling GANs with Style-Mixed Triplets for X2I Translation with Limited Data create mode 100644 data/2022/iclr/Distribution Compression in Near-Linear Time create mode 100644 data/2022/iclr/Distributional Reinforcement Learning with Monotonic Splines create mode 100644 data/2022/iclr/Distributionally Robust Fair Principal Components via Geodesic Descents create mode 100644 data/2022/iclr/Distributionally Robust Models with Parametric Likelihood Ratios create mode 100644 data/2022/iclr/Diurnal or Nocturnal? Federated Learning of Multi-branch Networks from Periodically Shifting Distributions create mode 100644 data/2022/iclr/Dive Deeper Into Integral Pose Regression create mode 100644 data/2022/iclr/Divergence-aware Federated Self-Supervised Learning create mode 100644 data/2022/iclr/Diverse Client Selection for Federated Learning via Submodular Maximization create mode 100644 data/2022/iclr/Divisive Feature Normalization Improves Image Recognition Performance in AlexNet create mode 100644 data/2022/iclr/Do Not Escape From the Manifold: Discovering the Local Coordinates on the Latent Space of GANs create mode 100644 data/2022/iclr/Do Users Benefit From Interpretable Vision? A User Study, Baseline, And Dataset create mode 100644 data/2022/iclr/Do We Need Anisotropic Graph Neural Networks? create mode 100644 data/2022/iclr/Do deep networks transfer invariances across classes? create mode 100644 data/2022/iclr/Does your graph need a confidence boost? Convergent boosted smoothing on graphs with tabular node features create mode 100644 data/2022/iclr/Domain Adversarial Training: A Game Perspective create mode 100644 data/2022/iclr/Domino: Discovering Systematic Errors with Cross-Modal Embeddings create mode 100644 data/2022/iclr/Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information create mode 100644 data/2022/iclr/DriPP: Driven Point Processes to Model Stimuli Induced Patterns in M EEG Signals create mode 100644 data/2022/iclr/Dropout Q-Functions for Doubly Efficient Reinforcement Learning create mode 100644 data/2022/iclr/Dual Lottery Ticket Hypothesis create mode 100644 data/2022/iclr/Dynamic Token Normalization improves Vision Transformers create mode 100644 data/2022/iclr/Dynamics-Aware Comparison of Learned Reward Functions create mode 100644 data/2022/iclr/EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits create mode 100644 data/2022/iclr/EViT: Expediting Vision Transformers via Token Reorganizations create mode 100644 data/2022/iclr/EXACT: Scalable Graph Neural Networks Training via Extreme Activation Compression create mode 100644 data/2022/iclr/Effect of scale on catastrophic forgetting in neural networks create mode 100644 data/2022/iclr/Effective Model Sparsification by Scheduled Grow-and-Prune Methods create mode 100644 data/2022/iclr/Efficient Active Search for Combinatorial Optimization Problems create mode 100644 data/2022/iclr/Efficient Computation of Deep Nonlinear Infinite-Width Neural Networks that Learn Features create mode 100644 data/2022/iclr/Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization create mode 100644 data/2022/iclr/Efficient Neural Causal Discovery without Acyclicity Constraints create mode 100644 data/2022/iclr/Efficient Self-supervised Vision Transformers for Representation Learning create mode 100644 data/2022/iclr/Efficient Sharpness-aware Minimization for Improved Training of Neural Networks create mode 100644 data/2022/iclr/Efficient Split-Mix Federated Learning for On-Demand and In-Situ Customization create mode 100644 data/2022/iclr/Efficient Token Mixing for Transformers via Adaptive Fourier Neural Operators create mode 100644 data/2022/iclr/Efficient and Differentiable Conformal Prediction with General Function Classes create mode 100644 data/2022/iclr/Efficiently Modeling Long Sequences with Structured State Spaces create mode 100644 data/2022/iclr/EigenGame Unloaded: When playing games is better than optimizing create mode 100644 data/2022/iclr/Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums create mode 100644 data/2022/iclr/Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation create mode 100644 data/2022/iclr/Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise create mode 100644 data/2022/iclr/Embedded-model flows: Combining the inductive biases of model-free deep learning and explicit probabilistic modeling create mode 100644 data/2022/iclr/Emergent Communication at Scale create mode 100644 data/2022/iclr/Enabling Arbitrary Translation Objectives with Adaptive Tree Search create mode 100644 data/2022/iclr/Encoding Weights of Irregular Sparsity for Fixed-to-Fixed Model Compression create mode 100644 data/2022/iclr/End-to-End Learning of Probabilistic Hierarchies on Graphs create mode 100644 data/2022/iclr/Energy-Based Learning for Cooperative Games, with Applications to Valuation Problems in Machine Learning create mode 100644 data/2022/iclr/Energy-Inspired Molecular Conformation Optimization create mode 100644 data/2022/iclr/Enhancing Cross-lingual Transfer by Manifold Mixup create mode 100644 data/2022/iclr/EntQA: Entity Linking as Question Answering create mode 100644 data/2022/iclr/Entroformer: A Transformer-based Entropy Model for Learned Image Compression create mode 100644 data/2022/iclr/Environment Predictive Coding for Visual Navigation create mode 100644 data/2022/iclr/Equivariant Graph Mechanics Networks with Constraints create mode 100644 data/2022/iclr/Equivariant Self-Supervised Learning: Encouraging Equivariance in Representations create mode 100644 data/2022/iclr/Equivariant Subgraph Aggregation Networks create mode 100644 data/2022/iclr/Equivariant Transformers for Neural Network based Molecular Potentials create mode 100644 data/2022/iclr/Equivariant and Stable Positional Encoding for More Powerful Graph Neural Networks create mode 100644 data/2022/iclr/Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems create mode 100644 data/2022/iclr/Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent create mode 100644 data/2022/iclr/Evaluating Disentanglement of Structured Representations create mode 100644 data/2022/iclr/Evaluating Distributional Distortion in Neural Language Modeling create mode 100644 data/2022/iclr/Evaluating Model-Based Planning and Planner Amortization for Continuous Control create mode 100644 data/2022/iclr/Evaluation Metrics for Graph Generative Models: Problems, Pitfalls, and Practical Solutions create mode 100644 data/2022/iclr/Evidential Turing Processes create mode 100644 data/2022/iclr/Evolutionary Diversity Optimization with Clustering-based Selection for Reinforcement Learning create mode 100644 data/2022/iclr/ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning create mode 100644 data/2022/iclr/Explainable GNN-Based Models over Knowledge Graphs create mode 100644 data/2022/iclr/Explaining Point Processes by Learning Interpretable Temporal Logic Rules create mode 100644 data/2022/iclr/Explanations of Black-Box Models based on Directional Feature Interactions create mode 100644 data/2022/iclr/Exploiting Class Activation Value for Partial-Label Learning create mode 100644 data/2022/iclr/Exploring Memorization in Adversarial Training create mode 100644 data/2022/iclr/Exploring extreme parameter compression for pre-trained language models create mode 100644 data/2022/iclr/Exploring the Limits of Large Scale Pre-training create mode 100644 data/2022/iclr/Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis--Hastings create mode 100644 data/2022/iclr/Expressiveness and Approximation Properties of Graph Neural Networks create mode 100644 data/2022/iclr/Expressivity of Emergent Languages is a Trade-off between Contextual Complexity and Unpredictability create mode 100644 data/2022/iclr/Extending the WILDS Benchmark for Unsupervised Adaptation create mode 100644 data/2022/iclr/F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization create mode 100644 data/2022/iclr/FALCON: Fast Visual Concept Learning by Integrating Images, Linguistic descriptions, and Conceptual Relations create mode 100644 data/2022/iclr/FILIP: Fine-grained Interactive Language-Image Pre-Training create mode 100644 data/2022/iclr/FILM: Following Instructions in Language with Modular Methods create mode 100644 data/2022/iclr/FP-DETR: Detection Transformer Advanced by Fully Pre-training create mode 100644 data/2022/iclr/Fair Normalizing Flows create mode 100644 data/2022/iclr/FairCal: Fairness Calibration for Face Verification create mode 100644 data/2022/iclr/Fairness Guarantees under Demographic Shift create mode 100644 data/2022/iclr/Fairness in Representation for Multilingual NLP: Insights from Controlled Experiments on Conditional Language Modeling create mode 100644 data/2022/iclr/Fast AdvProp create mode 100644 data/2022/iclr/Fast Differentiable Matrix Square Root create mode 100644 data/2022/iclr/Fast Generic Interaction Detection for Model Interpretability and Compression create mode 100644 data/2022/iclr/Fast Model Editing at Scale create mode 100644 data/2022/iclr/Fast Regression for Structured Inputs create mode 100644 data/2022/iclr/Fast topological clustering with Wasserstein distance create mode 100644 data/2022/iclr/FastSHAP: Real-Time Shapley Value Estimation create mode 100644 data/2022/iclr/Feature Kernel Distillation create mode 100644 data/2022/iclr/FedBABU: Toward Enhanced Representation for Federated Image Classification create mode 100644 data/2022/iclr/FedChain: Chained Algorithms for Near-optimal Communication Cost in Federated Learning create mode 100644 data/2022/iclr/FedPara: Low-rank Hadamard Product for Communication-Efficient Federated Learning create mode 100644 data/2022/iclr/Federated Learning from Only Unlabeled Data with Class-conditional-sharing Clients create mode 100644 data/2022/iclr/Few-Shot Backdoor Attacks on Visual Object Tracking create mode 100644 data/2022/iclr/Few-shot Learning via Dirichlet Tessellation Ensemble create mode 100644 data/2022/iclr/Filling the G_ap_s: Multivariate Time Series Imputation by Graph Neural Networks create mode 100644 data/2022/iclr/Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space create mode 100644 data/2022/iclr/Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks create mode 100644 data/2022/iclr/Finding an Unsupervised Image Segmenter in each of your Deep Generative Models create mode 100644 data/2022/iclr/Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution create mode 100644 data/2022/iclr/Fine-grained Differentiable Physics: A Yarn-level Model for Fabrics create mode 100644 data/2022/iclr/Finetuned Language Models are Zero-Shot Learners create mode 100644 data/2022/iclr/Finite-Time Convergence and Sample Complexity of Multi-Agent Actor-Critic Reinforcement Learning with Average Reward create mode 100644 data/2022/iclr/Fixed Neural Network Steganography: Train the images, not the network create mode 100644 data/2022/iclr/FlexConv: Continuous Kernel Convolutions With Differentiable Kernel Sizes create mode 100644 data/2022/iclr/Focus on the Common Good: Group Distributional Robustness Follows create mode 100644 data/2022/iclr/Fooling Explanations in Text Classifiers create mode 100644 data/2022/iclr/Fortuitous Forgetting in Connectionist Networks create mode 100644 data/2022/iclr/Frame Averaging for Invariant and Equivariant Network Design create mode 100644 data/2022/iclr/Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits create mode 100644 data/2022/iclr/From Intervention to Domain Transportation: A Novel Perspective to Optimize Recommendation create mode 100644 data/2022/iclr/From Stars to Subgraphs: Uplifting Any GNN with Local Structure Awareness create mode 100644 data/2022/iclr/GATSBI: Generative Adversarial Training for Simulation-Based Inference create mode 100644 data/2022/iclr/GDA-AM: On the Effectiveness of Solving Min-Imax Optimization via Anderson Mixing create mode 100644 data/2022/iclr/GLASS: GNN with Labeling Tricks for Subgraph Representation Learning create mode 100644 data/2022/iclr/GNN is a Counter? Revisiting GNN for Question Answering create mode 100644 data/2022/iclr/GNN-LM: Language Modeling based on Global Contexts via GNN create mode 100644 data/2022/iclr/GPT-Critic: Offline Reinforcement Learning for End-to-End Task-Oriented Dialogue Systems create mode 100644 data/2022/iclr/GRAND++: Graph Neural Diffusion with A Source Term create mode 100644 data/2022/iclr/Gaussian Mixture Convolution Networks create mode 100644 data/2022/iclr/GeneDisco: A Benchmark for Experimental Design in Drug Discovery create mode 100644 data/2022/iclr/Generalisation in Lifelong Reinforcement Learning through Logical Composition create mode 100644 data/2022/iclr/Generalization Through the Lens of Leave-One-Out Error create mode 100644 data/2022/iclr/Generalization of Neural Combinatorial Solvers Through the Lens of Adversarial Robustness create mode 100644 data/2022/iclr/Generalized Decision Transformer for Offline Hindsight Information Matching create mode 100644 data/2022/iclr/Generalized Demographic Parity for Group Fairness create mode 100644 data/2022/iclr/Generalized Kernel Thinning create mode 100644 data/2022/iclr/Generalized Natural Gradient Flows in Hidden Convex-Concave Games and GANs create mode 100644 data/2022/iclr/Generalized rectifier wavelet covariance models for texture synthesis create mode 100644 data/2022/iclr/Generalizing Few-Shot NAS with Gradient Matching create mode 100644 data/2022/iclr/Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks create mode 100644 data/2022/iclr/Generative Modeling with Optimal Transport Maps create mode 100644 data/2022/iclr/Generative Models as a Data Source for Multiview Representation Learning create mode 100644 data/2022/iclr/Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning create mode 100644 data/2022/iclr/Generative Principal Component Analysis create mode 100644 data/2022/iclr/Generative Pseudo-Inverse Memory create mode 100644 data/2022/iclr/GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation create mode 100644 data/2022/iclr/Geometric Transformers for Protein Interface Contact Prediction create mode 100644 data/2022/iclr/Geometric and Physical Quantities improve E(3) Equivariant Message Passing create mode 100644 data/2022/iclr/Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields create mode 100644 data/2022/iclr/GiraffeDet: A Heavy-Neck Paradigm for Object Detection create mode 100644 data/2022/iclr/Givens Coordinate Descent Methods for Rotation Matrix Learning in Trainable Embedding Indexes create mode 100644 data/2022/iclr/Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games create mode 100644 data/2022/iclr/Goal-Directed Planning via Hindsight Experience Replay create mode 100644 data/2022/iclr/GradMax: Growing Neural Networks using Gradient Information create mode 100644 data/2022/iclr/GradSign: Model Performance Inference with Theoretical Insights create mode 100644 data/2022/iclr/Gradient Importance Learning for Incomplete Observations create mode 100644 data/2022/iclr/Gradient Information Matters in Policy Optimization by Back-propagating through Model create mode 100644 data/2022/iclr/Gradient Matching for Domain Generalization create mode 100644 data/2022/iclr/Gradient Step Denoiser for convergent Plug-and-Play create mode 100644 data/2022/iclr/Granger causal inference on DAGs identifies genomic loci regulating transcription create mode 100644 data/2022/iclr/Graph Auto-Encoder via Neighborhood Wasserstein Reconstruction create mode 100644 data/2022/iclr/Graph Condensation for Graph Neural Networks create mode 100644 data/2022/iclr/Graph Neural Network Guided Local Search for the Traveling Salesperson Problem create mode 100644 data/2022/iclr/Graph Neural Networks with Learnable Structural and Positional Representations create mode 100644 data/2022/iclr/Graph-Augmented Normalizing Flows for Anomaly Detection of Multiple Time Series create mode 100644 data/2022/iclr/Graph-Guided Network for Irregularly Sampled Multivariate Time Series create mode 100644 data/2022/iclr/Graph-Relational Domain Adaptation create mode 100644 data/2022/iclr/Graph-based Nearest Neighbor Search in Hyperbolic Spaces create mode 100644 data/2022/iclr/Graph-less Neural Networks: Teaching Old MLPs New Tricks Via Distillation create mode 100644 data/2022/iclr/GraphENS: Neighbor-Aware Ego Network Synthesis for Class-Imbalanced Node Classification create mode 100644 data/2022/iclr/Graphon based Clustering and Testing of Networks: Algorithms and Theory create mode 100644 data/2022/iclr/GreaseLM: Graph REASoning Enhanced Language Models create mode 100644 data/2022/iclr/Group equivariant neural posterior estimation create mode 100644 data/2022/iclr/Group-based Interleaved Pipeline Parallelism for Large-scale DNN Training create mode 100644 data/2022/iclr/HTLM: Hyper-Text Pre-Training and Prompting of Language Models create mode 100644 data/2022/iclr/Half-Inverse Gradients for Physical Deep Learning create mode 100644 data/2022/iclr/Handling Distribution Shifts on Graphs: An Invariance Perspective create mode 100644 data/2022/iclr/Heteroscedastic Temporal Variational Autoencoder For Irregularly Sampled Time Series create mode 100644 data/2022/iclr/Hidden Convexity of Wasserstein GANs: Interpretable Generative Models with Closed-Form Solutions create mode 100644 data/2022/iclr/Hidden Parameter Recurrent State Space Models For Changing Dynamics Scenarios create mode 100644 data/2022/iclr/Hierarchical Few-Shot Imitation with Skill Transition Models create mode 100644 data/2022/iclr/Hierarchical Variational Memory for Few-shot Learning Across Domains create mode 100644 data/2022/iclr/High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize create mode 100644 data/2022/iclr/High Probability Generalization Bounds with Fast Rates for Minimax Problems create mode 100644 data/2022/iclr/Hindsight Foresight Relabeling for Meta-Reinforcement Learning create mode 100644 data/2022/iclr/Hindsight is 20 20: Leveraging Past Traversals to Aid 3D Perception create mode 100644 data/2022/iclr/Hindsight: Posterior-guided training of retrievers for improved open-ended generation create mode 100644 data/2022/iclr/Hot-Refresh Model Upgrades with Regression-Free Compatible Training in Image Retrieval create mode 100644 data/2022/iclr/How Attentive are Graph Attention Networks? create mode 100644 data/2022/iclr/How Did the Model Change? Efficiently Assessing Machine Learning API Shifts create mode 100644 data/2022/iclr/How Do Vision Transformers Work? create mode 100644 data/2022/iclr/How Does SimSiam Avoid Collapse Without Negative Samples? A Unified Understanding with Self-supervised Contrastive Learning create mode 100644 data/2022/iclr/How Low Can We Go: Trading Memory for Error in Low-Precision Training create mode 100644 data/2022/iclr/How Much Can CLIP Benefit Vision-and-Language Tasks? create mode 100644 data/2022/iclr/How Well Does Self-Supervised Pre-Training Perform with Streaming Data? create mode 100644 data/2022/iclr/How many degrees of freedom do we need to train deep networks: a loss landscape perspective create mode 100644 data/2022/iclr/How to Inject Backdoors with Better Consistency: Logit Anchoring on Clean Data create mode 100644 data/2022/iclr/How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective create mode 100644 data/2022/iclr/How to Train Your MAML to Excel in Few-Shot Classification create mode 100644 data/2022/iclr/How to deal with missing data in supervised deep learning? create mode 100644 data/2022/iclr/How unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis create mode 100644 data/2022/iclr/Huber Additive Models for Non-stationary Time Series Analysis create mode 100644 data/2022/iclr/HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation create mode 100644 data/2022/iclr/Hybrid Local SGD for Federated Learning with Heterogeneous Communications create mode 100644 data/2022/iclr/Hybrid Memoised Wake-Sleep: Approximate Inference at the Discrete-Continuous Interface create mode 100644 data/2022/iclr/Hybrid Random Features create mode 100644 data/2022/iclr/HyperDQN: A Randomized Exploration Method for Deep Reinforcement Learning create mode 100644 data/2022/iclr/Hyperparameter Tuning with Renyi Differential Privacy create mode 100644 data/2022/iclr/IFR-Explore: Learning Inter-object Functional Relationships in 3D Indoor Scenes create mode 100644 data/2022/iclr/IGLU: Efficient GCN Training via Lazy Updates create mode 100644 data/2022/iclr/Igeood: An Information Geometry Approach to Out-of-Distribution Detection create mode 100644 data/2022/iclr/Illiterate DALL-E Learns to Compose create mode 100644 data/2022/iclr/Image BERT Pre-training with Online Tokenizer create mode 100644 data/2022/iclr/Imbedding Deep Neural Networks create mode 100644 data/2022/iclr/Imitation Learning by Reinforcement Learning create mode 100644 data/2022/iclr/Imitation Learning from Observations under Transition Model Disparity create mode 100644 data/2022/iclr/Implicit Bias of Adversarial Training for Deep Neural Networks create mode 100644 data/2022/iclr/Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks create mode 100644 data/2022/iclr/Implicit Bias of Projected Subgradient Method Gives Provable Robust Recovery of Subspaces of Unknown Codimension create mode 100644 data/2022/iclr/Improved deterministic l2 robustness on CIFAR-10 and CIFAR-100 create mode 100644 data/2022/iclr/Improving Federated Learning Face Recognition via Privacy-Agnostic Clusters create mode 100644 data/2022/iclr/Improving Mutual Information Estimation with Annealed and Energy-Based Bounds create mode 100644 data/2022/iclr/Improving Non-Autoregressive Translation Models Without Distillation create mode 100644 data/2022/iclr/Improving the Accuracy of Learning Example Weights for Imbalance Classification create mode 100644 data/2022/iclr/In a Nutshell, the Human Asked for This: Latent Goals for Following Temporal Specifications create mode 100644 data/2022/iclr/Increasing the Cost of Model Extraction with Calibrated Proof of Work create mode 100644 data/2022/iclr/Incremental False Negative Detection for Contrastive Learning create mode 100644 data/2022/iclr/Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking create mode 100644 data/2022/iclr/Inductive Relation Prediction Using Analogy Subgraph Embeddings create mode 100644 data/2022/iclr/InfinityGAN: Towards Infinite-Pixel Image Synthesis create mode 100644 data/2022/iclr/Information Bottleneck: Exact Analysis of (Quantized) Neural Networks create mode 100644 data/2022/iclr/Information Gain Propagation: a New Way to Graph Active Learning with Soft Labels create mode 100644 data/2022/iclr/Information Prioritization through Empowerment in Visual Model-based RL create mode 100644 data/2022/iclr/Information-theoretic Online Memory Selection for Continual Learning create mode 100644 data/2022/iclr/IntSGD: Adaptive Floatless Compression of Stochastic Gradients create mode 100644 data/2022/iclr/Interacting Contour Stochastic Gradient Langevin Dynamics create mode 100644 data/2022/iclr/Interpretable Unsupervised Diversity Denoising and Artefact Removal create mode 100644 data/2022/iclr/Invariant Causal Representation Learning for Out-of-Distribution Generalization create mode 100644 data/2022/iclr/Inverse Online Learning: Understanding Non-Stationary and Reactionary Policies create mode 100644 data/2022/iclr/Is Fairness Only Metric Deep? Evaluating and Addressing Subgroup Gaps in Deep Metric Learning create mode 100644 data/2022/iclr/Is High Variance Unavoidable in RL? A Case Study in Continuous Control create mode 100644 data/2022/iclr/Is Homophily a Necessity for Graph Neural Networks? create mode 100644 data/2022/iclr/Is Importance Weighting Incompatible with Interpolating Classifiers? create mode 100644 data/2022/iclr/It Takes Four to Tango: Multiagent Self Play for Automatic Curriculum Generation create mode 100644 data/2022/iclr/It Takes Two to Tango: Mixup for Deep Metric Learning create mode 100644 data/2022/iclr/Iterated Reasoning with Mutual Information in Cooperative and Byzantine Decentralized Teaming create mode 100644 data/2022/iclr/Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design create mode 100644 data/2022/iclr/Joint Shapley values: a measure of joint feature importance create mode 100644 data/2022/iclr/KL Guided Domain Adaptation create mode 100644 data/2022/iclr/Know Thyself: Transferable Visual Control Policies Through Robot-Awareness create mode 100644 data/2022/iclr/Know Your Action Set: Learning Action Relations for Reinforcement Learning create mode 100644 data/2022/iclr/Knowledge Infused Decoding create mode 100644 data/2022/iclr/Knowledge Removal in Sampling-based Bayesian Inference create mode 100644 data/2022/iclr/L0-Sparse Canonical Correlation Analysis create mode 100644 data/2022/iclr/LFPT5: A Unified Framework for Lifelong Few-shot Language Learning Based on Prompt Tuning of T5 create mode 100644 data/2022/iclr/LIGS: Learnable Intrinsic-Reward Generation Selection for Multi-Agent Learning create mode 100644 data/2022/iclr/LORD: Lower-Dimensional Embedding of Log-Signature in Neural Rough Differential Equations create mode 100644 data/2022/iclr/Label Encoding for Regression Networks create mode 100644 data/2022/iclr/Label Leakage and Protection in Two-party Split Learning create mode 100644 data/2022/iclr/Label-Efficient Semantic Segmentation with Diffusion Models create mode 100644 data/2022/iclr/Language model compression with weighted low-rank factorization create mode 100644 data/2022/iclr/Language modeling via stochastic processes create mode 100644 data/2022/iclr/Language-biased image classification: evaluation based on semantic representations create mode 100644 data/2022/iclr/Language-driven Semantic Segmentation create mode 100644 data/2022/iclr/Large Language Models Can Be Strong Differentially Private Learners create mode 100644 data/2022/iclr/Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect create mode 100644 data/2022/iclr/Large-Scale Representation Learning on Graphs via Bootstrapping create mode 100644 data/2022/iclr/Latent Image Animator: Learning to Animate Images via Latent Space Navigation create mode 100644 data/2022/iclr/Latent Variable Sequential Set Transformers for Joint Multi-Agent Motion Prediction create mode 100644 data/2022/iclr/Learn Locally, Correct Globally: A Distributed Algorithm for Training Graph Neural Networks create mode 100644 data/2022/iclr/Learnability Lock: Authorized Learnability Control Through Adversarial Invertible Transformations create mode 100644 data/2022/iclr/Learnability of convolutional neural networks for infinite dimensional input via mixed and anisotropic smoothness create mode 100644 data/2022/iclr/Learned Simulators for Turbulence create mode 100644 data/2022/iclr/Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations create mode 100644 data/2022/iclr/Learning Altruistic Behaviours in Reinforcement Learning without External Rewards create mode 100644 data/2022/iclr/Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction create mode 100644 data/2022/iclr/Learning Causal Models from Conditional Moment Restrictions by Importance Weighting create mode 100644 data/2022/iclr/Learning Continuous Environment Fields via Implicit Functions create mode 100644 data/2022/iclr/Learning Curves for Gaussian Process Regression with Power-Law Priors and Targets create mode 100644 data/2022/iclr/Learning Curves for SGD on Structured Features create mode 100644 data/2022/iclr/Learning Discrete Structured Variational Auto-Encoder using Natural Evolution Strategies create mode 100644 data/2022/iclr/Learning Disentangled Representation by Exploiting Pretrained Generative Models: A Contrastive Learning View create mode 100644 data/2022/iclr/Learning Distributionally Robust Models at Scale via Composite Optimization create mode 100644 data/2022/iclr/Learning Efficient Image Super-Resolution Networks via Structure-Regularized Pruning create mode 100644 data/2022/iclr/Learning Efficient Online 3D Bin Packing on Packing Configuration Trees create mode 100644 data/2022/iclr/Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality create mode 100644 data/2022/iclr/Learning Fast, Learning Slow: A General Continual Learning Method based on Complementary Learning System create mode 100644 data/2022/iclr/Learning Features with Parameter-Free Layers create mode 100644 data/2022/iclr/Learning Generalizable Representations for Reinforcement Learning via Adaptive Meta-learner of Behavioral Similarities create mode 100644 data/2022/iclr/Learning Graphon Mean Field Games and Approximate Nash Equilibria create mode 100644 data/2022/iclr/Learning Guarantees for Graph Convolutional Networks on the Stochastic Block Model create mode 100644 data/2022/iclr/Learning Hierarchical Structures with Differentiable Nondeterministic Stacks create mode 100644 data/2022/iclr/Learning Long-Term Reward Redistribution via Randomized Return Decomposition create mode 100644 data/2022/iclr/Learning Multimodal VAEs through Mutual Supervision create mode 100644 data/2022/iclr/Learning Neural Contextual Bandits through Perturbed Rewards create mode 100644 data/2022/iclr/Learning Object-Oriented Dynamics for Planning from Text create mode 100644 data/2022/iclr/Learning Optimal Conformal Classifiers create mode 100644 data/2022/iclr/Learning Prototype-oriented Set Representations for Meta-Learning create mode 100644 data/2022/iclr/Learning Pruning-Friendly Networks via Frank-Wolfe: One-Shot, Any-Sparsity, And No Retraining create mode 100644 data/2022/iclr/Learning Representation from Neural Fisher Kernel with Low-rank Approximation create mode 100644 data/2022/iclr/Learning Scenario Representation for Solving Two-stage Stochastic Integer Programs create mode 100644 data/2022/iclr/Learning State Representations via Retracing in Reinforcement Learning create mode 100644 data/2022/iclr/Learning Strides in Convolutional Neural Networks create mode 100644 data/2022/iclr/Learning Super-Features for Image Retrieval create mode 100644 data/2022/iclr/Learning Synthetic Environments and Reward Networks for Reinforcement Learning create mode 100644 data/2022/iclr/Learning Temporally Causal Latent Processes from General Temporal Data create mode 100644 data/2022/iclr/Learning Towards The Largest Margins create mode 100644 data/2022/iclr/Learning Transferable Reward for Query Object Localization with Policy Adaptation create mode 100644 data/2022/iclr/Learning Value Functions from Undirected State-only Experience create mode 100644 data/2022/iclr/Learning Versatile Neural Architectures by Propagating Network Codes create mode 100644 data/2022/iclr/Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers create mode 100644 data/2022/iclr/Learning Weakly-supervised Contrastive Representations create mode 100644 data/2022/iclr/Learning a subspace of policies for online adaptation in Reinforcement Learning create mode 100644 data/2022/iclr/Learning by Directional Gradient Descent create mode 100644 data/2022/iclr/Learning curves for continual learning in neural networks: Self-knowledge transfer and forgetting create mode 100644 data/2022/iclr/Learning meta-features for AutoML create mode 100644 data/2022/iclr/Learning more skills through optimistic exploration create mode 100644 data/2022/iclr/Learning the Dynamics of Physical Systems from Sparse Observations with Finite Element Networks create mode 100644 data/2022/iclr/Learning to Annotate Part Segmentation with Gradient Matching create mode 100644 data/2022/iclr/Learning to Complete Code with Sketches create mode 100644 data/2022/iclr/Learning to Dequantise with Truncated Flows create mode 100644 data/2022/iclr/Learning to Downsample for Segmentation of Ultra-High Resolution Images create mode 100644 data/2022/iclr/Learning to Extend Molecular Scaffolds with Structural Motifs create mode 100644 data/2022/iclr/Learning to Generalize across Domains on Single Test Samples create mode 100644 data/2022/iclr/Learning to Guide and to be Guided in the Architect-Builder Problem create mode 100644 data/2022/iclr/Learning to Map for Active Semantic Goal Navigation create mode 100644 data/2022/iclr/Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic Forecasting create mode 100644 data/2022/iclr/Learning to Schedule Learning rate with Graph Neural Networks create mode 100644 data/2022/iclr/Learning transferable motor skills with hierarchical latent mixture policies create mode 100644 data/2022/iclr/Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations create mode 100644 data/2022/iclr/Learning-Augmented $k$-means Clustering create mode 100644 data/2022/iclr/Leveraging Automated Unit Tests for Unsupervised Code Translation create mode 100644 data/2022/iclr/Leveraging unlabeled data to predict out-of-distribution performance create mode 100644 "data/2022/iclr/Likelihood Training of Schr\303\266dinger Bridge using Forward-Backward SDEs Theory" create mode 100644 data/2022/iclr/Linking Emergent and Natural Languages via Corpus Transfer create mode 100644 data/2022/iclr/Lipschitz-constrained Unsupervised Skill Discovery create mode 100644 data/2022/iclr/LoRA: Low-Rank Adaptation of Large Language Models create mode 100644 data/2022/iclr/Local Feature Swapping for Generalization in Reinforcement Learning create mode 100644 data/2022/iclr/Long Expressive Memory for Sequence Modeling create mode 100644 data/2022/iclr/Looking Back on Learned Experiences For Class task Incremental Learning create mode 100644 data/2022/iclr/Lossless Compression with Probabilistic Circuits create mode 100644 data/2022/iclr/Lossy Compression with Distribution Shift as Entropy Constrained Optimal Transport create mode 100644 data/2022/iclr/Low-Budget Active Learning via Wasserstein Distance: An Integer Programming Approach create mode 100644 data/2022/iclr/MAML is a Noisy Contrastive Learner in Classification create mode 100644 data/2022/iclr/MCMC Should Mix: Learning Energy-Based Model with Neural Transport Latent Space MCMC create mode 100644 data/2022/iclr/MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling create mode 100644 data/2022/iclr/MT3: Multi-Task Multitrack Music Transcription create mode 100644 data/2022/iclr/MaGNET: Uniform Sampling from Deep Generative Network Manifolds Without Retraining create mode 100644 data/2022/iclr/Machine Learning For Elliptic PDEs: Fast Rate Generalization Bound, Neural Scaling Law and Minimax Optimality create mode 100644 data/2022/iclr/Map Induction: Compositional spatial submap learning for efficient exploration in novel environments create mode 100644 data/2022/iclr/Mapping Language Models to Grounded Conceptual Spaces create mode 100644 data/2022/iclr/Mapping conditional distributions for domain adaptation under generalized target shift create mode 100644 data/2022/iclr/Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning create mode 100644 data/2022/iclr/Maximizing Ensemble Diversity in Deep Reinforcement Learning create mode 100644 data/2022/iclr/Maximum Entropy RL (Provably) Solves Some Robust RL Problems create mode 100644 data/2022/iclr/Maximum n-times Coverage for Vaccine Design create mode 100644 data/2022/iclr/Measuring CLEVRness: Black-box Testing of Visual Reasoning Models create mode 100644 data/2022/iclr/Measuring the Interpretability of Unsupervised Representations via Quantized Reversed Probing create mode 100644 data/2022/iclr/Memorizing Transformers create mode 100644 data/2022/iclr/Memory Augmented Optimizers for Deep Learning create mode 100644 data/2022/iclr/Memory Replay with Data Compression for Continual Learning create mode 100644 data/2022/iclr/Mention Memory: incorporating textual knowledge into Transformers through entity mention attention create mode 100644 data/2022/iclr/Message Passing Neural PDE Solvers create mode 100644 data/2022/iclr/Meta Discovery: Learning to Discover Novel Classes given Very Limited Data create mode 100644 data/2022/iclr/Meta Learning Low Rank Covariance Factors for Energy Based Deterministic Uncertainty create mode 100644 data/2022/iclr/Meta-Imitation Learning by Watching Video Demonstrations create mode 100644 data/2022/iclr/Meta-Learning with Fewer Tasks through Task Interpolation create mode 100644 data/2022/iclr/MetaMorph: Learning Universal Controllers with Transformers create mode 100644 data/2022/iclr/MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts create mode 100644 data/2022/iclr/Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks create mode 100644 data/2022/iclr/Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond create mode 100644 data/2022/iclr/Minimax Optimality (Probably) Doesn't Imply Distribution Learning for GANs create mode 100644 data/2022/iclr/Minimax Optimization with Smooth Algorithmic Adversaries create mode 100644 data/2022/iclr/Mirror Descent Policy Optimization create mode 100644 data/2022/iclr/Missingness Bias in Model Debugging create mode 100644 data/2022/iclr/MoReL: Multi-omics Relational Learning create mode 100644 data/2022/iclr/MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer create mode 100644 data/2022/iclr/Model Agnostic Interpretability for Multiple Instance Learning create mode 100644 data/2022/iclr/Model Zoo: A Growing Brain That Learns Continually create mode 100644 data/2022/iclr/Model-Based Offline Meta-Reinforcement Learning with Regularization create mode 100644 data/2022/iclr/Model-augmented Prioritized Experience Replay create mode 100644 data/2022/iclr/Modeling Label Space Interactions in Multi-label Classification using Box Embeddings create mode 100644 data/2022/iclr/Modular Lifelong Reinforcement Learning via Neural Composition create mode 100644 data/2022/iclr/MonoDistill: Learning Spatial Features for Monocular 3D Object Detection create mode 100644 data/2022/iclr/Monotonic Differentiable Sorting Networks create mode 100644 data/2022/iclr/Multi-Agent MDP Homomorphic Networks create mode 100644 data/2022/iclr/Multi-Critic Actor Learning: Teaching RL Policies to Act with Style create mode 100644 data/2022/iclr/Multi-Mode Deep Matrix and Tensor Factorization create mode 100644 data/2022/iclr/Multi-Stage Episodic Control for Strategic Exploration in Text Games create mode 100644 data/2022/iclr/Multi-Task Processes create mode 100644 data/2022/iclr/Multi-objective Optimization by Learning Space Partition create mode 100644 data/2022/iclr/Multimeasurement Generative Models create mode 100644 data/2022/iclr/Multiset-Equivariant Set Prediction with Approximate Implicit Differentiation create mode 100644 data/2022/iclr/Multitask Prompted Training Enables Zero-Shot Task Generalization create mode 100644 data/2022/iclr/NAS-Bench-Suite: NAS Evaluation is (Now) Surprisingly Easy create mode 100644 data/2022/iclr/NASI: Label- and Data-agnostic Neural Architecture Search at Initialization create mode 100644 data/2022/iclr/NASPY: Automated Extraction of Automated Machine Learning Models create mode 100644 data/2022/iclr/NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training create mode 100644 data/2022/iclr/NODE-GAM: Neural Generalized Additive Model for Interpretable Deep Learning create mode 100644 data/2022/iclr/Natural Language Descriptions of Deep Visual Features create mode 100644 data/2022/iclr/Natural Posterior Network: Deep Bayesian Predictive Uncertainty for Exponential Family Distributions create mode 100644 data/2022/iclr/Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver create mode 100644 data/2022/iclr/Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism create mode 100644 data/2022/iclr/Network Augmentation for Tiny Deep Learning create mode 100644 data/2022/iclr/Network Insensitivity to Parameter Noise via Parameter Attack During Training create mode 100644 data/2022/iclr/NeuPL: Neural Population Learning create mode 100644 data/2022/iclr/Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path create mode 100644 data/2022/iclr/Neural Contextual Bandits with Deep Representation and Shallow Exploration create mode 100644 data/2022/iclr/Neural Deep Equilibrium Solvers create mode 100644 data/2022/iclr/Neural Link Prediction with Walk Pooling create mode 100644 data/2022/iclr/Neural Markov Controlled SDE: Stochastic Optimization for Continuous-Time Data create mode 100644 data/2022/iclr/Neural Methods for Logical Reasoning over Knowledge Graphs create mode 100644 data/2022/iclr/Neural Models for Output-Space Invariance in Combinatorial Problems create mode 100644 data/2022/iclr/Neural Network Approximation based on Hausdorff distance of Tropical Zonotopes create mode 100644 data/2022/iclr/Neural Networks as Kernel Learners: The Silent Alignment Effect create mode 100644 data/2022/iclr/Neural Parameter Allocation Search create mode 100644 data/2022/iclr/Neural Processes with Stochastic Attention: Paying more attention to the context dataset create mode 100644 data/2022/iclr/Neural Program Synthesis with Query create mode 100644 data/2022/iclr/Neural Relational Inference with Node-Specific Information create mode 100644 data/2022/iclr/Neural Solvers for Fast and Accurate Numerical Optimal Control create mode 100644 data/2022/iclr/Neural Spectral Marked Point Processes create mode 100644 data/2022/iclr/Neural Stochastic Dual Dynamic Programming create mode 100644 data/2022/iclr/Neural Structured Prediction for Inductive Node Classification create mode 100644 data/2022/iclr/Neural Variational Dropout Processes create mode 100644 data/2022/iclr/Neural graphical modelling in continuous-time: consistency guarantees and algorithms create mode 100644 data/2022/iclr/New Insights on Reducing Abrupt Representation Change in Online Continual Learning create mode 100644 data/2022/iclr/No One Representation to Rule Them All: Overlapping Features of Training Methods create mode 100644 data/2022/iclr/No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models create mode 100644 data/2022/iclr/Node Feature Extraction by Self-Supervised Multi-scale Neighborhood Prediction create mode 100644 data/2022/iclr/NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs create mode 100644 data/2022/iclr/Noisy Feature Mixup create mode 100644 data/2022/iclr/Non-Linear Operator Approximations for Initial Value Problems create mode 100644 data/2022/iclr/Non-Parallel Text Style Transfer with Self-Parallel Supervision create mode 100644 data/2022/iclr/Non-Transferable Learning: A New Approach for Model Ownership Verification and Applicability Authorization create mode 100644 data/2022/iclr/Nonlinear ICA Using Volume-Preserving Transformations create mode 100644 data/2022/iclr/Normalization of Language Embeddings for Cross-Lingual Alignment create mode 100644 data/2022/iclr/Object Dynamics Distillation for Scene Decomposition and Representation create mode 100644 data/2022/iclr/Object Pursuit: Building a Space of Objects via Discriminative Weight Generation create mode 100644 data/2022/iclr/Objects in Semantic Topology create mode 100644 data/2022/iclr/Offline Neural Contextual Bandits: Pessimism, Optimization and Generalization create mode 100644 data/2022/iclr/Offline Reinforcement Learning with Implicit Q-Learning create mode 100644 data/2022/iclr/Offline Reinforcement Learning with Value-based Episodic Memory create mode 100644 data/2022/iclr/Omni-Dimensional Dynamic Convolution create mode 100644 data/2022/iclr/Omni-Scale CNNs: a simple and effective kernel size configuration for time series classification create mode 100644 data/2022/iclr/On Bridging Generic and Personalized Federated Learning for Image Classification create mode 100644 data/2022/iclr/On Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning create mode 100644 data/2022/iclr/On Distributed Adaptive Optimization with Gradient Compression create mode 100644 data/2022/iclr/On Evaluation Metrics for Graph Generative Models create mode 100644 data/2022/iclr/On Improving Adversarial Transferability of Vision Transformers create mode 100644 data/2022/iclr/On Incorporating Inductive Biases into VAEs create mode 100644 data/2022/iclr/On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning create mode 100644 data/2022/iclr/On Non-Random Missing Labels in Semi-Supervised Learning create mode 100644 data/2022/iclr/On Predicting Generalization using GANs create mode 100644 data/2022/iclr/On Redundancy and Diversity in Cell-based Neural Architecture Search create mode 100644 data/2022/iclr/On Robust Prefix-Tuning for Text Classification create mode 100644 data/2022/iclr/On feature learning in neural networks with global convergence guarantees create mode 100644 data/2022/iclr/On the Certified Robustness for Ensemble Models and Beyond create mode 100644 data/2022/iclr/On the Connection between Local Attention and Dynamic Depth-wise Convolution create mode 100644 data/2022/iclr/On the Convergence of Certified Robust Training with Interval Bound Propagation create mode 100644 data/2022/iclr/On the Convergence of mSGD and AdaGrad for Stochastic Optimization create mode 100644 data/2022/iclr/On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning create mode 100644 data/2022/iclr/On the Existence of Universal Lottery Tickets create mode 100644 data/2022/iclr/On the Generalization of Models Trained with SGD: Information-Theoretic Bounds and Implications create mode 100644 data/2022/iclr/On the Importance of Difficulty Calibration in Membership Inference Attacks create mode 100644 data/2022/iclr/On the Importance of Firth Bias Reduction in Few-Shot Classification create mode 100644 data/2022/iclr/On the Learning and Learnability of Quasimetrics create mode 100644 data/2022/iclr/On the Limitations of Multimodal VAEs create mode 100644 data/2022/iclr/On the Optimal Memorization Power of ReLU Neural Networks create mode 100644 data/2022/iclr/On the Pitfalls of Analyzing Individual Neurons in Language Models create mode 100644 data/2022/iclr/On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks create mode 100644 data/2022/iclr/On the Role of Neural Collapse in Transfer Learning create mode 100644 data/2022/iclr/On the Uncomputability of Partition Functions in Energy-Based Sequence Models create mode 100644 data/2022/iclr/On the approximation properties of recurrent encoder-decoder architectures create mode 100644 data/2022/iclr/On the benefits of maximum likelihood estimation for Regression and Forecasting create mode 100644 data/2022/iclr/On the relation between statistical learning and perceptual distances create mode 100644 data/2022/iclr/On the role of population heterogeneity in emergent communication create mode 100644 data/2022/iclr/On-Policy Model Errors in Reinforcement Learning create mode 100644 data/2022/iclr/One After Another: Learning Incremental Skills for a Changing World create mode 100644 data/2022/iclr/Online Ad Hoc Teamwork under Partial Observability create mode 100644 data/2022/iclr/Online Adversarial Attacks create mode 100644 data/2022/iclr/Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference create mode 100644 data/2022/iclr/Online Coreset Selection for Rehearsal-based Continual Learning create mode 100644 data/2022/iclr/Online Facility Location with Predictions create mode 100644 data/2022/iclr/Online Hyperparameter Meta-Learning with Hypergradient Distillation create mode 100644 data/2022/iclr/Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs create mode 100644 data/2022/iclr/OntoProtein: Protein Pretraining With Gene Ontology Embedding create mode 100644 data/2022/iclr/Open-Set Recognition: A Good Closed-Set Classifier is All You Need create mode 100644 data/2022/iclr/Open-World Semi-Supervised Learning create mode 100644 data/2022/iclr/Open-vocabulary Object Detection via Vision and Language Knowledge Distillation create mode 100644 data/2022/iclr/Optimal ANN-SNN Conversion for High-accuracy and Ultra-low-latency Spiking Neural Networks create mode 100644 data/2022/iclr/Optimal Representations for Covariate Shift create mode 100644 data/2022/iclr/Optimal Transport for Causal Discovery create mode 100644 data/2022/iclr/Optimal Transport for Long-Tailed Recognition with Learnable Cost Matrix create mode 100644 data/2022/iclr/Optimization and Adaptive Generalization of Three layer Neural Networks create mode 100644 data/2022/iclr/Optimization inspired Multi-Branch Equilibrium Models create mode 100644 data/2022/iclr/Optimizer Amalgamation create mode 100644 data/2022/iclr/Optimizing Neural Networks with Gradient Lexicase Selection create mode 100644 data/2022/iclr/Orchestrated Value Mapping for Reinforcement Learning create mode 100644 data/2022/iclr/Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations create mode 100644 data/2022/iclr/Overcoming The Spectral Bias of Neural Value Approximation create mode 100644 data/2022/iclr/P-Adapters: Robustly Extracting Factual Information from Language Models with Diverse Prompts create mode 100644 data/2022/iclr/PAC Prediction Sets Under Covariate Shift create mode 100644 data/2022/iclr/PAC-Bayes Information Bottleneck create mode 100644 data/2022/iclr/PEARL: Data Synthesis via Private Embeddings and Adversarial Reconstruction Learning create mode 100644 data/2022/iclr/PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method create mode 100644 data/2022/iclr/PF-GNN: Differentiable particle filtering based approximation of universal graph representations create mode 100644 data/2022/iclr/PI3NN: Out-of-distribution-aware Prediction Intervals from Three Neural Networks create mode 100644 data/2022/iclr/POETREE: Interpretable Policy Learning with Adaptive Decision Trees create mode 100644 data/2022/iclr/PSA-GAN: Progressive Self Attention GANs for Synthetic Time Series create mode 100644 data/2022/iclr/Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences create mode 100644 data/2022/iclr/Pareto Policy Adaptation create mode 100644 data/2022/iclr/Pareto Policy Pool for Model-based Offline Reinforcement Learning create mode 100644 data/2022/iclr/Pareto Set Learning for Neural Multi-Objective Combinatorial Optimization create mode 100644 data/2022/iclr/Partial Wasserstein Adversarial Network for Non-rigid Point Set Registration create mode 100644 data/2022/iclr/Particle Stochastic Dual Coordinate Ascent: Exponential convergent algorithm for mean field neural network optimization create mode 100644 data/2022/iclr/Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations? create mode 100644 data/2022/iclr/Path Auxiliary Proposal for MCMC in Discrete Space create mode 100644 data/2022/iclr/Path Integral Sampler: A Stochastic Control Approach For Sampling create mode 100644 data/2022/iclr/Peek-a-Boo: What (More) is Disguised in a Randomly Weighted Neural Network, and How to Find It Efficiently create mode 100644 data/2022/iclr/Perceiver IO: A General Architecture for Structured Inputs & Outputs create mode 100644 data/2022/iclr/Permutation Compressors for Provably Faster Distributed Nonconvex Optimization create mode 100644 data/2022/iclr/Permutation-Based SGD: Is Random Optimal? create mode 100644 data/2022/iclr/Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning create mode 100644 data/2022/iclr/Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage create mode 100644 data/2022/iclr/Phase Collapse in Neural Networks create mode 100644 data/2022/iclr/Phenomenology of Double Descent in Finite-Width Neural Networks create mode 100644 data/2022/iclr/PiCO: Contrastive Label Disambiguation for Partial Label Learning create mode 100644 data/2022/iclr/PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication create mode 100644 data/2022/iclr/Pix2seq: A Language Modeling Framework for Object Detection create mode 100644 data/2022/iclr/Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models create mode 100644 data/2022/iclr/Planning in Stochastic Environments with a Learned Model create mode 100644 data/2022/iclr/Plant 'n' Seek: Can You Find the Winning Ticket? create mode 100644 data/2022/iclr/PoNet: Pooling Network for Efficient Token Mixing in Long Sequences create mode 100644 data/2022/iclr/Poisoning and Backdooring Contrastive Learning create mode 100644 data/2022/iclr/Policy Gradients Incorporating the Future create mode 100644 data/2022/iclr/Policy Smoothing for Provably Robust Reinforcement Learning create mode 100644 data/2022/iclr/Policy improvement by planning with Gumbel create mode 100644 data/2022/iclr/PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions create mode 100644 data/2022/iclr/Possibility Before Utility: Learning And Using Hierarchical Affordances create mode 100644 data/2022/iclr/Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation create mode 100644 data/2022/iclr/Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios create mode 100644 data/2022/iclr/Practical Conditional Neural Process Via Tractable Dependent Predictions create mode 100644 data/2022/iclr/Practical Integration via Separable Bijective Networks create mode 100644 data/2022/iclr/Pre-training Molecular Graph Representation with 3D Geometry create mode 100644 data/2022/iclr/Predicting Physics in Mesh-reduced Space with Temporal Attention create mode 100644 data/2022/iclr/Pretrained Language Model in Continual Learning: A Comparative Study create mode 100644 data/2022/iclr/Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators create mode 100644 data/2022/iclr/PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior create mode 100644 data/2022/iclr/Privacy Implications of Shuffling create mode 100644 data/2022/iclr/Probabilistic Implicit Scene Completion create mode 100644 data/2022/iclr/Procedural generalization by planning with self-supervised world models create mode 100644 data/2022/iclr/Programmatic Reinforcement Learning without Oracles create mode 100644 data/2022/iclr/Progressive Distillation for Fast Sampling of Diffusion Models create mode 100644 data/2022/iclr/Promoting Saliency From Depth: Deep Unsupervised RGB-D Saliency Detection create mode 100644 data/2022/iclr/Proof Artifact Co-Training for Theorem Proving with Language Models create mode 100644 data/2022/iclr/Properties from mechanisms: an equivariance perspective on identifiable representation learning create mode 100644 data/2022/iclr/Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients create mode 100644 data/2022/iclr/ProtoRes: Proto-Residual Network for Pose Authoring via Learned Inverse Kinematics create mode 100644 data/2022/iclr/Prototype memory and attention mechanisms for few shot image generation create mode 100644 data/2022/iclr/Prototypical Contrastive Predictive Coding create mode 100644 data/2022/iclr/Provable Adaptation across Multiway Domains via Representation Learning create mode 100644 data/2022/iclr/Provable Learning-based Algorithm For Sparse Recovery create mode 100644 data/2022/iclr/Provably Filtering Exogenous Distractors using Multistep Inverse Dynamics create mode 100644 data/2022/iclr/Provably Robust Adversarial Examples create mode 100644 data/2022/iclr/Provably convergent quasistatic dynamics for mean-field two-player zero-sum games create mode 100644 data/2022/iclr/Proving the Lottery Ticket Hypothesis for Convolutional Neural Networks create mode 100644 data/2022/iclr/Pseudo Numerical Methods for Diffusion Models on Manifolds create mode 100644 data/2022/iclr/Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint Localization create mode 100644 data/2022/iclr/Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting create mode 100644 data/2022/iclr/QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization create mode 100644 data/2022/iclr/Quadtree Attention for Vision Transformers create mode 100644 data/2022/iclr/Quantitative Performance Assessment of CNN Units via Topological Entropy Calculation create mode 100644 data/2022/iclr/Query Efficient Decision Based Sparse Attacks Against Black-Box Deep Learning Models create mode 100644 data/2022/iclr/Query Embedding on Hyper-Relational Knowledge Graphs create mode 100644 data/2022/iclr/R4D: Utilizing Reference Objects for Long-Range Distance Estimation create mode 100644 data/2022/iclr/R5: Rule Discovery with Reinforced and Recurrent Relational Reasoning create mode 100644 data/2022/iclr/RISP: Rendering-Invariant State Predictor with Differentiable Simulation and Rendering for Cross-Domain Parameter Estimation create mode 100644 data/2022/iclr/Random matrices in service of ML footprint: ternary random features with no performance loss create mode 100644 data/2022/iclr/Real-Time Neural Voice Camouflage create mode 100644 data/2022/iclr/Recursive Disentanglement Network create mode 100644 data/2022/iclr/Recycling Model Updates in Federated Learning: Are Gradient Subspaces Low-Rank? create mode 100644 data/2022/iclr/Reducing Excessive Margin to Achieve a Better Accuracy vs. Robustness Trade-off create mode 100644 data/2022/iclr/RegionViT: Regional-to-Local Attention for Vision Transformers create mode 100644 data/2022/iclr/Regularized Autoencoders for Isometric Representation Learning create mode 100644 data/2022/iclr/Reinforcement Learning in Presence of Discrete Markovian Context Evolution create mode 100644 data/2022/iclr/Reinforcement Learning under a Multi-agent Predictive State Representation Model: Method and Theory create mode 100644 data/2022/iclr/Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration create mode 100644 data/2022/iclr/RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning create mode 100644 data/2022/iclr/Relating transformers to models and neural representations of the hippocampal formation create mode 100644 data/2022/iclr/Relational Learning with Variational Bayes create mode 100644 data/2022/iclr/Relational Multi-Task Learning: Modeling Relations between Data and Tasks create mode 100644 data/2022/iclr/Relational Surrogate Loss Learning create mode 100644 data/2022/iclr/RelaxLoss: Defending Membership Inference Attacks without Losing Utility create mode 100644 data/2022/iclr/Reliable Adversarial Distillation with Unreliable Teachers create mode 100644 data/2022/iclr/Representation Learning for Online and Offline RL in Low-rank MDPs create mode 100644 data/2022/iclr/Representation-Agnostic Shape Fields create mode 100644 data/2022/iclr/Representational Continuity for Unsupervised Continual Learning create mode 100644 data/2022/iclr/Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings create mode 100644 data/2022/iclr/Resolving Training Biases via Influence-based Data Relabeling create mode 100644 data/2022/iclr/Resonance in Weight Space: Covariate Shift Can Drive Divergence of SGD with Momentum create mode 100644 data/2022/iclr/Responsible Disclosure of Generative Models Using Scalable Fingerprinting create mode 100644 data/2022/iclr/Rethinking Adversarial Transferability from a Data Distribution Perspective create mode 100644 data/2022/iclr/Rethinking Class-Prior Estimation for Positive-Unlabeled Learning create mode 100644 data/2022/iclr/Rethinking Goal-Conditioned Supervised Learning and Its Connection to Offline RL create mode 100644 data/2022/iclr/Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework create mode 100644 data/2022/iclr/Rethinking Supervised Pre-Training for Better Downstream Transferring create mode 100644 data/2022/iclr/Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph create mode 100644 data/2022/iclr/Reverse Engineering of Imperceptible Adversarial Image Perturbations create mode 100644 data/2022/iclr/Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift create mode 100644 data/2022/iclr/Revisit Kernel Pruning with Lottery Regulated Grouped Convolutions create mode 100644 data/2022/iclr/Revisiting Design Choices in Offline Model Based Reinforcement Learning create mode 100644 data/2022/iclr/Revisiting Over-smoothing in BERT from the Perspective of Graph create mode 100644 data/2022/iclr/Revisiting flow generative models for Out-of-distribution detection create mode 100644 data/2022/iclr/Reward Uncertainty for Exploration in Preference-based Reinforcement Learning create mode 100644 data/2022/iclr/Robbing the Fed: Directly Obtaining Private Data in Federated Learning with Modified Models create mode 100644 data/2022/iclr/Robust Learning Meets Generative Models: Can Proxy Distributions Improve Adversarial Robustness? create mode 100644 data/2022/iclr/Robust Unlearnable Examples: Protecting Data Privacy Against Adversarial Learning create mode 100644 data/2022/iclr/Robust and Scalable SDE Learning: A Functional Perspective create mode 100644 data/2022/iclr/RotoGrad: Gradient Homogenization in Multitask Learning create mode 100644 data/2022/iclr/RvS: What is Essential for Offline RL via Supervised Learning? create mode 100644 data/2022/iclr/SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations create mode 100644 data/2022/iclr/SGD Can Converge to Local Maxima create mode 100644 data/2022/iclr/SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models create mode 100644 data/2022/iclr/SOSP: Efficiently Capturing Global Correlations by Second-Order Structured Pruning create mode 100644 data/2022/iclr/SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training create mode 100644 data/2022/iclr/SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation create mode 100644 data/2022/iclr/SUMNAS: Supernet with Unbiased Meta-Features for Neural Architecture Search create mode 100644 data/2022/iclr/SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning create mode 100644 data/2022/iclr/Safe Neurosymbolic Learning with Differentiable Symbolic Execution create mode 100644 data/2022/iclr/Salient ImageNet: How to discover spurious features in Deep Learning? create mode 100644 data/2022/iclr/Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation create mode 100644 data/2022/iclr/Sample Efficient Stochastic Policy Extragradient Algorithm for Zero-Sum Markov Game create mode 100644 data/2022/iclr/Sample Selection with Uncertainty of Losses for Learning with Noisy Labels create mode 100644 data/2022/iclr/Sample and Computation Redistribution for Efficient Face Detection create mode 100644 data/2022/iclr/Sampling with Mirrored Stein Operators create mode 100644 data/2022/iclr/Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation create mode 100644 data/2022/iclr/Scalable Sampling for Nonsymmetric Determinantal Point Processes create mode 100644 data/2022/iclr/Scale Efficiently: Insights from Pretraining and Finetuning Transformers create mode 100644 data/2022/iclr/Scale Mixtures of Neural Network Gaussian Processes create mode 100644 data/2022/iclr/Scaling Laws for Neural Machine Translation create mode 100644 data/2022/iclr/Scarf: Self-Supervised Contrastive Learning using Random Feature Corruption create mode 100644 data/2022/iclr/Scattering Networks on the Sphere for Scalable and Rotationally Equivariant Spherical CNNs create mode 100644 data/2022/iclr/Scene Transformer: A unified architecture for predicting future trajectories of multiple agents create mode 100644 data/2022/iclr/Score-Based Generative Modeling with Critically-Damped Langevin Diffusion create mode 100644 data/2022/iclr/Selective Ensembles for Consistent Predictions create mode 100644 data/2022/iclr/Self-Joint Supervised Learning create mode 100644 data/2022/iclr/Self-Supervised Graph Neural Networks for Improved Electroencephalographic Seizure Analysis create mode 100644 data/2022/iclr/Self-Supervised Inference in State-Space Models create mode 100644 data/2022/iclr/Self-Supervision Enhanced Feature Selection with Correlated Gates create mode 100644 data/2022/iclr/Self-ensemble Adversarial Training for Improved Robustness create mode 100644 data/2022/iclr/Self-supervised Learning is More Robust to Dataset Imbalance create mode 100644 data/2022/iclr/Semi-relaxed Gromov-Wasserstein divergence and applications on graphs create mode 100644 data/2022/iclr/Sequence Approximation using Feedforward Spiking Neural Network for Spatiotemporal Learning: Theory and Optimization Methods create mode 100644 data/2022/iclr/Sequential Reptile: Inter-Task Gradient Alignment for Multilingual Learning create mode 100644 data/2022/iclr/Shallow and Deep Networks are Near-Optimal Approximators of Korobov Functions create mode 100644 data/2022/iclr/Should I Run Offline Reinforcement Learning or Behavioral Cloning? create mode 100644 data/2022/iclr/Should We Be Pre-training? An Argument for End-task Aware Training as an Alternative create mode 100644 data/2022/iclr/Shuffle Private Stochastic Convex Optimization create mode 100644 data/2022/iclr/Signing the Supermask: Keep, Hide, Invert create mode 100644 data/2022/iclr/SimVLM: Simple Visual Language Model Pretraining with Weak Supervision create mode 100644 data/2022/iclr/Simple GNN Regularisation for 3D Molecular Property Prediction and Beyond create mode 100644 data/2022/iclr/SketchODE: Learning neural sketch representation in continuous time create mode 100644 data/2022/iclr/Skill-based Meta-Reinforcement Learning create mode 100644 data/2022/iclr/Solving Inverse Problems in Medical Imaging with Score-Based Generative Models create mode 100644 data/2022/iclr/Sound Adversarial Audio-Visual Navigation create mode 100644 data/2022/iclr/Sound and Complete Neural Network Repair with Minimality and Locality Guarantees create mode 100644 data/2022/iclr/Source-Free Adaptation to Measurement Shift via Bottom-Up Feature Restoration create mode 100644 data/2022/iclr/Space-Time Graph Neural Networks create mode 100644 data/2022/iclr/Spanning Tree-based Graph Generation for Molecules create mode 100644 data/2022/iclr/Sparse Attention with Learning to Hash create mode 100644 data/2022/iclr/Sparse Communication via Mixed Distributions create mode 100644 data/2022/iclr/Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity create mode 100644 data/2022/iclr/Sparsity Winning Twice: Better Robust Generalization from More Efficient Training create mode 100644 data/2022/iclr/Spatial Graph Attention and Curiosity-driven Policy for Antiviral Drug Discovery create mode 100644 data/2022/iclr/SphereFace2: Binary Classification is All You Need for Deep Face Recognition create mode 100644 data/2022/iclr/Spherical Message Passing for 3D Molecular Graphs create mode 100644 data/2022/iclr/Spike-inspired rank coding for fast and accurate recurrent neural networks create mode 100644 data/2022/iclr/Spread Spurious Attribute: Improving Worst-group Accuracy with Spurious Attribute Estimation create mode 100644 data/2022/iclr/Sqrt(d) Dimension Dependence of Langevin Monte Carlo create mode 100644 data/2022/iclr/Stability Regularization for Discrete Representation Learning create mode 100644 data/2022/iclr/Steerable Partial Differential Operators for Equivariant Neural Networks create mode 100644 data/2022/iclr/Stein Latent Optimization for Generative Adversarial Networks create mode 100644 data/2022/iclr/Step-unrolled Denoising Autoencoders for Text Generation create mode 100644 data/2022/iclr/Stiffness-aware neural network for learning Hamiltonian systems create mode 100644 data/2022/iclr/Stochastic Training is Not Necessary for Generalization create mode 100644 data/2022/iclr/Strength of Minibatch Noise in SGD create mode 100644 data/2022/iclr/Structure-Aware Transformer Policy for Inhomogeneous Multi-Task Reinforcement Learning create mode 100644 data/2022/iclr/StyleAlign: Analysis and Applications of Aligned StyleGAN Models create mode 100644 data/2022/iclr/StyleNeRF: A Style-based 3D Aware Generator for High-resolution Image Synthesis create mode 100644 data/2022/iclr/Subspace Regularizers for Few-Shot Class Incremental Learning create mode 100644 data/2022/iclr/Superclass-Conditional Gaussian Mixture Model For Learning Fine-Grained Embeddings create mode 100644 data/2022/iclr/Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm create mode 100644 data/2022/iclr/Surreal-GAN: Semi-Supervised Representation Learning via GAN for uncovering heterogeneous disease-related imaging patterns create mode 100644 data/2022/iclr/Surrogate Gap Minimization Improves Sharpness-Aware Training create mode 100644 data/2022/iclr/Surrogate NAS Benchmarks: Going Beyond the Limited Search Spaces of Tabular NAS Benchmarks create mode 100644 data/2022/iclr/Switch to Generalize: Domain-Switch Learning for Cross-Domain Few-Shot Classification create mode 100644 data/2022/iclr/Symbolic Learning to Optimize: Towards Interpretability and Scalability create mode 100644 data/2022/iclr/Synchromesh: Reliable Code Generation from Pre-trained Language Models create mode 100644 data/2022/iclr/T-WaveNet: A Tree-Structured Wavelet Neural Network for Time Series Signal Analysis create mode 100644 data/2022/iclr/TAMP-S2GCNets: Coupling Time-Aware Multipersistence Knowledge Representation with Spatio-Supra Graph Convolutional Networks for Time-Series Forecasting create mode 100644 data/2022/iclr/TAPEX: Table Pre-training via Learning a Neural SQL Executor create mode 100644 data/2022/iclr/TAda! Temporally-Adaptive Convolutions for Video Understanding create mode 100644 data/2022/iclr/THOMAS: Trajectory Heatmap Output with learned Multi-Agent Sampling create mode 100644 data/2022/iclr/TPU-GAN: Learning temporal coherence from dynamic point cloud sequences create mode 100644 data/2022/iclr/TRAIL: Near-Optimal Imitation Learning with Suboptimal Data create mode 100644 data/2022/iclr/TRGP: Trust Region Gradient Projection for Continual Learning create mode 100644 data/2022/iclr/Tackling the Generative Learning Trilemma with Denoising Diffusion GANs create mode 100644 data/2022/iclr/Taming Sparsely Activated Transformer with Stochastic Experts create mode 100644 data/2022/iclr/Target-Side Input Augmentation for Sequence to Sequence Generation create mode 100644 data/2022/iclr/Task Affinity with Maximum Bipartite Matching in Few-Shot Learning create mode 100644 data/2022/iclr/Task Relatedness-Based Generalization Bounds for Meta Learning create mode 100644 data/2022/iclr/Task-Induced Representation Learning create mode 100644 data/2022/iclr/Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification create mode 100644 data/2022/iclr/Temporal Efficient Training of Spiking Neural Network via Gradient Re-weighting create mode 100644 data/2022/iclr/The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models create mode 100644 data/2022/iclr/The Close Relationship Between Contrastive Learning and Meta-Learning create mode 100644 data/2022/iclr/The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program create mode 100644 data/2022/iclr/The Effects of Invertibility on the Representational Complexity of Encoders in Variational Autoencoders create mode 100644 data/2022/iclr/The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models create mode 100644 data/2022/iclr/The Efficiency Misnomer create mode 100644 data/2022/iclr/The Evolution of Uncertainty of Learning in Games create mode 100644 data/2022/iclr/The Geometry of Memoryless Stochastic Policy Optimization in Infinite-Horizon POMDPs create mode 100644 data/2022/iclr/The Hidden Convex Optimization Landscape of Regularized Two-Layer ReLU Networks: an Exact Characterization of Optimal Solutions create mode 100644 data/2022/iclr/The Inductive Bias of In-Context Learning: Rethinking Pretraining Example Design create mode 100644 data/2022/iclr/The Information Geometry of Unsupervised Reinforcement Learning create mode 100644 data/2022/iclr/The MultiBERTs: BERT Reproductions for Robustness Analysis create mode 100644 data/2022/iclr/The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization create mode 100644 data/2022/iclr/The Rich Get Richer: Disparate Impact of Semi-Supervised Learning create mode 100644 data/2022/iclr/The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks create mode 100644 data/2022/iclr/The Role of Pretrained Representations for the OOD Generalization of RL Agents create mode 100644 data/2022/iclr/The Spectral Bias of Polynomial Neural Networks create mode 100644 data/2022/iclr/The Three Stages of Learning Dynamics in High-dimensional Kernel Methods create mode 100644 data/2022/iclr/The Uncanny Similarity of Recurrence and Depth create mode 100644 data/2022/iclr/The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training create mode 100644 data/2022/iclr/Tighter Sparse Approximation Bounds for ReLU Neural Networks create mode 100644 data/2022/iclr/ToM2C: Target-oriented Multi-agent Communication and Cooperation with Theory of Mind create mode 100644 data/2022/iclr/Top-N: Equivariant Set and Graph Generation without Exchangeability create mode 100644 data/2022/iclr/Top-label calibration and multiclass-to-binary reductions create mode 100644 data/2022/iclr/Topological Experience Replay create mode 100644 data/2022/iclr/Topological Graph Neural Networks create mode 100644 data/2022/iclr/Topologically Regularized Data Embeddings create mode 100644 data/2022/iclr/Toward Efficient Low-Precision Training: Data Format Optimization and Hysteresis Quantization create mode 100644 data/2022/iclr/Toward Faithful Case-based Reasoning through Learning Prototypes in a Nearest Neighbor-friendly Space create mode 100644 data/2022/iclr/Towards Better Understanding and Better Generalization of Low-shot Classification in Histology Images with Contrastive Learning create mode 100644 data/2022/iclr/Towards Building A Group-based Unsupervised Representation Disentanglement Framework create mode 100644 data/2022/iclr/Towards Continual Knowledge Learning of Language Models create mode 100644 data/2022/iclr/Towards Deepening Graph Neural Networks: A GNTK-based Optimization Perspective create mode 100644 data/2022/iclr/Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality create mode 100644 data/2022/iclr/Towards Empirical Sandwich Bounds on the Rate-Distortion Function create mode 100644 data/2022/iclr/Towards Evaluating the Robustness of Neural Networks Learned by Transduction create mode 100644 data/2022/iclr/Towards General Function Approximation in Zero-Sum Markov Games create mode 100644 data/2022/iclr/Towards Model Agnostic Federated Learning Using Knowledge Distillation create mode 100644 data/2022/iclr/Towards Training Billion Parameter Graph Neural Networks for Atomic Simulations create mode 100644 data/2022/iclr/Towards Understanding Generalization via Decomposing Excess Risk Dynamics create mode 100644 data/2022/iclr/Towards Understanding the Data Dependency of Mixup-style Training create mode 100644 data/2022/iclr/Towards Understanding the Robustness Against Evasion Attack on Categorical Data create mode 100644 data/2022/iclr/Towards a Unified View of Parameter-Efficient Transfer Learning create mode 100644 data/2022/iclr/Tracking the risk of a deployed model and detecting harmful distribution shifts create mode 100644 data/2022/iclr/Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation create mode 100644 data/2022/iclr/Training Data Generating Networks: Shape Reconstruction via Bi-level Optimization create mode 100644 data/2022/iclr/Training Structured Neural Networks Through Manifold Identification and Variance Reduction create mode 100644 data/2022/iclr/Training Transition Policies via Distribution Matching for Complex Tasks create mode 100644 data/2022/iclr/Training invariances and the low-rank phenomenon: beyond linear networks create mode 100644 data/2022/iclr/Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations create mode 100644 data/2022/iclr/Transfer RL across Observation Feature Spaces via Model-Based Regularization create mode 100644 data/2022/iclr/Transferable Adversarial Attack based on Integrated Gradients create mode 100644 data/2022/iclr/Transform2Act: Learning a Transform-and-Control Policy for Efficient Agent Design create mode 100644 data/2022/iclr/Transformer Embeddings of Irregularly Spaced Events and Their Participants create mode 100644 data/2022/iclr/Transformer-based Transform Coding create mode 100644 data/2022/iclr/Transformers Can Do Bayesian Inference create mode 100644 data/2022/iclr/Transition to Linearity of Wide Neural Networks is an Emerging Property of Assembling Weak Models create mode 100644 data/2022/iclr/Triangle and Four Cycle Counting with Predictions in Graph Streams create mode 100644 data/2022/iclr/Trigger Hunting with a Topological Prior for Trojan Detection create mode 100644 data/2022/iclr/Trivial or Impossible --- dichotomous data difficulty masks model differences (on ImageNet and beyond) create mode 100644 data/2022/iclr/Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning create mode 100644 data/2022/iclr/Tuformer: Data-driven Design of Transformers for Improved Generalization or Efficiency create mode 100644 data/2022/iclr/Uncertainty Modeling for Out-of-Distribution Generalization create mode 100644 data/2022/iclr/Understanding Dimensional Collapse in Contrastive Self-supervised Learning create mode 100644 data/2022/iclr/Understanding Domain Randomization for Sim-to-real Transfer create mode 100644 data/2022/iclr/Understanding Intrinsic Robustness Using Label Uncertainty create mode 100644 data/2022/iclr/Understanding Latent Correlation-Based Multiview Learning and Self-Supervision: An Identifiability Perspective create mode 100644 data/2022/iclr/Understanding and Improving Graph Injection Attack by Promoting Unnoticeability create mode 100644 data/2022/iclr/Understanding and Leveraging Overparameterization in Recursive Value Estimation create mode 100644 data/2022/iclr/Understanding and Preventing Capacity Loss in Reinforcement Learning create mode 100644 data/2022/iclr/Understanding approximate and unrolled dictionary learning for pattern recovery create mode 100644 data/2022/iclr/Understanding over-squashing and bottlenecks on graphs via curvature create mode 100644 data/2022/iclr/Understanding the Role of Self Attention for Efficient Speech Recognition create mode 100644 data/2022/iclr/Understanding the Variance Collapse of SVGD in High Dimensions create mode 100644 data/2022/iclr/UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning create mode 100644 data/2022/iclr/Unified Visual Transformer Compression create mode 100644 data/2022/iclr/Unifying Likelihood-free Inference with Black-box Optimization and Beyond create mode 100644 data/2022/iclr/Universal Approximation Under Constraints is Possible with Transformers create mode 100644 data/2022/iclr/Universalizing Weak Supervision create mode 100644 data/2022/iclr/Unraveling Model-Agnostic Meta-Learning via The Adaptation Learning Rate create mode 100644 data/2022/iclr/Unrolling PALM for Sparse Semi-Blind Source Separation create mode 100644 data/2022/iclr/Unsupervised Discovery of Object Radiance Fields create mode 100644 data/2022/iclr/Unsupervised Disentanglement with Tensor Product Representations on the Torus create mode 100644 data/2022/iclr/Unsupervised Learning of Full-Waveform Inversion: Connecting CNN and Partial Differential Equation in a Loop create mode 100644 data/2022/iclr/Unsupervised Semantic Segmentation by Distilling Feature Correspondences create mode 100644 data/2022/iclr/Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling create mode 100644 data/2022/iclr/Using Graph Representation Learning with Schema Encoders to Measure the Severity of Depressive Symptoms create mode 100644 data/2022/iclr/VAE Approximation Error: ELBO and Exponential Families create mode 100644 data/2022/iclr/VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects create mode 100644 data/2022/iclr/VC dimension of partially quantized neural networks in the overparametrized regime create mode 100644 data/2022/iclr/VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning create mode 100644 data/2022/iclr/VOS: Learning What You Don't Know by Virtual Outlier Synthesis create mode 100644 data/2022/iclr/Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning create mode 100644 data/2022/iclr/Value Gradient weighted Model-Based Reinforcement Learning create mode 100644 data/2022/iclr/Variational Inference for Discriminative Learning with Generative Modeling of Feature Incompletion create mode 100644 data/2022/iclr/Variational Neural Cellular Automata create mode 100644 data/2022/iclr/Variational Predictive Routing with Nested Subjective Timescales create mode 100644 data/2022/iclr/Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias create mode 100644 data/2022/iclr/Variational methods for simulation-based inference create mode 100644 data/2022/iclr/Variational oracle guiding for reinforcement learning create mode 100644 data/2022/iclr/Vector-quantized Image Modeling with Improved VQGAN create mode 100644 data/2022/iclr/ViDT: An Efficient and Effective Fully Transformer-based Object Detector create mode 100644 data/2022/iclr/ViTGAN: Training GANs with Vision Transformers create mode 100644 data/2022/iclr/Vision-Based Manipulators Need to Also See from Their Hands create mode 100644 data/2022/iclr/Visual Correspondence Hallucination create mode 100644 data/2022/iclr/Visual Representation Learning Does Not Generalize Strongly Within the Same Domain create mode 100644 data/2022/iclr/Visual Representation Learning over Latent Domains create mode 100644 data/2022/iclr/Visual hyperacuity with moving sensor and recurrent neural computations create mode 100644 data/2022/iclr/Vitruvion: A Generative Model of Parametric CAD Sketches create mode 100644 data/2022/iclr/W-CTC: a Connectionist Temporal Classification Loss with Wild Cards create mode 100644 data/2022/iclr/WeakM3D: Towards Weakly Supervised Monocular 3D Object Detection create mode 100644 data/2022/iclr/What Do We Mean by Generalization in Federated Learning? create mode 100644 data/2022/iclr/What Happens after SGD Reaches Zero Loss? --A Mathematical Framework create mode 100644 data/2022/iclr/What Makes Better Augmentation Strategies? Augment Difficult but Not too Different create mode 100644 data/2022/iclr/What's Wrong with Deep Learning in Tree Search for Combinatorial Optimization create mode 100644 data/2022/iclr/When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently? create mode 100644 data/2022/iclr/When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations create mode 100644 data/2022/iclr/When should agents explore? create mode 100644 data/2022/iclr/When, Why, and Which Pretrained GANs Are Useful? create mode 100644 data/2022/iclr/Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space Perspective create mode 100644 data/2022/iclr/Who Is Your Right Mixup Partner in Positive and Unlabeled Learning create mode 100644 data/2022/iclr/Who Is the Strongest Enemy? Towards Optimal and Efficient Evasion Attacks in Deep RL create mode 100644 data/2022/iclr/Why Propagate Alone? Parallel Use of Labels and Features on Graphs create mode 100644 data/2022/iclr/Wiring Up Vision: Minimizing Supervised Synaptic Updates Needed to Produce a Primate Ventral Stream create mode 100644 data/2022/iclr/Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models create mode 100644 data/2022/iclr/Wish you were here: Hindsight Goal Selection for long-horizon dexterous manipulation create mode 100644 data/2022/iclr/X-model: Improving Data Efficiency in Deep Learning with A Minimax Model create mode 100644 data/2022/iclr/You Mostly Walk Alone: Analyzing Feature Attribution in Trajectory Prediction create mode 100644 data/2022/iclr/You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks create mode 100644 data/2022/iclr/Zero Pixel Directional Boundary by Vector Transform create mode 100644 data/2022/iclr/Zero-CL: Instance and Feature decorrelation for negative-free symmetric contrastive learning create mode 100644 data/2022/iclr/Zero-Shot Self-Supervised Learning for MRI Reconstruction create mode 100644 data/2022/iclr/ZeroFL: Efficient On-Device Training for Federated Learning with Local Sparsity create mode 100644 data/2022/iclr/cosFormer: Rethinking Softmax In Attention create mode 100644 data/2022/iclr/iFlood: A Stable and Effective Regularizer create mode 100644 data/2022/iclr/iLQR-VAE : control-based learning of input-driven dynamics with applications to neural data create mode 100644 data/2022/iclr/miniF2F: a cross-system benchmark for formal Olympiad-level mathematics create mode 100644 data/2022/iclr/switch-GLAT: Multilingual Parallel Machine Translation Via Code-Switch Decoder create mode 100644 data/2023/iclr/A Multi-Grained Self-Interpretable Symbolic-Neural Model For Single Multi-Labeled Text Classification create mode 100644 data/2023/iclr/A Unified Framework for Soft Threshold Pruning create mode 100644 data/2023/iclr/Achieve the Minimum Width of Neural Networks for Universal Approximation create mode 100644 data/2023/iclr/BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection create mode 100644 data/2023/iclr/Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining create mode 100644 data/2023/iclr/Continuous-Discrete Convolution for Geometry-Sequence Modeling in Proteins create mode 100644 data/2023/iclr/DAG Matters! GFlowNets Enhanced Explainer for Graph Neural Networks create mode 100644 data/2023/iclr/Delving into Semantic Scale Imbalance create mode 100644 data/2023/iclr/Diagnosing and Rectifying Vision Models using Language create mode 100644 data/2023/iclr/Diversify and Disambiguate: Out-of-Distribution Robustness via Disagreement create mode 100644 data/2023/iclr/DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training create mode 100644 data/2023/iclr/DualAfford: Learning Collaborative Visual Affordance for Dual-gripper Manipulation create mode 100644 data/2023/iclr/Guiding Safe Exploration with Weakest Preconditions create mode 100644 data/2023/iclr/H2RBox: Horizontal Box Annotation is All You Need for Oriented Object Detection create mode 100644 data/2023/iclr/Harnessing Out-Of-Distribution Examples via Augmenting Content and Style create mode 100644 data/2023/iclr/IDEAL: Query-Efficient Data-Free Learning from Black-Box Models create mode 100644 data/2023/iclr/Learning Domain-Agnostic Representation for Disease Diagnosis create mode 100644 data/2023/iclr/Logical Entity Representation in Knowledge-Graphs for Differentiable Rule Learning create mode 100644 data/2023/iclr/Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching create mode 100644 data/2023/iclr/On amortizing convex conjugates for optimal transport create mode 100644 data/2023/iclr/Patch-Level Contrasting without Patch Correspondence for Accurate and Dense Contrastive Representation Learning create mode 100644 data/2023/iclr/Pushing the Limits of Fewshot Anomaly Detection in Industry Vision: Graphcore create mode 100644 data/2023/iclr/Representation Learning for Low-rank General-sum Markov Games create mode 100644 data/2023/iclr/SIMPLE: Specialized Model-Sample Matching for Domain Generalization create mode 100644 data/2023/iclr/Suppressing the Heterogeneity: A Strong Feature Extractor for Few-shot Segmentation create mode 100644 data/2023/iclr/Surgical Fine-Tuning Improves Adaptation to Distribution Shifts create mode 100644 data/2023/iclr/TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding create mode 100644 data/2023/iclr/The Augmented Image Prior: Distilling 1000 Classes by Extrapolating from a Single Image create mode 100644 data/2023/iclr/Trainability Preserving Neural Pruning diff --git a/data/2020/iclr/A Constructive Prediction of the Generalization Error Across Scales b/data/2020/iclr/A Constructive Prediction of the Generalization Error Across Scales new file mode 100644 index 0000000000..80d725bd5c --- /dev/null +++ b/data/2020/iclr/A Constructive Prediction of the Generalization Error Across Scales @@ -0,0 +1 @@ +The dependency of the generalization error of neural networks on model and dataset size is of critical importance both in practice and for understanding the theory of neural networks. Nevertheless, the functional form of this dependency remains elusive. In this work, we present a functional form which approximates well the generalization error in practice. Capitalizing on the successful concept of model scaling (e.g., width, depth), we are able to simultaneously construct such a form and specify the exact models which can attain it across model/data scales. Our construction follows insights obtained from observations conducted over a range of model/data scales, in various model types and datasets, in vision and language tasks. We show that the form both fits the observations well across scales, and provides accurate predictions from small- to large-scale models and data. \ No newline at end of file diff --git a/data/2020/iclr/A Fair Comparison of Graph Neural Networks for Graph Classification b/data/2020/iclr/A Fair Comparison of Graph Neural Networks for Graph Classification new file mode 100644 index 0000000000..9e2ebcee28 --- /dev/null +++ b/data/2020/iclr/A Fair Comparison of Graph Neural Networks for Graph Classification @@ -0,0 +1 @@ +Experimental reproducibility and replicability are critical topics in machine learning. Authors have often raised concerns about their lack in scientific publications to improve the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorousness and are hardly reproducible. Motivated by this, we provide an overview of common practices that should be avoided to fairly compare with the state of the art. To counter this troubling trend, we ran more than 47000 experiments in a controlled and uniform framework to re-evaluate five popular models across nine common benchmarks. Moreover, by comparing GNNs with structure-agnostic baselines we provide convincing evidence that, on some datasets, structural information has not been exploited yet. We believe that this work can contribute to the development of the graph learning field, by providing a much needed grounding for rigorous evaluations of graph classification models. \ No newline at end of file diff --git a/data/2020/iclr/A Learning-based Iterative Method for Solving Vehicle Routing Problems b/data/2020/iclr/A Learning-based Iterative Method for Solving Vehicle Routing Problems new file mode 100644 index 0000000000..916201a647 --- /dev/null +++ b/data/2020/iclr/A Learning-based Iterative Method for Solving Vehicle Routing Problems @@ -0,0 +1 @@ +This paper is concerned with solving combinatorial optimization problems, in particular, the capacitated vehicle routing problems (CVRP). Classical Operations Research (OR) algorithms such as LKH3 (Helsgaun, 2017) are extremely inefficient (e.g., 13 hours on CVRP of only size 100) and difficult to scale to larger-size problems. Machine learning based approaches have recently shown to be promising, partly because of their efficiency (once trained, they can perform solving within minutes or even seconds). However, there is still a considerable gap between the quality of a machine learned solution and what OR methods can offer (e.g., on CVRP-100, the best result of learned solutions is between 16.10-16.80, significantly worse than LKH3's 15.65). In this paper, we present the first learning based approach for CVRP that is efficient in solving speed and at the same time outperforms OR methods. Starting with a random initial solution, our algorithm learns to iteratively refines the solution with an improvement operator, selected by a reinforcement learning based controller. The improvement operator is selected from a pool of powerful operators that are customized for routing problems. By combining the strengths of the two worlds, our approach achieves the new state-of-the-art results on CVRP, e.g., an average cost of 15.57 on CVRP-100. \ No newline at end of file diff --git a/data/2020/iclr/A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning b/data/2020/iclr/A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning new file mode 100644 index 0000000000..2790a87f00 --- /dev/null +++ b/data/2020/iclr/A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning @@ -0,0 +1 @@ +Due to insufficient training data and the high computational cost to train a deep neural network from scratch, transfer learning has been extensively used in many deep-neural-network-based applications. A commonly used transfer learning approach involves taking a part of a pre-trained model, adding a few layers at the end, and re-training the new layers with a small dataset. This approach, while efficient and widely used, imposes a security vulnerability because the pre-trained model used in transfer learning is usually publicly available, including to potential attackers. In this paper, we show that without any additional knowledge other than the pre-trained model, an attacker can launch an effective and efficient brute force attack that can craft instances of input to trigger each target class with high confidence. We assume that the attacker has no access to any target-specific information, including samples from target classes, re-trained model, and probabilities assigned by Softmax to each class, and thus making the attack target-agnostic. These assumptions render all previous attack models inapplicable, to the best of our knowledge. To evaluate the proposed attack, we perform a set of experiments on face recognition and speech recognition tasks and show the effectiveness of the attack. Our work reveals a fundamental security weakness of the Softmax layer when used in transfer learning settings. \ No newline at end of file diff --git a/data/2020/iclr/A Theoretical Analysis of the Number of Shots in Few-Shot Learning b/data/2020/iclr/A Theoretical Analysis of the Number of Shots in Few-Shot Learning new file mode 100644 index 0000000000..70348df253 --- /dev/null +++ b/data/2020/iclr/A Theoretical Analysis of the Number of Shots in Few-Shot Learning @@ -0,0 +1 @@ +Few-shot classification is the task of predicting the category of an example from a set of few labeled examples. The number of labeled examples per category is called the number of shots (or shot number). Recent works tackle this task through meta-learning, where a meta-learner extracts information from observed tasks during meta-training to quickly adapt to new tasks during meta-testing. In this formulation, the number of shots exploited during meta-training has an impact on the recognition performance at meta-test time. Generally, the shot number used in meta-training should match the one used in meta-testing to obtain the best performance. We introduce a theoretical analysis of the impact of the shot number on Prototypical Networks, a state-of-the-art few-shot classification method. From our analysis, we propose a simple method that is robust to the choice of shot number used during meta-training, which is a crucial hyperparameter. The performance of our model trained for an arbitrary meta-training shot number shows great performance for different values of meta-testing shot numbers. We experimentally demonstrate our approach on different few-shot classification benchmarks. \ No newline at end of file diff --git a/data/2020/iclr/A critical analysis of self-supervision, or what we can learn from a single image b/data/2020/iclr/A critical analysis of self-supervision, or what we can learn from a single image new file mode 100644 index 0000000000..7f44c17d0e --- /dev/null +++ b/data/2020/iclr/A critical analysis of self-supervision, or what we can learn from a single image @@ -0,0 +1 @@ +We look critically at popular self-supervision techniques for learning deep convolutional neural networks without manual labels. We show that three different and representative methods, BiGAN, RotNet and DeepCluster, can learn the first few layers of a convolutional network from a single image as well as using millions of images and manual labels, provided that strong data augmentation is used. However, for deeper layers the gap with manual supervision cannot be closed even if millions of unlabelled images are used for training. We conclude that: (1) the weights of the early layers of deep networks contain limited information about the statistics of natural images, that (2) such low-level statistics can be learned through self-supervision just as well as through strong supervision, and that (3) the low-level statistics can be captured via synthetic transformations instead of using a large image dataset. \ No newline at end of file diff --git a/data/2020/iclr/AMRL: Aggregated Memory For Reinforcement Learning b/data/2020/iclr/AMRL: Aggregated Memory For Reinforcement Learning new file mode 100644 index 0000000000..659572235a --- /dev/null +++ b/data/2020/iclr/AMRL: Aggregated Memory For Reinforcement Learning @@ -0,0 +1 @@ +In many partially observable scenarios, Reinforcement Learning (RL) agents must rely on long-term memory in order to learn an optimal policy. We demonstrate that using techniques from NLP and supervised learning fails at RL tasks due to stochasticity from the environment and from exploration. Utilizing our insights on the limitations of traditional memory methods in RL, we propose AMRL, a class of models that can learn better policies with greater sample efficiency and are resilient to noisy inputs. Specifically, our models use a standard memory module to summarize short-term context, and then aggregate all prior states from the standard model without respect to order. We show that this provides advantages both in terms of gradient decay and signal-to-noise ratio over time. Evaluating in Minecraft and maze environments that test long-term memory, we find that our model improves average return by 19% over a baseline that has the same number of parameters and by 9% over a stronger baseline that has far more parameters. \ No newline at end of file diff --git a/data/2020/iclr/Accelerating SGD with momentum for over-parameterized learning b/data/2020/iclr/Accelerating SGD with momentum for over-parameterized learning new file mode 100644 index 0000000000..a6c8adbb0a --- /dev/null +++ b/data/2020/iclr/Accelerating SGD with momentum for over-parameterized learning @@ -0,0 +1,4 @@ +Nesterov SGD is widely used for training modern neural networks and other machine learning models. Yet, its advantages over SGD have not been theoretically clarified. Indeed, as we show in our paper, both theoretically and empirically, Nesterov SGD with any parameter selection does not in general provide acceleration over ordinary SGD. Furthermore, Nesterov SGD may diverge for step sizes that ensure convergence of ordinary SGD. This is in contrast to the classical results in the deterministic scenario, where the same step size ensures accelerated convergence of the Nesterov's method over optimal gradient descent. +To address the non-acceleration issue, we introduce a compensation term to Nesterov SGD. The resulting algorithm, which we call MaSS, converges for same step sizes as SGD. We prove that MaSS obtains an accelerated convergence rates over SGD for any mini-batch size in the linear setting. For full batch, the convergence rate of MaSS matches the well-known accelerated rate of the Nesterov's method. +We also analyze the practically important question of the dependence of the convergence rate and optimal hyper-parameters on the mini-batch size, demonstrating three distinct regimes: linear scaling, diminishing returns and saturation. +Experimental evaluation of MaSS for several standard architectures of deep networks, including ResNet and convolutional networks, shows improved performance over SGD, Nesterov SGD and Adam. \ No newline at end of file diff --git a/data/2020/iclr/Action Semantics Network: Considering the Effects of Actions in Multiagent Systems b/data/2020/iclr/Action Semantics Network: Considering the Effects of Actions in Multiagent Systems new file mode 100644 index 0000000000..ce017c4309 --- /dev/null +++ b/data/2020/iclr/Action Semantics Network: Considering the Effects of Actions in Multiagent Systems @@ -0,0 +1 @@ +In multiagent systems (MASs), each agent makes individual decisions but all of them contribute globally to the system evolution. Learning in MASs is difficult since each agent's selection of actions must take place in the presence of other co-learning agents. Moreover, the environmental stochasticity and uncertainties increase exponentially with the increase in the number of agents. Previous works borrow various multiagent coordination mechanisms into deep learning architecture to facilitate multiagent coordination. However, none of them explicitly consider action semantics between agents that different actions have different influences on other agents. In this paper, we propose a novel network architecture, named Action Semantics Network (ASN), that explicitly represents such action semantics between agents. ASN characterizes different actions' influence on other agents using neural networks based on the action semantics between them. ASN can be easily combined with existing deep reinforcement learning (DRL) algorithms to boost their performance. Experimental results on StarCraft II micromanagement and Neural MMO show ASN significantly improves the performance of state-of-the-art DRL approaches compared with several network architectures. \ No newline at end of file diff --git a/data/2020/iclr/Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games b/data/2020/iclr/Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games new file mode 100644 index 0000000000..a2a8b4ae62 --- /dev/null +++ b/data/2020/iclr/Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games @@ -0,0 +1 @@ +We study discrete-time mean-field Markov games with infinite numbers of agents where each agent aims to minimize its ergodic cost. We consider the setting where the agents have identical linear state transitions and quadratic cost functions, while the aggregated effect of the agents is captured by the population mean of their states, namely, the mean-field state. For such a game, based on the Nash certainty equivalence principle, we provide sufficient conditions for the existence and uniqueness of its Nash equilibrium. Moreover, to find the Nash equilibrium, we propose a mean-field actor-critic algorithm with linear function approximation, which does not require knowing the model of dynamics. Specifically, at each iteration of our algorithm, we use the single-agent actor-critic algorithm to approximately obtain the optimal policy of the each agent given the current mean-field state, and then update the mean-field state. In particular, we prove that our algorithm converges to the Nash equilibrium at a linear rate. To the best of our knowledge, this is the first success of applying model-free reinforcement learning with function approximation to discrete-time mean-field Markov games with provable non-asymptotic global convergence guarantees. \ No newline at end of file diff --git a/data/2020/iclr/Adaptive Structural Fingerprints for Graph Attention Networks b/data/2020/iclr/Adaptive Structural Fingerprints for Graph Attention Networks new file mode 100644 index 0000000000..9c6c3c3eac --- /dev/null +++ b/data/2020/iclr/Adaptive Structural Fingerprints for Graph Attention Networks @@ -0,0 +1 @@ +Many real-world data sets are represented as graphs, such as citation links, social media, and biological interaction. The volatile graph structure makes it non-trivial to employ convolutional neural networks (CNN's) for graph data processing. Recently, graph attention network (GAT) has proven a promising attempt by combining graph neural networks with attention mechanism, so as to achieve massage passing in graphs with arbitrary structures. However, the attention in GAT is computed mainly based on the similarity between the node content, while the structures of the graph remains largely unemployed (except in masking the attention out of one-hop neighbors). In this paper, we propose an `````````````````````````````"ADaptive Structural Fingerprint" (ADSF) model to fully exploit both topological details of the graph and content features of the nodes. The key idea is to contextualize each node with a weighted, learnable receptive field encoding rich and diverse local graph structures. By doing this, structural interactions between the nodes can be inferred accurately, thus improving subsequent attention layer as well as the convergence of learning. Furthermore, our model provides a useful platform for different subspaces of node features and various scales of graph structures to ``cross-talk'' with each other through the learning of multi-head attention, being particularly useful in handling complex real-world data. Encouraging performance is observed on a number of benchmark data sets in node classification. \ No newline at end of file diff --git a/data/2020/iclr/Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks b/data/2020/iclr/Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks new file mode 100644 index 0000000000..79d34cc752 --- /dev/null +++ b/data/2020/iclr/Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks @@ -0,0 +1 @@ +We propose Additive Powers-of-Two~(APoT) quantization, an efficient non-uniform quantization scheme for the bell-shaped and long-tailed distribution of weights and activations in neural networks. By constraining all quantization levels as the sum of Powers-of-Two terms, APoT quantization enjoys high computational efficiency and a good match with the distribution of weights. A simple reparameterization of the clipping function is applied to generate a better-defined gradient for learning the clipping threshold. Moreover, weight normalization is presented to refine the distribution of weights to make the training more stable and consistent. Experimental results show that our proposed method outperforms state-of-the-art methods, and is even competitive with the full-precision models, demonstrating the effectiveness of our proposed APoT quantization. For example, our 4-bit quantized ResNet-50 on ImageNet achieves 76.6% top-1 accuracy without bells and whistles; meanwhile, our model reduces 22% computational cost compared with the uniformly quantized counterpart. \ No newline at end of file diff --git a/data/2020/iclr/Adjustable Real-time Style Transfer b/data/2020/iclr/Adjustable Real-time Style Transfer new file mode 100644 index 0000000000..94f7149ab6 --- /dev/null +++ b/data/2020/iclr/Adjustable Real-time Style Transfer @@ -0,0 +1 @@ +Artistic style transfer is the problem of synthesizing an image with content similar to a given image and style similar to another. Although recent feed-forward neural networks can generate stylized images in real-time, these models produce a single stylization given a pair of style/content images, and the user doesn't have control over the synthesized output. Moreover, the style transfer depends on the hyper-parameters of the model with varying "optimum" for different input images. Therefore, if the stylized output is not appealing to the user, she/he has to try multiple models or retrain one with different hyper-parameters to get a favorite stylization. In this paper, we address these issues by proposing a novel method which allows adjustment of crucial hyper-parameters, after the training and in real-time, through a set of manually adjustable parameters. These parameters enable the user to modify the synthesized outputs from the same pair of style/content images, in search of a favorite stylized image. Our quantitative and qualitative experiments indicate how adjusting these parameters is comparable to retraining the model with different hyper-parameters. We also demonstrate how these parameters can be randomized to generate results which are diverse but still very similar in style and content. \ No newline at end of file diff --git a/data/2020/iclr/Adversarial Policies: Attacking Deep Reinforcement Learning b/data/2020/iclr/Adversarial Policies: Attacking Deep Reinforcement Learning new file mode 100644 index 0000000000..7c119759c1 --- /dev/null +++ b/data/2020/iclr/Adversarial Policies: Attacking Deep Reinforcement Learning @@ -0,0 +1 @@ +Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers. However, an attacker is not usually able to directly modify another agent's observations. This might lead one to wonder: is it possible to attack an RL agent simply by choosing an adversarial policy acting in a multi-agent environment so as to create natural observations that are adversarial? We demonstrate the existence of adversarial policies in zero-sum games between simulated humanoid robots with proprioceptive observations, against state-of-the-art victims trained via self-play to be robust to opponents. The adversarial policies reliably win against the victims but generate seemingly random and uncoordinated behavior. We find that these policies are more successful in high-dimensional environments, and induce substantially different activations in the victim policy network than when the victim plays against a normal opponent. Videos are available at this https URL. \ No newline at end of file diff --git a/data/2020/iclr/Adversarially Robust Representations with Smooth Encoders b/data/2020/iclr/Adversarially Robust Representations with Smooth Encoders new file mode 100644 index 0000000000..7ab9fba6b5 --- /dev/null +++ b/data/2020/iclr/Adversarially Robust Representations with Smooth Encoders @@ -0,0 +1 @@ +This paper studies the undesired phenomena of over-sensitivity of representations learned by deep networks to semantically-irrelevant changes in data. We identify a cause for this shortcoming in the classical Variational Auto-encoder (VAE) objective, the evidence lower bound (ELBO). We show that the ELBO fails to control the behaviour of the encoder out of the support of the empirical data distribution and this behaviour of the VAE can lead to extreme errors in the learned representation. This is a key hurdle in the effective use of representations for data-efficient learning and transfer. To address this problem, we propose to augment the data with specifications that enforce insensitivity of the representation with respect to families of transformations. To incorporate these specifications, we propose a regularization method that is based on a selection mechanism that creates a fictive data point by explicitly perturbing an observed true data point. For certain choices of parameters, our formulation naturally leads to the minimization of the entropy regularized Wasserstein distance between representations. We illustrate our approach on standard datasets and experimentally show that significant improvements in the downstream adversarial accuracy can be achieved by learning robust representations completely in an unsupervised manner, without a reference to a particular downstream task and without a costly supervised adversarial training procedure. \ No newline at end of file diff --git a/data/2020/iclr/Adversarially robust transfer learning b/data/2020/iclr/Adversarially robust transfer learning new file mode 100644 index 0000000000..c91c0a96bb --- /dev/null +++ b/data/2020/iclr/Adversarially robust transfer learning @@ -0,0 +1 @@ +Transfer learning, in which a network is trained on one task and re-purposed on another, is often used to produce neural network classifiers when data is scarce or full-scale training is too costly. When the goal is to produce a model that is not only accurate but also adversarially robust, data scarcity and computational limitations become even more cumbersome. We consider robust transfer learning, in which we transfer not only performance but also robustness from a source model to a target domain. We start by observing that robust networks contain robust feature extractors. By training classifiers on top of these feature extractors, we produce new models that inherit the robustness of their parent networks. We then consider the case of fine tuning a network by re-training end-to-end in the target domain. When using lifelong learning strategies, this process preserves the robustness of the source network while achieving high accuracy. By using such strategies, it is possible to produce accurate and robust models with little data, and without the cost of adversarial training. Additionally, we can improve the generalization of adversarially trained models, while maintaining their robustness. \ No newline at end of file diff --git a/data/2020/iclr/Ae-OT: a New Generative Model based on Extended Semi-discrete Optimal transport b/data/2020/iclr/Ae-OT: a New Generative Model based on Extended Semi-discrete Optimal transport new file mode 100644 index 0000000000..754be15559 --- /dev/null +++ b/data/2020/iclr/Ae-OT: a New Generative Model based on Extended Semi-discrete Optimal transport @@ -0,0 +1 @@ +Generative adversarial networks (GANs) have attracted huge attention due to its capability to generate visual realistic images. However, most of the existing models suffer from the mode collapse or mode mixture problems. In this work, we give a theoretic explanation of the both problems by Figalli’s regularity theory of optimal transportation maps. Basically, the generator compute the transportation maps between the white noise distributions and the data distributions, which are in general discontinuous. However, DNNs can only represent continuous maps. This intrinsic conflict induces mode collapse and mode mixture. In order to tackle the both problems, we explicitly separate the manifold embedding and the optimal transportation; the first part is carried out using an autoencoder to map the images onto the latent space; the second part is accomplished using a GPU-based convex optimization to find the discontinuous transportation maps. Composing the extended OT map and the decoder, we can finally generate new images from the white noise. This AE-OT model avoids representing discontinuous maps by DNNs, therefore effectively prevents mode collapse and mode mixture. \ No newline at end of file diff --git a/data/2020/iclr/An Inductive Bias for Distances: Neural Nets that Respect the Triangle Inequality b/data/2020/iclr/An Inductive Bias for Distances: Neural Nets that Respect the Triangle Inequality new file mode 100644 index 0000000000..b45b915641 --- /dev/null +++ b/data/2020/iclr/An Inductive Bias for Distances: Neural Nets that Respect the Triangle Inequality @@ -0,0 +1 @@ +Distances are pervasive in machine learning. They serve as similarity measures, loss functions, and learning targets; it is said that a good distance measure solves a task. When defining distances, the triangle inequality has proven to be a useful constraint, both theoretically--to prove convergence and optimality guarantees--and empirically--as an inductive bias. Deep metric learning architectures that respect the triangle inequality rely, almost exclusively, on Euclidean distance in the latent space. Though effective, this fails to model two broad classes of subadditive distances, common in graphs and reinforcement learning: asymmetric metrics, and metrics that cannot be embedded into Euclidean space. To address these problems, we introduce novel architectures that are guaranteed to satisfy the triangle inequality. We prove our architectures universally approximate norm-induced metrics on $\mathbb{R}^n$, and present a similar result for modified Input Convex Neural Networks. We show that our architectures outperform existing metric approaches when modeling graph distances and have a better inductive bias than non-metric approaches when training data is limited in the multi-goal reinforcement learning setting. \ No newline at end of file diff --git a/data/2020/iclr/Analysis of Video Feature Learning in Two-Stream CNNs on the Example of Zebrafish Swim Bout Classification b/data/2020/iclr/Analysis of Video Feature Learning in Two-Stream CNNs on the Example of Zebrafish Swim Bout Classification new file mode 100644 index 0000000000..7e1cac583f --- /dev/null +++ b/data/2020/iclr/Analysis of Video Feature Learning in Two-Stream CNNs on the Example of Zebrafish Swim Bout Classification @@ -0,0 +1 @@ +Semmelhack et al. (2014) have achieved high classification accuracy in distinguishing swim bouts of zebrafish using a Support Vector Machine (SVM). Convolutional Neural Networks (CNNs) have reached superior performance in various image recognition tasks over SVMs, but these powerful networks remain a black box. Reaching better transparency helps to build trust in their classifications and makes learned features interpretable to experts. Using a recently developed technique called Deep Taylor Decomposition, we generated heatmaps to highlight input regions of high relevance for predictions. We find that our CNN makes predictions by analyzing the steadiness of the tail's trunk, which markedly differs from the manually extracted features used by Semmelhack et al. (2014). We further uncovered that the network paid attention to experimental artifacts. Removing these artifacts ensured the validity of predictions. After correction, our best CNN beats the SVM by 6.12%, achieving a classification accuracy of 96.32%. Our work thus demonstrates the utility of AI explainability for CNNs. \ No newline at end of file diff --git a/data/2020/iclr/Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction b/data/2020/iclr/Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction new file mode 100644 index 0000000000..ac5ee5c36d --- /dev/null +++ b/data/2020/iclr/Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction @@ -0,0 +1 @@ +With the recent success and popularity of pre-trained language models (LMs) in natural language processing, there has been a rise in efforts to understand their inner workings. In line with such interest, we propose a novel method that assists us in investigating the extent to which pre-trained LMs capture the syntactic notion of constituency. Our method provides an effective way of extracting constituency trees from the pre-trained LMs without training. In addition, we report intriguing findings in the induced trees, including the fact that pre-trained LMs outperform other approaches in correctly demarcating adverb phrases in sentences. \ No newline at end of file diff --git a/data/2020/iclr/Are Transformers universal approximators of sequence-to-sequence functions? b/data/2020/iclr/Are Transformers universal approximators of sequence-to-sequence functions? new file mode 100644 index 0000000000..c723898710 --- /dev/null +++ b/data/2020/iclr/Are Transformers universal approximators of sequence-to-sequence functions? @@ -0,0 +1 @@ +Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other architectures that can compute contextual mappings and empirically evaluate them. \ No newline at end of file diff --git a/data/2020/iclr/AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures b/data/2020/iclr/AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures new file mode 100644 index 0000000000..199d3c56eb --- /dev/null +++ b/data/2020/iclr/AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures @@ -0,0 +1 @@ +Learning to represent videos is a very challenging task both algorithmically and computationally. Standard video CNN architectures have been designed by directly extending architectures devised for image understanding to include the time dimension, using modules such as 3D convolutions, or by using two-stream design to capture both appearance and motion in videos. We interpret a video CNN as a collection of multi-stream convolutional blocks connected to each other, and propose the approach of automatically finding neural architectures with better connectivity and spatio-temporal interactions for video understanding. This is done by evolving a population of overly-connected architectures guided by connection weight learning. Architectures combining representations that abstract different input types (i.e., RGB and optical flow) at multiple temporal resolutions are searched for, allowing different types or sources of information to interact with each other. Our method, referred to as AssembleNet, outperforms prior approaches on public video datasets, in some cases by a great margin. We obtain 58.6% mAP on Charades and 34.27% accuracy on Moments-in-Time. \ No newline at end of file diff --git a/data/2020/iclr/Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space b/data/2020/iclr/Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space new file mode 100644 index 0000000000..e8659d9bb0 --- /dev/null +++ b/data/2020/iclr/Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space @@ -0,0 +1 @@ +Challenges in natural sciences can often be phrased as optimization problems. Machine learning techniques have recently been applied to solve such problems. One example in chemistry is the design of tailor-made organic materials and molecules, which requires efficient methods to explore the chemical space. We present a genetic algorithm (GA) that is enhanced with a neural network (DNN) based discriminator model to improve the diversity of generated molecules and at the same time steer the GA. We show that our algorithm outperforms other generative models in optimization tasks. We furthermore present a way to increase interpretability of genetic algorithms, which helped us to derive design principles. \ No newline at end of file diff --git a/data/2020/iclr/AutoQ: Automated Kernel-Wise Neural Network Quantization b/data/2020/iclr/AutoQ: Automated Kernel-Wise Neural Network Quantization new file mode 100644 index 0000000000..a56e8fc93e --- /dev/null +++ b/data/2020/iclr/AutoQ: Automated Kernel-Wise Neural Network Quantization @@ -0,0 +1 @@ +Network quantization is one of the most hardware friendly techniques to enable the deployment of convolutional neural networks (CNNs) on low-power mobile devices. Recent network quantization techniques quantize each weight kernel in a convolutional layer independently for higher inference accuracy, since the weight kernels in a layer exhibit different variances and hence have different amounts of redundancy. The quantization bitwidth or bit number (QBN) directly decides the inference accuracy, latency, energy and hardware overhead. To effectively reduce the redundancy and accelerate CNN inferences, various weight kernels should be quantized with different QBNs. However, prior works use only one QBN to quantize each convolutional layer or the entire CNN, because the design space of searching a QBN for each weight kernel is too large. The hand-crafted heuristic of the kernel-wise QBN search is so sophisticated that domain experts can obtain only sub-optimal results. It is difficult for even deep reinforcement learning (DRL) Deep Deterministic Policy Gradient (DDPG)-based agents to find a kernel-wise QBN configuration that can achieve reasonable inference accuracy. In this paper, we propose a hierarchical-DRL-based kernel-wise network quantization technique, AutoQ, to automatically search a QBN for each weight kernel, and choose another QBN for each activation layer. Compared to the models quantized by the state-of-the-art DRL-based schemes, on average, the same models quantized by AutoQ reduce the inference latency by 54.06\%, and decrease the inference energy consumption by 50.69\%, while achieving the same inference accuracy. \ No newline at end of file diff --git a/data/2020/iclr/Automated Relational Meta-learning b/data/2020/iclr/Automated Relational Meta-learning new file mode 100644 index 0000000000..c0cda39050 --- /dev/null +++ b/data/2020/iclr/Automated Relational Meta-learning @@ -0,0 +1 @@ +In order to efficiently learn with small amount of data on new tasks, meta-learning transfers knowledge learned from previous tasks to the new ones. However, a critical challenge in meta-learning is the task heterogeneity which cannot be well handled by traditional globally shared meta-learning methods. In addition, current task-specific meta-learning methods may either suffer from hand-crafted structure design or lack the capability to capture complex relations between tasks. In this paper, motivated by the way of knowledge organization in knowledge bases, we propose an automated relational meta-learning (ARML) framework that automatically extracts the cross-task relations and constructs the meta-knowledge graph. When a new task arrives, it can quickly find the most relevant structure and tailor the learned structure knowledge to the meta-learner. As a result, the proposed framework not only addresses the challenge of task heterogeneity by a learned meta-knowledge graph, but also increases the model interpretability. We conduct extensive experiments on 2D toy regression and few-shot image classification and the results demonstrate the superiority of ARML over state-of-the-art baselines. \ No newline at end of file diff --git a/data/2020/iclr/Automated curriculum generation through setter-solver interactions b/data/2020/iclr/Automated curriculum generation through setter-solver interactions new file mode 100644 index 0000000000..b3771d5996 --- /dev/null +++ b/data/2020/iclr/Automated curriculum generation through setter-solver interactions @@ -0,0 +1 @@ +Reinforcement learning algorithms use correlations between policies and rewards to improve agent performance. But in dynamic or sparsely rewarding environments these correlations are often too small, or rewarding events are too infrequent to make learning feasible. Human education instead relies on curricula –the breakdown of tasks into simpler, static challenges with dense rewards– to build up to complex behaviors. While curricula are also useful for artificial agents, hand-crafting them is time consuming. This has lead researchers to explore automatic curriculum generation. Here we explore automatic curriculum generation in rich,dynamic environments. Using a setter-solver paradigm we show the importance of considering goal validity, goal feasibility, and goal coverage to construct useful curricula. We demonstrate the success of our approach in rich but sparsely rewarding 2D and 3D environments, where an agent is tasked to achieve a single goal selected from a set of possible goals that varies between episodes, and identify challenges for future work. Finally, we demonstrate the value of a novel technique that guides agents towards a desired goal distribution. Altogether, these results represent a substantial step towards applying automatic task curricula to learn complex, otherwise unlearnable goals, and to our knowledge are the first to demonstrate automated curriculum generation for goal-conditioned agents in environments where the possible goals vary between episodes. \ No newline at end of file diff --git a/data/2020/iclr/Automatically Discovering and Learning New Visual Categories with Ranking Statistics b/data/2020/iclr/Automatically Discovering and Learning New Visual Categories with Ranking Statistics new file mode 100644 index 0000000000..76e3bffdab --- /dev/null +++ b/data/2020/iclr/Automatically Discovering and Learning New Visual Categories with Ranking Statistics @@ -0,0 +1 @@ +We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes. This setting is similar to semi-supervised learning, but significantly harder because there are no labelled examples for the new classes. The challenge, then, is to leverage the information contained in the labelled images in order to learn a general-purpose clustering model and use the latter to identify the new classes in the unlabelled data. In this work we address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labeled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use rank statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data. We evaluate our approach on standard classification benchmarks and outperform current methods for novel category discovery by a significant margin. \ No newline at end of file diff --git a/data/2020/iclr/Black-Box Adversarial Attack with Transferable Model-based Embedding b/data/2020/iclr/Black-Box Adversarial Attack with Transferable Model-based Embedding new file mode 100644 index 0000000000..ff3364ce40 --- /dev/null +++ b/data/2020/iclr/Black-Box Adversarial Attack with Transferable Model-based Embedding @@ -0,0 +1 @@ +We present a new method for black-box adversarial attack. Unlike previous methods that combined transfer-based and scored-based methods by using the gradient or initialization of a surrogate white-box model, this new method tries to learn a low-dimensional embedding using a pretrained model, and then performs efficient search within the embedding space to attack an unknown target network. The method produces adversarial perturbations with high level semantic patterns that are easily transferable. We show that this approach can greatly improve the query efficiency of black-box adversarial attack across different target network architectures. We evaluate our approach on MNIST, ImageNet and Google Cloud Vision API, resulting in a significant reduction on the number of queries. We also attack adversarially defended networks on CIFAR10 and ImageNet, where our method not only reduces the number of queries, but also improves the attack success rate. \ No newline at end of file diff --git a/data/2020/iclr/Bounds on Over-Parameterization for Guaranteed Existence of Descent Paths in Shallow ReLU Networks b/data/2020/iclr/Bounds on Over-Parameterization for Guaranteed Existence of Descent Paths in Shallow ReLU Networks new file mode 100644 index 0000000000..8030cf08da --- /dev/null +++ b/data/2020/iclr/Bounds on Over-Parameterization for Guaranteed Existence of Descent Paths in Shallow ReLU Networks @@ -0,0 +1 @@ +We study the landscape of squared loss in neural networks with one-hidden layer and ReLU activation functions. Let $m$ and $d$ be the widths of hidden and input layers, respectively. We show that there exit poor local minima with positive curvature for some training sets of size $n\geq m+2d-2$. By positive curvature of a local minimum, we mean that within a small neighborhood the loss function is strictly increasing in all directions. Consequently, for such training sets, there are initialization of weights from which there is no descent path to global optima. It is known that for $n\le m$, there always exist descent paths to global optima from all initial weights. In this perspective, our results provide a somewhat sharp characterization of the over-parameterization required for "existence of descent paths" in the loss landscape. \ No newline at end of file diff --git a/data/2020/iclr/Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness b/data/2020/iclr/Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness new file mode 100644 index 0000000000..6e2114eea9 --- /dev/null +++ b/data/2020/iclr/Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness @@ -0,0 +1 @@ +Mode connectivity provides novel geometric insights on analyzing loss landscapes and enables building high-accuracy pathways between well-trained neural networks. In this work, we propose to employ mode connectivity in loss landscapes to study the adversarial robustness of deep neural networks, and provide novel methods for improving this robustness. Our experiments cover various types of adversarial attacks applied to different network architectures and datasets. When network models are tampered with backdoor or error-injection attacks, our results demonstrate that the path connection learned using limited amount of bonafide data can effectively mitigate adversarial effects while maintaining the original accuracy on clean data. Therefore, mode connectivity provides users with the power to repair backdoored or error-injected models. We also use mode connectivity to investigate the loss landscapes of regular and robust models against evasion attacks. Experiments show that there exists a barrier in adversarial robustness loss on the path connecting regular and adversarially-trained models. A high correlation is observed between the adversarial robustness loss and the largest eigenvalue of the input Hessian matrix, for which theoretical justifications are provided. Our results suggest that mode connectivity offers a holistic tool and practical means for evaluating and improving adversarial robustness. \ No newline at end of file diff --git a/data/2020/iclr/Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints b/data/2020/iclr/Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints new file mode 100644 index 0000000000..83c2cbd400 --- /dev/null +++ b/data/2020/iclr/Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints @@ -0,0 +1 @@ +In most practical settings and theoretical analyses, one assumes that a model can be trained until convergence. However, the growing complexity of machine learning datasets and models may violate such assumptions. Indeed, current approaches for hyper-parameter tuning and neural architecture search tend to be limited by practical resource constraints. Therefore, we introduce a formal setting for studying training under the non-asymptotic, resource-constrained regime, i.e., budgeted training. We analyze the following problem: "given a dataset, algorithm, and fixed resource budget, what is the best achievable performance?" We focus on the number of optimization iterations as the representative resource. Under such a setting, we show that it is critical to adjust the learning rate schedule according to the given budget. Among budget-aware learning schedules, we find simple linear decay to be both robust and high-performing. We support our claim through extensive experiments with state-of-the-art models on ImageNet (image classification), Kinetics (video classification), MS COCO (object detection and instance segmentation), and Cityscapes (semantic segmentation). We also analyze our results and find that the key to a good schedule is budgeted convergence, a phenomenon whereby the gradient vanishes at the end of each allowed budget. We also revisit existing approaches for fast convergence and show that budget-aware learning schedules readily outperform such approaches under (the practical but under-explored) budgeted training setting. \ No newline at end of file diff --git a/data/2020/iclr/CAQL: Continuous Action Q-Learning b/data/2020/iclr/CAQL: Continuous Action Q-Learning new file mode 100644 index 0000000000..5c2f99b644 --- /dev/null +++ b/data/2020/iclr/CAQL: Continuous Action Q-Learning @@ -0,0 +1 @@ +Value-based reinforcement learning (RL) methods like Q-learning have shown success in a variety of domains. One challenge in applying Q-learning to continuous-action RL problems, however, is the continuous action maximization (max-Q) required for optimal Bellman backup. In this work, we develop CAQL, a (class of) algorithm(s) for continuous-action Q-learning that can use several plug-and-play optimizers for the max-Q problem. Leveraging recent optimization results for deep neural networks, we show that max-Q can be solved optimally using mixed-integer programming (MIP). When the Q-function representation has sufficient power, MIP-based optimization gives rise to better policies and is more robust than approximate methods (e.g., gradient ascent, cross-entropy search). We further develop several techniques to accelerate inference in CAQL, which despite their approximate nature, perform well. We compare CAQL with state-of-the-art RL algorithms on benchmark continuous-control problems that have different degrees of action constraints and show that CAQL outperforms policy-based methods in heavily constrained environments, often dramatically. \ No newline at end of file diff --git a/data/2020/iclr/CLN2INV: Learning Loop Invariants with Continuous Logic Networks b/data/2020/iclr/CLN2INV: Learning Loop Invariants with Continuous Logic Networks new file mode 100644 index 0000000000..29c17c123d --- /dev/null +++ b/data/2020/iclr/CLN2INV: Learning Loop Invariants with Continuous Logic Networks @@ -0,0 +1 @@ +Program verification offers a framework for ensuring program correctness and therefore systematically eliminating different classes of bugs. Inferring loop invariants is one of the main challenges behind automated verification of real-world programs which often contain many loops. In this paper, we present Continuous Logic Network (CLN), a novel neural architecture for automatically learning loop invariants directly from program execution traces. Unlike existing neural networks, CLNs can learn precise and explicit representations of formulas in Satisfiability Modulo Theories (SMT) for loop invariants from program execution traces. We develop a new sound and complete semantic mapping for assigning SMT formulas to continuous truth values that allows CLNs to be trained efficiently. We use CLNs to implement a new inference system for loop invariants, CLN2INV, that significantly outperforms existing approaches on the popular Code2Inv dataset. CLN2INV is the first tool to solve all 124 theoretically solvable problems in the Code2Inv dataset. Moreover, CLN2INV takes only 1.1 seconds on average for each problem, which is 40 times faster than existing approaches. We further demonstrate that CLN2INV can even learn 12 significantly more complex loop invariants than the ones required for the Code2Inv dataset. \ No newline at end of file diff --git a/data/2020/iclr/CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning b/data/2020/iclr/CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning new file mode 100644 index 0000000000..ee7f0523da --- /dev/null +++ b/data/2020/iclr/CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning @@ -0,0 +1 @@ +A variety of cooperative multi-agent control problems require agents to achieve individual goals while contributing to collective success. This multi-goal multi-agent setting poses difficulties for recent algorithms, which primarily target settings with a single global reward, due to two new challenges: efficient exploration for learning both individual goal attainment and cooperation for others' success, and credit-assignment for interactions between actions and goals of different agents. To address both challenges, we restructure the problem into a novel two-stage curriculum, in which single-agent goal attainment is learned prior to learning multi-agent cooperation, and we derive a new multi-goal multi-agent policy gradient with a credit function for localized credit assignment. We use a function augmentation scheme to bridge value and policy functions across the curriculum. The complete architecture, called CM3, learns significantly faster than direct adaptations of existing algorithms on three challenging multi-goal multi-agent problems: cooperative navigation in difficult formations, negotiating multi-vehicle lane changes in the SUMO traffic simulator, and strategic cooperation in a Checkers environment. \ No newline at end of file diff --git a/data/2020/iclr/Can gradient clipping mitigate label noise? b/data/2020/iclr/Can gradient clipping mitigate label noise? new file mode 100644 index 0000000000..446c5cd7da --- /dev/null +++ b/data/2020/iclr/Can gradient clipping mitigate label noise? @@ -0,0 +1 @@ +Gradient clipping is a widely-used technique in the training of deep networks, and is generally motivated from an optimisation lens: informally, it controls the dynamics of iterates, thus enhancing the rate of convergence to a local minimum. This intuition has been made precise in a line of recent works, which show that suitable clipping can yield significantly faster convergence than vanilla gradient descent. In this paper, we propose a new lens for studying gradient clipping, namely, robustness: informally, one expects clipping to provide robustness to noise, since one does not overly trust any single sample. Surprisingly, we prove that for the common problem of label noise in classification, standard gradient clipping does not in general provide robustness. On the other hand, we show that a simple variant of gradient clipping is provably robust, and corresponds to suitably modifying the underlying loss function. This yields a simple, noise-robust alternative to the standard cross-entropy loss which performs well empirically. \ No newline at end of file diff --git a/data/2020/iclr/Certified Robustness for Top-k Predictions against Adversarial Perturbations via Randomized Smoothing b/data/2020/iclr/Certified Robustness for Top-k Predictions against Adversarial Perturbations via Randomized Smoothing new file mode 100644 index 0000000000..1da8831b27 --- /dev/null +++ b/data/2020/iclr/Certified Robustness for Top-k Predictions against Adversarial Perturbations via Randomized Smoothing @@ -0,0 +1 @@ +It is well-known that classifiers are vulnerable to adversarial perturbations. To defend against adversarial perturbations, various certified robustness results have been derived. However, existing certified robustnesses are limited to top-1 predictions. In many real-world applications, top-$k$ predictions are more relevant. In this work, we aim to derive certified robustness for top-$k$ predictions. In particular, our certified robustness is based on randomized smoothing, which turns any classifier to a new classifier via adding noise to an input example. We adopt randomized smoothing because it is scalable to large-scale neural networks and applicable to any classifier. We derive a tight robustness in $\ell_2$ norm for top-$k$ predictions when using randomized smoothing with Gaussian noise. We find that generalizing the certified robustness from top-1 to top-$k$ predictions faces significant technical challenges. We also empirically evaluate our method on CIFAR10 and ImageNet. For example, our method can obtain an ImageNet classifier with a certified top-5 accuracy of 62.8\% when the $\ell_2$-norms of the adversarial perturbations are less than 0.5 (=127/255). Our code is publicly available at: \url{https://github.com/jjy1994/Certify_Topk}. \ No newline at end of file diff --git a/data/2020/iclr/Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation b/data/2020/iclr/Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation new file mode 100644 index 0000000000..b790bc99e9 --- /dev/null +++ b/data/2020/iclr/Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation @@ -0,0 +1 @@ +Achieving faster execution with shorter compilation time can foster further diversity and innovation in neural networks. However, the current paradigm of executing neural networks either relies on hand-optimized libraries, traditional compilation heuristics, or very recently genetic algorithms and other stochastic methods. These methods suffer from frequent costly hardware measurements rendering them not only too time consuming but also suboptimal. As such, we devise a solution that can learn to quickly adapt to a previously unseen design space for code optimization, both accelerating the search and improving the output performance. This solution dubbed CHAMELEON leverages reinforcement learning whose solution takes fewer steps to converge, and develops an adaptive sampling algorithm that not only focuses on the costly samples (real hardware measurements) on representative points but also uses a domain knowledge inspired logic to improve the samples itself. Experimentation with real hardware shows that CHAMELEON provides 4.45×speed up in optimization time over AutoTVM, while also improving inference time of the modern deep networks by 5.6%. \ No newline at end of file diff --git a/data/2020/iclr/Compositional languages emerge in a neural iterated learning model b/data/2020/iclr/Compositional languages emerge in a neural iterated learning model new file mode 100644 index 0000000000..10326940de --- /dev/null +++ b/data/2020/iclr/Compositional languages emerge in a neural iterated learning model @@ -0,0 +1 @@ +The principle of compositionality, which enables natural language to represent complex concepts via a structured combination of simpler ones, allows us to convey an open-ended set of messages using a limited vocabulary. If compositionality is indeed a natural property of language, we may expect it to appear in communication protocols that are created by neural agents via grounded language learning. Inspired by the iterated learning framework, which simulates the process of language evolution, we propose an effective neural iterated learning algorithm that, when applied to interacting neural agents, facilitates the emergence of a more structured type of language. Indeed, these languages provide specific advantages to neural agents during training, which translates as a larger posterior probability, which is then incrementally amplified via the iterated learning procedure. Our experiments confirm our analysis, and also demonstrate that the emerged languages largely improve the generalization of the neural agent communication. \ No newline at end of file diff --git a/data/2020/iclr/Computation Reallocation for Object Detection b/data/2020/iclr/Computation Reallocation for Object Detection new file mode 100644 index 0000000000..5ed1d5181b --- /dev/null +++ b/data/2020/iclr/Computation Reallocation for Object Detection @@ -0,0 +1 @@ +The allocation of computation resources in the backbone is a crucial issue in object detection. However, classification allocation pattern is usually adopted directly to object detector, which is proved to be sub-optimal. In order to reallocate the engaged computation resources in a more efficient way, we present CR-NAS (Computation Reallocation Neural Architecture Search) that can learn computation reallocation strategies across different feature resolution and spatial position diectly on the target detection dataset. A two-level reallocation space is proposed for both stage and spatial reallocation. A novel hierarchical search procedure is adopted to cope with the complex search space. We apply CR-NAS to multiple backbones and achieve consistent improvements. Our CR-ResNet50 and CR-MobileNetV2 outperforms the baseline by 1.9% and 1.7% COCO AP respectively without any additional computation budget. The models discovered by CR-NAS can be equiped to other powerful detection neck/head and be easily transferred to other dataset, e.g. PASCAL VOC, and other vision tasks, e.g. instance segmentation. Our CR-NAS can be used as a plugin to improve the performance of various networks, which is demanding. \ No newline at end of file diff --git a/data/2020/iclr/Continual Learning with Adaptive Weights (CLAW) b/data/2020/iclr/Continual Learning with Adaptive Weights (CLAW) new file mode 100644 index 0000000000..f9da99e189 --- /dev/null +++ b/data/2020/iclr/Continual Learning with Adaptive Weights (CLAW) @@ -0,0 +1 @@ +Approaches to continual learning aim to successfully learn a set of related tasks that arrive in an online manner. Recently, several frameworks have been developed which enable deep learning to be deployed in this learning scenario. A key modelling decision is to what extent the architecture should be shared across tasks. On the one hand, separately modelling each task avoids catastrophic forgetting but it does not support transfer learning and leads to large models. On the other hand, rigidly specifying a shared component and a task-specific part enables task transfer and limits the model size, but it is vulnerable to catastrophic forgetting and restricts the form of task-transfer that can occur. Ideally, the network should adaptively identify which parts of the network to share in a data driven way. Here we introduce such an approach called Continual Learning with Adaptive Weights (CLAW), which is based on probabilistic modelling and variational inference. Experiments show that CLAW achieves state-of-the-art performance on six benchmarks in terms of overall continual learning performance, as measured by classification accuracy, and in terms of addressing catastrophic forgetting. \ No newline at end of file diff --git a/data/2020/iclr/Continual Learning with Bayesian Neural Networks for Non-Stationary Data b/data/2020/iclr/Continual Learning with Bayesian Neural Networks for Non-Stationary Data new file mode 100644 index 0000000000..c4033ad794 --- /dev/null +++ b/data/2020/iclr/Continual Learning with Bayesian Neural Networks for Non-Stationary Data @@ -0,0 +1 @@ +This work addresses continual learning for non-stationary data, using Bayesian neural networks and memory-based online variational Bayes. We represent the posterior approximation of the network weights by a diagonal Gaussian distribution and a complementary memory of raw data. This raw data corresponds to likelihood terms that cannot be well approximated by the Gaussian. We introduce a novel method for sequentially updating both components of the posterior approximation. Furthermore, we propose Bayesian forgetting and a Gaussian diffusion process for adapting to non-stationary data. The experimental results show that our update method improves on existing approaches for streaming data. Additionally, the adaptation methods lead to better predictive performance for non-stationary data. \ No newline at end of file diff --git a/data/2020/iclr/Counterfactuals uncover the modular structure of deep generative models b/data/2020/iclr/Counterfactuals uncover the modular structure of deep generative models new file mode 100644 index 0000000000..4dda48b4f6 --- /dev/null +++ b/data/2020/iclr/Counterfactuals uncover the modular structure of deep generative models @@ -0,0 +1 @@ +Deep generative models can emulate the perceptual properties of complex image datasets, providing a latent representation of the data. However, manipulating such representation to perform meaningful and controllable transformations in the data space remains challenging without some form of supervision. While previous work has focused on exploiting statistical independence to disentangle latent factors, we argue that such requirement is too restrictive and propose instead a non-statistical framework that relies on counterfactual manipulations to uncover a modular structure of the network composed of disentangled groups of internal variables. Experiments with a variety of generative models trained on complex image datasets show the obtained modules can be used to design targeted interventions. This opens the way to applications such as computationally efficient style transfer and the automated assessment of robustness to contextual changes in pattern recognition systems. \ No newline at end of file diff --git a/data/2020/iclr/Curvature Graph Network b/data/2020/iclr/Curvature Graph Network new file mode 100644 index 0000000000..ba9decd6d5 --- /dev/null +++ b/data/2020/iclr/Curvature Graph Network @@ -0,0 +1 @@ +Graph-structured data is prevalent in many domains. Despite the widely celebrated success of deep neural networks, their power in graph-structured data is yet to be fully explored. We propose a novel network architecture that incorporates advanced graph structural features. In particular, we leverage discrete graph curvature, which measures how the neighborhoods of a pair of nodes are structurally related. The curvature of an edge (x, y) defines the distance taken to travel from neighbors of x to neighbors of y, compared with the length of edge (x, y). It is a much more descriptive feature compared to previously used features that only focus on node specific attributes or limited topological information such as degree. Our curvature graph convolution network outperforms state-of-the-art on various synthetic and real-world graphs, especially the larger and denser ones. \ No newline at end of file diff --git a/data/2020/iclr/DBA: Distributed Backdoor Attacks against Federated Learning b/data/2020/iclr/DBA: Distributed Backdoor Attacks against Federated Learning new file mode 100644 index 0000000000..44ffc1af61 --- /dev/null +++ b/data/2020/iclr/DBA: Distributed Backdoor Attacks against Federated Learning @@ -0,0 +1 @@ +Backdoor attacks aim to manipulate a subset of training data by injecting adversarial triggers such that machine learning models trained on the tampered dataset will make arbitrarily (targeted) incorrect prediction on the testset with the same trigger embedded. While federated learning (FL) is capable of aggregating information provided by different parties for training a better model, its distributed learning methodology and inherently heterogeneous data distribution across parties may bring new vulnerabilities. In addition to recent centralized backdoor attacks on FL where each party embeds the same global trigger during training, we propose the distributed backdoor attack (DBA) --- a novel threat assessment framework developed by fully exploiting the distributed nature of FL. DBA decomposes a global trigger pattern into separate local patterns and embed them into the training set of different adversarial parties respectively. Compared to standard centralized backdoors, we show that DBA is substantially more persistent and stealthy against FL on diverse datasets such as finance and image data. We conduct extensive experiments to show that the attack success rate of DBA is significantly higher than centralized backdoors under different settings. Moreover, we find that distributed attacks are indeed more insidious, as DBA can evade two state-of-the-art robust FL algorithms against centralized backdoors. We also provide explanations for the effectiveness of DBA via feature visual interpretation and feature importance ranking. To further explore the properties of DBA, we test the attack performance by varying different trigger factors, including local trigger variations (size, gap, and location), scaling factor in FL, data distribution, and poison ratio and interval. Our proposed DBA and thorough evaluation results shed lights on characterizing the robustness of FL. \ No newline at end of file diff --git a/data/2020/iclr/DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames b/data/2020/iclr/DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames new file mode 100644 index 0000000000..4a3e94da18 --- /dev/null +++ b/data/2020/iclr/DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames @@ -0,0 +1,3 @@ +We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever "stale"), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling -- achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs. + +This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially "solves" the task -- near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of "ImageNet pre-training + task-specific fine-tuning" for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available). \ No newline at end of file diff --git a/data/2020/iclr/Data-Independent Neural Pruning via Coresets b/data/2020/iclr/Data-Independent Neural Pruning via Coresets new file mode 100644 index 0000000000..150f74b0af --- /dev/null +++ b/data/2020/iclr/Data-Independent Neural Pruning via Coresets @@ -0,0 +1 @@ +Previous work showed empirically that large neural networks can be significantly reduced in size while preserving their accuracy. Model compression became a central research topic, as it is crucial for deployment of neural networks on devices with limited computational and memory resources. The majority of the compression methods are based on heuristics and offer no worst-case guarantees on the trade-off between the compression rate and the approximation error for an arbitrarily new sample. We propose the first efficient, data-independent neural pruning algorithm with a provable trade-off between its compression rate and the approximation error for any future test sample. Our method is based on the coreset framework, which finds a small weighted subset of points that provably approximates the original inputs. Specifically, we approximate the output of a layer of neurons by a coreset of neurons in the previous layer and discard the rest. We apply this framework in a layer-by-layer fashion from the top to the bottom. Unlike previous works, our coreset is data independent, meaning that it provably guarantees the accuracy of the function for any input $x\in \mathbb{R}^d$, including an adversarial one. We demonstrate the effectiveness of our method on popular network architectures. In particular, our coresets yield 90\% compression of the LeNet-300-100 architecture on MNIST while improving the accuracy. \ No newline at end of file diff --git a/data/2020/iclr/DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling b/data/2020/iclr/DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling new file mode 100644 index 0000000000..c00eccb683 --- /dev/null +++ b/data/2020/iclr/DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling @@ -0,0 +1 @@ +For sequence models with large vocabularies, a majority of network parameters lie in the input and output layers. In this work, we describe a new method, DeFINE, for learning deep token representations efficiently. Our architecture uses a hierarchical structure with novel skip-connections which allows for the use of low dimensional input and output layers, reducing total parameters and training time while delivering similar or better performance versus existing methods. DeFINE can be incorporated easily in new or existing sequence models. Compared to state-of-the-art methods including adaptive input representations, this technique results in a 6% to 20% drop in perplexity. On WikiText-103, DeFINE reduces the total parameters of Transformer-XL by half with minimal impact on performance. On the Penn Treebank, DeFINE improves AWD-LSTM by 4 points with a 17% reduction in parameters, achieving comparable performance to state-of-the-art methods with fewer parameters. For machine translation, DeFINE improves the efficiency of the Transformer model by about 1.4 times while delivering similar performance. \ No newline at end of file diff --git "a/data/2020/iclr/Deep 3D Pan via local adaptive \"t-shaped\" convolutions with global and local adaptive dilations" "b/data/2020/iclr/Deep 3D Pan via local adaptive \"t-shaped\" convolutions with global and local adaptive dilations" new file mode 100644 index 0000000000..560f317b68 --- /dev/null +++ "b/data/2020/iclr/Deep 3D Pan via local adaptive \"t-shaped\" convolutions with global and local adaptive dilations" @@ -0,0 +1 @@ +Recent advances in deep learning have shown promising results in many low-level vision tasks. However, solving the single-image-based view synthesis is still an open problem. In particular, the generation of new images at parallel camera views given a single input image is of great interest, as it enables 3D visualization of the 2D input scenery. We propose a novel network architecture to perform stereoscopic view synthesis at arbitrary camera positions along the X-axis, or Deep 3D Pan, with "t-shaped" adaptive kernels equipped with globally and locally adaptive dilations. Our proposed network architecture, the monster-net, is devised with a novel t-shaped adaptive kernel with globally and locally adaptive dilation, which can efficiently incorporate global camera shift into and handle local 3D geometries of the target image's pixels for the synthesis of naturally looking 3D panned views when a 2-D input image is given. Extensive experiments were performed on the KITTI, CityScapes and our VXXLXX_STEREO indoors dataset to prove the efficacy of our method. Our monster-net significantly outperforms the state-of-the-art method, SOTA, by a large margin in all metrics of RMSE, PSNR, and SSIM. Our proposed monster-net is capable of reconstructing more reliable image structures in synthesized images with coherent geometry. Moreover, the disparity information that can be extracted from the "t-shaped" kernel is much more reliable than that of the SOTA for the unsupervised monocular depth estimation task, confirming the effectiveness of our method. \ No newline at end of file diff --git a/data/2020/iclr/Deep Imitative Models for Flexible Inference, Planning, and Control b/data/2020/iclr/Deep Imitative Models for Flexible Inference, Planning, and Control new file mode 100644 index 0000000000..cd00d903d4 --- /dev/null +++ b/data/2020/iclr/Deep Imitative Models for Flexible Inference, Planning, and Control @@ -0,0 +1 @@ +Imitation Learning (IL) is an appealing approach to learn desirable autonomous behavior. However, directing IL to achieve arbitrary goals is difficult. In contrast, planning-based algorithms use dynamics models and reward functions to achieve goals. Yet, reward functions that evoke desirable behavior are often difficult to specify. In this paper, we propose Imitative Models to combine the benefits of IL and goal-directed planning. Imitative Models are probabilistic predictive models of desirable behavior able to plan interpretable expert-like trajectories to achieve specified goals. We derive families of flexible goal objectives, including constrained goal regions, unconstrained goal sets, and energy-based goals. We show that our method can use these objectives to successfully direct behavior. Our method substantially outperforms six IL approaches and a planning-based approach in a dynamic simulated autonomous driving task, and is efficiently learned from expert demonstrations without online data collection. We also show our approach is robust to poorly specified goals, such as goals on the wrong side of the road. \ No newline at end of file diff --git a/data/2020/iclr/Deep Learning of Determinantal Point Processes via Proper Spectral Sub-gradient b/data/2020/iclr/Deep Learning of Determinantal Point Processes via Proper Spectral Sub-gradient new file mode 100644 index 0000000000..52adaf5b3c --- /dev/null +++ b/data/2020/iclr/Deep Learning of Determinantal Point Processes via Proper Spectral Sub-gradient @@ -0,0 +1 @@ +Determinantal point processes (DPPs) is an effective tool to deliver diversity on multiple machine learning and computer vision tasks. Under deep learning framework, DPP is typically optimized via approximation, which is not straightforward and has some conflict with diversity requirement. We note, however, there has been no deep learning paradigms to optimize DPP directly since it involves matrix inversion which may result in highly computational instability. This fact greatly hinders the wide use of DPP on some specific objectives where DPP serves as a term to measure the feature diversity. In this paper, we devise a simple but effective algorithm to address this issue to optimize DPP term directly expressed with L-ensemble in spectral domain over gram matrix, which is more flexible than learning on parametric kernels. By further taking into account some geometric constraints, our algorithm seeks to generate valid sub-gradients of DPP term in case when the DPP gram matrix is not invertible (no gradients exist in this case). In this sense, our algorithm can be easily incorporated with multiple deep learning tasks. Experiments show the effectiveness of our algorithm, indicating promising performance for practical learning problems. \ No newline at end of file diff --git a/data/2020/iclr/Deep Network Classification by Scattering and Homotopy Dictionary Learning b/data/2020/iclr/Deep Network Classification by Scattering and Homotopy Dictionary Learning new file mode 100644 index 0000000000..cfd03de9ef --- /dev/null +++ b/data/2020/iclr/Deep Network Classification by Scattering and Homotopy Dictionary Learning @@ -0,0 +1 @@ +We introduce a sparse scattering deep convolutional neural network, which provides a simple model to analyze properties of deep representation learning for classification. Learning a single dictionary matrix with a classifier yields a higher classification accuracy than AlexNet over the ImageNet 2012 dataset. The network first applies a scattering transform that linearizes variabilities due to geometric transformations such as translations and small deformations. A sparse $\ell^1$ dictionary coding reduces intra-class variability while preserving class separation through projections over unions of linear spaces. It is implemented in a deep convolutional network with a homotopy algorithm having an exponential convergence. A convergence proof is given in a general framework that includes ALISTA. Classification results are analyzed on ImageNet. \ No newline at end of file diff --git a/data/2020/iclr/Deep Semi-Supervised Anomaly Detection b/data/2020/iclr/Deep Semi-Supervised Anomaly Detection new file mode 100644 index 0000000000..e117396cdf --- /dev/null +++ b/data/2020/iclr/Deep Semi-Supervised Anomaly Detection @@ -0,0 +1 @@ +Deep approaches to anomaly detection have recently shown promising results over shallow methods on large and complex datasets. Typically anomaly detection is treated as an unsupervised learning problem. In practice however, one may have---in addition to a large set of unlabeled samples---access to a small pool of labeled samples, e.g. a subset verified by some domain expert as being normal or anomalous. Semi-supervised approaches to anomaly detection aim to utilize such labeled samples, but most proposed methods are limited to merely including labeled normal samples. Only a few methods take advantage of labeled anomalies, with existing deep approaches being domain-specific. In this work we present Deep SAD, an end-to-end deep methodology for general semi-supervised anomaly detection. We further introduce an information-theoretic framework for deep anomaly detection based on the idea that the entropy of the latent distribution for normal data should be lower than the entropy of the anomalous distribution, which can serve as a theoretical interpretation for our method. In extensive experiments on MNIST, Fashion-MNIST, and CIFAR-10, along with other anomaly detection benchmark datasets, we demonstrate that our method is on par or outperforms shallow, hybrid, and deep competitors, yielding appreciable performance improvements even when provided with only little labeled data. \ No newline at end of file diff --git a/data/2020/iclr/DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures b/data/2020/iclr/DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures new file mode 100644 index 0000000000..2d1019493b --- /dev/null +++ b/data/2020/iclr/DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures @@ -0,0 +1 @@ +In seeking for sparse and efficient neural network models, many previous works investigated on enforcing L1 or L0 regularizers to encourage weight sparsity during training. The L0 regularizer measures the parameter sparsity directly and is invariant to the scaling of parameter values, but it cannot provide useful gradients, and therefore requires complex optimization techniques. The L1 regularizer is almost everywhere differentiable and can be easily optimized with gradient descent. Yet it is not scale-invariant, causing the same shrinking rate to all parameters, which is inefficient in increasing sparsity. Inspired by the Hoyer measure (the ratio between L1 and L2 norms) used in traditional compressed sensing problems, we present DeepHoyer, a set of sparsity-inducing regularizers that are both differentiable almost everywhere and scale-invariant. Our experiments show that enforcing DeepHoyer regularizers can produce even sparser neural network models than previous works, under the same accuracy level. We also show that DeepHoyer can be applied to both element-wise and structural pruning. \ No newline at end of file diff --git a/data/2020/iclr/DeepV2D: Video to Depth with Differentiable Structure from Motion b/data/2020/iclr/DeepV2D: Video to Depth with Differentiable Structure from Motion new file mode 100644 index 0000000000..9d447aba77 --- /dev/null +++ b/data/2020/iclr/DeepV2D: Video to Depth with Differentiable Structure from Motion @@ -0,0 +1 @@ +We propose DeepV2D, an end-to-end deep learning architecture for predicting depth from video. DeepV2D combines the representation ability of neural networks with the geometric principles governing image formation. We compose a collection of classical geometric algorithms, which are converted into trainable modules and combined into an end-to-end differentiable architecture. DeepV2D interleaves two stages: motion estimation and depth estimation. During inference, motion and depth estimation are alternated and converge to accurate depth. Code is available this https URL. \ No newline at end of file diff --git a/data/2020/iclr/Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation b/data/2020/iclr/Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation new file mode 100644 index 0000000000..4a4951e5fb --- /dev/null +++ b/data/2020/iclr/Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation @@ -0,0 +1,2 @@ +Convolutional networks are not aware of an object's geometric variations, which leads to inefficient utilization of model and data capacity. To overcome this issue, recent works on deformation modeling seek to spatially reconfigure the data towards a common arrangement such that semantic recognition suffers less from deformation. This is typically done by augmenting static operators with learned free-form sampling grids in the image space, dynamically tuned to the data and task for adapting the receptive field. Yet adapting the receptive field does not quite reach the actual goal -- what really matters to the network is the "effective" receptive field (ERF), which reflects how much each pixel contributes. It is thus natural to design other approaches to adapt the ERF directly during runtime. +In this work, we instantiate one possible solution as Deformable Kernels (DKs), a family of novel and generic convolutional operators for handling object deformations by directly adapting the ERF while leaving the receptive field untouched. At the heart of our method is the ability to resample the original kernel space towards recovering the deformation of objects. This approach is justified with theoretical insights that the ERF is strictly determined by data sampling locations and kernel values. We implement DKs as generic drop-in replacements of rigid kernels and conduct a series of empirical studies whose results conform with our theories. Over several tasks and standard base models, our approach compares favorably against prior works that adapt during runtime. In addition, further experiments suggest a working mechanism orthogonal and complementary to previous works. \ No newline at end of file diff --git a/data/2020/iclr/Depth-Adaptive Transformer b/data/2020/iclr/Depth-Adaptive Transformer new file mode 100644 index 0000000000..d574342554 --- /dev/null +++ b/data/2020/iclr/Depth-Adaptive Transformer @@ -0,0 +1 @@ +State of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make output predictions at different stages of the network and we investigate different ways to predict how much computation is required for a particular sequence. Unlike dynamic computation in Universal Transformers, which applies the same set of layers iteratively, we apply different layers at every step to adjust both the amount of computation as well as the model capacity. On IWSLT German-English translation our approach matches the accuracy of a well tuned baseline Transformer while using less than a quarter of the decoder layers. \ No newline at end of file diff --git a/data/2020/iclr/Detecting Extrapolation with Local Ensembles b/data/2020/iclr/Detecting Extrapolation with Local Ensembles new file mode 100644 index 0000000000..8a680d7c7b --- /dev/null +++ b/data/2020/iclr/Detecting Extrapolation with Local Ensembles @@ -0,0 +1 @@ +We present local ensembles, a method for detecting extrapolation at test time in a pre-trained model. We focus on underdetermination as a key component of extrapolation: we aim to detect when many possible predictions are consistent with the training data and model class. Our method uses local second-order information to approximate the variance of predictions across an ensemble of models from the same class. We compute this approximation by estimating the norm of the component of a test point's gradient that aligns with the low-curvature directions of the Hessian, and provide a tractable method for estimating this quantity. Experimentally, we show that our method is capable of detecting when a pre-trained model is extrapolating on test data, with applications to out-of-distribution detection, detecting spurious correlates, and active learning. \ No newline at end of file diff --git a/data/2020/iclr/Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions b/data/2020/iclr/Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions new file mode 100644 index 0000000000..24543dcaa3 --- /dev/null +++ b/data/2020/iclr/Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions @@ -0,0 +1 @@ +Adversarial examples raise questions about whether neural network models are sensitive to the same visual features as humans. In this paper, we first detect adversarial examples or otherwise corrupted images based on a class-conditional reconstruction of the input. To specifically attack our detection mechanism, we propose the Reconstructive Attack which seeks both to cause a misclassification and a low reconstruction error. This reconstructive attack produces undetected adversarial examples but with much smaller success rate. Among all these attacks, we find that CapsNets always perform better than convolutional networks. Then, we diagnose the adversarial examples for CapsNets and find that the success of the reconstructive attack is highly related to the visual similarity between the source and target class. Additionally, the resulting perturbations can cause the input image to appear visually more like the target class and hence become non-adversarial. This suggests that CapsNets use features that are more aligned with human perception and have the potential to address the central issue raised by adversarial examples. \ No newline at end of file diff --git a/data/2020/iclr/Difference-Seeking Generative Adversarial Network-Unseen Sample Generation b/data/2020/iclr/Difference-Seeking Generative Adversarial Network-Unseen Sample Generation new file mode 100644 index 0000000000..db5df43176 --- /dev/null +++ b/data/2020/iclr/Difference-Seeking Generative Adversarial Network-Unseen Sample Generation @@ -0,0 +1 @@ +Unseen data, which are not samples from the distribution of training data and are difficult to collect, have exhibited importance in numerous applications, ({\em e.g.,} novelty detection, semi-supervised learning, and adversarial training). In this paper, we introduce a general framework called \textbf{d}ifference-\textbf{s}eeking \textbf{g}enerative \textbf{a}dversarial \textbf{n}etwork (DSGAN), to generate various types of unseen data. Its novelty is the consideration of the probability density of the unseen data distribution as the difference between two distributions $p_{\bar{d}}$ and $p_{d}$ whose samples are relatively easy to collect. The DSGAN can learn the target distribution, $p_{t}$, (or the unseen data distribution) from only the samples from the two distributions, $p_{d}$ and $p_{\bar{d}}$. In our scenario, $p_d$ is the distribution of the seen data, and $p_{\bar{d}}$ can be obtained from $p_{d}$ via simple operations, so that we only need the samples of $p_{d}$ during the training. Two key applications, semi-supervised learning and novelty detection, are taken as case studies to illustrate that the DSGAN enables the production of various unseen data. We also provide theoretical analyses about the convergence of the DSGAN. \ No newline at end of file diff --git a/data/2020/iclr/Differentially Private Meta-Learning b/data/2020/iclr/Differentially Private Meta-Learning new file mode 100644 index 0000000000..12d5520f46 --- /dev/null +++ b/data/2020/iclr/Differentially Private Meta-Learning @@ -0,0 +1 @@ +Parameter-transfer is a well-known and versatile approach for meta-learning, with applications including few-shot learning, federated learning, and reinforcement learning. However, parameter-transfer algorithms often require sharing models that have been trained on the samples from specific tasks, thus leaving the task-owners susceptible to breaches of privacy. We conduct the first formal study of privacy in this setting and formalize the notion of task-global differential privacy as a practical relaxation of more commonly studied threat models. We then propose a new differentially private algorithm for gradient-based parameter transfer that not only satisfies this privacy requirement but also retains provable transfer learning guarantees in convex settings. Empirically, we apply our analysis to the problems of federated learning with personalization and few-shot classification, showing that allowing the relaxation to task-global privacy from the more commonly studied notion of local privacy leads to dramatically increased performance in recurrent neural language modeling and image classification. \ No newline at end of file diff --git a/data/2020/iclr/Disentangling Factors of Variations Using Few Labels b/data/2020/iclr/Disentangling Factors of Variations Using Few Labels new file mode 100644 index 0000000000..38cc8254c4 --- /dev/null +++ b/data/2020/iclr/Disentangling Factors of Variations Using Few Labels @@ -0,0 +1 @@ +Learning disentangled representations is considered a cornerstone problem in representation learning. Recently, Locatello et al. (2019) demonstrated that unsupervised disentanglement learning without inductive biases is theoretically impossible and that existing inductive biases and unsupervised methods do not allow to consistently learn disentangled representations. However, in many practical settings, one might have access to a limited amount of supervision, for example through manual labeling of (some) factors of variation in a few training examples. In this paper, we investigate the impact of such supervision on state-of-the-art disentanglement methods and perform a large scale study, training over 52000 models under well-defined and reproducible experimental conditions. We observe that a small number of labeled examples (0.01--0.5% of the data set), with potentially imprecise and incomplete labels, is sufficient to perform model selection on state-of-the-art unsupervised models. Further, we investigate the benefit of incorporating supervision into the training process. Overall, we empirically validate that with little and imprecise supervision it is possible to reliably learn disentangled representations. \ No newline at end of file diff --git a/data/2020/iclr/Distance-Based Learning from Errors for Confidence Calibration b/data/2020/iclr/Distance-Based Learning from Errors for Confidence Calibration new file mode 100644 index 0000000000..94d8ee3d27 --- /dev/null +++ b/data/2020/iclr/Distance-Based Learning from Errors for Confidence Calibration @@ -0,0 +1 @@ +Deep neural networks (DNNs) are poorly-calibrated when trained in conventional ways. To improve confidence calibration of DNNs, we propose a novel training method, distance-based learning from errors (DBLE). DBLE bases its confidence estimation on distances in the representation space. We first adapt prototypical learning for training of a classification model for DBLE. It yields a representation space where a test sample's distance to its ground-truth class center can calibrate the model's performance. At inference, however, these distances are not available due to the lack of ground-truth label. To circumvent this by approximately inferring the distance for every test sample, we propose to train a confidence model jointly with the classification model, by merely learning from mis-classified training samples, which we show to be highly-beneficial for effective learning. On multiple data sets and DNN architectures, we demonstrate that DBLE outperforms alternative single-modal confidence calibration approaches. DBLE also achieves comparable performance with computationally-expensive ensemble approaches with lower computational cost and lower number of parameters. \ No newline at end of file diff --git a/data/2020/iclr/Diverse Trajectory Forecasting with Determinantal Point Processes b/data/2020/iclr/Diverse Trajectory Forecasting with Determinantal Point Processes new file mode 100644 index 0000000000..dd30743f9e --- /dev/null +++ b/data/2020/iclr/Diverse Trajectory Forecasting with Determinantal Point Processes @@ -0,0 +1 @@ +The ability to forecast a set of likely yet diverse possible future behaviors of an agent (e.g., future trajectories of a pedestrian) is essential for safety-critical perception systems (e.g., autonomous vehicles). In particular, a set of possible future behaviors generated by the system must be diverse to account for all possible outcomes in order to take necessary safety precautions. It is not sufficient to maintain a set of the most likely future outcomes because the set may only contain perturbations of a single outcome. While generative models such as variational autoencoders (VAEs) have been shown to be a powerful tool for learning a distribution over future trajectories, randomly drawn samples from the learned implicit likelihood model may not be diverse -- the likelihood model is derived from the training data distribution and the samples will concentrate around the major mode that has most data. In this work, we propose to learn a diversity sampling function (DSF) that generates a diverse and likely set of future trajectories. The DSF maps forecasting context features to a set of latent codes which can be decoded by a generative model (e.g., VAE) into a set of diverse trajectory samples. Concretely, the process of identifying the diverse set of samples is posed as a parameter estimation of the DSF. To learn the parameters of the DSF, the diversity of the trajectory samples is evaluated by a diversity loss based on a determinantal point process (DPP). Gradient descent is performed over the DSF parameters, which in turn move the latent codes of the sample set to find an optimal diverse and likely set of trajectories. Our method is a novel application of DPPs to optimize a set of items (trajectories) in continuous space. We demonstrate the diversity of the trajectories produced by our approach on both low-dimensional 2D trajectory data and high-dimensional human motion data. \ No newline at end of file diff --git a/data/2020/iclr/DivideMix: Learning with Noisy Labels as Semi-supervised Learning b/data/2020/iclr/DivideMix: Learning with Noisy Labels as Semi-supervised Learning new file mode 100644 index 0000000000..eebbffc0f5 --- /dev/null +++ b/data/2020/iclr/DivideMix: Learning with Noisy Labels as Semi-supervised Learning @@ -0,0 +1 @@ +Deep neural networks are known to be annotation-hungry. Numerous efforts have been devoted to reducing the annotation cost when learning with deep networks. Two prominent directions include learning with noisy labels and semi-supervised learning by exploiting unlabeled data. In this work, we propose DivideMix, a novel framework for learning with noisy labels by leveraging semi-supervised learning techniques. In particular, DivideMix models the per-sample loss distribution with a mixture model to dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples, and trains the model on both the labeled and unlabeled data in a semi-supervised manner. To avoid confirmation bias, we simultaneously train two diverged networks where each network uses the dataset division from the other network. During the semi-supervised training phase, we improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively. Experiments on multiple benchmark datasets demonstrate substantial improvements over state-of-the-art methods. Code is available at this https URL . \ No newline at end of file diff --git a/data/2020/iclr/Dynamic Time Lag Regression: Predicting What & When b/data/2020/iclr/Dynamic Time Lag Regression: Predicting What & When new file mode 100644 index 0000000000..3b3bd91ce2 --- /dev/null +++ b/data/2020/iclr/Dynamic Time Lag Regression: Predicting What & When @@ -0,0 +1 @@ +This paper tackles a new regression problem, called Dynamic Time-Lag Regression (DTLR), where a cause signal drives an effect signal with an unknown time delay. The motivating application, pertaining to space weather modelling, aims to predict the near-Earth solar wind speed based on estimates of the Sun's coronal magnetic field. DTLR differs from mainstream regression and from sequence-to-sequence learning in two respects: firstly, no ground truth (e.g., pairs of associated sub-sequences) is available; secondly, the cause signal contains much information irrelevant to the effect signal (the solar magnetic field governs the solar wind propagation in the heliosphere, of which the Earth's magnetosphere is but a minuscule region). A Bayesian approach is presented to tackle the specifics of the DTLR problem, with theoretical justifications based on linear stability analysis. A proof of concept on synthetic problems is presented. Finally, the empirical results on the solar wind modelling task improve on the state of the art in solar wind forecasting. \ No newline at end of file diff --git a/data/2020/iclr/Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery b/data/2020/iclr/Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery new file mode 100644 index 0000000000..f6f0678021 --- /dev/null +++ b/data/2020/iclr/Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery @@ -0,0 +1 @@ +Reinforcement learning requires manual specification of a reward function to learn a task. While in principle this reward function only needs to specify the task goal, in practice reinforcement learning can be very time-consuming or even infeasible unless the reward function is shaped so as to provide a smooth gradient towards a successful outcome. This shaping is difficult to specify by hand, particularly when the task is learned from raw observations, such as images. In this paper, we study how we can automatically learn dynamical distances: a measure of the expected number of time steps to reach a given goal state from any other state. These dynamical distances can be used to provide well-shaped reward functions for reaching new goals, making it possible to learn complex tasks efficiently. We show that dynamical distances can be used in a semi-supervised regime, where unsupervised interaction with the environment is used to learn the dynamical distances, while a small amount of preference supervision is used to determine the task goal, without any manually engineered reward function or goal examples. We evaluate our method both on a real-world robot and in simulation. We show that our method can learn to turn a valve with a real-world 9-DoF hand, using raw image observations and just ten preference labels, without any other supervision. Videos of the learned skills can be found on the project website: this https URL. \ No newline at end of file diff --git a/data/2020/iclr/Dynamically Pruned Message Passing Networks for Large-scale Knowledge Graph Reasoning b/data/2020/iclr/Dynamically Pruned Message Passing Networks for Large-scale Knowledge Graph Reasoning new file mode 100644 index 0000000000..7f29b7de46 --- /dev/null +++ b/data/2020/iclr/Dynamically Pruned Message Passing Networks for Large-scale Knowledge Graph Reasoning @@ -0,0 +1 @@ +We propose Dynamically Pruned Message Passing Networks (DPMPN) for large-scale knowledge graph reasoning. In contrast to existing models, embedding-based or path-based, we learn an input-dependent subgraph to explicitly model reasoning process. Subgraphs are dynamically constructed and expanded by applying graphical attention mechanism conditioned on input queries. In this way, we not only construct graph-structured explanations but also enable message passing designed in Graph Neural Networks (GNNs) to scale with graph sizes. We take the inspiration from the consciousness prior proposed by and develop a two-GNN framework to simultaneously encode input-agnostic full graph representation and learn input-dependent local one coordinated by an attention module. Experiments demonstrate the reasoning capability of our model that is to provide clear graphical explanations as well as deliver accurate predictions, outperforming most state-of-the-art methods in knowledge base completion tasks. \ No newline at end of file diff --git a/data/2020/iclr/ES-MAML: Simple Hessian-Free Meta Learning b/data/2020/iclr/ES-MAML: Simple Hessian-Free Meta Learning new file mode 100644 index 0000000000..490ff0efde --- /dev/null +++ b/data/2020/iclr/ES-MAML: Simple Hessian-Free Meta Learning @@ -0,0 +1 @@ +We introduce ES-MAML, a new framework for solving the model agnostic meta learning (MAML) problem based on Evolution Strategies (ES). Existing algorithms for MAML are based on policy gradients, and incur significant difficulties when attempting to estimate second derivatives using backpropagation on stochastic policies. We show how ES can be applied to MAML to obtain an algorithm which avoids the problem of estimating second derivatives, and is also conceptually simple and easy to implement. Moreover, ES-MAML can handle new types of nonsmooth adaptation operators, and other techniques for improving performance and estimation of ES methods become applicable. We show empirically that ES-MAML is competitive with existing methods and often yields better adaptation with fewer queries. \ No newline at end of file diff --git a/data/2020/iclr/Editable Neural Networks b/data/2020/iclr/Editable Neural Networks new file mode 100644 index 0000000000..eb837c84db --- /dev/null +++ b/data/2020/iclr/Editable Neural Networks @@ -0,0 +1 @@ +These days deep neural networks are ubiquitously used in a wide range of tasks, from image classification and machine translation to face identification and self-driving cars. In many applications, a single model error can lead to devastating financial, reputational and even life-threatening consequences. Therefore, it is crucially important to correct model mistakes quickly as they appear. In this work, we investigate the problem of neural network editing - how one can efficiently patch a mistake of the model on a particular sample, without influencing the model behavior on other samples. Namely, we propose Editable Training, a model-agnostic training technique that encourages fast editing of the trained model. We empirically demonstrate the effectiveness of this method on large-scale image classification and machine translation tasks. \ No newline at end of file diff --git a/data/2020/iclr/Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform b/data/2020/iclr/Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform new file mode 100644 index 0000000000..660a6d983b --- /dev/null +++ b/data/2020/iclr/Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform @@ -0,0 +1 @@ +Strictly enforcing orthonormality constraints on parameter matrices has been shown advantageous in deep learning. This amounts to Riemannian optimization on the Stiefel manifold, which, however, is computationally expensive. To address this challenge, we present two main contributions: (1) A new efficient retraction map based on an iterative Cayley transform for optimization updates, and (2) An implicit vector transport mechanism based on the combination of a projection of the momentum and the Cayley transform on the Stiefel manifold. We specify two new optimization algorithms: Cayley SGD with momentum, and Cayley ADAM on the Stiefel manifold. Convergence of Cayley SGD is theoretically analyzed. Our experiments for CNN training demonstrate that both algorithms: (a) Use less running time per iteration relative to existing approaches that enforce orthonormality of CNN parameters; and (b) Achieve faster convergence rates than the baseline SGD and ADAM algorithms without compromising the performance of the CNN. Cayley SGD and Cayley ADAM are also shown to reduce the training time for optimizing the unitary transition matrices in RNNs. \ No newline at end of file diff --git a/data/2020/iclr/Efficient and Information-Preserving Future Frame Prediction and Beyond b/data/2020/iclr/Efficient and Information-Preserving Future Frame Prediction and Beyond new file mode 100644 index 0000000000..4e8a57e7ab --- /dev/null +++ b/data/2020/iclr/Efficient and Information-Preserving Future Frame Prediction and Beyond @@ -0,0 +1 @@ +Applying resolution-preserving blocks is a common practice to maximize information preservation in video prediction, yet their high memory consumption greatly limits their application scenarios. We propose CrevNet, a Conditionally Reversible Network that uses reversible architectures to build a bijective two-way autoencoder and its complementary recurrent predictor. Our model enjoys the theoretically guaranteed property of no information loss during the feature extraction, much lower memory consumption and computational efficiency. The lightweight nature of our model enables us to incorporate 3D convolutions without concern of memory bottleneck, enhancing the model's ability to capture both short-term and long-term temporal dependencies. Our proposed approach achieves state-of-the-art results on Moving MNIST, Traffic4cast and KITTI datasets. We further demonstrate the transferability of our self-supervised learning method by exploiting its learnt features for object detection on KITTI. Our competitive results indicate the potential of using CrevNet as a generative pre-training strategy to guide downstream tasks. \ No newline at end of file diff --git a/data/2020/iclr/Enhancing Transformation-Based Defenses Against Adversarial Attacks with a Distribution Classifier b/data/2020/iclr/Enhancing Transformation-Based Defenses Against Adversarial Attacks with a Distribution Classifier new file mode 100644 index 0000000000..31ef9c7efe --- /dev/null +++ b/data/2020/iclr/Enhancing Transformation-Based Defenses Against Adversarial Attacks with a Distribution Classifier @@ -0,0 +1 @@ +Adversarial attacks on convolutional neural networks (CNN) have gained significant attention and there have been active research efforts on defense mechanisms. Stochastic input transformation methods have been proposed, where the idea is to recover the image from adversarial attack by random transformation, and to take the majority vote as consensus among the random samples. However, the transformation improves the accuracy on adversarial images at the expense of the accuracy on clean images. While it is intuitive that the accuracy on clean images would deteriorate, the exact mechanism in which how this occurs is unclear. In this paper, we study the distribution of softmax induced by stochastic transformations. We observe that with random transformations on the clean images, although the mass of the softmax distribution could shift to the wrong class, the resulting distribution of softmax could be used to correct the prediction. Furthermore, on the adversarial counterparts, with the image transformation, the resulting shapes of the distribution of softmax are similar to the distributions from the clean images. With these observations, we propose a method to improve existing transformation-based defenses. We train a separate lightweight distribution classifier to recognize distinct features in the distributions of softmax outputs of transformed images. Our empirical studies show that our distribution classifier, by training on distributions obtained from clean images only, outperforms majority voting for both clean and adversarial images. Our method is generic and can be integrated with existing transformation-based defenses. \ No newline at end of file diff --git a/data/2020/iclr/Ensemble Distribution Distillation b/data/2020/iclr/Ensemble Distribution Distillation new file mode 100644 index 0000000000..c90275b749 --- /dev/null +++ b/data/2020/iclr/Ensemble Distribution Distillation @@ -0,0 +1 @@ +Ensembles of models often yield improvements in system performance. These ensemble approaches have also been empirically shown to yield robust measures of uncertainty, and are capable of distinguishing between different \emph{forms} of uncertainty. However, ensembles come at a computational and memory cost which may be prohibitive for many applications. There has been significant work done on the distillation of an ensemble into a single model. Such approaches decrease computational cost and allow a single model to achieve an accuracy comparable to that of an ensemble. However, information about the \emph{diversity} of the ensemble, which can yield estimates of different forms of uncertainty, is lost. This work considers the novel task of \emph{Ensemble Distribution Distillation} (EnD$^2$) --- distilling the distribution of the predictions from an ensemble, rather than just the average prediction, into a single model. EnD$^2$ enables a single model to retain both the improved classification performance of ensemble distillation as well as information about the diversity of the ensemble, which is useful for uncertainty estimation. A solution for EnD$^2$ based on Prior Networks, a class of models which allow a single neural network to explicitly model a distribution over output distributions, is proposed in this work. The properties of EnD$^2$ are investigated on both an artificial dataset, and on the CIFAR-10, CIFAR-100 and TinyImageNet datasets, where it is shown that EnD$^2$ can approach the classification performance of an ensemble, and outperforms both standard DNNs and Ensemble Distillation on the tasks of misclassification and out-of-distribution input detection. \ No newline at end of file diff --git a/data/2020/iclr/Escaping Saddle Points Faster with Stochastic Momentum b/data/2020/iclr/Escaping Saddle Points Faster with Stochastic Momentum new file mode 100644 index 0000000000..4e6c1cb3ed --- /dev/null +++ b/data/2020/iclr/Escaping Saddle Points Faster with Stochastic Momentum @@ -0,0 +1 @@ +Stochastic gradient descent (SGD) with stochastic momentum is popular in nonconvex stochastic optimization and particularly for the training of deep neural networks. In standard SGD, parameters are updated by improving along the path of the gradient at the current iterate on a batch of examples, where the addition of a ``momentum'' term biases the update in the direction of the previous change in parameters. In non-stochastic convex optimization one can show that a momentum adjustment provably reduces convergence time in many settings, yet such results have been elusive in the stochastic and non-convex settings. At the same time, a widely-observed empirical phenomenon is that in training deep networks stochastic momentum appears to significantly improve convergence time, variants of it have flourished in the development of other popular update methods, e.g. ADAM, AMSGrad, etc. Yet theoretical justification for the use of stochastic momentum has remained a significant open question. In this paper we propose an answer: stochastic momentum improves deep network training because it modifies SGD to escape saddle points faster and, consequently, to more quickly find a second order stationary point. Our theoretical results also shed light on the related question of how to choose the ideal momentum parameter--our analysis suggests that $\beta \in [0,1)$ should be large (close to 1), which comports with empirical findings. We also provide experimental findings that further validate these conclusions. \ No newline at end of file diff --git a/data/2020/iclr/Evaluating The Search Phase of Neural Architecture Search b/data/2020/iclr/Evaluating The Search Phase of Neural Architecture Search new file mode 100644 index 0000000000..5903855ed4 --- /dev/null +++ b/data/2020/iclr/Evaluating The Search Phase of Neural Architecture Search @@ -0,0 +1 @@ +Neural Architecture Search (NAS) aims to facilitate the design of deep networks for new tasks. Existing techniques rely on two stages: searching over the architecture space and validating the best architecture. NAS algorithms are currently compared solely based on their results on the downstream task. While intuitive, this fails to explicitly evaluate the effectiveness of their search strategies. In this paper, we propose to evaluate the NAS search phase. To this end, we compare the quality of the solutions obtained by NAS search policies with that of random architecture selection. We find that: (i) On average, the state-of-the-art NAS algorithms perform similarly to the random policy; (ii) the widely-used weight sharing strategy degrades the ranking of the NAS candidates to the point of not reflecting their true performance, thus reducing the effectiveness of the search process. We believe that our evaluation framework will be key to designing NAS strategies that consistently discover architectures superior to random ones. \ No newline at end of file diff --git a/data/2020/iclr/Exploration in Reinforcement Learning with Deep Covering Options b/data/2020/iclr/Exploration in Reinforcement Learning with Deep Covering Options new file mode 100644 index 0000000000..0f030c75ea --- /dev/null +++ b/data/2020/iclr/Exploration in Reinforcement Learning with Deep Covering Options @@ -0,0 +1 @@ +While many option discovery methods have been proposed to accelerate exploration in reinforcement learning, they are often heuristic. Recently, covering options was proposed to discover a set of options that provably reduce the upper bound of the environment's cover time, a measure of the difficulty of exploration. Covering options are computed using the eigenvectors of the graph Laplacian, but they are constrained to tabular tasks and are not applicable to tasks with large or continuous state-spaces. We introduce deep covering options, an online method that extends covering options to large state spaces, automatically discovering task-agnostic options that encourage exploration. We evaluate our method in several challenging sparse-reward domains and we show that our approach identifies less explored regions of the state-space and successfully generates options to visit these regions, substantially improving both the exploration and the total accumulated reward. \ No newline at end of file diff --git a/data/2020/iclr/Exploring Model-based Planning with Policy Networks b/data/2020/iclr/Exploring Model-based Planning with Policy Networks new file mode 100644 index 0000000000..31026dc402 --- /dev/null +++ b/data/2020/iclr/Exploring Model-based Planning with Policy Networks @@ -0,0 +1 @@ +Model-based reinforcement learning (MBRL) with model-predictive control or online planning has shown great potential for locomotion control tasks in terms of both sample efficiency and asymptotic performance. Despite their initial successes, the existing planning methods search from candidate sequences randomly generated in the action space, which is inefficient in complex high-dimensional environments. In this paper, we propose a novel MBRL algorithm, model-based policy planning (POPLIN), that combines policy networks with online planning. More specifically, we formulate action planning at each time-step as an optimization problem using neural networks. We experiment with both optimization w.r.t. the action sequences initialized from the policy network, and also online optimization directly w.r.t. the parameters of the policy network. We show that POPLIN obtains state-of-the-art performance in the MuJoCo benchmarking environments, being about 3x more sample efficient than the state-of-the-art algorithms, such as PETS, TD3 and SAC. To explain the effectiveness of our algorithm, we show that the optimization surface in parameter space is smoother than in action space. Further more, we found the distilled policy network can be effectively applied without the expansive model predictive control during test time for some environments such as Cheetah. Code is released in this https URL. \ No newline at end of file diff --git a/data/2020/iclr/FSPool: Learning Set Representations with Featurewise Sort Pooling b/data/2020/iclr/FSPool: Learning Set Representations with Featurewise Sort Pooling new file mode 100644 index 0000000000..34228a29ed --- /dev/null +++ b/data/2020/iclr/FSPool: Learning Set Representations with Featurewise Sort Pooling @@ -0,0 +1 @@ +Traditional set prediction models can struggle with simple datasets due to an issue we call the responsibility problem. We introduce a pooling method for sets of feature vectors based on sorting features across elements of the set. This can be used to construct a permutation-equivariant auto-encoder that avoids this responsibility problem. On a toy dataset of polygons and a set version of MNIST, we show that such an auto-encoder produces considerably better reconstructions and representations. Replacing the pooling function in existing set encoders with FSPool improves accuracy and convergence speed on a variety of datasets. \ No newline at end of file diff --git a/data/2020/iclr/Fast is better than free: Revisiting adversarial training b/data/2020/iclr/Fast is better than free: Revisiting adversarial training new file mode 100644 index 0000000000..fef813e939 --- /dev/null +++ b/data/2020/iclr/Fast is better than free: Revisiting adversarial training @@ -0,0 +1 @@ +Adversarial training, a method for learning robust deep networks, is typically assumed to be more expensive than traditional training due to the necessity of constructing adversarial examples via a first-order method like projected gradient decent (PGD). In this paper, we make the surprising discovery that it is possible to train empirically robust models using a much weaker and cheaper adversary, an approach that was previously believed to be ineffective, rendering the method no more costly than standard training in practice. Specifically, we show that adversarial training with the fast gradient sign method (FGSM), when combined with random initialization, is as effective as PGD-based training but has significantly lower cost. Furthermore we show that FGSM adversarial training can be further accelerated by using standard techniques for efficient training of deep networks, allowing us to learn a robust CIFAR10 classifier with 45% robust accuracy to PGD attacks with $\epsilon=8/255$ in 6 minutes, and a robust ImageNet classifier with 43% robust accuracy at $\epsilon=2/255$ in 12 hours, in comparison to past work based on "free" adversarial training which took 10 and 50 hours to reach the same respective thresholds. Finally, we identify a failure mode referred to as "catastrophic overfitting" which may have caused previous attempts to use FGSM adversarial training to fail. All code for reproducing the experiments in this paper as well as pretrained model weights are at this https URL. \ No newline at end of file diff --git a/data/2020/iclr/FasterSeg: Searching for Faster Real-time Semantic Segmentation b/data/2020/iclr/FasterSeg: Searching for Faster Real-time Semantic Segmentation new file mode 100644 index 0000000000..c5f1756a92 --- /dev/null +++ b/data/2020/iclr/FasterSeg: Searching for Faster Real-time Semantic Segmentation @@ -0,0 +1 @@ +We present FasterSeg, an automatically designed semantic segmentation network with not only state-of-the-art performance but also faster speed than current methods. Utilizing neural architecture search (NAS), FasterSeg is discovered from a novel and broader search space integrating multi-resolution branches, that has been recently found to be vital in manually designed segmentation models. To better calibrate the balance between the goals of high accuracy and low latency, we propose a decoupled and fine-grained latency regularization, that effectively overcomes our observed phenomenons that the searched networks are prone to "collapsing" to low-latency yet poor-accuracy models. Moreover, we seamlessly extend FasterSeg to a new collaborative search (co-searching) framework, simultaneously searching for a teacher and a student network in the same single run. The teacher-student distillation further boosts the student model’s accuracy. Experiments on popular segmentation benchmarks demonstrate the competency of FasterSeg. For example, FasterSeg can run over 30% faster than the closest manually designed competitor on Cityscapes, while maintaining comparable accuracy. \ No newline at end of file diff --git a/data/2020/iclr/Feature Interaction Interpretability: A Case for Explaining Ad-Recommendation Systems via Neural Interaction Detection b/data/2020/iclr/Feature Interaction Interpretability: A Case for Explaining Ad-Recommendation Systems via Neural Interaction Detection new file mode 100644 index 0000000000..5bdf24130a --- /dev/null +++ b/data/2020/iclr/Feature Interaction Interpretability: A Case for Explaining Ad-Recommendation Systems via Neural Interaction Detection @@ -0,0 +1 @@ +Recommendation is a prevalent application of machine learning that affects many users; therefore, it is important for recommender models to be accurate and interpretable. In this work, we propose a method to both interpret and augment the predictions of black-box recommender systems. In particular, we propose to interpret feature interactions from a source recommender model and explicitly encode these interactions in a target recommender model, where both source and target models are black-boxes. By not assuming the structure of the recommender system, our approach can be used in general settings. In our experiments, we focus on a prominent use of machine learning recommendation: ad-click prediction. We found that our interaction interpretations are both informative and predictive, e.g., significantly outperforming existing recommender models. What's more, the same approach to interpret interactions can provide new insights into domains even beyond recommendation, such as text and image classification. \ No newline at end of file diff --git a/data/2020/iclr/Federated Adversarial Domain Adaptation b/data/2020/iclr/Federated Adversarial Domain Adaptation new file mode 100644 index 0000000000..90185826f5 --- /dev/null +++ b/data/2020/iclr/Federated Adversarial Domain Adaptation @@ -0,0 +1 @@ +Federated learning improves data privacy and efficiency in machine learning performed over networks of distributed devices, such as mobile phones, IoT and wearable devices, etc. Yet models trained with federated learning can still fail to generalize to new devices due to the problem of domain shift. Domain shift occurs when the labeled data collected by source nodes statistically differs from the target node's unlabeled data. In this work, we present a principled approach to the problem of federated domain adaptation, which aims to align the representations learned among the different nodes with the data distribution of the target node. Our approach extends adversarial adaptation techniques to the constraints of the federated setting. In addition, we devise a dynamic attention mechanism and leverage feature disentanglement to enhance knowledge transfer. Empirically, we perform extensive experiments on several image and text classification tasks and show promising results under unsupervised federated domain adaptation setting. \ No newline at end of file diff --git a/data/2020/iclr/Few-Shot Learning on graphs via super-Classes based on Graph spectral Measures b/data/2020/iclr/Few-Shot Learning on graphs via super-Classes based on Graph spectral Measures new file mode 100644 index 0000000000..24f5604b55 --- /dev/null +++ b/data/2020/iclr/Few-Shot Learning on graphs via super-Classes based on Graph spectral Measures @@ -0,0 +1 @@ +We propose to study the problem of few-shot graph classification in graph neural networks (GNNs) to recognize unseen classes, given limited labeled graph examples. Despite several interesting GNN variants being proposed recently for node and graph classification tasks, when faced with scarce labeled examples in the few-shot setting, these GNNs exhibit significant loss in classification performance. Here, we present an approach where a probability measure is assigned to each graph based on the spectrum of the graph’s normalized Laplacian. This enables us to accordingly cluster the graph base-labels associated with each graph into super-classes, where the L^p Wasserstein distance serves as our underlying distance metric. Subsequently, a super-graph constructed based on the super-classes is then fed to our proposed GNN framework which exploits the latent inter-class relationships made explicit by the super-graph to achieve better class label separation among the graphs. We conduct exhaustive empirical evaluations of our proposed method and show that it outperforms both the adaptation of state-of-the-art graph classification methods to few-shot scenario and our naive baseline GNNs. Additionally, we also extend and study the behavior of our method to semi-supervised and active learning scenarios. \ No newline at end of file diff --git a/data/2020/iclr/Few-shot Text Classification with Distributional Signatures b/data/2020/iclr/Few-shot Text Classification with Distributional Signatures new file mode 100644 index 0000000000..453b6bc12f --- /dev/null +++ b/data/2020/iclr/Few-shot Text Classification with Distributional Signatures @@ -0,0 +1 @@ +In this paper, we explore meta-learning for few-shot text classification. Meta-learning has shown strong performance in computer vision, where low-level patterns are transferable across learning tasks. However, directly applying this approach to text is challenging--lexical features highly informative for one task may be insignificant for another. Thus, rather than learning solely from words, our model also leverages their distributional signatures, which encode pertinent word occurrence patterns. Our model is trained within a meta-learning framework to map these signatures into attention scores, which are then used to weight the lexical representations of words. We demonstrate that our model consistently outperforms prototypical networks learned on lexical knowledge (Snell et al., 2017) in both few-shot text classification and relation classification by a significant margin across six benchmark datasets (20.0% on average in 1-shot classification). \ No newline at end of file diff --git a/data/2020/iclr/Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents b/data/2020/iclr/Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents new file mode 100644 index 0000000000..671e68ea5d --- /dev/null +++ b/data/2020/iclr/Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents @@ -0,0 +1 @@ +As deep reinforcement learning driven by visual perception becomes more widely used there is a growing need to better understand and probe the learned agents. Understanding the decision making process and its relationship to visual inputs can be very valuable to identify problems in learned behavior. However, this topic has been relatively under-explored in the research community. In this work we present a method for synthesizing visual inputs of interest for a trained agent. Such inputs or states could be situations in which specific actions are necessary. Further, critical states in which a very high or a very low reward can be achieved are often interesting to understand the situational awareness of the system as they can correspond to risky states. To this end, we learn a generative model over the state space of the environment and use its latent space to optimize a target function for the state of interest. In our experiments we show that this method can generate insights for a variety of environments and reinforcement learning methods. We explore results in the standard Atari benchmark games as well as in an autonomous driving simulator. Based on the efficiency with which we have been able to identify behavioural weaknesses with this technique, we believe this general approach could serve as an important tool for AI safety applications. \ No newline at end of file diff --git a/data/2020/iclr/Fooling Detection Alone is Not Enough: Adversarial Attack against Multiple Object Tracking b/data/2020/iclr/Fooling Detection Alone is Not Enough: Adversarial Attack against Multiple Object Tracking new file mode 100644 index 0000000000..4803d2a116 --- /dev/null +++ b/data/2020/iclr/Fooling Detection Alone is Not Enough: Adversarial Attack against Multiple Object Tracking @@ -0,0 +1 @@ +Recent work in adversarial machine learning started to focus on the visual perception in autonomous driving and studied Adversarial Examples (AEs) for object detection models. However, in such visual perception pipeline the detected objects must also be tracked, in a process called Multiple Object Tracking (MOT), to build the moving trajectories of surrounding obstacles. Since MOT is designed to be robust against errors in object detection, it poses a general challenge to existing attack techniques that blindly target objection detection: we find that a success rate of over 98% is needed for them to actually affect the tracking results, a requirement that no existing attack technique can satisfy. In this paper, we are the first to study adversarial machine learning attacks against the complete visual perception pipeline in autonomous driving, and discover a novel attack technique, tracker hijacking, that can effectively fool MOT using AEs on object detection. Using our technique, successful AEs on as few as one single frame can move an existing object in to or out of the headway of an autonomous vehicle to cause potential safety hazards. We perform evaluation using the Berkeley Deep Drive dataset and find that on average when 3 frames are attacked, our attack can have a nearly 100% success rate while attacks that blindly target object detection only have up to 25%. \ No newline at end of file diff --git a/data/2020/iclr/Four Things Everyone Should Know to Improve Batch Normalization b/data/2020/iclr/Four Things Everyone Should Know to Improve Batch Normalization new file mode 100644 index 0000000000..d5a90aab4c --- /dev/null +++ b/data/2020/iclr/Four Things Everyone Should Know to Improve Batch Normalization @@ -0,0 +1 @@ +A key component of most neural network architectures is the use of normalization layers, such as Batch Normalization. Despite its common use and large utility in optimizing deep architectures, it has been challenging both to generically improve upon Batch Normalization and to understand the circumstances that lend themselves to other enhancements. In this paper, we identify four improvements to the generic form of Batch Normalization and the circumstances under which they work, yielding performance gains across all batch sizes while requiring no additional computation during training. These contributions include proposing a method for reasoning about the current example in inference normalization statistics, fixing a training vs. inference discrepancy; recognizing and validating the powerful regularization effect of Ghost Batch Normalization for small and medium batch sizes; examining the effect of weight decay regularization on the scaling and shifting parameters gamma and beta; and identifying a new normalization algorithm for very small batch sizes by combining the strengths of Batch and Group Normalization. We validate our results empirically on six datasets: CIFAR-100, SVHN, Caltech-256, Oxford Flowers-102, CUB-2011, and ImageNet. \ No newline at end of file diff --git a/data/2020/iclr/From Variational to Deterministic Autoencoders b/data/2020/iclr/From Variational to Deterministic Autoencoders new file mode 100644 index 0000000000..cd70f27809 --- /dev/null +++ b/data/2020/iclr/From Variational to Deterministic Autoencoders @@ -0,0 +1 @@ +Variational Autoencoders (VAEs) provide a theoretically-backed and popular framework for deep generative models. However, learning a VAE from data poses still unanswered theoretical questions and considerable practical challenges. In this work, we propose an alternative framework for generative modeling that is simpler, easier to train, and deterministic, yet has many of the advantages of VAEs. We observe that sampling a stochastic encoder in a Gaussian VAE can be interpreted as simply injecting noise into the input of a deterministic decoder. We investigate how substituting this kind of stochasticity, with other explicit and implicit regularization schemes, can lead to an equally smooth and meaningful latent space without forcing it to conform to an arbitrarily chosen prior. To retrieve a generative mechanism to sample new data, we introduce an ex-post density estimation step that can be readily applied also to existing VAEs, improving their sample quality. We show, in a rigorous empirical study, that the proposed regularized deterministic autoencoders are able to generate samples that are comparable to, or better than, those of VAEs and more powerful alternatives when applied to images as well as to structured data such as molecules. \footnote{An implementation is available at: \url{this https URL}} \ No newline at end of file diff --git a/data/2020/iclr/Functional vs. parametric equivalence of ReLU networks b/data/2020/iclr/Functional vs. parametric equivalence of ReLU networks new file mode 100644 index 0000000000..3acf078a4a --- /dev/null +++ b/data/2020/iclr/Functional vs. parametric equivalence of ReLU networks @@ -0,0 +1 @@ +We address the following question: How redundant is the parameterisation of ReLU networks? Specifically, we consider transformations of the weight space which leave the function implemented by the network intact. Two such transformations are known for feed-forward architectures: permutation of neurons within a layer, and positive scaling of all incoming weights of a neuron coupled with inverse scaling of its outgoing weights. In this work, we show for architectures with non-increasing widths that permutation and scaling are in fact the only function-preserving weight transformations. For any eligible architecture we give an explicit construction of a neural network such that any other network that implements the same function can be obtained from the original one by the application of permutations and rescaling. The proof relies on a geometric understanding of boundaries between linear regions of ReLU networks, and we hope the developed mathematical tools are of independent interest. \ No newline at end of file diff --git a/data/2020/iclr/GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification b/data/2020/iclr/GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification new file mode 100644 index 0000000000..76f4811e40 --- /dev/null +++ b/data/2020/iclr/GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification @@ -0,0 +1 @@ +The vulnerabilities of deep neural networks against adversarial examples have become a significant concern for deploying these models in sensitive domains. Devising a definitive defense against such attacks is proven to be challenging, and the methods relying on detecting adversarial samples are only valid when the attacker is oblivious to the detection mechanism. In this paper, we consider the adversarial detection problem under the robust optimization framework. We partition the input space into subspaces and train adversarial robust subspace detectors using asymmetrical adversarial training (AAT). The integration of the classifier and detectors presents a detection mechanism that provides a performance guarantee to the adversary it considered. We demonstrate that AAT promotes the learning of class-conditional distributions, which further gives rise to generative detection/classification approaches that are both robust and more interpretable. We provide comprehensive evaluations of the above methods, and demonstrate their competitive performances and compelling properties on adversarial detection and robust classification problems. \ No newline at end of file diff --git a/data/2020/iclr/GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations b/data/2020/iclr/GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations new file mode 100644 index 0000000000..bd4b3095ef --- /dev/null +++ b/data/2020/iclr/GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations @@ -0,0 +1 @@ +Generative latent-variable models are emerging as promising tools in robotics and reinforcement learning. Yet, even though tasks in these domains typically involve distinct objects, most state-of-the-art generative models do not explicitly capture the compositional nature of visual scenes. Two recent exceptions, MONet and IODINE, decompose scenes into objects in an unsupervised fashion. Their underlying generative processes, however, do not account for component interactions. Hence, neither of them allows for principled sampling of novel scenes. Here we present GENESIS, the first object-centric generative model of 3D visual scenes capable of both decomposing and generating scenes by capturing relationships between scene components. GENESIS parameterises a spatial GMM over images which is decoded from a set of object-centric latent variables that are either inferred sequentially in an amortised fashion or sampled from an autoregressive prior. We train GENESIS on several publicly available datasets and evaluate its performance on scene generation, decomposition, and semi-supervised learning. \ No newline at end of file diff --git a/data/2020/iclr/GLAD: Learning Sparse Graph Recovery b/data/2020/iclr/GLAD: Learning Sparse Graph Recovery new file mode 100644 index 0000000000..bf043cf95c --- /dev/null +++ b/data/2020/iclr/GLAD: Learning Sparse Graph Recovery @@ -0,0 +1 @@ +Recovering sparse conditional independence graphs from data is a fundamental problem in machine learning with wide applications. A popular formulation of the problem is an $\ell_1$ regularized maximum likelihood estimation. Many convex optimization algorithms have been designed to solve this formulation to recover the graph structure. Recently, there is a surge of interest to learn algorithms directly based on data, and in this case, learn to map empirical covariance to the sparse precision matrix. However, it is a challenging task in this case, since the symmetric positive definiteness (SPD) and sparsity of the matrix are not easy to enforce in learned algorithms, and a direct mapping from data to precision matrix may contain many parameters. We propose a deep learning architecture, GLAD, which uses an Alternating Minimization (AM) algorithm as our model inductive bias, and learns the model parameters via supervised learning. We show that GLAD learns a very compact and effective model for recovering sparse graphs from data. \ No newline at end of file diff --git a/data/2020/iclr/Gap-Aware Mitigation of Gradient Staleness b/data/2020/iclr/Gap-Aware Mitigation of Gradient Staleness new file mode 100644 index 0000000000..9c38a3a9fa --- /dev/null +++ b/data/2020/iclr/Gap-Aware Mitigation of Gradient Staleness @@ -0,0 +1 @@ +Cloud computing is becoming increasingly popular as a platform for distributed training of deep neural networks. Synchronous stochastic gradient descent (SSGD) suffers from substantial slowdowns due to stragglers if the environment is non-dedicated, as is common in cloud computing. Asynchronous SGD (ASGD) methods are immune to these slowdowns but are scarcely used due to gradient staleness, which encumbers the convergence process. Recent techniques have had limited success mitigating the gradient staleness when scaling up to many workers (computing nodes). In this paper we define the Gap as a measure of gradient staleness and propose Gap-Aware (GA), a novel asynchronous-distributed method that penalizes stale gradients linearly to the Gap and performs well even when scaling to large numbers of workers. Our evaluation on the CIFAR, ImageNet, and WikiText-103 datasets shows that GA outperforms the currently acceptable gradient penalization method, in final test accuracy. We also provide convergence rate proof for GA. Despite prior beliefs, we show that if GA is applied, momentum becomes beneficial in asynchronous environments, even when the number of workers scales up. \ No newline at end of file diff --git a/data/2020/iclr/Generalization bounds for deep convolutional neural networks b/data/2020/iclr/Generalization bounds for deep convolutional neural networks new file mode 100644 index 0000000000..bbb8c8c24f --- /dev/null +++ b/data/2020/iclr/Generalization bounds for deep convolutional neural networks @@ -0,0 +1 @@ +We prove bounds on the generalization error of convolutional networks. The bounds are in terms of the training loss, the number of parameters, the Lipschitz constant of the loss and the distance from the weights to the initial weights. They are independent of the number of pixels in the input, and the height and width of hidden feature maps. We present experiments using CIFAR-10 with varying hyperparameters of a deep convolutional network, comparing our bounds with practical generalization gaps. \ No newline at end of file diff --git a/data/2020/iclr/Generative Ratio Matching Networks b/data/2020/iclr/Generative Ratio Matching Networks new file mode 100644 index 0000000000..a7f9797ebb --- /dev/null +++ b/data/2020/iclr/Generative Ratio Matching Networks @@ -0,0 +1 @@ +Deep generative models can learn to generate realistic-looking images, but many of the most effective methods are adversarial and involve a saddlepoint optimization, which require careful balancing of training between a generator network and a critic network. Maximum mean discrepancy networks (MMD-nets) avoid this issue by using kernel as a fixed adversary, but unfortunately they have not on their own been able to match the generative quality of adversarial training. In this work, we take their insight of using kernels as fixed adversaries further and present a novel method for training deep generative models that does not involve saddlepoint optimization. We call our method generative ratio matching or GRAM for short. In GRAM, the generator and the critic networks do not play a zero-sum game against each other, instead they do so against a fixed kernel. Thus GRAM networks are not only stable to train like MMD-nets but they also match and beat the generative quality of adversarially trained generative networks. \ No newline at end of file diff --git a/data/2020/iclr/Geometric Insights into the Convergence of Nonlinear TD Learning b/data/2020/iclr/Geometric Insights into the Convergence of Nonlinear TD Learning new file mode 100644 index 0000000000..5f520acfa5 --- /dev/null +++ b/data/2020/iclr/Geometric Insights into the Convergence of Nonlinear TD Learning @@ -0,0 +1 @@ +While there are convergence guarantees for temporal difference (TD) learning when using linear function approximators, the situation for nonlinear models is far less understood, and divergent examples are known. Here we take a first step towards extending theoretical convergence guarantees to TD learning with nonlinear function approximation. More precisely, we consider the expected learning dynamics of the TD(0) algorithm for value estimation. As the step-size converges to zero, these dynamics are defined by a nonlinear ODE which depends on the geometry of the space of function approximators, the structure of the underlying Markov chain, and their interaction. We find a set of function approximators that includes ReLU networks and has geometry amenable to TD learning regardless of environment, so that the solution performs about as well as linear TD in the worst case. Then, we show how environments that are more reversible induce dynamics that are better for TD learning and prove global convergence to the true value function for well-conditioned function approximators. Finally, we generalize a divergent counterexample to a family of divergent problems to demonstrate how the interaction between approximator and environment can go wrong and to motivate the assumptions needed to prove convergence. \ No newline at end of file diff --git a/data/2020/iclr/Global Relational Models of Source Code b/data/2020/iclr/Global Relational Models of Source Code new file mode 100644 index 0000000000..2c01dd4d76 --- /dev/null +++ b/data/2020/iclr/Global Relational Models of Source Code @@ -0,0 +1 @@ +Models of code can learn distributed representations of a program's syntax and semantics to predict many non-trivial properties of a program. Recent state-of-the-art models leverage highly structured representations of programs, such as trees, graphs and paths therein (e.g. data-flow relations), which are precise and abundantly available for code. This provides a strong inductive bias towards semantically meaningful relations, yielding more generalizable representations than classical sequence-based models. Unfortunately, these models primarily rely on graph-based message passing to represent relations in code, which makes them de facto local due to the high cost of message-passing steps, quite in contrast to modern, global sequence-based models, such as the Transformer. In this work, we bridge this divide between global and structured models by introducing two new hybrid model families that are both global and incorporate structural bias: Graph Sandwiches, which wrap traditional (gated) graph message-passing layers in sequential message-passing layers; and Graph Relational Embedding Attention Transformers (GREAT for short), which bias traditional Transformers with relational information from graph edge types. By studying a popular, non-trivial program repair task, variable-misuse identification, we explore the relative merits of traditional and hybrid model families for code representation. Starting with a graph-based model that already improves upon the prior state-of-the-art for this task by 20%, we show that our proposed hybrid models improve an additional 10-15%, while training both faster and using fewer parameters. \ No newline at end of file diff --git a/data/2020/iclr/Graph inference learning for semi-supervised classification b/data/2020/iclr/Graph inference learning for semi-supervised classification new file mode 100644 index 0000000000..7197604584 --- /dev/null +++ b/data/2020/iclr/Graph inference learning for semi-supervised classification @@ -0,0 +1 @@ +In this work, we address the semi-supervised classification of graph data, where the categories of those unlabeled nodes are inferred from labeled nodes as well as graph structures. Recent works often solve this problem with the advanced graph convolution in a conventional supervised manner, but the performance could be heavily affected when labeled data is scarce. Here we propose a Graph Inference Learning (GIL) framework to boost the performance of node classification by learning the inference of node labels on graph topology. To bridge the connection of two nodes, we formally define a structure relation by encapsulating node attributes, between-node paths and local topological structures together, which can make inference conveniently deduced from one node to another node. For learning the inference process, we further introduce meta-optimization on structure relations from training nodes to validation nodes, such that the learnt graph inference capability can be better self-adapted into test nodes. Comprehensive evaluations on four benchmark datasets (including Cora, Citeseer, Pubmed and NELL) demonstrate the superiority of our GIL when compared with other state-of-the-art methods in the semi-supervised node classification task. \ No newline at end of file diff --git a/data/2020/iclr/Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation b/data/2020/iclr/Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation new file mode 100644 index 0000000000..d53106e592 --- /dev/null +++ b/data/2020/iclr/Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation @@ -0,0 +1 @@ +Video prediction models combined with planning algorithms have shown promise in enabling robots to learn to perform many vision-based tasks through only self-supervision, reaching novel goals in cluttered scenes with unseen objects. However, due to the compounding uncertainty in long horizon video prediction and poor scalability of sampling-based planning optimizers, one significant limitation of these approaches is the ability to plan over long horizons to reach distant goals. To that end, we propose a framework for subgoal generation and planning, hierarchical visual foresight (HVF), which generates subgoal images conditioned on a goal image, and uses them for planning. The subgoal images are directly optimized to decompose the task into easy to plan segments, and as a result, we observe that the method naturally identifies semantically meaningful states as subgoals. Across three out of four simulated vision-based manipulation tasks, we find that our method achieves nearly a 200% performance improvement over planning without subgoals and model-free RL approaches. Further, our experiments illustrate that our approach extends to real, cluttered visual scenes. Project page: this https URL \ No newline at end of file diff --git a/data/2020/iclr/I Am Going MAD: Maximum Discrepancy Competition for Comparing Classifiers Adaptively b/data/2020/iclr/I Am Going MAD: Maximum Discrepancy Competition for Comparing Classifiers Adaptively new file mode 100644 index 0000000000..9c74e09e17 --- /dev/null +++ b/data/2020/iclr/I Am Going MAD: Maximum Discrepancy Competition for Comparing Classifiers Adaptively @@ -0,0 +1 @@ +The learning of hierarchical representations for image classification has experienced an impressive series of successes due in part to the availability of large-scale labeled data for training. On the other hand, the trained classifiers have traditionally been evaluated on small and fixed sets of test images, which are deemed to be extremely sparsely distributed in the space of all natural images. It is thus questionable whether recent performance improvements on the excessively re-used test sets generalize to real-world natural images with much richer content variations. Inspired by efficient stimulus selection for testing perceptual models in psychophysical and physiological studies, we present an alternative framework for comparing image classifiers, which we name the MAximum Discrepancy (MAD) competition. Rather than comparing image classifiers using fixed test images, we adaptively sample a small test set from an arbitrarily large corpus of unlabeled images so as to maximize the discrepancies between the classifiers, measured by the distance over WordNet hierarchy. Human labeling on the resulting model-dependent image sets reveals the relative performance of the competing classifiers, and provides useful insights on potential ways to improve them. We report the MAD competition results of eleven ImageNet classifiers while noting that the framework is readily extensible and cost-effective to add future classifiers into the competition. Codes can be found at this https URL. \ No newline at end of file diff --git a/data/2020/iclr/Identifying through Flows for Recovering Latent Representations b/data/2020/iclr/Identifying through Flows for Recovering Latent Representations new file mode 100644 index 0000000000..fc10f5ccf8 --- /dev/null +++ b/data/2020/iclr/Identifying through Flows for Recovering Latent Representations @@ -0,0 +1 @@ +Identifiability, or recovery of the true latent representations from which the observed data originates, is de facto a fundamental goal of representation learning. Yet, most deep generative models do not address the question of identifiability, and thus fail to deliver on the promise of the recovery of the true latent sources that generate the observations. Recent work proposed identifiable generative modelling using variational autoencoders (iVAE) with a theory of identifiability. Due to the intractablity of KL divergence between variational approximate posterior and the true posterior, however, iVAE has to maximize the evidence lower bound (ELBO) of the marginal likelihood, leading to suboptimal solutions in both theory and practice. In contrast, we propose an identifiable framework for estimating latent representations using a flow-based model (iFlow). Our approach directly maximizes the marginal likelihood, allowing for theoretical guarantees on identifiability, thereby dispensing with variational approximations. We derive its optimization objective in analytical form, making it possible to train iFlow in an end-to-end manner. Simulations on synthetic data validate the correctness and effectiveness of our proposed method and demonstrate its practical advantages over other existing methods. \ No newline at end of file diff --git a/data/2020/iclr/Identity Crisis: Memorization and Generalization Under Extreme Overparameterization b/data/2020/iclr/Identity Crisis: Memorization and Generalization Under Extreme Overparameterization new file mode 100644 index 0000000000..cf8c328fdd --- /dev/null +++ b/data/2020/iclr/Identity Crisis: Memorization and Generalization Under Extreme Overparameterization @@ -0,0 +1 @@ +We study the interplay between memorization and generalization of overparameterized networks in the extreme case of a single training example and an identity-mapping task. We examine fully-connected and convolutional networks (FCN and CNN), both linear and nonlinear, initialized randomly and then trained to minimize the reconstruction error. The trained networks stereotypically take one of two forms: the constant function (memorization) and the identity function (generalization). We formally characterize generalization in single-layer FCNs and CNNs. We show empirically that different architectures exhibit strikingly different inductive biases. For example, CNNs of up to 10 layers are able to generalize from a single example, whereas FCNs cannot learn the identity function reliably from 60k examples. Deeper CNNs often fail, but nonetheless do astonishing work to memorize the training output: because CNN biases are location invariant, the model must progressively grow an output pattern from the image boundaries via the coordination of many layers. Our work helps to quantify and visualize the sensitivity of inductive biases to architectural choices such as depth, kernel width, and number of channels. \ No newline at end of file diff --git a/data/2020/iclr/Image-guided Neural Object Rendering b/data/2020/iclr/Image-guided Neural Object Rendering new file mode 100644 index 0000000000..d02e915185 --- /dev/null +++ b/data/2020/iclr/Image-guided Neural Object Rendering @@ -0,0 +1 @@ +We present a novel method for photo-realistic re-rendering of reconstructed objects. The digital reproduction of object appearances is of paramount importance nowadays. Augmented and virtual reality relies on such 3D content. It enables virtual showrooms, virtual tours & sightseeing, the digital inspection of historical artifacts and many other applications. Classical approaches use methods to reconstruct the geometry of an object and textures to capture the appearance properties. Instead, we propose a learned image-guided rendering technique that combines the benefits of image-based rendering and GAN-based image synthesis. A core component of our work is the handling of view-dependent effects. Specifically, we directly train an object-specific deep neural network to synthesize the view-dependent appearance of an object. As input data we are using an RGB video of the object. This video is used to reconstruct a proxy geometry of the object via multi-view stereo. Based on this 3D proxy, the appearance of a captured view can be warped into a new target view. This warping assumes diffuse surfaces, in case of view-dependent effects, such as specular highlights, it leads to artifacts. To this end, we propose EffectsNet, a deep neural network that predicts view-dependent effects. Based on these estimations, we are able to convert observed images to diffuse images. These diffuse images can be projected into other views. In the target view, our pipeline reinserts the new view-dependent effects. To composite multiple reprojected images to a final output, we learn a composition network that outputs photo-realistic results. Using this image-guided approach, the network does not have to allocate capacity on ``''remembering'' object appearance, instead it learns how to combine the appearance of captured images. We demonstrate the effectiveness of our approach both qualitatively and quantitatively on synthetic as well as on real data. \ No newline at end of file diff --git a/data/2020/iclr/Imitation Learning via Off-Policy Distribution Matching b/data/2020/iclr/Imitation Learning via Off-Policy Distribution Matching new file mode 100644 index 0000000000..29a1b0932a --- /dev/null +++ b/data/2020/iclr/Imitation Learning via Off-Policy Distribution Matching @@ -0,0 +1 @@ +When performing imitation learning from expert demonstrations, distribution matching is a popular approach, in which one alternates between estimating distribution ratios and then using these ratios as rewards in a standard reinforcement learning (RL) algorithm. Traditionally, estimation of the distribution ratio requires on-policy data, which has caused previous work to either be exorbitantly data- inefficient or alter the original objective in a manner that can drastically change its optimum. In this work, we show how the original distribution ratio estimation objective may be transformed in a principled manner to yield a completely off-policy objective. In addition to the data-efficiency that this provides, we are able to show that this objective also renders the use of a separate RL optimization unnecessary. Rather, an imitation policy may be learned directly from this objective without the use of explicit rewards. We call the resulting algorithm ValueDICE and evaluate it on a suite of popular imitation learning benchmarks, finding that it can achieve state-of-the-art sample efficiency and performance. \ No newline at end of file diff --git a/data/2020/iclr/Implicit Bias of Gradient Descent based Adversarial Training on Separable Data b/data/2020/iclr/Implicit Bias of Gradient Descent based Adversarial Training on Separable Data new file mode 100644 index 0000000000..eef3755eb8 --- /dev/null +++ b/data/2020/iclr/Implicit Bias of Gradient Descent based Adversarial Training on Separable Data @@ -0,0 +1 @@ +Adversarial training is a principled approach for training robust neural networks. Despite of tremendous successes in practice, its theoretical properties still remain largely unexplored. In this paper, we provide new theoretical insights of gradient descent based adversarial training by studying its computational properties, specifically on its implicit bias. We take the binary classification task on linearly separable data as an illustrative example, where the loss asymptotically attains its infimum as the parameter diverges to infinity along certain directions. Specifically, we show that for any fixed iteration $T$, when the adversarial perturbation during training has proper bounded L2 norm, the classifier learned by gradient descent based adversarial training converges in direction to the maximum L2 norm margin classifier at the rate of $O(1/\sqrt{T})$, significantly faster than the rate $O(1/\log T}$ of training with clean data. In addition, when the adversarial perturbation during training has bounded Lq norm, the resulting classifier converges in direction to a maximum mixed-norm margin classifier, which has a natural interpretation of robustness, as being the maximum L2 norm margin classifier under worst-case bounded Lq norm perturbation to the data. Our findings provide theoretical backups for adversarial training that it indeed promotes robustness against adversarial perturbation. \ No newline at end of file diff --git a/data/2020/iclr/Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin b/data/2020/iclr/Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin new file mode 100644 index 0000000000..bffc81572a --- /dev/null +++ b/data/2020/iclr/Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin @@ -0,0 +1 @@ +For linear classifiers, the relationship between (normalized) output margin and generalization is captured in a clear and simple bound -- a large output margin implies good generalization. Unfortunately, for deep models, this relationship is less clear: existing analyses of the output margin give complicated bounds which sometimes depend exponentially on depth. In this work, we propose to instead analyze a new notion of margin, which we call the "all-layer margin." Our analysis reveals that the all-layer margin has a clear and direct relationship with generalization for deep models. This enables the following concrete applications of the all-layer margin: 1) by analyzing the all-layer margin, we obtain tighter generalization bounds for neural nets which depend on Jacobian and hidden layer norms and remove the exponential dependency on depth 2) our neural net results easily translate to the adversarially robust setting, giving the first direct analysis of robust test error for deep networks, and 3) we present a theoretically inspired training algorithm for increasing the all-layer margin and demonstrate that our algorithm improves test performance over strong baselines in practice. \ No newline at end of file diff --git a/data/2020/iclr/Improving Adversarial Robustness Requires Revisiting Misclassified Examples b/data/2020/iclr/Improving Adversarial Robustness Requires Revisiting Misclassified Examples new file mode 100644 index 0000000000..3ad7c4abf3 --- /dev/null +++ b/data/2020/iclr/Improving Adversarial Robustness Requires Revisiting Misclassified Examples @@ -0,0 +1 @@ +Deep neural networks (DNNs) are vulnerable to adversarial examples crafted by imperceptible perturbations. A range of defense techniques have been proposed to improve DNN robustness to adversarial examples, among which adversarial training has been demonstrated to be the most effective. Adversarial training is often formulated as a min-max optimization problem, with the inner maximization for generating adversarial examples. However, there exists a simple, yet easily overlooked fact that adversarial examples are only defined on correctly classified (natural) examples, but inevitably, some (natural) examples will be misclassified during training. In this paper, we investigate the distinctive influence of misclassified and correctly classified examples on the final robustness of adversarial training. Specifically, we find that misclassified examples indeed have a significant impact on the final robustness. More surprisingly, we find that different maximization techniques on misclassified examples may have a negligible influence on the final robustness, while different minimization techniques are crucial. Motivated by the above discovery, we propose a new defense algorithm called {\em Misclassification Aware adveRsarial Training} (MART), which explicitly differentiates the misclassified and correctly classified examples during the training. We also propose a semi-supervised extension of MART, which can leverage the unlabeled data to further improve the robustness. Experimental results show that MART and its variant could significantly improve the state-of-the-art adversarial robustness. \ No newline at end of file diff --git a/data/2020/iclr/In Search for a SAT-friendly Binarized Neural Network Architecture b/data/2020/iclr/In Search for a SAT-friendly Binarized Neural Network Architecture new file mode 100644 index 0000000000..84511c8cdf --- /dev/null +++ b/data/2020/iclr/In Search for a SAT-friendly Binarized Neural Network Architecture @@ -0,0 +1 @@ +Analyzing the behavior of neural networks is one of the most pressing challenges in deep learning. Binarized Neural Networks are an important class of networks that allow equivalent representation in Boolean logic and can be analyzed formally with logic-based reasoning tools like SAT solvers. Such tools can be used to answer existential and probabilistic queries about the network, perform explanation generation, etc. However, the main bottleneck for all methods is their ability to reason about large BNNs efficiently. In this work, we analyze architectural design choices of BNNs and discuss how they affect the performance of logic-based reasoners. We propose changes to the BNN architecture and the training procedure to get a simpler network for SAT solvers without sacrificing accuracy on the primary task. Our experimental results demonstrate that our approach scales to larger deep neural networks compared to existing work for existential and probabilistic queries, leading to significant speed ups on all tested datasets. \ No newline at end of file diff --git a/data/2020/iclr/Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models b/data/2020/iclr/Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models new file mode 100644 index 0000000000..9cfab8f46a --- /dev/null +++ b/data/2020/iclr/Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models @@ -0,0 +1 @@ +Likelihood-based generative models are a promising resource to detect out-of-distribution (OOD) inputs which could compromise the robustness or reliability of a machine learning system. However, likelihoods derived from such models have been shown to be problematic for detecting certain types of inputs that significantly differ from training data. In this paper, we pose that this problem is due to the excessive influence that input complexity has in generative models' likelihoods. We report a set of experiments supporting this hypothesis, and use an estimate of input complexity to derive an efficient and parameter-free OOD score, which can be seen as a likelihood-ratio, akin to Bayesian model comparison. We find such score to perform comparably to, or even better than, existing OOD detection approaches under a wide range of data sets, models, model sizes, and complexity estimates. \ No newline at end of file diff --git a/data/2020/iclr/Interpretable Complex-Valued Neural Networks for Privacy Protection b/data/2020/iclr/Interpretable Complex-Valued Neural Networks for Privacy Protection new file mode 100644 index 0000000000..08065f31ac --- /dev/null +++ b/data/2020/iclr/Interpretable Complex-Valued Neural Networks for Privacy Protection @@ -0,0 +1 @@ +Previous studies have found that an adversary attacker can often infer unintended input information from intermediate-layer features. We study the possibility of preventing such adversarial inference, yet without too much accuracy degradation. We propose a generic method to revise the neural network to boost the challenge of inferring input attributes from features, while maintaining highly accurate outputs. In particular, the method transforms real-valued features into complex-valued ones, in which the input is hidden in a randomized phase of the transformed features. The knowledge of the phase acts like a key, with which any party can easily recover the output from the processing result, but without which the party can neither recover the output nor distinguish the original input. Preliminary experiments on various datasets and network structures have shown that our method significantly diminishes the adversary's ability in inferring about the input while largely preserves the resulting accuracy. \ No newline at end of file diff --git a/data/2020/iclr/Intrinsic Motivation for Encouraging Synergistic Behavior b/data/2020/iclr/Intrinsic Motivation for Encouraging Synergistic Behavior new file mode 100644 index 0000000000..c8872d0a1c --- /dev/null +++ b/data/2020/iclr/Intrinsic Motivation for Encouraging Synergistic Behavior @@ -0,0 +1 @@ +We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks, which are tasks where multiple agents must work together to achieve a goal they could not individually. Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own. Thus, we propose to incentivize agents to take (joint) actions whose effects cannot be predicted via a composition of the predicted effect for each individual agent. We study two instantiations of this idea, one based on the true states encountered, and another based on a dynamics model trained concurrently with the policy. While the former is simpler, the latter has the benefit of being analytically differentiable with respect to the action taken. We validate our approach in robotic bimanual manipulation tasks with sparse rewards; we find that our approach yields more efficient learning than both 1) training with only the sparse reward and 2) using the typical surprise-based formulation of intrinsic motivation, which does not bias toward synergistic behavior. Videos are available on the project webpage: https://sites.google.com/view/iclr2020-synergistic. \ No newline at end of file diff --git a/data/2020/iclr/Knowledge Consistency between Neural Networks and Beyond b/data/2020/iclr/Knowledge Consistency between Neural Networks and Beyond new file mode 100644 index 0000000000..92589caa69 --- /dev/null +++ b/data/2020/iclr/Knowledge Consistency between Neural Networks and Beyond @@ -0,0 +1 @@ +This paper aims to analyze knowledge consistency between pre-trained deep neural networks. We propose a generic definition for knowledge consistency between neural networks at different fuzziness levels. A task-agnostic method is designed to disentangle feature components, which represent the consistent knowledge, from raw intermediate-layer features of each neural network. As a generic tool, our method can be broadly used for different applications. In preliminary experiments, we have used knowledge consistency as a tool to diagnose knowledge representations of neural networks. Knowledge consistency provides new insights to explain the success of existing deep-learning techniques, such as knowledge distillation and network compression. More crucially, knowledge consistency can also be used to refine pre-trained networks and boost performance. \ No newline at end of file diff --git a/data/2020/iclr/LAMOL: LAnguage MOdeling for Lifelong Language Learning b/data/2020/iclr/LAMOL: LAnguage MOdeling for Lifelong Language Learning new file mode 100644 index 0000000000..69b29587e9 --- /dev/null +++ b/data/2020/iclr/LAMOL: LAnguage MOdeling for Lifelong Language Learning @@ -0,0 +1 @@ +Most research on lifelong learning applies to images or games, but not language. We present LAMOL, a simple yet effective method for lifelong language learning (LLL) based on language modeling. LAMOL replays pseudo-samples of previous tasks while requiring no extra memory or model capacity. Specifically, LAMOL is a language model that simultaneously learns to solve the tasks and generate training samples. When the model is trained for a new task, it generates pseudo-samples of previous tasks for training alongside data for the new task. The results show that LAMOL prevents catastrophic forgetting without any sign of intransigence and can perform five very different language tasks sequentially with only one model. Overall, LAMOL outperforms previous methods by a considerable margin and is only 2-3% worse than multitasking, which is usually considered the LLL upper bound. The source code is available at this https URL. \ No newline at end of file diff --git a/data/2020/iclr/Language GANs Falling Short b/data/2020/iclr/Language GANs Falling Short new file mode 100644 index 0000000000..ca2f936236 --- /dev/null +++ b/data/2020/iclr/Language GANs Falling Short @@ -0,0 +1 @@ +Generating high-quality text with sufficient diversity is essential for a wide range of Natural Language Generation (NLG) tasks. Maximum-Likelihood (MLE) models trained with teacher forcing have consistently been reported as weak baselines, where poor performance is attributed to exposure bias (Bengio et al., 2015; Ranzato et al., 2015); at inference time, the model is fed its own prediction instead of a ground-truth token, which can lead to accumulating errors and poor samples. This line of reasoning has led to an outbreak of adversarial based approaches for NLG, on the account that GANs do not suffer from exposure bias. In this work, we make several surprising observations which contradict common beliefs. First, we revisit the canonical evaluation framework for NLG, and point out fundamental flaws with quality-only evaluation: we show that one can outperform such metrics using a simple, well-known temperature parameter to artificially reduce the entropy of the model's conditional distributions. Second, we leverage the control over the quality / diversity trade-off given by this parameter to evaluate models over the whole quality-diversity spectrum and find MLE models constantly outperform the proposed GAN variants over the whole quality-diversity space. Our results have several implications: 1) The impact of exposure bias on sample quality is less severe than previously thought, 2) temperature tuning provides a better quality / diversity trade-off than adversarial training while being easier to train, easier to cross-validate, and less computationally expensive. Code to reproduce the experiments is available at github.com/pclucas14/GansFallingShort \ No newline at end of file diff --git a/data/2020/iclr/Large Batch Optimization for Deep Learning: Training BERT in 76 minutes b/data/2020/iclr/Large Batch Optimization for Deep Learning: Training BERT in 76 minutes new file mode 100644 index 0000000000..48120d86d2 --- /dev/null +++ b/data/2020/iclr/Large Batch Optimization for Deep Learning: Training BERT in 76 minutes @@ -0,0 +1 @@ +Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly for attention models like BERT, indicating that its performance gains are not consistent across tasks. In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layerwise adaptive large batch optimization technique called LAMB; we then provide convergence analysis of LAMB as well as LARS, showing convergence to a stationary point in general nonconvex settings. Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning. In particular, for BERT training, our optimizer enables use of very large batch sizes of 32868 without any degradation of performance. By increasing the batch size to the memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to just 76 minutes (Table 1). The LAMB implementation is available at this https URL \ No newline at end of file diff --git a/data/2020/iclr/Lazy-CFR: fast and near-optimal regret minimization for extensive games with imperfect information b/data/2020/iclr/Lazy-CFR: fast and near-optimal regret minimization for extensive games with imperfect information new file mode 100644 index 0000000000..3d8acfe902 --- /dev/null +++ b/data/2020/iclr/Lazy-CFR: fast and near-optimal regret minimization for extensive games with imperfect information @@ -0,0 +1 @@ +Counterfactual regret minimization (CFR) is the most popular algorithm on solving two-player zero-sum extensive games with imperfect information and achieves state-of-the-art performance in practice. However, the performance of CFR is not fully understood, since empirical results on the regret are much better than the upper bound proved in \cite{zinkevich2008regret}. Another issue is that CFR has to traverse the whole game tree in each round, which is time-consuming in large scale games. In this paper, we present a novel technique, lazy update, which can avoid traversing the whole game tree in CFR, as well as a novel analysis on the regret of CFR with lazy update. Our analysis can also be applied to the vanilla CFR, resulting in a much tighter regret bound than that in \cite{zinkevich2008regret}. Inspired by lazy update, we further present a novel CFR variant, named Lazy-CFR. Compared to traversing $O(|\mathcal{I}|)$ information sets in vanilla CFR, Lazy-CFR needs only to traverse $O(\sqrt{|\mathcal{I}|})$ information sets per round while keeping the regret bound almost the same, where $\mathcal{I}$ is the class of all information sets. As a result, Lazy-CFR shows better convergence result compared with vanilla CFR. Experimental results consistently show that Lazy-CFR outperforms the vanilla CFR significantly. \ No newline at end of file diff --git a/data/2020/iclr/Learned Step Size quantization b/data/2020/iclr/Learned Step Size quantization new file mode 100644 index 0000000000..7f358d6c5d --- /dev/null +++ b/data/2020/iclr/Learned Step Size quantization @@ -0,0 +1 @@ +Deep networks run with low precision operations at inference time offer power and space advantages over high precision alternatives, but need to overcome the challenge of maintaining high accuracy as precision decreases. Here, we present a method for training such networks, Learned Step Size Quantization, that achieves the highest accuracy to date on the ImageNet dataset when using models, from a variety of architectures, with weights and activations quantized to 2-, 3- or 4-bits of precision, and that can train 3-bit models that reach full precision baseline accuracy. Our approach builds upon existing methods for learning weights in quantized networks by improving how the quantizer itself is configured. Specifically, we introduce a novel means to estimate and scale the task loss gradient at each weight and activation layer's quantizer step size, such that it can be learned in conjunction with other network parameters. This approach works using different levels of precision as needed for a given system and requires only a simple modification of existing training code. \ No newline at end of file diff --git a/data/2020/iclr/Learning Disentangled Representations for CounterFactual Regression b/data/2020/iclr/Learning Disentangled Representations for CounterFactual Regression new file mode 100644 index 0000000000..aa65043ad7 --- /dev/null +++ b/data/2020/iclr/Learning Disentangled Representations for CounterFactual Regression @@ -0,0 +1 @@ +We consider the challenge of estimating treatment effects from observational data; and point out that, in general, only some factors based on the observed covariates X contribute to selection of the treatment T, and only some to determining the outcomes Y. We model this by considering three underlying sources of {X, T, Y} and show that explicitly modeling these sources offers great insight to guide designing models that better handle selection bias. This paper is an attempt to conceptualize this line of thought and provide a path to explore it further. In this work, we propose an algorithm to (1) identify disentangled representations of the above-mentioned underlying factors from any given observational dataset D and (2) leverage this knowledge to reduce, as well as account for, the negative impact of selection bias on estimating the treatment effects from D. Our empirical results show that the proposed method (i) achieves state-of-the-art performance in both individual and population based evaluation measures and (ii) is highly robust under various data generating scenarios. \ No newline at end of file diff --git a/data/2020/iclr/Learning Efficient Parameter Server Synchronization Policies for Distributed SGD b/data/2020/iclr/Learning Efficient Parameter Server Synchronization Policies for Distributed SGD new file mode 100644 index 0000000000..dd690814c9 --- /dev/null +++ b/data/2020/iclr/Learning Efficient Parameter Server Synchronization Policies for Distributed SGD @@ -0,0 +1 @@ +We apply a reinforcement learning (RL) based approach to learning optimal synchronization policies used for Parameter Server-based distributed training of machine learning models with Stochastic Gradient Descent (SGD). Utilizing a formal synchronization policy description in the PS-setting, we are able to derive a suitable and compact description of states and actions, allowing us to efficiently use the standard off-the-shelf deep Q-learning algorithm. As a result, we are able to learn synchronization policies which generalize to different cluster environments, different training datasets and small model variations and (most importantly) lead to considerable decreases in training time when compared to standard policies such as bulk synchronous parallel (BSP), asynchronous parallel (ASP), or stale synchronous parallel (SSP). To support our claims we present extensive numerical results obtained from experiments performed in simulated cluster environments. In our experiments training time is reduced by 44 on average and learned policies generalize to multiple unseen circumstances. \ No newline at end of file diff --git a/data/2020/iclr/Learning Execution through Neural Code fusion b/data/2020/iclr/Learning Execution through Neural Code fusion new file mode 100644 index 0000000000..3fca47656e --- /dev/null +++ b/data/2020/iclr/Learning Execution through Neural Code fusion @@ -0,0 +1 @@ +As the performance of computer systems stagnates due to the end of Moore's Law, there is a need for new models that can understand and optimize the execution of general purpose code. While there is a growing body of work on using Graph Neural Networks (GNNs) to learn representations of source code, these representations do not understand how code dynamically executes. In this work, we propose a new approach to use GNNs to learn fused representations of general source code and its execution. Our approach defines a multi-task GNN over low-level representations of source code and program state (i.e., assembly code and dynamic memory states), converting complex source code constructs and complex data structures into a simpler, more uniform format. We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone. As an illustration of this, we apply the new model to challenging dynamic tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite, outperforming the state-of-the-art by 26% and 45% respectively. Moreover, we use the learned fused graph embeddings to demonstrate transfer learning with high performance on an indirectly related task (algorithm classification). \ No newline at end of file diff --git a/data/2020/iclr/Learning Expensive Coordination: An Event-Based Deep RL Approach b/data/2020/iclr/Learning Expensive Coordination: An Event-Based Deep RL Approach new file mode 100644 index 0000000000..89b109b964 --- /dev/null +++ b/data/2020/iclr/Learning Expensive Coordination: An Event-Based Deep RL Approach @@ -0,0 +1 @@ +Existing works in deep Multi-Agent Reinforcement Learning (MARL) mainly focus on coordinating cooperative agents to complete certain tasks jointly. However, in many cases of the real world, agents are self-interested such as employees in a company and clubs in a league. Therefore, the leader, i.e., the manager of the company or the league, needs to provide bonuses to followers for efficient coordination, which we call expensive coordination. The main difficulties of expensive coordination are that i) the leader has to consider the long-term effect and predict the followers' behaviors when assigning bonuses and ii) the complex interactions between followers make the training process hard to converge, especially when the leader's policy changes with time. In this work, we address this problem through an event-based deep RL approach. Our main contributions are threefold. (1) We model the leader's decision-making process as a semi-Markov Decision Process and propose a novel multi-agent event-based policy gradient to learn the leader's long-term policy. (2) We exploit the leader-follower consistency scheme to design a follower-aware module and a follower-specific attention module to predict the followers' behaviors and make accurate response to their behaviors. (3) We propose an action abstraction-based policy gradient algorithm to reduce the followers' decision space and thus accelerate the training process of followers. Experiments in resource collections, navigation, and the predator-prey game reveal that our approach outperforms the state-of-the-art methods dramatically. \ No newline at end of file diff --git a/data/2020/iclr/Learning Heuristics for Quantified Boolean Formulas through Reinforcement Learning b/data/2020/iclr/Learning Heuristics for Quantified Boolean Formulas through Reinforcement Learning new file mode 100644 index 0000000000..22a681cabd --- /dev/null +++ b/data/2020/iclr/Learning Heuristics for Quantified Boolean Formulas through Reinforcement Learning @@ -0,0 +1 @@ +We demonstrate how to learn efficient heuristics for automated reasoning algorithms for quantified Boolean formulas through deep reinforcement learning. We focus on a backtracking search algorithm, which can already solve formulas of impressive size - up to hundreds of thousands of variables. The main challenge is to find a representation of these formulas that lends itself to making predictions in a scalable way. For a family of challenging problems, we learned a heuristic that solves significantly more formulas compared to the existing handwritten heuristics. \ No newline at end of file diff --git a/data/2020/iclr/Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling b/data/2020/iclr/Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling new file mode 100644 index 0000000000..f432012659 --- /dev/null +++ b/data/2020/iclr/Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling @@ -0,0 +1 @@ +Imitation learning, followed by reinforcement learning algorithms, is a promising paradigm to solve complex control tasks sample-efficiently. However, learning from demonstrations often suffers from the covariate shift problem, which results in cascading errors of the learned policy. We introduce a notion of conservatively-extrapolated value functions, which provably lead to policies with self-correction. We design an algorithm Value Iteration with Negative Sampling (VINS) that practically learns such value functions with conservative extrapolation. We show that VINS can correct mistakes of the behavioral cloning policy on simulated robotics benchmark tasks. We also propose the algorithm of using VINS to initialize a reinforcement learning algorithm, which is shown to outperform significantly prior works in sample efficiency. \ No newline at end of file diff --git a/data/2020/iclr/Learning Space Partitions for Nearest Neighbor Search b/data/2020/iclr/Learning Space Partitions for Nearest Neighbor Search new file mode 100644 index 0000000000..d0d24d5983 --- /dev/null +++ b/data/2020/iclr/Learning Space Partitions for Nearest Neighbor Search @@ -0,0 +1 @@ +Space partitions of $\mathbb{R}^d$ underlie a vast and important class of fast nearest neighbor search (NNS) algorithms. Inspired by recent theoretical work on NNS for general metric spaces (Andoni et al. 2018b,c), we develop a new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification. We instantiate this general approach with the KaHIP graph partitioner (Sanders and Schulz 2013) and neural networks, respectively, to obtain a new partitioning procedure called Neural Locality-Sensitive Hashing (Neural LSH). On several standard benchmarks for NNS (Aumuller et al. 2017), our experiments show that the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods as well as classic, data-oblivious LSH. \ No newline at end of file diff --git a/data/2020/iclr/Learning deep graph matching with channel-independent embedding and Hungarian attention b/data/2020/iclr/Learning deep graph matching with channel-independent embedding and Hungarian attention new file mode 100644 index 0000000000..88d5b2299f --- /dev/null +++ b/data/2020/iclr/Learning deep graph matching with channel-independent embedding and Hungarian attention @@ -0,0 +1 @@ +Graph matching aims to establishing node-wise correspondence between two graphs, which is a classic combinatorial problem and in general NP-complete. Until very recently, deep graph matching methods start to resort to deep networks to achieve unprecedented matching accuracy. Along this direction, this paper makes two complementary contributions which can also be reused as plugin in existing works: i) a novel node and edge embedding strategy which stimulates the multi-head strategy in attention models and allows the information in each channel to be merged independently. In contrast, only node embedding is accounted in previous works; ii) a general masking mechanism over the loss function is devised to improve the smoothness of objective learning for graph matching. Using Hungarian algorithm, it dynamically constructs a structured and sparsely connected layer, taking into account the most contributing matching pairs as hard attention. Our approach performs competitively, and can also improve state-of-the-art methods as plugin, regarding with matching accuracy on three public benchmarks. \ No newline at end of file diff --git a/data/2020/iclr/Learning the Arrow of Time for Problems in Reinforcement Learning b/data/2020/iclr/Learning the Arrow of Time for Problems in Reinforcement Learning new file mode 100644 index 0000000000..04531f7be3 --- /dev/null +++ b/data/2020/iclr/Learning the Arrow of Time for Problems in Reinforcement Learning @@ -0,0 +1 @@ +We humans have an innate understanding of the asymmetric progression of time, which we use to efficiently and safely perceive and manipulate our environment. Drawing inspiration from that, we approach the problem of learning an arrow of time in a Markov (Decision) Process. We illustrate how a learned arrow of time can capture salient information about the environment, which in turn can be used to measure reachability, detect side-effects and to obtain an intrinsic reward signal. Finally, we propose a simple yet effective algorithm to parameterize the problem at hand and learn an arrow of time with a function approximator (here, a deep neural network). Our empirical results span a selection of discrete and continuous environments, and demonstrate for a class of stochastic processes that the learned arrow of time agrees reasonably well with a well known notion of an arrow of time due to Jordan, Kinderlehrer and Otto (1998). \ No newline at end of file diff --git a/data/2020/iclr/Learning to Learn by Zeroth-Order Oracle b/data/2020/iclr/Learning to Learn by Zeroth-Order Oracle new file mode 100644 index 0000000000..0194144b9a --- /dev/null +++ b/data/2020/iclr/Learning to Learn by Zeroth-Order Oracle @@ -0,0 +1 @@ +In the learning to learn (L2L) framework, we cast the design of optimization algorithms as a machine learning problem and use deep neural networks to learn the update rules. In this paper, we extend the L2L framework to zeroth-order (ZO) optimization setting, where no explicit gradient information is available. Our learned optimizer, modeled as recurrent neural network (RNN), first approximates gradient by ZO gradient estimator and then produces parameter update utilizing the knowledge of previous iterations. To reduce high variance effect due to ZO gradient estimator, we further introduce another RNN to learn the Gaussian sampling rule and dynamically guide the query direction sampling. Our learned optimizer outperforms hand-designed algorithms in terms of convergence rate and final solution on both synthetic and practical ZO optimization tasks (in particular, the black-box adversarial attack task, which is one of the most widely used tasks of ZO optimization). We finally conduct extensive analytical experiments to demonstrate the effectiveness of our proposed optimizer. \ No newline at end of file diff --git a/data/2020/iclr/Learning to Link b/data/2020/iclr/Learning to Link new file mode 100644 index 0000000000..ae4e3b8319 --- /dev/null +++ b/data/2020/iclr/Learning to Link @@ -0,0 +1,2 @@ +This paper describes how to automatically cross-reference documents with Wikipedia: the largest knowledge base ever known. It explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles. The resulting link detector and disambiguator performs very well, with recall and precision of almost 75%. This performance is constant whether the system is evaluated on Wikipedia articles or "real world" documents. + This work has implications far beyond enriching documents with explanatory links. It can provide structured knowledge about any unstructured fragment of text. Any task that is currently addressed with bags of words - indexing, clustering, retrieval, and summarization to name a few - could use the techniques described here to draw on a vast network of concepts and semantics. \ No newline at end of file diff --git a/data/2020/iclr/Learning to Represent Programs with Property Signatures b/data/2020/iclr/Learning to Represent Programs with Property Signatures new file mode 100644 index 0000000000..4615671cc6 --- /dev/null +++ b/data/2020/iclr/Learning to Represent Programs with Property Signatures @@ -0,0 +1 @@ +We introduce the notion of property signatures, a representation for programs and program specifications meant for consumption by machine learning algorithms. Given a function with input type $\tau_{in}$ and output type $\tau_{out}$, a property is a function of type: $(\tau_{in}, \tau_{out}) \rightarrow \texttt{Bool}$ that (informally) describes some simple property of the function under consideration. For instance, if $\tau_{in}$ and $\tau_{out}$ are both lists of the same type, one property might ask `is the input list the same length as the output list?'. If we have a list of such properties, we can evaluate them all for our function to get a list of outputs that we will call the property signature. Crucially, we can `guess' the property signature for a function given only a set of input/output pairs meant to specify that function. We discuss several potential applications of property signatures and show experimentally that they can be used to improve over a baseline synthesizer so that it emits twice as many programs in less than one-tenth of the time. \ No newline at end of file diff --git a/data/2020/iclr/Learning to solve the credit assignment problem b/data/2020/iclr/Learning to solve the credit assignment problem new file mode 100644 index 0000000000..efb21088a2 --- /dev/null +++ b/data/2020/iclr/Learning to solve the credit assignment problem @@ -0,0 +1 @@ +Backpropagation is driving today's artificial neural networks (ANNs). However, despite extensive research, it remains unclear if the brain implements this algorithm. Among neuroscientists, reinforcement learning (RL) algorithms are often seen as a realistic alternative: neurons can randomly introduce change, and use unspecific feedback signals to observe their effect on the cost and thus approximate their gradient. However, the convergence rate of such learning scales poorly with the number of involved neurons. Here we propose a hybrid learning approach. Each neuron uses an RL-type strategy to learn how to approximate the gradients that backpropagation would provide. We provide proof that our approach converges to the true gradient for certain classes of networks. In both feedforward and convolutional networks, we empirically show that our approach learns to approximate the gradient, and can match or the performance of exact gradient-based learning. Learning feedback weights provides a biologically plausible mechanism of achieving good performance, without the need for precise, pre-specified learning rules. \ No newline at end of file diff --git a/data/2020/iclr/Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware b/data/2020/iclr/Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware new file mode 100644 index 0000000000..6d66052917 --- /dev/null +++ b/data/2020/iclr/Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware @@ -0,0 +1 @@ +With the proliferation of specialized neural network processors that operate on low-precision integers, the performance of Deep Neural Network inference becomes increasingly dependent on the result of quantization. Despite plenty of prior work on the quantization of weights or activations for neural networks, there is still a wide gap between the software quantizers and the low-precision accelerator implementation, which degrades either the efficiency of networks or that of the hardware for the lack of software and hardware coordination at design-phase. In this paper, we propose a learned linear symmetric quantizer for integer neural network processors, which not only quantizes neural parameters and activations to low-bit integer but also accelerates hardware inference by using batch normalization fusion and low-precision accumulators (e.g., 16-bit) and multipliers (e.g., 4-bit). We use a unified way to quantize weights and activations, and the results outperform many previous approaches for various networks such as AlexNet, ResNet, and lightweight models like MobileNet while keeping friendly to the accelerator architecture. Additional, we also apply the method to object detection models and witness high performance and accuracy in YOLO-v2. Finally, we deploy the quantized models on our specialized integer-arithmetic-only DNN accelerator to show the effectiveness of the proposed quantizer. We show that even with linear symmetric quantization, the results can be better than asymmetric or non-linear methods in 4-bit networks. In evaluation, the proposed quantizer induces less than 0.4\% accuracy drop in ResNet18, ResNet34, and AlexNet when quantizing the whole network as required by the integer processors. \ No newline at end of file diff --git a/data/2020/iclr/Locality and Compositionality in Zero-Shot Learning b/data/2020/iclr/Locality and Compositionality in Zero-Shot Learning new file mode 100644 index 0000000000..530006008d --- /dev/null +++ b/data/2020/iclr/Locality and Compositionality in Zero-Shot Learning @@ -0,0 +1 @@ +In this work we study locality and compositionality in the context of learning representations for Zero Shot Learning (ZSL). In order to well-isolate the importance of these properties in learned representations, we impose the additional constraint that, differently from most recent work in ZSL, no pre-training on different datasets (e.g. ImageNet) is performed. The results of our experiments show how locality, in terms of small parts of the input, and compositionality, i.e. how well can the learned representations be expressed as a function of a smaller vocabulary, are both deeply related to generalization and motivate the focus on more local-aware models in future research directions for representation learning. \ No newline at end of file diff --git a/data/2020/iclr/Logic and the 2-Simplicial Transformer b/data/2020/iclr/Logic and the 2-Simplicial Transformer new file mode 100644 index 0000000000..e31e7d3207 --- /dev/null +++ b/data/2020/iclr/Logic and the 2-Simplicial Transformer @@ -0,0 +1 @@ +We introduce the $2$-simplicial Transformer, an extension of the Transformer which includes a form of higher-dimensional attention generalising the dot-product attention, and uses this attention to update entity representations with tensor products of value vectors. We show that this architecture is a useful inductive bias for logical reasoning in the context of deep reinforcement learning. \ No newline at end of file diff --git a/data/2020/iclr/Low-Resource Knowledge-Grounded Dialogue Generation b/data/2020/iclr/Low-Resource Knowledge-Grounded Dialogue Generation new file mode 100644 index 0000000000..f398243abd --- /dev/null +++ b/data/2020/iclr/Low-Resource Knowledge-Grounded Dialogue Generation @@ -0,0 +1 @@ +Responding with knowledge has been recognized as an important capability for an intelligent conversational agent. Yet knowledge-grounded dialogues, as training data for learning such a response generation model, are difficult to obtain. Motivated by the challenge in practice, we consider knowledge-grounded dialogue generation under a natural assumption that only limited training examples are available. In such a low-resource setting, we devise a disentangled response decoder in order to isolate parameters that depend on knowledge-grounded dialogues from the entire generation model. By this means, the major part of the model can be learned from a large number of ungrounded dialogues and unstructured documents, while the remaining small parameters can be well fitted using the limited training examples. Evaluation results on two benchmarks indicate that with only $1/8$ training data, our model can achieve the state-of-the-art performance and generalize well on out-of-domain knowledge. \ No newline at end of file diff --git a/data/2020/iclr/MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius b/data/2020/iclr/MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius new file mode 100644 index 0000000000..d172a0ec08 --- /dev/null +++ b/data/2020/iclr/MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius @@ -0,0 +1 @@ +Adversarial training is one of the most popular ways to learn robust models but is usually attack-dependent and time costly. In this paper, we propose the MACER algorithm, which learns robust models without using adversarial training but performs better than all existing provable l2-defenses. Recent work shows that randomized smoothing can be used to provide certified l2 radius to smoothed classifiers, and our algorithm trains provably robust smoothed classifiers via MAximizing the CErtified Radius (MACER). The attack-free characteristic makes MACER faster to train and easier to optimize. In our experiments, we show that our method can be applied to modern deep neural networks on a wide range of datasets, including Cifar-10, ImageNet, MNIST, and SVHN. For all tasks, MACER spends less training time than state-of-the-art adversarial training algorithms, and the learned models achieve larger average certified radius. \ No newline at end of file diff --git a/data/2020/iclr/Maxmin Q-learning: Controlling the Estimation Bias of Q-learning b/data/2020/iclr/Maxmin Q-learning: Controlling the Estimation Bias of Q-learning new file mode 100644 index 0000000000..62aeca0314 --- /dev/null +++ b/data/2020/iclr/Maxmin Q-learning: Controlling the Estimation Bias of Q-learning @@ -0,0 +1 @@ +Q-learning suffers from overestimation bias, because it approximates the maximum action value using the maximum estimated action value. Algorithms have been proposed to reduce overestimation bias, but we lack an understanding of how bias interacts with performance, and the extent to which existing algorithms mitigate bias. In this paper, we 1) highlight that the effect of overestimation bias on learning efficiency is environment-dependent; 2) propose a generalization of Q-learning, called \emph{Maxmin Q-learning}, which provides a parameter to flexibly control bias; 3) show theoretically that there exists a parameter choice for Maxmin Q-learning that leads to unbiased estimation with a lower approximation variance than Q-learning; and 4) prove the convergence of our algorithm in the tabular case, as well as convergence of several previous Q-learning variants, using a novel Generalized Q-learning framework. We empirically verify that our algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems. \ No newline at end of file diff --git a/data/2020/iclr/Measuring Compositional Generalization: A Comprehensive Method on Realistic Data b/data/2020/iclr/Measuring Compositional Generalization: A Comprehensive Method on Realistic Data new file mode 100644 index 0000000000..8105099b3b --- /dev/null +++ b/data/2020/iclr/Measuring Compositional Generalization: A Comprehensive Method on Realistic Data @@ -0,0 +1 @@ +State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings. \ No newline at end of file diff --git a/data/2020/iclr/Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples b/data/2020/iclr/Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples new file mode 100644 index 0000000000..3516d76e17 --- /dev/null +++ b/data/2020/iclr/Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples @@ -0,0 +1 @@ +Few-shot classification refers to learning a classifier for new classes given only a few examples. While a plethora of models have emerged to tackle it, we find the procedure and datasets that are used to assess their progress lacking. To address this limitation, we propose Meta-Dataset: a new benchmark for training and evaluating models that is large-scale, consists of diverse datasets, and presents more realistic tasks. We experiment with popular baselines and meta-learners on Meta-Dataset, along with a competitive method that we propose. We analyze performance as a function of various characteristics of test tasks and examine the models' ability to leverage diverse training sources for improving their generalization. We also propose a new set of baselines for quantifying the benefit of meta-learning in Meta-Dataset. Our extensive experimentation has uncovered important research challenges and we hope to inspire work in these directions. \ No newline at end of file diff --git a/data/2020/iclr/MetaPix: Few-Shot Video Retargeting b/data/2020/iclr/MetaPix: Few-Shot Video Retargeting new file mode 100644 index 0000000000..9ed6534f9f --- /dev/null +++ b/data/2020/iclr/MetaPix: Few-Shot Video Retargeting @@ -0,0 +1 @@ +We address the task of unsupervised retargeting of human actions from one video to another. We consider the challenging setting where only a few frames of the target is available. The core of our approach is a conditional generative model that can transcode input skeletal poses (automatically extracted with an off-the-shelf pose estimator) to output target frames. However, it is challenging to build a universal transcoder because humans can appear wildly different due to clothing and background scene geometry. Instead, we learn to adapt - or personalize - a universal generator to the particular human and background in the target. To do so, we make use of meta-learning to discover effective strategies for on-the-fly personalization. One significant benefit of meta-learning is that the personalized transcoder naturally enforces temporal coherence across its generated frames; all frames contain consistent clothing and background geometry of the target. We experiment on in-the-wild internet videos and images and show our approach improves over widely-used baselines for the task. \ No newline at end of file diff --git a/data/2020/iclr/Minimizing FLOPs to Learn Efficient Sparse Representations b/data/2020/iclr/Minimizing FLOPs to Learn Efficient Sparse Representations new file mode 100644 index 0000000000..d4e04b519b --- /dev/null +++ b/data/2020/iclr/Minimizing FLOPs to Learn Efficient Sparse Representations @@ -0,0 +1 @@ +Deep representation learning has become one of the most widely adopted approaches for visual search, recommendation, and identification. Retrieval of such representations from a large database is however computationally challenging. Approximate methods based on learning compact representations, have been widely explored for this problem, such as locality sensitive hashing, product quantization, and PCA. In this work, in contrast to learning compact representations, we propose to learn high dimensional and sparse representations that have similar representational capacity as dense embeddings while being more efficient due to sparse matrix multiplication operations which can be much faster than dense multiplication. Following the key insight that the number of operations decreases quadratically with the sparsity of embeddings provided the non-zero entries are distributed uniformly across dimensions, we propose a novel approach to learn such distributed sparse embeddings via the use of a carefully constructed regularization function that directly minimizes a continuous relaxation of the number of floating-point operations (FLOPs) incurred during retrieval. Our experiments show that our approach is competitive to the other baselines and yields a similar or better speed-vs-accuracy tradeoff on practical datasets. \ No newline at end of file diff --git a/data/2020/iclr/Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models b/data/2020/iclr/Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models new file mode 100644 index 0000000000..eeb64d4f1f --- /dev/null +++ b/data/2020/iclr/Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models @@ -0,0 +1 @@ +In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetuning a large pretrained language model on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as "mixout", motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks. More specifically, we demonstrate that the stability of finetuning and the average accuracy greatly increase when we use the proposed approach to regularize finetuning of BERT on downstream tasks in GLUE. \ No newline at end of file diff --git a/data/2020/iclr/Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks b/data/2020/iclr/Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks new file mode 100644 index 0000000000..b5b8ef263f --- /dev/null +++ b/data/2020/iclr/Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks @@ -0,0 +1 @@ +It has been widely recognized that adversarial examples can be easily crafted to fool deep networks, which mainly root from the locally non-linear behavior nearby input examples. Applying mixup in training provides an effective mechanism to improve generalization performance and model robustness against adversarial perturbations, which introduces the globally linear behavior in-between training examples. However, in previous work, the mixup-trained models only passively defend adversarial attacks in inference by directly classifying the inputs, where the induced global linearity is not well exploited. Namely, since the locality of the adversarial perturbations, it would be more efficient to actively break the locality via the globality of the model predictions. Inspired by simple geometric intuition, we develop an inference principle, named mixup inference (MI), for mixup-trained models. MI mixups the input with other random clean samples, which can shrink and transfer the equivalent perturbation if the input is adversarial. Our experiments on CIFAR-10 and CIFAR-100 demonstrate that MI can further improve the adversarial robustness for the models trained by mixup and its variants. \ No newline at end of file diff --git a/data/2020/iclr/Multi-agent Reinforcement Learning for Networked System Control b/data/2020/iclr/Multi-agent Reinforcement Learning for Networked System Control new file mode 100644 index 0000000000..3d6d56bce7 --- /dev/null +++ b/data/2020/iclr/Multi-agent Reinforcement Learning for Networked System Control @@ -0,0 +1 @@ +This paper considers multi-agent reinforcement learning (MARL) in networked system control. Specifically, each agent learns a decentralized control policy based on local observations and messages from connected neighbors. We formulate such a networked MARL (NMARL) problem as a spatiotemporal Markov decision process and introduce a spatial discount factor to stabilize the training of each local agent. Further, we propose a new differentiable communication protocol, called NeurComm, to reduce information loss and non-stationarity in NMARL. Based on experiments in realistic NMARL scenarios of adaptive traffic signal control and cooperative adaptive cruise control, an appropriate spatial discount factor effectively enhances the learning curves of non-communicative MARL algorithms, while NeurComm outperforms existing communication protocols in both learning efficiency and control performance. \ No newline at end of file diff --git a/data/2020/iclr/Multiplicative Interactions and Where to Find Them b/data/2020/iclr/Multiplicative Interactions and Where to Find Them new file mode 100644 index 0000000000..d3c1abbf53 --- /dev/null +++ b/data/2020/iclr/Multiplicative Interactions and Where to Find Them @@ -0,0 +1 @@ +We explore the role of multiplicative interaction as a unifying framework to describe a range of classical and modern neural network architectural motifs, such as gating, attention layers, hypernetworks, and dynamic convolutions amongst others. Multiplicative interaction layers as primitive operations have a long-established presence in the literature, though this often not emphasized and thus under-appreciated. We begin by showing that such layers strictly enrich the representable function classes of neural networks. We conjecture that multiplicative interactions offer a particularly powerful inductive bias when fusing multiple streams of information or when conditional computation is required. We therefore argue that they should be considered in many situation where multiple compute or information paths need to be combined, in place of the simple and oft-used concatenation operation. Finally, we back up our claims and demonstrate the potential of multiplicative interactions by applying them in large-scale complex RL and sequence modelling tasks, where their use allows us to deliver state-of-the-art results, and thereby provides new evidence in support of multiplicative interactions playing a more prominent role when designing new neural network architectures. \ No newline at end of file diff --git a/data/2020/iclr/Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification b/data/2020/iclr/Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification new file mode 100644 index 0000000000..4bb242a9bb --- /dev/null +++ b/data/2020/iclr/Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification @@ -0,0 +1 @@ +Person re-identification (re-ID) aims at identifying the same persons' images across different cameras. However, domain diversities between different datasets pose an evident challenge for adapting the re-ID model trained on one dataset to another one. State-of-the-art unsupervised domain adaptation methods for person re-ID transferred the learned knowledge from the source domain by optimizing with pseudo labels created by clustering algorithms on the target domain. Although they achieved state-of-the-art performances, the inevitable label noise caused by the clustering procedure was ignored. Such noisy pseudo labels substantially hinders the model's capability on further improving feature representations on the target domain. In order to mitigate the effects of noisy pseudo labels, we propose to softly refine the pseudo labels in the target domain by proposing an unsupervised framework, Mutual Mean-Teaching (MMT), to learn better features from the target domain via off-line refined hard pseudo labels and on-line refined soft pseudo labels in an alternative training manner. In addition, the common practice is to adopt both the classification loss and the triplet loss jointly for achieving optimal performances in person re-ID models. However, conventional triplet loss cannot work with softly refined labels. To solve this problem, a novel soft softmax-triplet loss is proposed to support learning with soft pseudo triplet labels for achieving the optimal domain adaptation performance. The proposed MMT framework achieves considerable improvements of 14.4%, 18.2%, 13.1% and 16.4% mAP on Market-to-Duke, Duke-to-Market, Market-to-MSMT and Duke-to-MSMT unsupervised domain adaptation tasks. \ No newline at end of file diff --git a/data/2020/iclr/N-BEATS: Neural basis expansion analysis for interpretable time series forecasting b/data/2020/iclr/N-BEATS: Neural basis expansion analysis for interpretable time series forecasting new file mode 100644 index 0000000000..5fa20c9c2f --- /dev/null +++ b/data/2020/iclr/N-BEATS: Neural basis expansion analysis for interpretable time series forecasting @@ -0,0 +1 @@ +We focus on solving the univariate times series point forecasting problem using deep learning. We propose a deep neural architecture based on backward and forward residual links and a very deep stack of fully-connected layers. The architecture has a number of desirable properties, being interpretable, applicable without modification to a wide array of target domains, and fast to train. We test the proposed architecture on several well-known datasets, including M3, M4 and TOURISM competition datasets containing time series from diverse domains. We demonstrate state-of-the-art performance for two configurations of N-BEATS for all the datasets, improving forecast accuracy by 11% over a statistical benchmark and by 3% over last year's winner of the M4 competition, a domain-adjusted hand-crafted hybrid between neural network and statistical time series models. The first configuration of our model does not employ any time-series-specific components and its performance on heterogeneous datasets strongly suggests that, contrarily to received wisdom, deep learning primitives such as residual blocks are by themselves sufficient to solve a wide range of forecasting problems. Finally, we demonstrate how the proposed architecture can be augmented to provide outputs that are interpretable without considerable loss in accuracy. \ No newline at end of file diff --git a/data/2020/iclr/NAS evaluation is frustratingly hard b/data/2020/iclr/NAS evaluation is frustratingly hard new file mode 100644 index 0000000000..71f7d2b88f --- /dev/null +++ b/data/2020/iclr/NAS evaluation is frustratingly hard @@ -0,0 +1 @@ +Neural Architecture Search (NAS) is an exciting new field which promises to be as much as a game-changer as Convolutional Neural Networks were in 2012. Despite many great works leading to substantial improvements on a variety of tasks, comparison between different methods is still very much an open issue. While most algorithms are tested on the same datasets, there is no shared experimental protocol followed by all. As such, and due to the under-use of ablation studies, there is a lack of clarity regarding why certain methods are more effective than others. Our first contribution is a benchmark of 8 NAS methods on 5 datasets. To overcome the hurdle of comparing methods with different search spaces, we propose using a method’s relative improvement over the randomly sampled average architecture, which effectively removes advantages arising from expertly engineered search spaces or training protocols. Surprisingly, we find that many NAS techniques struggle to significantly beat the average architecture baseline. We perform further experiments with the commonly used DARTS search space in order to understand the contribution of each component in the NAS pipeline. These experiments highlight that: (i) the use of tricks in the evaluation protocol has a predominant impact on the reported performance of architectures; (ii) the cell-based search space has a very narrow accuracy range, such that the seed has a considerable impact on architecture rankings; (iii) the hand-designed macrostructure (cells) is more important than the searched micro-structure (operations); and (iv) the depth-gap is a real phenomenon, evidenced by the change in rankings between 8 and 20 cell architectures. To conclude, we suggest best practices, that we hope will prove useful for the community and help mitigate current NAS pitfalls, e.g. difficulties in reproducibility and comparison of search methods. We provide the code used for our experiments at link-to-come. \ No newline at end of file diff --git a/data/2020/iclr/Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data b/data/2020/iclr/Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data new file mode 100644 index 0000000000..3a73e5fc7b --- /dev/null +++ b/data/2020/iclr/Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data @@ -0,0 +1 @@ +Nowadays, deep neural networks (DNNs) have become the main instrument for machine learning tasks within a wide range of domains, including vision, NLP, and speech. Meanwhile, in an important case of heterogenous tabular data, the advantage of DNNs over shallow counterparts remains questionable. In particular, there is no sufficient evidence that deep learning machinery allows constructing methods that outperform gradient boosting decision trees (GBDT), which are often the top choice for tabular problems. In this paper, we introduce Neural Oblivious Decision Ensembles (NODE), a new deep learning architecture, designed to work with any tabular data. In a nutshell, the proposed NODE architecture generalizes ensembles of oblivious decision trees, but benefits from both end-to-end gradient-based optimization and the power of multi-layer hierarchical representation learning. With an extensive experimental comparison to the leading GBDT packages on a large number of tabular datasets, we demonstrate the advantage of the proposed NODE architecture, which outperforms the competitors on most of the tasks. We open-source the PyTorch implementation of NODE and believe that it will become a universal framework for machine learning on tabular data. \ No newline at end of file diff --git a/data/2020/iclr/Neural Stored-program Memory b/data/2020/iclr/Neural Stored-program Memory new file mode 100644 index 0000000000..4c841f29be --- /dev/null +++ b/data/2020/iclr/Neural Stored-program Memory @@ -0,0 +1 @@ +Neural networks powered with external memory simulate computer behaviors. These models, which use the memory to store data for a neural controller, can learn algorithms and other complex tasks. In this paper, we introduce a new memory to store weights for the controller, analogous to the stored-program memory in modern computer architectures. The proposed model, dubbed Neural Stored-program Memory, augments current memory-augmented neural networks, creating differentiable machines that can switch programs through time, adapt to variable contexts and thus resemble the Universal Turing Machine. A wide range of experiments demonstrate that the resulting machines not only excel in classical algorithmic problems, but also have potential for compositional, continual, few-shot learning and question-answering tasks. \ No newline at end of file diff --git a/data/2020/iclr/Neural Text Generation With Unlikelihood Training b/data/2020/iclr/Neural Text Generation With Unlikelihood Training new file mode 100644 index 0000000000..ae84b16d9c --- /dev/null +++ b/data/2020/iclr/Neural Text Generation With Unlikelihood Training @@ -0,0 +1 @@ +Neural text generation is a key tool in natural language applications, but it is well known there are major problems at its core. In particular, standard likelihood training and decoding leads to dull and repetitive outputs. While some post-hoc fixes have been proposed, in particular top-$k$ and nucleus sampling, they do not address the fact that the token-level probabilities predicted by the model are poor. In this paper we show that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences containing repeats and frequent words, unlike those from the human training distribution. We propose a new objective, unlikelihood training, which forces unlikely generations to be assigned lower probability by the model. We show that both token and sequence level unlikelihood training give less repetitive, less dull text while maintaining perplexity, giving superior generations using standard greedy or beam search. According to human evaluations, our approach with standard beam search also outperforms the currently popular decoding methods of nucleus sampling or beam blocking, thus providing a strong alternative to existing techniques. \ No newline at end of file diff --git a/data/2020/iclr/Novelty Detection Via Blurring b/data/2020/iclr/Novelty Detection Via Blurring new file mode 100644 index 0000000000..500c17222c --- /dev/null +++ b/data/2020/iclr/Novelty Detection Via Blurring @@ -0,0 +1 @@ +Conventional out-of-distribution (OOD) detection schemes based on variational autoencoder or Random Network Distillation (RND) are known to assign lower uncertainty to the OOD data than the target distribution. In this work, we discover that such conventional novelty detection schemes are also vulnerable to the blurred images. Based on the observation, we construct a novel RND-based OOD detector, SVD-RND, that utilizes blurred images during training. Our detector is simple, efficient in test time, and outperforms baseline OOD detectors in various domains. Further results show that SVD-RND learns a better target distribution representation than the baselines. Finally, SVD-RND combined with geometric transform achieves near-perfect detection accuracy in CelebA domain. \ No newline at end of file diff --git a/data/2020/iclr/Observational Overfitting in Reinforcement Learning b/data/2020/iclr/Observational Overfitting in Reinforcement Learning new file mode 100644 index 0000000000..56d3501926 --- /dev/null +++ b/data/2020/iclr/Observational Overfitting in Reinforcement Learning @@ -0,0 +1 @@ +A major component of overfitting in model-free reinforcement learning (RL) involves the case where the agent may mistakenly correlate reward with certain spurious features from the observations generated by the Markov Decision Process (MDP). We provide a general framework for analyzing this scenario, which we use to design multiple synthetic benchmarks from only modifying the observation space of an MDP. When an agent overfits to different observation spaces even if the underlying MDP dynamics is fixed, we term this observational overfitting. Our experiments expose intriguing properties especially with regards to implicit regularization, and also corroborate results from previous works in RL generalization and supervised learning (SL). \ No newline at end of file diff --git a/data/2020/iclr/On Computation and Generalization of Generative Adversarial Imitation Learning b/data/2020/iclr/On Computation and Generalization of Generative Adversarial Imitation Learning new file mode 100644 index 0000000000..331fcf3d1c --- /dev/null +++ b/data/2020/iclr/On Computation and Generalization of Generative Adversarial Imitation Learning @@ -0,0 +1 @@ +Generative Adversarial Imitation Learning (GAIL) is a powerful and practical approach for learning sequential decision-making policies. Different from Reinforcement Learning (RL), GAIL takes advantage of demonstration data by experts (e.g., human), and learns both the policy and reward function of the unknown environment. Despite the significant empirical progresses, the theory behind GAIL is still largely unknown. The major difficulty comes from the underlying temporal dependency of the demonstration data and the minimax computational formulation of GAIL without convex-concave structure. To bridge such a gap between theory and practice, this paper investigates the theoretical properties of GAIL. Specifically, we show: (1) For GAIL with general reward parameterization, the generalization can be guaranteed as long as the class of the reward functions is properly controlled; (2) For GAIL, where the reward is parameterized as a reproducing kernel function, GAIL can be efficiently solved by stochastic first order optimization algorithms, which attain sublinear convergence to a stationary solution. To the best of our knowledge, these are the first results on statistical and computational guarantees of imitation learning with reward/policy function ap- proximation. Numerical experiments are provided to support our analysis. \ No newline at end of file diff --git a/data/2020/iclr/On Identifiability in Transformers b/data/2020/iclr/On Identifiability in Transformers new file mode 100644 index 0000000000..d4b1aa6043 --- /dev/null +++ b/data/2020/iclr/On Identifiability in Transformers @@ -0,0 +1 @@ +In this paper we delve deep in the Transformer architecture by investigating two of its core components: self-attention and contextual embeddings. In particular, we study the identifiability of attention weights and token embeddings, and the aggregation of context into hidden tokens. We show that, for sequences longer than the attention head dimension, attention weights are not identifiable. We propose effective attention as a complementary tool for improving explanatory interpretations based on attention. Furthermore, we show that input tokens retain to a large degree their identity across the model. We also find evidence suggesting that identity information is mainly encoded in the angle of the embeddings and gradually decreases with depth. Finally, we demonstrate strong mixing of input information in the generation of contextual embeddings by means of a novel quantification method based on gradient attribution. Overall, we show that self-attention distributions are not directly interpretable and present tools to better understand and further investigate Transformer models. \ No newline at end of file diff --git a/data/2020/iclr/On Mutual Information Maximization for Representation Learning b/data/2020/iclr/On Mutual Information Maximization for Representation Learning new file mode 100644 index 0000000000..25d472fb08 --- /dev/null +++ b/data/2020/iclr/On Mutual Information Maximization for Representation Learning @@ -0,0 +1 @@ +Many recent methods for unsupervised or self-supervised representation learning train feature extractors by maximizing an estimate of the mutual information (MI) between different views of the data. This comes with several immediate problems: For example, MI is notoriously hard to estimate, and using it as an objective for representation learning may lead to highly entangled representations due to its invariance under arbitrary invertible transformations. Nevertheless, these methods have been repeatedly shown to excel in practice. In this paper we argue, and provide empirical evidence, that the success of these methods cannot be attributed to the properties of MI alone, and that they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parametrization of the employed MI estimators. Finally, we establish a connection to deep metric learning and argue that this interpretation may be a plausible explanation for the success of the recently introduced methods. \ No newline at end of file diff --git "a/data/2020/iclr/On the \"steerability\" of generative adversarial networks" "b/data/2020/iclr/On the \"steerability\" of generative adversarial networks" new file mode 100644 index 0000000000..2b33f5ad8c --- /dev/null +++ "b/data/2020/iclr/On the \"steerability\" of generative adversarial networks" @@ -0,0 +1 @@ +An open secret in contemporary machine learning is that many models work beautifully on standard benchmarks but fail to generalize outside the lab. This has been attributed to biased training data, which provide poor coverage over real world events. Generative models are no exception, but recent advances in generative adversarial networks (GANs) suggest otherwise - these models can now synthesize strikingly realistic and diverse images. Is generative modeling of photos a solved problem? We show that although current GANs can fit standard datasets very well, they still fall short of being comprehensive models of the visual manifold. In particular, we study their ability to fit simple transformations such as camera movements and color changes. We find that the models reflect the biases of the datasets on which they are trained (e.g., centered objects), but that they also exhibit some capacity for generalization: by "steering" in latent space, we can shift the distribution while still creating realistic images. We hypothesize that the degree of distributional shift is related to the breadth of the training data distribution. Thus, we conduct experiments to quantify the limits of GAN transformations and introduce techniques to mitigate the problem. Code is released on our project page: this https URL \ No newline at end of file diff --git a/data/2020/iclr/On the Variance of the Adaptive Learning Rate and Beyond b/data/2020/iclr/On the Variance of the Adaptive Learning Rate and Beyond new file mode 100644 index 0000000000..875c47dd02 --- /dev/null +++ b/data/2020/iclr/On the Variance of the Adaptive Learning Rate and Beyond @@ -0,0 +1 @@ +The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence to verify our hypothesis. We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Extensive experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the effectiveness and robustness of our proposed method. All implementations are available at: this https URL. \ No newline at end of file diff --git a/data/2020/iclr/On the Weaknesses of Reinforcement Learning for Neural Machine Translation b/data/2020/iclr/On the Weaknesses of Reinforcement Learning for Neural Machine Translation new file mode 100644 index 0000000000..7c635e7e71 --- /dev/null +++ b/data/2020/iclr/On the Weaknesses of Reinforcement Learning for Neural Machine Translation @@ -0,0 +1 @@ +Reinforcement learning (RL) is frequently used to increase performance in text generation tasks, including machine translation (MT), notably through the use of Minimum Risk Training (MRT) and Generative Adversarial Networks (GAN). However, little is known about what and how these methods learn in the context of MT. We prove that one of the most common RL methods for MT does not optimize the expected reward, as well as show that other methods take an infeasibly long time to converge. In fact, our results suggest that RL practices in MT are likely to improve performance only where the pre-trained parameters are already close to yielding the correct translation. Our findings further suggest that observed gains may be due to effects unrelated to the training signal, but rather from changes in the shape of the distribution curve. \ No newline at end of file diff --git a/data/2020/iclr/One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation b/data/2020/iclr/One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation new file mode 100644 index 0000000000..9ddfe668a6 --- /dev/null +++ b/data/2020/iclr/One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation @@ -0,0 +1 @@ +Recent advances in the sparse neural network literature have made it possible to prune many large feed forward and convolutional networks with only a small quantity of data. Yet, these same techniques often falter when applied to the problem of recovering sparse recurrent networks. These failures are quantitative: when pruned with recent techniques, RNNs typically obtain worse performance than they do under a simple random pruning scheme. The failures are also qualitative: the distribution of active weights in a pruned LSTM or GRU network tend to be concentrated in specific neurons and gates, and not well dispersed across the entire architecture. We seek to rectify both the quantitative and qualitative issues with recurrent network pruning by introducing a new recurrent pruning objective derived from the spectrum of the recurrent Jacobian. Our objective is data efficient (requiring only 64 data points to prune the network), easy to implement, and produces 95% sparse GRUs that significantly improve on existing baselines. We evaluate on sequential MNIST, Billion Words, and Wikitext. \ No newline at end of file diff --git a/data/2020/iclr/Optimistic Exploration even with a Pessimistic Initialisation b/data/2020/iclr/Optimistic Exploration even with a Pessimistic Initialisation new file mode 100644 index 0000000000..b60e3a1cd0 --- /dev/null +++ b/data/2020/iclr/Optimistic Exploration even with a Pessimistic Initialisation @@ -0,0 +1 @@ +Optimistic initialisation is an effective strategy for efficient exploration in reinforcement learning (RL). In the tabular case, all provably efficient model-free algorithms rely on it. However, model-free deep RL algorithms do not use optimistic initialisation despite taking inspiration from these provably efficient tabular algorithms. In particular, in scenarios with only positive rewards, Q-values are initialised at their lowest possible values due to commonly used network initialisation schemes, a pessimistic initialisation. Merely initialising the network to output optimistic Q-values is not enough, since we cannot ensure that they remain optimistic for novel state-action pairs, which is crucial for exploration. We propose a simple count-based augmentation to pessimistically initialised Q-values that separates the source of optimism from the neural network. We show that this scheme is provably efficient in the tabular setting and extend it to the deep RL setting. Our algorithm, Optimistic Pessimistically Initialised Q-Learning (OPIQ), augments the Q-value estimates of a DQN-based agent with count-derived bonuses to ensure optimism during both action selection and bootstrapping. We show that OPIQ outperforms non-optimistic DQN variants that utilise a pseudocount-based intrinsic motivation in hard exploration tasks, and that it predicts optimistic estimates for novel state-action pairs. \ No newline at end of file diff --git a/data/2020/iclr/Option Discovery using Deep Skill Chaining b/data/2020/iclr/Option Discovery using Deep Skill Chaining new file mode 100644 index 0000000000..d47177164e --- /dev/null +++ b/data/2020/iclr/Option Discovery using Deep Skill Chaining @@ -0,0 +1 @@ +Autonomously discovering temporally extended actions, or skills, is a longstanding goal of hierarchical reinforcement learning. We propose a new algorithm that combines skill chaining with deep neural networks to autonomously discover skills in high-dimensional, continuous domains. The resulting algorithm, deep skill chaining, constructs skills with the property that executing one enables the agent to execute another. We demonstrate that deep skill chaining significantly outperforms both non-hierarchical agents and other state-of-the-art skill discovery techniques in challenging continuous control tasks. \ No newline at end of file diff --git a/data/2020/iclr/Order Learning and Its Application to Age Estimation b/data/2020/iclr/Order Learning and Its Application to Age Estimation new file mode 100644 index 0000000000..4b02337e2e --- /dev/null +++ b/data/2020/iclr/Order Learning and Its Application to Age Estimation @@ -0,0 +1 @@ +We propose order learning to determine the order graph of classes, representing ranks or priorities, and classify an object instance into one of the classes. To this end, we design a pairwise comparator to categorize the relationship between two instances into one of three cases: one instance is `greater than,' `similar to,' or `smaller than' the other. Then, by comparing an input instance with reference instances and maximizing the consistency among the comparison results, the class of the input can be estimated reliably. We apply order learning to develop a facial age estimator, which provides the state-of-the-art performance. Moreover, the performance is further improved when the order graph is divided into disjoint chains using gender and ethnic group information or even in an unsupervised manner. \ No newline at end of file diff --git a/data/2020/iclr/Overlearning Reveals Sensitive Attributes b/data/2020/iclr/Overlearning Reveals Sensitive Attributes new file mode 100644 index 0000000000..dbdb6e40ed --- /dev/null +++ b/data/2020/iclr/Overlearning Reveals Sensitive Attributes @@ -0,0 +1,3 @@ +"Overlearning" means that a model trained for a seemingly simple objective implicitly learns to recognize attributes and concepts that are (1) not part of the learning objective, and (2) sensitive from a privacy or bias perspective. For example, a binary gender classifier of facial images also learns to recognize races\textemdash even races that are not represented in the training data\textemdash and identities. +We demonstrate overlearning in several vision and NLP models and analyze its harmful consequences. First, inference-time representations of an overlearned model reveal sensitive attributes of the input, breaking privacy protections such as model partitioning. Second, an overlearned model can be "re-purposed" for a different, privacy-violating task even in the absence of the original training data. +We show that overlearning is intrinsic for some tasks and cannot be prevented by censoring unwanted attributes. Finally, we investigate where, when, and why overlearning happens during model training. \ No newline at end of file diff --git a/data/2020/iclr/Physics-as-Inverse-Graphics: Unsupervised Physical Parameter Estimation from Video b/data/2020/iclr/Physics-as-Inverse-Graphics: Unsupervised Physical Parameter Estimation from Video new file mode 100644 index 0000000000..598183a292 --- /dev/null +++ b/data/2020/iclr/Physics-as-Inverse-Graphics: Unsupervised Physical Parameter Estimation from Video @@ -0,0 +1 @@ +We propose a model that is able to perform physical parameter estimation of systems from video, where the differential equations governing the scene dynamics are known, but labeled states or objects are not available. Existing physical scene understanding methods require either object state supervision, or do not integrate with differentiable physics to learn interpretable system parameters and states. We address this problem through a \textit{physics-as-inverse-graphics} approach that brings together vision-as-inverse-graphics and differentiable physics engines, where objects and explicit state and velocity representations are discovered by the model. This framework allows us to perform long term extrapolative video prediction, as well as vision-based model-predictive control. Our approach significantly outperforms related unsupervised methods in long-term future frame prediction of systems with interacting objects (such as ball-spring or 3-body gravitational systems), due to its ability to build dynamics into the model as an inductive bias. We further show the value of this tight vision-physics integration by demonstrating data-efficient learning of vision-actuated model-based control for a pendulum system. We also show that the controller's interpretability provides unique capabilities in goal-driven control and physical reasoning for zero-data adaptation. \ No newline at end of file diff --git a/data/2020/iclr/Piecewise linear activations substantially shape the loss surfaces of neural networks b/data/2020/iclr/Piecewise linear activations substantially shape the loss surfaces of neural networks new file mode 100644 index 0000000000..2f0ac5887f --- /dev/null +++ b/data/2020/iclr/Piecewise linear activations substantially shape the loss surfaces of neural networks @@ -0,0 +1 @@ +Understanding the loss surface of a neural network is fundamentally important to the understanding of deep learning. This paper presents how piecewise linear activation functions substantially shape the loss surfaces of neural networks. We first prove that the loss surfaces of many neural networks have infinite spurious local minima, which are defined as the local minima with higher empirical risks than the global minima. Our result holds for any neural network with arbitrary depth and arbitrary piecewise linear activation functions (excluding linear functions) under most loss functions in practice with some mild assumptions. This result demonstrates that the networks with piecewise linear activations possess substantial differences to the well-studied linear neural networks. Essentially, the underlying assumptions for the above result are consistent with most practical circumstances where the output layer is narrower than any hidden layer. In addition, the loss surface of a neural network with piecewise linear activations is partitioned into multiple smooth and multilinear cells by nondifferentiable boundaries. The constructed spurious local minima are concentrated in one cell as a valley: they are connected with each other by a continuous path, on which empirical risk is invariant. Further for one-hidden-layer networks, we prove that all local minima in a cell constitute an equivalence class; they are concentrated in a valley; and they are all global minima in the cell. \ No newline at end of file diff --git a/data/2020/iclr/Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP b/data/2020/iclr/Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP new file mode 100644 index 0000000000..2c77d95e6e --- /dev/null +++ b/data/2020/iclr/Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP @@ -0,0 +1 @@ +The lottery ticket hypothesis proposes that over-parameterization of deep neural networks (DNNs) aids training by increasing the probability of a "lucky" sub-network initialization being present rather than by helping the optimization process (Frankle & Carbin, 2019). Intriguingly, this phenomenon suggests that initialization strategies for DNNs can be improved substantially, but the lottery ticket hypothesis has only previously been tested in the context of supervised learning for natural image tasks. Here, we evaluate whether "winning ticket" initializations exist in two different domains: natural language processing (NLP) and reinforcement learning (RL).For NLP, we examined both recurrent LSTM models and large-scale Transformer models (Vaswani et al., 2017). For RL, we analyzed a number of discrete-action space tasks, including both classic control and pixel control. Consistent with workin supervised image classification, we confirm that winning ticket initializations generally outperform parameter-matched random initializations, even at extreme pruning rates for both NLP and RL. Notably, we are able to find winning ticket initializations for Transformers which enable models one-third the size to achieve nearly equivalent performance. Together, these results suggest that the lottery ticket hypothesis is not restricted to supervised learning of natural images, but rather represents a broader phenomenon in DNNs. \ No newline at end of file diff --git a/data/2020/iclr/Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring b/data/2020/iclr/Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring new file mode 100644 index 0000000000..d2f598f47b --- /dev/null +++ b/data/2020/iclr/Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring @@ -0,0 +1 @@ +The use of deep pre-trained transformers has led to remarkable progress in a number of applications (Devlin et al., 2018). For tasks that make pairwise comparisons between sequences, matching a given input with a corresponding label, two approaches are common: Cross-encoders performing full self-attention over the pair and Bi-encoders encoding the pair separately. The former often performs better, but is too slow for practical use. In this work, we develop a new transformer architecture, the Poly-encoder, that learns global rather than token level self-attention features. We perform a detailed comparison of all three approaches, including what pre-training and fine-tuning strategies work best. We show our models achieve state-of-the-art results on four tasks; that Poly-encoders are faster than Cross-encoders and more accurate than Bi-encoders; and that the best results are obtained by pre-training on large datasets similar to the downstream tasks. \ No newline at end of file diff --git a/data/2020/iclr/Population-Guided Parallel Policy Search for Reinforcement Learning b/data/2020/iclr/Population-Guided Parallel Policy Search for Reinforcement Learning new file mode 100644 index 0000000000..0fff32167f --- /dev/null +++ b/data/2020/iclr/Population-Guided Parallel Policy Search for Reinforcement Learning @@ -0,0 +1 @@ +In this paper, a new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL). In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information. The key point is that the information of the best policy is fused in a soft manner by constructing an augmented loss function for policy update to enlarge the overall search region by the multiple learners. The guidance by the previous best policy and the enlarged range enable faster and better policy search. Monotone improvement of the expected cumulative return by the proposed scheme is proved theoretically. Working algorithms are constructed by applying the proposed scheme to the twin delayed deep deterministic (TD3) policy gradient algorithm. Numerical results show that the constructed algorithm outperforms most of the current state-of-the-art RL algorithms, and the gain is significant in the case of sparse reward environment. \ No newline at end of file diff --git a/data/2020/iclr/Pre-training Tasks for Embedding-based Large-scale Retrieval b/data/2020/iclr/Pre-training Tasks for Embedding-based Large-scale Retrieval new file mode 100644 index 0000000000..db496d3df8 --- /dev/null +++ b/data/2020/iclr/Pre-training Tasks for Embedding-based Large-scale Retrieval @@ -0,0 +1 @@ +We consider the large-scale query-document retrieval problem: given a query (e.g., a question), return the set of relevant documents (e.g., paragraphs containing the answer) from a large document corpus. This problem is often solved in two steps. The retrieval phase first reduces the solution space, returning a subset of candidate documents. The scoring phase then re-ranks the documents. Critically, the retrieval algorithm not only desires high recall but also requires to be highly efficient, returning candidates in time sublinear to the number of documents. Unlike the scoring phase witnessing significant advances recently due to the BERT-style pre-training tasks on cross-attention models, the retrieval phase remains less well studied. Most previous works rely on classic Information Retrieval (IR) methods such as BM-25 (token matching + TF-IDF weights). These models only accept sparse handcrafted features and can not be optimized for different downstream tasks of interest. In this paper, we conduct a comprehensive study on the embedding-based retrieval models. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks. With adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers. The paragraph-level pre-training tasks we studied are Inverse Cloze Task (ICT), Body First Selection (BFS), Wiki Link Prediction (WLP), and the combination of all three. \ No newline at end of file diff --git a/data/2020/iclr/Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model b/data/2020/iclr/Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model new file mode 100644 index 0000000000..191cae57ce --- /dev/null +++ b/data/2020/iclr/Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model @@ -0,0 +1 @@ +Recent breakthroughs of pretrained language models have shown the effectiveness of self-supervised learning for a wide range of natural language processing (NLP) tasks. In addition to standard syntactic and semantic NLP tasks, pretrained models achieve strong improvements on tasks that involve real-world knowledge, suggesting that large-scale language modeling could be an implicit method to capture knowledge. In this work, we further investigate the extent to which pretrained models such as BERT capture knowledge using a zero-shot fact completion task. Moreover, we propose a simple yet effective weakly supervised pretraining objective, which explicitly forces the model to incorporate knowledge about real-world entities. Models trained with our new objective yield significant improvements on the fact completion task. When applied to downstream tasks, our model consistently outperforms BERT on four entity-related question answering datasets (i.e., WebQuestions, TriviaQA, SearchQA and Quasar-T) with an average 2.7 F1 improvements and a standard fine-grained entity typing dataset (i.e., FIGER) with 5.7 accuracy gains. \ No newline at end of file diff --git a/data/2020/iclr/Progressive Memory Banks for Incremental Domain Adaptation b/data/2020/iclr/Progressive Memory Banks for Incremental Domain Adaptation new file mode 100644 index 0000000000..b61318748a --- /dev/null +++ b/data/2020/iclr/Progressive Memory Banks for Incremental Domain Adaptation @@ -0,0 +1 @@ +This paper addresses the problem of incremental domain adaptation (IDA) in natural language processing (NLP). We assume each domain comes one after another, and that we could only access data in the current domain. The goal of IDA is to build a unified model performing well on all the domains that we have encountered. We adopt the recurrent neural network (RNN) widely used in NLP, but augment it with a directly parameterized memory bank, which is retrieved by an attention mechanism at each step of RNN transition. The memory bank provides a natural way of IDA: when adapting our model to a new domain, we progressively add new slots to the memory bank, which increases the number of parameters, and thus the model capacity. We learn the new memory slots and fine-tune existing parameters by back-propagation. Experimental results show that our approach achieves significantly better performance than fine-tuning alone. Compared with expanding hidden states, our approach is more robust for old domains, shown by both empirical and theoretical results. Our model also outperforms previous work of IDA including elastic weight consolidation and progressive neural networks in the experiments. \ No newline at end of file diff --git a/data/2020/iclr/ProxSGD: Training Structured Neural Networks under Regularization and Constraints b/data/2020/iclr/ProxSGD: Training Structured Neural Networks under Regularization and Constraints new file mode 100644 index 0000000000..9d02b56341 --- /dev/null +++ b/data/2020/iclr/ProxSGD: Training Structured Neural Networks under Regularization and Constraints @@ -0,0 +1 @@ +In this paper, we consider the problem of training neural networks (NN). To promote a NN with specific structures, we explicitly take into consideration the nonsmooth regularization (such as L1-norm) and constraints (such as interval constraint). This is formulated as a constrained nonsmooth nonconvex optimization problem, and we propose a convergent proximal-type stochastic gradient descent (Prox-SGD) algorithm. We show that under properly selected learning rates, momentum eventually resembles the unknown real gradient and thus is crucial in analyzing the convergence. We establish that with probability 1, every limit point of the sequence generated by the proposed Prox-SGD is a stationary point. Then the Prox-SGD is tailored to train a sparse neural network and a binary neural network, and the theoretical analysis is also supported by extensive numerical tests. \ No newline at end of file diff --git a/data/2020/iclr/Pruned Graph Scattering Transforms b/data/2020/iclr/Pruned Graph Scattering Transforms new file mode 100644 index 0000000000..670bb7f9c7 --- /dev/null +++ b/data/2020/iclr/Pruned Graph Scattering Transforms @@ -0,0 +1 @@ +Graph convolutional networks (GCNs) have achieved remarkable performance in a variety of network science learning tasks. However, theoretical analysis of such approaches is still at its infancy. Graph scattering transforms (GSTs) are non-trainable deep GCN models that are amenable to generalization and stability analyses. The present work addresses some limitations of GSTs by introducing a novel so-termed pruned (p)GST approach. The resultant pruning algorithm is guided by a graph-spectrum-inspired criterion, and retains informative scattering features on-the-fly while bypassing the exponential complexity associated with GSTs. It is further established that pGSTs are stable to perturbations of the input graph signals with bounded energy. Experiments showcase that i) pGST performs comparably to the baseline GST that uses all scattering features, while achieving significant computational savings; ii) pGST achieves comparable performance to state-of-the-art GCNs; and iii) Graph data from various domains lead to different scattering patterns, suggesting domain-adaptive pGST network architectures. \ No newline at end of file diff --git a/data/2020/iclr/Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving b/data/2020/iclr/Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving new file mode 100644 index 0000000000..a8049c8f75 --- /dev/null +++ b/data/2020/iclr/Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving @@ -0,0 +1 @@ +Detecting objects such as cars and pedestrians in 3D plays an indispensable role in autonomous driving. Existing approaches largely rely on expensive LiDAR sensors for accurate depth information. While recently pseudo-LiDAR has been introduced as a promising alternative, at a much lower cost based solely on stereo images, there is still a notable performance gap. In this paper we provide substantial advances to the pseudo-LiDAR framework through improvements in stereo depth estimation. Concretely, we adapt the stereo network architecture and loss function to be more aligned with accurate depth estimation of faraway objects --- currently the primary weakness of pseudo-LiDAR. Further, we explore the idea to leverage cheaper but extremely sparse LiDAR sensors, which alone provide insufficient information for 3D detection, to de-bias our depth estimation. We propose a depth-propagation algorithm, guided by the initial depth estimates, to diffuse these few exact measurements across the entire depth map. We show on the KITTI object detection benchmark that our combined approach yields substantial improvements in depth estimation and stereo-based 3D object detection --- outperforming the previous state-of-the-art detection accuracy for faraway objects by 40%. Our code is available at this https URL. \ No newline at end of file diff --git a/data/2020/iclr/Pure and Spurious Critical Points: a Geometric Study of Linear Networks b/data/2020/iclr/Pure and Spurious Critical Points: a Geometric Study of Linear Networks new file mode 100644 index 0000000000..9e885c73c4 --- /dev/null +++ b/data/2020/iclr/Pure and Spurious Critical Points: a Geometric Study of Linear Networks @@ -0,0 +1 @@ +The critical locus of the loss function of a neural network is determined by the geometry of the functional space and by the parameterization of this space by the network's weights. We introduce a natural distinction between pure critical points, which only depend on the functional space, and spurious critical points, which arise from the parameterization. We apply this perspective to revisit and extend the literature on the loss function of linear neural networks. For this type of network, the functional space is either the set of all linear maps from input to output space, or a determinantal variety, i.e., a set of linear maps with bounded rank. We use geometric properties of determinantal varieties to derive new results on the landscape of linear networks with different loss functions and different parameterizations. Our analysis clearly illustrates that the absence of "bad" local minima in the loss landscape of linear networks is due to two distinct phenomena that apply in different settings: it is true for arbitrary smooth convex losses in the case of architectures that can express all linear maps ("filling architectures") but it holds only for the quadratic loss when the functional space is a determinantal variety ("non-filling architectures"). Without any assumption on the architecture, smooth convex losses may lead to landscapes with many bad minima. \ No newline at end of file diff --git a/data/2020/iclr/Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP b/data/2020/iclr/Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP new file mode 100644 index 0000000000..df834aa6d8 --- /dev/null +++ b/data/2020/iclr/Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP @@ -0,0 +1 @@ +A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. \cite{jin2018q} proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP. In this paper, we adapt Q-learning with UCB-exploration bonus to infinite-horizon MDP with discounted rewards \emph{without} accessing a generative model. We show that the \textit{sample complexity of exploration} of our algorithm is bounded by $\tilde{O}({\frac{SA}{\epsilon^2(1-\gamma)^7}})$. This improves the previously best known result of $\tilde{O}({\frac{SA}{\epsilon^4(1-\gamma)^8}})$ in this setting achieved by delayed Q-learning \cite{strehl2006pac}, and matches the lower bound in terms of $\epsilon$ as well as $S$ and $A$ except for logarithmic factors. \ No newline at end of file diff --git a/data/2020/iclr/Quantifying the Cost of Reliable Photo Authentication via High-Performance Learned Lossy Representations b/data/2020/iclr/Quantifying the Cost of Reliable Photo Authentication via High-Performance Learned Lossy Representations new file mode 100644 index 0000000000..15830209b9 --- /dev/null +++ b/data/2020/iclr/Quantifying the Cost of Reliable Photo Authentication via High-Performance Learned Lossy Representations @@ -0,0 +1 @@ +Detection of photo manipulation relies on subtle statistical traces, notoriously removed by aggressive lossy compression employed online. We demonstrate that end-to-end modeling of complex photo dissemination channels allows for codec optimization with explicit provenance objectives. We design a lightweight trainable lossy image codec, that delivers competitive rate-distortion performance, on par with best hand-engineered alternatives, but has lower computational footprint on modern GPU-enabled platforms. Our results show that significant improvements in manipulation detection accuracy are possible at fractional costs in bandwidth/storage. Our codec improved the accuracy from 37% to 86% even at very low bit-rates, well below the practicality of JPEG (QF 20). \ No newline at end of file diff --git a/data/2020/iclr/RTFM: Generalising to New Environment Dynamics via Reading b/data/2020/iclr/RTFM: Generalising to New Environment Dynamics via Reading new file mode 100644 index 0000000000..611b089d75 --- /dev/null +++ b/data/2020/iclr/RTFM: Generalising to New Environment Dynamics via Reading @@ -0,0 +1 @@ +Obtaining policies that can generalise to new environments in reinforcement learning is challenging. In this work, we demonstrate that language understanding via a reading policy learner is a promising vehicle for generalisation to new environments. We propose a grounded policy learning problem, Read to Fight Monsters (RTFM), in which the agent must jointly reason over a language goal, relevant dynamics described in a document, and environment observations. We procedurally generate environment dynamics and corresponding language descriptions of the dynamics, such that agents must read to understand new environment dynamics instead of memorising any particular information. In addition, we propose txt2π, a model that captures three-way interactions between the goal, document, and observations. On RTFM, txt2π generalises to new environments with dynamics not seen during training via reading. Furthermore, our model outperforms baselines such as FiLM and language-conditioned CNNs on RTFM. Through curriculum learning, txt2π produces policies that excel on complex RTFM tasks requiring several reasoning and coreference steps. \ No newline at end of file diff --git a/data/2020/iclr/RaCT: Toward Amortized Ranking-Critical Training For Collaborative Filtering b/data/2020/iclr/RaCT: Toward Amortized Ranking-Critical Training For Collaborative Filtering new file mode 100644 index 0000000000..92f4cf6976 --- /dev/null +++ b/data/2020/iclr/RaCT: Toward Amortized Ranking-Critical Training For Collaborative Filtering @@ -0,0 +1 @@ +We investigate new methods for training collaborative filtering models based on actor-critic reinforcement learning, to more directly maximize ranking-based objective functions. Specifically, we train a critic network to approximate ranking-based metrics, and then update the actor network to directly optimize against the learned metrics. In contrast to traditional learning-to-rank methods that require re-running the optimization procedure for new lists, our critic-based method amortizes the scoring process with a neural network, and can directly provide the (approximate) ranking scores for new lists. We demonstrate the actor-critic's ability to significantly improve the performance of a variety of prediction models, and achieve better or comparable performance to the state-of-the-art on three large-scale datasets. \ No newline at end of file diff --git a/data/2020/iclr/Ranking Policy Gradient b/data/2020/iclr/Ranking Policy Gradient new file mode 100644 index 0000000000..3d9ff08368 --- /dev/null +++ b/data/2020/iclr/Ranking Policy Gradient @@ -0,0 +1 @@ +Sample inefficiency is a long-lasting problem in reinforcement learning (RL). The state-of-the-art estimates the optimal action values while it usually involves an extensive search over the state-action space and unstable optimization. Towards the sample-efficient RL, we propose ranking policy gradient (RPG), a policy gradient method that learns the optimal rank of a set of discrete actions. To accelerate the learning of policy gradient methods, we establish the equivalence between maximizing the lower bound of return and imitating a near-optimal policy without accessing any oracles. These results lead to a general off-policy learning framework, which preserves the optimality, reduces variance, and improves the sample-efficiency. Furthermore, the sample complexity of RPG does not depend on the dimension of state space, which enables RPG for large-scale problems. We conduct extensive experiments showing that when consolidating with the off-policy learning framework, RPG substantially reduces the sample complexity, comparing to the state-of-the-art. \ No newline at end of file diff --git a/data/2020/iclr/Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML b/data/2020/iclr/Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML new file mode 100644 index 0000000000..a81e55aa07 --- /dev/null +++ b/data/2020/iclr/Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML @@ -0,0 +1 @@ +An important research direction in machine learning has centered around developing meta-learning algorithms to tackle few-shot learning. An especially successful algorithm has been Model Agnostic Meta-Learning (MAML), a method that consists of two optimization loops, with the outer loop finding a meta-initialization, from which the inner loop can efficiently learn new tasks. Despite MAML's popularity, a fundamental open question remains -- is the effectiveness of MAML due to the meta-initialization being primed for rapid learning (large, efficient changes in the representations) or due to feature reuse, with the meta initialization already containing high quality features? We investigate this question, via ablation studies and analysis of the latent representations, finding that feature reuse is the dominant factor. This leads to the ANIL (Almost No Inner Loop) algorithm, a simplification of MAML where we remove the inner loop for all but the (task-specific) head of a MAML-trained network. ANIL matches MAML's performance on benchmark few-shot image classification and RL and offers computational improvements over MAML. We further study the precise contributions of the head and body of the network, showing that performance on the test tasks is entirely determined by the quality of the learned features, and we can remove even the head of the network (the NIL algorithm). We conclude with a discussion of the rapid learning vs feature reuse question for meta-learning algorithms more broadly. \ No newline at end of file diff --git a/data/2020/iclr/ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning b/data/2020/iclr/ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning new file mode 100644 index 0000000000..6189b8d06a --- /dev/null +++ b/data/2020/iclr/ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning @@ -0,0 +1 @@ +Recent powerful pre-trained language models have achieved remarkable performance on most of the popular datasets for reading comprehension. It is time to introduce more challenging datasets to push the development of this field towards more comprehensive reasoning of text. In this paper, we introduce a new Reading Comprehension dataset requiring logical reasoning (ReClor) extracted from standardized graduate admission examinations. As earlier studies suggest, human-annotated datasets usually contain biases, which are often exploited by models to achieve high accuracy without truly understanding the text. In order to comprehensively evaluate the logical reasoning ability of models on ReClor, we propose to identify biased data points and separate them into EASY set while the rest as HARD set. Empirical results show that the state-of-the-art models have an outstanding ability to capture biases contained in the dataset with high accuracy on EASY set. However, they struggle on HARD set with poor performance near that of random guess, indicating more research is needed to essentially enhance the logical reasoning ability of current models. \ No newline at end of file diff --git a/data/2020/iclr/ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring b/data/2020/iclr/ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring new file mode 100644 index 0000000000..0d68f6b002 --- /dev/null +++ b/data/2020/iclr/ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring @@ -0,0 +1 @@ +We improve the recently-proposed ``MixMatch semi-supervised learning algorithm by introducing two new techniques: distribution alignment and augmentation anchoring. - Distribution alignment encourages the marginal distribution of predictions on unlabeled data to be close to the marginal distribution of ground-truth labels. - Augmentation anchoring} feeds multiple strongly augmented versions of an input into the model and encourages each output to be close to the prediction for a weakly-augmented version of the same input. To produce strong augmentations, we propose a variant of AutoAugment which learns the augmentation policy while the model is being trained. Our new algorithm, dubbed ReMixMatch, is significantly more data-efficient than prior work, requiring between 5 times and 16 times less data to reach the same accuracy. For example, on CIFAR-10 with 250 labeled examples we reach 93.73% accuracy (compared to MixMatch's accuracy of 93.58% with 4000 examples) and a median accuracy of 84.92% with just four labels per class. \ No newline at end of file diff --git a/data/2020/iclr/Reanalysis of Variance Reduced Temporal Difference Learning b/data/2020/iclr/Reanalysis of Variance Reduced Temporal Difference Learning new file mode 100644 index 0000000000..c010eee833 --- /dev/null +++ b/data/2020/iclr/Reanalysis of Variance Reduced Temporal Difference Learning @@ -0,0 +1 @@ +Temporal difference (TD) learning is a popular algorithm for policy evaluation in reinforcement learning, but the vanilla TD can substantially suffer from the inherent optimization variance. A variance reduced TD (VRTD) algorithm was proposed by Korda and La (2015), which applies the variance reduction technique directly to the online TD learning with Markovian samples. In this work, we first point out the technical errors in the analysis of VRTD in Korda and La (2015), and then provide a mathematically solid analysis of the non-asymptotic convergence of VRTD and its variance reduction performance. We show that VRTD is guaranteed to converge to a neighborhood of the fixed-point solution of TD at a linear convergence rate. Furthermore, the variance error (for both i.i.d. and Markovian sampling) and the bias error (for Markovian sampling) of VRTD are significantly reduced by the batch size of variance reduction in comparison to those of vanilla TD. \ No newline at end of file diff --git a/data/2020/iclr/Recurrent neural circuits for contour detection b/data/2020/iclr/Recurrent neural circuits for contour detection new file mode 100644 index 0000000000..57c4024970 --- /dev/null +++ b/data/2020/iclr/Recurrent neural circuits for contour detection @@ -0,0 +1 @@ +We introduce a deep recurrent neural network architecture that approximates visual cortical circuits (Mely et al., 2018). We show that this architecture, which we refer to as the 𝜸-net, learns to solve contour detection tasks with better sample efficiency than state-of-the-art feedforward networks, while also exhibiting a classic perceptual illusion, known as the orientation-tilt illusion. Correcting this illusion significantly reduces \gnetw contour detection accuracy by driving it to prefer low-level edges over high-level object boundary contours. Overall, our study suggests that the orientation-tilt illusion is a byproduct of neural circuits that help biological visual systems achieve robust and efficient contour detection, and that incorporating these circuits in artificial neural networks can improve computer vision. \ No newline at end of file diff --git a/data/2020/iclr/Reinforced active learning for image segmentation b/data/2020/iclr/Reinforced active learning for image segmentation new file mode 100644 index 0000000000..9b2676be8f --- /dev/null +++ b/data/2020/iclr/Reinforced active learning for image segmentation @@ -0,0 +1 @@ +Learning-based approaches for semantic segmentation have two inherent challenges. First, acquiring pixel-wise labels is expensive and time-consuming. Second, realistic segmentation datasets are highly unbalanced: some categories are much more abundant than others, biasing the performance to the most represented ones. In this paper, we are interested in focusing human labelling effort on a small subset of a larger pool of data, minimizing this effort while maximizing performance of a segmentation model on a hold-out set. We present a new active learning strategy for semantic segmentation based on deep reinforcement learning (RL). An agent learns a policy to select a subset of small informative image regions -- opposed to entire images -- to be labeled, from a pool of unlabeled data. The region selection decision is made based on predictions and uncertainties of the segmentation model being trained. Our method proposes a new modification of the deep Q-network (DQN) formulation for active learning, adapting it to the large-scale nature of semantic segmentation problems. We test the proof of concept in CamVid and provide results in the large-scale dataset Cityscapes. On Cityscapes, our deep RL region-based DQN approach requires roughly 30% less additional labeled data than our most competitive baseline to reach the same performance. Moreover, we find that our method asks for more labels of under-represented categories compared to the baselines, improving their performance and helping to mitigate class imbalance. \ No newline at end of file diff --git a/data/2020/iclr/Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation b/data/2020/iclr/Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation new file mode 100644 index 0000000000..e4c9863536 --- /dev/null +++ b/data/2020/iclr/Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation @@ -0,0 +1 @@ +Natural question generation (QG) aims to generate questions from a passage and an answer. Previous works on QG either (i) ignore the rich structure information hidden in text, (ii) solely rely on cross-entropy loss that leads to issues like exposure bias and inconsistency between train/test measurement, or (iii) fail to fully exploit the answer information. To address these limitations, in this paper, we propose a reinforcement learning (RL) based graph-to-sequence (Graph2Seq) model for QG. Our model consists of a Graph2Seq generator with a novel Bidirectional Gated Graph Neural Network based encoder to embed the passage, and a hybrid evaluator with a mixed objective combining both cross-entropy and RL losses to ensure the generation of syntactically and semantically valid text. We also introduce an effective Deep Alignment Network for incorporating the answer information into the passage at both the word and contextual levels. Our model is end-to-end trainable and achieves new state-of-the-art scores, outperforming existing methods by a significant margin on the standard SQuAD benchmark. \ No newline at end of file diff --git a/data/2020/iclr/Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives b/data/2020/iclr/Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives new file mode 100644 index 0000000000..e6c407ee8a --- /dev/null +++ b/data/2020/iclr/Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives @@ -0,0 +1 @@ +Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavior. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that triggers the appropriate behaviors for a given situation. However, the meta-policy must still produce appropriate decisions in all states. In this work, we propose a policy design that decomposes into primitives, similarly to hierarchical reinforcement learning, but without a high-level meta-policy. Instead, each primitive can decide for themselves whether they wish to act in the current state. We use an information-theoretic mechanism for enabling this decentralized decision: each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world. The primitives are regularized to use as little information as possible, which leads to natural competition and specialization. We experimentally demonstrate that this policy architecture improves over both flat and hierarchical policies in terms of generalization. \ No newline at end of file diff --git a/data/2020/iclr/Relational State-Space Model for Stochastic Multi-Object Systems b/data/2020/iclr/Relational State-Space Model for Stochastic Multi-Object Systems new file mode 100644 index 0000000000..df077199b6 --- /dev/null +++ b/data/2020/iclr/Relational State-Space Model for Stochastic Multi-Object Systems @@ -0,0 +1 @@ +Real-world dynamical systems often consist of multiple stochastic subsystems that interact with each other. Modeling and forecasting the behavior of such dynamics are generally not easy, due to the inherent hardness in understanding the complicated interactions and evolutions of their constituents. This paper introduces the relational state-space model (R-SSM), a sequential hierarchical latent variable model that makes use of graph neural networks (GNNs) to simulate the joint state transitions of multiple correlated objects. By letting GNNs cooperate with SSM, R-SSM provides a flexible way to incorporate relational information into the modeling of multi-object dynamics. We further suggest augmenting the model with normalizing flows instantiated for vertex-indexed random variables and propose two auxiliary contrastive objectives to facilitate the learning. The utility of R-SSM is empirically evaluated on synthetic and real time series datasets. \ No newline at end of file diff --git a/data/2020/iclr/Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness b/data/2020/iclr/Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness new file mode 100644 index 0000000000..bb16857547 --- /dev/null +++ b/data/2020/iclr/Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness @@ -0,0 +1 @@ +Pang et.al. [1] presented Max-Mahalanobis center (MMC) loss and argued that MMC loss is adversarial more robust 4 than SCE. The author’s SCE loss conveys inappropriate supervisory signals to the model, leading to sparse sample 5 density in the feature space. In this reproducibility challenge we verify the claims that training with MMC loss produces 6 adversarially robust models while also enabling accuracy comparably with models trained with SCE loss. 7 \ No newline at end of file diff --git a/data/2020/iclr/Robust And Interpretable Blind Image Denoising Via Bias-Free Convolutional Neural Networks b/data/2020/iclr/Robust And Interpretable Blind Image Denoising Via Bias-Free Convolutional Neural Networks new file mode 100644 index 0000000000..3600d783bb --- /dev/null +++ b/data/2020/iclr/Robust And Interpretable Blind Image Denoising Via Bias-Free Convolutional Neural Networks @@ -0,0 +1 @@ +Deep convolutional networks often append additive constant ("bias") terms to their convolution operations, enabling a richer repertoire of functional mappings. Biases are also used to facilitate training, by subtracting mean response over batches of training images (a component of "batch normalization"). Recent state-of-the-art blind denoising methods (e.g., DnCNN) seem to require these terms for their success. Here, however, we show that these networks systematically overfit the noise levels for which they are trained: when deployed at noise levels outside the training range, performance degrades dramatically. In contrast, a bias-free architecture -- obtained by removing the constant terms in every layer of the network, including those used for batch normalization-- generalizes robustly across noise levels, while preserving state-of-the-art performance within the training range. Locally, the bias-free network acts linearly on the noisy image, enabling direct analysis of network behavior via standard linear-algebraic tools. These analyses provide interpretations of network functionality in terms of nonlinear adaptive filtering, and projection onto a union of low-dimensional subspaces, connecting the learning-based method to more traditional denoising methodology. \ No newline at end of file diff --git a/data/2020/iclr/Robust Local Features for Improving the Generalization of Adversarial Training b/data/2020/iclr/Robust Local Features for Improving the Generalization of Adversarial Training new file mode 100644 index 0000000000..904befe9f0 --- /dev/null +++ b/data/2020/iclr/Robust Local Features for Improving the Generalization of Adversarial Training @@ -0,0 +1 @@ +Adversarial training has been demonstrated as one of the most effective methods for training robust models to defend against adversarial examples. However, adversarially trained models often lack adversarially robust generalization on unseen testing data. Recent works show that adversarially trained models are more biased towards global structure features. Instead, in this work, we would like to investigate the relationship between the generalization of adversarial training and the robust local features, as the robust local features generalize well for unseen shape variation. To learn the robust local features, we develop a Random Block Shuffle (RBS) transformation to break up the global structure features on normal adversarial examples. We continue to propose a new approach called Robust Local Features for Adversarial Training (RLFAT), which first learns the robust local features by adversarial training on the RBS-transformed adversarial examples, and then transfers the robust local features into the training of normal adversarial examples. To demonstrate the generality of our argument, we implement RLFAT in currently state-of-the-art adversarial training frameworks. Extensive experiments on STL-10, CIFAR-10 and CIFAR-100 show that RLFAT significantly improves both the adversarially robust generalization and the standard generalization of adversarial training. Additionally, we demonstrate that our models capture more local features of the object on the images, aligning better with human perception. \ No newline at end of file diff --git a/data/2020/iclr/Robust training with ensemble consensus b/data/2020/iclr/Robust training with ensemble consensus new file mode 100644 index 0000000000..1b8f9797ef --- /dev/null +++ b/data/2020/iclr/Robust training with ensemble consensus @@ -0,0 +1 @@ +Since deep neural networks are over-parametrized, they may memorize noisy examples. We address such memorizing issue under the existence of annotation noise. From the fact that deep neural networks cannot generalize neighborhoods of the features acquired via memorization, we find that noisy examples do not consistently incur small losses on the network in the presence of perturbation. Based on this, we propose a novel training method called Learning with Ensemble Consensus (LEC) whose goal is to prevent overfitting noisy examples by eliminating them identified via consensus of an ensemble of perturbed networks. One of the proposed LECs, LTEC outperforms the current state-of-the-art methods on MNIST, CIFAR-10, and CIFAR-100 despite its efficient memory. \ No newline at end of file diff --git a/data/2020/iclr/SAdam: A Variant of Adam for Strongly Convex Functions b/data/2020/iclr/SAdam: A Variant of Adam for Strongly Convex Functions new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2020/iclr/SELF: Learning to Filter Noisy Labels with Self-Ensembling b/data/2020/iclr/SELF: Learning to Filter Noisy Labels with Self-Ensembling new file mode 100644 index 0000000000..f8301b0726 --- /dev/null +++ b/data/2020/iclr/SELF: Learning to Filter Noisy Labels with Self-Ensembling @@ -0,0 +1 @@ +Deep neural networks (DNNs) have been shown to over-fit a dataset when being trained with noisy labels for a long enough time. To overcome this problem, we present a simple and effective method self-ensemble label filtering (SELF) to progressively filter out the wrong labels during training. Our method improves the task performance by gradually allowing supervision only from the potentially non-noisy (clean) labels and stops learning on the filtered noisy labels. For the filtering, we form running averages of predictions over the entire training dataset using the network output at different training epochs. We show that these ensemble estimates yield more accurate identification of inconsistent predictions throughout training than the single estimates of the network at the most recent training epoch. While filtered samples are removed entirely from the supervised training loss, we dynamically leverage them via semi-supervised learning in the unsupervised loss. We demonstrate the positive effect of such an approach on various image classification tasks under both symmetric and asymmetric label noise and at different noise ratios. It substantially outperforms all previous works on noise-aware learning across different datasets and can be applied to a broad set of network architectures. \ No newline at end of file diff --git a/data/2020/iclr/SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards b/data/2020/iclr/SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards new file mode 100644 index 0000000000..ae6c997af1 --- /dev/null +++ b/data/2020/iclr/SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards @@ -0,0 +1 @@ +Learning to imitate expert behavior from demonstrations can be challenging, especially in environments with high-dimensional, continuous observations and unknown dynamics. Supervised learning methods based on behavioral cloning (BC) suffer from distribution shift: because the agent greedily imitates demonstrated actions, it can drift away from demonstrated states due to error accumulation. Recent methods based on reinforcement learning (RL), such as inverse RL and generative adversarial imitation learning (GAIL), overcome this issue by training an RL agent to match the demonstrations over a long horizon. Since the true reward function for the task is unknown, these methods learn a reward function from the demonstrations, often using complex and brittle approximation techniques that involve adversarial training. We propose a simple alternative that still uses RL, but does not require learning a reward function. The key idea is to provide the agent with an incentive to match the demonstrations over a long horizon, by encouraging it to return to demonstrated states upon encountering new, out-of-distribution states. We accomplish this by giving the agent a constant reward of r=+1 for matching the demonstrated action in a demonstrated state, and a constant reward of r=0 for all other behavior. Our method, which we call soft Q imitation learning (SQIL), can be implemented with a handful of minor modifications to any standard Q-learning or off-policy actor-critic algorithm. Theoretically, we show that SQIL can be interpreted as a regularized variant of BC that uses a sparsity prior to encourage long-horizon imitation. Empirically, we show that SQIL outperforms BC and achieves competitive results compared to GAIL, on a variety of image-based and low-dimensional tasks in Box2D, Atari, and MuJoCo. \ No newline at end of file diff --git a/data/2020/iclr/Sampling-Free Learning of Bayesian Quantized Neural Networks b/data/2020/iclr/Sampling-Free Learning of Bayesian Quantized Neural Networks new file mode 100644 index 0000000000..ec8a5682e4 --- /dev/null +++ b/data/2020/iclr/Sampling-Free Learning of Bayesian Quantized Neural Networks @@ -0,0 +1 @@ +Bayesian learning of model parameters in neural networks is important in scenarios where estimates with well-calibrated uncertainty are important. In this paper, we propose Bayesian quantized networks (BQNs), quantized neural networks (QNNs) for which we learn a posterior distribution over their discrete parameters. We provide a set of efficient algorithms for learning and prediction in BQNs without the need to sample from their parameters or activations, which not only allows for differentiable learning in QNNs, but also reduces the variance in gradients. We evaluate BQNs on MNIST, Fashion-MNIST, KMNIST and CIFAR10 image classification datasets, compared against bootstrap ensemble of QNNs (E-QNN). We demonstrate BQNs achieve both lower predictive errors and better-calibrated uncertainties than E-QNN (with less than 20% of the negative log-likelihood). \ No newline at end of file diff --git a/data/2020/iclr/Scalable Model Compression by Entropy Penalized Reparameterization b/data/2020/iclr/Scalable Model Compression by Entropy Penalized Reparameterization new file mode 100644 index 0000000000..e0c0c0c38f --- /dev/null +++ b/data/2020/iclr/Scalable Model Compression by Entropy Penalized Reparameterization @@ -0,0 +1 @@ +We describe a simple and general neural network weight compression approach, in which the network parameters (weights and biases) are represented in a "latent" space, amounting to a reparameterization. This space is equipped with a learned probability model, which is used to impose an entropy penalty on the parameter representation during training, and to compress the representation using a simple arithmetic coder after training. Classification accuracy and model compressibility is maximized jointly, with the bitrate--accuracy trade-off specified by a hyperparameter. We evaluate the method on the MNIST, CIFAR-10 and ImageNet classification benchmarks using six distinct model architectures. Our results show that state-of-the-art model compression can be achieved in a scalable and general way without requiring complex procedures such as multi-stage training. \ No newline at end of file diff --git a/data/2020/iclr/Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base b/data/2020/iclr/Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base new file mode 100644 index 0000000000..e8079dc9f5 --- /dev/null +++ b/data/2020/iclr/Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base @@ -0,0 +1 @@ +We describe a novel way of representing a symbolic knowledge base (KB) called a sparse-matrix reified KB. This representation enables neural modules that are fully differentiable, faithful to the original semantics of the KB, expressive enough to model multi-hop inferences, and scalable enough to use with realistically large KBs. The sparse-matrix reified KB can be distributed across multiple GPUs, can scale to tens of millions of entities and facts, and is orders of magnitude faster than naive sparse-matrix implementations. The reified KB enables very simple end-to-end architectures to obtain competitive performance on several benchmarks representing two families of tasks: KB completion, and learning semantic parsers from denotations. \ No newline at end of file diff --git a/data/2020/iclr/Scalable and Order-robust Continual Learning with Additive Parameter Decomposition b/data/2020/iclr/Scalable and Order-robust Continual Learning with Additive Parameter Decomposition new file mode 100644 index 0000000000..ddf47a38a5 --- /dev/null +++ b/data/2020/iclr/Scalable and Order-robust Continual Learning with Additive Parameter Decomposition @@ -0,0 +1 @@ +While recent continual learning methods largely alleviate the catastrophic problem on toy-sized datasets, some issues remain to be tackled to apply them to real-world problem domains. First, a continual learning model should effectively handle catastrophic forgetting and be efficient to train even with a large number of tasks. Secondly, it needs to tackle the problem of order-sensitivity, where the performance of the tasks largely varies based on the order of the task arrival sequence, as it may cause serious problems where fairness plays a critical role (e.g. medical diagnosis). To tackle these practical challenges, we propose a novel continual learning method that is scalable as well as order-robust, which instead of learning a completely shared set of weights, represents the parameters for each task as a sum of task-shared and sparse task-adaptive parameters. With our Additive Parameter Decomposition (APD), the task-adaptive parameters for earlier tasks remain mostly unaffected, where we update them only to reflect the changes made to the task-shared parameters. This decomposition of parameters effectively prevents catastrophic forgetting and order-sensitivity, while being computation- and memory-efficient. Further, we can achieve even better scalability with APD using hierarchical knowledge consolidation, which clusters the task-adaptive parameters to obtain hierarchically shared parameters. We validate our network with APD, APD-Net, on multiple benchmark datasets against state-of-the-art continual learning methods, which it largely outperforms in accuracy, scalability, and order-robustness. \ No newline at end of file diff --git a/data/2020/iclr/Selection via Proxy: Efficient Data Selection for Deep Learning b/data/2020/iclr/Selection via Proxy: Efficient Data Selection for Deep Learning new file mode 100644 index 0000000000..e54e21f84a --- /dev/null +++ b/data/2020/iclr/Selection via Proxy: Efficient Data Selection for Deep Learning @@ -0,0 +1 @@ +Data selection methods such as active learning and core-set selection are useful tools for machine learning on large datasets, but they can be prohibitively expensive to apply in deep learning. Unlike in other areas of machine learning, the feature representations that these techniques depend on are learned in deep learning rather than given, which takes a substantial amount of training time. In this work, we show that we can significantly improve the computational efficiency of data selection in deep learning by using a much smaller proxy model to perform data selection for tasks that will eventually require a large target model (e.g., selecting data points to label for active learning). In deep learning, we can scale down models by removing hidden layers or reducing their dimension to create proxies that are an order of magnitude faster. Although these small proxy models have significantly higher error, we find that they empirically provide useful rankings for data selection that have a high correlation with those of larger models. We evaluate this "selection via proxy" (SVP) approach on several data selection tasks. For active learning, applying SVP to Sener and Savarese [2018]'s recent method for active learning in deep learning gives a 4x improvement in execution time while yielding the same model accuracy. For core-set selection, we show that a proxy model that trains 10x faster than a target ResNet164 model on CIFAR10 can be used to remove 50% of the training data without compromising the accuracy of the target model, making end-to-end training time improvements via core-set selection possible. \ No newline at end of file diff --git a/data/2020/iclr/Self-Adversarial Learning with Comparative Discrimination for Text Generation b/data/2020/iclr/Self-Adversarial Learning with Comparative Discrimination for Text Generation new file mode 100644 index 0000000000..52867e6495 --- /dev/null +++ b/data/2020/iclr/Self-Adversarial Learning with Comparative Discrimination for Text Generation @@ -0,0 +1 @@ +Conventional Generative Adversarial Networks (GANs) for text generation tend to have issues of reward sparsity and mode collapse that affect the quality and diversity of generated samples. To address the issues, we propose a novel self-adversarial learning (SAL) paradigm for improving GANs' performance in text generation. In contrast to standard GANs that use a binary classifier as its discriminator to predict whether a sample is real or generated, SAL employs a comparative discriminator which is a pairwise classifier for comparing the text quality between a pair of samples. During training, SAL rewards the generator when its currently generated sentence is found to be better than its previously generated samples. This self-improvement reward mechanism allows the model to receive credits more easily and avoid collapsing towards the limited number of real samples, which not only helps alleviate the reward sparsity issue but also reduces the risk of mode collapse. Experiments on text generation benchmark datasets show that our proposed approach substantially improves both the quality and the diversity, and yields more stable performance compared to the previous GANs for text generation. \ No newline at end of file diff --git a/data/2020/iclr/Semantically-Guided Representation Learning for Self-Supervised Monocular Depth b/data/2020/iclr/Semantically-Guided Representation Learning for Self-Supervised Monocular Depth new file mode 100644 index 0000000000..dfa53859c4 --- /dev/null +++ b/data/2020/iclr/Semantically-Guided Representation Learning for Self-Supervised Monocular Depth @@ -0,0 +1 @@ +Self-supervised learning is showing great promise for monocular depth estimation, using geometry as the only source of supervision. Depth networks are indeed capable of learning representations that relate visual appearance to 3D properties by implicitly leveraging category-level patterns. In this work we investigate how to leverage more directly this semantic structure to guide geometric representation learning, while remaining in the self-supervised regime. Instead of using semantic labels and proxy losses in a multi-task approach, we propose a new architecture leveraging fixed pretrained semantic segmentation networks to guide self-supervised representation learning via pixel-adaptive convolutions. Furthermore, we propose a two-stage training process to overcome a common semantic bias on dynamic objects via resampling. Our method improves upon the state of the art for self-supervised monocular depth prediction over all pixels, fine-grained details, and per semantic categories. \ No newline at end of file diff --git a/data/2020/iclr/Sharing Knowledge in Multi-Task Deep Reinforcement Learning b/data/2020/iclr/Sharing Knowledge in Multi-Task Deep Reinforcement Learning new file mode 100644 index 0000000000..1359ce66b3 --- /dev/null +++ b/data/2020/iclr/Sharing Knowledge in Multi-Task Deep Reinforcement Learning @@ -0,0 +1 @@ +We study the benefit of sharing representations among tasks to enable the effective use of deep neural networks in Multi-Task Reinforcement Learning. We leverage the assumption that learning from different tasks, sharing common properties, is helpful to generalize the knowledge of them resulting in a more effective feature extraction compared to learning a single task. Intuitively, the resulting set of features offers performance benefits when used by Reinforcement Learning algorithms. We prove this by providing theoretical guarantees that highlight the conditions for which is convenient to share representations among tasks, extending the well-known finite-time bounds of Approximate Value-Iteration to the multi-task setting. In addition, we complement our analysis by proposing multi-task extensions of three Reinforcement Learning algorithms that we empirically evaluate on widely used Reinforcement Learning benchmarks showing significant improvements over the single-task counterparts in terms of sample efficiency and performance. \ No newline at end of file diff --git a/data/2020/iclr/Short and Sparse Deconvolution - A Geometric Approach b/data/2020/iclr/Short and Sparse Deconvolution - A Geometric Approach new file mode 100644 index 0000000000..975bce4b80 --- /dev/null +++ b/data/2020/iclr/Short and Sparse Deconvolution - A Geometric Approach @@ -0,0 +1 @@ +Short-and-sparse deconvolution (SaSD) is the problem of extracting localized, recurring motifs in signals with spatial or temporal structure. Variants of this problem arise in applications such as image deblurring, microscopy, neural spike sorting, and more. The problem is challenging in both theory and practice, as natural optimization formulations are nonconvex. Moreover, practical deconvolution problems involve smooth motifs (kernels) whose spectra decay rapidly, resulting in poor conditioning and numerical challenges. This paper is motivated by recent theoretical advances, which characterize the optimization landscape of a particular nonconvex formulation of SaSD. This is used to derive a $provable$ algorithm which exactly solves certain non-practical instances of the SaSD problem. We leverage the key ideas from this theory (sphere constraints, data-driven initialization) to develop a $practical$ algorithm, which performs well on data arising from a range of application areas. We highlight key additional challenges posed by the ill-conditioning of real SaSD problems, and suggest heuristics (acceleration, continuation, reweighting) to mitigate them. Experiments demonstrate both the performance and generality of the proposed method. \ No newline at end of file diff --git a/data/2020/iclr/Sign Bits Are All You Need for Black-Box Attacks b/data/2020/iclr/Sign Bits Are All You Need for Black-Box Attacks new file mode 100644 index 0000000000..58fac77e48 --- /dev/null +++ b/data/2020/iclr/Sign Bits Are All You Need for Black-Box Attacks @@ -0,0 +1 @@ +We present a novel black-box adversarial attack algorithm with state-of-the-art model evasion rates for query efficiency under $\ell_\infty$ and $\ell_2$ metrics. It exploits a \textit{sign-based}, rather than magnitude-based, gradient estimation approach that shifts the gradient estimation from continuous to binary black-box optimization. It adaptively constructs queries to estimate the gradient, one query relying upon the previous, rather than re-estimating the gradient each step with random query construction. Its reliance on sign bits yields a smaller memory footprint and it requires neither hyperparameter tuning or dimensionality reduction. Further, its theoretical performance is guaranteed and it can characterize adversarial subspaces better than white-box gradient-aligned subspaces. On two public black-box attack challenges and a model robustly trained against transfer attacks, the algorithm's evasion rates surpass all submitted attacks. For a suite of published models, the algorithm is $3.8\times$ less failure-prone while spending $2.5\times$ fewer queries versus the best combination of state of art algorithms. For example, it evades a standard MNIST model using just $12$ queries on average. Similar performance is observed on a standard IMAGENET model with an average of $579$ queries. \ No newline at end of file diff --git a/data/2020/iclr/Sign-OPT: A Query-Efficient Hard-label Adversarial Attack b/data/2020/iclr/Sign-OPT: A Query-Efficient Hard-label Adversarial Attack new file mode 100644 index 0000000000..9067c6b8f2 --- /dev/null +++ b/data/2020/iclr/Sign-OPT: A Query-Efficient Hard-label Adversarial Attack @@ -0,0 +1 @@ +We study the most practical problem setup for evaluating adversarial robustness of a machine learning system with limited access: the hard-label black-box attack setting for generating adversarial examples, where limited model queries are allowed and only the decision is provided to a queried data input. Several algorithms have been proposed for this problem but they typically require huge amount (>20,000) of queries for attacking one example. Among them, one of the state-of-the-art approaches (Cheng et al., 2019) showed that hard-label attack can be modeled as an optimization problem where the objective function can be evaluated by binary search with additional model queries, thereby a zeroth order optimization algorithm can be applied. In this paper, we adopt the same optimization formulation but propose to directly estimate the sign of gradient at any direction instead of the gradient itself, which enjoys the benefit of single query. Using this single query oracle for retrieving sign of directional derivative, we develop a novel query-efficient Sign-OPT approach for hard-label black-box attack. We provide a convergence analysis of the new algorithm and conduct experiments on several models on MNIST, CIFAR-10 and ImageNet. We find that Sign-OPT attack consistently requires 5X to 10X fewer queries when compared to the current state-of-the-art approaches, and usually converges to an adversarial example with smaller perturbation. \ No newline at end of file diff --git a/data/2020/iclr/SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum b/data/2020/iclr/SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum new file mode 100644 index 0000000000..a09320b33a --- /dev/null +++ b/data/2020/iclr/SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum @@ -0,0 +1 @@ +Distributed optimization is essential for training large models on large datasets. Multiple approaches have been proposed to reduce the communication overhead in distributed training, such as synchronizing only after performing multiple local SGD steps, and decentralized methods (e.g., using gossip algorithms) to decouple communications among workers. Although these methods run faster than AllReduce-based methods, which use blocking communication before every update, the resulting models may be less accurate after the same number of updates. Inspired by the BMUF method of Chen & Huo (2016), we propose a slow momentum (SlowMo) framework, where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm. Experiments on image classification and machine translation tasks demonstrate that SlowMo consistently yields improvements in optimization and generalization performance relative to the base optimizer, even when the additional overhead is amortized over many updates so that the SlowMo runtime is on par with that of the base optimizer. We provide theoretical convergence guarantees showing that SlowMo converges to a stationary point of smooth non-convex losses. Since BMUF can be expressed through the SlowMo framework, our results also correspond to the first theoretical convergence guarantees for BMUF. \ No newline at end of file diff --git a/data/2020/iclr/Stochastic AUC Maximization with Deep Neural Networks b/data/2020/iclr/Stochastic AUC Maximization with Deep Neural Networks new file mode 100644 index 0000000000..da74cbcc65 --- /dev/null +++ b/data/2020/iclr/Stochastic AUC Maximization with Deep Neural Networks @@ -0,0 +1 @@ +Stochastic AUC maximization has garnered an increasing interest due to better fit to imbalanced data classification. However, existing works are limited to stochastic AUC maximization with a linear predictive model, which restricts its predictive power when dealing with extremely complex data. In this paper, we consider stochastic AUC maximization problem with a deep neural network as the predictive model. Building on the saddle point reformulation of a surrogated loss of AUC, the problem can be cast into a {\it non-convex concave} min-max problem. The main contribution made in this paper is to make stochastic AUC maximization more practical for deep neural networks and big data with theoretical insights as well. In particular, we propose to explore Polyak-Łojasiewicz (PL) condition that has been proved and observed in deep learning, which enables us to develop new stochastic algorithms with even faster convergence rate and more practical step size scheme. An AdaGrad-style algorithm is also analyzed under the PL condition with adaptive convergence rate. Our experimental results demonstrate the effectiveness of the proposed algorithms. \ No newline at end of file diff --git a/data/2020/iclr/Stochastic Conditional Generative Networks with Basis Decomposition b/data/2020/iclr/Stochastic Conditional Generative Networks with Basis Decomposition new file mode 100644 index 0000000000..679d753453 --- /dev/null +++ b/data/2020/iclr/Stochastic Conditional Generative Networks with Basis Decomposition @@ -0,0 +1 @@ +While generative adversarial networks (GANs) have revolutionized machine learning, a number of open questions remain to fully understand them and exploit their power. One of these questions is how to efficiently achieve proper diversity and sampling of the multi-mode data space. To address this, we introduce BasisGAN, a stochastic conditional multi-mode image generator. By exploiting the observation that a convolutional filter can be well approximated as a linear combination of a small set of basis elements, we learn a plug-and-played basis generator to stochastically generate basis elements, with just a few hundred of parameters, to fully embed stochasticity into convolutional filters. By sampling basis elements instead of filters, we dramatically reduce the cost of modeling the parameter space with no sacrifice on either image diversity or fidelity. To illustrate this proposed plug-and-play framework, we construct variants of BasisGAN based on state-of-the-art conditional image generation networks, and train the networks by simply plugging in a basis generator, without additional auxiliary components, hyperparameters, or training objectives. The experimental success is complemented with theoretical results indicating how the perturbations introduced by the proposed sampling of basis elements can propagate to the appearance of generated images. \ No newline at end of file diff --git a/data/2020/iclr/Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well b/data/2020/iclr/Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well new file mode 100644 index 0000000000..91e0696677 --- /dev/null +++ b/data/2020/iclr/Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well @@ -0,0 +1 @@ +We propose Stochastic Weight Averaging in Parallel (SWAP), an algorithm to accelerate DNN training. Our algorithm uses large mini-batches to compute an approximate solution quickly and then refines it by averaging the weights of multiple models computed independently and in parallel. The resulting models generalize equally well as those trained with small mini-batches but are produced in a substantially shorter time. We demonstrate the reduction in training time and the good generalization performance of the resulting models on the computer vision datasets CIFAR10, CIFAR100, and ImageNet. \ No newline at end of file diff --git a/data/2020/iclr/StructPool: Structured Graph Pooling via Conditional Random Fields b/data/2020/iclr/StructPool: Structured Graph Pooling via Conditional Random Fields new file mode 100644 index 0000000000..55d9c198b3 --- /dev/null +++ b/data/2020/iclr/StructPool: Structured Graph Pooling via Conditional Random Fields @@ -0,0 +1 @@ +Learning high-level representations for graphs is of great importance for graph analysis tasks. In addition to graph convolution, graph pooling is an important but less explored research area. In particular, most of existing graph pooling techniques do not consider the graph structural information explicitly. We argue that such information is important and develop a novel graph pooling technique, know as the StructPool, in this work. We consider the graph pooling as a node clustering problem, which requires the learning of a cluster assignment matrix. We propose to formulate it as a structured prediction problem and employ conditional random fields to capture the relationships among assignments of different nodes. We also generalize our method to incorporate graph topological information in designing the Gibbs energy function. Experimental results on multiple datasets demonstrate the effectiveness of our proposed StructPool. \ No newline at end of file diff --git a/data/2020/iclr/TabFact: A Large-scale Dataset for Table-based Fact Verification b/data/2020/iclr/TabFact: A Large-scale Dataset for Table-based Fact Verification new file mode 100644 index 0000000000..effc06e856 --- /dev/null +++ b/data/2020/iclr/TabFact: A Large-scale Dataset for Table-based Fact Verification @@ -0,0 +1 @@ +The problem of verifying whether a textual hypothesis holds based on the given evidence, also known as fact verification, plays an important role in the study of natural language understanding and semantic representation. However, existing studies are mainly restricted to dealing with unstructured evidence (e.g., natural language sentences and documents, news, etc), while verification under structured evidence, such as tables, graphs, and databases, remains under-explored. This paper specifically aims to study the fact verification given semi-structured data as evidence. To this end, we construct a large-scale dataset called TabFact with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements, which are labeled as either ENTAILED or REFUTED. TabFact is challenging since it involves both soft linguistic reasoning and hard symbolic reasoning. To address these reasoning challenges, we design two different models: Table-BERT and Latent Program Algorithm (LPA). Table-BERT leverages the state-of-the-art pre-trained language model to encode the linearized tables and statements into continuous vectors for verification. LPA parses statements into programs and executes them against the tables to obtain the returned binary value for verification. Both methods achieve similar accuracy but still lag far behind human performance. We also perform a comprehensive analysis to demonstrate great future opportunities. The data and code of the dataset are provided in \url{this https URL}. \ No newline at end of file diff --git a/data/2020/iclr/The Implicit Bias of Depth: How Incremental Learning Drives Generalization b/data/2020/iclr/The Implicit Bias of Depth: How Incremental Learning Drives Generalization new file mode 100644 index 0000000000..3beb98778f --- /dev/null +++ b/data/2020/iclr/The Implicit Bias of Depth: How Incremental Learning Drives Generalization @@ -0,0 +1 @@ +A leading hypothesis for the surprising generalization of neural networks is that the dynamics of gradient descent bias the model towards simple solutions, by searching through the solution space in an incremental order of complexity. We formally define the notion of incremental learning dynamics and derive the conditions on depth and initialization for which this phenomenon arises in deep linear models. Our main theoretical contribution is a dynamical depth separation result, proving that while shallow models can exhibit incremental learning dynamics, they require the initialization to be exponentially small for these dynamics to present themselves. However, once the model becomes deeper, the dependence becomes polynomial and incremental learning can arise in more natural settings. We complement our theoretical findings by experimenting with deep matrix sensing, quadratic neural networks and with binary classification using diagonal and convolutional linear networks, showing all of these models exhibit incremental learning. \ No newline at end of file diff --git a/data/2020/iclr/The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget b/data/2020/iclr/The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget new file mode 100644 index 0000000000..8e8ab78bd6 --- /dev/null +++ b/data/2020/iclr/The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget @@ -0,0 +1 @@ +In many applications, it is desirable to extract only the relevant information from complex input data, which involves making a decision about which input features are relevant. The information bottleneck method formalizes this as an information-theoretic optimization problem by maintaining an optimal tradeoff between compression (throwing away irrelevant input information), and predicting the target. In many problem settings, including the reinforcement learning problems we consider in this work, we might prefer to compress only part of the input. This is typically the case when we have a standard conditioning input, such as a state observation, and a ``privileged'' input, which might correspond to the goal of a task, the output of a costly planning algorithm, or communication with another agent. In such cases, we might prefer to compress the privileged input, either to achieve better generalization (e.g., with respect to goals) or to minimize access to costly information (e.g., in the case of communication). Practical implementations of the information bottleneck based on variational inference require access to the privileged input in order to compute the bottleneck variable, so although they perform compression, this compression operation itself needs unrestricted, lossless access. In this work, we propose the variational bandwidth bottleneck, which decides for each example on the estimated value of the privileged information before seeing it, i.e., only based on the standard input, and then accordingly chooses stochastically, whether to access the privileged input or not. We formulate a tractable approximation to this framework and demonstrate in a series of reinforcement learning experiments that it can improve generalization and reduce access to computationally costly information. \ No newline at end of file diff --git a/data/2020/iclr/The asymptotic spectrum of the Hessian of DNN throughout training b/data/2020/iclr/The asymptotic spectrum of the Hessian of DNN throughout training new file mode 100644 index 0000000000..7b085e8c20 --- /dev/null +++ b/data/2020/iclr/The asymptotic spectrum of the Hessian of DNN throughout training @@ -0,0 +1 @@ +The dynamics of DNNs during gradient descent is described by the so-called Neural Tangent Kernel (NTK). In this article, we show that the NTK allows one to gain precise insight into the Hessian of the cost of DNNs. When the NTK is fixed during training, we obtain a full characterization of the asymptotics of the spectrum of the Hessian, at initialization and during training. In the so-called mean-field limit, where the NTK is not fixed during training, we describe the first two moments of the Hessian at initialization. \ No newline at end of file diff --git a/data/2020/iclr/Theory and Evaluation Metrics for Learning Disentangled Representations b/data/2020/iclr/Theory and Evaluation Metrics for Learning Disentangled Representations new file mode 100644 index 0000000000..53ca3700c5 --- /dev/null +++ b/data/2020/iclr/Theory and Evaluation Metrics for Learning Disentangled Representations @@ -0,0 +1 @@ +We make two theoretical contributions to disentanglement learning by (a) defining precise semantics of disentangled representations, and (b) establishing robust metrics for evaluation. First, we characterize the concept "disentangled representations" used in supervised and unsupervised methods along three dimensions-informativeness, separability and interpretability - which can be expressed and quantified explicitly using information-theoretic constructs. This helps explain the behaviors of several well-known disentanglement learning models. We then propose robust metrics for measuring informativeness, separability and interpretability. Through a comprehensive suite of experiments, we show that our metrics correctly characterize the representations learned by different methods and are consistent with qualitative (visual) results. Thus, the metrics allow disentanglement learning methods to be compared on a fair ground. We also empirically uncovered new interesting properties of VAE-based methods and interpreted them with our formulation. These findings are promising and hopefully will encourage the design of more theoretically driven models for learning disentangled representations. \ No newline at end of file diff --git a/data/2020/iclr/Thieves on Sesame Street! Model Extraction of BERT-based APIs b/data/2020/iclr/Thieves on Sesame Street! Model Extraction of BERT-based APIs new file mode 100644 index 0000000000..562fc74515 --- /dev/null +++ b/data/2020/iclr/Thieves on Sesame Street! Model Extraction of BERT-based APIs @@ -0,0 +1 @@ +We study the problem of model extraction in natural language processing, in which an adversary with only query access to a victim model attempts to reconstruct a local copy of that model. Assuming that both the adversary and victim model fine-tune a large pretrained language model such as BERT (Devlin et al. 2019), we show that the adversary does not need any real training data to successfully mount the attack. In fact, the attacker need not even use grammatical or semantically meaningful queries: we show that random sequences of words coupled with task-specific heuristics form effective queries for model extraction on a diverse set of NLP tasks, including natural language inference and question answering. Our work thus highlights an exploit only made feasible by the shift towards transfer learning methods within the NLP community: for a query budget of a few hundred dollars, an attacker can extract a model that performs only slightly worse than the victim model. Finally, we study two defense strategies against model extraction---membership classification and API watermarking---which while successful against naive adversaries, are ineffective against more sophisticated ones. \ No newline at end of file diff --git a/data/2020/iclr/To Relieve Your Headache of Training an MRF, Take AdVIL b/data/2020/iclr/To Relieve Your Headache of Training an MRF, Take AdVIL new file mode 100644 index 0000000000..e7717bcce8 --- /dev/null +++ b/data/2020/iclr/To Relieve Your Headache of Training an MRF, Take AdVIL @@ -0,0 +1 @@ +We propose a black-box algorithm called {\it Adversarial Variational Inference and Learning} (AdVIL) to perform inference and learning on a general Markov random field (MRF). AdVIL employs two variational distributions to approximately infer the latent variables and estimate the partition function of an MRF, respectively. The two variational distributions provide an estimate of the negative log-likelihood of the MRF as a minimax optimization problem, which is solved by stochastic gradient descent. AdVIL is proven convergent under certain conditions. On one hand, compared with contrastive divergence, AdVIL requires a minimal assumption about the model structure and can deal with a broader family of MRFs. On the other hand, compared with existing black-box methods, AdVIL provides a tighter estimate of the log partition function and achieves much better empirical results. \ No newline at end of file diff --git a/data/2020/iclr/Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets b/data/2020/iclr/Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets new file mode 100644 index 0000000000..5191d15cd3 --- /dev/null +++ b/data/2020/iclr/Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets @@ -0,0 +1 @@ +Adaptive gradient algorithms perform gradient-based updates using the history of gradients and are ubiquitous in training deep neural networks. While adaptive gradient methods theory is well understood for minimization problems, the underlying factors driving their empirical success in min-max problems such as GANs remain unclear. In this paper, we aim at bridging this gap from both theoretical and empirical perspectives. Theoretically, we develop an algorithm (Optimistic Stochastic Gradient, OSG) for solving a class of non-convex non-concave min-max problem and establish $O(\epsilon^{-4})$ complexity for finding $\epsilon$-first-order stationary point, in which only one stochastic first-order oracle is invoked in each iteration. An adaptive variant of the proposed algorithm (Optimistic Adagrad, OAdagrad) is also analyzed, revealing an \emph{improved} adaptive complexity $\widetilde{O}\left(\epsilon^{-\frac{2}{1-\alpha}}\right)$~\footnote{Here $\widetilde{O}(\cdot)$ compresses a logarithmic factor of $\epsilon$.}, where $\alpha$ characterizes the growth rate of the cumulative stochastic gradient and $0\leq \alpha\leq 1/2$. To the best of our knowledge, this is the first work for establishing adaptive complexity in non-convex non-concave min-max optimization. Empirically, our experiments show that indeed adaptive gradient algorithms outperform their non-adaptive counterparts in GAN training. Moreover, this observation can be explained by the slow growth rate of the cumulative stochastic gradient, as observed empirically. \ No newline at end of file diff --git a/data/2020/iclr/Transferable Perturbations of Deep Feature Distributions b/data/2020/iclr/Transferable Perturbations of Deep Feature Distributions new file mode 100644 index 0000000000..8e972f4cdc --- /dev/null +++ b/data/2020/iclr/Transferable Perturbations of Deep Feature Distributions @@ -0,0 +1 @@ +Almost all current adversarial attacks of CNN classifiers rely on information derived from the output layer of the network. This work presents a new adversarial attack based on the modeling and exploitation of class-wise and layer-wise deep feature distributions. We achieve state-of-the-art targeted blackbox transfer-based attack results for undefended ImageNet models. Further, we place a priority on explainability and interpretability of the attacking process. Our methodology affords an analysis of how adversarial attacks change the intermediate feature distributions of CNNs, as well as a measure of layer-wise and class-wise feature distributional separability/entanglement. We also conceptualize a transition from task/data-specific to model-specific features within a CNN architecture that directly impacts the transferability of adversarial examples. \ No newline at end of file diff --git a/data/2020/iclr/Tree-Structured Attention with Hierarchical Accumulation b/data/2020/iclr/Tree-Structured Attention with Hierarchical Accumulation new file mode 100644 index 0000000000..8fdf341f28 --- /dev/null +++ b/data/2020/iclr/Tree-Structured Attention with Hierarchical Accumulation @@ -0,0 +1 @@ +Incorporating hierarchical structures like constituency trees has been shown to be effective for various natural language processing (NLP) tasks. However, it is evident that state-of-the-art (SOTA) sequence-based models like the Transformer struggle to encode such structures inherently. On the other hand, dedicated models like the Tree-LSTM, while explicitly modeling hierarchical structures, do not perform as efficiently as the Transformer. In this paper, we attempt to bridge this gap with Hierarchical Accumulation to encode parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German task. It also yields improvements over Transformer and Tree-LSTM on three text classification tasks. We further demonstrate that using hierarchical priors can compensate for data shortage, and that our model prefers phrase-level attentions over token-level attentions. \ No newline at end of file diff --git a/data/2020/iclr/Understanding Architectures Learnt by Cell-based Neural Architecture Search b/data/2020/iclr/Understanding Architectures Learnt by Cell-based Neural Architecture Search new file mode 100644 index 0000000000..13f8ffd6fd --- /dev/null +++ b/data/2020/iclr/Understanding Architectures Learnt by Cell-based Neural Architecture Search @@ -0,0 +1 @@ +Neural architecture search (NAS) generates architectures automatically for given tasks, e.g., image classification and language modeling. Recently, various NAS algorithms have been proposed to improve search efficiency and effectiveness. However, little attention is paid to understand the generated architectures, including whether they share any commonality. In this paper, we analyze the generated architectures and give our explanations of their superior performance. We firstly uncover that the architectures generated by NAS algorithms share a common connection pattern, which contributes to their fast convergence. Consequently, these architectures are selected during architecture search. We further empirically and theoretically show that the fast convergence is the consequence of smooth loss landscape and accurate gradient information conducted by the common connection pattern. Contracting to universal recognition, we finally observe that popular NAS architectures do not always generalize better than the candidate architectures, encouraging us to re-think about the state-of-the-art NAS algorithms. \ No newline at end of file diff --git a/data/2020/iclr/Understanding Knowledge Distillation in Non-autoregressive Machine Translation b/data/2020/iclr/Understanding Knowledge Distillation in Non-autoregressive Machine Translation new file mode 100644 index 0000000000..49cdbd0594 --- /dev/null +++ b/data/2020/iclr/Understanding Knowledge Distillation in Non-autoregressive Machine Translation @@ -0,0 +1 @@ +Non-autoregressive machine translation (NAT) systems predict a sequence of output tokens in parallel, achieving substantial improvements in generation speed compared to autoregressive models. Existing NAT models usually rely on the technique of knowledge distillation, which creates the training data from a pretrained autoregressive model for better performance. Knowledge distillation is empirically useful, leading to large gains in accuracy for NAT models, but the reason for this success has, as of yet, been unclear. In this paper, we first design systematic experiments to investigate why knowledge distillation is crucial to NAT training. We find that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data. Furthermore, a strong correlation is observed between the capacity of an NAT model and the optimal complexity of the distilled data for the best translation quality. Based on these findings, we further propose several approaches that can alter the complexity of data sets to improve the performance of NAT models. We achieve the state-of-the-art performance for the NAT-based models, and close the gap with the autoregressive baseline on WMT14 En-De benchmark. \ No newline at end of file diff --git a/data/2020/iclr/Understanding the Limitations of Variational Mutual Information Estimators b/data/2020/iclr/Understanding the Limitations of Variational Mutual Information Estimators new file mode 100644 index 0000000000..53d862d175 --- /dev/null +++ b/data/2020/iclr/Understanding the Limitations of Variational Mutual Information Estimators @@ -0,0 +1 @@ +Variational approaches based on neural networks are showing promise for estimating mutual information (MI) between high dimensional variables. However, they can be difficult to use in practice due to poorly understood bias/variance tradeoffs. We theoretically show that, under some conditions, estimators such as MINE exhibit variance that could grow exponentially with the true amount of underlying MI. We also empirically demonstrate that existing estimators fail to satisfy basic self-consistency properties of MI, such as data processing and additivity under independence. Based on a unified perspective of variational approaches, we develop a new estimator that focuses on variance reduction. Empirical results on standard benchmark tasks demonstrate that our proposed estimator exhibits improved bias-variance trade-offs on standard benchmark tasks. \ No newline at end of file diff --git a/data/2020/iclr/Unpaired Point Cloud Completion on Real Scans using Adversarial Training b/data/2020/iclr/Unpaired Point Cloud Completion on Real Scans using Adversarial Training new file mode 100644 index 0000000000..18d7e9c2f6 --- /dev/null +++ b/data/2020/iclr/Unpaired Point Cloud Completion on Real Scans using Adversarial Training @@ -0,0 +1 @@ +As 3D scanning solutions become increasingly popular, several deep learning setups have been developed geared towards that task of scan completion, i.e., plausibly filling in regions there were missed in the raw scans. These methods, however, largely rely on supervision in the form of paired training data, i.e., partial scans with corresponding desired completed scans. While these methods have been successfully demonstrated on synthetic data, the approaches cannot be directly used on real scans in absence of suitable paired training data. We develop a first approach that works directly on input point clouds, does not require paired training data, and hence can directly be applied to real scans for scan completion. We evaluate the approach qualitatively on several real-world datasets (ScanNet, Matterport, KITTI), quantitatively on 3D-EPN shape completion benchmark dataset, and demonstrate realistic completions under varying levels of incompleteness. \ No newline at end of file diff --git a/data/2020/iclr/Unsupervised Model Selection for Variational Disentangled Representation Learning b/data/2020/iclr/Unsupervised Model Selection for Variational Disentangled Representation Learning new file mode 100644 index 0000000000..4d1e2417d2 --- /dev/null +++ b/data/2020/iclr/Unsupervised Model Selection for Variational Disentangled Representation Learning @@ -0,0 +1 @@ +Disentangled representations have recently been shown to improve fairness, data efficiency and generalisation in simple supervised and reinforcement learning tasks. To extend the benefits of disentangled representations to more complex domains and practical applications, it is important to enable hyperparameter tuning and model selection of existing unsupervised approaches without requiring access to ground truth attribute labels, which are not available for most datasets. This paper addresses this problem by introducing a simple yet robust and reliable method for unsupervised disentangled model selection. Our approach, Unsupervised Disentanglement Ranking (UDR), leverages the recent theoretical results that explain why variational autoencoders disentangle (Rolinek et al, 2019), to quantify the quality of disentanglement by performing pairwise comparisons between trained model representations. We show that our approach performs comparably to the existing supervised alternatives across 5,400 models from six state of the art unsupervised disentangled representation learning model classes. Furthermore, we show that the ranking produced by our approach correlates well with the final task performance on two different domains. \ No newline at end of file diff --git a/data/2020/iclr/V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control b/data/2020/iclr/V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control new file mode 100644 index 0000000000..a3e98f6bd9 --- /dev/null +++ b/data/2020/iclr/V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control @@ -0,0 +1 @@ +Some of the most successful applications of deep reinforcement learning to challenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However, policy gradients can suffer from large variance that may limit performance, and in practice require carefully tuned entropy regularization to prevent policy collapse. As an alternative to policy gradient algorithms, we introduce V-MPO, an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) that performs policy iteration based on a learned state-value function. We show that V-MPO surpasses previously reported scores for both the Atari-57 and DMLab-30 benchmark suites in the multi-task setting, and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters. On individual DMLab and Atari levels, the proposed algorithm can achieve scores that are substantially higher than has previously been reported. V-MPO is also applicable to problems with high-dimensional, continuous action spaces, which we demonstrate in the context of learning to control simulated humanoids with 22 degrees of freedom from full state observations and 56 degrees of freedom from pixel observations, as well as example OpenAI Gym tasks where V-MPO achieves substantially higher asymptotic scores than previously reported. \ No newline at end of file diff --git a/data/2020/iclr/V4D: 4D Convolutional Neural Networks for Video-level Representation Learning b/data/2020/iclr/V4D: 4D Convolutional Neural Networks for Video-level Representation Learning new file mode 100644 index 0000000000..ad76b88b67 --- /dev/null +++ b/data/2020/iclr/V4D: 4D Convolutional Neural Networks for Video-level Representation Learning @@ -0,0 +1 @@ +Most existing 3D CNNs for video representation learning are clip-based methods, and thus do not consider video-level temporal evolution of spatio-temporal features. In this paper, we propose Video-level 4D Convolutional Neural Networks, referred as V4D, to model the evolution of long-range spatio-temporal representation with 4D convolutions, and at the same time, to preserve strong 3D spatio-temporal representation with residual connections. Specifically, we design a new 4D residual block able to capture inter-clip interactions, which could enhance the representation power of the original clip-level 3D CNNs. The 4D residual blocks can be easily integrated into the existing 3D CNNs to perform long-range modeling hierarchically. We further introduce the training and inference methods for the proposed V4D. Extensive experiments are conducted on three video recognition benchmarks, where V4D achieves excellent results, surpassing recent 3D CNNs by a large margin. \ No newline at end of file diff --git a/data/2020/iclr/VL-BERT: Pre-training of Generic Visual-Linguistic Representations b/data/2020/iclr/VL-BERT: Pre-training of Generic Visual-Linguistic Representations new file mode 100644 index 0000000000..bd47bea8aa --- /dev/null +++ b/data/2020/iclr/VL-BERT: Pre-training of Generic Visual-Linguistic Representations @@ -0,0 +1 @@ +We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark. Code is released at \url{this https URL}. \ No newline at end of file diff --git a/data/2020/iclr/Variational Recurrent Models for Solving Partially Observable Control Tasks b/data/2020/iclr/Variational Recurrent Models for Solving Partially Observable Control Tasks new file mode 100644 index 0000000000..688812d1c1 --- /dev/null +++ b/data/2020/iclr/Variational Recurrent Models for Solving Partially Observable Control Tasks @@ -0,0 +1 @@ +In partially observable (PO) environments, deep reinforcement learning (RL) agents often suffer from unsatisfactory performance, since two problems need to be tackled together: how to extract information from the raw observations to solve the task, and how to improve the policy. In this study, we propose an RL algorithm for solving PO tasks. Our method comprises two parts: a variational recurrent model (VRM) for modeling the environment, and an RL controller that has access to both the environment and the VRM. The proposed algorithm was tested in two types of PO robotic control tasks, those in which either coordinates or velocities were not observable and those that require long-term memorization. Our experiments show that the proposed algorithm achieved better data efficiency and/or learned more optimal policy than other alternative approaches in tasks in which unobserved states cannot be inferred from raw observations in a simple manner. \ No newline at end of file diff --git a/data/2020/iclr/Vid2Game: Controllable Characters Extracted from Real-World Videos b/data/2020/iclr/Vid2Game: Controllable Characters Extracted from Real-World Videos new file mode 100644 index 0000000000..23c3a02304 --- /dev/null +++ b/data/2020/iclr/Vid2Game: Controllable Characters Extracted from Real-World Videos @@ -0,0 +1,2 @@ +We are given a video of a person performing a certain activity, from which we extract a controllable model. The model generates novel image sequences of that person, according to arbitrary user-defined control signals, typically marking the displacement of the moving body. The generated video can have an arbitrary background, and effectively capture both the dynamics and appearance of the person. +The method is based on two networks. The first network maps a current pose, and a single-instance control signal to the next pose. The second network maps the current pose, the new pose, and a given background, to an output frame. Both networks include multiple novelties that enable high-quality performance. This is demonstrated on multiple characters extracted from various videos of dancers and athletes. \ No newline at end of file diff --git a/data/2020/iclr/VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation b/data/2020/iclr/VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation new file mode 100644 index 0000000000..ff42d78975 --- /dev/null +++ b/data/2020/iclr/VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation @@ -0,0 +1 @@ +Generative models that can model and predict sequences of future events can, in principle, learn to capture complex real-world phenomena, such as physical interactions. However, a central challenge in video prediction is that the future is highly uncertain: a sequence of past observations of events can imply many possible futures. Although a number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally as in the case of pixel-level autoregressive models, or do not directly optimize the likelihood of the data. To our knowledge, our work is the first to propose multi-frame video prediction with normalizing flows, which allows for direct optimization of the data likelihood, and produces high-quality stochastic predictions. We describe an approach for modeling the latent space dynamics, and demonstrate that flow-based generative models offer a viable and competitive approach to generative modelling of video. \ No newline at end of file diff --git a/data/2020/iclr/Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards b/data/2020/iclr/Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards new file mode 100644 index 0000000000..7990acf360 --- /dev/null +++ b/data/2020/iclr/Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards @@ -0,0 +1 @@ +Imitation learning allows agents to learn complex behaviors from demonstrations. However, learning a complex vision-based task may require an impractical number of demonstrations. Meta-imitation learning is a promising approach towards enabling agents to learn a new task from one or a few demonstrations by leveraging experience from learning similar tasks. In the presence of task ambiguity or unobserved dynamics, demonstrations alone may not provide enough information; an agent must also try the task to successfully infer a policy. In this work, we propose a method that can learn to learn from both demonstrations and trial-and-error experience with sparse reward feedback. In comparison to meta-imitation, this approach enables the agent to effectively and efficiently improve itself autonomously beyond the demonstration data. In comparison to meta-reinforcement learning, we can scale to substantially broader distributions of tasks, as the demonstration reduces the burden of exploration. Our experiments show that our method significantly outperforms prior approaches on a set of challenging, vision-based control tasks. \ No newline at end of file diff --git a/data/2020/iclr/Weakly Supervised Clustering by Exploiting Unique Class Count b/data/2020/iclr/Weakly Supervised Clustering by Exploiting Unique Class Count new file mode 100644 index 0000000000..0e6dbac362 --- /dev/null +++ b/data/2020/iclr/Weakly Supervised Clustering by Exploiting Unique Class Count @@ -0,0 +1 @@ +A weakly supervised learning based clustering framework is proposed in this paper. As the core of this framework, we introduce a novel multiple instance learning task based on a bag level label called unique class count (ucc), which is the number of unique classes among all instances inside the bag. In this task, no annotations on individual instances inside the bag are needed during training of the models. We mathematically prove that with a perfect ucc classifier, perfect clustering of individual instances inside the bags is possible even when no annotations on individual instances are given during training. We have constructed a neural network based ucc classifier and experimentally shown that the clustering performance of our framework with our weakly supervised ucc classifier is comparable to that of fully supervised learning models where labels for all instances are known. Furthermore, we have tested the applicability of our framework to a real world task of semantic segmentation of breast cancer metastases in histological lymph node sections and shown that the performance of our weakly supervised framework is comparable to the performance of a fully supervised Unet model. \ No newline at end of file diff --git a/data/2020/iclr/What graph neural networks cannot learn: depth vs width b/data/2020/iclr/What graph neural networks cannot learn: depth vs width new file mode 100644 index 0000000000..99759af92c --- /dev/null +++ b/data/2020/iclr/What graph neural networks cannot learn: depth vs width @@ -0,0 +1 @@ +This paper studies the expressive power of graph neural networks falling within the message-passing framework (GNNmp). Two results are presented. First, GNNmp are shown to be Turing universal under sufficient conditions on their depth, width, node attributes, and layer expressiveness. Second, it is discovered that GNNmp can lose a significant portion of their power when their depth and width is restricted. The proposed impossibility statements stem from a new technique that enables the repurposing of seminal results from distributed computing and leads to lower bounds for an array of decision, optimization, and estimation problems involving graphs. Strikingly, several of these problems are deemed impossible unless the product of a GNNmp's depth and width exceeds a polynomial of the graph size; this dependence remains significant even for tasks that appear simple or when considering approximation. \ No newline at end of file diff --git a/data/2021/iclr/A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning b/data/2021/iclr/A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning new file mode 100644 index 0000000000..a32aeaf6f3 --- /dev/null +++ b/data/2021/iclr/A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning @@ -0,0 +1 @@ +Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed compute systems. A key bottleneck of such systems is the communication overhead for exchanging information across the workers, such as stochastic gradients. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed communication with error feedback (EF). EF remains the only known technique that can deal with the error induced by contractive compressors which are not unbiased, such as Top-$K$. In this paper, we propose a new and theoretically and practically better alternative to EF for dealing with contractive compressors. In particular, we propose a construction which can transform any contractive compressor into an induced unbiased compressor. Following this transformation, existing methods able to work with unbiased compressors can be applied. We show that our approach leads to vast improvements over EF, including reduced memory requirements, better communication complexity guarantees and fewer assumptions. We further extend our results to federated learning with partial participation following an arbitrary distribution over the nodes, and demonstrate the benefits thereof. We perform several numerical experiments which validate our theoretical findings. \ No newline at end of file diff --git a/data/2021/iclr/A Block Minifloat Representation for Training Deep Neural Networks b/data/2021/iclr/A Block Minifloat Representation for Training Deep Neural Networks new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/A Critique of Self-Expressive Deep Subspace Clustering b/data/2021/iclr/A Critique of Self-Expressive Deep Subspace Clustering new file mode 100644 index 0000000000..2937fdbd66 --- /dev/null +++ b/data/2021/iclr/A Critique of Self-Expressive Deep Subspace Clustering @@ -0,0 +1 @@ +Subspace clustering is an unsupervised clustering technique designed to cluster data that is supported on a union of linear subspaces, with each subspace defining a cluster with dimension lower than the ambient space. Many existing formulations for this problem are based on exploiting the self-expressive property of linear subspaces, where any point within a subspace can be represented as linear combination of other points within the subspace. To extend this approach to data supported on a union of non-linear manifolds, numerous studies have proposed learning an appropriate kernel embedding of the original data using a neural network, which is regularized by a self-expressive loss function on the data in the embedded space to encourage a union of linear subspaces prior on the data in the embedded space. Here we show that there are a number of potential flaws with this approach which have not been adequately addressed in prior work. In particular, we show the model formulation is often ill-posed in multiple ways, which can lead to a degenerate embedding of the data, which need not correspond to a union of subspaces at all. We validate our theoretical results experimentally and additionally repeat prior experiments reported in the literature, where we conclude that a significant portion of the previously claimed performance benefits can be attributed to an ad-hoc post processing step rather than the clustering model. \ No newline at end of file diff --git a/data/2021/iclr/A Design Space Study for LISTA and Beyond b/data/2021/iclr/A Design Space Study for LISTA and Beyond new file mode 100644 index 0000000000..faf859ed5a --- /dev/null +++ b/data/2021/iclr/A Design Space Study for LISTA and Beyond @@ -0,0 +1 @@ +In recent years, great success has been witnessed in building problem-specific deep networks from unrolling iterative algorithms, for solving inverse problems and beyond. Unrolling is believed to incorporate the model-based prior with the learning capacity of deep learning. This paper revisits the role of unrolling as a design approach for deep networks: to what extent its resulting special architecture is superior, and can we find better? Using LISTA for sparse recovery as a representative example, we conduct the first thorough design space study for the unrolled models. Among all possible variations, we focus on extensively varying the connectivity patterns and neuron types, leading to a gigantic design space arising from LISTA. To efficiently explore this space and identify top performers, we leverage the emerging tool of neural architecture search (NAS). We carefully examine the searched top architectures in a number of settings, and are able to discover networks that are consistently better than LISTA. We further present more visualization and analysis to "open the black box", and find that the searched top architectures demonstrate highly consistent and potentially transferable patterns. We hope our study to spark more reflections and explorations on how to better mingle model-based optimization prior and data-driven learning. \ No newline at end of file diff --git a/data/2021/iclr/A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima b/data/2021/iclr/A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima new file mode 100644 index 0000000000..51a43d5434 --- /dev/null +++ b/data/2021/iclr/A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima @@ -0,0 +1 @@ +Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters. To the best of our knowledge, we are the first to theoretically and empirically prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima, while Gradient Descent (GD) with injected white noise favors flat minima only polynomially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima in terms of the ratio of the batch size and learning rate. Thus, large-batch training cannot search flat minima efficiently in a realistic computational time. \ No newline at end of file diff --git a/data/2021/iclr/A Discriminative Gaussian Mixture Model with Sparsity b/data/2021/iclr/A Discriminative Gaussian Mixture Model with Sparsity new file mode 100644 index 0000000000..0956958f41 --- /dev/null +++ b/data/2021/iclr/A Discriminative Gaussian Mixture Model with Sparsity @@ -0,0 +1 @@ +In probabilistic classification, a discriminative model based on the softmax function has a potential limitation in that it assumes unimodality for each class in the feature space. The mixture model can address this issue, although it leads to an increase in the number of parameters. We propose a sparse classifier based on a discriminative GMM, referred to as a sparse discriminative Gaussian mixture (SDGM). In the SDGM, a GMM-based discriminative model is trained via sparse Bayesian learning. Using this sparse learning framework, we can simultaneously remove redundant Gaussian components and reduce the number of parameters used in the remaining components during learning; this learning method reduces the model complexity, thereby improving the generalization capability. Furthermore, the SDGM can be embedded into neural networks (NNs), such as convolutional NNs, and can be trained in an end-to-end manner. Experimental results demonstrated that the proposed method outperformed the existing softmax-based discriminative models. \ No newline at end of file diff --git a/data/2021/iclr/A Distributional Approach to Controlled Text Generation b/data/2021/iclr/A Distributional Approach to Controlled Text Generation new file mode 100644 index 0000000000..4d4b76e0ea --- /dev/null +++ b/data/2021/iclr/A Distributional Approach to Controlled Text Generation @@ -0,0 +1 @@ +We propose a Distributional Approach for addressing Controlled Text Generation from pre-trained Language Models (LMs). This approach permits to specify, in a single formal framework, both “pointwise” and “distributional” constraints over the target LM — to our knowledge, the first model with such generality — while minimizing KL divergence from the initial LM distribution. The optimal target distribution is then uniquely determined as an explicit EBM (Energy-Based Model) representation. From that optimal representation we then train a target controlled Autoregressive LM through an adaptive distributional variant of Policy Gradient. We conduct a first set of experiments over pointwise constraints showing the advantages of our approach over a set of baselines, in terms of obtaining a controlled LM balancing constraint satisfaction with divergence from the initial LM. We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models. Through an ablation study, we show the effectiveness of our adaptive technique for obtaining faster convergence.1 \ No newline at end of file diff --git a/data/2021/iclr/A Geometric Analysis of Deep Generative Image Models and Its Applications b/data/2021/iclr/A Geometric Analysis of Deep Generative Image Models and Its Applications new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/A Good Image Generator Is What You Need for High-Resolution Video Synthesis b/data/2021/iclr/A Good Image Generator Is What You Need for High-Resolution Video Synthesis new file mode 100644 index 0000000000..c2a8a2d334 --- /dev/null +++ b/data/2021/iclr/A Good Image Generator Is What You Need for High-Resolution Video Synthesis @@ -0,0 +1 @@ +Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving image-based models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic. We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. Not only does such a framework render high-resolution videos, but it also is an order of magnitude more computationally efficient. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled. With such a representation, our framework allows for a broad range of applications, including content and motion manipulation. Furthermore, we introduce a new task, which we call cross-domain video synthesis, in which the image and motion generators are trained on disjoint datasets belonging to different domains. This allows for generating moving objects for which the desired video data is not available. Extensive experiments on various datasets demonstrate the advantages of our methods over existing video generation techniques. Code will be released at https://github.com/snap-research/MoCoGAN-HD. \ No newline at end of file diff --git a/data/2021/iclr/A Gradient Flow Framework For Analyzing Network Pruning b/data/2021/iclr/A Gradient Flow Framework For Analyzing Network Pruning new file mode 100644 index 0000000000..656bce4ed4 --- /dev/null +++ b/data/2021/iclr/A Gradient Flow Framework For Analyzing Network Pruning @@ -0,0 +1 @@ +Recent network pruning methods focus on pruning models early-on in training. To estimate the impact of removing a parameter, these methods use importance measures that were originally designed to prune trained models. Despite lacking justification for their use early-on in training, such measures result in surprisingly low accuracy loss. To better explain this behavior, we develop a general gradient flow based framework that unifies state-of-the-art importance measures through the norm of model parameters. We use this framework to determine the relationship between pruning measures and evolution of model parameters, establishing several results related to pruning models early-on in training: (i) magnitude-based pruning removes parameters that contribute least to reduction in loss, resulting in models that converge faster than magnitude-agnostic methods; (ii) loss-preservation based pruning preserves first-order model evolution dynamics and is therefore appropriate for pruning minimally trained models; and (iii) gradient-norm based pruning affects second-order model evolution dynamics, such that increasing gradient norm via pruning can produce poorly performing models. We validate our claims on several VGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10 and CIFAR-100. Code available at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/A Hypergradient Approach to Robust Regression without Correspondence b/data/2021/iclr/A Hypergradient Approach to Robust Regression without Correspondence new file mode 100644 index 0000000000..414aee2028 --- /dev/null +++ b/data/2021/iclr/A Hypergradient Approach to Robust Regression without Correspondence @@ -0,0 +1 @@ +We consider a regression problem, where the correspondence between input and output data is not available. Such shuffled data is commonly observed in many real world problems. Taking flow cytometry as an example, the measuring instruments are unable to preserve the correspondence between the samples and the measurements. Due to the combinatorial nature, most of existing methods are only applicable when the sample size is small, and limited to linear regression models. To overcome such bottlenecks, we propose a new computational framework - ROBOT- for the shuffled regression problem, which is applicable to large data and complex models. Specifically, we propose to formulate the regression without correspondence as a continuous optimization problem. Then by exploiting the interaction between the regression model and the data correspondence, we propose to develop a hypergradient approach based on differentiable programming techniques. Such a hypergradient approach essentially views the data correspondence as an operator of the regression, and therefore allows us to find a better descent direction for the model parameter by differentiating through the data correspondence. ROBOT is quite general, and can be further extended to the inexact correspondence setting, where the input and output data are not necessarily exactly aligned. Thorough numerical experiments show that ROBOT achieves better performance than existing methods in both linear and nonlinear regression tasks, including real-world applications such as flow cytometry and multi-object tracking. \ No newline at end of file diff --git a/data/2021/iclr/A Learning Theoretic Perspective on Local Explainability b/data/2021/iclr/A Learning Theoretic Perspective on Local Explainability new file mode 100644 index 0000000000..590bf2d60f --- /dev/null +++ b/data/2021/iclr/A Learning Theoretic Perspective on Local Explainability @@ -0,0 +1 @@ +In this paper, we explore connections between interpretable machine learning and learning theory through the lens of local approximation explanations. First, we tackle the traditional problem of performance generalization and bound the test-time accuracy of a model using a notion of how locally explainable it is. Second, we explore the novel problem of explanation generalization which is an important concern for a growing class of finite sample-based local approximation explanations. Finally, we validate our theoretical results empirically and show that they reflect what can be seen in practice. \ No newline at end of file diff --git a/data/2021/iclr/A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks b/data/2021/iclr/A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks new file mode 100644 index 0000000000..42a4549603 --- /dev/null +++ b/data/2021/iclr/A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks @@ -0,0 +1 @@ +Autoregressive language models pretrained on large corpora have been successful at solving downstream tasks, even with zero-shot usage. However, there is little theoretical justification for their success. This paper considers the following questions: (1) Why should learning the distribution of natural language help with downstream classification tasks? (2) Why do features learned using language modeling help solve downstream tasks with linear classifiers? For (1), we hypothesize, and verify empirically, that classification tasks of interest can be reformulated as next word prediction tasks, thus making language modeling a meaningful pretraining task. For (2), we analyze properties of the cross-entropy objective to show that $\epsilon$-optimal language models in cross-entropy (log-perplexity) learn features that are $\mathcal{O}(\sqrt{\epsilon})$-good on natural linear classification tasks, thus demonstrating mathematically that doing well on language modeling can be beneficial for downstream tasks. We perform experiments to verify assumptions and validate theoretical results. Our theoretical insights motivate a simple alternative to the cross-entropy objective that performs well on some linear classification tasks. \ No newline at end of file diff --git a/data/2021/iclr/A PAC-Bayesian Approach to Generalization Bounds for Graph Neural Networks b/data/2021/iclr/A PAC-Bayesian Approach to Generalization Bounds for Graph Neural Networks new file mode 100644 index 0000000000..f333d81e47 --- /dev/null +++ b/data/2021/iclr/A PAC-Bayesian Approach to Generalization Bounds for Graph Neural Networks @@ -0,0 +1 @@ +In this paper, we derive generalization bounds for the two primary classes of graph neural networks (GNNs), namely graph convolutional networks (GCNs) and message passing GNNs (MPGNNs), via a PAC-Bayesian approach. Our result reveals that the maximum node degree and spectral norm of the weights govern the generalization bounds of both models. We also show that our bound for GCNs is a natural generalization of the results developed in arXiv:1707.09564v2 [cs.LG] for fully-connected and convolutional neural networks. For message passing GNNs, our PAC-Bayes bound improves over the Rademacher complexity based bound in arXiv:2002.06157v1 [cs.LG], showing a tighter dependency on the maximum node degree and the maximum hidden dimension. The key ingredients of our proofs are a perturbation analysis of GNNs and the generalization of PAC-Bayes analysis to non-homogeneous GNNs. We perform an empirical study on several real-world graph datasets and verify that our PAC-Bayes bound is tighter than others. \ No newline at end of file diff --git a/data/2021/iclr/A Panda? No, It's a Sloth: Slowdown Attacks on Adaptive Multi-Exit Neural Network Inference b/data/2021/iclr/A Panda? No, It's a Sloth: Slowdown Attacks on Adaptive Multi-Exit Neural Network Inference new file mode 100644 index 0000000000..9d3c6b01ad --- /dev/null +++ b/data/2021/iclr/A Panda? No, It's a Sloth: Slowdown Attacks on Adaptive Multi-Exit Neural Network Inference @@ -0,0 +1 @@ +Recent increases in the computational demands of deep neural networks (DNNs), combined with the observation that most input samples require only simple models, have sparked interest in $input$-$adaptive$ multi-exit architectures, such as MSDNets or Shallow-Deep Networks. These architectures enable faster inferences and could bring DNNs to low-power devices, e.g. in the Internet of Things (IoT). However, it is unknown if the computational savings provided by this approach are robust against adversarial pressure. In particular, an adversary may aim to slow down adaptive DNNs by increasing their average inference time$-$a threat analogous to the $denial$-$of$-$service$ attacks from the Internet. In this paper, we conduct a systematic evaluation of this threat by experimenting with three generic multi-exit DNNs (based on VGG16, MobileNet, and ResNet56) and a custom multi-exit architecture, on two popular image classification benchmarks (CIFAR-10 and Tiny ImageNet). To this end, we show that adversarial sample-crafting techniques can be modified to cause slowdown, and we propose a metric for comparing their impact on different architectures. We show that a slowdown attack reduces the efficacy of multi-exit DNNs by 90%-100%, and it amplifies the latency by 1.5-5$\times$ in a typical IoT deployment. We also show that it is possible to craft universal, reusable perturbations and that the attack can be effective in realistic black-box scenarios, where the attacker has limited knowledge about the victim. Finally, we show that adversarial training provides limited protection against slowdowns. These results suggest that further research is needed for defending multi-exit architectures against this emerging threat. \ No newline at end of file diff --git a/data/2021/iclr/A Temporal Kernel Approach for Deep Learning with Continuous-time Information b/data/2021/iclr/A Temporal Kernel Approach for Deep Learning with Continuous-time Information new file mode 100644 index 0000000000..899e2386f1 --- /dev/null +++ b/data/2021/iclr/A Temporal Kernel Approach for Deep Learning with Continuous-time Information @@ -0,0 +1 @@ +Sequential deep learning models such as RNN, causal CNN and attention mechanism do not readily consume continuous-time information. Discretizing the temporal data, as we show, causes inconsistency even for simple continuous-time processes. Current approaches often handle time in a heuristic manner to be consistent with the existing deep learning architectures and implementations. In this paper, we provide a principled way to characterize continuous-time systems using deep learning tools. Notably, the proposed approach applies to all the major deep learning architectures and requires little modifications to the implementation. The critical insight is to represent the continuous-time system by composing neural networks with a temporal kernel, where we gain our intuition from the recent advancements in understanding deep learning with Gaussian process and neural tangent kernel. To represent the temporal kernel, we introduce the random feature approach and convert the kernel learning problem to spectral density estimation under reparameterization. We further prove the convergence and consistency results even when the temporal kernel is non-stationary, and the spectral density is misspecified. The simulations and real-data experiments demonstrate the empirical effectiveness of our temporal kernel approach in a broad range of settings. \ No newline at end of file diff --git a/data/2021/iclr/A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention b/data/2021/iclr/A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention new file mode 100644 index 0000000000..34e6004fdd --- /dev/null +++ b/data/2021/iclr/A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention @@ -0,0 +1 @@ +We address the problem of learning on large sets of features, motivated by the need of performing pooling operations in long biological sequences of varying sizes, with long-range dependencies, and possibly few labeled data. To address this challenging task, we introduce a parametrized embedding that aggregates the features from a given set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost. Our aggregation technique admits two useful interpretations: it may be seen as a mechanism related to attention layers in neural networks, yet that requires less data, or it may be seen as a scalable surrogate of a classical optimal transport-based kernel. We experimentally demonstrate the effectiveness of our approach on biological sequences, achieving state-of-the-art results for protein fold recognition and detection of chromatin profiles tasks, and, as a proof of concept, we show promising results for processing natural language sequences. We provide an open-source implementation of our embedding that can be used alone or as a module in larger learning models. Our code is freely available at \url{https://github.com/claying/OTK}. \ No newline at end of file diff --git a/data/2021/iclr/A Unified Approach to Interpreting and Boosting Adversarial Transferability b/data/2021/iclr/A Unified Approach to Interpreting and Boosting Adversarial Transferability new file mode 100644 index 0000000000..060aecb51a --- /dev/null +++ b/data/2021/iclr/A Unified Approach to Interpreting and Boosting Adversarial Transferability @@ -0,0 +1 @@ +In this paper, we use the interaction inside adversarial perturbations to explain and boost the adversarial transferability. We discover and prove the negative correlation between the adversarial transferability and the interaction inside adversarial perturbations. The negative correlation is further verified through different DNNs with various inputs. Moreover, this negative correlation can be regarded as a unified perspective to understand current transferability-boosting methods. To this end, we prove that some classic methods of enhancing the transferability essentially decease interactions inside adversarial perturbations. Based on this, we propose to directly penalize interactions during the attacking process, which significantly improves the adversarial transferability. \ No newline at end of file diff --git a/data/2021/iclr/A Universal Representation Transformer Layer for Few-Shot Image Classification b/data/2021/iclr/A Universal Representation Transformer Layer for Few-Shot Image Classification new file mode 100644 index 0000000000..2d9138c3bf --- /dev/null +++ b/data/2021/iclr/A Universal Representation Transformer Layer for Few-Shot Image Classification @@ -0,0 +1 @@ +Few-shot classification aims to recognize unseen classes when presented with only a small number of samples. We consider the problem of multi-domain few-shot image classification, where unseen classes and examples come from diverse data sources. This problem has seen growing interest and has inspired the development of benchmarks such as Meta-Dataset. A key challenge in this multi-domain setting is to effectively integrate the feature representations from the diverse set of training domains. Here, we propose a Universal Representation Transformer (URT) layer, that meta-learns to leverage universal features for few-shot classification by dynamically re-weighting and composing the most appropriate domain-specific representations. In experiments, we show that URT sets a new state-of-the-art result on Meta-Dataset. Specifically, it achieves top-performance on the highest number of data sources compared to competing methods. We analyze variants of URT and present a visualization of the attention score heatmaps that sheds light on how the model performs cross-domain generalization. Our code is available at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/A Wigner-Eckart Theorem for Group Equivariant Convolution Kernels b/data/2021/iclr/A Wigner-Eckart Theorem for Group Equivariant Convolution Kernels new file mode 100644 index 0000000000..f889c3fc0a --- /dev/null +++ b/data/2021/iclr/A Wigner-Eckart Theorem for Group Equivariant Convolution Kernels @@ -0,0 +1 @@ +Group equivariant convolutional networks (GCNNs) endow classical convolutional networks with additional symmetry priors, which can lead to a considerably improved performance. Recent advances in the theoretical description of GCNNs revealed that such models can generally be understood as performing convolutions with G-steerable kernels, that is, kernels that satisfy an equivariance constraint themselves. While the G-steerability constraint has been derived, it has to date only been solved for specific use cases - a general characterization of G-steerable kernel spaces is still missing. This work provides such a characterization for the practically relevant case of G being any compact group. Our investigation is motivated by a striking analogy between the constraints underlying steerable kernels on the one hand and spherical tensor operators from quantum mechanics on the other hand. By generalizing the famous Wigner-Eckart theorem for spherical tensor operators, we prove that steerable kernel spaces are fully understood and parameterized in terms of 1) generalized reduced matrix elements, 2) Clebsch-Gordan coefficients, and 3) harmonic basis functions on homogeneous spaces. \ No newline at end of file diff --git a/data/2021/iclr/A statistical theory of cold posteriors in deep neural networks b/data/2021/iclr/A statistical theory of cold posteriors in deep neural networks new file mode 100644 index 0000000000..ee99245890 --- /dev/null +++ b/data/2021/iclr/A statistical theory of cold posteriors in deep neural networks @@ -0,0 +1 @@ +To get Bayesian neural networks to perform comparably to standard neural networks it is usually necessary to artificially reduce uncertainty using a "tempered" or "cold" posterior. This is extremely concerning: if the prior is accurate, Bayes inference/decision theory is optimal, and any artificial changes to the posterior should harm performance. While this suggests that the prior may be at fault, here we argue that in fact, BNNs for image classification use the wrong likelihood. In particular, standard image benchmark datasets such as CIFAR-10 are carefully curated. We develop a generative model describing curation which gives a principled Bayesian account of cold posteriors, because the likelihood under this new generative model closely matches the tempered likelihoods used in past work. \ No newline at end of file diff --git a/data/2021/iclr/A teacher-student framework to distill future trajectories b/data/2021/iclr/A teacher-student framework to distill future trajectories new file mode 100644 index 0000000000..e827ed7788 --- /dev/null +++ b/data/2021/iclr/A teacher-student framework to distill future trajectories @@ -0,0 +1 @@ +By learning to predict trajectories of dynamical systems, model-based methods can make extensive use of all observations from past experience. However, due to partial observability, stochasticity, compounding errors, and irrelevant dynamics, training to predict observations explicitly often results in poor models. Model-free techniques try to side-step the problem by learning to predict values directly. While breaking the explicit dependency on future observations can result in strong performance, this usually comes at the cost of low sample efficiency, as the abundant information about the dynamics contained in future observations goes unused. Here we take a step back from both approaches: Instead of hand-designing how trajectories should be incorporated, a teacher network learns to extract relevant information from the trajectories and to distill it into target activations which guide a student model that can only observe the present. The teacher is trained with meta-gradients to maximize the student’s performance on a validation set. Our approach performs well on tasks that are difficult for model-free and model-based methods, and we study the role of every component through ablation studies. \ No newline at end of file diff --git a/data/2021/iclr/A unifying view on implicit bias in training linear neural networks b/data/2021/iclr/A unifying view on implicit bias in training linear neural networks new file mode 100644 index 0000000000..5f7939ffe3 --- /dev/null +++ b/data/2021/iclr/A unifying view on implicit bias in training linear neural networks @@ -0,0 +1 @@ +We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. For $L$-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the $\ell_{2/L}$ max-margin problem in a "transformed" input space defined by the network. For underdetermined regression, we prove that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted $\ell_1$ and $\ell_2$ norms in the transformed input space. Our theorems subsume existing results in the literature while removing most of the convergence assumptions. We also provide experiments that corroborate our analysis. \ No newline at end of file diff --git a/data/2021/iclr/ALFWorld: Aligning Text and Embodied Environments for Interactive Learning b/data/2021/iclr/ALFWorld: Aligning Text and Embodied Environments for Interactive Learning new file mode 100644 index 0000000000..64c3cd8315 --- /dev/null +++ b/data/2021/iclr/ALFWorld: Aligning Text and Embodied Environments for Interactive Learning @@ -0,0 +1 @@ +Given a simple request (e.g., Put a washed apple in the kitchen fridge), humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely. We address this limitation by introducing ALFWorld, a simulator that enables agents to learn abstract, text-based policies in TextWorld (Cote et al., 2018) and then execute goals from the ALFRED benchmark (Shridhar et al., 2020) in a rich visual environment. ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions. In turn, as we demonstrate empirically, this fosters better agent generalization than training only in the visually grounded environment. BUTLER's simple, modular design factors the problem to allow researchers to focus on models for improving every piece of the pipeline (language understanding, planning, navigation, visual scene understanding, and so forth). \ No newline at end of file diff --git a/data/2021/iclr/ANOCE: Analysis of Causal Effects with Multiple Mediators via Constrained Structural Learning b/data/2021/iclr/ANOCE: Analysis of Causal Effects with Multiple Mediators via Constrained Structural Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/ARMOURED: Adversarially Robust MOdels using Unlabeled data by REgularizing Diversity b/data/2021/iclr/ARMOURED: Adversarially Robust MOdels using Unlabeled data by REgularizing Diversity new file mode 100644 index 0000000000..70457335bf --- /dev/null +++ b/data/2021/iclr/ARMOURED: Adversarially Robust MOdels using Unlabeled data by REgularizing Diversity @@ -0,0 +1 @@ +Adversarial attacks pose a major challenge for modern deep neural networks. Recent advancements show that adversarially robust generalization requires a huge amount of labeled data for training. If annotation becomes a burden, can unlabeled data help bridge the gap? In this paper, we propose ARMOURED, an ad-versarially robust training method based on semi-supervised learning that consists of two components. The first component applies multi-view learning to simultaneously optimize multiple independent networks and utilizes unlabeled data to enforce labeling consistency. The second component reduces adversarial trans-ferability among the networks via diversity regularizers inspired by determinan-tal point processes and entropy maximization. Experimental results show that under small perturbation budgets, ARMOURED is robust against strong adaptive adversaries. Notably, ARMOURED does not rely on generating adversarial samples during training. When used in combination with adversarial training, ARMOURED achieves state-of-the-art robustness against (cid:96) ∞ and (cid:96) 2 attacks for a range of perturbation budgets, while maintaining high accuracy on clean samples. We demonstrate the robustness of ARMOURED on CIFAR-10 and SVHN datasets against state-of-the-art benchmarks in adversarial robust training. \ No newline at end of file diff --git a/data/2021/iclr/Accelerating Convergence of Replica Exchange Stochastic Gradient MCMC via Variance Reduction b/data/2021/iclr/Accelerating Convergence of Replica Exchange Stochastic Gradient MCMC via Variance Reduction new file mode 100644 index 0000000000..3fe8695746 --- /dev/null +++ b/data/2021/iclr/Accelerating Convergence of Replica Exchange Stochastic Gradient MCMC via Variance Reduction @@ -0,0 +1 @@ +Replica exchange stochastic gradient Langevin dynamics (reSGLD) has shown promise in accelerating the convergence in non-convex learning; however, an excessively large correction for avoiding biases from noisy energy estimators has limited the potential of the acceleration. To address this issue, we study the variance reduction for noisy energy estimators, which promotes much more effective swaps. Theoretically, we provide a non-asymptotic analysis on the exponential acceleration for the underlying continuous-time Markov jump process; moreover, we consider a generalized Girsanov theorem which includes the change of Poisson measure to overcome the crude discretization based on the Growall's inequality and yields a much tighter error in the 2-Wasserstein ($\mathcal{W}_2$) distance. Numerically, we conduct extensive experiments and obtain the state-of-the-art results in optimization and uncertainty estimates for synthetic experiments and image data. \ No newline at end of file diff --git a/data/2021/iclr/Accurate Learning of Graph Representations with Graph Multiset Pooling b/data/2021/iclr/Accurate Learning of Graph Representations with Graph Multiset Pooling new file mode 100644 index 0000000000..86072b695f --- /dev/null +++ b/data/2021/iclr/Accurate Learning of Graph Representations with Graph Multiset Pooling @@ -0,0 +1 @@ +Graph neural networks have been widely used on modeling graph data, achieving impressive results on node classification and link prediction tasks. Yet, obtaining an accurate representation for a graph further requires a pooling function that maps a set of node representations into a compact form. A simple sum or average over all node representations considers all node features equally without consideration of their task relevance, and any structural dependencies among them. Recently proposed hierarchical graph pooling methods, on the other hand, may yield the same representation for two different graphs that are distinguished by the Weisfeiler-Lehman test, as they suboptimally preserve information from the node features. To tackle these limitations of existing graph pooling methods, we first formulate the graph pooling problem as a multiset encoding problem with auxiliary information about the graph structure, and propose a Graph Multiset Transformer (GMT) which is a multi-head attention based global pooling layer that captures the interaction between nodes according to their structural dependencies. We show that GMT satisfies both injectiveness and permutation invariance, such that it is at most as powerful as the Weisfeiler-Lehman graph isomorphism test. Moreover, our methods can be easily extended to the previous node clustering approaches for hierarchical graph pooling. Our experimental results show that GMT significantly outperforms state-of-the-art graph pooling methods on graph classification benchmarks with high memory and time efficiency, and obtains even larger performance gain on graph reconstruction and generation tasks. \ No newline at end of file diff --git a/data/2021/iclr/Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning b/data/2021/iclr/Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning new file mode 100644 index 0000000000..f4fc53d08e --- /dev/null +++ b/data/2021/iclr/Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning @@ -0,0 +1 @@ +Federated learning (FL) is a distributed machine learning architecture that leverages a large number of workers to jointly learn a model with decentralized data. FL has received increasing attention in recent years thanks to its data privacy protection, communication efficiency and a linear speedup for convergence in training (i.e., convergence performance increases linearly with respect to the number of workers). However, existing studies on linear speedup for convergence are only limited to the assumptions of i.i.d. datasets across workers and/or full worker participation, both of which rarely hold in practice. So far, it remains an open question whether or not the linear speedup for convergence is achievable under non-i.i.d. datasets with partial worker participation in FL. In this paper, we show that the answer is affirmative. Specifically, we show that the federated averaging (FedAvg) algorithm (with two-sided learning rates) on non-i.i.d. datasets in non-convex settings achieves a convergence rate $\mathcal{O}(\frac{1}{\sqrt{mKT}} + \frac{1}{T})$ for full worker participation and a convergence rate $\mathcal{O}(\frac{1}{\sqrt{nKT}} + \frac{1}{T})$ for partial worker participation, where $K$ is the number of local steps, $T$ is the number of total communication rounds, $m$ is the total worker number and $n$ is the worker number in one communication round if for partial worker participation. Our results also reveal that the local steps in FL could help the convergence and show that the maximum number of local steps can be improved to $T/m$. We conduct extensive experiments on MNIST and CIFAR-10 to verify our theoretical results. \ No newline at end of file diff --git a/data/2021/iclr/Acting in Delayed Environments with Non-Stationary Markov Policies b/data/2021/iclr/Acting in Delayed Environments with Non-Stationary Markov Policies new file mode 100644 index 0000000000..c351fe926d --- /dev/null +++ b/data/2021/iclr/Acting in Delayed Environments with Non-Stationary Markov Policies @@ -0,0 +1 @@ +The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of m steps. The brute-force state augmentation baseline where the state is concatenated to the last m committed actions suffers from an exponential complexity in m, as we show for policy iteration. We then prove that with execution delay, Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary. As for stationary Markov policies, we show they are sub-optimal in general. Consequently, we devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation. Experiments on tabular, physical, and Atari domains reveal that it converges quickly to high performance even for substantial delays, while standard approaches that either ignore the delay or rely on state-augmentation struggle or fail due to divergence. The code will be shared upon publication. \ No newline at end of file diff --git a/data/2021/iclr/Activation-level uncertainty in deep neural networks b/data/2021/iclr/Activation-level uncertainty in deep neural networks new file mode 100644 index 0000000000..41622b4720 --- /dev/null +++ b/data/2021/iclr/Activation-level uncertainty in deep neural networks @@ -0,0 +1 @@ +, \ No newline at end of file diff --git a/data/2021/iclr/Active Contrastive Learning of Audio-Visual Video Representations b/data/2021/iclr/Active Contrastive Learning of Audio-Visual Video Representations new file mode 100644 index 0000000000..cf75ab1e56 --- /dev/null +++ b/data/2021/iclr/Active Contrastive Learning of Audio-Visual Video Representations @@ -0,0 +1 @@ +Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance. However, obtaining a tight lower bound requires a sample size exponential in MI and thus a large set of negative samples. We can incorporate more samples by building a large queue-based dictionary, but there are theoretical limits to performance improvements even with a large number of negative samples. We hypothesize that random negative sampling leads to a highly redundant dictionary that results in suboptimal representations for downstream tasks. In this paper, we propose an active contrastive learning approach that builds an actively sampled dictionary with diverse and informative items \ No newline at end of file diff --git a/data/2021/iclr/AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition b/data/2021/iclr/AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition new file mode 100644 index 0000000000..270a96a12b --- /dev/null +++ b/data/2021/iclr/AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition @@ -0,0 +1 @@ +Temporal modelling is the key for efficient video action recognition. While understanding temporal information can improve recognition accuracy for dynamic actions, removing temporal redundancy and reusing past features can significantly save computation leading to efficient action recognition. In this paper, we introduce an adaptive temporal fusion network, called AdaFuse, that dynamically fuses channels from current and past feature maps for strong temporal modelling. Specifically, the necessary information from the historical convolution feature maps is fused with current pruned feature maps with the goal of improving both recognition accuracy and efficiency. In addition, we use a skipping operation to further reduce the computation cost of action recognition. Extensive experiments on Something V1&V2, Jester and Mini-Kinetics show that our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods. The project page can be found at https://mengyuest.github.io/AdaFuse/ \ No newline at end of file diff --git a/data/2021/iclr/AdaGCN: Adaboosting Graph Convolutional Networks into Deep Models b/data/2021/iclr/AdaGCN: Adaboosting Graph Convolutional Networks into Deep Models new file mode 100644 index 0000000000..e36ac335b9 --- /dev/null +++ b/data/2021/iclr/AdaGCN: Adaboosting Graph Convolutional Networks into Deep Models @@ -0,0 +1 @@ +The design of deep graph models still remains to be investigated and the crucial part is how to explore and exploit the knowledge from different hops of neighbors in an efficient way. In this paper, we propose a novel RNN-like deep graph neural network architecture by incorporating AdaBoost into the computation of network; and the proposed graph convolutional network called AdaGCN~(AdaBoosting Graph Convolutional Network) has the ability to efficiently extract knowledge from high-order neighbors and integrate knowledge from different hops of neighbors into the network in an AdaBoost way. We also present the architectural difference between AdaGCN and existing graph convolutional methods to show the benefits of our proposal. Finally, extensive experiments demonstrate the state-of-the-art prediction performance and the computational advantage of our approach AdaGCN. \ No newline at end of file diff --git a/data/2021/iclr/AdaSpeech: Adaptive Text to Speech for Custom Voice b/data/2021/iclr/AdaSpeech: Adaptive Text to Speech for Custom Voice new file mode 100644 index 0000000000..650b3be955 --- /dev/null +++ b/data/2021/iclr/AdaSpeech: Adaptive Text to Speech for Custom Voice @@ -0,0 +1 @@ +Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech data. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions that could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we use two acoustic encoders to extract an utterance-level vector and a sequence of phoneme-level vectors from the target speech during training; in inference, we extract the utterance-level vector from a reference speech and use an acoustic predictor to predict the phoneme-level vectors. 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation. We pre-train the source TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets (with different acoustic conditions from LibriTTS) with few adaptation data, e.g., 20 sentences, about 1 minute speech. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. Audio samples are available at https://speechresearch.github.io/adaspeech/. \ No newline at end of file diff --git a/data/2021/iclr/AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights b/data/2021/iclr/AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights new file mode 100644 index 0000000000..2330cd5806 --- /dev/null +++ b/data/2021/iclr/AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights @@ -0,0 +1 @@ +Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers results in a far more rapid reduction in effective step sizes for scale-invariant weights, a phenomenon that has not yet been studied and may have caused unwanted side effects in the current practice. This is a crucial issue because arguably the vast majority of modern deep neural networks consist of (1) momentum-based GD (e.g. SGD or Adam) and (2) scale-invariant parameters. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances. We propose a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step. Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers. Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in those benchmarks. Source code is available at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/Adapting to Reward Progressivity via Spectral Reinforcement Learning b/data/2021/iclr/Adapting to Reward Progressivity via Spectral Reinforcement Learning new file mode 100644 index 0000000000..f2b547083e --- /dev/null +++ b/data/2021/iclr/Adapting to Reward Progressivity via Spectral Reinforcement Learning @@ -0,0 +1 @@ +In this paper we consider reinforcement learning tasks with progressive rewards; that is, tasks where the rewards tend to increase in magnitude over time. We hypothesise that this property may be problematic for value-based deep reinforcement learning agents, particularly if the agent must first succeed in relatively unrewarding regions of the task in order to reach more rewarding regions. To address this issue, we propose Spectral DQN, which decomposes the reward into frequencies such that the high frequencies only activate when large rewards are found. This allows the training loss to be balanced so that it gives more even weighting across small and large reward regions. In two domains with extreme reward progressivity, where standard value-based methods struggle significantly, Spectral DQN is able to make much farther progress. Moreover, when evaluated on a set of six standard Atari games that do not overtly favour the approach, Spectral DQN remains more than competitive: While it underperforms one of the benchmarks in a single game, it comfortably surpasses the benchmarks in three games. These results demonstrate that the approach is not overfit to its target problem, and suggest that Spectral DQN may have advantages beyond addressing reward progressivity. \ No newline at end of file diff --git a/data/2021/iclr/Adaptive Extra-Gradient Methods for Min-Max Optimization and Games b/data/2021/iclr/Adaptive Extra-Gradient Methods for Min-Max Optimization and Games new file mode 100644 index 0000000000..3b871b9a17 --- /dev/null +++ b/data/2021/iclr/Adaptive Extra-Gradient Methods for Min-Max Optimization and Games @@ -0,0 +1 @@ +We present a new family of min-max optimization algorithms that automatically exploit the geometry of the gradient data observed at earlier iterations to perform more informative extra-gradient steps in later ones. Thanks to this adaptation mechanism, the proposed method automatically detects whether the problem is smooth or not, without requiring any prior tuning by the optimizer. As a result, the algorithm simultaneously achieves order-optimal convergence rates, i.e., it converges to an $\varepsilon$-optimal solution within $\mathcal{O}(1/\varepsilon)$ iterations in smooth problems, and within $\mathcal{O}(1/\varepsilon^2)$ iterations in non-smooth ones. Importantly, these guarantees do not require any of the standard boundedness or Lipschitz continuity conditions that are typically assumed in the literature; in particular, they apply even to problems with singularities (such as resource allocation problems and the like). This adaptation is achieved through the use of a geometric apparatus based on Finsler metrics and a suitably chosen mirror-prox template that allows us to derive sharp convergence rates for the methods at hand. \ No newline at end of file diff --git a/data/2021/iclr/Adaptive Federated Optimization b/data/2021/iclr/Adaptive Federated Optimization new file mode 100644 index 0000000000..dda6844a11 --- /dev/null +++ b/data/2021/iclr/Adaptive Federated Optimization @@ -0,0 +1 @@ +Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Due to the heterogeneity of the client datasets, standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Adam, and Yogi, and analyze their convergence in the presence of heterogeneous data for general nonconvex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning. \ No newline at end of file diff --git a/data/2021/iclr/Adaptive Procedural Task Generation for Hard-Exploration Problems b/data/2021/iclr/Adaptive Procedural Task Generation for Hard-Exploration Problems new file mode 100644 index 0000000000..57d7a2fc0c --- /dev/null +++ b/data/2021/iclr/Adaptive Procedural Task Generation for Hard-Exploration Problems @@ -0,0 +1 @@ +We introduce Adaptive Procedural Task Generation (APT-Gen), an approach for progressively generating a sequence of tasks as curricula to facilitate reinforcement learning in hard-exploration problems. At the heart of our approach, a task generator learns to create tasks via a black-box procedural generation module by adaptively sampling from the parameterized task space. To enable curriculum learning in the absence of a direct indicator of learning progress, the task generator is trained by balancing the agent's expected return in the generated tasks and their similarities to the target task. Through adversarial training, the similarity between the generated tasks and the target task is adaptively estimated by a task discriminator defined on the agent's behaviors. In this way, our approach can efficiently generate tasks of rich variations for target tasks of unknown parameterization or not covered by the predefined task space. Experiments demonstrate the effectiveness of our approach through quantitative and qualitative analysis in various scenarios. \ No newline at end of file diff --git a/data/2021/iclr/Adaptive Universal Generalized PageRank Graph Neural Network b/data/2021/iclr/Adaptive Universal Generalized PageRank Graph Neural Network new file mode 100644 index 0000000000..b1f4679fe0 --- /dev/null +++ b/data/2021/iclr/Adaptive Universal Generalized PageRank Graph Neural Network @@ -0,0 +1 @@ +In many important graph data processing applications the acquired information includes both node features and observations of the graph topology. Graph neural networks (GNNs) are designed to exploit both sources of evidence but they do not optimally trade-off their utility and integrate them in a manner that is also universal. Here, universality refers to independence on homophily or heterophily graph assumptions. We address these issues by introducing a new Generalized PageRank (GPR) GNN architecture that adaptively learns the GPR weights so as to jointly optimize node feature and topological information extraction, regardless of the extent to which the node labels are homophilic or heterophilic. Learned GPR weights automatically adjust to the node label pattern, irrelevant on the type of initialization, and thereby guarantee excellent learning performance for label patterns that are usually hard to handle. Furthermore, they allow one to avoid feature over-smoothing, a process which renders feature information nondiscriminative, without requiring the network to be shallow. Our accompanying theoretical analysis of the GPR-GNN method is facilitated by novel synthetic benchmark datasets generated by the so-called contextual stochastic block model. We also compare the performance of our GNN architecture with that of several state-of-the-art GNNs on the problem of node-classification, using well-known benchmark homophilic and heterophilic datasets. The results demonstrate that GPR-GNN offers significant performance improvement compared to existing techniques on both synthetic and benchmark data. \ No newline at end of file diff --git a/data/2021/iclr/Adaptive and Generative Zero-Shot Learning b/data/2021/iclr/Adaptive and Generative Zero-Shot Learning new file mode 100644 index 0000000000..7ddefa425d --- /dev/null +++ b/data/2021/iclr/Adaptive and Generative Zero-Shot Learning @@ -0,0 +1 @@ +We address the problem of generalized zero-shot learning (GZSL) where the task is to predict the class label of a target image whether its label belongs to the seen or unseen category. Similar to ZSL, the learning setting assumes that all class-level semantic features are given, while only the images of seen classes are available for training. By exploring the correlation between image features and the corresponding semantic features, the main idea of the proposed approach is to enrich the semantic-to-visual (S2V) embeddings via a seamless fusion of adaptive and generative learning. To this end, we extend the semantic features of each class by supplementing image-adaptive attention so that the learned S2V embedding can account for not only inter-class but also intra-class variations. In addition, to break the limit of training with images only from seen classes, we design a generative scheme to simultaneously generate virtual class labels and their visual features by sampling and interpolating over seen counterparts. In inference, a testing image will give rise to two different S2V embeddings, seen and virtual. The former is used to decide whether the underlying label is of the unseen category or otherwise a specific seen class; the latter is to predict an unseen class label. To demonstrate the effectiveness of our method, we report state-of-the-art results on four standard GZSL datasets, including an ablation study of the proposed modules. \ No newline at end of file diff --git a/data/2021/iclr/Adversarial score matching and improved sampling for image generation b/data/2021/iclr/Adversarial score matching and improved sampling for image generation new file mode 100644 index 0000000000..eb1ec6c272 --- /dev/null +++ b/data/2021/iclr/Adversarial score matching and improved sampling for image generation @@ -0,0 +1,2 @@ +Denoising Score Matching with Annealed Langevin Sampling (DSM-ALS) has recently found success in generative modeling. The approach works by first training a neural network to estimate the score of a distribution, and then using Langevin dynamics to sample from the data distribution assumed by the score network. Despite the convincing visual quality of samples, this method appears to perform worse than Generative Adversarial Networks (GANs) under the Frechet Inception Distance, a standard metric for generative models. +We show that this apparent gap vanishes when denoising the final Langevin samples using the score network. In addition, we propose two improvements to DSM-ALS: 1) Consistent Annealed Sampling as a more stable alternative to Annealed Langevin Sampling, and 2) a hybrid training formulation, composed of both Denoising Score Matching and adversarial objectives. By combining these two techniques and exploring different network architectures, we elevate score matching methods and obtain results competitive with state-of-the-art image generation on CIFAR-10. \ No newline at end of file diff --git a/data/2021/iclr/Adversarially Guided Actor-Critic b/data/2021/iclr/Adversarially Guided Actor-Critic new file mode 100644 index 0000000000..1eff15f551 --- /dev/null +++ b/data/2021/iclr/Adversarially Guided Actor-Critic @@ -0,0 +1 @@ +Despite definite success in deep reinforcement learning problems, actor-critic algorithms are still confronted with sample inefficiency in complex environments, particularly in tasks where efficient exploration is a bottleneck. These methods consider a policy (the actor) and a value function (the critic) whose respective losses are built using different motivations and approaches. This paper introduces a third protagonist: the adversary. While the adversary mimics the actor by minimizing the KL-divergence between their respective action distributions, the actor, in addition to learning to solve the task, tries to differentiate itself from the adversary predictions. This novel objective stimulates the actor to follow strategies that could not have been correctly predicted from previous trajectories, making its behavior innovative in tasks where the reward is extremely rare. Our experimental analysis shows that the resulting Adversarially Guided Actor-Critic (AGAC) algorithm leads to more exhaustive exploration. Notably, AGAC outperforms current state-of-the-art methods on a set of various hard-exploration and procedurally-generated tasks. \ No newline at end of file diff --git a/data/2021/iclr/Adversarially-Trained Deep Nets Transfer Better: Illustration on Image Classification b/data/2021/iclr/Adversarially-Trained Deep Nets Transfer Better: Illustration on Image Classification new file mode 100644 index 0000000000..a66726f09f --- /dev/null +++ b/data/2021/iclr/Adversarially-Trained Deep Nets Transfer Better: Illustration on Image Classification @@ -0,0 +1 @@ +Transfer learning has emerged as a powerful methodology for adapting pre-trained deep neural networks on image recognition tasks to new domains. This process consists of taking a neural network pre-trained on a large feature-rich source dataset, freezing the early layers that encode essential generic image properties, and then fine-tuning the last few layers in order to capture specific information related to the target situation. This approach is particularly useful when only limited or weakly labeled data are available for the new task. In this work, we demonstrate that adversarially-trained models transfer better than non-adversarially-trained models, especially if only limited data are available for the new domain task. Further, we observe that adversarial training biases the learnt representations to retaining shapes, as opposed to textures, which impacts the transferability of the source models. Finally, through the lens of influence functions, we discover that transferred adversarially-trained models contain more human-identifiable semantic information, which explains – at least partly – why adversarially-trained models transfer better. \ No newline at end of file diff --git a/data/2021/iclr/Aligning AI With Shared Human Values b/data/2021/iclr/Aligning AI With Shared Human Values new file mode 100644 index 0000000000..5949ba66a2 --- /dev/null +++ b/data/2021/iclr/Aligning AI With Shared Human Values @@ -0,0 +1 @@ +We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete understanding of basic ethical knowledge. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values. \ No newline at end of file diff --git a/data/2021/iclr/An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale b/data/2021/iclr/An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale new file mode 100644 index 0000000000..b902e22485 --- /dev/null +++ b/data/2021/iclr/An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale @@ -0,0 +1 @@ +While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. \ No newline at end of file diff --git a/data/2021/iclr/An Unsupervised Deep Learning Approach for Real-World Image Denoising b/data/2021/iclr/An Unsupervised Deep Learning Approach for Real-World Image Denoising new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Analyzing the Expressive Power of Graph Neural Networks in a Spectral Perspective b/data/2021/iclr/Analyzing the Expressive Power of Graph Neural Networks in a Spectral Perspective new file mode 100644 index 0000000000..f3d443fb02 --- /dev/null +++ b/data/2021/iclr/Analyzing the Expressive Power of Graph Neural Networks in a Spectral Perspective @@ -0,0 +1 @@ +In the recent literature of Graph Neural Networks (GNN), the expressive power of models has been studied through their capability to distinguish if two given graphs are isomorphic or not. Since the graph isomorphism problem is NP-intermediate, and Weisfeiler-Lehman (WL) test can give sufficient but not enough evidence in polynomial time, the theoretical power of GNNs is usually evaluated by the equivalence of WL-test order, followed by an empirical analysis of the models on some reference inductive and transductive datasets. However, such analysis does not account the signal processing pipeline, whose capability is generally evaluated in the spectral domain. In this paper, we argue that a spectral analysis of GNNs behavior can provide a complementary point of view to go one step further in the understanding of GNNs. By bridging the gap between the spectral and spatial design of graph convolutions, we theoretically demonstrate some equivalence of the graph convolution process regardless it is designed in the spatial or the spectral domain. Using this connection, we managed to re-formulate most of the state-of-the-art graph neural networks into one common framework. This general framework allows to lead a spectral analysis of the most popular GNNs, explaining their performance and showing their limits according to spectral point of view. Our theoretical spectral analysis is confirmed by experiments on various graph databases. Furthermore, we demonstrate the necessity of high and/or band-pass filters on a graph dataset, while the majority of GNN is limited to only low-pass and inevitably it fails. Code available at https://github.com/balcilar/gnn-spectral-expressive-power. \ No newline at end of file diff --git a/data/2021/iclr/Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics b/data/2021/iclr/Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics new file mode 100644 index 0000000000..17f8953e26 --- /dev/null +++ b/data/2021/iclr/Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics @@ -0,0 +1 @@ +A central challenge in developing versatile machine learning systems is catastrophic forgetting: a model trained on tasks in sequence will suffer significant performance drops on earlier tasks. Despite the ubiquity of catastrophic forgetting, there is limited understanding of the underlying process and its causes. In this paper, we address this important knowledge gap, investigating how forgetting affects representations in neural network models. Through representational analysis techniques, we find that deeper layers are disproportionately the source of forgetting. Supporting this, a study of methods to mitigate forgetting illustrates that they act to stabilize deeper layers. These insights enable the development of an analytic argument and empirical picture relating the degree of forgetting to representational similarity between tasks. Consistent with this picture, we observe maximal forgetting occurs for task sequences with intermediate similarity. We perform empirical studies on the standard split CIFAR-10 setup and also introduce a novel CIFAR-100 based task approximating realistic input distribution shift. \ No newline at end of file diff --git a/data/2021/iclr/Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies b/data/2021/iclr/Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies new file mode 100644 index 0000000000..7e3c56f66d --- /dev/null +++ b/data/2021/iclr/Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies @@ -0,0 +1 @@ +We present DINO (\textbf{D}ETR with \textbf{I}mproved de\textbf{N}oising anch\textbf{O}r boxes), a state-of-the-art end-to-end object detector. % in this paper. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction. DINO achieves $49.4$AP in $12$ epochs and $51.3$AP in $24$ epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of $\textbf{+6.0}$\textbf{AP} and $\textbf{+2.7}$\textbf{AP}, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO \texttt{val2017} ($\textbf{63.2}$\textbf{AP}) and \texttt{test-dev} (\textbf{$\textbf{63.3}$AP}). Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results. Our code will be available at \url{https://github.com/IDEACVR/DINO}. \ No newline at end of file diff --git a/data/2021/iclr/Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval b/data/2021/iclr/Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval new file mode 100644 index 0000000000..45c72da11d --- /dev/null +++ b/data/2021/iclr/Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval @@ -0,0 +1 @@ +We propose a simple and efficient multi-hop dense retrieval approach for answering complex open-domain questions, which achieves state-of-the-art performance on two multi-hop datasets, HotpotQA and multi-evidence FEVER. Contrary to previous work, our method does not require access to any corpus-specific information, such as inter-document hyperlinks or human-annotated entity markers, and can be applied to any unstructured text corpus. Our system also yields a much better efficiency-accuracy trade-off, matching the best published accuracy on HotpotQA while being 10 times faster at inference time. \ No newline at end of file diff --git a/data/2021/iclr/Anytime Sampling for Autoregressive Models via Ordered Autoencoding b/data/2021/iclr/Anytime Sampling for Autoregressive Models via Ordered Autoencoding new file mode 100644 index 0000000000..39b079c55e --- /dev/null +++ b/data/2021/iclr/Anytime Sampling for Autoregressive Models via Ordered Autoencoding @@ -0,0 +1 @@ +Autoregressive models are widely used for tasks such as image and audio generation. The sampling process of these models, however, does not allow interruptions and cannot adapt to real-time computational resources. This challenge impedes the deployment of powerful autoregressive models, which involve a slow sampling process that is sequential in nature and typically scales linearly with respect to the data dimension. To address this difficulty, we propose a new family of autoregressive models that enables anytime sampling. Inspired by Principal Component Analysis, we learn a structured representation space where dimensions are ordered based on their importance with respect to reconstruction. Using an autoregressive model in this latent space, we trade off sample quality for computational efficiency by truncating the generation process before decoding into the original data space. Experimentally, we demonstrate in several image and audio generation tasks that sample quality degrades gracefully as we reduce the computational budget for sampling. The approach suffers almost no loss in sample quality (measured by FID) using only 60\% to 80\% of all latent dimensions for image data. Code is available at https://github.com/Newbeeer/Anytime-Auto-Regressive-Model . \ No newline at end of file diff --git a/data/2021/iclr/Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval b/data/2021/iclr/Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval new file mode 100644 index 0000000000..bb8d917219 --- /dev/null +++ b/data/2021/iclr/Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval @@ -0,0 +1 @@ +Conducting text retrieval in a dense learned representation space has many intriguing advantages over sparse retrieval. Yet the effectiveness of dense retrieval (DR) often requires combination with sparse retrieval. In this paper, we identify that the main bottleneck is in the training mechanisms, where the negative instances used in training are not representative of the irrelevant documents in testing. This paper presents Approximate nearest neighbor Negative Contrastive Estimation (ANCE), a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances. This fundamentally resolves the discrepancy between the data distribution used in the training and testing of DR. In our experiments, ANCE boosts the BERT-Siamese DR model to outperform all competitive dense and sparse retrieval baselines. It nearly matches the accuracy of sparse-retrieval-and-BERT-reranking using dot-product in the ANCE-learned representation space and provides almost 100x speed-up. \ No newline at end of file diff --git a/data/2021/iclr/Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks b/data/2021/iclr/Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks new file mode 100644 index 0000000000..65a257fad0 --- /dev/null +++ b/data/2021/iclr/Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks @@ -0,0 +1 @@ +Neural networks (NNs) whose subnetworks implement reusable functions are expected to offer numerous advantages, including compositionality through efficient recombination of functional building blocks, interpretability, preventing catastrophic interference, etc. Understanding if and how NNs are modular could provide insights into how to improve them. Current inspection methods, however, fail to link modules to their functionality. In this paper, we present a novel method based on learning binary weight masks to identify individual weights and subnets responsible for specific functions. Using this powerful tool, we contribute an extensive study of emerging modularity in NNs that covers several standard architectures and datasets. We demonstrate how common NNs fail to reuse submodules and offer new insights into the related issue of systematic generalization on language tasks. \ No newline at end of file diff --git a/data/2021/iclr/Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees? b/data/2021/iclr/Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees? new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Are wider nets better given the same number of parameters? b/data/2021/iclr/Are wider nets better given the same number of parameters? new file mode 100644 index 0000000000..2849d23753 --- /dev/null +++ b/data/2021/iclr/Are wider nets better given the same number of parameters? @@ -0,0 +1 @@ +Empirical studies demonstrate that the performance of neural networks improves with increasing number of parameters. In most of these studies, the number of parameters is increased by increasing the network width. This begs the question: Is the observed improvement due to the larger number of parameters, or is it due to the larger width itself? We compare different ways of increasing model width while keeping the number of parameters constant. We show that for models initialized with a random, static sparsity pattern in the weight tensors, network width is the determining factor for good performance, while the number of weights is secondary, as long as trainability is ensured. As a step towards understanding this effect, we analyze these models in the framework of Gaussian Process kernels. We find that the distance between the sparse finite-width model kernel and the infinite-width kernel at initialization is indicative of model performance. \ No newline at end of file diff --git a/data/2021/iclr/Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning b/data/2021/iclr/Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning new file mode 100644 index 0000000000..960fbe7579 --- /dev/null +++ b/data/2021/iclr/Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning @@ -0,0 +1 @@ +Complex, multi-task problems have proven to be difficult to solve efficiently in a sparse-reward reinforcement learning setting. In order to be sample efficient, multi-task learning requires reuse and sharing of low-level policies. To facilitate the automatic decomposition of hierarchical tasks, we propose the use of step-by-step human demonstrations in the form of natural language instructions and action trajectories. We introduce a dataset of such demonstrations in a crafting-based grid world. Our model consists of a high-level language generator and low-level policy, conditioned on language. We find that human demonstrations help solve the most complex tasks. We also find that incorporating natural language allows the model to generalize to unseen tasks in a zero-shot setting and to learn quickly from a few demonstrations. Generalization is not only reflected in the actions of the agent, but also in the generated natural language instructions in unseen tasks. Our approach also gives our trained agent interpretable behaviors because it is able to generate a sequence of high-level descriptions of its actions. \ No newline at end of file diff --git a/data/2021/iclr/Async-RED: A Provably Convergent Asynchronous Block Parallel Stochastic Method using Deep Denoising Priors b/data/2021/iclr/Async-RED: A Provably Convergent Asynchronous Block Parallel Stochastic Method using Deep Denoising Priors new file mode 100644 index 0000000000..e3b3fa4eeb --- /dev/null +++ b/data/2021/iclr/Async-RED: A Provably Convergent Asynchronous Block Parallel Stochastic Method using Deep Denoising Priors @@ -0,0 +1 @@ +Regularization by denoising (RED) is a recently developed framework for solving inverse problems by integrating advanced denoisers as image priors. Recent work has shown its state-of-the-art performance when combined with pre-trained deep denoisers. However, current RED algorithms are inadequate for parallel processing on multicore systems. We address this issue by proposing a new asynchronous RED (ASYNC-RED) algorithm that enables asynchronous parallel processing of data, making it significantly faster than its serial counterparts for large-scale inverse problems. The computational complexity of ASYNC-RED is further reduced by using a random subset of measurements at every iteration. We present complete theoretical analysis of the algorithm by establishing its convergence under explicit assumptions on the data-fidelity and the denoiser. We validate ASYNC-RED on image recovery using pre-trained deep denoisers as priors. \ No newline at end of file diff --git a/data/2021/iclr/Attentional Constellation Nets for Few-Shot Learning b/data/2021/iclr/Attentional Constellation Nets for Few-Shot Learning new file mode 100644 index 0000000000..47f0992877 --- /dev/null +++ b/data/2021/iclr/Attentional Constellation Nets for Few-Shot Learning @@ -0,0 +1 @@ +is \ No newline at end of file diff --git a/data/2021/iclr/Auction Learning as a Two-Player Game b/data/2021/iclr/Auction Learning as a Two-Player Game new file mode 100644 index 0000000000..30cd2fc921 --- /dev/null +++ b/data/2021/iclr/Auction Learning as a Two-Player Game @@ -0,0 +1 @@ +Designing an incentive compatible auction that maximizes expected revenue is a central problem in Auction Design. While theoretical approaches to the problem have hit some limits, a recent research direction initiated by Duetting et al. (2019) consists in building neural network architectures to find optimal auctions. We propose two conceptual deviations from their approach which result in enhanced performance. First, we use recent results in theoretical auction design (Rubinstein and Weinberg, 2018) to introduce a time-independent Lagrangian. This not only circumvents the need for an expensive hyper-parameter search (as in prior work), but also provides a principled metric to compare the performance of two auctions (absent from prior work). Second, the optimization procedure in previous work uses an inner maximization loop to compute optimal misreports. We amortize this process through the introduction of an additional neural network. We demonstrate the effectiveness of our approach by learning competitive or strictly improved auctions compared to prior work. Both results together further imply a novel formulation of Auction Design as a two-player game with stationary utility functions. \ No newline at end of file diff --git a/data/2021/iclr/Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting b/data/2021/iclr/Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting new file mode 100644 index 0000000000..23ca96629c --- /dev/null +++ b/data/2021/iclr/Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting @@ -0,0 +1 @@ +Forecasting complex dynamical phenomena in settings where only partial knowledge of their dynamics is available is a prevalent problem across various scientific fields. While purely data-driven approaches are arguably insufficient in this context, standard physical modeling-based approaches tend to be over-simplistic, inducing non-negligible errors. In this work, we introduce the APHYNITY framework, a principled approach for augmenting incomplete physical dynamics described by differential equations with deep data-driven models. It consists of decomposing the dynamics into two components: a physical component accounting for the dynamics for which we have some prior knowledge, and a data-driven component accounting for errors of the physical model. The learning problem is carefully formulated such that the physical model explains as much of the data as possible, while the data-driven component only describes information that cannot be captured by the physical model; no more, no less. This not only provides the existence and uniqueness for this decomposition, but also ensures interpretability and benefit generalization. Experiments made on three important use cases, each representative of a different family of phenomena, i.e. reaction–diffusion equations, wave equations and the non-linear damped pendulum, show that APHYNITY can efficiently leverage approximate physical models to accurately forecast the evolution of the system and correctly identify relevant physical parameters. The code is available at https://github.com/yuan-yin/APHYNITY. \ No newline at end of file diff --git a/data/2021/iclr/Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation b/data/2021/iclr/Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation new file mode 100644 index 0000000000..353664f3ed --- /dev/null +++ b/data/2021/iclr/Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation @@ -0,0 +1 @@ +Designing proper loss functions is essential in training deep networks. Especially in the field of semantic segmentation, various evaluation metrics have been proposed for diverse scenarios. Despite the success of the widely adopted cross-entropy loss and its variants, the mis-alignment between the loss functions and evaluation metrics degrades the network performance. Meanwhile, manually designing loss functions for each specific metric requires expertise and significant manpower. In this paper, we propose to automate the design of metric-specific loss functions by searching differentiable surrogate losses for each metric. We substitute the non-differentiable operations in the metrics with parameterized functions, and conduct parameter search to optimize the shape of loss surfaces. Two constraints are introduced to regularize the search space and make the search efficient. Extensive experiments on PASCAL VOC and Cityscapes demonstrate that the searched surrogate losses outperform the manually designed loss functions consistently. The searched losses can generalize well to other datasets and networks. Code shall be released. \ No newline at end of file diff --git a/data/2021/iclr/AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly b/data/2021/iclr/AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly new file mode 100644 index 0000000000..f2abf722aa --- /dev/null +++ b/data/2021/iclr/AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly @@ -0,0 +1 @@ +The learning rate (LR) schedule is one of the most important hyper-parameters needing careful tuning in training DNNs. However, it is also one of the least automated parts of machine learning systems and usually costs significant manual effort and computing. Though there are pre-defined LR schedules and optimizers with adaptive LR, they introduce new hyperparameters that need to be tuned separately for different tasks/datasets. In this paper, we consider the question: Can we automatically tune the LR over the course of training without human involvement? We propose an efficient method, AutoLRS, which automatically optimizes the LR for each training stage by modeling training dynamics. AutoLRS aims to find an LR applied to every $\tau$ steps that minimizes the resulted validation loss. We solve this black-box optimization on the fly by Bayesian optimization (BO). However, collecting training instances for BO requires a system to evaluate each LR queried by BO's acquisition function for $\tau$ steps, which is prohibitively expensive in practice. Instead, we apply each candidate LR for only $\tau'\ll\tau$ steps and train an exponential model to predict the validation loss after $\tau$ steps. This mutual-training process between BO and the loss-prediction model allows us to limit the training steps invested in the BO search. We demonstrate the advantages and the generality of AutoLRS through extensive experiments of training DNNs for tasks from diverse domains using different optimizers. The LR schedules auto-generated by AutoLRS lead to a speedup of $1.22\times$, $1.43\times$, and $1.5\times$ when training ResNet-50, Transformer, and BERT, respectively, compared to the LR schedules in their original papers, and an average speedup of $1.31\times$ over state-of-the-art heavily-tuned LR schedules. \ No newline at end of file diff --git a/data/2021/iclr/Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization b/data/2021/iclr/Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization new file mode 100644 index 0000000000..9ec5b84504 --- /dev/null +++ b/data/2021/iclr/Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization @@ -0,0 +1 @@ +Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail deterministic transition dynamics. In this paper, we challenge this conditional independence assumption and propose a family of expressive autoregressive dynamics models that generate different dimensions of the next state and reward sequentially conditioned on previous dimensions. We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of offline MuJoCo datasets, and find that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art. Finally, we show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer through data augmentation and improving performance using model-based planning. \ No newline at end of file diff --git a/data/2021/iclr/Autoregressive Entity Retrieval b/data/2021/iclr/Autoregressive Entity Retrieval new file mode 100644 index 0000000000..e334930b23 --- /dev/null +++ b/data/2021/iclr/Autoregressive Entity Retrieval @@ -0,0 +1 @@ +Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. One way to understand current approaches is as classifiers among atomic labels, one for each entity. Their weight vectors are dense entity representations produced by encoding entity information such as descriptions. This approach leads to several shortcomings: i) context and entity affinity is mainly captured through a vector dot product, potentially missing fine-grained interactions between the two; ii) a large memory footprint is needed to store dense representations when considering large entity sets; iii) an appropriately hard set of negative data has to be subsampled at training time. We propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion, and conditioned on the context. This enables to mitigate the aforementioned technical issues: i) the autoregressive formulation allows us to directly capture relations between context and entity name, effectively cross encoding both; ii) the memory footprint is greatly reduced because the parameters of our encoder-decoder architecture scale with vocabulary size, not entity count; iii) the exact softmax loss can be efficiently computed without the need to subsample negative data. We show the efficacy of the approach with more than 20 datasets on entity disambiguation, end-to-end entity linking and document retrieval tasks, achieving new SOTA, or very competitive results while using a tiny fraction of the memory of competing systems. Finally, we demonstrate that new entities can be added by simply specifying their unambiguous name. \ No newline at end of file diff --git a/data/2021/iclr/Auxiliary Learning by Implicit Differentiation b/data/2021/iclr/Auxiliary Learning by Implicit Differentiation new file mode 100644 index 0000000000..01712b5739 --- /dev/null +++ b/data/2021/iclr/Auxiliary Learning by Implicit Differentiation @@ -0,0 +1 @@ +Training with multiple auxiliary tasks is a common practice used in deep learning for improving the performance on the main task of interest. Two main challenges arise in this multi-task learning setting: (i) Designing useful auxiliary tasks; and (ii) Combining auxiliary tasks into a single coherent loss. We propose a novel framework, \textit{AuxiLearn}, that targets both challenges, based on implicit differentiation. First, when useful auxiliaries are known, we propose learning a network that combines all losses into a single coherent objective function. This network can learn \textit{non-linear} interactions between auxiliary tasks. Second, when no useful auxiliary task is known, we describe how to learn a network that generates a meaningful, novel auxiliary task. We evaluate AuxiLearn in a series of tasks and domains, including image segmentation and learning with attributes. We find that AuxiLearn consistently improves accuracy compared with competing methods. \ No newline at end of file diff --git a/data/2021/iclr/Auxiliary Task Update Decomposition: the Good, the Bad and the neutral b/data/2021/iclr/Auxiliary Task Update Decomposition: the Good, the Bad and the neutral new file mode 100644 index 0000000000..aa1f7e95da --- /dev/null +++ b/data/2021/iclr/Auxiliary Task Update Decomposition: the Good, the Bad and the neutral @@ -0,0 +1 @@ +While deep learning has been very beneficial in data-rich settings, tasks with smaller training set often resort to pre-training or multitask learning to leverage data from other tasks. In this case, careful consideration is needed to select tasks and model parameterizations such that updates from the auxiliary tasks actually help the primary task. We seek to alleviate this burden by formulating a model-agnostic framework that performs fine-grained manipulation of the auxiliary task gradients. We propose to decompose auxiliary updates into directions which help, damage or leave the primary task loss unchanged. This allows weighting the update directions differently depending on their impact on the problem of interest. We present a novel and efficient algorithm for that purpose and show its advantage in practice. Our method leverages efficient automatic differentiation procedures and randomized singular value decomposition for scalability. We show that our framework is generic and encompasses some prior work as particular cases. Our approach consistently outperforms strong and widely used baselines when leveraging out-of-distribution data for Text and Image classification tasks. \ No newline at end of file diff --git a/data/2021/iclr/Average-case Acceleration for Bilinear Games and Normal Matrices b/data/2021/iclr/Average-case Acceleration for Bilinear Games and Normal Matrices new file mode 100644 index 0000000000..c3117db967 --- /dev/null +++ b/data/2021/iclr/Average-case Acceleration for Bilinear Games and Normal Matrices @@ -0,0 +1 @@ +Advances in generative modeling and adversarial learning have given rise to renewed interest in smooth games. However, the absence of symmetry in the matrix of second derivatives poses challenges that are not present in the classical minimization framework. While a rich theory of average-case analysis has been developed for minimization problems, little is known in the context of smooth games. In this work we take a first step towards closing this gap by developing average-case optimal first-order methods for a subset of smooth games. We make the following three main contributions. First, we show that for zero-sum bilinear games the average-case optimal method is the optimal method for the minimization of the Hamiltonian. Second, we provide an explicit expression for the optimal method corresponding to normal matrices, potentially non-symmetric. Finally, we specialize it to matrices with eigenvalues located in a disk and show a provable speed-up compared to worst-case optimal algorithms. We illustrate our findings through benchmarks with a varying degree of mismatch with our assumptions. \ No newline at end of file diff --git a/data/2021/iclr/BERTology Meets Biology: Interpreting Attention in Protein Language Models b/data/2021/iclr/BERTology Meets Biology: Interpreting Attention in Protein Language Models new file mode 100644 index 0000000000..73bb9f329f --- /dev/null +++ b/data/2021/iclr/BERTology Meets Biology: Interpreting Attention in Protein Language Models @@ -0,0 +1 @@ +Transformer architectures have proven to learn useful representations for protein classification and generation tasks. However, these representations present challenges in interpretability. Through the lens of attention, we analyze the inner workings of the Transformer and explore how the model discerns structural and functional properties of proteins. We show that attention (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We also present a three-dimensional visualization of the interaction between attention and protein structure. Our findings align with known biological processes and provide a tool to aid discovery in protein engineering and synthetic biology. The code for visualization and analysis is available at https://github.com/salesforce/provis. \ No newline at end of file diff --git a/data/2021/iclr/BOIL: Towards Representation Change for Few-shot Learning b/data/2021/iclr/BOIL: Towards Representation Change for Few-shot Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction b/data/2021/iclr/BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction new file mode 100644 index 0000000000..547992b42f --- /dev/null +++ b/data/2021/iclr/BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction @@ -0,0 +1 @@ +We study the challenging task of neural network quantization without end-to-end retraining, called Post-training Quantization (PTQ). PTQ usually requires a small subset of training data but produces less powerful quantized models than Quantization-Aware Training (QAT). In this work, we propose a novel PTQ framework, dubbed BRECQ, which pushes the limits of bitwidth in PTQ down to INT2 for the first time. BRECQ leverages the basic building blocks in neural networks and reconstructs them one-by-one. In a comprehensive theoretical study of the second-order error, we show that BRECQ achieves a good balance between cross-layer dependency and generalization error. To further employ the power of quantization, the mixed precision technique is incorporated in our framework by approximating the inter-layer and intra-layer sensitivity. Extensive experiments on various handcrafted and searched neural architectures are conducted for both image classification and object detection tasks. And for the first time we prove that, without bells and whistles, PTQ can attain 4-bit ResNet and MobileNetV2 comparable with QAT and enjoy 240 times faster production of quantized models. Codes are available at https://github.com/yhhhli/BRECQ. \ No newline at end of file diff --git a/data/2021/iclr/BREEDS: Benchmarks for Subpopulation Shift b/data/2021/iclr/BREEDS: Benchmarks for Subpopulation Shift new file mode 100644 index 0000000000..97867e272b --- /dev/null +++ b/data/2021/iclr/BREEDS: Benchmarks for Subpopulation Shift @@ -0,0 +1 @@ +We develop a methodology for assessing the robustness of models to subpopulation shift---specifically, their ability to generalize to novel data subpopulations that were not observed during training. Our approach leverages the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. This enables us to synthesize realistic distribution shifts whose sources can be precisely controlled and characterized, within existing large-scale datasets. Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity. We then validate that the corresponding shifts are tractable by obtaining human baselines for them. Finally, we utilize these benchmarks to measure the sensitivity of standard model architectures as well as the effectiveness of off-the-shelf train-time robustness interventions. Code and data available at this https URL . \ No newline at end of file diff --git a/data/2021/iclr/BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization b/data/2021/iclr/BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization new file mode 100644 index 0000000000..8587c6f8a7 --- /dev/null +++ b/data/2021/iclr/BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization @@ -0,0 +1 @@ +Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks, and thus, have been widely investigated. However, it lacks a systematic method to determine the exact quantization scheme. Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space. These approaches cannot lead to an optimal quantization scheme efficiently. This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity. We consider each bit of quantized weights as an independent trainable variable and introduce a differentiable bit-sparsity regularizer. BSQ can induce all-zero bits across a group of weight elements and realize the dynamic precision reduction, leading to a mixed-precision quantization scheme of the original model. Our method enables the exploration of the full mixed-precision space with a single gradient-based optimization process, with only one hyperparameter to tradeoff the performance and compression. BSQ achieves both higher accuracy and higher bit reduction on various model architectures on the CIFAR-10 and ImageNet datasets comparing to previous methods. \ No newline at end of file diff --git a/data/2021/iclr/BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration b/data/2021/iclr/BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration new file mode 100644 index 0000000000..1e383f99fe --- /dev/null +++ b/data/2021/iclr/BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration @@ -0,0 +1 @@ +Program synthesis is challenging largely because of the difficulty of search in a large space of programs. Human programmers routinely tackle the task of writing complex programs by writing sub-programs and then analysing their intermediate results to compose them in appropriate ways. Motivated by this intuition, we present a new synthesis approach that leverages learning to guide a bottom-up search over programs. In particular, we train a model to prioritize compositions of intermediate values during search conditioned on a given set of input-output examples. This is a powerful combination because of several emergent properties: First, in bottom-up search, intermediate programs can be executed, providing semantic information to the neural network. Second, given the concrete values from those executions, we can exploit rich features based on recent work on property signatures. Finally, bottom-up search allows the system substantial flexibility in what order to generate the solution, allowing the synthesizer to build up a program from multiple smaller sub-programs. Overall, our empirical evaluation finds that the combination of learning and bottom-up search is remarkably effective, even with simple supervised learning approaches. We demonstrate the effectiveness of our technique on a new data set for synthesis of string transformation programs. \ No newline at end of file diff --git a/data/2021/iclr/Bag of Tricks for Adversarial Training b/data/2021/iclr/Bag of Tricks for Adversarial Training new file mode 100644 index 0000000000..207c95148a --- /dev/null +++ b/data/2021/iclr/Bag of Tricks for Adversarial Training @@ -0,0 +1 @@ +Adversarial training (AT) is one of the most effective strategies for promoting model robustness. However, recent benchmarks show that most of the proposed improvements on AT are less effective than simply early stopping the training procedure. This counter-intuitive fact motivates us to investigate the implementation details of tens of AT methods. Surprisingly, we find that the basic training settings (e.g., weight decay, learning rate schedule, etc.) used in these methods are highly inconsistent, which could largely affect the model performance as shown in our experiments. For example, a slightly different value of weight decay can reduce the model robust accuracy by more than 7%, which is probable to override the potential promotion induced by the proposed methods. In this work, we provide comprehensive evaluations on the effects of basic training tricks and hyperparameter settings for adversarially trained models. We provide a reasonable baseline setting and re-implement previous defenses to achieve new state-of-the-art results. \ No newline at end of file diff --git a/data/2021/iclr/Balancing Constraints and Rewards with Meta-Gradient D4PG b/data/2021/iclr/Balancing Constraints and Rewards with Meta-Gradient D4PG new file mode 100644 index 0000000000..2839d6aea6 --- /dev/null +++ b/data/2021/iclr/Balancing Constraints and Rewards with Meta-Gradient D4PG @@ -0,0 +1 @@ +Deploying Reinforcement Learning (RL) agents to solve real-world applications often requires satisfying complex system constraints. Often the constraint thresholds are incorrectly set due to the complex nature of a system or the inability to verify the thresholds offline (e.g, no simulator or reasonable offline evaluation procedure exists). This results in solutions where a task cannot be solved without violating the constraints. However, in many real-world cases, constraint violations are undesirable yet they are not catastrophic, motivating the need for soft-constrained RL approaches. We present two soft-constrained RL approaches that utilize meta-gradients to find a good trade-off between expected return and minimizing constraint violations. We demonstrate the effectiveness of these approaches by showing that they consistently outperform the baselines across four different Mujoco domains. \ No newline at end of file diff --git a/data/2021/iclr/Batch Reinforcement Learning Through Continuation Method b/data/2021/iclr/Batch Reinforcement Learning Through Continuation Method new file mode 100644 index 0000000000..fd0cb58e08 --- /dev/null +++ b/data/2021/iclr/Batch Reinforcement Learning Through Continuation Method @@ -0,0 +1 @@ +Many real-world applications of reinforcement learning (RL) require the agent to learn from a fixed set of trajectories, without collecting new interactions. Policy optimization under this setting is extremely challenging as: 1) the geometry of the objective function is hard to optimize efficiently; 2) the shift of data distributions causes high noise in the value estimation. In this work, we propose a simple yet effective policy iteration approach to batch RL using global optimization techniques known as continuation. By constraining the difference between the learned policy and the behavior policy that generates the fixed trajectories \ No newline at end of file diff --git "a/data/2021/iclr/Bayesian Few-Shot Classification with One-vs-Each P\303\263lya-Gamma Augmented Gaussian Processes" "b/data/2021/iclr/Bayesian Few-Shot Classification with One-vs-Each P\303\263lya-Gamma Augmented Gaussian Processes" new file mode 100644 index 0000000000..21cdb59086 --- /dev/null +++ "b/data/2021/iclr/Bayesian Few-Shot Classification with One-vs-Each P\303\263lya-Gamma Augmented Gaussian Processes" @@ -0,0 +1 @@ +Few-shot classification (FSC), the task of adapting a classifier to unseen classes given a small labeled dataset, is an important step on the path toward human-like machine learning. Bayesian methods are well-suited to tackling the fundamental issue of overfitting in the few-shot scenario because they allow practitioners to specify prior beliefs and update those beliefs in light of observed data. Contemporary approaches to Bayesian few-shot classification maintain a posterior distribution over model parameters, which is slow and requires storage that scales with model size. Instead, we propose a Gaussian process classifier based on a novel combination of Polya-gamma augmentation and the one-vs-each softmax approximation that allows us to efficiently marginalize over functions rather than model parameters. We demonstrate improved accuracy and uncertainty quantification on both standard few-shot classification benchmarks and few-shot domain transfer tasks. \ No newline at end of file diff --git a/data/2021/iclr/Behavioral Cloning from Noisy Demonstrations b/data/2021/iclr/Behavioral Cloning from Noisy Demonstrations new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods b/data/2021/iclr/Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods new file mode 100644 index 0000000000..0185c6f6f2 --- /dev/null +++ b/data/2021/iclr/Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods @@ -0,0 +1 @@ +Establishing a theoretical analysis that explains why deep learning can outperform shallow learning such as kernel methods is one of the biggest issues in the deep learning literature. Towards answering this question, we evaluate excess risk of a deep learning estimator trained by a noisy gradient descent with ridge regularization on a mildly overparameterized neural network, and discuss its superiority to a class of linear estimators that includes neural tangent kernel approach, random feature model, other kernel methods, $k$-NN estimator and so on. We consider a teacher-student regression model, and eventually show that any linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting. The obtained excess bounds are so-called fast learning rate which is faster than $O(1/\sqrt{n})$ that is obtained by usual Rademacher complexity analysis. This discrepancy is induced by the non-convex geometry of the model and the noisy gradient descent used for neural network training provably reaches a near global optimal solution even though the loss landscape is highly non-convex. Although the noisy gradient descent does not employ any explicit or implicit sparsity inducing regularization, it shows a preferable generalization performance that dominates linear estimators. \ No newline at end of file diff --git a/data/2021/iclr/Better Fine-Tuning by Reducing Representational Collapse b/data/2021/iclr/Better Fine-Tuning by Reducing Representational Collapse new file mode 100644 index 0000000000..c4fb9a5010 --- /dev/null +++ b/data/2021/iclr/Better Fine-Tuning by Reducing Representational Collapse @@ -0,0 +1 @@ +Although widely adopted, existing approaches for fine-tuning pre-trained language models have been shown to be unstable across hyper-parameter settings, motivating recent work on trust region methods. In this paper, we present a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance. We also introduce a new analysis to motivate the use of trust region methods more generally, by studying representational collapse; the degradation of generalizable representations from pre-trained models as they are fine-tuned for a specific end task. Extensive experiments show that our fine-tuning method matches or exceeds the performance of previous trust region methods on a range of understanding and generation tasks (including DailyMail/CNN, Gigaword, Reddit TIFU, and the GLUE benchmark), while also being much faster. We also show that it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned. \ No newline at end of file diff --git a/data/2021/iclr/Beyond Categorical Label Representations for Image Classification b/data/2021/iclr/Beyond Categorical Label Representations for Image Classification new file mode 100644 index 0000000000..7fa22452c6 --- /dev/null +++ b/data/2021/iclr/Beyond Categorical Label Representations for Image Classification @@ -0,0 +1 @@ +We find that the way we choose to represent data labels can have a profound effect on the quality of trained models. For example, training an image classifier to regress audio labels rather than traditional categorical probabilities produces a more reliable classification. This result is surprising, considering that audio labels are more complex than simpler numerical probabilities or text. We hypothesize that high dimensional, high entropy label representations are generally more useful because they provide a stronger error signal. We support this hypothesis with evidence from various label representations including constant matrices, spectrograms, shuffled spectrograms, Gaussian mixtures, and uniform random matrices of various dimensionalities. Our experiments reveal that high dimensional, high entropy labels achieve comparable accuracy to text (categorical) labels on the standard image classification task, but features learned through our label representations exhibit more robustness under various adversarial attacks and better effectiveness with a limited amount of training data. These results suggest that label representation may play a more important role than previously thought. The project website is at \url{https://www.creativemachineslab.com/label-representation.html}. \ No newline at end of file diff --git a/data/2021/iclr/Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1 n Parameters b/data/2021/iclr/Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1 n Parameters new file mode 100644 index 0000000000..9429065f70 --- /dev/null +++ b/data/2021/iclr/Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1 n Parameters @@ -0,0 +1 @@ +Recent works have demonstrated reasonable success of representation learning in hypercomplex space. Specifically,"fully-connected layers with Quaternions"(4D hypercomplex numbers), which replace real-valued matrix multiplications in fully-connected layers with Hamilton products of Quaternions, both enjoy parameter savings with only 1/4 learnable parameters and achieve comparable performance in various applications. However, one key caveat is that hypercomplex space only exists at very few predefined dimensions (4D, 8D, and 16D). This restricts the flexibility of models that leverage hypercomplex multiplications. To this end, we propose parameterizing hypercomplex multiplications, allowing models to learn multiplication rules from data regardless of whether such rules are predefined. As a result, our method not only subsumes the Hamilton product, but also learns to operate on any arbitrary nD hypercomplex space, providing more architectural flexibility using arbitrarily $1/n$ learnable parameters compared with the fully-connected layer counterpart. Experiments of applications to the LSTM and Transformer models on natural language inference, machine translation, text style transfer, and subject verb agreement demonstrate architectural flexibility and effectiveness of the proposed approach. \ No newline at end of file diff --git a/data/2021/iclr/BiPointNet: Binary Neural Network for Point Clouds b/data/2021/iclr/BiPointNet: Binary Neural Network for Point Clouds new file mode 100644 index 0000000000..31b4aad1fe --- /dev/null +++ b/data/2021/iclr/BiPointNet: Binary Neural Network for Point Clouds @@ -0,0 +1 @@ +To alleviate the resource constraint for real-time point clouds applications that run on edge devices, we present BiPointNet, the first model binarization approach for efficient deep learning on point clouds. In this work, we discover that the immense performance drop of binarized models for point clouds is caused by two main challenges: aggregation-induced feature homogenization that leads to a degradation of information entropy, and scale distortion that hinders optimization and invalidates scale-sensitive structures. With theoretical justifications and in-depth analysis, we propose Entropy-Maximizing Aggregation(EMA) to modulate the distribution before aggregation for the maximum information entropy, andLayer-wise Scale Recovery(LSR) to efficiently restore feature scales. Extensive experiments show that our BiPointNet outperforms existing binarization methods by convincing margins, at the level even comparable with the full precision counterpart. We highlight that our techniques are generic which show significant improvements on various fundamental tasks and mainstream backbones. BiPoint-Net gives an impressive 14.7 times speedup and 18.9 times storage saving on real-world resource-constrained devices. \ No newline at end of file diff --git a/data/2021/iclr/Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech b/data/2021/iclr/Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Blending MPC & Value Function Approximation for Efficient Reinforcement Learning b/data/2021/iclr/Blending MPC & Value Function Approximation for Efficient Reinforcement Learning new file mode 100644 index 0000000000..7ab1a4bf48 --- /dev/null +++ b/data/2021/iclr/Blending MPC & Value Function Approximation for Efficient Reinforcement Learning @@ -0,0 +1 @@ +Model-Predictive Control (MPC) is a powerful tool for controlling complex, real-world systems that uses a model to make predictions about future behavior. For each state encountered, MPC solves an online optimization problem to choose a control action that will minimize future cost. This is a surprisingly effective strategy, but real-time performance requirements warrant the use of simple models. If the model is not sufficiently accurate, then the resulting controller can be biased, limiting performance. We present a framework for improving on MPC with model-free reinforcement learning (RL). The key insight is to view MPC as constructing a series of local Q-function approximations. We show that by using a parameter $\lambda$, similar to the trace decay parameter in TD($\lambda$), we can systematically trade-off learned value estimates against the local Q-function approximations. We present a theoretical analysis that shows how error from inaccurate models in MPC and value function estimation in RL can be balanced. We further propose an algorithm that changes $\lambda$ over time to reduce the dependence on MPC as our estimates of the value function improve, and test the efficacy our approach on challenging high-dimensional manipulation tasks with biased models in simulation. We demonstrate that our approach can obtain performance comparable with MPC with access to true dynamics even under severe model bias and is more sample efficient as compared to model-free RL. \ No newline at end of file diff --git a/data/2021/iclr/Boost then Convolve: Gradient Boosting Meets Graph Neural Networks b/data/2021/iclr/Boost then Convolve: Gradient Boosting Meets Graph Neural Networks new file mode 100644 index 0000000000..9f7e1627c1 --- /dev/null +++ b/data/2021/iclr/Boost then Convolve: Gradient Boosting Meets Graph Neural Networks @@ -0,0 +1 @@ +Graph neural networks (GNNs) are powerful models that have been successful in various graph representation learning tasks. Whereas gradient boosted decision trees (GBDT) often outperform other machine learning methods when faced with heterogeneous tabular data. But what approach should be used for graphs with tabular node features? Previous GNN models have mostly focused on networks with homogeneous sparse features and, as we show, are suboptimal in the heterogeneous setting. In this work, we propose a novel architecture that trains GBDT and GNN jointly to get the best of both worlds: the GBDT model deals with heterogeneous features, while GNN accounts for the graph structure. Our model benefits from end-to-end optimization by allowing new trees to fit the gradient updates of GNN. With an extensive experimental comparison to the leading GBDT and GNN models, we demonstrate a significant increase in performance on a variety of graphs with tabular features. The code is available: https://github.com/nd7141/bgnn. \ No newline at end of file diff --git a/data/2021/iclr/Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and Novel-View Synthesis b/data/2021/iclr/Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and Novel-View Synthesis new file mode 100644 index 0000000000..9f7f4541bc --- /dev/null +++ b/data/2021/iclr/Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and Novel-View Synthesis @@ -0,0 +1 @@ +Generative modeling has recently shown great promise in computer vision, but its success is often limited to separate tasks. In this paper, motivated by multi-task learning of shareable feature representations, we consider a novel problem of learning a shared generative model across various tasks. We instantiate it on the illustrative dual-task of joint few-shot recognition and novel-view synthesis: given only one or few images of a novel object from arbitrary views with only category annotation, we aim to simultaneously learn an object classifier and generate images of the object from new viewpoints. To this end, we propose bowtie networks that jointly learn 3D geometric and semantic representations with feedback in the loop. Experimental evaluation on challenging fine-grained recognition datasets demonstrates that our synthesized images are realistic from multiple viewpoints and significantly improve recognition performance as ways of data augmentation, especially in the low-data regime. We further show that our approach is flexible and can be easily extended to incorporate other tasks, such as style guided synthesis. \ No newline at end of file diff --git a/data/2021/iclr/Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification b/data/2021/iclr/Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification new file mode 100644 index 0000000000..833fd7ab2f --- /dev/null +++ b/data/2021/iclr/Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification @@ -0,0 +1 @@ +Differentially private SGD (DP-SGD) is one of the most popular methods for solving differentially private empirical risk minimization (ERM). Due to its noisy perturbation on each gradient update, the error rate of DP-SGD scales with the ambient dimension $p$, the number of parameters in the model. Such dependence can be problematic for over-parameterized models where $p \gg n$, the number of training samples. Existing lower bounds on private ERM show that such dependence on $p$ is inevitable in the worst case. In this paper, we circumvent the dependence on the ambient dimension by leveraging a low-dimensional structure of gradient space in deep networks---that is, the stochastic gradients for deep nets usually stay in a low dimensional subspace in the training process. We propose Projected DP-SGD that performs noise reduction by projecting the noisy gradients to a low-dimensional subspace, which is given by the top gradient eigenspace on a small public dataset. We provide a general sample complexity analysis on the public dataset for the gradient subspace identification problem and demonstrate that under certain low-dimensional assumptions the public sample complexity only grows logarithmically in $p$. Finally, we provide a theoretical analysis and empirical evaluations to show that our method can substantially improve the accuracy of DP-SGD. \ No newline at end of file diff --git a/data/2021/iclr/Byzantine-Resilient Non-Convex Stochastic Gradient Descent b/data/2021/iclr/Byzantine-Resilient Non-Convex Stochastic Gradient Descent new file mode 100644 index 0000000000..4782fa88dc --- /dev/null +++ b/data/2021/iclr/Byzantine-Resilient Non-Convex Stochastic Gradient Descent @@ -0,0 +1 @@ +We study adversary-resilient stochastic distributed optimization, in which $m$ machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions. However, an $\alpha$-fraction of the machines are $\textit{Byzantine}$, in that they may behave in arbitrary, adversarial ways. We consider a variant of this procedure in the challenging $\textit{non-convex}$ case. Our main result is a new algorithm SafeguardSGD which can provably escape saddle points and find approximate local minima of the non-convex objective. The algorithm is based on a new concentration filtering technique, and its sample and time complexity bounds match the best known theoretical bounds in the stochastic, distributed setting when no Byzantine machines are present. Our algorithm is practical: it improves upon the performance of prior methods when training deep neural networks, it is relatively lightweight, and is the first method to withstand two recently-proposed Byzantine attacks. \ No newline at end of file diff --git a/data/2021/iclr/C-Learning: Horizon-Aware Cumulative Accessibility Estimation b/data/2021/iclr/C-Learning: Horizon-Aware Cumulative Accessibility Estimation new file mode 100644 index 0000000000..0ed065d40d --- /dev/null +++ b/data/2021/iclr/C-Learning: Horizon-Aware Cumulative Accessibility Estimation @@ -0,0 +1 @@ +Multi-goal reaching is an important problem in reinforcement learning needed to achieve algorithmic generalization. Despite recent advances in this field, current algorithms suffer from three major challenges: high sample complexity, learning only a single way of reaching the goals, and difficulties in solving complex motion planning tasks. In order to address these limitations, we introduce the concept of cumulative accessibility functions, which measure the reachability of a goal from a given state within a specified horizon. We show that these functions obey a recurrence relation, which enables learning from offline interactions. We also prove that optimal cumulative accessibility functions are monotonic in the planning horizon. Additionally, our method can trade off speed and reliability in goal-reaching by suggesting multiple paths to a single goal depending on the provided horizon. We evaluate our approach on a set of multi-goal discrete and continuous control tasks. We show that our method outperforms state-of-the-art goal-reaching algorithms in success rate, sample complexity, and path optimality. Our code is available at this https URL, and additional visualizations can be found at this https URL . \ No newline at end of file diff --git a/data/2021/iclr/C-Learning: Learning to Achieve Goals via Recursive Classification b/data/2021/iclr/C-Learning: Learning to Achieve Goals via Recursive Classification new file mode 100644 index 0000000000..91251b1cc8 --- /dev/null +++ b/data/2021/iclr/C-Learning: Learning to Achieve Goals via Recursive Classification @@ -0,0 +1 @@ +We study the problem of predicting and controlling the future state distribution of an autonomous agent. This problem, which can be viewed as a reframing of goal-conditioned reinforcement learning (RL), is centered around learning a conditional probability density function over future states. Instead of directly estimating this density function, we indirectly estimate this density function by training a classifier to predict whether an observation comes from the future. Via Bayes' rule, predictions from our classifier can be transformed into predictions over future states. Importantly, an off-policy variant of our algorithm allows us to predict the future state distribution of a new policy, without collecting new experience. This variant allows us to optimize functionals of a policy's future state distribution, such as the density of reaching a particular goal state. While conceptually similar to Q-learning, our work lays a principled foundation for goal-conditioned RL as density estimation, providing justification for goal-conditioned methods used in prior work. This foundation makes hypotheses about Q-learning, including the optimal goal-sampling ratio, which we confirm experimentally. Moreover, our proposed method is competitive with prior goal-conditioned RL methods. \ No newline at end of file diff --git a/data/2021/iclr/CO2: Consistent Contrast for Unsupervised Visual Representation Learning b/data/2021/iclr/CO2: Consistent Contrast for Unsupervised Visual Representation Learning new file mode 100644 index 0000000000..e59b3b1b7d --- /dev/null +++ b/data/2021/iclr/CO2: Consistent Contrast for Unsupervised Visual Representation Learning @@ -0,0 +1 @@ +Contrastive learning has been adopted as a core method for unsupervised visual representation learning. Without human annotation, the common practice is to perform an instance discrimination task: Given a query image crop, this task labels crops from the same image as positives, and crops from other randomly sampled images as negatives. An important limitation of this label assignment strategy is that it can not reflect the heterogeneous similarity between the query crop and each crop from other images, taking them as equally negative, while some of them may even belong to the same semantic class as the query. To address this issue, inspired by consistency regularization in semi-supervised learning on unlabeled data, we propose Consistent Contrast (CO2), which introduces a consistency regularization term into the current contrastive learning framework. Regarding the similarity of the query crop to each crop from other images as "unlabeled", the consistency term takes the corresponding similarity of a positive crop as a pseudo label, and encourages consistency between these two similarities. Empirically, CO2 improves Momentum Contrast (MoCo) by 2.9% top-1 accuracy on ImageNet linear protocol, 3.8% and 1.1% top-5 accuracy on 1% and 10% labeled semi-supervised settings. It also transfers to image classification, object detection, and semantic segmentation on PASCAL VOC. This shows that CO2 learns better visual representations for these downstream tasks. \ No newline at end of file diff --git a/data/2021/iclr/CPR: Classifier-Projection Regularization for Continual Learning b/data/2021/iclr/CPR: Classifier-Projection Regularization for Continual Learning new file mode 100644 index 0000000000..ee73909378 --- /dev/null +++ b/data/2021/iclr/CPR: Classifier-Projection Regularization for Continual Learning @@ -0,0 +1 @@ +We propose a general, yet simple patch that can be applied to existing regularization-based continual learning methods called classifier-projection regularization (CPR). Inspired by both recent results on neural networks with wide local minima and information theory, CPR adds an additional regularization term that maximizes the entropy of a classifier's output probability. We demonstrate that this additional term can be interpreted as a projection of the conditional probability given by a classifier's output to the uniform distribution. By applying the Pythagorean theorem for KL divergence, we then prove that this projection may (in theory) improve the performance of continual learning methods. In our extensive experimental results, we apply CPR to several state-of-the-art regularization-based continual learning methods and benchmark performance on popular image recognition datasets. Our results demonstrate that CPR indeed promotes a wide local minima and significantly improves both accuracy and plasticity while simultaneously mitigating the catastrophic forgetting of baseline continual learning methods. \ No newline at end of file diff --git a/data/2021/iclr/CPT: Efficient Deep Neural Network Training via Cyclic Precision b/data/2021/iclr/CPT: Efficient Deep Neural Network Training via Cyclic Precision new file mode 100644 index 0000000000..899325c088 --- /dev/null +++ b/data/2021/iclr/CPT: Efficient Deep Neural Network Training via Cyclic Precision @@ -0,0 +1 @@ +Low-precision deep neural network (DNN) training has gained tremendous attention as reducing precision is one of the most effective knobs for boosting DNNs' training time/energy efficiency. In this paper, we attempt to explore low-precision training from a new perspective as inspired by recent findings in understanding DNN training: we conjecture that DNNs' precision might have a similar effect as the learning rate during DNN training, and advocate dynamic precision along the training trajectory for further boosting the time/energy efficiency of DNN training. Specifically, we propose Cyclic Precision Training (CPT) to cyclically vary the precision between two boundary values which can be identified using a simple precision range test within the first few training epochs. Extensive simulations and ablation studies on five datasets and eleven models demonstrate that CPT's effectiveness is consistent across various models/tasks (including classification and language modeling). Furthermore, through experiments and visualization we show that CPT helps to (1) converge to a wider minima with a lower generalization error and (2) reduce training variance which we believe opens up a new design knob for simultaneously improving the optimization and efficiency of DNN training. Our codes are available at: this https URL \ No newline at end of file diff --git a/data/2021/iclr/CT-Net: Channel Tensorization Network for Video Classification b/data/2021/iclr/CT-Net: Channel Tensorization Network for Video Classification new file mode 100644 index 0000000000..30306fbca8 --- /dev/null +++ b/data/2021/iclr/CT-Net: Channel Tensorization Network for Video Classification @@ -0,0 +1 @@ +3D convolution is powerful for video classification but often computationally expensive, recent studies mainly focus on decomposing it on spatial-temporal and/or channel dimensions. Unfortunately, most approaches fail to achieve a preferable balance between convolutional efficiency and feature-interaction sufficiency. For this reason, we propose a concise and novel Channel Tensorization Network (CT-Net), by treating the channel dimension of input feature as a multiplication of K sub-dimensions. On one hand, it naturally factorizes convolution in a multiple dimension way, leading to a light computation burden. On the other hand, it can effectively enhance feature interaction from different channels, and progressively enlarge the 3D receptive field of such interaction to boost classification accuracy. Furthermore, we equip our CT-Module with a Tensor Excitation (TE) mechanism. It can learn to exploit spatial, temporal and channel attention in a high-dimensional manner, to improve the cooperative power of all the feature dimensions in our CT-Module. Finally, we flexibly adapt ResNet as our CT-Net. Extensive experiments are conducted on several challenging video benchmarks, e.g., Kinetics-400, Something-Something V1 and V2. Our CT-Net outperforms a number of recent SOTA approaches, in terms of accuracy and/or efficiency. The codes and models will be available on https://github.com/Andy1621/CT-Net. \ No newline at end of file diff --git a/data/2021/iclr/CaPC Learning: Confidential and Private Collaborative Learning b/data/2021/iclr/CaPC Learning: Confidential and Private Collaborative Learning new file mode 100644 index 0000000000..cd4057f864 --- /dev/null +++ b/data/2021/iclr/CaPC Learning: Confidential and Private Collaborative Learning @@ -0,0 +1 @@ +Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other's data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through model predictions (privacy). There is currently no method that enables machine learning in such a setting, where both confidentiality and privacy need to be preserved, to prevent both explicit and implicit sharing of data. Federated learning only provides confidentiality, not privacy, since gradients shared still contain private information. Differentially private learning assumes unreasonably large datasets. Furthermore, both of these learning paradigms produce a central model whose architecture was previously agreed upon by all parties rather than enabling collaborative learning where each party learns and improves their own local model. We introduce Confidential and Private Collaborative (CaPC) learning, the first method provably achieving both confidentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques in combination with privately aggregated teacher models. We demonstrate how CaPC allows participants to collaborate without having to explicitly join their training sets or train a central model. Each party is able to improve the accuracy and fairness of their model, even in settings where each party has a model that performs well on their own dataset or when datasets are not IID and model architectures are heterogeneous across parties. \ No newline at end of file diff --git a/data/2021/iclr/Calibration of Neural Networks using Splines b/data/2021/iclr/Calibration of Neural Networks using Splines new file mode 100644 index 0000000000..abd7e2d0bc --- /dev/null +++ b/data/2021/iclr/Calibration of Neural Networks using Splines @@ -0,0 +1 @@ +Calibrating neural networks is of utmost importance when employing them in safety-critical applications where the downstream decision making depends on the predicted probabilities. Measuring calibration error amounts to comparing two empirical distributions. In this work, we introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the main idea is to compare the respective cumulative probability distributions. From this, by approximating the empirical cumulative distribution using a differentiable function via splines, we obtain a recalibration function, which maps the network outputs to actual (calibrated) class assignment probabilities. The spine-fitting is performed using a held-out calibration set and the obtained recalibration function is evaluated on an unseen test set. We tested our method against existing calibration approaches on various image classification datasets and our spline-based recalibration approach consistently outperforms existing methods on KS error as well as other commonly used calibration measures. \ No newline at end of file diff --git a/data/2021/iclr/Calibration tests beyond classification b/data/2021/iclr/Calibration tests beyond classification new file mode 100644 index 0000000000..1986939999 --- /dev/null +++ b/data/2021/iclr/Calibration tests beyond classification @@ -0,0 +1 @@ +Most supervised machine learning tasks are subject to irreducible prediction errors. Probabilistic predictive models address this limitation by providing probability distributions that represent a belief over plausible targets, rather than point estimates. Such models can be a valuable tool in decision-making under uncertainty, provided that the model output is meaningful and interpretable. Calibrated models guarantee that the probabilistic predictions are neither over- nor under-confident. In the machine learning literature, different measures and statistical tests have been proposed and studied for evaluating the calibration of classification models. For regression problems, however, research has been focused on a weaker condition of calibration based on predicted quantiles for real-valued targets. In this paper, we propose the first framework that unifies calibration evaluation and tests for probabilistic predictive models. It applies to any such model, including classification and regression models of arbitrary dimension. Furthermore, the framework generalizes existing measures and provides a more intuitive reformulation of a recently proposed framework for calibration in multi-class classification. \ No newline at end of file diff --git a/data/2021/iclr/Can a Fruit Fly Learn Word Embeddings? b/data/2021/iclr/Can a Fruit Fly Learn Word Embeddings? new file mode 100644 index 0000000000..cb72aee6cf --- /dev/null +++ b/data/2021/iclr/Can a Fruit Fly Learn Word Embeddings? @@ -0,0 +1 @@ +The mushroom body of the fruit fly brain is one of the best studied systems in neuroscience. At its core it consists of a population of Kenyon cells, which receive inputs from multiple sensory modalities. These cells are inhibited by the anterior paired lateral neuron, thus creating a sparse high dimensional representation of the inputs. In this work we study a mathematical formalization of this network motif and apply it to learning the correlational structure between words and their context in a corpus of unstructured text, a common natural language processing (NLP) task. We show that this network can learn semantic representations of words and can generate both static and context-dependent word embeddings. Unlike conventional methods (e.g., BERT, GloVe) that use dense representations for word embedding, our algorithm encodes semantic meaning of words and their context in the form of sparse binary hash codes. The quality of the learned representations is evaluated on word similarity analysis, word-sense disambiguation, and document classification. It is shown that not only can the fruit fly network motif achieve performance comparable to existing methods in NLP, but, additionally, it uses only a fraction of the computational resources (shorter training time and smaller memory footprint). \ No newline at end of file diff --git a/data/2021/iclr/Capturing Label Characteristics in VAEs b/data/2021/iclr/Capturing Label Characteristics in VAEs new file mode 100644 index 0000000000..ebf81bcd77 --- /dev/null +++ b/data/2021/iclr/Capturing Label Characteristics in VAEs @@ -0,0 +1 @@ +We present a principled approach to incorporating labels in VAEs that captures the rich characteristic information associated with those labels. While prior work has typically conflated these by learning latent variables that directly correspond to label values, we argue this is contrary to the intended effect of supervision in VAEs-capturing rich label characteristics with the latents. For example, we may want to capture the characteristics of a face that make it look young, rather than just the age of the person. To this end, we develop the CCVAE, a novel VAE model and concomitant variational objective which captures label characteristics explicitly in the latent space, eschewing direct correspondences between label values and latents. Through judicious structuring of mappings between such characteristic latents and labels, we show that the CCVAE can effectively learn meaningful representations of the characteristics of interest across a variety of supervision schemes. In particular, we show that the CCVAE allows for more effective and more general interventions to be performed, such as smooth traversals within the characteristics for a given label, diverse conditional generation, and transferring characteristics across datapoints. \ No newline at end of file diff --git a/data/2021/iclr/Categorical Normalizing Flows via Continuous Transformations b/data/2021/iclr/Categorical Normalizing Flows via Continuous Transformations new file mode 100644 index 0000000000..0fed0bf9ad --- /dev/null +++ b/data/2021/iclr/Categorical Normalizing Flows via Continuous Transformations @@ -0,0 +1 @@ +Despite their popularity, to date, the application of normalizing flows on categorical data stays limited. The current practice of using dequantization to map discrete data to a continuous space is inapplicable as categorical data has no intrinsic order. Instead, categorical data have complex and latent relations that must be inferred, like the synonymy between words. In this paper, we investigate Categorical Normalizing Flows, that is normalizing flows for categorical data. By casting the encoding of categorical data in continuous space as a variational inference problem, we jointly optimize the continuous representation and the model likelihood. To maintain unique decoding, we learn a partitioning of the latent space by factorizing the posterior. Meanwhile, the complex relations between the categorical variables are learned by the ensuing normalizing flow, thus maintaining a close-to exact likelihood estimate and making it possible to scale up to a large number of categories. Based on Categorical Normalizing Flows, we propose GraphCNF a permutation-invariant generative model on graphs, outperforming both one-shot and autoregressive flow-based state-of-the-art on molecule generation. \ No newline at end of file diff --git a/data/2021/iclr/CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning b/data/2021/iclr/CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning new file mode 100644 index 0000000000..e0c608d2b6 --- /dev/null +++ b/data/2021/iclr/CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning @@ -0,0 +1 @@ +Despite recent successes of reinforcement learning (RL), it remains a challenge for agents to transfer learned skills to related environments. To facilitate research addressing this problem, we propose CausalWorld, a benchmark for causal structure and transfer learning in a robotic manipulation environment. The environment is a simulation of an open-source robotic platform, hence offering the possibility of sim-to-real transfer. Tasks consist of constructing 3D shapes from a given set of blocks - inspired by how children learn to build complex structures. The key strength of CausalWorld is that it provides a combinatorial family of such tasks with common causal structure and underlying factors (including, e.g., robot and object masses, colors, sizes). The user (or the agent) may intervene on all causal variables, which allows for fine-grained control over how similar different tasks (or task distributions) are. One can thus easily define training and evaluation distributions of a desired difficulty level, targeting a specific form of generalization (e.g., only changes in appearance or object mass). Further, this common parametrization facilitates defining curricula by interpolating between an initial and a target task. While users may define their own task distributions, we present eight meaningful distributions as concrete benchmarks, ranging from simple to very challenging, all of which require long-horizon planning as well as precise low-level motor control. Finally, we provide baseline results for a subset of these tasks on distinct training curricula and corresponding evaluation protocols, verifying the feasibility of the tasks in this benchmark. \ No newline at end of file diff --git a/data/2021/iclr/CcGAN: Continuous Conditional Generative Adversarial Networks for Image Generation b/data/2021/iclr/CcGAN: Continuous Conditional Generative Adversarial Networks for Image Generation new file mode 100644 index 0000000000..0db0b6a81b --- /dev/null +++ b/data/2021/iclr/CcGAN: Continuous Conditional Generative Adversarial Networks for Image Generation @@ -0,0 +1 @@ +This work proposes the continuous conditional generative adversarial network (CcGAN), the first generative model for image generation conditional on continuous, scalar conditions (termed regression labels). Existing conditional GANs (cGANs) are mainly designed for categorical conditions (e.g., class labels); conditioning on regression labels is mathematically distinct and raises two fundamental problems: (P1) Since there may be very few (even zero) real images for some regression labels, minimizing existing empirical versions of cGAN losses (a.k.a. empirical cGAN losses) often fails in practice; (P2) Since regression labels are scalar and infinitely many, conventional label input methods are not applicable. The proposed CcGAN solves the above problems, respectively, by (S1) reformulating existing empirical cGAN losses to be appropriate for the continuous scenario; and (S2) proposing a naive label input (NLI) method and an improved label input (ILI) method to incorporate regression labels into the generator and the discriminator. The reformulation in (S1) leads to two novel empirical discriminator losses, termed the hard vicinal discriminator loss (HVDL) and the soft vicinal discriminator loss (SVDL) respectively, and a novel empirical generator loss. The error bounds of a discriminator trained with HVDL and SVDL are derived under mild assumptions in this work. Two new benchmark datasets (RC-49 and Cell-200) and a novel evaluation metric (Sliding Frechet Inception Distance) are also proposed for this continuous scenario. Our experiments on the Circular 2-D Gaussians, RC-49, UTKFace, Cell-200, and Steering Angle datasets show that CcGAN can generate diverse, high-quality samples from the image distribution conditional on a given regression label. Moreover, in these experiments, CcGAN substantially outperforms cGAN both visually and quantitatively. \ No newline at end of file diff --git a/data/2021/iclr/Certify or Predict: Boosting Certified Robustness with Compositional Architectures b/data/2021/iclr/Certify or Predict: Boosting Certified Robustness with Compositional Architectures new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Chaos of Learning Beyond Zero-sum and Coordination via Game Decompositions b/data/2021/iclr/Chaos of Learning Beyond Zero-sum and Coordination via Game Decompositions new file mode 100644 index 0000000000..7a0beb2ece --- /dev/null +++ b/data/2021/iclr/Chaos of Learning Beyond Zero-sum and Coordination via Game Decompositions @@ -0,0 +1 @@ +Machine learning processes, e.g. ''learning in games'', can be viewed as non-linear dynamical systems. In general, such systems exhibit a wide spectrum of behaviors, ranging from stability/recurrence to the undesirable phenomena of chaos (or ''butterfly effect''). Chaos captures sensitivity of round-off errors and can severely affect predictability and reproducibility of ML systems, but AI/ML community's understanding of it remains rudimentary. It has a lot out there that await exploration. Recently, Cheung and Piliouras employed volume-expansion argument to show that Lyapunov chaos occurs in the cumulative payoff space, when some popular learning algorithms, including Multiplicative Weights Update (MWU), Follow-the-Regularized-Leader (FTRL) and Optimistic MWU (OMWU), are used in several subspaces of games, e.g. zero-sum, coordination or graphical constant-sum games. It is natural to ask: can these results generalize to much broader families of games? We take on a game decomposition approach and answer the question affirmatively. Among other results, we propose a notion of ''matrix domination'' and design a linear program, and use them to characterize bimatrix games where MWU is Lyapunov chaotic almost everywhere. Such family of games has positive Lebesgue measure in the bimatrix game space, indicating that chaos is a substantial issue of learning in games. For multi-player games, we present a local equivalence of volume change between general games and graphical games, which is used to perform volume and chaos analyses of MWU and OMWU in potential games. \ No newline at end of file diff --git a/data/2021/iclr/Characterizing signal propagation to close the performance gap in unnormalized ResNets b/data/2021/iclr/Characterizing signal propagation to close the performance gap in unnormalized ResNets new file mode 100644 index 0000000000..bcb1538010 --- /dev/null +++ b/data/2021/iclr/Characterizing signal propagation to close the performance gap in unnormalized ResNets @@ -0,0 +1 @@ +Batch Normalization is a key component in almost all state-of-the-art image classifiers, but it also introduces practical challenges: it breaks the independence between training examples within a batch, can incur compute and memory overhead, and often results in unexpected bugs. Building on recent theoretical analyses of deep ResNets at initialization, we propose a simple set of analysis tools to characterize signal propagation on the forward pass, and leverage these tools to design highly performant ResNets without activation normalization layers. Crucial to our success is an adapted version of the recently proposed Weight Standardization. Our analysis tools show how this technique preserves the signal in networks with ReLU or Swish activation functions by ensuring that the per-channel activation means do not grow with depth. Across a range of FLOP budgets, our networks attain performance competitive with the state-of-the-art EfficientNets on ImageNet. \ No newline at end of file diff --git a/data/2021/iclr/ChipNet: Budget-Aware Pruning with Heaviside Continuous Approximations b/data/2021/iclr/ChipNet: Budget-Aware Pruning with Heaviside Continuous Approximations new file mode 100644 index 0000000000..bebe33ecaa --- /dev/null +++ b/data/2021/iclr/ChipNet: Budget-Aware Pruning with Heaviside Continuous Approximations @@ -0,0 +1 @@ +Structured pruning methods are among the effective strategies for extracting small resource-efficient convolutional neural networks from their dense counterparts with minimal loss in accuracy. However, most existing methods still suffer from one or more limitations, that include 1) the need for training the dense model from scratch with pruning-related parameters embedded in the architecture, 2) requiring model-specific hyperparameter settings, 3) inability to include budget-related constraint in the training process, and 4) instability under scenarios of extreme pruning. In this paper, we present ChipNet, a deterministic pruning strategy that employs continuous Heaviside function and a novel crispness loss to identify a highly sparse network out of an existing dense network. Our choice of continuous Heaviside function is inspired by the field of design optimization, where the material distribution task is posed as a continuous optimization problem, but only discrete values (0 or 1) are practically feasible and expected as final outcomes. Our approach's flexible design facilitates its use with different choices of budget constraints while maintaining stability for very low target budgets. Experimental results show that ChipNet outperforms state-of-the-art structured pruning methods by remarkable margins of up to 16.1% in terms of accuracy. Further, we show that the masks obtained with ChipNet are transferable across datasets. For certain cases, it was observed that masks transferred from a model trained on feature-rich teacher dataset provide better performance on the student dataset than those obtained by directly pruning on the student data itself. \ No newline at end of file diff --git a/data/2021/iclr/Clairvoyance: A Pipeline Toolkit for Medical Time Series b/data/2021/iclr/Clairvoyance: A Pipeline Toolkit for Medical Time Series new file mode 100644 index 0000000000..43d6abeeb7 --- /dev/null +++ b/data/2021/iclr/Clairvoyance: A Pipeline Toolkit for Medical Time Series @@ -0,0 +1 @@ +Time-series learning is the bread and butter of data-driven *clinical decision support*, and the recent explosion in ML research has demonstrated great potential in various healthcare settings. At the same time, medical time-series problems in the wild are challenging due to their highly *composite* nature: They entail design choices and interactions among components that preprocess data, impute missing values, select features, issue predictions, estimate uncertainty, and interpret models. Despite exponential growth in electronic patient data, there is a remarkable gap between the potential and realized utilization of ML for clinical research and decision support. In particular, orchestrating a real-world project lifecycle poses challenges in engineering (i.e. hard to build), evaluation (i.e. hard to assess), and efficiency (i.e. hard to optimize). Designed to address these issues simultaneously, Clairvoyance proposes a unified, end-to-end, autoML-friendly pipeline that serves as a (i) software toolkit, (ii) empirical standard, and (iii) interface for optimization. Our ultimate goal lies in facilitating transparent and reproducible experimentation with complex inference workflows, providing integrated pathways for (1) personalized prediction, (2) treatment-effect estimation, and (3) information acquisition. Through illustrative examples on real-world data in outpatient, general wards, and intensive-care settings, we illustrate the applicability of the pipeline paradigm on core tasks in the healthcare journey. To the best of our knowledge, Clairvoyance is the first to demonstrate viability of a comprehensive and automatable pipeline for clinical time-series ML. \ No newline at end of file diff --git a/data/2021/iclr/Class Normalization for (Continual)? Generalized Zero-Shot Learning b/data/2021/iclr/Class Normalization for (Continual)? Generalized Zero-Shot Learning new file mode 100644 index 0000000000..33ad44472c --- /dev/null +++ b/data/2021/iclr/Class Normalization for (Continual)? Generalized Zero-Shot Learning @@ -0,0 +1 @@ +Normalization techniques have proved to be a crucial ingredient of successful training in a traditional supervised learning regime. However, in the zero-shot learning (ZSL) world, these ideas have received only marginal attention. This work studies normalization in ZSL scenario from both theoretical and practical perspectives. First, we give a theoretical explanation to two popular tricks used in zero-shot learning: normalize+scale and attributes normalization and show that they help training by preserving variance during a forward pass. Next, we demonstrate that they are insufficient to normalize a deep ZSL model and propose Class Normalization (CN): a normalization scheme, which alleviates this issue both provably and in practice. Third, we show that ZSL models typically have more irregular loss surface compared to traditional classifiers and that the proposed method partially remedies this problem. Then, we test our approach on 4 standard ZSL datasets and outperform sophisticated modern SotA with a simple MLP optimized without any bells and whistles and having ≈50 times faster training speed. Finally, we generalize ZSL to a broader problem — continual ZSL, and introduce some principled metrics and rigorous baselines for this new setup. The source code is available at https://github.com/universome/class-norm. \ No newline at end of file diff --git a/data/2021/iclr/Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation b/data/2021/iclr/Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation new file mode 100644 index 0000000000..b6d1611d96 --- /dev/null +++ b/data/2021/iclr/Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation @@ -0,0 +1 @@ +Clustering is one of the most fundamental tasks in machine learning. Recently, deep clustering has become a major trend in clustering techniques. Representation learning often plays an important role in the effectiveness of deep clustering, and thus can be a principal cause of performance degradation. In this paper, we propose a clustering-friendly representation learning method using instance discrimination and feature decorrelation. Our deep-learning-based representation learning method is motivated by the properties of classical spectral clustering. Instance discrimination learns similarities among data and feature decorrelation removes redundant correlation among features. We utilize an instance discrimination method in which learning individual instance classes leads to learning similarity among instances. Through detailed experiments and examination, we show that the approach can be adapted to learning a latent space for clustering. We design novel softmax-formulated decorrelation constraints for learning. In evaluations of image clustering using CIFAR-10 and ImageNet-10, our method achieves accuracy of 81.5% and 95.4%, respectively. We also show that the softmax-formulated constraints are compatible with various neural networks. \ No newline at end of file diff --git a/data/2021/iclr/Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity b/data/2021/iclr/Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity new file mode 100644 index 0000000000..1f0b6d55f1 --- /dev/null +++ b/data/2021/iclr/Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity @@ -0,0 +1 @@ +While deep neural networks show great performance on fitting to the training distribution, improving the networks' generalization performance to the test distribution and robustness to the sensitivity to input perturbations still remain as a challenge. Although a number of mixup based augmentation strategies have been proposed to partially address them, it remains unclear as to how to best utilize the supervisory signal within each input data for mixup from the optimization perspective. We propose a new perspective on batch mixup and formulate the optimal construction of a batch of mixup data maximizing the data saliency measure of each individual mixup data and encouraging the supermodular diversity among the constructed mixup data. This leads to a novel discrete optimization problem minimizing the difference between submodular functions. We also propose an efficient modular approximation based iterative submodular minimization algorithm for efficient mixup computation per each minibatch suitable for minibatch based neural network training. Our experiments show the proposed method achieves the state of the art generalization, calibration, and weakly supervised localization results compared to other mixup methods. The source code is available at https://github.com/snu-mllab/Co-Mixup. \ No newline at end of file diff --git a/data/2021/iclr/CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers b/data/2021/iclr/CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers new file mode 100644 index 0000000000..e927a2017c --- /dev/null +++ b/data/2021/iclr/CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers @@ -0,0 +1 @@ +Dialogue state trackers have made significant progress on benchmark datasets, but their generalization capability to novel and realistic scenarios beyond the held-out conversations is less understood. We propose controllable counterfactuals (CoCo) to bridge this gap and evaluate dialogue state tracking (DST) models on novel scenarios, i.e., would the system successfully tackle the request if the user responded differently but still consistently with the dialogue flow? CoCo leverages turn-level belief states as counterfactual conditionals to produce novel conversation scenarios in two steps: (i) counterfactual goal generation at turn-level by dropping and adding slots followed by replacing slot values, (ii) counterfactual conversation generation that is conditioned on (i) and consistent with the dialogue flow. Evaluating state-of-the-art DST models on MultiWOZ dataset with CoCo-generated counterfactuals results in a significant performance drop of up to 30.8% (from 49.4% to 18.6%) in absolute joint goal accuracy. In comparison, widely used techniques like paraphrasing only affect the accuracy by at most 2%. Human evaluations show that CoCo-generated conversations perfectly reflect the underlying user goal with more than 95% accuracy and are as human-like as the original conversations, further strengthening its reliability and promise to be adopted as part of the robustness evaluation of DST models. \ No newline at end of file diff --git a/data/2021/iclr/CoCon: A Self-Supervised Approach for Controlled Text Generation b/data/2021/iclr/CoCon: A Self-Supervised Approach for Controlled Text Generation new file mode 100644 index 0000000000..71b323b407 --- /dev/null +++ b/data/2021/iclr/CoCon: A Self-Supervised Approach for Controlled Text Generation @@ -0,0 +1 @@ +Pretrained Transformer-based language models (LMs) display remarkable natural language generation capabilities. With their immense potential, controlling text generation of such LMs is getting attention. While there are studies that seek to control high-level attributes (such as sentiment and topic) of generated text, there is still a lack of more precise control over its content at the word- and phrase-level. Here, we propose Content-Conditioner (CoCon) to control an LM's output text with a target content, at a fine-grained level. In our self-supervised approach, the CoCon block learns to help the LM complete a partially-observed text sequence by conditioning with content inputs that are withheld from the LM. Through experiments, we show that CoCon can naturally incorporate target content into generated texts and control high-level text attributes in a zero-shot manner. \ No newline at end of file diff --git a/data/2021/iclr/CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding b/data/2021/iclr/CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding new file mode 100644 index 0000000000..5afc873b25 --- /dev/null +++ b/data/2021/iclr/CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding @@ -0,0 +1 @@ +Data augmentation has been demonstrated as an effective strategy for improving model generalization and data efficiency. However, due to the discrete nature of natural language, designing label-preserving transformations for text data tends to be more challenging. In this paper, we propose a novel data augmentation framework dubbed CoDA, which synthesizes diverse and informative augmented examples by integrating multiple transformations organically. Moreover, a contrastive regularization objective is introduced to capture the global relationship among all the data samples. A momentum encoder along with a memory bank is further leveraged to better estimate the contrastive loss. To verify the effectiveness of the proposed framework, we apply CoDA to Transformer-based models on a wide range of natural language understanding tasks. On the GLUE benchmark, CoDA gives rise to an average improvement of 2.2% while applied to the RoBERTa-large model. More importantly, it consistently exhibits stronger results relative to several competitive data augmentation and adversarial training base-lines (including the low-resource settings). Extensive experiments show that the proposed contrastive objective can be flexibly combined with various data augmentation approaches to further boost their performance, highlighting the wide applicability of the CoDA framework. \ No newline at end of file diff --git a/data/2021/iclr/Collective Robustness Certificates: Exploiting Interdependence in Graph Neural Networks b/data/2021/iclr/Collective Robustness Certificates: Exploiting Interdependence in Graph Neural Networks new file mode 100644 index 0000000000..058105fd2b --- /dev/null +++ b/data/2021/iclr/Collective Robustness Certificates: Exploiting Interdependence in Graph Neural Networks @@ -0,0 +1 @@ +In tasks like node classification, image segmentation, and named-entity recognition we have a classifier that simultaneously outputs multiple predictions (a vector of labels) based on a single input, i.e. a single graph, image, or document respectively. Existing adversarial robustness certificates consider each prediction independently and are thus overly pessimistic for such tasks. They implicitly assume that an adversary can use different perturbed inputs to attack different predictions, ignoring the fact that we have a single shared input. We propose the first collective robustness certificate which computes the number of predictions that are simultaneously guaranteed to remain stable under perturbation, i.e. cannot be attacked. We focus on Graph Neural Networks and leverage their locality property - perturbations only affect the predictions in a close neighborhood - to fuse multiple single-node certificates into a drastically stronger collective certificate. For example, on the Citeseer dataset our collective certificate for node classification increases the average number of certifiable feature perturbations from $7$ to $351$. \ No newline at end of file diff --git a/data/2021/iclr/Colorization Transformer b/data/2021/iclr/Colorization Transformer new file mode 100644 index 0000000000..f5cc09fae5 --- /dev/null +++ b/data/2021/iclr/Colorization Transformer @@ -0,0 +1 @@ +We present the Colorization Transformer, a novel approach for diverse high fidelity image colorization based on self-attention. Given a grayscale image, the colorization proceeds in three steps. We first use a conditional autoregressive transformer to produce a low resolution coarse coloring of the grayscale image. Our architecture adopts conditional transformer layers to effectively condition grayscale input. Two subsequent fully parallel networks upsample the coarse colored low resolution image into a finely colored high resolution image. Sampling from the Colorization Transformer produces diverse colorings whose fidelity outperforms the previous state-of-the-art on colorising ImageNet based on FID results and based on a human evaluation in a Mechanical Turk test. Remarkably, in more than 60% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth. The code and pre-trained checkpoints for Colorization Transformer are publicly available at https://github.com/google-research/google-research/tree/master/coltran \ No newline at end of file diff --git a/data/2021/iclr/Combining Ensembles and Data Augmentation Can Harm Your Calibration b/data/2021/iclr/Combining Ensembles and Data Augmentation Can Harm Your Calibration new file mode 100644 index 0000000000..b9113d285e --- /dev/null +++ b/data/2021/iclr/Combining Ensembles and Data Augmentation Can Harm Your Calibration @@ -0,0 +1 @@ +Ensemble methods which average over multiple neural network predictions are a simple approach to improve a model's calibration and robustness. Similarly, data augmentation techniques, which encode prior information in the form of invariant feature transformations, are effective for improving calibration and robustness. In this paper, we show a surprising pathology: combining ensembles and data augmentation can harm model calibration. This leads to a trade-off in practice, whereby improved accuracy by combining the two techniques comes at the expense of calibration. On the other hand, selecting only one of the techniques ensures good uncertainty estimates at the expense of accuracy. We investigate this pathology and identify a compounding under-confidence among methods which marginalize over sets of weights and data augmentation techniques which soften labels. Finally, we propose a simple correction, achieving the best of both worlds with significant accuracy and calibration gains over using only ensembles or data augmentation individually. Applying the correction produces new state-of-the art in uncertainty calibration across CIFAR-10, CIFAR-100, and ImageNet. \ No newline at end of file diff --git a/data/2021/iclr/Combining Label Propagation and Simple Models out-performs Graph Neural Networks b/data/2021/iclr/Combining Label Propagation and Simple Models out-performs Graph Neural Networks new file mode 100644 index 0000000000..afea870d86 --- /dev/null +++ b/data/2021/iclr/Combining Label Propagation and Simple Models out-performs Graph Neural Networks @@ -0,0 +1 @@ +Graph Neural Networks (GNNs) are the predominant technique for learning over graphs. However, there is relatively little understanding of why GNNs are successful in practice and whether they are necessary for good performance. Here, we show that for many standard transductive node classification benchmarks, we can exceed or match the performance of state-of-the-art GNNs by combining shallow models that ignore the graph structure with two simple post-processing steps that exploit correlation in the label structure: (i) an "error correlation" that spreads residual errors in training data to correct errors in test data and (ii) a "prediction correlation" that smooths the predictions on the test data. We call this overall procedure Correct and Smooth (C&S), and the post-processing steps are implemented via simple modifications to standard label propagation techniques from early graph-based semi-supervised learning methods. Our approach exceeds or nearly matches the performance of state-of-the-art GNNs on a wide variety of benchmarks, with just a small fraction of the parameters and orders of magnitude faster runtime. For instance, we exceed the best known GNN performance on the OGB-Products dataset with 137 times fewer parameters and greater than 100 times less training time. The performance of our methods highlights how directly incorporating label information into the learning algorithm (as was done in traditional techniques) yields easy and substantial performance gains. We can also incorporate our techniques into big GNN models, providing modest gains. Our code for the OGB results is at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/Combining Physics and Machine Learning for Network Flow Estimation b/data/2021/iclr/Combining Physics and Machine Learning for Network Flow Estimation new file mode 100644 index 0000000000..945c9b46d6 --- /dev/null +++ b/data/2021/iclr/Combining Physics and Machine Learning for Network Flow Estimation @@ -0,0 +1 @@ +. \ No newline at end of file diff --git a/data/2021/iclr/Communication in Multi-Agent Reinforcement Learning: Intention Sharing b/data/2021/iclr/Communication in Multi-Agent Reinforcement Learning: Intention Sharing new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/CompOFA - Compound Once-For-All Networks for Faster Multi-Platform Deployment b/data/2021/iclr/CompOFA - Compound Once-For-All Networks for Faster Multi-Platform Deployment new file mode 100644 index 0000000000..18080675e0 --- /dev/null +++ b/data/2021/iclr/CompOFA - Compound Once-For-All Networks for Faster Multi-Platform Deployment @@ -0,0 +1 @@ +The emergence of CNNs in mainstream deployment has necessitated methods to design and train efficient architectures tailored to maximize the accuracy under diverse hardware&latency constraints. To scale these resource-intensive tasks with an increasing number of deployment targets, Once-For-All (OFA) proposed an approach to jointly train several models at once with a constant training cost. However, this cost remains as high as 40-50 GPU days and also suffers from a combinatorial explosion of sub-optimal model configurations. We seek to reduce this search space -- and hence the training budget -- by constraining search to models close to the accuracy-latency Pareto frontier. We incorporate insights of compound relationships between model dimensions to build CompOFA, a design space smaller by several orders of magnitude. Through experiments on ImageNet, we demonstrate that even with simple heuristics we can achieve a 2x reduction in training time and 216x speedup in model search/extraction time compared to the state of the art, without loss of Pareto optimality! We also show that this smaller design space is dense enough to support equally accurate models for a similar diversity of hardware and latency targets, while also reducing the complexity of the training and subsequent extraction algorithms. \ No newline at end of file diff --git a/data/2021/iclr/Complex Query Answering with Neural Link Predictors b/data/2021/iclr/Complex Query Answering with Neural Link Predictors new file mode 100644 index 0000000000..7f96b02837 --- /dev/null +++ b/data/2021/iclr/Complex Query Answering with Neural Link Predictors @@ -0,0 +1 @@ +Neural link predictors are useful for identifying missing edges in large scale Knowledge Graphs. However, it is still not clear how to use these models for answering more complex queries containing logical conjunctions (∧), disjunctions (∨), and existential quantifiers (∃). We propose a framework for efficiently answering complex queries on in- complete Knowledge Graphs. We translate each query into an end-to-end differentiable objective, where the truth value of each atom is computed by a pre-trained neural link predictor. We then analyse two solutions to the optimisation problem, including gradient-based and combinatorial search. In our experiments, the proposed approach produces more accurate results than state-of-the-art methods — black-box models trained on millions of generated queries — without the need for training on a large and diverse set of complex queries. Using orders of magnitude less training data, we obtain relative improvements ranging from 8% up to 40% in Hits@3 across multiple knowledge graphs. We find that it is possible to explain the outcome of our model in terms of the intermediate solutions identified for each of the complex query atoms. All our source code and datasets are available online (https://github.com/uclnlp/cqd). \ No newline at end of file diff --git a/data/2021/iclr/Computational Separation Between Convolutional and Fully-Connected Networks b/data/2021/iclr/Computational Separation Between Convolutional and Fully-Connected Networks new file mode 100644 index 0000000000..e064caf691 --- /dev/null +++ b/data/2021/iclr/Computational Separation Between Convolutional and Fully-Connected Networks @@ -0,0 +1 @@ +Convolutional neural networks (CNN) exhibit unmatched performance in a multitude of computer vision tasks. However, the advantage of using convolutional networks over fully-connected networks is not understood from a theoretical perspective. In this work, we show how convolutional networks can leverage locality in the data, and thus achieve a computational advantage over fully-connected networks. Specifically, we show a class of problems that can be efficiently solved using convolutional networks trained with gradient-descent, but at the same time is hard to learn using a polynomial-size fully-connected network. \ No newline at end of file diff --git a/data/2021/iclr/Concept Learners for Few-Shot Learning b/data/2021/iclr/Concept Learners for Few-Shot Learning new file mode 100644 index 0000000000..2b52109c22 --- /dev/null +++ b/data/2021/iclr/Concept Learners for Few-Shot Learning @@ -0,0 +1 @@ +Developing algorithms that are able to generalize to a novel task given only a few labeled examples represents a fundamental challenge in closing the gap between machine- and human-level performance. The core of human cognition lies in the structured, reusable concepts that help us to rapidly adapt to new tasks and provide reasoning behind our decisions. However, existing meta-learning methods learn complex representations across prior labeled tasks without imposing any structure on the learned representations. Here we propose COMET, a meta-learning method that improves generalization ability by learning to learn along human-interpretable concept dimensions. Instead of learning a joint unstructured metric space, COMET learns mappings of high-level concepts into semi-structured metric spaces, and effectively combines the outputs of independent concept learners. We evaluate our model on few-shot tasks from diverse domains, including fine-grained image classification, document categorization and cell type annotation on a novel dataset from a biological domain developed in our work. COMET significantly outperforms strong meta-learning baselines, achieving 6-15% relative improvement on the most challenging 1-shot learning tasks, while unlike existing methods providing interpretations behind the model's predictions. \ No newline at end of file diff --git a/data/2021/iclr/Conditional Generative Modeling via Learning the Latent Space b/data/2021/iclr/Conditional Generative Modeling via Learning the Latent Space new file mode 100644 index 0000000000..f9d2c8003d --- /dev/null +++ b/data/2021/iclr/Conditional Generative Modeling via Learning the Latent Space @@ -0,0 +1 @@ +Although deep learning has achieved appealing results on several machine learning tasks, most of the models are deterministic at inference, limiting their application to single-modal settings. We propose a novel general-purpose framework for conditional generation in multimodal spaces, that uses latent variables to model generalizable learning patterns while minimizing a family of regression cost functions. At inference, the latent variables are optimized to find optimal solutions corresponding to multiple output modes. Compared to existing generative solutions, in multimodal spaces, our approach demonstrates faster and stable convergence, and can learn better representations for downstream tasks. Importantly, it provides a simple generic model that can beat highly engineered pipelines tailored using domain expertise on a variety of tasks, while generating diverse outputs. Our codes will be released. \ No newline at end of file diff --git a/data/2021/iclr/Conditional Negative Sampling for Contrastive Learning of Visual Representations b/data/2021/iclr/Conditional Negative Sampling for Contrastive Learning of Visual Representations new file mode 100644 index 0000000000..c0f7e004ce --- /dev/null +++ b/data/2021/iclr/Conditional Negative Sampling for Contrastive Learning of Visual Representations @@ -0,0 +1 @@ +Recent methods for learning unsupervised visual representations, dubbed contrastive learning, optimize the noise-contrastive estimation (NCE) bound on mutual information between two views of an image. NCE uses randomly sampled negative examples to normalize the objective. In this paper, we show that choosing difficult negatives, or those more similar to the current instance, can yield stronger representations. To do this, we introduce a family of mutual information estimators that sample negatives conditionally -- in a "ring" around each positive. We prove that these estimators lower-bound mutual information, with higher bias but lower variance than NCE. Experimentally, we find our approach, applied on top of existing models (IR, CMC, and MoCo) improves accuracy by 2-5% points in each case, measured by linear evaluation on four standard image datasets. Moreover, we find continued benefits when transferring features to a variety of new image distributions from the Meta-Dataset collection and to a variety of downstream tasks such as object detection, instance segmentation, and keypoint detection. \ No newline at end of file diff --git a/data/2021/iclr/Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data b/data/2021/iclr/Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data new file mode 100644 index 0000000000..c0c880b675 --- /dev/null +++ b/data/2021/iclr/Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data @@ -0,0 +1 @@ +Multi-Task Learning (MTL) has emerged as a promising approach for transferring learned knowledge across different tasks. However, multi-task learning must deal with challenges such as: overfitting to low resource tasks, catastrophic forgetting, and negative task transfer, or learning interference. Additionally, in Natural Language Processing (NLP), MTL alone has typically not reached the performance level possible through per-task fine-tuning of pretrained models. However, many fine-tuning approaches are both parameter inefficient, e.g. potentially involving one new model per task, and highly susceptible to losing knowledge acquired during pretraining. We propose a novel transformer based architecture consisting of a new conditional attention mechanism as well as a set of task conditioned modules that facilitate weight sharing. Through this construction we achieve more efficient parameter sharing and mitigate forgetting by keeping half of the weights of a pretrained model fixed. We also use a new multi-task data sampling strategy to mitigate the negative effects of data imbalance across tasks. Using this approach we are able to surpass single-task fine-tuning methods while being parameter and data efficient. With our base model, we attain 2.2% higher performance compared to a full fine-tuned BERT large model on the GLUE benchmark, adding only 5.6% more trained parameters per task (whereas naive fine-tuning potentially adds 100% of the trained parameters per task) and needing only 64.6% of the data. We show that a larger variant of our single multi-task model approach performs competitively across 26 NLP tasks and yields state-of-the-art results on a number of test and development sets. \ No newline at end of file diff --git a/data/2021/iclr/Conformation-Guided Molecular Representation with Hamiltonian Neural Networks b/data/2021/iclr/Conformation-Guided Molecular Representation with Hamiltonian Neural Networks new file mode 100644 index 0000000000..83df078b7d --- /dev/null +++ b/data/2021/iclr/Conformation-Guided Molecular Representation with Hamiltonian Neural Networks @@ -0,0 +1 @@ +Well-designed molecular representations (fingerprints) are vital to combine medical chemistry and deep learning. Whereas incorporating 3D geometry of molecules (i.e. conformations) in their representations seems beneficial, current 3D algorithms are still in infancy. In this paper, we propose a novel molecular representation algorithm which preserves 3D conformations of molecules with a Molecular Hamiltonian Network (HamNet). In HamNet, implicit positions and momentums of atoms in a molecule interact in the Hamiltonian Engine following the discretized Hamiltonian equations. These implicit coordinations are supervised with real conformations with translation-&rotation-invariant losses, and further used as inputs to the Fingerprint Generator, a message-passing neural network. Experiments show that the Hamiltonian Engine can well preserve molecular conformations, and that the fingerprints generated by HamNet achieve state-of-the-art performances on MoleculeNet, a standard molecular machine learning benchmark. \ No newline at end of file diff --git a/data/2021/iclr/Conservative Safety Critics for Exploration b/data/2021/iclr/Conservative Safety Critics for Exploration new file mode 100644 index 0000000000..abe26a3656 --- /dev/null +++ b/data/2021/iclr/Conservative Safety Critics for Exploration @@ -0,0 +1 @@ +Safe exploration presents a major challenge in reinforcement learning (RL): when active data collection requires deploying partially trained policies, we must ensure that these policies avoid catastrophically unsafe regions, while still enabling trial and error learning. In this paper, we target the problem of safe exploration in RL by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration. We theoretically characterize the tradeoff between safety and policy improvement, show that the safety constraints are likely to be satisfied with high probability during training, derive provable convergence guarantees for our approach, which is no worse asymptotically than standard RL, and demonstrate the efficacy of the proposed approach on a suite of challenging navigation, manipulation, and locomotion tasks. Empirically, we show that the proposed approach can achieve competitive task performance while incurring significantly lower catastrophic failure rates during training than prior methods. Videos are at this url this https URL \ No newline at end of file diff --git a/data/2021/iclr/Contemplating Real-World Object Classification b/data/2021/iclr/Contemplating Real-World Object Classification new file mode 100644 index 0000000000..1c37baaaf1 --- /dev/null +++ b/data/2021/iclr/Contemplating Real-World Object Classification @@ -0,0 +1 @@ +Deep object recognition models have been very successful over benchmark datasets such as ImageNet. How accurate and robust are they to distribution shifts arising from natural and synthetic variations in datasets? Prior research on this problem has primarily focused on ImageNet variations (e.g., ImageNetV2, ImageNet-A). To avoid potential inherited biases in these studies, we take a different approach. Specifically, we reanalyze the ObjectNet dataset recently proposed by Barbu et al. containing objects in daily life situations. They showed a dramatic performance drop of the state of the art object recognition models on this dataset. Due to the importance and implications of their results regarding the generalization ability of deep models, we take a second look at their analysis. We find that applying deep models to the isolated objects, rather than the entire scene as is done in the original paper, results in around 20-30% performance improvement. Relative to the numbers reported in Barbu et al., around 10-15% of the performance loss is recovered, without any test time data augmentation. Despite this gain, however, we conclude that deep models still suffer drastically on the ObjectNet dataset. We also investigate the robustness of models against synthetic image perturbations such as geometric transformations (e.g., scale, rotation, translation), natural image distortions (e.g., impulse noise, blur) as well as adversarial attacks (e.g., FGSM and PGD-5). Our results indicate that limiting the object area as much as possible (i.e., from the entire image to the bounding box to the segmentation mask) leads to consistent improvement in accuracy and robustness. \ No newline at end of file diff --git a/data/2021/iclr/Contextual Dropout: An Efficient Sample-Dependent Dropout Module b/data/2021/iclr/Contextual Dropout: An Efficient Sample-Dependent Dropout Module new file mode 100644 index 0000000000..8684ca1b94 --- /dev/null +++ b/data/2021/iclr/Contextual Dropout: An Efficient Sample-Dependent Dropout Module @@ -0,0 +1 @@ +Dropout has been demonstrated as a simple and effective module to not only regularize the training process of deep neural networks, but also provide the uncertainty estimation for prediction. However, the quality of uncertainty estimation is highly dependent on the dropout probabilities. Most current models use the same dropout distributions across all data samples due to its simplicity. Despite the potential gains in the flexibility of modeling uncertainty, sample-dependent dropout, on the other hand, is less explored as it often encounters scalability issues or involves non-trivial model changes. In this paper, we propose contextual dropout with an efficient structural design as a simple and scalable sample-dependent dropout module, which can be applied to a wide range of models at the expense of only slightly increased memory and computational cost. We learn the dropout probabilities with a variational objective, compatible with both Bernoulli dropout and Gaussian dropout. We apply the contextual dropout module to various models with applications to image classification and visual question answering and demonstrate the scalability of the method with large-scale datasets, such as ImageNet and VQA 2.0. Our experimental results show that the proposed method outperforms baseline methods in terms of both accuracy and quality of uncertainty estimation. \ No newline at end of file diff --git a/data/2021/iclr/Contextual Transformation Networks for Online Continual Learning b/data/2021/iclr/Contextual Transformation Networks for Online Continual Learning new file mode 100644 index 0000000000..8034aa81ca --- /dev/null +++ b/data/2021/iclr/Contextual Transformation Networks for Online Continual Learning @@ -0,0 +1 @@ +5. The results show that the behavioural cloning strategy is more suitable for alleviating forgetting in ER, while enjoying less memory overhead or faster running time compared to other alternatives. \ No newline at end of file diff --git a/data/2021/iclr/Continual learning in recurrent neural networks b/data/2021/iclr/Continual learning in recurrent neural networks new file mode 100644 index 0000000000..6fedf7f463 --- /dev/null +++ b/data/2021/iclr/Continual learning in recurrent neural networks @@ -0,0 +1 @@ +While a diverse collection of continual learning (CL) methods has been proposed to prevent catastrophic forgetting, a thorough investigation of their effectiveness for processing sequential data with recurrent neural networks (RNNs) is lacking. Here, we provide the first comprehensive evaluation of established CL methods on a variety of sequential data benchmarks. Specifically, we shed light on the particularities that arise when applying weight-importance methods, such as elastic weight consolidation, to RNNs. In contrast to feedforward networks, RNNs iteratively reuse a shared set of weights and require working memory to process input samples. We show that the performance of weight-importance methods is not directly affected by the length of the processed sequences, but rather by high working memory requirements, which lead to an increased need for stability at the cost of decreased plasticity for learning subsequent tasks. We additionally provide theoretical arguments supporting this interpretation by studying linear RNNs. Our study shows that established CL methods can be successfully ported to the recurrent case, and that a recent regularization approach based on hypernetworks outperforms weight-importance methods, thus emerging as a promising candidate for CL in RNNs. Overall, we provide insights on the differences between CL in feedforward networks and RNNs, while guiding towards effective solutions to tackle CL on sequential data. \ No newline at end of file diff --git a/data/2021/iclr/Continuous Wasserstein-2 Barycenter Estimation without Minimax Optimization b/data/2021/iclr/Continuous Wasserstein-2 Barycenter Estimation without Minimax Optimization new file mode 100644 index 0000000000..07f7364105 --- /dev/null +++ b/data/2021/iclr/Continuous Wasserstein-2 Barycenter Estimation without Minimax Optimization @@ -0,0 +1 @@ +Wasserstein barycenters provide a geometric notion of the weighted average of probability measures based on optimal transport. In this paper, we present a scalable algorithm to compute Wasserstein-2 barycenters given sample access to the input measures, which are not restricted to being discrete. While past approaches rely on entropic or quadratic regularization, we employ input convex neural networks and cycle-consistency regularization to avoid introducing bias. As a result, our approach does not resort to minimax optimization. We provide theoretical analysis on error bounds as well as empirical evidence of the effectiveness of the proposed approach in low-dimensional qualitative scenarios and high-dimensional quantitative experiments. \ No newline at end of file diff --git a/data/2021/iclr/Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning b/data/2021/iclr/Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning new file mode 100644 index 0000000000..5f67035f41 --- /dev/null +++ b/data/2021/iclr/Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning @@ -0,0 +1 @@ +Reinforcement learning methods trained on few environments rarely learn policies that generalize to unseen environments. To improve generalization, we incorporate the inherent sequential structure in reinforcement learning into the representation learning process. This approach is orthogonal to recent approaches, which rarely exploit this structure explicitly. Specifically, we introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states. PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. We also present a contrastive representation learning procedure to embed any state similarity metric, which we instantiate with PSM to obtain policy similarity embeddings (PSEs). We demonstrate that PSEs improve generalization on diverse benchmarks, including LQR with spurious correlations, a jumping task from pixels, and Distracting DM Control Suite. \ No newline at end of file diff --git a/data/2021/iclr/Contrastive Divergence Learning is a Time Reversal Adversarial Game b/data/2021/iclr/Contrastive Divergence Learning is a Time Reversal Adversarial Game new file mode 100644 index 0000000000..713734420c --- /dev/null +++ b/data/2021/iclr/Contrastive Divergence Learning is a Time Reversal Adversarial Game @@ -0,0 +1 @@ +Contrastive divergence (CD) learning is a classical method for fitting unnormalized statistical models to data samples. Despite its wide-spread use, the convergence properties of this algorithm are still not well understood. The main source of difficulty is an unjustified approximation which has been used to derive the gradient of the loss. In this paper, we present an alternative derivation of CD that does not require any approximation and sheds new light on the objective that is actually being optimized by the algorithm. Specifically, we show that CD is an adversarial learning procedure, where a discriminator attempts to classify whether a Markov chain generated from the model has been time-reversed. Thus, although predating generative adversarial networks (GANs) by more than a decade, CD is, in fact, closely related to these techniques. Our derivation settles well with previous observations, which have concluded that CD's update steps cannot be expressed as the gradients of any fixed objective function. In addition, as a byproduct, our derivation reveals a simple correction that can be used as an alternative to Metropolis-Hastings rejection, which is required when the underlying Markov chain is inexact (e.g. when using Langevin dynamics with a large step). \ No newline at end of file diff --git a/data/2021/iclr/Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions b/data/2021/iclr/Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions new file mode 100644 index 0000000000..cacad5a583 --- /dev/null +++ b/data/2021/iclr/Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions @@ -0,0 +1 @@ +We investigate a deep reinforcement learning (RL) architecture that supports explaining why a learned agent prefers one action over another. The key idea is to learn action-values that are directly represented via human-understandable properties of expected futures. This is realized via the embedded self-prediction (ESP)model, which learns said properties in terms of human provided features. Action preferences can then be explained by contrasting the future properties predicted for each action. To address cases where there are a large number of features, we develop a novel method for computing minimal sufficient explanations from anESP. Our case studies in three domains, including a complex strategy game, show that ESP models can be effectively learned and support insightful explanations. \ No newline at end of file diff --git a/data/2021/iclr/Contrastive Learning with Adversarial Perturbations for Conditional Text Generation b/data/2021/iclr/Contrastive Learning with Adversarial Perturbations for Conditional Text Generation new file mode 100644 index 0000000000..08778e8263 --- /dev/null +++ b/data/2021/iclr/Contrastive Learning with Adversarial Perturbations for Conditional Text Generation @@ -0,0 +1 @@ +Recently, sequence-to-sequence (seq2seq) models with the Transformer architecture have achieved remarkable performance on various conditional text generation tasks, such as machine translation. However, most of them are trained with teacher forcing with the ground truth label given at each time step, without being exposed to incorrectly generated tokens during training, which hurts its generalization to unseen inputs, that is known as the ``exposure bias" problem. In this work, we propose to mitigate the conditional text generation problem by contrasting positive pairs with negative pairs, such that the model is exposed to various valid or incorrect perturbations of the inputs, for improved generalization. However, training the model with naive contrastive learning framework using random non-target sequences as negative examples is suboptimal, since they are easily distinguishable from the correct output, especially so with models pretrained with large text corpora. Also, generating positive examples requires domain-specific augmentation heuristics which may not generalize over diverse domains. To tackle this problem, we propose a principled method to generate positive and negative samples for contrastive learning of seq2seq models. Specifically, we generate negative examples by adding small perturbations to the input sequence to minimize its conditional likelihood, and positive examples by adding large perturbations while enforcing it to have a high conditional likelihood. Such ``hard'' positive and negative pairs generated using our method guides the model to better distinguish correct outputs from incorrect ones. We empirically show that our proposed method significantly improves the generalization of the seq2seq on three text generation tasks - machine translation, text summarization, and question generation. \ No newline at end of file diff --git a/data/2021/iclr/Contrastive Learning with Hard Negative Samples b/data/2021/iclr/Contrastive Learning with Hard Negative Samples new file mode 100644 index 0000000000..6b8601a0b9 --- /dev/null +++ b/data/2021/iclr/Contrastive Learning with Hard Negative Samples @@ -0,0 +1 @@ +How can you sample good negative examples for contrastive learning? We argue that, as with metric learning, contrastive learning of representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negative sampling strategies that use true similarity information. In response, we develop a new family of unsupervised sampling methods for selecting hard negative samples where the user can control the hardness. A limiting case of this sampling results in a representation that tightly clusters each class, and pushes different classes as far apart as possible. The proposed method improves downstream performance across multiple modalities, requires only few additional lines of code to implement, and introduces no computational overhead. \ No newline at end of file diff --git a/data/2021/iclr/Contrastive Syn-to-Real Generalization b/data/2021/iclr/Contrastive Syn-to-Real Generalization new file mode 100644 index 0000000000..9c4f3ce223 --- /dev/null +++ b/data/2021/iclr/Contrastive Syn-to-Real Generalization @@ -0,0 +1 @@ +Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that the diversity of the learned feature embeddings plays an important role in the generalization performance. To this end, we propose contrastive synthetic-to-real generalization (CSG), a novel framework that leverages the pre-trained ImageNet knowledge to prevent overfitting to the synthetic domain, while promoting the diversity of feature embeddings as an inductive bias to improve generalization. In addition, we enhance the proposed CSG framework with attentional pooling (A-pool) to let the model focus on semantically important regions and further improve its generalization. We demonstrate the effectiveness of CSG on various synthetic training tasks, exhibiting state-of-the-art performance on zero-shot domain generalization. \ No newline at end of file diff --git a/data/2021/iclr/Control-Aware Representations for Model-based Reinforcement Learning b/data/2021/iclr/Control-Aware Representations for Model-based Reinforcement Learning new file mode 100644 index 0000000000..3674287b86 --- /dev/null +++ b/data/2021/iclr/Control-Aware Representations for Model-based Reinforcement Learning @@ -0,0 +1 @@ +A major challenge in modern reinforcement learning (RL) is efficient control of dynamical systems from high-dimensional sensory observations. Learning controllable embedding (LCE) is a promising approach that addresses this challenge by embedding the observations into a lower-dimensional latent space, estimating the latent dynamics, and utilizing it to perform control in the latent space. Two important questions in this area are how to learn a representation that is amenable to the control problem at hand, and how to achieve an end-to-end framework for representation learning and control. In this paper, we take a few steps towards addressing these questions. We first formulate a LCE model to learn representations that are suitable to be used by a policy iteration style algorithm in the latent space. We call this model control-aware representation learning (CARL). We derive a loss function for CARL that has close connection to the prediction, consistency, and curvature (PCC) principle for representation learning. We derive three implementations of CARL. In the offline implementation, we replace the locally-linear control algorithm (e.g.,~iLQR) used by the existing LCE methods with a RL algorithm, namely model-based soft actor-critic, and show that it results in significant improvement. In online CARL, we interleave representation learning and control, and demonstrate further gain in performance. Finally, we propose value-guided CARL, a variation in which we optimize a weighted version of the CARL loss function, where the weights depend on the TD-error of the current policy. We evaluate the proposed algorithms by extensive experiments on benchmark tasks and compare them with several LCE baselines. \ No newline at end of file diff --git a/data/2021/iclr/Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization b/data/2021/iclr/Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization new file mode 100644 index 0000000000..8c99b336e9 --- /dev/null +++ b/data/2021/iclr/Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization @@ -0,0 +1 @@ +Flow-based models are powerful tools for designing probabilistic models with tractable density. This paper introduces Convex Potential Flows (CP-Flow), a natural and efficient parameterization of invertible models inspired by the optimal transport (OT) theory. CP-Flows are the gradient map of a strongly convex neural potential function. The convexity implies invertibility and allows us to resort to convex optimization to solve the convex conjugate for efficient inversion. To enable maximum likelihood training, we derive a new gradient estimator of the log-determinant of the Jacobian, which involves solving an inverse-Hessian vector product using the conjugate gradient method. The gradient estimator has constant-memory cost, and can be made effectively unbiased by reducing the error tolerance level of the convex optimization routine. Theoretically, we prove that CP-Flows are universal density approximators and are optimal in the OT sense. Our empirical results show that CP-Flow performs competitively on standard benchmarks of density estimation and variational inference. \ No newline at end of file diff --git a/data/2021/iclr/Convex Regularization behind Neural Reconstruction b/data/2021/iclr/Convex Regularization behind Neural Reconstruction new file mode 100644 index 0000000000..727272a14d --- /dev/null +++ b/data/2021/iclr/Convex Regularization behind Neural Reconstruction @@ -0,0 +1 @@ +Neural networks have shown tremendous potential for reconstructing high-resolution images in inverse problems. The non-convex and opaque nature of neural networks, however, hinders their utility in sensitive applications such as medical imaging. To cope with this challenge, this paper advocates a convex duality framework that makes a two-layer fully-convolutional ReLU denoising network amenable to convex optimization. The convex dual network not only offers the optimum training with convex solvers, but also facilitates interpreting training and prediction. In particular, it implies training neural networks with weight decay regularization induces path sparsity while the prediction is piecewise linear filtering. A range of experiments with MNIST and fastMRI datasets confirm the efficacy of the dual network optimization problem. \ No newline at end of file diff --git a/data/2021/iclr/Coping with Label Shift via Distributionally Robust Optimisation b/data/2021/iclr/Coping with Label Shift via Distributionally Robust Optimisation new file mode 100644 index 0000000000..6781008af9 --- /dev/null +++ b/data/2021/iclr/Coping with Label Shift via Distributionally Robust Optimisation @@ -0,0 +1 @@ +The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an \emph{unlabelled} test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, their scope is limited as it is not always feasible to access the target domain; further, they require repeated retraining if the model is to be deployed in \emph{multiple} test environments. Can one instead learn a \emph{single} classifier that is robust to arbitrary label shifts from a broad family? In this paper, we answer this question by proposing a model that minimises an objective based on distributionally robust optimisation (DRO). We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective. %, and establish its convergence. Finally, through experiments on CIFAR-100 and ImageNet, we show that our technique can significantly improve performance over a number of baselines in settings where label shift is present. \ No newline at end of file diff --git a/data/2021/iclr/CopulaGNN: Towards Integrating Representational and Correlational Roles of Graphs in Graph Neural Networks b/data/2021/iclr/CopulaGNN: Towards Integrating Representational and Correlational Roles of Graphs in Graph Neural Networks new file mode 100644 index 0000000000..12b3d629e9 --- /dev/null +++ b/data/2021/iclr/CopulaGNN: Towards Integrating Representational and Correlational Roles of Graphs in Graph Neural Networks @@ -0,0 +1 @@ +Graph-structured data are ubiquitous. However, graphs encode diverse types of information and thus play different roles in data representation. In this paper, we distinguish the \textit{representational} and the \textit{correlational} roles played by the graphs in node-level prediction tasks, and we investigate how Graph Neural Network (GNN) models can effectively leverage both types of information. Conceptually, the representational information provides guidance for the model to construct better node features; while the correlational information indicates the correlation between node outcomes conditional on node features. Through a simulation study, we find that many popular GNN models are incapable of effectively utilizing the correlational information. By leveraging the idea of the copula, a principled way to describe the dependence among multivariate random variables, we offer a general solution. The proposed Copula Graph Neural Network (CopulaGNN) can take a wide range of GNN models as base models and utilize both representational and correlational information stored in the graphs. Experimental results on two types of regression tasks verify the effectiveness of the proposed method. \ No newline at end of file diff --git a/data/2021/iclr/Correcting experience replay for multi-agent communication b/data/2021/iclr/Correcting experience replay for multi-agent communication new file mode 100644 index 0000000000..38c2eb717f --- /dev/null +++ b/data/2021/iclr/Correcting experience replay for multi-agent communication @@ -0,0 +1 @@ +We consider the problem of learning to communicate using multi-agent reinforcement learning (MARL). A common approach is to learn off-policy, using data sampled from a replay buffer. However, messages received in the past may not accurately reflect the current communication policy of each agent, and this complicates learning. We therefore introduce a 'communication correction' which accounts for the non-stationarity of observed communication induced by multi-agent learning. It works by relabelling the received message to make it likely under the communicator's current policy, and thus be a better reflection of the receiver's current environment. To account for cases in which agents are both senders and receivers, we introduce an ordered relabelling scheme. Our correction is computationally efficient and can be integrated with a range of off-policy algorithms. It substantially improves the ability of communicating MARL systems to learn across a variety of cooperative and competitive tasks. \ No newline at end of file diff --git a/data/2021/iclr/Counterfactual Generative Networks b/data/2021/iclr/Counterfactual Generative Networks new file mode 100644 index 0000000000..b6cd7022e9 --- /dev/null +++ b/data/2021/iclr/Counterfactual Generative Networks @@ -0,0 +1 @@ +Neural networks are prone to learning shortcuts -- they often model simple correlations, ignoring more complex ones that potentially generalize better. Prior works on image classification show that instead of learning a connection to object shape, deep classifiers tend to exploit spurious correlations with low-level texture or the background for solving the classification task. In this work, we take a step towards more robust and interpretable classifiers that explicitly expose the task's causal structure. Building on current advances in deep generative modeling, we propose to decompose the image generation process into independent causal mechanisms that we train without direct supervision. By exploiting appropriate inductive biases, these mechanisms disentangle object shape, object texture, and background; hence, they allow for generating counterfactual images. We demonstrate the ability of our model to generate such images on MNIST and ImageNet. Further, we show that the counterfactual images can improve out-of-distribution robustness with a marginal drop in performance on the original classification task, despite being synthetic. Lastly, our generative model can be trained efficiently on a single GPU, exploiting common pre-trained models as inductive biases. \ No newline at end of file diff --git a/data/2021/iclr/Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies b/data/2021/iclr/Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies new file mode 100644 index 0000000000..dc5ad8d507 --- /dev/null +++ b/data/2021/iclr/Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies @@ -0,0 +1 @@ +Circuits of biological neurons, such as in the functional parts of the brain can be modeled as networks of coupled oscillators. Inspired by the ability of these systems to express a rich set of outputs while keeping (gradients of) state variables bounded, we propose a novel architecture for recurrent neural networks. Our proposed RNN is based on a time-discretization of a system of second-order ordinary differential equations, modeling networks of controlled nonlinear oscillators. We prove precise bounds on the gradients of the hidden states, leading to the mitigation of the exploding and vanishing gradient problem for this RNN. Experiments show that the proposed RNN is comparable in performance to the state of the art on a variety of benchmarks, demonstrating the potential of this architecture to provide stable and accurate RNNs for processing complex sequential data. \ No newline at end of file diff --git a/data/2021/iclr/Creative Sketch Generation b/data/2021/iclr/Creative Sketch Generation new file mode 100644 index 0000000000..23402410f8 --- /dev/null +++ b/data/2021/iclr/Creative Sketch Generation @@ -0,0 +1 @@ +Sketching or doodling is a popular creative activity that people engage in. However, most existing work in automatic sketch understanding or generation has focused on sketches that are quite mundane. In this work, we introduce two datasets of creative sketches -- Creative Birds and Creative Creatures -- containing 10k sketches each along with part annotations. We propose DoodlerGAN -- a part-based Generative Adversarial Network (GAN) -- to generate unseen compositions of novel part appearances. Quantitative evaluations as well as human studies demonstrate that sketches generated by our approach are more creative and of higher quality than existing approaches. In fact, in Creative Birds, subjects prefer sketches generated by DoodlerGAN over those drawn by humans! Our code can be found at this https URL and a demo can be found at this http URL. \ No newline at end of file diff --git a/data/2021/iclr/Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization b/data/2021/iclr/Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization new file mode 100644 index 0000000000..2853167736 --- /dev/null +++ b/data/2021/iclr/Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization @@ -0,0 +1 @@ +Temporally localizing actions in videos is one of the key components for video understanding. Learning from weakly-labeled data is seen as a potential solu-tion towards avoiding expensive frame-level annotations. Different from other works which only depend on visual-modality, we propose to learn richer audio-visual representation for weakly-supervised action localization. First, we propose a multi-stage cross-attention mechanism to collaboratively fuse audio and visual features, which preserves the intra-modal characteristics. Second, to model both foreground and background frames, we construct an open-max classifier which treats the background class as an open-set. Third, for precise action localization, we design consistency losses to enforce temporal continuity for the action-class prediction, and also help with foreground-prediction reliability. Extensive experiments on two publicly available video-datasets (AVE and ActivityNet1.2) show that the proposed method effectively fuses audio and visual modalities, and achieves the state-of-the-art results for weakly-supervised action localization. \ No newline at end of file diff --git a/data/2021/iclr/Cut out the annotator, keep the cutout: better segmentation with weak supervision b/data/2021/iclr/Cut out the annotator, keep the cutout: better segmentation with weak supervision new file mode 100644 index 0000000000..945c9b46d6 --- /dev/null +++ b/data/2021/iclr/Cut out the annotator, keep the cutout: better segmentation with weak supervision @@ -0,0 +1 @@ +. \ No newline at end of file diff --git a/data/2021/iclr/DARTS-: Robustly Stepping out of Performance Collapse Without Indicators b/data/2021/iclr/DARTS-: Robustly Stepping out of Performance Collapse Without Indicators new file mode 100644 index 0000000000..fd3de30396 --- /dev/null +++ b/data/2021/iclr/DARTS-: Robustly Stepping out of Performance Collapse Without Indicators @@ -0,0 +1 @@ +Despite the fast development of differentiable architecture search (DARTS), it suffers from a standing instability issue regarding searching performance, which extremely limits its application. Existing robustifying methods draw clues from the outcome instead of finding out the causing factor. Various indicators such as Hessian eigenvalues are proposed as a signal of performance collapse, and the searching should be stopped once an indicator reaches a preset threshold. However, these methods tend to easily reject good architectures if thresholds are inappropriately set, let alone the searching is intrinsically noisy. In this paper, we undertake a more subtle and direct approach to resolve the collapse. We first demonstrate that skip connections with a learnable architectural coefficient can easily recover from a disadvantageous state and become dominant. We conjecture that skip connections profit too much from this privilege, hence causing the collapse for the derived model. Therefore, we propose to factor out this benefit with an auxiliary skip connection, ensuring a fairer competition for all operations. Extensive experiments on various datasets verify that our approach can substantially improve the robustness of DARTS. \ No newline at end of file diff --git a/data/2021/iclr/DC3: A learning method for optimization with hard constraints b/data/2021/iclr/DC3: A learning method for optimization with hard constraints new file mode 100644 index 0000000000..722d650fda --- /dev/null +++ b/data/2021/iclr/DC3: A learning method for optimization with hard constraints @@ -0,0 +1 @@ +Large optimization problems with hard constraints arise in many settings, yet classical solvers are often prohibitively slow, motivating the use of deep networks as cheap"approximate solvers."Unfortunately, naive deep learning approaches typically cannot enforce the hard constraints of such problems, leading to infeasible solutions. In this work, we present Deep Constraint Completion and Correction (DC3), an algorithm to address this challenge. Specifically, this method enforces feasibility via a differentiable procedure, which implicitly completes partial solutions to satisfy equality constraints and unrolls gradient-based corrections to satisfy inequality constraints. We demonstrate the effectiveness of DC3 in both synthetic optimization tasks and the real-world setting of AC optimal power flow, where hard constraints encode the physics of the electrical grid. In both cases, DC3 achieves near-optimal objective values while preserving feasibility. \ No newline at end of file diff --git a/data/2021/iclr/DDPNOpt: Differential Dynamic Programming Neural Optimizer b/data/2021/iclr/DDPNOpt: Differential Dynamic Programming Neural Optimizer new file mode 100644 index 0000000000..f0e9aea20e --- /dev/null +++ b/data/2021/iclr/DDPNOpt: Differential Dynamic Programming Neural Optimizer @@ -0,0 +1 @@ +Interpretation of Deep Neural Networks (DNNs) training as an optimal control problem with nonlinear dynamical systems has received considerable attention recently, yet the algorithmic development remains relatively limited. In this work, we make an attempt along this line by reformulating the training procedure from the trajectory optimization perspective. We first show that most widely-used algorithms for training DNNs can be linked to the Differential Dynamic Programming (DDP), a celebrated second-order trajectory optimization algorithm rooted in the Approximate Dynamic Programming. In this vein, we propose a new variant of DDP that can accept batch optimization for training feedforward networks, while integrating naturally with the recent progress in curvature approximation. The resulting algorithm features layer-wise feedback policies which improve convergence rate and reduce sensitivity to hyper-parameter over existing methods. We show that the algorithm is competitive against state-ofthe-art first and second order methods. Our work opens up new avenues for principled algorithmic design built upon the optimal control theory. \ No newline at end of file diff --git a/data/2021/iclr/DICE: Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation b/data/2021/iclr/DICE: Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation new file mode 100644 index 0000000000..fe1defa2c3 --- /dev/null +++ b/data/2021/iclr/DICE: Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation @@ -0,0 +1 @@ +Deep ensembles perform better than a single network thanks to the diversity among their members. Recent approaches regularize predictions to increase diversity; however, they also drastically decrease individual members' performances. In this paper, we argue that learning strategies for deep ensembles need to tackle the trade-off between ensemble diversity and individual accuracies. Motivated by arguments from information theory and leveraging recent advances in neural estimation of conditional mutual information, we introduce a novel training criterion called DICE: it increases diversity by reducing spurious correlations among features. The main idea is that features extracted from pairs of members should only share information useful for target class prediction without being conditionally redundant. Therefore, besides the classification loss with information bottleneck, we adversarially prevent features from being conditionally predictable from each other. We manage to reduce simultaneous errors while protecting class information. We obtain state-of-the-art accuracy results on CIFAR-10/100: for example, an ensemble of 5 networks trained with DICE matches an ensemble of 7 networks trained independently. We further analyze the consequences on calibration, uncertainty estimation, out-of-distribution detection and online co-distillation. \ No newline at end of file diff --git a/data/2021/iclr/DINO: A Conditional Energy-Based GAN for Domain Translation b/data/2021/iclr/DINO: A Conditional Energy-Based GAN for Domain Translation new file mode 100644 index 0000000000..e2b4dc7f0d --- /dev/null +++ b/data/2021/iclr/DINO: A Conditional Energy-Based GAN for Domain Translation @@ -0,0 +1 @@ +Domain translation is the process of transforming data from one domain to another while preserving the common semantics. Some of the most popular domain translation systems are based on conditional generative adversarial networks, which use source domain data to drive the generator and as an input to the discriminator. However, this approach does not enforce the preservation of shared semantics since the conditional input can often be ignored by the discriminator. We propose an alternative method for conditioning and present a new framework, where two networks are simultaneously trained, in a supervised manner, to perform domain translation in opposite directions. Our method is not only better at capturing the shared information between two domains but is more generic and can be applied to a broader range of problems. The proposed framework performs well even in challenging cross-modal translations, such as video-driven speech reconstruction, for which other systems struggle to maintain correspondence. \ No newline at end of file diff --git a/data/2021/iclr/DOP: Off-Policy Multi-Agent Decomposed Policy Gradients b/data/2021/iclr/DOP: Off-Policy Multi-Agent Decomposed Policy Gradients new file mode 100644 index 0000000000..31b7049e3f --- /dev/null +++ b/data/2021/iclr/DOP: Off-Policy Multi-Agent Decomposed Policy Gradients @@ -0,0 +1 @@ +Multi-agent policy gradient (MAPG) methods recently witness vigorous progress. However, there is a significant performance discrepancy between MAPG methods and state-of-the-art multi-agent value-based approaches. In this paper, we investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP). This method introduces the idea of value function decomposition into the multi-agent actor-critic framework. Based on this idea, DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment in both discrete and continuous action spaces. We formally show that DOP critics have sufficient representational capability to guarantee convergence. In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP significantly outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms. Demonstrative videos are available at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning b/data/2021/iclr/Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning new file mode 100644 index 0000000000..864767c7ca --- /dev/null +++ b/data/2021/iclr/Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning @@ -0,0 +1 @@ +Dancing to music is one of human's innate abilities since ancient times. In machine learning research, however, synthesizing dance movements from music is a challenging problem. Recently, researchers synthesize human motion sequences through autoregressive models like recurrent neural network (RNN). Such an approach often generates short sequences due to an accumulation of prediction errors that are fed back into the neural network. This problem becomes even more severe in the long motion sequence generation. Besides, the consistency between dance and music in terms of style, rhythm and beat is yet to be taken into account during modeling. In this paper, we formalize the music-conditioned dance generation as a sequence-to-sequence learning problem and devise a novel seq2seq architecture to efficiently process long sequences of music features and capture the fine-grained correspondence between music and dance. Furthermore, we propose a novel curriculum learning strategy to alleviate error accumulation of autoregressive models in long motion sequence generation, which gently changes the training process from a fully guided teacher-forcing scheme using the previous ground-truth movements, towards a less guided autoregressive scheme mostly using the generated movements instead. Extensive experiments show that our approach significantly outperforms the existing state-of-the-arts on automatic metrics and human evaluation. We also make a demo video to demonstrate the superior performance of our proposed approach at https://www.youtube.com/watch?v=lmE20MEheZ8. \ No newline at end of file diff --git a/data/2021/iclr/Data-Efficient Reinforcement Learning with Self-Predictive Representations b/data/2021/iclr/Data-Efficient Reinforcement Learning with Self-Predictive Representations new file mode 100644 index 0000000000..ffc4c39fdd --- /dev/null +++ b/data/2021/iclr/Data-Efficient Reinforcement Learning with Self-Predictive Representations @@ -0,0 +1 @@ +While deep reinforcement learning excels at solving tasks where large amounts of data can be collected through virtually unlimited interaction with the environment, learning from limited interaction remains a key challenge. We posit that an agent can learn more efficiently if we augment reward maximization with self-supervised objectives based on structure in its visual input and sequential interaction with the environment. Our method, Self-Predictive Representations(SPR), trains an agent to predict its own latent state representations multiple steps into the future. We compute target representations for future states using an encoder which is an exponential moving average of the agent's parameters and we make predictions using a learned transition model. On its own, this future prediction objective outperforms prior methods for sample-efficient deep RL from pixels. We further improve performance by adding data augmentation to the future prediction loss, which forces the agent's representations to be consistent across multiple views of an observation. Our full self-supervised objective, which combines future prediction and data augmentation, achieves a median human-normalized score of 0.415 on Atari in a setting limited to 100k steps of environment interaction, which represents a 55% relative improvement over the previous state-of-the-art. Notably, even in this limited data regime, SPR exceeds expert human scores on 7 out of 26 games. The code associated with this work is available at https://github.com/mila-iqia/spr \ No newline at end of file diff --git a/data/2021/iclr/Dataset Condensation with Gradient Matching b/data/2021/iclr/Dataset Condensation with Gradient Matching new file mode 100644 index 0000000000..020419beec --- /dev/null +++ b/data/2021/iclr/Dataset Condensation with Gradient Matching @@ -0,0 +1 @@ +Efficient training of deep neural networks is an increasingly important problem in the era of sophisticated architectures and large-scale datasets. This paper proposes a training set synthesis technique, called Dataset Condensation, that learns to produce a small set of informative samples for training deep neural networks from scratch in a small fraction of the required computational cost on the original data while achieving comparable results. We rigorously evaluate its performance in several computer vision benchmarks and show that it significantly outperforms the state-of-the-art methods. Finally we show promising applications of our method in continual learning and domain adaptation. \ No newline at end of file diff --git a/data/2021/iclr/Dataset Inference: Ownership Resolution in Machine Learning b/data/2021/iclr/Dataset Inference: Ownership Resolution in Machine Learning new file mode 100644 index 0000000000..b594e6910c --- /dev/null +++ b/data/2021/iclr/Dataset Inference: Ownership Resolution in Machine Learning @@ -0,0 +1 @@ +With increasingly more data and computation involved in their training, machine learning models constitute valuable intellectual property. This has spurred interest in model stealing, which is made more practical by advances in learning with partial, little, or no supervision. Existing defenses focus on inserting unique watermarks in a model's decision surface, but this is insufficient: the watermarks are not sampled from the training distribution and thus are not always preserved during model stealing. In this paper, we make the key observation that knowledge contained in the stolen model's training set is what is common to all stolen copies. The adversary's goal, irrespective of the attack employed, is always to extract this knowledge or its by-products. This gives the original model's owner a strong advantage over the adversary: model owners have access to the original training data. We thus introduce $dataset$ $inference$, the process of identifying whether a suspected model copy has private knowledge from the original model's dataset, as a defense against model stealing. We develop an approach for dataset inference that combines statistical testing with the ability to estimate the distance of multiple data points to the decision boundary. Our experiments on CIFAR10, SVHN, CIFAR100 and ImageNet show that model owners can claim with confidence greater than 99% that their model (or dataset as a matter of fact) was stolen, despite only exposing 50 of the stolen model's training points. Dataset inference defends against state-of-the-art attacks even when the adversary is adaptive. Unlike prior work, it does not require retraining or overfitting the defended model. \ No newline at end of file diff --git a/data/2021/iclr/Dataset Meta-Learning from Kernel Ridge-Regression b/data/2021/iclr/Dataset Meta-Learning from Kernel Ridge-Regression new file mode 100644 index 0000000000..650b6fdc3f --- /dev/null +++ b/data/2021/iclr/Dataset Meta-Learning from Kernel Ridge-Regression @@ -0,0 +1 @@ +One of the most fundamental aspects of any machine learning algorithm is the training data used by the algorithm. We introduce the novel concept of $\epsilon$-approximation of datasets, obtaining datasets which are much smaller than or are significant corruptions of the original training data while maintaining similar model performance. We introduce a meta-learning algorithm called Kernel Inducing Points (KIP) for obtaining such remarkable datasets, inspired by the recent developments in the correspondence between infinitely-wide neural networks and kernel ridge-regression (KRR). For KRR tasks, we demonstrate that KIP can compress datasets by one or two orders of magnitude, significantly improving previous dataset distillation and subset selection methods while obtaining state of the art results for MNIST and CIFAR-10 classification. Furthermore, our KIP-learned datasets are transferable to the training of finite-width neural networks even beyond the lazy-training regime, which leads to state of the art results for neural network dataset distillation with potential applications to privacy-preservation. \ No newline at end of file diff --git a/data/2021/iclr/DeLighT: Deep and Light-weight Transformer b/data/2021/iclr/DeLighT: Deep and Light-weight Transformer new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Deberta: decoding-Enhanced Bert with Disentangled Attention b/data/2021/iclr/Deberta: decoding-Enhanced Bert with Disentangled Attention new file mode 100644 index 0000000000..8ad1f6ef02 --- /dev/null +++ b/data/2021/iclr/Deberta: decoding-Enhanced Bert with Disentangled Attention @@ -0,0 +1 @@ +Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency of model pre-training and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and pre-trained models will be made publicly available at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/Debiasing Concept-based Explanations with Causal Analysis b/data/2021/iclr/Debiasing Concept-based Explanations with Causal Analysis new file mode 100644 index 0000000000..907f4213c0 --- /dev/null +++ b/data/2021/iclr/Debiasing Concept-based Explanations with Causal Analysis @@ -0,0 +1 @@ +Concept-based explanation approach is a popular model interpertability tool because it expresses the reasons for a model's predictions in terms of concepts that are meaningful for the domain experts. In this work, we study the problem of the concepts being correlated with confounding information in the features. We propose a new causal prior graph for modeling the impacts of unobserved variables and a method to remove the impact of confounding information and noise using a two-stage regression technique borrowed from the instrumental variable literature. We also model the completeness of the concepts set and show that our debiasing method works when the concepts are not complete. Our synthetic and real-world experiments demonstrate the success of our method in removing biases and improving the ranking of the concepts in terms of their contribution to the explanation of the predictions. \ No newline at end of file diff --git a/data/2021/iclr/Decentralized Attribution of Generative Models b/data/2021/iclr/Decentralized Attribution of Generative Models new file mode 100644 index 0000000000..ca2ba82bd4 --- /dev/null +++ b/data/2021/iclr/Decentralized Attribution of Generative Models @@ -0,0 +1 @@ +There have been growing concerns regarding the fabrication of contents through generative models. This paper investigates the feasibility of decentralized attribution of such models. Given a set of generative models learned from the same dataset, attributability is achieved when a public verification service exists to correctly identify the source models for generated content. Attribution allows tracing of machine-generated content back to its source model, thus facilitating IP-protection and content regulation. Existing attribution methods are non-scalable with respect to the number of models and lack theoretical bounds on attributability. This paper studies decentralized attribution, where provable attributability can be achieved by only requiring each model to be distinguishable from the authentic data. Our major contributions are the derivation of the sufficient conditions for decentralized attribution and the design of keys following these conditions. Specifically, we show that decentralized attribution can be achieved when keys are (1) orthogonal to each other, and (2) belonging to a subspace determined by the data distribution. This result is validated on MNIST and CelebA. Lastly, we use these datasets to examine the trade-off between generation quality and robust attributability against adversarial post-processes. \ No newline at end of file diff --git a/data/2021/iclr/Deciphering and Optimizing Multi-Task Learning: a Random Matrix Approach b/data/2021/iclr/Deciphering and Optimizing Multi-Task Learning: a Random Matrix Approach new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Deconstructing the Regularization of BatchNorm b/data/2021/iclr/Deconstructing the Regularization of BatchNorm new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Decoupling Global and Local Representations via Invertible Generative Flows b/data/2021/iclr/Decoupling Global and Local Representations via Invertible Generative Flows new file mode 100644 index 0000000000..15bbf0b6cf --- /dev/null +++ b/data/2021/iclr/Decoupling Global and Local Representations via Invertible Generative Flows @@ -0,0 +1 @@ +In this work, we propose a new generative model that is capable of automatically decoupling global and local representations of images in an entirely unsupervised setting, by embedding a generative flow in the VAE framework to model the decoder. Specifically, the proposed model utilizes the variational auto-encoding framework to learn a (low-dimensional) vector of latent variables to capture the global information of an image, which is fed as a conditional input to a flow-based invertible decoder with architecture borrowed from style transfer literature. Experimental results on standard image benchmarks demonstrate the effectiveness of our model in terms of density estimation, image generation and unsupervised representation learning. Importantly, this work demonstrates that with only architectural inductive biases, a generative model with a likelihood-based objective is capable of learning decoupled representations, requiring no explicit supervision. The code for our model is available at https://github.com/XuezheMax/wolf . \ No newline at end of file diff --git a/data/2021/iclr/Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation b/data/2021/iclr/Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation new file mode 100644 index 0000000000..74a3cfb7b3 --- /dev/null +++ b/data/2021/iclr/Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation @@ -0,0 +1 @@ +Much recent effort has been invested in non-autoregressive neural machine translation, which appears to be an efficient alternative to state-of-the-art autoregressive machine translation on modern GPUs. In contrast to the latter, where generation is sequential, the former allows generation to be parallelized across target token positions. Some of the latest non-autoregressive models have achieved impressive translation quality-speed tradeoffs compared to autoregressive baselines. In this work, we reexamine this tradeoff and argue that autoregressive baselines can be substantially sped up without loss in accuracy. Specifically, we study autoregressive models with encoders and decoders of varied depths. Our extensive experiments show that given a sufficiently deep encoder, a single-layer autoregressive decoder can substantially outperform strong non-autoregressive models with comparable inference speed. We show that the speed disadvantage for autoregressive baselines compared to non-autoregressive methods has been overestimated in three aspects: suboptimal layer allocation, insufficient speed measurement, and lack of knowledge distillation. Our results establish a new protocol for future research toward fast, accurate machine translation. Our code is available at https://github.com/jungokasai/deep-shallow. \ No newline at end of file diff --git a/data/2021/iclr/Deep Equals Shallow for ReLU Networks in Kernel Regimes b/data/2021/iclr/Deep Equals Shallow for ReLU Networks in Kernel Regimes new file mode 100644 index 0000000000..93f0dec64c --- /dev/null +++ b/data/2021/iclr/Deep Equals Shallow for ReLU Networks in Kernel Regimes @@ -0,0 +1 @@ +Deep networks are often considered to be more expressive than shallow ones in terms of approximation. Indeed, certain functions can be approximated by deep networks provably more efficiently than by shallow ones, however, no tractable algorithms are known for learning such deep models. Separately, a recent line of work has shown that deep networks trained with gradient descent may behave like (tractable) kernel methods in a certain over-parameterized regime, where the kernel is determined by the architecture and initialization, and this paper focuses on approximation for such kernels. We show that for ReLU activations, the kernels derived from deep fully-connected networks have essentially the same approximation properties as their "shallow" two-layer counterpart, namely the same eigenvalue decay for the corresponding integral operator. This highlights the limitations of the kernel framework for understanding the benefits of such deep architectures. Our main theoretical result relies on characterizing such eigenvalue decays through differentiability properties of the kernel function, which also easily applies to the study of other kernels defined on the sphere. \ No newline at end of file diff --git a/data/2021/iclr/Deep Learning meets Projective Clustering b/data/2021/iclr/Deep Learning meets Projective Clustering new file mode 100644 index 0000000000..596685d7b1 --- /dev/null +++ b/data/2021/iclr/Deep Learning meets Projective Clustering @@ -0,0 +1,3 @@ +A common approach for compressing NLP networks is to encode the embedding layer as a matrix $A\in\mathbb{R}^{n\times d}$, compute its rank-$j$ approximation $A_j$ via SVD, and then factor $A_j$ into a pair of matrices that correspond to smaller fully-connected layers to replace the original embedding layer. Geometrically, the rows of $A$ represent points in $\mathbb{R}^d$, and the rows of $A_j$ represent their projections onto the $j$-dimensional subspace that minimizes the sum of squared distances ("errors") to the points. In practice, these rows of $A$ may be spread around $k>1$ subspaces, so factoring $A$ based on a single subspace may lead to large errors that turn into large drops in accuracy. +Inspired by \emph{projective clustering} from computational geometry, we suggest replacing this subspace by a set of $k$ subspaces, each of dimension $j$, that minimizes the sum of squared distances over every point (row in $A$) to its \emph{closest} subspace. Based on this approach, we provide a novel architecture that replaces the original embedding layer by a set of $k$ small layers that operate in parallel and are then recombined with a single fully-connected layer. +Extensive experimental results on the GLUE benchmark yield networks that are both more accurate and smaller compared to the standard matrix factorization (SVD). For example, we further compress DistilBERT by reducing the size of the embedding layer by $40\%$ while incurring only a $0.5\%$ average drop in accuracy over all nine GLUE tasks, compared to a $2.8\%$ drop using the existing SVD approach. On RoBERTa we achieve $43\%$ compression of the embedding layer with less than a $0.8\%$ average drop in accuracy as compared to a $3\%$ drop previously. Open code for reproducing and extending our results is provided. \ No newline at end of file diff --git a/data/2021/iclr/Deep Networks and the Multiple Manifold Problem b/data/2021/iclr/Deep Networks and the Multiple Manifold Problem new file mode 100644 index 0000000000..7e02c55b25 --- /dev/null +++ b/data/2021/iclr/Deep Networks and the Multiple Manifold Problem @@ -0,0 +1 @@ +We study the multiple manifold problem, a binary classification task modeled on applications in machine vision, in which a deep fully-connected neural network is trained to separate two low-dimensional submanifolds of the unit sphere. We provide an analysis of the one-dimensional case, proving for a simple manifold configuration that when the network depth $L$ is large relative to certain geometric and statistical properties of the data, the network width $n$ grows as a sufficiently large polynomial in $L$, and the number of i.i.d. samples from the manifolds is polynomial in $L$, randomly-initialized gradient descent rapidly learns to classify the two manifolds perfectly with high probability. Our analysis demonstrates concrete benefits of depth and width in the context of a practically-motivated model problem: the depth acts as a fitting resource, with larger depths corresponding to smoother networks that can more readily separate the class manifolds, and the width acts as a statistical resource, enabling concentration of the randomly-initialized network and its gradients. The argument centers around the neural tangent kernel and its role in the nonasymptotic analysis of training overparameterized neural networks; to this literature, we contribute essentially optimal rates of concentration for the neural tangent kernel of deep fully-connected networks, requiring width $n \gtrsim L\,\mathrm{poly}(d_0)$ to achieve uniform concentration of the initial kernel over a $d_0$-dimensional submanifold of the unit sphere $\mathbb{S}^{n_0-1}$, and a nonasymptotic framework for establishing generalization of networks trained in the NTK regime with structured data. The proof makes heavy use of martingale concentration to optimally treat statistical dependencies across layers of the initial random network. This approach should be of use in establishing similar results for other network architectures. \ No newline at end of file diff --git a/data/2021/iclr/Deep Neural Network Fingerprinting by Conferrable Adversarial Examples b/data/2021/iclr/Deep Neural Network Fingerprinting by Conferrable Adversarial Examples new file mode 100644 index 0000000000..af0629cfc4 --- /dev/null +++ b/data/2021/iclr/Deep Neural Network Fingerprinting by Conferrable Adversarial Examples @@ -0,0 +1 @@ +In Machine Learning as a Service, a provider trains a deep neural network and provides many users access. The hosted (source) model is susceptible to model stealing attacks, where an adversary derives a \emph{surrogate model} from API access to the source model. For post hoc detection of such attacks, the provider needs a robust method to determine whether a suspect model is a surrogate of their model. We propose a fingerprinting method for deep neural network classifiers that extracts a set of inputs from the source model so that only surrogates agree with the source model on the classification of such inputs. These inputs are a subclass of transferable adversarial examples which we call \emph{conferrable} adversarial examples that exclusively transfer with a target label from a source model to its surrogates. We propose a new method to generate these conferrable adversarial examples. We present an extensive study on the unremovability of our fingerprint against fine-tuning, weight pruning, retraining, retraining with different architectures, three model extraction attacks from related work, transfer learning, adversarial training, and two new adaptive attacks. Our fingerprint is robust against distillation, related model extraction attacks, and even transfer learning when the attacker has no access to the model provider's dataset. Our fingerprint is the first method that reaches an AUC of 1.0 in verifying surrogates, compared to an AUC of 0.63 by previous fingerprints. \ No newline at end of file diff --git a/data/2021/iclr/Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS b/data/2021/iclr/Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS new file mode 100644 index 0000000000..610cc2dc61 --- /dev/null +++ b/data/2021/iclr/Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS @@ -0,0 +1 @@ +We prove that the reproducing kernel Hilbert spaces (RKHS) of a deep neural tangent kernel and the Laplace kernel include the same set of functions, when both kernels are restricted to the sphere $\mathbb{S}^{d-1}$. Additionally, we prove that the exponential power kernel with a smaller power (making the kernel more non-smooth) leads to a larger RKHS, when it is restricted to the sphere $\mathbb{S}^{d-1}$ and when it is defined on the entire $\mathbb{R}^d$. \ No newline at end of file diff --git a/data/2021/iclr/Deep Partition Aggregation: Provable Defenses against General Poisoning Attacks b/data/2021/iclr/Deep Partition Aggregation: Provable Defenses against General Poisoning Attacks new file mode 100644 index 0000000000..37cce9a7ed --- /dev/null +++ b/data/2021/iclr/Deep Partition Aggregation: Provable Defenses against General Poisoning Attacks @@ -0,0 +1 @@ +Adversarial poisoning attacks distort training data in order to corrupt the test-time behavior of a classifier. A provable defense provides a certificate for each test sample, which is a lower bound on the magnitude of any adversarial distortion of the training set that can corrupt the test sample's classification. We propose two provable defenses against poisoning attacks: (i) Deep Partition Aggregation (DPA), a certified defense against a general poisoning threat model, defined as the insertion or deletion of a bounded number of samples to the training set -- by implication, this threat model also includes arbitrary distortions to a bounded number of images and/or labels; and (ii) Semi-Supervised DPA (SS-DPA), a certified defense against label-flipping poisoning attacks. DPA is an ensemble method where base models are trained on partitions of the training set determined by a hash function. DPA is related to subset aggregation, a well-studied ensemble method in classical machine learning. DPA can also be viewed as an extension of randomized ablation (Levine & Feizi, 2020a), a certified defense against sparse evasion attacks, to the poisoning domain. Our label-flipping defense, SS-DPA, uses a semi-supervised learning algorithm as its base classifier model: we train each base classifier using the entire unlabeled training set in addition to the labels for a partition. SS-DPA outperforms the existing certified defense for label-flipping attacks (Rosenfeld et al., 2020). SS-DPA certifies >= 50% of test images against 675 label flips (vs. = 50% of test images against > 500 poison image insertions on MNIST, and nine insertions on CIFAR-10. These results establish new state-of-the-art provable defenses against poison attacks. \ No newline at end of file diff --git a/data/2021/iclr/Deep Repulsive Clustering of Ordered Data Based on Order-Identity Decomposition b/data/2021/iclr/Deep Repulsive Clustering of Ordered Data Based on Order-Identity Decomposition new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients b/data/2021/iclr/Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients new file mode 100644 index 0000000000..7b24b2210a --- /dev/null +++ b/data/2021/iclr/Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients @@ -0,0 +1 @@ +Discovering the underlying mathematical expressions describing a dataset is a core challenge for artificial intelligence. This is the problem of $\textit{symbolic}$ $\textit{regression.}$ Despite recent advances in training neural networks to solve complex tasks, deep learning approaches to symbolic regression are underexplored. We propose a framework that combines deep learning with symbolic regression via a simple idea: use a large model to search the space of small models. More specifically, we use a recurrent neural network to emit a distribution over tractable mathematical expressions, and employ reinforcement learning to train the network to generate better-fitting expressions. Our algorithm significantly outperforms standard genetic programming-based symbolic regression in its ability to exactly recover symbolic expressions on a series of benchmark problems, both with and without added noise. More broadly, our contributions include a framework that can be applied to optimize hierarchical, variable-length objects under a black-box performance metric, with the ability to incorporate a priori constraints in situ, and a risk-seeking policy gradient formulation that optimizes for best-case performance instead of expected performance. \ No newline at end of file diff --git a/data/2021/iclr/DeepAveragers: Offline Reinforcement Learning By Solving Derived Non-Parametric MDPs b/data/2021/iclr/DeepAveragers: Offline Reinforcement Learning By Solving Derived Non-Parametric MDPs new file mode 100644 index 0000000000..b4b595b127 --- /dev/null +++ b/data/2021/iclr/DeepAveragers: Offline Reinforcement Learning By Solving Derived Non-Parametric MDPs @@ -0,0 +1 @@ +We study an approach to offline reinforcement learning (RL) based on optimally solving finitely-represented MDPs derived from a static dataset of experience. This approach can be applied on top of any learned representation and has the potential to easily support multiple solution objectives as well as zero-shot adjustment to changing environments and goals. Our main contribution is to introduce the Deep Averagers with Costs MDP (DAC-MDP) and to investigate its solutions for offline RL. DAC-MDPs are a non-parametric model that can leverage deep representations and account for limited data by introducing costs for exploiting under-represented parts of the model. In theory, we show conditions that allow for lower-bounding the performance of DAC-MDP solutions. We also investigate the empirical behavior in a number of environments, including those with image-based observations. Overall, the experiments demonstrate that the framework can work in practice and scale to large complex offline RL problems. \ No newline at end of file diff --git a/data/2021/iclr/Deformable DETR: Deformable Transformers for End-to-End Object Detection b/data/2021/iclr/Deformable DETR: Deformable Transformers for End-to-End Object Detection new file mode 100644 index 0000000000..0a62c06a4c --- /dev/null +++ b/data/2021/iclr/Deformable DETR: Deformable Transformers for End-to-End Object Detection @@ -0,0 +1 @@ +DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10$\times$ less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code shall be released. \ No newline at end of file diff --git a/data/2021/iclr/Degree-Quant: Quantization-Aware Training for Graph Neural Networks b/data/2021/iclr/Degree-Quant: Quantization-Aware Training for Graph Neural Networks new file mode 100644 index 0000000000..fa8de3c60e --- /dev/null +++ b/data/2021/iclr/Degree-Quant: Quantization-Aware Training for Graph Neural Networks @@ -0,0 +1 @@ +Graph neural networks (GNNs) have demonstrated strong performance on a wide variety of tasks due to their ability to model non-uniform structured data. Despite their promise, there exists little research exploring methods to make them more efficient at inference time. In this work, we explore the viability of training quantized GNNs, enabling the usage of low precision integer arithmetic during inference. We identify the sources of error that uniquely arise when attempting to quantize GNNs, and propose an architecturally-agnostic method, Degree-Quant, to improve performance over existing quantization-aware training baselines commonly used on other architectures, such as CNNs. We validate our method on six datasets and show, unlike previous attempts, that models generalize to unseen graphs. Models trained with Degree-Quant for INT8 quantization perform as well as FP32 models in most cases; for INT4 models, we obtain up to 26% gains over the baselines. Our work enables up to 4.7x speedups on CPU when using INT8 arithmetic. \ No newline at end of file diff --git a/data/2021/iclr/Denoising Diffusion Implicit Models b/data/2021/iclr/Denoising Diffusion Implicit Models new file mode 100644 index 0000000000..2798c70a91 --- /dev/null +++ b/data/2021/iclr/Denoising Diffusion Implicit Models @@ -0,0 +1 @@ +Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples $10 \times$ to $50 \times$ faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space. \ No newline at end of file diff --git a/data/2021/iclr/Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization b/data/2021/iclr/Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization new file mode 100644 index 0000000000..a949bb61bb --- /dev/null +++ b/data/2021/iclr/Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization @@ -0,0 +1 @@ +Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We observe that naively applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient and sample-efficient algorithm. We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN) that can effectively optimize a policy offline using 10-20 times fewer data than prior works. Furthermore, the recursive application of BREMEN is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 5-10 deployments, compared to typical values of hundreds to millions in standard RL baselines. Codes and pre-trained models are available at this https URL . \ No newline at end of file diff --git a/data/2021/iclr/DialoGraph: Incorporating Interpretable Strategy-Graph Networks into Negotiation Dialogues b/data/2021/iclr/DialoGraph: Incorporating Interpretable Strategy-Graph Networks into Negotiation Dialogues new file mode 100644 index 0000000000..fc172dea3d --- /dev/null +++ b/data/2021/iclr/DialoGraph: Incorporating Interpretable Strategy-Graph Networks into Negotiation Dialogues @@ -0,0 +1 @@ +To successfully negotiate a deal, it is not enough to communicate fluently: pragmatic planning of persuasive negotiation strategies is essential. While modern dialogue agents excel at generating fluent sentences, they still lack pragmatic grounding and cannot reason strategically. We present DialoGraph, a negotiation system that incorporates pragmatic strategies in a negotiation dialogue using graph neural networks. DialoGraph explicitly incorporates dependencies between sequences of strategies to enable improved and interpretable prediction of next optimal strategies, given the dialogue context. Our graph-based method outperforms prior state-of-the-art negotiation models both in the accuracy of strategy/dialogue act prediction and in the quality of downstream dialogue response generation. We qualitatively show further benefits of learned strategy-graphs in providing explicit associations between effective negotiation strategies over the course of the dialogue, leading to interpretable and strategic dialogues. \ No newline at end of file diff --git a/data/2021/iclr/DiffWave: A Versatile Diffusion Model for Audio Synthesis b/data/2021/iclr/DiffWave: A Versatile Diffusion Model for Audio Synthesis new file mode 100644 index 0000000000..4013c626d3 --- /dev/null +++ b/data/2021/iclr/DiffWave: A Versatile Diffusion Model for Audio Synthesis @@ -0,0 +1 @@ +In this work, we propose DiffWave, a versatile Diffusion probabilistic model for conditional and unconditional Waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in Different Waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality~(MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations. \ No newline at end of file diff --git a/data/2021/iclr/Differentiable Segmentation of Sequences b/data/2021/iclr/Differentiable Segmentation of Sequences new file mode 100644 index 0000000000..75ff5185c3 --- /dev/null +++ b/data/2021/iclr/Differentiable Segmentation of Sequences @@ -0,0 +1 @@ +Segmented models are widely used to describe non-stationary sequential data with discrete change points. Their estimation usually requires solving a mixed discrete-continuous optimization problem, where the segmentation is the discrete part and all other model parameters are continuous. A number of estimation algorithms have been developed that are highly specialized for their specific model assumptions. The dependence on non-standard algorithms makes it hard to integrate segmented models in state-of-the-art deep learning architectures that critically depend on gradient-based optimization techniques. In this work, we formulate a relaxed variant of segmented models that enables joint estimation of all model parameters, including the segmentation, with gradient descent. We build on recent advances in learning continuous warping functions and propose a novel family of warping functions based on the two-sided power (TSP) distribution. TSP-based warping functions are differentiable, have simple closed-form expressions, and can represent segmentation functions exactly. Our formulation includes the important class of segmented generalized linear models as a special case, which makes it highly versatile. We use our approach to model the spread of COVID-19 by segmented Poisson regression, perform logistic regression on Fashion-MNIST with artificial concept drift, and demonstrate its capacities for phoneme segmentation. \ No newline at end of file diff --git a/data/2021/iclr/Differentiable Trust Region Layers for Deep Reinforcement Learning b/data/2021/iclr/Differentiable Trust Region Layers for Deep Reinforcement Learning new file mode 100644 index 0000000000..7f99893960 --- /dev/null +++ b/data/2021/iclr/Differentiable Trust Region Layers for Deep Reinforcement Learning @@ -0,0 +1 @@ +Trust region methods are a popular tool in reinforcement learning as they yield robust policy updates in continuous and discrete action spaces. However, enforcing such trust regions in deep reinforcement learning is difficult. Hence, many approaches, such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), are based on approximations. Due to those approximations, they violate the constraints or fail to find the optimal solution within the trust region. Moreover, they are difficult to implement, lack sufficient exploration, and have been shown to depend on seemingly unrelated implementation choices. In this work, we propose differentiable neural network layers to enforce trust regions for deep Gaussian policies via closed-form projections. Unlike existing methods, those layers formalize trust regions for each state individually and can complement existing reinforcement learning algorithms. We derive trust region projections based on the Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius norm for Gaussian distributions. We empirically demonstrate that those projection layers achieve similar or better results than existing methods while being almost agnostic to specific implementation choices. (Code: https://git. io/Jt3go) \ No newline at end of file diff --git a/data/2021/iclr/Differentially Private Learning Needs Better Features (or Much More Data) b/data/2021/iclr/Differentially Private Learning Needs Better Features (or Much More Data) new file mode 100644 index 0000000000..7d3ad9e49f --- /dev/null +++ b/data/2021/iclr/Differentially Private Learning Needs Better Features (or Much More Data) @@ -0,0 +1 @@ +We demonstrate that differentially private machine learning has not yet reached its "AlexNet moment" on many canonical vision tasks: linear models trained on handcrafted features significantly outperform end-to-end deep neural networks for moderate privacy budgets. To exceed the performance of handcrafted features, we show that private learning requires either much more private data, or access to features learned on public data from a similar domain. Our work introduces simple yet strong baselines for differentially private learning that can inform the evaluation of future progress in this area. \ No newline at end of file diff --git a/data/2021/iclr/Directed Acyclic Graph Neural Networks b/data/2021/iclr/Directed Acyclic Graph Neural Networks new file mode 100644 index 0000000000..a2498ce3b5 --- /dev/null +++ b/data/2021/iclr/Directed Acyclic Graph Neural Networks @@ -0,0 +1 @@ +Graph-structured data ubiquitously appears in science and engineering. Graph neural networks (GNNs) are designed to exploit the relational inductive bias exhibited in graphs; they have been shown to outperform other forms of neural networks in scenarios where structure information supplements node features. The most common GNN architecture aggregates information from neighborhoods based on message passing. Its generality has made it broadly applicable. In this paper, we focus on a special, yet widely used, type of graphs -- DAGs -- and inject a stronger inductive bias -- partial ordering -- into the neural network design. We propose the \emph{directed acyclic graph neural network}, DAGNN, an architecture that processes information according to the flow defined by the partial order. DAGNN can be considered a framework that entails earlier works as special cases (e.g., models for trees and models updating node representations recurrently), but we identify several crucial components that prior architectures lack. We perform comprehensive experiments, including ablation studies, on representative DAG datasets (i.e., source code, neural architectures, and probabilistic graphical models) and demonstrate the superiority of DAGNN over simpler DAG architectures as well as general graph architectures. \ No newline at end of file diff --git a/data/2021/iclr/Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate b/data/2021/iclr/Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate new file mode 100644 index 0000000000..a98ed26947 --- /dev/null +++ b/data/2021/iclr/Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate @@ -0,0 +1 @@ +Understanding the algorithmic regularization effect of stochastic gradient descent (SGD) is one of the key challenges in modern machine learning and deep learning theory. Most of the existing works, however, focus on very small or even infinitesimal learning rate regime, and fail to cover practical scenarios where the learning rate is moderate and annealing. In this paper, we make an initial attempt to characterize the particular regularization effect of SGD in the moderate learning rate regime by studying its behavior for optimizing an overparameterized linear regression problem. In this case, SGD and GD are known to converge to the unique minimum-norm solution; however, with the moderate and annealing learning rate, we show that they exhibit different directional bias: SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions. Furthermore, we show that such directional bias does matter when early stopping is adopted, where the SGD output is nearly optimal but the GD output is suboptimal. Finally, our theory explains several folk arts in practice used for SGD hyperparameter tuning, such as (1) linearly scaling the initial learning rate with batch size; and (2) overrunning SGD with high learning rate even when the loss stops decreasing. \ No newline at end of file diff --git a/data/2021/iclr/Disambiguating Symbolic Expressions in Informal Documents b/data/2021/iclr/Disambiguating Symbolic Expressions in Informal Documents new file mode 100644 index 0000000000..35b863eda0 --- /dev/null +++ b/data/2021/iclr/Disambiguating Symbolic Expressions in Informal Documents @@ -0,0 +1 @@ +We propose the task of disambiguating symbolic expressions in informal STEM documents in the form of LaTeX files - that is, determining their precise semantics and abstract syntax tree - as a neural machine translation task. We discuss the distinct challenges involved and present a dataset with roughly 33,000 entries. We evaluated several baseline models on this dataset, which failed to yield even syntactically valid LaTeX before overfitting. Consequently, we describe a methodology using a transformer language model pre-trained on sources obtained from arxiv.org, which yields promising results despite the small size of the dataset. We evaluate our model using a plurality of dedicated techniques, taking the syntax and semantics of symbolic expressions into account. \ No newline at end of file diff --git a/data/2021/iclr/Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization b/data/2021/iclr/Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization new file mode 100644 index 0000000000..1f3f7e0ea8 --- /dev/null +++ b/data/2021/iclr/Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization @@ -0,0 +1 @@ +We propose a simple, general and effective technique, Reward Randomization for discovering diverse strategic policies in complex multi-agent games. Combining reward randomization and policy gradient, we derive a new algorithm, Reward-Randomized Policy Gradient (RPG). RPG is able to discover multiple distinctive human-interpretable strategies in challenging temporal trust dilemmas, including grid-world games and a real-world game Agar.io, where multiple equilibria exist but standard multi-agent policy gradient algorithms always converge to a fixed one with a sub-optimal payoff for every player even using state-of-the-art exploration techniques. Furthermore, with the set of diverse strategies from RPG, we can (1) achieve higher payoffs by fine-tuning the best policy from the set; and (2) obtain an adaptive agent by using this set of strategies as its training opponents. The source code and example videos can be found in our website: https://sites.google.com/view/staghuntrpg. \ No newline at end of file diff --git a/data/2021/iclr/Discovering Non-monotonic Autoregressive Orderings with Variational Inference b/data/2021/iclr/Discovering Non-monotonic Autoregressive Orderings with Variational Inference new file mode 100644 index 0000000000..e5169e2cbd --- /dev/null +++ b/data/2021/iclr/Discovering Non-monotonic Autoregressive Orderings with Variational Inference @@ -0,0 +1 @@ +The predominant approach for language modeling is to process sequences from left to right, but this eliminates a source of information: the order by which the sequence was generated. One strategy to recover this information is to decode both the content and ordering of tokens. Existing approaches supervise content and ordering by designing problem-specific loss functions and pre-training with an ordering pre-selected. Other recent works use iterative search to discover problem-specific orderings for training, but suffer from high time complexity and cannot be efficiently parallelized. We address these limitations with an unsupervised parallelizable learner that discovers high-quality generation orders purely from training data -- no domain knowledge required. The learner contains an encoder network and decoder language model that perform variational inference with autoregressive orders (represented as permutation matrices) as latent variables. The corresponding ELBO is not differentiable, so we develop a practical algorithm for end-to-end optimization using policy gradients. We implement the encoder as a Transformer with non-causal attention that outputs permutations in one forward pass. Permutations then serve as target generation orders for training an insertion-based Transformer language model. Empirical results in language modeling tasks demonstrate that our method is context-aware and discovers orderings that are competitive with or even better than fixed orders. \ No newline at end of file diff --git a/data/2021/iclr/Discovering a set of policies for the worst case reward b/data/2021/iclr/Discovering a set of policies for the worst case reward new file mode 100644 index 0000000000..86afc9a2ba --- /dev/null +++ b/data/2021/iclr/Discovering a set of policies for the worst case reward @@ -0,0 +1 @@ +We study the problem of how to construct a set of policies that can be composed together to solve a collection of reinforcement learning tasks. Each task is a different reward function defined as a linear combination of known features. We consider a specific class of policy compositions which we call set improving policies (SIPs): given a set of policies and a set of tasks, a SIP is any composition of the former whose performance is at least as good as that of its constituents across all the tasks. We focus on the most conservative instantiation of SIPs, set-max policies (SMPs), so our analysis extends to any SIP. This includes known policy-composition operators like generalized policy improvement. Our main contribution is a policy iteration algorithm that builds a set of policies in order to maximize the worst-case performance of the resulting SMP on the set of tasks. The algorithm works by successively adding new policies to the set. We show that the worst-case performance of the resulting SMP strictly improves at each iteration, and the algorithm only stops when there does not exist a policy that leads to improved performance. We empirically evaluate our algorithm on a grid world and also on a set of domains from the DeepMind control suite. We confirm our theoretical results regarding the monotonically improving performance of our algorithm. Interestingly, we also show empirically that the sets of policies computed by the algorithm are diverse, leading to different trajectories in the grid world and very distinct locomotion skills in the control suite. \ No newline at end of file diff --git a/data/2021/iclr/Discrete Graph Structure Learning for Forecasting Multiple Time Series b/data/2021/iclr/Discrete Graph Structure Learning for Forecasting Multiple Time Series new file mode 100644 index 0000000000..122af341ee --- /dev/null +++ b/data/2021/iclr/Discrete Graph Structure Learning for Forecasting Multiple Time Series @@ -0,0 +1 @@ +Time series forecasting is an extensively studied subject in statistics, economics, and computer science. Exploration of the correlation and causation among the variables in a multivariate time series shows promise in enhancing the performance of a time series model. When using deep neural networks as forecasting models, we hypothesize that exploiting the pairwise information among multiple (multivariate) time series also improves their forecast. If an explicit graph structure is known, graph neural networks (GNNs) have been demonstrated as powerful tools to exploit the structure. In this work, we propose learning the structure simultaneously with the GNN if the graph is unknown. We cast the problem as learning a probabilistic graph model through optimizing the mean performance over the graph distribution. The distribution is parameterized by a neural network so that discrete graphs can be sampled differentiably through reparameterization. Empirical evaluations show that our method is simpler, more efficient, and better performing than a recently proposed bilevel learning approach for graph structure learning, as well as a broad array of forecasting models, either deep or non-deep learning based, and graph or non-graph based. \ No newline at end of file diff --git a/data/2021/iclr/Disentangled Recurrent Wasserstein Autoencoder b/data/2021/iclr/Disentangled Recurrent Wasserstein Autoencoder new file mode 100644 index 0000000000..c5d946f9a0 --- /dev/null +++ b/data/2021/iclr/Disentangled Recurrent Wasserstein Autoencoder @@ -0,0 +1 @@ +Learning disentangled representations leads to interpretable models and facilitates data generation with style transfer, which has been extensively studied on static data such as images in an unsupervised learning framework. However, only a few works have explored unsupervised disentangled sequential representation learning due to challenges of generating sequential data. In this paper, we propose recurrent Wasserstein Autoencoder (R-WAE), a new framework for generative modeling of sequential data. R-WAE disentangles the representation of an input sequence into static and dynamic factors (i.e., time-invariant and time-varying parts). Our theoretical analysis shows that, R-WAE minimizes an upper bound of a penalized form of the Wasserstein distance between model distribution and sequential data distribution, and simultaneously maximizes the mutual information between input data and different disentangled latent factors, respectively. This is superior to (recurrent) VAE which does not explicitly enforce mutual information maximization between input data and disentangled latent representations. When the number of actions in sequential data is available as weak supervision information, R-WAE is extended to learn a categorical latent representation of actions to improve its disentanglement. Experiments on a variety of datasets show that our models outperform other baselines with the same settings in terms of disentanglement and unconditional video generation both quantitatively and qualitatively. \ No newline at end of file diff --git a/data/2021/iclr/Disentangling 3D Prototypical Networks for Few-Shot Concept Learning b/data/2021/iclr/Disentangling 3D Prototypical Networks for Few-Shot Concept Learning new file mode 100644 index 0000000000..239c012375 --- /dev/null +++ b/data/2021/iclr/Disentangling 3D Prototypical Networks for Few-Shot Concept Learning @@ -0,0 +1 @@ +We present neural architectures that disentangle RGB-D images into objects' shapes and styles and a map of the background scene, and explore their applications for few-shot 3D object detection and few-shot concept classification. Our networks incorporate architectural biases that reflect the image formation process, 3D geometry of the world scene, and shape-style interplay. They are trained end-to-end self-supervised by predicting views in static scenes, alongside a small number of 3D object boxes. Objects and scenes are represented in terms of 3D feature grids in the bottleneck of the network. We show that the proposed 3D neural representations are compositional: they can generate novel 3D scene feature maps by mixing object shapes and styles, resizing and adding the resulting object 3D feature maps over background scene feature maps. We show that classifiers for object categories, color, materials, and spatial relationships trained over the disentangled 3D feature sub-spaces generalize better with dramatically fewer examples than the current state-of-the-art, and enable a visual question answering system that uses them as its modules to generalize one-shot to novel objects in the scene. \ No newline at end of file diff --git a/data/2021/iclr/Distance-Based Regularisation of Deep Networks for Fine-Tuning b/data/2021/iclr/Distance-Based Regularisation of Deep Networks for Fine-Tuning new file mode 100644 index 0000000000..0081981f55 --- /dev/null +++ b/data/2021/iclr/Distance-Based Regularisation of Deep Networks for Fine-Tuning @@ -0,0 +1 @@ +We investigate approaches to regularisation during fine-tuning of deep neural networks. First we provide a neural network generalisation bound based on Rademacher complexity that uses the distance the weights have moved from their initial values. This bound has no direct dependence on the number of weights and compares favourably to other bounds when applied to convolutional networks. Our bound is highly relevant for fine-tuning, because providing a network with a good initialisation based on transfer learning means that learning can modify the weights less, and hence achieve tighter generalisation. Inspired by this, we develop a simple yet effective fine-tuning algorithm that constrains the hypothesis class to a small sphere centred on the initial pre-trained weights, thus obtaining provably better generalisation performance than conventional transfer learning. Empirical evaluation shows that our algorithm works well, corroborating our theoretical results. It outperforms both state of the art fine-tuning competitors, and penalty-based alternatives that we show do not directly constrain the radius of the search space. \ No newline at end of file diff --git a/data/2021/iclr/Distilling Knowledge from Reader to Retriever for Question Answering b/data/2021/iclr/Distilling Knowledge from Reader to Retriever for Question Answering new file mode 100644 index 0000000000..3a16de031e --- /dev/null +++ b/data/2021/iclr/Distilling Knowledge from Reader to Retriever for Question Answering @@ -0,0 +1 @@ +The task of information retrieval is an important component of many natural language processing systems, such as open domain question answering. While traditional methods were based on hand-crafted features, continuous representations based on neural networks recently obtained competitive results. A challenge of using such methods is to obtain supervised data to train the retriever model, corresponding to pairs of query and support documents. In this paper, we propose a technique to learn retriever models for downstream tasks, inspired by knowledge distillation, and which does not require annotated pairs of query and documents. Our approach leverages attention scores of a reader model, used to solve the task based on retrieved documents, to obtain synthetic labels for the retriever. We evaluate our method on question answering, obtaining state-of-the-art results. \ No newline at end of file diff --git a/data/2021/iclr/Distributed Momentum for Byzantine-resilient Stochastic Gradient Descent b/data/2021/iclr/Distributed Momentum for Byzantine-resilient Stochastic Gradient Descent new file mode 100644 index 0000000000..9faeca14b3 --- /dev/null +++ b/data/2021/iclr/Distributed Momentum for Byzantine-resilient Stochastic Gradient Descent @@ -0,0 +1 @@ +Byzantine-resilient Stochastic Gradient Descent (SGD) aims at shielding model training from Byzantine faults, be they ill-labeled training datapoints, exploited software/hardware vulnerabilities, or malicious worker nodes in a distributed setting. Two recent attacks have been challenging state-of-the-art defenses though, often successfully precluding the model from even fitting the training set. The main identified weakness in current defenses is their requirement of a sufficiently low variance-norm ratio for the stochastic gradients. We propose a practical method which, despite increasing the variance, reduces the variance-norm ratio, mitigating the identified weakness. We assess the effectiveness of our method over 736 different training configurations, comprising the 2 state-of-the-art attacks and 6 defenses. For confidence and reproducibility purposes, each configuration is run 5 times with specified seeds (1 to 5), totalling 3680 runs. In our experiments, when the attack is effective enough to decrease the highest observed top-1 cross-accuracy by at least 20% compared to the unattacked run, our technique systematically increases back the highest observed accuracy, and is able to recover at least 20% in more than 60% of the cases. \ No newline at end of file diff --git a/data/2021/iclr/Distributional Sliced-Wasserstein and Applications to Generative Modeling b/data/2021/iclr/Distributional Sliced-Wasserstein and Applications to Generative Modeling new file mode 100644 index 0000000000..5f7aca8e01 --- /dev/null +++ b/data/2021/iclr/Distributional Sliced-Wasserstein and Applications to Generative Modeling @@ -0,0 +1 @@ +Sliced-Wasserstein distance (SWD) and its variation, Max Sliced-Wasserstein distance (Max-SWD), have been widely used in the recent years due to their fast computation and scalability when the probability measures lie in very high dimension. However, these distances still have their weakness, SWD requires a lot of projection samples because it uses the uniform distribution to sample projecting directions, Max-SWD uses only one projection, causing it to lose a large amount of information. In this paper, we propose a novel distance that finds optimal penalized probability measure over the slices, which is named Distributional Sliced-Wasserstein distance (DSWD). We show that the DSWD is a generalization of both SWD and Max-SWD, and the proposed distance could be found by searching for the push-forward measure over a set of measures satisfying some certain constraints. Moreover, similar to SWD, we can extend Generalized Sliced-Wasserstein distance (GSWD) to Distributional Generalized Sliced-Wasserstein distance (DGSWD). Finally, we carry out extensive experiments to demonstrate the favorable generative modeling performances of our distances over the previous sliced-based distances in large-scale real datasets. \ No newline at end of file diff --git a/data/2021/iclr/Diverse Video Generation using a Gaussian Process Trigger b/data/2021/iclr/Diverse Video Generation using a Gaussian Process Trigger new file mode 100644 index 0000000000..fb92f04ad7 --- /dev/null +++ b/data/2021/iclr/Diverse Video Generation using a Gaussian Process Trigger @@ -0,0 +1 @@ +Generating future frames given a few context (or past) frames is a challenging task. It requires modeling the temporal coherence of videos and multi-modality in terms of diversity in the potential future states. Current variational approaches for video generation tend to marginalize over multi-modal future outcomes. Instead, we propose to explicitly model the multi-modality in the future outcomes and leverage it to sample diverse futures. Our approach, Diverse Video Generator, uses a Gaussian Process (GP) to learn priors on future states given the past and maintains a probability distribution over possible futures given a particular sample. In addition, we leverage the changes in this distribution over time to control the sampling of diverse future states by estimating the end of ongoing sequences. That is, we use the variance of GP over the output function space to trigger a change in an action sequence. We achieve state-of-the-art results on diverse future frame generation in terms of reconstruction quality and diversity of the generated sequences. \ No newline at end of file diff --git a/data/2021/iclr/Do 2D GANs Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image GANs b/data/2021/iclr/Do 2D GANs Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image GANs new file mode 100644 index 0000000000..0d22759cad --- /dev/null +++ b/data/2021/iclr/Do 2D GANs Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image GANs @@ -0,0 +1 @@ +Natural images are projections of 3D objects on a 2D image plane. While state-of-the-art 2D generative models like GANs show unprecedented quality in modeling the natural image manifold, it is unclear whether they implicitly capture the underlying 3D object structures. And if so, how could we exploit such knowledge to recover the 3D shapes of objects in the images? To answer these questions, in this work, we present the first attempt to directly mine 3D geometric clues from an off-the-shelf 2D GAN that is trained on RGB images only. Through our investigation, we found that such a pre-trained GAN indeed contains rich 3D knowledge and thus can be used to recover 3D shape from a single 2D image in an unsupervised manner. The core of our framework is an iterative strategy that explores and exploits diverse viewpoint and lighting variations in the GAN image manifold. The framework does not require 2D keypoint or 3D annotations, or strong assumptions on object shapes (e.g. shapes are symmetric), yet it successfully recovers 3D shapes with high precision for human faces, cats, cars, and buildings. The recovered 3D shapes immediately allow high-quality image editing like relighting and object rotation. We quantitatively demonstrate the effectiveness of our approach compared to previous methods in both 3D shape reconstruction and face rotation. Our code and models will be released at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth b/data/2021/iclr/Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth new file mode 100644 index 0000000000..8a6762cacf --- /dev/null +++ b/data/2021/iclr/Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth @@ -0,0 +1 @@ +A key factor in the success of deep neural networks is the ability to scale models to improve performance by varying the architecture depth and width. This simple property of neural network design has resulted in highly effective architectures for a variety of tasks. Nevertheless, there is limited understanding of effects of depth and width on the learned representations. In this paper, we study this fundamental question. We begin by investigating how varying depth and width affects model hidden representations, finding a characteristic block structure in the hidden representations of larger capacity (wider or deeper) models. We demonstrate that this block structure arises when model capacity is large relative to the size of the training set, and is indicative of the underlying layers preserving and propagating the dominant principal component of their representations. This discovery has important ramifications for features learned by different models, namely, representations outside the block structure are often similar across architectures with varying widths and depths, but the block structure is unique to each model. We analyze the output predictions of different model architectures, finding that even when the overall accuracy is similar, wide and deep models exhibit distinctive error patterns and variations across classes. \ No newline at end of file diff --git a/data/2021/iclr/Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning b/data/2021/iclr/Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning new file mode 100644 index 0000000000..faced88ad9 --- /dev/null +++ b/data/2021/iclr/Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning @@ -0,0 +1 @@ +The privacy leakage of the model about the training data can be bounded in the differential privacy mechanism. However, for meaningful privacy parameters, a differentially private model degrades the utility drastically when the model comprises a large number of trainable parameters. In this paper, we propose an algorithm \emph{Gradient Embedding Perturbation (GEP)} towards training differentially private deep models with decent accuracy. Specifically, in each gradient descent step, GEP first projects individual private gradient into a non-sensitive anchor subspace, producing a low-dimensional gradient embedding and a small-norm residual gradient. Then, GEP perturbs the low-dimensional embedding and the residual gradient separately according to the privacy budget. Such a decomposition permits a small perturbation variance, which greatly helps to break the dimensional barrier of private learning. With GEP, we achieve decent accuracy with reasonable computational cost and modest privacy guarantee for deep models. Especially, with privacy bound $\epsilon=8$, we achieve $74.9\%$ test accuracy on CIFAR10 and $95.1\%$ test accuracy on SVHN, significantly improving over existing results. \ No newline at end of file diff --git a/data/2021/iclr/Does enhanced shape bias improve neural network robustness to common corruptions? b/data/2021/iclr/Does enhanced shape bias improve neural network robustness to common corruptions? new file mode 100644 index 0000000000..d77bd88bfa --- /dev/null +++ b/data/2021/iclr/Does enhanced shape bias improve neural network robustness to common corruptions? @@ -0,0 +1 @@ +Convolutional neural networks (CNNs) learn to extract representations of complex features, such as object shapes and textures to solve image recognition tasks. Recent work indicates that CNNs trained on ImageNet are biased towards features that encode textures and that these alone are sufficient to generalize to unseen test data from the same distribution as the training data but often fail to generalize to out-of-distribution data. It has been shown that augmenting the training data with different image styles decreases this texture bias in favor of increased shape bias while at the same time improving robustness to common corruptions, such as noise and blur. Commonly, this is interpreted as shape bias increasing corruption robustness. However, this relationship is only hypothesized. We perform a systematic study of different ways of composing inputs based on natural images, explicit edge information, and stylization. While stylization is essential for achieving high corruption robustness, we do not find a clear correlation between shape bias and robustness. We conclude that the data augmentation caused by style-variation accounts for the improved corruption robustness and increased shape bias is only a byproduct. \ No newline at end of file diff --git a/data/2021/iclr/Domain Generalization with MixStyle b/data/2021/iclr/Domain Generalization with MixStyle new file mode 100644 index 0000000000..8a88a4625a --- /dev/null +++ b/data/2021/iclr/Domain Generalization with MixStyle @@ -0,0 +1 @@ +Though convolutional neural networks (CNNs) have demonstrated remarkable ability in learning discriminative features, they often generalize poorly to unseen domains. Domain generalization aims to address this problem by learning from a set of source domains a model that is generalizable to any unseen domain. In this paper, a novel approach is proposed based on probabilistically mixing instance-level feature statistics of training samples across source domains. Our method, termed MixStyle, is motivated by the observation that visual domain is closely related to image style (e.g., photo vs.~sketch images). Such style information is captured by the bottom layers of a CNN where our proposed style-mixing takes place. Mixing styles of training instances results in novel domains being synthesized implicitly, which increase the domain diversity of the source domains, and hence the generalizability of the trained model. MixStyle fits into mini-batch training perfectly and is extremely easy to implement. The effectiveness of MixStyle is demonstrated on a wide range of tasks including category classification, instance retrieval and reinforcement learning. \ No newline at end of file diff --git a/data/2021/iclr/Domain-Robust Visual Imitation Learning with Mutual Information Constraints b/data/2021/iclr/Domain-Robust Visual Imitation Learning with Mutual Information Constraints new file mode 100644 index 0000000000..bd1bd60fbf --- /dev/null +++ b/data/2021/iclr/Domain-Robust Visual Imitation Learning with Mutual Information Constraints @@ -0,0 +1 @@ +Human beings are able to understand objectives and learn by simply observing others perform a task. Imitation learning methods aim to replicate such capabilities, however, they generally depend on access to a full set of optimal states and actions taken with the agent's actuators and from the agent's point of view. In this paper, we introduce a new algorithm - called Disentangling Generative Adversarial Imitation Learning (DisentanGAIL) - with the purpose of bypassing such constraints. Our algorithm enables autonomous agents to learn directly from high dimensional observations of an expert performing a task, by making use of adversarial learning with a latent representation inside the discriminator network. Such latent representation is regularized through mutual information constraints to incentivize learning only features that encode information about the completion levels of the task being demonstrated. This allows to obtain a shared feature space to successfully perform imitation while disregarding the differences between the expert's and the agent's domains. Empirically, our algorithm is able to efficiently imitate in a diverse range of control problems including balancing, manipulation and locomotive tasks, while being robust to various domain differences in terms of both environment appearance and agent embodiment. \ No newline at end of file diff --git a/data/2021/iclr/DrNAS: Dirichlet Neural Architecture Search b/data/2021/iclr/DrNAS: Dirichlet Neural Architecture Search new file mode 100644 index 0000000000..97d458e55c --- /dev/null +++ b/data/2021/iclr/DrNAS: Dirichlet Neural Architecture Search @@ -0,0 +1 @@ +This paper proposes a novel differentiable architecture search method by formulating it into a distribution learning problem. We treat the continuously relaxed architecture mixing weight as random variables, modeled by Dirichlet distribution. With recently developed pathwise derivatives, the Dirichlet parameters can be easily optimized with gradient-based optimizer in an end-to-end manner. This formulation improves the generalization ability and induces stochasticity that naturally encourages exploration in the search space. Furthermore, to alleviate the large memory consumption of differentiable NAS, we propose a simple yet effective progressive learning scheme that enables searching directly on large-scale tasks, eliminating the gap between search and evaluation phases. Extensive experiments demonstrate the effectiveness of our method. Specifically, we obtain a test error of 2.46% for CIFAR-10, 23.7% for ImageNet under the mobile setting. On NAS-Bench-201, we also achieve state-of-the-art results on all three datasets and provide insights for the effective design of neural architecture search algorithms. \ No newline at end of file diff --git a/data/2021/iclr/Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration b/data/2021/iclr/Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration new file mode 100644 index 0000000000..4a2c76fb20 --- /dev/null +++ b/data/2021/iclr/Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration @@ -0,0 +1 @@ +We propose a novel information bottleneck (IB) method named Drop-Bottleneck, which discretely drops features that are irrelevant to the target variable. Drop-Bottleneck not only enjoys a simple and tractable compression objective but also additionally provides a deterministic compressed representation of the input variable, which is useful for inference tasks that require consistent representation. Moreover, it can jointly learn a feature extractor and select features considering each feature dimension's relevance to the target task, which is unattainable by most neural network-based IB methods. We propose an exploration method based on Drop-Bottleneck for reinforcement learning tasks. In a multitude of noisy and reward sparse maze navigation tasks in VizDoom (Kempka et al., 2016) and DMLab (Beattie et al., 2016), our exploration method achieves state-of-the-art performance. As a new IB framework, we demonstrate that Drop-Bottleneck outperforms Variational Information Bottleneck (VIB) (Alemi et al., 2017) in multiple aspects including adversarial robustness and dimensionality reduction. \ No newline at end of file diff --git a/data/2021/iclr/Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling b/data/2021/iclr/Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling new file mode 100644 index 0000000000..28b1aed5b6 --- /dev/null +++ b/data/2021/iclr/Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling @@ -0,0 +1 @@ +Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation during the training. The Dual-mode ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and a large-scale dataset MultiDomain. Experiments and ablation studies demonstrate that Dual-mode ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR. With Dual-mode ASR, we achieve new state-of-the-art streaming ASR results on both LibriSpeech and MultiDomain in terms of accuracy and latency. \ No newline at end of file diff --git a/data/2021/iclr/DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation b/data/2021/iclr/DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation new file mode 100644 index 0000000000..9485ab27b9 --- /dev/null +++ b/data/2021/iclr/DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation @@ -0,0 +1 @@ +Recently, the DL compiler, together with Learning to Compile has proven to be a powerful technique for optimizing deep learning models. However, existing methods focus on accelerating the convergence speed of the individual tensor operator rather than the convergence speed of the entire model, which results in long optimization time to obtain a desired latency. In this paper, we present a new method called DynaTune, which provides significantly faster convergence speed to optimize a DNN model. In particular, we consider a Multi-Armed Bandit (MAB) model for the tensor program optimization problem. We use UCB to handle the decision-making of time-slot-based optimization, and we devise a Bayesian belief model that allows predicting the potential performance gain of each operator with uncertainty quantification, which guides the optimization process. We evaluate and compare DynaTune with the state-of-the-art DL compiler. The experiment results show that DynaTune is 1.2–2.4 times faster to achieve the same optimization quality for a range of models across different hardware architectures. \ No newline at end of file diff --git a/data/2021/iclr/Dynamic Tensor Rematerialization b/data/2021/iclr/Dynamic Tensor Rematerialization new file mode 100644 index 0000000000..bfa6feeb55 --- /dev/null +++ b/data/2021/iclr/Dynamic Tensor Rematerialization @@ -0,0 +1 @@ +Checkpointing enables training larger models by freeing intermediate activations and recomputing them on demand. Previous checkpointing techniques are difficult to generalize to dynamic models because they statically plan recomputations offline. We present Dynamic Tensor Rematerialization (DTR), a greedy online algorithm for heuristically checkpointing arbitrary models. DTR is extensible and general: it is parameterized by an eviction policy and only collects lightweight metadata on tensors and operators. Though DTR has no advance knowledge of the model or training task, we prove it can train an $N$-layer feedforward network on an $\Omega(\sqrt{N})$ memory budget with only $\mathcal{O}(N)$ tensor operations. Moreover, we identify a general eviction heuristic and show how it allows DTR to automatically provide favorable checkpointing performance across a variety of models and memory budgets. \ No newline at end of file diff --git a/data/2021/iclr/EEC: Learning to Encode and Regenerate Images for Continual Learning b/data/2021/iclr/EEC: Learning to Encode and Regenerate Images for Continual Learning new file mode 100644 index 0000000000..3e35970436 --- /dev/null +++ b/data/2021/iclr/EEC: Learning to Encode and Regenerate Images for Continual Learning @@ -0,0 +1 @@ +The two main impediments to continual learning are catastrophic forgetting and memory limitations on the storage of data. To cope with these challenges, we propose a novel, cognitively-inspired approach which trains autoencoders with Neural Style Transfer to encode and store images. During training on a new task, reconstructed images from encoded episodes are replayed in order to avoid catastrophic forgetting. The loss function for the reconstructed images is weighted to reduce its effect during classifier training to cope with image degradation. When the system runs out of memory the encoded episodes are converted into centroids and covariance matrices, which are used to generate pseudo-images during classifier training, keeping classifier performance stable while using less memory. Our approach increases classification accuracy by 13-17% over state-of-the-art methods on benchmark datasets, while requiring 78% less storage space. \ No newline at end of file diff --git a/data/2021/iclr/Early Stopping in Deep Networks: Double Descent and How to Eliminate it b/data/2021/iclr/Early Stopping in Deep Networks: Double Descent and How to Eliminate it new file mode 100644 index 0000000000..e206e17363 --- /dev/null +++ b/data/2021/iclr/Early Stopping in Deep Networks: Double Descent and How to Eliminate it @@ -0,0 +1 @@ +Over-parameterized models, such as large deep networks, often exhibit a double descent phenomenon, whereas a function of model size, error first decreases, increases, and decreases at last. This intriguing double descent behavior also occurs as a function of training epochs and has been conjectured to arise because training epochs control the model complexity. In this paper, we show that such epoch-wise double descent arises for a different reason: It is caused by a superposition of two or more bias-variance tradeoffs that arise because different parts of the network are learned at different epochs, and eliminating this by proper scaling of stepsizes can significantly improve the early stopping performance. We show this analytically for i) linear regression, where differently scaled features give rise to a superposition of bias-variance tradeoffs, and for ii) a two-layer neural network, where the first and second layer each govern a bias-variance tradeoff. Inspired by this theory, we study two standard convolutional networks empirically and show that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly. \ No newline at end of file diff --git a/data/2021/iclr/Economic Hyperparameter Optimization with Blended Search Strategy b/data/2021/iclr/Economic Hyperparameter Optimization with Blended Search Strategy new file mode 100644 index 0000000000..eb6d78e95e --- /dev/null +++ b/data/2021/iclr/Economic Hyperparameter Optimization with Blended Search Strategy @@ -0,0 +1 @@ +This article presents a new approach to modeling and optimizing individual decision-making strategies in multi-agent socio-economic systems (MSES). This approach is based on the synthesis of agent-based modeling methods, machine learning and genetic optimization algorithms. A procedure for the synthesis and training of artificial neural networks (ANNs) that simulate the functionality of MSES and provide an approximation of the values of its objective characteristics has been developed. The feature of the two-step procedure is the combined use of particle swarm optimization methods (to determine the optimal values of hyperparameters) and the Adam machine learning algorithm (to compute weight coefficients of the ANN). The use of such ANN-based surrogate models in parallel multi-agent real-coded genetic algorithms (MA-RCGA) makes it possible to raise substantially the time-efficiency of the evolutionary search for optimal solutions. We have conducted numerical experiments that confirm a significant improvement in the performance of MA-RCGA, which periodically uses the ANN-based surrogate-model to approximate the values of the objective and fitness functions. A software framework has been designed that consists of the original (reference) agent-based model of trade interactions, the ANN-based surrogate model and the MA-RCGA genetic algorithm. At the same time, the software libraries FLAME GPU, OpenNN (Open Neural Networks Library), etc., agent-based modeling and machine learning methods are used. The system we developed can be used by responsible managers. \ No newline at end of file diff --git a/data/2021/iclr/Effective Abstract Reasoning with Dual-Contrast Network b/data/2021/iclr/Effective Abstract Reasoning with Dual-Contrast Network new file mode 100644 index 0000000000..4146e43061 --- /dev/null +++ b/data/2021/iclr/Effective Abstract Reasoning with Dual-Contrast Network @@ -0,0 +1 @@ +As a step towards improving the abstract reasoning capability of machines, we aim to solve Raven's Progressive Matrices (RPM) with neural networks, since solving RPM puzzles is highly correlated with human intelligence. Unlike previous methods that use auxiliary annotations or assume hidden rules to produce appropriate feature representation, we only use the ground truth answer of each question for model learning, aiming for an intelligent agent to have a strong learning capability with a small amount of supervision. Based on the RPM problem formulation, the correct answer filled into the missing entry of the third row/column has to best satisfy the same rules shared between the first two rows/columns. Thus we design a simple yet effective Dual-Contrast Network (DCNet) to exploit the inherent structure of RPM puzzles. Specifically, a rule contrast module is designed to compare the latent rules between the filled row/column and the first two rows/columns; a choice contrast module is designed to increase the relative differences between candidate choices. Experimental results on the RAVEN and PGM datasets show that DCNet outperforms the state-of-the-art methods by a large margin of 5.77%. Further experiments on few training samples and model generalization also show the effectiveness of DCNet. Code is available at https://github.com/visiontao/dcnet. \ No newline at end of file diff --git a/data/2021/iclr/Effective Distributed Learning with Random Features: Improved Bounds and Algorithms b/data/2021/iclr/Effective Distributed Learning with Random Features: Improved Bounds and Algorithms new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Effective and Efficient Vote Attack on Capsule Networks b/data/2021/iclr/Effective and Efficient Vote Attack on Capsule Networks new file mode 100644 index 0000000000..968d549025 --- /dev/null +++ b/data/2021/iclr/Effective and Efficient Vote Attack on Capsule Networks @@ -0,0 +1 @@ +Standard Convolutional Neural Networks (CNNs) can be easily fooled by images with small quasi-imperceptible artificial perturbations. As alternatives to CNNs, the recently proposed Capsule Networks (CapsNets) are shown to be more robust to white-box attacks than CNNs under popular attack protocols. Besides, the class-conditional reconstruction part of CapsNets is also used to detect adversarial examples. In this work, we investigate the adversarial robustness of CapsNets, especially how the inner workings of CapsNets change when the output capsules are attacked. The first observation is that adversarial examples misled CapsNets by manipulating the votes from primary capsules. Another observation is the high computational cost, when we directly apply multi-step attack methods designed for CNNs to attack CapsNets, due to the computationally expensive routing mechanism. Motivated by these two observations, we propose a novel vote attack where we attack votes of CapsNets directly. Our vote attack is not only effective but also efficient by circumventing the routing process. Furthermore, we integrate our vote attack into the detection-aware attack paradigm, which can successfully bypass the class-conditional reconstruction based detection method. Extensive experiments demonstrate the superior attack performance of our vote attack on CapsNets. \ No newline at end of file diff --git a/data/2021/iclr/Efficient Certified Defenses Against Patch Attacks on Image Classifiers b/data/2021/iclr/Efficient Certified Defenses Against Patch Attacks on Image Classifiers new file mode 100644 index 0000000000..3a6a823d48 --- /dev/null +++ b/data/2021/iclr/Efficient Certified Defenses Against Patch Attacks on Image Classifiers @@ -0,0 +1 @@ +Adversarial patches pose a realistic threat model for physical world attacks on autonomous systems via their perception component. Autonomous systems in safety-critical domains such as automated driving should thus contain a fail-safe fallback component that combines certifiable robustness against patches with efficient inference while maintaining high performance on clean inputs. We propose BagCert, a novel combination of model architecture and certification procedure that allows efficient certification. We derive a loss that enables end-to-end optimization of certified robustness against patches of different sizes and locations. On CIFAR10, BagCert certifies 10.000 examples in 43 seconds on a single GPU and obtains 86% clean and 60% certified accuracy against 5x5 patches. \ No newline at end of file diff --git a/data/2021/iclr/Efficient Conformal Prediction via Cascaded Inference with Expanded Admission b/data/2021/iclr/Efficient Conformal Prediction via Cascaded Inference with Expanded Admission new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Efficient Continual Learning with Modular Networks and Task-Driven Priors b/data/2021/iclr/Efficient Continual Learning with Modular Networks and Task-Driven Priors new file mode 100644 index 0000000000..6c3ad3e213 --- /dev/null +++ b/data/2021/iclr/Efficient Continual Learning with Modular Networks and Task-Driven Priors @@ -0,0 +1 @@ +Existing literature in Continual Learning (CL) has focused on overcoming catastrophic forgetting, the inability of the learner to recall how to perform tasks observed in the past. There are however other desirable properties of a CL system, such as the ability to transfer knowledge from previous tasks and to scale memory and compute sub-linearly with the number of tasks. Since most current benchmarks focus only on forgetting using short streams of tasks, we first propose a new suite of benchmarks to probe CL algorithms across these new axes. Finally, we introduce a new modular architecture, whose modules represent atomic skills that can be composed to perform a certain task. Learning a task reduces to figuring out which past modules to re-use, and which new modules to instantiate to solve the current task. Our learning algorithm leverages a task-driven prior over the exponential search space of all possible ways to combine modules, enabling efficient learning on long streams of tasks. Our experiments show that this modular architecture and learning algorithm perform competitively on widely used CL benchmarks while yielding superior performance on the more challenging benchmarks we introduce in this work. \ No newline at end of file diff --git a/data/2021/iclr/Efficient Empowerment Estimation for Unsupervised Stabilization b/data/2021/iclr/Efficient Empowerment Estimation for Unsupervised Stabilization new file mode 100644 index 0000000000..55d99139e8 --- /dev/null +++ b/data/2021/iclr/Efficient Empowerment Estimation for Unsupervised Stabilization @@ -0,0 +1 @@ +Intrinsically motivated artificial agents learn advantageous behavior without externally-provided rewards. Previously, it was shown that maximizing mutual information between agent actuators and future states, known as the empowerment principle, enables unsupervised stabilization of dynamical systems at upright positions, which is a prototypical intrinsically motivated behavior for upright standing and walking. This follows from the coincidence between the objective of stabilization and the objective of empowerment. Unfortunately, sample-based estimation of this kind of mutual information is challenging. Recently, various variational lower bounds (VLBs) on empowerment have been proposed as solutions; however, they are often biased, unstable in training, and have high sample complexity. In this work, we propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel, which allows us to efficiently calculate an unbiased estimator of empowerment by convex optimization. We demonstrate our solution for sample-based unsupervised stabilization on different dynamical control systems and show the advantages of our method by comparing it to the existing VLB approaches. Specifically, we show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images. Consequently, our method opens a path to wider and easier adoption of empowerment for various applications. 1 \ No newline at end of file diff --git a/data/2021/iclr/Efficient Generalized Spherical CNNs b/data/2021/iclr/Efficient Generalized Spherical CNNs new file mode 100644 index 0000000000..67aeba3a32 --- /dev/null +++ b/data/2021/iclr/Efficient Generalized Spherical CNNs @@ -0,0 +1 @@ +Many problems across computer vision and the natural sciences require the analysis of spherical data, for which representations may be learned efficiently by encoding equivariance to rotational symmetries. We present a generalized spherical CNN framework that encompasses various existing approaches and allows them to be leveraged alongside each other. The only existing non-linear spherical CNN layer that is strictly equivariant has complexity $\mathcal{O}(C^2L^5)$, where $C$ is a measure of representational capacity and $L$ the spherical harmonic bandlimit. Such a high computational cost often prohibits the use of strictly equivariant spherical CNNs. We develop two new strictly equivariant layers with reduced complexity $\mathcal{O}(CL^4)$ and $\mathcal{O}(CL^3 \log L)$, making larger, more expressive models computationally feasible. Moreover, we adopt efficient sampling theory to achieve further computational savings. We show that these developments allow the construction of more expressive hybrid models that achieve state-of-the-art accuracy and parameter efficiency on spherical benchmark problems. \ No newline at end of file diff --git a/data/2021/iclr/Efficient Inference of Flexible Interaction in Spiking-neuron Networks b/data/2021/iclr/Efficient Inference of Flexible Interaction in Spiking-neuron Networks new file mode 100644 index 0000000000..0e717bfce9 --- /dev/null +++ b/data/2021/iclr/Efficient Inference of Flexible Interaction in Spiking-neuron Networks @@ -0,0 +1 @@ +Hawkes process provides an effective statistical framework for analyzing the time-dependent interaction of neuronal spiking activities. Although utilized in many real applications, the classic Hawkes process is incapable of modelling inhibitory interactions among neurons. Instead, the nonlinear Hawkes process allows for a more flexible influence pattern with excitatory or inhibitory interactions. In this paper, three sets of auxiliary latent variables (Polya-Gamma variables, latent marked Poisson processes and sparsity variables) are augmented to make functional connection weights in a Gaussian form, which allows for a simple iterative algorithm with analytical updates. As a result, an efficient expectation-maximization (EM) algorithm is derived to obtain the maximum a posteriori (MAP) estimate. We demonstrate the accuracy and efficiency performance of our algorithm on synthetic and real data. For real neural recordings, we show our algorithm can estimate the temporal dynamics of interaction and reveal the interpretable functional connectivity underlying neural spike trains. \ No newline at end of file diff --git a/data/2021/iclr/Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL b/data/2021/iclr/Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL new file mode 100644 index 0000000000..99f96c9624 --- /dev/null +++ b/data/2021/iclr/Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL @@ -0,0 +1 @@ +Reinforcement learning (RL) in episodic, factored Markov decision processes (FMDPs) is studied. We propose an algorithm called FMDP-BF, which leverages the factorization structure of FMDP. The regret of FMDP-BF is shown to be exponentially smaller than that of optimal algorithms designed for non-factored MDPs, and improves on the best previous result for FMDPs~\citep{osband2014near} by a factored of $\sqrt{H|\mathcal{S}_i|}$, where $|\mathcal{S}_i|$ is the cardinality of the factored state subspace and $H$ is the planning horizon. To show the optimality of our bounds, we also provide a lower bound for FMDP, which indicates that our algorithm is near-optimal w.r.t. timestep $T$, horizon $H$ and factored state-action subspace cardinality. Finally, as an application, we study a new formulation of constrained RL, known as RL with knapsack constraints (RLwK), and provides the first sample-efficient algorithm based on FMDP-BF. \ No newline at end of file diff --git a/data/2021/iclr/Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation b/data/2021/iclr/Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation new file mode 100644 index 0000000000..aefb8f804f --- /dev/null +++ b/data/2021/iclr/Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation @@ -0,0 +1 @@ +Many real-world applications such as robotics provide hard constraints on power and compute that limit the viable model complexity of Reinforcement Learning (RL) agents. Similarly, in many distributed RL settings, acting is done on un-accelerated hardware such as CPUs, which likewise restricts model size to prevent intractable experiment run times. These"actor-latency"constrained settings present a major obstruction to the scaling up of model complexity that has recently been extremely successful in supervised learning. To be able to utilize large model capacity while still operating within the limits imposed by the system during acting, we develop an"Actor-Learner Distillation"(ALD) procedure that leverages a continual form of distillation that transfers learning progress from a large capacity learner model to a small capacity actor model. As a case study, we develop this procedure in the context of partially-observable environments, where transformer models have had large improvements over LSTMs recently, at the cost of significantly higher computational complexity. With transformer models as the learner and LSTMs as the actor, we demonstrate in several challenging memory environments that using Actor-Learner Distillation recovers the clear sample-efficiency gains of the transformer learner model while maintaining the fast inference and reduced total training time of the LSTM actor model. \ No newline at end of file diff --git a/data/2021/iclr/Efficient Wasserstein Natural Gradients for Reinforcement Learning b/data/2021/iclr/Efficient Wasserstein Natural Gradients for Reinforcement Learning new file mode 100644 index 0000000000..b014c80db4 --- /dev/null +++ b/data/2021/iclr/Efficient Wasserstein Natural Gradients for Reinforcement Learning @@ -0,0 +1 @@ +A novel optimization approach is proposed for application to policy gradient methods and evolution strategies for reinforcement learning (RL). The procedure uses a computationally efficient Wasserstein natural gradient (WNG) descent that takes advantage of the geometry induced by a Wasserstein penalty to speed optimization. This method follows the recent theme in RL of including a divergence penalty in the objective to establish a trust region. Experiments on challenging tasks demonstrate improvements in both computational cost and performance over advanced baselines. \ No newline at end of file diff --git a/data/2021/iclr/EigenGame: PCA as a Nash Equilibrium b/data/2021/iclr/EigenGame: PCA as a Nash Equilibrium new file mode 100644 index 0000000000..f0e89b0223 --- /dev/null +++ b/data/2021/iclr/EigenGame: PCA as a Nash Equilibrium @@ -0,0 +1 @@ +We present a novel view on principal component analysis (PCA) as a competitive game in which each approximate eigenvector is controlled by a player whose goal is to maximize their own utility function. We analyze the properties of this PCA game and the behavior of its gradient based updates. The resulting algorithm which combines elements from Oja's rule with a generalized Gram-Schmidt orthogonalization is naturally decentralized and hence parallelizable through message passing. We demonstrate the scalability of the algorithm with experiments on large image datasets and neural network activations. We discuss how this new view of PCA as a differentiable game can lead to further algorithmic developments and insights. \ No newline at end of file diff --git a/data/2021/iclr/Emergent Road Rules In Multi-Agent Driving Environments b/data/2021/iclr/Emergent Road Rules In Multi-Agent Driving Environments new file mode 100644 index 0000000000..642e16c637 --- /dev/null +++ b/data/2021/iclr/Emergent Road Rules In Multi-Agent Driving Environments @@ -0,0 +1 @@ +For autonomous vehicles to safely share the road with human drivers, autonomous vehicles must abide by specific "road rules" that human drivers have agreed to follow. "Road rules" include rules that drivers are required to follow by law -- such as the requirement that vehicles stop at red lights -- as well as more subtle social rules -- such as the implicit designation of fast lanes on the highway. In this paper, we provide empirical evidence that suggests that -- instead of hard-coding road rules into self-driving algorithms -- a scalable alternative may be to design multi-agent environments in which road rules emerge as optimal solutions to the problem of maximizing traffic flow. We analyze what ingredients in driving environments cause the emergence of these road rules and find that two crucial factors are noisy perception and agents' spatial density. We provide qualitative and quantitative evidence of the emergence of seven social driving behaviors, ranging from obeying traffic signals to following lanes, all of which emerge from training agents to drive quickly to destinations without colliding. Our results add empirical support for the social road rules that countries worldwide have agreed on for safe, efficient driving. \ No newline at end of file diff --git a/data/2021/iclr/Emergent Symbols through Binding in External Memory b/data/2021/iclr/Emergent Symbols through Binding in External Memory new file mode 100644 index 0000000000..7638d764e2 --- /dev/null +++ b/data/2021/iclr/Emergent Symbols through Binding in External Memory @@ -0,0 +1 @@ +A key aspect of human intelligence is the ability to infer abstract rules directly from high-dimensional sensory data, and to do so given only a limited amount of training experience. Deep neural network algorithms have proven to be a powerful tool for learning directly from high-dimensional data, but currently lack this capacity for data-efficient induction of abstract rules, leading some to argue that symbol-processing mechanisms will be necessary to account for this capacity. In this work, we take a step toward bridging this gap by introducing the Emergent Symbol Binding Network (ESBN), a recurrent network augmented with an external memory that enables a form of variable-binding and indirection. This binding mechanism allows symbol-like representations to emerge through the learning process without the need to explicitly incorporate symbol-processing machinery, enabling the ESBN to learn rules in a manner that is abstracted away from the particular entities to which those rules apply. Across a series of tasks, we show that this architecture displays nearly perfect generalization of learned rules to novel entities given only a limited number of training examples, and outperforms a number of other competitive neural network architectures. \ No newline at end of file diff --git a/data/2021/iclr/Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition b/data/2021/iclr/Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition new file mode 100644 index 0000000000..f12d336156 --- /dev/null +++ b/data/2021/iclr/Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition @@ -0,0 +1 @@ +In many scenarios, named entity recognition (NER) models severely suffer from unlabeled entity problem, where the entities of a sentence may not be fully annotated. Through empirical studies performed on synthetic datasets, we find two causes of the performance degradation. One is the reduction of annotated entities and the other is treating unlabeled entities as negative instances. The first cause has less impact than the second one and can be mitigated by adopting pretraining language models. The second cause seriously misguides a model in training and greatly affects its performances. Based on the above observations, we propose a general approach that is capable of eliminating the misguidance brought by unlabeled entities. The core idea is using negative sampling to keep the probability of training with unlabeled entities at a very low level. Experiments on synthetic datasets and real-world datasets show that our model is robust to unlabeled entity problem and surpasses prior baselines. On well-annotated datasets, our model is competitive with state-of-the-art method. \ No newline at end of file diff --git a/data/2021/iclr/Empirical or Invariant Risk Minimization? A Sample Complexity Perspective b/data/2021/iclr/Empirical or Invariant Risk Minimization? A Sample Complexity Perspective new file mode 100644 index 0000000000..96ee975bb0 --- /dev/null +++ b/data/2021/iclr/Empirical or Invariant Risk Minimization? A Sample Complexity Perspective @@ -0,0 +1 @@ +Recently, invariant risk minimization (IRM) was proposed as a promising solution to address out-of-distribution (OOD) generalization. However, it is unclear when IRM should be preferred over the widely-employed empirical risk minimization (ERM) framework. In this work, we analyze both these frameworks from the perspective of sample complexity, thus taking a firm step towards answering this important question. We find that depending on the type of data generation mechanism, the two approaches might have very different finite sample and asymptotic behavior. For example, in the covariate shift setting we see that the two approaches not only arrive at the same asymptotic solution, but also have similar finite sample behavior with no clear winner. For other distribution shifts such as those involving confounders or anti-causal variables, however, the two approaches arrive at different asymptotic solutions where IRM is guaranteed to be close to the desired OOD solutions in the finite sample regime, while ERM is biased even asymptotically. We further investigate how different factors -- the number of environments, complexity of the model, and IRM penalty weight -- impact the sample complexity of IRM in relation to its distance from the OOD solutions \ No newline at end of file diff --git a/data/2021/iclr/End-to-End Egospheric Spatial Memory b/data/2021/iclr/End-to-End Egospheric Spatial Memory new file mode 100644 index 0000000000..8edfa4bd38 --- /dev/null +++ b/data/2021/iclr/End-to-End Egospheric Spatial Memory @@ -0,0 +1 @@ +Spatial memory, or the ability to remember and recall specific locations and objects, is central to autonomous agents' ability to carry out tasks in real environments. However, most existing artificial memory modules are not very adept at storing spatial information. We propose a parameter-free module, Egospheric Spatial Memory (ESM), which encodes the memory in an ego-sphere around the agent, enabling expressive 3D representations. ESM can be trained end-to-end via either imitation or reinforcement learning, and improves both training efficiency and final performance against other memory baselines on both drone and manipulator visuomotor control tasks. The explicit egocentric geometry also enables us to seamlessly combine the learned controller with other non-learned modalities, such as local obstacle avoidance. We further show applications to semantic segmentation on the ScanNet dataset, where ESM naturally combines image-level and map-level inference modalities. Through our broad set of experiments, we show that ESM provides a general computation graph for embodied spatial reasoning, and the module forms a bridge between real-time mapping systems and differentiable memory architectures. Implementation at: https://github.com/ivy-dl/memory. \ No newline at end of file diff --git a/data/2021/iclr/End-to-end Adversarial Text-to-Speech b/data/2021/iclr/End-to-end Adversarial Text-to-Speech new file mode 100644 index 0000000000..6741cc1187 --- /dev/null +++ b/data/2021/iclr/End-to-end Adversarial Text-to-Speech @@ -0,0 +1 @@ +Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme based on token length prediction. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic time warping in the spectrogram-based prediction loss. The resulting model achieves a mean opinion score exceeding 4 on a 5 point scale, which is comparable to the state-of-the-art models relying on multi-stage training and additional supervision. \ No newline at end of file diff --git a/data/2021/iclr/Enforcing robust control guarantees within neural network policies b/data/2021/iclr/Enforcing robust control guarantees within neural network policies new file mode 100644 index 0000000000..84a09e24d3 --- /dev/null +++ b/data/2021/iclr/Enforcing robust control guarantees within neural network policies @@ -0,0 +1 @@ +When designing controllers for safety-critical systems, practitioners often face a challenging tradeoff between robustness and performance. While robust control methods provide rigorous guarantees on system stability under certain worst-case disturbances, they often result in simple controllers that perform poorly in the average (non-worst) case. In contrast, nonlinear control methods trained using deep learning have achieved state-of-the-art performance on many control tasks, but often lack robustness guarantees. We propose a technique that combines the strengths of these two approaches: a generic nonlinear control policy class, parameterized by neural networks, that nonetheless enforces the same provable robustness criteria as robust control. Specifically, we show that by integrating custom convex-optimization-based projection layers into a nonlinear policy, we can construct a provably robust neural network policy class that outperforms robust control methods in the average (non-adversarial) setting. We demonstrate the power of this approach on several domains, improving in performance over existing robust control methods and in stability over (non-robust) RL methods. \ No newline at end of file diff --git a/data/2021/iclr/Enjoy Your Editing: Controllable GANs for Image Editing via Latent Space Navigation b/data/2021/iclr/Enjoy Your Editing: Controllable GANs for Image Editing via Latent Space Navigation new file mode 100644 index 0000000000..8a35b213ca --- /dev/null +++ b/data/2021/iclr/Enjoy Your Editing: Controllable GANs for Image Editing via Latent Space Navigation @@ -0,0 +1 @@ +Controllable semantic image editing enables a user to change entire image attributes with few clicks, e.g., gradually making a summer scene look like it was taken in winter. Classic approaches for this task use a Generative Adversarial Net (GAN) to learn a latent space and suitable latent-space transformations. However, current approaches often suffer from attribute edits that are entangled, global image identity changes, and diminished photo-realism. To address these concerns, we learn multiple attribute transformations simultaneously, we integrate attribute regression into the training of transformation functions, apply a content loss and an adversarial loss that encourage the maintenance of image identity and photo-realism. We propose quantitative evaluation strategies for measuring controllable editing performance, unlike prior work which primarily focuses on qualitative evaluation. Our model permits better control for both single- and multiple-attribute editing, while also preserving image identity and realism during transformation. We provide empirical results for both real and synthetic images, highlighting that our model achieves state-of-the-art performance for targeted image manipulation. \ No newline at end of file diff --git a/data/2021/iclr/Entropic gradient descent algorithms and wide flat minima b/data/2021/iclr/Entropic gradient descent algorithms and wide flat minima new file mode 100644 index 0000000000..5bb25bf6e5 --- /dev/null +++ b/data/2021/iclr/Entropic gradient descent algorithms and wide flat minima @@ -0,0 +1 @@ +The properties of flat minima in the empirical risk landscape of neural networks have been debated for some time. Increasing evidence suggests they possess better generalization capabilities with respect to sharp ones. In this work we first discuss the relationship between alternative measures of flatness: the local entropy, which is useful for analysis and algorithm development, and the local energy, which is easier to compute and was shown empirically in extensive tests on state-of-the-art networks to be the best predictor of generalization capabilities. We show semi-analytically in simple controlled scenarios that these two measures correlate strongly with each other and with generalization. Then, we extend the analysis to the deep learning scenario by extensive numerical validations. We study two algorithms, entropy-stochastic gradient descent and replicated-stochastic gradient descent, that explicitly include the local entropy in the optimization objective. We devise a training schedule by which we consistently find flatter minima (using both flatness measures), and improve the generalization error for common architectures (e.g. ResNet, EfficientNet). \ No newline at end of file diff --git a/data/2021/iclr/Estimating Lipschitz constants of monotone deep equilibrium models b/data/2021/iclr/Estimating Lipschitz constants of monotone deep equilibrium models new file mode 100644 index 0000000000..595d9f7557 --- /dev/null +++ b/data/2021/iclr/Estimating Lipschitz constants of monotone deep equilibrium models @@ -0,0 +1 @@ +Several methods have been proposed in recent years to provide bounds on the Lipschitz constants of deep networks, which can be used to provide robustness guarantees, generalization bounds, and characterize the smoothness of decision boundaries. However, existing bounds get substantially weaker with increasing depth of the network, which makes it unclear how to apply such bounds to recently proposed models such as the deep equilibrium (DEQ) model, which can be viewed as representing an infinitely-deep network. In this paper, we show that monotone DEQs, a recently-proposed subclass of DEQs, have Lipschitz constants that can be bounded as a simple function of the strong monotonicity parameter of the network. We derive simple-yet-tight bounds on both the input-output mapping and the weight-output mapping defined by these networks, and demonstrate that they are small relative to those for comparable standard DNNs. We show that one can use these bounds to design monotone DEQ models, even with e.g. multi-scale convolutional structure, that still have constraints on the Lipschitz constant. We also highlight how to use these bounds to develop PAC-Bayes generalization bounds that do not depend on any depth of the network, and which avoid the exponential depth-dependence of comparable DNN bounds. \ No newline at end of file diff --git a/data/2021/iclr/Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors b/data/2021/iclr/Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors new file mode 100644 index 0000000000..c1e20089db --- /dev/null +++ b/data/2021/iclr/Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors @@ -0,0 +1 @@ +Predictive uncertainty estimation is an essential next step for the reliable deployment of deep object detectors in safety-critical tasks. In this work, we focus on estimating predictive distributions for bounding box regression output with variance networks. We show that in the context of object detection, training variance networks with negative log likelihood (NLL) can lead to high entropy predictive distributions regardless of the correctness of the output mean. We propose to use the energy score as a non-local proper scoring rule and find that when used for training, the energy score leads to better calibrated and lower entropy predictive distributions than NLL. We also address the widespread use of non-proper scoring metrics for evaluating predictive distributions from deep object detectors by proposing an alternate evaluation approach founded on proper scoring rules. Using the proposed evaluation tools, we show that although variance networks can be used to produce high quality predictive distributions, ad-hoc approaches used by seminal object detectors for choosing regression targets during training do not provide wide enough data support for reliable variance learning. We hope that our work helps shift evaluation in probabilistic object detection to better align with predictive uncertainty evaluation in other machine learning domains. Code for all models, evaluation, and datasets is available at: https://github.com/asharakeh/probdet.git. \ No newline at end of file diff --git a/data/2021/iclr/Estimating informativeness of samples with Smooth Unique Information b/data/2021/iclr/Estimating informativeness of samples with Smooth Unique Information new file mode 100644 index 0000000000..5602cb385b --- /dev/null +++ b/data/2021/iclr/Estimating informativeness of samples with Smooth Unique Information @@ -0,0 +1 @@ +We define a notion of information that an individual sample provides to the training of a neural network, and we specialize it to measure both how much a sample informs the final weights and how much it informs the function computed by the weights. Though related, we show that these quantities have a qualitatively different behavior. We give efficient approximations of these quantities using a linearized network and demonstrate empirically that the approximation is accurate for real-world architectures, such as pre-trained ResNets. We apply these measures to several problems, such as dataset summarization, analysis of under-sampled classes, comparison of informativeness of different data sources, and detection of adversarial and corrupted examples. Our work generalizes existing frameworks but enjoys better computational properties for heavily over-parametrized models, which makes it possible to apply it to real-world networks. \ No newline at end of file diff --git a/data/2021/iclr/Evaluating the Disentanglement of Deep Generative Models through Manifold Topology b/data/2021/iclr/Evaluating the Disentanglement of Deep Generative Models through Manifold Topology new file mode 100644 index 0000000000..da2f3889f4 --- /dev/null +++ b/data/2021/iclr/Evaluating the Disentanglement of Deep Generative Models through Manifold Topology @@ -0,0 +1 @@ +Learning disentangled representations is regarded as a fundamental task for improving the generalization, robustness, and interpretability of generative models. However, measuring disentanglement has been challenging and inconsistent, often dependent on an ad-hoc external model or specific to a certain dataset. To address this, we present a method for quantifying disentanglement that only uses the generative model, by measuring the topological similarity of conditional submanifolds in the learned representation. This method showcases both unsupervised and supervised variants. To illustrate the effectiveness and applicability of our method, we empirically evaluate several state-of-the-art models across multiple datasets. We find that our method ranks models similarly to existing methods. \ No newline at end of file diff --git a/data/2021/iclr/Evaluation of Neural Architectures trained with square Loss vs Cross-Entropy in Classification Tasks b/data/2021/iclr/Evaluation of Neural Architectures trained with square Loss vs Cross-Entropy in Classification Tasks new file mode 100644 index 0000000000..226873180f --- /dev/null +++ b/data/2021/iclr/Evaluation of Neural Architectures trained with square Loss vs Cross-Entropy in Classification Tasks @@ -0,0 +1,2 @@ +Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is widely believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources. Indeed, we observe that the square loss produces better results in the dominant majority of NLP and ASR experiments. Cross-entropy appears to have a slight edge on computer vision tasks. +We argue that there is little compelling empirical or theoretical evidence indicating a clear-cut advantage to the cross-entropy loss. Indeed, in our experiments, performance on nearly all non-vision tasks can be improved, sometimes significantly, by switching to the square loss. Furthermore, training with square loss appears to be less sensitive to the randomness in initialization. We posit that training using the square loss for classification needs to be a part of best practices of modern deep learning on equal footing with cross-entropy. \ No newline at end of file diff --git a/data/2021/iclr/Evaluation of Similarity-based Explanations b/data/2021/iclr/Evaluation of Similarity-based Explanations new file mode 100644 index 0000000000..092d918be9 --- /dev/null +++ b/data/2021/iclr/Evaluation of Similarity-based Explanations @@ -0,0 +1 @@ +Explaining the predictions made by complex machine learning models helps users to understand and accept the predicted outputs with confidence. One promising way is to use similarity-based explanation that provides similar instances as evidence to support model predictions. Several relevance metrics are used for this purpose. In this study, we investigated relevance metrics that can provide reasonable explanations to users. Specifically, we adopted three tests to evaluate whether the relevance metrics satisfy the minimal requirements for similarity-based explanation. Our experiments revealed that the cosine similarity of the gradients of the loss performs best, which would be a recommended choice in practice. In addition, we showed that some metrics perform poorly in our tests and analyzed the reasons of their failure. We expect our insights to help practitioners in selecting appropriate relevance metrics and also aid further researches for designing better relevance metrics for explanations. \ No newline at end of file diff --git a/data/2021/iclr/Evaluations and Methods for Explanation through Robustness Analysis b/data/2021/iclr/Evaluations and Methods for Explanation through Robustness Analysis new file mode 100644 index 0000000000..10beeea41c --- /dev/null +++ b/data/2021/iclr/Evaluations and Methods for Explanation through Robustness Analysis @@ -0,0 +1 @@ +Among multiple ways of interpreting a machine learning model, measuring the importance of a set of features tied to a prediction is probably one of the most intuitive ways to explain a model. In this paper, we establish the link between a set of features to a prediction with a new evaluation criterion, robustness analysis, which measures the minimum distortion distance of adversarial perturbation. By measuring the tolerance level for an adversarial attack, we can extract a set of features that provides the most robust support for a prediction, and also can extract a set of features that contrasts the current prediction to a target class by setting a targeted adversarial attack. By applying this methodology to various prediction tasks across multiple domains, we observe the derived explanations are indeed capturing the significant feature set qualitatively and quantitatively. \ No newline at end of file diff --git a/data/2021/iclr/Evolving Reinforcement Learning Algorithms b/data/2021/iclr/Evolving Reinforcement Learning Algorithms new file mode 100644 index 0000000000..bb4153c983 --- /dev/null +++ b/data/2021/iclr/Evolving Reinforcement Learning Algorithms @@ -0,0 +1 @@ +We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. Learning from scratch on simple classical control and gridworld tasks, our method rediscovers the temporal-difference (TD) algorithm. Bootstrapped from DQN, we highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games. The analysis of the learned algorithm behavior shows resemblance to recently proposed RL algorithms that address overestimation in value-based methods. \ No newline at end of file diff --git a/data/2021/iclr/Exemplary Natural Images Explain CNN Activations Better than State-of-the-Art Feature Visualization b/data/2021/iclr/Exemplary Natural Images Explain CNN Activations Better than State-of-the-Art Feature Visualization new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Explainable Deep One-Class Classification b/data/2021/iclr/Explainable Deep One-Class Classification new file mode 100644 index 0000000000..37ec205b4c --- /dev/null +++ b/data/2021/iclr/Explainable Deep One-Class Classification @@ -0,0 +1 @@ +Deep one-class classification variants for anomaly detection learn a mapping that concentrates nominal samples in feature space causing anomalies to be mapped away. Because this transformation is highly non-linear, finding interpretations poses a significant challenge. In this paper we present an explainable deep one-class classification method, Fully Convolutional Data Description (FCDD), where the mapped samples are themselves also an explanation heatmap. FCDD yields competitive detection performance and provides reasonable explanations on common anomaly detection benchmarks with CIFAR-10 and ImageNet. On MVTec-AD, a recent manufacturing dataset offering ground-truth anomaly maps, FCDD meets the state of the art in an unsupervised setting, and outperforms its competitors in a semi-supervised setting. Finally, using FCDD's explanations we demonstrate the vulnerability of deep one-class classification models to spurious image features such as image watermarks. \ No newline at end of file diff --git a/data/2021/iclr/Explainable Subgraph Reasoning for Forecasting on Temporal Knowledge Graphs b/data/2021/iclr/Explainable Subgraph Reasoning for Forecasting on Temporal Knowledge Graphs new file mode 100644 index 0000000000..866acf4f54 --- /dev/null +++ b/data/2021/iclr/Explainable Subgraph Reasoning for Forecasting on Temporal Knowledge Graphs @@ -0,0 +1 @@ +Modeling time-evolving knowledge graphs (KGs) has recently gained increasing interest. Here, graph representation learning has become the dominant paradigm for link prediction on temporal KGs. However, the embedding-based approaches largely operate in a black-box fashion, lacking the ability to interpret their predictions. This paper provides a link forecasting framework that reasons over query-relevant subgraphs of temporal KGs and jointly models the structural dependencies and the temporal dynamics. Especially, we propose a temporal relational attention mechanism and a novel reverse representation update scheme to guide the extraction of an enclosing subgraph around the query. The subgraph is expanded by an iterative sampling of temporal neighbors and by attention propagation. Our approach provides human-understandable evidence explaining the forecast. We evaluate our model on four benchmark temporal knowledge graphs for the link forecasting task. While being more explainable, our model obtains a relative improvement of up to 20 % on Hits@1 compared to the previous best temporal KG forecasting method. We also conduct a survey with 53 respondents, and the results show that the evidence extracted by the model for link forecasting is aligned with human understanding. \ No newline at end of file diff --git a/data/2021/iclr/Explaining by Imitating: Understanding Decisions by Interpretable Policy Learning b/data/2021/iclr/Explaining by Imitating: Understanding Decisions by Interpretable Policy Learning new file mode 100644 index 0000000000..6fcd45874e --- /dev/null +++ b/data/2021/iclr/Explaining by Imitating: Understanding Decisions by Interpretable Policy Learning @@ -0,0 +1 @@ +Understanding human behavior from observed data is critical for transparency and accountability in decision-making. Consider real-world settings such as healthcare, in which modeling a decision-maker's policy is challenging -- with no access to underlying states, no knowledge of environment dynamics, and no allowance for live experimentation. We desire learning a data-driven representation of decision-making behavior that (1) inheres transparency by design, (2) accommodates partial observability, and (3) operates completely offline. To satisfy these key criteria, we propose a novel model-based Bayesian method for interpretable policy learning ("Interpole") that jointly estimates an agent's (possibly biased) belief-update process together with their (possibly suboptimal) belief-action mapping. Through experiments on both simulated and real-world data for the problem of Alzheimer's disease diagnosis, we illustrate the potential of our approach as an investigative device for auditing, quantifying, and understanding human decision-making behavior. \ No newline at end of file diff --git a/data/2021/iclr/Explaining the Efficacy of Counterfactually Augmented Data b/data/2021/iclr/Explaining the Efficacy of Counterfactually Augmented Data new file mode 100644 index 0000000000..6ebb12d459 --- /dev/null +++ b/data/2021/iclr/Explaining the Efficacy of Counterfactually Augmented Data @@ -0,0 +1 @@ +In attempts to produce machine learning models less reliant on spurious patterns in training data, researchers have recently proposed a human-in-the-loop process for generating counterfactually augmented datasets. As applied in NLP, given some documents and their (initial) labels, humans are tasked with revising the text to make a (given) counterfactual label applicable. Importantly, the instructions prohibit edits that are not necessary to flip the applicable label. Models trained on the augmented (original and revised) data have been shown to rely less on semantically irrelevant words and to generalize better out-of-domain. While this work draws on causal thinking, casting edits as interventions and relying on human understanding to assess outcomes, the underlying causal model is not clear nor are the principles underlying the observed improvements in out-of-domain evaluation. In this paper, we explore a toy analog, using linear Gaussian models. Our analysis reveals interesting relationships between causal models, measurement noise, out-of-domain generalization, and reliance on spurious signals. Interestingly our analysis suggests that data corrupted by adding noise to causal features will degrade out-of-domain performance, while noise added to non-causal features may make models more robust out-of-domain. This analysis yields interesting insights that help to explain the efficacy of counterfactually augmented data. Finally, we present a large-scale empirical study that supports this hypothesis. \ No newline at end of file diff --git a/data/2021/iclr/Exploring Balanced Feature Spaces for Representation Learning b/data/2021/iclr/Exploring Balanced Feature Spaces for Representation Learning new file mode 100644 index 0000000000..109f8cc83d --- /dev/null +++ b/data/2021/iclr/Exploring Balanced Feature Spaces for Representation Learning @@ -0,0 +1 @@ +Existing self-supervised learning (SSL) methods are mostly applied for training representation models from artificially balanced datasets ( e.g . ImageNet). It is unclear how well they will perform in the practical scenarios where datasets are often imbalanced w.r.t. the classes. Motivated by this question, we conduct a series of studies on the performance of self-supervised contrastive learning and supervised learning methods over multiple datasets where training instance distributions vary from a balanced one to a long-tailed one. Our findings are quite intriguing. Different from supervised methods with large performance drop, the self-supervised contrastive learning methods perform stably well even when the datasets are heavily imbalanced. This motivates us to explore the balanced feature spaces learned by contrastive learning, where the feature representations present similar linear separability w.r.t. all the classes. Our further experiments reveal that a representation model generating a balanced feature space can generalize better than that yielding an imbalanced one across multiple settings. Inspired by these insights, we develop a novel representation learning method, called k -positive contrastive learning. It effectively combines strengths of the supervised method and the contrastive learning method to learn representations that are both discriminative and balanced. Extensive experiments demonstrate its superiority on multiple recognition tasks, including both long-tailed ones and normal balanced ones. Code is available at https://github.com/bingykang/BalFeat . \ No newline at end of file diff --git a/data/2021/iclr/Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit b/data/2021/iclr/Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit new file mode 100644 index 0000000000..f25260fd4a --- /dev/null +++ b/data/2021/iclr/Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit @@ -0,0 +1 @@ +Modern deep learning models have achieved great success in predictive accuracy for many data modalities. However, their application to many real-world tasks is restricted by poor uncertainty estimates, such as overconfidence on out-of-distribution (OOD) data and ungraceful failing under distributional shift. Previous benchmarks have found that ensembles of neural networks (NNs) are typically the best calibrated models on OOD data. Inspired by this, we leverage recent theoretical advances that characterize the function-space prior of an ensemble of infinitely-wide NNs as a Gaussian process, termed the neural network Gaussian process (NNGP). We use the NNGP with a softmax link function to build a probabilistic model for multi-class classification and marginalize over the latent Gaussian outputs to sample from the posterior. This gives us a better understanding of the implicit prior NNs place on function space and allows a direct comparison of the calibration of the NNGP and its finite-width analogue. We also examine the calibration of previous approaches to classification with the NNGP, which treat classification problems as regression to the one-hot labels. In this case the Bayesian posterior is exact, and we compare several heuristics to generate a categorical distribution over classes. We find these methods are well calibrated under distributional shift. Finally, we consider an infinite-width final layer in conjunction with a pre-trained embedding. This replicates the important practical use case of transfer learning and allows scaling to significantly larger datasets. As well as achieving competitive predictive accuracy, this approach is better calibrated than its finite width analogue. \ No newline at end of file diff --git a/data/2021/iclr/Expressive Power of Invariant and Equivariant Graph Neural Networks b/data/2021/iclr/Expressive Power of Invariant and Equivariant Graph Neural Networks new file mode 100644 index 0000000000..3136aee157 --- /dev/null +++ b/data/2021/iclr/Expressive Power of Invariant and Equivariant Graph Neural Networks @@ -0,0 +1 @@ +Various classes of Graph Neural Networks (GNN) have been proposed and shown to be successful in a wide range of applications with graph structured data. In this paper, we propose a theoretical framework able to compare the expressive power of these GNN architectures. The current universality theorems only apply to intractable classes of GNNs. Here, we prove the first approximation guarantees for practical GNNs, paving the way for a better understanding of their generalization. Our theoretical results are proved for invariant GNNs computing a graph embedding (permutation of the nodes of the input graph does not affect the output) and equivariant GNNs computing an embedding of the nodes (permutation of the input permutes the output). We show that Folklore Graph Neural Networks (FGNN), which are tensor based GNNs augmented with matrix multiplication are the most expressive architectures proposed so far for a given tensor order. We illustrate our results on the Quadratic Assignment Problem (a NP-Hard combinatorial problem) by showing that FGNNs are able to learn how to solve the problem, leading to much better average performances than existing algorithms (based on spectral, SDP or other GNNs architectures). On a practical side, we also implement masked tensors to handle batches of graphs of varying sizes. \ No newline at end of file diff --git a/data/2021/iclr/Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers b/data/2021/iclr/Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers new file mode 100644 index 0000000000..63e2afa809 --- /dev/null +++ b/data/2021/iclr/Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers @@ -0,0 +1 @@ +Solving high-dimensional, continuous robotic tasks is a challenging optimization problem. Model-based methods that rely on zero-order optimizers like the crossentropy method (CEM) have so far shown strong performance and are considered state-of-the-art in the model-based reinforcement learning community. However, this success comes at the cost of high computational complexity, being therefore not suitable for real-time control. In this paper, we propose a technique to jointly optimize the trajectory and distill a policy, which is essential for fast execution in real robotic systems. Our method builds upon standard approaches, like guidance cost and dataset aggregation, and introduces a novel adaptive factor which prevents the optimizer from collapsing to the learner’s behavior at the beginning of the training. The extracted policies reach unprecedented performance on challenging tasks like making a humanoid stand up and opening a door without reward shaping. Figure 1: Environments and exemplary behaviors of the learned policy using APEX. From left to right: FETCH PICK&PLACE (sparse reward), DOOR (sparse reward), and HUMANOID STANDUP. \ No newline at end of file diff --git a/data/2021/iclr/Extreme Memorization via Scale of Initialization b/data/2021/iclr/Extreme Memorization via Scale of Initialization new file mode 100644 index 0000000000..7d01471856 --- /dev/null +++ b/data/2021/iclr/Extreme Memorization via Scale of Initialization @@ -0,0 +1 @@ +We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD, interpolating from good generalization performance to completely memorizing the training set while making little progress on the test set. Moreover, we find that the extent and manner in which generalization ability is affected depends on the activation and loss function used, with $\sin$ activation being the most extreme. In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function. Our empirical investigation reveals that increasing the scale of initialization could cause the representations and gradients to be increasingly misaligned across examples in the same class. We further demonstrate that a similar misalignment phenomenon occurs in other scenarios affecting generalization performance, such as changes to the architecture or data distribution. \ No newline at end of file diff --git a/data/2021/iclr/FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization b/data/2021/iclr/FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Factorizing Declarative and Procedural Knowledge in Structured, Dynamical Environments b/data/2021/iclr/Factorizing Declarative and Procedural Knowledge in Structured, Dynamical Environments new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Fair Mixup: Fairness via Interpolation b/data/2021/iclr/Fair Mixup: Fairness via Interpolation new file mode 100644 index 0000000000..e8db8c4c02 --- /dev/null +++ b/data/2021/iclr/Fair Mixup: Fairness via Interpolation @@ -0,0 +1 @@ +Training classifiers under fairness constraints such as group fairness, regularizes the disparities of predictions between the groups. Nevertheless, even though the constraints are satisfied during training, they might not generalize at evaluation time. To improve the generalizability of fair classifiers, we propose fair mixup, a new data augmentation strategy for imposing the fairness constraint. In particular, we show that fairness can be achieved by regularizing the models on paths of interpolated samples between the groups. We use mixup, a powerful data augmentation strategy to generate these interpolates. We analyze fair mixup and empirically show that it ensures a better generalization for both accuracy and fairness measurement in tabular, vision, and language benchmarks. \ No newline at end of file diff --git a/data/2021/iclr/FairBatch: Batch Selection for Model Fairness b/data/2021/iclr/FairBatch: Batch Selection for Model Fairness new file mode 100644 index 0000000000..c8d1f7431c --- /dev/null +++ b/data/2021/iclr/FairBatch: Batch Selection for Model Fairness @@ -0,0 +1 @@ +Training a fair machine learning model is essential to prevent demographic disparity. Existing techniques for improving model fairness require broad changes in either data preprocessing or model training, rendering themselves difficult-to-adopt for potentially already complex machine learning systems. We address this problem via the lens of bilevel optimization. While keeping the standard training algorithm as an inner optimizer, we incorporate an outer optimizer so as to equip the inner problem with an additional functionality: Adaptively selecting minibatch sizes for the purpose of improving model fairness. Our batch selection algorithm, which we call FairBatch, implements this optimization and supports prominent fairness measures: equal opportunity, equalized odds, and demographic parity. FairBatch comes with a significant implementation benefit -- it does not require any modification to data preprocessing or model training. For instance, a single-line change of PyTorch code for replacing batch selection part of model training suffices to employ FairBatch. Our experiments conducted both on synthetic and benchmark real data demonstrate that FairBatch can provide such functionalities while achieving comparable (or even greater) performances against the state of the arts. Furthermore, FairBatch can readily improve fairness of any pre-trained model simply via fine-tuning. It is also compatible with existing batch selection techniques intended for different purposes, such as faster convergence, thus gracefully achieving multiple purposes. \ No newline at end of file diff --git a/data/2021/iclr/FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders b/data/2021/iclr/FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders new file mode 100644 index 0000000000..1a6eaa1b84 --- /dev/null +++ b/data/2021/iclr/FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders @@ -0,0 +1 @@ +Pretrained text encoders, such as BERT, have been applied increasingly in various natural language processing (NLP) tasks, and have recently demonstrated significant performance gains. However, recent studies have demonstrated the existence of social bias in these pretrained NLP models. Although prior works have made progress on word-level debiasing, improved sentence-level fairness of pretrained encoders still lacks exploration. In this paper, we proposed the first neural debiasing method for a pretrained sentence encoder, which transforms the pretrained encoder outputs into debiased representations via a fair filter (FairFil) network. To learn the FairFil, we introduce a contrastive learning framework that not only minimizes the correlation between filtered embeddings and bias words but also preserves rich semantic information of the original sentences. On real-world datasets, our FairFil effectively reduces the bias degree of pretrained text encoders, while continuously showing desirable performance on downstream tasks. Moreover, our post-hoc method does not require any retraining of the text encoders, further enlarging FairFil's application space. \ No newline at end of file diff --git a/data/2021/iclr/Fantastic Four: Differentiable and Efficient Bounds on Singular Values of Convolution Layers b/data/2021/iclr/Fantastic Four: Differentiable and Efficient Bounds on Singular Values of Convolution Layers new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Fast And Slow Learning Of Recurrent Independent Mechanisms b/data/2021/iclr/Fast And Slow Learning Of Recurrent Independent Mechanisms new file mode 100644 index 0000000000..59c53f357b --- /dev/null +++ b/data/2021/iclr/Fast And Slow Learning Of Recurrent Independent Mechanisms @@ -0,0 +1 @@ +Decomposing knowledge into interchangeable pieces promises a generalization advantage when there are changes in distribution. A learning agent interacting with its environment is likely to be faced with situations requiring novel combinations of existing pieces of knowledge. We hypothesize that such a decomposition of knowledge is particularly relevant for being able to generalize in a systematic manner to out-of-distribution changes. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs and its reward function are stationary and can be re-used across tasks. An attention mechanism dynamically selects which modules can be adapted to the current task, and the parameters of the selected modules are allowed to change quickly as the learner is confronted with variations in what it experiences, while the parameters of the attention mechanisms act as stable, slowly changing, meta-parameters. We focus on pieces of knowledge captured by an ensemble of modules sparsely communicating with each other via a bottleneck of attention. We find that meta-learning the modular aspects of the proposed system greatly helps in achieving faster adaptation in a reinforcement learning setup involving navigation in a partially observed grid world with image-level input. We also find that reversing the role of parameters and meta-parameters does not work nearly as well, suggesting a particular role for fast adaptation of the dynamically selected modules. \ No newline at end of file diff --git a/data/2021/iclr/Fast Geometric Projections for Local Robustness Certification b/data/2021/iclr/Fast Geometric Projections for Local Robustness Certification new file mode 100644 index 0000000000..edf81cf345 --- /dev/null +++ b/data/2021/iclr/Fast Geometric Projections for Local Robustness Certification @@ -0,0 +1 @@ +Local robustness ensures that a model classifies all inputs within an $\epsilon$-ball consistently, which precludes various forms of adversarial inputs. In this paper, we present a fast procedure for checking local robustness in feed-forward neural networks with piecewise linear activation functions. The key insight is that such networks partition the input space into a polyhedral complex such that the network is linear inside each polyhedral region; hence, a systematic search for decision boundaries within the regions around a given input is sufficient for assessing robustness. Crucially, we show how these regions can be analyzed using geometric projections instead of expensive constraint solving, thus admitting an efficient, highly-parallel GPU implementation at the price of incompleteness, which can be addressed by falling back on prior approaches. Empirically, we find that incompleteness is not often an issue, and that our method performs one to two orders of magnitude faster than existing robustness-certification techniques based on constraint solving. \ No newline at end of file diff --git a/data/2021/iclr/Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers b/data/2021/iclr/Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers new file mode 100644 index 0000000000..b78e4950c3 --- /dev/null +++ b/data/2021/iclr/Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers @@ -0,0 +1 @@ +Formal verification of neural networks (NNs) is a challenging and important problem. Existing efficient complete solvers typically require the branch-and-bound (BaB) process, which splits the problem domain into sub-domains and solves each sub-domain using faster but weaker incomplete verifiers, such as Linear Programming (LP) on linearly relaxed sub-domains. In this paper, we propose to use the backward mode linear relaxation based perturbation analysis (LiRPA) to replace LP during the BaB process, which can be efficiently implemented on the typical machine learning accelerators such as GPUs and TPUs. However, unlike LP, LiRPA when applied naively can produce much weaker bounds and even cannot check certain conflicts of sub-domains during splitting, making the entire procedure incomplete after BaB. To address these challenges, we apply a fast gradient based bound tightening procedure combined with batch splits and the design of minimal usage of LP bound procedure, enabling us to effectively use LiRPA on the accelerator hardware for the challenging complete NN verification problem and significantly outperform LP-based approaches. On a single GPU, we demonstrate an order of magnitude speedup compared to existing LP-based approaches. \ No newline at end of file diff --git a/data/2021/iclr/Fast convergence of stochastic subgradient method under interpolation b/data/2021/iclr/Fast convergence of stochastic subgradient method under interpolation new file mode 100644 index 0000000000..83c06c059d --- /dev/null +++ b/data/2021/iclr/Fast convergence of stochastic subgradient method under interpolation @@ -0,0 +1 @@ +This paper studies the behaviour of the stochastic subgradient descent (SSGD) method applied to over-parameterized nonsmooth optimization problems that satisfy an interpolation condition. By leveraging the composite structure of the empirical risk minimization problems, we prove that SSGD converges, respectively, with rates O (1 /(cid:15) ) and O (log(1 /(cid:15) )) for convex and strongly-convex objectives when interpolation holds. These rates coincide with established rates for the stochastic gradient descent (SGD) method applied to smooth problems that also satisfy an interpolation condition. Our analysis provides a partial explanation for the empirical observation that sometimes SGD and SSGD behave similarly for training smooth and nonsmooth machine learning models. We also prove that the rate O (1 /(cid:15) ) is optimal for the subgradient method in the convex and interpolation setting. \ No newline at end of file diff --git a/data/2021/iclr/FastSpeech 2: Fast and High-Quality End-to-End Text to Speech b/data/2021/iclr/FastSpeech 2: Fast and High-Quality End-to-End Text to Speech new file mode 100644 index 0000000000..518ce67e92 --- /dev/null +++ b/data/2021/iclr/FastSpeech 2: Fast and High-Quality End-to-End Text to Speech @@ -0,0 +1 @@ +Advanced text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs during training and use predicted values during inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of full end-to-end training and even faster inference than FastSpeech. Experimental results show that 1) FastSpeech 2 and 2s outperform FastSpeech in voice quality with much simplified training pipeline and reduced training time; 2) FastSpeech 2 and 2s can match the voice quality of autoregressive models while enjoying much faster inference speed. \ No newline at end of file diff --git a/data/2021/iclr/Faster Binary Embeddings for Preserving Euclidean Distances b/data/2021/iclr/Faster Binary Embeddings for Preserving Euclidean Distances new file mode 100644 index 0000000000..588520121e --- /dev/null +++ b/data/2021/iclr/Faster Binary Embeddings for Preserving Euclidean Distances @@ -0,0 +1 @@ +We propose a fast, distance-preserving, binary embedding algorithm to transform a high-dimensional dataset $\mathcal{T}\subseteq\mathbb{R}^n$ into binary sequences in the cube $\{\pm 1\}^m$. When $\mathcal{T}$ consists of well-spread (i.e., non-sparse) vectors, our embedding method applies a stable noise-shaping quantization scheme to $A x$ where $A\in\mathbb{R}^{m\times n}$ is a sparse Gaussian random matrix. This contrasts with most binary embedding methods, which usually use $x\mapsto \mathrm{sign}(Ax)$ for the embedding. Moreover, we show that Euclidean distances among the elements of $\mathcal{T}$ are approximated by the $\ell_1$ norm on the images of $\{\pm 1\}^m$ under a fast linear transformation. This again contrasts with standard methods, where the Hamming distance is used instead. Our method is both fast and memory efficient, with time complexity $O(m)$ and space complexity $O(m)$. Further, we prove that the method is accurate and its associated error is comparable to that of a continuous valued Johnson-Lindenstrauss embedding plus a quantization error that admits a polynomial decay as the embedding dimension $m$ increases. Thus the length of the binary codes required to achieve a desired accuracy is quite small, and we show it can even be compressed further without compromising the accuracy. To illustrate our results, we test the proposed method on natural images and show that it achieves strong performance. \ No newline at end of file diff --git a/data/2021/iclr/FedBE: Making Bayesian Model Ensemble Applicable to Federated Learning b/data/2021/iclr/FedBE: Making Bayesian Model Ensemble Applicable to Federated Learning new file mode 100644 index 0000000000..7a8aafc7d0 --- /dev/null +++ b/data/2021/iclr/FedBE: Making Bayesian Model Ensemble Applicable to Federated Learning @@ -0,0 +1 @@ +Federated learning aims to collaboratively train a strong global model by accessing users' locally trained models but not their own data. A crucial step is therefore to aggregate local models into a global model, which has been shown challenging when users have non-i.i.d. data. In this paper, we propose a novel aggregation algorithm named FedBE, which takes a Bayesian inference perspective by sampling higher-quality global models and combining them via Bayesian model Ensemble, leading to much robust aggregation. We show that an effective model distribution can be constructed by simply fitting a Gaussian or Dirichlet distribution to the local models. Our empirical studies validate FedBE's superior performance, especially when users' data are not i.i.d. and when the neural networks go deeper. Moreover, FedBE is compatible with recent efforts in regularizing users' model training, making it an easily applicable module: you only need to replace the aggregation method but leave other parts of your federated learning algorithm intact. Our code is publicly available at https://github.com/hongyouc/FedBE. \ No newline at end of file diff --git a/data/2021/iclr/FedBN: Federated Learning on Non-IID Features via Local Batch Normalization b/data/2021/iclr/FedBN: Federated Learning on Non-IID Features via Local Batch Normalization new file mode 100644 index 0000000000..22ce36bc75 --- /dev/null +++ b/data/2021/iclr/FedBN: Federated Learning on Non-IID Features via Local Batch Normalization @@ -0,0 +1 @@ +The emerging paradigm of federated learning (FL) strives to enable collaborative training of deep models on the network edge without centrally aggregating raw data and hence improving data privacy. In most cases, the assumption of independent and identically distributed samples across local clients does not hold for federated learning setups. Under this setting, neural network training performance may vary significantly according to the data distribution and even hurt training convergence. Most of the previous work has focused on a difference in the distribution of labels or client shifts. Unlike those settings, we address an important problem of FL, e.g., different scanners/sensors in medical imaging, different scenery distribution in autonomous driving (highway vs. city), where local clients store examples with different distributions compared to other clients, which we denote as feature shift non-iid. In this work, we propose an effective method that uses local batch normalization to alleviate the feature shift before averaging models. The resulting scheme, called FedBN, outperforms both classical FedAvg, as well as the state-of-the-art for non-iid data (FedProx) on our extensive experiments. These empirical results are supported by a convergence analysis that shows in a simplified setting that FedBN has a faster convergence rate than FedAvg. Code is available at https://github.com/med-air/FedBN. \ No newline at end of file diff --git a/data/2021/iclr/FedMix: Approximation of Mixup under Mean Augmented Federated Learning b/data/2021/iclr/FedMix: Approximation of Mixup under Mean Augmented Federated Learning new file mode 100644 index 0000000000..c72ec84bc9 --- /dev/null +++ b/data/2021/iclr/FedMix: Approximation of Mixup under Mean Augmented Federated Learning @@ -0,0 +1 @@ +Federated learning (FL) allows edge devices to collectively learn a model without directly sharing data within each device, thus preserving privacy and eliminating the need to store data globally. While there are promising results under the assumption of independent and identically distributed (iid) local data, current state-of-the-art algorithms suffer from performance degradation as the heterogeneity of local data across clients increases. To resolve this issue, we propose a simple framework, Mean Augmented Federated Learning (MAFL), where clients send and receive averaged local data, subject to the privacy requirements of target applications. Under our framework, we propose a new augmentation algorithm, named FedMix, which is inspired by a phenomenal yet simple data augmentation method, Mixup, but does not require local raw data to be directly shared among devices. Our method shows greatly improved performance in the standard benchmark datasets of FL, under highly non-iid federated settings, compared to conventional algorithms. \ No newline at end of file diff --git a/data/2021/iclr/Federated Learning Based on Dynamic Regularization b/data/2021/iclr/Federated Learning Based on Dynamic Regularization new file mode 100644 index 0000000000..3d13c97ca3 --- /dev/null +++ b/data/2021/iclr/Federated Learning Based on Dynamic Regularization @@ -0,0 +1 @@ +We propose a novel federated learning method for distributively training neural network models, where the server orchestrates cooperation between a subset of randomly chosen devices in each round. We view Federated Learning problem primarily from a communication perspective and allow more device level computations to save transmission costs. We point out a fundamental dilemma, in that the minima of the local-device level empirical loss are inconsistent with those of the global empirical loss. Different from recent prior works, that either attempt inexact minimization or utilize devices for parallelizing gradient computation, we propose a dynamic regularizer for each device at each round, so that in the limit the global and device solutions are aligned. We demonstrate both through empirical results on real and synthetic data as well as analytical results that our scheme leads to efficient training, in both convex and non-convex settings, while being fully agnostic to device heterogeneity and robust to large number of devices, partial participation and unbalanced data. \ No newline at end of file diff --git a/data/2021/iclr/Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms b/data/2021/iclr/Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms new file mode 100644 index 0000000000..5efcaa495f --- /dev/null +++ b/data/2021/iclr/Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms @@ -0,0 +1 @@ +Federated learning is typically approached as an optimization problem, where the goal is to minimize a global loss function by distributing computation across client devices that possess local data and specify different parts of the global objective. We present an alternative perspective and formulate federated learning as a posterior inference problem, where the goal is to infer a global posterior distribution by having client devices each infer the posterior of their local data. While exact inference is often intractable, this perspective provides a principled way to search for global optima in federated settings. Further, starting with the analysis of federated quadratic objectives, we develop a computation- and communication-efficient approximate posterior inference algorithm -- federated posterior averaging (FedPA). Our algorithm uses MCMC for approximate inference of local posteriors on the clients and efficiently communicates their statistics to the server, where the latter uses them to refine a global estimate of the posterior mode. Finally, we show that FedPA generalizes federated averaging (FedAvg), can similarly benefit from adaptive optimizers, and yields state-of-the-art results on four realistic and challenging benchmarks, converging faster, to better optima. \ No newline at end of file diff --git a/data/2021/iclr/Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint Learning b/data/2021/iclr/Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint Learning new file mode 100644 index 0000000000..94c57ca265 --- /dev/null +++ b/data/2021/iclr/Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint Learning @@ -0,0 +1 @@ +While existing federated learning approaches mostly require that clients have fully-labeled data to train on, in realistic settings, data obtained at the client side often comes without any accompanying labels. Such deficiency of labels may result from either high labeling cost, or difficulty of annotation due to requirement of expert knowledge. Thus the private data at each client may be only partly labeled, or completely unlabeled with labeled data being available only at the server, which leads us to a new problem of Federated Semi-Supervised Learning (FSSL). In this work, we study this new problem of semi-supervised learning under federated learning framework, and propose a novel method to tackle it, which we refer to as Federated Matching (FedMatch). FedMatch improves upon naive federated semi-supervised learning approaches with a new inter-client consistency loss and decomposition of the parameters into parameters for labeled and unlabeled data. Through extensive experimental validation of our method in two different scenarios, we show that our method outperforms both local semi-supervised learning and baselines which naively combine federated learning with semi-supervised learning. \ No newline at end of file diff --git a/data/2021/iclr/Few-Shot Bayesian Optimization with Deep Kernel Surrogates b/data/2021/iclr/Few-Shot Bayesian Optimization with Deep Kernel Surrogates new file mode 100644 index 0000000000..d67a619ddd --- /dev/null +++ b/data/2021/iclr/Few-Shot Bayesian Optimization with Deep Kernel Surrogates @@ -0,0 +1 @@ +Hyperparameter optimization (HPO) is a central pillar in the automation of machine learning solutions and is mainly performed via Bayesian optimization, where a parametric surrogate is learned to approximate the black box response function (e.g. validation error). Unfortunately, evaluating the response function is computationally intensive. As a remedy, earlier work emphasizes the need for transfer learning surrogates which learn to optimize hyperparameters for an algorithm from other tasks. In contrast to previous work, we propose to rethink HPO as a few-shot learning problem in which we train a shared deep surrogate model to quickly adapt (with few response evaluations) to the response function of a new task. We propose the use of a deep kernel network for a Gaussian process surrogate that is meta-learned in an end-to-end fashion in order to jointly approximate the response functions of a collection of training data sets. As a result, the novel few-shot optimization of our deep kernel surrogate leads to new state-of-the-art results at HPO compared to several recent methods on diverse metadata sets. \ No newline at end of file diff --git a/data/2021/iclr/Few-Shot Learning via Learning the Representation, Provably b/data/2021/iclr/Few-Shot Learning via Learning the Representation, Provably new file mode 100644 index 0000000000..df6b20c87b --- /dev/null +++ b/data/2021/iclr/Few-Shot Learning via Learning the Representation, Provably @@ -0,0 +1 @@ +This paper studies few-shot learning via representation learning, where one uses $T$ source tasks with $n_1$ data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only $n_2 (\ll n_1)$ data. Specifically, we focus on the setting where there exists a good \emph{common representation} between source and target, and our goal is to understand how much of a sample size reduction is possible. First, we study the setting where this common representation is low-dimensional and provide a fast rate of $O\left(\frac{\mathcal{C}\left(\Phi\right)}{n_1T} + \frac{k}{n_2}\right)$; here, $\Phi$ is the representation function class, $\mathcal{C}\left(\Phi\right)$ is its complexity measure, and $k$ is the dimension of the representation. When specialized to linear representation functions, this rate becomes $O\left(\frac{dk}{n_1T} + \frac{k}{n_2}\right)$ where $d (\gg k)$ is the ambient input dimension, which is a substantial improvement over the rate without using representation learning, i.e. over the rate of $O\left(\frac{d}{n_2}\right)$. Second, we consider the setting where the common representation may be high-dimensional but is capacity-constrained (say in norm); here, we again demonstrate the advantage of representation learning in both high-dimensional linear regression and neural network learning. Our results demonstrate representation learning can fully utilize all $n_1T$ samples from source tasks. \ No newline at end of file diff --git a/data/2021/iclr/Fidelity-based Deep Adiabatic Scheduling b/data/2021/iclr/Fidelity-based Deep Adiabatic Scheduling new file mode 100644 index 0000000000..6dd519af42 --- /dev/null +++ b/data/2021/iclr/Fidelity-based Deep Adiabatic Scheduling @@ -0,0 +1 @@ +Adiabatic quantum computation is a form of computation that acts by slowly interpolating a quantum system between an easy to prepare initial state and a final state that represents a solution to a given computational problem. The choice of the interpolation schedule is critical to the performance: if at a certain time point, the evolution is too rapid, the system has a high probability to transfer to a higher energy state, which does not represent a solution to the problem. On the other hand, an evolution that is too slow leads to a loss of computation time and increases the probability of failure due to decoherence. In this work, we train deep neural models to produce optimal schedules that are conditioned on the problem at hand. We consider two types of problem representation: the Hamiltonian form \ No newline at end of file diff --git a/data/2021/iclr/Filtered Inner Product Projection for Crosslingual Embedding Alignment b/data/2021/iclr/Filtered Inner Product Projection for Crosslingual Embedding Alignment new file mode 100644 index 0000000000..cd44469eff --- /dev/null +++ b/data/2021/iclr/Filtered Inner Product Projection for Crosslingual Embedding Alignment @@ -0,0 +1 @@ +Due to widespread interest in machine translation and transfer learning, there are numerous algorithms for mapping multiple embeddings to a shared representation space. Recently, these algorithms have been studied in the setting of bilingual lexicon induction where one seeks to align the embeddings of a source and a target language such that translated word pairs lie close to one another in a common representation space. In this paper, we propose a method, Filtered Inner Product Projection (FIPP), for mapping embeddings to a common representation space. As semantic shifts are pervasive across languages and domains, FIPP first identifies the common geometric structure in both embeddings and then, only on the common structure, aligns the Gram matrices of these embeddings. FIPP aligns embeddings to isomorphic vector spaces even when the source and target embeddings are of differing dimensionalities. Additionally, FIPP provides computational benefits in ease of implementation and is faster to compute than current approaches. Following the baselines in Glavaš et al. (2019), we evaluate FIPP in the context of bilingual lexicon induction and downstream language tasks. We show that FIPP outperforms existing methods on the XLING (5K) BLI dataset and the XLING (1K) BLI dataset, when using a self-learning approach, while also providing robust performance across downstream tasks. \ No newline at end of file diff --git a/data/2021/iclr/Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis b/data/2021/iclr/Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis new file mode 100644 index 0000000000..5a2bc4966c --- /dev/null +++ b/data/2021/iclr/Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis @@ -0,0 +1 @@ +In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer. Flowtron borrows insights from IAF and revamps Tacotron in order to provide high-quality and expressive mel-spectrogram synthesis. Flowtron is optimized by maximizing the likelihood of the training data, which makes training simple and stable. Flowtron learns an invertible mapping of data to a latent space that can be manipulated to control many aspects of speech synthesis (pitch, tone, speech rate, cadence, accent). Our mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality. In addition, we provide results on control of speech variation, interpolation between samples and style transfer between speakers seen and unseen during training. Code and pre-trained models will be made publicly available at this https URL \ No newline at end of file diff --git a/data/2021/iclr/Fooling a Complete Neural Network Verifier b/data/2021/iclr/Fooling a Complete Neural Network Verifier new file mode 100644 index 0000000000..068b52175e --- /dev/null +++ b/data/2021/iclr/Fooling a Complete Neural Network Verifier @@ -0,0 +1 @@ +The efficient and accurate characterization of the robustness of neural networks to input perturbation is an important open problem. Many approaches exist including heuristic and exact (or complete) methods. Complete methods are expensive but their mathematical formulation guarantees that they provide exact robustness metrics. However, this guarantee is valid only if we assume that the verified network applies arbitrary-precision arithmetic and the verifier is reliable. In practice, however, both the networks and the verifiers apply limited-precision floating point arithmetic. In this paper, we show that numerical roundoff errors can be exploited to craft adversarial networks, in which the actual robustness and the robustness computed by a state-of-the-art complete verifier radically differ. We also show that such adversarial networks can be used to insert a backdoor into any network in such a way that the backdoor is completely missed by the verifier. The attack is easy to detect in its naive form but, as we show, the adversarial network can be transformed to make its detection less trivial. We offer a simple defense against our particular attack based on adding a very small perturbation to the network weights. However, our conjecture is that other numerical attacks are possible, and exact verification has to take into account all the details of the computation executed by the verified networks, which makes the problem significantly harder. \ No newline at end of file diff --git a/data/2021/iclr/For self-supervised learning, Rationality implies generalization, provably b/data/2021/iclr/For self-supervised learning, Rationality implies generalization, provably new file mode 100644 index 0000000000..8605259506 --- /dev/null +++ b/data/2021/iclr/For self-supervised learning, Rationality implies generalization, provably @@ -0,0 +1 @@ +We prove a new upper bound on the generalization gap of classifiers that are obtained by first using self-supervision to learn a representation $r$ of the training data, and then fitting a simple (e.g., linear) classifier $g$ to the labels. Specifically, we show that (under the assumptions described below) the generalization gap of such classifiers tends to zero if $\mathsf{C}(g) \ll n$, where $\mathsf{C}(g)$ is an appropriately-defined measure of the simple classifier $g$'s complexity, and $n$ is the number of training samples. We stress that our bound is independent of the complexity of the representation $r$. We do not make any structural or conditional-independence assumptions on the representation-learning task, which can use the same training dataset that is later used for classification. Rather, we assume that the training procedure satisfies certain natural noise-robustness (adding small amount of label noise causes small degradation in performance) and rationality (getting the wrong label is not better than getting no label at all) conditions that widely hold across many standard architectures. We show that our bound is non-vacuous for many popular representation-learning based classifiers on CIFAR-10 and ImageNet, including SimCLR, AMDIM and MoCo. \ No newline at end of file diff --git a/data/2021/iclr/Fourier Neural Operator for Parametric Partial Differential Equations b/data/2021/iclr/Fourier Neural Operator for Parametric Partial Differential Equations new file mode 100644 index 0000000000..65cecc1b44 --- /dev/null +++ b/data/2021/iclr/Fourier Neural Operator for Parametric Partial Differential Equations @@ -0,0 +1 @@ +The classical development of neural networks has primarily focused on learning mappings between finite-dimensional Euclidean spaces. Recently, this has been generalized to neural operators that learn mappings between function spaces. For partial differential equations (PDEs), neural operators directly learn the mapping from any functional parametric dependence to the solution. Thus, they learn an entire family of PDEs, in contrast to classical methods which solve one instance of the equation. In this work, we formulate a new neural operator by parameterizing the integral kernel directly in Fourier space, allowing for an expressive and efficient architecture. We perform experiments on Burgers' equation, Darcy flow, and the Navier-Stokes equation (including the turbulent regime). Our Fourier neural operator shows state-of-the-art performance compared to existing neural network methodologies and it is up to three orders of magnitude faster compared to traditional PDE solvers. \ No newline at end of file diff --git a/data/2021/iclr/Free Lunch for Few-shot Learning: Distribution Calibration b/data/2021/iclr/Free Lunch for Few-shot Learning: Distribution Calibration new file mode 100644 index 0000000000..a72f1e26a9 --- /dev/null +++ b/data/2021/iclr/Free Lunch for Few-shot Learning: Distribution Calibration @@ -0,0 +1 @@ +Learning from a limited number of samples is challenging since the learned model can easily become overfitted based on the biased distribution formed by only a few training examples. In this paper, we calibrate the distribution of these few-sample classes by transferring statistics from the classes with sufficient examples, then an adequate number of examples can be sampled from the calibrated distribution to expand the inputs to the classifier. We assume every dimension in the feature representation follows a Gaussian distribution so that the mean and the variance of the distribution can borrow from that of similar classes whose statistics are better estimated with an adequate number of samples. Our method can be built on top of off-the-shelf pretrained feature extractors and classification models without extra parameters. We show that a simple logistic regression classifier trained using the features sampled from our calibrated distribution can outperform the state-of-the-art accuracy on two datasets (~5% improvement on miniImageNet compared to the next best). The visualization of these generated features demonstrates that our calibrated distribution is an accurate estimation. \ No newline at end of file diff --git a/data/2021/iclr/Fully Unsupervised Diversity Denoising with Convolutional Variational Autoencoders b/data/2021/iclr/Fully Unsupervised Diversity Denoising with Convolutional Variational Autoencoders new file mode 100644 index 0000000000..83e80c157e --- /dev/null +++ b/data/2021/iclr/Fully Unsupervised Diversity Denoising with Convolutional Variational Autoencoders @@ -0,0 +1 @@ +Deep Learning based methods have emerged as the indisputable leaders for virtually all image restoration tasks. Especially in the domain of microscopy images, various content-aware image restoration (CARE) approaches are now used to improve the interpretability of acquired data. Naturally, there are limitations to what can be restored in corrupted images, and like for all inverse problems, many potential solutions exist, and one of them must be chosen. Here, we propose DIVNOISING, a denoising approach based on fully convolutional variational autoencoders (VAEs), overcoming the problem of having to choose a single solution by predicting a whole distribution of denoised images. First we introduce a principled way of formulating the unsupervised denoising problem within the VAE framework by explicitly incorporating imaging noise models into the decoder. Our approach is fully unsupervised, only requiring noisy images and a suitable description of the imaging noise distribution. We show that such a noise model can either be measured, bootstrapped from noisy data, or co-learned during training. If desired, consensus predictions can be inferred from a set of DIVNOISING predictions, leading to competitive results with other unsupervised methods and, on occasion, even with the supervised state-of-the-art. DIVNOISING samples from the posterior enable a plethora of useful applications. We are piq showing denoising results for 13 datasets, piiq discussing how optical character recognition (OCR) applications can benefit from diverse predictions, and are piiiq demonstrating how instance cell segmentation improves when using diverse DIVNOISING predictions. \ No newline at end of file diff --git a/data/2021/iclr/Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online b/data/2021/iclr/Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online new file mode 100644 index 0000000000..cdfa1c70f3 --- /dev/null +++ b/data/2021/iclr/Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online @@ -0,0 +1 @@ +Recent work has shown that sparse representations—where only a small percentage of units are active—can significantly reduce interference. Those works, however, relied on relatively complex regularization or meta-learning approaches, that have only been used offline in a pre-training phase. In this work, we pursue a direction that achieves sparsity by design, rather than by learning. Specifically, we design an activation function that produces sparse representations deterministically by construction, and so is more amenable to online training. The idea relies on the simple approach of binning, but overcomes the two key limitations of binning: zero gradients for the flat regions almost everywhere, and lost precision—reduced discrimination—due to coarse aggregation. We introduce a Fuzzy Tiling Activation (FTA) that provides non-negligible gradients and produces overlap between bins that improves discrimination. We first show that FTA is robust under covariate shift in a synthetic online supervised learning problem, where we can vary the level of correlation and drift. Then we move to the deep reinforcement learning setting and investigate both value-based and policy gradient algorithms that use neural networks with FTAs, in classic discrete control and Mujoco continuous control environments. We show that algorithms equipped with FTAs are able to learn a stable policy faster without needing target networks on most domains. 1 \ No newline at end of file diff --git "a/data/2021/iclr/GAN \"Steerability\" without optimization" "b/data/2021/iclr/GAN \"Steerability\" without optimization" new file mode 100644 index 0000000000..913149ca40 --- /dev/null +++ "b/data/2021/iclr/GAN \"Steerability\" without optimization" @@ -0,0 +1 @@ +Recent research has shown remarkable success in revealing "steering" directions in the latent spaces of pre-trained GANs. These directions correspond to semantically meaningful image transformations e.g., shift, zoom, color manipulations), and have similar interpretable effects across all categories that the GAN can generate. Some methods focus on user-specified transformations, while others discover transformations in an unsupervised manner. However, all existing techniques rely on an optimization procedure to expose those directions, and offer no control over the degree of allowed interaction between different transformations. In this paper, we show that "steering" trajectories can be computed in closed form directly from the generator's weights without any form of training or optimization. This applies to user-prescribed geometric transformations, as well as to unsupervised discovery of more complex effects. Our approach allows determining both linear and nonlinear trajectories, and has many advantages over previous methods. In particular, we can control whether one transformation is allowed to come on the expense of another (e.g. zoom-in with or without allowing translation to keep the object centered). Moreover, we can determine the natural end-point of the trajectory, which corresponds to the largest extent to which a transformation can be applied without incurring degradation. Finally, we show how transferring attributes between images can be achieved without optimization, even across different categories. \ No newline at end of file diff --git a/data/2021/iclr/GAN2GAN: Generative Noise Learning for Blind Denoising with Single Noisy Images b/data/2021/iclr/GAN2GAN: Generative Noise Learning for Blind Denoising with Single Noisy Images new file mode 100644 index 0000000000..129eca470c --- /dev/null +++ b/data/2021/iclr/GAN2GAN: Generative Noise Learning for Blind Denoising with Single Noisy Images @@ -0,0 +1 @@ +We tackle a challenging blind image denoising problem, in which only single distinct noisy images are available for training a denoiser, and no information about noise is known, except for it being zero-mean, additive, and independent of the clean image. In such a setting, which often occurs in practice, it is not possible to train a denoiser with the standard discriminative training or with the recently developed Noise2Noise (N2N) training; the former requires the underlying clean image for the given noisy image, and the latter requires two independently realized noisy image pair for a clean image. To that end, we propose GAN2GAN (Generated-Artificial-Noise to Generated-Artificial-Noise) method that first learns a generative model that can 1) simulate the noise in the given noisy images and 2) generate a rough, noisy estimates of the clean images, then 3) iteratively trains a denoiser with subsequently synthesized noisy image pairs (as in N2N), obtained from the generative model. In results, we show the denoiser trained with our GAN2GAN achieves an impressive denoising performance on both synthetic and real-world datasets for the blind denoising setting; it almost approaches the performance of the standard discriminatively-trained or N2N-trained models that have more information than ours, and it significantly outperforms the recent baseline for the same setting, \textit{e.g.}, Noise2Void, and a more conventional yet strong one, BM3D. The official code of our method is available at https://github.com/csm9493/GAN2GAN. \ No newline at end of file diff --git a/data/2021/iclr/GANs Can Play Lottery Tickets Too b/data/2021/iclr/GANs Can Play Lottery Tickets Too new file mode 100644 index 0000000000..13d129f264 --- /dev/null +++ b/data/2021/iclr/GANs Can Play Lottery Tickets Too @@ -0,0 +1 @@ +Deep generative adversarial networks (GANs) have gained growing popularity in numerous scenarios, while usually suffer from high parameter complexities for resource-constrained real-world applications. However, the compression of GANs has less been explored. A few works show that heuristically applying compression techniques normally leads to unsatisfactory results, due to the notorious training instability of GANs. In parallel, the lottery ticket hypothesis shows prevailing success on discriminative models, in locating sparse matching subnetworks capable of training in isolation to full model performance. In this work, we for the first time study the existence of such trainable matching subnetworks in deep GANs. For a range of GANs, we certainly find matching subnetworks at 67%-74% sparsity. We observe that with or without pruning discriminator has a minor effect on the existence and quality of matching subnetworks, while the initialization weights used in the discriminator play a significant role. We then show the powerful transferability of these subnetworks to unseen tasks. Furthermore, extensive experimental results demonstrate that our found subnetworks substantially outperform previous state-of-the-art GAN compression approaches in both image generation (e.g. SNGAN) and image-to-image translation GANs (e.g. CycleGAN). Codes available at https://github.com/VITA-Group/GAN-LTH. \ No newline at end of file diff --git a/data/2021/iclr/GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding b/data/2021/iclr/GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding new file mode 100644 index 0000000000..dc8c6af331 --- /dev/null +++ b/data/2021/iclr/GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding @@ -0,0 +1 @@ +Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art. \ No newline at end of file diff --git a/data/2021/iclr/Gauge Equivariant Mesh CNNs: Anisotropic convolutions on geometric graphs b/data/2021/iclr/Gauge Equivariant Mesh CNNs: Anisotropic convolutions on geometric graphs new file mode 100644 index 0000000000..d0a53d775d --- /dev/null +++ b/data/2021/iclr/Gauge Equivariant Mesh CNNs: Anisotropic convolutions on geometric graphs @@ -0,0 +1 @@ +A common approach to define convolutions on meshes is to interpret them as a graph and apply graph convolutional networks (GCNs). Such GCNs utilize isotropic kernels and are therefore insensitive to the relative orientation of vertices and thus to the geometry of the mesh as a whole. We propose Gauge Equivariant Mesh CNNs which generalize GCNs to apply anisotropic gauge equivariant kernels. Since the resulting features carry orientation information, we introduce a geometric message passing scheme defined by parallel transporting features over mesh edges. Our experiments validate the significantly improved expressivity of the proposed model over conventional GCNs and other methods. \ No newline at end of file diff --git a/data/2021/iclr/Generalization bounds via distillation b/data/2021/iclr/Generalization bounds via distillation new file mode 100644 index 0000000000..178553b1c1 --- /dev/null +++ b/data/2021/iclr/Generalization bounds via distillation @@ -0,0 +1 @@ +This paper theoretically investigates the following empirical phenomenon: given a high-complexity network with poor generalization bounds, one can distill it into a network with nearly identical predictions but low complexity and vastly smaller generalization bounds. The main contribution is an analysis showing that the original network inherits this good generalization bound from its distillation, assuming the use of well-behaved data augmentation. This bound is presented both in an abstract and in a concrete form, the latter complemented by a reduction technique to handle modern computation graphs featuring convolutional layers, fully-connected layers, and skip connections, to name a few. To round out the story, a (looser) classical uniform convergence analysis of compression is also presented, as well as a variety of experiments on cifar and mnist demonstrating similar generalization performance between the original network and its distillation. \ No newline at end of file diff --git a/data/2021/iclr/Generalization in data-driven models of primary visual cortex b/data/2021/iclr/Generalization in data-driven models of primary visual cortex new file mode 100644 index 0000000000..121aaa86af --- /dev/null +++ b/data/2021/iclr/Generalization in data-driven models of primary visual cortex @@ -0,0 +1 @@ +Deep neural networks (DNN) have set new standards at predicting responses of neural populations to visual input. Most such DNNs consist of a convolutional network (core) shared across all neurons which learns a representation of neural computation in visual cortex and a neuron-specific readout that linearly combines the relevant features in this representation. The goal of this paper is to test whether such a representation is indeed generally characteristic for visual cortex, i.e. generalizes between animals of a species, and what factors contribute to obtaining such a generalizing core. To push all non-linear computations into the core where the generalizing cortical features should be learned, we devise a novel readout that reduces the number of parameters per neuron in the readout by up to two orders of magnitude compared to the previous state-of-the-art. It does so by taking advantage of retinotopy and learns a Gaussian distribution over the neuron’s receptive field position. With this new readout we train our network on neural responses from mouse primary visual cortex (V1) and obtain a gain in performance of 7% compared to the previous state-of-the-art network. We then investigate whether the convolutional core indeed captures general cortical features by using the core in transfer learning to a different animal. When transferring a core trained on thousands of neurons from various animals and scans we exceed the performance of training directly on that animal by 12%, and outperform a commonly used VGG16 core pre-trained on imagenet by 33%. In addition, transfer learning with our data-driven core is more data-efficient than direct training, achieving the same performance with only 40% of the data. Our model with its novel readout thus sets a new state-of-the-art for neural response prediction in mouse visual cortex from natural images, generalizes between animals, and captures better characteristic cortical features than current task-driven pre-training approaches such as VGG16. \ No newline at end of file diff --git a/data/2021/iclr/Generalized Energy Based Models b/data/2021/iclr/Generalized Energy Based Models new file mode 100644 index 0000000000..ae423985a6 --- /dev/null +++ b/data/2021/iclr/Generalized Energy Based Models @@ -0,0 +1 @@ +We introduce the Generalized Energy Based Model (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally an implicit model), which can learn the support of data with low intrinsic dimension in a high dimensional space; and an energy function, to refine the probability mass on the learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the "generator"). GEBMs are trained by alternating between learning the energy and the base. We show that both training stages are well-defined: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model can be obtained via MCMC, thus finding regions in this space that produce better quality samples. Empirically, the GEBM samples on image-generation tasks are of much better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. GEBMs also return state-of-the-art performance on density modelling tasks, and when using base measures with an explicit form. \ No newline at end of file diff --git a/data/2021/iclr/Generalized Multimodal ELBO b/data/2021/iclr/Generalized Multimodal ELBO new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Generalized Variational Continual Learning b/data/2021/iclr/Generalized Variational Continual Learning new file mode 100644 index 0000000000..88d67fcba1 --- /dev/null +++ b/data/2021/iclr/Generalized Variational Continual Learning @@ -0,0 +1 @@ +Continual learning deals with training models on new tasks and datasets in an online fashion. One strand of research has used probabilistic regularization for continual learning, with two of the main approaches in this vein being Online Elastic Weight Consolidation (Online EWC) and Variational Continual Learning (VCL). VCL employs variational inference, which in other settings has been improved empirically by applying likelihood-tempering. We show that applying this modification to VCL recovers Online EWC as a limiting case, allowing for interpolation between the two approaches. We term the general algorithm Generalized VCL (GVCL). In order to mitigate the observed overpruning effect of VI, we take inspiration from a common multi-task architecture, neural networks with task-specific FiLM layers, and find that this addition leads to significant performance gains, specifically for variational methods. In the small-data regime, GVCL strongly outperforms existing baselines. In larger datasets, GVCL with FiLM layers outperforms or is competitive with existing baselines in terms of accuracy, whilst also providing significantly better calibration. \ No newline at end of file diff --git a/data/2021/iclr/Generating Adversarial Computer Programs using Optimized Obfuscations b/data/2021/iclr/Generating Adversarial Computer Programs using Optimized Obfuscations new file mode 100644 index 0000000000..4a561df496 --- /dev/null +++ b/data/2021/iclr/Generating Adversarial Computer Programs using Optimized Obfuscations @@ -0,0 +1 @@ +Machine learning (ML) models that learn and predict properties of computer programs are increasingly being adopted and deployed. These models have demonstrated success in applications such as auto-completing code, summarizing large programs, and detecting bugs and malware in programs. In this work, we investigate principled ways to adversarially perturb a computer program to fool such learned models, and thus determine their adversarial robustness. We use program obfuscations, which have conventionally been used to avoid attempts at reverse engineering programs, as adversarial perturbations. These perturbations modify programs in ways that do not alter their functionality but can be crafted to deceive an ML model when making a decision. We provide a general formulation for an adversarial program that allows applying multiple obfuscation transformations to a program in any language. We develop first-order optimization algorithms to efficiently determine two key aspects -- which parts of the program to transform, and what transformations to use. We show that it is important to optimize both these aspects to generate the best adversarially perturbed program. Due to the discrete nature of this problem, we also propose using randomized smoothing to improve the attack loss landscape to ease optimization. We evaluate our work on Python and Java programs on the problem of program summarization. We show that our best attack proposal achieves a $52\%$ improvement over a state-of-the-art attack generation approach for programs trained on a seq2seq model. We further show that our formulation is better at training models that are robust to adversarial attacks. \ No newline at end of file diff --git a/data/2021/iclr/Generating Furry Cars: Disentangling Object Shape and Appearance across Multiple Domains b/data/2021/iclr/Generating Furry Cars: Disentangling Object Shape and Appearance across Multiple Domains new file mode 100644 index 0000000000..af99e33f79 --- /dev/null +++ b/data/2021/iclr/Generating Furry Cars: Disentangling Object Shape and Appearance across Multiple Domains @@ -0,0 +1 @@ +• University of California, Davis (Fall, 2015 – Spring, 2020) PhD in Computer Science GPA: 3.93 Advisor: Prof. Yong Jae Lee • Robotics Institute, Carnegie Mellon University, USA (August 2013 – December 2014) Masters in Robotics QPA: 4.05 Advisors: Prof. Alexei Efros, Prof. Kayvon Fatahalian • International Institute of Information Technology (IIIT), Hyderabad, India (August 2009 – May 2013) B.Tech ( Honours ) in Computer Science and Engineering GPA: 9.07/10 Advisor: Prof. P. J. Narayanan \ No newline at end of file diff --git a/data/2021/iclr/Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule b/data/2021/iclr/Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule new file mode 100644 index 0000000000..6936c42ded --- /dev/null +++ b/data/2021/iclr/Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule @@ -0,0 +1 @@ +Vision-and-language navigation (VLN) is a task in which an agent is embodied in a realistic 3D environment and follows an instruction to reach the goal node. While most of the previous studies have built and investigated a discriminative approach, we notice that there are in fact two possible approaches to building such a VLN agent: discriminative \textit{and} generative. In this paper, we design and investigate a generative language-grounded policy which uses a language model to compute the distribution over all possible instructions i.e. all possible sequences of vocabulary tokens given action and the transition history. In experiments, we show that the proposed generative approach outperforms the discriminative approach in the Room-2-Room (R2R) and Room-4-Room (R4R) datasets, especially in the unseen environments. We further show that the combination of the generative and discriminative policies achieves close to the state-of-the art results in the R2R dataset, demonstrating that the generative and discriminative policies capture the different aspects of VLN. \ No newline at end of file diff --git a/data/2021/iclr/Generative Scene Graph Networks b/data/2021/iclr/Generative Scene Graph Networks new file mode 100644 index 0000000000..ba0deb2489 --- /dev/null +++ b/data/2021/iclr/Generative Scene Graph Networks @@ -0,0 +1 @@ +Human perception excels at building compositional hierarchies of parts and objects from unlabeled scenes that help systematic generalization. Yet most work on generative scene modeling either ignores the part-whole relationship or assumes access to predefined part labels. In this paper, we propose Generative Scene Graph Networks (GSGNs), the first deep generative model that learns to discover the primitive parts and infer the part-whole relationship jointly from multi-object scenes without supervision and in an end-to-end trainable way. We formulate GSGN as a variational autoencoder in which the latent representation is a treestructured probabilistic scene graph. The leaf nodes in the latent tree correspond to primitive parts, and the edges represent the symbolic pose variables required for recursively composing the parts into whole objects and then the full scene. This allows novel objects and scenes to be generated both by sampling from the prior and by manual configuration of the pose variables, as we do with graphics engines. We evaluate GSGN on datasets of scenes containing multiple compositional objects, including a challenging Compositional CLEVR dataset that we have developed. We show that GSGN is able to infer the latent scene graph, generalize out of the training regime, and improve data efficiency in downstream tasks. \ No newline at end of file diff --git a/data/2021/iclr/Generative Time-series Modeling with Fourier Flows b/data/2021/iclr/Generative Time-series Modeling with Fourier Flows new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Genetic Soft Updates for Policy Evolution in Deep Reinforcement Learning b/data/2021/iclr/Genetic Soft Updates for Policy Evolution in Deep Reinforcement Learning new file mode 100644 index 0000000000..15e21e49fa --- /dev/null +++ b/data/2021/iclr/Genetic Soft Updates for Policy Evolution in Deep Reinforcement Learning @@ -0,0 +1 @@ +The combination of Evolutionary Algorithms (EAs) and Deep Reinforcement Learning (DRL) has been recently proposed to merge the benefits of both solutions. Existing mixed approaches, however, have been successfully applied only to actor-critic methods and present significant overhead. We address these issues by introducing a novel mixed framework that exploits a periodical genetic evaluation to soft update the weights of a DRL agent. The resulting approach is applicable with any DRL method and, in a worst-case scenario, it does not exhibit detrimental behaviours. Experiments in robotic applications and continuous control benchmarks demonstrate the versatility of our approach that significantly outperforms prior DRL, EAs, and mixed approaches. Finally, we employ formal verification to confirm the policy improvement, mitigating the inefficient exploration and hyper-parameter sensitivity of DRL \ No newline at end of file diff --git a/data/2021/iclr/Geometry-Aware Gradient Algorithms for Neural Architecture Search b/data/2021/iclr/Geometry-Aware Gradient Algorithms for Neural Architecture Search new file mode 100644 index 0000000000..189c383b2c --- /dev/null +++ b/data/2021/iclr/Geometry-Aware Gradient Algorithms for Neural Architecture Search @@ -0,0 +1 @@ +Recent state-of-the-art methods for neural architecture search (NAS) exploit gradient-based optimization by relaxing the problem into continuous optimization over architectures and shared-weights, a noisy process that remains poorly understood. We argue for the study of single-level empirical risk minimization to understand NAS with weight-sharing, reducing the design of NAS methods to devising optimizers and regularizers that can quickly obtain high-quality solutions to this problem. Invoking the theory of mirror descent, we present a geometry-aware framework that exploits the underlying structure of this optimization to return sparse architectural parameters, leading to simple yet novel algorithms that enjoy fast convergence guarantees and achieve state-of-the-art accuracy on the latest NAS benchmarks in computer vision. Notably, we exceed the best published results for both CIFAR and ImageNet on both the DARTS search space and NAS-Bench-201; on the latter we achieve near-oracle-optimal performance on CIFAR-10 and CIFAR-100. Together, our theory and experiments demonstrate a principled way to co-design optimizers and continuous relaxations of discrete NAS search spaces. \ No newline at end of file diff --git a/data/2021/iclr/Geometry-aware Instance-reweighted Adversarial Training b/data/2021/iclr/Geometry-aware Instance-reweighted Adversarial Training new file mode 100644 index 0000000000..7522f7503f --- /dev/null +++ b/data/2021/iclr/Geometry-aware Instance-reweighted Adversarial Training @@ -0,0 +1 @@ +In adversarial machine learning, there was a common belief that robustness and accuracy hurt each other. The belief was challenged by recent studies where we can maintain the robustness and improve the accuracy. However, the other direction, whether we can keep the accuracy while improving the robustness, is conceptually and practically more interesting, since robust accuracy should be lower than standard accuracy for any model. In this paper, we show this direction is also promising. Firstly, we find even over-parameterized deep networks may still have insufficient model capacity, because adversarial training has an overwhelming smoothing effect. Secondly, given limited model capacity, we argue adversarial data should have unequal importance: geometrically speaking, a natural data point closer to/farther from the class boundary is less/more robust, and the corresponding adversarial data point should be assigned with larger/smaller weight. Finally, to implement the idea, we propose geometry-aware instance-reweighted adversarial training, where the weights are based on how difficult it is to attack a natural data point. Experiments show that our proposal boosts the robustness of standard adversarial training; combining two directions, we improve both robustness and accuracy of standard adversarial training. \ No newline at end of file diff --git a/data/2021/iclr/Getting a CLUE: A Method for Explaining Uncertainty Estimates b/data/2021/iclr/Getting a CLUE: A Method for Explaining Uncertainty Estimates new file mode 100644 index 0000000000..6146f7bf5b --- /dev/null +++ b/data/2021/iclr/Getting a CLUE: A Method for Explaining Uncertainty Estimates @@ -0,0 +1 @@ +Both uncertainty estimation and interpretability are important factors for trustworthy machine learning systems. However, there is little work at the intersection of these two areas. We address this gap by proposing a novel method for interpreting uncertainty estimates from differentiable probabilistic models, like Bayesian Neural Networks (BNNs). Our method, Counterfactual Latent Uncertainty Explanations (CLUE), indicates how to change an input, while keeping it on the data manifold, such that a BNN becomes more confident about the input's prediction. We validate CLUE through 1) a novel framework for evaluating counterfactual explanations of uncertainty, 2) a series of ablation experiments, and 3) a user study. Our experiments show that CLUE outperforms baselines and enables practitioners to better understand which input patterns are responsible for predictive uncertainty. \ No newline at end of file diff --git a/data/2021/iclr/Global Convergence of Three-layer Neural Networks in the Mean Field Regime b/data/2021/iclr/Global Convergence of Three-layer Neural Networks in the Mean Field Regime new file mode 100644 index 0000000000..44a703af1a --- /dev/null +++ b/data/2021/iclr/Global Convergence of Three-layer Neural Networks in the Mean Field Regime @@ -0,0 +1 @@ +In the mean field regime, neural networks are appropriately scaled so that as the width tends to infinity, the learning dynamics tends to a nonlinear and nontrivial dynamical limit, known as the mean field limit. This lends a way to study large-width neural networks via analyzing the mean field limit. Recent works have successfully applied such analysis to two-layer networks and provided global convergence guarantees. The extension to multilayer ones however has been a highly challenging puzzle, and little is known about the optimization efficiency in the mean field regime when there are more than two layers. In this work, we prove a global convergence result for unregularized feedforward three-layer networks in the mean field regime. We first develop a rigorous framework to establish the mean field limit of three-layer networks under stochastic gradient descent training. To that end, we propose the idea of a \textit{neuronal embedding}, which comprises of a fixed probability space that encapsulates neural networks of arbitrary sizes. The identified mean field limit is then used to prove a global convergence guarantee under suitable regularity and convergence mode assumptions, which -- unlike previous works on two-layer networks -- does not rely critically on convexity. Underlying the result is a universal approximation property, natural of neural networks, which importantly is shown to hold at \textit{any} finite training time (not necessarily at convergence) via an algebraic topology argument. \ No newline at end of file diff --git a/data/2021/iclr/Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime b/data/2021/iclr/Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime new file mode 100644 index 0000000000..98224ca692 --- /dev/null +++ b/data/2021/iclr/Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime @@ -0,0 +1 @@ +We study the problem of policy optimization for infinite-horizon discounted Markov Decision Processes with softmax policy and nonlinear function approximation trained with policy gradient algorithms. We concentrate on the training dynamics in the mean-field regime, modeling e.g., the behavior of wide single hidden layer neural networks, when exploration is encouraged through entropy regularization. The dynamics of these models is established as a Wasserstein gradient flow of distributions in parameter space. We further prove global optimality of the fixed points of this dynamics under mild conditions on their initialization. \ No newline at end of file diff --git a/data/2021/iclr/Go with the flow: Adaptive control for Neural ODEs b/data/2021/iclr/Go with the flow: Adaptive control for Neural ODEs new file mode 100644 index 0000000000..7b67b8dd3e --- /dev/null +++ b/data/2021/iclr/Go with the flow: Adaptive control for Neural ODEs @@ -0,0 +1 @@ +Despite their elegant formulation and lightweight memory cost, neural ordinary differential equations (NODEs) suffer from known representational limitations. In particular, the single flow learned by NODEs cannot express all homeomorphisms from a given data space to itself, and their static weight parametrization restricts the type of functions they can learn compared to discrete architectures with layer-dependent weights. Here, we describe a new module called neurally-controlled ODE (N-CODE) designed to improve the expressivity of NODEs. The parameters of N-CODE modules are dynamic variables governed by a trainable map from initial or current activation state, resulting in forms of open-loop and closed-loop control, respectively. A single module is sufficient for learning a distribution on non-autonomous flows that adaptively drive neural representations. We provide theoretical and empirical evidence that N-CODE circumvents limitations of previous models and show how increased model expressivity manifests in several domains. In supervised learning, we demonstrate that our framework achieves better performance than NODEs as measured by both training speed and testing accuracy. In unsupervised learning, we apply this control perspective to an image Autoencoder endowed with a latent transformation flow, greatly improving representational power over a vanilla model and leading to state-of-the-art image reconstruction on CIFAR-10. \ No newline at end of file diff --git a/data/2021/iclr/GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing b/data/2021/iclr/GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing new file mode 100644 index 0000000000..ea9b1edefc --- /dev/null +++ b/data/2021/iclr/GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing @@ -0,0 +1 @@ +We present GraPPa, an effective pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data. We construct synthetic question-SQL pairs over high-quality tables via a synchronous context-free grammar (SCFG) induced from existing text-to-SQL datasets. We pre-train our model on the synthetic data using a novel text-schema linking objective that predicts the syntactic role of a table field in the SQL for each question-SQL pair. To maintain the model's ability to represent real-world data, we also include masked language modeling (MLM) over several existing table-and-language datasets to regularize the pre-training process. On four popular fully supervised and weakly supervised table semantic parsing benchmarks, GraPPa significantly outperforms RoBERTa-large as the feature representation layers and establishes new state-of-the-art results on all of them. \ No newline at end of file diff --git a/data/2021/iclr/Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability b/data/2021/iclr/Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability new file mode 100644 index 0000000000..a2b1138762 --- /dev/null +++ b/data/2021/iclr/Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability @@ -0,0 +1 @@ +We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability. Code is available at https://github.com/locuslab/edge-of-stability. \ No newline at end of file diff --git a/data/2021/iclr/Gradient Projection Memory for Continual Learning b/data/2021/iclr/Gradient Projection Memory for Continual Learning new file mode 100644 index 0000000000..d8c71ad6c4 --- /dev/null +++ b/data/2021/iclr/Gradient Projection Memory for Continual Learning @@ -0,0 +1 @@ +The ability to learn continually without forgetting the past tasks is a desired attribute for artificial learning systems. Existing approaches to enable such learning in artificial neural networks usually rely on network growth, importance based weight update or replay of old data from the memory. In contrast, we propose a novel approach where a neural network learns new tasks by taking gradient steps in the orthogonal direction to the gradient subspaces deemed important for the past tasks. We find the bases of these subspaces by analyzing network representations (activations) after learning each task with Singular Value Decomposition (SVD) in a single shot manner and store them in the memory as Gradient Projection Memory (GPM). With qualitative and quantitative analyses, we show that such orthogonal gradient descent induces minimum to no interference with the past tasks, thereby mitigates forgetting. We evaluate our algorithm on diverse image classification datasets with short and long sequences of tasks and report better or on-par performance compared to the state-of-the-art approaches. \ No newline at end of file diff --git a/data/2021/iclr/Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models b/data/2021/iclr/Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models new file mode 100644 index 0000000000..0b4922a513 --- /dev/null +++ b/data/2021/iclr/Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models @@ -0,0 +1 @@ +Massively multilingual models subsuming tens or even hundreds of languages pose great challenges to multi-task optimization. While it is a common practice to apply a language-agnostic procedure optimizing a joint multilingual task objective, how to properly characterize and take advantage of its underlying problem structure for improving optimization efficiency remains under-explored. In this paper, we attempt to peek into the black-box of multilingual optimization through the lens of loss function geometry. We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with not only language proximity but also the overall model performance. Such observation helps us to identify a critical limitation of existing gradient-based multi-task learning methods, and thus we derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks. Empirically, our method obtains significant model performance gains on multilingual machine translation and XTREME benchmark tasks for multilingual language models. Our work reveals the importance of properly measuring and utilizing language proximity in multilingual optimization, and has broader implications for multi-task learning beyond multilingual modeling. \ No newline at end of file diff --git a/data/2021/iclr/Graph Coarsening with Neural Networks b/data/2021/iclr/Graph Coarsening with Neural Networks new file mode 100644 index 0000000000..e51c79f0a5 --- /dev/null +++ b/data/2021/iclr/Graph Coarsening with Neural Networks @@ -0,0 +1 @@ +As large-scale graphs become increasingly more prevalent, it poses significant computational challenges to process, extract and analyze large graph data. Graph coarsening is one popular technique to reduce the size of a graph while maintaining essential properties. Despite rich graph coarsening literature, there is only limited exploration of data-driven methods in the field. In this work, we leverage the recent progress of deep learning on graphs for graph coarsening. We first propose a framework for measuring the quality of coarsening algorithm and show that depending on the goal, we need to carefully choose the Laplace operator on the coarse graph and associated projection/lift operators. Motivated by the observation that the current choice of edge weight for the coarse graph may be sub-optimal, we parametrize the weight assignment map with graph neural networks and train it to improve the coarsening quality in an unsupervised way. Through extensive experiments on both synthetic and real networks, we demonstrate that our method significantly improves common graph coarsening methods under various metrics, reduction ratios, graph sizes, and graph types. It generalizes to graphs of larger size ($25\times$ of training graphs), is adaptive to different losses (differentiable and non-differentiable), and scales to much larger graphs than previous work. \ No newline at end of file diff --git a/data/2021/iclr/Graph Convolution with Low-rank Learnable Local Filters b/data/2021/iclr/Graph Convolution with Low-rank Learnable Local Filters new file mode 100644 index 0000000000..0f14045340 --- /dev/null +++ b/data/2021/iclr/Graph Convolution with Low-rank Learnable Local Filters @@ -0,0 +1 @@ +Geometric variations like rotation, scaling, and viewpoint changes pose a significant challenge to visual understanding. One common solution is to directly model certain intrinsic structures, e.g., using landmarks. However, it then becomes non-trivial to build effective deep models, especially when the underlying non-Euclidean grid is irregular and coarse. Recent deep models using graph convolutions provide an appropriate framework to handle such non-Euclidean data, but many of them, particularly those based on global graph Laplacians, lack expressiveness to capture local features required for representation of signals lying on the non-Euclidean grid. The current paper introduces a new type of graph convolution with learnable low-rank local filters, which is provably more expressive than previous spectral graph convolution methods. The model also provides a unified framework for both spectral and spatial graph convolutions. To improve model robustness, regularization by local graph Laplacians is introduced. The representation stability against input graph data perturbation is theoretically proved, making use of the graph filter locality and the local graph regularization. Experiments on spherical mesh data, real-world facial expression recognition/skeleton-based action recognition data, and data with simulated graph noise show the empirical advantage of the proposed model. \ No newline at end of file diff --git a/data/2021/iclr/Graph Edit Networks b/data/2021/iclr/Graph Edit Networks new file mode 100644 index 0000000000..2e65efe2a1 --- /dev/null +++ b/data/2021/iclr/Graph Edit Networks @@ -0,0 +1 @@ +a \ No newline at end of file diff --git a/data/2021/iclr/Graph Information Bottleneck for Subgraph Recognition b/data/2021/iclr/Graph Information Bottleneck for Subgraph Recognition new file mode 100644 index 0000000000..5be22746bc --- /dev/null +++ b/data/2021/iclr/Graph Information Bottleneck for Subgraph Recognition @@ -0,0 +1 @@ +Given the input graph and its label/property, several key problems of graph learning, such as finding interpretable subgraphs, graph denoising and graph compression, can be attributed to the fundamental problem of recognizing a subgraph of the original one. This subgraph shall be as informative as possible, yet contains less redundant and noisy structure. This problem setting is closely related to the well-known information bottleneck (IB) principle, which, however, has less been studied for the irregular graph data and graph neural networks (GNNs). In this paper, we propose a framework of Graph Information Bottleneck (GIB) for the subgraph recognition problem in deep graph learning. Under this framework, one can recognize the maximally informative yet compressive subgraph, named IB-subgraph. However, the GIB objective is notoriously hard to optimize, mostly due to the intractability of the mutual information of irregular graph data and the unstable optimization process. In order to tackle these challenges, we propose: i) a GIB objective based-on a mutual information estimator for the irregular graph data; ii) a bi-level optimization scheme to maximize the GIB objective; iii) a connectivity loss to stabilize the optimization process. We evaluate the properties of the IB-subgraph in three application scenarios: improvement of graph classification, graph interpretation and graph denoising. Extensive experiments demonstrate that the information-theoretic IB-subgraph enjoys superior graph properties. \ No newline at end of file diff --git a/data/2021/iclr/Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning b/data/2021/iclr/Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning new file mode 100644 index 0000000000..92e4d31834 --- /dev/null +++ b/data/2021/iclr/Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning @@ -0,0 +1 @@ +Graph Representation Learning (GRL) methods have impacted fields from chemistry to social science. However, their algorithmic implementations are specialized to specific use-cases e.g.message passing methods are run differently from node embedding ones. Despite their apparent differences, all these methods utilize the graph structure, and therefore, their learning can be approximated with stochastic graph traversals. We propose Graph Traversal via Tensor Functionals(GTTF), a unifying meta-algorithm framework for easing the implementation of diverse graph algorithms and enabling transparent and efficient scaling to large graphs. GTTF is founded upon a data structure (stored as a sparse tensor) and a stochastic graph traversal algorithm (described using tensor operations). The algorithm is a functional that accept two functions, and can be specialized to obtain a variety of GRL models and objectives, simply by changing those two functions. We show for a wide class of methods, our algorithm learns in an unbiased fashion and, in expectation, approximates the learning as if the specialized implementations were run directly. With these capabilities, we scale otherwise non-scalable methods to set state-of-the-art on large graph datasets while being more efficient than existing GRL libraries - with only a handful of lines of code for each method specialization. GTTF and its various GRL implementations are on: https://github.com/isi-usc-edu/gttf. \ No newline at end of file diff --git a/data/2021/iclr/Graph-Based Continual Learning b/data/2021/iclr/Graph-Based Continual Learning new file mode 100644 index 0000000000..9647235861 --- /dev/null +++ b/data/2021/iclr/Graph-Based Continual Learning @@ -0,0 +1 @@ +Despite significant advances, continual learning models still suffer from catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. Rehearsal approaches alleviate the problem by maintaining and replaying a small episodic memory of previous samples, often implemented as an array of independent memory slots. In this work, we propose to augment such an array with a learnable random graph that captures pairwise similarities between its samples, and use it not only to learn new tasks but also to guard against forgetting. Empirical results on several benchmark datasets show that our model consistently outperforms recently proposed baselines for task-free continual learning. \ No newline at end of file diff --git a/data/2021/iclr/GraphCodeBERT: Pre-training Code Representations with Data Flow b/data/2021/iclr/GraphCodeBERT: Pre-training Code Representations with Data Flow new file mode 100644 index 0000000000..d1fa0896b0 --- /dev/null +++ b/data/2021/iclr/GraphCodeBERT: Pre-training Code Representations with Data Flow @@ -0,0 +1 @@ +Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search. \ No newline at end of file diff --git a/data/2021/iclr/Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity b/data/2021/iclr/Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity new file mode 100644 index 0000000000..5182e8a9fc --- /dev/null +++ b/data/2021/iclr/Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity @@ -0,0 +1 @@ +Greedy-GQ is a value-based reinforcement learning (RL) algorithm for optimal control. Recently, the finite-time analysis of Greedy-GQ has been developed under linear function approximation and Markovian sampling, and the algorithm is shown to achieve an $\epsilon$-stationary point with a sample complexity in the order of $\mathcal{O}(\epsilon^{-3})$. Such a high sample complexity is due to the large variance induced by the Markovian samples. In this paper, we propose a variance-reduced Greedy-GQ (VR-Greedy-GQ) algorithm for off-policy optimal control. In particular, the algorithm applies the SVRG-based variance reduction scheme to reduce the stochastic variance of the two time-scale updates. We study the finite-time convergence of VR-Greedy-GQ under linear function approximation and Markovian sampling and show that the algorithm achieves a much smaller bias and variance error than the original Greedy-GQ. In particular, we prove that VR-Greedy-GQ achieves an improved sample complexity that is in the order of $\mathcal{O}(\epsilon^{-2})$. We further compare the performance of VR-Greedy-GQ with that of Greedy-GQ in various RL experiments to corroborate our theoretical findings. \ No newline at end of file diff --git a/data/2021/iclr/Grounded Language Learning Fast and Slow b/data/2021/iclr/Grounded Language Learning Fast and Slow new file mode 100644 index 0000000000..327068930c --- /dev/null +++ b/data/2021/iclr/Grounded Language Learning Fast and Slow @@ -0,0 +1 @@ +Recent work has shown that large text-based neural language models, trained with conventional supervised learning objectives, acquire a surprising propensity for few- and one-shot learning. Here, we show that an embodied agent situated in a simulated 3D world, and endowed with a novel dual-coding external memory, can exhibit similar one-shot word learning when trained with conventional reinforcement learning algorithms. After a single introduction to a novel object via continuous visual perception and a language prompt ("This is a dax"), the agent can re-identify the object and manipulate it as instructed ("Put the dax on the bed"). In doing so, it seamlessly integrates short-term, within-episode knowledge of the appropriate referent for the word "dax" with long-term lexical and motor knowledge acquired across episodes (i.e. "bed" and "putting"). We find that, under certain training conditions and with a particular memory writing mechanism, the agent's one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category, and is effective in settings with unfamiliar numbers of objects. We further show how dual-coding memory can be exploited as a signal for intrinsic motivation, stimulating the agent to seek names for objects that may be useful for later executing instructions. Together, the results demonstrate that deep neural networks can exploit meta-learning, episodic memory and an explicitly multi-modal environment to account for 'fast-mapping', a fundamental pillar of human cognitive development and a potentially transformative capacity for agents that interact with human users. \ No newline at end of file diff --git a/data/2021/iclr/Grounding Language to Autonomously-Acquired Skills via Goal Generation b/data/2021/iclr/Grounding Language to Autonomously-Acquired Skills via Goal Generation new file mode 100644 index 0000000000..05f293c29e --- /dev/null +++ b/data/2021/iclr/Grounding Language to Autonomously-Acquired Skills via Goal Generation @@ -0,0 +1 @@ +We are interested in the autonomous acquisition of repertoires of skills. Language-conditioned reinforcement learning ( LC - RL ) approaches are great tools in this quest, as they allow to express abstract goals as sets of constraints on the states. However, most LC - RL agents are not autonomous and cannot learn without ex-ternal instructions and feedback. Besides, their direct language condition cannot account for the goal-directed behavior of pre-verbal infants and strongly limits the expression of behavioral diversity for a given language input. To resolve these issues, we propose a new conceptual approach to language-conditioned RL : the Language-Goal-Behavior architecture ( LGB ). LGB decouples skill learning and language grounding via an intermediate semantic representation of the world. To showcase the properties of LGB , we present a specific implementation called DECSTR . DECSTR is an intrinsically motivated learning agent endowed with an innate semantic representation describing spatial relations between physical objects. In a first stage ( G → B ), it freely explores its environment and targets self-generated semantic configurations. In a second stage ( L → G ), it trains a language-conditioned goal generator to generate semantic goals that match the constraints expressed in language-based inputs. We showcase the additional properties of LGB w.r.t. both an end-to-end LC - RL approach and a similar approach leveraging non-semantic, continuous intermediate representations. Intermediate semantic representations help satisfy language commands in a diversity of ways, enable strategy switching after a failure and facilitate language grounding. \ No newline at end of file diff --git a/data/2021/iclr/Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning b/data/2021/iclr/Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning new file mode 100644 index 0000000000..423b686820 --- /dev/null +++ b/data/2021/iclr/Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning @@ -0,0 +1 @@ +We study the problem of dynamic visual reasoning on raw videos. This is a challenging problem; currently, state-of-the-art models often require dense supervision on physical object properties and events from simulation, which are impractical to obtain in real life. In this paper, we present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from video and language. DCL first adopts a trajectory extractor to track each object over time and to represent it as a latent, object-centric feature vector. Building upon this object-centric representation, DCL learns to approximate the dynamic interaction among objects using graph networks. DCL further incorporates a semantic parser to parse questions into semantic programs and, finally, a program executor to run the program to answer the question, levering the learned dynamics model. After training, DCL can detect and associate objects across the frames, ground visual properties, and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these extracted presentations for answering queries. DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training. We further test DCL on a newly proposed video-retrieval and event localization dataset derived from CLEVRER, showing its strong generalization capacity. \ No newline at end of file diff --git a/data/2021/iclr/Group Equivariant Conditional Neural Processes b/data/2021/iclr/Group Equivariant Conditional Neural Processes new file mode 100644 index 0000000000..f7d752d9d6 --- /dev/null +++ b/data/2021/iclr/Group Equivariant Conditional Neural Processes @@ -0,0 +1 @@ +We present the group equivariant conditional neural process (EquivCNP), a metalearning method with permutation invariance in a data set as in conventional conditional neural processes (CNPs), and it also has transformation equivariance in data space. Incorporating group equivariance, such as rotation and scaling equivariance, provides a way to consider the symmetry of real-world data. We give a decomposition theorem for permutation-invariant and group-equivariant maps, which leads us to construct EquivCNPs with an infinite-dimensional latent space to handle group symmetries. In this paper, we build architecture using Lie group convolutional layers for practical implementation. We show that EquivCNP with translation equivariance achieves comparable performance to conventional CNPs in a 1D regression task. Moreover, we demonstrate that incorporating an appropriate Lie group equivariance, EquivCNP is capable of zero-shot generalization for an image-completion task by selecting an appropriate Lie group equivariance. \ No newline at end of file diff --git a/data/2021/iclr/Group Equivariant Generative Adversarial Networks b/data/2021/iclr/Group Equivariant Generative Adversarial Networks new file mode 100644 index 0000000000..68ead5d48e --- /dev/null +++ b/data/2021/iclr/Group Equivariant Generative Adversarial Networks @@ -0,0 +1 @@ +Generative adversarial networks are the state of the art for generative modeling in vision, yet are notoriously unstable in practice. This instability is further exacerbated with limited training data. However, in the synthesis of domains such as medical or satellite imaging, it is often overlooked that the image label is invariant to global image symmetries (e.g., rotations and reflections). In this work, we improve gradient feedback between generator and discriminator using an inductive symmetry prior via group-equivariant convolutional networks. We replace convolutional layers with equivalent group-convolutional layers in both generator and discriminator, allowing for better optimization steps and increased expressive power with limited samples. In the process, we extend recent GAN developments to the group-equivariant setting. We demonstrate the utility of our methods by improving both sample fidelity and diversity in the class-conditional synthesis of a diverse set of globally-symmetric imaging modalities. \ No newline at end of file diff --git a/data/2021/iclr/Group Equivariant Stand-Alone Self-Attention For Vision b/data/2021/iclr/Group Equivariant Stand-Alone Self-Attention For Vision new file mode 100644 index 0000000000..916d004e37 --- /dev/null +++ b/data/2021/iclr/Group Equivariant Stand-Alone Self-Attention For Vision @@ -0,0 +1 @@ +We provide a general self-attention formulation to impose group equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings that are invariant to the action of the group considered. Since the group acts on the positional encoding directly, group equivariant self-attention networks (GSA-Nets) are steerable by nature. Our experiments on vision benchmarks demonstrate consistent improvements of GSA-Nets over non-equivariant self-attention networks. \ No newline at end of file diff --git a/data/2021/iclr/Growing Efficient Deep Networks by Structured Continuous Sparsification b/data/2021/iclr/Growing Efficient Deep Networks by Structured Continuous Sparsification new file mode 100644 index 0000000000..de1ea3c604 --- /dev/null +++ b/data/2021/iclr/Growing Efficient Deep Networks by Structured Continuous Sparsification @@ -0,0 +1 @@ +We develop an approach to training deep networks while dynamically adjusting their architecture, driven by a principled combination of accuracy and sparsity objectives. Unlike conventional pruning approaches, our method adopts a gradual continuous relaxation of discrete network structure optimization and then samples sparse subnetworks, enabling efficient deep networks to be trained in a growing and pruning manner. Extensive experiments across CIFAR-10, ImageNet, PASCAL VOC, and Penn Treebank, with convolutional models for image classification and semantic segmentation, and recurrent models for language modeling, show that our training scheme yields efficient networks that are smaller and more accurate than those produced by competing pruning methods. \ No newline at end of file diff --git a/data/2021/iclr/HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark b/data/2021/iclr/HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark new file mode 100644 index 0000000000..1838872652 --- /dev/null +++ b/data/2021/iclr/HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark @@ -0,0 +1 @@ +HardWare-aware Neural Architecture Search (HW-NAS) has recently gained tremendous attention by automating the design of DNNs deployed in more resource-constrained daily life devices. Despite its promising performance, developing optimal HW-NAS solutions can be prohibitively challenging as it requires cross-disciplinary knowledge in the algorithm, micro-architecture, and device-specific compilation. First, to determine the hardware-cost to be incorporated into the NAS process, existing works mostly adopt either pre-collected hardware-cost look-up tables or device-specific hardware-cost models. Both of them limit the development of HW-NAS innovations and impose a barrier-to-entry to non-hardware experts. Second, similar to generic NAS, it can be notoriously difficult to benchmark HW-NAS algorithms due to their significant required computational resources and the differences in adopted search spaces, hyperparameters, and hardware devices. To this end, we develop HW-NAS-Bench, the first public dataset for HW-NAS research which aims to democratize HW-NAS research to non-hardware experts and make HW-NAS research more reproducible and accessible. To design HW-NAS-Bench, we carefully collected the measured/estimated hardware performance of all the networks in the search spaces of both NAS-Bench-201 and FBNet, on six hardware devices that fall into three categories (i.e., commercial edge devices, FPGA, and ASIC). Furthermore, we provide a comprehensive analysis of the collected measurements in HW-NAS-Bench to provide insights for HW-NAS research. Finally, we demonstrate exemplary user cases to (1) show that HW-NAS-Bench allows non-hardware experts to perform HW-NAS by simply querying it and (2) verify that dedicated device-specific HW-NAS can indeed lead to optimal accuracy-cost trade-offs. The codes and all collected data are available at https://github.com/RICE-EIC/HW-NAS-Bench. \ No newline at end of file diff --git a/data/2021/iclr/HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents b/data/2021/iclr/HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents new file mode 100644 index 0000000000..617255478f --- /dev/null +++ b/data/2021/iclr/HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents @@ -0,0 +1 @@ +of \ No newline at end of file diff --git a/data/2021/iclr/Heating up decision boundaries: isocapacitory saturation, adversarial scenarios and generalization bounds b/data/2021/iclr/Heating up decision boundaries: isocapacitory saturation, adversarial scenarios and generalization bounds new file mode 100644 index 0000000000..aaac9e41b9 --- /dev/null +++ b/data/2021/iclr/Heating up decision boundaries: isocapacitory saturation, adversarial scenarios and generalization bounds @@ -0,0 +1 @@ +In the present work we study classifiers' decision boundaries via Brownian motion processes in ambient data space and associated probabilistic techniques. Intuitively, our ideas correspond to placing a heat source at the decision boundary and observing how effectively the sample points warm up. We are largely motivated by the search for a soft measure that sheds further light on the decision boundary's geometry. En route, we bridge aspects of potential theory and geometric analysis (Mazya, 2011, Grigoryan-Saloff-Coste, 2002) with active fields of ML research such as adversarial examples and generalization bounds. First, we focus on the geometric behavior of decision boundaries in the light of adversarial attack/defense mechanisms. Experimentally, we observe a certain capacitory trend over different adversarial defense strategies: decision boundaries locally become flatter as measured by isoperimetric inequalities (Ford et al, 2019); however, our more sensitive heat-diffusion metrics extend this analysis and further reveal that some non-trivial geometry invisible to plain distance-based methods is still preserved. Intuitively, we provide evidence that the decision boundaries nevertheless retain many persistent"wiggly and fuzzy"regions on a finer scale. Second, we show how Brownian hitting probabilities translate to soft generalization bounds which are in turn connected to compression and noise stability (Arora et al, 2018), and these bounds are significantly stronger if the decision boundary has controlled geometric features. \ No newline at end of file diff --git a/data/2021/iclr/HeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous Clients b/data/2021/iclr/HeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous Clients new file mode 100644 index 0000000000..6ba56b089f --- /dev/null +++ b/data/2021/iclr/HeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous Clients @@ -0,0 +1 @@ +Federated Learning (FL) is a method of training machine learning models on private data distributed over a large number of possibly heterogeneous clients such as mobile phones and IoT devices. In this work, we propose a new federated learning framework named HeteroFL to address heterogeneous clients equipped with very different computation and communication capabilities. Our solution can enable the training of heterogeneous local models with varying computation complexities and still produce a single global inference model. For the first time, our method challenges the underlying assumption of existing work that local models have to share the same architecture as the global model. We demonstrate several strategies to enhance FL training and conduct extensive empirical evaluations, including five computation complexity levels of three model architecture on three datasets. We show that adaptively distributing subnetworks according to clients' capabilities is both computation and communication efficient. \ No newline at end of file diff --git a/data/2021/iclr/Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization b/data/2021/iclr/Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization new file mode 100644 index 0000000000..4a1e13c133 --- /dev/null +++ b/data/2021/iclr/Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization @@ -0,0 +1 @@ +Real-world large-scale datasets are heteroskedastic and imbalanced -- labels have varying levels of uncertainty and label distributions are long-tailed. Heteroskedasticity and imbalance challenge deep learning algorithms due to the difficulty of distinguishing among mislabeled, ambiguous, and rare examples. Addressing heteroskedasticity and imbalance simultaneously is under-explored. We propose a data-dependent regularization technique for heteroskedastic datasets that regularizes different regions of the input space differently. Inspired by the theoretical derivation of the optimal regularization strength in a one-dimensional nonparametric classification setting, our approach adaptively regularizes the data points in higher-uncertainty, lower-density regions more heavily. We test our method on several benchmark tasks, including a real-world heteroskedastic and imbalanced dataset, WebVision. Our experiments corroborate our theory and demonstrate a significant improvement over other methods in noise-robust deep learning. \ No newline at end of file diff --git a/data/2021/iclr/Hierarchical Autoregressive Modeling for Neural Video Compression b/data/2021/iclr/Hierarchical Autoregressive Modeling for Neural Video Compression new file mode 100644 index 0000000000..360aec1c06 --- /dev/null +++ b/data/2021/iclr/Hierarchical Autoregressive Modeling for Neural Video Compression @@ -0,0 +1 @@ +Recent work by Marino et al. (2020) showed improved performance in sequential density estimation by combining masked autoregressive flows with hierarchical latent variable models. We draw a connection between such autoregressive generative models and the task of lossy video compression. Specifically, we view recent neural video compression methods (Lu et al., 2019; Yang et al., 2020b; Agustssonet al., 2020) as instances of a generalized stochastic temporal autoregressive trans-form, and propose avenues for enhancement based on this insight. Comprehensive evaluations on large-scale video data show improved rate-distortion performance over both state-of-the-art neural and conventional video compression methods. \ No newline at end of file diff --git a/data/2021/iclr/Hierarchical Reinforcement Learning by Discovering Intrinsic Options b/data/2021/iclr/Hierarchical Reinforcement Learning by Discovering Intrinsic Options new file mode 100644 index 0000000000..ce68782f8a --- /dev/null +++ b/data/2021/iclr/Hierarchical Reinforcement Learning by Discovering Intrinsic Options @@ -0,0 +1 @@ +We propose a hierarchical reinforcement learning method, HIDIO, that can learn task-agnostic options in a self-supervised manner while jointly learning to utilize them to solve sparse-reward tasks. Unlike current hierarchical RL approaches that tend to formulate goal-reaching low-level tasks or pre-define ad hoc lower-level policies, HIDIO encourages lower-level option learning that is independent of the task at hand, requiring few assumptions or little knowledge about the task structure. These options are learned through an intrinsic entropy minimization objective conditioned on the option sub-trajectories. The learned options are diverse and task-agnostic. In experiments on sparse-reward robotic manipulation and navigation tasks, HIDIO achieves higher success rates with greater sample efficiency than regular RL baselines and two state-of-the-art hierarchical RL methods. \ No newline at end of file diff --git a/data/2021/iclr/High-Capacity Expert Binary Networks b/data/2021/iclr/High-Capacity Expert Binary Networks new file mode 100644 index 0000000000..3cd200d47b --- /dev/null +++ b/data/2021/iclr/High-Capacity Expert Binary Networks @@ -0,0 +1 @@ +Network binarization is a promising hardware-aware direction for creating efficient deep models. Despite its memory and computational advantages, reducing the accuracy gap between such models and their real-valued counterparts remains an unsolved challenging research problem. To this end, we make the following 3 contributions: (a) To increase model capacity, we propose Expert Binary Convolution, which, for the first time, tailors conditional computing to binary networks by learning to select one data-specific expert binary filter at a time conditioned on input features. (b) To increase representation capacity, we propose to address the inherent information bottleneck in binary networks by introducing an efficient width expansion mechanism which keeps the binary operations within the same budget. (c) To improve network design, we propose a principled binary network growth mechanism that unveils a set of network topologies of favorable properties. Overall, our method improves upon prior work, with no increase in computational cost by ~6%, reaching a groundbreaking ~71% on ImageNet classification. \ No newline at end of file diff --git a/data/2021/iclr/Hopfield Networks is All You Need b/data/2021/iclr/Hopfield Networks is All You Need new file mode 100644 index 0000000000..bbb89c8c2e --- /dev/null +++ b/data/2021/iclr/Hopfield Networks is All You Need @@ -0,0 +1 @@ +We show that the transformer attention mechanism is the update rule of a modern Hopfield network with continuous states. This new Hopfield network can store exponentially (with the dimension) many patterns, converges with one update, and has exponentially small retrieval errors. The number of stored patterns is traded off against convergence speed and retrieval error. The new Hopfield network has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. Transformer and BERT models operate in their first layers preferably in the global averaging regime, while they operate in higher layers in metastable states. The gradient in transformers is maximal for metastable states, is uniformly distributed for global averaging, and vanishes for a fixed point near a stored pattern. Using the Hopfield network interpretation, we analyzed learning of transformer and BERT models. Learning starts with attention heads that average and then most of them switch to metastable states. However, the majority of heads in the first layers still averages and can be replaced by averaging, e.g. our proposed Gaussian weighting. In contrast, heads in the last layers steadily learn and seem to use metastable states to collect information created in lower layers. These heads seem to be a promising target for improving transformers. Neural networks with Hopfield networks outperform other methods on immune repertoire classification, where the Hopfield net stores several hundreds of thousands of patterns. We provide a new PyTorch layer called "Hopfield", which allows to equip deep learning architectures with modern Hopfield networks as a new powerful concept comprising pooling, memory, and attention. GitHub: this https URL \ No newline at end of file diff --git a/data/2021/iclr/Hopper: Multi-hop Transformer for Spatiotemporal Reasoning b/data/2021/iclr/Hopper: Multi-hop Transformer for Spatiotemporal Reasoning new file mode 100644 index 0000000000..f960bec79b --- /dev/null +++ b/data/2021/iclr/Hopper: Multi-hop Transformer for Spatiotemporal Reasoning @@ -0,0 +1 @@ +This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to our approach is the notion of object permanence, i.e., the ability to reason about the location of objects as they move through the video while being occluded, contained or carried by other objects. Existing deep learning based approaches often suffer from spatiotemporal biases when applied to video reasoning problems. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence in videos. Given a video and a localization query, Hopper reasons over image and object tracks to automatically hop over critical frames in an iterative fashion to predict the final position of the object of interest. We demonstrate the effectiveness of using a contrastive loss to reduce spatiotemporal biases. We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset that requires multi-step reasoning to localize objects of interest correctly. \ No newline at end of file diff --git a/data/2021/iclr/How Benign is Benign Overfitting ? b/data/2021/iclr/How Benign is Benign Overfitting ? new file mode 100644 index 0000000000..6778beae45 --- /dev/null +++ b/data/2021/iclr/How Benign is Benign Overfitting ? @@ -0,0 +1 @@ +We investigate two causes for adversarial vulnerability in deep neural networks: bad data and (poorly) trained models. When trained with SGD, deep neural networks essentially achieve zero training error, even in the presence of label noise, while also exhibiting good generalization on natural test data, something referred to as benign overfitting [2, 10]. However, these models are vulnerable to adversarial attacks. We identify label noise as one of the causes for adversarial vulnerability, and provide theoretical and empirical evidence in support of this. Surprisingly, we find several instances of label noise in datasets such as MNIST and CIFAR, and that robustly trained models incur training error on some of these, i.e. they don't fit the noise. However, removing noisy labels alone does not suffice to achieve adversarial robustness. Standard training procedures bias neural networks towards learning "simple" classification boundaries, which may be less robust than more complex ones. We observe that adversarial training does produce more complex decision boundaries. We conjecture that in part the need for complex decision boundaries arises from sub-optimal representation learning. By means of simple toy examples, we show theoretically how the choice of representation can drastically affect adversarial robustness. \ No newline at end of file diff --git a/data/2021/iclr/How Does Mixup Help With Robustness and Generalization? b/data/2021/iclr/How Does Mixup Help With Robustness and Generalization? new file mode 100644 index 0000000000..4787d6cd01 --- /dev/null +++ b/data/2021/iclr/How Does Mixup Help With Robustness and Generalization? @@ -0,0 +1 @@ +Mixup is a popular data augmentation technique based on taking convex combinations of pairs of examples and their labels. This simple technique has been shown to substantially improve both the robustness and the generalization of the trained model. However, it is not well-understood why such improvement occurs. In this paper, we provide theoretical analysis to demonstrate how using Mixup in training helps model robustness and generalization. For robustness, we show that minimizing the Mixup loss corresponds to approximately minimizing an upper bound of the adversarial loss. This explains why models obtained by Mixup training exhibits robustness to several kinds of adversarial attacks such as Fast Gradient Sign Method (FGSM). For generalization, we prove that Mixup augmentation corresponds to a specific type of data-adaptive regularization which reduces overfitting. Our analysis provides new insights and a framework to understand Mixup. \ No newline at end of file diff --git a/data/2021/iclr/How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? b/data/2021/iclr/How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? new file mode 100644 index 0000000000..dacd0b277b --- /dev/null +++ b/data/2021/iclr/How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? @@ -0,0 +1 @@ +A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target accuracy $\epsilon^{-1}$, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumption on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2019). However, how much over-parameterization is sufficient to guarantee optimization and generalization for deep neural networks still remains an open question. In this work, we establish sharp optimization and generalization guarantees for deep ReLU networks. Under various assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in $n$ and $\epsilon^{-1}$. Our results push the study of over-parameterized deep neural networks towards more practical settings. \ No newline at end of file diff --git a/data/2021/iclr/How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks b/data/2021/iclr/How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks new file mode 100644 index 0000000000..7399d5ef0c --- /dev/null +++ b/data/2021/iclr/How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks @@ -0,0 +1 @@ +We study how neural networks trained by gradient descent extrapolate, i.e., what they learn outside the support of the training distribution. Previous works report mixed empirical results when extrapolating with neural networks: while multilayer perceptrons (MLPs) do not extrapolate well in certain simple tasks, Graph Neural Network (GNN), a structured network with MLP modules, has shown some success in more complex tasks. Working towards a theoretical explanation, we identify conditions under which MLPs and GNNs extrapolate well. First, we quantify the observation that ReLU MLPs quickly converge to linear functions along any direction from the origin, which implies that ReLU MLPs do not extrapolate most non-linear functions. But, they can provably learn a linear target function when the training distribution is sufficiently "diverse". Second, in connection to analyzing successes and limitations of GNNs, these results suggest a hypothesis for which we provide theoretical and empirical evidence: the success of GNNs in extrapolating algorithmic tasks to new data (e.g., larger graphs or edge weights) relies on encoding task-specific non-linearities in the architecture or features. \ No newline at end of file diff --git a/data/2021/iclr/How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision b/data/2021/iclr/How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision new file mode 100644 index 0000000000..f96b2206a3 --- /dev/null +++ b/data/2021/iclr/How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision @@ -0,0 +1 @@ +Attention mechanism in graph neural networks is designed to assign larger weights to important neighbor nodes for better representation. However, what graph attention learns is not understood well, particularly when graphs are noisy. In this paper, we propose a self-supervised graph attention network (SuperGAT), an improved graph attention model for noisy graphs. Specifically, we exploit two attention forms compatible with a self-supervised task to predict edges, whose presence and absence contain the inherent information about the importance of the relationships between nodes. By encoding edges, SuperGAT learns more expressive attention in distinguishing mislinked neighbors. We find two graph characteristics influence the effectiveness of attention forms and self-supervision: homophily and average degree. Thus, our recipe provides guidance on which attention design to use when those two graph characteristics are known. Our experiment on 17 real-world datasets demonstrates that our recipe generalizes across 15 datasets of them, and our models designed by recipe show improved performance over baselines. \ No newline at end of file diff --git a/data/2021/iclr/Human-Level Performance in No-Press Diplomacy via Equilibrium Search b/data/2021/iclr/Human-Level Performance in No-Press Diplomacy via Equilibrium Search new file mode 100644 index 0000000000..294789a22c --- /dev/null +++ b/data/2021/iclr/Human-Level Performance in No-Press Diplomacy via Equilibrium Search @@ -0,0 +1 @@ +Prior AI breakthroughs in complex games have focused on either the purely adversarial or purely cooperative settings. In contrast, Diplomacy is a game of shifting alliances that involves both cooperation and competition. For this reason, Diplomacy has proven to be a formidable research challenge. In this paper we describe an agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via external regret minimization. External regret minimization techniques have been behind previous AI successes in adversarial games, most notably poker, but have not previously been shown to be successful in large-scale games involving cooperation. We show that our agent greatly exceeds the performance of past no-press Diplomacy bots, is unexploitable by expert humans, and achieves a rank of 23 out of 1,128 human players when playing anonymous games on a popular Diplomacy website. \ No newline at end of file diff --git a/data/2021/iclr/HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks b/data/2021/iclr/HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks new file mode 100644 index 0000000000..a77a4f53f5 --- /dev/null +++ b/data/2021/iclr/HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks @@ -0,0 +1 @@ +We propose HyperDynamics, a dynamics meta-learning framework that conditions on an agent's interactions with the environment and optionally its visual observations, and generates the parameters of neural dynamics models based on inferred properties of the dynamical system. Physical and visual properties of the environment that are not part of the low-dimensional state yet affect its temporal dynamics are inferred from the interaction history and visual observations, and are implicitly captured in the generated parameters. We test HyperDynamics on a set of object pushing and locomotion tasks. It outperforms existing dynamics models in the literature that adapt to environment variations by learning dynamics over high dimensional visual observations, capturing the interactions of the agent in recurrent state representations, or using gradient-based meta-optimization. We also show our method matches the performance of an ensemble of separately trained experts, while also being able to generalize well to unseen environment variations at test time. We attribute its good performance to the multiplicative interactions between the inferred system properties -- captured in the generated parameters -- and the low-dimensional state representation of the dynamical system. \ No newline at end of file diff --git a/data/2021/iclr/HyperGrid Transformers: Towards A Single Model for Multiple Tasks b/data/2021/iclr/HyperGrid Transformers: Towards A Single Model for Multiple Tasks new file mode 100644 index 0000000000..63914e59d5 --- /dev/null +++ b/data/2021/iclr/HyperGrid Transformers: Towards A Single Model for Multiple Tasks @@ -0,0 +1 @@ +Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose HyperGrid Transformers, a new Transformer architecture that leverages task-conditioned hyper networks for controlling its feed-forward layers. Specifically, we propose a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks. In order to construct the proposed hypernetwork, our method learns the interactions and composition between a global (task-agnostic) state and a local task-specific state. We conduct an extensive set of experiments on GLUE/SuperGLUE. On the SuperGLUE test set, we match the performance of the state-of-the-art while being 16 times more parameter efficient. Our method helps bridge the gap between fine-tuning and multi-task learning approaches. \ No newline at end of file diff --git a/data/2021/iclr/Hyperbolic Neural Networks++ b/data/2021/iclr/Hyperbolic Neural Networks++ new file mode 100644 index 0000000000..2aab166e00 --- /dev/null +++ b/data/2021/iclr/Hyperbolic Neural Networks++ @@ -0,0 +1 @@ +Hyperbolic spaces have recently gained momentum in the context of machine learning due to their high capacity and tree-likeliness properties. However, the representational power of hyperbolic geometry is not yet on par with Euclidean geometry, mostly because of the absence of corresponding hyperbolic neural network layers. This makes it hard to use hyperbolic embeddings in downstream tasks. Here, we bridge this gap in a principled manner by combining the formalism of Mobius gyrovector spaces with the Riemannian geometry of the Poincare model of hyperbolic spaces. As a result, we derive hyperbolic versions of important deep learning tools: multinomial logistic regression, feed-forward and recurrent neural networks such as gated recurrent units. This allows to embed sequential data and perform classification in the hyperbolic space. Empirically, we show that, even if hyperbolic optimization tools are limited, hyperbolic sentence embeddings either outperform or are on par with their Euclidean variants on textual entailment and noisy-prefix recognition tasks. \ No newline at end of file diff --git a/data/2021/iclr/IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression b/data/2021/iclr/IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression new file mode 100644 index 0000000000..532c07b6b9 --- /dev/null +++ b/data/2021/iclr/IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression @@ -0,0 +1 @@ +In this paper we analyse and improve integer discrete flows for lossless compression. Integer discrete flows are a recently proposed class of models that learn invertible transformations for integer-valued random variables. Due to its discrete nature, they can be combined in a straightforward manner with entropy coding schemes for lossless compression without the need for bits-back coding. We discuss the potential difference in flexibility between invertible flows for discrete random variables and flows for continuous random variables and show that (integer) discrete flows are more flexible than previously claimed. We furthermore investigate the influence of quantization operators on optimization and gradient bias in integer discrete flows. Finally, we introduce modifications to the architecture to improve the performance of this model class for lossless compression. \ No newline at end of file diff --git a/data/2021/iclr/IEPT: Instance-Level and Episode-Level Pretext Tasks for Few-Shot Learning b/data/2021/iclr/IEPT: Instance-Level and Episode-Level Pretext Tasks for Few-Shot Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving b/data/2021/iclr/INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving new file mode 100644 index 0000000000..f9dfa9cc14 --- /dev/null +++ b/data/2021/iclr/INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving @@ -0,0 +1 @@ +In learning-assisted theorem proving, one of the most critical challenges is to generalize to theorems unlike those seen at training time. In this paper, we introduce INT, an INequality Theorem proving benchmark, specifically designed to test agents' generalization ability. INT is based on a procedure for generating theorems and proofs; this procedure's knobs allow us to measure 6 different types of generalization, each reflecting a distinct challenge characteristic to automated theorem proving. In addition, unlike prior benchmarks for learning-assisted theorem proving, INT provides a lightweight and user-friendly theorem proving environment with fast simulations, conducive to performing learning-based and search-based research. We introduce learning-based baselines and evaluate them across 6 dimensions of generalization with the benchmark. We then evaluate the same agents augmented with Monte Carlo Tree Search (MCTS) at test time, and show that MCTS can help to prove new theorems. \ No newline at end of file diff --git a/data/2021/iclr/IOT: Instance-wise Layer Reordering for Transformer Structures b/data/2021/iclr/IOT: Instance-wise Layer Reordering for Transformer Structures new file mode 100644 index 0000000000..699ac5a89b --- /dev/null +++ b/data/2021/iclr/IOT: Instance-wise Layer Reordering for Transformer Structures @@ -0,0 +1 @@ +With sequentially stacked self-attention, (optional) encoder-decoder attention, and feed-forward layers, Transformer achieves big success in natural language processing (NLP), and many variants have been proposed. Currently, almost all these models assume that the layer order is fixed and kept the same across data samples. We observe that different data samples actually favor different orders of the layers. Based on this observation, in this work, we break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure. Our Instance-wise Ordered Transformer (IOT) can model variant functions by reordered layers, which enables each sample to select the better one to improve the model performance under the constraint of almost the same number of parameters. To achieve this, we introduce a light predictor with negligible parameter and inference cost to decide the most capable and favorable layer order for any input sequence. Experiments on 3 tasks (neural machine translation, abstractive summarization, and code generation) and 9 datasets demonstrate consistent improvements of our method. We further show that our method can also be applied to other architectures beyond Transformer. Our code is released at Github. \ No newline at end of file diff --git a/data/2021/iclr/Identifying Physical Law of Hamiltonian Systems via Meta-Learning b/data/2021/iclr/Identifying Physical Law of Hamiltonian Systems via Meta-Learning new file mode 100644 index 0000000000..f8281e7319 --- /dev/null +++ b/data/2021/iclr/Identifying Physical Law of Hamiltonian Systems via Meta-Learning @@ -0,0 +1 @@ +Hamiltonian mechanics is an effective tool to represent many physical processes with concise yet well-generalized mathematical expressions. A well-modeled Hamiltonian makes it easy for researchers to analyze and forecast many related phenomena that are governed by the same physical law. However, in general, identifying a functional or shared expression of the Hamiltonian is very difficult. It requires carefully designed experiments and the researcher's insight that comes from years of experience. We propose that meta-learning algorithms can be potentially powerful data-driven tools for identifying the physical law governing Hamiltonian systems without any mathematical assumptions on the representation, but with observations from a set of systems governed by the same physical law. We show that a well meta-trained learner can identify the shared representation of the Hamiltonian by evaluating our method on several types of physical systems with various experimental settings. \ No newline at end of file diff --git a/data/2021/iclr/Identifying nonlinear dynamical systems with multiple time scales and long-range dependencies b/data/2021/iclr/Identifying nonlinear dynamical systems with multiple time scales and long-range dependencies new file mode 100644 index 0000000000..93d1d01c4d --- /dev/null +++ b/data/2021/iclr/Identifying nonlinear dynamical systems with multiple time scales and long-range dependencies @@ -0,0 +1 @@ +A main theoretical interest in biology and physics is to identify the nonlinear dynamical system (DS) that generated observed time series. Recurrent Neural Networks (RNNs) are, in principle, powerful enough to approximate any underlying DS, but in their vanilla form suffer from the exploding vs. vanishing gradients problem. Previous attempts to alleviate this problem resulted either in more complicated, mathematically less tractable RNN architectures \ No newline at end of file diff --git a/data/2021/iclr/Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels b/data/2021/iclr/Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels new file mode 100644 index 0000000000..c9587b5637 --- /dev/null +++ b/data/2021/iclr/Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels @@ -0,0 +1 @@ +We propose a simple data augmentation technique that can be applied to standard model-free reinforcement learning algorithms, enabling robust learning directly from pixels without the need for auxiliary losses or pre-training. The approach leverages input perturbations commonly used in computer vision tasks to regularize the value function. Existing model-free approaches, such as Soft Actor-Critic (SAC), are not able to train deep networks effectively from image pixels. However, the addition of our augmentation method dramatically improves SAC's performance, enabling it to reach state-of-the-art performance on the DeepMind control suite, surpassing model-based (Dreamer, PlaNet, and SLAC) methods and recently proposed contrastive learning (CURL). Our approach can be combined with any model-free reinforcement learning algorithm, requiring only minor modifications. An implementation can be found at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering b/data/2021/iclr/Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering new file mode 100644 index 0000000000..4a26e828a8 --- /dev/null +++ b/data/2021/iclr/Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering @@ -0,0 +1 @@ +Differentiable rendering has paved the way to training neural networks to perform "inverse graphics" tasks such as predicting 3D geometry from monocular photographs. To train high performing models, most of the current approaches rely on multi-view imagery which are not readily available in practice. Recent Generative Adversarial Networks (GANs) that synthesize images, in contrast, seem to acquire 3D knowledge implicitly during training: object viewpoints can be manipulated by simply manipulating the latent codes. However, these latent codes often lack further physical interpretation and thus GANs cannot easily be inverted to perform explicit 3D reasoning. In this paper, we aim to extract and disentangle 3D knowledge learned by generative models by utilizing differentiable renderers. Key to our approach is to exploit GANs as a multi-view data generator to train an inverse graphics network using an off-the-shelf differentiable renderer, and the trained inverse graphics network as a teacher to disentangle the GAN's latent code into interpretable 3D properties. The entire architecture is trained iteratively using cycle consistency losses. We show that our approach significantly outperforms state-of-the-art inverse graphics networks trained on existing datasets, both quantitatively and via user studies. We further showcase the disentangled GAN as a controllable 3D "neural renderer", complementing traditional graphics renderers. \ No newline at end of file diff --git a/data/2021/iclr/Impact of Representation Learning in Linear Bandits b/data/2021/iclr/Impact of Representation Learning in Linear Bandits new file mode 100644 index 0000000000..2868f5856f --- /dev/null +++ b/data/2021/iclr/Impact of Representation Learning in Linear Bandits @@ -0,0 +1 @@ +We study how representation learning can improve the efficiency of bandit problems. We study the setting where we play $T$ linear bandits with dimension $d$ concurrently, and these $T$ bandit tasks share a common $k (\ll d)$ dimensional linear representation. For the finite-action setting, we present a new algorithm which achieves $\widetilde{O}(T\sqrt{kN} + \sqrt{dkNT})$ regret, where $N$ is the number of rounds we play for each bandit. When $T$ is sufficiently large, our algorithm significantly outperforms the naive algorithm (playing $T$ bandits independently) that achieves $\widetilde{O}(T\sqrt{d N})$ regret. We also provide an $\Omega(T\sqrt{kN} + \sqrt{dkNT})$ regret lower bound, showing that our algorithm is minimax-optimal up to poly-logarithmic factors. Furthermore, we extend our algorithm to the infinite-action setting and obtain a corresponding regret bound which demonstrates the benefit of representation learning in certain regimes. We also present experiments on synthetic and real-world data to illustrate our theoretical findings and demonstrate the effectiveness of our proposed algorithms. \ No newline at end of file diff --git a/data/2021/iclr/Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time b/data/2021/iclr/Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time new file mode 100644 index 0000000000..32206a811b --- /dev/null +++ b/data/2021/iclr/Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time @@ -0,0 +1 @@ +We study training of Convolutional Neural Networks (CNNs) with ReLU activations and introduce exact convex optimization formulations with a polynomial complexity with respect to the number of data samples, the number of neurons, and data dimension. More specifically, we develop a convex analytic framework utilizing semi-infinite duality to obtain equivalent convex optimization problems for several two- and three-layer CNN architectures. We first prove that two-layer CNNs can be globally optimized via an $\ell_2$ norm regularized convex program. We then show that three-layer CNN training problems are equivalent to an $\ell_1$ regularized convex program that encourages sparsity in the spectral domain. We also extend these results to multi-layer CNN architectures including three-layer networks with two ReLU layers and deeper circular convolutions with a single ReLU layer. Furthermore, we present extensions of our approach to different pooling methods, which elucidates the implicit architectural bias as convex regularizers. \ No newline at end of file diff --git a/data/2021/iclr/Implicit Gradient Regularization b/data/2021/iclr/Implicit Gradient Regularization new file mode 100644 index 0000000000..52899f3490 --- /dev/null +++ b/data/2021/iclr/Implicit Gradient Regularization @@ -0,0 +1 @@ +Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient descent trajectories that have large loss gradients. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization. We confirm empirically that implicit gradient regularization biases gradient descent toward flat minima, where test errors are small and solutions are robust to noisy parameter perturbations. Furthermore, we demonstrate that the implicit gradient regularization term can be used as an explicit regularizer, allowing us to control this gradient regularization directly. More broadly, our work indicates that backward error analysis is a useful theoretical approach to the perennial question of how learning rate, model size, and parameter regularization interact to determine the properties of overparameterized models optimized with gradient descent. \ No newline at end of file diff --git a/data/2021/iclr/Implicit Normalizing Flows b/data/2021/iclr/Implicit Normalizing Flows new file mode 100644 index 0000000000..04983e0e6a --- /dev/null +++ b/data/2021/iclr/Implicit Normalizing Flows @@ -0,0 +1 @@ +Normalizing flows define a probability distribution by an explicit invertible transformation $\boldsymbol{\mathbf{z}}=f(\boldsymbol{\mathbf{x}})$. In this work, we present implicit normalizing flows (ImpFlows), which generalize normalizing flows by allowing the mapping to be implicitly defined by the roots of an equation $F(\boldsymbol{\mathbf{z}}, \boldsymbol{\mathbf{x}})= \boldsymbol{\mathbf{0}}$. ImpFlows build on residual flows (ResFlows) with a proper balance between expressiveness and tractability. Through theoretical analysis, we show that the function space of ImpFlow is strictly richer than that of ResFlows. Furthermore, for any ResFlow with a fixed number of blocks, there exists some function that ResFlow has a non-negligible approximation error. However, the function is exactly representable by a single-block ImpFlow. We propose a scalable algorithm to train and draw samples from ImpFlows. Empirically, we evaluate ImpFlow on several classification and density modeling tasks, and ImpFlow outperforms ResFlow with a comparable amount of parameters on all the benchmarks. \ No newline at end of file diff --git a/data/2021/iclr/Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning b/data/2021/iclr/Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning new file mode 100644 index 0000000000..78175fb49d --- /dev/null +++ b/data/2021/iclr/Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning @@ -0,0 +1 @@ +We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity in terms of a drop in the rank of the learned value network features, and show that this corresponds to a drop in performance. We demonstrate this phenomenon on widely studies domains, including Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse improves performance. \ No newline at end of file diff --git a/data/2021/iclr/Improve Object Detection with Feature-based Knowledge Distillation: Towards Accurate and Efficient Detectors b/data/2021/iclr/Improve Object Detection with Feature-based Knowledge Distillation: Towards Accurate and Efficient Detectors new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Improved Autoregressive Modeling with Distribution Smoothing b/data/2021/iclr/Improved Autoregressive Modeling with Distribution Smoothing new file mode 100644 index 0000000000..f3a25669e7 --- /dev/null +++ b/data/2021/iclr/Improved Autoregressive Modeling with Distribution Smoothing @@ -0,0 +1 @@ +While autoregressive models excel at image compression, their sample quality is often lacking. Although not realistic, generated images often have high likelihood according to the model, resembling the case of adversarial examples. Inspired by a successful adversarial defense method, we incorporate randomized smoothing into autoregressive generative modeling. We first model a smoothed version of the data distribution, and then reverse the smoothing process to recover the original data distribution. This procedure drastically improves the sample quality of existing autoregressive models on several synthetic and real-world image datasets while obtaining competitive likelihoods on synthetic datasets. \ No newline at end of file diff --git "a/data/2021/iclr/Improved Estimation of Concentration Under \342\204\223p-Norm Distance Metrics Using Half Spaces" "b/data/2021/iclr/Improved Estimation of Concentration Under \342\204\223p-Norm Distance Metrics Using Half Spaces" new file mode 100644 index 0000000000..0be34e6dcd --- /dev/null +++ "b/data/2021/iclr/Improved Estimation of Concentration Under \342\204\223p-Norm Distance Metrics Using Half Spaces" @@ -0,0 +1 @@ +Concentration of measure has been argued to be the fundamental cause of adversarial vulnerability. Mahloujifar et al. presented an empirical way to measure the concentration of a data distribution using samples, and employed it to find lower bounds on intrinsic robustness for several benchmark datasets. However, it remains unclear whether these lower bounds are tight enough to provide a useful approximation for the intrinsic robustness of a dataset. To gain a deeper understanding of the concentration of measure phenomenon, we first extend the Gaussian Isoperimetric Inequality to non-spherical Gaussian measures and arbitrary $\ell_p$-norms ($p \geq 2$). We leverage these theoretical insights to design a method that uses half-spaces to estimate the concentration of any empirical dataset under $\ell_p$-norm distance metrics. Our proposed algorithm is more efficient than Mahloujifar et al.'s, and our experiments on synthetic datasets and image benchmarks demonstrate that it is able to find much tighter intrinsic robustness bounds. These tighter estimates provide further evidence that rules out intrinsic dataset concentration as a possible explanation for the adversarial vulnerability of state-of-the-art classifiers. \ No newline at end of file diff --git a/data/2021/iclr/Improving Adversarial Robustness via Channel-wise Activation Suppressing b/data/2021/iclr/Improving Adversarial Robustness via Channel-wise Activation Suppressing new file mode 100644 index 0000000000..1681920813 --- /dev/null +++ b/data/2021/iclr/Improving Adversarial Robustness via Channel-wise Activation Suppressing @@ -0,0 +1 @@ +The study of adversarial examples and their activation has attracted significant attention for secure and robust learning with deep neural networks (DNNs). Different from existing works, in this paper, we highlight two new characteristics of adversarial examples from the channel-wise activation perspective: 1) the activation magnitudes of adversarial examples are higher than that of natural examples; and 2) the channels are activated more uniformly by adversarial examples than natural examples. We find that the state-of-the-art defense adversarial training has addressed the first issue of high activation magnitudes via training on adversarial examples, while the second issue of uniform activation remains. This motivates us to suppress redundant activation from being activated by adversarial perturbations via a Channel-wise Activation Suppressing (CAS) strategy. We show that CAS can train a model that inherently suppresses adversarial activation, and can be easily applied to existing defense methods to further improve their robustness. Our work provides a simple but generic training strategy for robustifying the intermediate layer activation of DNNs. \ No newline at end of file diff --git a/data/2021/iclr/Improving Relational Regularized Autoencoders with Spherical Sliced Fused Gromov Wasserstein b/data/2021/iclr/Improving Relational Regularized Autoencoders with Spherical Sliced Fused Gromov Wasserstein new file mode 100644 index 0000000000..be45378156 --- /dev/null +++ b/data/2021/iclr/Improving Relational Regularized Autoencoders with Spherical Sliced Fused Gromov Wasserstein @@ -0,0 +1 @@ +Relational regularized autoencoder (RAE) is a framework to learn the distribution of data by minimizing a reconstruction loss together with a relational regularization on the latent space. A recent attempt to reduce the inner discrepancy between the prior and aggregated posterior distributions is to incorporate sliced fused Gromov-Wasserstein (SFG) between these distributions. That approach has a weakness since it treats every slicing direction similarly, meanwhile several directions are not useful for the discriminative task. To improve the discrepancy and consequently the relational regularization, we propose a new relational discrepancy, named spherical sliced fused Gromov Wasserstein (SSFG), that can find an important area of projections characterized by a von Mises-Fisher distribution. Then, we introduce two variants of SSFG to improve its performance. The first variant, named mixture spherical sliced fused Gromov Wasserstein (MSSFG), replaces the vMF distribution by a mixture of von Mises-Fisher distributions to capture multiple important areas of directions that are far from each other. The second variant, named power spherical sliced fused Gromov Wasserstein (PSSFG), replaces the vMF distribution by a power spherical distribution to improve the sampling time in high dimension settings. We then apply the new discrepancies to the RAE framework to achieve its new variants. Finally, we conduct extensive experiments to show that the new proposed autoencoders have favorable performance in learning latent manifold structure, image generation, and reconstruction. \ No newline at end of file diff --git a/data/2021/iclr/Improving Transformation Invariance in Contrastive Representation Learning b/data/2021/iclr/Improving Transformation Invariance in Contrastive Representation Learning new file mode 100644 index 0000000000..29a5b7d1ba --- /dev/null +++ b/data/2021/iclr/Improving Transformation Invariance in Contrastive Representation Learning @@ -0,0 +1 @@ +We propose methods to strengthen the invariance properties of representations obtained by contrastive learning. While existing approaches implicitly induce a degree of invariance as representations are learned, we look to more directly enforce invariance in the encoding process. To this end, we first introduce a training objective for contrastive learning that uses a novel regularizer to control how the representation changes under transformation. We show that representations trained with this objective perform better on downstream tasks and are more robust to the introduction of nuisance transformations at test time. Second, we propose a change to how test time representations are generated by introducing a feature averaging approach that combines encodings from multiple transformations of the original input, finding that this leads to across the board performance gains. Finally, we introduce the novel Spirograph dataset to explore our ideas in the context of a differentiable generative process with multiple downstream tasks, showing that our techniques for learning invariance are highly beneficial. \ No newline at end of file diff --git a/data/2021/iclr/Improving VAEs' Robustness to Adversarial Attack b/data/2021/iclr/Improving VAEs' Robustness to Adversarial Attack new file mode 100644 index 0000000000..6ca2fc7568 --- /dev/null +++ b/data/2021/iclr/Improving VAEs' Robustness to Adversarial Attack @@ -0,0 +1 @@ +Variational autoencoders (VAEs) have recently been shown to be vulnerable to adversarial attacks, wherein they are fooled into reconstructing a chosen target image. However, how to defend against such attacks remains an open problem. We make significant advances in addressing this issue by introducing methods for producing adversarially robust VAEs. Namely, we first demonstrate that methods used to obtain disentangled latent representations produce VAEs that are more robust to these attacks. However, this robustness comes at the cost of reducing the quality of the reconstructions. We, therefore, introduce a new hierarchical VAE, the $\textit{Seatbelt-VAE}$, which can produce high-fidelity autoencoders that are also adversarially robust. We confirm the capabilities of the Seatbelt-VAE on several different datasets and with current state-of-the-art VAE adversarial attacks. \ No newline at end of file diff --git a/data/2021/iclr/Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning b/data/2021/iclr/Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning new file mode 100644 index 0000000000..158d0e3558 --- /dev/null +++ b/data/2021/iclr/Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning @@ -0,0 +1 @@ +Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known speakers. However, zero-shot voice style transfer, which learns from non-parallel data and generates voices for previously unseen speakers, remains a challenging problem. We propose a novel zero-shot voice transfer method via disentangled representation learning. The proposed method first encodes speaker-related style and voice content of each input voice into separated low-dimensional embedding spaces, and then transfers to a new voice by combining the source content embedding and target style embedding through a decoder. With information-theoretic guidance, the style and content embedding spaces are representative and (ideally) independent of each other. On real-world VCTK datasets, our method outperforms other baselines and obtains state-of-the-art results in terms of transfer accuracy and voice naturalness for voice style transfer experiments under both many-to-many and zero-shot setups. \ No newline at end of file diff --git a/data/2021/iclr/In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning b/data/2021/iclr/In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning new file mode 100644 index 0000000000..69c59f5a15 --- /dev/null +++ b/data/2021/iclr/In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning @@ -0,0 +1 @@ +The recent research in semi-supervised learning (SSL) is mostly dominated by consistency regularization based methods which achieve strong performance. However, they heavily rely on domain-specific data augmentations, which are not easy to generate for all data modalities. Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation. We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models; these predictions generate many incorrect pseudo-labels, leading to noisy training. We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process. Furthermore, UPS generalizes the pseudo-labeling process, allowing for the creation of negative pseudo-labels; these negative pseudo-labels can be used for multi-label classification as well as negative learning to improve the single-label classification. We achieve strong performance when compared to recent SSL methods on the CIFAR-10 and CIFAR-100 datasets. Also, we demonstrate the versatility of our method on the video dataset UCF-101 and the multi-label dataset Pascal VOC. \ No newline at end of file diff --git a/data/2021/iclr/In Search of Lost Domain Generalization b/data/2021/iclr/In Search of Lost Domain Generalization new file mode 100644 index 0000000000..3dda7233c6 --- /dev/null +++ b/data/2021/iclr/In Search of Lost Domain Generalization @@ -0,0 +1 @@ +The goal of domain generalization algorithms is to predict well on distributions different from those seen during training. While a myriad of domain generalization algorithms exist, inconsistencies in experimental conditions -- datasets, architectures, and model selection criteria -- render fair and realistic comparisons difficult. In this paper, we are interested in understanding how useful domain generalization algorithms are in realistic settings. As a first step, we realize that model selection is non-trivial for domain generalization tasks. Contrary to prior work, we argue that domain generalization algorithms without a model selection strategy should be regarded as incomplete. Next, we implement DomainBed, a testbed for domain generalization including seven multi-domain datasets, nine baseline algorithms, and three model selection criteria. We conduct extensive experiments using DomainBed and find that, when carefully implemented, empirical risk minimization shows state-of-the-art performance across all datasets. Looking forward, we hope that the release of DomainBed, along with contributions from fellow researchers, will streamline reproducible and rigorous research in domain generalization. \ No newline at end of file diff --git a/data/2021/iclr/In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness b/data/2021/iclr/In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness new file mode 100644 index 0000000000..f82903b723 --- /dev/null +++ b/data/2021/iclr/In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness @@ -0,0 +1 @@ +Consider a prediction setting where a few inputs (e.g., satellite images) are expensively annotated with the prediction targets (e.g., crop types), and many inputs are cheaply annotated with auxiliary information (e.g., climate information). How should we best leverage this auxiliary information for the prediction task? Empirically across three image and time-series datasets, and theoretically in a multi-task linear regression setting, we show that (i) using auxiliary information as input features improves in-distribution error but can hurt out-of-distribution (OOD) error; while (ii) using auxiliary information as outputs of auxiliary tasks to pre-train a model improves OOD error. To get the best of both worlds, we introduce In-N-Out, which first trains a model with auxiliary inputs and uses it to pseudolabel all the in-distribution inputs, then pre-trains a model on OOD auxiliary outputs and fine-tunes this model with the pseudolabels (self-training). We show both theoretically and empirically that In-N-Out outperforms auxiliary inputs or outputs alone on both in-distribution and OOD error. \ No newline at end of file diff --git a/data/2021/iclr/Incorporating Symmetry into Deep Dynamics Models for Improved Generalization b/data/2021/iclr/Incorporating Symmetry into Deep Dynamics Models for Improved Generalization new file mode 100644 index 0000000000..700870732d --- /dev/null +++ b/data/2021/iclr/Incorporating Symmetry into Deep Dynamics Models for Improved Generalization @@ -0,0 +1 @@ +Recent work has shown deep learning can accelerate the prediction of physical dynamics relative to numerical solvers. However, limited physical accuracy and an inability to generalize under the distributional shift limit its applicability to the real world. We propose to improve accuracy and generalization by incorporating symmetries into deep neural networks. Specifically, we employ a variety of methods each tailored to enforce a different symmetry. Our models are both theoretically and experimentally robust to distributional shift by the symmetry group transformations and enjoy favorable sample complexity. We demonstrate the advantage of our approach on a variety of physical dynamics including Rayleigh-Benard Convection and real-world ocean currents and temperatures. This is the first time that equivariant neural networks have been used to forecast physical dynamics. \ No newline at end of file diff --git a/data/2021/iclr/Incremental few-shot learning via vector quantization in deep embedded space b/data/2021/iclr/Incremental few-shot learning via vector quantization in deep embedded space new file mode 100644 index 0000000000..c2986ec1bc --- /dev/null +++ b/data/2021/iclr/Incremental few-shot learning via vector quantization in deep embedded space @@ -0,0 +1 @@ +The capability of incrementally learning new tasks without forgetting old ones is a challenging problem due to catastrophic forgetting. This challenge becomes greater when novel tasks contain very few labelled training samples. Currently, most methods are dedicated to class-incremental learning and rely on sufficient training data to learn additional weights for newly added classes. Those methods cannot be easily extended to incremental regression tasks and could suffer from severe overfitting when learning few-shot novel tasks. In this study, we propose a nonparametric method in deep embedded space to tackle incremental few-shot learning problems. The knowledge about the learned tasks is compressed into a small number of quantized reference vectors. The proposed method learns new tasks sequentially by adding more reference vectors to the model using few-shot samples in each novel task. For classification problems, we employ the nearest neighbor scheme to make classification on sparsely available data and incorporate intra-class variation, less forgetting regularization and calibration of reference vectors to mitigate catastrophic forgetting. In addition, the proposed learning vector quantization (LVQ) in deep embedded space can be customized as a kernel smoother to handle incremental few-shot regression tasks. Experimental results demonstrate that the proposed method outperforms other state-of-the-art methods in incremental learning. \ No newline at end of file diff --git a/data/2021/iclr/Individually Fair Gradient Boosting b/data/2021/iclr/Individually Fair Gradient Boosting new file mode 100644 index 0000000000..085bbf42fb --- /dev/null +++ b/data/2021/iclr/Individually Fair Gradient Boosting @@ -0,0 +1 @@ +We consider the task of enforcing individual fairness in gradient boosting. Gradient boosting is a popular method for machine learning from tabular data, which arise often in applications where algorithmic fairness is a concern. At a high level, our approach is a functional gradient descent on a (distributionally) robust loss function that encodes our intuition of algorithmic fairness for the ML task at hand. Unlike prior approaches to individual fairness that only work with smooth ML models, our approach also works with non-smooth models such as decision trees. We show that our algorithm converges globally and generalizes. We also demonstrate the efficacy of our algorithm on three ML problems susceptible to algorithmic bias. \ No newline at end of file diff --git a/data/2021/iclr/Individually Fair Rankings b/data/2021/iclr/Individually Fair Rankings new file mode 100644 index 0000000000..90e58d4699 --- /dev/null +++ b/data/2021/iclr/Individually Fair Rankings @@ -0,0 +1 @@ +Rankings on online platforms help their end-users find the relevant information—people, news, media, and products—quickly. Fair ranking tasks, which ask to rank a set of items to maximize utility subject to satisfying group-fairness constraints, have gained significant interest in the Algorithmic Fairness, Information Retrieval, and Machine Learning literature. Recent works, however, identify uncertainty in the utilities of items as a primary cause of unfairness and propose introducing randomness in the output. This randomness is carefully chosen to guarantee an adequate representation of each item (while accounting for the uncertainty). However, due to this randomness, the output rankings may violate group fairness constraints. We give an efficient algorithm that samples rankings from an individually-fair distribution while ensuring that every output ranking is group fair. The expected utility of the output ranking is at least α times the utility of the optimal fair solution. Here, α depends on the utilities, position-discounts, and constraints—it approaches 1 as the range of utilities or the position-discounts shrinks, or when utilities satisfy distributional assumptions. Empirically, we observe that our algorithm achieves individual and group fairness and that Pareto dominates the state-of-the-art baselines. \ No newline at end of file diff --git a/data/2021/iclr/Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks b/data/2021/iclr/Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks new file mode 100644 index 0000000000..5271e4f26f --- /dev/null +++ b/data/2021/iclr/Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks @@ -0,0 +1 @@ +Temporal networks serve as abstractions of many real-world dynamic systems. These networks typically evolve according to certain laws, such as the law of triadic closure, which is universal in social networks. Inductive representation learning of temporal networks should be able to capture such laws and further be applied to systems that follow the same laws but have not been unseen during the training stage. Previous works in this area depend on either network node identities or rich edge attributes and typically fail to extract these laws. Here, we propose Causal Anonymous Walks (CAWs) to inductively represent a temporal network. CAWs are extracted by temporal random walks and work as automatic retrieval of temporal network motifs to represent network dynamics while avoiding the time-consuming selection and counting of those motifs. CAWs adopt a novel anonymization strategy that replaces node identities with the hitting counts of the nodes based on a set of sampled walks to keep the method inductive, and simultaneously establish the correlation between motifs. We further propose a neural-network model CAW-N to encode CAWs, and pair it with a CAW sampling strategy with constant memory and time cost to support online training and inference. CAW-N is evaluated to predict links over 6 real temporal networks and uniformly outperforms previous SOTA methods by averaged 10% AUC gain in the inductive setting. CAW-N also outperforms previous methods in 4 out of the 6 networks in the transductive setting. \ No newline at end of file diff --git a/data/2021/iclr/Influence Estimation for Generative Adversarial Networks b/data/2021/iclr/Influence Estimation for Generative Adversarial Networks new file mode 100644 index 0000000000..3b8bd2ea54 --- /dev/null +++ b/data/2021/iclr/Influence Estimation for Generative Adversarial Networks @@ -0,0 +1 @@ +Identifying harmful instances, whose absence in a training dataset improves model performance, is important for building better machine learning models. Although previous studies have succeeded in estimating harmful instances under supervised settings, they cannot be trivially extended to generative adversarial networks (GANs). This is because previous approaches require that (1) the absence of a training instance directly affects the loss value and that (2) the change in the loss directly measures the harmfulness of the instance for the performance of a model. In GAN training, however, neither of the requirements is satisfied. This is because, (1) the generator's loss is not directly affected by the training instances as they are not part of the generator's training steps, and (2) the values of GAN's losses normally do not capture the generative performance of a model. To this end, (1) we propose an influence estimation method that uses the Jacobian of the gradient of the generator's loss with respect to the discriminator's parameters (and vice versa) to trace how the absence of an instance in the discriminator's training affects the generator's parameters, and (2) we propose a novel evaluation scheme, in which we assess harmfulness of each training instance on the basis of how GAN evaluation metric (e.g., inception score) is expect to change due to the removal of the instance. We experimentally verified that our influence estimation method correctly inferred the changes in GAN evaluation metrics. Further, we demonstrated that the removal of the identified harmful instances effectively improved the model's generative performance with respect to various GAN evaluation metrics. \ No newline at end of file diff --git a/data/2021/iclr/Influence Functions in Deep Learning Are Fragile b/data/2021/iclr/Influence Functions in Deep Learning Are Fragile new file mode 100644 index 0000000000..557602a438 --- /dev/null +++ b/data/2021/iclr/Influence Functions in Deep Learning Are Fragile @@ -0,0 +1 @@ +Influence functions approximate the effect of training samples in test-time predictions and have a wide variety of applications in machine learning interpretability and uncertainty estimation. A commonly-used (first-order) influence function can be implemented efficiently as a post-hoc method requiring access only to the gradients and Hessian of the model. For linear models, influence functions are well-defined due to the convexity of the underlying loss function and are generally accurate even across difficult settings where model changes are fairly large such as estimating group influences. Influence functions, however, are not well-understood in the context of deep learning with non-convex loss functions. In this paper, we provide a comprehensive and large-scale empirical study of successes and failures of influence functions in neural network models trained on datasets such as Iris, MNIST, CIFAR-10 and ImageNet. Through our extensive experiments, we show that the network architecture, its depth and width, as well as the extent of model parameterization and regularization techniques have strong effects in the accuracy of influence functions. In particular, we find that (i) influence estimates are fairly accurate for shallow networks, while for deeper networks the estimates are often erroneous; (ii) for certain network architectures and datasets, training with weight-decay regularization is important to get high-quality influence estimates; and (iii) the accuracy of influence estimates can vary significantly depending on the examined test points. These results suggest that in general influence functions in deep learning are fragile and call for developing improved influence estimation methods to mitigate these issues in non-convex setups. \ No newline at end of file diff --git a/data/2021/iclr/InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective b/data/2021/iclr/InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective new file mode 100644 index 0000000000..cdd01c2c62 --- /dev/null +++ b/data/2021/iclr/InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective @@ -0,0 +1 @@ +Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks. Recent studies, however, show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks. We aim to address this problem from an information-theoretic perspective, and propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models. InfoBERT contains two mutual-information-based regularizers for model training: (i) an Information Bottleneck regularizer, which suppresses noisy mutual information between the input and the feature representation; and (ii) a Robust Feature regularizer, which increases the mutual information between local robust features and global features. We provide a principled way to theoretically analyze and improve the robustness of representation learning for language models in both standard and adversarial training. Extensive experiments demonstrate that InfoBERT achieves state-of-the-art robust accuracy over several adversarial datasets on Natural Language Inference (NLI) and Question Answering (QA) tasks. \ No newline at end of file diff --git a/data/2021/iclr/Information Laundering for Model Privacy b/data/2021/iclr/Information Laundering for Model Privacy new file mode 100644 index 0000000000..5f3929dbbc --- /dev/null +++ b/data/2021/iclr/Information Laundering for Model Privacy @@ -0,0 +1 @@ +In this work, we propose information laundering, a novel framework for enhancing model privacy. Unlike data privacy that concerns the protection of raw data information, model privacy aims to protect an already-learned model that is to be deployed for public use. The private model can be obtained from general learning methods, and its deployment means that it will return a deterministic or random response for a given input query. An information-laundered model consists of probabilistic components that deliberately maneuver the intended input and output for queries to the model, so the model's adversarial acquisition is less likely. Under the proposed framework, we develop an information-theoretic principle to quantify the fundamental tradeoffs between model utility and privacy leakage and derive the optimal design. \ No newline at end of file diff --git a/data/2021/iclr/Initialization and Regularization of Factorized Neural Layers b/data/2021/iclr/Initialization and Regularization of Factorized Neural Layers new file mode 100644 index 0000000000..d52ef78336 --- /dev/null +++ b/data/2021/iclr/Initialization and Regularization of Factorized Neural Layers @@ -0,0 +1 @@ +Factorized layers--operations parameterized by products of two or more matrices--occur in a variety of deep learning contexts, including compressed model training, certain types of knowledge distillation, and multi-head self-attention architectures. We study how to initialize and regularize deep nets containing such layers, examining two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance. The guiding insight is to design optimization routines for these networks that are as close as possible to that of their well-tuned, non-decomposed counterparts; we back this intuition with an analysis of how the initialization and regularization schemes impact training with gradient descent, drawing on modern attempts to understand the interplay of weight-decay and batch-normalization. Empirically, we highlight the benefits of spectral initialization and Frobenius decay across a variety of settings. In model compression, we show that they enable low-rank methods to significantly outperform both unstructured sparsity and tensor methods on the task of training low-memory residual networks; analogs of the schemes also improve the performance of tensor decomposition techniques. For knowledge distillation, Frobenius decay enables a simple, overcomplete baseline that yields a compact model from over-parameterized training without requiring retraining with or pruning a teacher network. Finally, we show how both schemes applied to multi-head attention lead to improved performance on both translation and unsupervised pre-training. \ No newline at end of file diff --git a/data/2021/iclr/Integrating Categorical Semantics into Unsupervised Domain Translation b/data/2021/iclr/Integrating Categorical Semantics into Unsupervised Domain Translation new file mode 100644 index 0000000000..b3ed94ac64 --- /dev/null +++ b/data/2021/iclr/Integrating Categorical Semantics into Unsupervised Domain Translation @@ -0,0 +1 @@ +While unsupervised domain translation (UDT) has seen a lot of success recently, we argue that allowing its translation to be mediated via categorical semantic features could enable wider applicability. In particular, we argue that categorical semantics are important when translating between domains with multiple object categories possessing distinctive styles, or even between domains that are simply too different but still share high-level semantics. We propose a method to learn, in an unsupervised manner, categorical semantic features (such as object labels) that are invariant of the source and target domains. We show that conditioning the style of a unsupervised domain translation methods on the learned categorical semantics leads to a considerably better high-level features preservation on tasks such as MNIST$\leftrightarrow$SVHN and to a more realistic stylization on Sketches$\to$Reals. \ No newline at end of file diff --git a/data/2021/iclr/Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling b/data/2021/iclr/Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling new file mode 100644 index 0000000000..c96210a349 --- /dev/null +++ b/data/2021/iclr/Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling @@ -0,0 +1 @@ +Obtaining large annotated datasets is critical for training successful machine learning models and it is often a bottleneck in practice. Weak supervision offers a promising alternative for producing labeled datasets without ground truth annotations by generating probabilistic labels using multiple noisy heuristics. This process can scale to large datasets and has demonstrated state of the art performance in diverse domains such as healthcare and e-commerce. One practical issue with learning from user-generated heuristics is that their creation requires creativity, foresight, and domain expertise from those who hand-craft them, a process which can be tedious and subjective. We develop the first framework for interactive weak supervision in which a method proposes heuristics and learns from user feedback given on each proposed heuristic. Our experiments demonstrate that only a small number of feedback iterations are needed to train models that achieve highly competitive test set performance without access to ground truth training labels. We conduct user studies, which show that users are able to effectively provide feedback on heuristics and that test set results track the performance of simulated oracles. \ No newline at end of file diff --git a/data/2021/iclr/Interpretable Models for Granger Causality Using Self-explaining Neural Networks b/data/2021/iclr/Interpretable Models for Granger Causality Using Self-explaining Neural Networks new file mode 100644 index 0000000000..0783459ca2 --- /dev/null +++ b/data/2021/iclr/Interpretable Models for Granger Causality Using Self-explaining Neural Networks @@ -0,0 +1 @@ +Exploratory analysis of time series data can yield a better understanding of complex dynamical systems. Granger causality is a practical framework for analysing interactions in sequential data, applied in a wide range of domains. In this paper, we propose a novel framework for inferring multivariate Granger causality under nonlinear dynamics based on an extension of self-explaining neural networks. This framework is more interpretable than other neural-network-based techniques for inferring Granger causality, since in addition to relational inference, it also allows detecting signs of Granger-causal effects and inspecting their variability over time. In comprehensive experiments on simulated data, we show that our framework performs on par with several powerful baseline methods at inferring Granger causality and that it achieves better performance at inferring interaction signs. The results suggest that our framework is a viable and more interpretable alternative to sparse-input neural networks for inferring Granger causality. \ No newline at end of file diff --git a/data/2021/iclr/Interpretable Neural Architecture Search via Bayesian Optimisation with Weisfeiler-Lehman Kernels b/data/2021/iclr/Interpretable Neural Architecture Search via Bayesian Optimisation with Weisfeiler-Lehman Kernels new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking b/data/2021/iclr/Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking new file mode 100644 index 0000000000..bb42b9c088 --- /dev/null +++ b/data/2021/iclr/Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking @@ -0,0 +1 @@ +Graph neural networks (GNNs) have become a popular approach to integrating structural inductive biases into NLP models. However, there has been little work on interpreting them, and specifically on understanding which parts of the graphs (e.g. syntactic trees or co-reference structures) contribute to a prediction. In this work, we introduce a post-hoc method for interpreting the predictions of GNNs which identifies unnecessary edges. Given a trained GNN model, we learn a simple classifier that, for every edge in every layer, predicts if that edge can be dropped. We demonstrate that such a classifier can be trained in a fully differentiable fashion, employing stochastic gates and encouraging sparsity through the expected $L_0$ norm. We use our technique as an attribution method to analyze GNN models for two tasks -- question answering and semantic role labeling -- providing insights into the information flow in these models. We show that we can drop a large proportion of edges without deteriorating the performance of the model, while we can analyse the remaining edges for interpreting model predictions. \ No newline at end of file diff --git a/data/2021/iclr/Interpreting Knowledge Graph Relation Representation from Word Embeddings b/data/2021/iclr/Interpreting Knowledge Graph Relation Representation from Word Embeddings new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Interpreting and Boosting Dropout from a Game-Theoretic View b/data/2021/iclr/Interpreting and Boosting Dropout from a Game-Theoretic View new file mode 100644 index 0000000000..0a99597408 --- /dev/null +++ b/data/2021/iclr/Interpreting and Boosting Dropout from a Game-Theoretic View @@ -0,0 +1 @@ +This paper aims to understand and improve the utility of the dropout operation from the perspective of game-theoretic interactions. We prove that dropout can suppress the strength of interactions between input variables of deep neural networks (DNNs). The theoretic proof is also verified by various experiments. Furthermore, we find that such interactions were strongly related to the over-fitting problem in deep learning. Thus, the utility of dropout can be regarded as decreasing interactions to alleviate the significance of over-fitting. Based on this understanding, we propose an interaction loss to further improve the utility of dropout. Experimental results have shown that the interaction loss can effectively improve the utility of dropout and boost the performance of DNNs. \ No newline at end of file diff --git a/data/2021/iclr/Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds b/data/2021/iclr/Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds new file mode 100644 index 0000000000..3b4d593870 --- /dev/null +++ b/data/2021/iclr/Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds @@ -0,0 +1 @@ +Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Prior audio-visual separation work assumed artificial limitations on the domain of sound classes (e.g., to speech or music), constrained the number of sources, and required strong sound separation or visual segmentation labels. AudioScope overcomes these limitations, operating on an open domain of sounds, with variable numbers of sources, and without labels or prior visual segmentation. The training procedure for AudioScope uses mixture invariant training (MixIT) to separate synthetic mixtures of mixtures (MoMs) into individual sources, where noisy labels for mixtures are provided by an unsupervised audio-visual coincidence model. Using the noisy labels, along with attention between video and audio features, AudioScope learns to identify audio-visual similarity and to suppress off-screen sounds. We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data. This dataset contains a wide diversity of sound classes recorded in unconstrained conditions, making the application of previous methods unsuitable. For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips. \ No newline at end of file diff --git a/data/2021/iclr/Intraclass clustering: an implicit learning ability that regularizes DNNs b/data/2021/iclr/Intraclass clustering: an implicit learning ability that regularizes DNNs new file mode 100644 index 0000000000..61c0bd39a3 --- /dev/null +++ b/data/2021/iclr/Intraclass clustering: an implicit learning ability that regularizes DNNs @@ -0,0 +1 @@ +Several works have shown that the regularization mechanisms underlying deep neural networks' generalization performances are still poorly understood. In this paper, we hypothesize that deep neural networks are regularized through their ability to extract meaningful clusters among the samples of a class. This constitutes an implicit form of regularization, as no explicit training mechanisms or supervision target such behaviour. To support our hypothesis, we design four different measures of intraclass clustering, based on the neuron- and layer-level representations of the training data. We then show that these measures constitute accurate predictors of generalization performance across variations of a large set of hyperparameters (learning rate, batch size, optimizer, weight decay, dropout rate, data augmentation, network depth and width). \ No newline at end of file diff --git a/data/2021/iclr/Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures b/data/2021/iclr/Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures new file mode 100644 index 0000000000..d0b610c482 --- /dev/null +++ b/data/2021/iclr/Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures @@ -0,0 +1 @@ +commonly used algorithms in protein learningwere specifically designed for protein data, and are able tocapture all relevant structural levels of a protein during learning. To fill this gap,we propose two new learning operators, specifically designed to process proteinstructures. First, we introduce a novel convolution operator that considers theprimary, secondary, and tertiary structure of a protein by usingn-D convolutionsdefined on both the Euclidean distance, as well as multiple geodesic distancesbetween the atoms in a multi-graph. Second, we introduce a set of hierarchicalpooling operators that enable multi-scale protein analysis. We further evaluate theaccuracy of our algorithms on common downstream tasks, where we outperformstate-of-the-art protein learning algorithms. \ No newline at end of file diff --git a/data/2021/iclr/Is Attention Better Than Matrix Decomposition? b/data/2021/iclr/Is Attention Better Than Matrix Decomposition? new file mode 100644 index 0000000000..89678f9e3e --- /dev/null +++ b/data/2021/iclr/Is Attention Better Than Matrix Decomposition? @@ -0,0 +1 @@ +As an essential ingredient of modern deep learning, attention mechanism, especially self-attention, plays a vital role in the global correlation discovery. However, is hand-crafted attention irreplaceable when modeling the global context? Our intriguing finding is that self-attention is not better than the matrix decomposition (MD) model developed 20 years ago regarding the performance and computational cost for encoding the long-distance dependencies. We model the global context issue as a low-rank recovery problem and show that its optimization algorithms can help design global information blocks. This paper then proposes a series of Hamburgers, in which we employ the optimization algorithms for solving MDs to factorize the input representations into sub-matrices and reconstruct a low-rank embedding. Hamburgers with different MDs can perform favorably against the popular global context module self-attention when carefully coping with gradients back-propagated through MDs. Comprehensive experiments are conducted in the vision tasks where it is crucial to learn the global context, including semantic segmentation and image generation, demonstrating significant improvements over self-attention and its variants. \ No newline at end of file diff --git a/data/2021/iclr/Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study b/data/2021/iclr/Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study new file mode 100644 index 0000000000..83c2389b6f --- /dev/null +++ b/data/2021/iclr/Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study @@ -0,0 +1 @@ +This work aims to empirically clarify a recently discovered perspective that label smoothing is incompatible with knowledge distillation. We begin by introducing the motivation behind on how this incompatibility is raised, i.e., label smoothing erases relative information between teacher logits. We provide a novel connection on how label smoothing affects distributions of semantically similar and dissimilar classes. Then we propose a metric to quantitatively measure the degree of erased information in sample's representation. After that, we study its one-sidedness and imperfection of the incompatibility view through massive analyses, visualizations and comprehensive experiments on Image Classification, Binary Networks, and Neural Machine Translation. Finally, we broadly discuss several circumstances wherein label smoothing will indeed lose its effectiveness. Project page: http://zhiqiangshen.com/projects/LS_and_KD/index.html. \ No newline at end of file diff --git a/data/2021/iclr/IsarStep: a Benchmark for High-level Mathematical Reasoning b/data/2021/iclr/IsarStep: a Benchmark for High-level Mathematical Reasoning new file mode 100644 index 0000000000..7e9e7c7f5e --- /dev/null +++ b/data/2021/iclr/IsarStep: a Benchmark for High-level Mathematical Reasoning @@ -0,0 +1 @@ +A well-defined benchmark is essential for measuring and accelerating research progress of machine learning models. In this paper, we present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. We build a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover. The dataset has a broad coverage of undergraduate and research-level mathematical and computer science theorems. In our defined task, a model is required to fill in a missing intermediate proposition given surrounding proofs. This task provides a starting point for the long-term goal of having machines generate human-readable proofs automatically. Our experiments and analysis reveal that while the task is challenging, neural models can capture non-trivial mathematical reasoning. We further design a hierarchical transformer that outperforms the transformer baseline. We will make the dataset and models publicly available. \ No newline at end of file diff --git a/data/2021/iclr/Isometric Propagation Network for Generalized Zero-shot Learning b/data/2021/iclr/Isometric Propagation Network for Generalized Zero-shot Learning new file mode 100644 index 0000000000..58c16d4469 --- /dev/null +++ b/data/2021/iclr/Isometric Propagation Network for Generalized Zero-shot Learning @@ -0,0 +1 @@ +Zero-shot learning (ZSL) aims to classify images of an unseen class only based on a few attributes describing that class but no access to any training sample. A popular strategy is to learn a mapping between the semantic space of class attributes and the visual space of images based on the seen classes and their data. Thus, an unseen class image can be ideally mapped to its corresponding class attributes. The key challenge is how to align the representations in the two spaces. For most ZSL settings, the attributes for each seen/unseen class are only represented by a vector while the seen-class data provide much more information. Thus, the imbalanced supervision from the semantic and the visual space can make the learned mapping easily overfitting to the seen classes. To resolve this problem, we propose Isometric Propagation Network (IPN), which learns to strengthen the relation between classes within each space and align the class dependency in the two spaces. Specifically, IPN learns to propagate the class representations on an auto-generated graph within each space. In contrast to only aligning the resulted static representation, we regularize the two dynamic propagation procedures to be isometric in terms of the two graphs' edge weights per step by minimizing a consistency loss between them. IPN achieves state-of-the-art performance on three popular ZSL benchmarks. To evaluate the generalization capability of IPN, we further build two larger benchmarks with more diverse unseen classes and demonstrate the advantages of IPN on them. \ No newline at end of file diff --git a/data/2021/iclr/Isometric Transformation Invariant and Equivariant Graph Convolutional Networks b/data/2021/iclr/Isometric Transformation Invariant and Equivariant Graph Convolutional Networks new file mode 100644 index 0000000000..b195044960 --- /dev/null +++ b/data/2021/iclr/Isometric Transformation Invariant and Equivariant Graph Convolutional Networks @@ -0,0 +1 @@ +Graphs are one of the most important data structures for representing pairwise relations between objects. Specifically, a graph embedded in a Euclidean space is essential to solving real problems, such as object detection, structural chemistry analyses, and physical simulation. A crucial requirement to applying a graph in a Euclidean space is learning the isometric transformation invariant and equivariant features. In the present paper, we propose a set of transformation invariant and equivariant models based on graph convolutional networks (GCNs), called IsoGCNs. We demonstrate that the proposed model outperforms state-of-the-art methods on tasks related with geometrical and physical data. Moreover, the proposed model can scale up to the graphs with 1M vertices and conduct an inference faster than a conventional finite element analysis. \ No newline at end of file diff --git a/data/2021/iclr/Isotropy in the Contextual Embedding Space: Clusters and Manifolds b/data/2021/iclr/Isotropy in the Contextual Embedding Space: Clusters and Manifolds new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Iterated learning for emergent systematicity in VQA b/data/2021/iclr/Iterated learning for emergent systematicity in VQA new file mode 100644 index 0000000000..9132f4fbf7 --- /dev/null +++ b/data/2021/iclr/Iterated learning for emergent systematicity in VQA @@ -0,0 +1 @@ +Although neural module networks have an architectural bias towards compositionality, they require gold standard layouts to generalize systematically in practice. When instead learning layouts and modules jointly, compositionality does not arise automatically and an explicit pressure is necessary for the emergence of layouts exhibiting the right structure. We propose to address this problem using iterated learning, a cognitive science theory of the emergence of compositional languages in nature that has primarily been applied to simple referential games in machine learning. Considering the layouts of module networks as samples from an emergent language, we use iterated learning to encourage the development of structure within this language. We show that the resulting layouts support systematic generalization in neural agents solving the more complex task of visual question-answering. Our regularized iterated learning method can outperform baselines without iterated learning on SHAPES-SyGeT (SHAPES Systematic Generalization Test), a new split of the SHAPES dataset we introduce to evaluate systematic generalization, and on CLOSURE, an extension of CLEVR also designed to test systematic generalization. We demonstrate superior performance in recovering ground-truth compositional program structure with limited supervision on both SHAPES-SyGeT and CLEVR. \ No newline at end of file diff --git a/data/2021/iclr/Iterative Empirical Game Solving via Single Policy Best Response b/data/2021/iclr/Iterative Empirical Game Solving via Single Policy Best Response new file mode 100644 index 0000000000..617c4f814f --- /dev/null +++ b/data/2021/iclr/Iterative Empirical Game Solving via Single Policy Best Response @@ -0,0 +1 @@ +Policy-Space Response Oracles (PSRO) is a general algorithmic framework for learning policies in multiagent systems by interleaving empirical game analysis with deep reinforcement learning (Deep RL). At each iteration, Deep RL is invoked to train a best response to a mixture of opponent policies. The repeated application of Deep RL poses an expensive computational burden as we look to apply this algorithm to more complex domains. We introduce two variations of PSRO designed to reduce the amount of simulation required during Deep RL training. Both algorithms modify how PSRO adds new policies to the empirical game, based on learned responses to a single opponent policy. The first, Mixed-Oracles, transfers knowledge from previous iterations of Deep RL, requiring training only against the opponent's newest policy. The second, Mixed-Opponents, constructs a pure-strategy opponent by mixing existing strategy's action-value estimates, instead of their policies. Learning against a single policy mitigates variance in state outcomes that is induced by an unobserved distribution of opponents. We empirically demonstrate that these algorithms substantially reduce the amount of simulation during training required by PSRO, while producing equivalent or better solutions to the game. \ No newline at end of file diff --git a/data/2021/iclr/Kanerva++: Extending the Kanerva Machine With Differentiable, Locally Block Allocated Latent Memory b/data/2021/iclr/Kanerva++: Extending the Kanerva Machine With Differentiable, Locally Block Allocated Latent Memory new file mode 100644 index 0000000000..d8611f50c0 --- /dev/null +++ b/data/2021/iclr/Kanerva++: Extending the Kanerva Machine With Differentiable, Locally Block Allocated Latent Memory @@ -0,0 +1 @@ +Episodic and semantic memory are critical components of the human memory model. The theory of complementary learning systems (McClelland et al., 1995) suggests that the compressed representation produced by a serial event (episodic memory) is later restructured to build a more generalized form of reusable knowledge (semantic memory). In this work we develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory via a hierarchical latent variable model. We take inspiration from traditional heap allocation and extend the idea of locally contiguous memory to the Kanerva Machine, enabling a novel differentiable block allocated latent memory. In contrast to the Kanerva Machine, we simplify the process of memory writing by treating it as a fully feed forward deterministic process, relying on the stochasticity of the read key distribution to disperse information within the memory. We demonstrate that this allocation scheme improves performance in memory conditional image generation, resulting in new state-of-the-art conditional likelihood values on binarized MNIST (<=41.58 nats/image) , binarized Omniglot (<=66.24 nats/image), as well as presenting competitive performance on CIFAR10, DMLab Mazes, Celeb-A and ImageNet32x32. \ No newline at end of file diff --git a/data/2021/iclr/Knowledge Distillation as Semiparametric Inference b/data/2021/iclr/Knowledge Distillation as Semiparametric Inference new file mode 100644 index 0000000000..2765740dd8 --- /dev/null +++ b/data/2021/iclr/Knowledge Distillation as Semiparametric Inference @@ -0,0 +1 @@ +A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model. Surprisingly, this two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data. To explain and enhance this phenomenon, we cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate. By adapting modern semiparametric tools, we derive new guarantees for the prediction error of standard distillation and develop two enhancements -- cross-fitting and loss correction -- to mitigate the impact of teacher overfitting and underfitting on student performance. We validate our findings empirically on both tabular and image data and observe consistent improvements from our knowledge distillation enhancements. \ No newline at end of file diff --git a/data/2021/iclr/Knowledge distillation via softmax regression representation learning b/data/2021/iclr/Knowledge distillation via softmax regression representation learning new file mode 100644 index 0000000000..b5cbd39dc9 --- /dev/null +++ b/data/2021/iclr/Knowledge distillation via softmax regression representation learning @@ -0,0 +1 @@ +This paper addresses the problem of model compression via knowledge distillation. We advocate for a method that optimizes the output feature of the penulti-mate layer of the student network and hence is directly related to representation learning. To this end, we firstly propose a direct feature matching approach which focuses on optimizing the student’s penultimate layer only. Secondly and more importantly , because feature matching does not take into account the classification problem at hand, we propose a second approach that decouples representation learning and classification and utilizes the teacher’s pre-trained classifier to train the student’s penultimate layer feature. In particular, for the same input image, we wish the teacher’s and student’s feature to produce the same output when passed through the teacher’s classifier, which is achieved with a simple L 2 loss. Our method is extremely simple to implement and straightforward to train and is shown to consistently outperform previous state-of-the-art methods over a large set of experimental settings including different (a) network architectures, (b) teacher-student capacities, (c) datasets, and (d) domains. The code is available at https://github.com/jingyang2017/KD_SRRL . \ No newline at end of file diff --git a/data/2021/iclr/LEAF: A Learnable Frontend for Audio Classification b/data/2021/iclr/LEAF: A Learnable Frontend for Audio Classification new file mode 100644 index 0000000000..9fb175b007 --- /dev/null +++ b/data/2021/iclr/LEAF: A Learnable Frontend for Audio Classification @@ -0,0 +1 @@ +Mel-filterbanks are fixed, engineered audio features which emulate human perception and have been used through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental limitations of handmade representations. In this work we show that we can train a single learnable frontend that outperforms mel-filterbanks on a wide range of audio signals, including speech, music, audio events and animal sounds, providing a general-purpose learned frontend for audio classification. To do so, we introduce a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks. Our system learns all operations of audio features extraction, from filtering to pooling, compression and normalization, and can be integrated into any neural network at a negligible parameter cost. We perform multi-task training on eight diverse audio classification tasks, and show consistent improvements of our model over mel-filterbanks and previous learnable alternatives. Moreover, our system outperforms the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters. \ No newline at end of file diff --git a/data/2021/iclr/LambdaNetworks: Modeling long-range Interactions without Attention b/data/2021/iclr/LambdaNetworks: Modeling long-range Interactions without Attention new file mode 100644 index 0000000000..584dd7e31a --- /dev/null +++ b/data/2021/iclr/LambdaNetworks: Modeling long-range Interactions without Attention @@ -0,0 +1 @@ +We present lambda layers -- an alternative framework to self-attention -- for capturing long-range interactions between an input and structured contextual information (e.g. a pixel surrounded by other pixels). Lambda layers capture such interactions by transforming available contexts into linear functions, termed lambdas, and applying these linear functions to each input separately. Similar to linear attention, lambda layers bypass expensive attention maps, but in contrast, they model both content and position-based interactions which enables their application to large structured inputs such as images. The resulting neural network architectures, LambdaNetworks, significantly outperform their convolutional and attentional counterparts on ImageNet classification, COCO object detection and COCO instance segmentation, while being more computationally efficient. Additionally, we design LambdaResNets, a family of hybrid architectures across different scales, that considerably improves the speed-accuracy tradeoff of image classification models. LambdaResNets reach excellent accuracies on ImageNet while being 3.2 - 4.4x faster than the popular EfficientNets on modern machine learning accelerators. When training with an additional 130M pseudo-labeled images, LambdaResNets achieve up to a 9.5x speed-up over the corresponding EfficientNet checkpoints. \ No newline at end of file diff --git a/data/2021/iclr/Language-Agnostic Representation Learning of Source Code from Structure and Context b/data/2021/iclr/Language-Agnostic Representation Learning of Source Code from Structure and Context new file mode 100644 index 0000000000..99ceb7bad0 --- /dev/null +++ b/data/2021/iclr/Language-Agnostic Representation Learning of Source Code from Structure and Context @@ -0,0 +1 @@ +Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code. \ No newline at end of file diff --git a/data/2021/iclr/Large Associative Memory Problem in Neurobiology and Machine Learning b/data/2021/iclr/Large Associative Memory Problem in Neurobiology and Machine Learning new file mode 100644 index 0000000000..ca97a1624e --- /dev/null +++ b/data/2021/iclr/Large Associative Memory Problem in Neurobiology and Machine Learning @@ -0,0 +1 @@ +Dense Associative Memories or modern Hopfield networks permit storage and reliable retrieval of an exponentially large (in the dimension of feature space) number of memories. At the same time, their naive implementation is non-biological, since it seemingly requires the existence of many-body synaptic junctions between the neurons. We show that these models are effective descriptions of a more microscopic (written in terms of biological degrees of freedom) theory that has additional (hidden) neurons and only requires two-body interactions between them. For this reason our proposed microscopic theory is a valid model of large associative memory with a degree of biological plausibility. The dynamics of our network and its reduced dimensional equivalent both minimize energy (Lyapunov) functions. When certain dynamical variables (hidden neurons) are integrated out from our microscopic theory, one can recover many of the models that were previously discussed in the literature, e.g. the model presented in ''Hopfield Networks is All You Need'' paper. We also provide an alternative derivation of the energy function and the update rule proposed in the aforementioned paper and clarify the relationships between various models of this class. \ No newline at end of file diff --git a/data/2021/iclr/Large Batch Simulation for Deep Reinforcement Learning b/data/2021/iclr/Large Batch Simulation for Deep Reinforcement Learning new file mode 100644 index 0000000000..a030d9e975 --- /dev/null +++ b/data/2021/iclr/Large Batch Simulation for Deep Reinforcement Learning @@ -0,0 +1 @@ +We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work, realizing end-to-end training speeds of over 19,000 frames of experience per second on a single GPU and up to 72,000 frames per second on a single eight-GPU machine. The key idea of our approach is to design a 3D renderer and embodied navigation simulator around the principle of"batch simulation": accepting and executing large batches of requests simultaneously. Beyond exposing large amounts of work at once, batch simulation allows implementations to amortize in-memory storage of scene assets, rendering work, data loading, and synchronization costs across many simulation requests, dramatically improving the number of simulated agents per GPU and overall simulation throughput. To balance DNN inference and training costs with faster simulation, we also build a computationally efficient policy DNN that maintains high task performance, and modify training algorithms to maintain sample efficiency when training with large mini-batches. By combining batch simulation and DNN performance optimizations, we demonstrate that PointGoal navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system using a 64-GPU cluster over three days. We provide open-source reference implementations of our batch 3D renderer and simulator to facilitate incorporation of these ideas into RL systems. \ No newline at end of file diff --git a/data/2021/iclr/Large Scale Image Completion via Co-Modulated Generative Adversarial Networks b/data/2021/iclr/Large Scale Image Completion via Co-Modulated Generative Adversarial Networks new file mode 100644 index 0000000000..c25c1a2023 --- /dev/null +++ b/data/2021/iclr/Large Scale Image Completion via Co-Modulated Generative Adversarial Networks @@ -0,0 +1 @@ +Numerous task-specific variants of conditional generative adversarial networks have been developed for image completion. Yet, a serious limitation remains that all existing algorithms tend to fail when handling large-scale missing regions. To overcome this challenge, we propose a generic new approach that bridges the gap between image-conditional and recent modulated unconditional generative architectures via co-modulation of both conditional and stochastic style representations. Also, due to the lack of good quantitative metrics for image completion, we propose the new Paired/Unpaired Inception Discriminative Score (P-IDS/U-IDS), which robustly measures the perceptual fidelity of inpainted images compared to real images via linear separability in a feature space. Experiments demonstrate superior performance in terms of both quality and diversity over state-of-the-art methods in free-form image completion and easy generalization to image-to-image translation. Code is available at https://github.com/zsyzzsoft/co-mod-gan. \ No newline at end of file diff --git a/data/2021/iclr/Large-width functional asymptotics for deep Gaussian neural networks b/data/2021/iclr/Large-width functional asymptotics for deep Gaussian neural networks new file mode 100644 index 0000000000..d09953de3a --- /dev/null +++ b/data/2021/iclr/Large-width functional asymptotics for deep Gaussian neural networks @@ -0,0 +1 @@ +In this paper, we consider fully connected feed-forward deep neural networks where weights and biases are independent and identically distributed according to Gaussian distributions. Extending previous results (Matthews et al., 2018a;b; Yang, 2019) we adopt a function-space perspective, i.e. we look at neural networks as infinite-dimensional random elements on the input space $\mathbb{R}^I$. Under suitable assumptions on the activation function we show that: i) a network defines a continuous Gaussian process on the input space $\mathbb{R}^I$; ii) a network with re-scaled weights converges weakly to a continuous Gaussian process in the large-width limit; iii) the limiting Gaussian process has almost surely locally $\gamma$-H\"older continuous paths, for $0<\gamma<1$. Our results contribute to recent theoretical studies on the interplay between infinitely wide deep neural networks and Gaussian processes by establishing weak convergence in function-space with respect to a stronger metric. \ No newline at end of file diff --git a/data/2021/iclr/Latent Convergent Cross Mapping b/data/2021/iclr/Latent Convergent Cross Mapping new file mode 100644 index 0000000000..a383911f49 --- /dev/null +++ b/data/2021/iclr/Latent Convergent Cross Mapping @@ -0,0 +1 @@ +Discovering causal structures of temporal processes is a major tool of scientific inquiry because it helps us better understand and explain the mechanisms driving a phenomenon of interest, thereby facilitating analysis, reasoning, and synthesis for such systems. However, accurately inferring causal structures within a phenomenon based on observational data only is still an open problem. Indeed, this type of data usually consists in short time series with missing or noisy values for which causal inference is increasingly difficult. In this work, we propose a method to uncover causal relations in chaotic dynamical systems from short, noisy and sporadic time series (that is, incomplete observations at infrequent and irregular intervals) where the classical convergent cross mapping (CCM) fails. Our method works by learning a Neural ODE latent process modeling the state-space dynamics of the time series and by checking the existence of a continuous map between the resulting processes. We provide theoretical analysis and show empirically that Latent-CCM can reliably uncover the true causal pattern, unlike traditional methods. \ No newline at end of file diff --git a/data/2021/iclr/Latent Skill Planning for Exploration and Transfer b/data/2021/iclr/Latent Skill Planning for Exploration and Transfer new file mode 100644 index 0000000000..4f975e5e36 --- /dev/null +++ b/data/2021/iclr/Latent Skill Planning for Exploration and Transfer @@ -0,0 +1 @@ +To quickly solve new tasks in complex environments, intelligent agents need to build up reusable knowledge. For example, a learned world model captures knowledge about the environment that applies to new tasks. Similarly, skills capture general behaviors that can apply to new tasks. In this paper, we investigate how these two approaches can be integrated into a single reinforcement learning agent. Specifically, we leverage the idea of partial amortization for fast adaptation at test time. For this, actions are produced by a policy that is learned over time while the skills it conditions on are chosen using online planning. We demonstrate the benefits of our design decisions across a suite of challenging locomotion tasks and demonstrate improved sample efficiency in single tasks as well as in transfer from one task to another, as compared to competitive baselines. Videos are available at: https://sites.google.com/view/latent-skill-planning/ \ No newline at end of file diff --git a/data/2021/iclr/Layer-adaptive Sparsity for the Magnitude-based Pruning b/data/2021/iclr/Layer-adaptive Sparsity for the Magnitude-based Pruning new file mode 100644 index 0000000000..dbb36f5f3e --- /dev/null +++ b/data/2021/iclr/Layer-adaptive Sparsity for the Magnitude-based Pruning @@ -0,0 +1 @@ +Recent discoveries on neural network pruning reveal that, with a carefully chosen layerwise sparsity, a simple magnitude-based pruning achieves state-of-the-art tradeoff between sparsity and performance. However, without a clear consensus on"how to choose,"the layerwise sparsities are mostly selected algorithm-by-algorithm, often resorting to handcrafted heuristics or an extensive hyperparameter search. To fill this gap, we propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score; the score is a rescaled version of weight magnitude that incorporates the model-level $\ell_2$ distortion incurred by pruning, and does not require any hyperparameter tuning or heavy computation. Under various image classification setups, LAMP consistently outperforms popular existing schemes for layerwise sparsity selection. Furthermore, we observe that LAMP continues to outperform baselines even in weight-rewinding setups, while the connectivity-oriented layerwise sparsity (the strongest baseline overall) performs worse than a simple global magnitude-based pruning in this case. Code: https://github.com/jaeho-lee/layer-adaptive-sparsity \ No newline at end of file diff --git a/data/2021/iclr/Learnable Embedding sizes for Recommender Systems b/data/2021/iclr/Learnable Embedding sizes for Recommender Systems new file mode 100644 index 0000000000..8fb170d4ad --- /dev/null +++ b/data/2021/iclr/Learnable Embedding sizes for Recommender Systems @@ -0,0 +1 @@ +The embedding-based representation learning is commonly used in deep learning recommendation models to map the raw sparse features to dense vectors. The traditional embedding manner that assigns a uniform size to all features has two issues. First, the numerous features inevitably lead to a gigantic embedding table that causes a high memory usage cost. Second, it is likely to cause the over-fitting problem for those features that do not require too large representation capacity. Existing works that try to address the problem always cause a significant drop in recommendation performance or suffers from the limitation of unaffordable training time cost. In this paper, we proposed a novel approach, named PEP (short for Plug-in Embedding Pruning), to reduce the size of the embedding table while obviating a drop in accuracy and computational optimization. PEP prunes embedding parameter where the pruning threshold(s) can be adaptively learned from data. Therefore we can automatically obtain a mixed-dimension embedding-scheme by pruning redundant parameters for each feature. PEP is a general framework that can plug in various base recommendation models. Extensive experiments demonstrate it can efficiently cut down embedding parameters and boost the base model's performance. Specifically, it achieves strong recommendation performance while reducing 97-99% parameters. As for the computation cost, PEP only brings an additional 20-30% time cost compared with base models. Codes are available at https://github.com/ssui-liu/learnable-embed-sizes-for-RecSys. \ No newline at end of file diff --git "a/data/2021/iclr/Learning \"What-if\" Explanations for Sequential Decision-Making" "b/data/2021/iclr/Learning \"What-if\" Explanations for Sequential Decision-Making" new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Learning A Minimax Optimizer: A Pilot Study b/data/2021/iclr/Learning A Minimax Optimizer: A Pilot Study new file mode 100644 index 0000000000..eb7f88ffed --- /dev/null +++ b/data/2021/iclr/Learning A Minimax Optimizer: A Pilot Study @@ -0,0 +1 @@ +Solving continuous minimax optimization is of extensive practical interest, yet notoriously unstable and difficult. This paper introduces the learning to optimize ( L2O ) methodology to the minimax problems for the first time and addresses its accompanying unique challenges. We first present Twin-L2O , the first dedicated minimax L2O framework consisting of two LSTMs for updating min and max variables separately. The decoupled design is found to facilitate learning, particularly when the min and max variables are highly asymmetric. Empirical experiments on a variety of minimax problems corroborate the effectiveness of Twin-L2O. We then discuss a crucial concern of Twin-L2O, i.e., its inevitably limited generalizability to unseen optimizees. To address this issue, we present two complementary strategies. Our first solution, Enhanced Twin-L2O , is empirically applicable for general mini-max problems, by improving L2O training via leveraging curriculum learning. Our second alternative, called Safeguarded Twin-L2O , is a preliminary theoretical exploration stating that under some strong assumptions, it is possible to theoretically establish the convergence of Twin-L2O. We benchmark our algorithms on several testbed problems and compare against state-of-the-art minimax solvers. The code is available at: https://github. \ No newline at end of file diff --git a/data/2021/iclr/Learning Accurate Entropy Model with Global Reference for Image Compression b/data/2021/iclr/Learning Accurate Entropy Model with Global Reference for Image Compression new file mode 100644 index 0000000000..89225176da --- /dev/null +++ b/data/2021/iclr/Learning Accurate Entropy Model with Global Reference for Image Compression @@ -0,0 +1 @@ +In recent deep image compression neural networks, the entropy model plays a critical role in estimating the prior distribution of deep image encodings. Existing methods combine hyperprior with local context in the entropy estimation function. This greatly limits their performance due to the absence of a global vision. In this work, we propose a novel Global Reference Model for image compression to effectively leverage both the local and the global context information, leading to an enhanced compression rate. The proposed method scans decoded latents and then finds the most relevant latent to assist the distribution estimating of the current latent. A by-product of this work is the innovation of a mean-shifting GDN module that further improves the performance. Experimental results demonstrate that the proposed model outperforms the rate-distortion performance of most of the state-of-the-art methods in the industry. \ No newline at end of file diff --git a/data/2021/iclr/Learning Associative Inference Using Fast Weight Memory b/data/2021/iclr/Learning Associative Inference Using Fast Weight Memory new file mode 100644 index 0000000000..748a6db91b --- /dev/null +++ b/data/2021/iclr/Learning Associative Inference Using Fast Weight Memory @@ -0,0 +1 @@ +Humans can quickly associate stimuli to solve problems in novel contexts. Our novel neural network model learns state representations of facts that can be composed to perform such associative inference. To this end, we augment the LSTM model with an associative memory, dubbed Fast Weight Memory (FWM). Through differentiable operations at every step of a given input sequence, the LSTM updates and maintains compositional associations stored in the rapidly changing FWM weights. Our model is trained end-to-end by gradient descent and yields excellent performance on compositional language reasoning problems, meta-reinforcement-learning for POMDPs, and small-scale word-level language modelling. \ No newline at end of file diff --git a/data/2021/iclr/Learning Better Structured Representations Using Low-rank Adaptive Label Smoothing b/data/2021/iclr/Learning Better Structured Representations Using Low-rank Adaptive Label Smoothing new file mode 100644 index 0000000000..9c044852a6 --- /dev/null +++ b/data/2021/iclr/Learning Better Structured Representations Using Low-rank Adaptive Label Smoothing @@ -0,0 +1 @@ +Training with soft targets instead of hard targets has been shown to improve performance and calibration of deep neural networks. Label smoothing is a popular way of computing soft targets, where one-hot encoding of a class is smoothed with a uniform distribution. Owing to its simplicity, label smoothing has found wide-spread use for training deep neural networks on a wide variety of tasks, ranging from image and text classification to machine translation and semantic parsing. Complementing recent empirical justification for label smoothing, we obtain PAC-Bayesian generalization bounds for label smoothing and show that the generalization error depends on the choice of the noise (smoothing) distribution. Then we propose low-rank adaptive label smoothing (LORAS): a simple yet novel method for training with learned soft targets that generalizes label smoothing and adapts to the latent structure of the label space in structured prediction tasks. Specifically, we evaluate our method on semantic parsing tasks and show that training with appropriately smoothed soft targets can significantly improve accuracy and model calibration, especially in low-resource settings. Used in conjunction with pre-trained sequence-to-sequence models, our method achieves state of the art performance on four semantic parsing data sets. LORAS can be used with any model, improves performance and implicit model calibration without increasing the number of model parameters, and can be scaled to problems with large label spaces containing tens of thousands of labels. \ No newline at end of file diff --git a/data/2021/iclr/Learning Cross-Domain Correspondence for Control with Dynamics Cycle-Consistency b/data/2021/iclr/Learning Cross-Domain Correspondence for Control with Dynamics Cycle-Consistency new file mode 100644 index 0000000000..2330c3dbef --- /dev/null +++ b/data/2021/iclr/Learning Cross-Domain Correspondence for Control with Dynamics Cycle-Consistency @@ -0,0 +1 @@ +At the heart of many robotics problems is the challenge of learning correspondences across domains. For instance, imitation learning requires obtaining correspondence between humans and robots; sim-to-real requires correspondence between physics simulators and the real world; transfer learning requires correspondences between different robotics environments. This paper aims to learn correspondence across domains differing in representation (vision vs. internal state), physics parameters (mass and friction), and morphology (number of limbs). Importantly, correspondences are learned using unpaired and randomly collected data from the two domains. We propose \textit{dynamics cycles} that align dynamic robot behavior across two domains using a cycle-consistency constraint. Once this correspondence is found, we can directly transfer the policy trained on one domain to the other, without needing any additional fine-tuning on the second domain. We perform experiments across a variety of problem domains, both in simulation and on real robot. Our framework is able to align uncalibrated monocular video of a real robot arm to dynamic state-action trajectories of a simulated arm without paired data. Video demonstrations of our results are available at: this https URL . \ No newline at end of file diff --git a/data/2021/iclr/Learning Deep Features in Instrumental Variable Regression b/data/2021/iclr/Learning Deep Features in Instrumental Variable Regression new file mode 100644 index 0000000000..daa044b998 --- /dev/null +++ b/data/2021/iclr/Learning Deep Features in Instrumental Variable Regression @@ -0,0 +1 @@ +Instrumental variable (IV) regression is a standard strategy for learning causal relationships between confounded treatment and outcome variables from observational data by utilizing an instrumental variable, which affects the outcome only through the treatment. In classical IV regression, learning proceeds in two stages: stage 1 performs linear regression from the instrument to the treatment; and stage 2 performs linear regression from the treatment to the outcome, conditioned on the instrument. We propose a novel method, deep feature instrumental variable regression (DFIV), to address the case where relations between instruments, treatments, and outcomes may be nonlinear. In this case, deep neural nets are trained to define informative nonlinear features on the instruments and treatments. We propose an alternating training regime for these features to ensure good end-to-end performance when composing stages 1 and 2, thus obtaining highly flexible feature maps in a computationally efficient manner. DFIV outperforms recent state-of-the-art methods on challenging IV benchmarks, including settings involving high dimensional image data. DFIV also exhibits competitive performance in off-policy policy evaluation for reinforcement learning, which can be understood as an IV regression task. \ No newline at end of file diff --git a/data/2021/iclr/Learning Energy-Based Generative Models via Coarse-to-Fine Expanding and Sampling b/data/2021/iclr/Learning Energy-Based Generative Models via Coarse-to-Fine Expanding and Sampling new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Learning Energy-Based Models by Diffusion Recovery Likelihood b/data/2021/iclr/Learning Energy-Based Models by Diffusion Recovery Likelihood new file mode 100644 index 0000000000..6477b8da59 --- /dev/null +++ b/data/2021/iclr/Learning Energy-Based Models by Diffusion Recovery Likelihood @@ -0,0 +1 @@ +While energy-based models (EBMs) exhibit a number of desirable properties, training and sampling on high-dimensional datasets remains challenging. Inspired by recent progress on diffusion probabilistic models, we present a diffusion recovery likelihood method to tractably learn and sample from a sequence of EBMs trained on increasingly noisy versions of a dataset. Each EBM is trained by maximizing the recovery likelihood: the conditional probability of the data at a certain noise level given their noisy versions at a higher noise level. The recovery likelihood objective is more tractable than the marginal likelihood objective, since it only requires MCMC sampling from a relatively concentrated conditional distribution. Moreover, we show that this estimation method is theoretically consistent: it learns the correct conditional and marginal distributions at each noise level, given sufficient data. After training, synthesized images can be generated efficiently by a sampling process that initializes from a spherical Gaussian distribution and progressively samples the conditional distributions at decreasingly lower noise levels. Our method generates high fidelity samples on various image datasets. On unconditional CIFAR-10 our method achieves FID 9.60 and inception score 8.58, superior to the majority of GANs. Moreover, we demonstrate that unlike previous work on EBMs, our long-run MCMC samples from the conditional distributions do not diverge and still represent realistic images, allowing us to accurately estimate the normalized density of data even for high-dimensional datasets. \ No newline at end of file diff --git a/data/2021/iclr/Learning Generalizable Visual Representations via Interactive Gameplay b/data/2021/iclr/Learning Generalizable Visual Representations via Interactive Gameplay new file mode 100644 index 0000000000..186ff87782 --- /dev/null +++ b/data/2021/iclr/Learning Generalizable Visual Representations via Interactive Gameplay @@ -0,0 +1 @@ +Numerous approaches have recently emerged in the realm of self-supervised visual representation learning. While these methods have demonstrated empirical success, a theoretical foundation that understands and unifies these diverse techniques remains to be established. In this work, we draw inspiration from the principles underlying brain-based learning and propose a new method named self-supervised information bottleneck. Our method aims to maximize the mutual information between representations of views derived from the same image, while maintaining a minimal mutual information between the view and its corresponding representation at the same time. The brain-inspired method provides a unified information-theoretic perspective on various self-supervised approaches. This unified framework also empowers the model to learn generalizable visual representations for diverse downstream tasks and data distributions, achieving state-of-the-art performance across a wide variety of image and video tasks. \ No newline at end of file diff --git a/data/2021/iclr/Learning Hyperbolic Representations of Topological Features b/data/2021/iclr/Learning Hyperbolic Representations of Topological Features new file mode 100644 index 0000000000..98fad6f81e --- /dev/null +++ b/data/2021/iclr/Learning Hyperbolic Representations of Topological Features @@ -0,0 +1 @@ +Learning task-specific representations of persistence diagrams is an important problem in topological data analysis and machine learning. However, current state of the art methods are restricted in terms of their expressivity as they are focused on Euclidean representations. Persistence diagrams often contain features of infinite persistence (i.e., essential features) and Euclidean spaces shrink their importance relative to non-essential features because they cannot assign infinite distance to finite points. To deal with this issue, we propose a method to learn representations of persistence diagrams on hyperbolic spaces, more specifically on the Poincare ball. By representing features of infinite persistence infinitesimally close to the boundary of the ball, their distance to non-essential features approaches infinity, thereby their relative importance is preserved. This is achieved without utilizing extremely high values for the learnable parameters, thus the representation can be fed into downstream optimization methods and trained efficiently in an end-to-end fashion. We present experimental results on graph and image classification tasks and show that the performance of our method is on par with or exceeds the performance of other state of the art methods. \ No newline at end of file diff --git a/data/2021/iclr/Learning Incompressible Fluid Dynamics from Scratch - Towards Fast, Differentiable Fluid Models that Generalize b/data/2021/iclr/Learning Incompressible Fluid Dynamics from Scratch - Towards Fast, Differentiable Fluid Models that Generalize new file mode 100644 index 0000000000..e92be09f22 --- /dev/null +++ b/data/2021/iclr/Learning Incompressible Fluid Dynamics from Scratch - Towards Fast, Differentiable Fluid Models that Generalize @@ -0,0 +1 @@ +Fast and stable fluid simulations are an essential prerequisite for applications ranging from computer-generated imagery to computer-aided design in research and development. However, solving the partial differential equations of incompressible fluids is a challenging task and traditional numerical approximation schemes come at high computational costs. Recent deep learning based approaches promise vast speed-ups but do not generalize to new fluid domains, require fluid simulation data for training, or rely on complex pipelines that outsource major parts of the fluid simulation to traditional methods. In this work, we propose a novel physics-constrained training approach that generalizes to new fluid domains, requires no fluid simulation data, and allows convolutional neural networks to map a fluid state from time-point t to a subsequent state at time t + dt in a single forward pass. This simplifies the pipeline to train and evaluate neural fluid models. After training, the framework yields models that are capable of fast fluid simulations and can handle various fluid phenomena including the Magnus effect and Karman vortex streets. We present an interactive real-time demo to show the speed and generalization capabilities of our trained models. Moreover, the trained neural networks are efficient differentiable fluid solvers as they offer a differentiable update step to advance the fluid simulation in time. We exploit this fact in a proof-of-concept optimal control experiment. Our models significantly outperform a recent differentiable fluid solver in terms of computational speed and accuracy. \ No newline at end of file diff --git a/data/2021/iclr/Learning Invariant Representations for Reinforcement Learning without Reconstruction b/data/2021/iclr/Learning Invariant Representations for Reinforcement Learning without Reconstruction new file mode 100644 index 0000000000..2aa4679b25 --- /dev/null +++ b/data/2021/iclr/Learning Invariant Representations for Reinforcement Learning without Reconstruction @@ -0,0 +1 @@ +We study how representation learning can accelerate reinforcement learning from rich observations, such as images, without relying either on domain knowledge or pixel-reconstruction. Our goal is to learn representations that both provide for effective downstream control and invariance to task-irrelevant details. Bisimulation metrics quantify behavioral similarity between states in continuous MDPs, which we propose using to learn robust latent representations which encode only the task-relevant information from observations. Our method trains encoders such that distances in latent space equal bisimulation distances in state space. We demonstrate the effectiveness of our method at disregarding task-irrelevant information using modified visual MuJoCo tasks, where the background is replaced with moving distractors and natural videos, while achieving SOTA performance. We also test a first-person highway driving task where our method learns invariance to clouds, weather, and time of day. Finally, we provide generalization results drawn from properties of bisimulation metrics, and links to causal inference. \ No newline at end of file diff --git a/data/2021/iclr/Learning Long-term Visual Dynamics with Region Proposal Interaction Networks b/data/2021/iclr/Learning Long-term Visual Dynamics with Region Proposal Interaction Networks new file mode 100644 index 0000000000..d7aa90eb64 --- /dev/null +++ b/data/2021/iclr/Learning Long-term Visual Dynamics with Region Proposal Interaction Networks @@ -0,0 +1 @@ +Learning long-term dynamics models is the key to understanding physical common sense. Most existing approaches on learning dynamics from visual input sidestep long-term predictions by resorting to rapid re-planning with short-term models. This not only requires such models to be super accurate but also limits them only to tasks where an agent can continuously obtain feedback and take action at each step until completion. In this paper, we aim to leverage the ideas from success stories in visual recognition tasks to build object representations that can capture inter-object and object-environment interactions over a long-range. To this end, we propose Region Proposal Interaction Networks (RPIN), which reason about each object's trajectory in a latent region-proposal feature space. Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin both in terms of prediction quality and their ability to plan for downstream tasks, and also generalize well to novel environments. Code, pre-trained models, and more visualization results are available at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/Learning Manifold Patch-Based Representations of Man-Made Shapes b/data/2021/iclr/Learning Manifold Patch-Based Representations of Man-Made Shapes new file mode 100644 index 0000000000..54771d1e01 --- /dev/null +++ b/data/2021/iclr/Learning Manifold Patch-Based Representations of Man-Made Shapes @@ -0,0 +1 @@ +Choosing the right shape representation for geometry is crucial for making 3D models compatible with existing applications. Focusing on piecewise-smooth man-made shapes, we propose a new representation that is usable in conventional CAD modeling pipelines and can also be learned by deep neural networks. We demonstrate the benefits of our representation by applying it to the task of sketch-based modeling. Given a raster image, our system infers a set of parametric surfaces that realize the input in 3D. To capture the piecewise smooth geometry of man-made shapes, we learn a special shape representation: a deformable parametric template composed of Coons patches. Naively training such a system, however, would suffer from non-manifold artifacts of the parametric shapes as well as from a lack of data. To address this, we introduce loss functions that bias the network to output non-self-intersecting shapes and implement them as part of a fully self-supervised system, automatically generating both shape templates and synthetic training data. To test the efficacy of our system, we develop a testbed for sketch-based modeling and show results on a gallery of synthetic and real artist sketches. As additional applications, we also demonstrate shape interpolation and provide comparison to related work. \ No newline at end of file diff --git a/data/2021/iclr/Learning Mesh-Based Simulation with Graph Networks b/data/2021/iclr/Learning Mesh-Based Simulation with Graph Networks new file mode 100644 index 0000000000..b7487b8eb0 --- /dev/null +++ b/data/2021/iclr/Learning Mesh-Based Simulation with Graph Networks @@ -0,0 +1 @@ +Mesh-based simulations are central to modeling complex physical systems in many disciplines across science and engineering. Mesh representations support powerful numerical integration methods and their resolution can be adapted to strike favorable trade-offs between accuracy and efficiency. However, high-dimensional scientific simulations are very expensive to run, and solvers and parameters must often be tuned individually to each system studied. Here we introduce MeshGraphNets, a framework for learning mesh-based simulations using graph neural networks. Our model can be trained to pass messages on a mesh graph and to adapt the mesh discretization during forward simulation. Our results show it can accurately predict the dynamics of a wide range of physical systems, including aerodynamics, structural mechanics, and cloth. The model's adaptivity supports learning resolution-independent dynamics and can scale to more complex state spaces at test time. Our method is also highly efficient, running 1-2 orders of magnitude faster than the simulation on which it is trained. Our approach broadens the range of problems on which neural network simulators can operate and promises to improve the efficiency of complex, scientific modeling tasks. \ No newline at end of file diff --git a/data/2021/iclr/Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch b/data/2021/iclr/Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch new file mode 100644 index 0000000000..92a3e0c15b --- /dev/null +++ b/data/2021/iclr/Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch @@ -0,0 +1 @@ +Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. It can be generally categorized into unstructured fine-grained sparsity that zeroes out multiple individual weights distributed across the neural network, and structured coarse-grained sparsity which prunes blocks of sub-networks of a neural network. Fine-grained sparsity can achieve a high compression ratio but is not hardware friendly and hence receives limited speed gains. On the other hand, coarse-grained sparsity cannot simultaneously achieve both apparent acceleration on modern GPUs and decent performance. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network, which can maintain the advantages of both unstructured fine-grained sparsity and structured coarse-grained sparsity simultaneously on specifically designed GPUs. Specifically, a 2 : 4 sparse network could achieve 2× speed-up without performance drop on Nvidia A100 GPUs. Furthermore, we propose a novel and effective ingredient, sparse-refined straight-through estimator (SR-STE), to alleviate the negative influence of the approximated gradients computed by vanilla STE during optimization. We also define a metric, Sparse Architecture Divergence (SAD), to measure the sparse network’s topology change during the training process. Finally, We justify SR-STE’s advantages with SAD and demonstrate the effectiveness of SR-STE by performing comprehensive experiments on various tasks. Anonymous code and model will be at available at https://github.com/anonymous-NM-sparsity/NM-sparsity. \ No newline at end of file diff --git a/data/2021/iclr/Learning Neural Event Functions for Ordinary Differential Equations b/data/2021/iclr/Learning Neural Event Functions for Ordinary Differential Equations new file mode 100644 index 0000000000..904eb9e07d --- /dev/null +++ b/data/2021/iclr/Learning Neural Event Functions for Ordinary Differential Equations @@ -0,0 +1 @@ +The existing Neural ODE formulation relies on an explicit knowledge of the termination time. We extend Neural ODEs to implicitly defined termination criteria modeled by neural event functions, which can be chained together and differentiated through. Neural Event ODEs are capable of modeling discrete (instantaneous) changes in a continuous-time system, without prior knowledge of when these changes should occur or how many such changes should exist. We test our approach in modeling hybrid discrete- and continuous- systems such as switching dynamical systems and collision in multi-body systems, and we propose simulation-based training of point processes with applications in discrete control. \ No newline at end of file diff --git a/data/2021/iclr/Learning Neural Generative Dynamics for Molecular Conformation Generation b/data/2021/iclr/Learning Neural Generative Dynamics for Molecular Conformation Generation new file mode 100644 index 0000000000..7c7b6be996 --- /dev/null +++ b/data/2021/iclr/Learning Neural Generative Dynamics for Molecular Conformation Generation @@ -0,0 +1 @@ +We study how to generate molecule conformations (\textit{i.e.}, 3D structures) from a molecular graph. Traditional methods, such as molecular dynamics, sample conformations via computationally expensive simulations. Recently, machine learning methods have shown great potential by training on a large collection of conformation data. Challenges arise from the limited model capacity for capturing complex distributions of conformations and the difficulty in modeling long-range dependencies between atoms. Inspired by the recent progress in deep generative models, in this paper, we propose a novel probabilistic framework to generate valid and diverse conformations given a molecular graph. We propose a method combining the advantages of both flow-based and energy-based models, enjoying: (1) a high model capacity to estimate the multimodal conformation distribution; (2) explicitly capturing the complex long-range dependencies between atoms in the observation space. Extensive experiments demonstrate the superior performance of the proposed method on several benchmarks, including conformation generation and distance modeling tasks, with a significant improvement over existing generative models for molecular conformation sampling. \ No newline at end of file diff --git a/data/2021/iclr/Learning Parametrised Graph Shift Operators b/data/2021/iclr/Learning Parametrised Graph Shift Operators new file mode 100644 index 0000000000..0639224726 --- /dev/null +++ b/data/2021/iclr/Learning Parametrised Graph Shift Operators @@ -0,0 +1 @@ +In many domains data is currently represented as graphs and therefore, the graph representation of this data becomes increasingly important in machine learning. Network data is, implicitly or explicitly, always represented using a graph shift operator (GSO) with the most common choices being the adjacency, Laplacian matrices and their normalisations. In this paper, a novel parametrised GSO (PGSO) is proposed, where specific parameter values result in the most commonly used GSOs and message-passing operators in graph neural network (GNN) frameworks. The PGSO is suggested as a replacement of the standard GSOs that are used in state-of-the-art GNN architectures and the optimisation of the PGSO parameters is seamlessly included in the model training. It is proved that the PGSO has real eigenvalues and a set of real eigenvectors independent of the parameter values and spectral bounds on the PGSO are derived. PGSO parameters are shown to adapt to the sparsity of the graph structure in a study on stochastic blockmodel networks, where they are found to automatically replicate the GSO regularisation found in the literature. On several real-world datasets the accuracy of state-of-the-art GNN architectures is improved by the inclusion of the PGSO in both node- and graph-classification tasks. \ No newline at end of file diff --git a/data/2021/iclr/Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues b/data/2021/iclr/Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues new file mode 100644 index 0000000000..c3683082b9 --- /dev/null +++ b/data/2021/iclr/Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues @@ -0,0 +1 @@ +Compared to traditional visual question answering, video-grounded dialogues require additional reasoning over dialogue context to answer questions in a multi-turn setting. Previous approaches to video-grounded dialogues mostly use dialogue context as a simple text input without modelling the inherent information flows at the turn level. In this paper, we propose a novel framework of Reasoning Paths in Dialogue Context (PDC). PDC model discovers information flows among dialogue turns through a semantic graph constructed based on lexical components in each question and answer. PDC model then learns to predict reasoning paths over this semantic graph. Our path prediction model predicts a path from the current turn through past dialogue turns that contain additional visual cues to answer the current question. Our reasoning model sequentially processes both visual and textual information through this reasoning path and the propagated features are used to generate the answer. Our experimental results demonstrate the effectiveness of our method and provide additional insights on how models use semantic dependencies in a dialogue context to retrieve visual cues. \ No newline at end of file diff --git a/data/2021/iclr/Learning Robust State Abstractions for Hidden-Parameter Block MDPs b/data/2021/iclr/Learning Robust State Abstractions for Hidden-Parameter Block MDPs new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Learning Safe Multi-agent Control with Decentralized Neural Barrier Certificates b/data/2021/iclr/Learning Safe Multi-agent Control with Decentralized Neural Barrier Certificates new file mode 100644 index 0000000000..77fc9ef466 --- /dev/null +++ b/data/2021/iclr/Learning Safe Multi-agent Control with Decentralized Neural Barrier Certificates @@ -0,0 +1 @@ +We study the multi-agent safe control problem where agents should avoid collisions to static obstacles and collisions with each other while reaching their goals. Our core idea is to learn the multi-agent control policy jointly with learning the control barrier functions as safety certificates. We propose a novel joint-learning framework that can be implemented in a decentralized fashion, with generalization guarantees for certain function classes. Such a decentralized framework can adapt to an arbitrarily large number of agents. Building upon this framework, we further improve the scalability by incorporating neural network architectures that are invariant to the quantity and permutation of neighboring agents. In addition, we propose a new spontaneous policy refinement method to further enforce the certificate condition during testing. We provide extensive experiments to demonstrate that our method significantly outperforms other leading multi-agent control approaches in terms of maintaining safety and completing original tasks. Our approach also shows exceptional generalization capability in that the control policy can be trained with 8 agents in one scenario, while being used on other scenarios with up to 1024 agents in complex multi-agent environments and dynamics. \ No newline at end of file diff --git a/data/2021/iclr/Learning Structural Edits via Incremental Tree Transformations b/data/2021/iclr/Learning Structural Edits via Incremental Tree Transformations new file mode 100644 index 0000000000..4d546e569b --- /dev/null +++ b/data/2021/iclr/Learning Structural Edits via Incremental Tree Transformations @@ -0,0 +1 @@ +While most neural generative models generate outputs in a single pass, the human creative process is usually one of iterative building and refinement. Recent work has proposed models of editing processes, but these mostly focus on editing sequential data and/or only model a single editing pass. In this paper, we present a generic model for incremental editing of structured data (i.e. ''structural edits''). Particularly, we focus on tree-structured data, taking abstract syntax trees of computer programs as our canonical example. Our editor learns to iteratively generate tree edits (e.g. deleting or adding a subtree) and applies them to the partially edited data, thereby the entire editing process can be formulated as consecutive, incremental tree transformations. To show the unique benefits of modeling tree edits directly, we further propose a novel edit encoder for learning to represent edits, as well as an imitation learning method that allows the editor to be more robust. We evaluate our proposed editor on two source code edit datasets, where results show that, with the proposed edit encoder, our editor significantly improves accuracy over previous approaches that generate the edited program directly in one pass. Finally, we demonstrate that training our editor to imitate experts and correct its mistakes dynamically can further improve its performance. \ No newline at end of file diff --git a/data/2021/iclr/Learning Subgoal Representations with Slow Dynamics b/data/2021/iclr/Learning Subgoal Representations with Slow Dynamics new file mode 100644 index 0000000000..5d05fc6234 --- /dev/null +++ b/data/2021/iclr/Learning Subgoal Representations with Slow Dynamics @@ -0,0 +1 @@ +In goal-conditioned Hierarchical Reinforcement Learning (HRL), a high-level policy periodically sets subgoals for a low-level policy, and the low-level policy is trained to reach those subgoals. A proper subgoal representation function, which abstracts a state space to a latent subgoal space, is crucial for effective goal-conditioned HRL, since different low-level behaviors are induced by reaching subgoals in the compressed representation space. Observing that the high-level agent operates at an abstract temporal scale, we propose a slowness objective to effectively learn the subgoal representation (i.e., the high-level action space). We provide a theoretical grounding for the slowness objective. That is, selecting slow features as the subgoal space can achieve efficient hierarchical exploration. As a result of better exploration ability, our approach significantly outperforms state-of-the-art HRL and exploration methods on a number of benchmark continuous-control tasks 12 . Thanks to the generality of the proposed subgoal representation learning method, empirical results also demonstrate that the learned representation and corresponding low-level policies can be transferred between distinct tasks. \ No newline at end of file diff --git a/data/2021/iclr/Learning Task Decomposition with Ordered Memory Policy Network b/data/2021/iclr/Learning Task Decomposition with Ordered Memory Policy Network new file mode 100644 index 0000000000..3409b6431f --- /dev/null +++ b/data/2021/iclr/Learning Task Decomposition with Ordered Memory Policy Network @@ -0,0 +1 @@ +Many complex real-world tasks are composed of several levels of sub-tasks. Humans leverage these hierarchical structures to accelerate the learning process and achieve better generalization. In this work, we study the inductive bias and propose Ordered Memory Policy Network (OMPN) to discover subtask hierarchy by learning from demonstration. The discovered subtask hierarchy could be used to perform task decomposition, recovering the subtask boundaries in an unstruc-tured demonstration. Experiments on Craft and Dial demonstrate that our modelcan achieve higher task decomposition performance under both unsupervised and weakly supervised settings, comparing with strong baselines. OMPN can also bedirectly applied to partially observable environments and still achieve higher task decomposition performance. Our visualization further confirms that the subtask hierarchy can emerge in our model. \ No newline at end of file diff --git a/data/2021/iclr/Learning Task-General Representations with Generative Neuro-Symbolic Modeling b/data/2021/iclr/Learning Task-General Representations with Generative Neuro-Symbolic Modeling new file mode 100644 index 0000000000..77c9161bc3 --- /dev/null +++ b/data/2021/iclr/Learning Task-General Representations with Generative Neuro-Symbolic Modeling @@ -0,0 +1 @@ +A hallmark of human intelligence is the ability to interact directly with raw data and acquire rich, general-purpose conceptual representations. In machine learning, symbolic models can capture the compositional and causal knowledge that enables flexible generalization, but they struggle to learn from raw inputs, relying on strong abstractions and simplifying assumptions. Neural network models can learn directly from raw data, but they struggle to capture compositional and causal structure and typically must retrain to tackle new tasks. To help bridge this gap, we propose Generative Neuro-Symbolic (GNS) Modeling, a framework for learning task-general representations by combining the structure of symbolic models with the expressivity of neural networks. Concepts and conceptual background knowledge are represented as probabilistic programs with neural network sub-routines, maintaining explicit causal and compositional structure while capturing nonparametric relationships and learning directly from raw data. We apply GNS to the Omniglot challenge of learning simple visual concepts at a human level. We report competitive results on 4 unique tasks including one-shot classification, parsing, generating new exemplars, and generating new concepts. To our knowledge, this is the strongest neurally-grounded model to complete a diverse set of Omniglot tasks. \ No newline at end of file diff --git a/data/2021/iclr/Learning Value Functions in Deep Policy Gradients using Residual Variance b/data/2021/iclr/Learning Value Functions in Deep Policy Gradients using Residual Variance new file mode 100644 index 0000000000..e182b97f77 --- /dev/null +++ b/data/2021/iclr/Learning Value Functions in Deep Policy Gradients using Residual Variance @@ -0,0 +1 @@ +Policy gradient algorithms have proven to be successful in diverse decision making and control tasks. However, these methods suffer from high sample complexity and instability issues. In this paper, we address these challenges by providing a different approach for training the critic in the actor-critic framework. Our work builds on recent studies indicating that traditional actor-critic algorithms do not succeed in fitting the true value function, calling for the need to identify a better objective for the critic. In our method, the critic uses a new state-value (resp. state-actionvalue) function approximation that learns the value of the states (resp. state-action pairs) relative to their mean value rather than the absolute value as in conventional actor-critic. We prove the theoretical consistency of the new gradient estimator and observe dramatic empirical improvement across a variety of continuous control tasks and algorithms. Furthermore, we validate our method in tasks with sparse rewards, where we provide experimental evidence and theoretical insights. \ No newline at end of file diff --git a/data/2021/iclr/Learning What To Do by Simulating the Past b/data/2021/iclr/Learning What To Do by Simulating the Past new file mode 100644 index 0000000000..45c905d4e1 --- /dev/null +++ b/data/2021/iclr/Learning What To Do by Simulating the Past @@ -0,0 +1 @@ +Since reward functions are hard to specify, recent work has focused on learning policies from human feedback. However, such approaches are impeded by the expense of acquiring such feedback. Recent work proposed that agents have access to a source of information that is effectively free: in any environment that humans have acted in, the state will already be optimized for human preferences, and thus an agent can extract information about what humans want from the state. Such learning is possible in principle, but requires simulating all possible past trajectories that could have led to the observed state. This is feasible in gridworlds, but how do we scale it to complex tasks? In this work, we show that by combining a learned feature encoder with learned inverse models, we can enable agents to simulate human actions backwards in time to infer what they must have done. The resulting algorithm is able to reproduce a specific skill in MuJoCo environments given a single state sampled from the optimal policy for that skill. \ No newline at end of file diff --git a/data/2021/iclr/Learning a Latent Search Space for Routing Problems using Variational Autoencoders b/data/2021/iclr/Learning a Latent Search Space for Routing Problems using Variational Autoencoders new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Learning a Latent Simplex in Input Sparsity Time b/data/2021/iclr/Learning a Latent Simplex in Input Sparsity Time new file mode 100644 index 0000000000..ea5a4bbfbe --- /dev/null +++ b/data/2021/iclr/Learning a Latent Simplex in Input Sparsity Time @@ -0,0 +1 @@ +We consider the problem of learning a latent $k$-vertex simplex $K\subset\mathbb{R}^d$, given access to $A\in\mathbb{R}^{d\times n}$, which can be viewed as a data matrix with $n$ points that are obtained by randomly perturbing latent points in the simplex $K$ (potentially beyond $K$). A large class of latent variable models, such as adversarial clustering, mixed membership stochastic block models, and topic models can be cast as learning a latent simplex. Bhattacharyya and Kannan (SODA, 2020) give an algorithm for learning such a latent simplex in time roughly $O(k\cdot\textrm{nnz}(A))$, where $\textrm{nnz}(A)$ is the number of non-zeros in $A$. We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $A$, which holds in many of these applications. Further, we show this assumption is necessary, as otherwise an algorithm for learning a latent simplex would imply an algorithmic breakthrough for spectral low rank approximation. At a high level, Bhattacharyya and Kannan provide an adaptive algorithm that makes $k$ matrix-vector product queries to $A$ and each query is a function of all queries preceding it. Since each matrix-vector product requires $\textrm{nnz}(A)$ time, their overall running time appears unavoidable. Instead, we obtain a low-rank approximation to $A$ in input-sparsity time and show that the column space thus obtained has small $\sin\Theta$ (angular) distance to the right top-$k$ singular space of $A$. Our algorithm then selects $k$ points in the low-rank subspace with the largest inner product with $k$ carefully chosen random vectors. By working in the low-rank subspace, we avoid reading the entire matrix in each iteration and thus circumvent the $\Theta(k\cdot\textrm{nnz}(A))$ running time. \ No newline at end of file diff --git a/data/2021/iclr/Learning advanced mathematical computations from examples b/data/2021/iclr/Learning advanced mathematical computations from examples new file mode 100644 index 0000000000..d46088aa68 --- /dev/null +++ b/data/2021/iclr/Learning advanced mathematical computations from examples @@ -0,0 +1 @@ +Using transformers over large generated datasets, we train models to learn mathematical properties of differential systems, such as local stability, behavior at infinity and controllability. We achieve near perfect prediction of qualitative characteristics, and good approximations of numerical features of the system. This demonstrates that neural networks can learn to perform complex computations, grounded in advanced theory, from examples, without built-in mathematical knowledge \ No newline at end of file diff --git a/data/2021/iclr/Learning and Evaluating Representations for Deep One-Class Classification b/data/2021/iclr/Learning and Evaluating Representations for Deep One-Class Classification new file mode 100644 index 0000000000..8de35d785b --- /dev/null +++ b/data/2021/iclr/Learning and Evaluating Representations for Deep One-Class Classification @@ -0,0 +1 @@ +We present a two-stage framework for deep one-class classification. We first learn self-supervised representations from one-class data, and then build one-class classifiers on learned representations. The framework not only allows to learn better representations, but also permits building one-class classifiers that are faithful to the target task. In particular, we present a novel distribution-augmented contrastive learning that extends training distributions via data augmentation to obstruct the uniformity of contrastive representations. Moreover, we argue that classifiers inspired by the statistical perspective in generative or discriminative models are more effective than existing approaches, such as an average of normality scores from a surrogate classifier. In experiments, we demonstrate state-of-the-art performance on visual domain one-class classification benchmarks. Finally, we present visual explanations, confirming that the decision-making process of our deep one-class classifier is intuitive to humans. The code is available at: this https URL. \ No newline at end of file diff --git a/data/2021/iclr/Learning continuous-time PDEs from sparse data with graph neural networks b/data/2021/iclr/Learning continuous-time PDEs from sparse data with graph neural networks new file mode 100644 index 0000000000..a45437b526 --- /dev/null +++ b/data/2021/iclr/Learning continuous-time PDEs from sparse data with graph neural networks @@ -0,0 +1 @@ +The behavior of many dynamical systems follow complex, yet still unknown partial differential equations (PDEs). While several machine learning methods have been proposed to learn PDEs directly from data, previous methods are limited to discrete-time approximations or make the limiting assumption of the observations arriving at regular grids. We propose a general continuous-time differential model for dynamical systems whose governing equations are parameterized by message passing graph neural networks. The model admits arbitrary space and time discretizations, which removes constraints on the locations of observation points and time intervals between the observations. The model is trained with continuous-time adjoint method enabling efficient neural PDE inference. We demonstrate the model's ability to work with unstructured grids, arbitrary time steps, and noisy observations. We compare our method with existing approaches on several well-known physical systems that involve first and higher-order PDEs with state-of-the-art predictive performance. \ No newline at end of file diff --git a/data/2021/iclr/Learning explanations that are hard to vary b/data/2021/iclr/Learning explanations that are hard to vary new file mode 100644 index 0000000000..167e367725 --- /dev/null +++ b/data/2021/iclr/Learning explanations that are hard to vary @@ -0,0 +1 @@ +In this paper, we investigate the principle that `good explanations are hard to vary' in the context of deep learning. We show that averaging gradients across examples -- akin to a logical OR of patterns -- can favor memorization and `patchwork' solutions that sew together different strategies, instead of identifying invariances. To inspect this, we first formalize a notion of consistency for minima of the loss surface, which measures to what extent a minimum appears only when examples are pooled. We then propose and experimentally validate a simple alternative algorithm based on a logical AND, that focuses on invariances and prevents memorization in a set of real-world tasks. Finally, using a synthetic dataset with a clear distinction between invariant and spurious mechanisms, we dissect learning signals and compare this approach to well-established regularizers. \ No newline at end of file diff --git a/data/2021/iclr/Learning from Demonstration with Weakly Supervised Disentanglement b/data/2021/iclr/Learning from Demonstration with Weakly Supervised Disentanglement new file mode 100644 index 0000000000..890e019c01 --- /dev/null +++ b/data/2021/iclr/Learning from Demonstration with Weakly Supervised Disentanglement @@ -0,0 +1 @@ +Robotic manipulation tasks, such as wiping with a soft sponge, require control from multiple rich sensory modalities. Human-robot interaction, aimed at teaching robots, is difficult in this setting as there is potential for mismatch between human and machine comprehension of the rich data streams. We treat the task of interpretable learning from demonstration as an optimisation problem over a probabilistic generative model. To account for the high-dimensionality of the data, a high-capacity neural network is chosen to represent the model. The latent variables in this model are explicitly aligned with high-level notions and concepts that are manifested in a set of demonstrations. We show that such alignment is best achieved through the use of labels from the end user, in an appropriately restricted vocabulary, in contrast to the conventional approach of the designer picking a prior over the latent variables. Our approach is evaluated in the context of a table-top robot manipulation task performed by a PR2 robot -- that of dabbing liquids with a sponge (forcefully pressing a sponge and moving it along a surface). The robot provides visual information, arm joint positions and arm joint efforts. We have made videos of the task and data available - see supplementary materials at this https URL \ No newline at end of file diff --git a/data/2021/iclr/Learning from Protein Structure with Geometric Vector Perceptrons b/data/2021/iclr/Learning from Protein Structure with Geometric Vector Perceptrons new file mode 100644 index 0000000000..0fe6d0c048 --- /dev/null +++ b/data/2021/iclr/Learning from Protein Structure with Geometric Vector Perceptrons @@ -0,0 +1 @@ +Learning on 3D structures of large biomolecules is emerging as a distinct area in machine learning, but there has yet to emerge a unifying network architecture that simultaneously leverages the graph-structured and geometric aspects of the problem domain. To address this gap, we introduce geometric vector perceptrons, which extend standard dense layers to operate on collections of Euclidean vectors. Graph neural networks equipped with such layers are able to perform both geometric and relational reasoning on efficient and natural representations of macromolecular structure. We demonstrate our approach on two important problems in learning from protein structure: model quality assessment and computational protein design. Our approach improves over existing classes of architectures, including state-of-the-art graph-based and voxel-based methods. \ No newline at end of file diff --git a/data/2021/iclr/Learning from others' mistakes: Avoiding dataset biases without modeling them b/data/2021/iclr/Learning from others' mistakes: Avoiding dataset biases without modeling them new file mode 100644 index 0000000000..5dfd0acf6a --- /dev/null +++ b/data/2021/iclr/Learning from others' mistakes: Avoiding dataset biases without modeling them @@ -0,0 +1 @@ +State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended underlying task. Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available. We consider cases where the bias issues may not be explicitly identified, and show a method for training models that learn to ignore these problematic correlations. Our approach relies on the observation that models with limited capacity primarily learn to exploit biases in the dataset. We can leverage the errors of such limited capacity models to train a more robust model in a product of experts, thus bypassing the need to hand-craft a biased model. We show the effectiveness of this method to retain improvements in out-of-distribution settings even if no particular bias is targeted by the biased model. \ No newline at end of file diff --git a/data/2021/iclr/Learning perturbation sets for robust machine learning b/data/2021/iclr/Learning perturbation sets for robust machine learning new file mode 100644 index 0000000000..5db32894fe --- /dev/null +++ b/data/2021/iclr/Learning perturbation sets for robust machine learning @@ -0,0 +1 @@ +Although much progress has been made towards robust deep learning, a significant gap in robustness remains between real-world perturbations and more narrowly defined sets typically studied in adversarial defenses. In this paper, we aim to bridge this gap by learning perturbation sets from data, in order to characterize real-world effects for robust training and evaluation. Specifically, we use a conditional generator that defines the perturbation set over a constrained region of the latent space. We formulate desirable properties that measure the quality of a learned perturbation set, and theoretically prove that a conditional variational autoencoder naturally satisfies these criteria. Using this framework, our approach can generate a variety of perturbations at different complexities and scales, ranging from baseline spatial transformations, through common image corruptions, to lighting variations. We measure the quality of our learned perturbation sets both quantitatively and qualitatively, finding that our models are capable of producing a diverse set of meaningful perturbations beyond the limited data seen during training. Finally, we leverage our learned perturbation sets to train models which are empirically and certifiably robust to adversarial image corruptions and adversarial lighting variations, while improving generalization on non-adversarial data. All code and configuration files for reproducing the experiments as well as pretrained model weights can be found at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/Learning the Pareto Front with Hypernetworks b/data/2021/iclr/Learning the Pareto Front with Hypernetworks new file mode 100644 index 0000000000..0b5c78ecb5 --- /dev/null +++ b/data/2021/iclr/Learning the Pareto Front with Hypernetworks @@ -0,0 +1,2 @@ +Multi-objective optimization problems are prevalent in machine learning. These problems have a set of optimal solutions, called the Pareto front, where each point on the front represents a different trade-off between possibly conflicting objectives. Recent optimization algorithms can target a specific desired ray in loss space, but still face two grave limitations: (i) A separate model has to be trained for each point on the front; and (ii) The exact trade-off must be known prior to the optimization process. Here, we tackle the problem of learning the entire Pareto front, with the capability of selecting a desired operating point on the front after training. We call this new setup Pareto-Front Learning (PFL). +We describe an approach to PFL implemented using HyperNetworks, which we term Pareto HyperNetworks (PHNs). PHN learns the entire Pareto front simultaneously using a single hypernetwork, which receives as input a desired preference vector and returns a Pareto-optimal model whose loss vector is in the desired ray. The unified model is runtime efficient compared to training multiple models, and generalizes to new operating points not used during training. We evaluate our method on a wide set of problems, from multi-task regression and classification to fairness. PHNs learns the entire Pareto front in roughly the same time as learning a single point on the front, and also reaches a better solution set. PFL opens the door to new applications where models are selected based on preferences that are only available at run time. \ No newline at end of file diff --git a/data/2021/iclr/Learning to Deceive Knowledge Graph Augmented Models via Targeted Perturbation b/data/2021/iclr/Learning to Deceive Knowledge Graph Augmented Models via Targeted Perturbation new file mode 100644 index 0000000000..3ab099d923 --- /dev/null +++ b/data/2021/iclr/Learning to Deceive Knowledge Graph Augmented Models via Targeted Perturbation @@ -0,0 +1 @@ +Knowledge graphs (KGs) have helped neural-symbolic models improve performance on various knowledge-intensive tasks, like question answering and item recommendation. By using attention over the KG, such models can also "explain" which KG information was most relevant for making a given prediction. In this paper, we question whether these models are really behaving as we expect. We demonstrate that, through a reinforcement learning policy (or even simple heuristics), one can produce deceptively perturbed KGs which maintain the downstream performance of the original KG while significantly deviating from the original semantics and structure. Our findings raise doubts about KG-augmented models' ability to leverage KG information and provide plausible explanations. \ No newline at end of file diff --git a/data/2021/iclr/Learning to Generate 3D Shapes with Generative Cellular Automata b/data/2021/iclr/Learning to Generate 3D Shapes with Generative Cellular Automata new file mode 100644 index 0000000000..b29c37a1b8 --- /dev/null +++ b/data/2021/iclr/Learning to Generate 3D Shapes with Generative Cellular Automata @@ -0,0 +1 @@ +We present a probabilistic 3D generative model, named Generative Cellular Automata, which is able to produce diverse and high quality shapes. We formulate the shape generation process as sampling from the transition kernel of a Markov chain, where the sampling chain eventually evolves to the full shape of the learned distribution. The transition kernel employs the local update rules of cellular automata, effectively reducing the search space in a high-resolution 3D grid space by exploiting the connectivity and sparsity of 3D shapes. Our progressive generation only focuses on the sparse set of occupied voxels and their neighborhood, thus enabling the utilization of an expressive sparse convolutional network. We propose an effective training scheme to obtain the local homogeneous rule of generative cellular automata with sequences that are slightly different from the sampling chain but converge to the full shapes in the training data. Extensive experiments on probabilistic shape completion and shape generation demonstrate that our method achieves competitive performance against recent methods. \ No newline at end of file diff --git a/data/2021/iclr/Learning to Make Decisions via Submodular Regularization b/data/2021/iclr/Learning to Make Decisions via Submodular Regularization new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Learning to Reach Goals via Iterated Supervised Learning b/data/2021/iclr/Learning to Reach Goals via Iterated Supervised Learning new file mode 100644 index 0000000000..ef0472ce0e --- /dev/null +++ b/data/2021/iclr/Learning to Reach Goals via Iterated Supervised Learning @@ -0,0 +1 @@ +Current reinforcement learning (RL) algorithms can be brittle and difficult to use, especially when learning goal-reaching behaviors from sparse rewards. Although supervised imitation learning provides a simple and stable alternative, it requires access to demonstrations from a human supervisor. In this paper, we study RL algorithms that use imitation learning to acquire goal reaching policies from scratch, without the need for expert demonstrations or a value function. In lieu of demonstrations, we leverage the property that any trajectory is a successful demonstration for reaching the final state in that same trajectory. We propose a simple algorithm in which an agent continually relabels and imitates the trajectories it generates to progressively learn goal-reaching behaviors from scratch. Each iteration, the agent collects new trajectories using the latest policy, and maximizes the likelihood of the actions along these trajectories under the goal that was actually reached, so as to improve the policy. We formally show that this iterated supervised learning procedure optimizes a bound on the RL objective, derive performance bounds of the learned policy, and empirically demonstrate improved goal-reaching performance and robustness over current RL algorithms in several benchmark tasks. \ No newline at end of file diff --git a/data/2021/iclr/Learning to Recombine and Resample Data For Compositional Generalization b/data/2021/iclr/Learning to Recombine and Resample Data For Compositional Generalization new file mode 100644 index 0000000000..330e4ac9e9 --- /dev/null +++ b/data/2021/iclr/Learning to Recombine and Resample Data For Compositional Generalization @@ -0,0 +1 @@ +Flexible neural models outperform grammar- and automaton-based counterparts on a variety of sequence modeling tasks. However, neural models perform poorly in settings requiring compositional generalization beyond the training data -- particularly to rare or unseen subsequences. Past work has found symbolic scaffolding (e.g. grammars or automata) essential in these settings. Here we present a family of learned data augmentation schemes that support a large category of compositional generalizations without appeal to latent symbolic structure. Our approach to data augmentation has two components: recombination of original training examples via a prototype-based generative model and resampling of generated examples to encourage extrapolation. Training an ordinary neural sequence model on a dataset augmented with recombined and resampled examples significantly improves generalization in two language processing problems---instruction following (SCAN) and morphological analysis (Sigmorphon 2018)---where our approach enables learning of new constructions and tenses from as few as eight initial examples. \ No newline at end of file diff --git a/data/2021/iclr/Learning to Represent Action Values as a Hypergraph on the Action Vertices b/data/2021/iclr/Learning to Represent Action Values as a Hypergraph on the Action Vertices new file mode 100644 index 0000000000..6463e0d1ce --- /dev/null +++ b/data/2021/iclr/Learning to Represent Action Values as a Hypergraph on the Action Vertices @@ -0,0 +1 @@ +Action-value estimation is a critical component of many reinforcement learning (RL) methods whereby sample complexity relies heavily on how fast a good estimator for action value can be learned. By viewing this problem through the lens of representation learning, good representations of both state and action can facilitate action-value estimation. While advances in deep learning have seamlessly driven progress in learning state representations, given the specificity of the notion of agency to RL, little attention has been paid to learning action representations. We conjecture that leveraging the combinatorial structure of multi-dimensional action spaces is a key ingredient for learning good representations of action. To test this, we set forth the action hypergraph networks framework---a class of functions for learning action representations with a relational inductive bias. Using this framework we realise an agent class based on a combination with deep Q-networks, which we dub hypergraph Q-networks. We show the effectiveness of our approach on a myriad of domains: illustrative prediction problems under minimal confounding effects, Atari 2600 games, and physical control benchmarks. \ No newline at end of file diff --git a/data/2021/iclr/Learning to Sample with Local and Global Contexts in Experience Replay Buffer b/data/2021/iclr/Learning to Sample with Local and Global Contexts in Experience Replay Buffer new file mode 100644 index 0000000000..b78c782d74 --- /dev/null +++ b/data/2021/iclr/Learning to Sample with Local and Global Contexts in Experience Replay Buffer @@ -0,0 +1 @@ +Experience replay, which enables the agents to remember and reuse experience from the past, plays a significant role in the success of off-policy reinforcement learning (RL). To utilize the experience replay efficiently, experience transitions should be sampled with consideration of their significance, such that the known prioritized experience replay (PER) further allows to sample more important experience. Yet, the conventional PER may result in generating highly biased samples due to considering a single metric such as TD-error and computing the sampling rate independently for each experience. To tackle this issue, we propose a Neural Experience Replay Sampler (NERS), which adaptively evaluates the relative importance of a sampled transition by obtaining context from not only its (local) values that characterize itself such as TD-error or the raw features but also other (global) transitions. We validate our framework on multiple benchmark tasks for both continuous and discrete controls and show that the proposed framework significantly improves the performance of various off-policy RL methods. Further analysis confirms that the improvements indeed come from the use of diverse features and the consideration of the relative importance of experiences. \ No newline at end of file diff --git a/data/2021/iclr/Learning to Set Waypoints for Audio-Visual Navigation b/data/2021/iclr/Learning to Set Waypoints for Audio-Visual Navigation new file mode 100644 index 0000000000..fcfc7a16a7 --- /dev/null +++ b/data/2021/iclr/Learning to Set Waypoints for Audio-Visual Navigation @@ -0,0 +1 @@ +In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements: 1) waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves. Both new ideas capitalize on the synergy of audio and visual data for revealing the geometry of an unmapped space. We demonstrate our approach on two challenging datasets of real-world 3D scenes, Replica and Matterport3D. Our model improves the state of the art by a substantial margin, and our experiments reveal that learning the links between sights, sounds, and space is essential for audio-visual navigation. \ No newline at end of file diff --git a/data/2021/iclr/Learning to live with Dale's principle: ANNs with separate excitatory and inhibitory units b/data/2021/iclr/Learning to live with Dale's principle: ANNs with separate excitatory and inhibitory units new file mode 100644 index 0000000000..401dab8464 --- /dev/null +++ b/data/2021/iclr/Learning to live with Dale's principle: ANNs with separate excitatory and inhibitory units @@ -0,0 +1 @@ +The units in artificial neural networks (ANNs) can be thought of as abstractions of biological neurons, and ANNs are increasingly used in neuroscience research. However, there are many important differences between ANN units and real neurons. One of the most notable is the absence of Dale’s principle, which ensures that biological neurons are either exclusively excitatory or inhibitory. Dale’s principle is typically left out of ANNs because its inclusion impairs learning. This is problematic, because one of the great advantages of ANNs for neuroscience research is their ability to learn complicated, realistic tasks. Here, by taking inspiration from feedforward inhibitory interneurons in the brain we show that we can develop ANNs with separate populations of excitatory and inhibitory units that learn just as well as standard ANNs. We call these networks Dale’s ANNs (DANNs). We present two insights that enable DANNs to learn well: (1) DANNs are related to normalization schemes, and can be initialized such that the inhibition centres and standardizes the excitatory activity, (2) updates to inhibitory neuron parameters should be scaled using corrections based on the Fisher Information matrix. These results demonstrate how ANNs that respect Dale’s principle can be built without sacrificing learning performance, which is important for future work using ANNs as models of the brain. The results also may have interesting implications for how inhibitory plasticity in the real brain operates. \ No newline at end of file diff --git a/data/2021/iclr/Learning with AMIGo: Adversarially Motivated Intrinsic Goals b/data/2021/iclr/Learning with AMIGo: Adversarially Motivated Intrinsic Goals new file mode 100644 index 0000000000..7d56b39787 --- /dev/null +++ b/data/2021/iclr/Learning with AMIGo: Adversarially Motivated Intrinsic Goals @@ -0,0 +1 @@ +A key challenge for reinforcement learning (RL) consists of learning in environments with sparse extrinsic rewards. In contrast to current RL methods, humans are able to learn new skills with little or no reward by using various forms of intrinsic motivation. We propose AMIGo, a novel agent incorporating a goal-generating teacher that proposes Adversarially Motivated Intrinsic Goals to train a goal-conditioned "student" policy in the absence of (or alongside) environment reward. Specifically, through a simple but effective "constructively adversarial" objective, the teacher learns to propose increasingly challenging---yet achievable---goals that allow the student to learn general skills for acting in a new environment, independent of the task to be solved. We show that our method generates a natural curriculum of self-proposed goals which ultimately allows the agent to solve challenging procedurally-generated tasks where other forms of intrinsic motivation and state-of-the-art RL methods fail. \ No newline at end of file diff --git a/data/2021/iclr/Learning with Feature-Dependent Label Noise: A Progressive Approach b/data/2021/iclr/Learning with Feature-Dependent Label Noise: A Progressive Approach new file mode 100644 index 0000000000..f9b70229fe --- /dev/null +++ b/data/2021/iclr/Learning with Feature-Dependent Label Noise: A Progressive Approach @@ -0,0 +1 @@ +Label noise is frequently observed in real-world large-scale datasets. The noise is introduced due to a variety of reasons; it is heterogeneous and feature-dependent. Most existing approaches to handling noisy labels fall into two categories: they either assume an ideal feature-independent noise, or remain heuristic without theoretical guarantees. In this paper, we propose to target a new family of feature-dependent label noise, which is much more general than commonly used i.i.d. label noise and encompasses a broad spectrum of noise patterns. Focusing on this general noise family, we propose a progressive label correction algorithm that iteratively corrects labels and refines the model. We provide theoretical guarantees showing that for a wide variety of (unknown) noise patterns, a classifier trained with this strategy converges to be consistent with the Bayes classifier. In experiments, our method outperforms SOTA baselines and is robust to various noise types and levels. \ No newline at end of file diff --git a/data/2021/iclr/Learning with Instance-Dependent Label Noise: A Sample Sieve Approach b/data/2021/iclr/Learning with Instance-Dependent Label Noise: A Sample Sieve Approach new file mode 100644 index 0000000000..759e1f914d --- /dev/null +++ b/data/2021/iclr/Learning with Instance-Dependent Label Noise: A Sample Sieve Approach @@ -0,0 +1 @@ +Human-annotated labels are often prone to noise, and the presence of such noise will degrade the performance of the resulting deep neural network (DNN) models. Much of the literature (with several recent exceptions) of learning with noisy labels focuses on the case when the label noise is independent from features. Practically, annotations errors tend to be instance-dependent and often depend on the difficulty levels of recognizing a certain task. Applying existing results from instance-independent settings would require a significant amount of estimation of noise rates. Therefore, learning with instance-dependent label noise remains a challenge. In this paper, we propose CORES^2 (COnfidence REgularized Sample Sieve), which progressively sieves out corrupted samples. The implementation of CORES^2 does not require specifying noise rates and yet we are able to provide theoretical guarantees of CORES^2 in filtering out the corrupted examples. This high-quality sample sieve allows us to treat clean examples and the corrupted ones separately in training a DNN solution, and such a separation is shown to be advantageous in the instance-dependent noise setting. We demonstrate the performance of CORES^2 on CIFAR10 and CIFAR100 datasets with synthetic instance-dependent label noise and Clothing1M with real-world human noise. As of independent interests, our sample sieve provides a generic machinery for anatomizing noisy datasets and provides a flexible interface for various robust training techniques to further improve the performance. \ No newline at end of file diff --git a/data/2021/iclr/Learning-based Support Estimation in Sublinear Time b/data/2021/iclr/Learning-based Support Estimation in Sublinear Time new file mode 100644 index 0000000000..135f70f29d --- /dev/null +++ b/data/2021/iclr/Learning-based Support Estimation in Sublinear Time @@ -0,0 +1 @@ +We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to \[ \ \log (1/\varepsilon) \cdot n^{1-\Theta(1/\log(1/\varepsilon))}. \] We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm. \ No newline at end of file diff --git a/data/2021/iclr/Lifelong Learning of Compositional Structures b/data/2021/iclr/Lifelong Learning of Compositional Structures new file mode 100644 index 0000000000..f9bd259cec --- /dev/null +++ b/data/2021/iclr/Lifelong Learning of Compositional Structures @@ -0,0 +1 @@ +A hallmark of human intelligence is the ability to construct self-contained chunks of knowledge and adequately reuse them in novel combinations for solving different yet structurally related problems. Learning such compositional structures has been a significant challenge for artificial systems, due to the combinatorial nature of the underlying search problem. To date, research into compositional learning has largely proceeded separately from work on lifelong or continual learning. We integrate these two lines of work to present a general-purpose framework for lifelong learning of compositional structures that can be used for solving a stream of related tasks. Our framework separates the learning process into two broad stages: learning how to best combine existing components in order to assimilate a novel problem, and learning how to adapt the set of existing components to accommodate the new problem. This separation explicitly handles the trade-off between the stability required to remember how to solve earlier tasks and the flexibility required to solve new tasks, as we show empirically in an extensive evaluation. \ No newline at end of file diff --git a/data/2021/iclr/LiftPool: Bidirectional ConvNet Pooling b/data/2021/iclr/LiftPool: Bidirectional ConvNet Pooling new file mode 100644 index 0000000000..23ec92f96a --- /dev/null +++ b/data/2021/iclr/LiftPool: Bidirectional ConvNet Pooling @@ -0,0 +1 @@ +Pooling is a critical operation in convolutional neural networks for increasing receptive fields and improving robustness to input variations. Most existing pooling operations downsample the feature maps, which is a lossy process. Moreover, they are not invertible: upsampling a downscaled feature map can not recover the lost information in the downsampling. By adopting the philosophy of the classical Lifting Scheme from signal processing, we propose LiftPool for bidirectional pooling layers, including LiftDownPool and LiftUpPool. LiftDownPool decomposes a feature map into various downsized sub-bands, each of which contains information with different frequencies. As the pooling function in LiftDownPool is perfectly invertible, by performing LiftDownPool backward, a corresponding up-pooling layer LiftUpPool is able to generate a refined upsampled feature map using the detail sub-bands, which is useful for image-to-image translation challenges. Experiments show the proposed methods achieve better results on image classification and semantic segmentation, using various backbones. Moreover, LiftDownPool offers better robustness to input corruptions and perturbations. \ No newline at end of file diff --git a/data/2021/iclr/Linear Convergent Decentralized Optimization with Compression b/data/2021/iclr/Linear Convergent Decentralized Optimization with Compression new file mode 100644 index 0000000000..43f16ddc3e --- /dev/null +++ b/data/2021/iclr/Linear Convergent Decentralized Optimization with Compression @@ -0,0 +1 @@ +Communication compression has been extensively adopted to speed up large-scale distributed optimization. However, most existing decentralized algorithms with compression are unsatisfactory in terms of convergence rate and stability. In this paper, we delineate two key obstacles in the algorithm design -- data heterogeneity and compression error. Our attempt to explicitly overcome these obstacles leads to a novel decentralized algorithm named LEAD. This algorithm is the first \underline{L}in\underline{EA}r convergent \underline{D}ecentralized algorithm with communication compression. Our theory describes the coupled dynamics of the inaccurate model propagation and optimization process. We also provide the first consensus error bound without assuming bounded gradients. Empirical experiments validate our theoretical analysis and show that the proposed algorithm achieves state-of-the-art computation and communication efficiency. \ No newline at end of file diff --git a/data/2021/iclr/Linear Last-iterate Convergence in Constrained Saddle-point Optimization b/data/2021/iclr/Linear Last-iterate Convergence in Constrained Saddle-point Optimization new file mode 100644 index 0000000000..79ea474d82 --- /dev/null +++ b/data/2021/iclr/Linear Last-iterate Convergence in Constrained Saddle-point Optimization @@ -0,0 +1,2 @@ +Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weights Update (OMWU) for saddle-point optimization have received growing attention due to their favorable last-iterate convergence. However, their behaviors for simple bilinear games over the probability simplex are still not fully understood -- previous analysis lacks explicit convergence rates, only applies to an exponentially small learning rate, or requires additional assumptions such as the uniqueness of the optimal solution. +In this work, we significantly expand the understanding of last-iterate convergence for OGDA and OMWU in the constrained setting. Specifically, for OMWU in bilinear games over the simplex, we show that when the equilibrium is unique, linear last-iterate convergence is achievable with a constant learning rate, which improves the result of (Daskalakis & Panageas, 2019) under the same assumption. We then significantly extend the results to more general objectives and feasible sets for the projected OGDA algorithm, by introducing a sufficient condition under which OGDA exhibits concrete last-iterate convergence rates with a constant learning rate. We show that bilinear games over any polytope satisfy this condition and OGDA converges exponentially fast even without the unique equilibrium assumption. Our condition also holds for strongly-convex-strongly-concave functions, recovering the result of (Hsieh et al., 2019). Finally, we provide experimental results to further support our theory. \ No newline at end of file diff --git a/data/2021/iclr/Linear Mode Connectivity in Multitask and Continual Learning b/data/2021/iclr/Linear Mode Connectivity in Multitask and Continual Learning new file mode 100644 index 0000000000..2371b78ef2 --- /dev/null +++ b/data/2021/iclr/Linear Mode Connectivity in Multitask and Continual Learning @@ -0,0 +1 @@ +Continual (sequential) training and multitask (simultaneous) training are often attempting to solve the same overall objective: to find a solution that performs well on all considered tasks. The main difference is in the training regimes, where continual learning can only have access to one task at a time, which for neural networks typically leads to catastrophic forgetting. That is, the solution found for a subsequent task does not perform well on the previous ones anymore. However, the relationship between the different minima that the two training regimes arrive at is not well understood. What sets them apart? Is there a local structure that could explain the difference in performance achieved by the two different schemes? Motivated by recent work showing that different minima of the same task are typically connected by very simple curves of low error, we investigate whether multitask and continual solutions are similarly connected. We empirically find that indeed such connectivity can be reliably achieved and, more interestingly, it can be done by a linear path, conditioned on having the same initialization for both. We thoroughly analyze this observation and discuss its significance for the continual learning process. Furthermore, we exploit this finding to propose an effective algorithm that constrains the sequentially learned minima to behave as the multitask solution. We show that our method outperforms several state of the art continual learning algorithms on various vision benchmarks. \ No newline at end of file diff --git a/data/2021/iclr/Local Convergence Analysis of Gradient Descent Ascent with Finite Timescale Separation b/data/2021/iclr/Local Convergence Analysis of Gradient Descent Ascent with Finite Timescale Separation new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Local Search Algorithms for Rank-Constrained Convex Optimization b/data/2021/iclr/Local Search Algorithms for Rank-Constrained Convex Optimization new file mode 100644 index 0000000000..c6807ad0ca --- /dev/null +++ b/data/2021/iclr/Local Search Algorithms for Rank-Constrained Convex Optimization @@ -0,0 +1 @@ +We propose greedy and local search algorithms for rank-constrained convex optimization, namely solving $\underset{\mathrm{rank}(A)\leq r^*}{\min}\, R(A)$ given a convex function $R:\mathbb{R}^{m\times n}\rightarrow \mathbb{R}$ and a parameter $r^*$. These algorithms consist of repeating two steps: (a) adding a new rank-1 matrix to $A$ and (b) enforcing the rank constraint on $A$. We refine and improve the theoretical analysis of Shalev-Shwartz et al. (2011), and show that if the rank-restricted condition number of $R$ is $\kappa$, a solution $A$ with rank $O(r^*\cdot \min\{\kappa \log \frac{R(\mathbf{0})-R(A^*)}{\epsilon}, \kappa^2\})$ and $R(A) \leq R(A^*) + \epsilon$ can be recovered, where $A^*$ is the optimal solution. This significantly generalizes associated results on sparse convex optimization, as well as rank-constrained convex optimization for smooth functions. We then introduce new practical variants of these algorithms that have superior runtime and recover better solutions in practice. We demonstrate the versatility of these methods on a wide range of applications involving matrix completion and robust principal component analysis. \ No newline at end of file diff --git a/data/2021/iclr/Locally Free Weight Sharing for Network Width Search b/data/2021/iclr/Locally Free Weight Sharing for Network Width Search new file mode 100644 index 0000000000..f1da214533 --- /dev/null +++ b/data/2021/iclr/Locally Free Weight Sharing for Network Width Search @@ -0,0 +1 @@ +Searching for network width is an effective way to slim deep neural networks with hardware budgets. With this aim, a one-shot supernet is usually leveraged as a performance evaluator to rank the performance \wrt~different width. Nevertheless, current methods mainly follow a manually fixed weight sharing pattern, which is limited to distinguish the performance gap of different width. In this paper, to better evaluate each width, we propose a locally free weight sharing strategy (CafeNet) accordingly. In CafeNet, weights are more freely shared, and each width is jointly indicated by its base channels and free channels, where free channels are supposed to loCAte FrEely in a local zone to better represent each width. Besides, we propose to further reduce the search space by leveraging our introduced FLOPs-sensitive bins. As a result, our CafeNet can be trained stochastically and get optimized within a min-min strategy. Extensive experiments on ImageNet, CIFAR-10, CelebA and MS COCO dataset have verified our superiority comparing to other state-of-the-art baselines. For example, our method can further boost the benchmark NAS network EfficientNet-B0 by 0.41\% via searching its width more delicately. \ No newline at end of file diff --git a/data/2021/iclr/Long Live the Lottery: The Existence of Winning Tickets in Lifelong Learning b/data/2021/iclr/Long Live the Lottery: The Existence of Winning Tickets in Lifelong Learning new file mode 100644 index 0000000000..62d4f32c8a --- /dev/null +++ b/data/2021/iclr/Long Live the Lottery: The Existence of Winning Tickets in Lifelong Learning @@ -0,0 +1 @@ +The lottery ticket hypothesis states that a highly sparsified sub-network can be trained in isolation, given the appropriate weight initialization. This paper extends that hypothesis from one-shot task leaning, and demonstrates for the first time that such extremely compact and independently trainable sub-networks can be also identified in the lifelong learning scenario, which we call lifelong tickets. We show that the resulting lifelong ticket can further be leveraged to improve the performance of learning over continual tasks. However, it is highly non-trivial to conduct network pruning in the lifelong setting. Two critical roadblocks arise: i) As many tasks now arrive sequentially, finding tickets in a greedy weight pruning fashion will inevitably suffer from the intrinsic bias, that the earlier emerging tasks impact more; ii) As lifelong learning is consistently challenged by catastrophic forgetting, the compact network capacity of tickets might amplify the risk of forgetting. In view of those, we introduce two pruning options, e.g., top-down and bottom-up, for finding lifelong tickets. Compared to the top-down pruning that extends vanilla (iterative) pruning over sequential tasks, we show that the bottomup one, which can dynamically shrink and (re-)expand model capacity, effectively avoids the undesirable excessive pruning in the early stage. We additionally introduce lottery teaching that further overcomes forgetting via knowledge distillation aided by external unlabeled data. Unifying those ingredients, we demonstrate the existence of very competitive lifelong tickets, e.g., achieving 3− 8% of the dense model size with even higher accuracy, compared to strong class-incremental learning baselines on CIFAR-10/CIFAR-100/Tiny-ImageNet datasets. \ No newline at end of file diff --git a/data/2021/iclr/Long Range Arena : A Benchmark for Efficient Transformers b/data/2021/iclr/Long Range Arena : A Benchmark for Efficient Transformers new file mode 100644 index 0000000000..c88a4b2ccc --- /dev/null +++ b/data/2021/iclr/Long Range Arena : A Benchmark for Efficient Transformers @@ -0,0 +1 @@ +Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, LRA, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. LRA paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. Our benchmark code will be released at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/Long-tail learning via logit adjustment b/data/2021/iclr/Long-tail learning via logit adjustment new file mode 100644 index 0000000000..d21eba522f --- /dev/null +++ b/data/2021/iclr/Long-tail learning via logit adjustment @@ -0,0 +1 @@ +Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes naive learning biased towards dominant labels. In this paper, we present two simple modifications of standard softmax cross-entropy training to cope with these challenges. Our techniques revisit the classic idea of logit adjustment based on the label frequencies, either applied post-hoc to a trained model, or enforced in the loss during training. Such adjustment encourages a large relative margin between logits of rare versus dominant labels. These techniques unify and generalise several recent proposals in the literature, while possessing firmer statistical grounding and empirical performance. \ No newline at end of file diff --git a/data/2021/iclr/Long-tailed Recognition by Routing Diverse Distribution-Aware Experts b/data/2021/iclr/Long-tailed Recognition by Routing Diverse Distribution-Aware Experts new file mode 100644 index 0000000000..2134575b69 --- /dev/null +++ b/data/2021/iclr/Long-tailed Recognition by Routing Diverse Distribution-Aware Experts @@ -0,0 +1 @@ +Natural data are often long-tail distributed over semantic classes. Existing recognition methods tend to focus on tail performance gain, often at the expense of head performance loss from increased classifier variance. The low tail performance manifests itself in large inter-class confusion and high classifier variance. We aim to reduce both the bias and the variance of a long-tailed classifier by RoutIng Diverse Experts (RIDE). It has three components: 1) a shared architecture for multiple classifiers (experts); 2) a distribution-aware diversity loss that encourages more diverse decisions for classes with fewer training instances; and 3) an expert routing module that dynamically assigns more ambiguous instances to additional experts. With on-par computational complexity, RIDE significantly outperforms the state-of-the-art methods by 5% to 7% on all the benchmarks including CIFAR100-LT, ImageNet-LT and iNaturalist. RIDE is also a universal framework that can be applied to different backbone networks and integrated into various long-tailed algorithms and training mechanisms for consistent performance gains. \ No newline at end of file diff --git a/data/2021/iclr/Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search b/data/2021/iclr/Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search new file mode 100644 index 0000000000..c000182b59 --- /dev/null +++ b/data/2021/iclr/Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search @@ -0,0 +1 @@ +Designing proper loss functions for vision tasks has been a long-standing research direction to advance the capability of existing models. For object detection, the well-established classification and regression loss functions have been carefully designed by considering diverse learning challenges. Inspired by the recent progress in network architecture search, it is interesting to explore the possibility of discovering new loss function formulations via directly searching the primitive operation combinations. So that the learned losses not only fit for diverse object detection challenges to alleviate huge human efforts, but also have better alignment with evaluation metric and good mathematical convergence property. Beyond the previous auto-loss works on face recognition and image classification, our work makes the first attempt to discover new loss functions for the challenging object detection from primitive operation levels. We propose an effective convergence-simulation driven evolutionary search algorithm, called CSE-Autoloss, for speeding up the search progress by regularizing the mathematical rationality of loss candidates via convergence property verification and model optimization simulation. CSE-Autoloss involves the search space that cover a wide range of the possible variants of existing losses and discovers best-searched loss function combination within a short time (around 1.5 wall-clock days). We conduct extensive evaluations of loss function search on popular detectors and validate the good generalization capability of searched losses across diverse architectures and datasets. Our experiments show that the best-discovered loss function combinations outperform default combinations by 1.1% and 0.8% in terms of mAP for two-stage and one-stage detectors on COCO respectively. Our searched losses are available at https://github.com/PerdonLiu/CSE-Autoloss. \ No newline at end of file diff --git a/data/2021/iclr/Lossless Compression of Structured Convolutional Models via Lifting b/data/2021/iclr/Lossless Compression of Structured Convolutional Models via Lifting new file mode 100644 index 0000000000..bfe6bc269b --- /dev/null +++ b/data/2021/iclr/Lossless Compression of Structured Convolutional Models via Lifting @@ -0,0 +1 @@ +Lifting is an efficient technique to scale up graphical models generalized to relational domains by exploiting the underlying symmetries. Concurrently, neural models are continuously expanding from grid-like tensor data into structured representations, such as various attributed graphs and relational databases. To address the irregular structure of the data, the models typically extrapolate on the idea of convolution, effectively introducing parameter sharing in their, dynamically unfolded, computation graphs. The computation graphs themselves then reflect the symmetries of the underlying data, similarly to the lifted graphical models. Inspired by lifting, we introduce a simple and efficient technique to detect the symmetries and compress the neural models without loss of any information. We demonstrate through experiments that such compression can lead to significant speedups of structured convolutional models, such as various Graph Neural Networks, across various tasks, such as molecule classification and knowledge-base completion. \ No newline at end of file diff --git a/data/2021/iclr/LowKey: Leveraging Adversarial Attacks to Protect Social Media Users from Facial Recognition b/data/2021/iclr/LowKey: Leveraging Adversarial Attacks to Protect Social Media Users from Facial Recognition new file mode 100644 index 0000000000..86981ed9c8 --- /dev/null +++ b/data/2021/iclr/LowKey: Leveraging Adversarial Attacks to Protect Social Media Users from Facial Recognition @@ -0,0 +1 @@ +Facial recognition systems are increasingly deployed by private corporations, government agencies, and contractors for consumer services and mass surveillance programs alike. These systems are typically built by scraping social media profiles for user images. Adversarial perturbations have been proposed for bypassing facial recognition systems. However, existing methods fail on full-scale systems and commercial APIs. We develop our own adversarial filter that accounts for the entire image processing pipeline and is demonstrably effective against industrial-grade pipelines that include face detection and large scale databases. Additionally, we release an easy-to-use webtool that significantly degrades the accuracy of Amazon Rekognition and the Microsoft Azure Face Recognition API, reducing the accuracy of each to below 1%. \ No newline at end of file diff --git a/data/2021/iclr/MALI: A memory efficient and reverse accurate integrator for Neural ODEs b/data/2021/iclr/MALI: A memory efficient and reverse accurate integrator for Neural ODEs new file mode 100644 index 0000000000..b623dd9f1f --- /dev/null +++ b/data/2021/iclr/MALI: A memory efficient and reverse accurate integrator for Neural ODEs @@ -0,0 +1 @@ +Neural ordinary differential equations (Neural ODEs) are a new family of deep-learning models with continuous depth. However, the numerical estimation of the gradient in the continuous case is not well solved: existing implementations of the adjoint method suffer from inaccuracy in reverse-time trajectory, while the naive method and the adaptive checkpoint adjoint method (ACA) have a memory cost that grows with integration time. In this project, based on the asynchronous leapfrog (ALF) solver, we propose the Memory-efficient ALF Integrator (MALI), which has a constant memory cost \textit{w.r.t} number of solver steps in integration similar to the adjoint method, and guarantees accuracy in reverse-time trajectory (hence accuracy in gradient estimation). We validate MALI in various tasks: on image recognition tasks, to our knowledge, MALI is the first to enable feasible training of a Neural ODE on ImageNet and outperform a well-tuned ResNet, while existing methods fail due to either heavy memory burden or inaccuracy; for time series modeling, MALI significantly outperforms the adjoint method; and for continuous generative models, MALI achieves new state-of-the-art performance. We provide a pypi package at \url{https://jzkay12.github.io/TorchDiffEqPack/} \ No newline at end of file diff --git a/data/2021/iclr/MARS: Markov Molecular Sampling for Multi-objective Drug Discovery b/data/2021/iclr/MARS: Markov Molecular Sampling for Multi-objective Drug Discovery new file mode 100644 index 0000000000..2345b94b0b --- /dev/null +++ b/data/2021/iclr/MARS: Markov Molecular Sampling for Multi-objective Drug Discovery @@ -0,0 +1 @@ +Searching for novel molecules with desired chemical properties is crucial in drug discovery. Existing work focuses on developing neural models to generate either molecular sequences or chemical graphs. However, it remains a big challenge to find novel and diverse compounds satisfying several properties. In this paper, we propose MARS, a method for multi-objective drug molecule discovery. MARS is based on the idea of generating the chemical candidates by iteratively editing fragments of molecular graphs. To search for high-quality candidates, it employs Markov chain Monte Carlo sampling (MCMC) on molecules with an annealing scheme and an adaptive proposal. To further improve sample efficiency, MARS uses a graph neural network (GNN) to represent and select candidate edits, where the GNN is trained on-the-fly with samples from MCMC. Experiments show that MARS achieves state-of-the-art performance in various multi-objective settings where molecular bio-activity, drug-likeness, and synthesizability are considered. Remarkably, in the most challenging setting where all four objectives are simultaneously optimized, our approach outperforms previous methods significantly in comprehensive evaluations. The code is available at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/MELR: Meta-Learning via Modeling Episode-Level Relationships for Few-Shot Learning b/data/2021/iclr/MELR: Meta-Learning via Modeling Episode-Level Relationships for Few-Shot Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space b/data/2021/iclr/MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space new file mode 100644 index 0000000000..1c71e08e3a --- /dev/null +++ b/data/2021/iclr/MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space @@ -0,0 +1 @@ +Data augmentation is an efficient way to expand a training dataset by creating additional artificial data. While data augmentation is found to be effective in improving the generalization capabilities of models for various machine learning tasks, the underlying augmentation methods are usually manually designed and carefully evaluated for each data modality separately. These include image processing functions for image data and word-replacing rules for text data. In this work, we propose an automated data augmentation approach called MODALS (Modalityagnostic Automated Data Augmentation in the Latent Space) to augment data for any modality in a generic way. MODALS exploits automated data augmentation to fine-tune four universal data transformation operations in the latent space to adapt the transform to data of different modalities. Through comprehensive experiments, we demonstrate the effectiveness of MODALS on multiple datasets for text, tabular, time-series and image modalities.1 \ No newline at end of file diff --git a/data/2021/iclr/MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training b/data/2021/iclr/MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training new file mode 100644 index 0000000000..58b7180543 --- /dev/null +++ b/data/2021/iclr/MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training @@ -0,0 +1 @@ +Recent advances by practitioners in the deep learning community have breathed new life into Locality Sensitive Hashing (LSH), using it to reduce memory and time bottlenecks in neural network (NN) training. However, while LSH has sublinear guarantees for approximate near-neighbor search in theory, it is known to have inefficient query time in practice due to its use of random hash functions. Moreover, when model parameters are changing, LSH suffers from update overhead. This work is motivated by an observation that model parameters evolve slowly, such that the changes do not always require an LSH update to maintain performance. This phenomenon points to the potential for a reduction in update time and allows for a modified learnable version of data-dependent LSH to improve query time at a low cost. We use the above insights to build MONGOOSE, an end-to-end LSH framework for efficient NN training. In particular, MONGOOSE is equipped with a scheduling algorithm to adaptively perform LSH updates with provable guarantees and learnable hash functions to improve query efficiency. Empirically, we validate MONGOOSE on large-scale deep learning models for recommendation systems and language modeling. We find that it achieves up to 8% better accuracy compared to previous LSH approaches, with 6.5× speed-up and 6× reduction in memory usage. \ No newline at end of file diff --git a/data/2021/iclr/Mapping the Timescale Organization of Neural Language Models b/data/2021/iclr/Mapping the Timescale Organization of Neural Language Models new file mode 100644 index 0000000000..bf7219a0ef --- /dev/null +++ b/data/2021/iclr/Mapping the Timescale Organization of Neural Language Models @@ -0,0 +1 @@ +In the human brain, sequences of language input are processed within a distributed and hierarchical architecture, in which higher stages of processing encode contextual information over longer timescales. In contrast, in recurrent neural networks which perform natural language processing, we know little about how the multiple timescales of contextual information are functionally organized. Therefore, we applied tools developed in neuroscience to map the "processing timescales" of individual units within a word-level LSTM language model. This timescale-mapping method assigned long timescales to units previously found to track long-range syntactic dependencies, and revealed a new cluster of previously unreported long-timescale units. Next, we explored the functional role of units by examining the relationship between their processing timescales and network connectivity. We identified two classes of long-timescale units: "Controller" units composed a densely interconnected subnetwork and strongly projected to the forget and input gates of the rest of the network, while "Integrator" units showed the longest timescales in the network, and expressed projection profiles closer to the mean projection profile. Ablating integrator and controller units affected model performance at different position of a sentence, suggesting distinctive functions of these two sets of units. Finally, we tested the generalization of these results to a character-level LSTM model. In summary, we demonstrated a model-free technique for mapping the timescale organization in neural network models, and we applied this method to reveal the timescale and functional organization of LSTM language models. \ No newline at end of file diff --git a/data/2021/iclr/Mathematical Reasoning via Self-supervised Skip-tree Training b/data/2021/iclr/Mathematical Reasoning via Self-supervised Skip-tree Training new file mode 100644 index 0000000000..22eda5b4a5 --- /dev/null +++ b/data/2021/iclr/Mathematical Reasoning via Self-supervised Skip-tree Training @@ -0,0 +1 @@ +We examine whether self-supervised language modeling applied to mathematical formulas enables logical reasoning. We suggest several logical reasoning tasks that can be used to evaluate language models trained on formal mathematical statements, such as type inference, suggesting missing assumptions and completing equalities. To train language models for formal mathematics, we propose a novel skip-tree task. We find that models trained on the skip-tree task show surprisingly strong mathematical reasoning abilities, and outperform models trained on standard skip-sequence tasks. We also analyze the models' ability to formulate new conjectures by measuring how often the predictions are provable and useful in other proofs. \ No newline at end of file diff --git a/data/2021/iclr/Measuring Massive Multitask Language Understanding b/data/2021/iclr/Measuring Massive Multitask Language Understanding new file mode 100644 index 0000000000..0585f8cfd5 --- /dev/null +++ b/data/2021/iclr/Measuring Massive Multitask Language Understanding @@ -0,0 +1 @@ +We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings. \ No newline at end of file diff --git a/data/2021/iclr/Memory Optimization for Deep Networks b/data/2021/iclr/Memory Optimization for Deep Networks new file mode 100644 index 0000000000..a551b8ecb6 --- /dev/null +++ b/data/2021/iclr/Memory Optimization for Deep Networks @@ -0,0 +1 @@ +Deep learning is slowly, but steadily, hitting a memory bottleneck. While the tensor computation in top-of-the-line GPUs increased by 32x over the last five years, the total available memory only grew by 2.5x. This prevents researchers from exploring larger architectures, as training large networks requires more memory for storing intermediate outputs. In this paper, we present MONeT, an automatic framework that minimizes both the memory footprint and computational overhead of deep networks. MONeT jointly optimizes the checkpointing schedule and the implementation of various operators. MONeT is able to outperform all prior hand-tuned operations as well as automated checkpointing. MONeT reduces the overall memory requirement by 3x for various PyTorch models, with a 9-16% overhead in computation. For the same computation cost, MONeT requires 1.2-1.8x less memory than current state-of-the-art automated checkpointing frameworks. Our code is available at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/Meta Back-Translation b/data/2021/iclr/Meta Back-Translation new file mode 100644 index 0000000000..8d52e9ff88 --- /dev/null +++ b/data/2021/iclr/Meta Back-Translation @@ -0,0 +1 @@ +Back-translation is an effective strategy to improve the performance of Neural Machine Translation~(NMT) by generating pseudo-parallel data. However, several recent works have found that better translation quality of the pseudo-parallel data does not necessarily lead to better final translation models, while lower-quality but more diverse data often yields stronger results. In this paper, we propose a novel method to generate pseudo-parallel data from a pre-trained back-translation model. Our method is a meta-learning algorithm which adapts a pre-trained back-translation model so that the pseudo-parallel data it generates would train a forward-translation model to do well on a validation set. In our evaluations in both the standard datasets WMT En-De'14 and WMT En-Fr'14, as well as a multilingual translation setting, our method leads to significant improvements over strong baselines. Our code will be made available. \ No newline at end of file diff --git a/data/2021/iclr/Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning b/data/2021/iclr/Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Meta-Learning of Structured Task Distributions in Humans and Machines b/data/2021/iclr/Meta-Learning of Structured Task Distributions in Humans and Machines new file mode 100644 index 0000000000..b5c73fafa4 --- /dev/null +++ b/data/2021/iclr/Meta-Learning of Structured Task Distributions in Humans and Machines @@ -0,0 +1 @@ +In recent years, meta-learning, in which a model is trained on a family of tasks (i.e. a task distribution), has emerged as an approach to training neural networks to perform tasks that were previously assumed to require structured representations, making strides toward closing the gap between humans and machines. However, we argue that evaluating meta-learning remains a challenge, and can miss whether meta-learning actually uses the structure embedded within the tasks. These meta-learners might therefore still be significantly different from humans learners. To demonstrate this difference, we first define a new meta-reinforcement learning task in which a structured task distribution is generated using a compositional grammar. We then introduce a novel approach to constructing a "null task distribution" with the same statistical complexity as this structured task distribution but without the explicit rule-based structure used to generate the structured task. We train a standard meta-learning agent, a recurrent network trained with model-free reinforcement learning, and compare it with human performance across the two task distributions. We find a double dissociation in which humans do better in the structured task distribution whereas agents do better in the null task distribution -- despite comparable statistical complexity. This work highlights that multiple strategies can achieve reasonable meta-test performance, and that careful construction of control task distributions is a valuable way to understand which strategies meta-learners acquire, and how they might differ from humans. \ No newline at end of file diff --git a/data/2021/iclr/Meta-Learning with Neural Tangent Kernels b/data/2021/iclr/Meta-Learning with Neural Tangent Kernels new file mode 100644 index 0000000000..c7f7048a6e --- /dev/null +++ b/data/2021/iclr/Meta-Learning with Neural Tangent Kernels @@ -0,0 +1 @@ +Model Agnostic Meta-Learning (MAML) has emerged as a standard framework for meta-learning, where a meta-model is learned with the ability of fast adapting to new tasks. However, as a double-looped optimization problem, MAML needs to differentiate through the whole inner-loop optimization path for every outer-loop training step, which may lead to both computational inefficiency and sub-optimal solutions. In this paper, we generalize MAML to allow meta-learning to be defined in function spaces, and propose the first meta-learning paradigm in the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model's Neural Tangent Kernel (NTK). Within this paradigm, we introduce two meta-learning algorithms in the RKHS, which no longer need a sub-optimal iterative inner-loop adaptation as in the MAML framework. We achieve this goal by 1) replacing the adaptation with a fast-adaptive regularizer in the RKHS; and 2) solving the adaptation analytically based on the NTK theory. Extensive experimental studies demonstrate advantages of our paradigm in both efficiency and quality of solutions compared to related meta-learning algorithms. Another interesting feature of our proposed methods is that they are demonstrated to be more robust to adversarial attacks and out-of-distribution adaptation than popular baselines, as demonstrated in our experiments. \ No newline at end of file diff --git a/data/2021/iclr/Meta-learning Symmetries by Reparameterization b/data/2021/iclr/Meta-learning Symmetries by Reparameterization new file mode 100644 index 0000000000..9f03d80f6f --- /dev/null +++ b/data/2021/iclr/Meta-learning Symmetries by Reparameterization @@ -0,0 +1 @@ +Many successful deep learning architectures are equivariant to certain transformations in order to conserve parameters and improve generalization: most famously, convolution layers are equivariant to shifts of the input. This approach only works when practitioners know a-priori symmetries of the task and can manually construct an architecture with the corresponding equivariances. Our goal is a general approach for learning equivariances from data, without needing prior knowledge of a task's symmetries or custom task-specific architectures. We present a method for learning and encoding equivariances into networks by learning corresponding parameter sharing patterns from data. Our method can provably encode equivariance-inducing parameter sharing for any finite group of symmetry transformations, and we find experimentally that it can automatically learn a variety of equivariances from symmetries in data. We provide our experiment code and pre-trained models at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/Meta-learning with negative learning rates b/data/2021/iclr/Meta-learning with negative learning rates new file mode 100644 index 0000000000..5877e732f2 --- /dev/null +++ b/data/2021/iclr/Meta-learning with negative learning rates @@ -0,0 +1 @@ +Deep learning models require a large amount of data to perform well. When data is scarce for a target task, we can transfer the knowledge gained by training on similar tasks to quickly learn the target. A successful approach is meta-learning, or learning to learn a distribution of tasks, where learning is represented by an outer loop, and to learn by an inner loop of gradient descent. However, a number of recent empirical studies argue that the inner loop is unnecessary and more simple models work equally well or even better. We study the performance of MAML as a function of the learning rate of the inner loop, where zero learning rate implies that there is no inner loop. Using random matrix theory and exact solutions of linear models, we calculate an algebraic expression for the test loss of MAML applied to mixed linear regression and nonlinear regression with overparameterized models. Surprisingly, while the optimal learning rate for adaptation is positive, we find that the optimal learning rate for training is always negative, a setting that has never been considered before. Therefore, not only does the performance increase by decreasing the learning rate to zero, as suggested by recent work, but it can be increased even further by decreasing the learning rate to negative values. These results help clarify under what circumstances meta-learning performs best. \ No newline at end of file diff --git a/data/2021/iclr/MetaNorm: Learning to Normalize Few-Shot Batches Across Domains b/data/2021/iclr/MetaNorm: Learning to Normalize Few-Shot Batches Across Domains new file mode 100644 index 0000000000..0547b6cb13 --- /dev/null +++ b/data/2021/iclr/MetaNorm: Learning to Normalize Few-Shot Batches Across Domains @@ -0,0 +1 @@ +Batch normalization plays a crucial role when training deep neural networks. However, batch statistics become unstable with small batch sizes and are unreliable in the presence of distribution shifts. We propose MetaNorm, a simple yet effective meta-learning normalization. It tackles the aforementioned issues in a unified way by leveraging the meta-learning setting and learns to infer adaptive statistics for batch normalization. MetaNorm is generic, flexible and model-agnostic, making it a simple plug-and-play module that is seamlessly embedded into existing meta-learning approaches. It can be efficiently implemented by lightweight hyper-networks with low computational cost. We verify its effectiveness by extensive evaluation on representative tasks suffering from the small batch and domain shift problems: few-shot learning and domain generalization. We further introduce an even more challenging setting: few-shot domain generalization. Results demonstrate that MetaNorm consistently achieves better, or at least competitive, accuracy compared to existing batch normalization methods. \ No newline at end of file diff --git a/data/2021/iclr/MiCE: Mixture of Contrastive Experts for Unsupervised Image Clustering b/data/2021/iclr/MiCE: Mixture of Contrastive Experts for Unsupervised Image Clustering new file mode 100644 index 0000000000..f64d57560b --- /dev/null +++ b/data/2021/iclr/MiCE: Mixture of Contrastive Experts for Unsupervised Image Clustering @@ -0,0 +1 @@ +We present Mixture of Contrastive Experts (MiCE), a unified probabilistic clustering framework that simultaneously exploits the discriminative representations learned by contrastive learning and the semantic structures captured by a latent mixture model. Motivated by the mixture of experts, MiCE employs a gating function to partition an unlabeled dataset into subsets according to the latent semantics and multiple experts to discriminate distinct subsets of instances assigned to them in a contrastive learning manner. To solve the nontrivial inference and learning problems caused by the latent variables, we further develop a scalable variant of the Expectation-Maximization (EM) algorithm for MiCE and provide proof of the convergence. Empirically, we evaluate the clustering performance of MiCE on four widely adopted natural image datasets. MiCE achieves significantly better results than various previous methods and a strong contrastive learning baseline. \ No newline at end of file diff --git a/data/2021/iclr/Mind the Gap when Conditioning Amortised Inference in Sequential Latent-Variable Models b/data/2021/iclr/Mind the Gap when Conditioning Amortised Inference in Sequential Latent-Variable Models new file mode 100644 index 0000000000..ac9643bf50 --- /dev/null +++ b/data/2021/iclr/Mind the Gap when Conditioning Amortised Inference in Sequential Latent-Variable Models @@ -0,0 +1 @@ +Amortised inference enables scalable learning of sequential latent-variable models (LVMs) with the evidence lower bound (ELBO). In this setting, variational posteriors are often only partially conditioned. While the true posteriors depend, e.g., on the entire sequence of observations, approximate posteriors are only informed by past observations. This mimics the Bayesian filter -- a mixture of smoothing posteriors. Yet, we show that the ELBO objective forces partially-conditioned amortised posteriors to approximate products of smoothing posteriors instead. Consequently, the learned generative model is compromised. We demonstrate these theoretical findings in three scenarios: traffic flow, handwritten digits, and aerial vehicle dynamics. Using fully-conditioned approximate posteriors, performance improves in terms of generative modelling and multi-step prediction. \ No newline at end of file diff --git a/data/2021/iclr/Mind the Pad - CNNs Can Develop Blind Spots b/data/2021/iclr/Mind the Pad - CNNs Can Develop Blind Spots new file mode 100644 index 0000000000..e2854513a9 --- /dev/null +++ b/data/2021/iclr/Mind the Pad - CNNs Can Develop Blind Spots @@ -0,0 +1 @@ +We show how feature maps in convolutional networks are susceptible to spatial bias. Due to a combination of architectural choices, the activation at certain locations is systematically elevated or weakened. The major source of this bias is the padding mechanism. Depending on several aspects of convolution arithmetic, this mechanism can apply the padding unevenly, leading to asymmetries in the learned weights. We demonstrate how such bias can be detrimental to certain tasks such as small object detection: the activation is suppressed if the stimulus lies in the impacted area, leading to blind spots and misdetection. We propose solutions to mitigate spatial bias and demonstrate how they can improve model accuracy. \ No newline at end of file diff --git a/data/2021/iclr/Minimum Width for Universal Approximation b/data/2021/iclr/Minimum Width for Universal Approximation new file mode 100644 index 0000000000..144b7ffb28 --- /dev/null +++ b/data/2021/iclr/Minimum Width for Universal Approximation @@ -0,0 +1 @@ +The universal approximation property of width-bounded networks has been studied as a dual of classical universal approximation results on depth-bounded networks. However, the critical width enabling the universal approximation has not been exactly characterized in terms of the input dimension $d_x$ and the output dimension $d_y$. In this work, we provide the first definitive result in this direction for networks using the ReLU activation functions: The minimum width required for the universal approximation of the $L^p$ functions is exactly $\max\{d_x+1,d_y\}$. We also prove that the same conclusion does not hold for the uniform approximation with ReLU, but does hold with an additional threshold activation function. Our proof technique can be also used to derive a tighter upper bound on the minimum width required for the universal approximation using networks with general activation functions. \ No newline at end of file diff --git a/data/2021/iclr/Mirostat: a Neural Text decoding Algorithm that directly controls perplexity b/data/2021/iclr/Mirostat: a Neural Text decoding Algorithm that directly controls perplexity new file mode 100644 index 0000000000..d64b974c7a --- /dev/null +++ b/data/2021/iclr/Mirostat: a Neural Text decoding Algorithm that directly controls perplexity @@ -0,0 +1 @@ +Neural text decoding algorithms strongly influence the quality of texts generated using language models, but popular algorithms like top-k , top-p (nucleus), and temperature-based sampling may yield texts that have objectionable repetition or incoherence. Although these methods generate high-quality text after ad hoc parameter tuning that depends on the language model and the length of generated text, not much is known about the control they provide over the statistics of the output. This is important, however, since recent reports show that humans prefer when perplexity is neither too much nor too little and since we experimentally show that cross-entropy (log of perplexity) has a near-linear relation with repetition. First we provide a theoretical analysis of perplexity in top-k , top-p , and temperature sampling, under Zipfian statistics. Then, we use this analysis to design a feedback-based adaptive top-k text decoding algorithm called mirostat that generates text (of any length) with a predetermined target value of perplexity without any tuning. Experiments show that for low values of k and p , perplexity drops significantly with generated text length and leads to excessive repetitions (the boredom trap). Contrarily, for large values of k and p , perplexity increases with generated text length and leads to incoherence (confusion trap). Mirostat avoids both traps. Specifically, we show that setting target perplexity value beyond a threshold yields negligible sentence-level repetitions. Experiments with human raters for fluency, coherence, and quality further verify our findings. \ No newline at end of file diff --git a/data/2021/iclr/MixKD: Towards Efficient Distillation of Large-scale Language Models b/data/2021/iclr/MixKD: Towards Efficient Distillation of Large-scale Language Models new file mode 100644 index 0000000000..c389eb96c5 --- /dev/null +++ b/data/2021/iclr/MixKD: Towards Efficient Distillation of Large-scale Language Models @@ -0,0 +1 @@ +Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly. Moreover, the student model has few opportunities to request useful information from the teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic the teacher's behavior on the linear interpolation of example pairs as well. We prove, from a theoretical perspective, that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach. \ No newline at end of file diff --git a/data/2021/iclr/Mixed-Features Vectors and Subspace Splitting b/data/2021/iclr/Mixed-Features Vectors and Subspace Splitting new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/MoPro: Webly Supervised Learning with Momentum Prototypes b/data/2021/iclr/MoPro: Webly Supervised Learning with Momentum Prototypes new file mode 100644 index 0000000000..c1bb3ec35f --- /dev/null +++ b/data/2021/iclr/MoPro: Webly Supervised Learning with Momentum Prototypes @@ -0,0 +1 @@ +We propose a webly-supervised representation learning method that does not suffer from the annotation unscalability of supervised learning, nor the computation unscalability of self-supervised learning. Most existing works on webly-supervised representation learning adopt a vanilla supervised learning method without accounting for the prevalent noise in the training data, whereas most prior methods in learning with label noise are less effective for real-world large-scale noisy data. We propose momentum prototypes (MoPro), a simple contrastive learning method that achieves online label noise correction, out-of-distribution sample removal, and representation learning. MoPro achieves state-of-the-art performance on WebVision, a weakly-labeled noisy dataset. MoPro also shows superior performance when the pretrained model is transferred to down-stream image classification and detection tasks. It outperforms the ImageNet supervised pretrained model by +10.5 on 1-shot classification on VOC, and outperforms the best self-supervised pretrained model by +17.3 when finetuned on 1\% of ImageNet labeled samples. Furthermore, MoPro is more robust to distribution shifts. Code and pretrained models are available at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond b/data/2021/iclr/MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond new file mode 100644 index 0000000000..1ef6c73089 --- /dev/null +++ b/data/2021/iclr/MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond @@ -0,0 +1 @@ +This paper focuses on visual counting, which aims to predict the number of occurrences given a natural image and a query ( e.g . a question or a category). Unlike most prior works that use explicit, symbolic models which can be computationally expensive and limited in generalization, we propose a simple and effective alternative by revisiting modulated convolutions that fuse the query and the image locally. Following the design of residual bottleneck, we call our method MoVie , short for Mo dulated con V olut i onal bottl e necks. Notably, MoVie reasons implicitly and holistically and only needs a single forward-pass during inference. Nevertheless, MoVie showcases strong performance for counting: 1) advancing the state-of-the-art on counting-specific VQA tasks while being more efficient; 2) outperforming prior-art on difficult benchmarks like COCO for common object counting; 3) helped us secure the first place of 2020 VQA challenge when integrated as a module for ‘number’ related questions in generic VQA models. Finally, we show evidence that modulated convolutions such as MoVie can serve as a general mechanism for reasoning tasks beyond counting. \ No newline at end of file diff --git a/data/2021/iclr/Model Patching: Closing the Subgroup Performance Gap with Data Augmentation b/data/2021/iclr/Model Patching: Closing the Subgroup Performance Gap with Data Augmentation new file mode 100644 index 0000000000..43e2bee46c --- /dev/null +++ b/data/2021/iclr/Model Patching: Closing the Subgroup Performance Gap with Data Augmentation @@ -0,0 +1 @@ +Classifiers in machine learning are often brittle when deployed. Particularly concerning are models with inconsistent performance on specific subgroups of a class, e.g., exhibiting disparities in skin cancer classification in the presence or absence of a spurious bandage. To mitigate these performance differences, we introduce model patching, a two-stage framework for improving robustness that encourages the model to be invariant to subgroup differences, and focus on class information shared by subgroups. Model patching first models subgroup features within a class and learns semantic transformations between them, and then trains a classifier with data augmentations that deliberately manipulate subgroup features. We instantiate model patching with CAMEL, which (1) uses a CycleGAN to learn the intra-class, inter-subgroup augmentations, and (2) balances subgroup performance using a theoretically-motivated subgroup consistency regularizer, accompanied by a new robust objective. We demonstrate CAMEL's effectiveness on 3 benchmark datasets, with reductions in robust error of up to 33% relative to the best baseline. Lastly, CAMEL successfully patches a model that fails due to spurious features on a real-world skin cancer dataset. \ No newline at end of file diff --git a/data/2021/iclr/Model-Based Offline Planning b/data/2021/iclr/Model-Based Offline Planning new file mode 100644 index 0000000000..4e9fafc1ad --- /dev/null +++ b/data/2021/iclr/Model-Based Offline Planning @@ -0,0 +1 @@ +Offline learning is a key part of making reinforcement learning (RL) useable in real systems. Offline RL looks at scenarios where there is data from a system's operation, but no direct access to the system when learning a policy. Recent work on training RL policies from offline data has shown results both with model-free policies learned directly from the data, or with planning on top of learnt models of the data. Model-free policies tend to be more performant, but are more opaque, harder to command externally, and less easy to integrate into larger systems. We propose an offline learner that generates a model that can be used to control the system directly through planning. This allows us to have easily controllable policies directly from data, without ever interacting with the system. We show the performance of our algorithm, Model-Based Offline Planning (MBOP) on a series of robotics-inspired tasks, and demonstrate its ability leverage planning to respect environmental constraints. We are able to find near-optimal polices for certain simulated systems from as little as 50 seconds of real-time system interaction, and create zero-shot goal-conditioned policies on a series of environments. \ No newline at end of file diff --git a/data/2021/iclr/Model-Based Visual Planning with Self-Supervised Functional Distances b/data/2021/iclr/Model-Based Visual Planning with Self-Supervised Functional Distances new file mode 100644 index 0000000000..908ae60b50 --- /dev/null +++ b/data/2021/iclr/Model-Based Visual Planning with Self-Supervised Functional Distances @@ -0,0 +1 @@ +A generalist robot must be able to complete a variety of tasks in its environment. One appealing way to specify each task is in terms of a goal observation. However, learning goal-reaching policies with reinforcement learning remains a challenging problem, particularly when hand-engineered reward functions are not available. Learned dynamics models are a promising approach for learning about the environment without rewards or task-directed data, but planning to reach goals with such a model requires a notion of functional similarity between observations and goal states. We present a self-supervised method for model-based visual goal reaching, which uses both a visual dynamics model as well as a dynamical distance function learned using model-free reinforcement learning. Our approach learns entirely using offline, unlabeled data, making it practical to scale to large and diverse datasets. In our experiments, we find that our method can successfully learn models that perform a variety of tasks at test-time, moving objects amid distractors with a simulated robotic arm and even learning to open and close a drawer using a real-world robot. In comparisons, we find that this approach substantially outperforms both model-free and model-based prior methods. Videos and visualizations are available here: http://sites.google.com/berkeley.edu/mbold. \ No newline at end of file diff --git a/data/2021/iclr/Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose? b/data/2021/iclr/Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose? new file mode 100644 index 0000000000..c7abaf9590 --- /dev/null +++ b/data/2021/iclr/Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose? @@ -0,0 +1 @@ +We contribute to micro-data model-based reinforcement learning (MBRL) by rigorously comparing popular generative models using a fixed (random shooting) control agent. We find that on an environment that requires multimodal posterior predictives, mixture density nets outperform all other models by a large margin. When multimodality is not required, our surprising finding is that we do not need probabilistic posterior predictives: deterministic models are on par, in fact they consistently (although non-significantly) outperform their probabilistic counterparts. We also found that heteroscedasticity at training time, perhaps acting as a regularizer, improves predictions at longer horizons. At the methodological side, we design metrics and an experimental protocol which can be used to evaluate the various models, predicting their asymptotic performance when using them on the control problem. Using this framework, we improve the state-of-the-art sample complexity of MBRL on Acrobot by two to four folds, using an aggressive training schedule which is outside of the hyperparameter interval usually considered \ No newline at end of file diff --git a/data/2021/iclr/Modeling the Second Player in Distributionally Robust Optimization b/data/2021/iclr/Modeling the Second Player in Distributionally Robust Optimization new file mode 100644 index 0000000000..521d60ed52 --- /dev/null +++ b/data/2021/iclr/Modeling the Second Player in Distributionally Robust Optimization @@ -0,0 +1 @@ +Distributionally robust optimization (DRO) provides a framework for training machine learning models that are able to perform well on a collection of related data distributions (the"uncertainty set"). This is done by solving a min-max game: the model is trained to minimize its maximum expected loss among all distributions in the uncertainty set. While careful design of the uncertainty set is critical to the success of the DRO procedure, previous work has been limited to relatively simple alternatives that keep the min-max optimization problem exactly tractable, such as $f$-divergence balls. In this paper, we argue instead for the use of neural generative models to characterize the worst-case distribution, allowing for more flexible and problem-specific selection of the uncertainty set. However, while simple conceptually, this approach poses a number of implementation and optimization challenges. To circumvent these issues, we propose a relaxation of the KL-constrained inner maximization objective that makes the DRO problem more amenable to gradient-based optimization of large scale generative models, and develop model selection heuristics to guide hyper-parameter search. On both toy settings and realistic NLP tasks, we find that the proposed approach yields models that are more robust than comparable baselines. \ No newline at end of file diff --git a/data/2021/iclr/Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System b/data/2021/iclr/Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System new file mode 100644 index 0000000000..666c25581d --- /dev/null +++ b/data/2021/iclr/Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System @@ -0,0 +1 @@ +Designing task-oriented dialogue systems is a challenging research topic, since it needs not only to generate utterances fulfilling user requests but also to guarantee the comprehensibility. Many previous works trained end-to-end (E2E) models with supervised learning (SL), however, the bias in annotated system utterances remains as a bottleneck. Reinforcement learning (RL) deals with the problem through using non-differentiable evaluation metrics (e.g., the success rate) as rewards. Nonetheless, existing works with RL showed that the comprehensibility of generated system utterances could be corrupted when improving the performance on fulfilling user requests. In o gur work, we (1) propose modelling the hierarchical structure between dialogue policy and natural language generator (NLG) with the option framework, called HDNO, where the latent dialogue act is applied to avoid designing specific dialogue act representations; (2) train HDNO via hierarchical reinforcement learning (HRL), as well as suggest the asynchronous updates between dialogue policy and NLG during training to theoretically guarantee their convergence to a local maximizer; and (3) propose using a discriminator modelled with language models as an additional reward to further improve the comprehensibility. We test HDNO on MultiWoz 2.0 and MultiWoz 2.1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA, showing improvements on the performance evaluated by automatic evaluation metrics and human evaluation. Finally, we demonstrate the semantic meanings of latent dialogue acts to show the ability of explanation. \ No newline at end of file diff --git a/data/2021/iclr/Molecule Optimization by Explainable Evolution b/data/2021/iclr/Molecule Optimization by Explainable Evolution new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Monotonic Kronecker-Factored Lattice b/data/2021/iclr/Monotonic Kronecker-Factored Lattice new file mode 100644 index 0000000000..37e74dcf7a --- /dev/null +++ b/data/2021/iclr/Monotonic Kronecker-Factored Lattice @@ -0,0 +1 @@ +It is computationally challenging to learn flexible monotonic functions that guarantee model behavior and provide interpretability beyond a few input features, and in a time where minimizing resource use is increasingly important, we must be able to learn such models that are still efficient. In this paper we show how to effectively and efficiently learn such functions using Kronecker-Factored Lattice (KFL), an efficient reparameterization of flexible monotonic lattice regression via Kronecker product. Both computational and storage costs scale linearly in the number of input features, which is a significant improvement over existing methods that grow exponentially. We also show that we can still properly enforce monotonicity and other shape constraints. The KFL function class consists of products of piecewise-linear functions, and the size of the function class can be further increased through ensembling. We prove that the function class of an ensemble of M base KFL models strictly increases as M increases up to a certain threshold. Beyond this threshold, every multilinear interpolated lattice function can be expressed. Our experimental results demonstrate that KFL trains faster with fewer parameters while still achieving accuracy and evaluation speeds comparable to or better than the baseline methods and preserving monotonicity guarantees on the learned model. \ No newline at end of file diff --git a/data/2021/iclr/Monte-Carlo Planning and Learning with Language Action Value Estimates b/data/2021/iclr/Monte-Carlo Planning and Learning with Language Action Value Estimates new file mode 100644 index 0000000000..ddbd6b6489 --- /dev/null +++ b/data/2021/iclr/Monte-Carlo Planning and Learning with Language Action Value Estimates @@ -0,0 +1 @@ +Interactive Fiction (IF) games provide a useful testbed for language-based reinforcement learning agents, posing significant challenges of natural language understanding, commonsense reasoning, and non-myopic planning in the combinatorial search space. Agents using standard planning algorithms struggle to play IF games due to the massive search space of language actions. Thus, languagegrounded planning is a key ability of such agents, since inferring the consequence of language action based on semantic understanding can drastically improve search. In this paper, we introduce Monte-Carlo planning with Language Action Value Estimates (MC-LAVE) that combines Monte-Carlo tree search with language-driven exploration. MC-LAVE concentrates search effort on semantically promising language actions using locally optimistic language value estimates, yielding a significant reduction in the effective search space of language actions. We then present a reinforcement learning approach built on MC-LAVE, which alternates between MC-LAVE planning and supervised learning of the selfgenerated language actions. In the experiments, we demonstrate that our method achieves new high scores in various IF games. \ No newline at end of file diff --git a/data/2021/iclr/More or Less: When and How to Build Convolutional Neural Network Ensembles b/data/2021/iclr/More or Less: When and How to Build Convolutional Neural Network Ensembles new file mode 100644 index 0000000000..b7cab75542 --- /dev/null +++ b/data/2021/iclr/More or Less: When and How to Build Convolutional Neural Network Ensembles @@ -0,0 +1 @@ +provide \ No newline at end of file diff --git a/data/2021/iclr/Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning b/data/2021/iclr/Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning new file mode 100644 index 0000000000..345b7ff010 --- /dev/null +++ b/data/2021/iclr/Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning @@ -0,0 +1 @@ +Post-hoc calibration is a common approach for providing high-quality confidence estimates of deep neural network predictions. Recent work has shown that widely used scaling methods underestimate their calibration error, while alternative Histogram Binning (HB) methods with verifiable calibration performance often fail to preserve classification accuracy. In the case of multi-class calibration with a large number of classes K, HB also faces the issue of severe sample-inefficiency due to a large class imbalance resulting from the conversion into K one-vs-rest class-wise calibration problems. The goal of this paper is to resolve the identified issues of HB in order to provide verified and calibrated confidence estimates using only a small holdout calibration dataset for bin optimization while preserving multi-class ranking accuracy. From an information-theoretic perspective, we derive the I-Max concept for binning, which maximizes the mutual information between labels and binned (quantized) logits. This concept mitigates potential loss in ranking performance due to lossy quantization, and by disentangling the optimization of bin edges and representatives allows simultaneous improvement of ranking and calibration performance. In addition, we propose a shared class-wise (sCW) binning strategy that fits a single calibrator on the merged training sets of all K class-wise problems, yielding reliable estimates from a small calibration set. The combination of sCW and I-Max binning outperforms the state of the art calibration methods on various evaluation metrics across different benchmark datasets and models, even when using only a small set of calibration data, e.g. 1k samples for ImageNet. \ No newline at end of file diff --git a/data/2021/iclr/Multi-Level Local SGD: Distributed SGD for Heterogeneous Hierarchical Networks b/data/2021/iclr/Multi-Level Local SGD: Distributed SGD for Heterogeneous Hierarchical Networks new file mode 100644 index 0000000000..6830b3baad --- /dev/null +++ b/data/2021/iclr/Multi-Level Local SGD: Distributed SGD for Heterogeneous Hierarchical Networks @@ -0,0 +1 @@ +We propose Multi-Level Local SGD, a distributed gradient method for learning a smooth, non-convex objective in a heterogeneous multi-level network. Our network model consists of a set of disjoint sub-networks, with a single hub and multiple worker nodes; further, worker nodes may have different operating rates. The hubs exchange information with one another via a connected, but not necessarily complete communication network. In our algorithm, sub-networks execute a distributed SGD algorithm, using a hub-and-spoke paradigm, and the hubs periodically average their models with neighboring hubs. We first provide a unified mathematical framework that describes the Multi-Level Local SGD algorithm. We then present a theoretical analysis of the algorithm; our analysis shows the dependence of the convergence error on the worker node heterogeneity, hub network topology, and the number of local, sub-network, and global iterations. We back up our theoretical results via simulation-based experiments using both convex and non-convex objectives. \ No newline at end of file diff --git a/data/2021/iclr/Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network b/data/2021/iclr/Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network new file mode 100644 index 0000000000..af98083381 --- /dev/null +++ b/data/2021/iclr/Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network @@ -0,0 +1 @@ +Recently, Frankle&Carbin (2019) demonstrated that randomly-initialized dense networks contain subnetworks that once found can be trained to reach test accuracy comparable to the trained dense network. However, finding these high performing trainable subnetworks is expensive, requiring iterative process of training and pruning weights. In this paper, we propose (and prove) a stronger Multi-Prize Lottery Ticket Hypothesis: A sufficiently over-parameterized neural network with random weights contains several subnetworks (winning tickets) that (a) have comparable accuracy to a dense target network with learned weights (prize 1), (b) do not require any further training to achieve prize 1 (prize 2), and (c) is robust to extreme forms of quantization (i.e., binary weights and/or activation) (prize 3). This provides a new paradigm for learning compact yet highly accurate binary neural networks simply by pruning and quantizing randomly weighted full precision neural networks. We also propose an algorithm for finding multi-prize tickets (MPTs) and test it by performing a series of experiments on CIFAR-10 and ImageNet datasets. Empirical results indicate that as models grow deeper and wider, multi-prize tickets start to reach similar (and sometimes even higher) test accuracy compared to their significantly larger and full-precision counterparts that have been weight-trained. Without ever updating the weight values, our MPTs-1/32 not only set new binary weight network state-of-the-art (SOTA) Top-1 accuracy -- 94.8% on CIFAR-10 and 74.03% on ImageNet -- but also outperform their full-precision counterparts by 1.78% and 0.76%, respectively. Further, our MPT-1/1 achieves SOTA Top-1 accuracy (91.9%) for binary neural networks on CIFAR-10. Code and pre-trained models are available at: https://github.com/chrundle/biprop. \ No newline at end of file diff --git a/data/2021/iclr/Multi-Time Attention Networks for Irregularly Sampled Time Series b/data/2021/iclr/Multi-Time Attention Networks for Irregularly Sampled Time Series new file mode 100644 index 0000000000..c826c47d48 --- /dev/null +++ b/data/2021/iclr/Multi-Time Attention Networks for Irregularly Sampled Time Series @@ -0,0 +1 @@ +Irregular sampling occurs in many time series modeling applications where it presents a significant challenge to standard deep learning models. This work is motivated by the analysis of physiological time series data in electronic health records, which are sparse, irregularly sampled, and multivariate. In this paper, we propose a new deep learning framework for this setting that we call Multi-Time Attention Networks. Multi-Time Attention Networks learn an embedding of continuous time values and use an attention mechanism to produce a fixed-length representation of a time series containing a variable number of observations. We investigate the performance of our framework on interpolation and classification tasks using multiple datasets. Our results show that our approach performs as well or better than a range of baseline and recently proposed models while offering significantly faster training times than current state-of-the-art methods. \ No newline at end of file diff --git a/data/2021/iclr/Multi-resolution modeling of a discrete stochastic process identifies causes of cancer b/data/2021/iclr/Multi-resolution modeling of a discrete stochastic process identifies causes of cancer new file mode 100644 index 0000000000..fdfecb6950 --- /dev/null +++ b/data/2021/iclr/Multi-resolution modeling of a discrete stochastic process identifies causes of cancer @@ -0,0 +1 @@ +Detection of cancer-causing mutations within the vast and mostly unexplored human genome is a major challenge. Doing so requires modeling the background mutation rate, a highly non-stationary stochastic process, across regions of interest varying in size from one to millions of positions. Here, we present the splitPoisson-Gamma (SPG) distribution, an extension of the classical Poisson-Gamma formulation, to model a discrete stochastic process at multiple resolutions. We demonstrate that the probability model has a closed-form posterior, enabling efficient and accurate linear-time prediction over any length scale after the parameters of the model have been inferred a single time. We apply our framework to model mutation rates in tumors and show that model parameters can be accurately inferred from high-dimensional epigenetic data using a convolutional neural network, Gaussian process, and maximum-likelihood estimation. Our method is both more accurate and more efficient than existing models over a large range of length scales. We demonstrate the usefulness of multi-resolution modeling by detecting genomic elements that drive tumor emergence and are of vastly differing sizes. \ No newline at end of file diff --git a/data/2021/iclr/Multi-timescale Representation Learning in LSTM Language Models b/data/2021/iclr/Multi-timescale Representation Learning in LSTM Language Models new file mode 100644 index 0000000000..e7ac24b1f0 --- /dev/null +++ b/data/2021/iclr/Multi-timescale Representation Learning in LSTM Language Models @@ -0,0 +1 @@ +Although neural language models are effective at capturing statistics of natural language, their representations are challenging to interpret. In particular, it is unclear how these models retain information over multiple timescales. In this work, we construct explicitly multi-timescale language models by manipulating the input and forget gate biases in a long short-term memory (LSTM) network. The distribution of timescales is selected to approximate power law statistics of natural language through a combination of exponentially decaying memory cells. We then empirically analyze the timescale of information routed through each part of the model using word ablation experiments and forget gate visualizations. These experiments show that the multi-timescale model successfully learns representations at the desired timescales, and that the distribution includes longer timescales than a standard LSTM. Further, information about high-,mid-, and low-frequency words is routed preferentially through units with the appropriate timescales. Thus we show how to construct language models with interpretable representations of different information timescales. \ No newline at end of file diff --git a/data/2021/iclr/MultiModalQA: complex question answering over text, tables and images b/data/2021/iclr/MultiModalQA: complex question answering over text, tables and images new file mode 100644 index 0000000000..eaa3c2e112 --- /dev/null +++ b/data/2021/iclr/MultiModalQA: complex question answering over text, tables and images @@ -0,0 +1 @@ +When answering complex questions, people can seamlessly combine information from visual, textual and tabular sources. While interest in models that reason over multiple pieces of evidence has surged in recent years, there has been relatively little work on question answering models that reason across multiple modalities. In this paper, we present MultiModalQA(MMQA): a challenging question answering dataset that requires joint reasoning over text, tables and images. We create MMQA using a new framework for generating complex multi-modal questions at scale, harvesting tables from Wikipedia, and attaching images and text paragraphs using entities that appear in each table. We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions. Last, crowdsourcing workers take these automatically-generated questions and rephrase them into more fluent language. We create 29,918 questions through this procedure, and empirically demonstrate the necessity of a multi-modal multi-hop approach to solve our task: our multi-hop model, ImplicitDecomp, achieves an average F1of 51.7 over cross-modal questions, substantially outperforming a strong baseline that achieves 38.2 F1, but still lags significantly behind human performance, which is at 90.1 F1 \ No newline at end of file diff --git a/data/2021/iclr/Multiplicative Filter Networks b/data/2021/iclr/Multiplicative Filter Networks new file mode 100644 index 0000000000..d72f65ed66 --- /dev/null +++ b/data/2021/iclr/Multiplicative Filter Networks @@ -0,0 +1 @@ +Although deep networks are typically used to approximate functions over high dimensional inputs, recent work has increased interest in neural networks as function approximators for low-dimensional-but-complex functions, such as representing images as a function of pixel coordinates, solving differential equations, or representing signed distance functions or neural radiance fields. Key to these recent successes has been the use of new elements such as sinusoidal nonlinearities or Fourier features in positional encodings, which vastly outperform simple ReLU networks. In this paper, we propose and empirically demonstrate that an arguably simpler class of function approximators can work just as well for such problems: multiplicative filter networks. In these networks, we avoid traditional compositional depth altogether, and simply multiply together (linear functions of) sinusoidal or Gabor wavelet functions applied to the input. This representation has the notable advantage that the entire function can simply be viewed as a linear function approximator over an exponential number of Fourier or Gabor basis functions, respectively. Despite this simplicity, when compared to recent approaches that use Fourier features with ReLU networks or sinusoidal activation networks, we show that these multiplicative filter networks largely outperform or match the performance of these approaches on the domains highlighted in these past works. \ No newline at end of file diff --git a/data/2021/iclr/Multiscale Score Matching for Out-of-Distribution Detection b/data/2021/iclr/Multiscale Score Matching for Out-of-Distribution Detection new file mode 100644 index 0000000000..1bf7f91737 --- /dev/null +++ b/data/2021/iclr/Multiscale Score Matching for Out-of-Distribution Detection @@ -0,0 +1 @@ +We present a new methodology for detecting out-of-distribution (OOD) images by utilizing norms of the score estimates at multiple noise scales. A score is defined to be the gradient of the log density with respect to the input data. Our methodology is completely unsupervised and follows a straight forward training scheme. First, we train a deep network to estimate scores for levels of noise. Once trained, we calculate the noisy score estimates for N in-distribution samples and take the L2-norms across the input dimensions (resulting in an NxL matrix). Then we train an auxiliary model (such as a Gaussian Mixture Model) to learn the in-distribution spatial regions in this L-dimensional space. This auxiliary model can now be used to identify points that reside outside the learned space. Despite its simplicity, our experiments show that this methodology significantly outperforms the state-of-the-art in detecting out-of-distribution images. For example, our method can effectively separate CIFAR-10 (inlier) and SVHN (OOD) images, a setting which has been previously shown to be difficult for deep likelihood models. \ No newline at end of file diff --git a/data/2021/iclr/Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows b/data/2021/iclr/Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows new file mode 100644 index 0000000000..f42c74b65f --- /dev/null +++ b/data/2021/iclr/Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows @@ -0,0 +1 @@ +Probabilistic forecasting of irregularly sampled multivariate time series with missing values is an important problem in many fields, including health care, astronomy, and climate. State-of-the-art methods for the task estimate only marginal distributions of observations in single channels and at single timepoints, assuming a fixed-shape parametric distribution. In this work, we propose a novel model, ProFITi, for probabilistic forecasting of irregularly sampled time series with missing values using conditional normalizing flows. The model learns joint distributions over the future values of the time series conditioned on past observations and queried channels and times, without assuming any fixed shape of the underlying distribution. As model components, we introduce a novel invertible triangular attention layer and an invertible non-linear activation function on and onto the whole real line. We conduct extensive experiments on four datasets and demonstrate that the proposed model provides $4$ times higher likelihood over the previously best model. \ No newline at end of file diff --git a/data/2021/iclr/Mutual Information State Intrinsic Control b/data/2021/iclr/Mutual Information State Intrinsic Control new file mode 100644 index 0000000000..269659c040 --- /dev/null +++ b/data/2021/iclr/Mutual Information State Intrinsic Control @@ -0,0 +1 @@ +Reinforcement learning has been shown to be highly successful at many challenging tasks. However, success heavily relies on well-shaped rewards. Intrinsically motivated RL attempts to remove this constraint by defining an intrinsic reward function. Motivated by the self-consciousness concept in psychology, we make a natural assumption that the agent knows what constitutes itself, and propose a new intrinsic objective that encourages the agent to have maximum control on the environment. We mathematically formalize this reward as the mutual information between the agent state and the surrounding state under the current agent policy. With this new intrinsic motivation, we are able to outperform previous methods, including being able to complete the pick-and-place task for the first time without using any task reward. A video showing experimental results is available at https://youtu.be/AUCwc9RThpk. \ No newline at end of file diff --git a/data/2021/iclr/My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control b/data/2021/iclr/My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control new file mode 100644 index 0000000000..aa1ce0e23d --- /dev/null +++ b/data/2021/iclr/My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control @@ -0,0 +1 @@ +Multitask Reinforcement Learning is a promising way to obtain models with better performance, generalisation, data efficiency, and robustness. Most existing work is limited to compatible settings, where the state and action space dimensions are the same across tasks. Graph Neural Networks (GNN) are one way to address incompatible environments, because they can process graphs of arbitrary size. They also allow practitioners to inject biases encoded in the structure of the input graph. Existing work in graph-based continuous control uses the physical morphology of the agent to construct the input graph, i.e., encoding limb features as node labels and using edges to connect the nodes if their corresponded limbs are physically connected. In this work, we present a series of ablations on existing methods that show that morphological information encoded in the graph does not improve their performance. Motivated by the hypothesis that any benefits GNNs extract from the graph structure are outweighed by difficulties they create for message passing, we also propose Amorpheus, a transformer-based approach. Further results show that, while Amorpheus ignores the morphological information that GNNs encode, it nonetheless substantially outperforms GNN-based methods. \ No newline at end of file diff --git a/data/2021/iclr/NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition b/data/2021/iclr/NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition new file mode 100644 index 0000000000..788636ffba --- /dev/null +++ b/data/2021/iclr/NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition @@ -0,0 +1 @@ +to \ No newline at end of file diff --git a/data/2021/iclr/NBDT: Neural-Backed Decision Tree b/data/2021/iclr/NBDT: Neural-Backed Decision Tree new file mode 100644 index 0000000000..be766b5f02 --- /dev/null +++ b/data/2021/iclr/NBDT: Neural-Backed Decision Tree @@ -0,0 +1 @@ +Machine learning applications such as finance and medicine demand accurate and justifiable predictions, barring most deep learning methods from use. In response, previous work combines decision trees with deep learning, yielding models that (1) sacrifice interpretability for accuracy or (2) sacrifice accuracy for interpretability. We forgo this dilemma by jointly improving accuracy and interpretability using Neural-Backed Decision Trees (NBDTs). NBDTs replace a neural network’s final linear layer with a differentiable sequence of decisions and a surrogate loss. This forces the model to learn high-level concepts and lessens reliance on highlyuncertain decisions, yielding (1) accuracy: NBDTs match or outperform modern neural networks on CIFAR, ImageNet and better generalize to unseen classes by up to 16%. Furthermore, our surrogate loss improves the original model’s accuracy by up to 2%. NBDTs also afford (2) interpretability: improving human trust by clearly identifying model mistakes and assisting in dataset debugging. Code and pretrained NBDTs are at github.com/alvinwan/neural-backed-decision-trees. \ No newline at end of file diff --git a/data/2021/iclr/NOVAS: Non-convex Optimization via Adaptive Stochastic Search for End-to-end Learning and Control b/data/2021/iclr/NOVAS: Non-convex Optimization via Adaptive Stochastic Search for End-to-end Learning and Control new file mode 100644 index 0000000000..49b3e7cae5 --- /dev/null +++ b/data/2021/iclr/NOVAS: Non-convex Optimization via Adaptive Stochastic Search for End-to-end Learning and Control @@ -0,0 +1 @@ +In this work we propose the use of adaptive stochastic search as a building block for general, non-convex optimization operations within deep neural network architectures. Specifically, for an objective function located at some layer in the network and parameterized by some network parameters, we employ adaptive stochastic search to perform optimization over its output. This operation is differentiable and does not obstruct the passing of gradients during backpropagation, thus enabling us to incorporate it as a component in end-to-end learning. We study the proposed optimization module's properties and benchmark it against two existing alternatives on a synthetic energy-based structured prediction task, and further showcase its use in stochastic optimal control applications. \ No newline at end of file diff --git a/data/2021/iclr/NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation b/data/2021/iclr/NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation new file mode 100644 index 0000000000..ae21f5baf8 --- /dev/null +++ b/data/2021/iclr/NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation @@ -0,0 +1 @@ +3D pose estimation is a challenging but important task in computer vision. In this work, we show that standard deep learning approaches to 3D pose estimation are not robust when objects are partially occluded or viewed from a previously unseen pose. Inspired by the robustness of generative vision models to partial occlusion, we propose to integrate deep neural networks with 3D generative representations of objects into a unified neural architecture that we term NeMo. In particular, NeMo learns a generative model of neural feature activations at each vertex on a dense 3D mesh. Using differentiable rendering we estimate the 3D object pose by minimizing the reconstruction error between NeMo and the feature representation of the target image. To avoid local optima in the reconstruction loss, we train the feature extractor to maximize the distance between the individual feature representations on the mesh using contrastive learning. Our extensive experiments on PASCAL3D+, occluded-PASCAL3D+ and ObjectNet3D show that NeMo is much more robust to partial occlusion and unseen pose compared to standard deep networks, while retaining competitive performance on regular data. Interestingly, our experiments also show that NeMo performs reasonably well even when the mesh representation only crudely approximates the true object geometry with a cuboid, hence revealing that the detailed 3D geometry is not needed for accurate 3D pose estimation. The code is publicly available at https://github.com/Angtian/NeMo. \ No newline at end of file diff --git a/data/2021/iclr/Nearest Neighbor Machine Translation b/data/2021/iclr/Nearest Neighbor Machine Translation new file mode 100644 index 0000000000..fa02398610 --- /dev/null +++ b/data/2021/iclr/Nearest Neighbor Machine Translation @@ -0,0 +1 @@ +We introduce $k$-nearest-neighbor machine translation ($k$NN-MT), which predicts tokens with a nearest neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search. This approach requires no additional training and scales to give the decoder direct access to billions of examples at test time, resulting in a highly expressive model that consistently improves performance across many settings. Simply adding nearest neighbor search improves a state-of-the-art German-English translation model by 1.5 BLEU. $k$NN-MT allows a single model to be adapted to diverse domains by using a domain-specific datastore, improving results by an average of 9.2 BLEU over zero-shot transfer, and achieving new state-of-the-art results---without training on these domains. A massively multilingual model can also be specialized for particular language pairs, with improvements of 3 BLEU for translating from English into German and Chinese. Qualitatively, $k$NN-MT is easily interpretable; it combines source and target context to retrieve highly relevant examples. \ No newline at end of file diff --git a/data/2021/iclr/Negative Data Augmentation b/data/2021/iclr/Negative Data Augmentation new file mode 100644 index 0000000000..a8ec3fbf2c --- /dev/null +++ b/data/2021/iclr/Negative Data Augmentation @@ -0,0 +1 @@ +In practical applications, the generalization capability of face anti-spoofing (FAS) models on unseen domains is of paramount importance to adapt to diverse camera sensors, device drift, environmental variation, and unpredictable attack types. Recently, various domain generalization (DG) methods have been developed to improve the generalization capability of FAS models via training on multiple source domains. These DG methods commonly require collecting sufficient real-world attack samples of different attack types for each source domain. This work aims to learn a FAS model without using any real-world attack sample in any source domain but can generalize well to the unseen domain, which can significantly reduce the learning cost. Toward this goal, we draw inspiration from the theoretical error bound of domain generalization to use negative data augmentation instead of real-world attack samples for training. We show that using only a few types of simple synthesized negative samples, e.g., color jitter and color mask, the learned model can achieve competitive performance over state-of-the-art DG methods trained using real-world attack samples. Moreover, a dynamic global common loss and a local contrast loss are proposed to prompt the model to learn a compact and common feature representation for real face samples from different source domains, which can further improve the generalization capability. Experimental results of extensive cross-dataset testing demonstrate that our method can even outperform state-of-the-art DG methods using real-world attack samples for training. The code for reproducing the results of our method is available at https://github.com/WeihangWANG/NDA-FAS. \ No newline at end of file diff --git a/data/2021/iclr/Net-DNF: Effective Deep Modeling of Tabular Data b/data/2021/iclr/Net-DNF: Effective Deep Modeling of Tabular Data new file mode 100644 index 0000000000..19c5fbd5f1 --- /dev/null +++ b/data/2021/iclr/Net-DNF: Effective Deep Modeling of Tabular Data @@ -0,0 +1 @@ +A challenging open question in deep learning is how to handle tabular data. Unlike domains such as image and natural language processing, where deep architectures prevail, there is still no widely accepted neural architecture that dominates tabular data. As a step toward bridging this gap, we present Net-DNF a novel generic architecture whose inductive bias elicits models whose structure corresponds to logical Boolean formulas in disjunctive normal form (DNF) over affine soft-threshold decision terms. Net-DNFs also promote localized decisions that are taken over small subsets of the features. We present extensive experiments showing that Net-DNFs significantly and consistently outperform fully connected networks over tabular data. With relatively few hyperparameters, Net-DNFs open the door to practical end-to-end handling of tabular data using neural networks. We present ablation studies, which justify the design choices of Net-DNF including the inductive bias elements, namely, Boolean formulation, locality, and feature selection. \ No newline at end of file diff --git a/data/2021/iclr/Network Pruning That Matters: A Case Study on Retraining Variants b/data/2021/iclr/Network Pruning That Matters: A Case Study on Retraining Variants new file mode 100644 index 0000000000..6cd0195ddf --- /dev/null +++ b/data/2021/iclr/Network Pruning That Matters: A Case Study on Retraining Variants @@ -0,0 +1 @@ +Network pruning is an effective method to reduce the computational expense of over-parameterized neural networks for deployment on low-resource systems. Recent state-of-the-art techniques for retraining pruned networks such as weight rewinding and learning rate rewinding have been shown to outperform the traditional fine-tuning technique in recovering the lost accuracy (Renda et al., 2020), but so far it is unclear what accounts for such performance. In this work, we conduct extensive experiments to verify and analyze the uncanny effectiveness of learning rate rewinding. We find that the reason behind the success of learning rate rewinding is the usage of a large learning rate. Similar phenomenon can be observed in other learning rate schedules that involve large learning rates, e.g., the 1-cycle learning rate schedule (Smith et al., 2019). By leveraging the right learning rate schedule in retraining, we demonstrate a counter-intuitive phenomenon in that randomly pruned networks could even achieve better performance than methodically pruned networks (fine-tuned with the conventional approach). Our results emphasize the cruciality of the learning rate schedule in pruned network retraining - a detail often overlooked by practitioners during the implementation of network pruning. One-sentence Summary: We study the effective of different retraining mechanisms while doing pruning \ No newline at end of file diff --git a/data/2021/iclr/Neural Approximate Sufficient Statistics for Implicit Models b/data/2021/iclr/Neural Approximate Sufficient Statistics for Implicit Models new file mode 100644 index 0000000000..d5f291f297 --- /dev/null +++ b/data/2021/iclr/Neural Approximate Sufficient Statistics for Implicit Models @@ -0,0 +1 @@ +We consider the fundamental problem of how to automatically construct summary statistics for implicit generative models where the evaluation of likelihood function is intractable but sampling / simulating data from the model is possible. The idea is to frame the task of constructing sufficient statistics as learning mutual information maximizing representation of the data. This representation is computed by a deep neural network trained by a joint statistic-posterior learning strategy. We apply our approach to both traditional approximate Bayesian computation (ABC) and recent neural likelihood approaches, boosting their performance on a range of tasks. \ No newline at end of file diff --git a/data/2021/iclr/Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective b/data/2021/iclr/Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective new file mode 100644 index 0000000000..91eee41550 --- /dev/null +++ b/data/2021/iclr/Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective @@ -0,0 +1 @@ +Neural Architecture Search (NAS) has been explosively studied to automate the discovery of top-performer neural networks. Current works require heavy training of supernet or intensive architecture evaluations, thus suffering from heavy resource consumption and often incurring search bias due to truncated training or approximations. Can we select the best neural architectures without involving any training and eliminate a drastic portion of the search cost? We provide an affirmative answer, by proposing a novel framework called training-free neural architecture search (TE-NAS). TE-NAS ranks architectures by analyzing the spectrum of the neural tangent kernel (NTK) and the number of linear regions in the input space. Both are motivated by recent theory advances in deep networks and can be computed without any training and any label. We show that: (1) these two measurements imply the trainability and expressivity of a neural network; (2) they strongly correlate with the network's test accuracy. Further on, we design a pruning-based NAS mechanism to achieve a more flexible and superior trade-off between the trainability and expressivity during the search. In NAS-Bench-201 and DARTS search spaces, TE-NAS completes high-quality search but only costs 0.5 and 4 GPU hours with one 1080Ti on CIFAR-10 and ImageNet, respectively. We hope our work inspires more attempts in bridging the theoretical findings of deep networks and practical impacts in real NAS applications. Code is available at: https://github.com/VITA-Group/TENAS. \ No newline at end of file diff --git a/data/2021/iclr/Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks b/data/2021/iclr/Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks new file mode 100644 index 0000000000..1604e0b778 --- /dev/null +++ b/data/2021/iclr/Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks @@ -0,0 +1 @@ +Deep neural networks (DNNs) are known vulnerable to backdoor attacks, a training time attack that injects a trigger pattern into a small proportion of training data so as to control the model's prediction at the test time. Backdoor attacks are notably dangerous since they do not affect the model's performance on clean examples, yet can fool the model to make incorrect prediction whenever the trigger pattern appears during testing. In this paper, we propose a novel defense framework Neural Attention Distillation (NAD) to erase backdoor triggers from backdoored DNNs. NAD utilizes a teacher network to guide the finetuning of the backdoored student network on a small clean subset of data such that the intermediate-layer attention of the student network aligns with that of the teacher network. The teacher network can be obtained by an independent finetuning process on the same clean subset. We empirically show, against 6 state-of-the-art backdoor attacks, NAD can effectively erase the backdoor triggers using only 5\% clean training data without causing obvious performance degradation on clean examples. Code is available in https://github.com/bboylyg/NAD. \ No newline at end of file diff --git a/data/2021/iclr/Neural Delay Differential Equations b/data/2021/iclr/Neural Delay Differential Equations new file mode 100644 index 0000000000..8a7ef8edac --- /dev/null +++ b/data/2021/iclr/Neural Delay Differential Equations @@ -0,0 +1 @@ +The intersection of machine learning and dynamical systems has generated considerable interest recently. Neural Ordinary Differential Equations (NODEs) represent a rich overlap between these fields. In this paper, we develop a continuous time neural network approach based on Delay Differential Equations (DDEs). Our model uses the adjoint sensitivity method to learn the model parameters and delay directly from data. Our approach is inspired by that of NODEs and extends earlier neural DDE models, which have assumed that the value of the delay is known a priori. We perform a sensitivity analysis on our proposed approach and demonstrate its ability to learn DDE parameters from benchmark systems. We conclude our discussion with potential future directions and applications. \ No newline at end of file diff --git a/data/2021/iclr/Neural Jump Ordinary Differential Equations: Consistent Continuous-Time Prediction and Filtering b/data/2021/iclr/Neural Jump Ordinary Differential Equations: Consistent Continuous-Time Prediction and Filtering new file mode 100644 index 0000000000..254aa6822b --- /dev/null +++ b/data/2021/iclr/Neural Jump Ordinary Differential Equations: Consistent Continuous-Time Prediction and Filtering @@ -0,0 +1 @@ +Combinations of neural ODEs with recurrent neural networks (RNN), like GRUODE-Bayes or ODE-RNN are well suited to model irregularly-sampled time series. While those models outperform existing discrete-time approaches, no theoretical guarantees for their predictive capabilities are available. Assuming that the irregularly-sampled time series data originates from a continuous stochastic processes, the optimal on-line prediction is the conditional expectation given the currently available information. We introduce the Neural Jump ODE (NJ-ODE) that provides a data-driven approach to learn, continuously in time, the conditional expectation of a stochastic process. Our approach models the conditional expectation between two observations with a neural ODE and jumps whenever a new observation is made. We define a novel training framework, which allows us to prove theoretical convergence guarantees for the first time. In particular, we demonstrate the predictive capabilities of our model by proving that, under some regularity assumptions, the output process converges to the conditional expectation process. We provide experiments showing that the theoretical results also hold empirically. Moreover, we experimentally show that our model outperforms one state of the art model in more complex learning tasks and give comparisons on a real-world dataset. \ No newline at end of file diff --git a/data/2021/iclr/Neural Learning of One-of-Many Solutions for Combinatorial Problems in Structured Output Spaces b/data/2021/iclr/Neural Learning of One-of-Many Solutions for Combinatorial Problems in Structured Output Spaces new file mode 100644 index 0000000000..aee3de2fb8 --- /dev/null +++ b/data/2021/iclr/Neural Learning of One-of-Many Solutions for Combinatorial Problems in Structured Output Spaces @@ -0,0 +1 @@ +Recent research has proposed neural architectures for solving combinatorial problems in structured output spaces. In many such problems, there may exist multiple solutions for a given input, e.g. a partially filled Sudoku puzzle may have many completions satisfying all constraints. Further, we are often interested in finding {\em any one} of the possible solutions, without any preference between them. Existing approaches completely ignore this solution multiplicity. In this paper, we argue that being oblivious to the presence of multiple solutions can severely hamper their training ability. Our contribution is two fold. First, we formally define the task of learning one-of-many solutions for combinatorial problems in structured output spaces, which is applicable for solving several problems of interest such as N-Queens, and Sudoku. Second, we present a generic learning framework that adapts an existing prediction network for a combinatorial problem to handle solution multiplicity. Our framework uses a selection module, whose goal is to dynamically determine, for every input, the solution that is most effective for training the network parameters in any given learning iteration. We propose an RL based approach to jointly train the selection module with the prediction network. Experiments on three different domains, and using two different prediction networks, demonstrate that our framework significantly improves the accuracy in our setting, obtaining up to $21$ pt gain over the baselines. \ No newline at end of file diff --git a/data/2021/iclr/Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics b/data/2021/iclr/Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics new file mode 100644 index 0000000000..82c5b050f3 --- /dev/null +++ b/data/2021/iclr/Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics @@ -0,0 +1 @@ +Predicting the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation for deep learning. A central obstacle is that the motion of a network in high-dimensional parameter space undergoes discrete finite steps along complex stochastic gradients derived from real-world datasets. We circumvent this obstacle through a unifying theoretical framework based on intrinsic symmetries embedded in a network's architecture that are present for any dataset. We show that any such symmetry imposes stringent geometric constraints on gradients and Hessians, leading to an associated conservation law in the continuous-time limit of stochastic gradient descent (SGD), akin to Noether's theorem in physics. We further show that finite learning rates used in practice can actually break these symmetry induced conservation laws. We apply tools from finite difference methods to derive modified gradient flow, a differential equation that better approximates the numerical trajectory taken by SGD at finite learning rates. We combine modified gradient flow with our framework of symmetries to derive exact integral expressions for the dynamics of certain parameter combinations. We empirically validate our analytic predictions for learning dynamics on VGG-16 trained on Tiny ImageNet. Overall, by exploiting symmetry, our work demonstrates that we can analytically describe the learning dynamics of various parameter combinations at finite learning rates and batch sizes for state of the art architectures trained on any dataset. \ No newline at end of file diff --git a/data/2021/iclr/Neural Networks for Learning Counterfactual G-Invariances from Single Environments b/data/2021/iclr/Neural Networks for Learning Counterfactual G-Invariances from Single Environments new file mode 100644 index 0000000000..907fb2f2e9 --- /dev/null +++ b/data/2021/iclr/Neural Networks for Learning Counterfactual G-Invariances from Single Environments @@ -0,0 +1 @@ +Despite -- or maybe because of -- their astonishing capacity to fit data, neural networks are believed to have difficulties extrapolating beyond training data distribution. This work shows that, for extrapolations based on finite transformation groups, a model's inability to extrapolate is unrelated to its capacity. Rather, the shortcoming is inherited from a learning hypothesis: Examples not explicitly observed with infinitely many training examples have underspecified outcomes in the learner's model. In order to endow neural networks with the ability to extrapolate over group transformations, we introduce a learning framework counterfactually-guided by the learning hypothesis that any group invariance to (known) transformation groups is mandatory even without evidence, unless the learner deems it inconsistent with the training data. Unlike existing invariance-driven methods for (counterfactual) extrapolations, this framework allows extrapolations from a single environment. Finally, we introduce sequence and image extrapolation tasks that validate our framework and showcase the shortcomings of traditional approaches. \ No newline at end of file diff --git a/data/2021/iclr/Neural ODE Processes b/data/2021/iclr/Neural ODE Processes new file mode 100644 index 0000000000..abdf748337 --- /dev/null +++ b/data/2021/iclr/Neural ODE Processes @@ -0,0 +1 @@ +Neural Ordinary Differential Equations (NODEs) use a neural network to model the instantaneous rate of change in the state of a system. However, despite their apparent suitability for dynamics-governed time-series, NODEs present a few disadvantages. First, they are unable to adapt to incoming data points, a fundamental requirement for real-time applications imposed by the natural direction of time. Second, time series are often composed of a sparse set of measurements that could be explained by many possible underlying dynamics. NODEs do not capture this uncertainty. In contrast, Neural Processes (NPs) are a family of models providing uncertainty estimation and fast data adaptation but lack an explicit treatment of the flow of time. To address these problems, we introduce Neural ODE Processes (NDPs), a new class of stochastic processes determined by a distribution over Neural ODEs. By maintaining an adaptive data-dependent distribution over the underlying ODE, we show that our model can successfully capture the dynamics of low-dimensional systems from just a few data points. At the same time, we demonstrate that NDPs scale up to challenging high-dimensional time-series with unknown latent dynamics such as rotating MNIST digits. \ No newline at end of file diff --git a/data/2021/iclr/Neural Pruning via Growing Regularization b/data/2021/iclr/Neural Pruning via Growing Regularization new file mode 100644 index 0000000000..d291f5fe62 --- /dev/null +++ b/data/2021/iclr/Neural Pruning via Growing Regularization @@ -0,0 +1 @@ +Regularization has long been utilized to learn sparsity in deep neural network pruning. However, its role is mainly explored in the small penalty strength regime. In this work, we extend its application to a new scenario where the regularization grows large gradually to tackle two central problems of pruning: pruning schedule and weight importance scoring. (1) The former topic is newly brought up in this work, which we find critical to the pruning performance while receives little research attention. Specifically, we propose an L2 regularization variant with rising penalty factors and show it can bring significant accuracy gains compared with its one-shot counterpart, even when the same weights are removed. (2) The growing penalty scheme also brings us an approach to exploit the Hessian information for more accurate pruning without knowing their specific values, thus not bothered by the common Hessian approximation problems. Empirically, the proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning. Their effectiveness is demonstrated with modern deep neural networks on the CIFAR and ImageNet datasets, achieving competitive results compared to many state-of-the-art algorithms. Our code and trained models are publicly available at this https URL. \ No newline at end of file diff --git a/data/2021/iclr/Neural Spatio-Temporal Point Processes b/data/2021/iclr/Neural Spatio-Temporal Point Processes new file mode 100644 index 0000000000..56254fbb78 --- /dev/null +++ b/data/2021/iclr/Neural Spatio-Temporal Point Processes @@ -0,0 +1 @@ +We propose a new class of parameterizations for spatio-temporal point processes which leverage Neural ODEs as a computational method and enable flexible, high-fidelity models of discrete events that are localized in continuous time and space. Central to our approach is a combination of recurrent continuous-time neural networks with two novel neural architectures, i.e., Jump and Attentive Continuous-time Normalizing Flows. This approach allows us to learn complex distributions for both the spatial and temporal domain and to condition non-trivially on the observed event history. We validate our models on data sets from a wide variety of contexts such as seismology, epidemiology, urban mobility, and neuroscience. \ No newline at end of file diff --git a/data/2021/iclr/Neural Synthesis of Binaural Speech From Mono Audio b/data/2021/iclr/Neural Synthesis of Binaural Speech From Mono Audio new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/Neural Thompson Sampling b/data/2021/iclr/Neural Thompson Sampling new file mode 100644 index 0000000000..0fa596d2b2 --- /dev/null +++ b/data/2021/iclr/Neural Thompson Sampling @@ -0,0 +1 @@ +We study the Combinatorial Thompson Sampling policy (CTS) for combinatorial multi-armed bandit problems (CMAB), within an approximation regret setting. Although CTS has attracted a lot of interest, it has a drawback that other usual CMAB policies do not have when considering non-exact oracles: for some oracles, CTS has a poor approximation regret (scaling linearly with the time horizon $T$) [Wang and Chen, 2018]. A study is then necessary to discriminate the oracles on which CTS could learn. This study was started by Kong et al. [2021]: they gave the first approximation regret analysis of CTS for the greedy oracle, obtaining an upper bound of order $\mathcal{O}(\log(T)/\Delta^2)$, where $\Delta$ is some minimal reward gap. In this paper, our objective is to push this study further than the simple case of the greedy oracle. We provide the first $\mathcal{O}(\log(T)/\Delta)$ approximation regret upper bound for CTS, obtained under a specific condition on the approximation oracle, allowing a reduction to the exact oracle analysis. We thus term this condition REDUCE2EXACT, and observe that it is satisfied in many concrete examples. Moreover, it can be extended to the probabilistically triggered arms setting, thus capturing even more problems, such as online influence maximization. \ No newline at end of file diff --git a/data/2021/iclr/Neural Topic Model via Optimal Transport b/data/2021/iclr/Neural Topic Model via Optimal Transport new file mode 100644 index 0000000000..d88a526376 --- /dev/null +++ b/data/2021/iclr/Neural Topic Model via Optimal Transport @@ -0,0 +1 @@ +Recently, Neural Topic Models (NTMs) inspired by variational autoencoders have obtained increasingly research interest due to their promising results on text analysis. However, it is usually hard for existing NTMs to achieve good document representation and coherent/diverse topics at the same time. Moreover, they often degrade their performance severely on short documents. The requirement of reparameterisation could also comprise their training quality and model flexibility. To address these shortcomings, we present a new neural topic model via the theory of optimal transport (OT). Specifically, we propose to learn the topic distribution of a document by directly minimising its OT distance to the document's word distributions. Importantly, the cost matrix of the OT distance models the weights between topics and words, which is constructed by the distances between topics and words in an embedding space. Our proposed model can be trained efficiently with a differentiable loss. Extensive experiments show that our framework significantly outperforms the state-of-the-art NTMs on discovering more coherent and diverse topics and deriving better document representations for both regular and short texts. \ No newline at end of file diff --git a/data/2021/iclr/Neural gradients are near-lognormal: improved quantized and sparse training b/data/2021/iclr/Neural gradients are near-lognormal: improved quantized and sparse training new file mode 100644 index 0000000000..f1c9cfdb22 --- /dev/null +++ b/data/2021/iclr/Neural gradients are near-lognormal: improved quantized and sparse training @@ -0,0 +1 @@ +While training can mostly be accelerated by reducing the time needed to propagate neural gradients back throughout the model, most previous works focus on the quantization/pruning of weights and activations. These methods are often not applicable to neural gradients, which have very different statistical properties. Distinguished from weights and activations, we find that the distribution of neural gradients is approximately lognormal. Considering this, we suggest two closed-form analytical methods to reduce the computational and memory burdens of neural gradients. The first method optimizes the floating-point format and scale of the gradients. The second method accurately sets sparsity thresholds for gradient pruning. Each method achieves state-of-the-art results on ImageNet. To the best of our knowledge, this paper is the first to (1) quantize the gradients to 6-bit floating-point formats, or (2) achieve up to 85% gradient sparsity -- in each case without accuracy degradation. Reference implementation accompanies the paper. \ No newline at end of file diff --git a/data/2021/iclr/Neural networks with late-phase weights b/data/2021/iclr/Neural networks with late-phase weights new file mode 100644 index 0000000000..651a7da109 --- /dev/null +++ b/data/2021/iclr/Neural networks with late-phase weights @@ -0,0 +1 @@ +The largely successful method of training neural networks is to learn their weights using some variant of stochastic gradient descent (SGD). Here, we show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning. At the end of learning, we obtain back a single model by taking a spatial average in weight space. To avoid incurring increased computational costs, we investigate a family of low-dimensional late-phase weight models which interact multiplicatively with the remaining parameters. Our results show that augmenting standard models with late-phase weights improves generalization in established benchmarks such as CIFAR-10/100, ImageNet and enwik8. These findings are complemented with a theoretical analysis of a noisy quadratic problem which provides a simplified picture of the late phases of neural network learning. \ No newline at end of file diff --git a/data/2021/iclr/Neural representation and generation for RNA secondary structures b/data/2021/iclr/Neural representation and generation for RNA secondary structures new file mode 100644 index 0000000000..b8fb298d52 --- /dev/null +++ b/data/2021/iclr/Neural representation and generation for RNA secondary structures @@ -0,0 +1 @@ +Our work is concerned with the generation and targeted design of RNA, a type of genetic macromolecule that can adopt complex structures which influence their cellular activities and functions. The design of large scale and complex biological structures spurs dedicated graph-based deep generative modeling techniques, which represents a key but underappreciated aspect of computational drug discovery. In this work, we investigate the principles behind representing and generating different RNA structural modalities, and propose a flexible framework to jointly embed and generate these molecular structures along with their sequence in a meaningful latent space. Equipped with a deep understanding of RNA molecular structures, our most sophisticated encoding and decoding methods operate on the molecular graph as well as the junction tree hierarchy, integrating strong inductive bias about RNA structural regularity and folding mechanism such that high structural validity, stability and diversity of generated RNAs are achieved. Also, we seek to adequately organize the latent space of RNA molecular embeddings with regard to the interaction with proteins, and targeted optimization is used to navigate in this latent space to search for desired novel RNA molecules. \ No newline at end of file diff --git a/data/2021/iclr/Neurally Augmented ALISTA b/data/2021/iclr/Neurally Augmented ALISTA new file mode 100644 index 0000000000..ccaff4de09 --- /dev/null +++ b/data/2021/iclr/Neurally Augmented ALISTA @@ -0,0 +1 @@ +It is well-established that many iterative sparse reconstruction algorithms can be unrolled to yield a learnable neural network for improved empirical performance. A prime example is learned ISTA (LISTA) where weights, step sizes and thresholds are learned from training data. Recently, Analytic LISTA (ALISTA) has been introduced, combining the strong empirical performance of a fully learned approach like LISTA, while retaining theoretical guarantees of classical compressed sensing algorithms and significantly reducing the number of parameters to learn. However, these parameters are trained to work in expectation, often leading to suboptimal reconstruction of individual targets. In this work we therefore introduce Neurally Augmented ALISTA, in which an LSTM network is used to compute step sizes and thresholds individually for each target vector during reconstruction. This adaptive approach is theoretically motivated by revisiting the recovery guarantees of ALISTA. We show that our approach further improves empirical performance in sparse reconstruction, in particular outperforming existing algorithms by an increasing margin as the compression ratio becomes more challenging. \ No newline at end of file diff --git a/data/2021/iclr/New Bounds For Distributed Mean Estimation and Variance Reduction b/data/2021/iclr/New Bounds For Distributed Mean Estimation and Variance Reduction new file mode 100644 index 0000000000..b8443a6ee3 --- /dev/null +++ b/data/2021/iclr/New Bounds For Distributed Mean Estimation and Variance Reduction @@ -0,0 +1 @@ +We consider the problem of distributed mean estimation (DME) , in which n machines are each given a local d -dimensional vector x v ∈ R d , and must cooperate to estimate the mean of their inputs µ = 1 n P nv =1 x v , while minimizing total communication cost. DME is a fundamental construct in distributed machine learning, and there has been considerable work on variants of this problem, especially in the context of distributed variance reduction for stochastic gradients in parallel SGD. Previous work typically assumes an upper bound on the norm of the input vectors, and achieves an error bound in terms of this norm. However, in many real applications, the input vectors are concentrated around the correct output µ , but µ itself has large norm. In such cases, previous output error bounds perform poorly. In this paper, we show that output error bounds need not depend on input norm. We provide a method of quantization which allows distributed mean estimation to be performed with solution quality dependent only on the distance between inputs , not on input norm, and show an analogous result for distributed variance reduction. The technique is based on a new connection with lattice theory. We also provide lower bounds showing that the communication to error trade-off of our algorithms is asymptotically optimal. As the lattices achieving optimal bounds under ‘ 2 -norm can be computationally impractical, we also present an extension which leverages easy-to-use cubic lattices, and is loose only up to a logarithmic factor in d . We show experimentally that our method yields practical improvements for common applications, relative to prior approaches. \ No newline at end of file diff --git a/data/2021/iclr/No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks b/data/2021/iclr/No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks new file mode 100644 index 0000000000..9d8cebd436 --- /dev/null +++ b/data/2021/iclr/No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks @@ -0,0 +1 @@ +There has been increasing interest in building deep hierarchy-aware classifiers that aim to quantify and reduce the severity of mistakes, and not just reduce the number of errors. The idea is to exploit the label hierarchy (e.g., the WordNet ontology) and consider graph distances as a proxy for mistake severity. Surprisingly, on examining mistake-severity distributions of the top-1 prediction, we find that current state-of-the-art hierarchy-aware deep classifiers do not always show practical improvement over the standard cross-entropy baseline in making better mistakes. The reason for the reduction in average mistake-severity can be attributed to the increase in low-severity mistakes, which may also explain the noticeable drop in their accuracy. To this end, we use the classical Conditional Risk Minimization (CRM) framework for hierarchy-aware classification. Given a cost matrix and a reliable estimate of likelihoods (obtained from a trained network), CRM simply amends mistakes at inference time; it needs no extra hyperparameters and requires adding just a few lines of code to the standard cross-entropy baseline. It significantly outperforms the state-of-the-art and consistently obtains large reductions in the average hierarchical distance of top-$k$ predictions across datasets, with very little loss in accuracy. CRM, because of its simplicity, can be used with any off-the-shelf trained model that provides reliable likelihood estimates. \ No newline at end of file diff --git a/data/2021/iclr/No MCMC for me: Amortized sampling for fast and stable training of energy-based models b/data/2021/iclr/No MCMC for me: Amortized sampling for fast and stable training of energy-based models new file mode 100644 index 0000000000..cae43f695c --- /dev/null +++ b/data/2021/iclr/No MCMC for me: Amortized sampling for fast and stable training of energy-based models @@ -0,0 +1 @@ +Energy-Based Models (EBMs) present a flexible and appealing way to represent uncertainty. Despite recent advances, training EBMs on high-dimensional data remains a challenging problem as the state-of-the-art approaches are costly, unstable, and require considerable tuning and domain expertise to apply successfully. In this work, we present a simple method for training EBMs at scale which uses an entropy-regularized generator to amortize the MCMC sampling typically used in EBM training. We improve upon prior MCMC-based entropy regularization methods with a fast variational approximation. We demonstrate the effectiveness of our approach by using it to train tractable likelihood models. Next, we apply our estimator to the recently proposed Joint Energy Model (JEM), where we match the original performance with faster and stable training. This allows us to extend JEM models to semi-supervised classification on tabular data from a variety of continuous domains. \ No newline at end of file diff --git a/data/2021/iclr/Noise against noise: stochastic label noise helps combat inherent label noise b/data/2021/iclr/Noise against noise: stochastic label noise helps combat inherent label noise new file mode 100644 index 0000000000..5c8b69d722 --- /dev/null +++ b/data/2021/iclr/Noise against noise: stochastic label noise helps combat inherent label noise @@ -0,0 +1 @@ +The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect, previously studied in optimization by analyzing the dynamics of parameter updates. In this paper, we are interested in learning with noisy labels, where we have a collection of samples with potential mislabeling. We show that a previously rarely discussed SGD noise, induced by stochastic label noise (SLN), mitigates the effects of inherent label noise. In contrast, the common SGD noise directly applied to model parameters does not. We formalize the differences and connections of SGD noise variants, showing that SLN induces SGD noise dependent on the sharpness of output landscape and the confidence of output probability, which may help escape from sharp minima and prevent overconfidence. SLN not only improves generalization in its simplest form but also boosts popular robust training methods, including sample selection and label correction. Specifically, we present an enhanced algorithm by applying SLN to label correction. Our code is released 1 . \ No newline at end of file diff --git a/data/2021/iclr/Noise or Signal: The Role of Image Backgrounds in Object Recognition b/data/2021/iclr/Noise or Signal: The Role of Image Backgrounds in Object Recognition new file mode 100644 index 0000000000..513f7fe84e --- /dev/null +++ b/data/2021/iclr/Noise or Signal: The Role of Image Backgrounds in Object Recognition @@ -0,0 +1 @@ +We assess the tendency of state-of-the-art object recognition models to depend on signals from image backgrounds. We create a toolkit for disentangling foreground and background signal on ImageNet images, and find that (a) models can achieve non-trivial accuracy by relying on the background alone, (b) models often misclassify images even in the presence of correctly classified foregrounds--up to 87.5% of the time with adversarially chosen backgrounds, and (c) more accurate models tend to depend on backgrounds less. Our analysis of backgrounds brings us closer to understanding which correlations machine learning models use, and how they determine models' out of distribution performance. \ No newline at end of file diff --git a/data/2021/iclr/Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds b/data/2021/iclr/Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds new file mode 100644 index 0000000000..a78ad5b630 --- /dev/null +++ b/data/2021/iclr/Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds @@ -0,0 +1 @@ +Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies. Therefore, OPE is a key step in applying reinforcement learning to real-world domains such as medical treatment, where interactive data collection is expensive or even unsafe. As the observed data tends to be noisy and limited, it is essential to provide rigorous uncertainty quantification, not just a point estimation, when applying OPE to make high stakes decisions. This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation, which remains a challenging open question. We develop a practical algorithm through a primal-dual optimization-based approach, which leverages the kernel Bellman loss (KBL) of Feng et al.(2019) and a new martingale concentration inequality of KBL applicable to time-dependent data with unknown mixing conditions. Our algorithm makes minimum assumptions on the data and the function class of the Q-function, and works for the behavior-agnostic settings where the data is collected under a mix of arbitrary unknown behavior policies. We present empirical results that clearly demonstrate the advantages of our approach over existing methods. \ No newline at end of file diff --git a/data/2021/iclr/Nonseparable Symplectic Neural Networks b/data/2021/iclr/Nonseparable Symplectic Neural Networks new file mode 100644 index 0000000000..e85e2f53fe --- /dev/null +++ b/data/2021/iclr/Nonseparable Symplectic Neural Networks @@ -0,0 +1 @@ +Predicting the behaviors of Hamiltonian systems has been drawing increasing attention in scientific machine learning. However, the vast majority of the literature was focused on predicting separable Hamiltonian systems with their kinematic and potential energy terms being explicitly decoupled, while building data-driven paradigms to predict nonseparable Hamiltonian systems that are ubiquitous in fluid dynamics and quantum mechanics were rarely explored. The main computational challenge lies in the effective embedding of symplectic priors to describe the inherently coupled evolution of position and momentum, which typically exhibits intricate dynamics with many degrees of freedom. To solve the problem, we propose a novel neural network architecture, Nonseparable Symplectic Neural Networks (NSSNNs), to uncover and embed the symplectic structure of a nonseparable Hamiltonian system from limited observation data. The enabling mechanics of our approach is an augmented symplectic time integrator to decouple the position and momentum energy terms and facilitate their evolution. We demonstrated the efficacy and versatility of our method by predicting a wide range of Hamiltonian systems, both separable and nonseparable, including vortical flow and quantum system. We showed the unique computational merits of our approach to yield long-term, accurate, and robust predictions for large-scale Hamiltonian systems by rigorously enforcing symplectomorphism. \ No newline at end of file diff --git a/data/2021/iclr/OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning b/data/2021/iclr/OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning new file mode 100644 index 0000000000..4facec0277 --- /dev/null +++ b/data/2021/iclr/OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning @@ -0,0 +1 @@ +Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent's ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severely limited. In this work, we focus on this offline setting. Our main insight is that, when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. Primitives extracted in this way serve two purposes: they delineate the behaviors that are supported by the data from those that are not, making them useful for avoiding distributional shift in offline RL; and they provide a degree of temporal abstraction, which reduces the effective horizon yielding better learning in theory, and improved offline RL in practice. In addition to benefiting offline policy optimization, we show that performing offline primitive learning in this way can also be leveraged for improving few-shot imitation learning as well as exploration and transfer in online RL on a variety of benchmark domains. Visualizations are available at this https URL \ No newline at end of file diff --git a/data/2021/iclr/Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers b/data/2021/iclr/Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers new file mode 100644 index 0000000000..dacf2e9ddc --- /dev/null +++ b/data/2021/iclr/Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers @@ -0,0 +1 @@ +We propose a simple, practical, and intuitive approach for domain adaptation in reinforcement learning. Our approach stems from the idea that the agent's experience in the source domain should look similar to its experience in the target domain. Building off of a probabilistic view of RL, we formally show that we can achieve this goal by compensating for the difference in dynamics by modifying the reward function. This modified reward function is simple to estimate by learning auxiliary classifiers that distinguish source-domain transitions from target-domain transitions. Intuitively, the modified reward function penalizes the agent for visiting states and taking actions in the source domain which are not possible in the target domain. Said another way, the agent is penalized for transitions that would indicate that the agent is interacting with the source domain, rather than the target domain. Our approach is applicable to domains with continuous states and actions and does not require learning an explicit model of the dynamics. On discrete and continuous control tasks, we illustrate the mechanics of our approach and demonstrate its scalability to high-dimensional tasks. \ No newline at end of file diff --git a/data/2021/iclr/Offline Model-Based Optimization via Normalized Maximum Likelihood Estimation b/data/2021/iclr/Offline Model-Based Optimization via Normalized Maximum Likelihood Estimation new file mode 100644 index 0000000000..d06daba4eb --- /dev/null +++ b/data/2021/iclr/Offline Model-Based Optimization via Normalized Maximum Likelihood Estimation @@ -0,0 +1 @@ +In this work we consider data-driven optimization problems where one must maximize a function given only queries at a fixed set of points. This problem setting emerges in many domains where function evaluation is a complex and expensive process, such as in the design of materials, vehicles, or neural network architectures. Because the available data typically only covers a small manifold of the possible space of inputs, a principal challenge is to be able to construct algorithms that can reason about uncertainty and out-of-distribution values, since a naive optimizer can easily exploit an estimated model to return adversarial inputs. We propose to tackle this problem by leveraging the normalized maximum-likelihood (NML) estimator, which provides a principled approach to handling uncertainty and out-of-distribution inputs. While in the standard formulation NML is intractable, we propose a tractable approximation that allows us to scale our method to high-capacity neural network models. We demonstrate that our method can effectively optimize high-dimensional design problems in a variety of disciplines such as chemistry, biology, and materials engineering. \ No newline at end of file diff --git a/data/2021/iclr/On Data-Augmentation and Consistency-Based Semi-Supervised Learning b/data/2021/iclr/On Data-Augmentation and Consistency-Based Semi-Supervised Learning new file mode 100644 index 0000000000..7adbf5e7a7 --- /dev/null +++ b/data/2021/iclr/On Data-Augmentation and Consistency-Based Semi-Supervised Learning @@ -0,0 +1 @@ +Recently proposed consistency-based Semi-Supervised Learning (SSL) methods such as the $\Pi$-model, temporal ensembling, the mean teacher, or the virtual adversarial training, have advanced the state of the art in several SSL tasks. These methods can typically reach performances that are comparable to their fully supervised counterparts while using only a fraction of labelled examples. Despite these methodological advances, the understanding of these methods is still relatively limited. In this text, we analyse (variations of) the $\Pi$-model in settings where analytically tractable results can be obtained. We establish links with Manifold Tangent Classifiers and demonstrate that the quality of the perturbations is key to obtaining reasonable SSL performances. Importantly, we propose a simple extension of the Hidden Manifold Model that naturally incorporates data-augmentation schemes and offers a framework for understanding and experimenting with SSL methods. \ No newline at end of file diff --git a/data/2021/iclr/On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections b/data/2021/iclr/On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections new file mode 100644 index 0000000000..33f52d3ac9 --- /dev/null +++ b/data/2021/iclr/On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections @@ -0,0 +1 @@ +Disparate impact has raised serious concerns in machine learning applications and its societal impacts. In response to the need of mitigating discrimination, fairness has been regarded as a crucial property in algorithmic design. In this work, we study the problem of disparate impact on graph-structured data. Specifically, we focus on dyadic fairness, which articulates a fairness concept that a predictive relationship between two instances should be independent of the sensitive attributes. Based on this, we theoretically relate the graph connections to dyadic fairness on link predictive scores in learning graph neural networks, and reveal that regulating weights on existing edges in a graph contributes to dyadic fairness conditionally. Subsequently, we propose our algorithm, FairAdj, to empirically learn a fair adjacency matrix with proper graph structural constraints for fair link prediction, and in the meanwhile preserve predictive accuracy as much as possible. Empirical validation demonstrates that our method delivers effective dyadic fairness in terms of various statistics, and at the same time enjoys a favorable fairness-utility tradeoff. \ No newline at end of file diff --git a/data/2021/iclr/On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning b/data/2021/iclr/On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning new file mode 100644 index 0000000000..1ad17056f8 --- /dev/null +++ b/data/2021/iclr/On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning @@ -0,0 +1 @@ +Model-agnostic meta-learning (MAML) has emerged as one of the most successful meta-learning techniques in few-shot learning. It enables us to learn a meta-initialization} of model parameters (that we call meta-model) to rapidly adapt to new tasks using a small amount of labeled training data. Despite the generalization power of the meta-model, it remains elusive that how adversarial robustness can be maintained by MAML in few-shot learning. In addition to generalization, robustness is also desired for a meta-model to defend adversarial examples (attacks). Toward promoting adversarial robustness in MAML, we first study WHEN a robustness-promoting regularization should be incorporated, given the fact that MAML adopts a bi-level (fine-tuning vs. meta-update) learning procedure. We show that robustifying the meta-update stage is sufficient to make robustness adapted to the task-specific fine-tuning stage even if the latter uses a standard training protocol. We also make additional justification on the acquired robustness adaptation by peering into the interpretability of neurons' activation maps. Furthermore, we investigate HOW robust regularization can efficiently be designed in MAML. We propose a general but easily-optimized robustness-regularized meta-learning framework, which allows the use of unlabeled data augmentation, fast adversarial attack generation, and computationally-light fine-tuning. In particular, we for the first time show that the auxiliary contrastive learning task can enhance the adversarial robustness of MAML. Finally, extensive experiments are conducted to demonstrate the effectiveness of our proposed methods in robust few-shot learning. \ No newline at end of file diff --git a/data/2021/iclr/On Graph Neural Networks versus Graph-Augmented MLPs b/data/2021/iclr/On Graph Neural Networks versus Graph-Augmented MLPs new file mode 100644 index 0000000000..686dcd7124 --- /dev/null +++ b/data/2021/iclr/On Graph Neural Networks versus Graph-Augmented MLPs @@ -0,0 +1 @@ +From the perspective of expressive power, this work compares multi-layer Graph Neural Networks (GNNs) with a simplified alternative that we call Graph-Augmented Multi-Layer Perceptrons (GA-MLPs), which first augments node features with certain multi-hop operators on the graph and then applies an MLP in a node-wise fashion. From the perspective of graph isomorphism testing, we show both theoretically and numerically that GA-MLPs with suitable operators can distinguish almost all non-isomorphic graphs, just like the Weifeiler-Lehman (WL) test. However, by viewing them as node-level functions and examining the equivalence classes they induce on rooted graphs, we prove a separation in expressive power between GA-MLPs and GNNs that grows exponentially in depth. In particular, unlike GNNs, GA-MLPs are unable to count the number of attributed walks. We also demonstrate via community detection experiments that GA-MLPs can be limited by their choice of operator family, as compared to GNNs with higher flexibility in learning. \ No newline at end of file diff --git a/data/2021/iclr/On InstaHide, Phase Retrieval, and Sparse Matrix Factorization b/data/2021/iclr/On InstaHide, Phase Retrieval, and Sparse Matrix Factorization new file mode 100644 index 0000000000..77f1c32b83 --- /dev/null +++ b/data/2021/iclr/On InstaHide, Phase Retrieval, and Sparse Matrix Factorization @@ -0,0 +1,2 @@ +In this work, we examine the security of InstaHide, a scheme recently proposed by [Huang, Song, Li and Arora, ICML'20] for preserving the security of private datasets in the context of distributed learning. To generate a synthetic training example to be shared among the distributed learners, InstaHide takes a convex combination of private feature vectors and randomly flips the sign of each entry of the resulting vector with probability 1/2. A salient question is whether this scheme is secure in any provable sense, perhaps under a plausible hardness assumption and assuming the distributions generating the public and private data satisfy certain properties. +We show that the answer to this appears to be quite subtle and closely related to the average-case complexity of a new multi-task, missing-data version of the classic problem of phase retrieval. Motivated by this connection, we design a provable algorithm that can recover private vectors using only the public vectors and synthetic vectors generated by InstaHide, under the assumption that the private and public vectors are isotropic Gaussian. \ No newline at end of file diff --git a/data/2021/iclr/On Learning Universal Representations Across Languages b/data/2021/iclr/On Learning Universal Representations Across Languages new file mode 100644 index 0000000000..9ae4ed1fac --- /dev/null +++ b/data/2021/iclr/On Learning Universal Representations Across Languages @@ -0,0 +1 @@ +Recent studies have demonstrated the overwhelming advantage of cross-lingual pre-trained models (PTMs), such as multilingual BERT and XLM, on cross-lingual NLP tasks. However, existing approaches essentially capture the co-occurrence among tokens through involving the masked language model (MLM) objective with token-level cross entropy. In this work, we extend these approaches to learn sentence-level representations, and show the effectiveness on cross-lingual understanding and generation. We propose Hierarchical Contrastive Learning (HiCTL) to (1) learn universal representations for parallel sentences distributed in one or multiple languages and (2) distinguish the semantically-related words from a shared cross-lingual vocabulary for each sentence. We conduct evaluations on three benchmarks: language understanding tasks (QQP, QNLI, SST-2, MRPC, STS-B and MNLI) in the GLUE benchmark, cross-lingual natural language inference (XNLI) and machine translation. Experimental results show that the HiCTL obtains an absolute gain of 1.0%/2.2% accuracy on GLUE/XNLI as well as achieves substantial improvements of +1.7-+3.6 BLEU on both the high-resource and low-resource English-to-X translation tasks over strong baselines. We will release the source codes as soon as possible. \ No newline at end of file diff --git a/data/2021/iclr/On Position Embeddings in BERT b/data/2021/iclr/On Position Embeddings in BERT new file mode 100644 index 0000000000..90818e09a3 --- /dev/null +++ b/data/2021/iclr/On Position Embeddings in BERT @@ -0,0 +1 @@ +relative \ No newline at end of file diff --git a/data/2021/iclr/On Self-Supervised Image Representations for GAN Evaluation b/data/2021/iclr/On Self-Supervised Image Representations for GAN Evaluation new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2021/iclr/On Statistical Bias In Active Learning: How and When to Fix It b/data/2021/iclr/On Statistical Bias In Active Learning: How and When to Fix It new file mode 100644 index 0000000000..3a5314acf7 --- /dev/null +++ b/data/2021/iclr/On Statistical Bias In Active Learning: How and When to Fix It @@ -0,0 +1 @@ +Active learning is a powerful tool when labelling data is expensive, but it introduces a bias because the training data no longer follows the population distribution. We formalize this bias and investigate the situations in which it can be harmful and sometimes even helpful. We further introduce novel corrective weights to remove bias when doing so is beneficial. Through this, our work not only provides a useful mechanism that can improve the active learning approach, but also an explanation of the empirical successes of various existing approaches which ignore this bias. In particular, we show that this bias can be actively helpful when training overparameterized models -- like neural networks -- with relatively little data. \ No newline at end of file diff --git a/data/2021/iclr/On the Bottleneck of Graph Neural Networks and its Practical Implications b/data/2021/iclr/On the Bottleneck of Graph Neural Networks and its Practical Implications new file mode 100644 index 0000000000..20e48c5aa8 --- /dev/null +++ b/data/2021/iclr/On the Bottleneck of Graph Neural Networks and its Practical Implications @@ -0,0 +1 @@ +Graph neural networks (GNNs) were shown to effectively learn from highly structured data containing elements (nodes) with relationships (edges) between them. GNN variants differ in how each node in the graph absorbs the information flowing from its neighbor nodes. In this paper, we highlight an inherent problem in GNNs: the mechanism of propagating information between neighbors creates a bottleneck when every node aggregates messages from its neighbors. This bottleneck causes the over-squashing of exponentially-growing information into fixed-size vectors. As a result, the graph fails to propagate messages flowing from distant nodes and performs poorly when the prediction task depends on long-range information. We demonstrate that the bottleneck hinders popular GNNs from fitting the training data. We show that GNNs that absorb incoming edges equally, like GCN and GIN, are more susceptible to over-squashing than other GNN types. We further show that existing, extensively-tuned, GNN-based models suffer from over-squashing and that breaking the bottleneck improves state-of-the-art results without any hyperparameter tuning or additional weights. \ No newline at end of file diff --git a/data/2021/iclr/On the Critical Role of Conventions in Adaptive Human-AI Collaboration b/data/2021/iclr/On the Critical Role of Conventions in Adaptive Human-AI Collaboration new file mode 100644 index 0000000000..8cc31fe142 --- /dev/null +++ b/data/2021/iclr/On the Critical Role of Conventions in Adaptive Human-AI Collaboration @@ -0,0 +1 @@ +Humans can quickly adapt to new partners in collaborative tasks (e.g. playing basketball), because they understand which fundamental skills of the task (e.g. how to dribble, how to shoot) carry over across new partners. Humans can also quickly adapt to similar tasks with the same partners by carrying over conventions that they have developed (e.g. raising hand signals pass the ball), without learning to coordinate from scratch. To collaborate seamlessly with humans, AI agents should adapt quickly to new partners and new tasks as well. However, current approaches have not attempted to distinguish between the complexities intrinsic to a task and the conventions used by a partner, and more generally there has been little focus on leveraging conventions for adapting to new settings. In this work, we propose a learning framework that teases apart rule-dependent representation from convention-dependent representation in a principled way. We show that, under some assumptions, our rule-dependent representation is a sufficient statistic of the distribution over best-response strategies across partners. Using this separation of representations, our agents are able to adapt quickly to new partners, and to coordinate with old partners on new tasks in a zero-shot manner. We experimentally validate our approach on three collaborative tasks varying in complexity: a contextual multi-armed bandit, a block placing task, and the card game Hanabi. \ No newline at end of file diff --git a/data/2021/iclr/On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis b/data/2021/iclr/On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis new file mode 100644 index 0000000000..2988b51e7e --- /dev/null +++ b/data/2021/iclr/On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis @@ -0,0 +1 @@ +We study the approximation properties and optimization dynamics of recurrent neural networks (RNNs) when applied to learn input-output relationships in temporal data. We consider the simple but representative setting of using continuous-time linear RNNs to learn from data generated by linear relationships. Mathematically, the latter can be understood as a sequence of linear functionals. We prove a universal approximation theorem of such linear functionals, and characterize the approximation rate and its relation with memory. Moreover, we perform a fine-grained dynamical analysis of training linear RNNs, which further reveal the intricate interactions between memory and learning. A unifying theme uncovered is the non-trivial effect of memory, a notion that can be made precise in our framework, on approximation and optimization: when there is long term memory in the target, it takes a large number of neurons to approximate it. Moreover, the training process will suffer from slow downs. In particular, both of these effects become exponentially more pronounced with memory - a phenomenon we call the "curse of memory". These analyses represent a basic step towards a concrete mathematical understanding of new phenomenon that may arise in learning temporal relationships using recurrent architectures. \ No newline at end of file diff --git a/data/2021/iclr/On the Dynamics of Training Attention Models b/data/2021/iclr/On the Dynamics of Training Attention Models new file mode 100644 index 0000000000..e438904c63 --- /dev/null +++ b/data/2021/iclr/On the Dynamics of Training Attention Models @@ -0,0 +1 @@ +The attention mechanism has been widely used in deep neural networks as a model component. By now, it has become a critical building block in many state-of-the-art natural language models. Despite its great success established empirically, the working mechanism of attention has not been investigated at a sufficient theoretical depth to date. In this paper, we set up a simple text classification task and study the dynamics of training a simple attention-based classification model using gradient descent. In this setting, we show that, for the discriminative words that the model should attend to, a persisting identity exists relating its embedding and the inner product of its key and the query. This allows us to prove that training must converge to attending to the discriminative words when the attention output is classified by a linear classifier. Experiments are performed, which validates our theoretical analysis and provides further insights. \ No newline at end of file diff --git a/data/2021/iclr/On the Impossibility of Global Convergence in Multi-Loss Optimization b/data/2021/iclr/On the Impossibility of Global Convergence in Multi-Loss Optimization new file mode 100644 index 0000000000..639bca6905 --- /dev/null +++ b/data/2021/iclr/On the Impossibility of Global Convergence in Multi-Loss Optimization @@ -0,0 +1 @@ +Under mild regularity conditions, gradient-based methods converge globally to a critical point in the single-loss setting. This is known to break down for vanilla gradient descent when moving to multi-loss optimization, but can we hope to build some algorithm with global guarantees? We negatively resolve this open problem by proving that any reasonable algorithm will exhibit limit cycles or diverge to infinite losses in some differentiable game, even in two-player games with zero-sum interactions. A reasonable algorithm is simply one which avoids strict maxima, an exceedingly weak assumption since converging to maxima would be the opposite of minimization. This impossibility theorem holds even if we impose existence of a strict minimum and no other critical points. The proof is constructive, enabling us to display explicit limit cycles for existing gradient-based methods. Nonetheless, it remains an open question whether cycles arise in high-dimensional games of interest to ML practitioners, such as GANs or multi-agent RL. \ No newline at end of file diff --git a/data/2021/iclr/On the Origin of Implicit Regularization in Stochastic Gradient Descent b/data/2021/iclr/On the Origin of Implicit Regularization in Stochastic Gradient Descent new file mode 100644 index 0000000000..c81bbc0b6b --- /dev/null +++ b/data/2021/iclr/On the Origin of Implicit Regularization in Stochastic Gradient Descent @@ -0,0 +1 @@ +For infinitesimal learning rates, stochastic gradient descent (SGD) follows the path of gradient flow on the full batch loss function. However moderately large learning rates can achieve higher test accuracies, and this generalization benefit is not explained by convergence bounds, since the learning rate which maximizes test accuracy is often larger than the learning rate which minimizes training loss. To interpret this phenomenon we prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite, but on a modified loss. This modified loss is composed of the original loss function and an implicit regularizer, which penalizes the norms of the minibatch gradients. Under mild assumptions, when the batch size is small the scale of the implicit regularization term is proportional to the ratio of the learning rate to the batch size. We verify empirically that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small. \ No newline at end of file diff --git a/data/2021/iclr/On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines b/data/2021/iclr/On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines new file mode 100644 index 0000000000..e34864c365 --- /dev/null +++ b/data/2021/iclr/On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines @@ -0,0 +1 @@ +Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and a small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on three commonly used datasets from the GLUE benchmark and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. Based on our analysis, we present a simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than previously proposed approaches. Code to reproduce our results is available online: this https URL . \ No newline at end of file diff --git a/data/2021/iclr/On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers b/data/2021/iclr/On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers new file mode 100644 index 0000000000..57f78ced2e --- /dev/null +++ b/data/2021/iclr/On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers @@ -0,0 +1 @@ +A deep equilibrium model uses implicit layers, which are implicitly defined through an equilibrium point of an infinite sequence of computation. It avoids any explicit computation of the infinite sequence by finding an equilibrium point directly via root-finding and by computing gradients via implicit differentiation. In this paper, we analyze the gradient dynamics of deep equilibrium models with nonlinearity only on weight matrices and non-convex objective functions of weights for regression and classification. Despite non-convexity, convergence to global optimum at a linear rate is guaranteed without any assumption on the width of the models, allowing the width to be smaller than the output dimension and the number of data points. Moreover, we prove a relation between the gradient dynamics of the deep implicit layer and the dynamics of trust region Newton method of a shallow explicit layer. This mathematically proven relation along with our numerical observation suggests the importance of understanding implicit bias of implicit layers and an open problem on the topic. Our proofs deal with implicit layers, weight tying and nonlinearity on weights, and differ from those in the related literature. \ No newline at end of file diff --git a/data/2021/iclr/On the Transfer of Disentangled Representations in Realistic Settings b/data/2021/iclr/On the Transfer of Disentangled Representations in Realistic Settings new file mode 100644 index 0000000000..63e6f03504 --- /dev/null +++ b/data/2021/iclr/On the Transfer of Disentangled Representations in Realistic Settings @@ -0,0 +1 @@ +Learning meaningful representations that disentangle the underlying structure of the data generating process is considered to be of key importance in machine learning. While disentangled representations were found to be useful for diverse tasks such as abstract reasoning and fair classification, their scalability and real-world impact remain questionable. We introduce a new high-resolution dataset with 1M simulated images and over 1,800 annotated real-world images of the same robotic setup. In contrast to previous work, this new dataset exhibits correlations, a complex underlying structure, and allows to evaluate transfer to unseen simulated and real-world settings where the encoder i) remains in distribution or ii) is out of distribution. We propose new architectures in order to scale disentangled representation learning to realistic high-resolution settings and conduct a large-scale empirical study of disentangled representations on this dataset. We observe that disentanglement is a good predictor for out-of-distribution (OOD) task performance. \ No newline at end of file diff --git a/data/2021/iclr/On the Universality of Rotation Equivariant Point Cloud Networks b/data/2021/iclr/On the Universality of Rotation Equivariant Point Cloud Networks new file mode 100644 index 0000000000..faa1e14494 --- /dev/null +++ b/data/2021/iclr/On the Universality of Rotation Equivariant Point Cloud Networks @@ -0,0 +1,2 @@ +Learning functions on point clouds has applications in many fields, including computer vision, computer graphics, physics, and chemistry. Recently, there has been a growing interest in neural architectures that are invariant or equivariant to all three shape-preserving transformations of point clouds: translation, rotation, and permutation. +In this paper, we present a first study of the approximation power of these architectures. We first derive two sufficient conditions for an equivariant architecture to have the universal approximation property, based on a novel characterization of the space of equivariant polynomials. We then use these conditions to show that two recently suggested models are universal, and for devising two other novel universal architectures. \ No newline at end of file diff --git a/data/2021/iclr/On the Universality of the Double Descent Peak in Ridgeless Regression b/data/2021/iclr/On the Universality of the Double Descent Peak in Ridgeless Regression new file mode 100644 index 0000000000..38dbed1cc3 --- /dev/null +++ b/data/2021/iclr/On the Universality of the Double Descent Peak in Ridgeless Regression @@ -0,0 +1 @@ +We prove a non-asymptotic distribution-independent lower bound for the expected mean squared generalization error caused by label noise in ridgeless linear regression. Our lower bound generalizes a similar known result to the overparameterized (interpolating) regime. In contrast to most previous works, our analysis applies to a broad class of input distributions with almost surely full-rank feature matrices, which allows us to cover various types of deterministic or random feature maps. Our lower bound is asymptotically sharp and implies that in the presence of label noise, ridgeless linear regression does not perform well around the interpolation threshold for any of these feature maps. We analyze the imposed assumptions in detail and provide a theory for analytic (random) feature maps. Using this theory, we can show that our assumptions are satisfied for input distributions with a (Lebesgue) density and feature maps given by random deep neural networks with analytic activation functions like sigmoid, tanh, softplus or GELU. As further examples, we show that feature maps from random Fourier features and polynomial kernels also satisfy our assumptions. We complement our theory with further experimental and analytic results. \ No newline at end of file diff --git a/data/2021/iclr/On the geometry of generalization and memorization in deep neural networks b/data/2021/iclr/On the geometry of generalization and memorization in deep neural networks new file mode 100644 index 0000000000..bd6b47dfcc --- /dev/null +++ b/data/2021/iclr/On the geometry of generalization and memorization in deep neural networks @@ -0,0 +1 @@ +Understanding how large neural networks avoid memorizing training data is key to explaining their high generalization performance. To examine the structure of when and where memorization occurs in a deep network, we use a recently developed replica-based mean field theoretic geometric analysis method. We find that all layers preferentially learn from examples which share features, and link this behavior to generalization performance. Memorization predominately occurs in the deeper layers, due to decreasing object manifolds' radius and dimension, whereas early layers are minimally affected. This predicts that generalization can be restored by reverting the final few layer weights to earlier epochs before significant memorization occurred, which is confirmed by the experiments. Additionally, by studying generalization under different model sizes, we reveal the connection between the double descent phenomenon and the underlying model geometry. Finally, analytical analysis shows that networks avoid memorization early in training because close to initialization, the gradient contribution from permuted examples are small. These findings provide quantitative evidence for the structure of memorization across layers of a deep neural network, the drivers for such structure, and its connection to manifold geometric properties. \ No newline at end of file diff --git a/data/2021/iclr/On the mapping between Hopfield networks and Restricted Boltzmann Machines b/data/2021/iclr/On the mapping between Hopfield networks and Restricted Boltzmann Machines new file mode 100644 index 0000000000..7e42e4dc9a --- /dev/null +++ b/data/2021/iclr/On the mapping between Hopfield networks and Restricted Boltzmann Machines @@ -0,0 +1 @@ +Hopfield networks (HNs) and Restricted Boltzmann Machines (RBMs) are two important models at the interface of statistical physics, machine learning, and neuroscience. Recently, there has been interest in the relationship between HNs and RBMs, due to their similarity under the statistical mechanics formalism. An exact mapping between HNs and RBMs has been previously noted for the special case of orthogonal (uncorrelated) encoded patterns. We present here an exact mapping in the general case of correlated pattern HNs, which are more broadly applicable to existing datasets. Specifically, we show that any HN with $N$ binary variables and $p