diff --git "a/data/2024/aaai/\"Allot?\" is \"A Lot!\" Towards Developing More Generalized Speech Recognition System for Accessible Communication" "b/data/2024/aaai/\"Allot?\" is \"A Lot!\" Towards Developing More Generalized Speech Recognition System for Accessible Communication"
new file mode 100644
index 0000000000..324641b552
--- /dev/null
+++ "b/data/2024/aaai/\"Allot?\" is \"A Lot!\" Towards Developing More Generalized Speech Recognition System for Accessible Communication"	
@@ -0,0 +1 @@
+The proliferation of Automatic Speech Recognition (ASR) systems has revolutionized translation and transcription. However, challenges persist in ensuring inclusive communication for non-native English speakers. This study quantifies the gap between accented and native English speech using Wav2Vec 2.0, a state-of-the-art transformer model. Notably, we found that accented speech exhibits significantly higher word error rates of 30-50%, in contrast to native speakers’ 2-8% (Baevski et al. 2020). Our exploration extends to leveraging accessible online datasets to highlight the potential of enhancing speech recognition by fine-tuning the Wav2Vec 2.0 model. Through experimentation and analysis, we highlight the challenges with training models on accented speech. By refining models and addressing data quality issues, our work presents a pipeline for future investigations aimed at developing an integrated system capable of effectively engaging with a broader range of individuals with diverse backgrounds. Accurate recognition of accented speech is a pivotal step toward democratizing AI-driven communication products.
\ No newline at end of file
diff --git a/data/2024/aaai/'Why Didn't You Allocate This Task to Them?' Negotiation-Aware Task Allocation and Contrastive Explanation Generation b/data/2024/aaai/'Why Didn't You Allocate This Task to Them?' Negotiation-Aware Task Allocation and Contrastive Explanation Generation
new file mode 100644
index 0000000000..b11f828630
--- /dev/null
+++ b/data/2024/aaai/'Why Didn't You Allocate This Task to Them?' Negotiation-Aware Task Allocation and Contrastive Explanation Generation	
@@ -0,0 +1 @@
+In this work, we design an Artificially Intelligent Task Allocator (AITA) that proposes a task allocation for a team of humans. A key property of this allocation is that when an agent with imperfect knowledge (about their teammate's costs and/or the team's performance metric) contests the allocation with a counterfactual, a contrastive explanation can always be provided to showcase why the proposed allocation is better than the proposed counterfactual. For this, we consider a negotiation process that produces a negotiation-aware task allocation and, when contested, leverages a negotiation tree to provide a contrastive explanation. With human subject studies, we show that the proposed allocation indeed appears fair to a majority of participants and, when not, the explanations generated are judged as convincing and easy to comprehend.
\ No newline at end of file
diff --git a/data/2024/aaai/1 2-Approximate MMS Allocation for Separable Piecewise Linear Concave Valuations b/data/2024/aaai/1 2-Approximate MMS Allocation for Separable Piecewise Linear Concave Valuations
new file mode 100644
index 0000000000..ba1f6fe236
--- /dev/null
+++ b/data/2024/aaai/1 2-Approximate MMS Allocation for Separable Piecewise Linear Concave Valuations	
@@ -0,0 +1,6 @@
+We study fair distribution of a collection of m indivisible goods among a group of n agents, using the widely recognized fairness principles of Maximin Share (MMS) and Any Price Share (APS). These principles have undergone thorough investigation within the context of additive valuations. We explore these notions for valuations that extend beyond additivity.
+
+First, we study approximate MMS under the separable (piecewise-linear) concave (SPLC) valuations, an important class generalizing additive, where the best known factor was 1/3-MMS. We show that 1/2-MMS allocation exists and can be computed in polynomial time, significantly improving the state-of-the-art.
+We note that SPLC valuations introduce an elevated level of intricacy in contrast to additive. For instance, the MMS value of an agent can be as high as her value for the entire set of items. We use a relax-and-round paradigm that goes through competitive equilibrium and LP relaxation. Our result extends to give (symmetric) 1/2-APS, a stronger guarantee than MMS.
+
+APS is a stronger notion that generalizes MMS by allowing agents with arbitrary entitlements. We study the approximation of APS under submodular valuation functions. We design and analyze a simple greedy algorithm using concave extensions of submodular functions. We prove that the algorithm gives a 1/3-APS allocation which matches the best-known factor. Concave extensions are hard to compute in polynomial time and are, therefore, generally not used in approximation algorithms. Our approach shows a way to utilize it within analysis (while bypassing its computation), and hence might be of independent interest.
\ No newline at end of file
diff --git a/data/2024/aaai/3D Visibility-Aware Generalizable Neural Radiance Fields for Interacting Hands b/data/2024/aaai/3D Visibility-Aware Generalizable Neural Radiance Fields for Interacting Hands
new file mode 100644
index 0000000000..f423feec20
--- /dev/null
+++ b/data/2024/aaai/3D Visibility-Aware Generalizable Neural Radiance Fields for Interacting Hands	
@@ -0,0 +1 @@
+Neural radiance fields (NeRFs) are promising 3D representations for scenes, objects, and humans. However, most existing methods require multi-view inputs and per-scene training, which limits their real-life applications. Moreover, current methods focus on single-subject cases, leaving scenes of interacting hands that involve severe inter-hand occlusions and challenging view variations remain unsolved. To tackle these issues, this paper proposes a generalizable visibility-aware NeRF (VA-NeRF) framework for interacting hands. Specifically, given an image of interacting hands as input, our VA-NeRF first obtains a mesh-based representation of hands and extracts their corresponding geometric and textural features. Subsequently, a feature fusion module that exploits the visibility of query points and mesh vertices is introduced to adaptively merge features of both hands, enabling the recovery of features in unseen areas. Additionally, our VA-NeRF is optimized together with a novel discriminator within an adversarial learning paradigm. In contrast to conventional discriminators that predict a single real/fake label for the synthesized image, the proposed discriminator generates a pixel-wise visibility map, providing fine-grained supervision for unseen areas and encouraging the VA-NeRF to improve the visual quality of synthesized images. Experiments on the Interhand2.6M dataset demonstrate that our proposed VA-NeRF outperforms conventional NeRFs significantly. Project Page: https://github.com/XuanHuang0/VANeRF.
\ No newline at end of file
diff --git a/data/2024/aaai/3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation b/data/2024/aaai/3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation
new file mode 100644
index 0000000000..8fee1b9934
--- /dev/null
+++ b/data/2024/aaai/3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation	
@@ -0,0 +1 @@
+In 3D Referring Expression Segmentation (3D-RES), the earlier approach adopts a two-stage paradigm, extracting segmentation proposals and then matching them with referring expressions. However, this conventional paradigm encounters significant challenges, most notably in terms of the generation of lackluster initial proposals and a pronounced deceleration in inference speed. Recognizing these limitations, we introduce an innovative end-to-end Superpoint-Text Matching Network (3D-STMN) that is enriched by dependency-driven insights. One of the keystones of our model is the Superpoint-Text Matching (STM) mechanism. Unlike traditional methods that navigate through instance proposals, STM directly correlates linguistic indications with their respective superpoints, clusters of semantically related points. This architectural decision empowers our model to efficiently harness cross-modal semantic relationships, primarily leveraging densely annotated superpoint-text pairs, as opposed to the more sparse instance-text pairs. In pursuit of enhancing the role of text in guiding the segmentation process, we further incorporate the Dependency-Driven Interaction (DDI) module to deepen the network's semantic comprehension of referring expressions. Using the dependency trees as a beacon, this module discerns the intricate relationships between primary terms and their associated descriptors in expressions, thereby elevating both the localization and segmentation capacities. Comprehensive experiments on the ScanRefer benchmark reveal that our model not only sets new performance standards, registering an mIoU gain of 11.7 points but also achieves a staggering enhancement in inference speed, surpassing traditional methods by 95.7 times. The code and models are available at https://github.com/sosppxo/3D-STMN.
\ No newline at end of file
diff --git a/data/2024/aaai/A Brain-Inspired Way of Reducing the Network Complexity via Concept-Regularized Coding for Emotion Recognition b/data/2024/aaai/A Brain-Inspired Way of Reducing the Network Complexity via Concept-Regularized Coding for Emotion Recognition
new file mode 100644
index 0000000000..841d1f34e2
--- /dev/null
+++ b/data/2024/aaai/A Brain-Inspired Way of Reducing the Network Complexity via Concept-Regularized Coding for Emotion Recognition	
@@ -0,0 +1 @@
+The human brain can effortlessly and reliably perceive emotions, whereas existing facial emotion recognition (FER) methods suffer from drawbacks such as complex model structures, high storage requirements, and poor interpretability. Inspired by the role of emotion concepts in visual perception coding within the human brain, we propose a dual-pathway framework emulating the neural computation of emotion recognition. Specifically, these two pathways are designed to model the representation of emotion concepts in the brain and the visual perception process, respectively. For the former, we adopt a disentangled approach to extract emotion concepts from complex facial geometric attributes; for the latter, we employ an emotional confidence evaluation strategy to determine which concept is optimal for regularizing the perceptual coding. The proposed concept-regularized coding strategy endows the framework with flexibility and interpretability as well as good performances on several benchmarking FER datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/A Bregman Proximal Stochastic Gradient Method with Extrapolation for Nonconvex Nonsmooth Problems b/data/2024/aaai/A Bregman Proximal Stochastic Gradient Method with Extrapolation for Nonconvex Nonsmooth Problems
new file mode 100644
index 0000000000..4ed4d1a961
--- /dev/null
+++ b/data/2024/aaai/A Bregman Proximal Stochastic Gradient Method with Extrapolation for Nonconvex Nonsmooth Problems	
@@ -0,0 +1 @@
+In this paper, we explore a specific optimization problem that involves the combination of a differentiable nonconvex function and a nondifferentiable function. The differentiable component lacks a global Lipschitz continuous gradient, posing challenges for optimization. To address this issue and accelerate the convergence, we propose a Bregman proximal stochastic gradient method with extrapolation (BPSGE), which only requires smooth adaptivity of the differentiable part. Under variance reduction framework, we not only analyze the subsequential and global convergence of the proposed algorithm under certain conditions, but also analyze the sublinear convergence rate of the subsequence, and the complexity of the algorithm, revealing that the BPSGE algorithm requires at most O(epsilon\^\,(-2)) iterations in expectation to attain an epsilon-stationary point. To validate the effectiveness of our proposed algorithm, we conduct numerical experiments on three real-world applications: graph regularized nonnegative matrix factorization (NMF), matrix factorization with weakly-convex regularization, and NMF with nonconvex sparsity constraints. These experiments demonstrate that BPSGE is faster than the baselines without extrapolation. The code is available at: https://github.com/nothing2wang/BPSGE-Algorithm.
\ No newline at end of file
diff --git a/data/2024/aaai/A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science b/data/2024/aaai/A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science
new file mode 100644
index 0000000000..d1b2ddf904
--- /dev/null
+++ b/data/2024/aaai/A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science	
@@ -0,0 +1 @@
+This paper explores the use of large language models (LLMs) to score and explain short-answer assessments in K-12 science. While existing methods can score more structured math and computer science assessments, they often do not provide explanations for the scores. Our study focuses on employing GPT-4 for automated assessment in middle school Earth Science, combining few-shot and active learning with chain-of-thought reasoning. Using a human-in-the-loop approach, we successfully score and provide meaningful explanations for formative assessment responses. A systematic analysis of our method's pros and cons sheds light on the potential for human-in-the-loop techniques to enhance automated grading for open-ended science assessments.
\ No newline at end of file
diff --git a/data/2024/aaai/A Class of Topological Pseudodistances for Fast Comparison of Persistence Diagrams b/data/2024/aaai/A Class of Topological Pseudodistances for Fast Comparison of Persistence Diagrams
new file mode 100644
index 0000000000..3ff31a8352
--- /dev/null
+++ b/data/2024/aaai/A Class of Topological Pseudodistances for Fast Comparison of Persistence Diagrams	
@@ -0,0 +1 @@
+Persistence diagrams (PD)s play a central role in topological data analysis, and are used in an ever increasing variety of applications. The comparison of PD data requires computing distances among large sets of PDs, with metrics which are accurate, theoretically sound, and fast to compute. Especially for denser multi-dimensional PDs, such comparison metrics are lacking. While on the one hand, Wasserstein-type distances have high accuracy and theoretical guarantees, they incur high computational cost. On the other hand, distances between vectorizations such as Persistence Statistics (PS)s have lower computational cost, but lack the accuracy guarantees and theoretical properties of a true distance over PD space. In this work we introduce a class of pseudodistances called Extended Topological Pseudodistances (ETD)s, which have tunable complexity, and can approximate Sliced and classical Wasserstein distances at the high-complexity extreme, while being computationally lighter and close to Persistence Statistics at the lower complexity extreme, and thus allow users to interpolate between the two metrics. We build theoretical comparisons to show how to fit our new distances at an intermediate level between persistence vectorizations and Wasserstein distances. We also experimentally verify that ETDs outperform PSs in terms of accuracy and outperform Wasserstein and Sliced Wasserstein distances in terms of computational complexity.
\ No newline at end of file
diff --git a/data/2024/aaai/A Closer Look at Curriculum Adversarial Training: From an Online Perspective b/data/2024/aaai/A Closer Look at Curriculum Adversarial Training: From an Online Perspective
new file mode 100644
index 0000000000..02a1d53d24
--- /dev/null
+++ b/data/2024/aaai/A Closer Look at Curriculum Adversarial Training: From an Online Perspective	
@@ -0,0 +1 @@
+Curriculum adversarial training empirically finds that gradually increasing the hardness of adversarial examples can further improve the adversarial robustness of the trained model compared to conventional adversarial training. However, theoretical understanding of this strategy remains limited. In an attempt to bridge this gap, we analyze the adversarial training process from an online perspective. Specifically, we treat adversarial examples in different iterations as samples from different adversarial distributions. We then introduce the time series prediction framework and deduce novel generalization error bounds. Our theoretical results not only demonstrate the effectiveness of the conventional adversarial training algorithm but also explain why curriculum adversarial training methods can further improve adversarial generalization. We conduct comprehensive experiments to support our theory.
\ No newline at end of file
diff --git a/data/2024/aaai/A Compiler for Weak Decomposable Negation Normal Form b/data/2024/aaai/A Compiler for Weak Decomposable Negation Normal Form
new file mode 100644
index 0000000000..4550343659
--- /dev/null
+++ b/data/2024/aaai/A Compiler for Weak Decomposable Negation Normal Form	
@@ -0,0 +1 @@
+This paper integrates weak decomposable negation normal form (wDNNF) circuits, introduced by Akshay et al. in 2018, into the knowledge compilation map. This circuit type generalises decomposable negation normal form (DNNF) circuits in such a way that they allow a restricted form of sharing variables among the inputs of a conjunction node. We show that wDNNF circuits have the same properties as DNNF circuits regarding the queries and transformations presented in the knowledge compilation map, whilst being strictly more succinct than DNNF circuits (that is, they can represent Boolean functions compactly). We also present and evaluate a knowledge compiler, called Bella, for converting CNF formulae into wDNNF circuits. Our experiments demonstrate that wDNNF circuits are suitable for configuration instances.
\ No newline at end of file
diff --git a/data/2024/aaai/A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators b/data/2024/aaai/A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators
new file mode 100644
index 0000000000..82a62748b3
--- /dev/null
+++ b/data/2024/aaai/A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators	
@@ -0,0 +1 @@
+Automatic evaluation is an integral aspect of dialogue system research. The traditional reference-based NLG metrics are generally found to be unsuitable for dialogue assessment. Consequently, recent studies have suggested various unique, reference-free neural metrics that better align with human evaluations. Notably among them, large language models (LLMs), particularly the instruction-tuned variants like ChatGPT, are shown to be promising substitutes for human judges. Yet, existing works on utilizing LLMs for automatic dialogue evaluation are limited in their scope in terms of the number of meta-evaluation datasets, mode of evaluation, coverage of LLMs, etc. Hence, it remains inconclusive how effective these LLMs are. To this end, we conduct a comprehensive study on the application of LLMs for automatic dialogue evaluation. Specifically, we analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels, using a comprehensive set of 12 meta-evaluation datasets. Additionally, we probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels. Finally, we explore how model-level and dimension-level ensembles impact the evaluation performance. All resources are available at https://github.com/e0397123/comp-analysis.
\ No newline at end of file
diff --git a/data/2024/aaai/A Comprehensive Augmentation Framework for Anomaly Detection b/data/2024/aaai/A Comprehensive Augmentation Framework for Anomaly Detection
new file mode 100644
index 0000000000..a6678645bb
--- /dev/null
+++ b/data/2024/aaai/A Comprehensive Augmentation Framework for Anomaly Detection	
@@ -0,0 +1,2 @@
+Data augmentation methods are commonly integrated into the training of anomaly detection models.
+Previous approaches have primarily focused on replicating real-world anomalies or enhancing diversity, without considering that the standard of anomaly varies across different classes, potentially leading to a biased training distribution. This paper analyzes crucial traits of simulated anomalies that contribute to the training of reconstructive networks and condenses them into several methods, thus creating a comprehensive framework by selectively utilizing appropriate combinations. Furthermore, we integrate this framework with a reconstruction-based approach and concurrently propose a split training strategy that alleviates the overfitting issue while avoiding introducing interference to the reconstruction process. The evaluations conducted on the MVTec anomaly detection dataset demonstrate that our method outperforms the previous state-of-the-art approach, particularly in terms of object classes. We also generate a simulated dataset comprising anomalies with diverse characteristics, and experimental results demonstrate that our approach exhibits promising potential for generalizing effectively to various unseen anomalies encountered in real-world scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/A Computation-Aware Shape Loss Function for Point Cloud Completion b/data/2024/aaai/A Computation-Aware Shape Loss Function for Point Cloud Completion
new file mode 100644
index 0000000000..535f1c2671
--- /dev/null
+++ b/data/2024/aaai/A Computation-Aware Shape Loss Function for Point Cloud Completion	
@@ -0,0 +1,3 @@
+Learning-based point cloud completion tasks have shown potential in various critical tasks, such as object detection, assignment, and registration. However, accurately and efficiently quantifying the shape error between the predicted point clouds generated by networks and the ground truth remains challenging. While EMD-based loss functions excel in shape detail and perceived density distribution, their approach can only yield results with significant discrepancies from the actual EMD within a tolerable training time.
+To address these challenges, we first propose the initial price based on the auction algorithm, reducing the number of iterations required for the algorithm while ensuring the correctness of the assignment results. We then introduce an algorithm to compute the initial price through a successive shortest path and the Euclidean information between its nodes. Finally, we adopt a series of optimization strategies to speed up the algorithm and offer an EMD approximation scheme for point cloud problems that balances time loss and computational accuracy based on point cloud data characteristics.
+Our experimental results confirm that our algorithm achieves the smallest gap with the real EMD within an acceptable time range and yields the best results in end-to-end training.
\ No newline at end of file
diff --git a/data/2024/aaai/A Convolutional Neural Network Interpretable Framework for Human Ventral Visual Pathway Representation b/data/2024/aaai/A Convolutional Neural Network Interpretable Framework for Human Ventral Visual Pathway Representation
new file mode 100644
index 0000000000..36e96a75ef
--- /dev/null
+++ b/data/2024/aaai/A Convolutional Neural Network Interpretable Framework for Human Ventral Visual Pathway Representation	
@@ -0,0 +1 @@
+Recently, convolutional neural networks (CNNs) have become the best quantitative encoding models for capturing neural activity and hierarchical structure in the ventral visual pathway. However, the weak interpretability of these black-box models hinders their ability to reveal visual representational encoding mechanisms. Here, we propose a convolutional neural network interpretable framework (CNN-IF) aimed at providing a transparent interpretable encoding model for the ventral visual pathway. First, we adapt the feature-weighted receptive field framework to train two high-performing ventral visual pathway encoding models using large-scale functional Magnetic Resonance Imaging (fMRI) in both goal-driven and data-driven approaches. We find that network layer-wise predictions align with the functional hierarchy of the ventral visual pathway. Then, we correspond feature units to voxel units in the brain and successfully quantify the alignment between voxel responses and visual concepts. Finally, we conduct Network Dissection along the ventral visual pathway including the fusiform face area (FFA), and discover variations related to the visual concept of `person'. Our results demonstrate the CNN-IF provides a new perspective for understanding encoding mechanisms in the human ventral visual pathway, and the combination of ante-hoc interpretable structure and post-hoc interpretable approaches can achieve fine-grained voxel-wise correspondence between model and brain. The source code is available at: https://github.com/BIT-YangLab/CNN-IF.
\ No newline at end of file
diff --git a/data/2024/aaai/A Cross-View Hierarchical Graph Learning Hypernetwork for Skill Demand-Supply Joint Prediction b/data/2024/aaai/A Cross-View Hierarchical Graph Learning Hypernetwork for Skill Demand-Supply Joint Prediction
new file mode 100644
index 0000000000..a177ed3814
--- /dev/null
+++ b/data/2024/aaai/A Cross-View Hierarchical Graph Learning Hypernetwork for Skill Demand-Supply Joint Prediction	
@@ -0,0 +1,2 @@
+The rapidly changing landscape of technology and industries leads to dynamic skill requirements, making it crucial for employees and employers to anticipate such shifts to maintain a competitive edge in the labor market. Existing efforts in this area either relies on domain-expert knowledge or regarding the skill evolution as a simplified time series forecasting problem. However, both approaches overlook the sophisticated relationships among different skills and the inner-connection between skill demand and supply variations. 
+In this paper, we propose a Cross-view Hierarchical Graph learning Hypernetwork (CHGH) framework for joint skill demand-supply prediction. Specifically, CHGH is an encoder-decoder network consisting of i) a cross-view graph encoder to capture the interconnection between skill demand and supply, ii) a hierarchical graph encoder to model the co-evolution of skills from a cluster-wise perspective, and iii) a conditional hyper-decoder to jointly predict demand and supply variations by incorporating historical demand-supply gaps. Extensive experiments on three real-world datasets demonstrate the superiority of the proposed framework compared to seven baselines and the effectiveness of the three modules.
\ No newline at end of file
diff --git a/data/2024/aaai/A Diffusion Model with State Estimation for Degradation-Blind Inverse Imaging b/data/2024/aaai/A Diffusion Model with State Estimation for Degradation-Blind Inverse Imaging
new file mode 100644
index 0000000000..ba9f3064bb
--- /dev/null
+++ b/data/2024/aaai/A Diffusion Model with State Estimation for Degradation-Blind Inverse Imaging	
@@ -0,0 +1 @@
+Solving the task of inverse imaging problems can restore unknown clean images from input measurements that have incomplete information. Utilizing powerful generative models, such as denoising diffusion models, could better tackle the ill-posed issues of inverse problems with the distribution prior of the unknown clean images. We propose a learnable state-estimator-based diffusion model to incorporate the measurements into the reconstruction process. Our method makes efficient use of the pre-trained diffusion models with computational feasibility compared to the conditional diffusion models, which need to be trained from scratch. In addition, our pipeline does not require explicit knowledge of the image degradation operator or make the assumption of its form, unlike many other works that use the pre-trained diffusion models at the test time. The experiments on three typical inverse imaging problems (both linear and non-linear), inpainting, deblurring, and JPEG compression restoration, have comparable results with the state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/A Diffusion-Based Framework for Multi-Class Anomaly Detection b/data/2024/aaai/A Diffusion-Based Framework for Multi-Class Anomaly Detection
new file mode 100644
index 0000000000..94e4e689ae
--- /dev/null
+++ b/data/2024/aaai/A Diffusion-Based Framework for Multi-Class Anomaly Detection	
@@ -0,0 +1 @@
+Reconstruction-based approaches have achieved remarkable outcomes in anomaly detection. The exceptional image reconstruction capabilities of recently popular diffusion models have sparked research efforts to utilize them for enhanced reconstruction of anomalous images. Nonetheless, these methods might face challenges related to the preservation of image categories and pixel-wise structural integrity in the more practical multi-class setting. To solve the above problems, we propose a Difusion-based Anomaly Detection (DiAD) framework for multi-class anomaly detection, which consists of a pixel-space autoencoder, a latent-space Semantic-Guided (SG) network with a connection to the stable diffusion’s denoising network, and a feature-space pre-trained feature extractor. Firstly, The SG network is proposed for reconstructing anomalous regions while preserving the original image’s semantic information. Secondly, we introduce Spatial-aware Feature Fusion (SFF) block to maximize reconstruction accuracy when dealing with extensively reconstructed areas. Thirdly, the input and reconstructed images are processed by a pre-trained feature extractor to generate anomaly maps based on features extracted at different scales. Experiments on MVTec-AD and VisA datasets demonstrate the effectiveness of our approach which surpasses the state-of-the-art methods, e.g., achieving 96.8/52.6 and 97.2/99.0 (AUROC/AP) for localization and detection respectively on multi-class MVTec-AD dataset. Code will be available at https://lewandofskee.github.io/projects/diad.
\ No newline at end of file
diff --git a/data/2024/aaai/A Diffusion-Based Pre-training Framework for Crystal Property Prediction b/data/2024/aaai/A Diffusion-Based Pre-training Framework for Crystal Property Prediction
new file mode 100644
index 0000000000..549c5309a2
--- /dev/null
+++ b/data/2024/aaai/A Diffusion-Based Pre-training Framework for Crystal Property Prediction	
@@ -0,0 +1 @@
+Many significant problems involving crystal property prediction from 3D structures have limited labeled data due to expensive and time-consuming physical simulations or lab experiments. To overcome this challenge, we propose a pretrain-finetune framework for the crystal property prediction task named CrysDiff based on diffusion models. In the pre-training phase, CrysDiff learns the latent marginal distribution of crystal structures via the reconstruction task. Subsequently, CrysDiff can be fine-tuned under the guidance of the new sparse labeled data, fitting the conditional distribution of the target property given the crystal structures. To better model the crystal geometry, CrysDiff notably captures the full symmetric properties of the crystals, including the invariance of reflection, rotation, and periodic translation. Extensive experiments demonstrate that CrysDiff can significantly improve the performance of the downstream crystal property prediction task on multiple target properties, outperforming all the SOTA pre-training models for crystals with good margins on the popular JARVIS-DFT dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/A Dual Stealthy Backdoor: From Both Spatial and Frequency Perspectives b/data/2024/aaai/A Dual Stealthy Backdoor: From Both Spatial and Frequency Perspectives
new file mode 100644
index 0000000000..f377013899
--- /dev/null
+++ b/data/2024/aaai/A Dual Stealthy Backdoor: From Both Spatial and Frequency Perspectives	
@@ -0,0 +1 @@
+Backdoor attacks pose serious security threats to deep neural networks (DNNs). Backdoored models make arbitrarily (targeted) incorrect predictions on inputs containing well-designed triggers, while behaving normally on clean inputs. Prior researches have explored the invisibility of backdoor triggers to enhance attack stealthiness. However, most of them only focus on the invisibility in the spatial domain, neglecting the generation of invisible triggers in the frequency domain. This limitation renders the generated poisoned images easily detectable by recent defense methods. To address this issue, we propose a DUal stealthy BAckdoor attack method named DUBA, which simultaneously considers the invisibility of triggers in both the spatial and frequency domains, to achieve desirable attack performance, while ensuring strong stealthiness. Specifically, we first use Wavelet Transform to embed the high-frequency information of the trigger image into the clean image to ensure attack effectiveness. Then, to attain strong stealthiness, we incorporate Fourier Transform and Cosine Transform to mix the poisoned image and clean image in the frequency domain. Moreover, DUBA adopts a novel attack strategy, training the model with weak triggers and attacking with strong triggers to further enhance attack performance and stealthiness. DUBA is evaluated extensively on four datasets against popular image classifiers, showing significant superiority over state-of-the-art backdoor attacks in attack success rate and stealthiness.
\ No newline at end of file
diff --git a/data/2024/aaai/A Dynamic GCN with Cross-Representation Distillation for Event-Based Learning b/data/2024/aaai/A Dynamic GCN with Cross-Representation Distillation for Event-Based Learning
new file mode 100644
index 0000000000..6729953e2d
--- /dev/null
+++ b/data/2024/aaai/A Dynamic GCN with Cross-Representation Distillation for Event-Based Learning	
@@ -0,0 +1 @@
+Recent advances in event-based research prioritize sparsity and temporal precision. Approaches learning sparse point-based representations through graph CNNs (GCN) become more popular. Yet, these graph techniques hold lower performance than their frame-based counterpart due to two issues: (i) Biased graph structures that don't properly incorporate varied attributes (such as semantics, and spatial and temporal signals) for each vertex, resulting in inaccurate graph representations. (ii) A shortage of robust pretrained models. Here we solve the first problem by proposing a new event-based GCN (EDGCN), with a dynamic aggregation module to integrate all attributes of vertices adaptively. To address the second problem, we introduce a novel learning framework called cross-representation distillation (CRD), which leverages the dense representation of events as a cross-representation auxiliary to provide additional supervision and prior knowledge for the event graph. This frame-to-graph distillation allows us to benefit from the large-scale priors provided by CNNs while still retaining the advantages of graph-based models. Extensive experiments show our model and learning framework are effective and generalize well across multiple vision tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/A Dynamic Learning Method towards Realistic Compositional Zero-Shot Learning b/data/2024/aaai/A Dynamic Learning Method towards Realistic Compositional Zero-Shot Learning
new file mode 100644
index 0000000000..a391025c76
--- /dev/null
+++ b/data/2024/aaai/A Dynamic Learning Method towards Realistic Compositional Zero-Shot Learning	
@@ -0,0 +1 @@
+To tackle the challenge of recognizing images of unseen attribute-object compositions, Compositional Zero-Shot Learning (CZSL) methods have been previously addressed. However, test images in realistic scenarios may also incorporate other forms of unknown factors, such as novel semantic concepts or novel image styles. As previous CZSL works have overlooked this critical issue, in this research, we first propose the Realistic Compositional Zero-Shot Learning (RCZSL) task which considers the various types of unknown factors in an unified experimental setting. To achieve this, we firstly conduct re-labelling on MIT-States and use the pre-trained generative models to obtain images of various domains. Then the entire dataset is split into a training set and a test set, with the latter containing images of unseen concepts, unseen compositions, unseen domains as well as their combinations. Following this, we show that the visual-semantic relationship changes on unseen images, leading us to construct two dynamic modulators to adapt the visual features and composition prototypes in accordance with the input image. We believe that such a dynamic learning method could effectively alleviate the domain shift problem caused by various types of unknown factors. We conduct extensive experiments on benchmark datasets for both the conventional CZSL setting and the proposed RCZSL setting. The effectiveness of our method has been proven by empirical results, which significantly outperformed both our baseline method and state-of-the-art approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/A Fast Exact Solver with Theoretical Analysis for the Maximum Edge-Weighted Clique Problem b/data/2024/aaai/A Fast Exact Solver with Theoretical Analysis for the Maximum Edge-Weighted Clique Problem
new file mode 100644
index 0000000000..fc7cdcc55c
--- /dev/null
+++ b/data/2024/aaai/A Fast Exact Solver with Theoretical Analysis for the Maximum Edge-Weighted Clique Problem	
@@ -0,0 +1,5 @@
+The maximum vertex-weighted clique problem (MVWCP) and the maximum edge-weighted clique problem (MEWCP) are two natural extensions of the fundamental maximum clique problem. 
+In this paper, we systematically study MEWCP and make the following major contributions:
+(1) We show that MEWCP is NP-hard even when the minimum degree of the graph is n-2, in contrast to MVWCP which is polynomial-time solvable when the minimum degree of the graph is at least n-3. This result distinguishes the complexity of the two problems for the first time.
+(2) To address MEWCP, we develop an efficient branch-and-bound algorithm called MEWCat with both practical and theoretical performance guarantees. In practice, MEWCat utilizes a new upper bound tighter than existing ones, which allows for more efficient pruning of branches. In theory, we prove a running-time bound of O*(1.4423^n) for MEWCat, which breaks the trivial bound of O*(2^n) in the research line of practical exact MEWCP solvers for the first time.
+(3) Empirically, we evaluate the performance of MEWCat on various benchmark instances. The experiments demonstrate that MEWCat outperforms state-of-the-art exact solvers significantly. For instance, on 16 DIMACS graphs that the state-of-the-art solver BBEWC fails to solve within 7200 seconds, MEWCat solves all of them with an average time of less than 1000 seconds. On real-world graphs, MEWCat achieves an average speedup of over 36x.
\ No newline at end of file
diff --git a/data/2024/aaai/A Fixed-Parameter Tractable Algorithm for Counting Markov Equivalence Classes with the Same Skeleton b/data/2024/aaai/A Fixed-Parameter Tractable Algorithm for Counting Markov Equivalence Classes with the Same Skeleton
new file mode 100644
index 0000000000..af582724cd
--- /dev/null
+++ b/data/2024/aaai/A Fixed-Parameter Tractable Algorithm for Counting Markov Equivalence Classes with the Same Skeleton	
@@ -0,0 +1,11 @@
+Causal DAGs (also known as Bayesian networks) are a popular tool for encoding
+conditional dependencies between random variables. In a causal DAG, the random
+variables are modeled as vertices in the DAG, and it is stipulated that every
+random variable is independent of its non-descendants conditioned on its parents. It
+is possible, however, for two different causal DAGs on the same set of random
+variables to encode exactly the same set of conditional dependencies. Such
+causal DAGs are said to be Markov equivalent, and equivalence classes of
+Markov equivalent DAGs are known as Markov Equivalent Classes (MECs).
+Beautiful combinatorial characterizations of MECs have been developed in the
+past few decades, and it is known, in particular, that all DAGs in the same MEC
+must have the same skeleton (underlying undirected graph) and v-structures (induced subgraph of the form a->b
\ No newline at end of file
diff --git a/data/2024/aaai/A Fixed-Point Approach to Unified Prompt-Based Counting b/data/2024/aaai/A Fixed-Point Approach to Unified Prompt-Based Counting
new file mode 100644
index 0000000000..de12da1d4c
--- /dev/null
+++ b/data/2024/aaai/A Fixed-Point Approach to Unified Prompt-Based Counting	
@@ -0,0 +1 @@
+Existing class-agnostic counting models typically rely on a single type of prompt, e.g., box annotations. This paper aims to establish a comprehensive prompt-based counting framework capable of generating density maps for concerned objects indicated by various prompt types, such as box, point, and text. To achieve this goal, we begin by converting prompts from different modalities into prompt masks without requiring training. These masks are then integrated into a class-agnostic counting methodology for predicting density maps. Furthermore, we introduce a fixed-point inference along with an associated loss function to improve counting accuracy, all without introducing new parameters. The effectiveness of this method is substantiated both theoretically and experimentally. Additionally, a contrastive training scheme is implemented to mitigate dataset bias inherent in current class-agnostic counting datasets, a strategy whose effectiveness is confirmed by our ablation study. Our model excels in prominent class-agnostic datasets and exhibits superior performance in cross-dataset adaptation tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/A Framework for Approaching AI Education in Educator Preparation Programs b/data/2024/aaai/A Framework for Approaching AI Education in Educator Preparation Programs
new file mode 100644
index 0000000000..185254a84a
--- /dev/null
+++ b/data/2024/aaai/A Framework for Approaching AI Education in Educator Preparation Programs	
@@ -0,0 +1 @@
+In recent years, the rapid advancement of artificial intelligence (AI) has fostered an urgent need to better prepare current and future educators to be able to integrate AI technologies in their teaching and to teach AI literacy to PreK-12 students. While many organizations have developed professional learning opportunities for inservice educators, a gap remains for resources specifically designed for those facilitating and enrolled in Educator Preparation Programs (EPPs). In response to this gap, the International Society for Technology in Education (ISTE) launched its first AI Explorations for EPPs Faculty Fellowship. As a result of the Faculty Fellows’ collaboration, this paper articulates a framework of seven critical strategies with the potential to address the urgent need EPPs have in preparing preservice teachers to effectively integrate AI-powered instructional tools and to teach this new area of content knowledge in PreK-12 classrooms. In addition, we provide a review of literature and an overview of the emerging needs for integrating AI education in EPPs. We demonstrate why support for preservice teachers’ critical examination and application of AI, including a focus on the issues of equity, ethics, and culturally responsive teaching, is essential to their later success in PreK-12 classrooms. Recommendations for further research and learning are also provided to promote community-wide initiatives for supporting the integration of AI in education through Educator Preparation Programs and beyond.
\ No newline at end of file
diff --git a/data/2024/aaai/A Framework for Data-Driven Explainability in Mathematical Optimization b/data/2024/aaai/A Framework for Data-Driven Explainability in Mathematical Optimization
new file mode 100644
index 0000000000..fe7306d0ab
--- /dev/null
+++ b/data/2024/aaai/A Framework for Data-Driven Explainability in Mathematical Optimization	
@@ -0,0 +1 @@
+Advancements in mathematical programming have made it possible to efficiently tackle large-scale real-world problems that were deemed intractable just a few decades ago. However, provably optimal solutions may not be accepted due to the perception of optimization software as a black box. Although well understood by scientists, this lacks easy accessibility for practitioners. Hence, we advocate for introducing the explainability of a solution as another evaluation criterion, next to its objective value, which enables us to find trade-off solutions between these two criteria. Explainability is attained by comparing against (not necessarily optimal) solutions that were implemented in similar situations in the past. Thus, solutions are preferred that exhibit similar features. Although we prove that already in simple cases the explainable model is NP-hard, we characterize relevant polynomially solvable cases such as the explainable shortest path problem. Our numerical experiments on both artificial as well as real-world road networks show the resulting Pareto front. It turns out that the cost of enforcing explainability can be very small.
\ No newline at end of file
diff --git a/data/2024/aaai/A Framework for Mining Speech-to-Text Transcripts of the Customer for Automated Problem Remediation b/data/2024/aaai/A Framework for Mining Speech-to-Text Transcripts of the Customer for Automated Problem Remediation
new file mode 100644
index 0000000000..9aacd43540
--- /dev/null
+++ b/data/2024/aaai/A Framework for Mining Speech-to-Text Transcripts of the Customer for Automated Problem Remediation	
@@ -0,0 +1,17 @@
+Technical support services get several thousand voice calls
+every year. These calls vary across a range of technical issues
+or maintenance requests for a suite of hardware and software
+products. On receiving the call, a support agent creates a ser-
+vice request artifact that contains her interpretation of the
+customer’s problem. This service request goes through the life
+cycle of the problem remediation process with the resolution
+also being recorded as part of the service request. It has been
+empirically observed that the actual complaint voiced by the
+customer is often different from the recorded interpretation
+in the service request. The service request created by sup-
+port agents runs the risk of missing key information elements
+present in the customer voice records. In this paper, we build
+a framework that taps into voice calls and uses unsupervised
+and supervised learning methods to enrich the service requests
+with additional information. The enriched data is then used
+for automated problem resolution.
\ No newline at end of file
diff --git a/data/2024/aaai/A General Implicit Framework for Fast NeRF Composition and Rendering b/data/2024/aaai/A General Implicit Framework for Fast NeRF Composition and Rendering
new file mode 100644
index 0000000000..d46e6b7844
--- /dev/null
+++ b/data/2024/aaai/A General Implicit Framework for Fast NeRF Composition and Rendering	
@@ -0,0 +1 @@
+A variety of Neural Radiance Fields (NeRF) methods have recently achieved remarkable success in high render speed. However, current accelerating methods are specialized and incompatible with various implicit methods, preventing real-time composition over various types of NeRF works. Because NeRF relies on sampling along rays, it is possible to provide general guidance for acceleration. To that end, we propose a general implicit pipeline for composing NeRF objects quickly. Our method enables the casting of dynamic shadows within or between objects using analytical light sources while allowing multiple NeRF objects to be seamlessly placed and rendered together with any arbitrary rigid transformations. Mainly, our work introduces a new surface representation known as Neural Depth Fields (NeDF) that quickly determines the spatial relationship between objects by allowing direct intersection computation between rays and implicit surfaces. It leverages an intersection neural network to query NeRF for acceleration instead of depending on an explicit spatial structure.Our proposed method is the first to enable both the progressive and interactive composition of NeRF objects. Additionally, it also serves as a previewing plugin for a range of existing NeRF works.
\ No newline at end of file
diff --git a/data/2024/aaai/A General Model for Aggregating Annotations AcrossSimple, Complex, and Multi-object Annotation Tasks (Abstract Reprint) b/data/2024/aaai/A General Model for Aggregating Annotations AcrossSimple, Complex, and Multi-object Annotation Tasks (Abstract Reprint)
new file mode 100644
index 0000000000..01f50d0c00
--- /dev/null
+++ b/data/2024/aaai/A General Model for Aggregating Annotations AcrossSimple, Complex, and Multi-object Annotation Tasks (Abstract Reprint)	
@@ -0,0 +1,5 @@
+Human annotations are vital to supervised learning, yet annotators often disagree on the correct label, especially as annotation tasks increase in complexity. A common strategy to improve label quality is to ask multiple annotators to label the same item and then aggregate their labels. To date, many aggregation models have been proposed for simple categorical or numerical annotation tasks, but far less work has considered more complex annotation tasks, such as those involving open-ended, multivariate, or structured responses. Similarly, while a variety of bespoke models have been proposed for specific tasks, our work is the first we are aware of to introduce aggregation methods that generalize across many, diverse complex tasks, including sequence labeling, translation, syntactic parsing, ranking, bounding boxes, and keypoints. This generality is achieved by applying readily available task-specific distance functions, then devising a task-agnostic method to model these distances between labels, rather than the labels themselves.
+
+This article presents a unified treatment of our prior work on complex annotation modeling and extends that work with investigation of three new research questions. First, how do complex annotation task and dataset properties impact aggregation accuracy? Second, how should a task owner navigate the many modeling choices in order to maximize aggregation accuracy? Finally, what tests and diagnoses can verify that aggregation models are specified correctly for the given data? To understand how various factors impact accuracy and to inform model selection, we conduct large-scale simulation studies and broad experiments on real, complex datasets. Regarding testing, we introduce the concept of unit tests for aggregation models and present a suite of such tests to ensure that a given model is not mis-specified and exhibits expected behavior.
+
+Beyond investigating these research questions above, we discuss the foundational concept and nature of annotation complexity, present a new aggregation model as a conceptual bridge between traditional models and our own, and contribute a new general semisupervised learning method for complex label aggregation that outperforms prior work.
\ No newline at end of file
diff --git a/data/2024/aaai/A General Search-Based Framework for Generating Textual Counterfactual Explanations b/data/2024/aaai/A General Search-Based Framework for Generating Textual Counterfactual Explanations
new file mode 100644
index 0000000000..55579d7c39
--- /dev/null
+++ b/data/2024/aaai/A General Search-Based Framework for Generating Textual Counterfactual Explanations	
@@ -0,0 +1,5 @@
+One of the prominent methods for explaining the decision of a machine-learning classifier is by a counterfactual example.
+Most current algorithms for generating such examples in the textual domain are based on generative language models. Generative models, however, are trained to minimize a specific loss function in order to fulfill certain requirements for the generated texts. Any change in the requirements may necessitate costly retraining, thus potentially limiting their applicability.
+In this paper, we present a general search-based framework for generating counterfactual explanations in the textual domain. 
+Our framework is model-agnostic, domain-agnostic, anytime, and does not require retraining in order to adapt to changes in the user requirements. 
+We model the task as a search problem in a space where the initial state is the classified text, and the goal state is a text in a given target class. Our framework includes domain-independent modification operators, but can also exploit domain-specific knowledge through specialized operators. The search algorithm attempts to find a text from the target class with minimal user-specified distance from the original classified object.
\ No newline at end of file
diff --git a/data/2024/aaai/A General Theoretical Framework for Learning Smallest Interpretable Models b/data/2024/aaai/A General Theoretical Framework for Learning Smallest Interpretable Models
new file mode 100644
index 0000000000..c82f42d2e8
--- /dev/null
+++ b/data/2024/aaai/A General Theoretical Framework for Learning Smallest Interpretable Models	
@@ -0,0 +1 @@
+We develop a general algorithmic framework that allows us to obtain fixed-parameter tractability for computing smallest symbolic models that represent given data. Our framework applies to all ML model types that admit a certain extension property. By showing this extension property for decision trees, decision sets, decision lists, and binary decision diagrams, we obtain that minimizing these fundamental model types is fixed-parameter tractable. Our framework even applies to ensembles, which combine individual models by majority decision.
\ No newline at end of file
diff --git a/data/2024/aaai/A Generalizable Theory-Driven Agent-Based Framework to Study Conflict-Induced Forced Migration b/data/2024/aaai/A Generalizable Theory-Driven Agent-Based Framework to Study Conflict-Induced Forced Migration
new file mode 100644
index 0000000000..aef6995d86
--- /dev/null
+++ b/data/2024/aaai/A Generalizable Theory-Driven Agent-Based Framework to Study Conflict-Induced Forced Migration	
@@ -0,0 +1 @@
+Large-scale population displacements arising from conflict-induced forced migration generate uncertainty and introduce several policy challenges. Addressing these concerns requires an interdisciplinary approach that integrates knowledge from both computational modeling and social sciences. We propose a generalized computational agent-based modeling framework grounded by Theory of Planned Behavior to model conflict-induced migration outflows within Ukraine during the start of that conflict in 2022. Existing migration modeling frameworks that attempt to address policy implications primarily focus on destination while leaving absent a generalized computational framework grounded by social theory focused on the conflict-induced region. We propose an agent-based framework utilizing a spatiotemporal gravity model and a Bi-threshold model over a Graph Dynamical System to update migration status of agents in conflict-induced regions at fine temporal and spatial granularity. This approach significantly outperforms previous work when examining the case of Russian invasion in Ukraine. Policy implications of the proposed framework are demonstrated by modeling the migration behavior of Ukrainian civilians attempting to flee from regions encircled by Russian forces. We also showcase the generalizability of the model by simulating a past conflict in Burundi, an alternative conflict setting. Results demonstrate the utility of the framework for assessing conflict-induced migration in varied settings as well as identifying vulnerable civilian populations.
\ No newline at end of file
diff --git a/data/2024/aaai/A Generalized Neural Diffusion Framework on Graphs b/data/2024/aaai/A Generalized Neural Diffusion Framework on Graphs
new file mode 100644
index 0000000000..c81ce2150d
--- /dev/null
+++ b/data/2024/aaai/A Generalized Neural Diffusion Framework on Graphs	
@@ -0,0 +1 @@
+Recent studies reveal the connection between GNNs and the diffusion process, which motivates many diffusion based GNNs to be proposed. However, since these two mechanisms are closely related, one fundamental question naturally arises: Is there a general diffusion framework that can formally unify these GNNs? The answer to this question can not only deepen our understanding of the learning process of GNNs, but also may open a new door to design a broad new class of GNNs. In this paper, we propose a general diffusion equation framework with the fidelity term, which formally establishes the relationship between the diffusion process with more GNNs. Meanwhile, with this framework, we identify one characteristic of graph diffusion networks, i.e., the current neural diffusion process only corresponds to the first-order diffusion equation. However, by an experimental investigation, we show that the labels of high-order neighbors actually appear monophily property, which induces the similarity based on labels among high-order neighbors without requiring the similarity among first-order neighbors. This discovery motives to design a new high-order neighbor-aware diffusion equation, and derive a new type of graph diffusion network (HiD-Net) based on the framework. With the high-order diffusion equation, HiD-Net is more robust against attacks and works on both homophily and heterophily graphs. We not only theoretically analyze the relation between HiD-Net with high-order random walk, but also provide a theoretical convergence guarantee. Extensive experimental results well demonstrate the effectiveness of HiD-Net over state-of-the-art graph diffusion networks.
\ No newline at end of file
diff --git a/data/2024/aaai/A Generalized Shuffle Framework for Privacy Amplification: Strengthening Privacy Guarantees and Enhancing Utility b/data/2024/aaai/A Generalized Shuffle Framework for Privacy Amplification: Strengthening Privacy Guarantees and Enhancing Utility
new file mode 100644
index 0000000000..3d472df1b9
--- /dev/null
+++ b/data/2024/aaai/A Generalized Shuffle Framework for Privacy Amplification: Strengthening Privacy Guarantees and Enhancing Utility	
@@ -0,0 +1,12 @@
+The shuffle model of local differential privacy is an advanced method of privacy amplification designed to enhance privacy protection with high utility. 
+It achieves this by randomly shuffling sensitive data, making linking individual data points to specific individuals more challenging.
+However, most existing studies have focused on the shuffle model based on
+(ε0,0)-Locally Differentially Private (LDP) randomizers, with limited consideration for complex scenarios such as (ε0,δ0)-LDP or personalized LDP (PLDP). 
+This hinders a comprehensive understanding of the shuffle model's potential and limits its application in various settings.
+To bridge this research gap, we propose a generalized shuffle framework that can be applied to PLDP setting. This generalization allows for a broader exploration of the privacy-utility trade-off and facilitates the design of privacy-preserving analyses in diverse contexts.
+We prove that the shuffled PLDP process approximately preserves μ-Gaussian Differential Privacy with 
+μ = O(1/√n).
+This approach allows us to avoid the limitations and potential inaccuracies associated with inequality estimations.
+To strengthen the privacy guarantee, we improve the lower bound by utilizing hypothesis testing instead of relying on rough estimations like the Chernoff bound or Hoeffding's inequality.
+Furthermore, extensive comparative evaluations clearly show that our approach outperforms existing methods in achieving strong central privacy guarantees while preserving the utility of the global model.
+We have also carefully designed corresponding algorithms for average function, frequency estimation, and stochastic gradient descent.
\ No newline at end of file
diff --git a/data/2024/aaai/A Goal Interaction Graph Planning Framework for Conversational Recommendation b/data/2024/aaai/A Goal Interaction Graph Planning Framework for Conversational Recommendation
new file mode 100644
index 0000000000..53d9783797
--- /dev/null
+++ b/data/2024/aaai/A Goal Interaction Graph Planning Framework for Conversational Recommendation	
@@ -0,0 +1 @@
+Multi-goal conversational recommender system (MG-CRS) which is more in line with realistic scenarios has attracted a lot of attention. MG-CRS can dynamically capture the demands of users in conversation, continuously engage their interests, and make recommendations. The key of accomplishing these tasks is to plan a reasonable goal sequence which can naturally guide the user to accept the recommended goal. Previous works have demonstrated that mining the correlations of goals from the goal sequences in the dialogue corpus is helpful for recommending the goal that the user is interested in. However, they independently model correlations for each level of goal (i.e., goal type or entity) and neglect the order of goals appear in the dialogue. In this paper, we propose a goal interaction graph planning framework which constructs a directed heterogeneous graph to flexibly model the correlations between any level of goals and retain the order of goals. We design a goal interaction graph learning module to model the goal correlations and propagate goal representations via directed edges, then use an encoder and a dual-way fusion decoder to extract the most relevant information with the current goal from the conversation and domain knowledge, making the next-goal prediction fully exploit the prior goal correlations and user feedback. Finally we generate engaging responses based on the predicted goal sequence to complete the recommendation task. Experiments on two benchmark datasets show that our method achieves significant improvements in both the goal planning and response generation tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/A Graph Dynamics Prior for Relational Inference b/data/2024/aaai/A Graph Dynamics Prior for Relational Inference
new file mode 100644
index 0000000000..f1ca122114
--- /dev/null
+++ b/data/2024/aaai/A Graph Dynamics Prior for Relational Inference	
@@ -0,0 +1 @@
+Relational inference aims to identify interactions between parts of a dynamical system from the observed dynamics. Current state-of-the-art methods fit the dynamics with a graph neural network (GNN) on a learnable graph. They use one-step message-passing GNNs---intuitively the right choice since non-locality of multi-step or spectral GNNs may confuse direct and indirect interactions. But the effective interaction graph depends on the sampling rate and it is rarely localized to direct neighbors, leading to poor local optima for the one-step model. In this work, we propose a graph dynamics prior (GDP) for relational inference. GDP constructively uses error amplification in non-local polynomial filters to steer the solution to the ground-truth graph. To deal with non-uniqueness, GDP simultaneously fits a ``shallow'' one-step model and a polynomial multi-step model with shared graph topology. Experiments show that GDP reconstructs graphs far more accurately than earlier methods, with remarkable robustness to under-sampling. Since appropriate sampling rates for unknown dynamical systems are not known a priori, this robustness makes GDP suitable for real applications in scientific machine learning. Reproducible code is available at https://github.com/DaDaCheng/GDP.
\ No newline at end of file
diff --git a/data/2024/aaai/A Hierarchical Network for Multimodal Document-Level Relation Extraction b/data/2024/aaai/A Hierarchical Network for Multimodal Document-Level Relation Extraction
new file mode 100644
index 0000000000..86d3761680
--- /dev/null
+++ b/data/2024/aaai/A Hierarchical Network for Multimodal Document-Level Relation Extraction	
@@ -0,0 +1 @@
+Document-level relation extraction aims to extract entity relations that span across multiple sentences. This task faces two critical issues: long dependency and mention selection. Prior works address the above problems from the textual perspective, however, it is hard to handle these problems solely based on text information. In this paper, we leverage video information to provide additional evidence for understanding long dependencies and offer a wider perspective for identifying relevant mentions, thus giving rise to a new task named Multimodal Document-level Relation Extraction (MDocRE). To tackle this new task, we construct a human-annotated dataset including documents and relevant videos, which, to the best of our knowledge, is the first document-level relation extraction dataset equipped with video clips. We also propose a hierarchical framework to learn interactions between different dependency levels and a textual-guided transformer architecture that incorporates both textual and video modalities. In addition, we utilize a mention gate module to address the mention-selection problem in both modalities. Experiments on our proposed dataset show that 1) incorporating video information greatly improves model performance; 2) our hierarchical framework has state-of-the-art results compared with both unimodal and multimodal baselines; 3) through collaborating with video information, our model better solves the long-dependency and mention-selection problems.
\ No newline at end of file
diff --git a/data/2024/aaai/A Huber Loss Minimization Approach to Byzantine Robust Federated Learning b/data/2024/aaai/A Huber Loss Minimization Approach to Byzantine Robust Federated Learning
new file mode 100644
index 0000000000..0d79405120
--- /dev/null
+++ b/data/2024/aaai/A Huber Loss Minimization Approach to Byzantine Robust Federated Learning	
@@ -0,0 +1 @@
+Federated learning systems are susceptible to adversarial attacks. To combat this, we introduce a novel aggregator based on Huber loss minimization, and provide a comprehensive theoretical analysis. Under independent and identically distributed (i.i.d) assumption, our approach has several advantages compared to existing methods. Firstly, it has optimal dependence on epsilon, which stands for the ratio of attacked clients. Secondly, our approach does not need precise knowledge of epsilon. Thirdly, it allows different clients to have unequal data sizes. We then broaden our analysis to include non-i.i.d data, such that clients have slightly different distributions.
\ No newline at end of file
diff --git a/data/2024/aaai/A Hybrid AI Framework for Sensor-Based Personal Health Monitoring towards Precision Health b/data/2024/aaai/A Hybrid AI Framework for Sensor-Based Personal Health Monitoring towards Precision Health
new file mode 100644
index 0000000000..94dcaf71ed
--- /dev/null
+++ b/data/2024/aaai/A Hybrid AI Framework for Sensor-Based Personal Health Monitoring towards Precision Health	
@@ -0,0 +1 @@
+Non-communicable diseases are on the rise globally, resulting in accelerated efforts to develop personal health monitoring systems for early detection, prediction, and prevention of diseases. This is part of the vision of precision health, an emerging paradigm that focuses on preventing disease before it strikes by encouraging people to actively monitor and work towards improving their health. A key facilitator of this is the use of wearable sensors that can collect and measure physiological data.Although many sensor-based health monitoring systems have been proposed, interoperability of health data and processes, prediction of future health states, and uncertainty management remain open challenges. This research aims to alleviate these challenges through the development of a reusable framework integrating both data-driven and knowledge-driven AI within a hybrid AI architecture.
\ No newline at end of file
diff --git a/data/2024/aaai/A Hybrid Global-Local Perception Network for Lane Detection b/data/2024/aaai/A Hybrid Global-Local Perception Network for Lane Detection
new file mode 100644
index 0000000000..489b325f2d
--- /dev/null
+++ b/data/2024/aaai/A Hybrid Global-Local Perception Network for Lane Detection	
@@ -0,0 +1 @@
+Lane detection is a critical task in autonomous driving, which requires accurately predicting the complex topology of lanes in various scenarios. While previous methods of lane detection have shown success, challenges still exist, especially in scenarios where lane markings are absent. In this paper, we analyze the role of global and local features in accurately detecting lanes and propose a Hybrid Global-Local Perception Network (HGLNet) to leverage them. Global and local features play distinct roles in lane detection by respectively aiding in the detection of lane instances and the localization of corresponding lanes. HGLNet extracts global semantic context by utilizing a global extraction head that aggregates information about adaptive sampling points around lanes, achieving an optimal trade-off between performance and efficiency. Moreover, we introduce a Multi-hierarchy feature aggregator (MFA) to capture feature hierarchies in both regional and local ranges, elevating the representation of local features. The proposed Hybrid architecture can simultaneously focus on global and local features at different depth levels and efficiently integrate them to sense the global presence of lanes and accurately regress their locations. Experimental results demonstrate that our proposed method improves detection accuracy in various challenging scenarios, outperforming the state-of-the-art lane detection methods.
\ No newline at end of file
diff --git a/data/2024/aaai/A Joint Framework with Heterogeneous-Relation-Aware Graph and Multi-Channel Label Enhancing Strategy for Event Causality Extraction b/data/2024/aaai/A Joint Framework with Heterogeneous-Relation-Aware Graph and Multi-Channel Label Enhancing Strategy for Event Causality Extraction
new file mode 100644
index 0000000000..80ff306fff
--- /dev/null
+++ b/data/2024/aaai/A Joint Framework with Heterogeneous-Relation-Aware Graph and Multi-Channel Label Enhancing Strategy for Event Causality Extraction	
@@ -0,0 +1 @@
+Event Causality Extraction (ECE) aims to extract the cause-effect event pairs with their structured event information from plain texts. As far as we know, the existing ECE methods mainly focus on the correlation between arguments, without explicitly modeling the causal relationship between events, and usually design two independent frameworks to extract cause events and effect events, respectively, which cannot effectively capture the dependency between the subtasks. Therefore, we propose a joint multi-label extraction framework for ECE to alleviate the above limitations. In particular, 1) we design a heterogeneous-relation-aware graph module to learn the potential relationships between events and arguments, in which we construct the heterogeneous graph by taking the predefined event types and all the words in the sentence as nodes, and modeling three relationships of "event-event", "event-argument" and "argument-argument" as edges. 2) We also design a multi-channel label enhancing module to better learn the distributed representation of each label in the multi-label extraction framework, and further enhance the interaction between the subtasks by considering the preliminary results of cause-effect type identification and event argument extraction. The experimental results on the benchmark dataset ECE-CCKS show that our approach outperforms previous state-of-the-art methods, and that our model also performs well on the complex samples with multiple cause-effect event pairs.
\ No newline at end of file
diff --git a/data/2024/aaai/A Label Disambiguation-Based Multimodal Massive Multiple Instance Learning Approach for Immune Repertoire Classification b/data/2024/aaai/A Label Disambiguation-Based Multimodal Massive Multiple Instance Learning Approach for Immune Repertoire Classification
new file mode 100644
index 0000000000..45227d4b00
--- /dev/null
+++ b/data/2024/aaai/A Label Disambiguation-Based Multimodal Massive Multiple Instance Learning Approach for Immune Repertoire Classification	
@@ -0,0 +1 @@
+One individual human’s immune repertoire consists of a huge set of adaptive immune receptors at a certain time point, representing the individual's adaptive immune state. Immune repertoire classification and associated receptor identification have the potential to make a transformative contribution to the development of novel vaccines and therapies. The vast number of instances and exceedingly low witness rate pose a great challenge to the immune repertoire classification, which can be formulated as a Massive Multiple Instance Learning (MMIL) problem. Traditional MIL methods, at both bag-level and instance-level, confront the issues of substantial computational burden or supervision ambiguity when handling massive instances. To address these issues, we propose a novel label disambiguation-based multimodal massive multiple instance learning approach (LaDM³IL) for immune repertoire classification. LaDM³IL adapts the instance-level MIL paradigm to deal with the issue of high computational cost and employs a specially-designed label disambiguation module for label correction, mitigating the impact of misleading supervision. To achieve a more comprehensive representation of each receptor, LaDM³IL leverages a multimodal fusion module with gating-based attention and tensor-fusion to integrate the information from gene segments and amino acid (AA) sequences of each immune receptor. Extensive experiments on the Cytomegalovirus (CMV) and Cancer datasets demonstrate the superior performance of the proposed LaDM³IL for both immune repertoire classification and associated receptor identification tasks. The code is publicly available at https://github.com/Josie-xufan/LaDM3IL.
\ No newline at end of file
diff --git a/data/2024/aaai/A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis b/data/2024/aaai/A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis
new file mode 100644
index 0000000000..3ba90751cc
--- /dev/null
+++ b/data/2024/aaai/A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis	
@@ -0,0 +1 @@
+The actual collection of tabular data for sharing involves confidentiality and privacy constraints, leaving the potential risks of machine learning for interventional data analysis unsafely averted. Synthetic data has emerged recently as a privacy-protecting solution to address this challenge. However, existing approaches regard discrete and continuous modal features as separate entities, thus falling short in properly capturing their inherent correlations. In this paper, we propose a novel contrastive learning guided Gaussian Transformer autoencoder, termed GTCoder, to synthesize photo-realistic multimodal tabular data for scientific research. Our approach introduces a transformer-based fusion module that seamlessly integrates multimodal features, permitting for mining more informative latent representations. The attention within the fusion module directs the integrated output features to focus on critical components that facilitate the task of generating latent embeddings. Moreover, we formulate a contrastive learning strategy to implicitly constrain the embeddings from discrete features in the latent feature space by encouraging the similar discrete feature distributions closer while pushing the dissimilar further away, in order to better enhance the representation of the latent embedding. Experimental results indicate that GTCoder is effective to generate photo-realistic synthetic data, with interactive interpretation of latent embedding, and performs favorably against some baselines on most real-world and simulated datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/A Local-Ascending-Global Learning Strategy for Brain-Computer Interface b/data/2024/aaai/A Local-Ascending-Global Learning Strategy for Brain-Computer Interface
new file mode 100644
index 0000000000..bb736c4dd3
--- /dev/null
+++ b/data/2024/aaai/A Local-Ascending-Global Learning Strategy for Brain-Computer Interface	
@@ -0,0 +1 @@
+Neuroscience research indicates that the interaction among different functional regions of the brain plays a crucial role in driving various cognitive tasks. Existing studies have primarily focused on constructing either local or global functional connectivity maps within the brain, often lacking an adaptive approach to fuse functional brain regions and explore latent relationships between localization during different cognitive tasks. This paper introduces a novel approach called the Local-Ascending-Global Learning Strategy (LAG) to uncover higher-level latent topological patterns among functional brain regions. The strategy initiates from the local connectivity of individual brain functional regions and develops a K-Level Self-Adaptive Ascending Network (SALK) to dynamically capture strong connectivity patterns among brain regions during different cognitive tasks. Through the step-by-step fusion of brain regions, this approach captures higher-level latent patterns, shedding light on the progressively adaptive fusion of various brain functional regions under different cognitive tasks. Notably, this study represents the first exploration of higher-level latent patterns through progressively adaptive fusion of diverse brain functional regions under different cognitive tasks. The proposed LAG strategy is validated using datasets related to fatigue (SEED-VIG), emotion (SEED-IV), and motor imagery (BCI_C_IV_2a). The results demonstrate the generalizability of LAG, achieving satisfactory outcomes in independent-subject experiments across all three datasets. This suggests that LAG effectively characterizes higher-level latent patterns associated with different cognitive tasks, presenting a novel approach to understanding brain patterns in varying cognitive contexts.
\ No newline at end of file
diff --git a/data/2024/aaai/A Model for Estimating the Economic Costs of Computer Vision Systems That Use Deep Learning b/data/2024/aaai/A Model for Estimating the Economic Costs of Computer Vision Systems That Use Deep Learning
new file mode 100644
index 0000000000..24c2f6fe01
--- /dev/null
+++ b/data/2024/aaai/A Model for Estimating the Economic Costs of Computer Vision Systems That Use Deep Learning	
@@ -0,0 +1 @@
+Deep learning, the most important subfield of machine learning and artificial intelligence (AI) over the last decade, is considered one of the fundamental technologies underpinning the Fourth Industrial Revolution. But despite its record-breaking history, deep learning’s enormous appetite for compute and data means that sometimes it can be too costly to practically use. In this paper, we connect technical insights from deep learning scaling laws and transfer learning with the economics of IT to propose a framework for estimating the cost of deep learning computer vision systems to achieve a desired level of accuracy. Our tool can be of practical use to AI practitioners in industry or academia to guide investment decisions.
\ No newline at end of file
diff --git a/data/2024/aaai/A New Benchmark and Model for Challenging Image Manipulation Detection b/data/2024/aaai/A New Benchmark and Model for Challenging Image Manipulation Detection
new file mode 100644
index 0000000000..a866560af5
--- /dev/null
+++ b/data/2024/aaai/A New Benchmark and Model for Challenging Image Manipulation Detection	
@@ -0,0 +1 @@
+The ability to detect manipulation in multimedia data is vital in digital forensics. Existing Image Manipulation Detection (IMD) methods are mainly based on detecting anomalous features arisen from image editing or double compression artifacts. All existing IMD techniques encounter challenges when it comes to detecting small tampered regions from a large image. Moreover, compression-based IMD approaches face difficulties in cases of double compression of identical quality factors. To investigate the State-of-The-Art (SoTA) IMD methods in those challenging conditions, we introduce a new Challenging Image Manipulation Detection (CIMD) benchmark dataset, which consists of two subsets, for evaluating editing-based and compression-based IMD methods, respectively. The dataset images were manually taken and tampered with high-quality annotations. In addition, we propose a new two-branch network model based on HRNet that can better detect both the image-editing and compression artifacts in those challenging conditions. Extensive experiments on the CIMD benchmark show that our model significantly outperforms SoTA IMD methods on CIMD. The dataset is available at: https://github.com/ZhenfeiZ/CIMD.
\ No newline at end of file
diff --git a/data/2024/aaai/A New Mechanism for Eliminating Implicit Conflict in Graph Contrastive Learning b/data/2024/aaai/A New Mechanism for Eliminating Implicit Conflict in Graph Contrastive Learning
new file mode 100644
index 0000000000..f15326b134
--- /dev/null
+++ b/data/2024/aaai/A New Mechanism for Eliminating Implicit Conflict in Graph Contrastive Learning	
@@ -0,0 +1 @@
+Graph contrastive learning (GCL) has attracted considerable attention because it can self-supervisedly extract low-dimensional representation of graph data. InfoNCE-based loss function is widely used in graph contrastive learning, which pulls the representations of positive pairs close to each other and pulls the representations of negative pairs away from each other. Recent works mainly focus on designing new augmentation methods or sampling strategies. However, we argue that the widely used InfoNCE-based methods may contain an implicit conflict which seriously confuses models when learning from negative pairs. This conflict is engendered by the encoder's message-passing mechanism and the InfoNCE loss function. As a result, the learned representations between negative samples cannot be far away from each other, compromising the model performance. To our best knowledge, this is the first time to report and analysis this conflict of GCL. To address this problem, we propose a simple but effective method called Partial ignored Graph Contrastive Learning (PiGCL). Specifically, PiGCL first dynamically captures the conflicts during training by detecting the gradient of representation similarities. It then enables the loss function to ignore the conflict, allowing the encoder to adaptively learn the ignored information without self-supervised samples. Extensive experiments demonstrate the effectiveness of our method.
\ No newline at end of file
diff --git a/data/2024/aaai/A Non-parametric Graph Clustering Framework for Multi-View Data b/data/2024/aaai/A Non-parametric Graph Clustering Framework for Multi-View Data
new file mode 100644
index 0000000000..e6ca2723f9
--- /dev/null
+++ b/data/2024/aaai/A Non-parametric Graph Clustering Framework for Multi-View Data	
@@ -0,0 +1,3 @@
+Multi-view graph clustering (MVGC) derives encouraging grouping results by seamlessly integrating abundant information inside heterogeneous data, and has captured surging focus recently. 
+Nevertheless, the majority of current MVGC works involve at least one hyper-parameter, which not only requires additional efforts for tuning, but also leads to a complicated solving procedure, 
+largely harming the flexibility and scalability of corresponding algorithms. To this end, in the article we are devoted to getting rid of hyper-parameters, and devise a non-parametric graph clustering (NpGC) framework to more practically partition multi-view data. To be specific, we hold that hyper-parameters play a role in balancing error item and regularization item so as to form high-quality clustering representations. Therefore, under without the assistance of hyper-parameters, how to acquire high-quality representations becomes the key. Inspired by this, we adopt two types of anchors, view-related and view-unrelated, to concurrently mine exclusive characteristics and common characteristics among views. Then, all anchors' information is gathered together via a consensus bipartite graph. By such ways, NpGC extracts both complementary and consistent multi-view features, thereby obtaining superior clustering results. Also, linear complexities enable it to handle datasets with over 120000 samples. Numerous experiments reveal NpGC's strong points compared to lots of classical approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/A Novel Approach for Longitudinal Modeling of Aging Health and Predicting Mortality Rates b/data/2024/aaai/A Novel Approach for Longitudinal Modeling of Aging Health and Predicting Mortality Rates
new file mode 100644
index 0000000000..6c8bdfb4be
--- /dev/null
+++ b/data/2024/aaai/A Novel Approach for Longitudinal Modeling of Aging Health and Predicting Mortality Rates	
@@ -0,0 +1 @@
+Aging is a complex stochastic process that affects healthy functioning through various pathways. In contrast to the more commonly used cross-sectional methods, our research focuses on longitudinal modeling of aging, a less explored but crucial area. We have developed a Stochastic Differential Equation (SDE) model, at the forefront of aging research, designed to accurately forecast the health trajectories and survival rates of individuals. This model adeptly delineates the connections between different health indicators and provides clear, interpretable results. Our approach utilizes the SDE framework to encapsulate the inherent uncertainty in the aging process. Moreover, it incorporates a Recurrent Neural Network (RNN) to integrate past health data into future health projections. We plan to train and test our model using a comprehensive dataset tailored for aging studies. This model is not only computationally cost-effective but also highly relevant in assessing health risks in older populations, particularly for those at high risk. It can serve as an essential tool in anticipating and preparing for challenges like infectious disease outbreaks. Overall, our research aims to improve health equity and global health security significantly, offering substantial benefits to public health and deepening our understanding of the aging process.
\ No newline at end of file
diff --git a/data/2024/aaai/A Novel Energy Based Model Mechanism for Multi-Modal Aspect-Based Sentiment Analysis b/data/2024/aaai/A Novel Energy Based Model Mechanism for Multi-Modal Aspect-Based Sentiment Analysis
new file mode 100644
index 0000000000..e54a23068a
--- /dev/null
+++ b/data/2024/aaai/A Novel Energy Based Model Mechanism for Multi-Modal Aspect-Based Sentiment Analysis	
@@ -0,0 +1 @@
+Multi-modal aspect-based sentiment analysis (MABSA) has recently attracted increasing attention. The span-based extraction methods, such as FSUIE, demonstrate strong performance in sentiment analysis due to their joint modeling of input sequences and target labels. However, previous methods still have certain limitations: (i) They ignore the difference in the focus of visual information between different analysis targets (aspect or sentiment). (ii) Combining features from uni-modal encoders directly may not be sufficient to eliminate the modal gap and can cause difficulties in capturing the image-text pairwise relevance. (iii) Existing span-based methods for MABSA ignore the pairwise relevance of target span boundaries. To tackle these limitations, we propose a novel framework called DQPSA. Specifically, our model contains a Prompt as Dual Query (PDQ) module that uses the prompt as both a visual query and a language query to extract prompt-aware visual information and strengthen the pairwise relevance between visual information and the analysis target. Additionally, we introduce an Energy-based Pairwise Expert (EPE) module that models the boundaries pairing of the analysis target from the perspective of an Energy-based Model. This expert predicts aspect or sentiment span based on pairwise stability. Experiments on three widely used benchmarks demonstrate that DQPSA outperforms previous approaches and achieves a new state-of-the-art performance. The code will be released at https://github.com/pengts/DQPSA.
\ No newline at end of file
diff --git a/data/2024/aaai/A Novel Skip Orthogonal List for Dynamic Optimal Transport Problem b/data/2024/aaai/A Novel Skip Orthogonal List for Dynamic Optimal Transport Problem
new file mode 100644
index 0000000000..61e1167e62
--- /dev/null
+++ b/data/2024/aaai/A Novel Skip Orthogonal List for Dynamic Optimal Transport Problem	
@@ -0,0 +1 @@
+Optimal transport is a fundamental topic that has attracted a great amount of attention from the optimization community in the past decades. In this paper, we consider an interesting discrete dynamic optimal transport problem: can we efficiently update the optimal transport plan when the weights or the locations of the data points change? This problem is naturally motivated by several applications in machine learning. For example, we often need to compute the optimal transport cost between two different data sets; if some changes happen to a few data points, should we re-compute the high complexity cost function or update the cost by some efficient dynamic data structure? We are aware that several dynamic maximum flow algorithms have been proposed before, however, the research on dynamic minimum cost flow problem is still quite limited, to the best of our knowledge. We propose a novel 2D Skip Orthogonal List together with some dynamic tree techniques. Although our algorithm is based on the conventional simplex method, it can efficiently find the variable to pivot within expected O(1) time, and complete each pivoting operation within expected O(|V|) time where V is the set of all supply and demand nodes. Since dynamic modifications typically do not introduce significant changes, our algorithm requires only a few simplex iterations in practice. So our algorithm is more efficient than re-computing the optimal transport cost that needs at least one traversal over all |E|=O(|V|^2) variables, where |E| denotes the number of edges in the network. Our experiments demonstrate that our algorithm significantly outperforms existing algorithms in the dynamic scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/A PAC Learning Algorithm for LTL and Omega-Regular Objectives in MDPs b/data/2024/aaai/A PAC Learning Algorithm for LTL and Omega-Regular Objectives in MDPs
new file mode 100644
index 0000000000..e2205166e5
--- /dev/null
+++ b/data/2024/aaai/A PAC Learning Algorithm for LTL and Omega-Regular Objectives in MDPs	
@@ -0,0 +1 @@
+Linear temporal logic (LTL) and omega-regular objectives---a superset of LTL---have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes (MDPs). As part of the development of our algorithm, we introduce the epsilon-recurrence time: a measure of the speed at which a policy converges to the satisfaction of the omega-regular objective in the limit. We prove that our algorithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.
\ No newline at end of file
diff --git a/data/2024/aaai/A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning b/data/2024/aaai/A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning
new file mode 100644
index 0000000000..56d9e2061a
--- /dev/null
+++ b/data/2024/aaai/A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning	
@@ -0,0 +1 @@
+Offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Built on offline RL algorithms, most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples. In this paper, from a novel perspective, we systematically study the challenges that remain in O2O RL and identify that the reason behind the slow improvement of the performance and the instability of online finetuning lies in the inaccurate Q-value estimation inherited from offline pretraining. Specifically, we demonstrate that the estimation bias and the inaccurate rank of Q-value cause a misleading signal for the policy update, making the standard offline RL algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based on this observation, we address the problem of Q-value estimation by two techniques: (1) perturbed value update and (2) increased frequency of Q-value updates. The first technique smooths out biased Q-value estimation with sharp peaks, preventing early-stage policy exploitation of sub-optimal actions. The second one alleviates the estimation bias inherited from offline pretraining by accelerating learning. Extensive experiments on the MuJoco and Adroit environments demonstrate that the proposed method, named SO2, significantly alleviates Q-value estimation issues, and consistently improves the performance against the state-of-the-art methods by up to 83.1%.
\ No newline at end of file
diff --git a/data/2024/aaai/A Picture Is Worth a Thousand Words: Co-designing Text-to-Image Generation Learning Materials for K-12 with Educators b/data/2024/aaai/A Picture Is Worth a Thousand Words: Co-designing Text-to-Image Generation Learning Materials for K-12 with Educators
new file mode 100644
index 0000000000..13e2659dda
--- /dev/null
+++ b/data/2024/aaai/A Picture Is Worth a Thousand Words: Co-designing Text-to-Image Generation Learning Materials for K-12 with Educators	
@@ -0,0 +1 @@
+Text-to-image generation (TTIG) technologies are Artificial Intelligence (AI) algorithms that use natural language algorithms in combination with visual generative algorithms. TTIG tools have gained popularity in recent months, garnering interest from non-AI experts, including educators and K-12 students. While they have exciting creative potential when used by K-12 learners and educators for creative learning, they are also accompanied by serious ethical implications, such as data privacy, spreading misinformation, and algorithmic bias. Given the potential learning applications, social implications, and ethical concerns, we designed 6-hour learning materials to teach K-12 teachers from diverse subject expertise about the technical implementation, classroom applications, and ethical implications of TTIG algorithms. We piloted the learning materials titled “Demystify text-to-image generative tools for K-12 educators" with 30 teachers across two workshops with the goal of preparing them to teach about and use TTIG tools in their classrooms. We found that teachers demonstrated a technical, applied and ethical understanding of TTIG algorithms and successfully designed prototypes of teaching materials for their classrooms.
\ No newline at end of file
diff --git a/data/2024/aaai/A Plug-and-Play Quaternion Message-Passing Module for Molecular Conformation Representation b/data/2024/aaai/A Plug-and-Play Quaternion Message-Passing Module for Molecular Conformation Representation
new file mode 100644
index 0000000000..eccd22bf2f
--- /dev/null
+++ b/data/2024/aaai/A Plug-and-Play Quaternion Message-Passing Module for Molecular Conformation Representation	
@@ -0,0 +1,8 @@
+Graph neural networks have been widely used to represent 3D molecules, which capture molecular attributes and geometric information through various message-passing mechanisms.
+This study proposes a novel quaternion message-passing (QMP) module that can be plugged into many existing 3D molecular representation models and enhance their power for distinguishing molecular conformations.
+In particular, our QMP module represents the 3D rotations between one chemical bond and its neighbor bonds as a quaternion sequence. 
+Then, it aggregates the rotations by the chained Hamilton product of the quaternions. 
+The real part of the output quaternion is invariant to the global 3D rotations of molecules but sensitive to the local torsions caused by twisting bonds, providing discriminative information for training molecular conformation representation models. 
+In theory, we prove that considering these features enables invariant GNNs to distinguish the conformations caused by bond torsions. 
+We encapsulate the QMP module with acceleration, so combining existing models with the QMP requires merely one-line code and little computational cost. 
+Experiments on various molecular datasets show that plugging our QMP module into existing invariant GNNs leads to consistent and significant improvements in molecular conformation representation and downstream tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/A Positive-Unlabeled Metric Learning Framework for Document-Level Relation Extraction with Incomplete Labeling b/data/2024/aaai/A Positive-Unlabeled Metric Learning Framework for Document-Level Relation Extraction with Incomplete Labeling
new file mode 100644
index 0000000000..0c64f565be
--- /dev/null
+++ b/data/2024/aaai/A Positive-Unlabeled Metric Learning Framework for Document-Level Relation Extraction with Incomplete Labeling	
@@ -0,0 +1 @@
+The goal of document-level relation extraction (RE) is to identify relations between entities that span multiple sentences. Recently, incomplete labeling in document-level RE has received increasing attention, and some studies have used methods such as positive-unlabeled learning to tackle this issue, but there is still a lot of room for improvement. Motivated by this, we propose a positive-augmentation and positive-mixup positive-unlabeled metric learning framework (P3M). Specifically, we formulate document-level RE as a metric learning problem. We aim to pull the distance closer between entity pair embedding and their corresponding relation embedding, while pushing it farther away from the none-class relation embedding. Additionally, we adapt the positive-unlabeled learning to this loss objective. In order to improve the generalizability of the model, we use dropout to augment positive samples and propose a positive-none-class mixup method. Extensive experiments show that P3M improves the F1 score by approximately 4-10 points in document-level RE with incomplete labeling, and achieves state-of-the-art results in fully labeled scenarios. Furthermore, P3M has also demonstrated robustness to prior estimation bias in incomplete labeled scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/A Pre-convolved Representation for Plug-and-Play Neural Illumination Fields b/data/2024/aaai/A Pre-convolved Representation for Plug-and-Play Neural Illumination Fields
new file mode 100644
index 0000000000..b64c701edd
--- /dev/null
+++ b/data/2024/aaai/A Pre-convolved Representation for Plug-and-Play Neural Illumination Fields	
@@ -0,0 +1 @@
+Recent advances in implicit neural representation have demonstrated the ability to recover detailed geometry and material from multi-view images. However, the use of simplified lighting models such as environment maps to represent non-distant illumination, or using a network to fit indirect light modeling without a solid basis, can lead to an undesirable decomposition between lighting and material. To address this, we propose a fully differentiable framework named Neural Illumination Fields (NeIF) that uses radiance fields as a lighting model to handle complex lighting in a physically based way. Together with integral lobe encoding for roughness-adaptive specular lobe and leveraging the pre-convolved background for accurate decomposition, the proposed method represents a significant step towards integrating physically based rendering into the NeRF representation. The experiments demonstrate the superior performance of novel-view rendering compared to previous works, and the capability to re-render objects under arbitrary NeRF-style environments opens up exciting possibilities for bridging the gap between virtual and real-world scenes.
\ No newline at end of file
diff --git a/data/2024/aaai/A Primal-Dual Algorithm for Hybrid Federated Learning b/data/2024/aaai/A Primal-Dual Algorithm for Hybrid Federated Learning
new file mode 100644
index 0000000000..fea9c7cc1f
--- /dev/null
+++ b/data/2024/aaai/A Primal-Dual Algorithm for Hybrid Federated Learning	
@@ -0,0 +1 @@
+Very few methods for hybrid federated learning, where clients only hold subsets of both features and samples, exist. Yet, this scenario is very important in practical settings. We provide a fast, robust algorithm for hybrid federated learning that hinges on Fenchel Duality. We prove the convergence of the algorithm to the same solution as if the model was trained centrally in a variety of practical regimes. Furthermore, we provide experimental results that demonstrate the performance improvements of the algorithm over a commonly used method in federated learning, FedAvg, and an existing hybrid FL algorithm, HyFEM. We also provide privacy considerations and necessary steps to protect client data.
\ No newline at end of file
diff --git a/data/2024/aaai/A Privacy Preserving Federated Learning (PPFL) Based Cognitive Digital Twin (CDT) Framework for Smart Cities b/data/2024/aaai/A Privacy Preserving Federated Learning (PPFL) Based Cognitive Digital Twin (CDT) Framework for Smart Cities
new file mode 100644
index 0000000000..7227747514
--- /dev/null
+++ b/data/2024/aaai/A Privacy Preserving Federated Learning (PPFL) Based Cognitive Digital Twin (CDT) Framework for Smart Cities	
@@ -0,0 +1 @@
+A Smart City is one that makes better use of city data to make our communities better places to live. Typically, this has 3 components: sensing (data collection), analysis and actuation. Privacy, particularly as it relates to citizen's data, is a cross-cutting theme. A Digital Twin (DT) is a virtual replica of a real-world physical entity. Cognitive Digital Twins (CDT) are DTs enhanced with cognitive AI capabilities. Both DTs and CDTs have seen adoption in the manufacturing and industrial sectors however cities are slow to adopt these because of privacy concerns. This work attempts to address these concerns by proposing a Privacy Preserving Federated Learning (PPFL) based Cognitive Digital Twin framework for Smart Cities.
\ No newline at end of file
diff --git a/data/2024/aaai/A Provably Accurate Randomized Sampling Algorithm for Logistic Regression b/data/2024/aaai/A Provably Accurate Randomized Sampling Algorithm for Logistic Regression
new file mode 100644
index 0000000000..f463041ef6
--- /dev/null
+++ b/data/2024/aaai/A Provably Accurate Randomized Sampling Algorithm for Logistic Regression	
@@ -0,0 +1 @@
+In statistics and machine learning, logistic regression is a widely-used supervised learning technique primarily employed for binary classification tasks. When the number of observations greatly exceeds the number of predictor variables, we present a simple, randomized sampling-based algorithm for logistic regression problem that guarantees high-quality approximations to both the estimated probabilities and the overall discrepancy of the model. Our analysis builds upon two simple structural conditions that boil down to randomized matrix multiplication, a fundamental and well-understood primitive of randomized numerical linear algebra. We analyze the properties of estimated probabilities of logistic regression when leverage scores are used to sample observations, and prove that accurate approximations can be achieved with a sample whose size is much smaller than the total number of observations. To further validate our theoretical findings, we conduct comprehensive empirical evaluations. Overall, our work sheds light on the potential of using randomized sampling approaches to efficiently approximate the estimated probabilities in logistic regression, offering a practical and computationally efficient solution for large-scale datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/A Reinforcement-Learning-Based Multiple-Column Selection Strategy for Column Generation b/data/2024/aaai/A Reinforcement-Learning-Based Multiple-Column Selection Strategy for Column Generation
new file mode 100644
index 0000000000..81dd6e64b4
--- /dev/null
+++ b/data/2024/aaai/A Reinforcement-Learning-Based Multiple-Column Selection Strategy for Column Generation	
@@ -0,0 +1 @@
+Column generation (CG) is one of the most successful approaches for solving large-scale linear programming (LP) problems. Given an LP with a prohibitively large number of variables (i.e., columns), the idea of CG is to explicitly consider only a subset of columns and iteratively add potential columns to improve the objective value. While adding the column with the most negative reduced cost can guarantee the convergence of CG, it has been shown that adding multiple columns per iteration rather than a single column can lead to faster convergence. However, it remains a challenge to design a multiple-column selection strategy to select the most promising columns from a large number of candidate columns. In this paper, we propose a novel reinforcement-learning-based (RL) multiple-column selection strategy. To the best of our knowledge, it is the first RL-based multiple-column selection strategy for CG. The effectiveness of our approach is evaluated on two sets of problems: the cutting stock problem and the graph coloring problem. Compared to several widely used single-column and multiple-column selection strategies, our RL-based multiple-column selection strategy leads to faster convergence and achieves remarkable reductions in the number of CG iterations and runtime.
\ No newline at end of file
diff --git a/data/2024/aaai/A Robust Mutual-Reinforcing Framework for 3D Multi-Modal Medical Image Fusion Based on Visual-Semantic Consistency b/data/2024/aaai/A Robust Mutual-Reinforcing Framework for 3D Multi-Modal Medical Image Fusion Based on Visual-Semantic Consistency
new file mode 100644
index 0000000000..8b1cee3816
--- /dev/null
+++ b/data/2024/aaai/A Robust Mutual-Reinforcing Framework for 3D Multi-Modal Medical Image Fusion Based on Visual-Semantic Consistency	
@@ -0,0 +1 @@
+This work proposes a robust 3D medical image fusion framework to establish a mutual-reinforcing mechanism between visual fusion and lesion segmentation, achieving their double improvement. Specifically, we explore the consistency between vision and semantics by sharing feature fusion modules. Through the coupled optimization of the visual fusion loss and the lesion segmentation loss, visual-related and semantic-related features will be pulled into the same domain, effectively promoting accuracy improvement in a mutual-reinforcing manner. Further, we establish the robustness guarantees by constructing a two-level refinement constraint in the process of feature extraction and reconstruction. Benefiting from full consideration for common degradations in medical images, our framework can not only provide clear visual fusion results for doctor's observation, but also enhance the defense ability of lesion segmentation against these negatives. Extensive evaluations of visual fusion and lesion segmentation scenarios demonstrate the advantages of our method in terms of accuracy and robustness. Moreover, our proposed framework is generic, which can be well-compatible with existing lesion segmentation algorithms and improve their performance. The code is publicly available at https://github.com/HaoZhang1018/RMR-Fusion.
\ No newline at end of file
diff --git a/data/2024/aaai/A SAT + Computer Algebra System Verification of the Ramsey Problem R(3, 8) (Student Abstract) b/data/2024/aaai/A SAT + Computer Algebra System Verification of the Ramsey Problem R(3, 8) (Student Abstract)
new file mode 100644
index 0000000000..d27590f476
--- /dev/null
+++ b/data/2024/aaai/A SAT + Computer Algebra System Verification of the Ramsey Problem R(3, 8) (Student Abstract)	
@@ -0,0 +1 @@
+The Ramsey problem R(3,8) asks for the smallest n such that every red/blue coloring of the complete graph on n vertices must contain either a blue triangle or a red 8-clique. We provide the first certifiable proof that R(3,8) = 28, automatically generated by a combination of Boolean satisfiability (SAT) solver and a computer algebra system (CAS). This SAT+CAS combination is significantly faster than a SAT-only approach. While the R(3,8) problem was first computationally solved by McKay and Min in 1992, it was not a verifiable proof. The SAT+CAS method that we use for our proof is very general and can be applied to a wide variety of combinatorial problems.
\ No newline at end of file
diff --git a/data/2024/aaai/A SAT Solver and Computer Algebra Attack on the Minimum Kochen-Specker Problem (Student Abstract) b/data/2024/aaai/A SAT Solver and Computer Algebra Attack on the Minimum Kochen-Specker Problem (Student Abstract)
new file mode 100644
index 0000000000..f5bd9bc099
--- /dev/null
+++ b/data/2024/aaai/A SAT Solver and Computer Algebra Attack on the Minimum Kochen-Specker Problem (Student Abstract)	
@@ -0,0 +1 @@
+The problem of finding the minimum three-dimensional Kochen–Specker (KS) vector system, an important problem in quantum foundations, has remained open for over 55 years. We present a new method to address this problem based on a combination of a Boolean satisfiability (SAT) solver and a computer algebra system (CAS). Our approach improved the lower bound on the size of a KS system from 22 to 24. More importantly, we provide the first computer-verifiable proof certificate of a lower bound to the KS problem with a proof size of 41.6 TiB for order 23. The efficiency is due to the powerful combination of SAT solvers and CAS-based orderly generation.
\ No newline at end of file
diff --git a/data/2024/aaai/A Score-Based Deterministic Diffusion Algorithm with Smooth Scores for General Distributions b/data/2024/aaai/A Score-Based Deterministic Diffusion Algorithm with Smooth Scores for General Distributions
new file mode 100644
index 0000000000..69b8abfbf5
--- /dev/null
+++ b/data/2024/aaai/A Score-Based Deterministic Diffusion Algorithm with Smooth Scores for General Distributions	
@@ -0,0 +1 @@
+Score matching based diffusion has shown to achieve the state of art results in generation modeling. In the original score matching based diffusion algorithm, the forward equation is a differential equation for which the probability density equation evolves according to a linear partial differential equation, the Fokker-Planck equation. A drawback of this approach is that one needs the data distribution to have a Lipschitz logarithmic gradient. This excludes a large class of data distributions that have a compact support. We present a deterministic diffusion process for which the vector fields are always Lipschitz and hence the score does not explode for probability measures with compact support. This deterministic diffusion process can be seen as a regularization of the porous media equation equation, which enables one to guarantee long term convergence of the forward process to the noise distribution. Though the porous media equation is itself not always guaranteed to have a Lipschitz vector field, it can be used to understand the closeness of the output of the algorithm to the data distribution as a function of the the time horizon and score matching error. This analysis enables us to show that the algorithm has better dependence on the score matching error than approaches based on stochastic diffusions. Using numerical experiments we verify our theoretical results on example one and two dimensional data distributions which are compactly supported. Additionally, we validate the approach on a modified MNIST data set for which the distribution is concentrated on a compact set. In each of the experiments, the approach using deterministic diffusion performs better that the diffusion algorithm with stochastic forward process, when considering the FID scores of the generated samples.
\ No newline at end of file
diff --git a/data/2024/aaai/A Separation and Alignment Framework for Black-Box Domain Adaptation b/data/2024/aaai/A Separation and Alignment Framework for Black-Box Domain Adaptation
new file mode 100644
index 0000000000..d2e516dbf9
--- /dev/null
+++ b/data/2024/aaai/A Separation and Alignment Framework for Black-Box Domain Adaptation	
@@ -0,0 +1 @@
+Black-box domain adaptation (BDA) targets to learn a classifier on an unsupervised target domain while assuming only access to black-box predictors trained from unseen source data. Although a few BDA approaches have demonstrated promise by manipulating the transferred labels, they largely overlook the rich underlying structure in the target domain. To address this problem, we introduce a novel separation and alignment framework for BDA. Firstly, we locate those well-adapted samples via loss ranking and a flexible confidence-thresholding procedure. Then, we introduce a novel graph contrastive learning objective that aligns under-adapted samples to their local neighbors and well-adapted samples. Lastly, the adaptation is finally achieved by a nearest-centroid-augmented objective that exploits the clustering effect in the feature space. Extensive experiments demonstrate that our proposed method outperforms best baselines on benchmark datasets, e.g. improving the averaged per-class accuracy by 4.1% on the VisDA dataset. The source code is available at: https://github.com/MingxuanXia/SEAL.
\ No newline at end of file
diff --git a/data/2024/aaai/A Sequentially Fair Mechanism for Multiple Sensitive Attributes b/data/2024/aaai/A Sequentially Fair Mechanism for Multiple Sensitive Attributes
new file mode 100644
index 0000000000..3a7824663f
--- /dev/null
+++ b/data/2024/aaai/A Sequentially Fair Mechanism for Multiple Sensitive Attributes	
@@ -0,0 +1 @@
+In the standard use case of Algorithmic Fairness, the goal is to eliminate the relationship between a sensitive variable and a corresponding score. Throughout recent years, the scientific community has developed a host of definitions and tools to solve this task, which work well in many practical applications. However, the applicability and effectivity of these tools and definitions becomes less straightfoward in the case of multiple sensitive attributes. To tackle this issue, we propose a sequential framework, which allows to progressively achieve fairness across a set of sensitive features. We accomplish this by leveraging multi-marginal Wasserstein barycenters, which extends the standard notion of Strong Demographic Parity to the case with multiple sensitive characteristics. This method also provides a closed-form solution for the optimal, sequentially fair predictor, permitting a clear interpretation of inter-sensitive feature correlations. Our approach seamlessly extends to approximate fairness, enveloping a framework accommodating the trade-off between risk and unfairness. This extension permits a targeted prioritization of fairness improvements for a specific attribute within a set of sensitive attributes, allowing for a case specific adaptation. A data-driven estimation procedure for the derived solution is developed, and comprehensive numerical experiments are conducted on both synthetic and real datasets. Our empirical findings decisively underscore the practical efficacy of our post-processing approach in fostering fair decision-making.
\ No newline at end of file
diff --git a/data/2024/aaai/A Simple and Yet Fairly Effective Defense for Graph Neural Networks b/data/2024/aaai/A Simple and Yet Fairly Effective Defense for Graph Neural Networks
new file mode 100644
index 0000000000..fff13b35f1
--- /dev/null
+++ b/data/2024/aaai/A Simple and Yet Fairly Effective Defense for Graph Neural Networks	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) have emerged as the dominant approach for machine learning on graph-structured data. However, concerns have arisen regarding the vulnerability of GNNs to small adversarial perturbations. Existing defense methods against such perturbations suffer from high time complexity and can negatively impact the model's performance on clean graphs. To address these challenges, this paper introduces NoisyGNNs, a novel defense method that incorporates noise into the underlying model's architecture. We establish a theoretical connection between noise injection and the enhancement of GNN robustness, highlighting the effectiveness of our approach. We further conduct extensive empirical evaluations on the node classification task to validate our theoretical findings, focusing on two popular GNNs: the GCN and GIN. The results demonstrate that NoisyGNN achieves superior or comparable defense performance to existing methods while minimizing added time complexity. The NoisyGNN approach is model-agnostic, allowing it to be integrated with different GNN architectures. Successful combinations of our NoisyGNN approach with existing defense techniques demonstrate even further improved adversarial defense results. Our code is publicly available at: https://github.com/Sennadir/NoisyGNN.
\ No newline at end of file
diff --git a/data/2024/aaai/A Submodular Optimization Approach to Accountable Loan Approval b/data/2024/aaai/A Submodular Optimization Approach to Accountable Loan Approval
new file mode 100644
index 0000000000..2a09c55d0a
--- /dev/null
+++ b/data/2024/aaai/A Submodular Optimization Approach to Accountable Loan Approval	
@@ -0,0 +1,3 @@
+In the field of finance, the underwriting process is an essential step in evaluating every loan application. During this stage, the borrowers' creditworthiness and ability to repay the loan are assessed to ultimately decide whether to approve the loan application. One of the core components of underwriting is credit scoring, in which the probability of default is estimated. 
+As such, there has been significant progress in enhancing the predictive accuracy of credit scoring models through the use of machine learning, but there still exists a need to ultimately construct an approval rule that takes into consideration additional criteria beyond the score itself. This construction process is traditionally done manually to ensure that the approval rule remains interpretable to humans.
+In this paper, we outline an automated system for optimizing a rule-based system for approving loan applications, which has been deployed at Hyundai Capital Services (HCS). The main challenge lay in creating a high-quality rule base that is simultaneously simple enough to be interpretable by risk analysts as well as customers, since the approval decision should be accountable. We addressed this challenge through principled submodular optimization. The deployment of our system has led to a 14% annual growth in the volume of loan services at HCS, while maintaining the target bad rate, and has resulted in the approval of customers who might have otherwise been rejected.
\ No newline at end of file
diff --git a/data/2024/aaai/A Surprisingly Simple Continuous-Action POMDP Solver: Lazy Cross-Entropy Search Over Policy Trees b/data/2024/aaai/A Surprisingly Simple Continuous-Action POMDP Solver: Lazy Cross-Entropy Search Over Policy Trees
new file mode 100644
index 0000000000..9c80f98ad6
--- /dev/null
+++ b/data/2024/aaai/A Surprisingly Simple Continuous-Action POMDP Solver: Lazy Cross-Entropy Search Over Policy Trees	
@@ -0,0 +1 @@
+The Partially Observable Markov Decision Process (POMDP) provides a principled framework for decision making in stochastic partially observable environments. However, computing good solutions for problems with continuous action spaces remains challenging. To ease this challenge, we propose a simple online POMDP solver, called Lazy Cross-Entropy Search Over Policy Trees (LCEOPT). At each planning step, our method uses a novel lazy Cross-Entropy method to search the space of policy trees, which provide a simple policy representation. Specifically, we maintain a distribution on promising finite-horizon policy trees. The distribution is iteratively updated by sampling policies, evaluating them via Monte Carlo simulation, and refitting them to the top-performing ones. Our method is lazy in the sense that it exploits the policy tree representation to avoid redundant computations in policy sampling, evaluation, and distribution update. This leads to computational savings of up to two orders of magnitude. Our LCEOPT is surprisingly simple as compared to existing state-of-the-art methods, yet empirically outperforms them on several continuous-action POMDP problems, particularly for problems with higher-dimensional action spaces.
\ No newline at end of file
diff --git a/data/2024/aaai/A Survey of Learning Criteria Going beyond the Usual Risk (Abstract Reprint) b/data/2024/aaai/A Survey of Learning Criteria Going beyond the Usual Risk (Abstract Reprint)
new file mode 100644
index 0000000000..ec86761448
--- /dev/null
+++ b/data/2024/aaai/A Survey of Learning Criteria Going beyond the Usual Risk (Abstract Reprint)	
@@ -0,0 +1 @@
+Virtually all machine learning tasks are characterized using some form of loss function, and “good performance” is typically stated in terms of a sufficiently small average loss, taken over the random draw of test data. While optimizing for performance on average is intuitive, convenient to analyze in theory, and easy to implement in practice, such a choice brings about trade-offs. In this work, we survey and introduce a wide variety of non-traditional criteria used to design and evaluate machine learning algorithms, place the classical paradigm within the proper historical context, and propose a view of learning problems which emphasizes the question of “what makes for a desirable loss distribution?” in place of tacit use of the expected loss.
\ No newline at end of file
diff --git a/data/2024/aaai/A Theory of Non-acyclic Generative Flow Networks b/data/2024/aaai/A Theory of Non-acyclic Generative Flow Networks
new file mode 100644
index 0000000000..ae5422ab2f
--- /dev/null
+++ b/data/2024/aaai/A Theory of Non-acyclic Generative Flow Networks	
@@ -0,0 +1 @@
+GFlowNets is a novel flow-based method for learning a stochastic policy to generate objects via a sequence of actions and with probability proportional to a given positive reward. We contribute to relaxing hypotheses limiting the application range of GFlowNets, in particular: acyclicity (or lack thereof). To this end, we extend the theory of GFlowNets on measurable spaces which includes continuous state spaces without cycle restrictions, and provide a generalization of cycles in this generalized context. We show that losses used so far push flows to get stuck into cycles and we define a family of losses solving this issue. Experiments on graphs and continuous tasks validate those principles.
\ No newline at end of file
diff --git a/data/2024/aaai/A Toolbox for Modelling Engagement with Educational Videos b/data/2024/aaai/A Toolbox for Modelling Engagement with Educational Videos
new file mode 100644
index 0000000000..35210ed8f5
--- /dev/null
+++ b/data/2024/aaai/A Toolbox for Modelling Engagement with Educational Videos	
@@ -0,0 +1 @@
+With the advancement and utility of Artificial Intelligence (AI), personalising education to a global population could be a cornerstone of new educational systems in the future. This work presents the PEEKC dataset and the TrueLearn Python library, which contains a dataset and a series of online learner state models that are essential to facilitate research on learner engagement modelling. TrueLearn family of models was designed following the "open learner" concept, using humanly-intuitive user representations. This family of scalable, online models also help end-users visualise the learner models, which may in the future facilitate user interaction with their models/recommenders. The extensive documentation and coding examples make the library highly accessible to both machine learning developers and educational data mining and learning analytics practitioners. The experiments show the utility of both the dataset and the library with predictive performance significantly exceeding comparative baseline models. The dataset contains a large amount of AI-related educational videos, which are of interest for building and validating AI-specific educational recommenders.
\ No newline at end of file
diff --git a/data/2024/aaai/A Transfer Approach Using Graph Neural Networks in Deep Reinforcement Learning b/data/2024/aaai/A Transfer Approach Using Graph Neural Networks in Deep Reinforcement Learning
new file mode 100644
index 0000000000..bdf57df275
--- /dev/null
+++ b/data/2024/aaai/A Transfer Approach Using Graph Neural Networks in Deep Reinforcement Learning	
@@ -0,0 +1 @@
+Transfer learning (TL) has shown great potential to improve Reinforcement Learning (RL) efficiency by leveraging prior knowledge in new tasks. However, much of the existing TL research focuses on transferring knowledge between tasks that share the same state-action spaces. Further, transfer from multiple source tasks that have different state-action spaces is more challenging and needs to be solved urgently to improve the generalization and practicality of the method in real-world scenarios. This paper proposes TURRET (Transfer Using gRaph neuRal nETworks), to utilize the generalization capabilities of Graph Neural Networks (GNNs) to facilitate efficient and effective multi-source policy transfer learning in the state-action mismatch setting. TURRET learns a semantic representation by accounting for the intrinsic property of the agent through GNNs, which leads to a unified state embedding space for all tasks. As a result, TURRET achieves more efficient transfer with strong generalization ability between different tasks and can be easily combined with existing Deep RL algorithms. Experimental results show that TURRET significantly outperforms other TL methods on multiple continuous action control tasks, successfully transferring across robots with different state-action spaces.
\ No newline at end of file
diff --git a/data/2024/aaai/A Twist for Graph Classification: Optimizing Causal Information Flow in Graph Neural Networks b/data/2024/aaai/A Twist for Graph Classification: Optimizing Causal Information Flow in Graph Neural Networks
new file mode 100644
index 0000000000..58c23bf57f
--- /dev/null
+++ b/data/2024/aaai/A Twist for Graph Classification: Optimizing Causal Information Flow in Graph Neural Networks	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) have achieved state-of-the-art results on many graph representation learning tasks by exploiting statistical correlations. However, numerous observations have shown that such correlations may not reflect the true causal mechanisms underlying the data and thus may hamper the ability of the model to generalize beyond the observed distribution. To address this problem, we propose an Information-based Causal Learning (ICL) framework that combines information theory and causality to analyze and improve graph representation learning to transform information relevance to causal dependence. Specifically, we first introduce a multi-objective mutual information optimization objective derived from information-theoretic analysis and causal learning principles to simultaneously extract invariant and interpretable causal information and reduce reliance on non-causal information in correlations. To optimize this multi-objective objective, we enable a causal disentanglement layer that effectively decouples the causal and non-causal information in the graph representations. Moreover, due to the intractability of mutual information estimation, we derive variational bounds that enable us to transform the above objective into a tractable loss function. To balance the multiple information objectives and avoid optimization conflicts, we leverage multi-objective gradient descent to achieve a stable and efficient transformation from informational correlation to causal dependency. Our approach provides important insights into modulating the information flow in GNNs to enhance their reliability and generalization. Extensive experiments demonstrate that our approach significantly improves the robustness and interpretability of GNNs across different distribution shifts. Visual analysis demonstrates how our method converts informative dependencies in representations into causal dependencies.
\ No newline at end of file
diff --git a/data/2024/aaai/A Two-Stage Information Extraction Network for Incomplete Multi-View Multi-Label Classification b/data/2024/aaai/A Two-Stage Information Extraction Network for Incomplete Multi-View Multi-Label Classification
new file mode 100644
index 0000000000..9d29f121ba
--- /dev/null
+++ b/data/2024/aaai/A Two-Stage Information Extraction Network for Incomplete Multi-View Multi-Label Classification	
@@ -0,0 +1 @@
+Recently, multi-view multi-label classification (MvMLC) has received a significant amount of research interest and many methods have been proposed based on the assumptions of view completion and label completion. However, in real-world scenarios, multi-view multi-label data tends to be incomplete due to various uncertainties involved in data collection and manual annotation. As a result, the conventional MvMLC methods fail. In this paper, we propose a new two-stage MvMLC network to solve this incomplete MvMLC issue with partial missing views and missing labels. Different from the existing works, our method attempts to leverage the diverse information from the partially missing data based on the information theory. Specifically, our method aims to minimize task-irrelevant information while maximizing task-relevant information through the principles of information bottleneck theory and mutual information extraction. The first stage of our network involves training view-specific classifiers to concentrate the task-relevant information. Subsequently, in the second stage, the hidden states of these classifiers serve as input for an alignment model, an autoencoder-based mutual information extraction framework, and a weighted fusion classifier to make the final prediction. Extensive experiments performed on five datasets validate that our method outperforms other state-of-the-art methods. Code is available at https://github.com/KevinTan10/TSIEN.
\ No newline at end of file
diff --git a/data/2024/aaai/A Unified Environmental Network for Pedestrian Trajectory Prediction b/data/2024/aaai/A Unified Environmental Network for Pedestrian Trajectory Prediction
new file mode 100644
index 0000000000..6b35e34ab0
--- /dev/null
+++ b/data/2024/aaai/A Unified Environmental Network for Pedestrian Trajectory Prediction	
@@ -0,0 +1 @@
+Accurately predicting pedestrian movements in complex environments is challenging due to social interactions, scene constraints, and pedestrians' multimodal behaviors. Sequential models like long short-term memory fail to effectively integrate scene features to make predicted trajectories comply with scene constraints due to disparate feature modalities of scene and trajectory. Though existing convolution neural network (CNN) models can extract scene features, they are ineffective in mapping these features into scene constraints for pedestrians and struggle to model pedestrian interactions due to the loss of target pedestrian information. To address these issues, we propose a unified environmental network based on CNN for pedestrian trajectory prediction. We introduce a polar-based method to reflect the distance and direction relationship between any position in the environment and the target pedestrian. This enables us to simultaneously model scene constraints and pedestrian social interactions in the form of feature maps. Additionally, we capture essential local features in the feature map, characterizing potential multimodal movements of pedestrians at each time step to prevent redundant predicted trajectories. We verify the performance of our proposed model on four trajectory prediction datasets, encompassing both short-term and long-term predictions. The experimental results demonstrate the superiority of our approach over existing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/A Unified Knowledge Transfer Network for Generalized Category Discovery b/data/2024/aaai/A Unified Knowledge Transfer Network for Generalized Category Discovery
new file mode 100644
index 0000000000..f6092b9827
--- /dev/null
+++ b/data/2024/aaai/A Unified Knowledge Transfer Network for Generalized Category Discovery	
@@ -0,0 +1 @@
+Generalized Category Discovery (GCD) aims to recognize both known and novel categories in an unlabeled dataset by leveraging another labeled dataset with only known categories. Without considering knowledge transfer from known to novel categories, current methods usually perform poorly on novel categories due to the lack of corresponding supervision. To mitigate this issue, we propose a unified Knowledge Transfer Network (KTN), which solves two obstacles to knowledge transfer in GCD. First, the mixture of known and novel categories in unlabeled data makes it difficult to identify transfer candidates (i.e., samples with novel categories). For this, we propose an entropy-based method that leverages knowledge in the pre-trained classifier to differentiate known and novel categories without requiring extra data or parameters. Second, the lack of prior knowledge of novel categories presents challenges in quantifying semantic relationships between categories to decide the transfer weights. For this, we model different categories with prototypes and treat their similarities as transfer weights to measure the semantic similarities between categories. On the basis of two treatments, we transfer knowledge from known to novel categories by conducting pre-adjustment of logits and post-adjustment of labels for transfer candidates based on the transfer weights between different categories. With the weighted adjustment, KTN can generate more accurate pseudo-labels for unlabeled data, which helps to learn more discriminative features and boost model performance on novel categories. Extensive experiments show that our method outperforms state-of-the-art models on all evaluation metrics across multiple benchmark datasets. Furthermore, different from previous clustering-based methods that can only work offline with abundant data, KTN can be deployed online conveniently with faster inference speed. Code and data are available at https://github.com/yibai-shi/KTN.
\ No newline at end of file
diff --git a/data/2024/aaai/A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis b/data/2024/aaai/A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis
new file mode 100644
index 0000000000..c549667a03
--- /dev/null
+++ b/data/2024/aaai/A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis	
@@ -0,0 +1 @@
+The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset while achieving state-of-the-art results in motion inbetweening on the LaFAN1 dataset for long transition periods.
\ No newline at end of file
diff --git a/data/2024/aaai/A Unified Self-Distillation Framework for Multimodal Sentiment Analysis with Uncertain Missing Modalities b/data/2024/aaai/A Unified Self-Distillation Framework for Multimodal Sentiment Analysis with Uncertain Missing Modalities
new file mode 100644
index 0000000000..4f3b0ff84b
--- /dev/null
+++ b/data/2024/aaai/A Unified Self-Distillation Framework for Multimodal Sentiment Analysis with Uncertain Missing Modalities	
@@ -0,0 +1 @@
+Multimodal Sentiment Analysis (MSA) has attracted widespread research attention recently. Most MSA studies are based on the assumption of modality completeness. However, many inevitable factors in real-world scenarios lead to uncertain missing modalities, which invalidate the fixed multimodal fusion approaches. To this end, we propose a Unified multimodal Missing modality self-Distillation Framework (UMDF) to handle the problem of uncertain missing modalities in MSA. Specifically, a unified self-distillation mechanism in UMDF drives a single network to automatically learn robust inherent representations from the consistent distribution of multimodal data. Moreover, we present a multi-grained crossmodal interaction module to deeply mine the complementary semantics among modalities through coarse- and fine-grained crossmodal attention. Eventually, a dynamic feature integration module is introduced to enhance the beneficial semantics in incomplete modalities while filtering the redundant information therein to obtain a refined and robust multimodal representation. Comprehensive experiments on three datasets demonstrate that our framework significantly improves MSA performance under both uncertain missing-modality and complete-modality testing conditions.
\ No newline at end of file
diff --git a/data/2024/aaai/A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis b/data/2024/aaai/A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis
new file mode 100644
index 0000000000..47d083d84e
--- /dev/null
+++ b/data/2024/aaai/A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis	
@@ -0,0 +1 @@
+Well-designed prompts have demonstrated the potential to guide text-to-image models in generating amazing images. Although existing prompt engineering methods can provide high-level guidance, it is challenging for novice users to achieve the desired results by manually entering prompts due to a discrepancy between novice-user-input prompts and the model-preferred prompts. To bridge the distribution gap between user input behavior and model training datasets, we first construct a novel Coarse-Fine Granularity Prompts dataset (CFP) and propose a novel User-Friendly Fine-Grained Text Generation framework (UF-FGTG) for automated prompt optimization. For CFP, we construct a novel dataset for text-to-image tasks that combines coarse and fine-grained prompts to facilitate the development of automated prompt generation methods. For UF-FGTG, we propose a novel framework that automatically translates user-input prompts into model-preferred prompts. Specifically, we propose a prompt refiner that continually rewrites prompts to empower users to select results that align with their unique needs. Meanwhile, we integrate image-related loss functions from the text-to-image model into the training process of text generation to generate model-preferred prompts. Additionally, we propose an adaptive feature extraction module to ensure diversity in the generated results. Experiments demonstrate that our approach is capable of generating more visually appealing and diverse images than previous state-of-the-art methods, achieving an average improvement of 5% across six quality and aesthetic metrics. Data and code are available at https://github.com/Naylenv/UF-FGTG.
\ No newline at end of file
diff --git a/data/2024/aaai/A Variational Autoencoder for Neural Temporal Point Processes with Dynamic Latent Graphs b/data/2024/aaai/A Variational Autoencoder for Neural Temporal Point Processes with Dynamic Latent Graphs
new file mode 100644
index 0000000000..2adfacedb7
--- /dev/null
+++ b/data/2024/aaai/A Variational Autoencoder for Neural Temporal Point Processes with Dynamic Latent Graphs	
@@ -0,0 +1 @@
+Continuously observed event occurrences, often exhibit self and mutually exciting effects, which can be well modeled using temporal point processes. Beyond that, these event dynamics may also change over time, with certain periodic trends. We propose a novel variational autoencoder to capture such a mixture of temporal dynamics. More specifically, the whole time interval of the input sequence is partitioned into a set of sub intervals. The event dynamics are assumed to be stationary within each subinterval, but could be changing across those subintervals. In particular, we use a sequential latent variable model to learn a dependency graph between the observed dimensions, for each subinterval. The model predicts the future event times, by using the learned dependency graph to remove the non contributing influences of past events. By doing so, the proposed model demonstrates its higher accuracy in predicting inter event times and event types for several real world event sequences, compared with existing state of the art neural point processes.
\ No newline at end of file
diff --git a/data/2024/aaai/A Virtual Driving Instructor That Generates Personalized Driving Lessons Based on Student Skill Level b/data/2024/aaai/A Virtual Driving Instructor That Generates Personalized Driving Lessons Based on Student Skill Level
new file mode 100644
index 0000000000..8a58eea496
--- /dev/null
+++ b/data/2024/aaai/A Virtual Driving Instructor That Generates Personalized Driving Lessons Based on Student Skill Level	
@@ -0,0 +1 @@
+Currently, students acquire driving skills by practicing in actual traffic conditions and through direct interactions with an instructor. While one-on-one interactions could be tailored to a student’s learning style and skill level, making them effective for learning, one-on-one interactions are also inefficient, potentially costly, and not standardized with limitations on which traffic situation can be safely taught. For these exact reasons Way AS has developed and commercially deployed a virtual driving instructor that educates students in high-fidelity simulators. In this paper, we present a module, the Lesson generator, that extends the virtual driving instructor to generate personalized lessons for individual students with the goal to practice in a focused and deliberately fashion the skills that need practice for the students to become proficient drivers. A case study is presented, and the path to deployment is discussed.
\ No newline at end of file
diff --git a/data/2024/aaai/A Wireframe-Based Approach for Classifying and Acquiring Proficiency in the American Sign Language (Student Abstract) b/data/2024/aaai/A Wireframe-Based Approach for Classifying and Acquiring Proficiency in the American Sign Language (Student Abstract)
new file mode 100644
index 0000000000..f25ef285db
--- /dev/null
+++ b/data/2024/aaai/A Wireframe-Based Approach for Classifying and Acquiring Proficiency in the American Sign Language (Student Abstract)	
@@ -0,0 +1 @@
+We describe our methodology for classifying ASL (American Sign Language) gestures. Rather than operate directly on raw images of hand gestures, we extract coor-dinates and render wireframes from individual images to construct a curated training dataset. This dataset is then used in a classifier that is memory efficient and provides effective performance (94% accuracy). Because we con-struct wireframes that contain information about several angles in the joints that comprise hands, our methodolo-gy is amenable to training those interested in learning ASL by identifying targeted errors in their hand gestures.
\ No newline at end of file
diff --git a/data/2024/aaai/AACP: Aesthetics Assessment of Children's Paintings Based on Self-Supervised Learning b/data/2024/aaai/AACP: Aesthetics Assessment of Children's Paintings Based on Self-Supervised Learning
new file mode 100644
index 0000000000..eb92a4accf
--- /dev/null
+++ b/data/2024/aaai/AACP: Aesthetics Assessment of Children's Paintings Based on Self-Supervised Learning	
@@ -0,0 +1 @@
+The Aesthetics Assessment of Children's Paintings (AACP) is an important branch of the image aesthetics assessment (IAA), playing a significant role in children's education. This task presents unique challenges, such as limited available data and the requirement for evaluation metrics from multiple perspectives. However, previous approaches have relied on training large datasets and subsequently providing an aesthetics score to the image, which is not applicable to AACP. To solve this problem, we construct an aesthetics assessment dataset of children's paintings and a model based on self-supervised learning. 1) We build a novel dataset composed of two parts: the first part contains more than 20k unlabeled images of children's paintings; the second part contains 1.2k images of children's paintings, and each image contains eight attributes labeled by multiple design experts. 2) We design a pipeline that includes a feature extraction module, perception modules and a disentangled evaluation module. 3) We conduct both qualitative and quantitative experiments to compare our model's performance with five other methods using the AACP dataset. Our experiments reveal that our method can accurately capture aesthetic features and achieve state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2024/aaai/ACAMDA: Improving Data Efficiency in Reinforcement Learning through Guided Counterfactual Data Augmentation b/data/2024/aaai/ACAMDA: Improving Data Efficiency in Reinforcement Learning through Guided Counterfactual Data Augmentation
new file mode 100644
index 0000000000..19df27fb3f
--- /dev/null
+++ b/data/2024/aaai/ACAMDA: Improving Data Efficiency in Reinforcement Learning through Guided Counterfactual Data Augmentation	
@@ -0,0 +1 @@
+Data augmentation plays a crucial role in improving the data efficiency of reinforcement learning (RL). However, the generation of high-quality augmented data remains a significant challenge. To overcome this, we introduce ACAMDA (Adversarial Causal Modeling for Data Augmentation), a novel framework that integrates two causality-based tasks: causal structure recovery and counterfactual estimation. The unique aspect of ACAMDA lies in its ability to recover temporal causal relationships from limited non-expert datasets. The identification of the sequential cause-and-effect allows the creation of realistic yet unobserved scenarios. We utilize this characteristic to generate guided counterfactual datasets, which, in turn, substantially reduces the need for extensive data collection. By simulating various state-action pairs under hypothetical actions, ACAMDA enriches the training dataset for diverse and heterogeneous conditions. Our experimental evaluation shows that ACAMDA outperforms existing methods, particularly when applied to novel and unseen domains.
\ No newline at end of file
diff --git a/data/2024/aaai/ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning b/data/2024/aaai/ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning
new file mode 100644
index 0000000000..bebf51c639
--- /dev/null
+++ b/data/2024/aaai/ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning	
@@ -0,0 +1 @@
+Decision Transformer (DT), which employs expressive sequence modeling techniques to perform action generation, has emerged as a promising approach to offline policy optimization. However, DT generates actions conditioned on a desired future return, which is known to bear some weaknesses such as the susceptibility to environmental stochasticity. To overcome DT's weaknesses, we propose to empower DT with dynamic programming. Our method comprises three steps. First, we employ in-sample value iteration to obtain approximated value functions, which involves dynamic programming over the MDP structure. Second, we evaluate action quality in context with estimated advantages. We introduce two types of advantage estimators, IAE and GAE, which are suitable for different tasks. Third, we train an Advantage-Conditioned Transformer (ACT) to generate actions conditioned on the estimated advantages. Finally, during testing, ACT generates actions conditioned on a desired advantage. Our evaluation results validate that, by leveraging the power of dynamic programming, ACT demonstrates effective trajectory stitching and robust action generation in spite of the environmental stochasticity, outperforming baseline methods across various benchmarks. Additionally, we conduct an in-depth analysis of ACT's various design choices through ablation studies. Our code is available at https://github.com/LAMDA-RL/ACT.
\ No newline at end of file
diff --git a/data/2024/aaai/ADA-GAD: Anomaly-Denoised Autoencoders for Graph Anomaly Detection b/data/2024/aaai/ADA-GAD: Anomaly-Denoised Autoencoders for Graph Anomaly Detection
new file mode 100644
index 0000000000..8013ed5490
--- /dev/null
+++ b/data/2024/aaai/ADA-GAD: Anomaly-Denoised Autoencoders for Graph Anomaly Detection	
@@ -0,0 +1 @@
+Graph anomaly detection is crucial for identifying nodes that deviate from regular behavior within graphs, benefiting various domains such as fraud detection and social network. Although existing reconstruction-based methods have achieved considerable success, they may face the Anomaly Overfitting and Homophily Trap problems caused by the abnormal patterns in the graph, breaking the assumption that normal nodes are often better reconstructed than abnormal ones. Our observations indicate that models trained on graphs with fewer anomalies exhibit higher detection performance. Based on this insight, we introduce a novel two-stage framework called Anomaly-Denoised Autoencoders for Graph Anomaly Detection (ADA-GAD). In the first stage, we design a learning-free anomaly-denoised augmentation method to generate graphs with reduced anomaly levels. We pretrain graph autoencoders on these augmented graphs at multiple levels, which enables the graph autoencoders to capture normal patterns. In the next stage, the decoders are retrained for detection on the original graph, benefiting from the multi-level representations learned in the previous stage. Meanwhile, we propose the node anomaly distribution regularization to further alleviate Anomaly Overfitting. We validate the effectiveness of our approach through extensive experiments on both synthetic and real-world datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis b/data/2024/aaai/AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis
new file mode 100644
index 0000000000..f141f71fca
--- /dev/null
+++ b/data/2024/aaai/AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis	
@@ -0,0 +1 @@
+Audio-driven talking head synthesis is a promising topic with wide applications in digital human, film making and virtual reality. Recent NeRF-based approaches have shown superiority in quality and fidelity compared to previous studies. However, when it comes to few-shot talking head generation, a practical scenario where only few seconds of talking video is available for one identity, two limitations emerge: 1) they either have no base model, which serves as a facial prior for fast convergence, or ignore the importance of audio when building the prior; 2) most of them overlook the degree of correlation between different face regions and audio, e.g., mouth is audio related, while ear is audio independent. In this paper, we present Audio Enhanced Neural Radiance Field (AE-NeRF) to tackle the above issues, which can generate realistic portraits of a new speaker with few-shot dataset. Specifically, we introduce an Audio Aware Aggregation module into the feature fusion stage of the reference scheme, where the weight is determined by the similarity of audio between reference and target image. Then, an Audio-Aligned Face Generation strategy is proposed to model the audio related and audio independent regions respectively, with a dual-NeRF framework. Extensive experiments have shown AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations.
\ No newline at end of file
diff --git a/data/2024/aaai/AGS: Affordable and Generalizable Substitute Training for Transferable Adversarial Attack b/data/2024/aaai/AGS: Affordable and Generalizable Substitute Training for Transferable Adversarial Attack
new file mode 100644
index 0000000000..c1357e9b6c
--- /dev/null
+++ b/data/2024/aaai/AGS: Affordable and Generalizable Substitute Training for Transferable Adversarial Attack	
@@ -0,0 +1 @@
+In practical black-box attack scenarios, most of the existing transfer-based attacks employ pretrained models (e.g. ResNet50) as the substitute models. Unfortunately, these substitute models are not always appropriate for transfer-based attacks. Firstly, these models are usually trained on a largescale annotated dataset, which is extremely expensive and time-consuming to construct. Secondly, the primary goal of these models is to perform a specific task, such as image classification, which is not developed for adversarial attacks. To tackle the above issues, i.e., high cost and over-fitting on taskspecific models, we propose an Affordable and Generalizable Substitute (AGS) training framework tailored for transferbased adversarial attack. Specifically, we train the substitute model from scratch by our proposed adversary-centric constrastive learning. This proposed learning mechanism introduces another sample with slight adversarial perturbations as an additional positive view of the input image, and then encourages the adversarial view and two benign views to interact comprehensively with each other. To further boost the generalizability of the substitute model, we propose adversarial invariant learning to maintain the representations of the adversarial example invariants under augmentations with various strengths. Our AGS model can be trained solely with unlabeled and out-of domain data and avoid overfitting to any task-specific models, because of its inherently self-supervised nature. Extensive experiments demonstrate that our AGS achieves comparable or superior performance compared to substitute models pretrained on the complete ImageNet training set, when executing attacks across a diverse range of target models, including ViTs, robustly trained models, object detection and segmentation models. Our source codes are available at https://github.com/lwmming/AGS.
\ No newline at end of file
diff --git a/data/2024/aaai/AI Evaluation Authorities: A Case Study Mapping Model Audits to Persistent Standards b/data/2024/aaai/AI Evaluation Authorities: A Case Study Mapping Model Audits to Persistent Standards
new file mode 100644
index 0000000000..b657dd6be7
--- /dev/null
+++ b/data/2024/aaai/AI Evaluation Authorities: A Case Study Mapping Model Audits to Persistent Standards	
@@ -0,0 +1 @@
+Intelligent system audits are labor-intensive assurance activities that are typically performed once and discarded along with the opportunity to programmatically test all similar products for the market. This study illustrates how several incidents (i.e., harms) involving Named Entity Recognition (NER) can be prevented by scaling up a previously-performed audit of NER systems. The audit instrument's diagnostic capacity is maintained through a security model that protects the underlying data (i.e., addresses Goodhart's Law). An open-source evaluation infrastructure is released along with an example derived from a real-world audit that reports aggregated findings without exposing the underlying data.
\ No newline at end of file
diff --git a/data/2024/aaai/AI Risk Profiles: A Standards Proposal for Pre-deployment AI Risk Disclosures b/data/2024/aaai/AI Risk Profiles: A Standards Proposal for Pre-deployment AI Risk Disclosures
new file mode 100644
index 0000000000..e2912bbcf5
--- /dev/null
+++ b/data/2024/aaai/AI Risk Profiles: A Standards Proposal for Pre-deployment AI Risk Disclosures	
@@ -0,0 +1 @@
+As AI systems’ sophistication and proliferation have increased, awareness of the risks has grown proportionally. The AI industry is increasingly emphasizing the need for transparency, with proposals ranging from standardizing use of technical disclosures, like model cards, to regulatory licensing regimes. Since the AI value chain is complicated, with actors bringing varied expertise, perspectives, and values, it is crucial that consumers of transparency disclosures be able to understand the risks of the AI system in question. In this paper we propose a risk profiling standard which can guide downstream decision-making, including triaging further risk assessment, informing procurement and deployment, and directing regulatory frameworks. The standard is built on our proposed taxonomy of AI risks, which distills the wide variety of risks proposed in the literature into a high-level categorization. We outline the myriad data sources needed to construct informative Risk Profiles and propose a template and methodology for collating risk information into a standard, yet flexible, structure. We apply this methodology to a number of prominent AI systems using publicly available information. To conclude, we discuss design decisions for the profiles and future work.
\ No newline at end of file
diff --git a/data/2024/aaai/AI, Ethics, and Education: The Pioneering Path of Sidekick Academy b/data/2024/aaai/AI, Ethics, and Education: The Pioneering Path of Sidekick Academy
new file mode 100644
index 0000000000..e8fa9c173e
--- /dev/null
+++ b/data/2024/aaai/AI, Ethics, and Education: The Pioneering Path of Sidekick Academy	
@@ -0,0 +1 @@
+Generative artificial intelligence (AI) is swiftly cementing its role as an indispensable tool for students transitioning from K-12 to higher education and professional spheres. Yet, harnessing its full potential requires more than mere familiarity. Students must be equipped with the skills to engage with AI both productively and ethically. Left unchecked, AI usage can pose risks, especially if students lack proper guidance or understanding of their actions. Moreover, effective interaction with AI necessitates skills in prompt engineering to yield desired outcomes. Sidekick Academy is a digital online platform where students can safely experiment with and learn about AI. This article delves into the genesis of Sidekick Academy, offering a glimpse into its lessons on how to use AI and complex debate on ethical use. It also sheds light on the academy's "sandbox" - a secure space for students to explore AI without jeopardizing their safety or privacy.
\ No newline at end of file
diff --git a/data/2024/aaai/AI-Assisted Human Teamwork b/data/2024/aaai/AI-Assisted Human Teamwork
new file mode 100644
index 0000000000..689c95a38c
--- /dev/null
+++ b/data/2024/aaai/AI-Assisted Human Teamwork	
@@ -0,0 +1 @@
+Effective teamwork translates to fewer preventable errors and higher task performance in collaborative tasks. However, in time-critical tasks, successful teamwork becomes highly challenging to attain. In such settings, often, team members have partial observability of their surroundings, incur high cost of communication, and have trouble estimating the state and intent of their teammates. To assist a team in improving teamwork at task time, my doctoral research proposes an automated task-time team intervention system. Grounded in the notion of shared mental models, the system first detects whether the team is on the same page or not. It then generates effective interventions to improve teamwork. Additionally, by leveraging past demonstrations to learn a model of team behavior, this system minimizes the need for domain experts to specify teamwork models and rules.
\ No newline at end of file
diff --git a/data/2024/aaai/AI-Based Energy Transportation Safety: Pipeline Radial Threat Estimation Using Intelligent Sensing System b/data/2024/aaai/AI-Based Energy Transportation Safety: Pipeline Radial Threat Estimation Using Intelligent Sensing System
new file mode 100644
index 0000000000..656980a164
--- /dev/null
+++ b/data/2024/aaai/AI-Based Energy Transportation Safety: Pipeline Radial Threat Estimation Using Intelligent Sensing System	
@@ -0,0 +1 @@
+The application of artificial intelligence technology has greatly enhanced and fortified the safety of energy pipelines, particularly in safeguarding against external threats. The predominant methods involve the integration of intelligent sensors to detect external vibration, enabling the identification of event types and locations, thereby replacing manual detection methods. However, practical implementation has exposed a limitation in current methods - their constrained ability to accurately discern the spatial dimensions of external signals, which complicates the authentication of threat events. Our research endeavors to overcome the above issues by harnessing deep learning techniques to achieve a more fine-grained recognition and localization process. This refinement is crucial in effectively identifying genuine threats to pipelines, thus enhancing the safety of energy transportation. This paper proposes a radial threat estimation method for energy pipelines based on distributed optical fiber sensing technology. Specifically, we introduce a continuous multi-view and multi-domain feature fusion methodology to extract comprehensive signal features and construct a threat estimation and recognition network. The utilization of collected acoustic signal data is optimized, and the underlying principle is elucidated. Moreover, we incorporate the concept of transfer learning through a pre-trained model, enhancing both recognition accuracy and training efficiency. Empirical evidence gathered from real-world scenarios underscores the efficacy of our method, notably in its substantial reduction of false alarms and remarkable gains in recognition accuracy. More generally, our method exhibits versatility and can be extrapolated to a broader spectrum of recognition tasks and scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/AI-Enhanced Art Appreciation: Generating Text from Artwork to Promote Inclusivity b/data/2024/aaai/AI-Enhanced Art Appreciation: Generating Text from Artwork to Promote Inclusivity
new file mode 100644
index 0000000000..8b897d25f2
--- /dev/null
+++ b/data/2024/aaai/AI-Enhanced Art Appreciation: Generating Text from Artwork to Promote Inclusivity	
@@ -0,0 +1 @@
+Visual art facilitates expression, communication, and connection, yet it remains inaccessible to those who are visually-impaired and those who lack the resources to understand the techniques and history of art. In this work, I propose the development of a generative AI model that generates a description and interpretation of a given artwork. Such research can make art more accessible, support art education, and improve the ability of AI to understand and translate between creative media. Development will begin with a formative study to assess the needs and preferences of blind and low vision people and art experts. Following the formative study, the basic approach is to train the model on a database of artworks and their accompanying descriptions, predict sentiments from extracted visual data, and generate a paragraph closely resembling training textual data and incorporating sentiment analysis. The model will then be evaluated quantitatively through metrics like METEOR and qualitatively through Turing tests in an iterative process.
\ No newline at end of file
diff --git a/data/2024/aaai/ALISON: Fast and Effective Stylometric Authorship Obfuscation b/data/2024/aaai/ALISON: Fast and Effective Stylometric Authorship Obfuscation
new file mode 100644
index 0000000000..245c9963c7
--- /dev/null
+++ b/data/2024/aaai/ALISON: Fast and Effective Stylometric Authorship Obfuscation	
@@ -0,0 +1,3 @@
+Authorship Attribution (AA) and Authorship Obfuscation (AO) are two competing tasks of increasing importance in privacy research. Modern AA leverages an author's consistent writing style to match a text to its author using an AA classifier. AO is the corresponding adversarial task, aiming to modify a text in such a way that its semantics are preserved, yet an AA model cannot correctly infer its authorship. To address privacy concerns raised by state-of-the-art (SOTA) AA methods,
+new AO methods have been proposed but remain largely impractical to use due to their prohibitively slow training and obfuscation speed, often taking hours.
+To this challenge, we propose a practical AO method, ALISON, that (1) dramatically reduces training/obfuscation time, demonstrating more than 10x faster obfuscation than SOTA AO methods, (2) achieves better obfuscation success through attacking three transformer-based AA methods on two benchmark datasets, typically performing 15% better than competing methods, (3) does not require direct signals from a target AA classifier during obfuscation, and (4) utilizes unique stylometric features, allowing sound model interpretation for explainable obfuscation. We also demonstrate that ALISON can effectively prevent four SOTA AA methods from accurately determining the authorship of ChatGPT-generated texts, all while minimally changing the original text semantics. To ensure the reproducibility of our findings, our code and data are available at: https://github.com/EricX003/ALISON.
\ No newline at end of file
diff --git a/data/2024/aaai/AMD: Anatomical Motion Diffusion with Interpretable Motion Decomposition and Fusion b/data/2024/aaai/AMD: Anatomical Motion Diffusion with Interpretable Motion Decomposition and Fusion
new file mode 100644
index 0000000000..17b22b15fb
--- /dev/null
+++ b/data/2024/aaai/AMD: Anatomical Motion Diffusion with Interpretable Motion Decomposition and Fusion	
@@ -0,0 +1 @@
+Generating realistic human motion sequences from text descriptions is a challenging task that requires capturing the rich expressiveness of both natural language and human motion. Recent advances in diffusion models have enabled significant progress in human motion synthesis. However, existing methods struggle to handle text inputs that describe complex or long motions. In this paper, we propose the Adaptable Motion Diffusion (AMD) model, which leverages a Large Language Model (LLM) to parse the input text into a sequence of concise and interpretable anatomical scripts that correspond to the target motion. This process exploits the LLM’s ability to provide anatomical guidance for complex motion synthesis. We then devise a two-branch fusion scheme that balances the influence of the input text and the anatomical scripts on the inverse diffusion process, which adaptively ensures the semantic fidelity and diversity of the synthesized motion. Our method can effectively handle texts with complex or long motion descriptions, where existing methods often fail. Experiments on datasets with relatively more complex motions, such as CLCD1 and CLCD2, demonstrate that our AMD significantly outperforms existing state-of-the-art models.
\ No newline at end of file
diff --git a/data/2024/aaai/AMD: Autoregressive Motion Diffusion b/data/2024/aaai/AMD: Autoregressive Motion Diffusion
new file mode 100644
index 0000000000..5fed5817fb
--- /dev/null
+++ b/data/2024/aaai/AMD: Autoregressive Motion Diffusion	
@@ -0,0 +1,4 @@
+Human motion generation aims to produce plausible human motion sequences according to various conditional inputs, such as text or audio. Despite the feasibility of existing methods in generating motion based on short prompts and simple motion patterns, they encounter difficulties when dealing with long prompts or complex motions.
+The challenges are two-fold: 1) the scarcity of human motion-captured data for long prompts and complex motions. 2) the high diversity of human motions in the temporal domain and the substantial divergence of distributions from conditional modalities, leading to a many-to-many mapping problem when generating motion with complex and long texts. 
+In this work, we address these gaps by 1) elaborating the first dataset pairing long textual descriptions and 3D complex motions (HumanLong3D), and 2) proposing an autoregressive motion diffusion model (AMD). Specifically, AMD integrates the text prompt at the current timestep with the text prompt and action sequences at the previous timestep as conditional information to predict the current action sequences in an iterative manner.
+Furthermore, we present its generalization for X-to-Motion with “No Modality Left Behind”, enabling for the first time the generation of high-definition and high-fidelity human motions based on user-defined modality input.
\ No newline at end of file
diff --git a/data/2024/aaai/AMSP-UOD: When Vortex Convolution and Stochastic Perturbation Meet Underwater Object Detection b/data/2024/aaai/AMSP-UOD: When Vortex Convolution and Stochastic Perturbation Meet Underwater Object Detection
new file mode 100644
index 0000000000..e8bba779f4
--- /dev/null
+++ b/data/2024/aaai/AMSP-UOD: When Vortex Convolution and Stochastic Perturbation Meet Underwater Object Detection	
@@ -0,0 +1 @@
+In this paper, we present a novel Amplitude-Modulated Stochastic Perturbation and Vortex Convolutional Network, AMSP-UOD, designed for underwater object detection. AMSP-UOD specifically addresses the impact of non-ideal imaging factors on detection accuracy in complex underwater environments. To mitigate the influence of noise on object detection performance, we propose AMSP Vortex Convolution (AMSP-VConv) to disrupt the noise distribution, enhance feature extraction capabilities, effectively reduce parameters, and improve network robustness. We design the Feature Association Decoupling Cross Stage Partial (FAD-CSP) module, which strengthens the association of long and short range features, improving the network performance in complex underwater environments. Additionally, our sophisticated post-processing method, based on non-maximum suppression with aspect-ratio similarity thresholds, optimizes detection in dense scenes, such as waterweed and schools of fish, improving object detection accuracy. Extensive experiments on the URPC and RUOD datasets demonstrate that our method outperforms existing state-of-the-art methods in terms of accuracy and noise immunity. AMSP-UOD proposes an innovative solution with the potential for real-world applications. Our code is available at https://github.com/zhoujingchun03/AMSP-UOD.
\ No newline at end of file
diff --git a/data/2024/aaai/ANEDL: Adaptive Negative Evidential Deep Learning for Open-Set Semi-supervised Learning b/data/2024/aaai/ANEDL: Adaptive Negative Evidential Deep Learning for Open-Set Semi-supervised Learning
new file mode 100644
index 0000000000..4bcf5fbeab
--- /dev/null
+++ b/data/2024/aaai/ANEDL: Adaptive Negative Evidential Deep Learning for Open-Set Semi-supervised Learning	
@@ -0,0 +1,14 @@
+Semi-supervised learning (SSL) methods assume that labeled
+data, unlabeled data and test data are from the same distribution. Open-set semi-supervised learning (Open-set SSL) con-
+siders a more practical scenario, where unlabeled data and
+test data contain new categories (outliers) not observed in
+labeled data (inliers). Most previous works focused on out-
+lier detection via binary classifiers, which suffer from insufficient scalability and inability to distinguish different types
+of uncertainty. In this paper, we propose a novel framework,
+Adaptive Negative Evidential Deep Learning (ANEDL) to
+tackle these limitations. Concretely, we first introduce evidential deep learning (EDL) as an outlier detector to quantify
+different types of uncertainty, and design different uncertainty
+metrics for self-training and inference. Furthermore, we propose a novel adaptive negative optimization strategy, making
+EDL more tailored to the unlabeled dataset containing both
+inliers and outliers. As demonstrated empirically, our proposed method outperforms existing state-of-the-art methods
+across four datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries b/data/2024/aaai/AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries
new file mode 100644
index 0000000000..4bb2eee1f1
--- /dev/null
+++ b/data/2024/aaai/AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries	
@@ -0,0 +1 @@
+DEtection TRansformer (DETR)-based models have achieved remarkable performance. However, they are accompanied by a large computation overhead cost, which significantly prevents their applications on resource-limited devices. Prior arts attempt to reduce the computational burden of DETR using low-bit quantization, while these methods sacrifice a severe significant performance on weight-activation-attention low-bit quantization. We observe that the number of matching queries and positive samples affect much on the representation capacity of queries in DETR, while quantifying queries of DETR further reduces its representational capacity, thus leading to a severe performance drop. We introduce a new quantization strategy based on Auxiliary Queries for DETR (AQ-DETR), aiming to enhance the capacity of quantized queries. In addition, a layer-by-layer distillation is proposed to reduce the quantization error between quantized attention and full-precision counterpart. Through our extensive experiments on large-scale open datasets, the performance of the 4-bit quantization of DETR and Deformable DETR models is comparable to full-precision counterparts.
\ No newline at end of file
diff --git a/data/2024/aaai/ASWT-SGNN: Adaptive Spectral Wavelet Transform-Based Self-Supervised Graph Neural Network b/data/2024/aaai/ASWT-SGNN: Adaptive Spectral Wavelet Transform-Based Self-Supervised Graph Neural Network
new file mode 100644
index 0000000000..a316f43054
--- /dev/null
+++ b/data/2024/aaai/ASWT-SGNN: Adaptive Spectral Wavelet Transform-Based Self-Supervised Graph Neural Network	
@@ -0,0 +1 @@
+Graph Comparative Learning (GCL) is a self-supervised method that combines the advantages of Graph Convolutional Networks (GCNs) and comparative learning, making it promising for learning node representations. However, the GCN encoders used in these methods rely on the Fourier transform to learn fixed graph representations, which is inherently limited by the uncertainty principle involving spatial and spectral localization trade-offs. To overcome the inflexibility of existing methods and the computationally expensive eigen-decomposition and dense matrix multiplication, this paper proposes an Adaptive Spectral Wavelet Transform-based Self-Supervised Graph Neural Network (ASWT-SGNN). The proposed method employs spectral adaptive polynomials to approximate the filter function and optimize the wavelet using contrast loss. This design enables the creation of local filters in both spectral and spatial domains, allowing flexible aggregation of neighborhood information at various scales and facilitating controlled transformation between local and global information. Compared to existing methods, the proposed approach reduces computational complexity and addresses the limitation of graph convolutional neural networks, which are constrained by graph size and lack flexible control over the neighborhood aspect. Extensive experiments on eight benchmark datasets demonstrate that ASWT-SGNN accurately approximates the filter function in high-density spectral regions, avoiding costly eigen-decomposition. Furthermore, ASWT-SGNN achieves comparable performance to state-of-the-art models in node classification tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/AT4CTR: Auxiliary Match Tasks for Enhancing Click-Through Rate Prediction b/data/2024/aaai/AT4CTR: Auxiliary Match Tasks for Enhancing Click-Through Rate Prediction
new file mode 100644
index 0000000000..d7044b8a82
--- /dev/null
+++ b/data/2024/aaai/AT4CTR: Auxiliary Match Tasks for Enhancing Click-Through Rate Prediction	
@@ -0,0 +1 @@
+Click-through rate (CTR) prediction is a vital task in industrial recommendation systems. Most existing methods focus on the network architecture design of the CTR model for better accuracy and suffer from the data sparsity problem. Especially in industrial recommendation systems, the widely applied negative sample down-sampling technique due to resource limitation worsens the problem, resulting in a decline in performance. In this paper, we propose Auxiliary Match Tasks for enhancing Click-Through Rate (AT4CTR) prediction accuracy by alleviating the data sparsity problem. Specifically, we design two match tasks inspired by collaborative filtering to enhance the relevance modeling between user and item. As the "click" action is a strong signal which indicates the user's preference towards the item directly, we make the first match task aim at pulling closer the representation between the user and the item regarding the positive samples. Since the user's past click behaviors can also be treated as the user him/herself, we apply the next item prediction as the second match task. For both the match tasks, we choose the InfoNCE as their loss function. The two match tasks can provide meaningful training signals to speed up the model's convergence and alleviate the data sparsity. We conduct extensive experiments on one public dataset and one large-scale industrial recommendation dataset. The result demonstrates the effectiveness of the proposed auxiliary match tasks. AT4CTR has been deployed in the real industrial advertising system and has gained remarkable revenue.
\ No newline at end of file
diff --git a/data/2024/aaai/Abstract Action Scheduling for Optimal Temporal Planning via OMT b/data/2024/aaai/Abstract Action Scheduling for Optimal Temporal Planning via OMT
new file mode 100644
index 0000000000..817242d687
--- /dev/null
+++ b/data/2024/aaai/Abstract Action Scheduling for Optimal Temporal Planning via OMT	
@@ -0,0 +1,2 @@
+Given the model of a system with explicit temporal constraints, optimal temporal planning is the problem of finding a schedule of actions that achieves a certain goal while optimizing an objective function. Recent approaches for optimal planning reduce the problem to a series of queries to an Optimization Modulo Theory (OMT) solver: each query encodes a bounded version of the problem, with additional abstract actions representing an over-approximation of the plans beyond the bound. This technique suffers from performance issues, mainly due to the looseness of the over-approximation, which can include many non-executable plans.
+In this paper, we propose a refined abstraction for solving optimal temporal planning via OMT by introducing abstract scheduling constraints, which have a double purpose. First, they enforce a partial ordering of abstract actions based on mutual dependencies between them, which leads to a better makespan estimation and allows to prove optimality sooner. Second, they implicitly forbid circular self-enabling of abstract actions, which is a common cause of spurious models that severely affects performance in existing approaches. We prove the soundness and completeness of the resulting approach and empirically demonstrate its superiority with respect to the state of the art.
\ No newline at end of file
diff --git a/data/2024/aaai/Abstract and Explore: A Novel Behavioral Metric with Cyclic Dynamics in Reinforcement Learning b/data/2024/aaai/Abstract and Explore: A Novel Behavioral Metric with Cyclic Dynamics in Reinforcement Learning
new file mode 100644
index 0000000000..e889d2fde2
--- /dev/null
+++ b/data/2024/aaai/Abstract and Explore: A Novel Behavioral Metric with Cyclic Dynamics in Reinforcement Learning	
@@ -0,0 +1 @@
+Intrinsic motivation lies at the heart of the exploration of reinforcement learning, which is primarily driven by the agent's inherent satisfaction rather than external feedback from the environment. However, in recent more challenging procedurally-generated environments with high stochasticity and uninformative extrinsic rewards, we identify two significant issues of applying intrinsic motivation. (1) State representation collapse: In existing methods, the learned representations within intrinsic motivation have a high probability to neglect the distinction among different states and be distracted by the task-irrelevant information brought by the stochasticity. (2) Insufficient interrelation among dynamics: Unsuccessful guidance provided by the uninformative extrinsic reward makes the dynamics learning in intrinsic motivation less effective. In light of the above observations, a novel Behavioral metric with Cyclic Dynamics (BCD) is proposed, which considers both cumulative and immediate effects and facilitates the abstraction and exploration of the agent. For the behavioral metric, the successor feature is utilized to reveal the expected future rewards and alleviate the heavy reliance of previous methods on extrinsic rewards. Moreover, the latent variable and vector quantization techniques are employed to enable an accurate measurement of the transition function in a discrete and interpretable manner. In addition, cyclic dynamics is established to capture the interrelations between state and action, thereby providing a thorough awareness of environmental dynamics. Extensive experiments conducted on procedurally-generated environments demonstrate the state-of-the-art performance of our proposed BCD.
\ No newline at end of file
diff --git a/data/2024/aaai/Abstraction of Situation Calculus Concurrent Game Structures b/data/2024/aaai/Abstraction of Situation Calculus Concurrent Game Structures
new file mode 100644
index 0000000000..504ce122ce
--- /dev/null
+++ b/data/2024/aaai/Abstraction of Situation Calculus Concurrent Game Structures	
@@ -0,0 +1 @@
+We present a general framework for abstracting agent behavior in multi-agent synchronous games in the situation calculus, which provides a first-order representation of the state and allows us to model how plays depend on the data and objects involved. We represent such games as action theories of a special form called situation calculus synchronous game structures (SCSGSs), in which we have a single action "tick" whose effects depend on the combination of moves selected by the players. In our framework, one specifies both an abstract SCSGS and a concrete SCSGS, as well as a refinement mapping that specifies how each abstract move is implemented by a Golog program defined over the concrete SCSGS. We define notions of sound and complete abstraction with respect to a mapping over such SCSGS. To express strategic properties on the abstract and concrete games we adopt a first-order variant of alternating-time mu-calculus mu-ATL-FO. We show that we can exploit abstraction in verifying mu-ATL-FO properties of SCSGSs under the assumption that agents can always execute abstract moves to completion even if not fully controlling their outcomes.
\ No newline at end of file
diff --git a/data/2024/aaai/Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning b/data/2024/aaai/Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning
new file mode 100644
index 0000000000..67ed1d5cf8
--- /dev/null
+++ b/data/2024/aaai/Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning	
@@ -0,0 +1 @@
+Learning Nash equilibrium (NE) in complex zero-sum games with multi-agent reinforcement learning (MARL) can be extremely computationally expensive. Curriculum learning is an effective way to accelerate learning, but an under-explored dimension for generating a curriculum is the difficulty-to-learn of the subgames –games induced by starting from a specific state. In this work, we present a novel subgame curriculum learning framework for zero-sum games. It adopts an adaptive initial state distribution by resetting agents to some previously visited states where they can quickly learn to improve performance. Building upon this framework, we derive a subgame selection metric that approximates the squared distance to NE values and further adopt a particle-based state sampler for subgame generation. Integrating these techniques leads to our new algorithm, Subgame Automatic Curriculum Learning (SACL), which is a realization of the subgame curriculum learning framework. SACL can be combined with any MARL algorithm such as MAPPO. Experiments in the particle-world environment and Google Research Football environment show SACL produces much stronger policies than baselines. In the challenging hide-and-seek quadrant environment, SACL produces all four emergent stages and uses only half the samples of MAPPO with self-play. The project website is at https://sites.google.com/view/sacl-neurips.
\ No newline at end of file
diff --git a/data/2024/aaai/Accelerating Adversarially Robust Model Selection for Deep Neural Networks via Racing b/data/2024/aaai/Accelerating Adversarially Robust Model Selection for Deep Neural Networks via Racing
new file mode 100644
index 0000000000..10bd5d96c8
--- /dev/null
+++ b/data/2024/aaai/Accelerating Adversarially Robust Model Selection for Deep Neural Networks via Racing	
@@ -0,0 +1,2 @@
+Recent research has introduced several approaches to formally verify the robustness of neural network models against perturbations in their inputs, such as the ones that occur in adversarial attacks. At the same time, this particular verification task is known to be computationally challenging. More specifically, assessing the robustness of a neural network against input perturbations can easily take several hours of compute time per input vector, even when using state-of-the-art verification approaches. In light of this, it becomes challenging to select from a given set of neural network models the one that is best in terms of robust accuracy, i.e., the fraction of instances for which the model is known to be robust against adversarial perturbations, especially when given limited computing resources.
+To tackle this problem, we propose a racing method specifically adapted to the domain of robustness verification. This racing method utilises Delta-values, which can be seen as an efficiently computable proxy for the distance of a given input to a neural network model to the decision boundary. We present statistical evidence indicating significant differences in the empirical cumulative distribution between robust and non-robust inputs as a function of Delta-values. Using this information, we show that it is possible to reliably expose vulnerabilities in the model with relatively few input iterations. Overall, when applied to selecting the most robust network from sets of 31 MNIST and 27 CIFAR-10 networks, our proposed method achieves speedups of a factor of 108 and 42, respectively, in terms of cumulative running time compared to standard local robustness verification on the complete testing sets.
\ No newline at end of file
diff --git a/data/2024/aaai/Accelerating Cutting-Plane Algorithms via Reinforcement Learning Surrogates b/data/2024/aaai/Accelerating Cutting-Plane Algorithms via Reinforcement Learning Surrogates
new file mode 100644
index 0000000000..e53a5f03ec
--- /dev/null
+++ b/data/2024/aaai/Accelerating Cutting-Plane Algorithms via Reinforcement Learning Surrogates	
@@ -0,0 +1,22 @@
+Discrete optimization belongs to the set of N P-hard
+problems, spanning fields such as mixed-integer
+programming and combinatorial optimization. A current
+standard approach to solving convex discrete optimization
+problems is the use of cutting-plane algorithms, which
+reach optimal solutions by iteratively adding inequalities
+known as cuts to refine a feasible set. Despite the existence
+of a number of general-purpose cut-generating algorithms,
+large-scale discrete optimization problems continue to suffer
+from intractability. In this work, we propose a method for
+accelerating cutting-plane algorithms via reinforcement
+learning. Our approach uses learned policies as surrogates
+for N P-hard elements of the cut generating procedure
+in a way that (i) accelerates convergence, and (ii) retains
+guarantees of optimality. We apply our method on two types
+of problems where cutting-plane algorithms are commonly
+used: stochastic optimization, and mixed-integer quadratic
+programming. We observe the benefits of our method when
+applied to Benders decomposition (stochastic optimization)
+and iterative loss approximation (quadratic programming),
+achieving up to 45% faster average convergence when
+compared to modern alternative algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference b/data/2024/aaai/Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference
new file mode 100644
index 0000000000..3ee0b3e315
--- /dev/null
+++ b/data/2024/aaai/Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference	
@@ -0,0 +1,3 @@
+Due to the recent success of diffusion models, text-to-image generation is becoming increasingly popular and achieves a wide range of applications. Among them, text-to-image editing, or continuous text-to-image generation, attracts lots of attention and can potentially improve the quality of generated images. It's common to see that users may want to slightly edit the generated image by making minor modifications to their input textual descriptions for several rounds of diffusion inference. However, such an image editing process suffers from the low inference efficiency of many existing diffusion models even using GPU accelerators.
+
+To solve this problem, we introduce Fast Image Semantically Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for efficient text-to-image editing. The key intuition behind our approach is to utilize the semantic mapping between the minor modifications on the input text and the affected regions on the output image. For each text editing step, FISEdit can 1) automatically identify the affected image regions and 2) utilize the cached unchanged regions' feature map to accelerate the inference process. For the former, we measure the differences between cached and ad hoc feature maps given the modified textual description, extract the region with significant differences, and capture the affected region by masks. For the latter, we develop an efficient sparse diffusion inference engine that only computes the feature maps for the affected region while reusing the cached statistics for the rest of the image. Finally, extensive empirical results show that FISEdit can be 3.4 times and 4.4 times faster than existing methods on NVIDIA TITAN RTX and A100 GPUs respectively, and even generates more satisfactory images.
\ No newline at end of file
diff --git a/data/2024/aaai/Accelerating the Global Aggregation of Local Explanations b/data/2024/aaai/Accelerating the Global Aggregation of Local Explanations
new file mode 100644
index 0000000000..d20e33c3c1
--- /dev/null
+++ b/data/2024/aaai/Accelerating the Global Aggregation of Local Explanations	
@@ -0,0 +1,7 @@
+Local explanation methods highlight the input tokens that have a considerable impact on the outcome of classifying the document at hand. For example, the Anchor algorithm applies a statistical analysis of the sensitivity of the classifier to changes in the token. Aggregating local explanations over a dataset provides a global explanation of the model.
+Such aggregation aims to detect words with the most impact, giving valuable insights about the model, like what it has learned in training and which adversarial examples expose its weaknesses.
+However, standard aggregation methods bear a high computational cost:
+a naive implementation applies a costly algorithm to each token of each document, and hence, it is infeasible for a simple user running in the scope of a short analysis session. 
+
+We devise techniques for accelerating the global aggregation of the Anchor algorithm. Specifically, our goal is to compute a set of top-k words with the highest global impact according to different aggregation functions. Some of our techniques are lossless and some are lossy.
+We show that for a very mild loss of quality, we are able to accelerate the computation by up to 30 times, reducing the computation from hours to minutes. We also devise and study a probabilistic model that accounts for noise in the Anchor algorithm and diminishes the bias toward words that are frequent yet low in impact.
\ No newline at end of file
diff --git a/data/2024/aaai/Accurate Parameter Estimation for Safety-Critical Systems with Unmodeled Dynamics (Abstract Reprint) b/data/2024/aaai/Accurate Parameter Estimation for Safety-Critical Systems with Unmodeled Dynamics (Abstract Reprint)
new file mode 100644
index 0000000000..a8c0ad28ac
--- /dev/null
+++ b/data/2024/aaai/Accurate Parameter Estimation for Safety-Critical Systems with Unmodeled Dynamics (Abstract Reprint)	
@@ -0,0 +1 @@
+Analysis and synthesis of safety-critical autonomous systems are carried out using models which are often dynamic. Two central features of these dynamic systems are parameters and unmodeled dynamics. Much of feedback control design is parametric in nature and as such, accurate and fast estimation of the parameters in the modeled part of the dynamic system is a crucial property for designing risk-aware autonomous systems. This paper addresses the use of a spectral lines-based approach for estimating parameters of the dynamic model of an autonomous system. Existing literature has treated all unmodeled components of the dynamic system as sub-Gaussian noise and proposed parameter estimation using Gaussian noise-based exogenous signals. In contrast, we allow the unmodeled part to have deterministic unmodeled dynamics, which are almost always present in physical systems, in addition to sub-Gaussian noise. In addition, we propose a deterministic construction of the exogenous signal in order to carry out parameter estimation. We introduce a new tool kit which employs the theory of spectral lines, retains the stochastic setting, and leads to non-asymptotic bounds on the parameter estimation error. Unlike the existing stochastic approach, these bounds are tunable through an optimal choice of the spectrum of the exogenous signal leading to accurate parameter estimation. We also show that this estimation is robust to unmodeled dynamics, a property that is not assured by the existing approach. Finally, we show that under ideal conditions with no deterministic unmodeled dynamics, the proposed approach can ensure a Õ(√t) Regret, matching existing literature. Experiments are provided to support all theoretical derivations, which show that the spectral lines-based approach outperforms the Gaussian noise-based method when unmodeled dynamics are present, in terms of both parameter estimation error and Regret obtained using the parameter estimates with a Linear Quadratic Regulator in feedback.
\ No newline at end of file
diff --git a/data/2024/aaai/Active Learning Guided by Efficient Surrogate Learners b/data/2024/aaai/Active Learning Guided by Efficient Surrogate Learners
new file mode 100644
index 0000000000..40c6884fee
--- /dev/null
+++ b/data/2024/aaai/Active Learning Guided by Efficient Surrogate Learners	
@@ -0,0 +1 @@
+Re-training a deep learning model each time a single data point receives a new label is impractical due to the inherent complexity of the training process. Consequently, existing active learning (AL) algorithms tend to adopt a batch-based approach where, during each AL iteration, a set of data points is collectively chosen for annotation. However, this strategy frequently leads to redundant sampling, ultimately eroding the efficacy of the labeling procedure. In this paper, we introduce a new AL algorithm that harnesses the power of a Gaussian process surrogate in conjunction with the neural network principal learner. Our proposed model adeptly updates the surrogate learner for every new data instance, enabling it to emulate and capitalize on the continuous learning dynamics of the neural network without necessitating a complete retraining of the principal model for each individual label. Experiments on four benchmark datasets demonstrate that this approach yields significant enhancements, either rivaling or aligning with the performance of state-of-the-art techniques.
\ No newline at end of file
diff --git a/data/2024/aaai/Active Reinforcement Learning for Robust Building Control b/data/2024/aaai/Active Reinforcement Learning for Robust Building Control
new file mode 100644
index 0000000000..0609cbe82b
--- /dev/null
+++ b/data/2024/aaai/Active Reinforcement Learning for Robust Building Control	
@@ -0,0 +1 @@
+Reinforcement learning (RL) is a powerful tool for optimal control that has found great success in Atari games, the game of Go, robotic control, and building optimization. RL is also very brittle; agents often overfit to their training environment and fail to generalize to new settings. Unsupervised environment design (UED) has been proposed as a solution to this problem, in which the agent trains in environments that have been specially selected to help it learn. Previous UED algorithms focus on trying to train an RL agent that generalizes across a large distribution of environments. This is not necessarily desirable when we wish to prioritize performance in one environment over others. In this work, we will be examining the setting of robust RL building control, where we wish to train an RL agent that prioritizes performing well in normal weather while still being robust to extreme weather conditions. We demonstrate a novel UED algorithm, ActivePLR, that uses uncertainty-aware neural network architectures to generate new training environments at the limit of the RL agent's ability while being able to prioritize performance in a desired base environment. We show that ActivePLR is able to outperform state-of-the-art UED algorithms in minimizing energy usage while maximizing occupant comfort in the setting of building control.
\ No newline at end of file
diff --git a/data/2024/aaai/Actor Prioritized Experience Replay (Abstract Reprint) b/data/2024/aaai/Actor Prioritized Experience Replay (Abstract Reprint)
new file mode 100644
index 0000000000..3706841fd1
--- /dev/null
+++ b/data/2024/aaai/Actor Prioritized Experience Replay (Abstract Reprint)	
@@ -0,0 +1 @@
+A widely-studied deep reinforcement learning (RL) technique known as Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error. Although it has been shown that PER is one of the most crucial components for the overall performance of deep RL methods in discrete action domains, many empirical studies indicate that it considerably underperforms off-policy actor-critic algorithms. We theoretically show that actor networks cannot be effectively trained with transitions that have large TD errors. As a result, the approximate policy gradient computed under the Q-network diverges from the actual gradient computed under the optimal Q-function. Motivated by this, we introduce a novel experience replay sampling framework for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of PER. The introduced algorithm suggests a new branch of improvements to PER and schedules effective and efficient training for both actor and critic networks. An extensive set of experiments verifies our theoretical findings, showing that our method outperforms competing approaches and achieves state-of-the-art results over the standard off-policy actor-critic algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Ada-Retrieval: An Adaptive Multi-Round Retrieval Paradigm for Sequential Recommendations b/data/2024/aaai/Ada-Retrieval: An Adaptive Multi-Round Retrieval Paradigm for Sequential Recommendations
new file mode 100644
index 0000000000..d4dc4db5d7
--- /dev/null
+++ b/data/2024/aaai/Ada-Retrieval: An Adaptive Multi-Round Retrieval Paradigm for Sequential Recommendations	
@@ -0,0 +1 @@
+Retrieval models aim at selecting a small set of item candidates which match the preference of a given user. They play a vital role in large-scale recommender systems since subsequent models such as rankers highly depend on the quality of item candidates. However, most existing retrieval models employ a single-round inference paradigm, which may not adequately capture the dynamic nature of user preferences and stuck in one area in the item space. In this paper, we propose Ada-Retrieval, an adaptive multi-round retrieval paradigm for recommender systems that iteratively refines user representations to better capture potential candidates in the full item space. Ada-Retrieval comprises two key modules: the item representation adapter and the user representation adapter, designed to inject context information into items' and users' representations. The framework maintains a model-agnostic design, allowing seamless integration with various backbone models such as RNNs or Transformers. We perform experiments on three widely used public datasets, incorporating five powerful sequential recommenders as backbone models. Our results demonstrate that Ada-Retrieval significantly enhances the performance of various base models, with consistent improvements observed across different datasets. Our code and data are publicly available at: https://github.com/ll0ruc/Ada-Retrieval.
\ No newline at end of file
diff --git a/data/2024/aaai/AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual Adaptation for Code Clone Detection b/data/2024/aaai/AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual Adaptation for Code Clone Detection
new file mode 100644
index 0000000000..ca638f42e9
--- /dev/null
+++ b/data/2024/aaai/AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual Adaptation for Code Clone Detection	
@@ -0,0 +1 @@
+Code Clone Detection, which aims to retrieve functionally similar programs from large code bases, has been attracting increasing attention. Modern software often involves a diverse range of programming languages. However, current code clone detection methods are generally limited to only a few popular programming languages due to insufficient annotated data as well as their own model design constraints. To address these issues, we present AdaCCD, a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language. AdaCCD leverages language-agnostic code representations from pre-trained programming language models and propose an Adaptively Refined Contrastive Learning framework to transfer knowledge from resource-rich languages to resource-poor languages. We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages. AdaCCD achieves significant improvements over other baselines, and achieve comparable performance to supervised fine-tuning.
\ No newline at end of file
diff --git a/data/2024/aaai/AdaFormer: Efficient Transformer with Adaptive Token Sparsification for Image Super-resolution b/data/2024/aaai/AdaFormer: Efficient Transformer with Adaptive Token Sparsification for Image Super-resolution
new file mode 100644
index 0000000000..8b78070942
--- /dev/null
+++ b/data/2024/aaai/AdaFormer: Efficient Transformer with Adaptive Token Sparsification for Image Super-resolution	
@@ -0,0 +1 @@
+Efficient transformer-based models have made remarkable progress in image super-resolution (SR). Most of these works mainly design elaborate structures to accelerate the inference of the transformer, where all feature tokens are propagated equally. However, they ignore the underlying characteristic of image content, i.e., various image regions have distinct restoration difficulties, especially for large images (2K-8K), failing to achieve adaptive inference. In this work, we propose an adaptive token sparsification transformer (AdaFormer) to speed up the model inference for image SR. Specifically, a texture-relevant sparse attention block with parallel global and local branches is introduced, aiming to integrate informative tokens from the global view instead of only in fixed local windows. Then, an early-exit strategy is designed to progressively halt tokens according to the token importance. To estimate the plausibility of each token, we adopt a lightweight confidence estimator, which is constrained by an uncertainty-guided loss to obtain a binary halting mask about the tokens. Experiments on large images have illustrated that our proposal reduces nearly 90% latency against SwinIR on Test8K, while maintaining a comparable performance.
\ No newline at end of file
diff --git a/data/2024/aaai/AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing b/data/2024/aaai/AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing
new file mode 100644
index 0000000000..b8a9a7a713
--- /dev/null
+++ b/data/2024/aaai/AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing	
@@ -0,0 +1 @@
+With the great success of text-conditioned diffusion models in creative text-to-image generation, various text-driven image editing approaches have attracted the attentions of many researchers. However, previous works mainly focus on discreteness-sensitive instructions such as adding, removing or replacing specific objects, background elements or global styles (i.e., “hard editing”), while generally ignoring subject-binding but semantically fine-changing continuity-sensitive instructions such as actions, poses or adjectives, and so on (i.e., “soft editing”), which hampers generative AI from generating user-customized visual contents. To mitigate this predicament, we propose a spatio-temporal guided adaptive editing algorithm AdapEdit, which realizes adaptive image editing by introducing a soft-attention strategy to dynamically vary the guiding degree from the editing conditions to visual pixels from both temporal and spatial perspectives. Note our approach has a significant advantage in preserving model priors and does not require model training, fine-tuning, extra data, or optimization. We present our results over a wide variety of raw images and editing instructions, demonstrating competitive performance and showing it significantly outperforms the previous approaches. Code is available: https://github.com/AnonymousPony/adap-edit.
\ No newline at end of file
diff --git a/data/2024/aaai/Adapted Weighted Aggregation in Federated Learning b/data/2024/aaai/Adapted Weighted Aggregation in Federated Learning
new file mode 100644
index 0000000000..37a24e350e
--- /dev/null
+++ b/data/2024/aaai/Adapted Weighted Aggregation in Federated Learning	
@@ -0,0 +1,2 @@
+This study introduces FedAW, a novel federated learning algorithm that uses a weighted aggregation mechanism sensitive to the quality of client datasets, leading to better model
+performance and faster convergence on diverse datasets, validated using Colored MNIST.
\ No newline at end of file
diff --git a/data/2024/aaai/AdapterGNN: Parameter-Efficient Fine-Tuning Improves Generalization in GNNs b/data/2024/aaai/AdapterGNN: Parameter-Efficient Fine-Tuning Improves Generalization in GNNs
new file mode 100644
index 0000000000..95d0fedc1c
--- /dev/null
+++ b/data/2024/aaai/AdapterGNN: Parameter-Efficient Fine-Tuning Improves Generalization in GNNs	
@@ -0,0 +1 @@
+Fine-tuning pre-trained models has recently yielded remarkable performance gains in graph neural networks (GNNs). In addition to pre-training techniques, inspired by the latest work in the natural language fields, more recent work has shifted towards applying effective fine-tuning approaches, such as parameter-efficient fine-tuning (PEFT). However, given the substantial differences between GNNs and transformer-based models, applying such approaches directly to GNNs proved to be less effective. In this paper, we present a comprehensive comparison of PEFT techniques for GNNs and propose a novel PEFT method specifically designed for GNNs, called AdapterGNN. AdapterGNN preserves the knowledge of the large pre-trained model and leverages highly expressive adapters for GNNs, which can adapt to downstream tasks effectively with only a few parameters, while also improving the model's generalization ability. Extensive experiments show that AdapterGNN achieves higher performance than other PEFT methods and is the only one consistently surpassing full fine-tuning (outperforming it by 1.6% and 5.7% in the chemistry and biology domains respectively, with only 5% and 4% of its parameters tuned) with lower generalization gaps. Moreover, we empirically show that a larger GNN model can have a worse generalization ability, which differs from the trend observed in large transformer-based models. Building upon this, we provide a theoretical justification for PEFT can improve generalization of GNNs by applying generalization bounds. Our code is available at https://github.com/Lucius-lsr/AdapterGNN.
\ No newline at end of file
diff --git a/data/2024/aaai/Adapting Animal Models to Assess Sufficiency of Fluid Resuscitation in Humans (Student Abstract) b/data/2024/aaai/Adapting Animal Models to Assess Sufficiency of Fluid Resuscitation in Humans (Student Abstract)
new file mode 100644
index 0000000000..7d7f24ba3d
--- /dev/null
+++ b/data/2024/aaai/Adapting Animal Models to Assess Sufficiency of Fluid Resuscitation in Humans (Student Abstract)	
@@ -0,0 +1 @@
+Fluid resuscitation is an initial treatment frequently employed to treat shock, restore lost blood, protect tissues from injury, and prevent organ dysfunction in critically ill patients. However, it is not without risk (e.g., overly aggressive resuscitation may cause organ damage and even death). We leverage machine learning models trained to assess sufficiency of resuscitation in laboratory animals subjected to induced hemorrhage and transfer them to use with human trauma patients. Our key takeaway is that animal experiments and models can inform human healthcare, especially when human data is limited or when collecting relevant human data via potentially harmful protocols is unfeasible.
\ No newline at end of file
diff --git a/data/2024/aaai/Adaptive Discovering and Merging for Incremental Novel Class Discovery b/data/2024/aaai/Adaptive Discovering and Merging for Incremental Novel Class Discovery
new file mode 100644
index 0000000000..467e1d58a0
--- /dev/null
+++ b/data/2024/aaai/Adaptive Discovering and Merging for Incremental Novel Class Discovery	
@@ -0,0 +1 @@
+One important desideratum of lifelong learning aims to discover novel classes from unlabelled data in a continuous manner. The central challenge is twofold: discovering and learning novel classes while mitigating the issue of catastrophic forgetting of established knowledge. To this end, we introduce a new paradigm called Adaptive Discovering and Merging (ADM) to discover novel categories adaptively in the incremental stage and integrate novel knowledge into the model without affecting the original knowledge. To discover novel classes adaptively, we decouple representation learning and novel class discovery, and use Triple Comparison (TC) and Probability Regularization (PR) to constrain the probability discrepancy and diversity for adaptive category assignment. To merge the learned novel knowledge adaptively, we propose a hybrid structure with base and novel branches named Adaptive Model Merging (AMM), which reduces the interference of the novel branch on the old classes to preserve the previous knowledge, and merges the novel branch to the base model without performance loss and parameter growth. Extensive experiments on several datasets show that ADM significantly outperforms existing class-incremental Novel Class Discovery (class-iNCD) approaches. Moreover, our AMM also benefits the class-incremental Learning (class-IL) task by alleviating the catastrophic forgetting problem. The source code is included in the supplementary materials.
\ No newline at end of file
diff --git a/data/2024/aaai/Adaptive FSS: A Novel Few-Shot Segmentation Framework via Prototype Enhancement b/data/2024/aaai/Adaptive FSS: A Novel Few-Shot Segmentation Framework via Prototype Enhancement
new file mode 100644
index 0000000000..6c0b918004
--- /dev/null
+++ b/data/2024/aaai/Adaptive FSS: A Novel Few-Shot Segmentation Framework via Prototype Enhancement	
@@ -0,0 +1 @@
+The Few-Shot Segmentation (FSS) aims to accomplish the novel class segmentation task with a few annotated images. Current FSS research based on meta-learning focuses on designing a complex interaction mechanism between the query and support feature. However, unlike humans who can rapidly learn new things from limited samples, the existing approach relies solely on fixed feature matching to tackle new tasks, lacking adaptability. In this paper, we propose a novel framework based on the adapter mechanism, namely Adaptive FSS, which can efficiently adapt the existing FSS model to the novel classes. In detail, we design the Prototype Adaptive Module (PAM), which utilizes accurate category information provided by the support set to derive class prototypes, enhancing class-specific information in the multi-stage representation. In addition, our approach is compatible with diverse FSS methods with different backbones by simply inserting PAM between the layers of the encoder. Experiments demonstrate that our method effectively improves the performance of the FSS models (e.g., MSANet, HDMNet, FPTrans, and DCAMA) and achieves new state-of-the-art (SOTA) results (i.e., 72.4% and 79.1% mIoU on PASCAL-5i 1-shot and 5-shot settings, 52.7% and 60.0% mIoU on COCO-20i 1-shot and 5-shot settings). Our code is available at https://github.com/jingw193/AdaptiveFSS.
\ No newline at end of file
diff --git a/data/2024/aaai/Adaptive Feature Imputation with Latent Graph for Deep Incomplete Multi-View Clustering b/data/2024/aaai/Adaptive Feature Imputation with Latent Graph for Deep Incomplete Multi-View Clustering
new file mode 100644
index 0000000000..4965569c7d
--- /dev/null
+++ b/data/2024/aaai/Adaptive Feature Imputation with Latent Graph for Deep Incomplete Multi-View Clustering	
@@ -0,0 +1 @@
+In recent years, incomplete multi-view clustering (IMVC), which studies the challenging multi-view clustering problem on missing views, has received growing research interests. Previous IMVC methods suffer from the following issues: (1) the inaccurate imputation for missing data, which leads to suboptimal clustering performance, and (2) most existing IMVC models merely consider the explicit presence of graph structure in data, ignoring the fact that latent graphs of different views also provide valuable information for the clustering task. To overcome such challenges, we present a novel method, termed Adaptive feature imputation with latent graph for incomplete multi-view clustering (AGDIMC). Specifically, it captures the embbedded features of each view by incorporating the view-specific deep encoders. Then, we construct partial latent graphs on complete data, which can consolidate the intrinsic relationships within each view while preserving the topological information. With the aim of estimating the missing sample based on the available information, we utilize an adaptive imputation layer to impute the embedded feature of missing data by using cross-view soft cluster assignments and global cluster centroids. As the imputation progresses, the portion of complete data increases, contributing to enhancing the discriminative information contained in global pseudo-labels. Meanwhile, to alleviate the negative impact caused by inferior impute samples and the discrepancy of cluster structures, we further design an adaptive imputation strategy based on the global pseudo-label and the local cluster assignment. Experimental results on multiple real-world datasets demonstrate the effectiveness of our method over existing approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Adaptive Graph Learning for Multimodal Conversational Emotion Detection b/data/2024/aaai/Adaptive Graph Learning for Multimodal Conversational Emotion Detection
new file mode 100644
index 0000000000..5dc3dec37c
--- /dev/null
+++ b/data/2024/aaai/Adaptive Graph Learning for Multimodal Conversational Emotion Detection	
@@ -0,0 +1 @@
+Multimodal Emotion Recognition in Conversations (ERC) aims to identify the emotions conveyed by each utterance in a conversational video. Current efforts encounter challenges in balancing intra- and inter-speaker context dependencies when tackling intra-modal interactions. This balance is vital as it encompasses modeling self-dependency (emotional inertia) where speakers' own emotions affect them and modeling interpersonal dependencies (empathy) where counterparts' emotions influence a speaker. Furthermore, challenges arise in addressing cross-modal interactions that involve content with conflicting emotions across different modalities. To address this issue, we introduce an adaptive interactive graph network (IGN) called AdaIGN that employs the Gumbel Softmax trick to adaptively select nodes and edges, enhancing intra- and cross-modal interactions. Unlike undirected graphs, we use a directed IGN to prevent future utterances from impacting the current one. Next, we propose Node- and Edge-level Selection Policies (NESP) to guide node and edge selection, along with a Graph-Level Selection Policy (GSP) to integrate the utterance representation from original IGN and NESP-enhanced IGN. Moreover, we design a task-specific loss function that prioritizes text modality and intra-speaker context selection. To reduce computational complexity, we use pre-defined pseudo labels through self-supervised methods to mask unnecessary utterance nodes for selection. Experimental results show that AdaIGN outperforms state-of-the-art methods on two popular datasets. Our code will be available at https://github.com/TuGengs/AdaIGN.
\ No newline at end of file
diff --git a/data/2024/aaai/Adaptive Hardness Negative Sampling for Collaborative Filtering b/data/2024/aaai/Adaptive Hardness Negative Sampling for Collaborative Filtering
new file mode 100644
index 0000000000..f2c101c4b6
--- /dev/null
+++ b/data/2024/aaai/Adaptive Hardness Negative Sampling for Collaborative Filtering	
@@ -0,0 +1 @@
+Negative sampling is essential for implicit collaborative filtering to provide proper negative training signals so as to achieve desirable performance. We experimentally unveil a common limitation of all existing negative sampling methods that they can only select negative samples of a fixed hardness level, leading to the false positive problem (FPP) and false negative problem (FNP). We then propose a new paradigm called adaptive hardness negative sampling (AHNS) and discuss its three key criteria. By adaptively selecting negative samples with appropriate hardnesses during the training process, AHNS can well mitigate the impacts of FPP and FNP. Next, we present a concrete instantiation of AHNS called AHNS_{p
\ No newline at end of file
diff --git a/data/2024/aaai/Adaptive Integration of Partial Label Learning and Negative Learning for Enhanced Noisy Label Learning b/data/2024/aaai/Adaptive Integration of Partial Label Learning and Negative Learning for Enhanced Noisy Label Learning
new file mode 100644
index 0000000000..a5a59f2052
--- /dev/null
+++ b/data/2024/aaai/Adaptive Integration of Partial Label Learning and Negative Learning for Enhanced Noisy Label Learning	
@@ -0,0 +1 @@
+There has been significant attention devoted to the effectiveness of various domains, such as semi-supervised learning, contrastive learning, and meta-learning, in enhancing the performance of methods for noisy label learning (NLL) tasks. However, most existing methods still depend on prior assumptions regarding clean samples amidst different sources of noise (e.g., a pre-defined drop rate or a small subset of clean samples). In this paper, we propose a simple yet powerful idea called NPN, which revolutionizes Noisy label learning by integrating Partial label learning (PLL) and Negative learning (NL). Toward this goal, we initially decompose the given label space adaptively into the candidate and complementary labels, thereby establishing the conditions for PLL and NL. We propose two adaptive data-driven paradigms of label disambiguation for PLL: hard disambiguation and soft disambiguation. Furthermore, we generate reliable complementary labels using all non-candidate labels for NL to enhance model robustness through indirect supervision. To maintain label reliability during the later stage of model training, we introduce a consistency regularization term that encourages agreement between the outputs of multiple augmentations. Experiments conducted on both synthetically corrupted and real-world noisy datasets demonstrate the superiority of NPN compared to other state-of-the-art (SOTA) methods. The source code has been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/NPN.
\ No newline at end of file
diff --git a/data/2024/aaai/Adaptive Meta-Learning Probabilistic Inference Framework for Long Sequence Prediction b/data/2024/aaai/Adaptive Meta-Learning Probabilistic Inference Framework for Long Sequence Prediction
new file mode 100644
index 0000000000..ccbb47b2ca
--- /dev/null
+++ b/data/2024/aaai/Adaptive Meta-Learning Probabilistic Inference Framework for Long Sequence Prediction	
@@ -0,0 +1 @@
+Long sequence prediction has broad and significant application value in fields such as finance, wind power, and weather. However, the complex long-term dependencies of long sequence data and the potential domain shift problems limit the effectiveness of traditional models in practical scenarios. To this end, we propose an Adaptive Meta-Learning Probabilistic Inference Framework (AMPIF) based on sequence decomposition, which can effectively enhance the long sequence prediction ability of various basic models. Specifically, first, we decouple complex sequences into seasonal and trend components through a frequency domain decomposition module. Then, we design an adaptive meta-learning task construction strategy, which divides the seasonal and trend components into different tasks through a clustering-matching approach. Finally, we design a dual-stream amortized network (ST-DAN) to capture shared information between seasonal-trend tasks and use the support set to generate task-specific parameters for rapid generalization learning on the query set. We conducted extensive experiments on six datasets, including wind power and finance scenarios, and the results show that our method significantly outperforms baseline methods in prediction accuracy, interpretability, and algorithm stability and can effectively enhance the long sequence prediction capabilities of base models. The source code is publicly available at https://github.com/Zhu-JP/AMPIF.
\ No newline at end of file
diff --git a/data/2024/aaai/Adaptive Prompt Routing for Arbitrary Text Style Transfer with Pre-trained Language Models b/data/2024/aaai/Adaptive Prompt Routing for Arbitrary Text Style Transfer with Pre-trained Language Models
new file mode 100644
index 0000000000..e31b7b4e36
--- /dev/null
+++ b/data/2024/aaai/Adaptive Prompt Routing for Arbitrary Text Style Transfer with Pre-trained Language Models	
@@ -0,0 +1 @@
+Recently, arbitrary text style transfer (TST) has made significant progress with the paradigm of prompt learning. In this paradigm, researchers often design or search for a fixed prompt for any input. However, existing evidence shows that large language models (LLMs) are prompt-sensitive and it is sub-optimal to apply the same prompt to any input for downstream TST tasks. Besides, the prompts obtained by searching are often unreadable and unexplainable to humans. To address these issues, we propose an Adaptive Prompt Routing (APR) framework to adaptively route prompts from a human-readable prompt set for various input texts and given styles. Specifically, we first construct a candidate prompt set of diverse and human-readable prompts for the target style. This set consists of several seed prompts and their variants paraphrased by an LLM. Subsequently, we train a prompt routing model to select the optimal prompts efficiently according to inputs. The adaptively selected prompt can guide the LLMs to perform a precise style transfer for each input sentence while maintaining readability for humans. Extensive experiments on 4 public TST benchmarks over 3 popular LLMs (with parameter sizes ranging from 1.5B to 175B) demonstrate that our APR achieves superior style transfer performances, compared to the state-of-the-art prompt-based and fine-tuning methods. The source code is available at https://github.com/DwyaneLQY/APR
\ No newline at end of file
diff --git a/data/2024/aaai/Adaptive Reactive Synthesis for LTL and LTLf Modulo Theories b/data/2024/aaai/Adaptive Reactive Synthesis for LTL and LTLf Modulo Theories
new file mode 100644
index 0000000000..0a52ec1721
--- /dev/null
+++ b/data/2024/aaai/Adaptive Reactive Synthesis for LTL and LTLf Modulo Theories	
@@ -0,0 +1 @@
+Reactive synthesis is the process of generate correct con- trollers from temporal logic specifications. Typically, synthesis is restricted to Boolean specifications in LTL. Recently, a Boolean abstraction technique allows to translate LTLT specifications that contain literals in theories into equi-realizable LTL specifications, but no full synthesis procedure exists yet. In synthesis modulo theories, the system receives valuations of environment variables (from a first-order theory T ) and outputs valuations of system variables from T . In this paper, we address how to syntheize a full controller using a combination of the static Boolean controller obtained from the Booleanized LTL specification together with on-the-fly queries to a solver that produces models of satisfiable existential T formulae. This is the first synthesis method for LTL modulo theories. Additionally, our method can produce adaptive responses which increases explainability and can improve runtime properties like performance. Our approach is applicable to both LTL modulo theories and LTLf modulo theories.
\ No newline at end of file
diff --git a/data/2024/aaai/Adaptive Shortcut Debiasing for Online Continual Learning b/data/2024/aaai/Adaptive Shortcut Debiasing for Online Continual Learning
new file mode 100644
index 0000000000..1103f558e5
--- /dev/null
+++ b/data/2024/aaai/Adaptive Shortcut Debiasing for Online Continual Learning	
@@ -0,0 +1 @@
+We propose a novel framework DropTop that suppresses the shortcut bias in online continual learning (OCL) while being adaptive to the varying degree of the shortcut bias incurred by continuously changing environment. By the observed high-attention property of the shortcut bias, highly-activated features are considered candidates for debiasing. More importantly, resolving the limitation of the online environment where prior knowledge and auxiliary data are not ready, two novel techniques---feature map fusion and adaptive intensity shifting---enable us to automatically determine the appropriate level and proportion of the candidate shortcut features to be dropped. Extensive experiments on five benchmark datasets demonstrate that, when combined with various OCL algorithms, DropTop increases the average accuracy by up to 10.4% and decreases the forgetting by up to 63.2%.
\ No newline at end of file
diff --git a/data/2024/aaai/Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval b/data/2024/aaai/Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval
new file mode 100644
index 0000000000..73c45905df
--- /dev/null
+++ b/data/2024/aaai/Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval	
@@ -0,0 +1 @@
+Text-based person retrieval aims at retrieving a specific pedestrian image from a gallery based on textual descriptions. The primary challenge is how to overcome the inherent heterogeneous modality gap in the situation of significant intra-class variation and minimal inter-class variation. Existing approaches commonly employ vision-language pre-training or attention mechanisms to learn appropriate cross-modal alignments from noise inputs. Despite commendable progress, current methods inevitably suffer from two defects: 1) Matching ambiguity, which mainly derives from unreliable matching pairs; 2) One-sided cross-modal alignments, stemming from the absence of exploring one-to-many correspondence, i.e., coarse-grained semantic alignment. These critical issues significantly deteriorate retrieval performance. To this end, we propose a novel framework termed Adaptive Uncertainty-based Learning (AUL) for text-based person retrieval from the uncertainty perspective. Specifically, our AUL framework consists of three key components: 1) Uncertainty-aware Matching Filtration that leverages Subjective Logic to effectively mitigate the disturbance of unreliable matching pairs and select high-confidence cross-modal matches for training; 2) Uncertainty-based Alignment Refinement, which not only simulates coarse-grained alignments by constructing uncertainty representations but also performs progressive learning to incorporate coarse- and fine-grained alignments properly; 3) Cross-modal Masked Modeling that aims at exploring more comprehensive relations between vision and language. Extensive experiments demonstrate that our AUL method consistently achieves state-of-the-art performance on three benchmark datasets in supervised, weakly supervised, and domain generalization settings. Our code is available at https://github.com/CFM-MSG/Code-AUL.
\ No newline at end of file
diff --git a/data/2024/aaai/Addressing Digital and AI Skills Gaps in European Living Areas: A Comparative Analysis of Small and Large Communities b/data/2024/aaai/Addressing Digital and AI Skills Gaps in European Living Areas: A Comparative Analysis of Small and Large Communities
new file mode 100644
index 0000000000..1743a0fa99
--- /dev/null
+++ b/data/2024/aaai/Addressing Digital and AI Skills Gaps in European Living Areas: A Comparative Analysis of Small and Large Communities	
@@ -0,0 +1 @@
+As Artificial Intelligence (AI) continues to permeate various aspects of societies, understanding the disparities in AI knowledge and skills across different living areas becomes imperative. Small living areas have emerged as significant contributors to Europe's economy, offering an alternative to the bustling environment of larger cities for those seeking an improved quality of life. Nonetheless, they often encounter challenges related to digital infrastructure, access to financial resources, and digital skills gaps, limiting their economic and social growth prospects. This study investigates the digital and AI skills gaps in the context of small and large European living areas, shedding light on the potential hindrances to unleashing the full economic and social potentials of these regions in an AI-enabled economy. Drawing from a comprehensive dataset encompassing 4,006 respondents across eight EU countries, this research examines the current perceptions and understandings of AI and digital skills within two distinct population groups: residents of smaller living areas and their counterparts in larger communities. Through bivariate analysis, notable insights are revealed concerning trust in AI solutions and entities, self-assessed digital skills, AI Awareness, AI Attitudes and demography variables in both population groups. These insights may refer to the significance of addressing digital and AI skills gaps in fostering growth and preparedness for the AI-driven future. As AI becomes increasingly integral to various aspects of society, targeted interventions and policies are essential to bridge these gaps and enable individuals and communities to harness the transformative potential of AI-enabled economies.
\ No newline at end of file
diff --git a/data/2024/aaai/Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent Diffusion Model b/data/2024/aaai/Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent Diffusion Model
new file mode 100644
index 0000000000..969d7cd8f6
--- /dev/null
+++ b/data/2024/aaai/Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent Diffusion Model	
@@ -0,0 +1 @@
+Adversarial attacks involve adding perturbations to the source image to cause misclassification by the target model, which demonstrates the potential of attacking face recognition models. Existing adversarial face image generation methods still can’t achieve satisfactory performance because of low transferability and high detectability. In this paper, we propose a unified framework Adv-Diffusion that can generate imperceptible adversarial identity perturbations in the latent space but not the raw pixel space, which utilizes strong inpainting capabilities of the latent diffusion model to generate realistic adversarial images. Specifically, we propose the identity-sensitive conditioned diffusion generative model to generate semantic perturbations in the surroundings. The designed adaptive strength-based adversarial perturbation algorithm can ensure both attack transferability and stealthiness. Extensive qualitative and quantitative experiments on the public FFHQ and CelebA-HQ datasets prove the proposed method achieves superior performance compared with the state-of-the-art methods without an extra generative model training process. The source code is available at https://github.com/kopper-xdu/Adv-Diffusion.
\ No newline at end of file
diff --git a/data/2024/aaai/AdvST: Revisiting Data Augmentations for Single Domain Generalization b/data/2024/aaai/AdvST: Revisiting Data Augmentations for Single Domain Generalization
new file mode 100644
index 0000000000..4a2308b34c
--- /dev/null
+++ b/data/2024/aaai/AdvST: Revisiting Data Augmentations for Single Domain Generalization	
@@ -0,0 +1 @@
+Single domain generalization (SDG) aims to train a robust model against unknown target domain shifts using data from a single source domain. Data augmentation has been proven an effective approach to SDG. However, the utility of standard augmentations, such as translate, or invert, has not been fully exploited in SDG; practically, these augmentations are used as a part of a data preprocessing procedure. Although it is intuitive to use many such augmentations to boost the robustness of a model to out-of-distribution domain shifts, we lack a principled approach to harvest the benefit brought from multiple these augmentations. Here, we conceptualize standard data augmentations with learnable parameters as semantics transformations that can manipulate certain semantics of a sample, such as the geometry or color of an image. Then, we propose Adversarial learning with Semantics Transformations (AdvST) that augments the source domain data with semantics transformations and learns a robust model with the augmented data. We theoretically show that AdvST essentially optimizes a distributionally robust optimization objective defined on a set of semantics distributions induced by the parameters of semantics transformations. We demonstrate that AdvST can produce samples that expand the coverage on target domain data. Compared with the state-of-the-art methods, AdvST, despite being a simple method, is surprisingly competitive and achieves the best average SDG performance on the Digits, PACS, and DomainNet datasets. Our code is available at https://github.com/gtzheng/AdvST.
\ No newline at end of file
diff --git a/data/2024/aaai/Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark b/data/2024/aaai/Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark
new file mode 100644
index 0000000000..a46bb3fa7a
--- /dev/null
+++ b/data/2024/aaai/Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark	
@@ -0,0 +1 @@
+Artificial intelligence (AI) has made remarkable progress across various domains, with large language models like ChatGPT gaining substantial attention for their human-like text-generation capabilities. Despite these achievements, improving spatial reasoning remains a significant challenge for these models. Benchmarks like StepGame evaluate AI spatial reasoning, where ChatGPT has shown unsatisfactory performance. However, the presence of template errors in the benchmark has an impact on the evaluation results. Thus there is potential for ChatGPT to perform better if these template errors are addressed, leading to more accurate assessments of its spatial reasoning capabilities. In this study, we refine the StepGame benchmark, providing a more accurate dataset for model evaluation. We analyze GPT’s spatial reasoning performance on the rectified benchmark, identifying proficiency in mapping natural language text to spatial relations but limitations in multi-hop reasoning. We provide a flawless solution to the benchmark by combining template-to-relation mapping with logic-based reasoning. This combination demonstrates proficiency in performing qualitative reasoning on StepGame without encountering any errors. We then address the limitations of GPT models in spatial reasoning. To improve spatial reasoning, we deploy Chain-of-Thought and Tree-of-thoughts prompting strategies, offering insights into GPT’s cognitive process. Our investigation not only sheds light on model deficiencies but also proposes enhancements, contributing to the advancement of AI with more robust spatial reasoning capabilities.
\ No newline at end of file
diff --git a/data/2024/aaai/Advancing Video Synchronization with Fractional Frame Analysis: Introducing a Novel Dataset and Model b/data/2024/aaai/Advancing Video Synchronization with Fractional Frame Analysis: Introducing a Novel Dataset and Model
new file mode 100644
index 0000000000..3735dfc773
--- /dev/null
+++ b/data/2024/aaai/Advancing Video Synchronization with Fractional Frame Analysis: Introducing a Novel Dataset and Model	
@@ -0,0 +1 @@
+Multiple views play a vital role in 3D pose estimation tasks. Ideally, multi-view 3D pose estimation tasks should directly utilize naturally collected videos for pose estimation. However, due to the constraints of video synchronization, existing methods often use expensive hardware devices to synchronize the initiation of cameras, which restricts most 3D pose collection scenarios to indoor settings. Some recent works learn deep neural networks to align desynchronized datasets derived from synchronized cameras and can only produce frame-level accuracy. For fractional frame video synchronization, this work proposes an Inter-Frame and Intra-Frame Desynchronized Dataset (IFID), which labels fractional time intervals between two video clips. IFID is the first dataset that annotates inter-frame and intra-frame intervals, with a total of 382,500 video clips annotated, making it the largest dataset to date. We also develop a novel model based on the Transformer architecture, named InSynFormer, for synchronizing inter-frame and intra-frame. Extensive experimental evaluations demonstrate its promising performance. The dataset and source code of the model are available at https://github.com/yuxuan-cser/InSynFormer.
\ No newline at end of file
diff --git a/data/2024/aaai/Adversarial Attacks on Federated-Learned Adaptive Bitrate Algorithms b/data/2024/aaai/Adversarial Attacks on Federated-Learned Adaptive Bitrate Algorithms
new file mode 100644
index 0000000000..3bd9139b2f
--- /dev/null
+++ b/data/2024/aaai/Adversarial Attacks on Federated-Learned Adaptive Bitrate Algorithms	
@@ -0,0 +1 @@
+Learning-based adaptive bitrate (ABR) algorithms have revolutionized video streaming solutions. With the growing demand for data privacy and the rapid development of mobile devices, federated learning (FL) has emerged as a popular training method for neural ABR algorithms in both academia and industry. However, we have discovered that FL-based ABR models are vulnerable to model-poisoning attacks as local updates remain unseen during global aggregation. In response, we propose MAFL (Malicious ABR model based on Federated Learning) to prove that backdooring the learning-based ABR model via FL is practical. Instead of attacking the global policy, MAFL only targets a single ``target client''. Moreover, the unique challenges brought by deep reinforcement learning (DRL) make the attack even more challenging. To address these challenges, MAFL is designed with a two-stage attacking mechanism. Using two representative attack cases with real-world traces, we show that MAFL significantly degrades the model performance on the target client (i.e., increasing rebuffering penalty by 2x and 5x) with a minimal negative impact on benign clients.
\ No newline at end of file
diff --git a/data/2024/aaai/Adversarial Attacks on the Interpretation of Neuron Activation Maximization b/data/2024/aaai/Adversarial Attacks on the Interpretation of Neuron Activation Maximization
new file mode 100644
index 0000000000..6256e049ad
--- /dev/null
+++ b/data/2024/aaai/Adversarial Attacks on the Interpretation of Neuron Activation Maximization	
@@ -0,0 +1 @@
+Feature visualization is one of the most popular techniques used to interpret the internal behavior of individual units of trained deep neural networks. Based on activation maximization, they consist of finding synthetic or natural inputs that maximize neuron activations. This paper introduces an optimization framework that aims to deceive feature visualization through adversarial model manipulation. It consists of finetuning a pre-trained model with a specifically introduced loss that aims to maintain model performance, while also significantly changing feature visualization. We provide evidence of the success of this manipulation on several pre-trained models for the classification task with ImageNet.
\ No newline at end of file
diff --git a/data/2024/aaai/Adversarial Fairness Network b/data/2024/aaai/Adversarial Fairness Network
new file mode 100644
index 0000000000..3f59ced2e8
--- /dev/null
+++ b/data/2024/aaai/Adversarial Fairness Network	
@@ -0,0 +1 @@
+Fairness is becoming a rising concern in machine learning. Recent research has discovered that state-of-the-art models are amplifying social bias by making biased prediction towards some population groups (characterized by sensitive features like race or gender). Such unfair prediction among groups renders trust issues and ethical concerns in machine learning, especially for sensitive fields such as employment, criminal justice, and trust score assessment. In this paper, we introduce a new framework to improve machine learning fairness. The goal of our model is to minimize the influence of sensitive feature from the perspectives of both data input and predictive model. To achieve this goal, we reformulate the data input by eliminating the sensitive information and strengthen model fairness by minimizing the marginal contribution of the sensitive feature. We propose to learn the sensitive-irrelevant input via sampling among features and design an adversarial network to minimize the dependence between the reformulated input and the sensitive information. Empirical results validate that our model achieves comparable or better results than related state-of-the-art methods w.r.t. both fairness metrics and prediction performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Adversarial Initialization with Universal Adversarial Perturbation: A New Approach to Fast Adversarial Training b/data/2024/aaai/Adversarial Initialization with Universal Adversarial Perturbation: A New Approach to Fast Adversarial Training
new file mode 100644
index 0000000000..7f989495a9
--- /dev/null
+++ b/data/2024/aaai/Adversarial Initialization with Universal Adversarial Perturbation: A New Approach to Fast Adversarial Training	
@@ -0,0 +1 @@
+Traditional adversarial training, while effective at improving machine learning model robustness, is computationally intensive. Fast Adversarial Training (FAT) addresses this by using a single-step attack to generate adversarial examples more efficiently. Nonetheless, FAT is susceptible to a phenomenon known as catastrophic overfitting, wherein the model's adversarial robustness abruptly collapses to zero during the training phase. To address this challenge, recent studies have suggested adopting adversarial initialization with Fast Gradient Sign Method Adversarial Training (FGSM-AT), which recycles adversarial perturbations from prior epochs by computing gradient momentum. However, our research has uncovered a flaw in this approach. Given that data augmentation is employed during the training phase, the samples in each epoch are not identical. Consequently, the method essentially yields not the adversarial perturbation of a singular sample, but rather the Universal Adversarial Perturbation (UAP) of a sample and its data augmentation. This insight has led us to explore the potential of using UAPs for adversarial initialization within the context of FGSM-AT. We have devised various strategies for adversarial initialization utilizing UAPs, including single, class-based, and feature-based UAPs. Experiments conducted on three distinct datasets demonstrate that our method achieves an improved trade-off among robustness, computational cost, and memory footprint. Code is available at https://github.com/fzjcdt/fgsm-uap.
\ No newline at end of file
diff --git a/data/2024/aaai/Adversarial Purification with the Manifold Hypothesis b/data/2024/aaai/Adversarial Purification with the Manifold Hypothesis
new file mode 100644
index 0000000000..3a662335f5
--- /dev/null
+++ b/data/2024/aaai/Adversarial Purification with the Manifold Hypothesis	
@@ -0,0 +1 @@
+In this work, we formulate a novel framework for adversarial robustness using the manifold hypothesis. This framework provides sufficient conditions for defending against adversarial examples. We develop an adversarial purification method with this framework. Our method combines manifold learning with variational inference to provide adversarial robustness without the need for expensive adversarial training. Experimentally, our approach can provide adversarial robustness even if attackers are aware of the existence of the defense. In addition, our method can also serve as a test-time defense mechanism for variational autoencoders.
\ No newline at end of file
diff --git a/data/2024/aaai/Adversarial Socialbots Modeling Based on Structural Information Principles b/data/2024/aaai/Adversarial Socialbots Modeling Based on Structural Information Principles
new file mode 100644
index 0000000000..cfa36a9139
--- /dev/null
+++ b/data/2024/aaai/Adversarial Socialbots Modeling Based on Structural Information Principles	
@@ -0,0 +1 @@
+The importance of effective detection is underscored by the fact that socialbots imitate human behavior to propagate misinformation, leading to an ongoing competition between socialbots and detectors. Despite the rapid advancement of reactive detectors, the exploration of adversarial socialbot modeling remains incomplete, significantly hindering the development of proactive detectors. To address this issue, we propose a mathematical Structural Information principles-based Adversarial Socialbots Modeling framework, namely SIASM, to enable more accurate and effective modeling of adversarial behaviors. First, a heterogeneous graph is presented to integrate various users and rich activities in the original social network and measure its dynamic uncertainty as structural entropy. By minimizing the high-dimensional structural entropy, a hierarchical community structure of the social network is generated and referred to as the optimal encoding tree. Secondly, a novel method is designed to quantify influence by utilizing the assigned structural entropy, which helps reduce the computational cost of SIASM by filtering out uninfluential users. Besides, a new conditional structural entropy is defined between the socialbot and other users to guide the follower selection for network influence maximization. Extensive and comparative experiments on both homogeneous and heterogeneous social networks demonstrate that, compared with state-of-the-art baselines, the proposed SIASM framework yields substantial performance improvements in terms of network influence (up to 16.32%) and sustainable stealthiness (up to 16.29%) when evaluated against a robust detector with 90% accuracy.
\ No newline at end of file
diff --git a/data/2024/aaai/Adversarially Balanced Representation for Continuous Treatment Effect Estimation b/data/2024/aaai/Adversarially Balanced Representation for Continuous Treatment Effect Estimation
new file mode 100644
index 0000000000..3e6a84e42f
--- /dev/null
+++ b/data/2024/aaai/Adversarially Balanced Representation for Continuous Treatment Effect Estimation	
@@ -0,0 +1,3 @@
+Individual treatment effect (ITE) estimation requires adjusting for the covariate shift between populations with different treatments, and deep representation learning has shown great promise in learning a balanced representation of covariates. However the existing methods mostly consider the scenario of binary treatments. In this paper, we consider the more practical and challenging scenario in which the treatment is a continuous variable (e.g. dosage of a medication), and we address the two main challenges of this setup. We propose the adversarial counterfactual regression network (ACFR) that adversarially minimizes the representation imbalance in terms of KL divergence, and also maintains the impact of the treatment value on the outcome prediction by leveraging an attention mechanism. 
+Theoretically we demonstrate that ACFR objective function is grounded in an upper bound on counterfactual outcome prediction error. 
+Our experimental evaluation on semi-synthetic datasets demonstrates the empirical superiority of ACFR over a range of state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/AesFA: An Aesthetic Feature-Aware Arbitrary Neural Style Transfer b/data/2024/aaai/AesFA: An Aesthetic Feature-Aware Arbitrary Neural Style Transfer
new file mode 100644
index 0000000000..b946ed0f58
--- /dev/null
+++ b/data/2024/aaai/AesFA: An Aesthetic Feature-Aware Arbitrary Neural Style Transfer	
@@ -0,0 +1 @@
+Neural style transfer (NST) has evolved significantly in recent years. Yet, despite its rapid progress and advancement, existing NST methods either struggle to transfer aesthetic information from a style effectively or suffer from high computational costs and inefficiencies in feature disentanglement due to using pre-trained models. This work proposes a lightweight but effective model, AesFA---Aesthetic Feature-Aware NST. The primary idea is to decompose the image via its frequencies to better disentangle aesthetic styles from the reference image while training the entire model in an end-to-end manner to exclude pre-trained models at inference completely. To improve the network's ability to extract more distinct representations and further enhance the stylization quality, this work introduces a new aesthetic feature: contrastive loss. Extensive experiments and ablations show the approach not only outperforms recent NST methods in terms of stylization quality, but it also achieves faster inference. Codes are available at https://github.com/Sooyyoungg/AesFA.
\ No newline at end of file
diff --git a/data/2024/aaai/Agile Multi-Source-Free Domain Adaptation b/data/2024/aaai/Agile Multi-Source-Free Domain Adaptation
new file mode 100644
index 0000000000..3c0611173e
--- /dev/null
+++ b/data/2024/aaai/Agile Multi-Source-Free Domain Adaptation	
@@ -0,0 +1 @@
+Efficiently utilizing rich knowledge in pretrained models has become a critical topic in the era of large models. This work focuses on adaptively utilize knowledge from multiple source-pretrained models to an unlabeled target domain without accessing the source data. Despite being a practically useful setting, existing methods require extensive parameter tuning over each source model, which is computationally expensive when facing abundant source domains or larger source models. To address this challenge, we propose a novel approach which is free of the parameter tuning over source backbones. Our technical contribution lies in the Bi-level ATtention ENsemble (Bi-ATEN) module, which learns both intra-domain weights and inter-domain ensemble weights to achieve a fine balance between instance specificity and domain consistency. By slightly tuning source bottlenecks, we achieve comparable or even superior performance on a challenging benchmark DomainNet with less than 3% trained parameters and 8 times of throughput compared with SOTA method. Furthermore, with minor modifications, the proposed module can be easily equipped to existing methods and gain more than 4% performance boost. Code is available at https://github.com/TL-UESTC/Bi-ATEN.
\ No newline at end of file
diff --git a/data/2024/aaai/Ahpatron: A New Budgeted Online Kernel Learning Machine with Tighter Mistake Bound b/data/2024/aaai/Ahpatron: A New Budgeted Online Kernel Learning Machine with Tighter Mistake Bound
new file mode 100644
index 0000000000..947818a2e7
--- /dev/null
+++ b/data/2024/aaai/Ahpatron: A New Budgeted Online Kernel Learning Machine with Tighter Mistake Bound	
@@ -0,0 +1 @@
+In this paper, we study the mistake bound of online kernel learning on a budget. We propose a new budgeted online kernel learning model, called Ahpatron, which significantly improves the mistake bound of previous work and resolves an open problem related to upper bounds of hypothesis space constraints. We first present an aggressive variant of Perceptron, named AVP, a model without budget, which uses an active updating rule. Then we design a new budget maintenance mechanism, which removes a half of examples, and projects the removed examples onto a hypothesis space spanned by the remaining examples. Ahpatron adopts the above mechanism to approximate AVP. Theoretical analyses prove that Ahpatron has tighter mistake bounds, and experimental results show that Ahpatron outperforms the state-of-the-art algorithms on the same or a smaller budget.
\ No newline at end of file
diff --git a/data/2024/aaai/Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption b/data/2024/aaai/Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption
new file mode 100644
index 0000000000..4e7e59054d
--- /dev/null
+++ b/data/2024/aaai/Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption	
@@ -0,0 +1 @@
+The standard Neural Radiance Fields (NeRF) paradigm employs a viewer-centered methodology, entangling the aspects of illumination and material reflectance into emission solely from 3D points. This simplified rendering approach presents challenges in accurately modeling images captured under adverse lighting conditions, such as low light or over-exposure. Motivated by the ancient Greek emission theory that posits visual perception as a result of rays emanating from the eyes, we slightly refine the conventional NeRF framework to train NeRF under challenging light conditions and generate normal-light condition novel views unsupervisedly. We introduce the concept of a ``Concealing Field," which assigns transmittance values to the surrounding air to account for illumination effects. In dark scenarios, we assume that object emissions maintain a standard lighting level but are attenuated as they traverse the air during the rendering process. Concealing Field thus compel NeRF to learn reasonable density and colour estimations for objects even in dimly lit situations. Similarly, the Concealing Field can mitigate over-exposed emissions during rendering stage. Furthermore, we present a comprehensive multi-view dataset captured under challenging illumination conditions for evaluation. Our code and proposed dataset are available at https://github.com/cuiziteng/Aleth-NeRF.
\ No newline at end of file
diff --git a/data/2024/aaai/Algorithmic Foundation of Federated Learning with Sequential Data b/data/2024/aaai/Algorithmic Foundation of Federated Learning with Sequential Data
new file mode 100644
index 0000000000..cb8b073e24
--- /dev/null
+++ b/data/2024/aaai/Algorithmic Foundation of Federated Learning with Sequential Data	
@@ -0,0 +1,4 @@
+The current analysis of federated optimization algorithms for training deep neural networks assumes that the data is non-sequential (e.g., images), which incurs a smooth loss objective. In contrast, edge devices generate lots of sequential data every day, where these sequences exhibit significant sequential correlation at different time stamps (e.g., text messages). In order to learn from such sequential data, people typically use a class of neural networks that is inherently nonsmooth, with a potentially unbounded smoothness parameter. Examples include recurrent neural networks, long-short-term memory networks, and transformers. It remains unclear how to design provably efficient algorithms for training these neural networks to learn from sequential data. My goal is to lay the algorithmic foundation of federated learning with sequential data, which contributes novel algorithms for learning from a range of real-world sequential data (e.g., natural language, electronic health record, transportation, time series, etc.) using state-of-the-art deep neural networks.
+
+
+In this talk, I will first motivate the problem by showing that the transformer, which is widely used for sequential data learning, has an unbounded smooth landscape. Then, I will introduce provably efficient federated deep learning algorithms in the presence of unbounded smoothness. In particular, I will introduce a few efficient algorithms for various settings of federated learning, including homogeneous data, heterogeneous data, and partial client participation. The main result is twofold. First, we show that the designed algorithms provably small computational and communication complexities. Second, we establish fundamental hardness results in the unbounded smoothness setting. Ultimately, I will discuss the future challenges of extending our research framework from small-scale neural networks to large language models.
\ No newline at end of file
diff --git "a/data/2024/aaai/Aligner\302\262: Enhancing Joint Multiple Intent Detection and Slot Filling via Adjustive and Forced Cross-Task Alignment" "b/data/2024/aaai/Aligner\302\262: Enhancing Joint Multiple Intent Detection and Slot Filling via Adjustive and Forced Cross-Task Alignment"
new file mode 100644
index 0000000000..105effe8e7
--- /dev/null
+++ "b/data/2024/aaai/Aligner\302\262: Enhancing Joint Multiple Intent Detection and Slot Filling via Adjustive and Forced Cross-Task Alignment"	
@@ -0,0 +1 @@
+Multi-intent spoken language understanding (SLU) has garnered growing attention due to its ability to handle multiple intent utterances, which closely mirrors practical scenarios. Unlike traditional SLU, each intent in multi-intent SLU corresponds to its designated scope for slots, which occurs in certain fragments within the utterance. As a result, establishing precise scope alignment to mitigate noise impact emerges as a key challenge in multi-intent SLU. More seriously, they lack alignment between the predictions of the two sub-tasks due to task-independent decoding, resulting in a limitation on the overall performance. To address these challenges, we propose a novel framework termed Aligner² for multi-intent SLU, which contains an Adjustive Cross-task Aligner (ACA) and a Forced Cross-task Aligner (FCA). ACA utilizes the information conveyed by joint label embeddings to accurately align the scope of intent and corresponding slots, before the interaction of the two subtasks. FCA introduces reinforcement learning, to enforce the alignment of the task-specific hidden states after the interaction, which is explicitly guided by the prediction. Extensive experiments on two public multi-intent SLU datasets demonstrate the superiority of our Aligner² over state-of-the-art methods. More encouragingly, the proposed method Aligner² can be easily integrated into existing multi-intent SLU frameworks, to further boost performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination b/data/2024/aaai/Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination
new file mode 100644
index 0000000000..43f0866b79
--- /dev/null
+++ b/data/2024/aaai/Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination	
@@ -0,0 +1 @@
+Cross-view geo-localization holds significant potential for various applications, but drastic differences in viewpoints and visual appearances between cross-view images make this task extremely challenging. Recent works have made notable progress in cross-view geo-localization. However, existing methods either ignore the correspondence between geometric spatial layout in cross-view images or require high costs or strict constraints to achieve such alignment. In response to these challenges, we propose a Feature Recombination Module (FRM) that explicitly establishes the geometric spatial layout correspondences between two views. Unlike existing methods, FRM aligns geometric spatial layout by directly recombining features, avoiding image preprocessing, and introducing no additional computational and parameter costs. This effectively reduces ambiguities caused by geometric misalignments between ground-level and aerial-level images. Furthermore, it is not sensitive to frameworks and applies to both CNN-based and Transformer-based architectures. Additionally, as part of the training procedure, we also introduce a novel weighted (B+1)-tuple loss (WBL) as optimization objective. Compared to the widely used weighted soft margin ranking loss, this innovative loss enhances convergence speed and final performance. Based on the two core components (FRM and WBL), we develop an end-to-end network architecture (FRGeo) to address these limitations from a different perspective. Extensive experiments show that our proposed FRGeo not only achieves state-of-the-art performance on cross-view geo-localization benchmarks, including CVUSA, CVACT, and VIGOR, but also is significantly superior or competitive in terms of computational complexity and trainable parameters. Our project homepage is at https://zqwlearning.github.io/FRGeo.
\ No newline at end of file
diff --git a/data/2024/aaai/All Beings Are Equal in Open Set Recognition b/data/2024/aaai/All Beings Are Equal in Open Set Recognition
new file mode 100644
index 0000000000..668591463f
--- /dev/null
+++ b/data/2024/aaai/All Beings Are Equal in Open Set Recognition	
@@ -0,0 +1 @@
+In open-set recognition (OSR), a promising strategy is exploiting pseudo-unknown data outside given K known classes as an additional K+1-th class to explicitly model potential open space. However, treating unknown classes without distinction is unequal for them relative to known classes due to the category-agnostic and scale-agnostic of the unknowns. This inevitably not only disrupts the inherent distributions of unknown classes but also incurs both class-wise and instance-wise imbalances between known and unknown classes. Ideally, the OSR problem should model the whole class space as K+∞, but enumerating all unknowns is impractical. Since the core of OSR is to effectively model the boundaries of known classes, this means just focusing on the unknowns nearing the boundaries of targeted known classes seems sufficient. Thus, as a compromise, we convert the open classes from infinite to K, with a novel concept Target-Aware Universum (TAU) and propose a simple yet effective framework Dual Contrastive Learning with Target-Aware Universum (DCTAU). In details, guided by the targeted known classes, TAU automatically expands the unknown classes from the previous 1 to K, effectively alleviating the distribution disruption and the imbalance issues mentioned above. Then, a novel Dual Contrastive (DC) loss is designed, where all instances irrespective of known or TAU are considered as positives to contrast with their respective negatives. Experimental results indicate DCTAU sets a new state-of-the-art.
\ No newline at end of file
diff --git a/data/2024/aaai/All Should Be Equal in the Eyes of LMs: Counterfactually Aware Fair Text Generation b/data/2024/aaai/All Should Be Equal in the Eyes of LMs: Counterfactually Aware Fair Text Generation
new file mode 100644
index 0000000000..625569a95b
--- /dev/null
+++ b/data/2024/aaai/All Should Be Equal in the Eyes of LMs: Counterfactually Aware Fair Text Generation	
@@ -0,0 +1 @@
+Fairness in Language Models (LMs) remains a long-standing challenge, given the inherent biases in training data that can be perpetuated by models and affect the downstream tasks. Recent methods employ expensive retraining or attempt debiasing during inference by constraining model outputs to contrast from a reference set of biased templates/exemplars. Regardless, they don’t address the primary goal of fairness to maintain equitability across different demographic groups. In this work, we posit that inferencing LMs to generate unbiased output for one demographic under a context ensues from being aware of outputs for other demographics under the same context. To this end, we propose Counterfactually Aware Fair InferencE (CAFIE), a framework that dynamically compares the model’s understanding of diverse demographics to generate more equitable sentences. We conduct an extensive empirical evaluation using base LMs of varying sizes and across three diverse datasets and found that CAFIE outperforms strong baselines. CAFIE produces fairer text and strikes the best balance between fairness and language modeling capability.
\ No newline at end of file
diff --git a/data/2024/aaai/All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models b/data/2024/aaai/All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models
new file mode 100644
index 0000000000..6a757c1de8
--- /dev/null
+++ b/data/2024/aaai/All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models	
@@ -0,0 +1 @@
+Text-to-Image models such as Stable Diffusion have shown impressive image generation synthesis, thanks to the utilization of large-scale datasets. However, these datasets may contain sexually explicit, copyrighted, or undesirable content, which allows the model to directly generate them. Given that retraining these large models on individual concept deletion requests is infeasible, fine-tuning algorithms have been developed to tackle concept erasing in diffusion models. While these algorithms yield good concept erasure, they all present one of the following issues: 1) the corrupted feature space yields synthesis of disintegrated objects, 2) the initially synthesized content undergoes a divergence in both spatial structure and semantics in the generated images, and 3) sub-optimal training updates heighten the model's susceptibility to utility harm. These issues severely degrade the original utility of generative models. In this work, we present a new approach that solves all of these challenges. We take inspiration from the concept of classifier guidance and propose a surgical update on the classifier guidance term while constraining the drift of the unconditional score term. Furthermore, our algorithm empowers the user to select an alternative to the erasing concept, allowing for more controllability. Our experimental results show that our algorithm not only erases the target concept effectively but also preserves the model’s generation capability.
\ No newline at end of file
diff --git a/data/2024/aaai/AltDiffusion: A Multilingual Text-to-Image Diffusion Model b/data/2024/aaai/AltDiffusion: A Multilingual Text-to-Image Diffusion Model
new file mode 100644
index 0000000000..6db0f0466e
--- /dev/null
+++ b/data/2024/aaai/AltDiffusion: A Multilingual Text-to-Image Diffusion Model	
@@ -0,0 +1 @@
+Large Text-to-Image(T2I) diffusion models have shown a remarkable capability to produce photorealistic and diverse images based on text inputs. However, existing works only support limited language input, e.g., English, Chinese, and Japanese, leaving users beyond these languages underserved and blocking the global expansion of T2I models. Therefore, this paper presents AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages. Specifically, we first train a multilingual text encoder based on the knowledge distillation. Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability, including concept alignment and quality improvement stage on a large-scale multilingual dataset. Furthermore, we introduce a new benchmark, which includes Multilingual-General-18(MG-18) and Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I diffusion models for generating high-quality images and capturing culture-specific concepts in different languages. Experimental results on both MG-18 and MC-18 demonstrate that AltDiffusion outperforms current state-of-the-art T2I models, e.g., Stable Diffusion in multilingual understanding, especially with respect to culture-specific concepts, while still having comparable capability for generating high-quality images. All source code and checkpoints could be found in https://github.com/superhero-7/AltDiffuson.
\ No newline at end of file
diff --git a/data/2024/aaai/AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization b/data/2024/aaai/AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization
new file mode 100644
index 0000000000..4218f36eac
--- /dev/null
+++ b/data/2024/aaai/AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization	
@@ -0,0 +1 @@
+Neural Radiance Fields (NeRF) have shown promise in generating realistic novel views from sparse scene images. However, existing NeRF approaches often encounter challenges due to the lack of explicit 3D supervision and imprecise camera poses, resulting in suboptimal outcomes. To tackle these issues, we propose AltNeRF---a novel framework designed to create resilient NeRF representations using self-supervised monocular depth estimation (SMDE) from monocular videos, without relying on known camera poses. SMDE in AltNeRF masterfully learns depth and pose priors to regulate NeRF training. The depth prior enriches NeRF's capacity for precise scene geometry depiction, while the pose prior provides a robust starting point for subsequent pose refinement. Moreover, we introduce an alternating algorithm that harmoniously melds NeRF outputs into SMDE through a consistence-driven mechanism, thus enhancing the integrity of depth priors. This alternation empowers AltNeRF to progressively refine NeRF representations, yielding the synthesis of realistic novel views. Extensive experiments showcase the compelling capabilities of AltNeRF in generating high-fidelity and robust novel views that closely resemble reality.
\ No newline at end of file
diff --git a/data/2024/aaai/Amalgamating Multi-Task Models with Heterogeneous Architectures b/data/2024/aaai/Amalgamating Multi-Task Models with Heterogeneous Architectures
new file mode 100644
index 0000000000..49f204c5a8
--- /dev/null
+++ b/data/2024/aaai/Amalgamating Multi-Task Models with Heterogeneous Architectures	
@@ -0,0 +1 @@
+Multi-task learning (MTL) is essential for real-world applications that handle multiple tasks simultaneously, such as selfdriving cars. MTL methods improve the performance of all tasks by utilizing information across tasks to learn a robust shared representation. However, acquiring sufficient labeled data tends to be extremely expensive, especially when having to support many tasks. Recently, Knowledge Amalgamation (KA) has emerged as an effective strategy for addressing the lack of labels by instead learning directly from pretrained models (teachers). KA learns one unified multi-task student that masters all tasks across all teachers. Existing KA for MTL works are limited to teachers with identical architectures, and thus propose layer-to-layer based approaches. Unfortunately, in practice, teachers may have heterogeneous architectures; their layers may not be aligned and their dimensionalities or scales may be incompatible. Amalgamating multi-task teachers with heterogeneous architectures remains an open problem. For this, we design Versatile Common Feature Consolidator (VENUS), the first solution to this problem. VENUS fuses knowledge from the shared representations of each teacher into one unified generalized representation for all tasks. Specifically, we design the Feature Consolidator network that leverages an array of teacher-specific trainable adaptors. These adaptors enable the student to learn from multiple teachers, even if they have incompatible learned representations. We demonstrate that VENUS outperforms five alternative methods on numerous benchmark datasets across a broad spectrum of experiments.
\ No newline at end of file
diff --git a/data/2024/aaai/Amodal Scene Analysis via Holistic Occlusion Relation Inference and Generative Mask Completion b/data/2024/aaai/Amodal Scene Analysis via Holistic Occlusion Relation Inference and Generative Mask Completion
new file mode 100644
index 0000000000..f3c7df6a2b
--- /dev/null
+++ b/data/2024/aaai/Amodal Scene Analysis via Holistic Occlusion Relation Inference and Generative Mask Completion	
@@ -0,0 +1,4 @@
+Amodal scene analysis entails interpreting the occlusion relationship among scene elements and inferring the possible shapes of the invisible parts. Existing methods typically frame this task as an extended instance segmentation or a pair-wise object de-occlusion problem. In this work, we propose a new framework, which comprises a Holistic Occlusion Relation Inference (HORI) module followed by an instance-level Generative Mask Completion (GMC) module. 
+ Unlike previous approaches, which rely on mask completion results for occlusion reasoning, our HORI module directly predicts an occlusion relation matrix in a single pass. This approach is much more efficient than the pair-wise de-occlusion process and it naturally handles mutual occlusion, a common but often neglected situation.
+ Moreover, we formulate the mask completion task as a generative process and use a diffusion-based GMC module for instance-level mask completion. This improves mask completion quality and provides multiple plausible solutions.
+ We further introduce a large-scale amodal segmentation dataset with high-quality human annotations, including mutual occlusions. Experiments on our dataset and two public benchmarks demonstrate the advantages of our method. code public available at https://github.com/zbwxp/Amodal-AAAI.
\ No newline at end of file
diff --git a/data/2024/aaai/Amplifying Diversity and Quality in Commonsense Knowledge Graph Completion (Student Abstract) b/data/2024/aaai/Amplifying Diversity and Quality in Commonsense Knowledge Graph Completion (Student Abstract)
new file mode 100644
index 0000000000..5fd1580479
--- /dev/null
+++ b/data/2024/aaai/Amplifying Diversity and Quality in Commonsense Knowledge Graph Completion (Student Abstract)	
@@ -0,0 +1 @@
+Conventional commonsense knowledge graph completion (CKGC) methods provide inadequate sequence when fine-tuning or generating stages and incorporate full fine-tuning, which fail to align with the autoregressive model's pre-training patterns and have insufficient parameter efficiency. Moreover, decoding through beam or greedy search produces low diversity and high similarity in generated tail entities. Hence, we resort to prefix-tuning and propose a lightweight, effective pipeline to enhance the quality and diversity of extracted commonsense knowledge. Precisely, we measure head entity similarity to yield and then concatenate top-k tuples before each target tuple for prefix-tuning the source LM, thereby improving the efficiency and speed for pretrained models; then, we design a penalty-tailored diverse beam search (p-DBS) for decoding tail entities, producing a greater quantity and diversity of generated commonsense tuples; besides, a filter strategy is utilized to filter out invalid commonsense knowledge. Through extensive automatic evaluations, including ChatGPT scoring, our method can extract diverse, novel, and accurate commonsense knowledge (CK).
\ No newline at end of file
diff --git a/data/2024/aaai/An Approximate Skolem Function Counter b/data/2024/aaai/An Approximate Skolem Function Counter
new file mode 100644
index 0000000000..087b1db786
--- /dev/null
+++ b/data/2024/aaai/An Approximate Skolem Function Counter	
@@ -0,0 +1,5 @@
+One approach to probabilistic inference involves counting the number of models of a given Boolean formula. Here, we are interested in inferences involving higher-order objects, i.e., functions. We study the following task: Given a Boolean specification between a set of inputs and outputs, count the number of functions of inputs such that the specification is met. Such functions are called Skolem functions.
+
+We are motivated by the recent development of scalable approaches to Boolean function synthesis. This stands in relation to our problem analogously to the relationship between Boolean satisfiability and the model counting problem. Yet, counting Skolem functions poses considerable new challenges. From the complexity-theoretic standpoint, counting Skolem functions is not only #P-hard; it is quite unlikely to have an FPRAS (Fully Polynomial Randomized Approximation Scheme) as the problem of synthesizing a Skolem function remains challenging, even given access to an NP oracle.
+
+The primary contribution of this work is the first algorithm, SkolemFC, that computes the number of Skolem functions. SkolemFC relies on technical connections between counting functions and propositional model counting: our algorithm makes a linear number of calls to an approximate model counter and computes an estimate of the number of Skolem functions with theoretical guarantees. Our prototype displays impressive scalability, handling benchmarks comparably to state-of-the-art Skolem function synthesis engines, even though counting all such functions ostensibly poses a greater challenge than synthesizing a single function.
\ No newline at end of file
diff --git a/data/2024/aaai/An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention b/data/2024/aaai/An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention
new file mode 100644
index 0000000000..995d88f6c6
--- /dev/null
+++ b/data/2024/aaai/An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention	
@@ -0,0 +1 @@
+Sequential recommendation (SR) models based on Transformers have achieved remarkable successes. The self-attention mechanism of Transformers for computer vision and natural language processing suffers from the oversmoothing problem, i.e., hidden representations becoming similar to tokens. In the SR domain, we, for the first time, show that the same problem occurs. We present pioneering investigations that reveal the low-pass filtering nature of self-attention in the SR, which causes oversmoothing. To this end, we propose a novel method called Beyond Self-Attention for Sequential Recommendation (BSARec), which leverages the Fourier transform to i) inject an inductive bias by considering fine-grained sequential patterns and ii) integrate low and high-frequency information to mitigate oversmoothing. Our discovery shows significant advancements in the SR domain and is expected to bridge the gap for existing Transformer-based SR models. We test our proposed approach through extensive experiments on 6 benchmark datasets. The experimental results demonstrate that our model outperforms 7 baseline methods in terms of recommendation performance. Our code is available at https://github.com/yehjin-shin/BSARec.
\ No newline at end of file
diff --git a/data/2024/aaai/An Autoregressive Text-to-Graph Framework for Joint Entity and Relation Extraction b/data/2024/aaai/An Autoregressive Text-to-Graph Framework for Joint Entity and Relation Extraction
new file mode 100644
index 0000000000..b56e31117d
--- /dev/null
+++ b/data/2024/aaai/An Autoregressive Text-to-Graph Framework for Joint Entity and Relation Extraction	
@@ -0,0 +1 @@
+In this paper, we propose a novel method for joint entity and relation extraction from unstructured text by framing it as a conditional sequence generation problem. In contrast to conventional generative information extraction models that are left-to-right token-level generators, our approach is \textit{span-based}. It generates a linearized graph where nodes represent text spans and edges represent relation triplets. Our method employs a transformer encoder-decoder architecture with pointing mechanism on a dynamic vocabulary of spans and relation types. Our model can capture the structural characteristics and boundaries of entities and relations through span representations while simultaneously grounding the generated output in the original text thanks to the pointing mechanism. Evaluation on benchmark datasets validates the effectiveness of our approach, demonstrating competitive results. Code is available at https://github.com/urchade/ATG.
\ No newline at end of file
diff --git a/data/2024/aaai/An Eager Satisfiability Modulo Theories Solver for Algebraic Datatypes b/data/2024/aaai/An Eager Satisfiability Modulo Theories Solver for Algebraic Datatypes
new file mode 100644
index 0000000000..e6bb65dc18
--- /dev/null
+++ b/data/2024/aaai/An Eager Satisfiability Modulo Theories Solver for Algebraic Datatypes	
@@ -0,0 +1 @@
+Algebraic data types (ADTs) are a construct classically found in functional programming languages that capture data structures like enumerated types, lists, and trees. In recent years, interest in ADTs has increased. For example, popular programming languages, like Python, have added support for ADTs. Automated reasoning about ADTs can be done using satisfiability modulo theories (SMT) solving, an extension of the Boolean satisfiability problem with first-order logic and associated background theories. Unfortunately, SMT solvers that support ADTs do not scale as state-of-the-art approaches all use variations of the same lazy approach. In this paper, we present an SMT solver that takes a fundamentally different approach, an eager approach. Specifically, our solver reduces ADT queries to a simpler logical theory, uninterpreted functions (UF), and then uses an existing solver on the reduced query. We prove the soundness and completeness of our approach and demonstrate that it outperforms the state of the art on existing benchmarks, as well as a new, more challenging benchmark set from the planning domain.
\ No newline at end of file
diff --git a/data/2024/aaai/An Effective Augmented Lagrangian Method for Fine-Grained Multi-View Optimization b/data/2024/aaai/An Effective Augmented Lagrangian Method for Fine-Grained Multi-View Optimization
new file mode 100644
index 0000000000..0489f7f758
--- /dev/null
+++ b/data/2024/aaai/An Effective Augmented Lagrangian Method for Fine-Grained Multi-View Optimization	
@@ -0,0 +1,2 @@
+The significance of multi-view learning in effectively mitigating the intricate intricacies entrenched within heterogeneous data has garnered substantial attention in recent years. Notwithstanding the favorable achievements showcased by recent strides in this area, a confluence of noteworthy challenges endures. To be specific, a majority of extant methodologies unceremoniously assign weights to data points view-wisely. This ineluctably disregards the intrinsic reality that disparate views confer diverse contributions to each individual sample, consequently neglecting the rich wellspring of sample-level structural insights harbored within the dataset. In this paper, we proposed an effective Augmented Lagrangian MethOd for fiNe-graineD (ALMOND) multi-view optimization. 
+This innovative approach scrutinizes the interplay among multiple views at the granularity of individual samples, thereby fostering the enhanced preservation of local structural coherence. The Augmented Lagrangian Method (ALM) is elaborately incorporated into our framework, which enables us to achieve an optimal solution without involving an inexplicable intermediate variable as previous methods do. Empirical experiments on multi-view clustering tasks across heterogeneous datasets serve to incontrovertibly showcase the effectiveness of our proposed methodology, corroborating its preeminence over incumbent state-of-the-art alternatives.
\ No newline at end of file
diff --git a/data/2024/aaai/An Effective Polynomial Technique for Compiling Conditional Effects Away b/data/2024/aaai/An Effective Polynomial Technique for Compiling Conditional Effects Away
new file mode 100644
index 0000000000..818a7da189
--- /dev/null
+++ b/data/2024/aaai/An Effective Polynomial Technique for Compiling Conditional Effects Away	
@@ -0,0 +1 @@
+The paper introduces a novel polynomial compilation technique for the sound and complete removal of conditional effects in classical planning problems. Similar to Nebel's polynomial compilation of conditional effects, our solution also decomposes each action with conditional effects into several simpler actions. However, it does so more effectively by exploiting the actual structure of the given conditional effects. We characterise such a structure using a directed graph and leverage it to significantly reduce the number of additional atoms required, thereby shortening the size of valid plans. Our experimental analysis indicates that this approach enables the effective use of polynomial compilations, offering benefits in terms of modularity and reusability of existing planners. It also demonstrates that a compilation-based approach can be more efficient, either independently or in synergy with state-of-the-art optimal planners that directly support conditional effects.
\ No newline at end of file
diff --git a/data/2024/aaai/An Effectiveness Study of Teacher-Led AI Literacy Curriculum in K-12 Classrooms b/data/2024/aaai/An Effectiveness Study of Teacher-Led AI Literacy Curriculum in K-12 Classrooms
new file mode 100644
index 0000000000..457c2f0dd9
--- /dev/null
+++ b/data/2024/aaai/An Effectiveness Study of Teacher-Led AI Literacy Curriculum in K-12 Classrooms	
@@ -0,0 +1 @@
+Artificial intelligence (AI) has rapidly pervaded and reshaped almost all walks of life, but efforts to promote AI literacy in K-12 schools remain limited. There is a knowledge gap in how to prepare teachers to teach AI literacy in inclusive classrooms and how teacher-led classroom implementations can impact students. This paper reports a comparison study to investigate the effectiveness of an AI literacy curriculum when taught by classroom teachers. The experimental group included 89 middle school students who learned an AI literacy curriculum during regular school hours. The comparison group consisted of 69 students who did not learn the curriculum. Both groups completed the same pre and post-test. The results show that students in the experimental group developed a deeper understanding of AI concepts and more positive attitudes toward AI and its impact on future careers after the curriculum than those in the comparison group. This shows that the teacher-led classroom implementation successfully equipped students with a conceptual understanding of AI. Students achieved significant gains in recognizing how AI is relevant to their lives and felt empowered to thrive in the age of AI. Overall this study confirms the potential of preparing K-12 classroom teachers to offer AI education in classrooms in order to reach learners of diverse backgrounds and broaden participation in AI literacy education among young learners.
\ No newline at end of file
diff --git a/data/2024/aaai/An Efficient Knowledge Transfer Strategy for Spiking Neural Networks from Static to Event Domain b/data/2024/aaai/An Efficient Knowledge Transfer Strategy for Spiking Neural Networks from Static to Event Domain
new file mode 100644
index 0000000000..106a9dc482
--- /dev/null
+++ b/data/2024/aaai/An Efficient Knowledge Transfer Strategy for Spiking Neural Networks from Static to Event Domain	
@@ -0,0 +1,2 @@
+Spiking neural networks (SNNs) are rich in spatio-temporal dynamics and are suitable for processing event-based neuromorphic data. However, event-based datasets are usually less annotated than static datasets. This small data scale makes SNNs prone to overfitting and limits their performance. In order to improve the generalization ability of SNNs on event-based datasets, we use static images to assist SNN training on event data. In this paper, we first discuss the domain mismatch problem encountered when directly transferring networks trained on static datasets to event data. We argue that the inconsistency of feature distributions becomes a major factor hindering the effective transfer of knowledge from static images to event data. To address this problem, we propose solutions in terms of two aspects: feature distribution and training strategy. Firstly, we propose a knowledge transfer loss, which consists of domain alignment loss and spatio-temporal regularization. The domain alignment loss learns domain-invariant spatial features by reducing the marginal distribution distance between the static image and the event data. Spatio-temporal regularization provides dynamically learnable coefficients for domain alignment loss by using the output features of the event data at each time step as a regularization term. In addition, we propose a sliding training strategy, which gradually replaces static image inputs probabilistically with event data, resulting in a smoother and more stable training for the network. We validate our method on neuromorphic datasets, including N-Caltech101, CEP-DVS, and N-Omniglot. The experimental results show that our proposed method achieves better performance on all datasets compared to the current state-of-the-art methods. 
+Code is available at https://github.com/Brain-Cog-Lab/Transfer-for-DVS.
\ No newline at end of file
diff --git a/data/2024/aaai/An Efficient Subgraph-Inferring Framework for Large-Scale Heterogeneous Graphs b/data/2024/aaai/An Efficient Subgraph-Inferring Framework for Large-Scale Heterogeneous Graphs
new file mode 100644
index 0000000000..d6db6590a2
--- /dev/null
+++ b/data/2024/aaai/An Efficient Subgraph-Inferring Framework for Large-Scale Heterogeneous Graphs	
@@ -0,0 +1 @@
+Heterogeneous Graph Neural Networks (HGNNs) play a vital role in advancing the field of graph representation learning by addressing the complexities arising from diverse data types and interconnected relationships in real-world scenarios. However, traditional HGNNs face challenges when applied to large-scale graphs due to the necessity of training or inferring on the entire graph. As the size of the heterogeneous graphs increases, the time and memory overhead required by these models escalates rapidly, even reaching unacceptable levels. To address this issue, in this paper, we present a novel framework named (SubInfer), which conducts training and inferring on subgraphs instead of the entire graphs, hence efficiently handling large-scale heterogeneous graphs. The proposed framework comprises three main steps: 1) partitioning the heterogeneous graph from multiple perspectives to preserve various semantic information, 2) completing the subgraphs to improve the convergence speed of subgraph training and the performance of subgraph inference, and 3) training and inferring the HGNN model on distributed clusters to further reduce the time overhead. The framework is applicable to the vast majority of HGNN models. Experiments on five benchmark datasets demonstrate that SubInfer effectively optimizes the training and inference phase, delivering comparable performance to traditional HGNN models while significantly reducing time and memory overhead.
\ No newline at end of file
diff --git a/data/2024/aaai/An Embedding-Unleashing Video Polyp Segmentation Framework via Region Linking and Scale Alignment b/data/2024/aaai/An Embedding-Unleashing Video Polyp Segmentation Framework via Region Linking and Scale Alignment
new file mode 100644
index 0000000000..309c908bfd
--- /dev/null
+++ b/data/2024/aaai/An Embedding-Unleashing Video Polyp Segmentation Framework via Region Linking and Scale Alignment	
@@ -0,0 +1 @@
+Automatic polyp segmentation from colonoscopy videos is a critical task for the development of computer-aided screening and diagnosis systems. However, accurate and real-time video polyp segmentation (VPS) is a very challenging task due to low contrast between background and polyps and frame-to-frame dramatic variations in colonoscopy videos. We propose a novel embedding-unleashing framework consisting of a proposal-generative network (PGN) and an appearance-embedding network (AEN) to comprehensively address these challenges. Our framework, for the first time, models VPS as an appearance-level semantic embedding process to facilitate generate more global information to counteract background disturbances and dramatic variations. Specifically, PGN is a video segmentation network to obtain segmentation mask proposals, while AEN is a network we specially designed to produce appearance-level embedding semantics for PGN, thereby unleashing the capability of PGN in VPS. Our AEN consists of a cross-scale region linking (CRL) module and a cross-wise scale alignment (CSA) module. The former screens reliable background information against background disturbances by constructing linking of region semantics, while the latter performs the scale alignment to resist dramatic variations by modeling the center-perceived motion dependence with a cross-wise manner. We further introduce a parameter-free semantic interaction to embed the semantics of AEN into PGN to obtain the segmentation results. Extensive experiments on CVC-612 and SUN-SEG demonstrate that our approach achieves better performance than other state-of-the-art methods. Codes are available at https://github.com/zhixue-fang/EUVPS.
\ No newline at end of file
diff --git a/data/2024/aaai/An Empirical Study of CLIP for Text-Based Person Search b/data/2024/aaai/An Empirical Study of CLIP for Text-Based Person Search
new file mode 100644
index 0000000000..c260ecdf69
--- /dev/null
+++ b/data/2024/aaai/An Empirical Study of CLIP for Text-Based Person Search	
@@ -0,0 +1 @@
+Text-based Person Search (TBPS) aims to retrieve the person images using natural language descriptions. Recently, Contrastive Language Image Pretraining (CLIP), a universal large cross-modal vision-language pre-training model, has remarkably performed over various cross-modal downstream tasks due to its powerful cross-modal semantic learning capacity. TPBS, as a fine-grained cross-modal retrieval task, is also facing the rise of research on the CLIP-based TBPS. In order to explore the potential of the visual-language pre-training model for downstream TBPS tasks, this paper makes the first attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the TBPS community. We revisit critical design considerations under CLIP, including data augmentation and loss function. The model, with the aforementioned designs and practical training tricks, can attain satisfactory performance without any sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in model generalization and model compression, demonstrating the effectiveness of TBPS-CLIP from various aspects. This work is expected to provide empirical insights and highlight future CLIP-based TBPS research.
\ No newline at end of file
diff --git a/data/2024/aaai/An Empirical Study of Distributed Deep Learning Training on Edge (Student Abstract) b/data/2024/aaai/An Empirical Study of Distributed Deep Learning Training on Edge (Student Abstract)
new file mode 100644
index 0000000000..c04ecf0a47
--- /dev/null
+++ b/data/2024/aaai/An Empirical Study of Distributed Deep Learning Training on Edge (Student Abstract)	
@@ -0,0 +1 @@
+Deep learning (DL), despite its success in various fields, remains expensive and inaccessible to many due to its need for powerful supercomputing and high-end GPUs. This study explores alternative computing infrastructure and methods for distributed DL on low-energy, low-cost devices. We experiment on Raspberry Pi 4 devices with ARM Cortex-A72 processors and train a ResNet-18 model on the CIFAR-10 dataset. Our findings reveal limitations and opportunities for future optimizations, paving the way for a DL toolset for low-energy edge devices.
\ No newline at end of file
diff --git a/data/2024/aaai/An Exercise in Tournament Design: When Some Matches Must Be Scheduled b/data/2024/aaai/An Exercise in Tournament Design: When Some Matches Must Be Scheduled
new file mode 100644
index 0000000000..03f1012425
--- /dev/null
+++ b/data/2024/aaai/An Exercise in Tournament Design: When Some Matches Must Be Scheduled	
@@ -0,0 +1 @@
+Single-elimination (SE) tournaments are a popular format used in competitive environments and decision making. Algorithms for SE tournament manipulation have been an active topic of research in recent years. In this paper, we initiate the algorithmic study of a novel variant of SE tournament manipulation that aims to model the fact that certain matchups are highly desired in a sporting context, incentivizing an organizer to manipulate the bracket to make such matchups take place. We obtain both hardness and tractability results. We show that while the problem of computing a bracket enforcing a given set of matches in an SE tournament is NP-hard, there are natural restrictions that lead to polynomial-time solvability. In particular, we show polynomial-time solvability if there is a linear ordering on the ability of players with only a constant number of exceptions where a player with lower ability beats a player with higher ability.
\ No newline at end of file
diff --git a/data/2024/aaai/An Implicit Trust Region Approach to Behavior Regularized Offline Reinforcement Learning b/data/2024/aaai/An Implicit Trust Region Approach to Behavior Regularized Offline Reinforcement Learning
new file mode 100644
index 0000000000..651b32ba0f
--- /dev/null
+++ b/data/2024/aaai/An Implicit Trust Region Approach to Behavior Regularized Offline Reinforcement Learning	
@@ -0,0 +1 @@
+We revisit behavior regularization, a popular approach to mitigate the extrapolation error in offline reinforcement learning (RL), showing that current behavior regularization may suffer from unstable learning and hinder policy improvement. Motivated by this, a novel reward shaping-based behavior regularization method is proposed, where the log-probability ratio between the learned policy and the behavior policy is monitored during learning. We show that this is equivalent to an implicit but computationally lightweight trust region mechanism, which is beneficial to mitigate the influence of estimation errors of the value function, leading to more stable performance improvement. Empirical results on the popular D4RL benchmark verify the effectiveness of the presented method with promising performance compared with some state-of-the-art offline RL algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/An Information-Flow Perspective on Algorithmic Fairness b/data/2024/aaai/An Information-Flow Perspective on Algorithmic Fairness
new file mode 100644
index 0000000000..6d62ca7778
--- /dev/null
+++ b/data/2024/aaai/An Information-Flow Perspective on Algorithmic Fairness	
@@ -0,0 +1,5 @@
+This work presents insights gained by investigating the relationship between algorithmic fairness and the concept of secure information flow. The problem of enforcing secure information flow is well-studied in the context of information security: If secret information may "flow" through an algorithm or program in such a way that it can influence the program’s output, then that is considered insecure information flow as attackers could potentially observe (parts of) the secret.
+
+There is a strong correspondence between secure information flow and algorithmic fairness: if protected attributes such as race, gender, or age are treated as secret program inputs, then secure information flow means that these "secret" attributes cannot influence the result of a program. While most research in algorithmic fairness evaluation concentrates on studying the impact of algorithms (often treating the algorithm as a black-box), the concepts derived from information flow can be used both for the analysis of disparate treatment as well as disparate impact w.r.t. a structural causal model.
+
+In this paper, we examine the relationship between quantitative as well as qualitative information-flow properties and fairness. Moreover, based on this duality, we derive a new quantitative notion of fairness called fairness spread, which can be easily analyzed using quantitative information flow and which strongly relates to counterfactual fairness. We demonstrate that off-the-shelf tools for information-flow properties can be used in order to formally analyze a program's algorithmic fairness properties, including the new notion of fairness spread as well as established notions such as demographic parity.
\ No newline at end of file
diff --git a/data/2024/aaai/An Interpretable Approach to the Solutions of High-Dimensional Partial Differential Equations b/data/2024/aaai/An Interpretable Approach to the Solutions of High-Dimensional Partial Differential Equations
new file mode 100644
index 0000000000..69a7acaad5
--- /dev/null
+++ b/data/2024/aaai/An Interpretable Approach to the Solutions of High-Dimensional Partial Differential Equations	
@@ -0,0 +1 @@
+In recent years, machine learning algorithms, especially deep learning, have shown promising prospects in solving Partial Differential Equations (PDEs). However, as the dimension increases, the relationship and interaction between variables become more complex, and existing methods are difficult to provide fast and interpretable solutions for high-dimensional PDEs. To address this issue, we propose a genetic programming symbolic regression algorithm based on transfer learning and automatic differentiation to solve PDEs. This method uses genetic programming to search for a mathematically understandable expression and combines automatic differentiation to determine whether the search result satisfies the PDE and boundary conditions to be solved. To overcome the problem of slow solution speed caused by large search space, we propose a transfer learning mechanism that transfers the structure of one-dimensional PDE analytical solution to the form of high-dimensional PDE solution. We tested three representative types of PDEs, and the results showed that our proposed method can obtain reliable and human-understandable real solutions or algebraic equivalent solutions of PDEs, and the convergence speed is better than the compared methods. Code of this project is at https://github.com/grassdeerdeer/HD-TLGP.
\ No newline at end of file
diff --git a/data/2024/aaai/An Optimal Transport View for Subspace Clustering and Spectral Clustering b/data/2024/aaai/An Optimal Transport View for Subspace Clustering and Spectral Clustering
new file mode 100644
index 0000000000..07e78a938d
--- /dev/null
+++ b/data/2024/aaai/An Optimal Transport View for Subspace Clustering and Spectral Clustering	
@@ -0,0 +1 @@
+Clustering is one of the most fundamental problems in machine learning and data mining, and many algorithms have been proposed in the past decades. Among them, subspace clustering and spectral clustering are the most famous approaches. In this paper, we provide an explanation for subspace clustering and spectral clustering from the perspective of optimal transport. Optimal transport studies how to move samples from one distribution to another distribution with minimal transport cost, and has shown a powerful ability to extract geometric information. By considering a self optimal transport model with only one group of samples, we observe that both subspace clustering and spectral clustering can be explained in the framework of optimal transport, and the optimal transport matrix bridges the spaces of features and spectral embeddings. Inspired by this connection, we propose a spectral optimal transport barycenter model, which learns spectral embeddings by solving a barycenter problem equipped with an optimal transport discrepancy and guidance of data. Based on our proposed model, we take advantage of optimal transport to exploit both feature and metric information involved in data for learning coupled spectral embeddings and affinity matrix in a unified model. We develop an alternating optimization algorithm to solve the resultant problems, and conduct experiments in different settings to evaluate the performance of our proposed methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Analysis of Differentially Private Synthetic Data: A Measurement Error Approach b/data/2024/aaai/Analysis of Differentially Private Synthetic Data: A Measurement Error Approach
new file mode 100644
index 0000000000..7cd1ee22cd
--- /dev/null
+++ b/data/2024/aaai/Analysis of Differentially Private Synthetic Data: A Measurement Error Approach	
@@ -0,0 +1 @@
+Differentially private (DP) synthetic datasets have been receiving significant attention from academia, industry, and government. However, little is known about how to perform statistical inference using DP synthetic datasets. Naive approaches that do not take into account the induced uncertainty due to the DP mechanism will result in biased estimators and invalid inferences. In this paper, we present a class of maximum likelihood estimator (MLE)-based easy-to-implement bias-corrected DP estimators with valid asymptotic confidence intervals (CI) for parameters in regression settings, by establishing the connection between additive DP mechanisms and measurement error models. Our simulation shows that our estimator has comparable performance to the widely used sufficient statistic perturbation (SSP) algorithm in some scenarios but with the advantage of releasing a synthetic dataset and obtaining statistically valid asymptotic CIs, which can achieve better coverage when compared to the naive CIs obtained by ignoring the DP mechanism.
\ No newline at end of file
diff --git a/data/2024/aaai/Analytically Tractable Models for Decision Making under Present Bias b/data/2024/aaai/Analytically Tractable Models for Decision Making under Present Bias
new file mode 100644
index 0000000000..94364aeaf7
--- /dev/null
+++ b/data/2024/aaai/Analytically Tractable Models for Decision Making under Present Bias	
@@ -0,0 +1 @@
+Time-inconsistency is a characteristic of human behavior in which people plan for long-term benefits but take actions that differ from the plan due to conflicts with short-term benefits. Such time-inconsistent behavior is believed to be caused by present bias, a tendency to overestimate immediate rewards and underestimate future rewards. It is essential in behavioral economics to investigate the relationship between present bias and time-inconsistency. In this paper, we propose a model for analyzing agent behavior with present bias in tasks to make progress toward a goal over a specific period. Unlike previous models, the state sequence of the agent can be described analytically in our model. Based on this property, we analyze three crucial problems related to agents under present bias: task abandonment, optimal goal setting, and optimal reward scheduling. Extensive analysis reveals how present bias affects the condition under which task abandonment occurs and optimal intervention strategies. Our findings are meaningful for preventing task abandonment and intervening through incentives in the real world.
\ No newline at end of file
diff --git a/data/2024/aaai/Analyzing Generalization in Policy Networks: A Case Study with the Double-Integrator System b/data/2024/aaai/Analyzing Generalization in Policy Networks: A Case Study with the Double-Integrator System
new file mode 100644
index 0000000000..d7edacbddd
--- /dev/null
+++ b/data/2024/aaai/Analyzing Generalization in Policy Networks: A Case Study with the Double-Integrator System	
@@ -0,0 +1 @@
+Extensive utilization of deep reinforcement learning (DRL) policy networks in diverse continuous control tasks has raised questions regarding performance degradation in expansive state spaces where the input state norm is larger than that in the training environment. This paper aims to uncover the underlying factors contributing to such performance deterioration when dealing with expanded state spaces, using a novel analysis technique known as state division. In contrast to prior approaches that employ state division merely as a post-hoc explanatory tool, our methodology delves into the intrinsic characteristics of DRL policy networks. Specifically, we demonstrate that the expansion of state space induces the activation function $\tanh$ to exhibit saturability, resulting in the transformation of the state division boundary from nonlinear to linear. Our analysis centers on the paradigm of the double-integrator system, revealing that this gradual shift towards linearity imparts a control behavior reminiscent of bang-bang control. However, the inherent linearity of the division boundary prevents the attainment of an ideal bang-bang control, thereby introducing unavoidable overshooting. Our experimental investigations, employing diverse RL algorithms, establish that this performance phenomenon stems from inherent attributes of the DRL policy network, remaining consistent across various optimization algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Anchoring Path for Inductive Relation Prediction in Knowledge Graphs b/data/2024/aaai/Anchoring Path for Inductive Relation Prediction in Knowledge Graphs
new file mode 100644
index 0000000000..0e87c5eac3
--- /dev/null
+++ b/data/2024/aaai/Anchoring Path for Inductive Relation Prediction in Knowledge Graphs	
@@ -0,0 +1 @@
+Aiming to accurately predict missing edges representing relations between entities, which are pervasive in real-world Knowledge Graphs (KGs), relation prediction plays a critical role in enhancing the comprehensiveness and utility of KGs. Recent research focuses on path-based methods due to their inductive and explainable properties. However, these methods face a great challenge when lots of reasoning paths do not form Closed Paths (CPs) in the KG. To address this challenge, we propose Anchoring Path Sentence Transformer (APST) by introducing Anchoring Paths (APs) to alleviate the reliance of CPs. Specifically, we develop a search-based description retrieval method to enrich entity descriptions and an assessment mechanism to evaluate the rationality of APs. APST takes both APs and CPs as the inputs of a unified Sentence Transformer architecture, enabling comprehensive predictions and high-quality explanations. We evaluate APST on three public datasets and achieve state-of-the-art (SOTA) performance in 30 of 36 transductive, inductive, and few-shot experimental settings.
\ No newline at end of file
diff --git a/data/2024/aaai/Angle Robustness Unmanned Aerial Vehicle Navigation in GNSS-Denied Scenarios b/data/2024/aaai/Angle Robustness Unmanned Aerial Vehicle Navigation in GNSS-Denied Scenarios
new file mode 100644
index 0000000000..ccc364ee4f
--- /dev/null
+++ b/data/2024/aaai/Angle Robustness Unmanned Aerial Vehicle Navigation in GNSS-Denied Scenarios	
@@ -0,0 +1 @@
+Due to the inability to receive signals from the Global Navigation Satellite System (GNSS) in extreme conditions, achieving accurate and robust navigation for Unmanned Aerial Vehicles (UAVs) is a challenging task. Recently emerged, vision-based navigation has been a promising and feasible alternative to GNSS-based navigation. However, existing vision-based techniques are inadequate in addressing flight deviation caused by environmental disturbances and inaccurate position predictions in practical settings. In this paper, we present a novel angle robustness navigation paradigm to deal with flight deviation in point-to-point navigation tasks. Additionally, we propose a model that includes the Adaptive Feature Enhance Module, Cross-knowledge Attention-guided Module and Robust Task-oriented Head Module to accurately predict direction angles for high-precision navigation. To evaluate the vision-based navigation methods, we collect a new dataset termed as UAV_AR368. Furthermore, we design the Simulation Flight Testing Instrument (SFTI) using Google Earth to simulate different flight environments, thereby reducing the expenses associated with real flight testing. Experiment results demonstrate that the proposed model outperforms the state-of-the-art by achieving improvements of 26.0% and 45.6% in the success rate of arrival under ideal and disturbed circumstances, respectively.
\ No newline at end of file
diff --git a/data/2024/aaai/AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model b/data/2024/aaai/AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model
new file mode 100644
index 0000000000..7f8125dda4
--- /dev/null
+++ b/data/2024/aaai/AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model	
@@ -0,0 +1 @@
+Anomaly inspection plays an important role in industrial manufacture. Existing anomaly inspection methods are limited in their performance due to insufficient anomaly data. Although anomaly generation methods have been proposed to augment the anomaly data, they either suffer from poor generation authenticity or inaccurate alignment between the generated anomalies and masks. To address the above problems, we propose AnomalyDiffusion, a novel diffusion-based few-shot anomaly generation model, which utilizes the strong prior information of latent diffusion model learned from large-scale dataset to enhance the generation authenticity under few-shot training data. Firstly, we propose Spatial Anomaly Embedding, which consists of a learnable anomaly embedding and a spatial embedding encoded from an anomaly mask, disentangling the anomaly information into anomaly appearance and location information. Moreover, to improve the alignment between the generated anomalies and the anomaly masks, we introduce a novel Adaptive Attention Re-weighting Mechanism. Based on the disparities between the generated anomaly image and normal sample, it dynamically guides the model to focus more on the areas with less noticeable generated anomalies, enabling generation of accurately-matched anomalous image-mask pairs. Extensive experiments demonstrate that our model significantly outperforms the state-of-the-art methods in generation authenticity and diversity, and effectively improves the performance of downstream anomaly inspection tasks. The code and data are available in https://github.com/sjtuplayer/anomalydiffusion.
\ No newline at end of file
diff --git a/data/2024/aaai/AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models b/data/2024/aaai/AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models
new file mode 100644
index 0000000000..60772a7744
--- /dev/null
+++ b/data/2024/aaai/AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models	
@@ -0,0 +1 @@
+Large Vision-Language Models (LVLMs) such as MiniGPT-4 and LLaVA have demonstrated the capability of understanding images and achieved remarkable performance in various visual tasks. Despite their strong abilities in recognizing common objects due to extensive training datasets, they lack specific domain knowledge and have a weaker understanding of localized details within objects, which hinders their effectiveness in the Industrial Anomaly Detection (IAD) task. On the other hand, most existing IAD methods only provide anomaly scores and necessitate the manual setting of thresholds to distinguish between normal and abnormal samples, which restricts their practical implementation. In this paper, we explore the utilization of LVLM to address the IAD problem and propose AnomalyGPT, a novel IAD approach based on LVLM. We generate training data by simulating anomalous images and producing corresponding textual descriptions for each image. We also employ an image decoder to provide fine-grained semantic and design a prompt learner to fine-tune the LVLM using prompt embeddings. Our AnomalyGPT eliminates the need for manual threshold adjustments, thus directly assesses the presence and locations of anomalies. Additionally, AnomalyGPT supports multi-turn dialogues and exhibits impressive few-shot in-context learning capabilities. With only one normal shot, AnomalyGPT achieves the state-of-the-art performance with an accuracy of 86.1%, an image-level AUC of 94.1%, and a pixel-level AUC of 95.3% on the MVTec-AD dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/Another Way to the Top: Exploit Contextual Clustering in Learned Image Coding b/data/2024/aaai/Another Way to the Top: Exploit Contextual Clustering in Learned Image Coding
new file mode 100644
index 0000000000..73477de6fc
--- /dev/null
+++ b/data/2024/aaai/Another Way to the Top: Exploit Contextual Clustering in Learned Image Coding	
@@ -0,0 +1 @@
+While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image for intra-cluster feature aggregation. Afterward, features are reordered to their original spatial positions to pass through the local attention units for inter-cluster embedding. Additionally, we introduce the Guided Post-Quantization Filtering (GuidedPQF) into CLIC, effectively mitigating the propagation and accumulation of quantization errors at the initial decoding stage. Extensive experiments demonstrate the superior performance of CLIC over state-of-the-art works: when optimized using MSE, it outperforms VVC by about 10% BD-Rate in three widely-used benchmark datasets; when optimized using MS-SSIM, it saves more than 50% BD-Rate over VVC. Our CLIC offers a new way to generate compact representations for image compression, which also provides a novel direction along the line of LIC development.
\ No newline at end of file
diff --git a/data/2024/aaai/Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images b/data/2024/aaai/Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images
new file mode 100644
index 0000000000..9143bd65dc
--- /dev/null
+++ b/data/2024/aaai/Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images	
@@ -0,0 +1 @@
+Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed HD images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2X compared to the traditional tiled algorithm. The source code is available at https://github.com/ProAirVerse/Any-Size-Diffusion.
\ No newline at end of file
diff --git a/data/2024/aaai/Any-Stereo: Arbitrary Scale Disparity Estimation for Iterative Stereo Matching b/data/2024/aaai/Any-Stereo: Arbitrary Scale Disparity Estimation for Iterative Stereo Matching
new file mode 100644
index 0000000000..f8daacf3d9
--- /dev/null
+++ b/data/2024/aaai/Any-Stereo: Arbitrary Scale Disparity Estimation for Iterative Stereo Matching	
@@ -0,0 +1 @@
+Due to unaffordable computational costs, the regularized disparity in iterative stereo matching is typically maintained at a lower resolution than the input. To regress the full resolution disparity, most stereo methods resort to convolutions to decode a fixed-scale output. However, they are inadequate for recovering vital high-frequency information lost during downsampling, limiting their performance on full-resolution prediction. In this paper, we introduce AnyStereo, an accurate and efficient disparity upsampling module with implicit neural representation for the iterative stereo pipeline. By modeling the disparity as a continuous representation over 2D spatial coordinates, subtle details can emerge from the latent space at arbitrary resolution. To further complement the missing information and details in the latent code, we propose two strategies: intra-scale similarity unfolding and cross-scale feature alignment. The former unfolds the neighbor relationships, while the latter introduces the context in high-resolution feature maps. The proposed AnyStereo can seamlessly replace the upsampling module in most iterative stereo models, improving their ability to capture fine details and generate arbitrary-scale disparities even with fewer parameters. With our method, the iterative stereo pipeline establishes a new state-of-the-art performance. The code is available at https://github.com/Zhaohuai-L/Any-Stereo.
\ No newline at end of file
diff --git a/data/2024/aaai/Any-Way Meta Learning b/data/2024/aaai/Any-Way Meta Learning
new file mode 100644
index 0000000000..1ff0fc0a08
--- /dev/null
+++ b/data/2024/aaai/Any-Way Meta Learning	
@@ -0,0 +1,8 @@
+Although meta-learning seems promising performance in the realm of rapid adaptability, it is constrained by 
+fixed cardinality. When faced with tasks of varying cardinalities that were unseen during training, 
+the model lacks its ability. In this paper, we address and resolve this challenge 
+by harnessing `label equivalence' emerged from stochastic numeric label assignments during episodic task sampling. Questioning what defines ``true" meta-learning, we introduce the ``any-way" learning paradigm, an innovative model training approach that liberates model from
+fixed cardinality constraints. Surprisingly, this model not only matches but often outperforms traditional fixed-way models in terms of performance, convergence speed, and stability. This disrupts established notions
+about domain generalization. Furthermore, we argue that the inherent 
+label equivalence naturally lacks semantic information. To bridge this 
+semantic information gap arising from label equivalence, we further propose a mechanism for infusing semantic class information into the model. This would enhance the model's comprehension and functionality. Experiments conducted on renowned architectures like MAML and ProtoNet affirm the effectiveness of our method.
\ No newline at end of file
diff --git a/data/2024/aaai/Approval-Based Committee Voting in Practice: A Case Study of (over-)Representation in the Polkadot Blockchain b/data/2024/aaai/Approval-Based Committee Voting in Practice: A Case Study of (over-)Representation in the Polkadot Blockchain
new file mode 100644
index 0000000000..09e76d1112
--- /dev/null
+++ b/data/2024/aaai/Approval-Based Committee Voting in Practice: A Case Study of (over-)Representation in the Polkadot Blockchain	
@@ -0,0 +1 @@
+We provide the first large-scale data collection of real-world approval-based committee elections. These elections have been conducted on the Polkadot blockchain as part of their Nominated Proof-of-Stake mechanism and contain around one thousand candidates and tens of thousands of (weighted) voters each. We conduct an in-depth study of application-relevant questions, including a quantitative and qualitative analysis of the outcomes returned by different voting rules. Besides considering proportionality measures that are standard in the multiwinner voting literature, we pay particular attention to less-studied measures of overrepresentation, as these are closely related to the security of the Polkadot network. We also analyze how different design decisions such as the committee size affect the examined measures.
\ No newline at end of file
diff --git a/data/2024/aaai/Approximate Distance Oracle for Fault-Tolerant Geometric Spanners b/data/2024/aaai/Approximate Distance Oracle for Fault-Tolerant Geometric Spanners
new file mode 100644
index 0000000000..3dce889761
--- /dev/null
+++ b/data/2024/aaai/Approximate Distance Oracle for Fault-Tolerant Geometric Spanners	
@@ -0,0 +1,7 @@
+In this paper, we present approximate distance and shortest-path oracles for fault-tolerant Euclidean spanners motivated by the routing problem in real-world road networks.
+A fault-tolerant Euclidean spanner for a set of points in Euclidean space is a graph
+in which, despite the deletion of small number of any points, the distance between any two points in the damaged graph is an approximation of their Euclidean distance. 
+Given a fault-tolerant Euclidean spanner and a small approximation factor, 
+our data structure allows us to compute an approximate distance between two points in the damaged spanner in constant time when a query involves any two points and a small set of failed points. 
+Additionally, by incorporating additional data structures, we can return a path itself in time almost linear in the length of the returned path. 
+Both data structures require near-linear space.
\ No newline at end of file
diff --git a/data/2024/aaai/Approximate Integer Solution Counts over Linear Arithmetic Constraints b/data/2024/aaai/Approximate Integer Solution Counts over Linear Arithmetic Constraints
new file mode 100644
index 0000000000..a38430cfb1
--- /dev/null
+++ b/data/2024/aaai/Approximate Integer Solution Counts over Linear Arithmetic Constraints	
@@ -0,0 +1 @@
+Counting integer solutions of linear constraints has found interesting applications in various fields. It is equivalent to the problem of counting lattice points inside a polytope. However, state-of-the-art algorithms for this problem become too slow for even a modest number of variables. In this paper, we propose a new framework to approximate the lattice counts inside a polytope with a new random-walk sampling method. The counts computed by our approach has been proved approximately bounded by a (epsilon, delta)-bound. Experiments on extensive benchmarks show that our algorithm could solve polytopes with dozens of dimensions, which significantly outperforms state-of-the-art counters.
\ No newline at end of file
diff --git a/data/2024/aaai/Approximation Algorithms for Preference Aggregation Using CP-Nets b/data/2024/aaai/Approximation Algorithms for Preference Aggregation Using CP-Nets
new file mode 100644
index 0000000000..42e6f220ac
--- /dev/null
+++ b/data/2024/aaai/Approximation Algorithms for Preference Aggregation Using CP-Nets	
@@ -0,0 +1 @@
+This paper studies the design and analysis of approximation algorithms for aggregating preferences over combinatorial domains, represented using Conditional Preference Networks (CP-nets). Its focus is on aggregating preferences over so-called swaps, for which optimal solutions in general are already known to be of exponential size. We first analyze a trivial 2-approximation algorithm that simply outputs the best of the given input preferences, and establish a structural condition under which the approximation ratio of this algorithm is improved to 4/3. We then propose a polynomial-time approximation algorithm whose outputs are provably no worse than those of the trivial algorithm, but often substantially better. A family of problem instances is presented for which our improved algorithm produces optimal solutions, while, for any ε, the trivial algorithm cannot attain a (2- ε)-approximation. These results may lead to the first polynomial-time approximation algorithm that solves the CP-net aggregation problem for swaps with an approximation ratio substantially better than 2.
\ No newline at end of file
diff --git a/data/2024/aaai/Approximation Scheme for Weighted Metric Clustering via Sherali-Adams b/data/2024/aaai/Approximation Scheme for Weighted Metric Clustering via Sherali-Adams
new file mode 100644
index 0000000000..de5c7fb911
--- /dev/null
+++ b/data/2024/aaai/Approximation Scheme for Weighted Metric Clustering via Sherali-Adams	
@@ -0,0 +1,3 @@
+Motivated by applications to classification problems on metric data, we study Weighted Metric Clustering problem: given a metric d over n points and a k x k symmetric matrix A with non-negative entries, the goal is to find a k-partition of these points into clusters C1,...,Ck, while minimizing the sum of A[i,j] * d(u,v) over all pairs of clusters Ci and Cj and all pairs of points u from Ci and v from Cj. Specific choices of A lead to Weighted Metric Clustering capturing well-studied graph partitioning problems in metric spaces, such as Min-Uncut, Min-k-Sum, Min-k-Cut, and more.
+
+Our main result is that Weighted Metric Clustering admits a polynomial-time approximation scheme (PTAS). Our algorithm handles all the above problems using the Sherali-Adams linear programming relaxation. This subsumes several prior works, unifies many of the techniques for various metric clustering objectives, and yields a PTAS for several new problems, including metric clustering on manifolds and a new family of hierarchical clustering objectives. Our experiments on the hierarchical clustering objective show that it better captures the ground-truth structural information compared to the popular Dasgupta's objective.
\ No newline at end of file
diff --git a/data/2024/aaai/Arbitrariness and Social Prediction: The Confounding Role of Variance in Fair Classification b/data/2024/aaai/Arbitrariness and Social Prediction: The Confounding Role of Variance in Fair Classification
new file mode 100644
index 0000000000..4416010711
--- /dev/null
+++ b/data/2024/aaai/Arbitrariness and Social Prediction: The Confounding Role of Variance in Fair Classification	
@@ -0,0 +1 @@
+Variance in predictions across different trained models is a significant, under-explored source of error in fair binary classification. In practice, the variance on some data examples is so large that decisions can be effectively arbitrary. To investigate this problem, we take an experimental approach and make four overarching contributions. We: 1) Define a metric called self-consistency, derived from variance, which we use as a proxy for measuring and reducing arbitrariness; 2) Develop an ensembling algorithm that abstains from classification when a prediction would be arbitrary; 3) Conduct the largest to-date empirical study of the role of variance (vis-a-vis self-consistency and arbitrariness) in fair binary classification; and, 4) Release a toolkit that makes the US Home Mortgage Disclosure Act (HMDA) datasets easily usable for future research. Altogether, our experiments reveal shocking insights about the reliability of conclusions on benchmark datasets. Most fair binary classification benchmarks are close-to-fair when taking into account the amount of arbitrariness present in predictions -- before we even try to apply any fairness interventions. This finding calls into question the practical utility of common algorithmic fairness methods, and in turn suggests that we should reconsider how we choose to measure fairness in binary classification.
\ No newline at end of file
diff --git a/data/2024/aaai/Arbitrary-Scale Point Cloud Upsampling by Voxel-Based Network with Latent Geometric-Consistent Learning b/data/2024/aaai/Arbitrary-Scale Point Cloud Upsampling by Voxel-Based Network with Latent Geometric-Consistent Learning
new file mode 100644
index 0000000000..e62dcf39a0
--- /dev/null
+++ b/data/2024/aaai/Arbitrary-Scale Point Cloud Upsampling by Voxel-Based Network with Latent Geometric-Consistent Learning	
@@ -0,0 +1 @@
+Recently, arbitrary-scale point cloud upsampling mechanism became increasingly popular due to its efficiency and convenience for practical applications. To achieve this, most previous approaches formulate it as a problem of surface approximation and employ point-based networks to learn surface representations. However, learning surfaces from sparse point clouds is more challenging, and thus they often suffer from the low-fidelity geometry approximation. To address it, we propose an arbitrary-scale Point cloud Upsampling framework using Voxel-based Network (PU-VoxelNet). Thanks to the completeness and regularity inherited from the voxel representation, voxel-based networks are capable of providing predefined grid space to approximate 3D surface, and an arbitrary number of points can be reconstructed according to the predicted density distribution within each grid cell. However, we investigate the inaccurate grid sampling caused by imprecise density predictions. To address this issue, a density-guided grid resampling method is developed to generate high-fidelity points while effectively avoiding sampling outliers. Further, to improve the fine-grained details, we present an auxiliary training supervision to enforce the latent geometric consistency among local surface patches. Extensive experiments indicate the proposed approach outperforms the state-of-the-art approaches not only in terms of fixed upsampling rates but also for arbitrary-scale upsampling. The code is available at https://github.com/hikvision-research/3DVision
\ No newline at end of file
diff --git a/data/2024/aaai/Arbitrary-Scale Video Super-resolution Guided by Dynamic Context b/data/2024/aaai/Arbitrary-Scale Video Super-resolution Guided by Dynamic Context
new file mode 100644
index 0000000000..b8ab134b54
--- /dev/null
+++ b/data/2024/aaai/Arbitrary-Scale Video Super-resolution Guided by Dynamic Context	
@@ -0,0 +1 @@
+We propose a Dynamic Context-Guided Upsampling (DCGU) module for video super-resolution (VSR) that leverages temporal context guidance to achieve efficient and effective arbitrary-scale VSR. While most VSR research focuses on backbone design, the importance of the upsampling part is often overlooked. Existing methods rely on pixelshuffle-based upsampling, which has limited capabilities in handling arbitrary upsampling scales. Recent attempts to replace pixelshuffle-based modules with implicit neural function-based and filter-based approaches suffer from slow inference speeds and limited representation capacity, respectively. To overcome these limitations, our DCGU module predicts non-local sampling locations and content-dependent filter weights, enabling efficient and effective arbitrary-scale VSR. Our proposed multi-granularity location search module efficiently identifies non-local sampling locations across the entire low-resolution grid, and the temporal bilateral filter modulation module integrates content information with the filter weight to enhance textual details. Extensive experiments demonstrate the superiority of our method in terms of performance and speed on arbitrary-scale VSR.
\ No newline at end of file
diff --git a/data/2024/aaai/Are You Concerned about Limited Function Evaluations: Data-Augmented Pareto Set Learning for Expensive Multi-Objective Optimization b/data/2024/aaai/Are You Concerned about Limited Function Evaluations: Data-Augmented Pareto Set Learning for Expensive Multi-Objective Optimization
new file mode 100644
index 0000000000..05dcc3f391
--- /dev/null
+++ b/data/2024/aaai/Are You Concerned about Limited Function Evaluations: Data-Augmented Pareto Set Learning for Expensive Multi-Objective Optimization	
@@ -0,0 +1 @@
+Optimizing multiple conflicting black-box objectives simultaneously is a prevalent occurrence in many real-world applications, such as neural architecture search, and machine learning. These problems are known as expensive multi-objective optimization problems (EMOPs) when the function evaluations are computationally or financially costly. Multi-objective Bayesian optimization (MOBO) offers an efficient approach to discovering a set of Pareto optimal solutions. However, the data deficiency issue caused by limited function evaluations has posed a great challenge to current optimization methods. Moreover, most current methods tend to prioritize the quality of candidate solutions, while ignoring the quantity of promising samples. In order to tackle these issues, our paper proposes a novel multi-objective Bayesian optimization algorithm with a data augmentation strategy that provides ample high-quality samples for Pareto set learning (PSL). Specifically, it utilizes Generative Adversarial Networks (GANs) to enrich data and a dominance prediction model to screen out high-quality samples, mitigating the predicament of limited function evaluations in EMOPs. Additionally, we adopt the regularity model to expensive multi-objective Bayesian optimization for PSL. Experimental results on both synthetic and real-world problems demonstrate that our algorithm outperforms several state-of-the-art and classical algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Arithmetic Feature Interaction Is Necessary for Deep Tabular Learning b/data/2024/aaai/Arithmetic Feature Interaction Is Necessary for Deep Tabular Learning
new file mode 100644
index 0000000000..9a611f22e7
--- /dev/null
+++ b/data/2024/aaai/Arithmetic Feature Interaction Is Necessary for Deep Tabular Learning	
@@ -0,0 +1 @@
+Until recently, the question of the effective inductive bias of deep models on tabular data has remained unanswered. This paper investigates the hypothesis that arithmetic feature interaction is necessary for deep tabular learning. To test this point, we create a synthetic tabular dataset with a mild feature interaction assumption and examine a modified transformer architecture enabling arithmetical feature interactions, referred to as AMFormer. Results show that AMFormer outperforms strong counterparts in fine-grained tabular data modeling, data efficiency in training, and generalization. This is attributed to its parallel additive and multiplicative attention operators and prompt-based optimization, which facilitate the separation of tabular samples in an extended space with arithmetically-engineered features. Our extensive experiments on real-world data also validate the consistent effectiveness, efficiency, and rationale of AMFormer, suggesting it has established a strong inductive bias for deep learning on tabular data. Code is available at https://github.com/aigc-apps/AMFormer.
\ No newline at end of file
diff --git a/data/2024/aaai/ArtBank: Artistic Style Transfer with Pre-trained Diffusion Model and Implicit Style Prompt Bank b/data/2024/aaai/ArtBank: Artistic Style Transfer with Pre-trained Diffusion Model and Implicit Style Prompt Bank
new file mode 100644
index 0000000000..b777917bea
--- /dev/null
+++ b/data/2024/aaai/ArtBank: Artistic Style Transfer with Pre-trained Diffusion Model and Implicit Style Prompt Bank	
@@ -0,0 +1 @@
+Artistic style transfer aims to repaint the content image with the learned artistic style. Existing artistic style transfer methods can be divided into two categories: small model-based approaches and pre-trained large-scale model-based approaches. Small model-based approaches can preserve the content strucuture, but fail to produce highly realistic stylized images and introduce artifacts and disharmonious patterns; Pre-trained large-scale model-based approaches can generate highly realistic stylized images but struggle with preserving the content structure. To address the above issues, we propose ArtBank, a novel artistic style transfer framework, to generate highly realistic stylized images while preserving the content structure of the content images. Specifically, to sufficiently dig out the knowledge embedded in pre-trained large-scale models, an Implicit Style Prompt Bank (ISPB), a set of trainable parameter matrices, is designed to learn and store knowledge from the collection of artworks and behave as a visual prompt to guide pre-trained large-scale models to generate highly realistic stylized images while preserving content structure. Besides, to accelerate training the above ISPB, we propose a novel Spatial-Statistical-based self-Attention Module (SSAM). The qualitative and quantitative experiments demonstrate the superiority of our proposed method over state-of-the-art artistic style transfer methods. Code is available at https://github.com/Jamie-Cheung/ArtBank.
\ No newline at end of file
diff --git a/data/2024/aaai/Artificial Intelligence in the CS2023 Undergraduate Computer Science Curriculum: Rationale and Challenges b/data/2024/aaai/Artificial Intelligence in the CS2023 Undergraduate Computer Science Curriculum: Rationale and Challenges
new file mode 100644
index 0000000000..2438252604
--- /dev/null
+++ b/data/2024/aaai/Artificial Intelligence in the CS2023 Undergraduate Computer Science Curriculum: Rationale and Challenges	
@@ -0,0 +1 @@
+Roughly every decade, the ACM and IEEE professional organizations have produced recommendations for the education of undergraduate computer science students. These guidelines are used worldwide by research universities, liberal arts colleges, and community colleges. For the latest 2023 revision of the curriculum, AAAI has collaborated with ACM and IEEE to integrate artificial intelligence more broadly into this new curriculum and to address the issues it raises for students, instructors, practitioners, policy makers, and the general public. This paper describes the development process and rationale that underlie the artificial intelligence components of the CS2023 curriculum, discusses the challenges in curriculum design for such a rapidly advancing field, and examines lessons learned during this three-year process.
\ No newline at end of file
diff --git a/data/2024/aaai/Aspect-Based Sentiment Analysis with Explicit Sentiment Augmentations b/data/2024/aaai/Aspect-Based Sentiment Analysis with Explicit Sentiment Augmentations
new file mode 100644
index 0000000000..2c6035ea7c
--- /dev/null
+++ b/data/2024/aaai/Aspect-Based Sentiment Analysis with Explicit Sentiment Augmentations	
@@ -0,0 +1 @@
+Aspect-based sentiment analysis (ABSA), a fine-grained sentiment classification task, has received much attention recently. Many works investigate sentiment information through opinion words, such as "good'' and "bad''. However, implicit sentiment data widely exists in the ABSA dataset, whose sentiment polarity is hard to determine due to the lack of distinct opinion words. To deal with implicit sentiment, this paper proposes an ABSA method that integrates explicit sentiment augmentations (ABSA-ESA) to add more sentiment clues. We propose an ABSA-specific explicit sentiment generation method to create such augmentations. Specifically, we post-train T5 by rule-based data and employ three strategies to constrain the sentiment polarity and aspect term of the generated augmentations. We employ Syntax Distance Weighting and Unlikelihood Contrastive Regularization in the training procedure to guide the model to generate the explicit opinion words with the same polarity as the input sentence. Meanwhile, we utilize the Constrained Beam Search to ensure the augmentations are aspect-related. We test ABSA-ESA on two ABSA benchmarks. The results show that ABSA-ESA outperforms the SOTA baselines on implicit and explicit sentiment accuracy.
\ No newline at end of file
diff --git a/data/2024/aaai/Assume-Guarantee Reinforcement Learning b/data/2024/aaai/Assume-Guarantee Reinforcement Learning
new file mode 100644
index 0000000000..5ce95fc823
--- /dev/null
+++ b/data/2024/aaai/Assume-Guarantee Reinforcement Learning	
@@ -0,0 +1 @@
+We present a modular approach to reinforcement learning (RL) in environments consisting of simpler components evolving in parallel. A monolithic view of such modular environments may be prohibitively large to learn, or may require unrealizable communication between the components in the form of a centralized controller. Our proposed approach is based on the assume-guarantee paradigm where the optimal control for the individual components is synthesized in isolation by making assumptions about the behaviors of neighboring components, and providing guarantees about their own behavior. We express these assume-guarantee contracts as regular languages and provide automatic translations to scalar rewards to be used in RL. By combining local probabilities of satisfaction for each component, we provide a lower bound on the probability of satisfaction of the complete system. By solving a Markov game for each component, RL can produce a controller for each component that maximizes this lower bound. The controller utilizes the information it receives through communication, observations, and any knowledge of a coarse model of other agents. We experimentally demonstrate the efficiency of the proposed approach on a variety of case studies.
\ No newline at end of file
diff --git a/data/2024/aaai/Asymmetric Mutual Alignment for Unsupervised Zero-Shot Sketch-Based Image Retrieval b/data/2024/aaai/Asymmetric Mutual Alignment for Unsupervised Zero-Shot Sketch-Based Image Retrieval
new file mode 100644
index 0000000000..366657e047
--- /dev/null
+++ b/data/2024/aaai/Asymmetric Mutual Alignment for Unsupervised Zero-Shot Sketch-Based Image Retrieval	
@@ -0,0 +1 @@
+In recent years, many methods have been proposed to address the zero-shot sketch-based image retrieval (ZS-SBIR) task, which is a practical problem in many applications. However, in real-world scenarios, on the one hand, we can not obtain training data with the same distribution as the test data, and on the other hand, the labels of training data are not available as usual. To tackle this issue, we focus on a new problem, namely unsupervised zero-shot sketch-based image retrieval (UZS-SBIR), where the available training data does not have labels while the training and testing categories are not overlapping. In this paper, we introduce a new asymmetric mutual alignment method (AMA) including a self-distillation module and a cross-modality mutual alignment module. First, we conduct self-distillation to extract the feature embeddings from unlabeled data. Due to the lack of available information in an unsupervised manner, we employ the cross-modality mutual alignment module to further excavate underlying intra-modality and inter-modality relationships from unlabeled data, and take full advantage of these correlations to align the feature embeddings in image and sketch domains. Meanwhile, the feature representations are enhanced by the intra-modality clustering relations, leading to better generalization ability to unseen classes. Moreover, we conduct an asymmetric strategy to update the teacher and student networks, respectively. Extensive experimental results on several benchmark datasets demonstrate the superiority of our method.
\ No newline at end of file
diff --git a/data/2024/aaai/Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation b/data/2024/aaai/Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation
new file mode 100644
index 0000000000..e9ce1d1e6c
--- /dev/null
+++ b/data/2024/aaai/Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation	
@@ -0,0 +1 @@
+Existing generative adversarial network (GAN) based conditional image generative models typically produce fixed output for the same conditional input, which is unreasonable for highly subjective tasks, such as large-mask image inpainting or style transfer. On the other hand, GAN-based diverse image generative methods require retraining/fine-tuning the network or designing complex noise injection functions, which is computationally expensive, task-specific, or struggle to generate high-quality results. Given that many deterministic conditional image generative models have been able to produce high-quality yet fixed results, we raise an intriguing question: is it possible for pre-trained deterministic conditional image generative models to generate diverse results without changing network structures or parameters? To answer this question, we re-examine the conditional image generation tasks from the perspective of adversarial attack and propose a simple and efficient plug-in projected gradient descent (PGD) like method for diverse and controllable image generation. The key idea is attacking the pre-trained deterministic generative models by adding a micro perturbation to the input condition. In this way, diverse results can be generated without any adjustment of network structures or fine-tuning of the pre-trained models. In addition, we can also control the diverse results to be generated by specifying the attack direction according to a reference text or image. Our work opens the door to applying adversarial attack to low-level vision tasks, and experiments on various conditional image generation tasks demonstrate the effectiveness and superiority of the proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Attacking CNNs in Histopathology with SNAP: Sporadic and Naturalistic Adversarial Patches (Student Abstract) b/data/2024/aaai/Attacking CNNs in Histopathology with SNAP: Sporadic and Naturalistic Adversarial Patches (Student Abstract)
new file mode 100644
index 0000000000..78b358b8ca
--- /dev/null
+++ b/data/2024/aaai/Attacking CNNs in Histopathology with SNAP: Sporadic and Naturalistic Adversarial Patches (Student Abstract)	
@@ -0,0 +1,11 @@
+Convolutional neural networks (CNNs) are being increasingly
+adopted in medical imaging. However, in the race for
+developing accurate models, their robustness is often overlooked.
+This elicits a significant concern given the safety-critical
+nature of the healthcare system. Here, we highlight
+the vulnerability of CNNs against a sporadic and naturalistic
+adversarial patch attack (SNAP). We train SNAP to mislead
+the ResNet50 model predicting metastasis in histopathological
+scans of lymph node sections, lowering the accuracy by
+27%. This work emphasizes the need for defense strategies
+before deploying CNNs in critical healthcare settings.
\ No newline at end of file
diff --git a/data/2024/aaai/Attacking Transformers with Feature Diversity Adversarial Perturbation b/data/2024/aaai/Attacking Transformers with Feature Diversity Adversarial Perturbation
new file mode 100644
index 0000000000..fcb82c5748
--- /dev/null
+++ b/data/2024/aaai/Attacking Transformers with Feature Diversity Adversarial Perturbation	
@@ -0,0 +1 @@
+Understanding the mechanisms behind Vision Transformer (ViT), particularly its vulnerability to adversarial perturbations, is crucial for addressing challenges in its real-world applications. Existing ViT adversarial attackers rely on labels to calculate the gradient for perturbation, and exhibit low transferability to other structures and tasks. In this paper, we present a label-free white-box attack approach for ViT-based models that exhibits strong transferability to various black-box models, including most ViT variants, CNNs, and MLPs, even for models developed for other modalities. Our inspiration comes from the feature collapse phenomenon in ViTs, where the critical attention mechanism overly depends on the low-frequency component of features, causing the features in middle-to-end layers to become increasingly similar and eventually collapse. We propose the feature diversity attacker to naturally accelerate this process and achieve remarkable performance and transferability.
\ No newline at end of file
diff --git a/data/2024/aaai/Attacks on Continual Semantic Segmentation by Perturbing Incremental Samples b/data/2024/aaai/Attacks on Continual Semantic Segmentation by Perturbing Incremental Samples
new file mode 100644
index 0000000000..31b87bad39
--- /dev/null
+++ b/data/2024/aaai/Attacks on Continual Semantic Segmentation by Perturbing Incremental Samples	
@@ -0,0 +1 @@
+As an essential computer vision task, Continual Semantic Segmentation (CSS) has received a lot of attention. However, security issues regarding this task have not been fully studied. To bridge this gap, we study the problem of attacks in CSS in this paper. We first propose a new task, namely, attacks on incremental samples in CSS, and reveal that the attacks on incremental samples corrupt the performance of CSS in both old and new classes. Moreover, we present an adversarial sample generation method based on class shift, namely Class Shift Attack (CS-Attack), which is an offline and easy-to-implement approach for CSS. CS-Attack is able to significantly degrade the performance of models on both old and new classes without knowledge of the incremental learning approach, which undermines the original purpose of the incremental learning, i.e., learning new classes while retaining old knowledge. Experiments show that on the popular datasets Pascal VOC, ADE20k, and Cityscapes, our approach easily degrades the performance of currently popular CSS methods, which reveals the importance of security in CSS.
\ No newline at end of file
diff --git a/data/2024/aaai/Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention b/data/2024/aaai/Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention
new file mode 100644
index 0000000000..3f9b993acd
--- /dev/null
+++ b/data/2024/aaai/Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention	
@@ -0,0 +1 @@
+Vision Transformer(ViT) is one of the most widely used models in the computer vision field with its great performance on various tasks. In order to fully utilize the ViT-based architecture in various applications, proper visualization methods with a decent localization performance are necessary, but these methods employed in CNN-based models are still not available in ViT due to its unique structure. In this work, we propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision. Our method selectively aggregates the gradients directly propagated from the classification output to each self-attention, collecting the contribution of image features extracted from each location of the input image. These gradients are additionally guided by the normalized self-attention scores, which are the pairwise patch correlation scores. They are used to supplement the gradients on the patch-level context information efficiently detected by the self-attention mechanism. This approach of our method provides elaborate high-level semantic explanations with great localization performance only with the class labels. As a result, our method outperforms the previous leading explainability methods of ViT in the weakly-supervised localization task and presents great capability in capturing the full instances of the target class object. Meanwhile, our method provides a visualization that faithfully explains the model, which is demonstrated in the perturbation comparison test.
\ No newline at end of file
diff --git a/data/2024/aaai/Attention-Based Models for Snow-Water Equivalent Prediction b/data/2024/aaai/Attention-Based Models for Snow-Water Equivalent Prediction
new file mode 100644
index 0000000000..769456f379
--- /dev/null
+++ b/data/2024/aaai/Attention-Based Models for Snow-Water Equivalent Prediction	
@@ -0,0 +1 @@
+Snow Water-Equivalent (SWE)—the amount of water available if snowpack is melted—is a key decision variable used by water management agencies to make irrigation, flood control, power generation, and drought management decisions. SWE values vary spatiotemporally—affected by weather, topography, and other environmental factors. While daily SWE can be measured by Snow Telemetry (SNOTEL) stations with requisite instrumentation, such stations are spatially sparse requiring interpolation techniques to create spatiotemporal complete data. While recent efforts have explored machine learning (ML) for SWE prediction, a number of recent ML advances have yet to be considered. The main contribution of this paper is to explore one such ML advance, attention mechanisms, for SWE prediction. Our hypothesis is that attention has a unique ability to capture and exploit correlations that may exist across locations or the temporal spectrum (or both). We present a generic attention-based modeling framework for SWE prediction and adapt it to capture spatial attention and temporal attention. Our experimental results on 323 SNOTEL stations in the Western U.S. demonstrate that our attention-based models outperform other machine-learning approaches. We also provide key results highlighting the differences between spatial and temporal attention in this context and a roadmap toward deployment for generating spatially-complete SWE maps.
\ No newline at end of file
diff --git a/data/2024/aaai/Attention-Induced Embedding Imputation for Incomplete Multi-View Partial Multi-Label Classification b/data/2024/aaai/Attention-Induced Embedding Imputation for Incomplete Multi-View Partial Multi-Label Classification
new file mode 100644
index 0000000000..e5cb6f20e1
--- /dev/null
+++ b/data/2024/aaai/Attention-Induced Embedding Imputation for Incomplete Multi-View Partial Multi-Label Classification	
@@ -0,0 +1 @@
+As a combination of emerging multi-view learning methods and traditional multi-label classification tasks, multi-view multi-label classification has shown broad application prospects. The diverse semantic information contained in heterogeneous data effectively enables the further development of multi-label classification. However, the widespread incompleteness problem on multi-view features and labels greatly hinders the practical application of multi-view multi-label classification. Therefore, in this paper, we propose an attention-induced missing instances imputation technique to enhance the generalization ability of the model. Different from existing incomplete multi-view completion methods, we attempt to approximate the latent features of missing instances in embedding space according to cross-view joint attention, instead of recovering missing views in kernel space or original feature space. Accordingly, multi-view completed features are dynamically weighted by the confidence derived from joint attention in the late fusion phase. In addition, we propose a multi-view multi-label classification framework based on label-semantic feature learning, utilizing the statistical weak label correlation matrix and graph attention network to guide the learning process of label-specific features. Finally, our model is compatible with missing multi-view and partial multi-label data simultaneously and extensive experiments on five datasets confirm the advancement and effectiveness of our embedding imputation method and multi-view multi-label classification model.
\ No newline at end of file
diff --git a/data/2024/aaai/Attribute-Missing Graph Clustering Network b/data/2024/aaai/Attribute-Missing Graph Clustering Network
new file mode 100644
index 0000000000..0f01a2d7d8
--- /dev/null
+++ b/data/2024/aaai/Attribute-Missing Graph Clustering Network	
@@ -0,0 +1 @@
+Deep clustering with attribute-missing graphs, where only a subset of nodes possesses complete attributes while those of others are missing, is an important yet challenging topic in various practical applications. It has become a prevalent learning paradigm in existing studies to perform data imputation first and subsequently conduct clustering using the imputed information. However, these ``two-stage" methods disconnect the clustering and imputation processes, preventing the model from effectively learning clustering-friendly graph embedding. Furthermore, they are not tailored for clustering tasks, leading to inferior clustering results. To solve these issues, we propose a novel Attribute-Missing Graph Clustering (AMGC) method to alternately promote clustering and imputation in a unified framework, where we iteratively produce the clustering-enhanced nearest neighbor information to conduct the data imputation process and utilize the imputed information to implicitly refine the clustering distribution through model optimization. Specifically, in the imputation step, we take the learned clustering information as imputation prompts to help each attribute-missing sample gather highly correlated features within its clusters for data completion, such that the intra-class compactness can be improved. Moreover, to support reliable clustering, we maximize inter-class separability by conducting cost-efficient dual non-contrastive learning over the imputed latent features, which in turn promotes greater graph encoding capability for clustering sub-network. Extensive experiments on five datasets have verified the superiority of AMGC against competitors.
\ No newline at end of file
diff --git a/data/2024/aaai/Audio Generation with Multiple Conditional Diffusion Model b/data/2024/aaai/Audio Generation with Multiple Conditional Diffusion Model
new file mode 100644
index 0000000000..707f36e0ab
--- /dev/null
+++ b/data/2024/aaai/Audio Generation with Multiple Conditional Diffusion Model	
@@ -0,0 +1 @@
+Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation.
\ No newline at end of file
diff --git a/data/2024/aaai/Audio Scanning Network: Bridging Time and Frequency Domains for Audio Classification b/data/2024/aaai/Audio Scanning Network: Bridging Time and Frequency Domains for Audio Classification
new file mode 100644
index 0000000000..9307370366
--- /dev/null
+++ b/data/2024/aaai/Audio Scanning Network: Bridging Time and Frequency Domains for Audio Classification	
@@ -0,0 +1 @@
+With the rapid growth of audio data, there's a pressing need for automatic audio classification. As a type of time-series data, audio exhibits waveform fluctuations in both the time and frequency domains that evolve over time, with similar instances sharing consistent patterns. This study introduces the Audio Scanning Network (ASNet), designed to leverage abundant information for achieving stable and effective audio classification. ASNet captures real-time changes in audio waveforms across both time and frequency domains through reservoir computing, supported by Reservoir Kernel Canonical Correlation Analysis (RKCCA) to explore correlations between time-domain and frequency-domain waveform fluctuations. This innovative approach empowers ASNet to comprehensively capture the changes and inherent correlations within the audio waveform, and without the need for time-consuming iterative training. Instead of converting audio into spectrograms, ASNet directly utilizes audio feature sequences to uncover associations between time and frequency fluctuations. Experiments on environmental sound and music genre classification tasks demonstrate ASNet's comparable performance to state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head b/data/2024/aaai/AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
new file mode 100644
index 0000000000..6724ab9392
--- /dev/null
+++ b/data/2024/aaai/AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head	
@@ -0,0 +1 @@
+Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving 16 AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Code can be found in https://github.com/AIGC-Audio/AudioGPT
\ No newline at end of file
diff --git a/data/2024/aaai/Auditable Algorithms for Approximate Model Counting b/data/2024/aaai/Auditable Algorithms for Approximate Model Counting
new file mode 100644
index 0000000000..6aea2b0c4f
--- /dev/null
+++ b/data/2024/aaai/Auditable Algorithms for Approximate Model Counting	
@@ -0,0 +1,5 @@
+The problem of model counting, i.e., counting satisfying assignments of a Boolean formula, is a fundamental problem in computer science, with diverse applications. Given #P-hardness of the problem, many algorithms have been developed over the years to provide an approximate model count. Recently, building on the practical success of SAT-solvers used as NP oracles, the focus has shifted from theory to practical implementations of such algorithms. This has brought to focus new challenges. In this paper, we consider one such challenge – that of auditable deterministic approximate model counters wherein a counter should also generate a certificate, which allows a user (often with limited computational power) to independently audit whether the count returned by an invocation of the algorithm is indeed within the promised bounds. 
+
+We start by examining a celebrated approximate model counting algorithm due to Stockmeyer that uses polynomially many calls to a \Sigma^2_P oracle, and show that it can be audited via a \Pi^2_P formula on (n^2 log^2 n) variables, where n is the number of variables in the original formula. Since n is often large (10’s to 100’s of thousands) for typical instances, we ask if the count of variables in the certificate formula can be reduced – a critical question towards potential implementation. We show that this improvement in certification can be achieved with a tradeoff in the counting algorithm’s complexity. Specifically, we develop new deterministic approximate model counting algorithms that invoke a \Sigma^3_P oracle, but can be certified using a \Pi^2_P formula on fewer variables: our final algorithm uses just (n log n) variables.
+
+Our study demonstrates that one can simplify certificate checking significantly if we allow the counting algorithm to access a slightly more powerful oracle. We believe this shows for the first time how the audit complexity can be traded for the complexity of approximate counting.
\ No newline at end of file
diff --git a/data/2024/aaai/Augmented Commonsense Knowledge for Remote Object Grounding b/data/2024/aaai/Augmented Commonsense Knowledge for Remote Object Grounding
new file mode 100644
index 0000000000..21d18fe150
--- /dev/null
+++ b/data/2024/aaai/Augmented Commonsense Knowledge for Remote Object Grounding	
@@ -0,0 +1 @@
+The vision-and-language navigation (VLN) task necessitates an agent to perceive the surroundings, follow natural language instructions, and act in photo-realistic unseen environments. Most of the existing methods employ the entire image or object features to represent navigable viewpoints. However, these representations are insufficient for proper action prediction, especially for the REVERIE task, which uses concise high-level instructions, such as “Bring me the blue cushion in the master bedroom”. To address enhancing representation, we propose an augmented commonsense knowledge model (ACK) to leverage commonsense information as a spatio-temporal knowledge graph for improving agent navigation. Specifically, the proposed approach involves constructing a knowledge base by retrieving commonsense information from ConceptNet, followed by a refinement module to remove noisy and irrelevant knowledge. We further present ACK which consists of knowledge graph-aware cross-modal and concept aggregation modules to enhance visual representation and visual-textual data alignment by integrating visible objects, commonsense knowledge, and concept history, which includes object and knowledge temporal information. Moreover, we add a new pipeline for the commonsense-based decision-making process which leads to more accurate local action prediction. Experimental results demonstrate our proposed model noticeably outperforms the baseline and archives the state-of-the-art on the REVERIE benchmark. The source code is available at https://github.com/Bahram-Mohammadi/ACK.
\ No newline at end of file
diff --git a/data/2024/aaai/Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery b/data/2024/aaai/Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery
new file mode 100644
index 0000000000..b5f09f3071
--- /dev/null
+++ b/data/2024/aaai/Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery	
@@ -0,0 +1 @@
+The substantial success of Vision Transformer (ViT) in computer vision tasks is largely attributed to the architecture design. This underscores the necessity of efficient architecture search for designing better ViTs automatically. As training-based architecture search methods are computationally intensive, there’s a growing interest in training-free methods that use zero-cost proxies to score ViTs. However, existing training-free approaches require expert knowledge to manually design specific zero-cost proxies. Moreover, these zero-cost proxies exhibit limitations to generalize across diverse domains. In this paper, we introduce Auto-Prox, an automatic proxy discovery framework, to address the problem. First, we build the ViT-Bench-101, which involves different ViT candidates and their actual performance on multiple datasets. Utilizing ViT-Bench-101, we can evaluate zero-cost proxies based on their score-accuracy correlation. Then, we represent zero-cost proxies with computation graphs and organize the zero-cost proxy search space with ViT statistics and primitive operations. To discover generic zero-cost proxies, we propose a joint correlation metric to evolve and mutate different zero-cost proxy candidates. We introduce an elitism-preserve strategy for search efficiency to achieve a better trade-off between exploitation and exploration. Based on the discovered zero-cost proxy, we conduct a ViT architecture search in a training-free manner. Extensive experiments demonstrate that our method generalizes well to different datasets and achieves state-of-the-art results both in ranking correlation and final accuracy. Codes can be found at https://github.com/lilujunai/Auto-Prox-AAAI24.
\ No newline at end of file
diff --git a/data/2024/aaai/Auto311: A Confidence-Guided Automated System for Non-emergency Calls b/data/2024/aaai/Auto311: A Confidence-Guided Automated System for Non-emergency Calls
new file mode 100644
index 0000000000..cd329dd4c0
--- /dev/null
+++ b/data/2024/aaai/Auto311: A Confidence-Guided Automated System for Non-emergency Calls	
@@ -0,0 +1 @@
+Emergency and non-emergency response systems are essential services provided by local governments and critical to protecting lives, the environment, and property. The effective handling of (non-)emergency calls is critical for public safety and well-being. By reducing the burden through non-emergency callers, residents in critical need of assistance through 911 will receive a fast and effective response. Collaborating with the Department of Emergency Communications (DEC) in Nashville, we analyzed 11,796 non-emergency call recordings and developed Auto311, the first automated system to handle 311 non-emergency calls, which (1) effectively and dynamically predicts ongoing non-emergency incident types to generate tailored case reports during the call; (2) itemizes essential information from dialogue contexts to complete the generated reports; and (3) strategically structures system-caller dialogues with optimized confidence. We used real-world data to evaluate the system's effectiveness and deployability. The experimental results indicate that the system effectively predicts incident type with an average F-1 score of 92.54%. Moreover, the system successfully itemizes critical information from relevant contexts to complete reports, evincing a 0.93 average consistency score compared to the ground truth. Additionally, emulations demonstrate that the system effectively decreases conversation turns as the utterance size gets more extensive and categorizes the ongoing call with 94.49% mean accuracy.
\ No newline at end of file
diff --git a/data/2024/aaai/AutoLTS: Automating Cycling Stress Assessment via Contrastive Learning and Spatial Post-processing b/data/2024/aaai/AutoLTS: Automating Cycling Stress Assessment via Contrastive Learning and Spatial Post-processing
new file mode 100644
index 0000000000..6e2cbdd09b
--- /dev/null
+++ b/data/2024/aaai/AutoLTS: Automating Cycling Stress Assessment via Contrastive Learning and Spatial Post-processing	
@@ -0,0 +1 @@
+Cycling stress assessment, which quantifies cyclists' perceived stress imposed by the built environment and motor traffics, increasingly informs cycling infrastructure planning and cycling route recommendation. However, currently calculating cycling stress is slow and data-intensive, which hinders its broader application. In this paper, We propose a deep learning framework to support accurate, fast, and large-scale cycling stress assessments for urban road networks based on street-view images. Our framework features i) a contrastive learning approach that leverages the ordinal relationship among cycling stress labels, and ii) a post-processing technique that enforces spatial smoothness into our predictions. On a dataset of 39,153 road segments collected in Toronto, Canada, our results demonstrate the effectiveness of our deep learning framework and the value of using image data for cycling stress assessment in the absence of high-quality road geometry and motor traffic data.
\ No newline at end of file
diff --git a/data/2024/aaai/AutoMixer for Improved Multivariate Time-Series Forecasting on Business and IT Observability Data b/data/2024/aaai/AutoMixer for Improved Multivariate Time-Series Forecasting on Business and IT Observability Data
new file mode 100644
index 0000000000..7f48b65a34
--- /dev/null
+++ b/data/2024/aaai/AutoMixer for Improved Multivariate Time-Series Forecasting on Business and IT Observability Data	
@@ -0,0 +1 @@
+The efficiency of business processes relies on business key performance indicators (Biz-KPIs), that can be negatively impacted by IT failures. Business and IT Observability (BizITObs) data fuses both Biz-KPIs and IT event channels together as multivariate time series data. Forecasting Biz-KPIs in advance can enhance efficiency and revenue through proactive corrective measures. However, BizITObs data generally exhibit both useful and noisy inter-channel interactions between Biz-KPIs and IT events that need to be effectively decoupled. This leads to suboptimal forecasting performance when existing multivariate forecasting models are employed. To address this, we introduce AutoMixer, a time-series Foundation Model (FM) approach, grounded on the novel technique of channel-compressed pretrain and finetune workflows. AutoMixer leverages an AutoEncoder for channel-compressed pretraining and integrates it with the advanced TSMixer model for multivariate time series forecasting. This fusion greatly enhances the potency of TSMixer for accurate forecasts and also generalizes well across several downstream tasks. Through detailed experiments and dashboard analytics, we show AutoMixer's capability to consistently improve the Biz-KPI's forecasting accuracy (by 11-15%) which directly translates to actionable business insights.
\ No newline at end of file
diff --git a/data/2024/aaai/Automated Assessment of Fidelity and Interpretability: An Evaluation Framework for Large Language Models' Explanations (Student Abstract) b/data/2024/aaai/Automated Assessment of Fidelity and Interpretability: An Evaluation Framework for Large Language Models' Explanations (Student Abstract)
new file mode 100644
index 0000000000..201d5883af
--- /dev/null
+++ b/data/2024/aaai/Automated Assessment of Fidelity and Interpretability: An Evaluation Framework for Large Language Models' Explanations (Student Abstract)	
@@ -0,0 +1 @@
+As Large Language Models (LLMs) become more prevalent in various fields, it is crucial to rigorously assess the quality of their explanations. Our research introduces a task-agnostic framework for evaluating free-text rationales, drawing on insights from both linguistics and machine learning. We evaluate two dimensions of explainability: fidelity and interpretability. For fidelity, we propose methods suitable for proprietary LLMs where direct introspection of internal features is unattainable. For interpretability, we use language models instead of human evaluators, addressing concerns about subjectivity and scalability in evaluations. We apply our framework to evaluate GPT-3.5 and the impact of prompts on the quality of its explanations. In conclusion, our framework streamlines the evaluation of explanations from LLMs, promoting the development of safer models.
\ No newline at end of file
diff --git a/data/2024/aaai/Automated Defect Report Generation for Enhanced Industrial Quality Control b/data/2024/aaai/Automated Defect Report Generation for Enhanced Industrial Quality Control
new file mode 100644
index 0000000000..3147059739
--- /dev/null
+++ b/data/2024/aaai/Automated Defect Report Generation for Enhanced Industrial Quality Control	
@@ -0,0 +1 @@
+Defect detection is a pivotal aspect ensuring product quality and production efficiency in industrial manufacturing. Existing studies on defect detection predominantly focus on locating defects through bounding boxes and classifying defect types. However, their methods can only provide limited information and fail to meet the requirements for further processing after detecting defects. To this end, we propose a novel task called defect detection report generation, which aims to provide more comprehensive and informative insights into detected defects in the form of text reports. For this task, we propose some new datasets, which contain 16 different materials and each defect contains a detailed report of human constructs. In addition, we propose a knowledge-aware report generation model as a baseline for future research, which aims to incorporate additional knowledge to generate detailed analysis and subsequent processing related to defect in images. By constructing defect report datasets and proposing corresponding baselines, we chart new directions for future research and practical applications of this task.
\ No newline at end of file
diff --git a/data/2024/aaai/Automated Design of Affine Maximizer Mechanisms in Dynamic Settings b/data/2024/aaai/Automated Design of Affine Maximizer Mechanisms in Dynamic Settings
new file mode 100644
index 0000000000..b4f09becb4
--- /dev/null
+++ b/data/2024/aaai/Automated Design of Affine Maximizer Mechanisms in Dynamic Settings	
@@ -0,0 +1,4 @@
+Dynamic mechanism design is a challenging extension to ordinary mechanism design in which the mechanism designer must make a sequence of decisions over time in the face of possibly untruthful reports of participating agents.
+Optimizing dynamic mechanisms for welfare is relatively well understood. However, there has been less work on optimizing for other goals (e.g., revenue), and without restrictive assumptions on valuations, it is remarkably challenging to characterize good mechanisms. Instead, we turn to automated mechanism design to find mechanisms with good performance in specific problem instances.
+We extend the class of affine maximizer mechanisms to MDPs where agents may untruthfully report their rewards. This extension results in a challenging bilevel optimization problem in which the upper problem involves choosing optimal mechanism parameters, and the lower problem involves solving the resulting MDP. 
+Our approach can find truthful dynamic mechanisms that achieve strong performance on goals other than welfare, and can be applied to essentially any problem setting---without restrictions on valuations---for which RL can learn optimal policies.
\ No newline at end of file
diff --git a/data/2024/aaai/Automated Natural Language Explanation of Deep Visual Neurons with Large Models (Student Abstract) b/data/2024/aaai/Automated Natural Language Explanation of Deep Visual Neurons with Large Models (Student Abstract)
new file mode 100644
index 0000000000..64b4fc22c1
--- /dev/null
+++ b/data/2024/aaai/Automated Natural Language Explanation of Deep Visual Neurons with Large Models (Student Abstract)	
@@ -0,0 +1 @@
+Interpreting deep neural networks through examining neurons offers distinct advantages when it comes to exploring the inner workings of Deep Neural Networks. Previous research has indicated that specific neurons within deep vision networks possess semantic meaning and play pivotal roles in model performance. Nonetheless, the current methods for generating neuron semantics heavily rely on human intervention, which hampers their scalability and applicability. To address this limitation, this paper proposes a novel post-hoc framework for generating semantic explanations of neurons with large foundation models, without requiring human intervention or prior knowledge. Experiments are conducted with both qualitative and quantitative analysis to verify the effectiveness of our proposed approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Automated State Estimation for Summarizing the Dynamics of Complex Urban Systems Using Representation Learning b/data/2024/aaai/Automated State Estimation for Summarizing the Dynamics of Complex Urban Systems Using Representation Learning
new file mode 100644
index 0000000000..8d731b1629
--- /dev/null
+++ b/data/2024/aaai/Automated State Estimation for Summarizing the Dynamics of Complex Urban Systems Using Representation Learning	
@@ -0,0 +1,4 @@
+Complex urban systems can be difficult to monitor, diagnose and manage because the complete states of such systems are only partially observable with sensors. State estimation techniques can be used to determine the
+underlying dynamic behavior of such complex systems with their highly non-linear processes and external time-variant influences.
+States can be estimated by clustering observed sensor readings. However,
+clustering performance degrades as the number of sensors and readings (i.e. feature dimension) increases. To address this problem, we propose a framework that learns a feature-centric lower dimensional representation of data for clustering to support analysis of system dynamics. We propose Unsupervised Feature Attention with Compact Representation (UFACR) to rank features contributing to a cluster assignment. These weighted features are then used to learn a reduced-dimension temporal representation of the data with a deep-learning model. The resulting low-dimensional representation can be effectively clustered into states. UFACR is evaluated on real-world and synthetic wastewater treatment plant data sets, and feature ranking outcomes were validated by Wastewater treatment domain experts. Our quantitative and qualitative experimental analyses demonstrate the effectiveness of UFACR for uncovering system dynamics in an automated and unsupervised manner to offer guidance to wastewater engineers to enhance industrial productivity and treatment efficiency.
\ No newline at end of file
diff --git a/data/2024/aaai/Automatic Core-Guided Reformulation via Constraint Explanation and Condition Learning b/data/2024/aaai/Automatic Core-Guided Reformulation via Constraint Explanation and Condition Learning
new file mode 100644
index 0000000000..59b67de4f7
--- /dev/null
+++ b/data/2024/aaai/Automatic Core-Guided Reformulation via Constraint Explanation and Condition Learning	
@@ -0,0 +1,10 @@
+SAT and propagation solvers often underperform for optimisation models whose objective sums many single-variable terms.
+MaxSAT solvers avoid this by detecting and exploiting cores: subsets of these terms that cannot collectively take their lower bounds.
+Previous work has shown manual analysis of cores can help define model reformulations likely to speed up solving for many model instances.
+This paper presents a method to automate this process.
+For each selected core the method identifies the instance constraints that caused it;
+infers the model constraints and parameters that explain how these instance constraints were formed;
+and learns the conditions that made those model constraint instances generate cores, while others did not.
+It then uses this information to reformulate the objective.
+The empirical evaluation shows this method can produce useful reformulations.
+Importantly, the method can be useful in many other situations that require explaining a set of constraints.
\ No newline at end of file
diff --git a/data/2024/aaai/Automatic Interpretation of Line Probe Assay Test for Tuberculosis b/data/2024/aaai/Automatic Interpretation of Line Probe Assay Test for Tuberculosis
new file mode 100644
index 0000000000..6928e1e916
--- /dev/null
+++ b/data/2024/aaai/Automatic Interpretation of Line Probe Assay Test for Tuberculosis	
@@ -0,0 +1 @@
+Line Probe Assay (LPA) is a widely used method for diagnosing drug-resistant tuberculosis (DRTB), but it is a time-consuming and labor-intensive process that requires expert interpretation. DRTB is a significant threat to global TB control efforts and its prompt diagnosis is critical for initiating appropriate treatment. In this paper, we present an automated LPA test interpretation solution that uses computer vision techniques to extract and analyze strips from LPA sheets and uses machine learning algorithms to produce drug sensitivity and resistivity outcomes with extremely high precision and recall. We also develop OCR models to eliminate manual data entry to further reduce the overall time. Our solution comprises a rejection module that flags ambiguous and novel samples that are then referred to experienced lab technicians. This results in increased trust in the solution. To evaluate our solution, we curate an extensive and diverse dataset of LPA strips annotated by multiple microbiologists across India. Our solution achieves more than 95% accuracy for all drugs on this dataset. The proposed solution has the potential to increase the efficiency, standardization of LPA test interpretation, and fast-tracking the dissemination of results to end-users via a designated Management Information System (MIS).
\ No newline at end of file
diff --git a/data/2024/aaai/Automatic Radiology Reports Generation via Memory Alignment Network b/data/2024/aaai/Automatic Radiology Reports Generation via Memory Alignment Network
new file mode 100644
index 0000000000..a87e84227b
--- /dev/null
+++ b/data/2024/aaai/Automatic Radiology Reports Generation via Memory Alignment Network	
@@ -0,0 +1 @@
+The automatic generation of radiology reports is of great significance, which can reduce the workload of doctors and improve the accuracy and reliability of medical diagnosis and treatment, and has attracted wide attention in recent years. Cross-modal mapping between images and text, a key component of generating high-quality reports, is challenging due to the lack of corresponding annotations. Despite its importance, previous studies have often overlooked it or lacked adequate designs for this crucial component. In this paper, we propose a method with memory alignment embedding to assist the model in aligning visual and textual features to generate a coherent and informative report. Specifically, we first get the memory alignment embedding by querying the memory matrix, where the query is derived from a combination of the visual features and their corresponding positional embeddings. Then the alignment between the visual and textual features can be guided by the memory alignment embedding during the generation process. The comparison experiments with other alignment methods show that the proposed alignment method is less costly and more effective. The proposed approach achieves better performance than state-of-the-art approaches on two public datasets IU X-Ray and MIMIC-CXR, which further demonstrates the effectiveness of the proposed alignment method.
\ No newline at end of file
diff --git a/data/2024/aaai/Automatic Short Answer Grading for Finnish with ChatGPT b/data/2024/aaai/Automatic Short Answer Grading for Finnish with ChatGPT
new file mode 100644
index 0000000000..70c1ec0176
--- /dev/null
+++ b/data/2024/aaai/Automatic Short Answer Grading for Finnish with ChatGPT	
@@ -0,0 +1 @@
+Automatic short answer grading (ASAG) seeks to mitigate the burden on teachers by leveraging computational methods to evaluate student-constructed text responses. Large language models (LLMs) have recently gained prominence across diverse applications, with educational contexts being no exception. The sudden rise of ChatGPT has raised expectations that LLMs can handle numerous tasks, including ASAG. This paper aims to shed some light on this expectation by evaluating two LLM-based chatbots, namely ChatGPT built on GPT-3.5 and GPT-4, on scoring short-question answers under zero-shot and one-shot settings. Our data consists of 2000 student answers in Finnish from ten undergraduate courses. Multiple perspectives are taken into account during this assessment, encompassing those of grading system developers, teachers, and students. On our dataset, GPT-4 achieves a good QWK score (0.6+) in 44% of one-shot settings, clearly outperforming GPT-3.5 at 21%. We observe a negative association between student answer length and model performance, as well as a correlation between a smaller standard deviation among a set of predictions and lower performance. We conclude that while GPT-4 exhibits signs of being a capable grader, additional research is essential before considering its deployment as a reliable autograder.
\ No newline at end of file
diff --git a/data/2024/aaai/Automatically Testing Functional Properties of Code Translation Models b/data/2024/aaai/Automatically Testing Functional Properties of Code Translation Models
new file mode 100644
index 0000000000..34862f3c52
--- /dev/null
+++ b/data/2024/aaai/Automatically Testing Functional Properties of Code Translation Models	
@@ -0,0 +1 @@
+Large language models are becoming increasingly practical for translating code across programming languages, a process known as transpiling. Even though automated transpilation significantly boosts developer productivity, a key concern is whether the generated code is correct. Existing work initially used manually crafted test suites to test the translations of a small corpus of programs; these test suites were later automated. In contrast, we devise the first approach for automated, functional, property-based testing of code translation models. Our general, user-provided specifications about the transpiled code capture a range of properties, from purely syntactic to purely semantic ones. As shown by our experiments, this approach is very effective in detecting property violations in popular code translation models, and therefore, in evaluating model quality with respect to given properties. We also go a step further and explore the usage scenario where a user simply aims to obtain a correct translation of some code with respect to certain properties without necessarily being concerned about the overall quality of the model. To this purpose, we develop the first property-guided search procedure for code translation models, where a model is repeatedly queried with slightly different parameters to produce alternative and potentially more correct translations. Our results show that this search procedure helps to obtain significantly better code translations.
\ No newline at end of file
diff --git a/data/2024/aaai/Autonomous Policy Explanations for Effective Human-Machine Teaming b/data/2024/aaai/Autonomous Policy Explanations for Effective Human-Machine Teaming
new file mode 100644
index 0000000000..3db650eede
--- /dev/null
+++ b/data/2024/aaai/Autonomous Policy Explanations for Effective Human-Machine Teaming	
@@ -0,0 +1 @@
+Policy explanation, a process for describing the behavior of an autonomous system, plays a crucial role in effectively conveying an agent's decision-making rationale to human collaborators and is essential for safe real-world deployments. It becomes even more critical in effective human-robot teaming, where good communication allows teams to adapt and improvise successfully during uncertain situations by enabling value alignment within the teams. This thesis proposal focuses on improving human-machine teaming by developing novel human-centered explainable AI (xAI) techniques that empower autonomous agents to communicate their capabilities and limitations via multiple modalities, teach and influence human teammates' behavior as decision-support systems, and effectively build and manage trust in HRI systems.
\ No newline at end of file
diff --git a/data/2024/aaai/Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation b/data/2024/aaai/Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation
new file mode 100644
index 0000000000..9f41ae0635
--- /dev/null
+++ b/data/2024/aaai/Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation	
@@ -0,0 +1 @@
+A 360-degree (omni-directional) image provides an all-encompassing spherical view of a scene. Recently, there has been an increasing interest in synthesising 360-degree images from conventional narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. Yet, existing methods typically fall short in synthesizing intricate visual details or ensure the generated images align consistently with user-provided prompts. In this study, autoregressive omni-aware generative network (AOG-Net) is proposed for 360-degree image generation by outpainting an incomplete 360-degree image progressively with NFoV and text guidances joinly or individually. This autoregressive scheme not only allows for deriving finer-grained and text-consistent patterns by dynamically generating and adjusting the process but also offers users greater flexibility to edit their conditions throughout the generation process. A global-local conditioning mechanism is devised to comprehensively formulate the outpainting guidance in each autoregressive step. Text guidances, omni-visual cues, NFoV inputs and omni-geometry are encoded and further formulated with cross-attention based transformers into a global stream and a local stream into a conditioned generative backbone model. As AOG-Net is compatible to leverage large-scale models for the conditional encoder and the generative prior, it enables the generation to use extensive open-vocabulary text guidances. Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method. Our code is available at https://github.com/zhuqiangLu/AOG-NET-360.
\ No newline at end of file
diff --git a/data/2024/aaai/AvatarVerse: High-Quality & Stable 3D Avatar Creation from Text and Pose b/data/2024/aaai/AvatarVerse: High-Quality & Stable 3D Avatar Creation from Text and Pose
new file mode 100644
index 0000000000..04d4388d99
--- /dev/null
+++ b/data/2024/aaai/AvatarVerse: High-Quality & Stable 3D Avatar Creation from Text and Pose	
@@ -0,0 +1 @@
+Creating expressive, diverse and high-quality 3D avatars from highly customized text descriptions and pose guidance is a challenging task, due to the intricacy of modeling and texturing in 3D that ensure details and various styles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline for generating expressive high-quality 3D avatars from nothing but text descriptions and pose guidance. In specific, we introduce a 2D diffusion model conditioned on DensePose signal to establish 3D pose control of avatars through 2D images, which enhances view consistency from partially observed scenarios. It addresses the infamous Janus Problem and significantly stablizes the generation process. Moreover, we propose a progressive high-resolution 3D synthesis strategy, which obtains substantial improvement over the quality of the created 3D avatars. To this end, the proposed AvatarVerse pipeline achieves zero-shot 3D modeling of 3D avatars that are not only more expressive, but also in higher quality and fidelity than previous works. Rigorous qualitative evaluations and user studies showcase AvatarVerse's superiority in synthesizing high-fidelity 3D avatars, leading to a new standard in high-quality and stable 3D avatar creation. Our project page is: https://avatarverse3d.github.io/ .
\ No newline at end of file
diff --git a/data/2024/aaai/Axiomatic Aggregations of Abductive Explanations b/data/2024/aaai/Axiomatic Aggregations of Abductive Explanations
new file mode 100644
index 0000000000..ebddbfeaf2
--- /dev/null
+++ b/data/2024/aaai/Axiomatic Aggregations of Abductive Explanations	
@@ -0,0 +1 @@
+The recent criticisms of the robustness of post hoc model approximation explanation methods (like LIME and SHAP) have led to the rise of model-precise abductive explanations. For each data point, abductive explanations provide a minimal subset of features that are sufficient to generate the outcome. While theoretically sound and rigorous, abductive explanations suffer from a major issue --- there can be several valid abductive explanations for the same data point. In such cases, providing a single abductive explanation can be insufficient; on the other hand, providing all valid abductive explanations can be incomprehensible due to their size. In this work, we solve this issue by aggregating the many possible abductive explanations into feature importance scores. We propose three aggregation methods: two based on power indices from cooperative game theory and a third based on a well-known measure of causal strength. We characterize these three methods axiomatically, showing that each of them uniquely satisfies a set of desirable properties. We also evaluate them on multiple datasets and show that these explanations are robust to the attacks that fool SHAP and LIME.
\ No newline at end of file
diff --git a/data/2024/aaai/B-spine: Learning B-spline Curve Representation for Robust and Interpretable Spinal Curvature Estimation b/data/2024/aaai/B-spine: Learning B-spline Curve Representation for Robust and Interpretable Spinal Curvature Estimation
new file mode 100644
index 0000000000..278d529de9
--- /dev/null
+++ b/data/2024/aaai/B-spine: Learning B-spline Curve Representation for Robust and Interpretable Spinal Curvature Estimation	
@@ -0,0 +1 @@
+Spinal curvature estimation is important to the diagnosis and treatment of the scoliosis. Existing methods face several issues such as the need of expensive annotations on the vertebral landmarks and being sensitive to the image quality. It is challenging to achieve robust estimation and obtain interpretable results, especially for low-quality images which are blurry and hazy. In this paper, we propose B-Spine, a novel deep learning pipeline to learn B-spline curve representation of the spine and estimate the Cobb angles for spinal curvature estimation from low-quality X-ray images. Given a low quality input, a novel SegRefine network which employs the unpaired image-to-image translation is proposed to generate a high quality spine mask from the initial segmentation result. Next, a novel mask-based B-spline prediction model is proposed to predict the B-spline curve for the spine centerline. Finally, the Cobb angles are estimated by a hybrid approach which combines the curve slope analysis and a curve based regression model. We conduct quantitative and qualitative comparisons with the representative and SOTA learning-based methods on the public AASCE2019 dataset and our new proposed JLU-CJUH dataset which contains more challenging low-quality images. The superior performance on both datasets shows our method can achieve both robustness and interpretability for spinal curvature estimation.
\ No newline at end of file
diff --git a/data/2024/aaai/BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving b/data/2024/aaai/BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving
new file mode 100644
index 0000000000..d750b9c9fb
--- /dev/null
+++ b/data/2024/aaai/BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving	
@@ -0,0 +1,16 @@
+Artificial Intelligence for Theorem Proving (AITP) has given
+rise to a plethora of benchmarks and methodologies, particularly in Interactive Theorem Proving (ITP). Research in the
+area is fragmented, with a diverse set of approaches being
+spread across several ITP systems. This presents a significant challenge to the comparison of methods, which are often
+complex and difficult to replicate.
+Addressing this, we present BAIT, a framework for the fair
+and streamlined comparison of learning approaches in ITP.
+We demonstrate BAIT’s capabilities with an in-depth comparison, across several ITP benchmarks, of state-of-the-art
+architectures applied to the problem of formula embedding.
+We find that Structure Aware Transformers perform particularly well, improving on techniques associated with the original problem sets. BAIT also allows us to assess the end-to-end proving performance of systems built on interactive
+environments. This unified perspective reveals a novel end-to-end system that improves on prior work. We also provide
+a qualitative analysis, illustrating that improved performance
+is associated with more semantically-aware embeddings. By
+streamlining the implementation and comparison of Machine
+Learning algorithms in the ITP context, we anticipate BAIT
+will be a springboard for future research.
\ No newline at end of file
diff --git a/data/2024/aaai/BAND: Biomedical Alert News Dataset b/data/2024/aaai/BAND: Biomedical Alert News Dataset
new file mode 100644
index 0000000000..fe821f47a9
--- /dev/null
+++ b/data/2024/aaai/BAND: Biomedical Alert News Dataset	
@@ -0,0 +1 @@
+Infectious disease outbreaks continue to pose a significant threat to human health and well-being. To improve disease surveillance and understanding of disease spread, several surveillance systems have been developed to monitor daily news alerts and social media. However, existing systems lack thorough epidemiological analysis in relation to corresponding alerts or news, largely due to the scarcity of well-annotated reports data. To address this gap, we introduce the Biomedical Alert News Dataset (BAND), which includes 1,508 samples from existing reported news articles, open emails, and alerts, as well as 30 epidemiology-related questions. These questions necessitate the model's expert reasoning abilities, thereby offering valuable insights into the outbreak of the disease. The BAND dataset brings new challenges to the NLP world, requiring better inference capability of the content and the ability to infer important information. We provide several benchmark tasks, including Named Entity Recognition (NER), Question Answering (QA), and Event Extraction (EE), to demonstrate existing models' capabilities and limitations in handling epidemiology-specific tasks. It is worth noting that some models may lack the human-like inference capability required to fully utilize the corpus. To the best of our knowledge, the BAND corpus is the largest corpus of well-annotated biomedical outbreak alert news with elaborately designed questions, making it a valuable resource for epidemiologists and NLP researchers alike.
\ No newline at end of file
diff --git a/data/2024/aaai/BARET: Balanced Attention Based Real Image Editing Driven by Target-Text Inversion b/data/2024/aaai/BARET: Balanced Attention Based Real Image Editing Driven by Target-Text Inversion
new file mode 100644
index 0000000000..b284d48422
--- /dev/null
+++ b/data/2024/aaai/BARET: Balanced Attention Based Real Image Editing Driven by Target-Text Inversion	
@@ -0,0 +1 @@
+Image editing approaches with diffusion models have been rapidly developed, yet their applicability are subject to requirements such as specific editing types (e.g., foreground or background object editing, style transfer), multiple conditions (e.g., mask, sketch, caption), and time consuming fine-tuning of diffusion models. For alleviating these limitations and realizing efficient real image editing, we propose a novel editing technique that only requires an input image and target text for various editing types including non-rigid edits without fine-tuning diffusion model. Our method contains three novelties: (I) Target-text Inversion Schedule (TTIS) is designed to fine-tune the input target text embedding to achieve fast image reconstruction without image caption and acceleration of convergence. (II) Progressive Transition Scheme applies progressive linear interpolation between target text embedding and its fine-tuned version to generate transition embedding for maintaining non-rigid editing capability. (III) Balanced Attention Module (BAM) balances the tradeoff between textual description and image semantics. By the means of combining self-attention map from reconstruction process and cross-attention map from transition process, the guidance of target text embeddings in diffusion process is optimized. In order to demonstrate editing capability, effectiveness and efficiency of the proposed BARET, we have conducted extensive qualitative and quantitative experiments. Moreover, results derived from user study and ablation study further prove the superiority over other methods.
\ No newline at end of file
diff --git a/data/2024/aaai/BBScore: A Brownian Bridge Based Metric for Assessing Text Coherence b/data/2024/aaai/BBScore: A Brownian Bridge Based Metric for Assessing Text Coherence
new file mode 100644
index 0000000000..f086dbe5c9
--- /dev/null
+++ b/data/2024/aaai/BBScore: A Brownian Bridge Based Metric for Assessing Text Coherence	
@@ -0,0 +1,3 @@
+Measuring the coherence of text is a vital aspect of evaluating the quality of written content. Recent advancements in neural coherence modeling have demonstrated their efficacy in capturing entity coreference and discourse relations, thereby enhancing coherence evaluation. However, many existing methods heavily depend on static embeddings or focus narrowly on nearby context, constraining their capacity to measure the overarching coherence of long texts.
+In this paper, we posit that coherent texts inherently manifest a sequential and cohesive interplay among sentences, effectively conveying the central theme, purpose, or standpoint. To explore this abstract relationship, we introduce the "BB Score," a novel reference-free metric grounded in Brownian bridge theory for assessing text coherence. Our findings showcase that when synergized with a simple additional classification component, this metric attains a performance level comparable to state-of-the-art techniques on standard artificial discrimination tasks.
+We also establish in downstream tasks that this metric effectively differentiates between human-written documents and text generated by large language models within specific domains. Furthermore, we illustrate the efficacy of this approach in detecting written styles attributed to various large language models, underscoring its potential for generalizability. In summary, we present a novel Brownian bridge coherence metric capable of measuring both local and global text coherence, while circumventing the need for end-to-end model training. This flexibility allows for its application in various downstream tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/BCLNet: Bilateral Consensus Learning for Two-View Correspondence Pruning b/data/2024/aaai/BCLNet: Bilateral Consensus Learning for Two-View Correspondence Pruning
new file mode 100644
index 0000000000..9b72b62e10
--- /dev/null
+++ b/data/2024/aaai/BCLNet: Bilateral Consensus Learning for Two-View Correspondence Pruning	
@@ -0,0 +1 @@
+Correspondence pruning aims to establish reliable correspondences between two related images and recover relative camera motion. Existing approaches often employ a progressive strategy to handle the local and global contexts, with a prominent emphasis on transitioning from local to global, resulting in the neglect of interactions between different contexts. To tackle this issue, we propose a parallel context learning strategy that involves acquiring bilateral consensus for the two-view correspondence pruning task. In our approach, we design a distinctive self-attention block to capture global context and parallel process it with the established local context learning module, which enables us to simultaneously capture both local and global consensuses. By combining these local and global consensuses, we derive the required bilateral consensus. We also design a recalibration block, reducing the influence of erroneous consensus information and enhancing the robustness of the model. The culmination of our efforts is the Bilateral Consensus Learning Network (BCLNet), which efficiently estimates camera pose and identifies inliers (true correspondences). Extensive experiments results demonstrate that our network not only surpasses state-of-the-art methods on benchmark datasets but also showcases robust generalization abilities across various feature extraction techniques. Noteworthily, BCLNet obtains significant improvement gains over the second best method on unknown outdoor dataset, and obviously accelerates model training speed.
\ No newline at end of file
diff --git a/data/2024/aaai/BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind b/data/2024/aaai/BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind
new file mode 100644
index 0000000000..c438069355
--- /dev/null
+++ b/data/2024/aaai/BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind	
@@ -0,0 +1 @@
+As a foundational component of cognitive intelligence, theory of mind (ToM) can make AI more closely resemble human thought processes, thereby enhancing their interaction and collaboration with human. In particular, it can significantly improve a model's comprehension of videos in complex scenes. However, current video question answer (VideoQA) datasets focus on studying causal reasoning within events, few of them genuinely incorporating human ToM. Consequently, there is a lack of development in ToM reasoning tasks within the area of VideoQA. This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM. BDIQA is inspired by the cognitive development of children's ToM and addresses the current deficiencies in machine ToM within datasets and tasks. Specifically, it offers tasks at two difficulty levels, assessing Belief, Desire and Intention (BDI) reasoning in both simple and complex scenarios. We conduct evaluations on several mainstream methods of VideoQA and diagnose their capabilities with zero-shot, few-shot and supervised learning. We find that the performance of pre-trained models on cognitive reasoning tasks remains unsatisfactory. To counter this challenge, we undertake thorough analysis and experimentation, ultimately presenting two guidelines to enhance cognitive reasoning derived from ablation analysis.
\ No newline at end of file
diff --git a/data/2024/aaai/BERTground: A Transformer-Based Model of Background Spectra on the ISS-Based NICER Space Telescope b/data/2024/aaai/BERTground: A Transformer-Based Model of Background Spectra on the ISS-Based NICER Space Telescope
new file mode 100644
index 0000000000..b7de4b8773
--- /dev/null
+++ b/data/2024/aaai/BERTground: A Transformer-Based Model of Background Spectra on the ISS-Based NICER Space Telescope	
@@ -0,0 +1 @@
+The Neutron star Interior Composition Explorer (NICER) is an International Space Station (ISS)-based Space Telescope developed by NASA and devoted to the study of high-energy X-Ray sources in the universe, including but not limited to neutron stars, pulsars, and black holes in stellar systems and active galactic nuclei (AGN). One prominent problem with NICER observations is the highly variable background spectra, obscuring actual signals of astrophysical sources and negatively affecting scientific analysis of the targets. Therefore, obtaining accurate estimations of the background spectra is crucial to filter the noise and facilitate better scientific discoveries of new astronomical objects. In this paper, we propose the very first Deep Neural Network architecture to model the NICER background spectra variation using information about the spacecraft and telescope associated with each observation. In particular, we develop a BERT-based architecture with tokenizers applied to different groups of features in our tabular dataset. We also introduce an adapted Tabular Deep Residual Network architecture as the predictor following the Transformer modules in our network. We show that our model outperforms the current state-of-the-art background model developed by the NICER team in most evaluation metrics. Finally, we discuss pathways and future work for the deployment of this model on NASA’s next versions of HEASARC Software packages.
\ No newline at end of file
diff --git a/data/2024/aaai/BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios b/data/2024/aaai/BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios
new file mode 100644
index 0000000000..ea788f9cc7
--- /dev/null
+++ b/data/2024/aaai/BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios	
@@ -0,0 +1 @@
+Existing LiDAR-based 3D object detection methods for autonomous driving scenarios mainly adopt the training-from-scratch paradigm. Unfortunately, this paradigm heavily relies on large-scale labeled data, whose collection can be expensive and time-consuming. Self-supervised pre-training is an effective and desirable way to alleviate this dependence on extensive annotated data. In this work, we present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving. Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation in a BEV perspective and avoid complex decoder design during pre-training. Furthermore, we introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder with fine-tuning for masked point cloud inputs. Based on the property of outdoor point clouds in autonomous driving scenarios, i.e., the point clouds of distant objects are more sparse, we propose point density prediction to enable the 3D encoder to learn location information, which is essential for object detection. Experimental results show that BEV-MAE surpasses prior state-of-the-art self-supervised methods and achieves a favorably pre-training efficiency. Furthermore, based on TransFusion-L, BEV-MAE achieves new state-of-the-art LiDAR-based 3D object detection results, with 73.6 NDS and 69.6 mAP on the nuScenes benchmark. The source code will be released at https://github.com/VDIGPKU/BEV-MAE.
\ No newline at end of file
diff --git a/data/2024/aaai/BLADE: Box-Level Supervised Amodal Segmentation through Directed Expansion b/data/2024/aaai/BLADE: Box-Level Supervised Amodal Segmentation through Directed Expansion
new file mode 100644
index 0000000000..ea0c782953
--- /dev/null
+++ b/data/2024/aaai/BLADE: Box-Level Supervised Amodal Segmentation through Directed Expansion	
@@ -0,0 +1 @@
+Perceiving the complete shape of occluded objects is essential for human and machine intelligence. While the amodal segmentation task is to predict the complete mask of partially occluded objects, it is time-consuming and labor-intensive to annotate the pixel-level ground truth amodal masks. Box-level supervised amodal segmentation addresses this challenge by relying solely on ground truth bounding boxes and instance classes as supervision, thereby alleviating the need for exhaustive pixel-level annotations. Nevertheless, current box-level methodologies encounter limitations in generating low-resolution masks and imprecise boundaries, failing to meet the demands of practical real-world applications. We present a novel solution to tackle this problem by introducing a directed expansion approach from visible masks to corresponding amodal masks. Our approach involves a hybrid end-to-end network based on the overlapping region - the area where different instances intersect. Diverse segmentation strategies are applied for overlapping regions and non-overlapping regions according to distinct characteristics. To guide the expansion of visible masks, we introduce an elaborately-designed connectivity loss for overlapping regions, which leverages correlations with visible masks and facilitates accurate amodal segmentation. Experiments are conducted on several challenging datasets and the results show that our proposed method can outperform existing state-of-the-art methods with large margins.
\ No newline at end of file
diff --git a/data/2024/aaai/BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions b/data/2024/aaai/BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
new file mode 100644
index 0000000000..85797206aa
--- /dev/null
+++ b/data/2024/aaai/BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions	
@@ -0,0 +1 @@
+Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limited to the token count, potentially curtailing the recognition of scenes with text-rich context. To improve upon them, the present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings from InstructBLIP and also directly projects encoded patch embeddings into the LLM, a technique inspired by LLaVA. This approach assists the model to capture intricate details potentially missed during the query decoding process. Empirical evidence demonstrates that our model, BLIVA, significantly enhances performance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA benchmark) and in undertaking general (not particularly text-rich) VQA benchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), and achieved 17.72% overall improvement in a comprehensive multimodal LLM benchmark (MME), comparing to our baseline InstructBLIP. BLIVA demonstrates significant capability in decoding real-world images, irrespective of text presence. To demonstrate the broad industry applications enabled by BLIVA, we evaluate the model using a new dataset comprising YouTube thumbnails paired with question-answer sets across 11 diverse categories. For researchers interested in further exploration, our code and models are freely accessible at https://github.com/mlpc-ucsd/BLIVA.
\ No newline at end of file
diff --git a/data/2024/aaai/BLiRF: Bandlimited Radiance Fields for Dynamic Scene Modeling b/data/2024/aaai/BLiRF: Bandlimited Radiance Fields for Dynamic Scene Modeling
new file mode 100644
index 0000000000..2a2606d530
--- /dev/null
+++ b/data/2024/aaai/BLiRF: Bandlimited Radiance Fields for Dynamic Scene Modeling	
@@ -0,0 +1 @@
+Inferring the 3D structure of a non-rigid dynamic scene from a single moving camera is an under-constrained problem. Inspired by the remarkable progress of neural radiance fields (NeRFs) in photo-realistic novel view synthesis of static scenes, it has also been extended to dynamic settings. Such methods heavily rely on implicit neural priors to regularize the problem. In this work, we take a step back and investigate how current implementations may entail deleterious effects including limited expressiveness, entanglement of light and density fields, and sub-optimal motion localization. Further, we devise a factorisation-based framework that represents the scene as a composition of bandlimited, high-dimensional signals. We demonstrate compelling results across complex dynamic scenes that involve changes in lighting, texture and long-range dynamics.
\ No newline at end of file
diff --git a/data/2024/aaai/BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining b/data/2024/aaai/BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining
new file mode 100644
index 0000000000..a4bf7a7324
--- /dev/null
+++ b/data/2024/aaai/BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining	
@@ -0,0 +1 @@
+The current research direction in generative models, such as the recently developed GPT4, aims to find relevant knowledge information for multimodal and multilingual inputs to provide answers. Under these research circumstances, the demand for multilingual evaluation of visual question answering (VQA) tasks, a representative task of multimodal systems, has increased. Accordingly, we propose a bilingual outside-knowledge VQA (BOK-VQA) dataset in this study that can be extended to multilingualism. The proposed data include 17K images, 17K question-answer pairs for both Korean and English and 280K instances of knowledge information related to question-answer content. We also present a framework that can effectively inject knowledge information into a VQA system by pretraining the knowledge information of BOK-VQA data in the form of graph embeddings. Finally, through in-depth analysis, we demonstrated the actual effect of the knowledge information contained in the constructed training data on VQA.
\ No newline at end of file
diff --git a/data/2024/aaai/BVT-IMA: Binary Vision Transformer with Information-Modified Attention b/data/2024/aaai/BVT-IMA: Binary Vision Transformer with Information-Modified Attention
new file mode 100644
index 0000000000..d6022ef1fd
--- /dev/null
+++ b/data/2024/aaai/BVT-IMA: Binary Vision Transformer with Information-Modified Attention	
@@ -0,0 +1 @@
+As a compression method that can significantly reduce the cost of calculations and memories, model binarization has been extensively studied in convolutional neural networks. However, the recently popular vision transformer models pose new challenges to such a technique, in which the binarized models suffer from serious performance drops. In this paper, an attention shifting is observed in the binary multi-head self-attention module, which can influence the information fusion between tokens and thus hurts the model performance. From the perspective of information theory, we find a correlation between attention scores and the information quantity, further indicating that a reason for such a phenomenon may be the loss of the information quantity induced by constant moduli of binarized tokens. Finally, we reveal the information quantity hidden in the attention maps of binary vision transformers and propose a simple approach to modify the attention values with look-up information tables so that improve the model performance. Extensive experiments on CIFAR-100/TinyImageNet/ImageNet-1k demonstrate the effectiveness of the proposed information-modified attention on binary vision transformers.
\ No newline at end of file
diff --git a/data/2024/aaai/BaCon: Boosting Imbalanced Semi-supervised Learning via Balanced Feature-Level Contrastive Learning b/data/2024/aaai/BaCon: Boosting Imbalanced Semi-supervised Learning via Balanced Feature-Level Contrastive Learning
new file mode 100644
index 0000000000..c161e3e1e8
--- /dev/null
+++ b/data/2024/aaai/BaCon: Boosting Imbalanced Semi-supervised Learning via Balanced Feature-Level Contrastive Learning	
@@ -0,0 +1 @@
+Semi-supervised Learning (SSL) reduces the need for extensive annotations in deep learning, but the more realistic challenge of imbalanced data distribution in SSL remains largely unexplored. In Class Imbalanced Semi-supervised Learning (CISSL), the bias introduced by unreliable pseudo-labels can be exacerbated by imbalanced data distributions. Most existing methods address this issue at instance-level through reweighting or resampling, but the performance is heavily limited by their reliance on biased backbone representation. Some other methods do perform feature-level adjustments like feature blending but might introduce unfavorable noise. In this paper, we discuss the bonus of a more balanced feature distribution for the CISSL problem, and further propose a Balanced Feature-Level Contrastive Learning method (BaCon). Our method directly regularizes the distribution of instances' representations in a well-designed contrastive manner. Specifically, class-wise feature centers are computed as the positive anchors, while negative anchors are selected by a straightforward yet effective mechanism. A distribution-related temperature adjustment is leveraged to control the class-wise contrastive degrees dynamically. Our method demonstrates its effectiveness through comprehensive experiments on the CIFAR10-LT, CIFAR100-LT, STL10-LT, and SVHN-LT datasets across various settings. For example, BaCon surpasses instance-level method FixMatch-based ABC on CIFAR10-LT with a 1.21% accuracy improvement, and outperforms state-of-the-art feature-level method CoSSL on CIFAR100-LT with a 0.63% accuracy improvement. When encountering more extreme imbalance degree, BaCon also shows better robustness than other methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Backdoor Adjustment via Group Adaptation for Debiased Coupon Recommendations b/data/2024/aaai/Backdoor Adjustment via Group Adaptation for Debiased Coupon Recommendations
new file mode 100644
index 0000000000..69edaac9c7
--- /dev/null
+++ b/data/2024/aaai/Backdoor Adjustment via Group Adaptation for Debiased Coupon Recommendations	
@@ -0,0 +1,2 @@
+Accurate prediction of coupon usage is crucial for promoting user consumption through targeted coupon recommendations. However, in real-world coupon recommendations, the coupon allocation process is not solely determined by the model trained with the history interaction data but is also interfered with by marketing tactics desired to fulfill specific commercial goals.This interference creates an imbalance in the interactions, which causes the data to deviate from the user's natural preferences. We refer to this deviation as the matching bias. Such biased interaction data affects the efficacy of the model, and thus it is necessary to employ debiasing techniques to prevent any negative impact.
+We investigate the mitigation of matching bias in coupon recommendations from a causal-effect perspective. By treating the attributes of users and coupons associated with marketing tactics as confounders, we find the confounders open the backdoor path between user-coupon matching and the conversion, which introduces spurious correlation. To remove the bad effect, we propose a novel training paradigm named Backdoor Adjustment via Group Adaptation (BAGA) for debiased coupon recommendations, which performs intervened training and inference, i.e., separately modeling each user-coupon group pair. However, modeling all possible group pairs greatly increases the computational complexity and cost. To address the efficiency challenge, we further present a simple but effective dual-tower multi-task framework and leverage the Customized Gate Control (CGC) model architecture, which separately models each user and coupon group with a separate expert module. We instantiate BAGA on five representative models: FM, DNN, NCF, MASKNET, and DEEPFM, and conduct comprehensive offline and online experiments to demonstrate the efficacy of our proposed paradigm.
\ No newline at end of file
diff --git a/data/2024/aaai/Backdoor Attacks via Machine Unlearning b/data/2024/aaai/Backdoor Attacks via Machine Unlearning
new file mode 100644
index 0000000000..9f618107a9
--- /dev/null
+++ b/data/2024/aaai/Backdoor Attacks via Machine Unlearning	
@@ -0,0 +1 @@
+As a new paradigm to erase data from a model and protect user privacy, machine unlearning has drawn significant attention. However, existing studies on machine unlearning mainly focus on its effectiveness and efficiency, neglecting the security challenges introduced by this technique. In this paper, we aim to bridge this gap and study the possibility of conducting malicious attacks leveraging machine unlearning. Specifically, we consider the backdoor attack via machine unlearning, where an attacker seeks to inject a backdoor in the unlearned model by submitting malicious unlearning requests, so that the prediction made by the unlearned model can be changed when a particular trigger presents. In our study, we propose two attack approaches. The first attack approach does not require the attacker to poison any training data of the model. The attacker can achieve the attack goal only by requesting to unlearn a small subset of his contributed training data. The second approach allows the attacker to poison a few training instances with a pre-defined trigger upfront, and then activate the attack via submitting a malicious unlearning request. Both attack approaches are proposed with the goal of maximizing the attack utility while ensuring attack stealthiness. The effectiveness of the proposed attacks is demonstrated with different machine unlearning algorithms as well as different models on different datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Backpropagation Through Agents b/data/2024/aaai/Backpropagation Through Agents
new file mode 100644
index 0000000000..9614d816dc
--- /dev/null
+++ b/data/2024/aaai/Backpropagation Through Agents	
@@ -0,0 +1 @@
+A fundamental challenge in multi-agent reinforcement learning (MARL) is to learn the joint policy in an extremely large search space, which grows exponentially with the number of agents. Moreover, fully decentralized policy factorization significantly restricts the search space, which may lead to sub-optimal policies. In contrast, the auto-regressive joint policy can represent a much richer class of joint policies by factorizing the joint policy into the product of a series of conditional individual policies. While such factorization introduces the action dependency among agents explicitly in sequential execution, it does not take full advantage of the dependency during learning. In particular, the subsequent agents do not give the preceding agents feedback about their decisions. In this paper, we propose a new framework Back-Propagation Through Agents (BPTA) that directly accounts for both agents' own policy updates and the learning of their dependent counterparts. This is achieved by propagating the feedback through action chains. With the proposed framework, our Bidirectional Proximal Policy Optimisation (BPPO) outperforms the state-of-the-art methods. Extensive experiments on matrix games, StarCraftII v2, Multi-agent MuJoCo, and Google Research Football demonstrate the effectiveness of the proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Backward Responsibility in Transition Systems Using General Power Indices b/data/2024/aaai/Backward Responsibility in Transition Systems Using General Power Indices
new file mode 100644
index 0000000000..cba62a5d7a
--- /dev/null
+++ b/data/2024/aaai/Backward Responsibility in Transition Systems Using General Power Indices	
@@ -0,0 +1,5 @@
+To improve reliability and the understanding of AI systems, there is increasing interest in the use of formal methods, e.g. model checking. Model checking tools produce a counterexample when a model does not satisfy a property. Understanding these counterexamples is critical for efficient debugging, as it allows the developer to focus on the parts of the program that caused the issue.
+
+To this end, we present a new technique that ascribes a responsibility value to each state in a transition system that does not satisfy a given safety property. The value is higher if the non-deterministic choices in a state have more power to change the outcome, given the behaviour observed in the counterexample. For this, we employ a concept from cooperative game theory – namely general power indices, such as the Shapley value – to compute the responsibility of the states.
+
+We present an optimistic and pessimistic version of responsibility that differ in how they treat the states that do not lie on the counterexample. We give a characterisation of optimistic responsibility that leads to an efficient algorithm for it and show computational hardness of the pessimistic version. We also present a tool to compute responsibility and show how a stochastic algorithm can be used to approximate responsibility in larger models. These methods can be deployed in the design phase, at runtime and at inspection time to gain insights on causal relations within the behavior of AI systems.
\ No newline at end of file
diff --git a/data/2024/aaai/Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection b/data/2024/aaai/Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection
new file mode 100644
index 0000000000..0e7ca8da75
--- /dev/null
+++ b/data/2024/aaai/Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection	
@@ -0,0 +1 @@
+Detecting fake news requires both a delicate sense of diverse clues and a profound understanding of the real-world background, which remains challenging for detectors based on small language models (SLMs) due to their knowledge and capability limitations. Recent advances in large language models (LLMs) have shown remarkable performance in various tasks, but whether and how LLMs could help with fake news detection remains underexplored. In this paper, we investigate the potential of LLMs in fake news detection. First, we conduct an empirical study and find that a sophisticated LLM such as GPT 3.5 could generally expose fake news and provide desirable multi-perspective rationales but still underperforms the basic SLM, fine-tuned BERT. Our subsequent analysis attributes such a gap to the LLM's inability to select and integrate rationales properly to conclude. Based on these findings, we propose that current LLMs may not substitute fine-tuned SLMs in fake news detection but can be a good advisor for SLMs by providing multi-perspective instructive rationales. To instantiate this proposal, we design an adaptive rationale guidance network for fake news detection (ARG), in which SLMs selectively acquire insights on news analysis from the LLMs' rationales. We further derive a rationale-free version of ARG by distillation, namely ARG-D, which services cost-sensitive scenarios without inquiring LLMs. Experiments on two real-world datasets demonstrate that ARG and ARG-D outperform three types of baseline methods, including SLM-based, LLM-based, and combinations of small and large language models.
\ No newline at end of file
diff --git a/data/2024/aaai/BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning b/data/2024/aaai/BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning
new file mode 100644
index 0000000000..9ffc1e93a7
--- /dev/null
+++ b/data/2024/aaai/BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning	
@@ -0,0 +1 @@
+Backdoor attacks in reinforcement learning (RL) have previously employed intense attack strategies to ensure attack success. However, these methods suffer from high attack costs and increased detectability. In this work, we propose a novel approach, BadRL, which focuses on conducting highly sparse backdoor poisoning efforts during training and testing while maintaining successful attacks. Our algorithm, BadRL, strategically chooses state observations with high attack values to inject triggers during training and testing, thereby reducing the chances of detection. In contrast to the previous methods that utilize sample-agnostic trigger patterns, BadRL dynamically generates distinct trigger patterns based on targeted state observations, thereby enhancing its effectiveness. Theoretical analysis shows that the targeted backdoor attack is always viable and remains stealthy under specific assumptions. Empirical results on various classic RL tasks illustrate that BadRL can substantially degrade the performance of a victim agent with minimal poisoning efforts (0.003% of total training steps) during training and infrequent attacks during testing. Code is available at: https://github.com/7777777cc/code.
\ No newline at end of file
diff --git a/data/2024/aaai/BadSAM: Exploring Security Vulnerabilities of SAM via Backdoor Attacks (Student Abstract) b/data/2024/aaai/BadSAM: Exploring Security Vulnerabilities of SAM via Backdoor Attacks (Student Abstract)
new file mode 100644
index 0000000000..35464a15a4
--- /dev/null
+++ b/data/2024/aaai/BadSAM: Exploring Security Vulnerabilities of SAM via Backdoor Attacks (Student Abstract)	
@@ -0,0 +1 @@
+Image segmentation is foundational to computer vision applications, and the Segment Anything Model (SAM) has become a leading base model for these tasks. However, SAM falters in specialized downstream challenges, leading to various customized SAM models. We introduce BadSAM, a backdoor attack tailored for SAM, revealing that customized models can harbor malicious behaviors. Using the CAMO dataset, we confirm BadSAM's efficacy and identify SAM vulnerabilities. This study paves the way for the development of more secure and customizable vision foundation models.
\ No newline at end of file
diff --git a/data/2024/aaai/Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation b/data/2024/aaai/Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation
new file mode 100644
index 0000000000..bd640ed7b1
--- /dev/null
+++ b/data/2024/aaai/Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation	
@@ -0,0 +1 @@
+Ensuring the safety of Reinforcement Learning (RL) is crucial for its deployment in real-world applications. Nevertheless, managing the trade-off between reward and safety during exploration presents a significant challenge. Improving reward performance through policy adjustments may adversely affect safety performance. In this study, we aim to address this conflicting relation by leveraging the theory of gradient manipulation. Initially, we analyze the conflict between reward and safety gradients. Subsequently, we tackle the balance between reward and safety optimization by proposing a soft switching policy optimization method, for which we provide convergence analysis. Based on our theoretical examination, we provide a safe RL framework to overcome the aforementioned challenge, and we develop a Safety-MuJoCo Benchmark to assess the performance of safe RL algorithms. Finally, we evaluate the effectiveness of our method on the Safety-MuJoCo Benchmark and a popular safe benchmark, Omnisafe. Experimental results demonstrate that our algorithms outperform several state-of-the-art baselines in terms of balancing reward and safety optimization.
\ No newline at end of file
diff --git a/data/2024/aaai/Balancing Humans and Machines: A Study on Integration Scale and Its Impact on Collaborative Performance b/data/2024/aaai/Balancing Humans and Machines: A Study on Integration Scale and Its Impact on Collaborative Performance
new file mode 100644
index 0000000000..947a403da9
--- /dev/null
+++ b/data/2024/aaai/Balancing Humans and Machines: A Study on Integration Scale and Its Impact on Collaborative Performance	
@@ -0,0 +1 @@
+In the evolving artificial intelligence domain, hybrid human-machine systems have emerged as a transformative research area. While many studies have concentrated on individual human-machine interactions, there is a lack of focus on multi-human and multi-machine dynamics. This paper delves into these nuances by introducing a novel statistical framework that discerns integration accuracy in terms of precision and diversity. Empirical studies reveal that performance surges consistently with scale, either in human or machine settings. However, hybrid systems present complexities. Their performance is intricately tied to the human-to-machine ratio. Interestingly, as the scale expands, integration performance growth isn't limitless. It reaches a threshold influenced by model diversity. This introduces a pivotal `knee point', signifying the optimal balance between performance and scale. This knowledge is vital for resource allocation in practical applications. Grounded in rigorous evaluations using public datasets, our findings emphasize the framework's robustness in refining integrated systems.
\ No newline at end of file
diff --git a/data/2024/aaai/Barely Supervised Learning for Graph-Based Fraud Detection b/data/2024/aaai/Barely Supervised Learning for Graph-Based Fraud Detection
new file mode 100644
index 0000000000..448ae0902e
--- /dev/null
+++ b/data/2024/aaai/Barely Supervised Learning for Graph-Based Fraud Detection	
@@ -0,0 +1 @@
+In recent years, graph-based fraud detection methods have garnered increasing attention for their superior ability to tackle the issue of camouflage in fraudulent scenarios. However, these methods often rely on a substantial proportion of samples as the training set, disregarding the reality of scarce annotated samples in real-life scenarios. As a theoretical framework within semi-supervised learning, the principle of consistency regularization posits that unlabeled samples should be classified into the same category as their own perturbations. Inspired by this principle, this study incorporates unlabeled samples as an auxiliary during model training, designing a novel barely supervised learning method to address the challenge of limited annotated samples in fraud detection. Specifically, to tackle the issue of camouflage in fraudulent scenarios, we employ disentangled representation learning based on edge information for a small subset of annotated nodes. This approach partitions node features into three distinct components representing different connected edges, providing a foundation for the subsequent augmentation of unlabeled samples. For the unlabeled nodes used in auxiliary training, we apply both strong and weak augmentation and design regularization losses to enhance the detection performance of the model in the context of extremely limited labeled samples. Across five publicly available datasets, the proposed model showcases its superior detection capability over baseline models.
\ No newline at end of file
diff --git a/data/2024/aaai/Batch Normalization Is Blind to the First and Second Derivatives of the Loss b/data/2024/aaai/Batch Normalization Is Blind to the First and Second Derivatives of the Loss
new file mode 100644
index 0000000000..7be3fe3eea
--- /dev/null
+++ b/data/2024/aaai/Batch Normalization Is Blind to the First and Second Derivatives of the Loss	
@@ -0,0 +1 @@
+We prove that when we do the Taylor series expansion of the loss function, the BN operation will block the influence of the first-order term and most influence of the second-order term of the loss. We also find that such a problem is caused by the standardization phase of the BN operation. We believe that proving the blocking of certain loss terms provides an analytic perspective for potential detects of a deep model with BN operations, although the blocking problem is not fully equivalent to significant damages in all tasks on benchmark datasets. Experiments show that the BN operation significantly affects feature representations in specific tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Bayesian Inference with Complex Knowledge Graph Evidence b/data/2024/aaai/Bayesian Inference with Complex Knowledge Graph Evidence
new file mode 100644
index 0000000000..87e4b0617e
--- /dev/null
+++ b/data/2024/aaai/Bayesian Inference with Complex Knowledge Graph Evidence	
@@ -0,0 +1 @@
+Knowledge Graphs (KGs) provide a widely used format for representing entities and their relationships and have found use in diverse applications including question answering and recommendation. A majority of current research on KG inference has focused on reasoning with atomic facts (triples) and has disregarded the possibility of making complex evidential observations involving logical operators (negation, conjunction, disjunction) and quantifiers (existential, universal). Further, while the application of complex evidence has been explored in KG-based query answering (KGQA) research, in many practical online settings, observations are made sequentially. For example, in KGQA, additional context may be incrementally suggested to narrow down the answer. Or in interactive recommendation, user critiques may be expressed sequentially in order to narrow down a set of preferred items. Both settings are indicative of information filtering or tracking tasks that are reminiscent of belief tracking in Bayesian inference. In fact, in this paper, we precisely cast the problem of belief tracking over unknown KG entities given incremental complex KG evidence as a Bayesian filtering problem. Specifically, we leverage Knowledge-based Model Construction (KBMC) over the logical KG evidence to instantiate a Markov Random Field (MRF) likelihood representation to perform closed-form Bayesian inference with complex KG evidence (BIKG). We experimentally evaluate BIKG in incremental KGQA and interactive recommendation tasks demonstrating that it outperforms non-incremental methodologies and leads to better incorporation of conjunctive evidence vs. existing complex KGQA methods like CQD that leverage fuzzy T-norm operators. Overall, this work demonstrates a novel, efficient, and unified perspective of logic, KGs, and online inference through the lens of closed-form BIKG.
\ No newline at end of file
diff --git a/data/2024/aaai/Behavioral Recognition of Skeletal Data Based on Targeted Dual Fusion Strategy b/data/2024/aaai/Behavioral Recognition of Skeletal Data Based on Targeted Dual Fusion Strategy
new file mode 100644
index 0000000000..2874c05bb5
--- /dev/null
+++ b/data/2024/aaai/Behavioral Recognition of Skeletal Data Based on Targeted Dual Fusion Strategy	
@@ -0,0 +1 @@
+The deployment of multi-stream fusion strategy on behavioral recognition from skeletal data can extract complementary features from different information streams and improve the recognition accuracy, but suffers from high model complexity and a large number of parameters. Besides, existing multi-stream methods using a fixed adjacency matrix homogenizes the model’s discrimination process across diverse actions, causing reduction of the actual lift for the multi-stream model. Finally, attention mechanisms are commonly applied to the multi-dimensional features, including spatial, temporal and channel dimensions. But their attention scores are typically fused in a concatenated manner, leading to the ignorance of the interrelation between joints in complex actions. To alleviate these issues, the Front-Rear dual Fusion Graph Convolutional Network (FRF-GCN) is proposed to provide a lightweight model based on skeletal data. Targeted adjacency matrices are also designed for different front fusion streams, allowing the model to focus on actions of varying magnitudes. Simultaneously, the mechanism of Spatial-Temporal-Channel Parallel Attention (STC-P), which processes attention in parallel and places greater emphasis on useful information, is proposed to further improve model’s performance. FRF-GCN demonstrates significant competitiveness compared to the current state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120 and Kinetics-Skeleton 400 datasets. Our code is available at: https://github.com/sunbeam-kkt/FRF-GCN-master.
\ No newline at end of file
diff --git a/data/2024/aaai/BeliefFlow: A Framework for Logic-Based Belief Diffusion via Iterated Belief Change b/data/2024/aaai/BeliefFlow: A Framework for Logic-Based Belief Diffusion via Iterated Belief Change
new file mode 100644
index 0000000000..899e0994da
--- /dev/null
+++ b/data/2024/aaai/BeliefFlow: A Framework for Logic-Based Belief Diffusion via Iterated Belief Change	
@@ -0,0 +1 @@
+This paper presents BeliefFlow, a novel framework for representing how logical beliefs spread among interacting agents within a network. In a Belief Flow Network (BFN), agents communicate asynchronously. The agents' beliefs are represented using epistemic states, which encompass their current beliefs and conditional beliefs guiding future changes. When communication occurs between two connected agents, the receiving agent changes its epistemic state using an improvement operator, a well-known type of rational iterated belief change operator that generalizes belief revision operators. We show that BFNs satisfy appealing properties, leading to two significant outcomes. First, in any BFN with strong network connectivity, the beliefs of all agents converge towards a global consensus. Second, within any BFN, we show that it is possible to compute an optimal strategy for influencing the global beliefs. This strategy, which involves controlling the beliefs of a least number of agents through bribery, can be identified from the topology of the network and can be computed in polynomial time.
\ No newline at end of file
diff --git a/data/2024/aaai/Benchmarking Large Language Models in Retrieval-Augmented Generation b/data/2024/aaai/Benchmarking Large Language Models in Retrieval-Augmented Generation
new file mode 100644
index 0000000000..61927d7990
--- /dev/null
+++ b/data/2024/aaai/Benchmarking Large Language Models in Retrieval-Augmented Generation	
@@ -0,0 +1 @@
+Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.
\ No newline at end of file
diff --git a/data/2024/aaai/Benchmarking Large Language Models on Controllable Generation under Diversified Instructions b/data/2024/aaai/Benchmarking Large Language Models on Controllable Generation under Diversified Instructions
new file mode 100644
index 0000000000..178b02fad8
--- /dev/null
+++ b/data/2024/aaai/Benchmarking Large Language Models on Controllable Generation under Diversified Instructions	
@@ -0,0 +1 @@
+While large language models (LLMs) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. Finally, we automate the entire evaluation process to facilitate further developments. Different from existing studies on controllable text generation, CoDI-Eval extends the scope to the prevalent instruction-following paradigm for the first time. We provide extensive evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source LLMs. We believe this benchmark will facilitate research into improving the controllability of LLMs' responses to instructions. Our data and code are available at https://github.com/Xt-cyh/CoDI-Eval.
\ No newline at end of file
diff --git a/data/2024/aaai/BertRLFuzzer: A BERT and Reinforcement Learning Based Fuzzer (Student Abstract) b/data/2024/aaai/BertRLFuzzer: A BERT and Reinforcement Learning Based Fuzzer (Student Abstract)
new file mode 100644
index 0000000000..c404f012cc
--- /dev/null
+++ b/data/2024/aaai/BertRLFuzzer: A BERT and Reinforcement Learning Based Fuzzer (Student Abstract)	
@@ -0,0 +1 @@
+We present a novel tool BertRLFuzzer, a BERT and Reinforcement Learning (RL) based fuzzer aimed at finding security vulnerabilities for Web applications. BertRLFuzzer works as follows: given a set of seed inputs, the fuzzer performs grammar-adhering and attack-provoking mutation operations on them to generate candidate attack vectors. The key insight of BertRLFuzzer is the use of RL with a BERT model as an agent to guide the fuzzer to efficiently learn grammar-adhering and attack-provoking mutation operators. In order to establish the efficacy of BertRLFuzzer we compare it against a total of 13 black box and white box fuzzers over a benchmark of 9 victim websites with over 16K LOC. We observed a significant improvement, relative to the nearest competing tool in terms of time to first attack (54% less), new vulnerabilities found (17 new vulnerabilities), and attack rate (4.4% more attack vectors generated).
\ No newline at end of file
diff --git a/data/2024/aaai/Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling b/data/2024/aaai/Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling
new file mode 100644
index 0000000000..86ac6b420d
--- /dev/null
+++ b/data/2024/aaai/Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling	
@@ -0,0 +1 @@
+Human evaluation is viewed as a reliable evaluation method for NLG which is expensive and time-consuming. To save labor and costs, researchers usually perform human evaluation on a small subset of data sampled from the whole dataset in practice. However, different selection subsets will lead to different rankings of the systems. To give a more correct inter-system ranking and make the gold standard human evaluation more reliable, we propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. CASF operates through a Learner, a Systematic Sampler and a Constrained Controller to select representative samples for getting a more correct inter-system ranking. Experiment results on 137 real NLG evaluation setups with 44 human evaluation metrics across 16 datasets and 5 NLG tasks demonstrate CASF receives 93.18\% top-ranked system recognition accuracy and ranks first or ranks second on 90.91\% of the human metrics with 0.83 overall inter-system ranking Kendall correlation. Code and data are publicly available online.
\ No newline at end of file
diff --git a/data/2024/aaai/Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory b/data/2024/aaai/Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory
new file mode 100644
index 0000000000..ae5bef387a
--- /dev/null
+++ b/data/2024/aaai/Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory	
@@ -0,0 +1 @@
+A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer models to extend input context length while linearly scaling compute. Our approach demonstrates the capability to store information in memory for sequences of up to an unprecedented two million tokens while maintaining high retrieval accuracy. Experiments with language modeling tasks show perplexity improvement as the number of processed input segments increases. These results underscore the effectiveness of our method, which has significant potential to enhance long-term dependency handling in natural language understanding and generation tasks, as well as enable large-scale context processing for memory-intensive applications.
\ No newline at end of file
diff --git a/data/2024/aaai/Beyond Entities: A Large-Scale Multi-Modal Knowledge Graph with Triplet Fact Grounding b/data/2024/aaai/Beyond Entities: A Large-Scale Multi-Modal Knowledge Graph with Triplet Fact Grounding
new file mode 100644
index 0000000000..a8cd58570b
--- /dev/null
+++ b/data/2024/aaai/Beyond Entities: A Large-Scale Multi-Modal Knowledge Graph with Triplet Fact Grounding	
@@ -0,0 +1 @@
+Much effort has been devoted to building multi-modal knowledge graphs by visualizing entities on images, but ignoring the multi-modal information of the relation between entities. Hence, in this paper, we aim to construct a new large-scale multi-modal knowledge graph with triplet facts grounded on images that reflect not only entities but also their relations. To achieve this purpose, we propose a novel pipeline method, including triplet fact filtering, image retrieving, entity-based image filtering, relation-based image filtering, and image clustering. In this way, a multi-modal knowledge graph named ImgFact is constructed, which contains 247,732 triplet facts and 3,730,805 images. In experiments, the manual and automatic evaluations prove the reliable quality of our ImgFact. We further use the obtained images to enhance model performance on two tasks. In particular, the model optimized by our ImgFact achieves an impressive 8.38% and 9.87% improvement over the solutions enhanced by an existing multi-modal knowledge graph and VisualChatGPT on F1 of relation classification. We release ImgFact and its instructions at https://github.com/kleinercubs/ImgFact.
\ No newline at end of file
diff --git a/data/2024/aaai/Beyond Expected Return: Accounting for Policy Reproducibility When Evaluating Reinforcement Learning Algorithms b/data/2024/aaai/Beyond Expected Return: Accounting for Policy Reproducibility When Evaluating Reinforcement Learning Algorithms
new file mode 100644
index 0000000000..9b551946e0
--- /dev/null
+++ b/data/2024/aaai/Beyond Expected Return: Accounting for Policy Reproducibility When Evaluating Reinforcement Learning Algorithms	
@@ -0,0 +1,2 @@
+Many applications in Reinforcement Learning (RL) usually have noise or stochasticity present in the environment. Beyond their impact on learning, these uncertainties lead the exact same policy to perform differently, i.e. yield different return, from one roll-out to another. Common evaluation procedures in RL summarise the consequent return distributions using solely the expected return, which does not account for the spread of the distribution. Our work defines this spread as the policy reproducibility: the ability of a policy to obtain similar performance when rolled out many times, a crucial property in some real-world applications. We highlight that existing procedures that only use the expected return are limited on two fronts: first an infinite number of return distributions with a wide range of performance-reproducibility trade-offs can have the same expected return, limiting its effectiveness when used for comparing policies; second, the expected return metric does not leave any room for practitioners to choose the best trade-off value for considered applications. In this work, we address these limitations by recommending the use of Lower Confidence Bound, a metric taken from Bayesian optimisation that provides the user with a preference parameter to choose a desired performance-reproducibility trade-off.
+We also formalise and quantify policy reproducibility, and demonstrate the benefit of our metrics using extensive experiments of popular RL algorithms on common uncertain RL tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities b/data/2024/aaai/Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities
new file mode 100644
index 0000000000..387d303f35
--- /dev/null
+++ b/data/2024/aaai/Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities	
@@ -0,0 +1,4 @@
+Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, the abstract event of "war'' manifests at a lower semantic level through subevents "tanks firing'' (in video) and airplane "shot'' (in text), leading to a hierarchical, multimodal relationship between the events.
+
+
+In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in different modalities at different semantic levels. This reveals the structure of events and is critical to understanding them. To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve is composed of news video-article pairs, which makes it rich in event hierarchies. We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated video-article pairs from MultiHiEve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and highlight opportunities for future research. Data: https://github.com/hayyubi/multihieve
\ No newline at end of file
diff --git a/data/2024/aaai/Beyond Mimicking Under-Represented Emotions: Deep Data Augmentation with Emotional Subspace Constraints for EEG-Based Emotion Recognition b/data/2024/aaai/Beyond Mimicking Under-Represented Emotions: Deep Data Augmentation with Emotional Subspace Constraints for EEG-Based Emotion Recognition
new file mode 100644
index 0000000000..5e7b7d626c
--- /dev/null
+++ b/data/2024/aaai/Beyond Mimicking Under-Represented Emotions: Deep Data Augmentation with Emotional Subspace Constraints for EEG-Based Emotion Recognition	
@@ -0,0 +1,2 @@
+In recent years, using Electroencephalography (EEG) to recognize emotions has garnered considerable attention. Despite advancements, limited EEG data restricts its potential. Thus, Generative Adversarial Networks (GANs) are proposed to mimic the observed distributions and generate EEG data. However, for imbalanced datasets, GANs struggle to produce reliable augmentations for under-represented minority emotions by merely mimicking them. Thus, we introduce Emotional Subspace Constrained Generative Adversarial Networks (ESC-GAN) as an alternative to existing frameworks. We first propose the EEG editing paradigm, editing reference EEG signals from well-represented to under-represented emotional subspaces. Then, we introduce diversity-aware and
+boundary-aware losses to constrain the augmented subspace. Here, the diversity-aware loss encourages a diverse emotional subspace by enlarging the sample difference, while boundary-aware loss constrains the augmented subspace near the decision boundary where recognition models can be vulnerable. Experiments show ESC-GAN boosts emotion recognition performance on benchmark datasets, DEAP, AMIGOS, and SEED, while protecting against potential adversarial attacks. Finally, the proposed method opens new avenues for editing EEG signals under emotional subspace constraints, facilitating unbiased and secure EEG data augmentation.
\ No newline at end of file
diff --git a/data/2024/aaai/Beyond OOD State Actions: Supported Cross-Domain Offline Reinforcement Learning b/data/2024/aaai/Beyond OOD State Actions: Supported Cross-Domain Offline Reinforcement Learning
new file mode 100644
index 0000000000..4d16ba78ff
--- /dev/null
+++ b/data/2024/aaai/Beyond OOD State Actions: Supported Cross-Domain Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Offline reinforcement learning (RL) aims to learn a policy using only pre-collected and fixed data. Although avoiding the time-consuming online interactions in RL, it poses challenges for out-of-distribution (OOD) state actions and often suffers from data inefficiency for training. Despite many efforts being devoted to addressing OOD state actions, the latter (data inefficiency) receives little attention in offline RL. To address this, this paper proposes the cross-domain offline RL, which assumes offline data incorporate additional source-domain data from varying transition dynamics (environments), and expects it to contribute to the offline data efficiency. To do so, we identify a new challenge of OOD transition dynamics, beyond the common OOD state actions issue, when utilizing cross-domain offline data. Then, we propose our method BOSA, which employs two support-constrained objectives to address the above OOD issues. Through extensive experiments in the cross-domain offline RL setting, we demonstrate BOSA can greatly improve offline data efficiency: using only 10% of the target data, BOSA could achieve 74.4% of the SOTA offline RL performance that uses 100% of the target data. Additionally, we also show BOSA can be effortlessly plugged into model-based offline RL and noising data augmentation techniques (used for generating source-domain data), which naturally avoids the potential dynamics mismatch between target-domain data and newly generated source-domain data.
\ No newline at end of file
diff --git a/data/2024/aaai/Beyond Prototypes: Semantic Anchor Regularization for Better Representation Learning b/data/2024/aaai/Beyond Prototypes: Semantic Anchor Regularization for Better Representation Learning
new file mode 100644
index 0000000000..93ef15b162
--- /dev/null
+++ b/data/2024/aaai/Beyond Prototypes: Semantic Anchor Regularization for Better Representation Learning	
@@ -0,0 +1 @@
+One of the ultimate goals of representation learning is to achieve compactness within a class and well-separability between classes. Many outstanding metric-based and prototype-based methods following the Expectation-Maximization paradigm, have been proposed for this objective. However, they inevitably introduce biases into the learning process, particularly with long-tail distributed training data. In this paper, we reveal that the class prototype is not necessarily to be derived from training features and propose a novel perspective to use pre-defined class anchors serving as feature centroid to unidirectionally guide feature learning. However, the pre-defined anchors may have a large semantic distance from the pixel features, which prevents them from being directly applied. To address this issue and generate feature centroid independent from feature learning, a simple yet effective Semantic Anchor Regularization (SAR) is proposed. SAR ensures the inter-class separability of semantic anchors in the semantic space by employing a classifier-aware auxiliary cross-entropy loss during training via disentanglement learning. By pulling the learned features to these semantic anchors, several advantages can be attained: 1) the intra-class compactness and naturally inter-class separability, 2) induced bias or errors from feature learning can be avoided, and 3) robustness to the long-tailed problem. The proposed SAR can be used in a plug-and-play manner in the existing models. Extensive experiments demonstrate that the SAR performs better than previous sophisticated prototype-based methods. The implementation is available at https://github.com/geyanqi/SAR.
\ No newline at end of file
diff --git a/data/2024/aaai/Beyond Traditional Threats: A Persistent Backdoor Attack on Federated Learning b/data/2024/aaai/Beyond Traditional Threats: A Persistent Backdoor Attack on Federated Learning
new file mode 100644
index 0000000000..d4a48f264b
--- /dev/null
+++ b/data/2024/aaai/Beyond Traditional Threats: A Persistent Backdoor Attack on Federated Learning	
@@ -0,0 +1 @@
+Backdoors on federated learning will be diluted by subsequent benign updates. This is reflected in the significant reduction of attack success rate as iterations increase, ultimately failing. We use a new metric to quantify the degree of this weakened backdoor effect, called attack persistence. Given that research to improve this performance has not been widely noted, we propose a Full Combination Backdoor Attack (FCBA) method. It aggregates more combined trigger information for a more complete backdoor pattern in the global model. Trained backdoored global model is more resilient to benign updates, leading to a higher attack success rate on the test set. We test on three datasets and evaluate with two models across various settings. FCBA's persistence outperforms SOTA federated learning backdoor attacks. On GTSRB, post-attack 120 rounds, our attack success rate rose over 50% from baseline. The core code of our method is available at https://github.com/PhD-TaoLiu/FCBA.
\ No newline at end of file
diff --git a/data/2024/aaai/Beyond TreeSHAP: Efficient Computation of Any-Order Shapley Interactions for Tree Ensembles b/data/2024/aaai/Beyond TreeSHAP: Efficient Computation of Any-Order Shapley Interactions for Tree Ensembles
new file mode 100644
index 0000000000..f3e06a7c07
--- /dev/null
+++ b/data/2024/aaai/Beyond TreeSHAP: Efficient Computation of Any-Order Shapley Interactions for Tree Ensembles	
@@ -0,0 +1 @@
+While shallow decision trees may be interpretable, larger ensemble models like gradient-boosted trees, which often set the state of the art in machine learning problems involving tabular data, still remain black box models. As a remedy, the Shapley value (SV) is a well-known concept in explainable artificial intelligence (XAI) research for quantifying additive feature attributions of predictions. The model-specific TreeSHAP methodology solves the exponential complexity for retrieving exact SVs from tree-based models. Expanding beyond individual feature attribution, Shapley interactions reveal the impact of intricate feature interactions of any order. In this work, we present TreeSHAP-IQ, an efficient method to compute any-order additive Shapley interactions for predictions of tree-based models. TreeSHAP-IQ is supported by a mathematical framework that exploits polynomial arithmetic to compute the interaction scores in a single recursive traversal of the tree, akin to Linear TreeSHAP. We apply TreeSHAP-IQ on state-of-the-art tree ensembles and explore interactions on well-established benchmark datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Beyond the Label Itself: Latent Labels Enhance Semi-supervised Point Cloud Panoptic Segmentation b/data/2024/aaai/Beyond the Label Itself: Latent Labels Enhance Semi-supervised Point Cloud Panoptic Segmentation
new file mode 100644
index 0000000000..d745c4567f
--- /dev/null
+++ b/data/2024/aaai/Beyond the Label Itself: Latent Labels Enhance Semi-supervised Point Cloud Panoptic Segmentation	
@@ -0,0 +1 @@
+As the exorbitant expense of labeling autopilot datasets and the growing trend of utilizing unlabeled data, semi-supervised segmentation on point clouds becomes increasingly imperative. Intuitively, finding out more ``unspoken words'' (i.e., latent instance information) beyond the label itself should be helpful to improve performance. In this paper, we discover two types of latent labels behind the displayed label embedded in LiDAR and image data. First, in the LiDAR Branch, we propose a novel augmentation, Cylinder-Mix, which is able to augment more yet reliable samples for training. Second, in the Image Branch, we propose the Instance Position-scale Learning (IPSL) Module to learn and fuse the information of instance position and scale, which is from a 2D pre-trained detector and a type of latent label obtained from 3D to 2D projection. Finally, the two latent labels are embedded into the multi-modal panoptic segmentation network. The ablation of the IPSL module demonstrates its robust adaptability, and the experiments evaluated on SemanticKITTI and nuScenes demonstrate that our model outperforms the state-of-the-art method, LaserMix.
\ No newline at end of file
diff --git a/data/2024/aaai/Bi-ViT: Pushing the Limit of Vision Transformer Quantization b/data/2024/aaai/Bi-ViT: Pushing the Limit of Vision Transformer Quantization
new file mode 100644
index 0000000000..d160b8b714
--- /dev/null
+++ b/data/2024/aaai/Bi-ViT: Pushing the Limit of Vision Transformer Quantization	
@@ -0,0 +1 @@
+Vision transformers (ViTs) quantization offers a promising prospect to facilitate deploying large pre-trained networks on resource-limited devices. Fully-binarized ViTs (Bi-ViT) that pushes the quantization of ViTs to its limit remain largely unexplored and a very challenging task yet, due to their unacceptable performance. Through extensive empirical analyses, we identify the severe drop in ViT binarization is caused by attention distortion in self-attention, which technically stems from the gradient vanishing and ranking disorder. To address these issues, we first introduce a learnable scaling factor to reactivate the vanished gradients and illustrate its effectiveness through theoretical and experimental analyses. We then propose a ranking-aware distillation method to rectify the disordered ranking in a teacher-student framework. Bi-ViT achieves significant improvements over popular DeiT and Swin backbones in terms of Top-1 accuracy and FLOPs. For example, with DeiT-Tiny and Swin-Tiny, our method significantly outperforms baselines by 22.1% and 21.4% respectively, while 61.5x and 56.1x theoretical acceleration in terms of FLOPs compared with real-valued counterparts on ImageNet. Our codes and models are attached on https://github.com/YanjingLi0202/Bi-ViT/ .
\ No newline at end of file
diff --git a/data/2024/aaai/Bi-directional Adapter for Multimodal Tracking b/data/2024/aaai/Bi-directional Adapter for Multimodal Tracking
new file mode 100644
index 0000000000..4be790bab0
--- /dev/null
+++ b/data/2024/aaai/Bi-directional Adapter for Multimodal Tracking	
@@ -0,0 +1,2 @@
+Due to the rapid development of computer vision, single-modal (RGB) object tracking has made significant progress in recent years. Considering the limitation of single imaging
+sensor, multi-modal images (RGB, infrared, etc.) are introduced to compensate for this deficiency for all-weather object tracking in complex environments. However, as acquiring sufficient multi-modal tracking data is hard while the dominant modality changes with the open environment, most existing techniques fail to extract multi-modal complementary information dynamically, yielding unsatisfactory tracking performance. To handle this problem, we propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter, cross-prompting multiple modalities mutually. Our model consists of a universal bi-directional adapter and multiple modality-specific transformer encoder branches with sharing parameters. The encoders extract features of each modality separately by using a frozen, pre-trained foundation model. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another, performing visual feature prompt fusion in an adaptive manner. With adding fewer (0.32M) trainable parameters, our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods. Our code is available: https://github.com/SparkTempest/BAT.
\ No newline at end of file
diff --git a/data/2024/aaai/Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video b/data/2024/aaai/Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video
new file mode 100644
index 0000000000..d568bdc9ba
--- /dev/null
+++ b/data/2024/aaai/Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video	
@@ -0,0 +1 @@
+Temporal Sentence Grounding in Video (TSGV) is troubled by dataset bias issue, which is caused by the uneven temporal distribution of the target moments for samples with similar semantic components in input videos or query texts. Existing methods resort to utilizing prior knowledge about bias to artificially break this uneven distribution, which only removes a limited amount of significant language biases. In this work, we propose the bias-conflict sample synthesis and adversarial removal debias strategy (BSSARD), which dynamically generates bias-conflict samples by explicitly leveraging potentially spurious correlations between single-modality features and the temporal position of the target moments. Through adversarial training, its bias generators continuously introduce biases and generate bias-conflict samples to deceive its grounding model. Meanwhile, the grounding model continuously eliminates the introduced biases, which requires it to model multi-modality alignment information. BSSARD will cover most kinds of coupling relationships and disrupt language and visual biases simultaneously. Extensive experiments on Charades-CD and ActivityNet-CD demonstrate the promising debiasing capability of BSSARD. Source codes are available at https://github.com/qzhb/BSSARD.
\ No newline at end of file
diff --git a/data/2024/aaai/Biases Mitigation and Expressiveness Preservation in Language Models: A Comprehensive Pipeline (Student Abstract) b/data/2024/aaai/Biases Mitigation and Expressiveness Preservation in Language Models: A Comprehensive Pipeline (Student Abstract)
new file mode 100644
index 0000000000..fa658cd561
--- /dev/null
+++ b/data/2024/aaai/Biases Mitigation and Expressiveness Preservation in Language Models: A Comprehensive Pipeline (Student Abstract)	
@@ -0,0 +1 @@
+Pre-trained language models (PLMs) have greatly transformed various downstream tasks, yet frequently display social biases from training data, raising fairness concerns. Recent efforts to debias PLMs come with limitations: they either fine-tune the entire parameters in PLMs, which is time-consuming and disregards the expressiveness of PLMs, or ignore the reintroducing biases from downstream tasks when applying debiased models to them. Hence, we propose a two-stage pipeline to mitigate biases from both internal and downstream contexts while preserving expressiveness in language models. Specifically, for the debiasing procedure, we resort to continuous prefix-tuning, not fully fine-tuning the PLM, in which we design a debiasing term for optimization and an alignment term to keep words’ relative distances and ensure the model's expressiveness. For downstream tasks, we perform causal intervention across different demographic groups for invariant predictions. Results on three GLUE tasks show our method alleviates biases from internal and downstream contexts, while keeping PLM expressiveness intact.
\ No newline at end of file
diff --git a/data/2024/aaai/Bidirectional Contrastive Split Learning for Visual Question Answering b/data/2024/aaai/Bidirectional Contrastive Split Learning for Visual Question Answering
new file mode 100644
index 0000000000..6204be3cbd
--- /dev/null
+++ b/data/2024/aaai/Bidirectional Contrastive Split Learning for Visual Question Answering	
@@ -0,0 +1 @@
+Visual Question Answering (VQA) based on multi-modal data facilitates real-life applications such as home robots and medical diagnoses. One significant challenge is to devise a robust decentralized learning framework for various client models where centralized data collection is refrained due to confidentiality concerns. This work aims to tackle privacy-preserving VQA by decoupling a multi-modal model into representation modules and a contrastive module, leveraging inter-module gradients sharing and inter-client weight sharing. To this end, we propose Bidirectional Contrastive Split Learning (BiCSL) to train a global multi-modal model on the entire data distribution of decentralized clients. We employ the contrastive loss that enables a more efficient self-supervised learning of decentralized modules. Comprehensive experiments are conducted on the VQA-v2 dataset based on five SOTA VQA models, demonstrating the effectiveness of the proposed method. Furthermore, we inspect BiCSL's robustness against a dual-key backdoor attack on VQA. Consequently, BiCSL shows significantly enhanced resilience when exposed to the multi-modal adversarial attack compared to the centralized learning method, which provides a promising approach to decentralized multi-modal learning.
\ No newline at end of file
diff --git a/data/2024/aaai/Bidirectional Temporal Plan Graph: Enabling Switchable Passing Orders for More Efficient Multi-Agent Path Finding Plan Execution b/data/2024/aaai/Bidirectional Temporal Plan Graph: Enabling Switchable Passing Orders for More Efficient Multi-Agent Path Finding Plan Execution
new file mode 100644
index 0000000000..d607ce4f6d
--- /dev/null
+++ b/data/2024/aaai/Bidirectional Temporal Plan Graph: Enabling Switchable Passing Orders for More Efficient Multi-Agent Path Finding Plan Execution	
@@ -0,0 +1 @@
+The Multi-Agent Path Finding (MAPF) problem involves planning collision-free paths for multiple agents in a shared environment. The majority of MAPF solvers rely on the assumption that an agent can arrive at a specific location at a specific timestep. However, real-world execution uncertainties can cause agents to deviate from this assumption, leading to collisions and deadlocks. Prior research solves this problem by having agents follow a Temporal Plan Graph (TPG), enforcing a consistent passing order at every location as defined in the MAPF plan. However, we show that TPGs are overly strict because, in some circumstances, satisfying the passing order requires agents to wait unnecessarily, leading to longer execution time. To overcome this issue, we introduce a new graphical representation called a Bidirectional Temporal Plan Graph (BTPG), which allows switching passing orders during execution to avoid unnecessary waiting time. We design two anytime algorithms for constructing a BTPG: BTPG-naïve and BTPG-optimized. Experimental results show that following BTPGs consistently outperforms following TPGs, reducing unnecessary waits by 8-20%.
\ No newline at end of file
diff --git a/data/2024/aaai/Big Learning Expectation Maximization b/data/2024/aaai/Big Learning Expectation Maximization
new file mode 100644
index 0000000000..a8105c0bce
--- /dev/null
+++ b/data/2024/aaai/Big Learning Expectation Maximization	
@@ -0,0 +1 @@
+Mixture models serve as one fundamental tool with versatile applications. However, their training techniques, like the popular Expectation Maximization (EM) algorithm, are notoriously sensitive to parameter initialization and often suffer from bad local optima that could be arbitrarily worse than the optimal. To address the long-lasting bad-local-optima challenge, we draw inspiration from the recent ground-breaking foundation models and propose to leverage their underlying big learning principle to upgrade the EM. Specifically, we present the Big Learning EM (BigLearn-EM), an EM upgrade that simultaneously performs joint, marginal, and orthogonally transformed marginal matchings between data and model distributions. Through simulated experiments, we empirically show that the BigLearn-EM is capable of delivering the optimal with high probability; comparisons on benchmark clustering datasets further demonstrate its effectiveness and advantages over existing techniques. The code is available at https://github.com/YulaiCong/Big-Learning-Expectation-Maximization.
\ No newline at end of file
diff --git a/data/2024/aaai/Bilateral Gradual Semantics for Weighted Argumentation b/data/2024/aaai/Bilateral Gradual Semantics for Weighted Argumentation
new file mode 100644
index 0000000000..859664de7d
--- /dev/null
+++ b/data/2024/aaai/Bilateral Gradual Semantics for Weighted Argumentation	
@@ -0,0 +1 @@
+Abstract argumentation is a reasoning model for evaluating arguments. Recently, gradual semantics has received considerable attention in weighted argumentation, which assigns an acceptability degree to each argument as its strength. In this paper, we aim to enhance gradual semantics by non-reciprocally incorporating the notion of rejectability degree. Such a setting offers a bilateral perspective on argument strength, enabling more comprehensive argument evaluations in practical situations. To this end, we first provide a set of principles for our semantics, taking both the acceptability and rejectability degrees into account, and propose three novel semantics conforming to the above principles. These semantics are defined as the limits of iterative sequences that always converge in any given weighted argumentation system, making them preferable for real-world applications.
\ No newline at end of file
diff --git a/data/2024/aaai/Biomedical Knowledge Graph Embedding with Householder Projection (Student Abstract) b/data/2024/aaai/Biomedical Knowledge Graph Embedding with Householder Projection (Student Abstract)
new file mode 100644
index 0000000000..6597d43165
--- /dev/null
+++ b/data/2024/aaai/Biomedical Knowledge Graph Embedding with Householder Projection (Student Abstract)	
@@ -0,0 +1 @@
+Researchers have applied knowledge graph embedding (KGE) techniques with advanced neural network techniques, such as capsule networks, for predicting drug-drug interactions (DDIs) and achieved remarkable results. However, most ignore molecular structure and position features between drug pairs. They cannot model the biomedical field's significant relational mapping properties (RMPs,1-N, N-1, N-N) relation. To solve these problems, we innovatively propose CDHse that consists of two crucial modules: 1) Entity embedding module, we obtain position feature obtained by PubMedBERT and Convolutional Neural Network (CNN), obtain molecular structure feature with Graphic Nuaral Network (GNN), obtain entity embedding feature of drug pairs, and then incorporate these features into one synthetic feature. 2) Knowledge graph embedding module, the synthetic feature is Householder projections and then embedded in the complex vector space for training. In this paper, we have selected several advanced models for the DDIs task and performed experiments on three standard BioKG to validate the effectiveness of CDHse.
\ No newline at end of file
diff --git a/data/2024/aaai/BirdCollect: A Comprehensive Benchmark for Analyzing Dense Bird Flock Attributes b/data/2024/aaai/BirdCollect: A Comprehensive Benchmark for Analyzing Dense Bird Flock Attributes
new file mode 100644
index 0000000000..c492564ace
--- /dev/null
+++ b/data/2024/aaai/BirdCollect: A Comprehensive Benchmark for Analyzing Dense Bird Flock Attributes	
@@ -0,0 +1 @@
+Automatic recognition of bird behavior from long-term, un controlled outdoor imagery can contribute to conservation efforts by enabling large-scale monitoring of bird populations. Current techniques in AI-based wildlife monitoring have focused on short-term tracking and monitoring birds individually rather than in species-rich flocks. We present Bird-Collect, a comprehensive benchmark dataset for monitoring dense bird flock attributes. It includes a unique collection of more than 6,000 high-resolution images of Demoiselle Cranes (Anthropoides virgo) feeding and nesting in the vicinity of Khichan region of Rajasthan. Particularly, each image contains an average of 190 individual birds, illustrating the complex dynamics of densely populated bird flocks on a scale that has not previously been studied. In addition, a total of 433 distinct pictures captured at Keoladeo National Park, Bharatpur provide a comprehensive representation of 34 distinct bird species belonging to various taxonomic groups. These images offer details into the diversity and the behaviour of birds in vital natural ecosystem along the migratory flyways. Additionally, we provide a set of 2,500 point-annotated samples which serve as ground truth for benchmarking various computer vision tasks like crowd counting, density estimation, segmentation, and species classification. The benchmark performance for these tasks highlight the need for tailored approaches for specific wildlife applications, which include varied conditions including views, illumination, and resolutions. With around 46.2 GBs in size encompassing data collected from two distinct nesting ground sets, it is the largest birds dataset containing detailed annotations, showcasing a substantial leap in bird research possibilities. We intend to publicly release the dataset to the research community. The database is available at: https://iab-rubric.org/resources/wildlife-dataset/birdcollect
\ No newline at end of file
diff --git a/data/2024/aaai/Blind Face Restoration under Extreme Conditions: Leveraging 3D-2D Prior Fusion for Superior Structural and Texture Recovery b/data/2024/aaai/Blind Face Restoration under Extreme Conditions: Leveraging 3D-2D Prior Fusion for Superior Structural and Texture Recovery
new file mode 100644
index 0000000000..1ad8fc4eca
--- /dev/null
+++ b/data/2024/aaai/Blind Face Restoration under Extreme Conditions: Leveraging 3D-2D Prior Fusion for Superior Structural and Texture Recovery	
@@ -0,0 +1 @@
+Blind face restoration under extreme conditions involves reconstructing high-quality face images from severely degraded inputs. These input images are often in poor quality and have extreme facial poses, leading to errors in facial structure and unnatural artifacts within the restored images. In this paper, we show that utilizing 3D priors effectively compensates for structure knowledge deficiencies in 2D priors while preserving the texture details. Based on this, we introduce FREx (Face Restoration under Extreme conditions) that combines structure-accurate 3D priors and texture-rich 2D priors in pretrained generative networks for blind face restoration under extreme conditions. To fuse the different information in 3D and 2D priors, we introduce an adaptive weight module that adjusts the importance of features based on the input image's condition. With this approach, our model can restore structure-accurate and natural-looking faces even when the images have lost a lot of information due to degradation and extreme pose. Extensive experimental results on synthetic and real-world datasets validate the effectiveness of our methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Blind-Touch: Homomorphic Encryption-Based Distributed Neural Network Inference for Privacy-Preserving Fingerprint Authentication b/data/2024/aaai/Blind-Touch: Homomorphic Encryption-Based Distributed Neural Network Inference for Privacy-Preserving Fingerprint Authentication
new file mode 100644
index 0000000000..be653f356b
--- /dev/null
+++ b/data/2024/aaai/Blind-Touch: Homomorphic Encryption-Based Distributed Neural Network Inference for Privacy-Preserving Fingerprint Authentication	
@@ -0,0 +1 @@
+Fingerprint authentication is a popular security mechanism for smartphones and laptops. However, its adoption in web and cloud environments has been limited due to privacy concerns over storing and processing biometric data on servers. This paper introduces Blind-Touch, a novel machine learning-based fingerprint authentication system leveraging homomorphic encryption to address these privacy concerns. Homomorphic encryption allows computations on encrypted data without decrypting. Thus, Blind-Touch can keep fingerprint data encrypted on the server while performing machine learning operations. Blind-Touch combines three strategies to efficiently utilize homomorphic encryption in machine learning: (1) It optimizes the feature vector for a distributed architecture, processing the first fully connected layer (FC-16) in plaintext on the client side and the subsequent layer (FC-1) post-encryption on the server, thereby minimizing encrypted computations; (2) It employs a homomorphic encryption-compatible data compression technique capable of handling 8,192 authentication results concurrently; and (3) It utilizes a clustered server architecture to simultaneously process authentication results, thereby enhancing scalability with increasing user numbers. Blind-Touch achieves high accuracy on two benchmark fingerprint datasets, with a 93.6% F1- score for the PolyU dataset and a 98.2% F1-score for the SOKOTO dataset. Moreover, Blind-Touch can match a fingerprint among 5,000 in about 0.65 seconds. With its privacy-focused design, high accuracy, and efficiency, Blind-Touch is a promising alternative to conventional fingerprint authentication for web and cloud applications.
\ No newline at end of file
diff --git a/data/2024/aaai/Block Image Compressive Sensing with Local and Global Information Interaction b/data/2024/aaai/Block Image Compressive Sensing with Local and Global Information Interaction
new file mode 100644
index 0000000000..b3d6a56576
--- /dev/null
+++ b/data/2024/aaai/Block Image Compressive Sensing with Local and Global Information Interaction	
@@ -0,0 +1,9 @@
+Block image compressive sensing methods, which divide a single image into small blocks for efficient sampling and reconstruction, have achieved significant success.
+However, these methods process each block locally and thus disregard the global communication among different blocks in the reconstruction step.
+Existing methods have attempted to address this issue with local filters or by directly reconstructing the entire image, but they have only achieved insufficient communication among adjacent pixels or bypassed the problem.
+To directly confront the communication problem among blocks and effectively resolve it, we propose a novel approach called Block Reconstruction with Blocks' Communication Network (BRBCN).
+BRBCN focuses on both local and global information, while further taking their interactions into account.
+Specifically, BRBCN comprises dual CNN and Transformer architectures, in which CNN is used to reconstruct each block for powerful local processing and Transformer is used to calculate the global communication among all the blocks.
+Moreover, we propose a global-to-local module (G2L) and a local-to-global module (L2G) to effectively integrate the representations of CNN and Transformer, with which our BRBCN network realizes the bidirectional interaction between local and global information.
+Extensive experiments show our BRBCN method outperforms existing state-of-the-art methods by a large margin.
+The code is available at https://github.com/kongxiuxiu/BRBCN
\ No newline at end of file
diff --git a/data/2024/aaai/Block-Level Goal Recognition Design b/data/2024/aaai/Block-Level Goal Recognition Design
new file mode 100644
index 0000000000..27a658e77e
--- /dev/null
+++ b/data/2024/aaai/Block-Level Goal Recognition Design	
@@ -0,0 +1 @@
+Existing works on goal recognition design (GRD) consider the underlying domain as a classical planning domain and apply modifications to the domain to minimize the worst case distinctiveness. In this paper, we propose replacing existing modifications with blocks, which group several closely related modifications together such that a block can modify a region in a search space with respect to some design constraints. Moreover, there could be blocks within blocks such that the design space becomes hierarchical for modifications at different levels of granularity. We present 1) a new version of pruned-reduce, a successful pruning rule for GRD, for block-level GRD, and 2) a new pruning rule for pruning some branches in both hierarchical and non-hierarchical design space. Our experiments show that searching in hierarchical design spaces greatly speeds up the redesign process.
\ No newline at end of file
diff --git a/data/2024/aaai/Boosting Adversarial Transferability across Model Genus by Deformation-Constrained Warping b/data/2024/aaai/Boosting Adversarial Transferability across Model Genus by Deformation-Constrained Warping
new file mode 100644
index 0000000000..6f781c7c2d
--- /dev/null
+++ b/data/2024/aaai/Boosting Adversarial Transferability across Model Genus by Deformation-Constrained Warping	
@@ -0,0 +1 @@
+Adversarial examples generated by a surrogate model typically exhibit limited transferability to unknown target systems. To address this problem, many transferability enhancement approaches (e.g., input transformation and model augmentation) have been proposed. However, they show poor performances in attacking systems having different model genera from the surrogate model. In this paper, we propose a novel and generic attacking strategy, called Deformation-Constrained Warping Attack (DeCoWA), that can be effectively applied to cross model genus attack. Specifically, DeCoWA firstly augments input examples via an elastic deformation, namely Deformation-Constrained Warping (DeCoW), to obtain rich local details of the augmented input. To avoid severe distortion of global semantics led by random deformation, DeCoW further constrains the strength and direction of the warping transformation by a novel adaptive control strategy. Extensive experiments demonstrate that the transferable examples crafted by our DeCoWA on CNN surrogates can significantly hinder the performance of Transformers (and vice versa) on various tasks, including image classification, video action recognition, and audio recognition. Code is made available at https://github.com/LinQinLiang/DeCoWA.
\ No newline at end of file
diff --git a/data/2024/aaai/Boosting Few-Shot Learning via Attentive Feature Regularization b/data/2024/aaai/Boosting Few-Shot Learning via Attentive Feature Regularization
new file mode 100644
index 0000000000..2ef1398135
--- /dev/null
+++ b/data/2024/aaai/Boosting Few-Shot Learning via Attentive Feature Regularization	
@@ -0,0 +1 @@
+Few-shot learning (FSL) based on manifold regularization aims to improve the recognition capacity of novel objects with limited training samples by mixing two samples from different categories with a blending factor. However, this mixing operation weakens the feature representation due to the linear interpolation and the overlooking of the importance of specific channels. To solve these issues, this paper proposes attentive feature regularization (AFR) which aims to improve the feature representativeness and discriminability. In our approach, we first calculate the relations between different categories of semantic labels to pick out the related features used for regularization. Then, we design two attention-based calculations at both the instance and channel levels. These calculations enable the regularization procedure to focus on two crucial aspects: the feature complementarity through adaptive interpolation in related categories and the emphasis on specific feature channels. Finally, we combine these regularization strategies to significantly improve the classifier performance. Empirical studies on several popular FSL benchmarks demonstrate the effectiveness of AFR, which improves the recognition accuracy of novel categories without the need to retrain any feature extractor, especially in the 1-shot setting. Furthermore, the proposed AFR can seamlessly integrate into other FSL methods to improve classification performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Boosting Multiple Instance Learning Models for Whole Slide Image Classification: A Model-Agnostic Framework Based on Counterfactual Inference b/data/2024/aaai/Boosting Multiple Instance Learning Models for Whole Slide Image Classification: A Model-Agnostic Framework Based on Counterfactual Inference
new file mode 100644
index 0000000000..efc56712a2
--- /dev/null
+++ b/data/2024/aaai/Boosting Multiple Instance Learning Models for Whole Slide Image Classification: A Model-Agnostic Framework Based on Counterfactual Inference	
@@ -0,0 +1 @@
+Multiple instance learning is an effective paradigm for whole slide image (WSI) classification, where labels are only provided at the bag level. However, instance-level prediction is also crucial as it offers insights into fine-grained regions of interest. Existing multiple instance learning methods either solely focus on training a bag classifier or have the insufficient capability of exploring instance prediction. In this work, we propose a novel model-agnostic framework to boost existing multiple instance learning models, to improve the WSI classification performance in both bag and instance levels. Specifically, we propose a counterfactual inference-based sub-bag assessment method and a hierarchical instance searching strategy to help to search reliable instances and obtain their accurate pseudo labels. Furthermore, an instance classifier is well-trained to produce accurate predictions. The instance embedding it generates is treated as a prompt to refine the instance feature for bag prediction. This framework is model-agnostic, capable of adapting to existing multiple instance learning models, including those without specific mechanisms like attention. Extensive experiments on three datasets demonstrate the competitive performance of our method. Code will be available at https://github.com/centurion-crawler/CIMIL.
\ No newline at end of file
diff --git a/data/2024/aaai/Boosting Neural Cognitive Diagnosis with Student's Affective State Modeling b/data/2024/aaai/Boosting Neural Cognitive Diagnosis with Student's Affective State Modeling
new file mode 100644
index 0000000000..42b7bcecd1
--- /dev/null
+++ b/data/2024/aaai/Boosting Neural Cognitive Diagnosis with Student's Affective State Modeling	
@@ -0,0 +1 @@
+Cognitive Diagnosis Modeling aims to infer students' proficiency level on knowledge concepts from their response logs. Existing methods typically model students’ response processes as the interaction between students and exercises or concepts based on hand-crafted or deeply-learned interaction functions. Despite their promising achievements, they fail to consider the relationship between students' cognitive states and affective states in learning, e.g., the feelings of frustration, boredom, or confusion with the learning content, which is insufficient for comprehensive cognitive diagnosis in intelligent education. To fill the research gap, we propose a novel Affect-aware Cognitive Diagnosis (ACD) model which can effectively diagnose the knowledge proficiency levels of students by taking into consideration the affective factors. Specifically, we first design a student affect perception module under the assumption that the affective state is jointly influenced by the student's affect trait and the difficulty of the exercise. Then, our inferred affective distribution is further used to estimate the student's subjective factors, i.e., guessing and slipping, respectively. Finally, we integrate the estimated guessing and slipping parameters with the basic neural cognitive diagnosis framework based on the DINA model, which facilitates the modeling of complex exercising interactions in a more accurate and interpretable fashion. Besides, we also extend our affect perception module in an unsupervised learning setting based on contrastive learning, thus significantly improving the compatibility of our ACD. To the best of our knowledge, we are the first to unify the cognition modeling and affect modeling into the same framework for student cognitive diagnosis. Extensive experiments on real-world datasets clearly demonstrate the effectiveness of our ACD. Our code is available at https://github.com/zeng-zhen/ACD.
\ No newline at end of file
diff --git a/data/2024/aaai/Boosting Residual Networks with Group Knowledge b/data/2024/aaai/Boosting Residual Networks with Group Knowledge
new file mode 100644
index 0000000000..1c11f2ab0a
--- /dev/null
+++ b/data/2024/aaai/Boosting Residual Networks with Group Knowledge	
@@ -0,0 +1 @@
+Recent research understands the residual networks from a new perspective of the implicit ensemble model. From this view, previous methods such as stochastic depth and stimulative training have further improved the performance of the residual network by sampling and training of its subnets. However, they both use the same supervision for all subnets of different capacities and neglect the valuable knowledge generated by subnets during training. In this manuscript, we mitigate the significant knowledge distillation gap caused by using the same kind of supervision and advocate leveraging the subnets to provide diverse knowledge. Based on this motivation, we propose a group knowledge based training framework for boosting the performance of residual networks. Specifically, we implicitly divide all subnets into hierarchical groups by subnet-in-subnet sampling, aggregate the knowledge of different subnets in each group during training, and exploit upper-level group knowledge to supervise lower-level subnet group. Meanwhile, we also develop a subnet sampling strategy that naturally samples larger subnets, which are found to be more helpful than smaller subnets in boosting performance for hierarchical groups. Compared with typical subnet training and other methods, our method achieves the best efficiency and performance trade-offs on multiple datasets and network structures. The code is at https://github.com/tsj-001/AAAI24-GKT.
\ No newline at end of file
diff --git a/data/2024/aaai/Bootstrapping Cognitive Agents with a Large Language Model b/data/2024/aaai/Bootstrapping Cognitive Agents with a Large Language Model
new file mode 100644
index 0000000000..d6ce87d10b
--- /dev/null
+++ b/data/2024/aaai/Bootstrapping Cognitive Agents with a Large Language Model	
@@ -0,0 +1 @@
+Large language models contain noisy general knowledge of the world, yet are hard to train or fine-tune. In contrast cognitive architectures have excellent interpretability and are flexible to update but require a lot of manual work to instantiate. In this work, we combine the best of both worlds: bootstrapping a cognitive-based model with the noisy knowledge encoded in large language models. Through an embodied agent doing kitchen tasks, we show that our proposed framework yields better efficiency compared to an agent entirely based on large language models. Our experiments also indicate that the cognitive agent bootstrapped using this framework can generalize to novel environments and be scaled to complex tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Bootstrapping Large Language Models for Radiology Report Generation b/data/2024/aaai/Bootstrapping Large Language Models for Radiology Report Generation
new file mode 100644
index 0000000000..aee8c6265f
--- /dev/null
+++ b/data/2024/aaai/Bootstrapping Large Language Models for Radiology Report Generation	
@@ -0,0 +1 @@
+Radiology report generation (RRG) aims to automatically generate a free-text description from a specific clinical radiograph, e.g., chest X-Ray images. Existing approaches tend to perform RRG with specific models trained on the public yet limited data from scratch, where they often lead to inferior performance owing to the problem of inefficient capabilities in both aligning visual and textual features and generating informative reports accordingly. Currently, large language models (LLMs) offered a promising solution to text generation with their power in learning from big data, especially for cross-modal scenarios such as RRG. However, most existing LLMs are pre-trained on general data, and suffer from the same problem of conventional approaches caused by knowledge gap between general and medical domain if they are applied to RRG. Therefore in this paper, we propose an approach to bootstrapping LLMs for RRG with a in-domain instance induction and a coarse-to-fine decoding process. Specifically, the in-domain instance induction process learns to align the LLM to radiology reports from general texts through contrastive learning. The coarse-to-fine decoding performs a text elevating process for those reports from the ranker, further enhanced with visual features and refinement prompts. Experimental results on two prevailing RRG datasets, namely, IU X-Ray and MIMIC-CXR, demonstrate the superiority of our approach to previous state-of-the-art solutions. Further analyses illustrate that, for the LLM, the induction process enables it to better align with the medical domain and the coarse-to-fine generation allows it to conduct more precise text generation.
\ No newline at end of file
diff --git a/data/2024/aaai/Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text b/data/2024/aaai/Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text
new file mode 100644
index 0000000000..428a816881
--- /dev/null
+++ b/data/2024/aaai/Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text	
@@ -0,0 +1 @@
+Recently, Transformer-based text detection techniques have sought to predict polygons by encoding the coordinates of individual boundary vertices using distinct query features. However, this approach incurs a significant memory overhead and struggles to effectively capture the intricate relationships between vertices belonging to the same instance. Consequently, irregular text layouts often lead to the prediction of outlined vertices, diminishing the quality of results. To address these challenges, we present an innovative approach rooted in Sparse R-CNN: a cascade decoding pipeline for polygon prediction. Our method ensures precision by iteratively refining polygon predictions, considering both the scale and location of preceding results. Leveraging this stabilized regression pipeline, even employing just a single feature vector to guide polygon instance regression yields promising detection results. Simultaneously, the leverage of instance-level feature proposal substantially enhances memory efficiency ( > 50% less vs. the SOTA method DPText-DETR) and reduces inference speed (> 40% less vs. DPText-DETR) with comparable performance on benchmarks. The code is available at https://github.com/Albertchen98/Box2Poly.git.
\ No newline at end of file
diff --git a/data/2024/aaai/Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA b/data/2024/aaai/Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA
new file mode 100644
index 0000000000..6066a7dcfd
--- /dev/null
+++ b/data/2024/aaai/Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA	
@@ -0,0 +1 @@
+In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e.g., only around 800 scenes are utilized in ScanQA and SQA dataset). Current approaches resort supplement 3D reasoning with 2D information. However, these methods face challenges: either they use top-down 2D views that introduce overly complex and sometimes question-irrelevant visual clues, or they rely on globally aggregated scene/image-level representations from 2D VLMs, losing the fine-grained vision-language correlations. To overcome these limitations, our approach utilizes question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues. We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure. This structure, featuring a Twin-Transformer design, compactly combines 2D and 3D modalities and captures fine-grained correlations between modalities, allowing them mutually augmenting each other. Integrating proposed mechanisms above, we present BridgeQA, that offers a fresh perspective on multi-modal transformer-based architectures for 3D-VQA. Experiments validate that BridgeQA achieves state-of-the-art on 3D-VQA datasets and significantly outperforms existing solutions. Code is available at https://github.com/matthewdm0816/BridgeQA.
\ No newline at end of file
diff --git a/data/2024/aaai/Bridging the Gap between Source Code and Requirements Using GPT (Student Abstract) b/data/2024/aaai/Bridging the Gap between Source Code and Requirements Using GPT (Student Abstract)
new file mode 100644
index 0000000000..026cb5e44b
--- /dev/null
+++ b/data/2024/aaai/Bridging the Gap between Source Code and Requirements Using GPT (Student Abstract)	
@@ -0,0 +1 @@
+Reverse engineering involves analyzing the design, architecture, and functionality of systems, and is crucial for legacy systems. Legacy systems are outdated software systems that are still in use and often lack proper documentation, which makes their maintenance and evolution challenging. To address this, we introduce SC2Req, utilizing the Generative Pre-trained Transformer (GPT) for automated code analysis and requirement generation. This approach aims to convert source code into understandable requirements and bridge the gap between those two. Through experiments on diverse software projects, SC2Req shows the potential to enhance the accuracy and efficiency of the translation process. This approach not only facilitates faster software development and easier maintenance of legacy systems but also lays a strong foundation for future research, promoting better understanding and communication in software development.
\ No newline at end of file
diff --git a/data/2024/aaai/Bridging the Semantic Latent Space between Brain and Machine: Similarity Is All You Need b/data/2024/aaai/Bridging the Semantic Latent Space between Brain and Machine: Similarity Is All You Need
new file mode 100644
index 0000000000..ba53eab67b
--- /dev/null
+++ b/data/2024/aaai/Bridging the Semantic Latent Space between Brain and Machine: Similarity Is All You Need	
@@ -0,0 +1 @@
+How our brain encodes complex concepts has been a longstanding mystery in neuroscience. The answer to this problem can lead to new understandings about how the brain retrieves information in large-scale data with high efficiency and robustness. Neuroscience studies suggest the brain represents concepts in a locality-sensitive hashing (LSH) strategy, i.e., similar concepts will be represented by similar responses. This finding has inspired the design of similarity-based algorithms, especially in contrastive learning. Here, we hypothesize that the brain and large neural network models, both using similarity-based learning rules, could contain a similar semantic embedding space. To verify that, this paper proposes a functional Magnetic Resonance Imaging (fMRI) semantic learning network named BrainSem, aimed at seeking a joint semantic latent space that bridges the brain and a Contrastive Language-Image Pre-training (CLIP) model. Given that our perception is inherently cross-modal, we introduce a fuzzy (one-to-many) matching loss function to encourage the models to extract high-level semantic components from neural signals. Our results claimed that using only a small set of fMRI recordings for semantic space alignment, we could obtain shared embedding valid for unseen categories out of the training set, which provided potential evidence for the semantic representation similarity between the brain and large neural networks. In a zero-shot classification task, our BrainSem achieves an 11.6% improvement over the state-of-the-art.
\ No newline at end of file
diff --git a/data/2024/aaai/Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model b/data/2024/aaai/Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model
new file mode 100644
index 0000000000..1b4c7b3085
--- /dev/null
+++ b/data/2024/aaai/Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model	
@@ -0,0 +1 @@
+Recently, diffusion-based image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language. Our model outputs a photo-realistic image given a text of any language along with a textual description of a scene. The model leverages rendered sketch images as priors, thus arousing the potential multilingual-generation ability of the pre-trained Stable Diffusion. Based on the observation from the influence of the cross-attention map on object placement in generated images, we propose a localized attention constraint into the cross-attention layer to address the unreasonable positioning problem of scene text. Additionally, we introduce contrastive image-level prompts to further refine the position of the textual region and achieve more accurate scene text generation. Experiments demonstrate that our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
\ No newline at end of file
diff --git a/data/2024/aaai/Build Your Own Robot Friend: An Open-Source Learning Module for Accessible and Engaging AI Education b/data/2024/aaai/Build Your Own Robot Friend: An Open-Source Learning Module for Accessible and Engaging AI Education
new file mode 100644
index 0000000000..462c35ac66
--- /dev/null
+++ b/data/2024/aaai/Build Your Own Robot Friend: An Open-Source Learning Module for Accessible and Engaging AI Education	
@@ -0,0 +1 @@
+As artificial intelligence (AI) is playing an increasingly important role in our society and global economy, AI education and literacy have become necessary components in college and K-12 education to prepare students for an AI-powered society. However, current AI curricula have not yet been made accessible and engaging enough for students and schools from all socio-economic backgrounds with different educational goals. In this work, we developed an open-source learning module for college and high school students, which allows students to build their own robot companion from the ground up. This open platform can be used to provide hands-on experience and introductory knowledge about various aspects of AI, including robotics, machine learning (ML), software engineering, and mechanical engineering. Because of the social and personal nature of a socially assistive robot companion, this module also puts a special emphasis on human-centered AI, enabling students to develop a better understanding of human-AI interaction and AI ethics through hands-on learning activities. With open-source documentation, assembling manuals and affordable materials, students from different socio-economic backgrounds can personalize their learning experience based on their individual educational goals. To evaluate the student-perceived quality of our module, we conducted a usability testing workshop with 15 college students recruited from a minority-serving institution. Our results indicate that our AI module is effective, easy-to-follow, and engaging, and it increases student interest in studying AI/ML and robotics in the future. We hope that this work will contribute toward accessible and engaging AI education in human-AI interaction for college and high school students.
\ No newline at end of file
diff --git a/data/2024/aaai/Building Conversational Artifacts to Enable Digital Assistant for APIs and RPAs b/data/2024/aaai/Building Conversational Artifacts to Enable Digital Assistant for APIs and RPAs
new file mode 100644
index 0000000000..48cfa26d9f
--- /dev/null
+++ b/data/2024/aaai/Building Conversational Artifacts to Enable Digital Assistant for APIs and RPAs	
@@ -0,0 +1 @@
+In the realm of business automation, digital assistants/chatbots are emerging as the primary method for making automation software accessible to users in various business sectors. Access to automation primarily occurs through APIs and RPAs. To effectively convert APIs and RPAs into chatbots on a larger scale, it is crucial to establish an automated process for generating data and training models that can recognize user intentions, identify questions for conversational slot filling, and provide recommendations for subsequent actions. In this paper, we present a technique for enhancing and generating natural language conversational artifacts from API specifications using large language models (LLMs). The goal is to utilize LLMs in the "build" phase to assist humans in creating skills for digital assistants. As a result, the system doesn't need to rely on LLMs during conversations with business users, leading to efficient deployment. Experimental results highlight the effectiveness of our proposed approach. Our system is deployed in the IBM Watson Orchestrate product for general availability.
\ No newline at end of file
diff --git a/data/2024/aaai/Building Higher-Order Abstractions from the Components of Recommender Systems b/data/2024/aaai/Building Higher-Order Abstractions from the Components of Recommender Systems
new file mode 100644
index 0000000000..b464718f74
--- /dev/null
+++ b/data/2024/aaai/Building Higher-Order Abstractions from the Components of Recommender Systems	
@@ -0,0 +1 @@
+We present a modular recommender system framework that tightly integrates yet maintains the independence of individual components, thus satisfying two of the most critical aspects of industrial applications, generality and specificity. On the one hand, we ensure that each component remains self-contained and is ready to serve in other applications beyond recommender systems. On the other hand, when these components are combined, a unified theme emerges for recommender systems. We present the details of each component in the context of recommender systems and other applications. We release each component as an open-source library, and most importantly, we release their integration under MAB2REC, an industry-strength open-source software for building bandit-based recommender systems. By bringing standalone components together, Mab2Rec realizes a powerful and scalable toolchain to build and deploy business-relevant personalization applications. Finally, we share our experience and best practices for user training, adoption, performance evaluation, deployment, and model governance within the enterprise and the broader community.
\ No newline at end of file
diff --git a/data/2024/aaai/Building Minimal and Reusable Causal State Abstractions for Reinforcement Learning b/data/2024/aaai/Building Minimal and Reusable Causal State Abstractions for Reinforcement Learning
new file mode 100644
index 0000000000..032fe0e9b5
--- /dev/null
+++ b/data/2024/aaai/Building Minimal and Reusable Causal State Abstractions for Reinforcement Learning	
@@ -0,0 +1,5 @@
+Two desiderata of reinforcement learning (RL) algorithms are the ability to learn from relatively little experience and the ability to learn policies that generalize to a range of problem specifications. 
+In factored state spaces, one approach towards achieving both goals is to learn state abstractions, which only keep the necessary variables for learning the tasks at hand. 
+This paper introduces Causal Bisimulation Modeling (CBM), a method that learns the causal relationships in the dynamics and reward functions for each task to derive a minimal, task-specific abstraction. 
+CBM leverages and improves implicit modeling to train a high-fidelity causal dynamics model that can be reused for all tasks in the same environment. 
+Empirical validation on two manipulation environments and four tasks reveals that CBM's learned implicit dynamics models identify the underlying causal relationships and state abstractions more accurately than explicit ones. Furthermore, the derived state abstractions allow a task learner to achieve near-oracle levels of sample efficiency and outperform baselines on all tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Building Variable-Sized Models via Learngene Pool b/data/2024/aaai/Building Variable-Sized Models via Learngene Pool
new file mode 100644
index 0000000000..dd859b057f
--- /dev/null
+++ b/data/2024/aaai/Building Variable-Sized Models via Learngene Pool	
@@ -0,0 +1,2 @@
+Recently, Stitchable Neural Networks (SN-Net) is proposed to stitch some pre-trained networks for quickly building numerous networks with different complexity and performance trade-offs. In this way, the burdens of designing or training the variable-sized networks, which can be used in application scenarios with diverse resource constraints, are alleviated. However, SN-Net still faces a few challenges. 1) Stitching from multiple independently pre-trained anchors introduces high storage resource consumption. 2) SN-Net faces challenges to build smaller models for low resource constraints. 3). SN-Net uses an unlearned initialization method for stitch layers, limiting the final performance.
+To overcome these challenges, motivated by the recently proposed Learngene framework, we propose a novel method called Learngene Pool. Briefly, Learngene distills the critical knowledge from a large pre-trained model into a small part (termed as learngene) and then expands this small part into a few variable-sized models. In our proposed method, we distill one pre-trained large model into multiple small models whose network blocks are used as learngene instances to construct the learngene pool. Since only one large model is used, we do not need to store more large models as SN-Net and after distilling, smaller learngene instances can be created to build small models to satisfy low resource constraints. We also insert learnable transformation matrices between the instances to stitch them into variable-sized models to improve the performance of these models. Exhaustive experiments have been implemented and the results validate the effectiveness of the proposed Learngene Pool compared with SN-Net.
\ No newline at end of file
diff --git a/data/2024/aaai/Byzantine-Robust Decentralized Learning via Remove-then-Clip Aggregation b/data/2024/aaai/Byzantine-Robust Decentralized Learning via Remove-then-Clip Aggregation
new file mode 100644
index 0000000000..b2fcbc7f49
--- /dev/null
+++ b/data/2024/aaai/Byzantine-Robust Decentralized Learning via Remove-then-Clip Aggregation	
@@ -0,0 +1,4 @@
+We consider decentralized learning over a network of workers with heterogeneous datasets, in the presence of Byzantine workers. 
+Byzantine workers may transmit arbitrary or malicious values to neighboring workers, leading to degradation in overall performance. The heterogeneous nature of the training data across various workers complicates the identification and mitigation of Byzantine workers. 
+To address this complex problem, we introduce a resilient decentralized learning approach that combines the gradient descent algorithm with a novel robust aggregator. Specifically, we propose a remove-then-clip aggregator, whereby each benign worker meticulously filters the neighbors' values and subsequently projects the remaining values to a sphere centered at its local value, with an appropriately selected radius.
+We prove that our proposed method converges to a neighborhood of a stationary point for non-convex objectives under standard assumptions. Furthermore, empirical evaluations are provided to demonstrate the superior performance of our method in comparison to existing algorithms, under various Byzantine attack models.
\ No newline at end of file
diff --git a/data/2024/aaai/CAMEL: Capturing Metaphorical Alignment with Context Disentangling for Multimodal Emotion Recognition b/data/2024/aaai/CAMEL: Capturing Metaphorical Alignment with Context Disentangling for Multimodal Emotion Recognition
new file mode 100644
index 0000000000..364cf77163
--- /dev/null
+++ b/data/2024/aaai/CAMEL: Capturing Metaphorical Alignment with Context Disentangling for Multimodal Emotion Recognition	
@@ -0,0 +1 @@
+Understanding the emotional polarity of multimodal content with metaphorical characteristics, such as memes, poses a significant challenge in Multimodal Emotion Recognition (MER). Previous MER researches have overlooked the phenomenon of metaphorical alignment in multimedia content, which involves non-literal associations between concepts to convey implicit emotional tones. Metaphor-agnostic MER methods may be misinformed by the isolated unimodal emotions, which are distinct from the real emotions blended in multimodal metaphors. Moreover, contextual semantics can further affect the emotions associated with similar metaphors, leading to the challenge of maintaining contextual compatibility. To address the issue of metaphorical alignment in MER, we propose to leverage a conditional generative approach for capturing metaphorical analogies. Our approach formulates schematic prompts and corresponding references based on theoretical foundations, which allows the model to better grasp metaphorical nuances. In order to maintain contextual sensitivity, we incorporate a disentangled contrastive matching mechanism, which undergoes curricular adjustment to regulate its intensity during the learning process. The automatic and human evaluation experiments on two benchmarks prove that, our model provides considerable and stable improvements in recognizing multimodal emotion with metaphor attributes.
\ No newline at end of file
diff --git a/data/2024/aaai/CAR-Transformer: Cross-Attention Reinforcement Transformer for Cross-Lingual Summarization b/data/2024/aaai/CAR-Transformer: Cross-Attention Reinforcement Transformer for Cross-Lingual Summarization
new file mode 100644
index 0000000000..59cc843b9a
--- /dev/null
+++ b/data/2024/aaai/CAR-Transformer: Cross-Attention Reinforcement Transformer for Cross-Lingual Summarization	
@@ -0,0 +1 @@
+Cross-Lingual Summarization (CLS) involves generating a summary for a given document in another language. Most of the existing approaches adopt multi-task training and knowledge distillation, which increases the training cost and improves the performance of CLS tasks intuitively but unexplainably. In this work, we propose Cross-Attention Reinforcement (CAR) module and incorporate the module into the transformer backbone to formulate the CAR-Transformer. The CAR module formulates a pseudo summarization policy parameterized by the cross-attention weights reinforced by the ground-truth monolingual summary without introducing extra model parameters. Our approach demonstrates more consistent improvement across CLS tasks compared to traditional multi-task training methods and outperforms the fine-tuned vanilla mBART by 3.67 and the best-performing multi-task training approach by 1.48 in ROUGE-L F1 score on the WikiLingua Korean-to-English CLS task.
\ No newline at end of file
diff --git a/data/2024/aaai/CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition b/data/2024/aaai/CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition
new file mode 100644
index 0000000000..e6ce31557d
--- /dev/null
+++ b/data/2024/aaai/CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition	
@@ -0,0 +1 @@
+Multi-modal multi-label emotion recognition (MMER) aims to identify relevant emotions from multiple modalities. The challenge of MMER is how to effectively capture discriminative features for multiple labels from heterogeneous data. Recent studies are mainly devoted to exploring various fusion strategies to integrate multi-modal information into a unified representation for all labels. However, such a learning scheme not only overlooks the specificity of each modality but also fails to capture individual discriminative features for different labels. Moreover, dependencies of labels and modalities cannot be effectively modeled. To address these issues, this paper presents ContrAstive feature Reconstruction and AggregaTion (CARAT) for the MMER task. Specifically, we devise a reconstruction-based fusion mechanism to better model fine-grained modality-to-label dependencies by contrastively learning modal-separated and label-specific features. To further exploit the modality complementarity, we introduce a shuffle-based aggregation strategy to enrich co-occurrence collaboration among labels. Experiments on two benchmark datasets CMU-MOSEI and M3ED demonstrate the effectiveness of CARAT over state-of-the-art methods. Code is available at https://github.com/chengzju/CARAT.
\ No newline at end of file
diff --git a/data/2024/aaai/CASE: Exploiting Intra-class Compactness and Inter-class Separability of Feature Embeddings for Out-of-Distribution Detection b/data/2024/aaai/CASE: Exploiting Intra-class Compactness and Inter-class Separability of Feature Embeddings for Out-of-Distribution Detection
new file mode 100644
index 0000000000..551e0f43ca
--- /dev/null
+++ b/data/2024/aaai/CASE: Exploiting Intra-class Compactness and Inter-class Separability of Feature Embeddings for Out-of-Distribution Detection	
@@ -0,0 +1 @@
+Detecting out-of-distribution (OOD) inputs is critical for reliable machine learning, but deep neural networks often make overconfident predictions, even for OOD inputs that deviate from the distribution of training data. Prior methods relied on the widely used softmax cross-entropy (CE) loss that is adequate for classifying in-distribution (ID) samples but not optimally designed for OOD detection. To address this issue, we propose CASE, a simple and effective OOD detection method by explicitly improving intra-class Compactness And inter-class Separability of feature Embeddings. To enhance the separation between ID and OOD samples, CASE uses a dual-loss framework, which includes a separability loss that maximizes the inter-class Euclidean distance to promote separability among different class centers, along with a compactness loss that minimizes the intra-class Euclidean distance to encourage samples to be close to their class centers. In particular, the class centers are defined as a free optimization parameter of the model and updated by gradient descent, which is simple and further enhances the OOD detection performance. Extensive experiments demonstrate the superiority of CASE, which reduces the average FPR95 by 37.11% and improves the average AUROC by 15.89% compared to the baseline method using a softmax confidence score on the more challenging CIFAR-100 model.
\ No newline at end of file
diff --git a/data/2024/aaai/CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments b/data/2024/aaai/CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments
new file mode 100644
index 0000000000..faea51f845
--- /dev/null
+++ b/data/2024/aaai/CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments	
@@ -0,0 +1 @@
+Audio-visual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget-aware partially observable semi-Markov decision process that implicitly learns the uncertainty in the audio-based navigation policy to decide when and how the agent may interact with the oracle. Our CAVEN agent can engage in fully-bidirectional natural language conversations by producing relevant questions and interpret free-form, potentially noisy responses from the oracle based on the audio-visual context. To enable such a capability, CAVEN is equipped with: i) a trajectory forecasting network that is grounded in audio-visual cues to produce a potential trajectory to the estimated goal, and (ii) a natural language based question generation and reasoning network to pose an interactive question to the oracle or interpret the oracle's response to produce navigation instructions. To train the interactive modules, we present a large scale dataset: AVN-Instruct, based on the Landmark-RxR dataset. To substantiate the usefulness of conversations, we present experiments on the benchmark audio-goal task using the SoundSpaces simulator under various noisy settings. Our results reveal that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate, especially in localizing new sound sources and against methods that use only uni-directional interaction.
\ No newline at end of file
diff --git a/data/2024/aaai/CCTR: Calibrating Trajectory Prediction for Uncertainty-Aware Motion Planning in Autonomous Driving b/data/2024/aaai/CCTR: Calibrating Trajectory Prediction for Uncertainty-Aware Motion Planning in Autonomous Driving
new file mode 100644
index 0000000000..53d64bb930
--- /dev/null
+++ b/data/2024/aaai/CCTR: Calibrating Trajectory Prediction for Uncertainty-Aware Motion Planning in Autonomous Driving	
@@ -0,0 +1 @@
+Autonomous driving systems rely on precise trajectory prediction for safe and efficient motion planning. Despite considerable efforts to enhance prediction accuracy, inherent uncertainties persist due to data noise and incomplete observations. Many strategies entail formalizing prediction outcomes into distributions and utilizing variance to represent uncertainty. However, our experimental investigation reveals that existing trajectory prediction models yield unreliable uncertainty estimates, necessitating additional customized calibration processes. On the other hand, directly applying current calibration techniques to prediction outputs may yield sub-optimal results due to using a universal scaler for all predictions and neglecting informative data cues. In this paper, we propose Customized Calibration Temperature with Regularizer (CCTR), a generic framework that calibrates the output distribution. Specifically, CCTR 1) employs a calibration-based regularizer to align output variance with the discrepancy between prediction and ground truth and 2) generates a tailor-made temperature scaler for each prediction using a post-processing network guided by context and historical information. Extensive evaluation involving multiple prediction and planning methods demonstrates the superiority of CCTR over existing calibration algorithms and uncertainty-aware methods, with significant improvements of 11%-22% in calibration quality and 17%-46% in motion planning.
\ No newline at end of file
diff --git a/data/2024/aaai/CDPNet: Cross-Modal Dual Phases Network for Point Cloud Completion b/data/2024/aaai/CDPNet: Cross-Modal Dual Phases Network for Point Cloud Completion
new file mode 100644
index 0000000000..7361ad734d
--- /dev/null
+++ b/data/2024/aaai/CDPNet: Cross-Modal Dual Phases Network for Point Cloud Completion	
@@ -0,0 +1 @@
+Point cloud completion aims at completing shapes from their partial. Most existing methods utilized shape’s priors information for point cloud completion, such as inputting the partial and getting the complete one through an encoder-decoder deep learning structure. However, it is very often to easily cause the loss of information in the generation process because of the invisibility of missing areas. Unlike most existing methods directly inferring the missing points using shape priors, we address it as a cross-modality task. We propose a new Cross-modal Dual Phases Network (CDPNet) for shape completion. Our key idea is that the global information of the shape is obtained from the extra single-view image, and the partial point clouds provide the geometric information. After that, the multi-modal features jointly guide the specific structural information. To learn the geometric details of the shape, we chose to use patches to preserve the local geometric feature. In this way, we can generate shapes with enough geometric details. Experimental results show that our method achieves state-of-the-art performance on point cloud completion.
\ No newline at end of file
diff --git a/data/2024/aaai/CEDFlow: Latent Contour Enhancement for Dark Optical Flow Estimation b/data/2024/aaai/CEDFlow: Latent Contour Enhancement for Dark Optical Flow Estimation
new file mode 100644
index 0000000000..f01bf8ec9b
--- /dev/null
+++ b/data/2024/aaai/CEDFlow: Latent Contour Enhancement for Dark Optical Flow Estimation	
@@ -0,0 +1 @@
+Accurately computing optical flow in low-contrast and noisy dark images is challenging, especially when contour information is degraded or difficult to extract. This paper proposes CEDFlow, a latent space contour enhancement for estimating optical flow in dark environments. By leveraging spatial frequency feature decomposition, CEDFlow effectively encodes local and global motion features. Importantly, we introduce the 2nd-order Gaussian difference operation to select salient contour features in the latent space precisely. It is specifically designed for large-scale contour components essential in dark optical flow estimation. Experimental results on the FCDN and VBOF datasets demonstrate that CEDFlow outperforms state-of-the-art methods in terms of the EPE index and produces more accurate and robust flow estimation. Our code is available at: https://github.com/xautstuzfy.
\ No newline at end of file
diff --git a/data/2024/aaai/CEGAR-Based Approach for Solving Combinatorial Optimization Modulo Quantified Linear Arithmetics Problems b/data/2024/aaai/CEGAR-Based Approach for Solving Combinatorial Optimization Modulo Quantified Linear Arithmetics Problems
new file mode 100644
index 0000000000..8f3e864321
--- /dev/null
+++ b/data/2024/aaai/CEGAR-Based Approach for Solving Combinatorial Optimization Modulo Quantified Linear Arithmetics Problems	
@@ -0,0 +1 @@
+Bioinformatics has always been a prolific domain for generating complex satisfiability and optimization problems. For instance, the synthesis of multi-scale models of biological networks has recently been associated with the resolution of optimization problems mixing Boolean logic and universally quantified linear constraints (OPT+qLP), which can be benchmarked on real-world models. In this paper, we introduce a Counter-Example-Guided Abstraction Refinement (CEGAR) to solve such problems efficiently. Our CEGAR exploits monotone properties inherent to linear optimization in order to generalize counter-examples of Boolean relaxations. We implemented our approach by extending Answer Set Programming (ASP) solver Clingo with a quantified linear constraints propagator. Our prototype enables exploiting independence of sub-formulas to further exploit the generalization of counter-examples. We evaluate the impact of refinement and partitioning on two sets of OPT+qLP problems inspired by system biology. Additionally, we conducted a comparison with the state-of-the-art ASP solver Clingo[lpx] that handles non-quantified linear constraints, showing the advantage of our CEGAR approach for solving large problems.
\ No newline at end of file
diff --git a/data/2024/aaai/CF-NeRF: Camera Parameter Free Neural Radiance Fields with Incremental Learning b/data/2024/aaai/CF-NeRF: Camera Parameter Free Neural Radiance Fields with Incremental Learning
new file mode 100644
index 0000000000..5d085cad77
--- /dev/null
+++ b/data/2024/aaai/CF-NeRF: Camera Parameter Free Neural Radiance Fields with Incremental Learning	
@@ -0,0 +1 @@
+Neural Radiance Fields have demonstrated impressive performance in novel view synthesis. However, NeRF and most of its variants still rely on traditional complex pipelines to provide extrinsic and intrinsic camera parameters, such as COLMAP. Recent works, like NeRFmm, BARF, and L2G-NeRF, directly treat camera parameters as learnable and estimate them through differential volume rendering. However, these methods work for forward-looking scenes with slight motions and fail to tackle the rotation scenario in practice. To overcome this limitation, we propose a novel camera parameter free neural radiance field (CF-NeRF), which incrementally reconstructs 3D representations and recovers the camera parameters inspired by incremental structure from motion. Given a sequence of images, CF-NeRF estimates camera parameters of images one by one and reconstructs the scene through initialization, implicit localization, and implicit optimization. To evaluate our method, we use a challenging real-world dataset, NeRFBuster, which provides 12 scenes under complex trajectories. Results demonstrate that CF-NeRF is robust to rotation and achieves state-of-the-art results without providing prior information and constraints.
\ No newline at end of file
diff --git a/data/2024/aaai/CFEVER: A Chinese Fact Extraction and VERification Dataset b/data/2024/aaai/CFEVER: A Chinese Fact Extraction and VERification Dataset
new file mode 100644
index 0000000000..86d7b6ea29
--- /dev/null
+++ b/data/2024/aaai/CFEVER: A Chinese Fact Extraction and VERification Dataset	
@@ -0,0 +1 @@
+We present CFEVER, a Chinese dataset designed for Fact Extraction and VERification. CFEVER comprises 30,012 manually created claims based on content in Chinese Wikipedia. Each claim in CFEVER is labeled as “Supports”, “Refutes”, or “Not Enough Info” to depict its degree of factualness. Similar to the FEVER dataset, claims in the “Supports” and “Refutes” categories are also annotated with corresponding evidence sentences sourced from single or multiple pages in Chinese Wikipedia. Our labeled dataset holds a Fleiss’ kappa value of 0.7934 for five-way inter-annotator agreement. In addition, through the experiments with the state-of-the-art approaches developed on the FEVER dataset and a simple baseline for CFEVER, we demonstrate that our dataset is a new rigorous benchmark for factual extraction and verification, which can be further used for developing automated systems to alleviate human fact-checking efforts. CFEVER is available at https://ikmlab.github.io/CFEVER.
\ No newline at end of file
diff --git a/data/2024/aaai/CFR-ICL: Cascade-Forward Refinement with Iterative Click Loss for Interactive Image Segmentation b/data/2024/aaai/CFR-ICL: Cascade-Forward Refinement with Iterative Click Loss for Interactive Image Segmentation
new file mode 100644
index 0000000000..2ca3965309
--- /dev/null
+++ b/data/2024/aaai/CFR-ICL: Cascade-Forward Refinement with Iterative Click Loss for Interactive Image Segmentation	
@@ -0,0 +1 @@
+The click-based interactive segmentation aims to extract the object of interest from an image with the guidance of user clicks. Recent work has achieved great overall performance by employing feedback from the output. However, in most state-of-the-art approaches, 1) the inference stage involves inflexible heuristic rules and requires a separate refinement model, and 2) the number of user clicks and model performance cannot be balanced. To address the challenges, we propose a click-based and mask-guided interactive image segmentation framework containing three novel components: Cascade-Forward Refinement (CFR), Iterative Click Loss (ICL), and SUEM image augmentation. The CFR offers a unified inference framework to generate segmentation results in a coarse-to-fine manner. The proposed ICL allows model training to improve segmentation and reduce user interactions simultaneously. The proposed SUEM augmentation is a comprehensive way to create large and diverse training sets for interactive image segmentation. Extensive experiments demonstrate the state-of-the-art performance of the proposed approach on five public datasets. Remarkably, our model reduces by 33.2%, and 15.5% the number of clicks required to surpass an IoU of 0.95 in the previous state-of-the-art approach on the Berkeley and DAVIS sets, respectively.
\ No newline at end of file
diff --git a/data/2024/aaai/CGMGM: A Cross-Gaussian Mixture Generative Model for Few-Shot Semantic Segmentation b/data/2024/aaai/CGMGM: A Cross-Gaussian Mixture Generative Model for Few-Shot Semantic Segmentation
new file mode 100644
index 0000000000..c6f0b2f2c2
--- /dev/null
+++ b/data/2024/aaai/CGMGM: A Cross-Gaussian Mixture Generative Model for Few-Shot Semantic Segmentation	
@@ -0,0 +1 @@
+Few-shot semantic segmentation (FSS) aims to segment unseen objects in a query image using a few pixel-wise annotated support images, thus expanding the capabilities of semantic segmentation. The main challenge lies in extracting sufficient information from the limited support images to guide the segmentation process. Conventional methods typically address this problem by generating single or multiple prototypes from the support images and calculating their cosine similarity to the query image. However, these methods often fail to capture meaningful information for modeling the de facto joint distribution of pixel and category. Consequently, they result in incomplete segmentation of foreground objects and mis-segmentation of the complex background. To overcome this issue, we propose the Cross Gaussian Mixture Generative Model (CGMGM), a novel Gaussian Mixture Models~(GMMs)-based FSS method, which establishes the joint distribution of pixel and category in both the support and query images. Specifically, our method initially matches the feature representations of the query image with those of the support images to generate and refine an initial segmentation mask. It then employs GMMs to accurately model the joint distribution of foreground and background using the support masks and the initial segmentation mask. Subsequently, a parametric decoder utilizes the posterior probability of pixels in the query image, by applying the Bayesian theorem, to the joint distribution, to generate the final segmentation mask. Experimental results on PASCAL-5i and COCO-20i datasets demonstrate our CGMGM's effectiveness and superior performance compared to the state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/CGS-Mask: Making Time Series Predictions Intuitive for All b/data/2024/aaai/CGS-Mask: Making Time Series Predictions Intuitive for All
new file mode 100644
index 0000000000..654ca7ac9e
--- /dev/null
+++ b/data/2024/aaai/CGS-Mask: Making Time Series Predictions Intuitive for All	
@@ -0,0 +1 @@
+Artificial intelligence (AI) has immense potential in time series prediction, but most explainable tools have limited capabilities in providing a systematic understanding of important features over time. These tools typically rely on evaluating a single time point, overlook the time ordering of inputs, and neglect the time-sensitive nature of time series applications. These factors make it difficult for users, particularly those without domain knowledge, to comprehend AI model decisions and obtain meaningful explanations. We propose CGS-Mask, a post-hoc and model-agnostic cellular genetic strip mask-based saliency approach to address these challenges. CGS-Mask uses consecutive time steps as a cohesive entity to evaluate the impact of features on the final prediction, providing binary and sustained feature importance scores over time. Our algorithm optimizes the mask population iteratively to obtain the optimal mask in a reasonable time. We evaluated CGS-Mask on synthetic and real-world datasets, and it outperformed state-of-the-art methods in elucidating the importance of features over time. According to our pilot user study via a questionnaire survey, CGS-Mask is the most effective approach in presenting easily understandable time series prediction results, enabling users to comprehend the decision-making process of AI models with ease.
\ No newline at end of file
diff --git a/data/2024/aaai/CHICOT: A Developer-Assistance Toolkit for Code Search with High-Level Contextual Information b/data/2024/aaai/CHICOT: A Developer-Assistance Toolkit for Code Search with High-Level Contextual Information
new file mode 100644
index 0000000000..b707174741
--- /dev/null
+++ b/data/2024/aaai/CHICOT: A Developer-Assistance Toolkit for Code Search with High-Level Contextual Information	
@@ -0,0 +1,5 @@
+We propose a source code search system named CHICOT (Code search with HIgh level COnText) to assist developers in reusing existing code.
+While previous studies have examined code search on the basis of code-level, fine-grained specifications such as functionality, logic, or implementation, CHICOT addresses a unique mission: code search with high-level contextual information, such as the purpose or domain of a developer's project.
+It achieves this feature by first extracting the context information from codebases and then considering this context during the search.
+It provides a VSCode plugin for daily coding assistance, and the built-in crawler ensures up-to-date code suggestions.
+The case study attests to the utility of CHICOT in real-world scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/CHRONOS: A Schema-Based Event Understanding and Prediction System b/data/2024/aaai/CHRONOS: A Schema-Based Event Understanding and Prediction System
new file mode 100644
index 0000000000..c64dc7ac98
--- /dev/null
+++ b/data/2024/aaai/CHRONOS: A Schema-Based Event Understanding and Prediction System	
@@ -0,0 +1 @@
+Chronological and Hierarchical Reasoning Over Naturally Occurring Schemas (CHRONOS) is a system that combines language model-based natural language processing with symbolic knowledge representations to analyze and make predictions about newsworthy events. CHRONOS consists of an event-centric information extraction pipeline and a complex event schema instantiation and prediction system. Resulting predictions are detailed with arguments, event types from Wikidata, schema-based justifications, and source document provenance. We evaluate our system by its ability to capture the structure of unseen events described in news articles and make plausible predictions as judged by human annotators.
\ No newline at end of file
diff --git a/data/2024/aaai/CI-STHPAN: Pre-trained Attention Network for Stock Selection with Channel-Independent Spatio-Temporal Hypergraph b/data/2024/aaai/CI-STHPAN: Pre-trained Attention Network for Stock Selection with Channel-Independent Spatio-Temporal Hypergraph
new file mode 100644
index 0000000000..dd0f68fa6b
--- /dev/null
+++ b/data/2024/aaai/CI-STHPAN: Pre-trained Attention Network for Stock Selection with Channel-Independent Spatio-Temporal Hypergraph	
@@ -0,0 +1 @@
+Quantitative stock selection is one of the most challenging FinTech tasks due to the non-stationary dynamics and complex market dependencies. Existing studies rely on channel mixing methods, exacerbating the issue of distribution shift in financial time series. Additionally, complex model structures they build make it difficult to handle very long sequences. Furthermore, most of them are based on predefined stock relationships thus making it difficult to capture the dynamic and highly volatile stock markets. To address the above issues, in this paper, we propose Channel-Independent based Spatio-Temporal Hypergraph Pre-trained Attention Networks (CI-STHPAN), a two-stage framework for stock selection, involving Transformer and HGAT based stock time series self-supervised pre-training and stock-ranking based downstream task fine-tuning. We calculate the similarity of stock time series of different channel in dynamic intervals based on Dynamic Time Warping (DTW), and further construct channel-independent stock dynamic hypergraph based on the similarity. Experiments with NASDAQ and NYSE markets data over five years show that our framework outperforms SOTA approaches in terms of investment return ratio (IRR) and Sharpe ratio (SR). Additionally, we find that even without introducing graph information, self-supervised learning based on the vanilla Transformer Encoder also surpasses SOTA results. Notable improvements are gained on the NYSE market. It is mainly attributed to the improvement of fine-tuning approach on Information Coefficient (IC) and Information Ratio based IC (ICIR), indicating that the fine-tuning method enhances the accuracy and stability of the model prediction.
\ No newline at end of file
diff --git a/data/2024/aaai/CIDR: A Cooperative Integrated Dynamic Refining Method for Minimal Feature Removal Problem b/data/2024/aaai/CIDR: A Cooperative Integrated Dynamic Refining Method for Minimal Feature Removal Problem
new file mode 100644
index 0000000000..49579eea14
--- /dev/null
+++ b/data/2024/aaai/CIDR: A Cooperative Integrated Dynamic Refining Method for Minimal Feature Removal Problem	
@@ -0,0 +1,2 @@
+The minimal feature removal problem in the post-hoc explanation area aims to identify the minimal feature set (MFS). Prior studies using the greedy algorithm to calculate the minimal feature set lack the exploration of feature interactions under a monotonic assumption which cannot be satisfied in general scenarios. In order to address the above limitations, 
+we propose a Cooperative Integrated Dynamic Refining method (CIDR) to efficiently discover minimal feature sets. Specifically, we design Cooperative Integrated Gradients (CIG) to detect interactions between features. By incorporating CIG and characteristics of the minimal feature set, we transform the minimal feature removal problem into a knapsack problem. Additionally, we devise an auxiliary Minimal Feature Refinement algorithm to determine the minimal feature set from numerous candidate sets. To the best of our knowledge, our work is the first to address the minimal feature removal problem in the field of natural language processing. Extensive experiments demonstrate that CIDR is capable of tracing representative minimal feature sets with improved interpretability across various models and datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/CK12: A Rounded K12 Knowledge Graph Based Benchmark for Chinese Holistic Cognition Evaluation b/data/2024/aaai/CK12: A Rounded K12 Knowledge Graph Based Benchmark for Chinese Holistic Cognition Evaluation
new file mode 100644
index 0000000000..5260e3dea2
--- /dev/null
+++ b/data/2024/aaai/CK12: A Rounded K12 Knowledge Graph Based Benchmark for Chinese Holistic Cognition Evaluation	
@@ -0,0 +1 @@
+New NLP benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present a meticulously designed evaluation benchmark that leverages the knowledge graph. This evaluation comprises 584 level-1 knowledge points and 1,989 level-2 knowledge points, thereby encompassing a comprehensive spectrum of the K12 education domain knowledge. The primary objective is to comprehensively assess the high-level comprehension aptitude and reasoning capabilities of LLMs operating within the Chinese context. Our evaluation incorporates five distinct question types with 39,452 questions. We test the current mainstream LLMs by three distinct modes. Firstly, four prompt evaluation modes were employed to assess the fundamental capacity. Additionally, for choice questions, a result-oriented evaluation approach was designed through data augmentation to assess the model's proficiency in advanced knowledge and reasoning. Moreover, a subset with reasoning process is derived, and the process-oriented testing method is used to test the model's interpretability and higher-order reasoning capacity. We further show models' capability in our knowledge points, and anticipate the evaluation can assist in the assessment of the strengths and deficiencies of LLMs on knowledge points, thus fostering their development within the Chinese context. Our Dataset will be publicly available in https://github.com/tal-tech/chinese-k12-evaluation.
\ No newline at end of file
diff --git a/data/2024/aaai/CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer b/data/2024/aaai/CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer
new file mode 100644
index 0000000000..f816fc0225
--- /dev/null
+++ b/data/2024/aaai/CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer	
@@ -0,0 +1 @@
+Cross-lingual cross-modal retrieval has garnered increasing attention recently, which aims to achieve the alignment between vision and target language (V-T) without using any annotated V-T data pairs. Current methods employ machine translation (MT) to construct pseudo-parallel data pairs, which are then used to learn a multi-lingual and multi-modal embedding space that aligns visual and target-language representations. However, the large heterogeneous gap between vision and text, along with the noise present in target language translations, poses significant challenges in effectively aligning their representations. To address these challenges, we propose a general framework, Cross-Lingual to Cross-Modal (CL2CM), which improves the alignment between vision and target language using cross-lingual transfer. This approach allows us to fully leverage the merits of multi-lingual pre-trained models (e.g., mBERT) and the benefits of the same modality structure, i.e., smaller gap, to provide reliable and comprehensive semantic correspondence (knowledge) for the cross-modal network. We evaluate our proposed approach on two multilingual image-text datasets, Multi30K and MSCOCO, and one video-text dataset, VATEX. The results clearly demonstrate the effectiveness of our proposed method and its high potential for large-scale retrieval.
\ No newline at end of file
diff --git a/data/2024/aaai/CLIM: Contrastive Language-Image Mosaic for Region Representation b/data/2024/aaai/CLIM: Contrastive Language-Image Mosaic for Region Representation
new file mode 100644
index 0000000000..c1401b7a95
--- /dev/null
+++ b/data/2024/aaai/CLIM: Contrastive Language-Image Mosaic for Region Representation	
@@ -0,0 +1,2 @@
+Detecting objects accurately from a large or open vocabulary necessitates the vision-language alignment on region representations. However, learning such a region-text alignment by obtaining high-quality box annotations with text labels or descriptions is expensive and infeasible. In contrast, collecting image-text pairs is simpler but lacks precise object location information to associate regions with texts. In this paper, we propose a novel approach called Contrastive Language-Image Mosaic (CLIM), which leverages large-scale image-text pairs effectively for aligning region and text representations. CLIM combines multiple images into a mosaicked image and treats each image as a ‘pseudo region’. The feature of each pseudo region is extracted and trained to be similar to the corresponding text embedding while dissimilar from others by a contrastive loss, enabling the model to learn the region-text alignment without costly box annotations. As a generally
+applicable approach, CLIM consistently improves different open-vocabulary object detection methods that use caption supervision. Furthermore, CLIM can effectively enhance the region representation of vision-language models, thus providing stronger backbones for open-vocabulary object detectors. Our experimental results demonstrate that CLIM improves different baseline open-vocabulary object detectors by a large margin on both OV-COCO and OV-LVIS benchmarks. The code is available at https://github.com/wusize/CLIM.
\ No newline at end of file
diff --git a/data/2024/aaai/CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model b/data/2024/aaai/CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model
new file mode 100644
index 0000000000..ec25bd8d02
--- /dev/null
+++ b/data/2024/aaai/CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model	
@@ -0,0 +1 @@
+Gaze estimation methods often experience significant performance degradation when evaluated across different domains, due to the domain gap between the testing and training data. Existing methods try to address this issue using various domain generalization approaches, but with little success because of the limited diversity of gaze datasets, such as appearance, wearable, and image quality. To overcome these limitations, we propose a novel framework called CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its transferable knowledge. Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task. Specifically, we extract gaze-relevant feature by pushing it away from gaze-irrelevant features which can be flexibly constructed via language descriptions. To learn more suitable prompts, we propose a personalized context optimization method for text prompt tuning. Furthermore, we utilize the relationship among gaze samples to refine the distribution of gaze-relevant features, thereby improving the generalization capability of the gaze estimation model. Extensive experiments demonstrate the excellent performance of CLIP-Gaze over existing methods on four cross-domain evaluations.
\ No newline at end of file
diff --git a/data/2024/aaai/CLIP-Guided Federated Learning on Heterogeneity and Long-Tailed Data b/data/2024/aaai/CLIP-Guided Federated Learning on Heterogeneity and Long-Tailed Data
new file mode 100644
index 0000000000..d2ef05f9a5
--- /dev/null
+++ b/data/2024/aaai/CLIP-Guided Federated Learning on Heterogeneity and Long-Tailed Data	
@@ -0,0 +1 @@
+Federated learning (FL) provides a decentralized machine learning paradigm where a server collaborates with a group of clients to learn a global model without accessing the clients' data. User heterogeneity is a significant challenge for FL, which together with the class-distribution imbalance further enhances the difficulty of FL. Great progress has been made in large vision-language models, such as Contrastive Language-Image Pre-training (CLIP), which paves a new way for image classification and object recognition. Inspired by the success of CLIP on few-shot and zero-shot learning, we use CLIP to optimize the federated learning between server and client models under its vision-language supervision. It is promising to mitigate the user heterogeneity and class-distribution balance due to the powerful cross-modality representation and rich open-vocabulary prior knowledge. In this paper, we propose the CLIP-guided FL (CLIP2FL) method on heterogeneous and long-tailed data. In CLIP2FL, the knowledge of the off-the-shelf CLIP model is transferred to the client-server models, and a bridge is built between the client and server. Specifically, for client-side learning, knowledge distillation is conducted between client models and CLIP to improve the ability of client-side feature representation. For server-side learning, in order to mitigate the heterogeneity and class-distribution imbalance, we generate federated features to retrain the server model. A prototype contrastive learning with the supervision of the text encoder of CLIP is introduced to generate federated features depending on the client-side gradients, and they are used to retrain a balanced server classifier. Extensive experimental results on several benchmarks demonstrate that CLIP2FL achieves impressive performance and effectively deals with data heterogeneity and long-tail distribution. The code is available at https://github.com/shijiangming1/CLIP2FL.
\ No newline at end of file
diff --git a/data/2024/aaai/CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare b/data/2024/aaai/CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare
new file mode 100644
index 0000000000..9aaa68f2be
--- /dev/null
+++ b/data/2024/aaai/CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare	
@@ -0,0 +1,2 @@
+In the era of modern healthcare, swiftly generating medical question summaries is crucial for informed and timely patient care. Despite the increasing complexity and volume of medical data, existing studies have focused solely on text-based summarization, neglecting the integration of visual information. Recognizing the untapped potential of combining textual queries with visual representations of medical conditions, we introduce the Multimodal Medical Question Summarization (MMQS) Dataset. This dataset, a major contribution of our work, pairs medical queries with visual aids, facilitating a richer and more nuanced understanding of patient needs. We also propose a framework, utilizing the power of Contrastive Language Image Pretraining(CLIP) and Large Language Models(LLMs), consisting of four modules that identify medical disorders, generate relevant context, filter medical concepts, and craft visually aware summaries. Our comprehensive framework harnesses the power of CLIP, a multimodal foundation model, and various general-purpose LLMs, comprising four main modules: the medical disorder identification module, the relevant context generation module, the context filtration module for distilling relevant medical concepts and knowledge, and finally, a general-purpose LLM to generate visually aware medical question summaries. Leveraging our MMQS dataset, we showcase how visual cues from images enhance the generation of medically nuanced summaries. This multimodal approach not only enhances the decision-making process in healthcare but also fosters a more nuanced understanding of patient queries, laying the groundwork for future research in personalized and responsive medical care.
+Disclaimer: The article features graphic medical imagery, a result of the subject's inherent requirements.
\ No newline at end of file
diff --git a/data/2024/aaai/CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection b/data/2024/aaai/CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection
new file mode 100644
index 0000000000..9f81cf0277
--- /dev/null
+++ b/data/2024/aaai/CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection	
@@ -0,0 +1 @@
+Recent LiDAR-based 3D Object Detection (3DOD) methods show promising results, but they often do not generalize well to target domains outside the source (or training) data distribution. To reduce such domain gaps and thus to make 3DOD models more generalizable, we introduce a novel unsupervised domain adaptation (UDA) method, called CMDA, which (i) leverages visual semantic cues from an image modality (i.e., camera images) as an effective semantic bridge to close the domain gap in the cross-modal Bird's Eye View (BEV) representations. Further, (ii) we also introduce a self-training-based learning strategy, wherein a model is adversarially trained to generate domain-invariant features, which disrupt the discrimination of whether a feature instance comes from a source or an unseen target domain. Overall, our CMDA framework guides the 3DOD model to generate highly informative and domain-adaptive features for novel data distributions. In our extensive experiments with large-scale benchmarks, such as nuScenes, Waymo, and KITTI, those mentioned above provide significant performance gains for UDA tasks, achieving state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2024/aaai/CMG-Net: Robust Normal Estimation for Point Clouds via Chamfer Normal Distance and Multi-Scale Geometry b/data/2024/aaai/CMG-Net: Robust Normal Estimation for Point Clouds via Chamfer Normal Distance and Multi-Scale Geometry
new file mode 100644
index 0000000000..9c1260ba43
--- /dev/null
+++ b/data/2024/aaai/CMG-Net: Robust Normal Estimation for Point Clouds via Chamfer Normal Distance and Multi-Scale Geometry	
@@ -0,0 +1 @@
+This work presents an accurate and robust method for estimating normals from point clouds. In contrast to predecessor approaches that minimize the deviations between the annotated and the predicted normals directly, leading to direction inconsistency, we first propose a new metric termed Chamfer Normal Distance to address this issue. This not only mitigates the challenge but also facilitates network training and substantially enhances the network robustness against noise. Subsequently, we devise an innovative architecture that encompasses Multi-scale Local Feature Aggregation and Hierarchical Geometric Information Fusion. This design empowers the network to capture intricate geometric details more effectively and alleviate the ambiguity in scale selection. Extensive experiments demonstrate that our method achieves the state-of-the-art performance on both synthetic and real-world datasets, particularly in scenarios contaminated by noise. Our implementation is available at https://github.com/YingruiWoo/CMG-Net_Pytorch.
\ No newline at end of file
diff --git a/data/2024/aaai/COMBAT: Alternated Training for Effective Clean-Label Backdoor Attacks b/data/2024/aaai/COMBAT: Alternated Training for Effective Clean-Label Backdoor Attacks
new file mode 100644
index 0000000000..52193b4180
--- /dev/null
+++ b/data/2024/aaai/COMBAT: Alternated Training for Effective Clean-Label Backdoor Attacks	
@@ -0,0 +1 @@
+Backdoor attacks pose a critical concern to the practice of using third-party data for AI development. The data can be poisoned to make a trained model misbehave when a predefined trigger pattern appears, granting the attackers illegal benefits. While most proposed backdoor attacks are dirty-label, clean-label attacks are more desirable by keeping data labels unchanged to dodge human inspection. However, designing a working clean-label attack is a challenging task, and existing clean-label attacks show underwhelming performance. In this paper, we propose a novel mechanism to develop clean-label attacks with outstanding attack performance. The key component is a trigger pattern generator, which is trained together with a surrogate model in an alternating manner. Our proposed mechanism is flexible and customizable, allowing different backdoor trigger types and behaviors for either single or multiple target labels. Our backdoor attacks can reach near-perfect attack success rates and bypass all state-of-the-art backdoor defenses, as illustrated via comprehensive experiments on standard benchmark datasets. Our code is available at https://github.com/VinAIResearch/COMBAT.
\ No newline at end of file
diff --git a/data/2024/aaai/COMBHelper: A Neural Approach to Reduce Search Space for Graph Combinatorial Problems b/data/2024/aaai/COMBHelper: A Neural Approach to Reduce Search Space for Graph Combinatorial Problems
new file mode 100644
index 0000000000..b1995e6cd5
--- /dev/null
+++ b/data/2024/aaai/COMBHelper: A Neural Approach to Reduce Search Space for Graph Combinatorial Problems	
@@ -0,0 +1 @@
+Combinatorial Optimization (CO) problems over graphs appear routinely in many applications such as in optimizing traffic, viral marketing in social networks, and matching for job allocation. Due to their combinatorial nature, these problems are often NP-hard. Existing approximation algorithms and heuristics rely on the search space to find the solutions and become time-consuming when this space is large. In this paper, we design a neural method called COMBHelper to reduce this space and thus improve the efficiency of the traditional CO algorithms based on node selection. Specifically, it employs a Graph Neural Network (GNN) to identify promising nodes for the solution set. This pruned search space is then fed to the traditional CO algorithms. COMBHelper also uses a Knowledge Distillation (KD) module and a problem-specific boosting module to bring further efficiency and efficacy. Our extensive experiments show that the traditional CO algorithms with COMBHelper are at least 2 times faster than their original versions.
\ No newline at end of file
diff --git a/data/2024/aaai/COMMA: Co-articulated Multi-Modal Learning b/data/2024/aaai/COMMA: Co-articulated Multi-Modal Learning
new file mode 100644
index 0000000000..be21c9011d
--- /dev/null
+++ b/data/2024/aaai/COMMA: Co-articulated Multi-Modal Learning	
@@ -0,0 +1 @@
+Pretrained large-scale vision-language models such as CLIP have demonstrated excellent generalizability over a series of downstream tasks. However, they are sensitive to the variation of input text prompts and need a selection of prompt templates to achieve satisfactory performance. Recently, various methods have been proposed to dynamically learn the prompts as the textual inputs to avoid the requirements of laboring hand-crafted prompt engineering in the fine-tuning process. We notice that these methods are suboptimal in two aspects. First, the prompts of the vision and language branches in these methods are usually separated or uni-directionally correlated. Thus, the prompts of both branches are not fully correlated and may not provide enough guidance to align the representations of both branches. Second, it's observed that most previous methods usually achieve better performance on seen classes but cause performance degeneration on unseen classes compared to CLIP. This is because the essential generic knowledge learned in the pretraining stage is partly forgotten in the fine-tuning process. In this paper, we propose Co-Articulated Multi-Modal Learning (COMMA) to handle the above limitations. Especially, our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches. Besides, to alleviate forgetting about the essential knowledge, we minimize the feature discrepancy between the learned prompts and the embeddings of hand-crafted prompts in the pre-trained CLIP in the late transformer layers. We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Experimental results demonstrate the superiority of our method by exhibiting a favorable performance boost upon all tasks with high efficiency. Code is available at https://github.com/hulianyuyy/COMMA.
\ No newline at end of file
diff --git a/data/2024/aaai/CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework b/data/2024/aaai/CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework
new file mode 100644
index 0000000000..d6839e5e0c
--- /dev/null
+++ b/data/2024/aaai/CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework	
@@ -0,0 +1 @@
+Multilingual code retrieval aims to find code snippets relevant to a user's query from a multilingual codebase, which plays a crucial role in software development and expands their application scenarios compared to classical monolingual code retrieval. Despite the performance improvements achieved by previous studies, two crucial problems are overlooked in the multilingual scenario. First, certain programming languages face data scarcity in specific domains, resulting in limited representation capabilities within those domains. Second, different programming languages can be used interchangeably within the same domain, making it challenging for multilingual models to accurately identify the intended programming language of a user's query. To address these issues, we propose the CommONalities and SpecIalties Driven Multilingual CodE Retrieval Framework (CONSIDER), which includes two modules. The first module enhances the representation of various programming languages by modeling pairwise and global commonalities among them. The second module introduces a novel contrastive learning negative sampling algorithm that leverages language confusion to automatically extract specific language features. Through our experiments, we confirm the significant benefits of our model in real-world multilingual code retrieval scenarios in various aspects. Furthermore, an evaluation demonstrates the effectiveness of our proposed CONSIDER framework in monolingual scenarios as well. Our source code is available at https://github.com/smsquirrel/consider.
\ No newline at end of file
diff --git a/data/2024/aaai/CPN: Complementary Proposal Network for Unconstrained Text Detection b/data/2024/aaai/CPN: Complementary Proposal Network for Unconstrained Text Detection
new file mode 100644
index 0000000000..86c22b653a
--- /dev/null
+++ b/data/2024/aaai/CPN: Complementary Proposal Network for Unconstrained Text Detection	
@@ -0,0 +1 @@
+Existing methods for scene text detection can be divided into two paradigms: segmentation-based and anchor-based. While Segmentation-based methods are well-suited for irregular shapes, they struggle with compact or overlapping layouts. Conversely, anchor-based approaches excel for complex layouts but suffer from irregular shapes. To strengthen their merits and overcome their respective demerits, we propose a Complementary Proposal Network (CPN) that seamlessly and parallelly integrates semantic and geometric information for superior performance. The CPN comprises two efficient networks for proposal generation: the Deformable Morphology Semantic Network, which generates semantic proposals employing an innovative deformable morphological operator, and the Balanced Region Proposal Network, which produces geometric proposals with pre-defined anchors. To further enhance the complementarity, we introduce an Interleaved Feature Attention module that enables semantic and geometric features to interact deeply before proposal generation. By leveraging both complementary proposals and features, CPN outperforms state-of-the-art approaches with significant margins under comparable computation cost. Specifically, our approach achieves improvements of 3.6%, 1.3% and 1.0% on challenging benchmarks ICDAR19-ArT, IC15, and MSRA-TD500, respectively. Code for our method will be released.
\ No newline at end of file
diff --git a/data/2024/aaai/CR-SAM: Curvature Regularized Sharpness-Aware Minimization b/data/2024/aaai/CR-SAM: Curvature Regularized Sharpness-Aware Minimization
new file mode 100644
index 0000000000..9ef7efa138
--- /dev/null
+++ b/data/2024/aaai/CR-SAM: Curvature Regularized Sharpness-Aware Minimization	
@@ -0,0 +1 @@
+The capacity to generalize to future unseen data stands as one of the utmost crucial attributes of deep neural networks. Sharpness-Aware Minimization (SAM) aims to enhance the generalizability by minimizing worst-case loss using one-step gradient ascent as an approximation. However, as training progresses, the non-linearity of the loss landscape increases, rendering one-step gradient ascent less effective. On the other hand, multi-step gradient ascent will incur higher training cost. In this paper, we introduce a normalized Hessian trace to accurately measure the curvature of loss landscape on both training and test sets. In particular, to counter excessive non-linearity of loss landscape, we propose Curvature Regularized SAM (CR-SAM), integrating the normalized Hessian trace as a SAM regularizer. Additionally, we present an efficient way to compute the trace via finite differences with parallelism. Our theoretical analysis based on PAC-Bayes bounds establishes the regularizer's efficacy in reducing generalization error. Empirical evaluation on CIFAR and ImageNet datasets shows that CR-SAM consistently enhances classification performance for ResNet and Vision Transformer (ViT) models across various datasets. Our code is available at https://github.com/TrustAIoT/CR-SAM.
\ No newline at end of file
diff --git a/data/2024/aaai/CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers b/data/2024/aaai/CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers
new file mode 100644
index 0000000000..9ec308022b
--- /dev/null
+++ b/data/2024/aaai/CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers	
@@ -0,0 +1 @@
+Point cloud completion is an indispensable task for recovering complete point clouds due to incompleteness caused by occlusion, limited sensor resolution, etc. The family of coarse-to-fine generation architectures has recently exhibited great success in point cloud completion and gradually became mainstream. In this work, we unveil one of the key ingredients behind these methods: meticulously devised feature extraction operations with explicit cross-resolution aggregation. We present Cross-Resolution Transformer that efficiently performs cross-resolution aggregation with local attention mechanisms. With the help of our recursive designs, the proposed operation can capture more scales of features than common aggregation operations, which is beneficial for capturing fine geometric characteristics. While prior methodologies have ventured into various manifestations of inter-level cross-resolution aggregation, the effectiveness of intra-level one and their combination has not been analyzed. With unified designs, Cross-Resolution Transformer can perform intra- or inter-level cross-resolution aggregation by switching inputs. We integrate two forms of Cross-Resolution Transformers into one up-sampling block for point generation, and following the coarse-to-fine manner, we construct CRA-PCN to incrementally predict complete shapes with stacked up-sampling blocks. Extensive experiments demonstrate that our method outperforms state-of-the-art methods by a large margin on several widely used benchmarks. Codes are available at https://github.com/EasyRy/CRA-PCN.
\ No newline at end of file
diff --git a/data/2024/aaai/CREAD: A Classification-Restoration Framework with Error Adaptive Discretization for Watch Time Prediction in Video Recommender Systems b/data/2024/aaai/CREAD: A Classification-Restoration Framework with Error Adaptive Discretization for Watch Time Prediction in Video Recommender Systems
new file mode 100644
index 0000000000..62c41c3b10
--- /dev/null
+++ b/data/2024/aaai/CREAD: A Classification-Restoration Framework with Error Adaptive Discretization for Watch Time Prediction in Video Recommender Systems	
@@ -0,0 +1 @@
+The watch time is a significant indicator of user satisfaction in video recommender systems. However, the prediction of watch time as a target variable is often hindered by its highly imbalanced distribution with a scarcity of observations for larger target values and over-populated samples for small values. State-of-the-art watch time prediction models discretize the continuous watch time into a set of buckets in order to consider the distribution of watch time. However, it is highly uninvestigated how these discrete buckets should be created from the continuous watch time distribution, and existing discretization approaches suffer from either a large learning error or a large restoration error. To address this challenge, we propose a Classification-Restoration framework with Error-Adaptive-Discretization (CREAD) to accurately predict the watch time. The proposed framework contains a discretization module, a classification module, and a restoration module. It predicts the watch time through multiple classification problems. The discretization process is a key contribution of the CREAD framework. We theoretically analyze the impacts of the discretization on the learning error and the restoration error, and then propose the error-adaptive discretization (EAD) technique to better balance the two errors, which achieves better performance over traditional discretization approaches. We conduct detailed offline evaluations on a public dataset and an industrial dataset, both showing performance gains through the proposed approach. Moreover, We have fully launched our framework to an online video platform, which resulted in a significant increase in users' video watch time by 0.29% through A/B testing. These results highlight the effectiveness of the CREAD framework in watch time prediction in video recommender systems.
\ No newline at end of file
diff --git a/data/2024/aaai/CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen b/data/2024/aaai/CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen
new file mode 100644
index 0000000000..b884ab6543
--- /dev/null
+++ b/data/2024/aaai/CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen	
@@ -0,0 +1 @@
+Addressing Out-Of-Distribution (OOD) Segmentation and Zero-Shot Semantic Segmentation (ZS3) is challenging, necessitating segmenting unseen classes. Existing strategies adapt the class-agnostic Mask2Former (CA-M2F) tailored to specific tasks. However, these methods cater to singular tasks, demand training from scratch, and we demonstrate certain deficiencies in CA-M2F, which affect performance. We propose the Class-Agnostic Structure-Constrained Learning (CSL), a plug-in framework that can integrate with existing methods, thereby embedding structural constraints and achieving performance gain, including the unseen, specifically OOD, ZS3, and domain adaptation (DA) tasks. There are two schemes for CSL to integrate with existing methods (1) by distilling knowledge from a base teacher network, enforcing constraints across training and inference phrases, or (2) by leveraging established models to obtain per-pixel distributions without retraining, appending constraints during the inference phase. Our soft assignment and mask split methodologies enhance OOD object segmentation. Empirical evaluations demonstrate CSL's prowess in boosting the performance of existing algorithms spanning OOD segmentation, ZS3, and DA segmentation, consistently transcending the state-of-art across all three tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/CTO-SLAM: Contour Tracking for Object-Level Robust 4D SLAM b/data/2024/aaai/CTO-SLAM: Contour Tracking for Object-Level Robust 4D SLAM
new file mode 100644
index 0000000000..251125bdd9
--- /dev/null
+++ b/data/2024/aaai/CTO-SLAM: Contour Tracking for Object-Level Robust 4D SLAM	
@@ -0,0 +1 @@
+The demand for 4D ( 3D+time ) SLAM system is increasingly urgent, especially for decision-making and scene understanding. However, most of the existing simultaneous localization and mapping ( SLAM ) systems primarily assume static environments. They fail to represent dynamic scenarios due to the challenge of establishing robust long-term spatiotemporal associations in dynamic object tracking. We address this limitation and propose CTO-SLAM, a monocular and RGB-D object-level 4D SLAM system to track moving objects and estimate their motion simultaneously. In this paper, we propose contour tracking, which introduces contour features to enhance the keypoint representation of dynamic objects and coupled with pixel tracking to achieve long-term robust object tracking. Based on contour tracking, we propose a novel sampling-based object pose initialization algorithm and the following adapted bundle adjustment ( BA ) optimization algorithm to estimate dynamic object poses with high accuracy. The CTO-SLAM system is verified on both KITTI and VKITTI datasets. The experimental results demonstrate that our system effectively addresses cumulative errors in long-term spatiotemporal association and hence obtains substantial improvements over the state-of-the-art systems. The source code is available at https://github.com/realXiaohan/CTO-SLAM.
\ No newline at end of file
diff --git a/data/2024/aaai/CUDC: A Curiosity-Driven Unsupervised Data Collection Method with Adaptive Temporal Distances for Offline Reinforcement Learning b/data/2024/aaai/CUDC: A Curiosity-Driven Unsupervised Data Collection Method with Adaptive Temporal Distances for Offline Reinforcement Learning
new file mode 100644
index 0000000000..e7ee7ae118
--- /dev/null
+++ b/data/2024/aaai/CUDC: A Curiosity-Driven Unsupervised Data Collection Method with Adaptive Temporal Distances for Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Offline reinforcement learning (RL) aims to learn an effective policy from a pre-collected dataset. Most existing works are to develop sophisticated learning algorithms, with less emphasis on improving the data collection process. Moreover, it is even challenging to extend the single-task setting and collect a task-agnostic dataset that allows an agent to perform multiple downstream tasks. In this paper, we propose a Curiosity-driven Unsupervised Data Collection (CUDC) method to expand feature space using adaptive temporal distances for task-agnostic data collection and ultimately improve learning efficiency and capabilities for multi-task offline RL. To achieve this, CUDC estimates the probability of the k-step future states being reachable from the current states, and adapts how many steps into the future that the dynamics model should predict. With this adaptive reachability mechanism in place, the feature representation can be diversified, and the agent can navigate itself to collect higher-quality data with curiosity. Empirically, CUDC surpasses existing unsupervised methods in efficiency and learning performance in various downstream offline RL tasks of the DeepMind control suite.
\ No newline at end of file
diff --git a/data/2024/aaai/CUTS+: High-Dimensional Causal Discovery from Irregular Time-Series b/data/2024/aaai/CUTS+: High-Dimensional Causal Discovery from Irregular Time-Series
new file mode 100644
index 0000000000..e137b1826e
--- /dev/null
+++ b/data/2024/aaai/CUTS+: High-Dimensional Causal Discovery from Irregular Time-Series	
@@ -0,0 +1 @@
+Causal discovery in time-series is a fundamental problem in the machine learning community, enabling causal reasoning and decision-making in complex scenarios. Recently, researchers successfully discover causality by combining neural networks with Granger causality, but their performances degrade largely when encountering high-dimensional data because of the highly redundant network design and huge causal graphs. Moreover, the missing entries in the observations further hamper the causal structural learning. To overcome these limitations, We propose CUTS+, which is built on the Granger-causality-based causal discovery method CUTS and raises the scalability by introducing a technique called Coarse-to-fine-discovery (C2FD) and leveraging a message-passing-based graph neural network (MPGNN). Compared to previous methods on simulated, quasi-real, and real datasets, we show that CUTS+ largely improves the causal discovery performance on high-dimensional data with different types of irregular sampling.
\ No newline at end of file
diff --git a/data/2024/aaai/CaMIL: Causal Multiple Instance Learning for Whole Slide Image Classification b/data/2024/aaai/CaMIL: Causal Multiple Instance Learning for Whole Slide Image Classification
new file mode 100644
index 0000000000..2e8b6b6131
--- /dev/null
+++ b/data/2024/aaai/CaMIL: Causal Multiple Instance Learning for Whole Slide Image Classification	
@@ -0,0 +1 @@
+Whole slide image (WSI) classification is a crucial component in automated pathology analysis. Due to the inherent challenges of high-resolution WSIs and the absence of patch-level labels, most of the proposed methods follow the multiple instance learning (MIL) formulation. While MIL has been equipped with excellent instance feature extractors and aggregators, it is prone to learn spurious associations that undermine the performance of the model. For example, relying solely on color features may lead to erroneous diagnoses due to spurious associations between the disease and the color of patches. To address this issue, we develop a causal MIL framework for WSI classification, effectively distinguishing between causal and spurious associations. Specifically, we use the expectation of the intervention P(Y | do(X)) for bag prediction rather than the traditional likelihood P(Y | X). By applying the front-door adjustment, the spurious association is effectively blocked, where the intervened mediator is aggregated from patch-level features. We evaluate our proposed method on two publicly available WSI datasets, Camelyon16 and TCGA-NSCLC. Our causal MIL framework shows outstanding performance and is plug-and-play, seamlessly integrating with various feature extractors and aggregators.
\ No newline at end of file
diff --git a/data/2024/aaai/Cached Transformers: Improving Transformers with Differentiable Memory Cachde b/data/2024/aaai/Cached Transformers: Improving Transformers with Differentiable Memory Cachde
new file mode 100644
index 0000000000..27a656146a
--- /dev/null
+++ b/data/2024/aaai/Cached Transformers: Improving Transformers with Differentiable Memory Cachde	
@@ -0,0 +1 @@
+This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.
\ No newline at end of file
diff --git a/data/2024/aaai/CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion Models b/data/2024/aaai/CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion Models
new file mode 100644
index 0000000000..8cb5612400
--- /dev/null
+++ b/data/2024/aaai/CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion Models	
@@ -0,0 +1 @@
+Camouflaged Object Detection (COD) is a challenging task in computer vision due to the high similarity between camouflaged objects and their surroundings. Existing COD methods struggle with nuanced object boundaries and overconfident incorrect predictions. In response, we propose a new paradigm that treats COD as a conditional mask-generation task leveraging diffusion models. Our method, dubbed CamoDiffusion, employs the denoising process to progressively refine predictions while incorporating image conditions. Due to the stochastic sampling process of diffusion, our model is capable of sampling multiple possible predictions, avoiding the problem of overconfident point estimation. Moreover, we develop specialized network architecture, training, and sampling strategies, to enhance the model’s expressive power, refinement capabilities and suppress overconfident mis-segmentations, thus aptly tailoring the diffusion model to the demands of COD. Extensive experiments on three COD datasets attest to the superior performance of our model compared to existing state-of-the-art methods, particularly on the most challenging COD10K dataset, where our approach achieves 0.019 in terms of MAE. Codes and models are available at https://github.com/Rapisurazurite/CamoDiffusion.
\ No newline at end of file
diff --git a/data/2024/aaai/Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation b/data/2024/aaai/Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation
new file mode 100644
index 0000000000..91db7e7462
--- /dev/null
+++ b/data/2024/aaai/Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation	
@@ -0,0 +1 @@
+Recently, large language models (LLMs) have shown an extraordinary ability to understand natural language and generate programming code. It has been a common practice for software engineers to consult LLMs when encountering coding questions. Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability, and robustness of the code generation from LLMs have not yet been thoroughly studied. The executable code is not equivalent to reliable and robust code, especially in the context of real-world software development. For example, the misuse of APIs in the generated code could lead to severe problems, such as resource leaks, program crashes, etc. Existing code evaluation benchmarks and datasets focus on crafting small tasks such as programming questions in coding interviews, which, however, deviates from the problem that developers would ask LLM for real-world coding help. To fill the missing piece, in this work, we propose a dataset RobustAPI for evaluating the reliability and robustness of code generated by LLMs. We collect 1208 coding questions from Stack Overflow on 18 representative Java APIs. We summarize the common misuse patterns of these APIs and evaluate them on current popular LLMs. The evaluation results show that even for GPT-4, 62% of the generated code contains API misuses, which would cause unexpected consequences if the code is introduced into real-world software.
\ No newline at end of file
diff --git a/data/2024/aaai/Can LLMs Fix Issues with Reasoning Models? Towards More Likely Models for AI Planning b/data/2024/aaai/Can LLMs Fix Issues with Reasoning Models? Towards More Likely Models for AI Planning
new file mode 100644
index 0000000000..37d4989b48
--- /dev/null
+++ b/data/2024/aaai/Can LLMs Fix Issues with Reasoning Models? Towards More Likely Models for AI Planning	
@@ -0,0 +1 @@
+This is the first work to look at the application of large language models (LLMs) for the purpose of model space edits in automated planning tasks. To set the stage for this union, we explore two different flavors of model space problems that have been studied in the AI planning literature and explore the effect of an LLM on those tasks. We empirically demonstrate how the performance of an LLM contrasts with combinatorial search (CS) – an approach that has been traditionally used to solve model space tasks in planning, both with the LLM in the role of a standalone model space reasoner as well as in the role of a statistical signal in concert with the CS approach as part of a two-stage process. Our experiments show promising results suggesting further forays of LLMs into the exciting world of model space reasoning for planning tasks in the future.
\ No newline at end of file
diff --git a/data/2024/aaai/Can Large Language Models Serve as Rational Players in Game Theory? A Systematic Analysis b/data/2024/aaai/Can Large Language Models Serve as Rational Players in Game Theory? A Systematic Analysis
new file mode 100644
index 0000000000..cd86f5d1f8
--- /dev/null
+++ b/data/2024/aaai/Can Large Language Models Serve as Rational Players in Game Theory? A Systematic Analysis	
@@ -0,0 +1 @@
+Game theory, as an analytical tool, is frequently utilized to analyze human behavior in social science research. With the high alignment between the behavior of Large Language Models (LLMs) and humans, a promising research direction is to employ LLMs as substitutes for humans in game experiments, enabling social science research. However, despite numerous empirical researches on the combination of LLMs and game theory, the capability boundaries of LLMs in game theory remain unclear. In this research, we endeavor to systematically analyze LLMs in the context of game theory. Specifically, rationality, as the fundamental principle of game theory, serves as the metric for evaluating players' behavior --- building a clear desire, refining belief about uncertainty, and taking optimal actions. Accordingly, we select three classical games (dictator game, Rock-Paper-Scissors, and ring-network game) to analyze to what extent LLMs can achieve rationality in these three aspects. The experimental results indicate that even the current state-of-the-art LLM (GPT-4) exhibits substantial disparities compared to humans in game theory. For instance, LLMs struggle to build desires based on uncommon preferences, fail to refine belief from many simple patterns, and may overlook or modify refined belief when taking actions. Therefore, we consider that introducing LLMs into game experiments in the field of social science should be approached with greater caution.
\ No newline at end of file
diff --git a/data/2024/aaai/Can Large Language Models Understand Real-World Complex Instructions? b/data/2024/aaai/Can Large Language Models Understand Real-World Complex Instructions?
new file mode 100644
index 0000000000..c9eaa0f56f
--- /dev/null
+++ b/data/2024/aaai/Can Large Language Models Understand Real-World Complex Instructions?	
@@ -0,0 +1 @@
+Large language models (LLMs) can understand human instructions, showing their potential for pragmatic applications beyond traditional NLP tasks. However, they still struggle with complex instructions, which can be either complex task descriptions that require multiple tasks and constraints, or complex input that contains long context, noise, heterogeneous information and multi-turn format. Due to these features, LLMs often ignore semantic constraints from task descriptions, generate incorrect formats, violate length or sample count constraints, and be unfaithful to the input text. Existing benchmarks are insufficient to assess LLMs’ ability to understand complex instructions, as they are close-ended and simple. To bridge this gap, we propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically. We design eight features for complex instructions and construct a comprehensive evaluation dataset from real-world scenarios. We also establish four criteria and develop corresponding metrics, as current ones are inadequate, biased or too strict and coarse-grained. We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO.
\ No newline at end of file
diff --git a/data/2024/aaai/Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It's Complicated b/data/2024/aaai/Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It's Complicated
new file mode 100644
index 0000000000..2709df8c90
--- /dev/null
+++ b/data/2024/aaai/Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It's Complicated	
@@ -0,0 +1 @@
+Preference-based Reinforcement Learning (PbRL) enables non-experts to train Reinforcement Learning models using preference feedback. However, the effort required to collect preference labels from real humans means that PbRL research primarily relies on synthetic labellers. We validate the most common synthetic labelling strategy by comparing against labels collected from a crowd of humans on three Deep Mind Control (DMC) suite tasks: stand, walk, and run. We find that: (1) the synthetic labels are a good proxy for real humans under some circumstances, (2) strong preference label agreement between human and synthetic labels is not necessary for similar policy performance, (3) policy performance is higher at the start of training from human feedback and is higher at the end of training from synthetic feedback, and (4) training on only examples with high levels of inter-annotator agreement does not meaningfully improve policy performance. Our results justify the use of synthetic labellers to develop and ablate PbRL methods, and provide insight into how human labelling changes over the course of policy training.
\ No newline at end of file
diff --git a/data/2024/aaai/Carbon Footprint Reduction for Sustainable Data Centers in Real-Time b/data/2024/aaai/Carbon Footprint Reduction for Sustainable Data Centers in Real-Time
new file mode 100644
index 0000000000..285e7b198c
--- /dev/null
+++ b/data/2024/aaai/Carbon Footprint Reduction for Sustainable Data Centers in Real-Time	
@@ -0,0 +1 @@
+As machine learning workloads are significantly increasing energy consumption, sustainable data centers with low carbon emissions are becoming a top priority for governments and corporations worldwide. This requires a paradigm shift in optimizing power consumption in cooling and IT loads, shifting flexible loads based on the availability of renewable energy in the power grid, and leveraging battery storage from the uninterrupted power supply in data centers, using collaborative agents. The complex association between these optimization strategies and their dependencies on variable external factors like weather and the power grid carbon intensity makes this a hard problem. Currently, a real-time controller to optimize all these goals simultaneously in a dynamic real-world setting is lacking. We propose a Data Center Carbon Footprint Reduction (DC-CFR) multi-agent Reinforcement Learning (MARL) framework that optimizes data centers for the multiple objectives of carbon footprint reduction, energy consumption, and energy cost. The results show that the DC-CFR MARL agents effectively resolved the complex interdependencies in optimizing cooling, load shifting, and energy storage in real-time for various locations under real-world dynamic weather and grid carbon intensity conditions. DC-CFR significantly outperformed the industry-standard ASHRAE controller with a considerable reduction in carbon emissions (14.5%), energy usage (14.4%), and energy cost (13.7%) when evaluated over one year across multiple geographical regions.
\ No newline at end of file
diff --git a/data/2024/aaai/CariesXrays: Enhancing Caries Detection in Hospital-Scale Panoramic Dental X-rays via Feature Pyramid Contrastive Learning b/data/2024/aaai/CariesXrays: Enhancing Caries Detection in Hospital-Scale Panoramic Dental X-rays via Feature Pyramid Contrastive Learning
new file mode 100644
index 0000000000..773008ae67
--- /dev/null
+++ b/data/2024/aaai/CariesXrays: Enhancing Caries Detection in Hospital-Scale Panoramic Dental X-rays via Feature Pyramid Contrastive Learning	
@@ -0,0 +1 @@
+Dental caries has been widely recognized as one of the most prevalent chronic diseases in the field of public health. Despite advancements in automated diagnosis across various medical domains, it remains a substantial challenge for dental caries detection due to its inherent variability and intricacies. To bridge this gap, we release a hospital-scale panoramic dental X-ray benchmark, namely “CariesXrays”, to facilitate the advancements in high-precision computer-aided diagnosis for dental caries. It comprises 6,000 panoramic dental X-ray images, with a total of 13,783 instances of dental caries, all meticulously annotated by dental professionals. In this paper, we propose a novel Feature Pyramid Contrastive Learning (FPCL) framework, that jointly incorporates feature pyramid learning and contrastive learning within a unified diagnostic paradigm for automated dental caries detection. Specifically, a robust dual-directional feature pyramid network (D2D-FPN) is designed to adaptively capture rich and informative contextual information from multi-level feature maps, thus enhancing the generalization ability of caries detection across different scales. Furthermore, our model is augmented with an effective proposals-prototype contrastive regularization learning (P2P-CRL) mechanism, which can flexibly bridge the semantic gaps among diverse dental caries with varying appearances, resulting in high-quality dental caries proposals. Extensive experiments on our newly-established CariesXrays benchmark demonstrate the potential of FPCL to make a significant social impact on caries diagnosis.
\ No newline at end of file
diff --git a/data/2024/aaai/CatFormer: Category-Level 6D Object Pose Estimation with Transformer b/data/2024/aaai/CatFormer: Category-Level 6D Object Pose Estimation with Transformer
new file mode 100644
index 0000000000..55f401264b
--- /dev/null
+++ b/data/2024/aaai/CatFormer: Category-Level 6D Object Pose Estimation with Transformer	
@@ -0,0 +1 @@
+Although there has been significant progress in category-level object pose estimation in recent years, there is still considerable room for improvement. In this paper, we propose a novel transformer-based category-level 6D pose estimation method called CatFormer to enhance the accuracy pose estimation. CatFormer comprises three main parts: a coarse deformation part, a fine deformation part, and a recurrent refinement part. In the coarse and fine deformation sections, we introduce a transformer-based deformation module that performs point cloud deformation and completion in the feature space. Additionally, after each deformation, we incorporate a transformer-based graph module to adjust fused features and establish geometric and topological relationships between points based on these features. Furthermore, we present an end-to-end recurrent refinement module that enables the prior point cloud to deform multiple times according to real scene features. We evaluate CatFormer's performance by training and testing it on CAMERA25 and REAL275 datasets. Experimental results demonstrate that CatFormer surpasses state-of-the-art methods. Moreover, we extend the usage of CatFormer to instance-level object pose estimation on the LINEMOD dataset, as well as object pose estimation in real-world scenarios. The experimental results validate the effectiveness and generalization capabilities of CatFormer. Our code and the supplemental materials are avaliable at https://github.com/BIT-robot-group/CatFormer.
\ No newline at end of file
diff --git a/data/2024/aaai/Catalyst for Clustering-Based Unsupervised Object Re-identification: Feature Calibration b/data/2024/aaai/Catalyst for Clustering-Based Unsupervised Object Re-identification: Feature Calibration
new file mode 100644
index 0000000000..76f26667d2
--- /dev/null
+++ b/data/2024/aaai/Catalyst for Clustering-Based Unsupervised Object Re-identification: Feature Calibration	
@@ -0,0 +1 @@
+Clustering-based methods are emerging as a ubiquitous technology in unsupervised object Re-Identification (ReID), which alternate between pseudo-label generation and representation learning. Recent advances in this field mainly fall into two groups: pseudo-label correction and robust representation learning. Differently, in this work, we improve unsupervised object ReID from feature calibration, a completely different but complementary insight from the current approaches. Specifically, we propose to insert a conceptually simple yet empirically powerful Feature Calibration Module (FCM) before pseudo-label generation. In practice, FCM calibrates the features using a nonparametric graph attention network, enforcing similar instances to move together in the feature space while allowing dissimilar instances to separate. As a result, we can generate more reliable pseudo-labels using the calibrated features and further improve subsequent representation learning. FCM is simple, effective, parameter-free, training-free, plug-and-play, and can be considered as a catalyst, increasing the ’chemical reaction’ between pseudo-label generation and representation learning. Moreover, it maintains the efficiency of testing time with negligible impact on training time. In this paper, we insert FCM into a simple baseline. Experiments across different scenarios and benchmarks show that FCM consistently improves the baseline (e.g., 8.2% mAP gain on MSMT17), and achieves the new state-of-the-art results. Code is available at: https://github.com/lhf12278/FCM-ReID.
\ No newline at end of file
diff --git a/data/2024/aaai/Catch-Up Mix: Catch-Up Class for Struggling Filters in CNN b/data/2024/aaai/Catch-Up Mix: Catch-Up Class for Struggling Filters in CNN
new file mode 100644
index 0000000000..7aa5973f2e
--- /dev/null
+++ b/data/2024/aaai/Catch-Up Mix: Catch-Up Class for Struggling Filters in CNN	
@@ -0,0 +1 @@
+Deep learning has made significant advances in computer vision, particularly in image classification tasks. Despite their high accuracy on training data, deep learning models often face challenges related to complexity and overfitting. One notable concern is that the model often relies heavily on a limited subset of filters for making predictions. This dependency can result in compromised generalization and an increased vulnerability to minor variations. While regularization techniques like weight decay, dropout, and data augmentation are commonly used to address this issue, they may not directly tackle the reliance on specific filters. Our observations reveal that the heavy reliance problem gets severe when slow-learning filters are deprived of learning opportunities due to fast-learning filters. Drawing inspiration from image augmentation research that combats over-reliance on specific image regions by removing and replacing parts of images, Our idea is to mitigate the problem of over-reliance on strong filters by substituting highly activated features. To this end, we present a novel method called Catch-up Mix, which provides learning opportunities to a wide range of filters during training, focusing on filters that may lag behind. By mixing activation maps with relatively lower norms, Catch-up Mix promotes the development of more diverse representations and reduces reliance on a small subset of filters. Experimental results demonstrate the superiority of our method in various vision classification datasets, providing enhanced robustness.
\ No newline at end of file
diff --git a/data/2024/aaai/CatmullRom Splines-Based Regression for Image Forgery Localization b/data/2024/aaai/CatmullRom Splines-Based Regression for Image Forgery Localization
new file mode 100644
index 0000000000..1c739f17ad
--- /dev/null
+++ b/data/2024/aaai/CatmullRom Splines-Based Regression for Image Forgery Localization	
@@ -0,0 +1 @@
+IFL (Image Forgery Location) helps secure digital media forensics. However, many methods suffer from false detections (i.e., FPs) and inaccurate boundaries. In this paper, we proposed the CatmullRom Splines-based Regression Network (CSR-Net), which first rethinks the IFL task from the perspective of regression to deal with this problem. Specifically speaking, we propose an adaptive CutmullRom splines fitting scheme for coarse localization of the tampered regions. Then, for false positive cases, we first develop a novel re-scoring mechanism, which aims to filter out samples that cannot have responses on both the classification branch and the instance branch. Later on, to further restrict the boundaries, we design a learnable texture extraction module, which refines and enhances the contour representation by decoupling the horizontal and vertical forgery features to extract a more robust contour representation, thus suppressing FPs. Compared to segmentation-based methods, our method is simple but effective due to the unnecessity of post-processing. Extensive experiments show the superiority of CSR-Net to existing state-of-the-art methods, not only on standard natural image datasets but also on social media datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Causal Adversarial Perturbations for Individual Fairness and Robustness in Heterogeneous Data Spaces b/data/2024/aaai/Causal Adversarial Perturbations for Individual Fairness and Robustness in Heterogeneous Data Spaces
new file mode 100644
index 0000000000..dee41fa08f
--- /dev/null
+++ b/data/2024/aaai/Causal Adversarial Perturbations for Individual Fairness and Robustness in Heterogeneous Data Spaces	
@@ -0,0 +1 @@
+As responsible AI gains importance in machine learning algorithms, properties like fairness, adversarial robustness, and causality have received considerable attention in recent years. However, despite their individual significance, there remains a critical gap in simultaneously exploring and integrating these properties. In this paper, we propose a novel approach that examines the relationship between individual fairness, adversarial robustness, and structural causal models (SCMs) in heterogeneous data spaces, particularly when dealing with discrete sensitive attributes. We use SCMs and sensitive attributes to create a fair metric and apply it to measure semantic similarity among individuals. By introducing a novel causal adversarial perturbation (CAP) and applying adversarial training, we create a new regularizer that combines individual fairness, causality, and robustness in the classifier. Our method is evaluated on both real-world and synthetic datasets, demonstrating its effectiveness in achieving an accurate classifier that simultaneously exhibits fairness, adversarial robustness, and causal awareness.
\ No newline at end of file
diff --git a/data/2024/aaai/Causal Discovery from Poisson Branching Structural Causal Model Using High-Order Cumulant with Path Analysis b/data/2024/aaai/Causal Discovery from Poisson Branching Structural Causal Model Using High-Order Cumulant with Path Analysis
new file mode 100644
index 0000000000..2bf289e9b8
--- /dev/null
+++ b/data/2024/aaai/Causal Discovery from Poisson Branching Structural Causal Model Using High-Order Cumulant with Path Analysis	
@@ -0,0 +1 @@
+Count data naturally arise in many fields, such as finance, neuroscience, and epidemiology, and discovering causal structure among count data is a crucial task in various scientific and industrial scenarios. One of the most common characteristics of count data is the inherent branching structure described by a binomial thinning operator and an independent Poisson distribution that captures both branching and noise. For instance, in a population count scenario, mortality and immigration contribute to the count, where survival follows a Bernoulli distribution, and immigration follows a Poisson distribution. However, causal discovery from such data is challenging due to the non-identifiability issue: a single causal pair is Markov equivalent, i.e., X->Y and Y->X are distributed equivalent. Fortunately, in this work, we found that the causal order from X to its child Y is identifiable if X is a root vertex and has at least two directed paths to Y, or the ancestor of X with the most directed path to X has a directed path to Y without passing X. Specifically, we propose a Poisson Branching Structure Causal Model (PB-SCM) and perform a path analysis on PB-SCM using high-order cumulants. Theoretical results establish the connection between the path and cumulant and demonstrate that the path information can be obtained from the cumulant. With the path information, causal order is identifiable under some graphical conditions. A practical algorithm for learning causal structure under PB-SCM is proposed and the experiments demonstrate and verify the effectiveness of the proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Causal Representation Learning via Counterfactual Intervention b/data/2024/aaai/Causal Representation Learning via Counterfactual Intervention
new file mode 100644
index 0000000000..28ba024467
--- /dev/null
+++ b/data/2024/aaai/Causal Representation Learning via Counterfactual Intervention	
@@ -0,0 +1 @@
+Existing causal representation learning methods are based on the causal graph they build. However, due to the omission of bias within the causal graph, they essentially encourage models to learn biased causal effects in latent space. In this paper, we propose a novel causally disentangling framework that aims to learn unbiased causal effects. We first introduce inductive and dataset biases into traditional causal graph for the physical concepts of interest. Then, we eliminate the negative effects from these two biases by counterfactual intervention with reweighted loss function for learning unbiased causal effects. Finally, we employ the causal effects into the VAE to endow the latent representations with causality. In particular, we highlight that removing biases in this paper is regarded as a part of learning process for unbiased causal effects, which is crucial for causal disentanglement performance improvement. Through extensive experiments on real-world and synthetic datasets, we show that our method outperforms different baselines and obtains the state-of-the-art results for achieving causal representation learning.
\ No newline at end of file
diff --git a/data/2024/aaai/Causal Strategic Learning with Competitive Selection b/data/2024/aaai/Causal Strategic Learning with Competitive Selection
new file mode 100644
index 0000000000..9eff4887e7
--- /dev/null
+++ b/data/2024/aaai/Causal Strategic Learning with Competitive Selection	
@@ -0,0 +1,10 @@
+We study the problem of agent selection in causal strategic learning under multiple decision makers and address two key challenges that come with it. 
+Firstly, while much of prior work focuses on studying a fixed pool of agents that remains static regardless of their evaluations, we consider the impact of selection procedure by which agents are not only evaluated, but also selected.
+When each decision maker unilaterally selects agents by maximising their own utility, we show that the optimal selection rule is a trade-off between selecting the best agents and providing incentives to maximise the agents' improvement. 
+Furthermore, this optimal selection rule relies on incorrect predictions of agents' outcomes. 
+Hence, we study the conditions under which a decision maker's optimal selection rule will not lead to deterioration of agents' outcome nor cause unjust reduction in agents' selection chance. 
+To that end, we provide an analytical form of the optimal selection rule and a mechanism to retrieve the causal parameters from observational data, under certain assumptions on agents' behaviour. 
+Secondly, when there are multiple decision makers, the interference between selection rules introduces another source of biases in estimating the underlying causal parameters. 
+To address this problem, we provide a cooperative protocol which all decision makers must collectively adopt to recover the true causal parameters. 
+Lastly, we complement our theoretical results with simulation studies.
+Our results highlight not only the importance of causal modeling as a strategy to mitigate the effect of gaming, as suggested by previous work, but also the need of a benevolent regulator to enable it.
\ No newline at end of file
diff --git a/data/2024/aaai/Causal Walk: Debiasing Multi-Hop Fact Verification with Front-Door Adjustment b/data/2024/aaai/Causal Walk: Debiasing Multi-Hop Fact Verification with Front-Door Adjustment
new file mode 100644
index 0000000000..e4b6848e98
--- /dev/null
+++ b/data/2024/aaai/Causal Walk: Debiasing Multi-Hop Fact Verification with Front-Door Adjustment	
@@ -0,0 +1 @@
+Multi-hop fact verification aims to detect the veracity of the given claim by integrating and reasoning over multiple pieces of evidence. Conventional multi-hop fact verification models are prone to rely on spurious correlations from the annotation artifacts, leading to an obvious performance decline on unbiased datasets. Among the various debiasing works, the causal inference-based methods become popular by performing theoretically guaranteed debiasing such as casual intervention or counterfactual reasoning. However, existing causal inference-based debiasing methods, which mainly formulate fact verification as a single-hop reasoning task to tackle shallow bias patterns, cannot deal with the complicated bias patterns hidden in multiple hops of evidence. To address the challenge, we propose Causal Walk, a novel method for debiasing multi-hop fact verification from a causal perspective with front-door adjustment. Specifically, in the structural causal model, the reasoning path between the treatment (the input claim-evidence graph) and the outcome (the veracity label) is introduced as the mediator to block the confounder. With the front-door adjustment, the causal effect between the treatment and the outcome is decomposed into the causal effect between the treatment and the mediator, which is estimated by applying the idea of random walk, and the causal effect between the mediator and the outcome, which is estimated with normalized weighted geometric mean approximation. To investigate the effectiveness of the proposed method, an adversarial multi-hop fact verification dataset and a symmetric multi-hop fact verification dataset are proposed with the help of the large language model. Experimental results show that Causal Walk outperforms some previous debiasing methods on both existing datasets and the newly constructed datasets. Code and data will be released at https://github.com/zcccccz/CausalWalk.
\ No newline at end of file
diff --git a/data/2024/aaai/Causal-Driven Skill Prerequisite Structure Discovery b/data/2024/aaai/Causal-Driven Skill Prerequisite Structure Discovery
new file mode 100644
index 0000000000..027def2432
--- /dev/null
+++ b/data/2024/aaai/Causal-Driven Skill Prerequisite Structure Discovery	
@@ -0,0 +1 @@
+Knowing a prerequisite structure among skills in a subject domain effectively enables several educational applications, including intelligent tutoring systems and curriculum planning. Traditionally, educators or domain experts use intuition to determine the skills' prerequisite relationships, which is time-consuming and prone to fall into the trap of blind spots. In this paper, we focus on inferring the prerequisite structure given access to students' performance on exercises in a subject. Nevertheless, it is challenging since students' mastery of skills can not be directly observed, but can only be estimated, i.e., its latency in nature. To tackle this problem, we propose a causal-driven skill prerequisite structure discovery (CSPS) method in a two-stage learning framework. In the first stage, we learn the skills' correlation relationships presented in the covariance matrix from the student performance data while, through the predicted covariance matrix in the second stage, we consider a heuristic method based on conditional independence tests and standardized partial variance to discover the prerequisite structure. We demonstrate the performance of the new approach with both simulated and real-world data. The experimental results show the effectiveness of the proposed model for identifying the skills' prerequisite structure.
\ No newline at end of file
diff --git a/data/2024/aaai/Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval b/data/2024/aaai/Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval
new file mode 100644
index 0000000000..7b174503fb
--- /dev/null
+++ b/data/2024/aaai/Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval	
@@ -0,0 +1 @@
+Text-based Person Retrieval (TPR) aims to retrieve relevant images of specific pedestrians based on the given textual query. The mainstream approaches primarily leverage pretrained deep neural networks to learn the mapping of visual and textual modalities into a common latent space for cross-modality matching. Despite their remarkable achievements, existing efforts mainly focus on learning the statistical cross-modality correlation found in training data, other than the intrinsic causal correlation. As a result, they often struggle to retrieve accurately in the face of environmental changes such as illumination, pose, and occlusion, or when encountering images with similar attributes. In this regard, we pioneer the observation of TPR from a causal view. Specifically, we assume that each image is composed of a mixture of causal factors (which are semantically consistent with text descriptions) and non-causal factors (retrieval-irrelevant, e.g., background), and only the former can lead to reliable retrieval judgments. Our goal is to extract text-critical robust visual representation (i.e., causal factors) and establish domain invariant cross-modality correlations for accurate and reliable retrieval. However, causal/non-causal factors are unobserved, so we emphasize that ideal causal factors that can simulate causal scenes should satisfy two basic principles:1） Independence: being independent of non-causal factors, and 2）Sufficiency: being causally sufficient for TPR across different environments. Building on that, we propose an Invariant Representation Learning method for TPR (IRLT), that enforces the visual representations to satisfy the two aforementioned critical properties. Extensive experiments on three datasets clearly demonstrate the advantages of IRLT over leading baselines in terms of accuracy and generalization.
\ No newline at end of file
diff --git a/data/2024/aaai/Causally Aware Generative Adversarial Networks for Light Pollution Control b/data/2024/aaai/Causally Aware Generative Adversarial Networks for Light Pollution Control
new file mode 100644
index 0000000000..0424e12101
--- /dev/null
+++ b/data/2024/aaai/Causally Aware Generative Adversarial Networks for Light Pollution Control	
@@ -0,0 +1 @@
+Artificial light plays an integral role in modern cities, significantly enhancing human productivity and the efficiency of civilization. However, excessive illumination can lead to light pollution, posing non-negligible threats to economic burdens, ecosystems, and human health. Despite its critical importance, the exploration of its causes remains relatively limited within the field of artificial intelligence, leaving an incomplete understanding of the factors contributing to light pollution and sustainable illumination planning distant. To address this gap, we introduce a novel framework named Causally Aware Generative Adversarial Networks (CAGAN). This innovative approach aims to uncover the fundamental drivers of light pollution within cities and offer intelligent solutions for optimal illumination resource allocation in the context of sustainable urban development. We commence by examining light pollution across 33,593 residential areas in seven global metropolises. Our findings reveal substantial influences on light pollution levels from various building types, notably grasslands, commercial centers and residential buildings as significant contributors. These discovered causal relationships are seamlessly integrated into the generative modeling framework, guiding the process of generating light pollution maps for diverse residential areas. Extensive experiments showcase CAGAN’s potential to inform and guide the implementation of effective strategies to mitigate light pollution. Our code and data are publicly available at https://github.com/zhangyuuao/Light_Pollution_CAGAN.
\ No newline at end of file
diff --git a/data/2024/aaai/Cautiously-Optimistic Knowledge Sharing for Cooperative Multi-Agent Reinforcement Learning b/data/2024/aaai/Cautiously-Optimistic Knowledge Sharing for Cooperative Multi-Agent Reinforcement Learning
new file mode 100644
index 0000000000..274d76b04a
--- /dev/null
+++ b/data/2024/aaai/Cautiously-Optimistic Knowledge Sharing for Cooperative Multi-Agent Reinforcement Learning	
@@ -0,0 +1 @@
+While decentralized training is attractive in multi-agent reinforcement learning (MARL) for its excellent scalability and robustness, its inherent coordination challenges in collaborative tasks result in numerous interactions for agents to learn good policies. To alleviate this problem, action advising methods make experienced agents share their knowledge about what to do, while less experienced agents strictly follow the received advice. However, this method of sharing and utilizing knowledge may hinder the team's exploration of better states, as agents can be unduly influenced by suboptimal or even adverse advice, especially in the early stages of learning. Inspired by the fact that humans can learn not only from the success but also from the failure of others, this paper proposes a novel knowledge sharing framework called Cautiously-Optimistic kNowledge Sharing (CONS). CONS enables each agent to share both positive and negative knowledge and cautiously assimilate knowledge from others, thereby enhancing the efficiency of early-stage exploration and the agents' robustness to adverse advice. Moreover, considering the continuous improvement of policies, agents value negative knowledge more in the early stages of learning and shift their focus to positive knowledge in the later stages. Our framework can be easily integrated into existing Q-learning based methods without introducing additional training costs. We evaluate CONS in several challenging multi-agent tasks and find it excels in environments where optimal behavioral patterns are difficult to discover, surpassing the baselines in terms of convergence rate and final performance.
\ No newline at end of file
diff --git a/data/2024/aaai/CcDPM: A Continuous Conditional Diffusion Probabilistic Model for Inverse Design b/data/2024/aaai/CcDPM: A Continuous Conditional Diffusion Probabilistic Model for Inverse Design
new file mode 100644
index 0000000000..c6ef25bb84
--- /dev/null
+++ b/data/2024/aaai/CcDPM: A Continuous Conditional Diffusion Probabilistic Model for Inverse Design	
@@ -0,0 +1 @@
+Engineering design methods aim to generate new designs that meet desired performance requirements. Past work has directly introduced conditional Generative Adversarial Networks (cGANs) into this field and achieved promising results in single-point design problems(one performance requirement under one working condition). However, these methods assume that the performance requirements are distributed in categorical space, which is not reasonable in these scenarios. Although Continuous conditional GANs (CcGANs) introduce Vicinal Risk Minimization (VRM) to reduce the performance loss caused by this assumption, they still face the following challenges: 1) CcGANs can not handle multi-point design problems (multiple performance requirements under multiple working conditions). 2) Their training process is time-consuming due to the high computational complexity of the vicinal loss. To address these issues, A Continuous conditional Diffusion Probabilistic Model (CcDPM) is proposed, which the first time introduces the diffusion model into the engineering design area and VRM into the diffusion model. CcDPM adopts a novel sampling method called multi-point design sampling to deal with multi-point design problems. Moreover, the k-d tree is used in the training process of CcDPM to shorten the calculation time of vicinal loss and speed up the training process by 2-300 times in our experiments. Experiments on a synthetic problem and three real-world design problems demonstrate that CcDPM outperforms the state-of-the-art GAN models.
\ No newline at end of file
diff --git a/data/2024/aaai/Ced-NeRF: A Compact and Efficient Method for Dynamic Neural Radiance Fields b/data/2024/aaai/Ced-NeRF: A Compact and Efficient Method for Dynamic Neural Radiance Fields
new file mode 100644
index 0000000000..8561380f10
--- /dev/null
+++ b/data/2024/aaai/Ced-NeRF: A Compact and Efficient Method for Dynamic Neural Radiance Fields	
@@ -0,0 +1 @@
+Rendering photorealistic dynamic scenes has been a focus of recent research, with applications in virtual and augmented reality. While the Neural Radiance Field (NeRF) has shown remarkable rendering quality for static scenes, achieving real-time rendering of dynamic scenes remains challenging due to expansive computation for the time dimension. The incorporation of explicit-based methods, specifically voxel grids, has been proposed to accelerate the training and rendering of neural radiance fields with hybrid representation. However, employing a hybrid representation for dynamic scenes results in overfitting due to fast convergence, which can result in artifacts (e.g., floaters, noisy geometric) on novel views. To address this, we propose a compact and efficient method for dynamic neural radiance fields, namely Ced-NeRF which only require a small number of additional parameters to construct a hybrid representation of dynamic NeRF. Evaluation of dynamic scene datasets shows that our Ced-NeRF achieves fast rendering speeds while maintaining high-quality rendering results. Our method outperforms the current state-of-the-art methods in terms of quality, training and rendering speed.
\ No newline at end of file
diff --git a/data/2024/aaai/Cell Graph Transformer for Nuclei Classification b/data/2024/aaai/Cell Graph Transformer for Nuclei Classification
new file mode 100644
index 0000000000..2b12059160
--- /dev/null
+++ b/data/2024/aaai/Cell Graph Transformer for Nuclei Classification	
@@ -0,0 +1 @@
+Nuclei classification is a critical step in computer-aided diagnosis with histopathology images. In the past, various methods have employed graph neural networks (GNN) to analyze cell graphs that model inter-cell relationships by considering nuclei as vertices. However, they are limited by the GNN mechanism that only passes messages among local nodes via fixed edges. To address the issue, we develop a cell graph transformer (CGT) that treats nodes and edges as input tokens to enable learnable adjacency and information exchange among all nodes. Nevertheless, training the transformer with a cell graph presents another challenge. Poorly initialized features can lead to noisy self-attention scores and inferior convergence, particularly when processing the cell graphs with numerous connections. Thus, we further propose a novel topology-aware pretraining method that leverages a graph convolutional network (GCN) to learn a feature extractor. The pre-trained features may suppress unreasonable correlations and hence ease the finetuning of CGT. Experimental results suggest that the proposed cell graph transformer with topology-aware pretraining significantly improves the nuclei classification results, and achieves the state-of-the-art performance. Code and models are available at https://github.com/lhaof/CGT
\ No newline at end of file
diff --git a/data/2024/aaai/Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control b/data/2024/aaai/Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control
new file mode 100644
index 0000000000..76301cc825
--- /dev/null
+++ b/data/2024/aaai/Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control	
@@ -0,0 +1 @@
+This study aims to improve the generation of 3D gestures by utilizing multimodal information from human speech. Previous studies have focused on incorporating additional modalities to enhance the quality of generated gestures. However, these methods perform poorly when certain modalities are missing during inference. To address this problem, we suggest using speech-derived multimodal priors to improve gesture generation. We introduce a novel method that separates priors from speech and employs multimodal priors as constraints for generating gestures. Our approach utilizes a chain-like modeling method to generate facial blendshapes, body movements, and hand gestures sequentially. Specifically, we incorporate rhythm cues derived from facial deformation and stylization prior based on speech emotions, into the process of generating gestures. By incorporating multimodal priors, our method improves the quality of generated gestures and eliminate the need for expensive setup preparation during inference. Extensive experiments and user studies confirm that our proposed approach achieves state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Chain-of-Thought Improves Text Generation with Citations in Large Language Models b/data/2024/aaai/Chain-of-Thought Improves Text Generation with Citations in Large Language Models
new file mode 100644
index 0000000000..3e436730a1
--- /dev/null
+++ b/data/2024/aaai/Chain-of-Thought Improves Text Generation with Citations in Large Language Models	
@@ -0,0 +1 @@
+Previous studies disclose that Large Language Models (LLMs) suffer from hallucinations when generating texts, bringing a novel and challenging research topic to the public, which centers on enabling LLMs to generate texts with citations. Existing work exposes two limitations when using LLMs to generate answers to questions with provided documents: unsatisfactory answer correctness and poor citation quality. To tackle the above issues, we investigate using Chain-of-Thought (CoT) to elicit LLMs’ ability to synthesize correct answers from multiple documents, as well as properly cite these documents. Moreover, we propose a Citation Insurance Mechanism, which enables LLMs to detect and cite those missing citations. We conduct experiments on the ALCE benchmark with six open-source LLMs. Experimental results demonstrate that: (1) the CoT prompting strategy significantly improves the quality of text generation with citations; (2) the Citation Insurance Mechanism delivers impressive gains in citation quality at a low cost; (3) our best approach performs comparably as previous best ChatGPT-based baselines. Extensive analyses further validate the effectiveness of the proposed approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Characterizing Information Seeking Events in Health-Related Social Discourse b/data/2024/aaai/Characterizing Information Seeking Events in Health-Related Social Discourse
new file mode 100644
index 0000000000..a2aff280a3
--- /dev/null
+++ b/data/2024/aaai/Characterizing Information Seeking Events in Health-Related Social Discourse	
@@ -0,0 +1 @@
+Social media sites have become a popular platform for individuals to seek and share health information. Despite the progress in natural language processing for social media mining, a gap remains in analyzing health-related texts on social discourse in the context of events. Event-driven analysis can offer insights into different facets of healthcare at an individual and collective level, including treatment options, misconceptions, knowledge gaps, etc. This paper presents a paradigm to characterize health-related information-seeking in social discourse through the lens of events. Events here are board categories defined with domain experts that capture the trajectory of the treatment/medication. To illustrate the value of this approach, we analyze Reddit posts regarding medications for Opioid Use Disorder (OUD), a critical global health concern. To the best of our knowledge, this is the first attempt to define event categories for characterizing information-seeking in OUD social discourse. Guided by domain experts, we develop TREAT-ISE, a novel multilabel treatment information-seeking event dataset to analyze online discourse on an event-based framework. This dataset contains Reddit posts on information-seeking events related to recovery from OUD, where each post is annotated based on the type of events. We also establish a strong performance benchmark (77.4% F1 score) for the task by employing several machine learning and deep learning classifiers. Finally, we thoroughly investigate the performance and errors of ChatGPT on this task, providing valuable insights into the LLM's capabilities and ongoing characterization efforts.
\ No newline at end of file
diff --git a/data/2024/aaai/Chasing Fairness in Graphs: A GNN Architecture Perspective b/data/2024/aaai/Chasing Fairness in Graphs: A GNN Architecture Perspective
new file mode 100644
index 0000000000..704b60614d
--- /dev/null
+++ b/data/2024/aaai/Chasing Fairness in Graphs: A GNN Architecture Perspective	
@@ -0,0 +1,3 @@
+There has been significant progress in improving the performance of graph neural networks (GNNs) through enhancements in graph data, model architecture design, and training strategies. For fairness in graphs, recent studies achieve fair representations and predictions through either graph data pre-processing (e.g., node feature masking, and topology rewiring) or fair training strategies (e.g., regularization, adversarial debiasing, and fair contrastive learning). How to achieve fairness in graphs from the model architecture perspective is less explored. More importantly, GNNs exhibit worse fairness performance compared to multilayer perception since their model architecture (i.e., neighbor aggregation) amplifies biases. To this end, we aim to achieve fairness via a new GNN architecture. We propose Fair Message Passing (FMP) designed within a unified optimization framework for GNNs. Notably, FMP explicitly renders sensitive attribute usage in forward propagation for node classification task using cross-entropy loss without data pre-processing. In FMP, the aggregation is first adopted to utilize neighbors' information and then the bias mitigation step explicitly pushes demographic group node presentation centers together.
+In this way, FMP scheme can aggregate useful information from neighbors and mitigate bias to achieve better fairness and prediction tradeoff performance. 
+Experiments on node classification tasks demonstrate that the proposed FMP outperforms several baselines in terms of fairness and accuracy on three real-world datasets. The code is available at https://github.com/zhimengj0326/FMP.
\ No newline at end of file
diff --git a/data/2024/aaai/ChatGPT-Generated Code Assignment Detection Using Perplexity of Large Language Models (Student Abstract) b/data/2024/aaai/ChatGPT-Generated Code Assignment Detection Using Perplexity of Large Language Models (Student Abstract)
new file mode 100644
index 0000000000..8943ccd0a1
--- /dev/null
+++ b/data/2024/aaai/ChatGPT-Generated Code Assignment Detection Using Perplexity of Large Language Models (Student Abstract)	
@@ -0,0 +1 @@
+In the era of large language models like Chatgpt, maintaining academic integrity in programming education has become challenging due to potential misuse. There's a pressing need for reliable detectors to identify Chatgpt-generated code. While previous studies have tackled model-generated text detection, identifying such code remains uncharted territory. In this paper, we introduce a novel method to discern Chatgpt-generated code. We employ targeted masking perturbation, emphasizing code sections with high perplexity. Fine-tuned CodeBERT is utilized to replace these masked sections, generating subtly perturbed samples. Our scoring system amalgamates overall perplexity, variations in code line perplexity, and burstiness. In this scoring scheme, a higher rank for the original code suggests it's more likely to be chatgpt-generated. The underlying principle is that code generated by models typically exhibits consistent, low perplexity and reduced burstiness, with its ranking remaining relatively stable even after subtle modifications. In contrast, human-written code, when perturbed, is more likely to produce samples that the model prefers. Our approach significantly outperforms current detectors, especially against OpenAI's text-davinci-003 model, with the average AUC rising from 0.56 (GPTZero baseline) to 0.87.
\ No newline at end of file
diff --git a/data/2024/aaai/Cheaper and Faster: Distributed Deep Reinforcement Learning with Serverless Computing b/data/2024/aaai/Cheaper and Faster: Distributed Deep Reinforcement Learning with Serverless Computing
new file mode 100644
index 0000000000..0ab31a238e
--- /dev/null
+++ b/data/2024/aaai/Cheaper and Faster: Distributed Deep Reinforcement Learning with Serverless Computing	
@@ -0,0 +1 @@
+Deep reinforcement learning (DRL) has gained immense success in many applications, including gaming AI, robotics, and system scheduling. Distributed algorithms and architectures have been vastly proposed (e.g., actor-learner architecture) to accelerate DRL training with large-scale server-based clusters. However, training on-policy algorithms with the actor-learner architecture unavoidably induces resource wasting due to synchronization between learners and actors, thus resulting in significantly extra billing. As a promising alternative, serverless computing naturally fits on-policy synchronization and alleviates resource wasting in distributed DRL training with pay-as-you-go pricing. Yet, none has leveraged serverless computing to facilitate DRL training. This paper proposes MinionsRL, the first serverless distributed DRL training framework that aims to accelerate DRL training- and cost-efficiency with dynamic actor scaling. We prototype MinionsRL on top of Microsoft Azure Container Instances and evaluate it with popular DRL tasks from OpenAI Gym. Extensive experiments show that MinionsRL reduces total training time by up to 52% and training cost by 86% compared to latest solutions.
\ No newline at end of file
diff --git a/data/2024/aaai/Check-In Desk Scheduling Optimisation at CDG International Airport b/data/2024/aaai/Check-In Desk Scheduling Optimisation at CDG International Airport
new file mode 100644
index 0000000000..1ee0554d6f
--- /dev/null
+++ b/data/2024/aaai/Check-In Desk Scheduling Optimisation at CDG International Airport	
@@ -0,0 +1,4 @@
+More than ever, air transport players (i.e., airline and airport companies) in an intensely competitive climate need to benefit from a carefully optimized management of airport resources to improve the quality of service and control the induced costs.
+In this paper, we investigate the Airport Check-in Desk Assignment Problem.
+We propose a Constraint Programming (CP) model for this problem, and present some promising experimental results from data coming from ADP (Aéroport de Paris).
+Our works are deployed in a preprod environment since 1 year.
\ No newline at end of file
diff --git a/data/2024/aaai/Chinese Spelling Correction as Rephrasing Language Model b/data/2024/aaai/Chinese Spelling Correction as Rephrasing Language Model
new file mode 100644
index 0000000000..f93427d1ce
--- /dev/null
+++ b/data/2024/aaai/Chinese Spelling Correction as Rephrasing Language Model	
@@ -0,0 +1 @@
+This paper studies Chinese Spelling Correction (CSC), which aims to detect and correct potential spelling errors in a given sentence. Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs. However, we note a critical flaw in the process of tagging one character to another, that the correction is excessively conditioned on the error. This is opposite from human mindset, where individuals rephrase the complete sentence based on its semantics, rather than solely on the error patterns memorized before. Such a counter-intuitive learning process results in the bottleneck of generalizability and transferability of machine spelling correction. To address this, we propose Rephrasing Language Modeling (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging. This novel training paradigm achieves the new state-of-theart results across fine-tuned and zero-shot CSC benchmarks, outperforming previous counterparts by a large margin. Our method also learns transferable language representation when CSC is jointly trained with other tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/ChromaFusionNet (CFNet): Natural Fusion of Fine-Grained Color Editing b/data/2024/aaai/ChromaFusionNet (CFNet): Natural Fusion of Fine-Grained Color Editing
new file mode 100644
index 0000000000..dfcd005335
--- /dev/null
+++ b/data/2024/aaai/ChromaFusionNet (CFNet): Natural Fusion of Fine-Grained Color Editing	
@@ -0,0 +1 @@
+Digital image enhancement aims to deliver visually striking, pleasing images that align with human perception. While global techniques can elevate the image's overall aesthetics, fine-grained color enhancement can further boost visual appeal and expressiveness. However, colorists frequently face challenges in achieving accurate, localized color adjustments. Direct composition of these local edits can result in spatial color inconsistencies. Existing methods, including color style transfer and image harmonization, exhibit inconsistencies, especially at boundary regions. Addressing this, we present ChromaFusionNet (CFNet), a novel approach that views the color fusion problem through the lens of image color inpainting. Built on the Vision Transformer architecture, CFNet captures global context and delivers high-fidelity outputs, seamlessly blending colors while preserving boundary integrity. Empirical studies on ImageNet and COCO datasets demonstrate CFNet's superiority over existing methods in maintaining color harmony and color fidelity. Robustness evaluations and user studies have further validated the effectiveness of CFNet. In conclusion, CFNet introduces an innovative approach to seamless, fine-grained color fusion, paving the way for advancements in the domain of fine-grained color editing. Code and pretrained models are available at our project page: https://yidong.pro/projects/cfnet.
\ No newline at end of file
diff --git a/data/2024/aaai/Chronic Poisoning: Backdoor Attack against Split Learning b/data/2024/aaai/Chronic Poisoning: Backdoor Attack against Split Learning
new file mode 100644
index 0000000000..c1f7fdefba
--- /dev/null
+++ b/data/2024/aaai/Chronic Poisoning: Backdoor Attack against Split Learning	
@@ -0,0 +1 @@
+Split learning is a computing resource-friendly distributed learning framework that protects client training data by splitting the model between the client and server. Previous work has proved that split learning faces a severe risk of privacy leakage, as a malicious server can recover the client's private data by hijacking the training process. In this paper, we first explore the vulnerability of split learning to server-side backdoor attacks, where our goal is to compromise the model's integrity. Since the server-side attacker cannot access the training data and client model in split learning, the traditional poisoning-based backdoor attack methods are no longer applicable. Therefore, constructing backdoor attacks in split learning poses significant challenges. Our strategy involves the attacker establishing a shadow model on the server side that can encode backdoor samples and guiding the client model to learn from this model during the training process, thereby enabling the client to acquire the same capability. Based on these insights, we propose a three-stage backdoor attack framework named SFI. Our attack framework minimizes assumptions about the attacker's background knowledge and ensures that the attack process remains imperceptible to the client. We implement SFI on various benchmark datasets, and extensive experimental results demonstrate its effectiveness and generality. For example, success rates of our attack on MNIST, Fashion, and CIFAR10 datasets all exceed 90%, with limited impact on the main task.
\ No newline at end of file
diff --git a/data/2024/aaai/CityPulse: Fine-Grained Assessment of Urban Change with Street View Time Series b/data/2024/aaai/CityPulse: Fine-Grained Assessment of Urban Change with Street View Time Series
new file mode 100644
index 0000000000..b71824f585
--- /dev/null
+++ b/data/2024/aaai/CityPulse: Fine-Grained Assessment of Urban Change with Street View Time Series	
@@ -0,0 +1 @@
+Urban transformations have profound societal impact on both individuals and communities at large. Accurately assessing these shifts is essential for understanding their underlying causes and ensuring sustainable urban planning. Traditional measurements often encounter constraints in spatial and temporal granularity, failing to capture real-time physical changes. While street view imagery, capturing the heartbeat of urban spaces in a pedestrian point of view, can add as a high-definition, up-to-date, and on-the-ground visual proxy of urban change. We curate the largest street view time series dataset to date, and propose an end-to-end change detection model to effectively capture physical alterations in the built environment at scale. We demonstrate the effectiveness of our proposed method by benchmark comparisons with previous literature and implementing it at the city-wide level. Our approach has the potential to supplement existing dataset and serve as a fine-grained and accurate assessment of urban change.
\ No newline at end of file
diff --git a/data/2024/aaai/Clarifying the Behavior and the Difficulty of Adversarial Training b/data/2024/aaai/Clarifying the Behavior and the Difficulty of Adversarial Training
new file mode 100644
index 0000000000..8f9ea93192
--- /dev/null
+++ b/data/2024/aaai/Clarifying the Behavior and the Difficulty of Adversarial Training	
@@ -0,0 +1 @@
+Adversarial training is usually difficult to optimize. This paper provides conceptual and analytic insights into the difficulty of adversarial training via a simple theoretical study, where we derive an approximate dynamics of a recursive multi-step attack in a simple setting. Despite the simplicity of our theory, it still reveals verifiable predictions about various phenomena in adversarial training under real-world settings. First, compared to vanilla training, adversarial training is more likely to boost the influence of input samples with large gradient norms in an exponential manner. Besides, adversarial training also strengthens the influence of the Hessian matrix of the loss w.r.t. network parameters, which is more likely to make network parameters oscillate and boosts the difficulty of adversarial training.
\ No newline at end of file
diff --git a/data/2024/aaai/Class-Attribute Priors: Adapting Optimization to Heterogeneity and Fairness Objective b/data/2024/aaai/Class-Attribute Priors: Adapting Optimization to Heterogeneity and Fairness Objective
new file mode 100644
index 0000000000..c79992fd80
--- /dev/null
+++ b/data/2024/aaai/Class-Attribute Priors: Adapting Optimization to Heterogeneity and Fairness Objective	
@@ -0,0 +1 @@
+Modern classification problems exhibit heterogeneities across individual classes: Each class may have unique attributes, such as sample size, label quality, or predictability (easy vs difficult), and variable importance at test-time. Without care, these heterogeneities impede the learning process, most notably, when optimizing fairness objectives. Confirming this, under a gaussian mixture setting, we show that the optimal SVM classifier for balanced accuracy needs to be adaptive to the class attributes. This motivates us to propose CAP: An effective and general method that generates a class-specific learning strategy (e.g.~hyperparameter) based on the attributes of that class. This way, optimization process better adapts to heterogeneities. CAP leads to substantial improvements over the naive approach of assigning separate hyperparameters to each class. We instantiate CAP for loss function design and post-hoc logit adjustment, with emphasis on label-imbalanced problems. We show that CAP is competitive with prior art and its flexibility unlocks clear benefits for fairness objectives beyond balanced accuracy. Finally, we evaluate CAP on problems with label noise as well as weighted test objectives to showcase how CAP can jointly adapt to different heterogeneities.
\ No newline at end of file
diff --git a/data/2024/aaai/Cluster-Based Sampling in Hindsight Experience Replay for Robotic Tasks (Student Abstract) b/data/2024/aaai/Cluster-Based Sampling in Hindsight Experience Replay for Robotic Tasks (Student Abstract)
new file mode 100644
index 0000000000..4ad5b6c930
--- /dev/null
+++ b/data/2024/aaai/Cluster-Based Sampling in Hindsight Experience Replay for Robotic Tasks (Student Abstract)	
@@ -0,0 +1 @@
+In multi-goal reinforcement learning with a sparse binary reward, training agents is particularly challenging, due to a lack of successful experiences. To solve this problem, hindsight experience replay (HER) generates successful experiences even from unsuccessful ones. However, generating successful experiences from uniformly sampled ones is not an efficient process. In this paper, the impact of exploiting the property of achieved goals in generating successful experiences is investigated and a novel cluster-based sampling strategy is proposed. The proposed sampling strategy groups episodes with different achieved goals by using a cluster model and samples experiences in the manner of HER to create the training batch. The proposed method is validated by experiments with three robotic control tasks of the OpenAI Gym. The results of experiments demonstrate that the proposed method is substantially sample efficient and achieves better performance than baseline approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Co-designing AI Education Curriculum with Cross-Disciplinary High School Teachers b/data/2024/aaai/Co-designing AI Education Curriculum with Cross-Disciplinary High School Teachers
new file mode 100644
index 0000000000..b667cfc4b6
--- /dev/null
+++ b/data/2024/aaai/Co-designing AI Education Curriculum with Cross-Disciplinary High School Teachers	
@@ -0,0 +1 @@
+High school teachers from many disciplines have growing interests in teaching about artificial intelligence (AI). This cross-disciplinary interest reflects the prevalence of AI tools across society, such as Generative AI tools built upon Large Language Models (LLM). However, high school classes are unique and complex environments, led by teachers with limited time and resources with priorities that vary by class and the students they serve. Therefore, developing curricula about AI for classes that span many disciplines (e.g. history, art, math) must involve centering the expertise of cross-disciplinary teachers. In this study, we conducted five collaborative curricular co-design sessions with eight teachers who taught high school humanities and STEM classes. We sought to understand how teachers considered AI when it was taught in art, math, and social studies contexts, as well as opportunities and challenges they identified with incorporating AI tools into their instruction. We found that teachers considered technical skills and ethical debates around AI, opportunities for "dual exploration" between AI and disciplinary learning, and limitations of AI tools as supporting engagement and reflection but also potentially distracting. We interpreted our findings relative to co-designing adaptable AI curricula to support teaching about and with AI across high school disciplines.
\ No newline at end of file
diff --git a/data/2024/aaai/CoLAL: Co-learning Active Learning for Text Classification b/data/2024/aaai/CoLAL: Co-learning Active Learning for Text Classification
new file mode 100644
index 0000000000..ae7d56f68e
--- /dev/null
+++ b/data/2024/aaai/CoLAL: Co-learning Active Learning for Text Classification	
@@ -0,0 +1 @@
+In the machine learning field, the challenge of effectively learning with limited data has become increasingly crucial. Active Learning (AL) algorithms play a significant role in this by enhancing model performance. We introduce a novel AL algorithm, termed Co-learning (CoLAL), designed to select the most diverse and representative samples within a training dataset. This approach utilizes noisy labels and predictions made by the primary model on unlabeled data. By leveraging a probabilistic graphical model, we combine two multi-class classifiers into a binary one. This classifier determines if both the main and the peer models agree on a prediction. If they do, the unlabeled sample is assumed to be easy to classify and is thus not beneficial to increase the target model's performance. We prioritize data that represents the unlabeled set without overlapping decision boundaries. The discrepancies between these boundaries can be estimated by the probability that two models result in the same prediction. Through theoretical analysis and experimental validation, we reveal that the integration of noisy labels into the peer model effectively identifies target model's potential inaccuracies. We evaluated the CoLAL method across seven benchmark datasets: four text datasets (AGNews, DBPedia, PubMed, SST-2) and text-based state-of-the-art (SOTA) baselines, and three image datasets (CIFAR100, MNIST, OpenML-155) and computer vision SOTA baselines. The results show that our CoLAL method significantly outperforms existing SOTA in text-based AL, and is competitive with SOTA image-based AL techniques.
\ No newline at end of file
diff --git a/data/2024/aaai/CoPL: Contextual Prompt Learning for Vision-Language Understanding b/data/2024/aaai/CoPL: Contextual Prompt Learning for Vision-Language Understanding
new file mode 100644
index 0000000000..ace61af228
--- /dev/null
+++ b/data/2024/aaai/CoPL: Contextual Prompt Learning for Vision-Language Understanding	
@@ -0,0 +1,2 @@
+Recent advances in multimodal learning has resulted in powerful vision-language models, whose representations are generalizable across a variety of downstream tasks. Recently, their generalization ability has been further extended by incorporating trainable prompts, borrowed from the natural language processing literature. While such prompt learning techniques have shown impressive results, we identify that these prompts are trained based on global image features which limits itself in two aspects: First, by using global features, these prompts could be focusing less on the discriminative foreground image, resulting in poor generalization to various out-of-distribution test cases. Second, existing work weights all prompts equally whereas intuitively, prompts should be reweighed according to the semantics of the image. We address these as part of our proposed Contextual Prompt Learning (CoPL) framework, capable of aligning the prompts to
+the localized features of the image. Our key innovations over earlier works include using local image features as part of the prompt learning process, and more crucially, learning to weight these prompts based on local features that are appropriate for the task at hand. This gives us dynamic prompts that are both aligned to local image features as well as aware of local contextual relationships. Our extensive set of experiments on a variety of standard and few-shot datasets show that our method produces substantially improved performance when compared to the current state of the art methods. We also demonstrate both few-shot and out-of-distribution performance to establish the utility of learning dynamic prompts that are aligned to local image features.
\ No newline at end of file
diff --git a/data/2024/aaai/CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding b/data/2024/aaai/CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding
new file mode 100644
index 0000000000..0f6cd10a45
--- /dev/null
+++ b/data/2024/aaai/CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding	
@@ -0,0 +1 @@
+This paper studies the spatio-temporal video grounding task, which aims to localize a spatio-temporal tube in an untrimmed video based on the given text description of an event. Existing one-stage approaches suffer from insufficient space-time interaction in two aspects: i) less precise prediction of event temporal boundaries, and ii) inconsistency in object prediction for the same event across adjacent frames. To address these issues, we propose a framework of Comprehensive Space-Time entAnglement (CoSTA) to densely entangle space-time multi-modal features for spatio-temporal localization. Specifically, we propose a space-time collaborative encoder to extract comprehensive video features and leverage Transformer to perform spatio-temporal multi-modal understanding. Our entangled decoder couples temporal boundary prediction and spatial localization via an entangled query, boasting an enhanced ability to capture object-event relationships. We conduct extensive experiments on the challenging benchmarks of HC-STVG and VidSTG, where CoSTA outperforms existing state-of-the-art methods, demonstrating its effectiveness for this task.
\ No newline at end of file
diff --git a/data/2024/aaai/CoVR: Learning Composed Video Retrieval from Web Video Captions b/data/2024/aaai/CoVR: Learning Composed Video Retrieval from Web Video Captions
new file mode 100644
index 0000000000..8b9f6722f0
--- /dev/null
+++ b/data/2024/aaai/CoVR: Learning Composed Video Retrieval from Web Video Captions	
@@ -0,0 +1 @@
+Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr.
\ No newline at end of file
diff --git a/data/2024/aaai/Coalition Formation for Task Allocation Using Multiple Distance Metrics (Student Abstract) b/data/2024/aaai/Coalition Formation for Task Allocation Using Multiple Distance Metrics (Student Abstract)
new file mode 100644
index 0000000000..6f790cbfd0
--- /dev/null
+++ b/data/2024/aaai/Coalition Formation for Task Allocation Using Multiple Distance Metrics (Student Abstract)	
@@ -0,0 +1 @@
+Simultaneous Coalition Structure Generation and Assignment (SCSGA) is an important research problem in multi-agent systems. Given n agents and m tasks, the aim of SCSGA is to form m disjoint coalitions of n agents such that between the coalitions and tasks there is a one-to-one mapping, which ensures each coalition is capable of accomplishing the assigned task. SCSGA with Multi-dimensional Features (SCSGA-MF) extends the problem by introducing a d-dimensional vector for each agent and task. We propose a heuristic algorithm called Multiple Distance Metric (MDM) approach to solve SCSGA-MF. Experimental results confirm that MDM produces near optimal solutions, while being feasible for large-scale inputs within a reasonable time frame.
\ No newline at end of file
diff --git a/data/2024/aaai/Code-Style In-Context Learning for Knowledge-Based Question Answering b/data/2024/aaai/Code-Style In-Context Learning for Knowledge-Based Question Answering
new file mode 100644
index 0000000000..b5ed9fc674
--- /dev/null
+++ b/data/2024/aaai/Code-Style In-Context Learning for Knowledge-Based Question Answering	
@@ -0,0 +1 @@
+Current methods for Knowledge-Based Question Answering (KBQA) usually rely on complex training techniques and model frameworks, leading to many limitations in practical applications. Recently, the emergence of In-Context Learning (ICL) capabilities in Large Language Models (LLMs) provides a simple and training-free semantic parsing paradigm for KBQA: Given a small number of questions and their labeled logical forms as demo examples, LLMs can understand the task intent and generate the logic form for a new question. However, current powerful LLMs have little exposure to logic forms during pre-training, resulting in a high format error rate. To solve this problem, we propose a code-style in-context learning method for KBQA, which converts the generation process of unfamiliar logical form into the more familiar code generation process for LLMs. Experimental results on three mainstream datasets show that our method dramatically mitigated the formatting error problem in generating logic forms while realizing a new SOTA on WebQSP, GrailQA, and GraphQ under the few-shot setting. The code and supplementary files are released at https://github.com/Arthurizijar/KB-Coder.
\ No newline at end of file
diff --git a/data/2024/aaai/Coevolutionary Algorithm for Building Robust Decision Trees under Minimax Regret b/data/2024/aaai/Coevolutionary Algorithm for Building Robust Decision Trees under Minimax Regret
new file mode 100644
index 0000000000..8855985627
--- /dev/null
+++ b/data/2024/aaai/Coevolutionary Algorithm for Building Robust Decision Trees under Minimax Regret	
@@ -0,0 +1 @@
+In recent years, there has been growing interest in developing robust machine learning (ML) models that can withstand adversarial attacks, including one of the most widely adopted, efficient, and interpretable ML algorithms—decision trees (DTs). This paper proposes a novel coevolutionary algorithm (CoEvoRDT) designed to create robust DTs capable of handling noisy high-dimensional data in adversarial contexts. Motivated by the limitations of traditional DT algorithms, we leverage adaptive coevolution to allow DTs to evolve and learn from interactions with perturbed input data. CoEvoRDT alternately evolves competing populations of DTs and perturbed features, enabling construction of DTs with desired properties. CoEvoRDT is easily adaptable to various target metrics, allowing the use of tailored robustness criteria such as minimax regret. Furthermore, CoEvoRDT has potential to improve the results of other state-of-the-art methods by incorporating their outcomes (DTs they produce) into the initial population and optimize them in the process of coevolution. Inspired by the game theory, CoEvoRDT utilizes mixed Nash equilibrium to enhance convergence. The method is tested on 20 popular datasets and shows superior performance compared to 4 state-of-the-art algorithms. It outperformed all competing methods on 13 datasets with adversarial accuracy metrics, and on all 20 considered datasets with minimax regret. Strong experimental results and flexibility in choosing the error measure make CoEvoRDT a promising approach for constructing robust DTs in real-world applications.
\ No newline at end of file
diff --git a/data/2024/aaai/ColNeRF: Collaboration for Generalizable Sparse Input Neural Radiance Field b/data/2024/aaai/ColNeRF: Collaboration for Generalizable Sparse Input Neural Radiance Field
new file mode 100644
index 0000000000..b377a5ead4
--- /dev/null
+++ b/data/2024/aaai/ColNeRF: Collaboration for Generalizable Sparse Input Neural Radiance Field	
@@ -0,0 +1 @@
+Neural Radiance Fields (NeRF) have demonstrated impressive potential in synthesizing novel views from dense input, however, their effectiveness is challenged when dealing with sparse input. Existing approaches that incorporate additional depth or semantic supervision can alleviate this issue to an extent. However, the process of supervision collection is not only costly but also potentially inaccurate. In our work, we introduce a novel model: the Collaborative Neural Radiance Fields (ColNeRF) designed to work with sparse input. The collaboration in ColNeRF includes the cooperation among sparse input source images and the cooperation among the output of the NeRF. Through this, we construct a novel collaborative module that aligns information from various views and meanwhile imposes self-supervised constraints to ensure multi-view consistency in both geometry and appearance. A Collaborative Cross-View Volume Integration module (CCVI) is proposed to capture complex occlusions and implicitly infer the spatial location of objects. Moreover, we introduce self-supervision of target rays projected in multiple directions to ensure geometric and color consistency in adjacent regions. Benefiting from the collaboration at the input and output ends, ColNeRF is capable of capturing richer and more generalized scene representation, thereby facilitating higher-quality results of the novel view synthesis. Our extensive experimental results demonstrate that ColNeRF outperforms state-of-the-art sparse input generalizable NeRF methods. Furthermore, our approach exhibits superiority in fine-tuning towards adapting to new scenes, achieving competitive performance compared to per-scene optimized NeRF-based methods while significantly reducing computational costs. Our code is available at: https://github.com/eezkni/ColNeRF.
\ No newline at end of file
diff --git a/data/2024/aaai/Collaborative Consortium of Foundation Models for Open-World Few-Shot Learning b/data/2024/aaai/Collaborative Consortium of Foundation Models for Open-World Few-Shot Learning
new file mode 100644
index 0000000000..f085cfcd72
--- /dev/null
+++ b/data/2024/aaai/Collaborative Consortium of Foundation Models for Open-World Few-Shot Learning	
@@ -0,0 +1 @@
+Open-World Few-Shot Learning (OFSL) is a crucial research field dedicated to accurately identifying target samples in scenarios where data is limited and labels are unreliable. This research holds significant practical implications and is highly relevant to real-world applications. Recently, the advancements in foundation models like CLIP and DINO have showcased their robust representation capabilities even in resource-constrained settings with scarce data. This realization has brought about a transformative shift in focus, moving away from “building models from scratch” towards “effectively harnessing the potential of foundation models to extract pertinent prior knowledge suitable for OFSL and utilizing it sensibly”. Motivated by this perspective, we introduce the Collaborative Consortium of Foundation Models (CO3), which leverages CLIP, DINO, GPT-3, and DALL-E to collectively address the OFSL problem. CO3 comprises four key blocks: (1) the Label Correction Block (LC-Block) corrects unreliable labels, (2) the Data Augmentation Block (DA-Block) enhances available data, (3) the Feature Extraction Block (FE-Block) extracts multi-modal features, and (4) the Text-guided Fusion Adapter (TeFu-Adapter) integrates multiple features while mitigating the impact of noisy labels through semantic constraints. Only the adapter's parameters are adjustable, while the others remain frozen. Through collaboration among these foundation models, CO3 effectively unlocks their potential and unifies their capabilities to achieve state-of-the-art performance on multiple benchmark datasets. https://github.com/The-Shuai/CO3.
\ No newline at end of file
diff --git a/data/2024/aaai/Collaborative Learning across Heterogeneous Systems with Pre-Trained Models b/data/2024/aaai/Collaborative Learning across Heterogeneous Systems with Pre-Trained Models
new file mode 100644
index 0000000000..a766ab47ba
--- /dev/null
+++ b/data/2024/aaai/Collaborative Learning across Heterogeneous Systems with Pre-Trained Models	
@@ -0,0 +1 @@
+The increasingly decentralized and private nature of data in our digital society has motivated the development of personalized, collaborative intelligent systems that enable knowledge aggregation across multiple data owners while accommodating for their data privacy and system constraints. However, collaborative learning has only been investigated in simple and limited settings: isolated task scenarios where learning begins from scratch and does not build on prior expertise; learned model is represented in task-specific forms which are not generalizable to unseen, emerging scenarios; and more often, a universal model representation is assumed across collaborators, ignoring their local compute constraints or input representations. This restricts its practicality in continual learning scenarios with limited task data, which demand continuous adaptation and knowledge transfer across different information silos, tasks, and learning models, as well as the utilization of prior solution expertises. To overcome these limitations, my research has been focused on developing effective and scalable resource-aware collaborative learning frameworks across heterogeneous systems.
\ No newline at end of file
diff --git a/data/2024/aaai/Collaborative Synthesis of Patient Records through Multi-Visit Health State Inference b/data/2024/aaai/Collaborative Synthesis of Patient Records through Multi-Visit Health State Inference
new file mode 100644
index 0000000000..1fc1ba3936
--- /dev/null
+++ b/data/2024/aaai/Collaborative Synthesis of Patient Records through Multi-Visit Health State Inference	
@@ -0,0 +1 @@
+Electronic health records (EHRs) have become the foundation of machine learning applications in healthcare, while the utility of real patient records is often limited by privacy and security concerns. Synthetic EHR generation provides an additional perspective to compensate for this limitation. Most existing methods synthesize new records based on real EHR data, without consideration of different types of events in EHR data, which cannot control the event combinations in line with medical common sense. In this paper, we propose MSIC, a Multi-visit health Status Inference model for Collaborative EHR synthesis to address these limitations. First, we formulate the synthetic EHR generation process as a probabilistic graphical model and tightly connect different types of events by modeling the latent health states. Then, we derive a health state inference method tailored for the multi-visit scenario to effectively utilize previous records to synthesize current and future records. Furthermore, we propose to generate medical reports to add textual descriptions for each medical event, providing broader applications for synthesized EHR data. For generating different paragraphs in each visit, we incorporate a multi-generator deliberation framework to collaborate the message passing of multiple generators and employ a two-phase decoding strategy to generate high-quality reports. Our extensive experiments on the widely used benchmarks, MIMIC-III and MIMIC-IV, demonstrate that MSIC advances state-of-the-art results on the quality of synthetic data while maintaining low privacy risks.
\ No newline at end of file
diff --git a/data/2024/aaai/Collaborative Tooth Motion Diffusion Model in Digital Orthodontics b/data/2024/aaai/Collaborative Tooth Motion Diffusion Model in Digital Orthodontics
new file mode 100644
index 0000000000..9b54f76846
--- /dev/null
+++ b/data/2024/aaai/Collaborative Tooth Motion Diffusion Model in Digital Orthodontics	
@@ -0,0 +1,9 @@
+Tooth motion generation is an essential task in digital orthodontic treatment for precise and quick dental healthcare, which aims to generate the whole intermediate tooth motion process given the initial pathological and target ideal tooth alignments. 
+Most prior works for multi-agent motion planning problems usually result in complex solutions.
+Moreover, the occlusal relationship between upper and lower teeth is often overlooked. 
+In this paper, we propose a collaborative tooth motion diffusion model. 
+The critical insight is to remodel the problem as a diffusion process. 
+In this sense, we model the whole tooth motion distribution with a diffusion model and transform the planning problem into a sampling process from this distribution.
+We design a tooth latent representation to provide accurate conditional guides consisting of two key components: the tooth frame represents the position and posture, and the tooth latent shape code represents the geometric morphology. 
+Subsequently, we present a collaborative diffusion model to learn the multi-tooth motion distribution based on inter-tooth and occlusal constraints, which are implemented by graph structure and new loss functions, respectively. 
+Extensive qualitative and quantitative experiments demonstrate the superiority of our framework in the application of orthodontics compared with state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Collaborative Weakly Supervised Video Correlation Learning for Procedure-Aware Instructional Video Analysis b/data/2024/aaai/Collaborative Weakly Supervised Video Correlation Learning for Procedure-Aware Instructional Video Analysis
new file mode 100644
index 0000000000..1855495122
--- /dev/null
+++ b/data/2024/aaai/Collaborative Weakly Supervised Video Correlation Learning for Procedure-Aware Instructional Video Analysis	
@@ -0,0 +1 @@
+Video Correlation Learning (VCL), which aims to analyze the relationships between videos, has been widely studied and applied in various general video tasks. However, applying VCL to instructional videos is still quite challenging due to their intrinsic procedural temporal structure. Specifically, procedural knowledge is critical for accurate correlation analyses on instructional videos. Nevertheless, current procedure-learning methods heavily rely on step-level annotations, which are costly and not scalable. To address this problem, we introduce a weakly supervised framework called Collaborative Procedure Alignment (CPA) for procedure-aware correlation learning on instructional videos. Our framework comprises two core modules: collaborative step mining and frame-to-step alignment. The collaborative step mining module enables simultaneous and consistent step segmentation for paired videos, leveraging the semantic and temporal similarity between frames. Based on the identified steps, the frame-to-step alignment module performs alignment between the frames and steps across videos. The alignment result serves as a measurement of the correlation distance between two videos. We instantiate our framework in two distinct instructional video tasks: sequence verification and action quality assessment. Extensive experiments validate the effectiveness of our approach in providing accurate and interpretable correlation analyses for instructional videos.
\ No newline at end of file
diff --git a/data/2024/aaai/Color Event Enhanced Single-Exposure HDR Imaging b/data/2024/aaai/Color Event Enhanced Single-Exposure HDR Imaging
new file mode 100644
index 0000000000..da7f65d0b6
--- /dev/null
+++ b/data/2024/aaai/Color Event Enhanced Single-Exposure HDR Imaging	
@@ -0,0 +1,27 @@
+Single-exposure high dynamic range (HDR) imaging aims
+to reconstruct the wide-range intensities of a scene by using
+its single low dynamic range (LDR) image, thus providing
+significant efficiency. Existing methods pay high attention to
+restoring the luminance by inversing the tone-mapping process,
+while the color in the over-/under-exposed area cannot
+be well restored due to the information loss of the single
+LDR image. To address this issue, we introduce color
+events into the imaging pipeline, which record asynchronous
+pixel-wise color changes in a high dynamic range, enabling
+edge-like scene perception under challenging lighting conditions.
+Specifically, we propose a joint framework that incorporates
+color events and a single LDR image to restore
+both content and color of an HDR image, where an exposureaware
+transformer (EaT) module is designed to propagate the
+informative hints, provided by the normal-exposed LDR regions
+and the event streams, to the missing areas. In this
+module, an exposure-aware mask is estimated to suppress
+distractive information and strengthen the restoration of the
+over-/under-exposed regions. To our knowledge, we are the
+first to use color events to enhance single-exposure HDR
+imaging. We also contribute corresponding datasets, consisting
+of synthesized datasets and a real-world dataset collected
+by a DAVIS346-color camera. The datasets can be found at
+https://www.kaggle.com/datasets/mengyaocui/ce-hdr. Extensive
+experiments demonstrate the effectiveness of the proposed
+method.
\ No newline at end of file
diff --git a/data/2024/aaai/Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling b/data/2024/aaai/Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling
new file mode 100644
index 0000000000..84e5908657
--- /dev/null
+++ b/data/2024/aaai/Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling	
@@ -0,0 +1 @@
+Proximal Policy Optimization (PPO), a popular on-policy deep reinforcement learning method, employs a stochastic policy for exploration. In this paper, we propose a colored noise-based stochastic policy variant of PPO. Previous research highlighted the importance of temporal correlation in action noise for effective exploration in off-policy reinforcement learning. Building on this, we investigate whether correlated noise can also enhance exploration in on-policy methods like PPO. We discovered that correlated noise for action selection improves learning performance and outperforms the currently popular uncorrelated white noise approach in on-policy methods. Unlike off-policy learning, where pink noise was found to be highly effective, we found that a colored noise, intermediate between white and pink, performed best for on-policy learning in PPO. We examined the impact of varying the amount of data collected for each update by modifying the number of parallel simulation environments for data collection and observed that with a larger number of parallel environments, more strongly correlated noise is beneficial. Due to the significant impact and ease of implementation, we recommend switching to correlated noise as the default noise source in PPO.
\ No newline at end of file
diff --git a/data/2024/aaai/Colorizing Monochromatic Radiance Fields b/data/2024/aaai/Colorizing Monochromatic Radiance Fields
new file mode 100644
index 0000000000..f270bf1c81
--- /dev/null
+++ b/data/2024/aaai/Colorizing Monochromatic Radiance Fields	
@@ -0,0 +1 @@
+Though Neural Radiance Fields (NeRF) can produce colorful 3D representations of the world by using a set of 2D images, such ability becomes non-existent when only monochromatic images are provided. Since color is necessary in representing the world, reproducing color from monochromatic radiance fields becomes crucial. To achieve this goal, instead of manipulating the monochromatic radiance fields directly, we consider it as a representation-prediction task in the Lab color space. By first constructing the luminance and density representation using monochromatic images, our prediction stage can recreate color representation on the basis of an image colorization module. We then reproduce a colorful implicit model through the representation of luminance, density, and color. Extensive experiments have been conducted to validate the effectiveness of our approaches. Our project page: https://liquidammonia.github.io/color-nerf.
\ No newline at end of file
diff --git a/data/2024/aaai/Colour Passing Revisited: Lifted Model Construction with Commutative Factors b/data/2024/aaai/Colour Passing Revisited: Lifted Model Construction with Commutative Factors
new file mode 100644
index 0000000000..bde6aab26a
--- /dev/null
+++ b/data/2024/aaai/Colour Passing Revisited: Lifted Model Construction with Commutative Factors	
@@ -0,0 +1 @@
+Lifted probabilistic inference exploits symmetries in a probabilistic model to allow for tractable probabilistic inference with respect to domain sizes. To apply lifted inference, a lifted representation has to be obtained, and to do so, the so-called colour passing algorithm is the state of the art. The colour passing algorithm, however, is bound to a specific inference algorithm and we found that it ignores commutativity of factors while constructing a lifted representation. We contribute a modified version of the colour passing algorithm that uses logical variables to construct a lifted representation independent of a specific inference algorithm while at the same time exploiting commutativity of factors during an offline-step. Our proposed algorithm efficiently detects more symmetries than the state of the art and thereby drastically increases compression, yielding significantly faster online query times for probabilistic inference when the resulting model is applied.
\ No newline at end of file
diff --git a/data/2024/aaai/Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators b/data/2024/aaai/Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators
new file mode 100644
index 0000000000..bbd42d1744
--- /dev/null
+++ b/data/2024/aaai/Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators	
@@ -0,0 +1 @@
+Federated learning has become a popular method to learn from decentralized heterogeneous data. Federated semi-supervised learning (FSSL) emerges to train models from a small fraction of labeled data due to label scarcity on decentralized clients. Existing FSSL methods assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled and unlabeled data within a client. This work studies a more practical and challenging scenario of FSSL, where data distribution is different not only across clients but also within a client between labeled and unlabeled data. To address this challenge, we propose a novel FSSL framework with dual regulators, FedDure. FedDure lifts the previous assumption with a coarse-grained regulator (C-reg) and a fine-grained regulator (F-reg): C-reg regularizes the updating of the local model by tracking the learning effect on labeled data distribution; F-reg learns an adaptive weighting scheme tailored for unlabeled instances in each client. We further formulate the client model training as bi-level optimization that adaptively optimizes the model in the client with two regulators. Theoretically, we show the convergence guarantee of the dual regulators. Empirically, we demonstrate that FedDure is superior to the existing methods across a wide range of settings, notably by more than 11% on CIFAR-10 and CINIC-10 datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Combating Insider Threat in the Open-World Environments: Identification, Monitoring, and Data Augmentation b/data/2024/aaai/Combating Insider Threat in the Open-World Environments: Identification, Monitoring, and Data Augmentation
new file mode 100644
index 0000000000..4539670c91
--- /dev/null
+++ b/data/2024/aaai/Combating Insider Threat in the Open-World Environments: Identification, Monitoring, and Data Augmentation	
@@ -0,0 +1 @@
+Recent years have witnessed a dramatic increase in a class of security threats known as "insider threats". These threats occur when individuals with authorized access to an organization's network engage in harmful activities, potentially leading to the disclosure of vital information or adversely affecting the organization's systems (e.g., financial loss, system crashes, and national security challenges). Distinct from other types of terror attacks, combating insider threats exhibits several unique challenges, including (1) rarity, (2) non-separability, (3) label scarcity, (4) dynamics, and (5) heterogeneity, making themselves extremely difficult to identify and mitigate. We target the challenging problem of combating insider threats in open-world environments by leveraging a variety of data sources (e.g., internal system logs, employee networks, human trafficking, and smuggling networks). To effectively combat these intricate threats, we introduce an interactive learning mechanism that is composed of three mutually beneficial learning modules: insider identification, insider monitoring, and data augmentation. Each module plays a crucial role in enhancing our ability to detect and mitigate insider threats, thereby contributing to a more secure and resilient organizational environment.
\ No newline at end of file
diff --git a/data/2024/aaai/Combinatorial CNN-Transformer Learning with Manifold Constraints for Semi-supervised Medical Image Segmentation b/data/2024/aaai/Combinatorial CNN-Transformer Learning with Manifold Constraints for Semi-supervised Medical Image Segmentation
new file mode 100644
index 0000000000..fc0f826816
--- /dev/null
+++ b/data/2024/aaai/Combinatorial CNN-Transformer Learning with Manifold Constraints for Semi-supervised Medical Image Segmentation	
@@ -0,0 +1,7 @@
+Semi-supervised learning (SSL), as one of the dominant methods, aims at leveraging the unlabeled data to deal with the annotation dilemma of supervised learning, which has attracted much attentions in the medical image segmentation. 
+Most of the existing approaches leverage a unitary network by convolutional neural networks (CNNs) with compulsory consistency of the predictions through small perturbations applied to inputs or models. 
+The penalties of such a learning paradigm are that (1) CNN-based models place severe limitations on global learning; (2) rich and diverse class-level distributions are inhibited. 
+In this paper, we present a novel CNN-Transformer learning framework in the manifold space for semi-supervised medical image segmentation. 
+First, at intra-student level, we propose a novel class-wise consistency loss to facilitate the learning of both discriminative and compact target feature representations. 
+Then, at inter-student level, we align the CNN and Transformer features using a prototype-based optimal transport method. 
+Extensive experiments show that our method outperforms previous state-of-the-art methods on three public medical image segmentation benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Combinatorial Stochastic-Greedy Bandit b/data/2024/aaai/Combinatorial Stochastic-Greedy Bandit
new file mode 100644
index 0000000000..664dcfc4c1
--- /dev/null
+++ b/data/2024/aaai/Combinatorial Stochastic-Greedy Bandit	
@@ -0,0 +1 @@
+We propose a novel combinatorial stochastic-greedy bandit (SGB) algorithm for combinatorial multi-armed bandit problems when no extra information other than the joint reward of the selected set of n arms at each time step t in [T] is observed. SGB adopts an optimized stochastic-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms. Unlike existing methods that explore the entire set of unselected base arms during each selection step, our SGB algorithm samples only an optimized proportion of unselected arms and selects actions from this subset. We prove that our algorithm achieves a (1-1/e)-regret bound of O(n^(1/3) k^(2/3) T^(2/3) log(T)^(2/3)) for monotone stochastic submodular rewards, which outperforms the state-of-the-art in terms of the cardinality constraint k. Furthermore, we empirically evaluate the performance of our algorithm in the context of online constrained social influence maximization. Our results demonstrate that our proposed approach consistently outperforms the other algorithms, increasing the performance gap as k grows.
\ No newline at end of file
diff --git a/data/2024/aaai/Combining Deep Learning and Street View Imagery to Map Smallholder Crop Types b/data/2024/aaai/Combining Deep Learning and Street View Imagery to Map Smallholder Crop Types
new file mode 100644
index 0000000000..68bffd2981
--- /dev/null
+++ b/data/2024/aaai/Combining Deep Learning and Street View Imagery to Map Smallholder Crop Types	
@@ -0,0 +1,3 @@
+Accurate crop type maps are an essential source of information for monitoring yield progress at scale, projecting global crop production, and planning effective policies. To date, however, crop type maps remain challenging to create in low- and middle-income countries due to a lack of ground truth labels for training machine learning models. Field surveys are the gold standard in terms of accuracy but require an often-prohibitively large amount of time, money, and statistical capacity. 
+In recent years, street-level imagery, such as Google Street View, KartaView, and Mapillary, has become available around the world. Such imagery contains rich information about crop types grown at particular locations and times. 
+In this work, we develop an automated system to generate crop type ground references using deep learning and Google Street View imagery. The method efficiently curates a set of street-view images containing crop fields, trains a model to predict crop types using either weakly-labeled images from disparate out-of-domain sources or zero-shot labeled street view images with GPT-4V, and combines the predicted labels with remote sensing time series to create a wall-to-wall crop type map. We show that, in Thailand, the resulting country-wide map of rice, cassava, maize, and sugarcane achieves an accuracy of 93%. We publicly release the first-ever crop type map for all of Thailand 2022 at 10m-resolution with no gaps. To our knowledge, this is the first time a 10m-resolution, multi-crop map has been created for any smallholder country. As the availability of roadside imagery expands, our pipeline provides a way to map crop types at scale around the globe, especially in underserved smallholder regions.
\ No newline at end of file
diff --git a/data/2024/aaai/Combining Graph Transformers Based Multi-Label Active Learning and Informative Data Augmentation for Chest Xray Classification b/data/2024/aaai/Combining Graph Transformers Based Multi-Label Active Learning and Informative Data Augmentation for Chest Xray Classification
new file mode 100644
index 0000000000..56ef24e059
--- /dev/null
+++ b/data/2024/aaai/Combining Graph Transformers Based Multi-Label Active Learning and Informative Data Augmentation for Chest Xray Classification	
@@ -0,0 +1 @@
+Informative sample selection in active learning (AL) helps a machine learning system attain optimum performance with minimum labeled samples, thus improving human-in-the-loop computer-aided diagnosis systems with limited labeled data. Data augmentation is highly effective for enlarging datasets with less labeled data. Combining informative sample selection and data augmentation should leverage their respective advantages and improve performance of AL systems. We propose a novel approach to combine informative sample selection and data augmentation for multi-label active learning. Conventional informative sample selection approaches have mostly focused on the single-label case which do not perform optimally in the multi-label setting. We improve upon state-of-the-art multi-label active learning techniques by representing disease labels as graph nodes, use graph attention transformers (GAT) to learn more effective inter-label relationships and identify most informative samples. We generate transformations of these informative samples which are also informative. Experiments on public chest xray datasets show improved results over state-of-the-art multi-label AL techniques in terms of classification performance, learning rates, and robustness. We also perform qualitative analysis to determine the realism of generated images.
\ No newline at end of file
diff --git a/data/2024/aaai/Combining Machine Learning and Queueing Theory for Data-Driven Incarceration-Diversion Program Management b/data/2024/aaai/Combining Machine Learning and Queueing Theory for Data-Driven Incarceration-Diversion Program Management
new file mode 100644
index 0000000000..fb9b6ca60b
--- /dev/null
+++ b/data/2024/aaai/Combining Machine Learning and Queueing Theory for Data-Driven Incarceration-Diversion Program Management	
@@ -0,0 +1 @@
+Incarceration-diversion programs have proven effective in reducing recidivism. Accurate prediction of the number of individuals with different characteristics in the program and their program outcomes based on given eligibility criteria is crucial for successful implementation, because this prediction serves as the foundation for determining the appropriate program size and the consequent staffing requirements. However, this task poses challenges due to the complexities arising from varied outcomes and lengths-of-stay for the diverse individuals in incarceration-diversion programs. In collaboration with an Illinois government agency, we develop a framework to address these issues. Our framework combines ML and queueing model simulation, providing accurate predictions for the program census and interpretable insights into program dynamics and the impact of different decisions in counterfactual scenarios. Additionally, we deploy a user-friendly web app beta-version that allows program managers to visualize census data by counties and race groups. We showcase two decision support use cases: Changing program admission criteria and launching similar programs in new counties.
\ No newline at end of file
diff --git a/data/2024/aaai/Combining Multiple Supervision for Robust Zero-Shot Dense Retrieval b/data/2024/aaai/Combining Multiple Supervision for Robust Zero-Shot Dense Retrieval
new file mode 100644
index 0000000000..624cba8da0
--- /dev/null
+++ b/data/2024/aaai/Combining Multiple Supervision for Robust Zero-Shot Dense Retrieval	
@@ -0,0 +1,8 @@
+Recently, dense retrieval (DR) models, which represent queries and documents with fixed-width vectors and retrieve relevant ones via nearest neighbor search, have drawn increasing attention from the IR community. 
+However, previous studies have shown that the effectiveness of DR critically relies on sufficient training signals, which leads to severe performance degradation when applied in out-of-domain scenarios, where large-scale training data are usually unavailable.
+To solve this problem, existing studies adopt a data-augmentation-plus-joint-training paradigm to construct weak/pseudo supervisions on the target domain and combine them with the large-scale human annotated data on the source domain to train the DR models. However, they don't explicitly distinguish the data and the supervision signals in the training process and simply assume that the DR models are mighty enough to capture and memorize different domain knowledge and relevance matching patterns without guidance, which, as shown in this paper, is not true.
+Based on this observation, we propose a Robust Multi-Supervision Combining strategy (RMSC) that
+decouples the domain and supervision signals by explicitly telling the DR models how the domain data and supervision signals are combined in the training data with specially designed soft tokens. 
+With the extra soft tokens to store the domain-specific and supervision-specific knowledge, RMSC allows the DR models 
+to conduct retrieval based on human-like relevance matching patterns and target-specific language distribution on the target domain without human annotations.
+Extensive experiments on zero-shot DR benchmarks show that RMSC significantly improves the ranking performance on the target domain compared to strong DR baselines and domain adaptation methods, while being stable during training and can be combined with query generation or second-stage pre-training.
\ No newline at end of file
diff --git a/data/2024/aaai/Commonsense for Zero-Shot Natural Language Video Localization b/data/2024/aaai/Commonsense for Zero-Shot Natural Language Video Localization
new file mode 100644
index 0000000000..875a479c0d
--- /dev/null
+++ b/data/2024/aaai/Commonsense for Zero-Shot Natural Language Video Localization	
@@ -0,0 +1 @@
+Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL.
\ No newline at end of file
diff --git a/data/2024/aaai/Communication Efficient Distributed Newton Method over Unreliable Networks b/data/2024/aaai/Communication Efficient Distributed Newton Method over Unreliable Networks
new file mode 100644
index 0000000000..a57aac48da
--- /dev/null
+++ b/data/2024/aaai/Communication Efficient Distributed Newton Method over Unreliable Networks	
@@ -0,0 +1 @@
+Distributed optimization in resource constrained devices demands both communication efficiency and fast convergence rates. Newton-type methods are getting preferable due to their superior convergence rates compared to the first-order methods. In this paper, we study a new problem in regard to the second-order distributed optimization over unreliable networks. The working devices are power-limited or operate in unfavorable wireless channels, experiencing packet losses during their uplink transmission to the server. Our scenario is very common in real-world and leads to instability of classical distributed optimization methods especially the second-order methods because of their sensitivity to the imprecision of local Hessian matrices. To achieve robustness to high packet loss, communication efficiency and fast convergence rates, we propose a novel distributed second-order method, called RED-New (Packet loss Resilient Distributed Approximate Newton). Each iteration of RED-New comprises two rounds of light-weight and lossy transmissions, in which the server aggregates the local information with a new developed scaling strategy. We prove the linear-quadratic convergence rate of RED-New. Experimental results demonstrate its advantage over first-order and second-order baselines, and its tolerance to packet loss rate ranging from 5% to 40%.
\ No newline at end of file
diff --git a/data/2024/aaai/Communication-Efficient Collaborative Regret Minimization in Multi-Armed Bandits b/data/2024/aaai/Communication-Efficient Collaborative Regret Minimization in Multi-Armed Bandits
new file mode 100644
index 0000000000..2bf75895f9
--- /dev/null
+++ b/data/2024/aaai/Communication-Efficient Collaborative Regret Minimization in Multi-Armed Bandits	
@@ -0,0 +1 @@
+In this paper, we study the collaborative learning model, which concerns the tradeoff between parallelism and communication overhead in multi-agent multi-armed bandits. For regret minimization in multi-armed bandits, we present the first set of tradeoffs between the number of rounds of communication between the agents and the regret of the collaborative learning process.
\ No newline at end of file
diff --git a/data/2024/aaai/Compact HD Map Construction via Douglas-Peucker Point Transformer b/data/2024/aaai/Compact HD Map Construction via Douglas-Peucker Point Transformer
new file mode 100644
index 0000000000..741cc8e75a
--- /dev/null
+++ b/data/2024/aaai/Compact HD Map Construction via Douglas-Peucker Point Transformer	
@@ -0,0 +1 @@
+High-definition (HD) map construction requires a comprehensive understanding of traffic environments, encompassing centimeter-level localization and rich semantic information. Previous works face challenges in redundant point representation or high-complexity curve modeling. In this paper, we present a flexible yet effective map element detector that synthesizes hierarchical information with a compact Douglas-Peucker (DP) point representation in a transformer architecture for robust and reliable predictions. Specifically, our proposed representation approximates class-agnostic map elements with DP points, which are sparsely located in crucial positions of structures and can get rid of redundancy and complexity. Besides, we design a position constraint with uncertainty to avoid potential ambiguities. Moreover, pairwise-point shape matching constraints are proposed to balance local structural information of different scales. Experiments on the public nuScenes dataset demonstrate that our method overwhelms current SOTAs. Extensive ablation studies validate each component of our methods. Codes will be released at https://github.com/sweety121/DPFormer.
\ No newline at end of file
diff --git a/data/2024/aaai/Complementary Knowledge Distillation for Robust and Privacy-Preserving Model Serving in Vertical Federated Learning b/data/2024/aaai/Complementary Knowledge Distillation for Robust and Privacy-Preserving Model Serving in Vertical Federated Learning
new file mode 100644
index 0000000000..03e8e4e22e
--- /dev/null
+++ b/data/2024/aaai/Complementary Knowledge Distillation for Robust and Privacy-Preserving Model Serving in Vertical Federated Learning	
@@ -0,0 +1 @@
+Vertical Federated Learning (VFL) enables an active party with labeled data to enhance model performance (utility) by collaborating with multiple passive parties that possess auxiliary features corresponding to the same sample identifiers (IDs). Model serving in VFL is vital for real-world, delay-sensitive applications, and it faces two major challenges: 1) robustness against arbitrarily-aligned data and stragglers; and 2) privacy protection, ensuring minimal label leakage to passive parties. Existing methods fail to transfer knowledge among parties to improve robustness in a privacy-preserving way. In this paper, we introduce a privacy-preserving knowledge transfer framework, Complementary Knowledge Distillation (CKD), designed to enhance the robustness and privacy of multi-party VFL systems. Specifically, we formulate a Complementary Label Coding (CLC) objective to encode only complementary label information of the active party's local model for passive parties to learn. Then, CKD selectively transfers the CLC-encoded complementary knowledge 1) from the passive parties to the active party, and 2) among the passive parties themselves. Experimental results on four real-world datasets demonstrate that CKD outperforms existing approaches in terms of robustness against arbitrarily-aligned data, while also minimizing label privacy leakage.
\ No newline at end of file
diff --git a/data/2024/aaai/Complete Neural Networks for Complete Euclidean Graphs b/data/2024/aaai/Complete Neural Networks for Complete Euclidean Graphs
new file mode 100644
index 0000000000..5d81c867a1
--- /dev/null
+++ b/data/2024/aaai/Complete Neural Networks for Complete Euclidean Graphs	
@@ -0,0 +1 @@
+Neural networks for point clouds, which respect their natural invariance to permutation and rigid motion, have enjoyed recent success in modeling geometric phenomena, from molecular dynamics to recommender systems. Yet, to date, no architecture with polynomial complexity is known to be complete, that is, able to distinguish between any pair of non-isomorphic point clouds. We fill this theoretical gap by showing that point clouds can be completely determined, up to permutation and rigid motion, by applying the 3-WL graph isomorphism test to the point cloud's centralized Gram matrix. Moreover, we formulate an Euclidean variant of the 2-WL test and show that it is also sufficient to achieve completeness. We then show how our complete Euclidean WL tests can be simulated by an Euclidean graph neural network of moderate size and demonstrate their separation capability on highly symmetrical point clouds.
\ No newline at end of file
diff --git a/data/2024/aaai/Completing Priceable Committees: Utilitarian and Representation Guarantees for Proportional Multiwinner Voting b/data/2024/aaai/Completing Priceable Committees: Utilitarian and Representation Guarantees for Proportional Multiwinner Voting
new file mode 100644
index 0000000000..2a4d68ea50
--- /dev/null
+++ b/data/2024/aaai/Completing Priceable Committees: Utilitarian and Representation Guarantees for Proportional Multiwinner Voting	
@@ -0,0 +1 @@
+When selecting committees based on preferences of voters, a variety of different criteria can be considered. Two natural objectives are maximizing the utilitarian welfare (the sum of voters' utilities) and coverage (the number of represented voters) of the selected committee. Previous work has studied the impact on utilitarian welfare and coverage when requiring the committee to satisfy minimal requirements such as justified representation or weak proportionality. In this paper, we consider the impact of imposing much more demanding proportionality axioms. We identify a class of voting rules that achieve strong guarantees on utilitarian welfare and coverage when combined with appropriate completions. This class is defined via a weakening of priceability and contains prominent rules such as the Method of Equal Shares. We show that committees selected by these rules (i) can be completed to achieve optimal coverage and (ii) can be completed to achieve an asymptotically optimal approximation to the utilitarian welfare if they additionally satisfy EJR+. Answering an open question of Elkind et al. (2022), we use the Greedy Justified Candidate Rule to obtain the best possible utilitarian guarantee subject to proportionality. We also consider completion methods suggested in the participatory budgeting literature and other objectives besides welfare and coverage.
\ No newline at end of file
diff --git a/data/2024/aaai/Complexity of Credulous and Skeptical Acceptance in Epistemic Argumentation Framework b/data/2024/aaai/Complexity of Credulous and Skeptical Acceptance in Epistemic Argumentation Framework
new file mode 100644
index 0000000000..dffcbd1d9b
--- /dev/null
+++ b/data/2024/aaai/Complexity of Credulous and Skeptical Acceptance in Epistemic Argumentation Framework	
@@ -0,0 +1 @@
+Dung’s Argumentation Framework (AF) has been extended in several directions. Among the numerous proposed extensions, three of them seem to be of particular interest and have correlations between them. These extensions are: constrained AF (CAF), where AF is augmented with (strong) constraints; epistemic AF (EAF), where AF is augmented with epistemic constraints; and incomplete AF (iAF), where arguments and attacks can be uncertain. While the complexity and expressiveness of CAF and iAF have been studied, that of EAF has not been explored so far. In this paper we investigate the complexity and expressivity of EAF. To this end, we first introduce the Labeled CAF (LCAF), a variation of CAF where constraints are defined over the alphabet of labeled arguments. Then, we investigate the complexity of credulous and skeptical reasoning and show that: i) EAF is more expressive than iAF (under preferred semantics), ii) although LCAF is a restriction of EAF where modal operators are not allowed, these frameworks have the same complexity, iii) the results for LCAF close a gap in the characterization of the complexity of CAF. Interestingly, even though EAF has the same complexity as LCAF, it allows modeling domain knowledge in a more natural and easy-to-understand way.
\ No newline at end of file
diff --git a/data/2024/aaai/Component Fourier Neural Operator for Singularly Perturbed Differential Equations b/data/2024/aaai/Component Fourier Neural Operator for Singularly Perturbed Differential Equations
new file mode 100644
index 0000000000..0ededecae9
--- /dev/null
+++ b/data/2024/aaai/Component Fourier Neural Operator for Singularly Perturbed Differential Equations	
@@ -0,0 +1 @@
+Solving Singularly Perturbed Differential Equations (SPDEs) poses computational challenges arising from the rapid transitions in their solutions within thin regions. The effectiveness of deep learning in addressing differential equations motivates us to employ these methods for solving SPDEs. In this paper, we introduce Component Fourier Neural Operator (ComFNO), an innovative operator learning method that builds upon Fourier Neural Operator (FNO), while simultaneously incorporating valuable prior knowledge obtained from asymptotic analysis. Our approach is not limited to FNO and can be applied to other neural network frameworks, such as Deep Operator Network (DeepONet), leading to potential similar SPDEs solvers. Experimental results across diverse classes of SPDEs demonstrate that ComFNO significantly improves accuracy compared to vanilla FNO. Furthermore, ComFNO exhibits natural adaptability to diverse data distributions and performs well in few-shot scenarios, showcasing its excellent generalization ability in practical situations.
\ No newline at end of file
diff --git a/data/2024/aaai/Composing Biases by Using CP to Decompose Minimal Functional Dependencies for Acquiring Complex Formulae b/data/2024/aaai/Composing Biases by Using CP to Decompose Minimal Functional Dependencies for Acquiring Complex Formulae
new file mode 100644
index 0000000000..0b6da97521
--- /dev/null
+++ b/data/2024/aaai/Composing Biases by Using CP to Decompose Minimal Functional Dependencies for Acquiring Complex Formulae	
@@ -0,0 +1 @@
+Given a table with a minimal set of input columns that functionally determines an output column, we introduce a method that tries to gradually decompose the corresponding minimal functional dependency (mfd) to acquire a formula expressing the output column in terms of the input columns. A first key element of the method is to create sub-problems that are easier to solve than the original formula acquisition problem, either because it learns formulae with fewer inputs parameters, or as it focuses on formulae of a particular class, such as Boolean formulae; as a result, the acquired formulae can mix different learning biases such as polynomials, conditionals or Boolean expressions. A second key feature of the method is that it can be applied recursively to find formulae that combine polynomial, conditional or Boolean sub-terms in a nested manner. The method was tested on data for eight families of combinatorial objects; new conjectures were found that were previously unattainable. The method often creates conjectures that combine several formulae into one with a limited number of automatically found Boolean terms.
\ No newline at end of file
diff --git a/data/2024/aaai/Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions b/data/2024/aaai/Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions
new file mode 100644
index 0000000000..14acd3ade3
--- /dev/null
+++ b/data/2024/aaai/Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions	
@@ -0,0 +1 @@
+Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them, e.g., people outside Australia searching for ‘numbats.’ Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., “numbat digging in the ground.” In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of “difficult-to-name but easy-to-draw” objects and text describing “difficult-to-sketch but easy-to-verbalize” object's attributes or interaction with the scene. This novel problem statement distinctly differs from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of ~2M queries and 108K natural scene images. Further, as a solution to this problem, we propose a pretrained multimodal transformer-based baseline, STNet (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image, and encodes the text and image to perform image retrieval. In addition to contrastive learning, we propose multiple training objectives that improve the performance of our model. Extensive experiments show that our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities. We make the dataset and code available at: https://vl2g.github.io/projects/cstbir.
\ No newline at end of file
diff --git a/data/2024/aaai/Compositional Generalization for Multi-Label Text Classification: A Data-Augmentation Approach b/data/2024/aaai/Compositional Generalization for Multi-Label Text Classification: A Data-Augmentation Approach
new file mode 100644
index 0000000000..50791c2fb2
--- /dev/null
+++ b/data/2024/aaai/Compositional Generalization for Multi-Label Text Classification: A Data-Augmentation Approach	
@@ -0,0 +1 @@
+Despite significant advancements in multi-label text classification, the ability of existing models to generalize to novel and seldom-encountered complex concepts, which are compositions of elementary ones, remains underexplored. This research addresses this gap. By creating unique data splits across three benchmarks, we assess the compositional generalization ability of existing multi-label text classification models. Our results show that these models often fail to generalize to compositional concepts encountered infrequently during training, leading to inferior performance on tests with these new combinations. To address this, we introduce a data augmentation method that leverages two innovative text generation models designed to enhance the classification models' capacity for compositional generalization. Our experiments show that this data augmentation approach significantly improves the compositional generalization capabilities of classification models on our benchmarks, with both generation models surpassing other text generation baselines. Our codes available at https://github.com/yychai74/LD-VAE.
\ No newline at end of file
diff --git a/data/2024/aaai/Compositional Inversion for Stable Diffusion Models b/data/2024/aaai/Compositional Inversion for Stable Diffusion Models
new file mode 100644
index 0000000000..98881cb2ab
--- /dev/null
+++ b/data/2024/aaai/Compositional Inversion for Stable Diffusion Models	
@@ -0,0 +1 @@
+Inversion methods, such as Textual Inversion, generate personalized images by incorporating concepts of interest provided by user images. However, existing methods often suffer from overfitting issues, where the dominant presence of inverted concepts leads to the absence of other desired concepts. It stems from the fact that during inversion, the irrelevant semantics in the user images are also encoded, forcing the inverted concepts to occupy locations far from the core distribution in the embedding space. To address this issue, we propose a method that guides the inversion process towards the core distribution for compositional embeddings. Additionally, we introduce a spatial regularization approach to balance the attention on the concepts being composed. Our method is designed as a post-training approach and can be seamlessly integrated with other inversion methods. Experimental results demonstrate the effectiveness of our proposed approach in mitigating the overfitting problem and generating more diverse and balanced compositions of concepts in the synthesized images. The source code is available at https://github.com/zhangxulu1996/Compositional-Inversion.
\ No newline at end of file
diff --git a/data/2024/aaai/Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models b/data/2024/aaai/Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models
new file mode 100644
index 0000000000..a9d53fc4e2
--- /dev/null
+++ b/data/2024/aaai/Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models	
@@ -0,0 +1 @@
+Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, they fail to semantically align the generated images with the prompts due to their limited compositional capabilities, leading to attribute leakage, entity leakage, and missing entities. In this paper, we propose a novel attention mask control strategy based on predicted object boxes to address these issues. In particular, we first train a BoxNet to predict a box for each entity that possesses the attribute specified in the prompt. Then, depending on the predicted boxes, a unique mask control is applied to the cross- and self-attention maps. Our approach produces a more semantically accurate synthesis by constraining the attention regions of each token in the prompt to the image. In addition, the proposed method is straightforward and effective and can be readily integrated into existing cross-attention-based T2I generators. We compare our approach to competing methods and demonstrate that it can faithfully convey the semantics of the original text to the generated content and achieve high availability as a ready-to-use plugin. Please refer to https://github.com/OPPO-Mente-Lab/attention-mask-control.
\ No newline at end of file
diff --git a/data/2024/aaai/Compound Text-Guided Prompt Tuning via Image-Adaptive Cues b/data/2024/aaai/Compound Text-Guided Prompt Tuning via Image-Adaptive Cues
new file mode 100644
index 0000000000..561eacf9b0
--- /dev/null
+++ b/data/2024/aaai/Compound Text-Guided Prompt Tuning via Image-Adaptive Cues	
@@ -0,0 +1 @@
+Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable generalization capabilities to downstream tasks. However, existing prompt tuning based frameworks need to parallelize learnable textual inputs for all categories, suffering from massive GPU memory consumption when there is a large number of categories in the target dataset. Moreover, previous works require to include category names within prompts, exhibiting subpar performance when dealing with ambiguous category names. To address these shortcomings, we propose Compound Text-Guided Prompt Tuning (TGP-T) that significantly reduces resource demand while achieving superior performance. We introduce text supervision to the optimization of prompts, which enables two benefits: 1) releasing the model reliance on the pre-defined category names during inference, thereby enabling more flexible prompt generation; 2) reducing the number of inputs to the text encoder, which decreases GPU memory consumption significantly. Specifically, we found that compound text supervisions, i.e., category-wise and content-wise, is highly effective, since they provide inter-class separability and capture intra-class variations, respectively. Moreover, we condition the prompt generation on visual features through a module called Bonder, which facilitates the alignment between prompts and visual features. Extensive experiments on few-shot recognition and domain generalization demonstrate that TGP-T achieves superior performance with consistently lower training costs. It reduces GPU memory usage by 93% and attains a 2.5% performance gain on 16-shot ImageNet. The code is available at https://github.com/EricTan7/TGP-T.
\ No newline at end of file
diff --git a/data/2024/aaai/Comprehensive View Embedding Learning for Single-Cell Multimodal Integration b/data/2024/aaai/Comprehensive View Embedding Learning for Single-Cell Multimodal Integration
new file mode 100644
index 0000000000..0ea1d5dd72
--- /dev/null
+++ b/data/2024/aaai/Comprehensive View Embedding Learning for Single-Cell Multimodal Integration	
@@ -0,0 +1 @@
+Motivation: Advances in single-cell measurement techniques provide rich multimodal data, which helps us to explore the life state of cells more deeply. However, multimodal integration, or, learning joint embeddings from multimodal data remains a current challenge. The difficulty in integrating unpaired single-cell multimodal data is that different modalities have different feature spaces, which easily leads to information loss in joint embedding. And few existing methods have fully exploited and fused the information in single-cell multimodal data. Result: In this study, we propose CoVEL, a deep learning method for unsupervised integration of single-cell multimodal data. CoVEL learns single-cell representations from a comprehensive view, including regulatory relationships between modalities, fine-grained representations of cells, and relationships between different cells. The comprehensive view embedding enables CoVEL to remove the gap between modalities while protecting biological heterogeneity. Experimental results on multiple public datasets show that CoVEL is accurate and robust to single-cell multimodal integration. Data availability: https://github.com/shapsider/scintegration.
\ No newline at end of file
diff --git a/data/2024/aaai/Comprehensive Visual Grounding for Video Description b/data/2024/aaai/Comprehensive Visual Grounding for Video Description
new file mode 100644
index 0000000000..572b5f1649
--- /dev/null
+++ b/data/2024/aaai/Comprehensive Visual Grounding for Video Description	
@@ -0,0 +1 @@
+The grounding accuracy of existing video captioners is still behind the expectation. The majority of existing methods perform grounded video captioning on sparse entity annotations, whereas the captioning accuracy often suffers from degenerated object appearances on the annotated area such as motion blur and video defocus. Moreover, these methods seldom consider the complex interactions among entities. In this paper, we propose a comprehensive visual grounding network to improve video captioning, by explicitly linking the entities and actions to the visual clues across the video frames. Specifically, the network consists of spatial-temporal entity grounding and action grounding. The proposed entity grounding encourages the attention mechanism to focus on informative spatial areas across video frames, albeit the entity is annotated in only one frame of a video. The action grounding dynamically associates the verbs to related subjects and the corresponding context, which keeps fine-grained spatial and temporal details for action prediction. Both entity grounding and action grounding are formulated as a unified task guided by a soft grounding supervision, which brings architecture simplification and improves training efficiency as well. We conduct extensive experiments on two challenging datasets, and demonstrate significant performance improvements of +2.3 CIDEr on ActivityNet-Entities and +2.2 CIDEr on MSR-VTT compared to state-of-the-arts.
\ No newline at end of file
diff --git a/data/2024/aaai/Compressing Image-to-Image Translation GANs Using Local Density Structures on Their Learned Manifold b/data/2024/aaai/Compressing Image-to-Image Translation GANs Using Local Density Structures on Their Learned Manifold
new file mode 100644
index 0000000000..242b07f031
--- /dev/null
+++ b/data/2024/aaai/Compressing Image-to-Image Translation GANs Using Local Density Structures on Their Learned Manifold	
@@ -0,0 +1 @@
+Generative Adversarial Networks (GANs) have shown remarkable success in modeling complex data distributions for image-to-image translation. Still, their high computational demands prohibit their deployment in practical scenarios like edge devices. Existing GAN compression methods mainly rely on knowledge distillation or convolutional classifiers' pruning techniques. Thus, they neglect the critical characteristic of GANs: their local density structure over their learned manifold. Accordingly, we approach GAN compression from a new perspective by explicitly encouraging the pruned model to preserve the density structure of the original parameter-heavy model on its learned manifold. We facilitate this objective for the pruned model by partitioning the learned manifold of the original generator into local neighborhoods around its generated samples. Then, we propose a novel pruning objective to regularize the pruned model to preserve the local density structure over each neighborhood, resembling the kernel density estimation method. Also, we develop a collaborative pruning scheme in which the discriminator and generator are pruned by two pruning agents. We design the agents to capture interactions between the generator and discriminator by exchanging their peer's feedback when determining corresponding models' architectures. Thanks to such a design, our pruning method can efficiently find performant sub-networks and can maintain the balance between the generator and discriminator more effectively compared to baselines during pruning, thereby showing more stable pruning dynamics. Our experiments on image translation GAN models, Pix2Pix and CycleGAN, with various benchmark datasets and architectures demonstrate our method's effectiveness.
\ No newline at end of file
diff --git a/data/2024/aaai/Computing the Why-Provenance for Datalog Queries via SAT Solvers b/data/2024/aaai/Computing the Why-Provenance for Datalog Queries via SAT Solvers
new file mode 100644
index 0000000000..4b6ea726a7
--- /dev/null
+++ b/data/2024/aaai/Computing the Why-Provenance for Datalog Queries via SAT Solvers	
@@ -0,0 +1 @@
+Explaining an answer to a Datalog query is an essential task towards Explainable AI, especially nowadays where Datalog plays a critical role in the development of ontology-based applications. A well-established approach for explaining a query answer is the so-called why-provenance, which essentially collects all the subsets of the input database that can be used to obtain that answer via some derivation process, typically represented as a proof tree. It is well known, however, that computing the why-provenance for Datalog queries is computationally expensive, and thus, very few attempts can be found in the literature. The goal of this work is to demonstrate how off-the-shelf SAT solvers can be exploited towards an efficient computation of the why-provenance for Datalog queries. Interestingly, our SAT-based approach allows us to build the why-provenance in an incremental fashion, that is, one explanation at a time, which is much more useful in a practical context than the one-shot computation of the whole set of explanations as done by existing approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/ConSequence: Synthesizing Logically Constrained Sequences for Electronic Health Record Generation b/data/2024/aaai/ConSequence: Synthesizing Logically Constrained Sequences for Electronic Health Record Generation
new file mode 100644
index 0000000000..eadfa2723e
--- /dev/null
+++ b/data/2024/aaai/ConSequence: Synthesizing Logically Constrained Sequences for Electronic Health Record Generation	
@@ -0,0 +1 @@
+Generative models can produce synthetic patient records for analytical tasks when real data is unavailable or limited. However, current methods struggle with adhering to domain-specific knowledge and removing invalid data. We present ConSequence, an effective approach to integrating domain knowledge into sequential generative neural network outputs. Our rule-based formulation includes temporal aggregation and antecedent evaluation modules, ensured by an efficient matrix multiplication formulation, to satisfy hard and soft logical constraints across time steps. Existing constraint methods often fail to guarantee constraint satisfaction, lack the ability to handle temporal constraints, and hinder the learning and computational efficiency of the model. In contrast, our approach efficiently handles all types of constraints with guaranteed logical coherence. We demonstrate ConSequence's effectiveness in generating electronic health records, outperforming competitors in achieving complete temporal and spatial constraint satisfaction without compromising runtime performance or generative quality. Specifically, ConSequence successfully prevents all rule violations while improving the model quality in reducing its test perplexity by 5% and incurring less than a 13% slowdown in generation speed compared to an unconstrained model.
\ No newline at end of file
diff --git a/data/2024/aaai/ConVQG: Contrastive Visual Question Generation with Multimodal Guidance b/data/2024/aaai/ConVQG: Contrastive Visual Question Generation with Multimodal Guidance
new file mode 100644
index 0000000000..53896d6443
--- /dev/null
+++ b/data/2024/aaai/ConVQG: Contrastive Visual Question Generation with Multimodal Guidance	
@@ -0,0 +1 @@
+Asking questions about visual environments is a crucial way for intelligent agents to understand rich multi-faceted scenes, raising the importance of Visual Question Generation (VQG) systems. Apart from being grounded to the image, existing VQG systems can use textual constraints, such as expected answers or knowledge triplets, to generate focused questions. These constraints allow VQG systems to specify the question content or leverage external commonsense knowledge that can not be obtained from the image content only. However, generating focused questions using textual constraints while enforcing a high relevance to the image content remains a challenge, as VQG systems often ignore one or both forms of grounding. In this work, we propose Contrastive Visual Question Generation (ConVQG), a method using a dual contrastive objective to discriminate questions generated using both modalities from those based on a single one. Experiments on both knowledge-aware and standard VQG benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods and generates image-grounded, text-guided, and knowledge-rich questions. Our human evaluation results also show preference for ConVQG questions compared to non-contrastive baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/ConcaveQ: Non-monotonic Value Function Factorization via Concave Representations in Deep Multi-Agent Reinforcement Learning b/data/2024/aaai/ConcaveQ: Non-monotonic Value Function Factorization via Concave Representations in Deep Multi-Agent Reinforcement Learning
new file mode 100644
index 0000000000..30af9f30df
--- /dev/null
+++ b/data/2024/aaai/ConcaveQ: Non-monotonic Value Function Factorization via Concave Representations in Deep Multi-Agent Reinforcement Learning	
@@ -0,0 +1 @@
+Value function factorization has achieved great success in multi-agent reinforcement learning by optimizing joint action-value functions through the maximization of factorized per-agent utilities. To ensure Individual-Global-Maximum property, existing works often focus on value factorization using monotonic functions, which are known to result in restricted representation expressiveness. In this paper, we analyze the limitations of monotonic factorization and present ConcaveQ, a novel non-monotonic value function factorization approach that goes beyond monotonic mixing functions and employs neural network representations of concave mixing functions. Leveraging the concave property in factorization, an iterative action selection scheme is developed to obtain optimal joint actions during training. It is used to update agents’ local policy networks, enabling fully decentralized execution. The effectiveness of the proposed ConcaveQ is validated across scenarios involving multi-agent predator-prey environment and StarCraft II micromanagement tasks. Empirical results exhibit significant improvement of ConcaveQ over state-of-the-art multi-agent reinforcement learning approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Concealing Sensitive Samples against Gradient Leakage in Federated Learning b/data/2024/aaai/Concealing Sensitive Samples against Gradient Leakage in Federated Learning
new file mode 100644
index 0000000000..11e4c050b5
--- /dev/null
+++ b/data/2024/aaai/Concealing Sensitive Samples against Gradient Leakage in Federated Learning	
@@ -0,0 +1,2 @@
+Federated Learning (FL) is a distributed learning paradigm that enhances users' privacy by eliminating the need for clients to share raw, private data with the server.
+Despite the success, recent studies expose the vulnerability of FL to model inversion attacks, where adversaries reconstruct users’ private data via eavesdropping on the shared gradient information. We hypothesize that a key factor in the success of such attacks is the low entanglement among gradients per data within the batch during stochastic optimization. This creates a vulnerability that an adversary can exploit to reconstruct the sensitive data. Building upon this insight, we present a simple, yet effective defense strategy that obfuscates the gradients of the sensitive data with concealed samples. To achieve this, we propose synthesizing concealed samples to mimic the sensitive data at the gradient level while ensuring their visual dissimilarity from the actual sensitive data. Compared to the previous art, our empirical evaluations suggest that the proposed technique provides the strongest protection while simultaneously maintaining the FL performance. Code is located at https://github.com/JingWu321/DCS-2.
\ No newline at end of file
diff --git a/data/2024/aaai/Concept-Guided Prompt Learning for Generalization in Vision-Language Models b/data/2024/aaai/Concept-Guided Prompt Learning for Generalization in Vision-Language Models
new file mode 100644
index 0000000000..6004374a16
--- /dev/null
+++ b/data/2024/aaai/Concept-Guided Prompt Learning for Generalization in Vision-Language Models	
@@ -0,0 +1,11 @@
+Contrastive Language-Image Pretraining (CLIP) model has exhibited remarkable efficacy in establishing cross-modal connections between texts and images, yielding impressive
+performance across a broad spectrum of downstream applications through fine-tuning. However, for generalization tasks, the current fine-tuning methods for CLIP, such as CoOp and
+CoCoOp, demonstrate relatively low performance on some fine-grained datasets. We recognize the underlying reason is that these previous methods only projected global features
+into the prompt, neglecting the various visual concepts, such as colors, shapes, and sizes, which are naturally transferable
+across domains and play a crucial role in generalization tasks. To address this issue, in this work, we propose
+Concept-Guided Prompt Learning (CPL) for vision-language models. Specifically, we leverage the well-learned knowledge
+of CLIP to create a visual concept cache to enable conceptguided prompting. In order to refine the text features, we further
+develop a projector that transforms multi-level visual features into text features. We observe that this concept-guided
+prompt learning approach is able to achieve enhanced consistency between visual and linguistic modalities. Extensive
+experimental results demonstrate that our CPL method significantly improves generalization capabilities compared to
+the current state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models b/data/2024/aaai/ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models
new file mode 100644
index 0000000000..6bcbe4d198
--- /dev/null
+++ b/data/2024/aaai/ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models	
@@ -0,0 +1 @@
+The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions. However, the evaluation of T2I models has focused on photorealism and limited qualitative measures of visual understanding. To quantify the ability of T2I models in learning and synthesizing novel visual concepts (a.k.a. personalized T2I), we introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts, and 33K composite text prompts. Along with the dataset, we propose an evaluation metric, Concept Confidence Deviation (CCD), that uses the confidence of oracle concept classifiers to measure the alignment between concepts generated by T2I generators and concepts contained in target images. We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions. Our human study shows that CCD is highly correlated with human understanding of concepts. Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome. The data, code, and interactive demo is available at: https://conceptbed.github.io/
\ No newline at end of file
diff --git a/data/2024/aaai/ConditionVideo: Training-Free Condition-Guided Video Generation b/data/2024/aaai/ConditionVideo: Training-Free Condition-Guided Video Generation
new file mode 100644
index 0000000000..13c5f4b766
--- /dev/null
+++ b/data/2024/aaai/ConditionVideo: Training-Free Condition-Guided Video Generation	
@@ -0,0 +1 @@
+Recent works have successfully extended large-scale text-to-image models to the video domain, producing promising results but at a high computational cost and requiring a large amount of video data. In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e.g., Stable Diffusion). ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method explicitly disentangles the motion representation into condition-guided and scenery motion components. To this end, the ConditionVideo model is designed with a UNet branch and a control branch. To improve temporal coherence, we introduce sparse bi-directional spatial-temporal attention (sBiST-Attn). The 3D control network extends the conventional 2D controlnet model, aiming to strengthen conditional generation accuracy by additionally leveraging the bi-directional frames in the temporal domain. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Conditional Backdoor Attack via JPEG Compression b/data/2024/aaai/Conditional Backdoor Attack via JPEG Compression
new file mode 100644
index 0000000000..fd445dde13
--- /dev/null
+++ b/data/2024/aaai/Conditional Backdoor Attack via JPEG Compression	
@@ -0,0 +1 @@
+Deep neural network (DNN) models have been proven vulnerable to backdoor attacks. One trend of backdoor attacks is developing more invisible and dynamic triggers to make attacks stealthier. However, these invisible and dynamic triggers can be inadvertently mitigated by some widely used passive denoising operations, such as image compression, making the efforts under this trend questionable. Another trend is to exploit the full potential of backdoor attacks by proposing new triggering paradigms, such as hibernated or opportunistic backdoors. In line with these trends, our work investigates the first conditional backdoor attack, where the backdoor is activated by a specific condition rather than pre-defined triggers. Specifically, we take the JPEG compression as our condition and jointly optimize the compression operator and the target model's loss function, which can force the target model to accurately learn the JPEG compression behavior as the triggering condition. In this case, besides the conditional triggering feature, our attack is also stealthy and robust to denoising operations. Extensive experiments on the MNIST, GTSRB and CelebA verify our attack's effectiveness, stealthiness and resistance to existing backdoor defenses and denoising operations. As a new triggering paradigm, the conditional backdoor attack brings a new angle for assessing the vulnerability of DNN models, and conditioned over JPEG compression magnifies its threat due to the universal usage of JPEG.
\ No newline at end of file
diff --git a/data/2024/aaai/Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment b/data/2024/aaai/Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment
new file mode 100644
index 0000000000..ebbf000317
--- /dev/null
+++ b/data/2024/aaai/Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment	
@@ -0,0 +1 @@
+Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences. As a typical multi-modal task, there exists an inherent modality gap between sign language videos and spoken language text, which makes the cross-modal alignment between visual and textual modalities crucial. However, previous studies tend to rely on an intermediate sign gloss representation to help alleviate the cross-modal problem thereby neglecting the alignment across modalities that may lead to compromised results. To address this issue, we propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT) that facilitates direct and sufficient cross-modal alignment between sign language videos and spoken language text. Specifically, our CV-SLT consists of two paths with two Kullback-Leibler (KL) divergences to regularize the outputs of the encoder and decoder, respectively. In the prior path, the model solely relies on visual information to predict the target text; whereas in the posterior path, it simultaneously encodes visual information and textual knowledge to reconstruct the target text. The first KL divergence optimizes the conditional variational autoencoder and regularizes the encoder outputs, while the second KL divergence performs a self-distillation from the posterior path to the prior path, ensuring the consistency of decoder outputs.We further enhance the integration of textual information to the posterior path by employing a shared Attention Residual Gaussian Distribution (ARGD), which considers the textual information in the posterior path as a residual component relative to the prior path. Extensive experiments conducted on public datasets demonstrate the effectiveness of our framework, achieving new state-of-the-art results while significantly alleviating the cross-modal representation discrepancy. The code and models are available at https://github.com/rzhao-zhsq/CV-SLT.
\ No newline at end of file
diff --git a/data/2024/aaai/Confidence Is All You Need for MI Attacks (Student Abstract) b/data/2024/aaai/Confidence Is All You Need for MI Attacks (Student Abstract)
new file mode 100644
index 0000000000..177c63fcc7
--- /dev/null
+++ b/data/2024/aaai/Confidence Is All You Need for MI Attacks (Student Abstract)	
@@ -0,0 +1 @@
+In this evolving era of machine learning security, membership inference attacks have emerged as a potent threat to the confidentiality of sensitive data. In this attack, adversaries aim to determine whether a particular point was used during the training of a target model. This paper proposes a new method to gauge a data point’s membership in a model’s training set. Instead of correlating loss with membership, as is traditionally done, we have leveraged the fact that training examples generally exhibit higher confidence values when classified into their actual class. During training, the model is essentially being ’fit’ to the training data and might face particular difficulties in generalization to unseen data. This asymmetry leads to the model achieving higher confidence on the training data as it exploits the specific patterns and noise present in the training data. Our proposed approach leverages the confidence values generated by the machine-learning model. These confidence values provide a probabilistic measure of the model’s certainty in its predictions and can further be used to infer the membership of a given data point. Additionally, we also introduce another variant of our method that allows us to carry out this attack without knowing the ground truth(true class) of a given data point, thus offering an edge over existing label-dependent attack methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Conformal Autoregressive Generation: Beam Search with Coverage Guarantees b/data/2024/aaai/Conformal Autoregressive Generation: Beam Search with Coverage Guarantees
new file mode 100644
index 0000000000..5ff64f4d86
--- /dev/null
+++ b/data/2024/aaai/Conformal Autoregressive Generation: Beam Search with Coverage Guarantees	
@@ -0,0 +1,4 @@
+We introduce two new extensions to the beam search algorithm based on conformal predictions (CP) to produce sets of sequences with theoretical coverage guarantees. 
+The first method is very simple and proposes dynamically-sized subsets of beam search results but, unlike typical CP proceedures, has an upper bound on the achievable guarantee depending on a post-hoc calibration measure. 
+Our second algorithm introduces the conformal set prediction procedure as part of the decoding process, producing a variable beam width which adapts to the current uncertainty. 
+While more complex, this procedure can achieve coverage guarantees selected a priori. We provide marginal coverage bounds as well as calibration-conditional guarantees for each method, and evaluate them empirically on a selection of tasks drawing from natural language processing and chemistry.
\ No newline at end of file
diff --git a/data/2024/aaai/Conformal Crystal Graph Transformer with Robust Encoding of Periodic Invariance b/data/2024/aaai/Conformal Crystal Graph Transformer with Robust Encoding of Periodic Invariance
new file mode 100644
index 0000000000..bf21c36952
--- /dev/null
+++ b/data/2024/aaai/Conformal Crystal Graph Transformer with Robust Encoding of Periodic Invariance	
@@ -0,0 +1 @@
+Machine learning techniques, especially in the realm of materials design, hold immense promise in predicting the properties of crystal materials and aiding in the discovery of novel crystals with desirable traits. However, crystals possess unique geometric constraints—namely, E(3) invariance for primitive cell and periodic invariance—which need to be accurately reflected in crystal representations. Though past research has explored various construction techniques to preserve periodic invariance in crystal representations, their robustness remains inadequate. Furthermore, effectively capturing angular information within 3D crystal structures continues to pose a significant challenge for graph-based approaches. This study introduces novel solutions to these challenges. We first present a graph construction method that robustly encodes periodic invariance and a strategy to capture angular information in neural networks without compromising efficiency. We further introduce CrystalFormer, a pioneering graph transformer architecture that emphasizes angle preservation and enhances long-range information. Through comprehensive evaluation, we verify our model's superior performance in 5 crystal prediction tasks, reaffirming the efficiency of our proposed methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Conformal Prediction Regions for Time Series Using Linear Complementarity Programming b/data/2024/aaai/Conformal Prediction Regions for Time Series Using Linear Complementarity Programming
new file mode 100644
index 0000000000..99840453ec
--- /dev/null
+++ b/data/2024/aaai/Conformal Prediction Regions for Time Series Using Linear Complementarity Programming	
@@ -0,0 +1 @@
+Conformal prediction is a statistical tool for producing prediction regions of machine learning models that are valid with high probability. However, applying conformal prediction to time series data leads to conservative prediction regions. In fact, to obtain prediction regions over T time steps with confidence 1--delta, previous works require that each individual prediction region is valid with confidence 1--delta/T. We propose an optimization-based method for reducing this conservatism to enable long horizon planning and verification when using learning-enabled time series predictors. Instead of considering prediction errors individually at each time step, we consider a parameterized prediction error over multiple time steps. By optimizing the parameters over an additional dataset, we find prediction regions that are not conservative. We show that this problem can be cast as a mixed integer linear complementarity program (MILCP), which we then relax into a linear complementarity program (LCP). Additionally, we prove that the relaxed LP has the same optimal cost as the original MILCP. Finally, we demonstrate the efficacy of our method on case studies using pedestrian trajectory predictors and F16 fighter jet altitude predictors.
\ No newline at end of file
diff --git a/data/2024/aaai/Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum b/data/2024/aaai/Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum
new file mode 100644
index 0000000000..5ccc58f25f
--- /dev/null
+++ b/data/2024/aaai/Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum	
@@ -0,0 +1 @@
+Augmenting large language models (LLMs) with external tools has emerged as a promising approach to extending the capability of LLMs. Although there are some works that employ open-source LLMs for the tool-learning task, most of them are trained in a controlled environment in which LLMs only learn to execute the human-provided tools. However, selecting proper tools from the large toolset is also a crucial ability for the tool-learning model to be applied in real-world applications. Existing methods usually directly employ self-instruction methods to train the model, which ignores differences in tool complexity. In this paper, we propose the Confucius a novel tool-learning framework to train LLM to use complicated tools in real-world scenarios, which contains two main phases: (1) We first propose a multi-stage learning method to teach the LLM to use various tools from an easy-to-difficult curriculum; (2) thenceforth, we propose the Iterative Self-instruct from Introspective Feedback (ISIF) to dynamically construct the dataset to improve the ability to use the complicated tool. Extensive experiments conducted on both controlled and real-world settings demonstrate the superiority of our tool-learning framework in the real-world application scenario compared to both tuning-free (e.g., ChatGPT, Claude) and tuning-based baselines (e.g., GPT4Tools).
\ No newline at end of file
diff --git a/data/2024/aaai/Confusing Pair Correction Based on Category Prototype for Domain Adaptation under Noisy Environments b/data/2024/aaai/Confusing Pair Correction Based on Category Prototype for Domain Adaptation under Noisy Environments
new file mode 100644
index 0000000000..de365913a9
--- /dev/null
+++ b/data/2024/aaai/Confusing Pair Correction Based on Category Prototype for Domain Adaptation under Noisy Environments	
@@ -0,0 +1 @@
+In this paper, we address unsupervised domain adaptation under noisy environments, which is more challenging and practical than traditional domain adaptation. In this scenario, the model is prone to overfitting noisy labels, resulting in a more pronounced domain shift and a notable decline in the overall model performance. Previous methods employed prototype methods for domain adaptation on robust feature spaces. However, these approaches struggle to effectively classify classes with similar features under noisy environments. To address this issue, we propose a new method to detect and correct confusing class pair. We first divide classes into easy and hard classes based on the small loss criterion. We then leverage the top-2 predictions for each sample after aligning the source and target domain to find the confusing pair in the hard classes. We apply label correction to the noisy samples within the confusing pair. With the proposed label correction method, we can train our model with more accurate labels. Extensive experiments confirm the effectiveness of our method and demonstrate its favorable performance compared with existing state-of-the-art methods. Our codes are publicly available at https://github.com/Hehxcf/CPC/.
\ No newline at end of file
diff --git a/data/2024/aaai/Considering Nonstationary within Multivariate Time Series with Variational Hierarchical Transformer for Forecasting b/data/2024/aaai/Considering Nonstationary within Multivariate Time Series with Variational Hierarchical Transformer for Forecasting
new file mode 100644
index 0000000000..144049a23e
--- /dev/null
+++ b/data/2024/aaai/Considering Nonstationary within Multivariate Time Series with Variational Hierarchical Transformer for Forecasting	
@@ -0,0 +1 @@
+The forecasting of Multivariate Time Series (MTS) has long been an important but challenging task. Due to the non-stationary problem across long-distance time steps, previous studies primarily adopt stationarization method to attenuate the non-stationary problem of original series for better predictability. However, existed methods always adopt the stationarized series, which ignore the inherent non-stationarity, and have difficulty in modeling MTS with complex distributions due to the lack of stochasticity. To tackle these problems, we first develop a powerful hierarchical probabilistic generative module to consider the non-stationarity and stochastity characteristics within MTS, and then combine it with transformer for a well-defined variational generative dynamic model named Hierarchical Time series Variational Transformer (HTV-Trans), which recovers the intrinsic non-stationary information into temporal dependencies. Being an powerful probabilistic model, HTV-Trans is utilized to learn expressive representations of MTS and applied to the forecasting tasks. Extensive experiments on diverse datasets show the efficiency of HTV-Trans on MTS forecasting tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/ConsistNER: Towards Instructive NER Demonstrations for LLMs with the Consistency of Ontology and Context b/data/2024/aaai/ConsistNER: Towards Instructive NER Demonstrations for LLMs with the Consistency of Ontology and Context
new file mode 100644
index 0000000000..0e17826f13
--- /dev/null
+++ b/data/2024/aaai/ConsistNER: Towards Instructive NER Demonstrations for LLMs with the Consistency of Ontology and Context	
@@ -0,0 +1 @@
+Named entity recognition (NER) aims to identify and classify specific entities mentioned in textual sentences. Most existing superior NER models employ the standard fully supervised paradigm, which requires a large amount of annotated data during training. In order to maintain performance with insufficient annotation resources (i.e., low resources), in-context learning (ICL) has drawn a lot of attention, due to its plug-and-play nature compared to other methods (e.g., meta-learning and prompt learning). In this manner, how to retrieve high-correlated demonstrations for target sentences serves as the key to emerging ICL ability. For the NER task, the correlation implies the consistency of both ontology (i.e., generalized entity type) and context (i.e., sentence semantic), which is ignored by previous NER demonstration retrieval techniques. To address this issue, we propose ConsistNER, a novel three-stage framework that incorporates ontological and contextual information for low-resource NER. Firstly, ConsistNER employs large language models (LLMs) to pre-recognize potential entities in a zero-shot manner. Secondly, ConsistNER retrieves the sentence-specific demonstrations for each target sentence based on the two following considerations: (1) Regarding ontological consistency, demonstrations are filtered into a candidate set based on ontology distribution. (2) Regarding contextual consistency, an entity-aware self-attention mechanism is introduced to focus more on the potential entities and semantic-correlated tokens. Finally, ConsistNER feeds the retrieved demonstrations for all target sentences into LLMs for prediction. We conduct experiments on four widely-adopted NER datasets, including both general and specific domains. Experimental results show that ConsistNER achieves a 6.01%-26.37% and 3.07%-21.18% improvement over the state-of-the-art baselines on Micro-F1 scores under 1- and 5-shot settings, respectively.
\ No newline at end of file
diff --git a/data/2024/aaai/Consistency-GAN: Training GANs with Consistency Model b/data/2024/aaai/Consistency-GAN: Training GANs with Consistency Model
new file mode 100644
index 0000000000..82478150ab
--- /dev/null
+++ b/data/2024/aaai/Consistency-GAN: Training GANs with Consistency Model	
@@ -0,0 +1 @@
+For generative learning tasks, there are three crucial criteria for generating samples from the models: quality, coverage/diversity, and sampling speed. Among the existing generative models, Generative adversarial networks (GANs) and diffusion models demonstrate outstanding quality performance while suffering from notable limitations. GANs can generate high-quality results and enable fast sampling, their drawbacks, however, lie in the limited diversity of the generated samples. On the other hand, diffusion models excel at generating high-quality results with a commendable diversity. Yet, its iterative generation process necessitates hundreds to thousands of sampling steps, leading to slow speeds that are impractical for real-time scenarios. To address the aforementioned problem, this paper proposes a novel Consistency-GAN model. In particular, to aid in the training of the GAN, we introduce instance noise, which employs consistency models using only a few steps compared to the conventional diffusion process. Our evaluations on various datasets indicate that our approach significantly accelerates sampling speeds compared to traditional diffusion models, while preserving sample quality and diversity. Furthermore, our approach also has better model coverage than traditional adversarial training methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Consistency-Guided Temperature Scaling Using Style and Content Information for Out-of-Domain Calibration b/data/2024/aaai/Consistency-Guided Temperature Scaling Using Style and Content Information for Out-of-Domain Calibration
new file mode 100644
index 0000000000..0f055c412e
--- /dev/null
+++ b/data/2024/aaai/Consistency-Guided Temperature Scaling Using Style and Content Information for Out-of-Domain Calibration	
@@ -0,0 +1 @@
+Research interests in the robustness of deep neural networks against domain shifts have been rapidly increasing in recent years. Most existing works, however, focus on improving the accuracy of the model, not the calibration performance which is another important requirement for trustworthy AI systems. Temperature scaling (TS), an accuracy-preserving post-hoc calibration method, has been proven to be effective in in-domain settings, but not in out-of-domain (OOD) due to the difficulty in obtaining a validation set for the unseen domain beforehand. In this paper, we propose consistency-guided temperature scaling (CTS), a new temperature scaling strategy that can significantly enhance the OOD calibration performance by providing mutual supervision among data samples in the source domains. Motivated by our observation that over-confidence stemming from inconsistent sample predictions is the main obstacle to OOD calibration, we propose to guide the scaling process by taking consistencies into account in terms of two different aspects - style and content - which are the key components that can well-represent data samples in multi-domain settings. Experimental results demonstrate that our proposed strategy outperforms existing works, achieving superior OOD calibration performance on various datasets. This can be accomplished by employing only the source domains without compromising accuracy, making our scheme directly applicable to various trustworthy AI systems.
\ No newline at end of file
diff --git a/data/2024/aaai/ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference b/data/2024/aaai/ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference
new file mode 100644
index 0000000000..8f66930112
--- /dev/null
+++ b/data/2024/aaai/ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference	
@@ -0,0 +1 @@
+Early Exiting is one of the most popular methods to achieve efficient inference. Current early exiting methods adopt the (weighted) sum of the cross entropy loss of all internal classifiers as the objective function during training, imposing all these classifiers to predict all instances correctly. However, during inference, as long as one internal classifier predicts an instance correctly, it can accelerate without losing accuracy. Thus, there is a notable gap between training and inference. We propose ConsistentEE, an early exiting method that is consistent in training and inference. ConsistentEE formulates the early exiting process as a reinforcement learning problem. A policy network is added to decide whether an instance should exit or continue. The training objective of ConsistentEE only requires each instance to be predicted correctly by one internal classifier. Additionally, we introduce the concept "Memorized Layer" to measure the hardness of an instance. We incorporate the memorized layer into reward function design, which allows "easy'' instances to focus more on acceleration while ``hard'' instances to focus more on accuracy. Experimental results show that our method outperforms other baselines on various natural language understanding and generation tasks using PLMs and LLMs as backbones respectively.
\ No newline at end of file
diff --git a/data/2024/aaai/Constrained Bayesian Optimization under Partial Observations: Balanced Improvements and Provable Convergence b/data/2024/aaai/Constrained Bayesian Optimization under Partial Observations: Balanced Improvements and Provable Convergence
new file mode 100644
index 0000000000..723b75fbd1
--- /dev/null
+++ b/data/2024/aaai/Constrained Bayesian Optimization under Partial Observations: Balanced Improvements and Provable Convergence	
@@ -0,0 +1 @@
+The partially observable constrained optimization problems (POCOPs) impede data-driven optimization techniques since an infeasible solution of POCOPs can provide little information about the objective as well as the constraints. We endeavor to design an efficient and provable method for expensive POCOPs under the framework of constrained Bayesian optimization. Our method consists of two key components. Firstly, we present an improved design of the acquisition functions that introduce balanced exploration during optimization. We rigorously study the convergence properties of this design to demonstrate its effectiveness. Secondly, we propose Gaussian processes embedding different likelihoods as the surrogate model for partially observable constraints. This model leads to a more accurate representation of the feasible regions compared to traditional classification-based models. Our proposed method is empirically studied on both synthetic and real-world problems. The results demonstrate the competitiveness of our method for solving POCOPs.
\ No newline at end of file
diff --git a/data/2024/aaai/Constrained Meta-Reinforcement Learning for Adaptable Safety Guarantee with Differentiable Convex Programming b/data/2024/aaai/Constrained Meta-Reinforcement Learning for Adaptable Safety Guarantee with Differentiable Convex Programming
new file mode 100644
index 0000000000..61c2a5263d
--- /dev/null
+++ b/data/2024/aaai/Constrained Meta-Reinforcement Learning for Adaptable Safety Guarantee with Differentiable Convex Programming	
@@ -0,0 +1 @@
+Despite remarkable achievements in artificial intelligence, the deployability of learning-enabled systems in high-stakes real-world environments still faces persistent challenges. For example, in safety-critical domains like autonomous driving, robotic manipulation, and healthcare, it is crucial not only to achieve high performance but also to comply with given constraints. Furthermore, adaptability becomes paramount in non-stationary domains, where environmental parameters are subject to change. While safety and adaptability are recognized as key qualities for the new generation of AI, current approaches have not demonstrated effective adaptable performance in constrained settings. Hence, this paper breaks new ground by studying the unique challenges of ensuring safety in nonstationary environments by solving constrained problems through the lens of the meta-learning approach (learning to learn). While unconstrained meta-learning already encounters complexities in end to end differentiation of the loss due to the bi-level nature, its constrained counterpart introduces an additional layer of difficulty, since the constraints imposed on task-level updates complicate the differentiation process. To address the issue, we first employ successive convex-constrained policy updates across multiple tasks with differentiable convex programming, which allows meta-learning in constrained scenarios by enabling end-to-end differentiation. This approach empowers the agent to rapidly adapt to new tasks under nonstationarity while ensuring compliance with safety constraints. We also provide a theoretical analysis demonstrating guaranteed monotonic improvement of our approach, justifying our algorithmic designs. Extensive simulations across diverse environments provide empirical validation with significant improvement over established benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Constraint Latent Space Matters: An Anti-anomalous Waveform Transformation Solution from Photoplethysmography to Arterial Blood Pressure b/data/2024/aaai/Constraint Latent Space Matters: An Anti-anomalous Waveform Transformation Solution from Photoplethysmography to Arterial Blood Pressure
new file mode 100644
index 0000000000..ca05a42478
--- /dev/null
+++ b/data/2024/aaai/Constraint Latent Space Matters: An Anti-anomalous Waveform Transformation Solution from Photoplethysmography to Arterial Blood Pressure	
@@ -0,0 +1 @@
+Arterial blood pressure (ABP) holds substantial promise for proactive cardiovascular health management. Notwithstanding its potential, the invasive nature of ABP measurements confines their utility primarily to clinical environments, limiting their applicability for continuous monitoring beyond medical facilities. The conversion of photoplethysmography (PPG) signals into ABP equivalents has garnered significant attention due to its potential in revolutionizing cardiovascular disease management. Recent strides in PPG-to-ABP prediction encompass the integration of generative and discriminative models. Despite these advances, the efficacy of these models is curtailed by the latent space shift predicament, stemming from alterations in PPG data distribution across disparate hardware and individuals, potentially leading to distorted ABP waveforms. To tackle this problem, we present an innovative solution named the Latent Space Constraint Transformer (LSCT), leveraging a quantized codebook to yield robust latent spaces by employing multiple discretizing bases. To facilitate improved reconstruction, the Correlation-boosted Attention Module (CAM) is introduced to systematically query pertinent bases on a global scale. Furthermore, to enhance expressive capacity, we propose the Multi-Spectrum Enhancement Knowledge (MSEK), which fosters local information flow within the channels of latent code and provides additional embedding for reconstruction. Through comprehensive experimentation on both publicly available datasets and a private downstream task dataset, the proposed approach demonstrates noteworthy performance enhancements compared to existing methods. Extensive ablation studies further substantiate the effectiveness of each introduced module.
\ No newline at end of file
diff --git a/data/2024/aaai/Constructing Dreams Using Generative AI b/data/2024/aaai/Constructing Dreams Using Generative AI
new file mode 100644
index 0000000000..958f189569
--- /dev/null
+++ b/data/2024/aaai/Constructing Dreams Using Generative AI	
@@ -0,0 +1 @@
+Generative AI tools introduce new and accessible forms of media creation for youth. They also raise ethical concerns about the generation of fake media, data protection, privacy and ownership of AI-generated art. Since generative AI is already being used in products used by youth, it is critical that they understand how these tools work and how they can be used or misused. In this work, we facilitated students’ generative AI learning through expression of their imagined future identities. We designed a learning workshop - Dreaming with AI - where students learned about the inner workings of generative AI tools, used text-to-image generation algorithms to create their imaged future dreams, reflected on the potential benefits and harms of generative AI tools and voiced their opinions about policies for the use of these tools in classrooms. In this paper, we present the learning activities and experiences of 34 high school students who engaged in our workshops. Students reached creative learning objectives by using prompt engineering to create their future dreams, gained technical knowledge by learning the abilities, limitations, text-visual mappings and applications of generative AI, and identified most potential societal benefits and harms of generative AI.
\ No newline at end of file
diff --git a/data/2024/aaai/ContactGen: Contact-Guided Interactive 3D Human Generation for Partners b/data/2024/aaai/ContactGen: Contact-Guided Interactive 3D Human Generation for Partners
new file mode 100644
index 0000000000..b76a8c29e7
--- /dev/null
+++ b/data/2024/aaai/ContactGen: Contact-Guided Interactive 3D Human Generation for Partners	
@@ -0,0 +1 @@
+Among various interactions between humans, such as eye contact and gestures, physical interactions by contact can act as an essential moment in understanding human behaviors. Inspired by this fact, given a 3D partner human with the desired interaction label, we introduce a new task of 3D human generation in terms of physical contact. Unlike previous works of interacting with static objects or scenes, a given partner human can have diverse poses and different contact regions according to the type of interaction. To handle this challenge, we propose a novel method of generating interactive 3D humans for a given partner human based on a guided diffusion framework (ContactGen in short). Specifically, we newly present a contact prediction module that adaptively estimates potential contact regions between two input humans according to the interaction label. Using the estimated potential contact regions as complementary guidances, we dynamically enforce ContactGen to generate interactive 3D humans for a given partner human within a guided diffusion model. We demonstrate ContactGen on the CHI3D dataset, where our method generates physically plausible and diverse poses compared to comparison methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Content Filtering with Inattentive Information Consumers b/data/2024/aaai/Content Filtering with Inattentive Information Consumers
new file mode 100644
index 0000000000..eb471d8a30
--- /dev/null
+++ b/data/2024/aaai/Content Filtering with Inattentive Information Consumers	
@@ -0,0 +1 @@
+We develop a model of content filtering as a game between the filter and the content consumer, where the latter incurs information costs for examining the content. Motivating examples include censoring misinformation, spam/phish filtering, and recommender systems acting on a stream of content. When the attacker is exogenous, we show that improving the filter’s quality is weakly Pareto improving, but has no impact on equilibrium payoffs until the filter becomes sufficiently accurate. Further, if the filter does not internalize the consumer’s information costs, its lack of commitment power may render it useless and lead to inefficient outcomes. When the attacker is also strategic, improvements in filter quality may decrease equilibrium payoffs.
\ No newline at end of file
diff --git a/data/2024/aaai/Context Enhanced Transformer for Single Image Object Detection in Video Data b/data/2024/aaai/Context Enhanced Transformer for Single Image Object Detection in Video Data
new file mode 100644
index 0000000000..2ed336e645
--- /dev/null
+++ b/data/2024/aaai/Context Enhanced Transformer for Single Image Object Detection in Video Data	
@@ -0,0 +1 @@
+With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-based VOD methods have shown promising results, their reliance on multiple inputs and additional network complexity to incorporate temporal information limits their practical applicability. In this paper, we propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR), by incorporating temporal context into DETR using a newly designed memory module. To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. Additionally, we present a classification-based sampling technique to selectively utilize the relevant memory for the current image. In the testing, We introduce a test-time memory adaptation method that updates individual memory functions by considering the test distribution. Experiments with CityCam and ImageNet VID datasets exhibit the efficiency of the framework on various video systems. The project page and code will be made available at: https://ku-cvlab.github.io/CETR.
\ No newline at end of file
diff --git a/data/2024/aaai/Context-Aware Iteration Policy Network for Efficient Optical Flow Estimation b/data/2024/aaai/Context-Aware Iteration Policy Network for Efficient Optical Flow Estimation
new file mode 100644
index 0000000000..04a5d4b91f
--- /dev/null
+++ b/data/2024/aaai/Context-Aware Iteration Policy Network for Efficient Optical Flow Estimation	
@@ -0,0 +1 @@
+Existing recurrent optical flow estimation networks are computationally expensive since they use a fixed large number of iterations to update the flow field for each sample. An efficient network should skip iterations when the flow improvement is limited. In this paper, we develop a Context-Aware Iteration Policy Network for efficient optical flow estimation, which determines the optimal number of iterations per sample. The policy network achieves this by learning contextual information to realize whether flow improvement is bottlenecked or minimal. On the one hand, we use iteration embedding and historical hidden cell, which include previous iterations information, to convey how flow has changed from previous iterations. On the other hand, we use the incremental loss to make the policy network implicitly perceive the magnitude of optical flow improvement in the subsequent iteration. Furthermore, the computational complexity in our dynamic network is controllable, allowing us to satisfy various resource preferences with a single trained model. Our policy network can be easily integrated into state-of-the-art optical flow networks. Extensive experiments show that our method maintains performance while reducing FLOPs by about 40%/20% for the Sintel/KITTI datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval b/data/2024/aaai/Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval
new file mode 100644
index 0000000000..211db82bbd
--- /dev/null
+++ b/data/2024/aaai/Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval	
@@ -0,0 +1 @@
+Different from the Composed Image Retrieval task that requires expensive labels for training task-specific models, Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent that could be related to domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to learn a more accurate image representation that has adaptive attention to the reference image for various manipulation descriptions. In this paper, we propose a novel context-dependent mapping network, named Context-I2W, for adaptively converting description-relevant Image information into a pseudo-word token composed of the description for accurate ZS-CIR. Specifically, an Intent View Selector first dynamically learns a rotation rule to map the identical image to a task-specific manipulation view. Then a Visual Target Extractor further captures local information covering the main targets in ZS-CIR tasks under the guidance of multiple learnable queries. The two complementary modules work together to map an image to a context-dependent pseudo-word token without extra supervision. Our model shows strong generalization ability on four ZS-CIR tasks, including domain conversion, object composition, object manipulation, and attribute manipulation. It obtains consistent and significant performance boosts ranging from 1.88% to 3.60% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available at https://anonymous.4open.science/r/Context-I2W-4224/.
\ No newline at end of file
diff --git a/data/2024/aaai/Contextual Pandora's Box b/data/2024/aaai/Contextual Pandora's Box
new file mode 100644
index 0000000000..5058e5408d
--- /dev/null
+++ b/data/2024/aaai/Contextual Pandora's Box	
@@ -0,0 +1 @@
+Pandora’s Box is a fundamental stochastic optimization problem, where the decision-maker must find a good alternative, while minimizing the search cost of exploring the value of each alternative. In the original formulation, it is assumed that accurate distributions are given for the values of all the alternatives, while recent work studies the online variant of Pandora’s Box where the distributions are originally unknown. In this work, we study Pandora’s Box in the online setting, while incorporating context. At each round, we are presented with a number of alternatives each having a context, an exploration cost and an unknown value drawn from an unknown distribution that may change at every round. Our main result is a no-regret algorithm that performs comparably well against the optimal algorithm which knows all prior distributions exactly. Our algorithm works even in the bandit setting where the algorithm never learns the values of the alternatives that were not explored. The key technique that enables our result is a novel modification of the realizability condition in contextual bandits that connects a context to a sufficient statistic of each alternative’s distribution (its reservation value) rather than its mean.
\ No newline at end of file
diff --git a/data/2024/aaai/Contextual Pre-planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning b/data/2024/aaai/Contextual Pre-planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning
new file mode 100644
index 0000000000..b96b2438d6
--- /dev/null
+++ b/data/2024/aaai/Contextual Pre-planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning	
@@ -0,0 +1 @@
+Recent studies show that deep reinforcement learning (DRL) agents tend to overfit to the task on which they were trained and fail to adapt to minor environment changes. To expedite learning when transferring to unseen tasks, we propose a novel approach to representing the current task using reward machines (RMs), state machine abstractions that induce subtasks based on the current task’s rewards and dynamics. Our method provides agents with symbolic representations of optimal transitions from their current abstract state and rewards them for achieving these transitions. These representations are shared across tasks, allowing agents to exploit knowledge of previously encountered symbols and transitions, thus enhancing transfer. Empirical results show that our representations improve sample efficiency and few-shot transfer in a variety of domains.
\ No newline at end of file
diff --git a/data/2024/aaai/Continual Learning in an Open and Dynamic World b/data/2024/aaai/Continual Learning in an Open and Dynamic World
new file mode 100644
index 0000000000..8ba76dc87d
--- /dev/null
+++ b/data/2024/aaai/Continual Learning in an Open and Dynamic World	
@@ -0,0 +1,2 @@
+Building autonomous agents that can process massive amounts of real-time sensor-captured data is essential for many real-world applications including autonomous vehicles, robotics and AI in medicine. As the agent often needs to explore in a dynamic environment, it is thus a desirable as well as challenging goal to enable the agent to learn over time without performance degradation. Continual learning aims to build a continual learner which can learn new concepts over the data stream while preserving previously learnt concepts. In the talk, I will survey three pieces of my recent research on continual learning (i) supervised continual learning, (ii) unsupervised continual learning, and (iii) multi-modal continual learning. In the first work, I will discuss a supervised
+continual learning algorithm called MEGA which dynamically balances the old tasks and the new task. In the second work, I will discuss unsupervised continual learning algorithms which learn representation continually without access to the labels. In the third work, I will elaborate an efficient continual learning algorithm that can learn multiple modalities continually without forgetting.
\ No newline at end of file
diff --git a/data/2024/aaai/Continual Relation Extraction via Sequential Multi-Task Learning b/data/2024/aaai/Continual Relation Extraction via Sequential Multi-Task Learning
new file mode 100644
index 0000000000..96271efcf4
--- /dev/null
+++ b/data/2024/aaai/Continual Relation Extraction via Sequential Multi-Task Learning	
@@ -0,0 +1 @@
+To build continual relation extraction (CRE) models, those can adapt to an ever-growing ontology of relations, is a cornerstone information extraction task that serves in various dynamic real-world domains. To mitigate catastrophic forgetting in CRE, existing state-of-the-art approaches have effectively utilized rehearsal techniques from continual learning and achieved remarkable success. However, managing multiple objectives associated with memory-based rehearsal remains underexplored, often relying on simple summation and overlooking complex trade-offs. In this paper, we propose Continual Relation Extraction via Sequential Multi-task Learning (CREST), a novel CRE approach built upon a tailored Multi-task Learning framework for continual learning. CREST takes into consideration the disparity in the magnitudes of gradient signals of different objectives, thereby effectively handling the inherent difference between multi-task learning and continual learning. Through extensive experiments on multiple datasets, CREST demonstrates significant improvements in CRE performance as well as superiority over other state-of-the-art Multi-task Learning frameworks, offering a promising solution to the challenges of continual learning in this domain.
\ No newline at end of file
diff --git a/data/2024/aaai/Continual Vision-Language Retrieval via Dynamic Knowledge Rectification b/data/2024/aaai/Continual Vision-Language Retrieval via Dynamic Knowledge Rectification
new file mode 100644
index 0000000000..a4ea736a25
--- /dev/null
+++ b/data/2024/aaai/Continual Vision-Language Retrieval via Dynamic Knowledge Rectification	
@@ -0,0 +1 @@
+The recent large-scale pre-trained models like CLIP have aroused great concern in vision-language tasks. However, when required to match image-text data collected in a streaming manner, namely Continual Vision-Language Retrieval (CVRL), their performances are still limited due to the catastrophic forgetting of the learned old knowledge. To handle this issue, advanced methods are proposed to distill the affinity knowledge between images and texts from the old model to the new one for anti-forgetting. Unfortunately, existing approaches neglect the impact of incorrect affinity, which prevents the balance between the anti-forgetting of old knowledge and the acquisition of new knowledge. Therefore, we propose a novel framework called Dynamic Knowledge Rectification (DKR) that simultaneously achieves incorrect knowledge filtering and rectification. Specifically, we first filter the incorrect affinity knowledge calculated by the old model on the new data. Then, a knowledge rectification method is designed to rectify the incorrect affinities while preserving the correct ones. In particular, for the new data that can only be correctly retrieved by the new model, we rectify them with the corresponding new affinity to protect them from negative transfer. Additionally, for those that can not be retrieved by either the old or the new model, we introduce paired ground-truth labels to promote the acquisition of both old and new knowledge. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our DKR and its superiority against state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Continuous Piecewise-Affine Based Motion Model for Image Animation b/data/2024/aaai/Continuous Piecewise-Affine Based Motion Model for Image Animation
new file mode 100644
index 0000000000..e1378fa83e
--- /dev/null
+++ b/data/2024/aaai/Continuous Piecewise-Affine Based Motion Model for Image Animation	
@@ -0,0 +1 @@
+Image animation aims to bring static images to life according to driving videos and create engaging visual content that can be used for various purposes such as animation, entertainment, and education. Recent unsupervised methods utilize affine and thin-plate spline transformations based on keypoints to transfer the motion in driving frames to the source image. However, limited by the expressive power of the transformations used, these methods always produce poor results when the gap between the motion in the driving frame and the source image is large. To address this issue, we propose to model motion from the source image to the driving frame in highly-expressive diffeomorphism spaces. Firstly, we introduce Continuous Piecewise-Affine based (CPAB) transformation to model the motion and present a well-designed inference algorithm to generate CPAB transformation from control keypoints. Secondly, we propose a SAM-guided keypoint semantic loss to further constrain the keypoint extraction process and improve the semantic consistency between the corresponding keypoints on the source and driving images. Finally, we design a structure alignment loss to align the structure-related features extracted from driving and generated images, thus helping the generator generate results that are more consistent with the driving action. Extensive experiments on four datasets demonstrate the effectiveness of our method against state-of-the-art competitors quantitatively and qualitatively. Code will be publicly available at: https://github.com/DevilPG/AAAI2024-CPABMM.
\ No newline at end of file
diff --git a/data/2024/aaai/Continuous Rotation Group Equivariant Network Inspired by Neural Population Coding b/data/2024/aaai/Continuous Rotation Group Equivariant Network Inspired by Neural Population Coding
new file mode 100644
index 0000000000..75c02c30df
--- /dev/null
+++ b/data/2024/aaai/Continuous Rotation Group Equivariant Network Inspired by Neural Population Coding	
@@ -0,0 +1 @@
+Neural population coding can represent continuous information by neurons with a series of discrete preferred stimuli, and we find that the bell-shaped tuning curve plays an important role in this mechanism. Inspired by this, we incorporate a bell-shaped tuning curve into the discrete group convolution to achieve continuous group equivariance. Simply, we modulate group convolution kernels by Gauss functions to obtain bell-shaped tuning curves. Benefiting from the modulation, kernels also gain smooth gradients on geometric dimensions (e.g., location dimension and orientation dimension). It allows us to generate group convolution kernels from sparse weights with learnable geometric parameters, which can achieve both competitive performances and parameter efficiencies. Furthermore, we quantitatively prove that discrete group convolutions with proper tuning curves (bigger than 1x sampling step) can achieve continuous equivariance. Experimental results show that 1) our approach achieves very competitive performances on MNIST-rot with at least 75% fewer parameters compared with previous SOTA methods, which is efficient in parameter; 2) Especially with small sample sizes, our approach exhibits more pronounced performance improvements (up to 24%); 3) It also has excellent rotation generalization ability on various datasets such as MNIST, CIFAR, and ImageNet with both plain and ResNet architectures.
\ No newline at end of file
diff --git a/data/2024/aaai/Continuous Treatment Effect Estimation Using Gradient Interpolation and Kernel Smoothing b/data/2024/aaai/Continuous Treatment Effect Estimation Using Gradient Interpolation and Kernel Smoothing
new file mode 100644
index 0000000000..839f539eae
--- /dev/null
+++ b/data/2024/aaai/Continuous Treatment Effect Estimation Using Gradient Interpolation and Kernel Smoothing	
@@ -0,0 +1,25 @@
+We address the Individualized continuous treatment effect
+(ICTE) estimation problem where we predict the effect of
+any continuous valued treatment on an individual using ob-
+servational data. The main challenge in this estimation task
+is the potential confounding of treatment assignment with in-
+dividual’s covariates in the training data, whereas during in-
+ference ICTE requires prediction on independently sampled
+treatments. In contrast to prior work that relied on regularizers
+or unstable GAN training, we advocate the direct approach
+of augmenting training individuals with independently sam-
+pled treatments and inferred counterfactual outcomes. We in-
+fer counterfactual outcomes using a two-pronged strategy: a
+Gradient Interpolation for close-to-observed treatments, and
+a Gaussian Process based Kernel Smoothing which allows
+us to down weigh high variance inferences. We evaluate our
+method on five benchmarks and show that our method out-
+performs six state-of-the-art methods on the counterfactual
+estimation error. We analyze the superior performance of our
+method by showing that (1) our inferred counterfactual re-
+sponses are more accurate, and (2) adding them to the train-
+ing data reduces the distributional distance between the con-
+founded training distribution and test distribution where treat-
+ment is independent of covariates. Our proposed method is
+model-agnostic and we show that it improves ICTE accuracy
+of several existing models.
\ No newline at end of file
diff --git a/data/2024/aaai/Continuous-Time Graph Representation with Sequential Survival Process b/data/2024/aaai/Continuous-Time Graph Representation with Sequential Survival Process
new file mode 100644
index 0000000000..c7cb80d42f
--- /dev/null
+++ b/data/2024/aaai/Continuous-Time Graph Representation with Sequential Survival Process	
@@ -0,0 +1 @@
+Over the past two decades, there has been a tremendous increase in the growth of representation learning methods for graphs, with numerous applications across various fields, including bioinformatics, chemistry, and the social sciences. However, current dynamic network approaches focus on discrete-time networks or treat links in continuous-time networks as instantaneous events. Therefore, these approaches have limitations in capturing the persistence or absence of links that continuously emerge and disappear over time for particular durations. To address this, we propose a novel stochastic process relying on survival functions to model the durations of links and their absences over time. This forms a generic new likelihood specification explicitly accounting for intermittent edge-persistent networks, namely GraSSP: Graph Representation with Sequential Survival Process. We apply the developed framework to a recent continuous time dynamic latent distance model characterizing network dynamics in terms of a sequence of piecewise linear movements of nodes in latent space. We quantitatively assess the developed framework in various downstream tasks, such as link prediction and network completion, demonstrating that the developed modeling framework accounting for link persistence and absence well tracks the intrinsic trajectories of nodes in a latent space and captures the underlying characteristics of evolving network structure.
\ No newline at end of file
diff --git a/data/2024/aaai/Contrastive Balancing Representation Learning for Heterogeneous Dose-Response Curves Estimation b/data/2024/aaai/Contrastive Balancing Representation Learning for Heterogeneous Dose-Response Curves Estimation
new file mode 100644
index 0000000000..a35f007a1c
--- /dev/null
+++ b/data/2024/aaai/Contrastive Balancing Representation Learning for Heterogeneous Dose-Response Curves Estimation	
@@ -0,0 +1 @@
+Estimating the individuals' potential response to varying treatment doses is crucial for decision-making in areas such as precision medicine and management science. Most recent studies predict counterfactual outcomes by learning a covariate representation that is independent of the treatment variable. However, such independence constraints neglect much of the covariate information that is useful for counterfactual prediction, especially when the treatment variables are continuous. To tackle the above issue, in this paper, we first theoretically demonstrate the importance of the balancing and prognostic representations for unbiased estimation of the heterogeneous dose-response curves, that is, the learned representations are constrained to satisfy the conditional independence between the covariates and both of the treatment variables and the potential responses. Based on this, we propose a novel Contrastive balancing Representation learning Network using a partial distance measure, called CRNet, for estimating the heterogeneous dose-response curves without losing the continuity of treatments. Extensive experiments are conducted on synthetic and real-world datasets demonstrating that our proposal significantly outperforms previous methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation b/data/2024/aaai/Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation
new file mode 100644
index 0000000000..55d6c96f8c
--- /dev/null
+++ b/data/2024/aaai/Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation	
@@ -0,0 +1 @@
+Recently, because of the high-quality representations of contrastive learning methods, rehearsal-based contrastive continual learning has been proposed to explore how to continually learn transferable representation embeddings to avoid the catastrophic forgetting issue in traditional continual settings. Based on this framework, we propose Contrastive Continual Learning via Importance Sampling (CCLIS) to preserve knowledge by recovering previous data distributions with a new strategy for Replay Buffer Selection (RBS), which minimize estimated variance to save hard negative samples for representation learning with high quality. Furthermore, we present the Prototype-instance Relation Distillation (PRD) loss, a technique designed to maintain the relationship between prototypes and sample representations using a self-distillation process. Experiments on standard continual learning benchmarks reveal that our method notably outperforms existing baselines in terms of knowledge preservation and thereby effectively counteracts catastrophic forgetting in online contexts. The code is available at https://github.com/lijy373/CCLIS.
\ No newline at end of file
diff --git a/data/2024/aaai/Contrastive Credibility Propagation for Reliable Semi-supervised Learning b/data/2024/aaai/Contrastive Credibility Propagation for Reliable Semi-supervised Learning
new file mode 100644
index 0000000000..80f8e00dee
--- /dev/null
+++ b/data/2024/aaai/Contrastive Credibility Propagation for Reliable Semi-supervised Learning	
@@ -0,0 +1 @@
+Producing labels for unlabeled data is error-prone, making semi-supervised learning (SSL) troublesome. Often, little is known about when and why an algorithm fails to outperform a supervised baseline. Using benchmark datasets, we craft five common real-world SSL data scenarios: few-label, open-set, noisy-label, and class distribution imbalance/misalignment in the labeled and unlabeled sets. We propose a novel algorithm called Contrastive Credibility Propagation (CCP) for deep SSL via iterative transductive pseudo-label refinement. CCP unifies semi-supervised learning and noisy label learning for the goal of reliably outperforming a supervised baseline in any data scenario. Compared to prior methods which focus on a subset of scenarios, CCP uniquely outperforms the supervised baseline in all scenarios, supporting practitioners when the qualities of labeled or unlabeled data are unknown.
\ No newline at end of file
diff --git a/data/2024/aaai/Contrastive Learning for Low-Light Raw Denoising (Student Abstract) b/data/2024/aaai/Contrastive Learning for Low-Light Raw Denoising (Student Abstract)
new file mode 100644
index 0000000000..c33ba3e5da
--- /dev/null
+++ b/data/2024/aaai/Contrastive Learning for Low-Light Raw Denoising (Student Abstract)	
@@ -0,0 +1 @@
+Image/video denoising in low-light scenes is an extremely challenging problem due to limited photon count and high noise. In this paper, we propose a novel approach with contrastive learning to address this issue. Inspired by the success of contrastive learning used in some high-level computer vision tasks, we bring in this idea to the low-level denoising task. In order to achieve this goal, we introduce a new denoising contrastive regularization (DCR) to exploit the information of noisy images and clean images. In the feature space, DCR makes the denoised image closer to the clean image and far away from the noisy image. In addition, we build a new feature embedding network called Wnet, which is more effective to extract high-frequency information. We conduct the experiments on a real low-light dataset that captures still images taken on a moonless clear night in 0.6 millilux and videos under starlight (no moon present). The results show that our method can achieve a higher PSNR and better visual quality compared with existing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget b/data/2024/aaai/Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget
new file mode 100644
index 0000000000..ac27bf2599
--- /dev/null
+++ b/data/2024/aaai/Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget	
@@ -0,0 +1,2 @@
+Masked Image Modeling (MIM) methods, like Masked Autoencoders (MAE), efficiently learn a rich representation of the input. However, for adapting to downstream tasks, they require a sufficient amount of labeled data since their rich features code not only objects but also less relevant image background. In contrast, Instance Discrimination (ID) methods focus on objects. In this work, we study how to combine the efficiency and scalability of MIM with the ability of ID to perform downstream classification in the absence of large amounts of labeled data. To this end, we introduce Masked Autoencoder Contrastive Tuning (MAE-CT), a sequential approach that utilizes the implicit clustering of the Nearest Neighbor Contrastive Learning (NNCLR) objective to induce abstraction in the topmost layers of a pre-trained MAE. MAE-CT tunes the rich features such that they form semantic clusters of objects without using any labels. Notably, MAE-CT does not rely on hand-crafted augmentations and frequently achieves its best performances while using only minimal augmentations (crop & flip). Further, MAE-CT is compute efficient as it requires at most 10% overhead compared to MAE re-training. Applied to large and huge Vision Transformer (ViT) models, MAE-CT excels over previous self-supervised methods trained on ImageNet in linear probing, k-NN and low-shot classification accuracy as well as in unsupervised clustering accuracy. With ViT-H/16 MAE-CT achieves a new state-of-the-art in linear probing of 82.2%.
+Project page: github.com/ml-jku/MAE-CT.
\ No newline at end of file
diff --git a/data/2024/aaai/Controllable 3D Face Generation with Conditional Style Code Diffusion b/data/2024/aaai/Controllable 3D Face Generation with Conditional Style Code Diffusion
new file mode 100644
index 0000000000..0376c0eedf
--- /dev/null
+++ b/data/2024/aaai/Controllable 3D Face Generation with Conditional Style Code Diffusion	
@@ -0,0 +1,2 @@
+Generating photorealistic 3D faces from given conditions is a challenging task. Existing methods often rely on time-consuming one-by-one optimization approaches, which are not efficient for modeling the same distribution content, e.g., faces. Additionally, an ideal controllable 3D face generation model should consider both facial attributes and expressions.
+Thus we propose a novel approach called TEx-Face(TExt & Expression-to-Face) that addresses these challenges by dividing the task into three components, i.e., 3D GAN Inversion, Conditional Style Code Diffusion, and 3D Face Decoding. For 3D GAN inversion, we introduce two methods, which aim to enhance the representation of style codes and alleviate 3D inconsistencies. Furthermore, we design a style code denoiser to incorporate multiple conditions into the style code and propose a data augmentation strategy to address the issue of insufficient paired visual-language data. Extensive experiments conducted on FFHQ, CelebA-HQ, and CelebA-Dialog demonstrate the promising performance of our TEx-Face in achieving the efficient and controllable generation of photorealistic 3D faces. The code will be publicly available.
\ No newline at end of file
diff --git a/data/2024/aaai/Controllable Mind Visual Diffusion Model b/data/2024/aaai/Controllable Mind Visual Diffusion Model
new file mode 100644
index 0000000000..05af62c4a5
--- /dev/null
+++ b/data/2024/aaai/Controllable Mind Visual Diffusion Model	
@@ -0,0 +1 @@
+Brain signal visualization has emerged as an active research area, serving as a critical interface between the human visual system and computer vision models. Diffusion-based methods have recently shown promise in analyzing functional magnetic resonance imaging (fMRI) data, including the reconstruction of high-quality images consistent with original visual stimuli. Nonetheless, it remains a critical challenge to effectively harness the semantic and silhouette information extracted from brain signals. In this paper, we propose a novel approach, termed as Controllable Mind Visual Diffusion Model (CMVDM). Specifically, CMVDM first extracts semantic and silhouette information from fMRI data using attribute alignment and assistant networks. Then, a control model is introduced in conjunction with a residual block to fully exploit the extracted information for image synthesis, generating high-quality images that closely resemble the original visual stimuli in both semantic content and silhouette characteristics. Through extensive experimentation, we demonstrate that CMVDM outperforms existing state-of-the-art methods both qualitatively and quantitatively. Our code is available at https://github.com/zengbohan0217/CMVDM.
\ No newline at end of file
diff --git a/data/2024/aaai/Controller-Guided Partial Label Consistency Regularization with Unlabeled Data b/data/2024/aaai/Controller-Guided Partial Label Consistency Regularization with Unlabeled Data
new file mode 100644
index 0000000000..e2901b6061
--- /dev/null
+++ b/data/2024/aaai/Controller-Guided Partial Label Consistency Regularization with Unlabeled Data	
@@ -0,0 +1 @@
+Partial label learning (PLL) learns from training examples each associated with multiple candidate labels, among which only one is valid. In recent years, benefiting from the strong capability of dealing with ambiguous supervision and the impetus of modern data augmentation methods, consistency regularization-based PLL methods have achieved a series of successes and become mainstream. However, as the partial annotation becomes insufficient, their performances drop significantly. In this paper, we leverage easily accessible unlabeled examples to facilitate the partial label consistency regularization. In addition to a partial supervised loss, our method performs a controller-guided consistency regularization at both the label-level and representation-level with the help of unlabeled data. To minimize the disadvantages of insufficient capabilities of the initial supervised model, we use the controller to estimate the confidence of each current prediction to guide the subsequent consistency regularization. Furthermore, we dynamically adjust the confidence thresholds so that the number of samples of each class participating in consistency regularization remains roughly equal to alleviate the problem of class-imbalance. Experiments show that our method achieves satisfactory performances in more practical situations, and its modules can be applied to existing PLL methods to enhance their capabilities.
\ No newline at end of file
diff --git a/data/2024/aaai/Conversational Modeling for Constraint Satisfaction b/data/2024/aaai/Conversational Modeling for Constraint Satisfaction
new file mode 100644
index 0000000000..90e370c66a
--- /dev/null
+++ b/data/2024/aaai/Conversational Modeling for Constraint Satisfaction	
@@ -0,0 +1 @@
+Many problems, from Sudoku to factory scheduling, can be regarded as constraint satisfaction problems. A key component of real world problem solving is a conversation between a constraint programming expert and a problem domain expert to specify the problem to be solved. This presentation argues that the time is ripe for progress in automating the constraint programmer side of this conversation and suggests promising avenues for this pursuit.
\ No newline at end of file
diff --git a/data/2024/aaai/Convolutional Channel-Wise Competitive Learning for the Forward-Forward Algorithm b/data/2024/aaai/Convolutional Channel-Wise Competitive Learning for the Forward-Forward Algorithm
new file mode 100644
index 0000000000..2d5381f6dd
--- /dev/null
+++ b/data/2024/aaai/Convolutional Channel-Wise Competitive Learning for the Forward-Forward Algorithm	
@@ -0,0 +1 @@
+The Forward-Forward (FF) Algorithm has been recently proposed to alleviate the issues of backpropagation (BP) commonly used to train deep neural networks. However, its current formulation exhibits limitations such as the generation of negative data, slower convergence, and inadequate performance on complex tasks. In this paper we take the main ideas of FF and improve them by leveraging channel-wise competitive learning in the context of convolutional neural networks for image classification tasks. A layer-wise loss function is introduced that promotes competitive learning and eliminates the need for negative data construction. To enhance both the learning of compositional features and feature space partitioning, a channel-wise feature separator and extractor block is proposed that complements the competitive learning process. Our method outperforms recent FF-based models on image classification tasks, achieving testing errors of 0.58%, 7.69%, 21.89%, and 48.77% on MNIST, Fashion-MNIST, CIFAR-10 and CIFAR-100 respectively. Our approach bridges the performance gap between FF learning and BP methods, indicating the potential of our proposed approach to learn useful representations in a layer-wise modular fashion, enabling more efficient and flexible learning. Our source code and supplementary material are available at https://github.com/andreaspapac/CwComp.
\ No newline at end of file
diff --git a/data/2024/aaai/Convolutional Spectral Kernel Learning with Generalization Guarantees (Abstract Reprint) b/data/2024/aaai/Convolutional Spectral Kernel Learning with Generalization Guarantees (Abstract Reprint)
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/aaai/Cooper: Coordinating Specialized Agents towards a Complex Dialogue Goal b/data/2024/aaai/Cooper: Coordinating Specialized Agents towards a Complex Dialogue Goal
new file mode 100644
index 0000000000..a3e3f1340b
--- /dev/null
+++ b/data/2024/aaai/Cooper: Coordinating Specialized Agents towards a Complex Dialogue Goal	
@@ -0,0 +1 @@
+In recent years, there has been a growing interest in exploring dialogues with more complex goals, such as negotiation, persuasion, and emotional support, which go beyond traditional service-focused dialogue systems. Apart from the requirement for much more sophisticated strategic reasoning and communication skills, a significant challenge of these tasks lies in the difficulty of objectively measuring the achievement of their goals in a quantifiable way, making it difficult for existing research to directly optimize the dialogue procedure towards them. In our work, we emphasize the multifaceted nature of complex dialogue goals and argue that it is more feasible to accomplish them by comprehensively considering and jointly promoting their different aspects. To this end, we propose a novel dialogue framework, Cooper, which coordinates multiple specialized agents, each dedicated to a specific dialogue goal aspect separately, to approach the complex objective. Through this divide-and-conquer manner, we make complex dialogue goals more approachable and elicit greater intelligence via the collaboration of individual agents. Experiments on persuasion and emotional support dialogues demonstrate the superiority of our method over a set of competitive baselines. Our codes are available at https://github.com/YiCheng98/Cooper.
\ No newline at end of file
diff --git a/data/2024/aaai/Cooperative Knowledge Distillation: A Learner Agnostic Approach b/data/2024/aaai/Cooperative Knowledge Distillation: A Learner Agnostic Approach
new file mode 100644
index 0000000000..e7c076a11e
--- /dev/null
+++ b/data/2024/aaai/Cooperative Knowledge Distillation: A Learner Agnostic Approach	
@@ -0,0 +1 @@
+Knowledge distillation is a simple but powerful way to transfer knowledge between a teacher model to a student model. Existing work suffers from at least one of the following key limitations in terms of direction and scope of transfer which restrict its use: all knowledge is transferred from teacher to student regardless of whether or not that knowledge is useful, the student is the only one learning in this exchange, and typically distillation transfers knowledge only from a single teacher to a single student. We formulate a novel form of knowledge distillation in which many models can act as both students and teachers which we call cooperative distillation. The models cooperate as follows: a model (the student) identifies specific deficiencies in it's performance and searches for another model (the teacher) who encodes learned knowledge into instructional virtual instances via counterfactual instance generation. Because different models may have different strengths and weaknesses, all models can act as either students or teachers (cooperation) when appropriate and only distill knowledge in areas specific to their strengths (focus). Since counterfactuals as a paradigm are not tied to any specific algorithm, we can use this method to distill knowledge between learners of different architectures, algorithms, and even feature spaces. We demonstrate our approach not only outperforms baselines such as transfer learning, self-supervised learning, and multiple knowledge distillation algorithms on several datasets, but it can also be used in settings where the aforementioned techniques cannot.
\ No newline at end of file
diff --git a/data/2024/aaai/Coordination of Emergent Demand Changes via Value-Based Negotiation for Supply Chain Management (Student Abstract) b/data/2024/aaai/Coordination of Emergent Demand Changes via Value-Based Negotiation for Supply Chain Management (Student Abstract)
new file mode 100644
index 0000000000..e72c1ef12d
--- /dev/null
+++ b/data/2024/aaai/Coordination of Emergent Demand Changes via Value-Based Negotiation for Supply Chain Management (Student Abstract)	
@@ -0,0 +1,4 @@
+We propose an automated negotiation for a reinforcement learning agent to adapt the agent to unexpected situations such as demand changes in supply chain management (SCM).
+Existing studies that consider reinforcement learning and SCM assume a centralized environment where the coordination of chain components is hierarchical rather than through negotiations between agents.
+This study focused on a negotiation agent that considered the value function of reinforcement learning for SCM as its utility function in automated negotiation.
+We demonstrated that the proposed approach could avoid inventory shortages under increased demand requests from the terminal customer.
\ No newline at end of file
diff --git a/data/2024/aaai/CoreRec: A Counterfactual Correlation Inference for Next Set Recommendation b/data/2024/aaai/CoreRec: A Counterfactual Correlation Inference for Next Set Recommendation
new file mode 100644
index 0000000000..710bc6777d
--- /dev/null
+++ b/data/2024/aaai/CoreRec: A Counterfactual Correlation Inference for Next Set Recommendation	
@@ -0,0 +1 @@
+Next set recommendation aims to predict the items that are likely to be bought in the next purchase. Central to this endeavor is the task of capturing intra-set and cross-set correlations among items. However, the modeling of cross-set correlations poses challenges due to specific issues. Primarily, these correlations are often implicit, and the prevailing approach of establishing an indiscriminate link across the entire set of objects neglects factors like purchase frequency and correlations between purchased items. Such hastily formed connections across sets introduce substantial noise. Additionally, the preeminence of high-frequency items in numerous sets could potentially overshadow and distort correlation modeling with respect to low-frequency items. Thus, we devoted to mitigating misleading inter-set correlations. With a fresh perspective rooted in causality, we delve into the question of whether correlations between a particular item and items from other sets should be relied upon for item representation learning and set prediction. Technically, we introduce the Counterfactual Correlation Inference framework for next set recommendation, denoted as CoreRec. This framework establishes a counterfactual scenario in which the recommendation model impedes cross-set correlations to generate intervened predictions. By contrasting these intervened predictions with the original ones, we gauge the causal impact of inter-set neighbors on set prediction—essentially assessing whether they contribute to spurious correlations. During testing, we introduce a post-trained switch module that selects between set-aware item representations derived from either the original or the counterfactual scenarios. To validate our approach, we extensively experiment using three real-world datasets, affirming both the effectiveness of CoreRec and the cogency of our analytical approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Coreference Graph Guidance for Mind-Map Generation b/data/2024/aaai/Coreference Graph Guidance for Mind-Map Generation
new file mode 100644
index 0000000000..2f50016f74
--- /dev/null
+++ b/data/2024/aaai/Coreference Graph Guidance for Mind-Map Generation	
@@ -0,0 +1 @@
+Mind-map generation aims to process a document into a hierarchical structure to show its central idea and branches. Such a manner is more conducive to understanding the logic and semantics of the document than plain text. Recently, a state-of-the-art method encodes the sentences of a document sequentially and converts them to a relation graph via sequence-to-graph. Though this method is efficient to generate mind-maps in parallel, its mechanism focuses more on sequential features while hardly capturing structural information. Moreover, it's difficult to model long-range semantic relations. In this work, we propose a coreference-guided mind-map generation network (CMGN) to incorporate external structure knowledge. Specifically, we construct a coreference graph based on the coreference semantic relationship to introduce the graph structure information. Then we employ a coreference graph encoder to mine the potential governing relations between sentences. In order to exclude noise and better utilize the information of the coreference graph, we adopt a graph enhancement module in a contrastive learning manner. Experimental results demonstrate that our model outperforms all the existing methods. The case study further proves that our model can more accurately and concisely reveal the structure and semantics of a document. Code and data are available at https://github.com/Cyno2232/CMGN.
\ No newline at end of file
diff --git a/data/2024/aaai/Correlation Matching Transformation Transformers for UHD Image Restoration b/data/2024/aaai/Correlation Matching Transformation Transformers for UHD Image Restoration
new file mode 100644
index 0000000000..883b594b95
--- /dev/null
+++ b/data/2024/aaai/Correlation Matching Transformation Transformers for UHD Image Restoration	
@@ -0,0 +1 @@
+This paper proposes UHDformer, a general Transformer for Ultra-High-Definition (UHD) image restoration. UHDformer contains two learning spaces: (a) learning in high-resolution space and (b) learning in low-resolution space. The former learns multi-level high-resolution features and fuses low-high features and reconstructs the residual images, while the latter explores more representative features learning from the high-resolution ones to facilitate better restoration. To better improve feature representation in low-resolution space, we propose to build feature transformation from the high-resolution space to the low-resolution one. To that end, we propose two new modules: Dual-path Correlation Matching Transformation module (DualCMT) and Adaptive Channel Modulator (ACM). The DualCMT selects top C/r (r is greater or equal to 1 which controls the squeezing level) correlation channels from the max-pooling/mean-pooling high-resolution features to replace low-resolution ones in Transformers, which can effectively squeeze useless content to improve the feature representation in low-resolution space to facilitate better recovery. The ACM is exploited to adaptively modulate multi-level high-resolution features, enabling to provide more useful features to low-resolution space for better learning. Experimental results show that our UHDformer reduces about ninety-seven percent model sizes compared with most state-of-the-art methods while significantly improving performance under different training sets on 3 UHD image restoration tasks, including low-light image enhancement, image dehazing, and image deblurring. The source codes will be made available at https://github.com/supersupercong/UHDformer.
\ No newline at end of file
diff --git a/data/2024/aaai/Count What You Want: Exemplar Identification and Few-Shot Counting of Human Actions in the Wild b/data/2024/aaai/Count What You Want: Exemplar Identification and Few-Shot Counting of Human Actions in the Wild
new file mode 100644
index 0000000000..60775ec71d
--- /dev/null
+++ b/data/2024/aaai/Count What You Want: Exemplar Identification and Few-Shot Counting of Human Actions in the Wild	
@@ -0,0 +1 @@
+This paper addresses the task of counting human actions of interest using sensor data from wearable devices. We propose a novel exemplar-based framework, allowing users to provide exemplars of the actions they want to count by vocalizing predefined sounds ``one'', ``two'', and ``three''. Our method first localizes temporal positions of these utterances from the audio sequence. These positions serve as the basis for identifying exemplars representing the action class of interest. A similarity map is then computed between the exemplars and the entire sensor data sequence, which is further fed into a density estimation module to generate a sequence of estimated density values. Summing these density values provides the final count. To develop and evaluate our approach, we introduce a diverse and realistic dataset consisting of real-world data from 37 subjects and 50 action categories, encompassing both sensor and audio data. The experiments on this dataset demonstrate the viability of the proposed method in counting instances of actions from new classes and subjects that were not part of the training data. On average, the discrepancy between the predicted count and the ground truth value is 7.47, significantly lower than the errors of the frequency-based and transformer-based methods. Our project, code and dataset can be found at https://github.com/cvlab-stonybrook/ExRAC.
\ No newline at end of file
diff --git a/data/2024/aaai/Counterfactual Explanations for Misclassified Images: How Human and Machine Explanations Differ (Abstract Reprint) b/data/2024/aaai/Counterfactual Explanations for Misclassified Images: How Human and Machine Explanations Differ (Abstract Reprint)
new file mode 100644
index 0000000000..c0bfed3b6e
--- /dev/null
+++ b/data/2024/aaai/Counterfactual Explanations for Misclassified Images: How Human and Machine Explanations Differ (Abstract Reprint)	
@@ -0,0 +1 @@
+Counterfactual explanations have emerged as a popular solution for the eXplainable AI (XAI) problem of elucidating the predictions of black-box deep-learning systems because people easily understand them, they apply across different problem domains and seem to be legally compliant. Although over 100 counterfactual methods exist in the XAI literature, each claiming to generate plausible explanations akin to those preferred by people, few of these methods have actually been tested on users (∼7%). Even fewer studies adopt a user-centered perspective; for instance, asking people for their counterfactual explanations to determine their perspective on a “good explanation”. This gap in the literature is addressed here using a novel methodology that (i) gathers human-generated counterfactual explanations for misclassified images, in two user studies and, then, (ii) compares these human-generated explanations to computationally-generated explanations for the same misclassifications. Results indicate that humans do not “minimally edit” images when generating counterfactual explanations. Instead, they make larger, “meaningful” edits that better approximate prototypes in the counterfactual class. An analysis based on “explanation goals” is proposed to account for this divergence between human and machine explanations. The implications of these proposals for future work are discussed.
\ No newline at end of file
diff --git a/data/2024/aaai/Counterfactual Graph Learning for Anomaly Detection with Feature Disentanglement and Generation (Student Abstract) b/data/2024/aaai/Counterfactual Graph Learning for Anomaly Detection with Feature Disentanglement and Generation (Student Abstract)
new file mode 100644
index 0000000000..d1d076cd3a
--- /dev/null
+++ b/data/2024/aaai/Counterfactual Graph Learning for Anomaly Detection with Feature Disentanglement and Generation (Student Abstract)	
@@ -0,0 +1 @@
+Graph anomaly detection has received remarkable research interests, and various techniques have been employed for enhancing detection performance. However, existing models tend to learn dataset-specific spurious correlations based on statistical associations. A well-trained model might suffer from performance degradation when applied to newly observed nodes with different environments. To handle this situation, we propose CounterFactual Graph Anomaly Detection model, CFGAD. In this model, we design a gradient-based separator to disentangle node features into class features and environment features. Then, we present a weight-varying diffusion model to combine class features and environment features from different nodes to generate counterfactual samples. These counterfactual samples will be adopted to enhance model robustness. Comprehensive experiments demonstrate the effectiveness of our CFGAD.
\ No newline at end of file
diff --git a/data/2024/aaai/Counterfactual-Enhanced Information Bottleneck for Aspect-Based Sentiment Analysis b/data/2024/aaai/Counterfactual-Enhanced Information Bottleneck for Aspect-Based Sentiment Analysis
new file mode 100644
index 0000000000..6bfa3d034c
--- /dev/null
+++ b/data/2024/aaai/Counterfactual-Enhanced Information Bottleneck for Aspect-Based Sentiment Analysis	
@@ -0,0 +1 @@
+Despite having achieved notable success for aspect-based sentiment analysis (ABSA), deep neural networks are susceptible to spurious correlations between input features and output labels, leading to poor robustness. In this paper, we propose a novel Counterfactual-Enhanced Information Bottleneck framework (called CEIB) to reduce spurious correlations for ABSA. CEIB extends the information bottleneck (IB) principle to a factual-counterfactual balancing setting by integrating augmented counterfactual data, with the goal of learning a robust ABSA model. Concretely, we first devise a multi-pattern prompting method, which utilizes the large language model (LLM) to generate high-quality counterfactual samples from the original samples. Then, we employ the information bottleneck principle and separate the mutual information into factual and counterfactual parts. In this way, we can learn effective and robust representations for the ABSA task by balancing the predictive information of these two parts. Extensive experiments on five benchmark ABSA datasets show that our CEIB approach achieves superior prediction performance and robustness over the state-of-the-art baselines. Code and data to reproduce the results in this paper is available at: https://github.com/shesshan/CEIB.
\ No newline at end of file
diff --git a/data/2024/aaai/Coupled Confusion Correction: Learning from Crowds with Sparse Annotations b/data/2024/aaai/Coupled Confusion Correction: Learning from Crowds with Sparse Annotations
new file mode 100644
index 0000000000..bb99688c39
--- /dev/null
+++ b/data/2024/aaai/Coupled Confusion Correction: Learning from Crowds with Sparse Annotations	
@@ -0,0 +1 @@
+As the size of the datasets getting larger, accurately annotating such datasets is becoming more impractical due to the expensiveness on both time and economy. Therefore, crowd-sourcing has been widely adopted to alleviate the cost of collecting labels, which also inevitably introduces label noise and eventually degrades the performance of the model. To learn from crowd-sourcing annotations, modeling the expertise of each annotator is a common but challenging paradigm, because the annotations collected by crowd-sourcing are usually highly-sparse. To alleviate this problem, we propose Coupled Confusion Correction (CCC), where two models are simultaneously trained to correct the confusion matrices learned by each other. Via bi-level optimization, the confusion matrices learned by one model can be corrected by the distilled data from the other. Moreover, we cluster the ``annotator groups'' who share similar expertise so that their confusion matrices could be corrected together. In this way, the expertise of the annotators, especially of those who provide seldom labels, could be better captured. Remarkably, we point out that the annotation sparsity not only means the average number of labels is low, but also there are always some annotators who provide very few labels, which is neglected by previous works when constructing synthetic crowd-sourcing annotations. Based on that, we propose to use Beta distribution to control the generation of the crowd-sourcing labels so that the synthetic annotations could be more consistent with the real-world ones. Extensive experiments are conducted on two types of synthetic datasets and three real-world datasets, the results of which demonstrate that CCC significantly outperforms state-of-the-art approaches. Source codes are available at: https://github.com/Hansong-Zhang/CCC.
\ No newline at end of file
diff --git a/data/2024/aaai/Coupling Graph Neural Networks with Fractional Order Continuous Dynamics: A Robustness Study b/data/2024/aaai/Coupling Graph Neural Networks with Fractional Order Continuous Dynamics: A Robustness Study
new file mode 100644
index 0000000000..9563b06f36
--- /dev/null
+++ b/data/2024/aaai/Coupling Graph Neural Networks with Fractional Order Continuous Dynamics: A Robustness Study	
@@ -0,0 +1 @@
+In this work, we rigorously investigate the robustness of graph neural fractional-order differential equation (FDE) models. This framework extends beyond traditional graph neural (integer-order) ordinary differential equation (ODE) models by implementing the time-fractional Caputo derivative. Utilizing fractional calculus allows our model to consider long-term memory during the feature updating process, diverging from the memoryless Markovian updates seen in traditional graph neural ODE models. The superiority of graph neural FDE models over graph neural ODE models has been established in environments free from attacks or perturbations. While traditional graph neural ODE models have been verified to possess a degree of stability and resilience in the presence of adversarial attacks in existing literature, the robustness of graph neural FDE models, especially under adversarial conditions, remains largely unexplored. This paper undertakes a detailed assessment of the robustness of graph neural FDE models. We establish a theoretical foundation outlining the robustness characteristics of graph neural FDE models, highlighting that they maintain more stringent output perturbation bounds in the face of input and graph topology disturbances, compared to their integer-order counterparts. Our empirical evaluations further confirm the enhanced robustness of graph neural FDE models, highlighting their potential in adversarially robust applications.
\ No newline at end of file
diff --git a/data/2024/aaai/Coverage-Guaranteed Prediction Sets for Out-of-Distribution Data b/data/2024/aaai/Coverage-Guaranteed Prediction Sets for Out-of-Distribution Data
new file mode 100644
index 0000000000..4f02428d3c
--- /dev/null
+++ b/data/2024/aaai/Coverage-Guaranteed Prediction Sets for Out-of-Distribution Data	
@@ -0,0 +1 @@
+Out-of-distribution (OOD) generalization has attracted increasing research attention in recent years, due to its promising experimental results in real-world applications. In this paper, we study the confidence set prediction problem in the OOD generalization setting. Split conformal prediction (SCP) is an efficient framework for handling the confidence set prediction problem. However, the validity of SCP requires the examples to be exchangeable, which is violated in the OOD setting. Empirically, we show that trivially applying SCP results in a failure to maintain the marginal coverage when the unseen target domain is different from the source domain. To address this issue, we develop a method for forming confident prediction sets in the OOD setting and theoretically prove the validity of our method. Finally, we conduct experiments on simulated data to empirically verify the correctness of our theory and the validity of our proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Critic-Guided Decision Transformer for Offline Reinforcement Learning b/data/2024/aaai/Critic-Guided Decision Transformer for Offline Reinforcement Learning
new file mode 100644
index 0000000000..399d81dd94
--- /dev/null
+++ b/data/2024/aaai/Critic-Guided Decision Transformer for Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner. However, prevailing RCSL methods largely focus on deterministic trajectory modeling, disregarding stochastic state transitions and the diversity of future trajectory distributions. A fundamental challenge arises from the inconsistency between the sampled returns within individual trajectories and the expected returns across multiple trajectories. Fortunately, value-based methods offer a solution by leveraging a value function to approximate the expected returns, thereby addressing the inconsistency effectively. Building upon these insights, we propose a novel approach, termed the Critic-Guided Decision Transformer (CGDT), which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer. By incorporating a learned value function, known as the critic, CGDT ensures a direct alignment between the specified target returns and the expected returns of actions. This integration bridges the gap between the deterministic nature of RCSL and the probabilistic characteristics of value-based methods. Empirical evaluations on stochastic environments and D4RL benchmark datasets demonstrate the superiority of CGDT over traditional RCSL methods. These results highlight the potential of CGDT to advance the state of the art in offline RL and extend the applicability of RCSL to a wide range of RL tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Cross-Class Feature Augmentation for Class Incremental Learning b/data/2024/aaai/Cross-Class Feature Augmentation for Class Incremental Learning
new file mode 100644
index 0000000000..7c11fc9271
--- /dev/null
+++ b/data/2024/aaai/Cross-Class Feature Augmentation for Class Incremental Learning	
@@ -0,0 +1 @@
+We propose a novel class incremental learning approach, which incorporates a feature augmentation technique motivated by adversarial attacks. We employ a classifier learned in the past to complement training examples of previous tasks. The proposed approach has an unique perspective to utilize the previous knowledge in class incremental learning since it augments features of arbitrary target classes using examples in other classes via adversarial attacks on a previously learned classifier. By allowing the Cross-Class Feature Augmentations (CCFA), each class in the old tasks conveniently populates samples in the feature space, which alleviates the collapse of the decision boundaries caused by sample deficiency for the previous tasks, especially when the number of stored exemplars is small. This idea can be easily incorporated into existing class incremental learning algorithms without any architecture modification. Extensive experiments on the standard benchmarks show that our method consistently outperforms existing class incremental learning methods by significant margins in various scenarios, especially under an environment with an extremely limited memory budget.
\ No newline at end of file
diff --git a/data/2024/aaai/Cross-Constrained Progressive Inference for 3D Hand Pose Estimation with Dynamic Observer-Decision-Adjuster Networks b/data/2024/aaai/Cross-Constrained Progressive Inference for 3D Hand Pose Estimation with Dynamic Observer-Decision-Adjuster Networks
new file mode 100644
index 0000000000..9db376eca7
--- /dev/null
+++ b/data/2024/aaai/Cross-Constrained Progressive Inference for 3D Hand Pose Estimation with Dynamic Observer-Decision-Adjuster Networks	
@@ -0,0 +1 @@
+Generalization is very important for pose estimation, especially for 3D pose estimation where small changes in the 2D images could trigger structural changes in the 3D space. To achieve generalization, the system needs to have the capability of detecting estimation errors by double-checking the projection coherence between the 3D and 2D spaces and adapting its network inference process based on this feedback. Current pose estimation is one-time feed-forward and lacks the capability to gather feedback and adapt the inference outcome. To address this problem, we propose to explore the concept of progressive inference where the network learns an observer to continuously detect the prediction error based on constraints matching, as well as an adjuster to refine its inference outcome based on these constraints errors. Within the context of 3D hand pose estimation, we find that this observer-adjuster design is relatively unstable since the observer is operating in the 2D image domain while the adjuster is operating in the 3D domain. To address this issue, we propose to construct two sets of observers-adjusters with complementary constraints from different perspectives. They operate in a dynamic sequential manner controlled by a decision network to progressively improve the 3D pose estimation. We refer to this method as Cross-Constrained Progressive Inference (CCPI). Our extensive experimental results on FreiHAND and HO-3D benchmark datasets demonstrate that the proposed CCPI method is able to significantly improve the generalization capability and performance of 3D hand pose estimation.
\ No newline at end of file
diff --git a/data/2024/aaai/Cross-Covariate Gait Recognition: A Benchmark b/data/2024/aaai/Cross-Covariate Gait Recognition: A Benchmark
new file mode 100644
index 0000000000..c3ef5cd726
--- /dev/null
+++ b/data/2024/aaai/Cross-Covariate Gait Recognition: A Benchmark	
@@ -0,0 +1 @@
+Gait datasets are essential for gait research. However, this paper observes that present benchmarks, whether conventional constrained or emerging real-world datasets, fall short regarding covariate diversity. To bridge this gap, we undertake an arduous 20-month effort to collect a cross-covariate gait recognition (CCGR) dataset. The CCGR dataset has 970 subjects and about 1.6 million sequences; almost every subject has 33 views and 53 different covariates. Compared to existing datasets, CCGR has both population and individual-level diversity. In addition, the views and covariates are well labeled, enabling the analysis of the effects of different factors. CCGR provides multiple types of gait data, including RGB, parsing, silhouette, and pose, offering researchers a comprehensive resource for exploration. In order to delve deeper into addressing cross-covariate gait recognition, we propose parsing-based gait recognition (ParsingGait) by utilizing the newly proposed parsing data. We have conducted extensive experiments. Our main results show: 1) Cross-covariate emerges as a pivotal challenge for practical applications of gait recognition. 2) ParsingGait demonstrates remarkable potential for further advancement. 3) Alarmingly, existing SOTA methods achieve less than 43% accuracy on the CCGR, highlighting the urgency of exploring cross-covariate gait recognition. Link: https://github.com/ShinanZou/CCGR.
\ No newline at end of file
diff --git a/data/2024/aaai/Cross-Domain Contrastive Learning for Time Series Clustering b/data/2024/aaai/Cross-Domain Contrastive Learning for Time Series Clustering
new file mode 100644
index 0000000000..bcccf1ef05
--- /dev/null
+++ b/data/2024/aaai/Cross-Domain Contrastive Learning for Time Series Clustering	
@@ -0,0 +1,3 @@
+Most deep learning-based time series clustering models concentrate on data representation in a separate process from clustering. This leads to that clustering loss cannot guide feature extraction. Moreover, most methods solely analyze data from the temporal domain, disregarding the potential within the frequency domain.
+
+To address these challenges, we introduce a novel end-to-end Cross-Domain Contrastive learning model for time series Clustering (CDCC). Firstly, it integrates the clustering process and feature extraction using contrastive constraints at both cluster-level and instance-level. Secondly, the data is encoded simultaneously in both temporal and frequency domains, leveraging contrastive learning to enhance within-domain representation. Thirdly, cross-domain constraints are proposed to align the latent representations and category distribution across domains. With the above strategies, CDCC not only achieves end-to-end output but also effectively integrates frequency domains. Extensive experiments and visualization analysis are conducted on 40 time series datasets from UCR, demonstrating the superior performance of the proposed model.
\ No newline at end of file
diff --git a/data/2024/aaai/Cross-Gate MLP with Protein Complex Invariant Embedding Is a One-Shot Antibody Designer b/data/2024/aaai/Cross-Gate MLP with Protein Complex Invariant Embedding Is a One-Shot Antibody Designer
new file mode 100644
index 0000000000..e7fb81f8a0
--- /dev/null
+++ b/data/2024/aaai/Cross-Gate MLP with Protein Complex Invariant Embedding Is a One-Shot Antibody Designer	
@@ -0,0 +1 @@
+Antibodies are crucial proteins produced by the immune system in response to foreign substances or antigens. The specificity of an antibody is determined by its complementarity-determining regions (CDRs), which are located in the variable domains of the antibody chains and form the antigen-binding site. Previous studies have utilized complex techniques to generate CDRs, but they suffer from inadequate geometric modeling. Moreover, the common iterative refinement strategies lead to an inefficient inference. In this paper, we propose a simple yet effective model that can co-design 1D sequences and 3D structures of CDRs in a one-shot manner. To achieve this, we decouple the antibody CDR design problem into two stages: (i) geometric modeling of protein complex structures and (ii) sequence-structure co-learning. We develop a novel macromolecular structure invariant embedding, typically for protein complexes, that captures both intra- and inter-component interactions among the backbone atoms, including Calpha, N, C, and O atoms, to achieve comprehensive geometric modeling. Then, we introduce a simple cross-gate MLP for sequence-structure co-learning, allowing sequence and structure representations to implicitly refine each other. This enables our model to design desired sequences and structures in a one-shot manner. Extensive experiments are conducted to evaluate our results at both the sequence and structure level, which demonstrate that our model achieves superior performance compared to the state-of-the-art antibody CDR design methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Cross-Layer and Cross-Sample Feature Optimization Network for Few-Shot Fine-Grained Image Classification b/data/2024/aaai/Cross-Layer and Cross-Sample Feature Optimization Network for Few-Shot Fine-Grained Image Classification
new file mode 100644
index 0000000000..2880f02950
--- /dev/null
+++ b/data/2024/aaai/Cross-Layer and Cross-Sample Feature Optimization Network for Few-Shot Fine-Grained Image Classification	
@@ -0,0 +1 @@
+Recently, a number of Few-Shot Fine-Grained Image Classification (FS-FGIC) methods have been proposed, but they primarily focus on better fine-grained feature extraction while overlooking two important issues. The first one is how to extract discriminative features for Fine-Grained Image Classification tasks while reducing trivial and non-generalizable sample level noise introduced in this procedure, to overcome the over-fitting problem under the setting of Few-Shot Learning. The second one is how to achieve satisfying feature matching between limited support and query samples with variable spatial positions and angles. To address these issues, we propose a novel Cross-layer and Cross-sample feature optimization Network for FS-FGIC, C2-Net for short. The proposed method consists of two main modules: Cross-Layer Feature Refinement (CLFR) module and Cross-Sample Feature Adjustment (CSFA) module. The CLFR module further refines the extracted features while integrating outputs from multiple layers to suppress sample-level feature noise interference. Additionally, the CSFA module addresses the feature mismatch between query and support samples through both channel activation and position matching operations. Extensive experiments have been conducted on five fine-grained benchmark datasets, and the results show that the C2-Net outperforms other state-of-the-art methods by a significant margin in most cases. Our code is available at: https://github.com/zenith0923/C2-Net.
\ No newline at end of file
diff --git a/data/2024/aaai/Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering b/data/2024/aaai/Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering
new file mode 100644
index 0000000000..fff8fd4ab2
--- /dev/null
+++ b/data/2024/aaai/Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering	
@@ -0,0 +1 @@
+Few-shot Visual Question Answering (VQA) realizes few-shot cross-modal learning, which is an emerging and challenging task in computer vision. Currently, most of the few-shot VQA methods are confined to simply extending few-shot classification methods to cross-modal tasks while ignoring the spatial distribution properties of multimodal features and cross-modal information interaction. To address this problem, we propose a novel Cross-modal feature Distribution Calibration Inference Network (CDCIN) in this paper, where a new concept named visual information entropy is proposed to realize multimodal features distribution calibration by cross-modal information interaction for more effective few-shot VQA. Visual information entropy is a statistical variable that represents the spatial distribution of visual features guided by the question, which is aligned before and after the reasoning process to mitigate redundant information and improve multi-modal features by our proposed visual information entropy calibration module. To further enhance the inference ability of cross-modal features, we additionally propose a novel pre-training method, where the reasoning sub-network of CDCIN is pretrained on the base class in a VQA classification paradigm and fine-tuned on the few-shot VQA datasets. Extensive experiments demonstrate that our proposed CDCIN achieves excellent performance on few-shot VQA and outperforms state-of-the-art methods on three widely used benchmark datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Cross-Modal Match for Language Conditioned 3D Object Grounding b/data/2024/aaai/Cross-Modal Match for Language Conditioned 3D Object Grounding
new file mode 100644
index 0000000000..d50748b6cc
--- /dev/null
+++ b/data/2024/aaai/Cross-Modal Match for Language Conditioned 3D Object Grounding	
@@ -0,0 +1 @@
+Language conditioned 3D object grounding aims to find the object within the 3D scene mentioned by natural language descriptions, which mainly depends on the matching between visual and natural language. Considerable improvement in grounding performance is achieved by improving the multimodal fusion mechanism or bridging the gap between detection and matching. However, several mismatches are ignored, i.e., mismatch in local visual representation and global sentence representation, and mismatch in visual space and corresponding label word space. In this paper, we propose crossmodal match for 3D grounding from mitigating these mismatches perspective. Specifically, to match local visual features with the global description sentence, we propose BEV (Bird’s-eye-view) based global information embedding module. It projects multiple object proposal features into the BEV and the relations of different objects are accessed by the visual transformer which can model both positions and features with long-range dependencies. To circumvent the mismatch in feature spaces of different modalities, we propose crossmodal consistency learning. It performs cross-modal consistency constraints to convert the visual feature space into the label word feature space resulting in easier matching. Besides, we introduce label distillation loss and global distillation loss to drive these matches learning in a distillation way. We evaluate our method in mainstream evaluation settings on three datasets, and the results demonstrate the effectiveness of the proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval b/data/2024/aaai/Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval
new file mode 100644
index 0000000000..5cce15b247
--- /dev/null
+++ b/data/2024/aaai/Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval	
@@ -0,0 +1 @@
+Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly affect the accuracy of image-text retrieval. To address these challenges, we propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA). Our method leverages the power of uni-modal pre-trained models to provide soft-label supervision signals for the image-text retrieval model. Additionally, we introduce two alignment techniques, Cross-modal Soft-label Alignment (CSA) and Uni-modal Soft-label Alignment (USA), to overcome false negatives and enhance similarity recognition between uni-modal samples. Our method is designed to be plug-and-play, meaning it can be easily applied to existing image-text retrieval models without changing their original architectures. Extensive experiments on various image-text retrieval models and datasets, we demonstrate that our method can consistently improve the performance of image-text retrieval and achieve new state-of-the-art results. Furthermore, our method can also boost the uni-modal retrieval performance of image-text retrieval models, enabling it to achieve universal retrieval. The code and supplementary files can be found at https://github.com/lerogo/aaai24_itr_cusa.
\ No newline at end of file
diff --git a/data/2024/aaai/Cross-Sentence Gloss Consistency for Continuous Sign Language Recognition b/data/2024/aaai/Cross-Sentence Gloss Consistency for Continuous Sign Language Recognition
new file mode 100644
index 0000000000..d131a45ea8
--- /dev/null
+++ b/data/2024/aaai/Cross-Sentence Gloss Consistency for Continuous Sign Language Recognition	
@@ -0,0 +1 @@
+Continuous sign language recognition (CSLR) aims to recognize gloss sequences from continuous sign videos. Recent works enhance the gloss representation consistency by mining correlations between visual and contextual modules within individual sentences. However, there still remain much richer correlations among glosses across different sentences. In this paper, we present a simple yet effective Cross-Sentence Gloss Consistency (CSGC), which enforces glosses belonging to a same category to be more consistent in representation than those belonging to different categories, across all training sentences. Specifically, in CSGC, a prototype is maintained for each gloss category and benefits the gloss discrimination in a contrastive way. Thanks to the well-distinguished gloss prototype, an auxiliary similarity classifier is devised to enhance the recognition clues, thus yielding more accurate results. Extensive experiments conducted on three CSLR datasets show that our proposed CSGC significantly boosts the performance of CSLR, surpassing existing state-of-the-art works by large margins (i.e., 1.6% on PHOENIX14, 2.4% on PHOENIX14-T, and 5.7% on CSL-Daily).
\ No newline at end of file
diff --git a/data/2024/aaai/CrossBind: Collaborative Cross-Modal Identification of Protein Nucleic-Acid-Binding Residues b/data/2024/aaai/CrossBind: Collaborative Cross-Modal Identification of Protein Nucleic-Acid-Binding Residues
new file mode 100644
index 0000000000..ee12acda6f
--- /dev/null
+++ b/data/2024/aaai/CrossBind: Collaborative Cross-Modal Identification of Protein Nucleic-Acid-Binding Residues	
@@ -0,0 +1 @@
+Accurate identification of protein nucleic acid binding residues poses a significant challenge with important implications for various biological processes and drug design. Many typical computational methods for protein analysis rely on a single model that could ignore either the semantic context of the protein or the global 3D geometric information. Consequently, these approaches may result in incomplete or inaccurate protein analysis. To address the above issue, in this paper, we present CrossBind, a novel collaborative cross modal approach for identifying binding residues by exploiting both protein geometric structure and its sequence prior knowledge extracted from a large scale protein language model. Specifically, our multi modal approach leverages a contrastive learning technique and atom wise attention to capture the positional relationships between atoms and residues, thereby incorporating fine grained local geometric knowledge, for better binding residue prediction. Extensive experimental results demonstrate that our approach outperforms the next best state of the art methods, GraphSite and GraphBind, on DNA and RNA datasets by 10.8/17.3% in terms of the harmonic mean of precision and recall (F1 Score) and 11.9/24.8% in Matthews correlation coefficient (MCC), respectively. We release the code at https://github.com/BEAM-Labs/CrossBind.
\ No newline at end of file
diff --git a/data/2024/aaai/CrystalBox: Future-Based Explanations for Input-Driven Deep RL Systems b/data/2024/aaai/CrystalBox: Future-Based Explanations for Input-Driven Deep RL Systems
new file mode 100644
index 0000000000..a6461c85fd
--- /dev/null
+++ b/data/2024/aaai/CrystalBox: Future-Based Explanations for Input-Driven Deep RL Systems	
@@ -0,0 +1 @@
+We present CrystalBox, a novel, model-agnostic, posthoc explainability framework for Deep Reinforcement Learning (DRL) controllers in the large family of input-driven environments which includes computer systems. We combine the natural decomposability of reward functions in input-driven environments with the explanatory power of decomposed returns. We propose an efficient algorithm to generate future-based explanations across both discrete and continuous control environments. Using applications such as adaptive bitrate streaming and congestion control, we demonstrate CrystalBox's capability to generate high-fidelity explanations. We further illustrate its higher utility across three practical use cases: contrastive explanations, network observability, and guided reward design, as opposed to prior explainability techniques that identify salient features.
\ No newline at end of file
diff --git a/data/2024/aaai/Cumulative Difference Learning VAE for Time-Series with Temporally Correlated Inflow-Outflow b/data/2024/aaai/Cumulative Difference Learning VAE for Time-Series with Temporally Correlated Inflow-Outflow
new file mode 100644
index 0000000000..3c732bc6ca
--- /dev/null
+++ b/data/2024/aaai/Cumulative Difference Learning VAE for Time-Series with Temporally Correlated Inflow-Outflow	
@@ -0,0 +1 @@
+Time-series generation has crucial practical significance for decision-making under uncertainty. Existing methods have various limitations like accumulating errors over time, significantly impacting downstream tasks. We develop a novel generation method, DT-VAE, that incorporates generalizable domain knowledge, is mathematically justified, and significantly outperforms existing methods by mitigating error accumulation through a cumulative difference learning mechanism. We evaluate the performance of DT-VAE on several downstream tasks using both semi-synthetic and real time-series datasets, including benchmark datasets and our newly curated COVID-19 hospitalization datasets. The COVID-19 datasets enrich existing resources for time-series analysis. Additionally, we introduce Diverse Trend Preserving (DTP), a time-series clustering-based evaluation for direct and interpretable assessments of generated samples, serving as a valuable tool for evaluating time-series generative models.
\ No newline at end of file
diff --git a/data/2024/aaai/Cumulative Regret Analysis of the Piyavskii-Shubert Algorithm and Its Variants for Global Optimization b/data/2024/aaai/Cumulative Regret Analysis of the Piyavskii-Shubert Algorithm and Its Variants for Global Optimization
new file mode 100644
index 0000000000..0b6d3ccaca
--- /dev/null
+++ b/data/2024/aaai/Cumulative Regret Analysis of the Piyavskii-Shubert Algorithm and Its Variants for Global Optimization	
@@ -0,0 +1 @@
+We study the problem of global optimization, where we analyze the performance of the Piyavskii--Shubert algorithm and its variants. For any given time duration T, instead of the extensively studied simple regret (which is the difference of the losses between the best estimate up to T and the global minimum), we study the cumulative regret up to time T. For L-Lipschitz continuous functions, we show that the cumulative regret is O(L logT). For H-Lipschitz smooth functions, we show that the cumulative regret is O(H). We analytically extend our results for functions with Hölder continuous derivatives, which cover both the Lipschitz continuous and the Lipschitz smooth functions, individually. We further show that a simpler variant of the Piyavskii-Shubert algorithm performs just as well as the traditional variants for the Lipschitz continuous or the Lipschitz smooth functions. We further extend our results to broader classes of functions, and show that, our algorithm efficiently determines its queries; and achieves nearly minimax optimal (up to log factors) cumulative regret, for general convex or even concave regularity conditions on the extrema of the objective (which encompasses many preceding regularities). We consider further extensions by investigating the performance of the Piyavskii-Shubert variants in the scenarios with unknown regularity, noisy evaluation and multivariate domain.
\ No newline at end of file
diff --git a/data/2024/aaai/Curvature-Invariant Adversarial Attacks for 3D Point Clouds b/data/2024/aaai/Curvature-Invariant Adversarial Attacks for 3D Point Clouds
new file mode 100644
index 0000000000..e6917d1182
--- /dev/null
+++ b/data/2024/aaai/Curvature-Invariant Adversarial Attacks for 3D Point Clouds	
@@ -0,0 +1 @@
+Imperceptibility is one of the crucial requirements for adversarial examples. Previous adversarial attacks on 3D point cloud recognition suffer from noticeable outliers, resulting in low imperceptibility. We think that the drawbacks can be alleviated via taking the local curvature of the point cloud into consideration. Existing approaches introduce the local geometry distance into the attack objective function. However, their definition of the local geometry distance neglects different perceptibility of distortions along different directions. In this paper, we aim to enhance the imperceptibility of adversarial attacks on 3D point cloud recognition by better preserving the local curvature of the original 3D point clouds. To this end, we propose the Curvature-Invariant Method (CIM), which directly regularizes the back-propagated gradient during the generation of adversarial point clouds based on two assumptions. Specifically, we first decompose the back-propagated gradients into the tangent plane and the normal direction. Then we directly reduce the gradient along the large curvature direction on the tangent plane and only keep the gradient along the negative normal direction. Comprehensive experimental comparisons confirm the superiority of our approach. Notably, our strategy can achieve 7.2% and 14.5% improvements in Hausdorff distance and Gaussian curvature measurements of the imperceptibility.
\ No newline at end of file
diff --git a/data/2024/aaai/Curved Representation Space of Vision Transformers b/data/2024/aaai/Curved Representation Space of Vision Transformers
new file mode 100644
index 0000000000..d1584bd345
--- /dev/null
+++ b/data/2024/aaai/Curved Representation Space of Vision Transformers	
@@ -0,0 +1 @@
+Neural networks with self-attention (a.k.a. Transformers) like ViT and Swin have emerged as a better alternative to traditional convolutional neural networks (CNNs). However, our understanding of how the new architecture works is still limited. In this paper, we focus on the phenomenon that Transformers show higher robustness against corruptions than CNNs, while not being overconfident. This is contrary to the intuition that robustness increases with confidence. We resolve this contradiction by empirically investigating how the output of the penultimate layer moves in the representation space as the input data moves linearly within a small area. In particular, we show the following. (1) While CNNs exhibit fairly linear relationship between the input and output movements, Transformers show nonlinear relationship for some data. For those data, the output of Transformers moves in a curved trajectory as the input moves linearly. (2) When a data is located in a curved region, it is hard to move it out of the decision region since the output moves along a curved trajectory instead of a straight line to the decision boundary, resulting in high robustness of Transformers. (3) If a data is slightly modified to jump out of the curved region, the movements afterwards become linear and the output goes to the decision boundary directly. In other words, there does exist a decision boundary near the data, which is hard to find only because of the curved representation space. This explains the underconfident prediction of Transformers. Also, we examine mathematical properties of the attention operation that induce nonlinear response to linear perturbation. Finally, we share our additional findings, regarding what contributes to the curved representation space of Transformers, and how the curvedness evolves during training.
\ No newline at end of file
diff --git a/data/2024/aaai/Customizing Language Model Responses with Contrastive In-Context Learning b/data/2024/aaai/Customizing Language Model Responses with Contrastive In-Context Learning
new file mode 100644
index 0000000000..ae0bbf628d
--- /dev/null
+++ b/data/2024/aaai/Customizing Language Model Responses with Contrastive In-Context Learning	
@@ -0,0 +1,2 @@
+Large language models (LLMs) are becoming increasingly important for machine learning applications. However, it can be challenging to align LLMs with our intent, particularly when we want to generate content that is preferable over others or when we want the LLM to respond in a certain style or tone that is hard to describe. To address this challenge, we propose an approach that uses contrastive examples to better describe our intent. This involves providing positive examples that illustrate the true intent, along with negative examples that show what characteristics we want LLMs to avoid. The negative examples can be retrieved from labeled data, written by a human, or generated by the LLM itself.
+Before generating an answer, we ask the model to analyze the examples to teach itself what to avoid. This reasoning step provides the model with the appropriate articulation of the user's need and guides it towards generting a better answer. We tested our approach on both synthesized and real-world datasets, including StackExchange and Reddit, and found that it significantly improves performance compared to standard few-shot prompting.
\ No newline at end of file
diff --git a/data/2024/aaai/CutFreq: Cut-and-Swap Frequency Components for Low-Level Vision Augmentation b/data/2024/aaai/CutFreq: Cut-and-Swap Frequency Components for Low-Level Vision Augmentation
new file mode 100644
index 0000000000..c1b4b20005
--- /dev/null
+++ b/data/2024/aaai/CutFreq: Cut-and-Swap Frequency Components for Low-Level Vision Augmentation	
@@ -0,0 +1 @@
+Low-level vision plays a crucial role in a wide range of imaging quality and image recognition applications. However, the limited size, quality, and diversity of datasets often pose significant challenges for low-level tasks. Data augmentation is the most effective and practical way of sample expansion, but the commonly used augmentation methods in high-level tasks have limited improvement in the low-level due to the boundary effects or the non-realistic context information. In this paper, we propose the Cut-and-Swap Frequency Components (CutFreq) method for low-level vision, which aims to preserve high-level representations with directionality and improve image synthesis quality. Observing the significant frequency domain differences between reconstructed images and real ones, in CutFreq, we propose to transform the input and real images separately in the frequency domain, then define two stages for the model training process, and finally swap the specified frequency bands respectively and inversely transform to generate augmented samples. The experimental results show the superior performance of CutFreq on five low-level vision tasks. Moreover, we demonstrate the effectiveness of CutFreq in the low-data regime. Code is available at https://github.com/DreamerCCC/CutFreq.
\ No newline at end of file
diff --git a/data/2024/aaai/CyberQ: Generating Questions and Answers for Cybersecurity Education Using Knowledge Graph-Augmented LLMs b/data/2024/aaai/CyberQ: Generating Questions and Answers for Cybersecurity Education Using Knowledge Graph-Augmented LLMs
new file mode 100644
index 0000000000..ba3c1a1567
--- /dev/null
+++ b/data/2024/aaai/CyberQ: Generating Questions and Answers for Cybersecurity Education Using Knowledge Graph-Augmented LLMs	
@@ -0,0 +1 @@
+Building a skilled cybersecurity workforce is paramount to building a safer digital world. However, the diverse skill set, constantly emerging vulnerabilities, and deployment of new cyber threats make learning cybersecurity challenging. Traditional education methods struggle to cope with cybersecurity's rapidly evolving landscape and keep students engaged and motivated. Different studies on students' behaviors show that an interactive mode of education by engaging through a question-answering system or dialoguing is one of the most effective learning methodologies. There is a strong need to create advanced AI-enabled education tools to promote interactive learning in cybersecurity. Unfortunately, there are no publicly available standard question-answer datasets to build such systems for students and novice learners to learn cybersecurity concepts, tools, and techniques. The education course material and online question banks are unstructured and need to be validated and updated by domain experts, which is tedious when done manually. In this paper, we propose CyberGen, a novel unification of large language models (LLMs) and knowledge graphs (KG) to generate the questions and answers for cybersecurity automatically. Augmenting the structured knowledge from knowledge graphs in prompts improves factual reasoning and reduces hallucinations in LLMs. We used the knowledge triples from cybersecurity knowledge graphs (AISecKG) to design prompts for ChatGPT and generate questions and answers using different prompting techniques. Our question-answer dataset, CyberQ, contains around 4k pairs of questions and answers. The domain expert manually evaluated the random samples for consistency and correctness. We train the generative model using the CyberQ dataset for question answering task.
\ No newline at end of file
diff --git a/data/2024/aaai/Cycle Self-Refinement for Multi-Source Domain Adaptation b/data/2024/aaai/Cycle Self-Refinement for Multi-Source Domain Adaptation
new file mode 100644
index 0000000000..98fda9fad0
--- /dev/null
+++ b/data/2024/aaai/Cycle Self-Refinement for Multi-Source Domain Adaptation	
@@ -0,0 +1 @@
+Multi-source domain adaptation (MSDA) aims to transfer knowledge from multiple source domains to the unlabeled target domain. In this paper, we propose a cycle self-refinement domain adaptation method, which progressively attempts to learn the dominant transferable knowledge in each source domain in a cycle manner. Specifically, several source-specific networks and a domain-ensemble network are adopted in the proposed method. The source-specific networks are adopted to provide the dominant transferable knowledge in each source domain for instance-level ensemble on predictions of the samples in target domain. Then these samples with high-confidence ensemble predictions are adopted to refine the domain-ensemble network. Meanwhile, to guide each source-specific network to learn more dominant transferable knowledge, we force the features of the target domain from the domain-ensemble network and the features of each source domain from the corresponding source-specific network to be aligned with their predictions from the corresponding networks. Thus the adaptation ability of source-specific networks and the domain-ensemble network can be improved progressively. Extensive experiments on Office-31, Office-Home and DomainNet show that the proposed method outperforms the state-of-the-art methods for most tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Cycle-Consistency Learning for Captioning and Grounding b/data/2024/aaai/Cycle-Consistency Learning for Captioning and Grounding
new file mode 100644
index 0000000000..c64d9595e5
--- /dev/null
+++ b/data/2024/aaai/Cycle-Consistency Learning for Captioning and Grounding	
@@ -0,0 +1 @@
+We present that visual grounding and image captioning, which perform as two mutually inverse processes, can be bridged together for collaborative training by careful designs. By consolidating this idea, we introduce CyCo, a cyclic-consistent learning framework to ameliorate the independent training pipelines of visual grounding and image captioning. The proposed framework (1) allows the semi-weakly supervised training of visual grounding; (2) improves the performance of fully supervised visual grounding; (3) yields a general captioning model that can describe arbitrary image regions. Extensive experiments show that our fully supervised grounding model achieves state-of-the-art performance, and the semi-weakly supervised one also exhibits competitive performance compared to the fully supervised counterparts. Our image captioning model has the capability to freely describe image regions and meanwhile shows impressive performance on prevalent captioning benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/CycleVTON: A Cycle Mapping Framework for Parser-Free Virtual Try-On b/data/2024/aaai/CycleVTON: A Cycle Mapping Framework for Parser-Free Virtual Try-On
new file mode 100644
index 0000000000..37e493ac68
--- /dev/null
+++ b/data/2024/aaai/CycleVTON: A Cycle Mapping Framework for Parser-Free Virtual Try-On	
@@ -0,0 +1 @@
+Image-based virtual try-on aims to transfer a target clothing onto a specific person. A significant challenge is arbitrarily matched clothing and person lack corresponding ground truth to supervised learning. A recent pioneering work leveraged an improved cycleGAN to enable one network to generate the desired image for another network during training. However, there is no difference in the result distribution before and after the clothing changes. Therefore, using two different networks is unnecessary and may even increase the difficulty of convergence. Furthermore, the introduced human parsing used to provide body structure information in the input also have a negative impact on the try-on result. How to employ a single network for supervised learning while eliminating human parsing? To tackle these issues, we present a Cycle mapping Virtual Try-On Network (CycleVTON), which can produce photo-realistic try-on results by using a cycle mapping framework without the parser. In particular, we introduce a flow constraint loss to achieve supervised learning of arbitrarily matched clothing and person as inputs to the deformer, thus naturally mimicking the interaction between clothing and the human body. Additionally, we design a skin generation strategy that can adapt to the shape of the target clothing by dynamically adjusting the skin region, i.e., by first removing and then filling skin areas. Extensive experiments conducted on challenging benchmarks demonstrate that our proposed method exhibits superior performance compared to state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/D3: A Methodological Exploration of Domain Division, Modeling, and Balance in Multi-Domain Recommendations b/data/2024/aaai/D3: A Methodological Exploration of Domain Division, Modeling, and Balance in Multi-Domain Recommendations
new file mode 100644
index 0000000000..2468811ecd
--- /dev/null
+++ b/data/2024/aaai/D3: A Methodological Exploration of Domain Division, Modeling, and Balance in Multi-Domain Recommendations	
@@ -0,0 +1 @@
+To enhance the efficacy of multi-scenario services in industrial recommendation systems, the emergence of multi-domain recommendation has become prominent, which entails simultaneous modeling of all domains through a unified model, effectively capturing commonalities and differences among them. However, current methods rely on manual domain partitioning, which overlook the intricate domain relationships and the heterogeneity of different domains during joint optimization, hindering the integration of domain commonalities and differences. To address these challenges, this paper proposes a universal and flexible framework D3 aimed at optimizing the multi-domain recommendation pipeline from three key aspects. Firstly, an attention-based domain adaptation module is introduced to automatically identify and incorporate domain-sensitive features during training. Secondly, we propose a fusion gate module that enables the seamless integration of commonalities and diversities among domains, allowing for implicit characterization of intricate domain relationships. Lastly, we tackle the issue of joint optimization by deriving loss weights from two complementary viewpoints: domain complexity and domain specificity, alleviating inconsistencies among different domains during the training phase. Experiments on three public datasets demonstrate the effectiveness and superiority of our proposed framework. In addition, D3 has been implemented on a real-life, high-traffic internet platform catering to millions of users daily.
\ No newline at end of file
diff --git a/data/2024/aaai/DA-Net: A Disentangled and Adaptive Network for Multi-Source Cross-Lingual Transfer Learning b/data/2024/aaai/DA-Net: A Disentangled and Adaptive Network for Multi-Source Cross-Lingual Transfer Learning
new file mode 100644
index 0000000000..6261b139de
--- /dev/null
+++ b/data/2024/aaai/DA-Net: A Disentangled and Adaptive Network for Multi-Source Cross-Lingual Transfer Learning	
@@ -0,0 +1 @@
+Multi-Source cross-lingual transfer learning deals with the transfer of task knowledge from multiple labelled source languages to an unlabeled target language under the language shift. Existing methods typically focus on weighting the predictions produced by language-specific classifiers of different sources that follow a shared encoder. However, all source languages share the same encoder, which is updated by all these languages. The extracted representations inevitably contain different source languages' information, which may disturb the learning of the language-specific classifiers. Additionally, due to the language gap, language-specific classifiers trained with source labels are unable to make accurate predictions for the target language. Both facts impair the model's performance. To address these challenges, we propose a Disentangled and Adaptive Network ~(DA-Net). Firstly, we devise a feedback-guided collaborative disentanglement method that seeks to purify input representations of classifiers, thereby mitigating mutual interference from multiple sources. Secondly, we propose a class-aware parallel adaptation method that aligns class-level distributions for each source-target language pair, thereby alleviating the language pairs' language gap. Experimental results on three different tasks involving 38 languages validate the effectiveness of our approach.
\ No newline at end of file
diff --git a/data/2024/aaai/DAG-Aware Variational Autoencoder for Social Propagation Graph Generation b/data/2024/aaai/DAG-Aware Variational Autoencoder for Social Propagation Graph Generation
new file mode 100644
index 0000000000..edb938005d
--- /dev/null
+++ b/data/2024/aaai/DAG-Aware Variational Autoencoder for Social Propagation Graph Generation	
@@ -0,0 +1 @@
+Propagation models in social networks are critical, with extensive applications across various fields and downstream tasks. However, existing propagation models are often oversimplified, scenario-specific, and lack real-world user social attributes. These limitations detaching from real-world analysis lead to inaccurate representations of the propagation process in social networks. To address these issues, we propose a User Features Attention-based DAG-Aware Variational Autoencoder (DAVA) for propagation graph generation. First, nearly 1 million pieces of user attributes data are collected. Then DAVA can integrate the analysis of propagation graph topology and corresponding user attributes as prior knowledge. By leveraging a lightweight attention-based framework and a sliding window mechanism based on BFS permutations weighted by user influence, DAVA significantly enhances the ability to generate realistic, large-scale propagation data, yielding graph scales ten times greater than those produced by existing SOTA methods. Every module of DAVA has flexibility and extension that allows for easy substitution to suit other generation tasks. Additionally, we provide a comprehensive evaluation of DAVA, one focus is the effectiveness of generated data in improving the performance of downstream tasks. During the generation process, we discover the Credibility Erosion Effect by modifying the generation rules, revealing a social phenomenon in social network propagation.
\ No newline at end of file
diff --git a/data/2024/aaai/DALDet: Depth-Aware Learning Based Object Detection for Autonomous Driving b/data/2024/aaai/DALDet: Depth-Aware Learning Based Object Detection for Autonomous Driving
new file mode 100644
index 0000000000..b20a14d0ab
--- /dev/null
+++ b/data/2024/aaai/DALDet: Depth-Aware Learning Based Object Detection for Autonomous Driving	
@@ -0,0 +1 @@
+3D object detection achieves good detection performance in autonomous driving. However, it requires substantial computational resources, which prevents its practical application. 2D object detection has less computational burden but lacks spatial and geometric information embedded in depth. Therefore, we present DALDet, an efficient depth-aware learning based 2D detector, achieving high-performance object detection for autonomous driving. We design an efficient one-stage detection framework and seamlessly integrate depth cues into convolutional neural network by introducing depth-aware convolution and depth-aware average pooling, which effectively improve the detector's ability to perceive 3D space. Moreover, we propose a depth-guided loss function for training DALDet, which effectively improves the localization ability of the detector. Due to the use of depth map, DALDet can also output the distance of the object, which is of great importance for driving applications such as obstacle avoidance. Extensive experiments demonstrate the superiority and efficiency of DALDet. In particular, our DALDet ranks 1st on both KITTI Car and Cyclist 2D detection test leaderboards among all 2D detectors with high efficiency as well as yielding competitive performance among many leading 3D detectors. Code will be available at https://github.com/hukefy/DALDet.
\ No newline at end of file
diff --git a/data/2024/aaai/DART: Dual-Modal Adaptive Online Prompting and Knowledge Retention for Test-Time Adaptation b/data/2024/aaai/DART: Dual-Modal Adaptive Online Prompting and Knowledge Retention for Test-Time Adaptation
new file mode 100644
index 0000000000..ebbd705d4c
--- /dev/null
+++ b/data/2024/aaai/DART: Dual-Modal Adaptive Online Prompting and Knowledge Retention for Test-Time Adaptation	
@@ -0,0 +1 @@
+As an up-and-coming area, CLIP-based pre-trained vision-language models can readily facilitate downstream tasks through the zero-shot or few-shot fine-tuning manners. However, they still face critical challenges in test-time generalization due to the shifts between the training and test data distributions, hindering the further improvement of the performance. To address this crucial problem, the latest works have introduced Test-Time Adaptation (TTA) techniques to CLIP which dynamically learn text prompts using only test samples. However, their limited learning capacity due to the overlook of visual modality information, and the underutilization of knowledge in previously seen test samples result in reduced performance. In this paper, we propose a novel Dual-modal Adaptive online prompting and knowledge ReTention method called DART to overcome these challenges. To increase the learning capacity, DART captures knowledge from each test sample by learning class-specific text prompts and instance-level image prompts. Additionally, to fully leverage the knowledge from previously seen test samples, DART utilizes dual-modal knowledge retention prompts to adaptively retain the acquired knowledge, thereby enhancing the predictions on subsequent test samples. Extensive experiments on various large-scale benchmarks demonstrate the effectiveness of our proposed DART against state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/DC-NAS: Divide-and-Conquer Neural Architecture Search for Multi-Modal Classification b/data/2024/aaai/DC-NAS: Divide-and-Conquer Neural Architecture Search for Multi-Modal Classification
new file mode 100644
index 0000000000..ff4cff642b
--- /dev/null
+++ b/data/2024/aaai/DC-NAS: Divide-and-Conquer Neural Architecture Search for Multi-Modal Classification	
@@ -0,0 +1 @@
+Neural architecture search-based multi-modal classification (NAS-MMC) methods can individually obtain the optimal classifier for different multi-modal data sets in an automatic manner. However, most existing NAS-MMC methods are dramatically time consuming due to the requirement for training and evaluating enormous models. In this paper, we propose an efficient evolutionary-based NAS-MMC method called divide-and-conquer neural architecture search (DC-NAS). Specifically, the evolved population is first divided into k+1 sub-populations, and then k sub-populations of them evolve on k small-scale data sets respectively that are obtained by splitting the entire data set using the k-fold stratified sampling technique; the remaining one evolves on the entire data set. To solve the sub-optimal fusion model problem caused by the training strategy of partial data, two kinds of sub-populations that are trained using partial data and entire data exchange the learned knowledge via two special knowledge bases. With the two techniques mentioned above, DC-NAS achieves the training time reduction and classification performance improvement. Experimental results show that DC-NAS achieves the state-of-the-art results in term of classification performance, training efficiency and the number of model parameters than the compared NAS-MMC methods on three popular multi-modal tasks including multi-label movie genre classification, action recognition with RGB and body joints and dynamic hand gesture recognition.
\ No newline at end of file
diff --git a/data/2024/aaai/DCLP: Neural Architecture Predictor with Curriculum Contrastive Learning b/data/2024/aaai/DCLP: Neural Architecture Predictor with Curriculum Contrastive Learning
new file mode 100644
index 0000000000..ff4aa682e7
--- /dev/null
+++ b/data/2024/aaai/DCLP: Neural Architecture Predictor with Curriculum Contrastive Learning	
@@ -0,0 +1 @@
+Neural predictors have shown great potential in the evaluation process of neural architecture search (NAS). However, current predictor-based approaches overlook the fact that training a predictor necessitates a considerable number of trained neural networks as the labeled training set, which is costly to obtain. Therefore, the critical issue in utilizing predictors for NAS is to train a high-performance predictor using as few trained neural networks as possible. Although some methods attempt to address this problem through unsupervised learning, they often result in inaccurate predictions. We argue that the unsupervised tasks intended for the common graph data are too challenging for neural networks, causing unsupervised training to be susceptible to performance crashes in NAS. To address this issue, we propose a CurricuLum-guided Contrastive Learning framework for neural Predictor (DCLP). Our method simplifies the contrastive task by designing a novel curriculum to enhance the stability of unlabeled training data distribution during contrastive training. Specifically, we propose a scheduler that ranks the training data according to the contrastive difficulty of each data and then inputs them to the contrastive learner in order. This approach concentrates the training data distribution and makes contrastive training more efficient. By using our method, the contrastive learner incrementally learns feature representations via unsupervised data on a smooth learning curve, avoiding performance crashes that may occur with excessively variable training data distributions. We experimentally demonstrate that DCLP has high accuracy and efficiency compared with existing predictors, and shows promising potential to discover superior architectures in various search spaces when combined with search strategies. Our code is available at: https://github.com/Zhengsh123/DCLP.
\ No newline at end of file
diff --git a/data/2024/aaai/DCV2I: A Practical Approach for Supporting Geographers' Visual Interpretation in Dune Segmentation with Deep Vision Models b/data/2024/aaai/DCV2I: A Practical Approach for Supporting Geographers' Visual Interpretation in Dune Segmentation with Deep Vision Models
new file mode 100644
index 0000000000..d87219f908
--- /dev/null
+++ b/data/2024/aaai/DCV2I: A Practical Approach for Supporting Geographers' Visual Interpretation in Dune Segmentation with Deep Vision Models	
@@ -0,0 +1 @@
+Visual interpretation is extremely important in human geography as the primary technique for geographers to use photograph data in identifying, classifying, and quantifying geographic and topological objects or regions. However, it is also time-consuming and requires overwhelming manual effort from professional geographers. This paper describes our interdisciplinary team's efforts in integrating computer vision models with geographers' visual image interpretation process to reduce their workload in interpreting images. Focusing on the dune segmentation task, we proposed an approach featuring a deep dune segmentation model to identify dunes and label their ranges in an automated way. By developing a tool to connect our model with ArcGIS, one of the most popular workbenches for visual interpretation, geographers can further refine the automatically-generated dune segmentation on images without learning any CV or deep learning techniques. Our approach thus realized a non-invasive change to geographers' visual interpretation routines, reducing their manual efforts while incurring minimal interruptions to their work routines and tools they are familiar with. Deployment with a leading Chinese geography research institution demonstrated the potential of our approach in supporting geographers in researching and solving drylands desertification.
\ No newline at end of file
diff --git a/data/2024/aaai/DDAE: Towards Deep Dynamic Vision BERT Pretraining b/data/2024/aaai/DDAE: Towards Deep Dynamic Vision BERT Pretraining
new file mode 100644
index 0000000000..4235f11620
--- /dev/null
+++ b/data/2024/aaai/DDAE: Towards Deep Dynamic Vision BERT Pretraining	
@@ -0,0 +1 @@
+Recently, masked image modeling (MIM) has demonstrated promising prospects in self-supervised representation learning. However, existing MIM frameworks recover all masked patches equivalently, ignoring that the reconstruction difficulty of different patches can vary sharply due to their diverse distance from visible patches. In this paper, we propose a novel deep dynamic supervision to enable MIM methods to dynamically reconstruct patches with different degrees of difficulty at different pretraining phases and depths of the model. Our deep dynamic supervision helps to provide more locality inductive bias for ViTs especially in deep layers, which inherently makes up for the absence of local prior for self-attention mechanism. Built upon the deep dynamic supervision, we propose Deep Dynamic AutoEncoder (DDAE), a simple yet effective MIM framework that utilizes dynamic mechanisms for pixel regression and feature self-distillation simultaneously. Extensive experiments across a variety of vision tasks including ImageNet classification, semantic segmentation on ADE20K and object detection on COCO demonstrate the effectiveness of our approach.
\ No newline at end of file
diff --git a/data/2024/aaai/DDViT: Double-Level Fusion Domain Adapter Vision Transformer (Student Abstract) b/data/2024/aaai/DDViT: Double-Level Fusion Domain Adapter Vision Transformer (Student Abstract)
new file mode 100644
index 0000000000..a147acfe09
--- /dev/null
+++ b/data/2024/aaai/DDViT: Double-Level Fusion Domain Adapter Vision Transformer (Student Abstract)	
@@ -0,0 +1 @@
+With the help of Vision transformers (ViTs), medical image segmentation was able to achieve outstanding performance. In particular, they overcome the limitation of convolutional neural networks (CNNs) which rely on local receptive fields. ViTs use self-attention mechanisms to consider relationships between all image pixels or patches simultaneously. However, they require large datasets for training and did not perform well on capturing low-level features. To that end, we propose DDViT, a novel ViT model that unites a CNN to alleviate data-hunger for medical image segmentation with two multi-scale feature representations. Significantly, our approach incorporates a ViT with a plug-in domain adapter (DA) with Double-Level Fusion (DLF) technique, complemented by a mutual knowledge distillation paradigm, facilitating the seamless exchange of knowledge between a universal network and specialized domain-specific network branches. The DLF framework plays a pivotal role in our encoder-decoder architecture, combining the innovation of the TransFuse module with a robust CNN-based encoder. Extensive experimentation across diverse medical image segmentation datasets underscores the remarkable efficacy of DDViT when compared to alternative approaches based on CNNs and Transformer-based models.
\ No newline at end of file
diff --git a/data/2024/aaai/DGA-GNN: Dynamic Grouping Aggregation GNN for Fraud Detection b/data/2024/aaai/DGA-GNN: Dynamic Grouping Aggregation GNN for Fraud Detection
new file mode 100644
index 0000000000..5a2c9583e9
--- /dev/null
+++ b/data/2024/aaai/DGA-GNN: Dynamic Grouping Aggregation GNN for Fraud Detection	
@@ -0,0 +1,2 @@
+Fraud detection has increasingly become a prominent research field due to the dramatically increased incidents of fraud. The complex connections involving thousands, or even millions of nodes, present challenges for fraud detection tasks. Many researchers have developed various graph-based methods to detect fraud from these intricate graphs. However, those methods neglect two distinct characteristics of the fraud graph: the non-additivity of certain attributes and the distinguishability of grouped messages from neighbor nodes.
+This paper introduces the Dynamic Grouping Aggregation Graph Neural Network (DGA-GNN) for fraud detection, which addresses these two characteristics by dynamically grouping attribute value ranges and neighbor nodes. In DGA-GNN, we initially propose the decision tree binning encoding to transform non-additive node attributes into bin vectors. This approach aligns well with the GNN’s aggregation operation and avoids nonsensical feature generation. Furthermore, we devise a feedback dynamic grouping strategy to classify graph nodes into two distinct groups and then employ a hierarchical aggregation. This method extracts more discriminative features for fraud detection tasks. Extensive experiments on five datasets suggest that our proposed method achieves a 3% ~ 16% improvement over existing SOTA methods. Code is available at https://github.com/AtwoodDuan/DGA-GNN.
\ No newline at end of file
diff --git a/data/2024/aaai/DGCLUSTER: A Neural Framework for Attributed Graph Clustering via Modularity Maximization b/data/2024/aaai/DGCLUSTER: A Neural Framework for Attributed Graph Clustering via Modularity Maximization
new file mode 100644
index 0000000000..b61eeb167c
--- /dev/null
+++ b/data/2024/aaai/DGCLUSTER: A Neural Framework for Attributed Graph Clustering via Modularity Maximization	
@@ -0,0 +1 @@
+Graph clustering is a fundamental and challenging task in the field of graph mining where the objective is to group the nodes into clusters taking into consideration the topology of the graph. It has several applications in diverse domains spanning social network analysis, recommender systems, computer vision, and bioinformatics. In this work, we propose a novel method, DGCluster, which primarily optimizes the modularity objective using graph neural networks and scales linearly with the graph size. Our method does not require the number of clusters to be specified as a part of the input and can also leverage the availability of auxiliary node level information. We extensively test DGCluster on several real-world datasets of varying sizes, across multiple popular cluster quality metrics. Our approach consistently outperforms the state-of-the-art methods, demonstrating significant performance gains in almost all settings.
\ No newline at end of file
diff --git a/data/2024/aaai/DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval b/data/2024/aaai/DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
new file mode 100644
index 0000000000..cac484ab2c
--- /dev/null
+++ b/data/2024/aaai/DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval	
@@ -0,0 +1 @@
+Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at https://github.com/knightyxp/DGL.
\ No newline at end of file
diff --git a/data/2024/aaai/DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization b/data/2024/aaai/DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization
new file mode 100644
index 0000000000..3454031b35
--- /dev/null
+++ b/data/2024/aaai/DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization	
@@ -0,0 +1 @@
+Most reinforcement learning algorithms seek a single optimal strategy that solves a given task. However, it can often be valuable to learn a diverse set of solutions, for instance, to make an agent's interaction with users more engaging, or improve the robustness of a policy to an unexpected perturbance. We propose Diversity-Guided Policy Optimization (DGPO), an on-policy algorithm that discovers multiple strategies for solving a given task. Unlike prior work, it achieves this with a shared policy network trained over a single run. Specifically, we design an intrinsic reward based on an information-theoretic diversity objective. Our final objective alternately constraints on the diversity of the strategies and on the extrinsic reward. We solve the constrained optimization problem by casting it as a probabilistic inference task and use policy iteration to maximize the derived lower bound. Experimental results show that our method efficiently discovers diverse strategies in a wide variety of reinforcement learning tasks. Compared to baseline methods, DGPO achieves comparable rewards, while discovering more diverse strategies, and often with better sample efficiency.
\ No newline at end of file
diff --git a/data/2024/aaai/DHGCN: Dynamic Hop Graph Convolution Network for Self-Supervised Point Cloud Learning b/data/2024/aaai/DHGCN: Dynamic Hop Graph Convolution Network for Self-Supervised Point Cloud Learning
new file mode 100644
index 0000000000..515ccfdfe2
--- /dev/null
+++ b/data/2024/aaai/DHGCN: Dynamic Hop Graph Convolution Network for Self-Supervised Point Cloud Learning	
@@ -0,0 +1 @@
+Recent works attempt to extend Graph Convolution Networks (GCNs) to point clouds for classification and segmentation tasks. These works tend to sample and group points to create smaller point sets locally and mainly focus on extracting local features through GCNs, while ignoring the relationship between point sets. In this paper, we propose the Dynamic Hop Graph Convolution Network (DHGCN) for explicitly learning the contextual relationships between the voxelized point parts, which are treated as graph nodes. Motivated by the intuition that the contextual information between point parts lies in the pairwise adjacent relationship, which can be depicted by the hop distance of the graph quantitatively, we devise a novel self-supervised part-level hop distance reconstruction task and design a novel loss function accordingly to facilitate training. In addition, we propose the Hop Graph Attention (HGA), which takes the learned hop distance as input for producing attention weights to allow edge features to contribute distinctively in aggregation. Eventually, the proposed DHGCN is a plug-and-play module that is compatible with point-based backbone networks. Comprehensive experiments on different backbones and tasks demonstrate that our self-supervised method achieves state-of-the-art performance. Our source codes are available at: https://github.com/Jinec98/DHGCN.
\ No newline at end of file
diff --git a/data/2024/aaai/DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection b/data/2024/aaai/DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection
new file mode 100644
index 0000000000..358f412a75
--- /dev/null
+++ b/data/2024/aaai/DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection	
@@ -0,0 +1 @@
+Vehicle-to-Everything (V2X) collaborative perception has recently gained significant attention due to its capability to enhance scene understanding by integrating information from various agents, e.g., vehicles, and infrastructure. However, current works often treat the information from each agent equally, ignoring the inherent domain gap caused by the utilization of different LiDAR sensors of each agent, thus leading to suboptimal performance. In this paper, we propose DI-V2X, that aims to learn Domain-Invariant representations through a new distillation framework to mitigate the domain discrepancy in the context of V2X 3D object detection. DI-V2X comprises three essential components: a domain-mixing instance augmentation (DMA) module, a progressive domain-invariant distillation (PDD) module, and a domain-adaptive fusion (DAF) module. Specifically, DMA builds a domain-mixing 3D instance bank for the teacher and student models during training, resulting in aligned data representation. Next, PDD encourages the student models from different domains to gradually learn a domain-invariant feature representation towards the teacher, where the overlapping regions between agents are employed as guidance to facilitate the distillation process. Furthermore, DAF closes the domain gap between the students by incorporating calibration-aware domain-adaptive attention. Extensive experiments on the challenging DAIR-V2X and V2XSet benchmark datasets demonstrate DI-V2X achieves remarkable performance, outperforming all the previous V2X models. Code is available at https://github.com/Serenos/DI-V2X.
\ No newline at end of file
diff --git a/data/2024/aaai/DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation b/data/2024/aaai/DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation
new file mode 100644
index 0000000000..1431420b9d
--- /dev/null
+++ b/data/2024/aaai/DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation	
@@ -0,0 +1 @@
+Instruction-following is particularly crucial for large language models (LLMs) to support diverse user requests. While existing work has made progress in aligning LLMs with human preferences, evaluating their capabilities on instruction-following remains a challenge due to complexity and diversity of real-world user instructions. While existing evaluation methods focus on general skills, they suffer from two main shortcomings, i.e., lack of fine-grained task-level evaluation and reliance on singular instruction expression. To address these problems, this paper introduces DINGO, a fine-grained and diverse instruction-following evaluation dataset that has two main advantages: (1) DINGO is based on a manual annotated, fine-grained and multi-level category tree with 130 nodes derived from real-world user requests; (2) DINGO includes diverse instructions, generated by both GPT-4 and human experts. Through extensive experiments, we demonstrate that DINGO can not only provide more challenging and comprehensive evaluation for LLMs, but also provide task-level fine-grained directions to further improve LLMs.
\ No newline at end of file
diff --git a/data/2024/aaai/DISCount: Counting in Large Image Collections with Detector-Based Importance Sampling b/data/2024/aaai/DISCount: Counting in Large Image Collections with Detector-Based Importance Sampling
new file mode 100644
index 0000000000..ba69b1cf36
--- /dev/null
+++ b/data/2024/aaai/DISCount: Counting in Large Image Collections with Detector-Based Importance Sampling	
@@ -0,0 +1 @@
+Many applications use computer vision to detect and count objects in massive image collections. However, automated methods may fail to deliver accurate counts, especially when the task is very difficult or requires a fast response time. For example, during disaster response, aid organizations aim to quickly count damaged buildings in satellite images to plan relief missions, but pre-trained building and damage detectors often perform poorly due to domain shifts. In such cases, there is a need for human-in-the-loop approaches to accurately count with minimal human effort. We propose DISCount -- a detector-based importance sampling framework for counting in large image collections. DISCount uses an imperfect detector and human screening to estimate low-variance unbiased counts. We propose techniques for counting over multiple spatial or temporal regions using a small amount of screening and estimate confidence intervals. This enables end-users to stop screening when estimates are sufficiently accurate, which is often the goal in real-world applications. We demonstrate our method with two applications: counting birds in radar imagery to understand responses to climate change, and counting damaged buildings in satellite imagery for damage assessment in regions struck by a natural disaster. On the technical side we develop variance reduction techniques based on control variates and prove the (conditional) unbiasedness of the estimators. DISCount leads to a 9-12x reduction in the labeling costs to obtain the same error rates compared to naive screening for tasks we consider, and surpasses alternative covariate-based screening approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/DIUSum: Dynamic Image Utilization for Multimodal Summarization b/data/2024/aaai/DIUSum: Dynamic Image Utilization for Multimodal Summarization
new file mode 100644
index 0000000000..dff4a9d81c
--- /dev/null
+++ b/data/2024/aaai/DIUSum: Dynamic Image Utilization for Multimodal Summarization	
@@ -0,0 +1 @@
+Existing multimodal summarization approaches focus on fusing image features in the encoding process, ignoring the individualized needs for images when generating different summaries. However, whether intuitively or empirically, not all images can improve summary quality. Therefore, we propose a novel Dynamic Image Utilization framework for multimodal Summarization (DIUSum) to select and utilize valuable images for summarization. First, to predict whether an image helps produce a high-quality summary, we propose an image selector to score the usefulness of each image. Second, to dynamically utilize the multimodal information, we incorporate the hard and soft guidance from the image selector. Under the guidance, the image information is plugged into the decoder to generate a summary. Experimental results have shown that DIUSum outperforms multiple strong baselines and achieves SOTA on two public multimodal summarization datasets. Further analysis demonstrates that the image selector can reflect the improved level of summary quality brought by the images.
\ No newline at end of file
diff --git a/data/2024/aaai/DLCA-Recon: Dynamic Loose Clothing Avatar Reconstruction from Monocular Videos b/data/2024/aaai/DLCA-Recon: Dynamic Loose Clothing Avatar Reconstruction from Monocular Videos
new file mode 100644
index 0000000000..9b4f429380
--- /dev/null
+++ b/data/2024/aaai/DLCA-Recon: Dynamic Loose Clothing Avatar Reconstruction from Monocular Videos	
@@ -0,0 +1 @@
+Reconstructing a dynamic human with loose clothing is an important but difficult task. To address this challenge, we propose a method named DLCA-Recon to create human avatars from monocular videos. The distance from loose clothing to the underlying body rapidly changes in every frame when the human freely moves and acts. Previous methods lack effective geometric initialization and constraints for guiding the optimization of deformation to explain this dramatic change, resulting in the discontinuous and incomplete reconstruction surface.To model the deformation more accurately, we propose to initialize an estimated 3D clothed human in the canonical space, as it is easier for deformation fields to learn from the clothed human than from SMPL.With both representations of explicit mesh and implicit SDF, we utilize the physical connection information between consecutive frames and propose a dynamic deformation field (DDF) to optimize deformation fields. DDF accounts for contributive forces on loose clothing to enhance the interpretability of deformations and effectively capture the free movement of loose clothing. Moreover, we propagate SMPL skinning weights to each individual and refine pose and skinning weights during the optimization to improve skinning transformation. Based on more reasonable initialization and DDF, we can simulate real-world physics more accurately. Extensive experiments on public and our own datasets validate that our method can produce superior results for humans with loose clothing compared to the SOTA methods.
\ No newline at end of file
diff --git a/data/2024/aaai/DME: Unveiling the Bias for Better Generalized Monocular Depth Estimation b/data/2024/aaai/DME: Unveiling the Bias for Better Generalized Monocular Depth Estimation
new file mode 100644
index 0000000000..a3552de1c7
--- /dev/null
+++ b/data/2024/aaai/DME: Unveiling the Bias for Better Generalized Monocular Depth Estimation	
@@ -0,0 +1 @@
+This paper aims to design monocular depth estimation models with better generalization abilities. To this end, we have conducted quantitative analysis and discovered two important insights. First, the Simulation Correlation phenomenon, commonly seen in long-tailed classification problems, also exists in monocular depth estimation, indicating that the imbalanced depth distribution in training data may be the cause of limited generalization ability. Second, the imbalanced and long-tail distribution of depth values extends beyond the dataset scale, and also manifests within each individual image, further exacerbating the challenge of monocular depth estimation. Motivated by the above findings, we propose the Distance-aware Multi-Expert (DME) depth estimation model. Unlike prior methods that handle different depth range indiscriminately, DME adopts a divide-and-conquer philosophy where each expert is responsible for depth estimation of regions within a specific depth range. As such, the depth distribution seen by each expert is more uniform and can be more easily predicted. A pixel-level routing module is further designed and learned to stitch the prediction of all experts into the final depth map. Experiments show that DME achieves state-of-the-art performance on both NYU-Depth v2 and KITTI, and also delivers favorable zero-shot generalization capability on unseen datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/DMMR: Cross-Subject Domain Generalization for EEG-Based Emotion Recognition via Denoising Mixed Mutual Reconstruction b/data/2024/aaai/DMMR: Cross-Subject Domain Generalization for EEG-Based Emotion Recognition via Denoising Mixed Mutual Reconstruction
new file mode 100644
index 0000000000..18b5f23235
--- /dev/null
+++ b/data/2024/aaai/DMMR: Cross-Subject Domain Generalization for EEG-Based Emotion Recognition via Denoising Mixed Mutual Reconstruction	
@@ -0,0 +1 @@
+Electroencephalography (EEG) has proven to be effective in emotion analysis. However, current methods struggle with individual variations, complicating the generalization of models trained on data from source subjects to unseen target subjects. To tackle this issue, we propose the Denoising Mixed Mutual Reconstruction (DMMR) model, employing a two-stage pre-training followed by fine-tuning approach. During the pre-training phase, DMMR leverages self-supervised learning through a multi-decoder autoencoder, which encodes and reconstructs features of one subject, aiming to generate features resembling those from other subjects within the same category, thereby encouraging the encoder to learn subject-invariant features. We introduce a hidden-layer mixed data augmentation approach to mitigate the limitations posed by the scarcity of source data, thereby extending the method to a two-stage process. To bolster stability against noise, we incorporate a noise injection method, named “Time Steps Shuffling”, into the input data. During the fine-tuning phase, an emotion classifier is integrated to extract emotion-related features. Experimental accuracy on the SEED and SEED-IV datasets reached 88.27% (±5.62) and 72.70% (±8.01), respectively, demonstrating state-of-the-art and comparable performance, thereby showcasing the superiority of DMMR. The proposed data augmentation and noise injection methods were observed to complementarily enhance accuracy and stability, thus alleviating the aforementioned issues.
\ No newline at end of file
diff --git a/data/2024/aaai/DNIT: Enhancing Day-Night Image-to-Image Translation through Fine-Grained Feature Handling (Student Abstract) b/data/2024/aaai/DNIT: Enhancing Day-Night Image-to-Image Translation through Fine-Grained Feature Handling (Student Abstract)
new file mode 100644
index 0000000000..730812bf13
--- /dev/null
+++ b/data/2024/aaai/DNIT: Enhancing Day-Night Image-to-Image Translation through Fine-Grained Feature Handling (Student Abstract)	
@@ -0,0 +1 @@
+Existing image-to-image translation methods perform less satisfactorily in the "day-night" domain due to insufficient scene feature study. To address this problem, we propose DNIT, which performs fine-grained handling of features by a nighttime image preprocessing (NIP) module and an edge fusion detection (EFD) module. The NIP module enhances brightness while minimizing noise, facilitating extracting content and style features. Meanwhile, the EFD module utilizes two types of edge images as additional constraints to optimize the generator. Experimental results show that we can generate more realistic and higher-quality images compared to other methods, proving the effectiveness of our DNIT.
\ No newline at end of file
diff --git a/data/2024/aaai/DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding b/data/2024/aaai/DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding
new file mode 100644
index 0000000000..ed46a3e770
--- /dev/null
+++ b/data/2024/aaai/DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding	
@@ -0,0 +1 @@
+Point scene understanding is a challenging task to process real-world scene point cloud, which aims at segmenting each object, estimating its pose, and reconstructing its mesh simultaneously. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. This leads to a complex pipeline to optimize and makes it hard to leverage the relationship constraints between multiple objects. In this work, we propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation to facilitate learning with multiple objects for the multiple sub-tasks in a unified manner. Each object is represented as a query, and a Transformer decoder is adapted to iteratively optimize all the queries involving their relationship. In particular, we introduce a semantic-geometry disentangled query (SGDQ) design that enables the query features to attend separately to semantic information and geometric information relevant to the corresponding sub-tasks. A hybrid bipartite matching module is employed to well use the supervisions from all the sub-tasks during training. Qualitative and quantitative experimental results demonstrate that our method achieves state-of-the-art performance on the challenging ScanNet dataset. Code is available at https://github.com/SAITPublic/DOCTR.
\ No newline at end of file
diff --git a/data/2024/aaai/DOGE-Train: Discrete Optimization on GPU with End-to-End Training b/data/2024/aaai/DOGE-Train: Discrete Optimization on GPU with End-to-End Training
new file mode 100644
index 0000000000..2a278fe073
--- /dev/null
+++ b/data/2024/aaai/DOGE-Train: Discrete Optimization on GPU with End-to-End Training	
@@ -0,0 +1 @@
+We present a fast, scalable, data-driven approach for solving relaxations of 0-1 integer linear programs. We use a combination of graph neural networks (GNN) and a Lagrange decomposition based algorithm. We make the latter differentiable for end-to-end training and use GNNs to predict its algorithmic parameters. This allows to retain the algorithm's theoretical properties including dual feasibility and guaranteed non-decrease in the lower bound while improving it via training. We overcome suboptimal fixed points of the basic solver by additional non-parametric GNN update steps maintaining dual feasibility. For training we use an unsupervised loss. We train on smaller problems and test on larger ones showing strong generalization performance with a GNN comprising only around 10k parameters. Our solver achieves significantly faster performance and better dual objectives than its non-learned version, achieving close to optimal objective values of LP relaxations of very large structured prediction problems and on selected combinatorial ones. In particular, we achieve better objective values than specialized approximate solvers for specific problem classes while retaining their efficiency. Our solver has better any-time performance over a large time period compared to a commercial solver.
\ No newline at end of file
diff --git a/data/2024/aaai/DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction) b/data/2024/aaai/DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction)
new file mode 100644
index 0000000000..df74839645
--- /dev/null
+++ b/data/2024/aaai/DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction)	
@@ -0,0 +1 @@
+The Adam optimizer is a popular choice in contemporary deep learning due to its strong empirical performance. However we observe that in privacy sensitive scenarios, the traditional use of Differential Privacy (DP) with the Adam optimizer leads to sub-optimal performance on several tasks. We find that this performance degradation is due to a DP bias in Adam's second moment estimator, introduced by the addition of independent noise in the gradient computation to enforce DP guarantees. This DP bias leads to a different scaling for low variance parameter updates, that is inconsistent with the behavior of non-private Adam, and Adam's sign descent interpretation. We propose the DP-AdamBC optimization algorithm, which corrects for the bias in the second moment estimation and retrieves the expected behaviour of Adam. Empirically, DP-AdamBC significantly improves the optimization performance of DP-Adam by up to 3.5% in final accuracy in image, text, and graph node classification tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/DPA-P2PNet: Deformable Proposal-Aware P2PNet for Accurate Point-Based Cell Detection b/data/2024/aaai/DPA-P2PNet: Deformable Proposal-Aware P2PNet for Accurate Point-Based Cell Detection
new file mode 100644
index 0000000000..fb60c067a0
--- /dev/null
+++ b/data/2024/aaai/DPA-P2PNet: Deformable Proposal-Aware P2PNet for Accurate Point-Based Cell Detection	
@@ -0,0 +1 @@
+Point-based cell detection (PCD), which pursues high-performance cell sensing under low-cost data annotation, has garnered increased attention in computational pathology community. Unlike mainstream PCD methods that rely on intermediate density map representations, the Point-to-Point network (P2PNet) has recently emerged as an end-to-end solution for PCD, demonstrating impressive cell detection accuracy and efficiency. Nevertheless, P2PNet is limited to decoding from a single-level feature map due to the scale-agnostic property of point proposals, which is insufficient to leverage multi-scale information. Moreover, the spatial distribution of pre-set point proposals is biased from that of cells, leading to inaccurate cell localization. To lift these limitations, we present DPA-P2PNet in this work. The proposed method directly extracts multi-scale features for decoding according to the coordinates of point proposals on hierarchical feature maps. On this basis, we further devise deformable point proposals to mitigate the positional bias between proposals and potential cells to promote cell localization. Inspired by practical pathological diagnosis that usually combines high-level tissue structure and low-level cell morphology for accurate cell classification, we propose a multi-field-of-view (mFoV) variant of DPA-P2PNet to accommodate additional large FoV images with tissue information as model input. Finally, we execute the first self-supervised pre-training on immunohistochemistry histopathology image data and evaluate the suitability of four representative self-supervised methods on the PCD task. Experimental results on three benchmarks and a large-scale and real-world interval dataset demonstrate the superiority of our proposed models over the state-of-the-art counterparts. Codes and pre-trained weights are available at https://github.com/windygoo/DPA-P2PNet.
\ No newline at end of file
diff --git a/data/2024/aaai/DQSSA: A Quantum-Inspired Solution for Maximizing Influence in Online Social Networks (Student Abstract) b/data/2024/aaai/DQSSA: A Quantum-Inspired Solution for Maximizing Influence in Online Social Networks (Student Abstract)
new file mode 100644
index 0000000000..3929908cdc
--- /dev/null
+++ b/data/2024/aaai/DQSSA: A Quantum-Inspired Solution for Maximizing Influence in Online Social Networks (Student Abstract)	
@@ -0,0 +1 @@
+Influence Maximization is the task of selecting optimal nodes maximising the influence spread in social networks. This study proposes a Discretized Quantum-based Salp Swarm Algorithm (DQSSA) for optimizing influence diffusion in social networks. By discretizing meta-heuristic algorithms and infusing them with quantum-inspired enhancements, we address issues like premature convergence and low efficacy. The proposed method, guided by quantum principles, offers a promising solution for Influence Maximisation. Experiments on four real-world datasets reveal DQSSA's superior performance as compared to established cutting-edge algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/DR-Label: Label Deconstruction and Reconstruction of GNN Models for Catalysis Systems b/data/2024/aaai/DR-Label: Label Deconstruction and Reconstruction of GNN Models for Catalysis Systems
new file mode 100644
index 0000000000..53f59e4dae
--- /dev/null
+++ b/data/2024/aaai/DR-Label: Label Deconstruction and Reconstruction of GNN Models for Catalysis Systems	
@@ -0,0 +1 @@
+Attaining the equilibrium geometry of a catalyst-adsorbate system is key to fundamentally assessing its effective properties, such as adsorption energy. While machine learning methods with advanced representation or supervision strategies have been applied to boost and guide the relaxation processes of catalysis systems, existing methods that produce linearly aggregated geometry predictions are susceptible to edge representations ambiguity, and are therefore vulnerable to graph variations. In this paper, we present a novel graph neural network (GNN) supervision and prediction strategy DR-Label. Our approach mitigates the multiplicity of solutions in edge representation and encourages model predictions that are independent of graph structural variations. DR-Label first Deconstructs finer-grained equilibrium state information to the model by projecting the node-level supervision signal to each edge. Reversely, the model Reconstructs a more robust equilibrium state prediction by converting edge-level predictions to node-level via a sphere-fitting algorithm. When applied to three fundamentally different models, DR-Label consistently enhanced performance. Leveraging the graph structure invariance of the DR-Label strategy, we further propose DRFormer, which applied explicit intermediate positional update and achieves a new state-of-the-art performance on the Open Catalyst 2020 (OC20) dataset and the Cu-based single-atom alloys CO adsorption (SAA) dataset. We expect our work to highlight vital principles for advancing geometric GNN models for catalysis systems and beyond. Our code is available at https://github.com/bowenwang77/DR-Label
\ No newline at end of file
diff --git a/data/2024/aaai/DRF: Improving Certified Robustness via Distributional Robustness Framework b/data/2024/aaai/DRF: Improving Certified Robustness via Distributional Robustness Framework
new file mode 100644
index 0000000000..5abea6f378
--- /dev/null
+++ b/data/2024/aaai/DRF: Improving Certified Robustness via Distributional Robustness Framework	
@@ -0,0 +1 @@
+Randomized smoothing (RS) has provided state-of-the-art (SOTA) certified robustness against adversarial perturbations for large neural networks. Among studies in this field, methods based on adversarial training (AT) achieve remarkably robust performance by applying adversarial examples to construct the smoothed classifier. These AT-based RS methods typically seek a pointwise adversary that generates the worst-case adversarial examples by perturbing each input independently. However, there are unexplored benefits to considering such adversarial robustness across the entire data distribution. To this end, we provide a novel framework called DRF, which connects AT-based RS methods with distributional robustness (DR), and show that these methods are special cases of their counterparts in our framework. Due to the advantages conferred by DR, our framework can control the trade-off between the clean accuracy and certified robustness of smoothed classifiers to a significant extent. Our experiments demonstrate that DRF can substantially improve the certified robustness of AT-based RS.
\ No newline at end of file
diff --git a/data/2024/aaai/DS-AL: A Dual-Stream Analytic Learning for Exemplar-Free Class-Incremental Learning b/data/2024/aaai/DS-AL: A Dual-Stream Analytic Learning for Exemplar-Free Class-Incremental Learning
new file mode 100644
index 0000000000..adf8bc0147
--- /dev/null
+++ b/data/2024/aaai/DS-AL: A Dual-Stream Analytic Learning for Exemplar-Free Class-Incremental Learning	
@@ -0,0 +1 @@
+Class-incremental learning (CIL) under an exemplar-free constraint has presented a significant challenge. Existing methods adhering to this constraint are prone to catastrophic forgetting, far more so than replay-based techniques that retain access to past samples. In this paper, to solve the exemplar-free CIL problem, we propose a Dual-Stream Analytic Learning (DS-AL) approach. The DS-AL contains a main stream offering an analytical (i.e., closed-form) linear solution, and a compensation stream improving the inherent under-fitting limitation due to adopting linear mapping. The main stream redefines the CIL problem into a Concatenated Recursive Least Squares (C-RLS) task, allowing an equivalence between the CIL and its joint-learning counterpart. The compensation stream is governed by a Dual-Activation Compensation (DAC) module. This module re-activates the embedding with a different activation function from the main stream one, and seeks fitting compensation by projecting the embedding to the null space of the main stream's linear mapping. Empirical results demonstrate that the DS-AL, despite being an exemplar-free technique, delivers performance comparable with or better than that of replay-based methods across various datasets, including CIFAR-100, ImageNet-100 and ImageNet-Full. Additionally, the C-RLS' equivalent property allows the DS-AL to execute CIL in a phase-invariant manner. This is evidenced by a never-before-seen 500-phase CIL ImageNet task, which performs on a level identical to a 5-phase one. Our codes are available at https://github.com/ZHUANGHP/Analytic-continual-learning.
\ No newline at end of file
diff --git "a/data/2024/aaai/DSD\302\262: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?" "b/data/2024/aaai/DSD\302\262: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?"
new file mode 100644
index 0000000000..c27b78065a
--- /dev/null
+++ "b/data/2024/aaai/DSD\302\262: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?"	
@@ -0,0 +1,3 @@
+Neoteric works have shown that modern deep learning models can exhibit a sparse double descent phenomenon. Indeed, as the sparsity of the model increases, the test performance first worsens since the model is overfitting the training data; then, the overfitting reduces, leading to an improvement in performance, and finally, the model begins to forget critical information, resulting in underfitting. Such a behavior prevents using traditional early stop criteria.
+
+In this work, we have three key contributions. First, we propose a learning framework that avoids such a phenomenon and improves generalization. Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon and enabling the use of traditional stop criteria. Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise. The contributions are supported by empirical evidence in typical setups. Our code is available at https://github.com/VGCQ/DSD2.
\ No newline at end of file
diff --git a/data/2024/aaai/DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification b/data/2024/aaai/DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification
new file mode 100644
index 0000000000..a2af54700f
--- /dev/null
+++ b/data/2024/aaai/DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification	
@@ -0,0 +1,3 @@
+Convolutional neural networks (CNNs) and Transformer-based networks have recently enjoyed significant attention for various audio classification and tagging tasks following their wide adoption in the computer vision domain.
+Despite the difference in information distribution between audio spectrograms and natural images, there has been limited exploration of effective information retrieval from spectrograms using domain-specific layers tailored for the audio domain. In this paper, we leverage the power of the Multi-Axis Vision Transformer (MaxViT) to create DTF-AT (Decoupled Time-Frequency Audio Transformer) that facilitates interactions across time, frequency, spatial, and channel dimensions.
+The proposed DTF-AT architecture is rigorously evaluated across diverse audio and speech classification tasks, consistently establishing new benchmarks for state-of-the-art (SOTA) performance. Notably, on the challenging AudioSet 2M classification task, our approach demonstrates a substantial improvement of 4.4% when the model is trained from scratch and 3.2% when the model is initialised from ImageNet-1K pretrained weights. In addition, we present comprehensive ablation studies to investigate the impact and efficacy of our proposed approach. The codebase and pretrained weights are available on https://github.com/ta012/DTFAT.git
\ No newline at end of file
diff --git a/data/2024/aaai/DTL: Disentangled Transfer Learning for Visual Recognition b/data/2024/aaai/DTL: Disentangled Transfer Learning for Visual Recognition
new file mode 100644
index 0000000000..0c069f16e0
--- /dev/null
+++ b/data/2024/aaai/DTL: Disentangled Transfer Learning for Visual Recognition	
@@ -0,0 +1 @@
+When pre-trained models become rapidly larger, the cost of fine-tuning on downstream tasks steadily increases, too. To economically fine-tune these models, parameter-efficient transfer learning (PETL) is proposed, which only tunes a tiny subset of trainable parameters to efficiently learn quality representations. However, current PETL methods are facing the dilemma that during training the GPU memory footprint is not effectively reduced as trainable parameters. PETL will likely fail, too, if the full fine-tuning encounters the out-of-GPU-memory issue. This phenomenon happens because trainable parameters from these methods are generally entangled with the backbone, such that a lot of intermediate states have to be stored in GPU memory for gradient propagation. To alleviate this problem, we introduce Disentangled Transfer Learning (DTL), which disentangles the trainable parameters from the backbone using a lightweight Compact Side Network (CSN). By progressively extracting task-specific information with a few low-rank linear mappings and appropriately adding the information back to the backbone, CSN effectively realizes knowledge transfer in various downstream tasks. We conducted extensive experiments to validate the effectiveness of our method. The proposed method not only reduces a large amount of GPU memory usage and trainable parameters, but also outperforms existing PETL methods by a significant margin in accuracy, achieving new state-of-the-art on several standard benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation b/data/2024/aaai/DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation
new file mode 100644
index 0000000000..e5b5c26867
--- /dev/null
+++ b/data/2024/aaai/DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation	
@@ -0,0 +1 @@
+Despite the great potential in capturing long-range dependency, one rarely-explored underlying issue of transformer in medical image segmentation is attention collapse, making it often degenerate into a bypass module in CNN-Transformer hybrid architectures. This is due to the high computational complexity of vision transformers requiring extensive training data while well-annotated medical image data is relatively limited, resulting in poor convergence. In this paper, we propose a plug-n-play transformer block with dynamic token merging, named DTMFormer, to avoid building long-range dependency on redundant and duplicated tokens and thus pursue better convergence. Specifically, DTMFormer consists of an attention-guided token merging (ATM) module to adaptively cluster tokens into fewer semantic tokens based on feature and dependency similarity and a light token reconstruction module to fuse ordinary and semantic tokens. In this way, as self-attention in ATM is calculated based on fewer tokens, DTMFormer is of lower complexity and more friendly to converge. Extensive experiments on publicly-available datasets demonstrate the effectiveness of DTMFormer working as a plug-n-play module for simultaneous complexity reduction and performance improvement. We believe it will inspire future work on rethinking transformers in medical image segmentation. Code: https://github.com/iam-nacl/DTMFormer.
\ No newline at end of file
diff --git a/data/2024/aaai/DUEL: Duplicate Elimination on Active Memory for Self-Supervised Class-Imbalanced Learning b/data/2024/aaai/DUEL: Duplicate Elimination on Active Memory for Self-Supervised Class-Imbalanced Learning
new file mode 100644
index 0000000000..b6512814d1
--- /dev/null
+++ b/data/2024/aaai/DUEL: Duplicate Elimination on Active Memory for Self-Supervised Class-Imbalanced Learning	
@@ -0,0 +1 @@
+Recent machine learning algorithms have been developed using well-curated datasets, which often require substantial cost and resources. On the other hand, the direct use of raw data often leads to overfitting towards frequently occurring class information. To address class imbalances cost-efficiently, we propose an active data filtering process during self-supervised pre-training in our novel framework, Duplicate Elimination (DUEL). This framework integrates an active memory inspired by human working memory and introduces distinctiveness information, which measures the diversity of the data in the memory, to optimize both the feature extractor and the memory. The DUEL policy, which replaces the most duplicated data with new samples, aims to enhance the distinctiveness information in the memory and thereby mitigate class imbalances. We validate the effectiveness of the DUEL framework in class-imbalanced environments, demonstrating its robustness and providing reliable results in downstream tasks. We also analyze the role of the DUEL policy in the training process through various metrics and visualizations.
\ No newline at end of file
diff --git a/data/2024/aaai/DVANet: Disentangling View and Action Features for Multi-View Action Recognition b/data/2024/aaai/DVANet: Disentangling View and Action Features for Multi-View Action Recognition
new file mode 100644
index 0000000000..0bd2087d60
--- /dev/null
+++ b/data/2024/aaai/DVANet: Disentangling View and Action Features for Multi-View Action Recognition	
@@ -0,0 +1 @@
+In this work, we present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video. When trying to classify action instances captured from multiple viewpoints, there is a higher degree of difficulty due to the difference in background, occlusion, and visibility of the captured action from different camera angles. To tackle the various problems introduced in multi-view action recognition, we propose a novel configuration of learnable transformer decoder queries, in conjunction with two supervised contrastive losses, to enforce the learning of action features that are robust to shifts in viewpoints. Our disentangled feature learning occurs in two stages: the transformer decoder uses separate queries to separately learn action and view information, which are then further disentangled using our two contrastive losses. We show that our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets: NTU RGB+D, NTU RGB+D 120, PKU-MMD, and N-UCLA. Compared to previous RGB works, we see maximal improvements of 1.5%, 4.8%, 2.2%, and 4.8% on each dataset, respectively. Our code can be found here: https://github.com/NyleSiddiqui/MultiView_Actions
\ No newline at end of file
diff --git a/data/2024/aaai/DVSAI: Diverse View-Shared Anchors Based Incomplete Multi-View Clustering b/data/2024/aaai/DVSAI: Diverse View-Shared Anchors Based Incomplete Multi-View Clustering
new file mode 100644
index 0000000000..0443e4e16d
--- /dev/null
+++ b/data/2024/aaai/DVSAI: Diverse View-Shared Anchors Based Incomplete Multi-View Clustering	
@@ -0,0 +1,2 @@
+In numerous real-world applications, it is quite common that sample information is partially available for some views due to machine breakdown or sensor failure, causing the problem of incomplete multi-view clustering (IMVC). While several IMVC approaches using view-shared anchors have successfully achieved pleasing performance improvement, (1) they generally construct anchors with only one dimension, which could deteriorate the multi-view diversity, bringing about serious information loss; (2) the constructed anchors are typically with a single size, which could not sufficiently characterize the distribution of the whole samples, leading to limited clustering performance. For generating view-shared anchors with multi-dimension and multi-size for IMVC, we design a novel framework called Diverse View-Shared Anchors based Incomplete multi-view clustering (DVSAI). Concretely, we associate each partial view with several potential spaces. 
+In each space, we enable anchors to communicate among views and generate the view-shared anchors with space-specific dimension and size. Consequently, spaces with various scales make the generated view-shared anchors enjoy diverse dimensions and sizes. Subsequently, we devise an integration scheme with linear computational and memory expenditures to integrate the outputted multi-scale unified anchor graphs such that running spectral algorithm generates the spectral embedding. Afterwards, we theoretically demonstrate that DVSAI owns linear time and space costs, thus well-suited for tackling large-size datasets. Finally, comprehensive experiments confirm the effectiveness and advantages of DVSAI.
\ No newline at end of file
diff --git a/data/2024/aaai/DanceAnyWay: Synthesizing Beat-Guided 3D Dances with Randomized Temporal Contrastive Learning b/data/2024/aaai/DanceAnyWay: Synthesizing Beat-Guided 3D Dances with Randomized Temporal Contrastive Learning
new file mode 100644
index 0000000000..52e31f8713
--- /dev/null
+++ b/data/2024/aaai/DanceAnyWay: Synthesizing Beat-Guided 3D Dances with Randomized Temporal Contrastive Learning	
@@ -0,0 +1 @@
+We present DanceAnyWay, a generative learning method to synthesize beat-guided dances of 3D human characters synchronized with music. Our method learns to disentangle the dance movements at the beat frames from the dance movements at all the remaining frames by operating at two hierarchical levels. At the coarser "beat" level, it encodes the rhythm, pitch, and melody information of the input music via dedicated feature representations only at the beat frames. It leverages them to synthesize the beat poses of the target dances using a sequence-to-sequence learning framework. At the finer "repletion" level, our method encodes similar rhythm, pitch, and melody information from all the frames of the input music via dedicated feature representations. It generates the full dance sequences by combining the synthesized beat and repletion poses and enforcing plausibility through an adversarial learning framework. Our training paradigm also enforces fine-grained diversity in the synthesized dances through a randomized temporal contrastive loss, which ensures different segments of the dance sequences have different movements and avoids motion freezing or collapsing to repetitive movements. We evaluate the performance of our approach through extensive experiments on the benchmark AIST++ dataset and observe improvements of about 7%-12% in motion quality metrics and 1.5%-4% in motion diversity metrics over the current baselines, respectively. We also conducted a user study to evaluate the visual quality of our synthesized dances. We noted that, on average, the samples generated by our method were about 9-48% more preferred by the participants and had a 4-27% better five-point Likert-scale score over the best available current baseline in terms of motion quality and synchronization. Our source code and project page are available at https://github.com/aneeshbhattacharya/DanceAnyWay.
\ No newline at end of file
diff --git a/data/2024/aaai/DanceMVP: Self-Supervised Learning for Multi-Task Primitive-Based Dance Performance Assessment via Transformer Text Prompting b/data/2024/aaai/DanceMVP: Self-Supervised Learning for Multi-Task Primitive-Based Dance Performance Assessment via Transformer Text Prompting
new file mode 100644
index 0000000000..ce52e65945
--- /dev/null
+++ b/data/2024/aaai/DanceMVP: Self-Supervised Learning for Multi-Task Primitive-Based Dance Performance Assessment via Transformer Text Prompting	
@@ -0,0 +1 @@
+Dance is generally considered to be complex for most people as it requires coordination of numerous body motions and accurate responses to the musical content and rhythm. Studies on automatic dance performance assessment could help people improve their sensorimotor skills and promote research in many fields, including human motion analysis and motion generation. Recent papers on dance performance assessment usually evaluate simple dance motions with a single task - estimating final performance scores. In this paper, we propose DanceMVP: multi-task dance performance assessment via text prompting that solves three related tasks - (i) dance vocabulary recognition, (ii) dance performance scoring and (iii) dance rhythm evaluation. In the pre-training phase, we contrastively learn the primitive-based features of complex dance motion and music using the InfoNCE loss. For the downstream task, we propose a transformer-based text prompter to perform multi-task evaluations for the three proposed assessment tasks. Also, we build a multimodal dance-music dataset named ImperialDance. The novelty of our ImperialDance is that it contains dance motions for diverse expertise levels and a significant amount of repeating dance sequences for the same choreography to keep track of the dance performance progression. Qualitative results show that our pre-trained feature representation could cluster dance pieces for different dance genres, choreographies, expertise levels and primitives, which generalizes well on both ours and other dance-music datasets. The downstream experiments demonstrate the robustness and improvement of our method over several ablations and baselines across all three tasks, as well as monitoring the users' dance level progression.
\ No newline at end of file
diff --git a/data/2024/aaai/Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification b/data/2024/aaai/Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification
new file mode 100644
index 0000000000..222c8d5139
--- /dev/null
+++ b/data/2024/aaai/Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification	
@@ -0,0 +1 @@
+Vision-language foundation models have been incredibly successful in a wide range of downstream computer vision tasks using adaptation methods. However, due to the high cost of obtaining pre-training datasets, pairs with weak image-text correlation in the data exist in large numbers. We call them weak-paired samples. Due to the limitations of these weak-paired samples, the pre-training model are unable to mine all the knowledge from pre-training data. The existing adaptation methods do not consider the missing knowledge, which may lead to crucial task-related knowledge for the downstream tasks being ignored. To address this issue, we propose a new adaptation framework called Data Adaptive Traceback (DAT). Specifically, we utilize a zero-shot-based method to extract the most downstream task-related subset of the pre-training data to enable the downstream tasks. Furthermore, we adopt a pseudo-label-based semi-supervised technique to reuse the pre-training images and a vision-language contrastive learning method to address the confirmation bias issue in semi-supervised learning. We conduct extensive experiments that show our proposed DAT approach meaningfully improves various benchmark datasets’ performance over traditional adaptation methods by simply.
\ No newline at end of file
diff --git a/data/2024/aaai/Data Augmented Graph Neural Networks for Personality Detection b/data/2024/aaai/Data Augmented Graph Neural Networks for Personality Detection
new file mode 100644
index 0000000000..0095c929f3
--- /dev/null
+++ b/data/2024/aaai/Data Augmented Graph Neural Networks for Personality Detection	
@@ -0,0 +1 @@
+Personality detection is a fundamental task for user psychology research. One of the biggest challenges in personality detection lies in the quantitative limitation of labeled data collected by completing the personality questionnaire, which is very time-consuming and labor-intensive. Most of the existing works are mainly devoted to learning the rich representations of posts based on labeled data. However, they still suffer from the inherent weakness of the amount limitation of labels, which potentially restricts the capability of the model to deal with unseen data. In this paper, we construct a heterogeneous personality graph for each labeled and unlabeled user and develop a novel psycholinguistic augmented graph neural network to detect personality in a semi-supervised manner, namely Semi-PerGCN. Specifically, our model first explores a supervised Personality Graph Neural Network (PGNN) to refine labeled user representation on the heterogeneous graph. For the remaining massive unlabeled users, we utilize the empirical psychological knowledge of the Linguistic Inquiry and Word Count (LIWC) lexicon for multi-view graph augmentation and perform unsupervised graph consistent constraints on the parameters shared PGNN. During the learning process of finite labeled users, noise-invariant learning on a large scale of unlabeled users is combined to enhance the generalization ability. Extensive experiments on three real-world datasets, Youtube, PAN2015, and MyPersonality demonstrate the effectiveness of our Semi-PerGCN in personality detection, especially in scenarios with limited labeled users.
\ No newline at end of file
diff --git a/data/2024/aaai/Data Disparity and Temporal Unavailability Aware Asynchronous Federated Learning for Predictive Maintenance on Transportation Fleets b/data/2024/aaai/Data Disparity and Temporal Unavailability Aware Asynchronous Federated Learning for Predictive Maintenance on Transportation Fleets
new file mode 100644
index 0000000000..4a0326098d
--- /dev/null
+++ b/data/2024/aaai/Data Disparity and Temporal Unavailability Aware Asynchronous Federated Learning for Predictive Maintenance on Transportation Fleets	
@@ -0,0 +1 @@
+Predictive maintenance has emerged as a critical application in modern transportation, leveraging sensor data to forecast potential damages proactively using machine learning. However, privacy concerns limit data sharing, making Federated learning an appealing approach to preserve data privacy. Nevertheless, challenges arise due to disparities in data distribution and temporal unavailability caused by individual usage patterns in transportation. In this paper, we present a novel asynchronous federated learning approach to address system heterogeneity and facilitate machine learning for predictive maintenance on transportation fleets. The approach introduces a novel data disparity aware aggregation scheme and a federated early stopping method for training. To validate the effectiveness of our approach, we evaluate it on two independent real-world datasets from the transportation domain: 1) oil dilution prediction of car combustion engines and 2) remaining lifetime prediction of plane turbofan engines. Our experiments show that we reliably outperform five state-of-the-art baselines, including federated and classical machine learning models. Moreover, we show that our approach generalises to various prediction model architectures.
\ No newline at end of file
diff --git a/data/2024/aaai/Data Distribution Distilled Generative Model for Generalized Zero-Shot Recognition b/data/2024/aaai/Data Distribution Distilled Generative Model for Generalized Zero-Shot Recognition
new file mode 100644
index 0000000000..f5cabf843f
--- /dev/null
+++ b/data/2024/aaai/Data Distribution Distilled Generative Model for Generalized Zero-Shot Recognition	
@@ -0,0 +1 @@
+In the realm of Zero-Shot Learning (ZSL), we address biases in Generalized Zero-Shot Learning (GZSL) models, which favor seen data. To counter this, we introduce an end-to-end generative GZSL framework called D3GZSL. This framework respects seen and synthesized unseen data as in-distribution and out-of-distribution data, respectively, for a more balanced model. D3GZSL comprises two core modules: in-distribution dual space distillation (ID2SD) and out-of-distribution batch distillation (O2DBD). ID2SD aligns teacher-student outcomes in embedding and label spaces, enhancing learning coherence. O2DBD introduces low-dimensional out-of-distribution representations per batch sample, capturing shared structures between seen and un seen categories. Our approach demonstrates its effectiveness across established GZSL benchmarks, seamlessly integrating into mainstream generative frameworks. Extensive experiments consistently showcase that D3GZSL elevates the performance of existing generative GZSL methods, under scoring its potential to refine zero-shot learning practices. The code is available at: https://github.com/PJBQ/D3GZSL.git
\ No newline at end of file
diff --git a/data/2024/aaai/Data Efficient Paradigms for Personalized Assessment of Black-Box Taskable AI Systems b/data/2024/aaai/Data Efficient Paradigms for Personalized Assessment of Black-Box Taskable AI Systems
new file mode 100644
index 0000000000..0995e860cf
--- /dev/null
+++ b/data/2024/aaai/Data Efficient Paradigms for Personalized Assessment of Black-Box Taskable AI Systems	
@@ -0,0 +1 @@
+The vast diversity of internal designs of taskable black-box AI systems and their nuanced zones of safe functionality make it difficult for a layperson to use them without unintended side effects. My dissertation focuses on developing paradigms that enable a user to assess and understand the limits of an AI system's safe operability. We develop a personalized AI assessment module that lets an AI system execute instruction sequences in simulators and answer queries about these executions. Our results show that such a primitive query-response interface is sufficient to efficiently derive a user-interpretable model of a system's capabilities.
\ No newline at end of file
diff --git a/data/2024/aaai/Data Poisoning to Fake a Nash Equilibria for Markov Games b/data/2024/aaai/Data Poisoning to Fake a Nash Equilibria for Markov Games
new file mode 100644
index 0000000000..9701b0f778
--- /dev/null
+++ b/data/2024/aaai/Data Poisoning to Fake a Nash Equilibria for Markov Games	
@@ -0,0 +1 @@
+We characterize offline data poisoning attacks on Multi-Agent Reinforcement Learning (MARL), where an attacker may change a data set in an attempt to install a (potentially fictitious) unique Markov-perfect Nash equilibrium for a two-player zero-sum Markov game. We propose the unique Nash set, namely the set of games, specified by their Q functions, with a specific joint policy being the unique Nash equilibrium. The unique Nash set is central to poisoning attacks because the attack is successful if and only if data poisoning pushes all plausible games inside it. The unique Nash set generalizes the reward polytope commonly used in inverse reinforcement learning to MARL. For zero-sum Markov games, both the inverse Nash set and the set of plausible games induced by data are polytopes in the Q function space. We exhibit a linear program to efficiently compute the optimal poisoning attack. Our work sheds light on the structure of data poisoning attacks on offline MARL, a necessary step before one can design more robust MARL algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Data Roaming and Quality Assessment for Composed Image Retrieval b/data/2024/aaai/Data Roaming and Quality Assessment for Composed Image Retrieval
new file mode 100644
index 0000000000..e3f5c89c8e
--- /dev/null
+++ b/data/2024/aaai/Data Roaming and Quality Assessment for Composed Image Retrieval	
@@ -0,0 +1,2 @@
+The task of Composed Image Retrieval (CoIR) involves queries that combine image and text modalities, allowing users to express their intent more effectively. However, current CoIR datasets are orders of magnitude smaller compared to other vision and language (V&L) datasets. Additionally, some of these datasets have noticeable issues, such as queries containing redundant modalities. To address these shortcomings, we introduce the Large Scale Composed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times larger than existing ones. Pre-training on our LaSCo, shows a noteworthy improvement in performance, even in zero-shot. Furthermore, we propose a new approach for analyzing CoIR datasets and methods, which detects modality redundancy or necessity, in queries.
+We also introduce a new CoIR baseline, the Cross-Attention driven Shift Encoder (CASE). This baseline allows for early fusion of modalities using a cross-attention module and employs an additional auxiliary task during training. Our experiments demonstrate that this new baseline outperforms the current state-of-the-art methods on established benchmarks like FashionIQ and CIRR.
\ No newline at end of file
diff --git a/data/2024/aaai/Data Shunt: Collaboration of Small and Large Models for Lower Costs and Better Performance b/data/2024/aaai/Data Shunt: Collaboration of Small and Large Models for Lower Costs and Better Performance
new file mode 100644
index 0000000000..ddcbae200d
--- /dev/null
+++ b/data/2024/aaai/Data Shunt: Collaboration of Small and Large Models for Lower Costs and Better Performance	
@@ -0,0 +1,2 @@
+Pretrained large models, particularly large language models, have garnered increasing attention, as they have demonstrated remarkable abilities through contextual learning. Pretrained large models are increasingly recognized as fundamental tools for solving various tasks. However, the substantial computational demands of large models have dissuaded most product teams and individuals from running them. In such scenarios, to leverage the exceptional performance of large models, one must solely depend on costly APIs, further burdening product teams and individuals. On the other hand, despite the overall inferior performance of small models compared to large models, there are certain distributions where small models can achieve comparable or even superior results. For instance, during training, small models may become trapped in a local optimum that is unique to certain distributions, leading to superior performance. Hence, we propose Data Shunt (DS), a general paradigm for collaboration of small and large models. DS not only substantially reduces the cost associated with deploying large models but also effectively enhances overall performance. Specifically, DS determines the shunting direction by evaluating the confidence level of small models. When the confidence level falls below a specific threshold, the input data is forwarded to large models. To further leverage the advantages of the small and large models, we introduce Prompt Pruning (PP) and 2-Stage Confidence Distillation (2CD), which facilitate mutual collaboration, leading to better results and less cost. 
+The remarkable performance across diverse modalities and tasks demonstrates the superiority of the proposed DS over large models. For instance, ChatGPT achieves an accuracy of 94.43% on Amazon Product sentiment analysis, and DS achieves an accuracy of 95.64%, while the cost has been reduced to only 31.18%. The code for the proposed method are provided for research purposes https://github.com/Anfeather/Data-Shunt.
\ No newline at end of file
diff --git a/data/2024/aaai/Data-Augmented Curriculum Graph Neural Architecture Search under Distribution Shifts b/data/2024/aaai/Data-Augmented Curriculum Graph Neural Architecture Search under Distribution Shifts
new file mode 100644
index 0000000000..c40fb10951
--- /dev/null
+++ b/data/2024/aaai/Data-Augmented Curriculum Graph Neural Architecture Search under Distribution Shifts	
@@ -0,0 +1 @@
+Graph neural architecture search (NAS) has achieved great success in designing architectures for graph data processing.However, distribution shifts pose great challenges for graph NAS, since the optimal searched architectures for the training graph data may fail to generalize to the unseen test graph data. The sole prior work tackles this problem by customizing architectures for each graph instance through learning graph structural information, but failed to consider data augmentation during training, which has been proven by existing works to be able to improve generalization.In this paper, we propose Data-augmented Curriculum Graph Neural Architecture Search (DCGAS), which learns an architecture customizer with good generalizability to data under distribution shifts. Specifically, we design an embedding-guided data generator, which can generate sufficient graphs for training to help the model better capture graph structural information. In addition, we design a two-factor uncertainty-based curriculum weighting strategy, which can evaluate the importance of data in enabling the model to learn key information in real-world distribution and reweight them during training. Experimental results on synthetic datasets and real datasets with distribution shifts demonstrate that our proposed method learns generalizable mappings and outperforms existing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Data-Driven Discovery of Design Specifications (Student Abstract) b/data/2024/aaai/Data-Driven Discovery of Design Specifications (Student Abstract)
new file mode 100644
index 0000000000..f94f19c275
--- /dev/null
+++ b/data/2024/aaai/Data-Driven Discovery of Design Specifications (Student Abstract)	
@@ -0,0 +1 @@
+Ensuring a machine learning model’s trustworthiness is crucial to prevent potential harm. One way to foster trust is through the formal verification of the model’s adherence to essential design requirements. However, this approach relies on well-defined, application-domain-centric criteria with which to test the model, and such specifications may be cumbersome to collect in practice. We propose a data-driven approach for creating specifications to evaluate a trained model effectively. Implementing this framework allows us to prove that the model will exhibit safe behavior while minimizing the false-positive prediction rate. This strategy enhances predictive accuracy and safety, providing deeper insight into the model’s strengths and weaknesses, and promotes trust through a systematic approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Data-Driven Knowledge-Aware Inference of Private Information in Continuous Double Auctions b/data/2024/aaai/Data-Driven Knowledge-Aware Inference of Private Information in Continuous Double Auctions
new file mode 100644
index 0000000000..764050e06f
--- /dev/null
+++ b/data/2024/aaai/Data-Driven Knowledge-Aware Inference of Private Information in Continuous Double Auctions	
@@ -0,0 +1 @@
+Inferring the private information of humans from their strategic behavioral data is crucial and challenging. The main approach is first obtaining human behavior functions (which map public information and human private information to behavior), enabling subsequent inference of private information from observed behavior. Most existing studies rely on strong equilibrium assumptions to obtain behavior functions. Our work focuses on continuous double auctions, where multiple traders with heterogeneous rationalities and beliefs dynamically trade commodities and deriving equilibria is generally intractable. We develop a knowledge-aware machine learning-based framework to infer each trader's private cost vectors for producing different units of its commodity. Our key idea is to learn behavior functions by incorporating the statistical knowledge about private costs given the observed trader asking behavior across the population. Specifically, we first use a neural network to characterize each trader's behavior function. Second, we leverage the statistical knowledge to derive the posterior distribution of each trader's private costs given its observed asks. Third, through designing a novel loss function, we utilize the knowledge-based posterior distributions to guide the learning of the neural network. We conduct extensive experiments on a large experimental dataset, and demonstrate the superior performance of our framework over baselines in inferring the private information of humans.
\ No newline at end of file
diff --git a/data/2024/aaai/Data-Driven Structural Fire Risk Prediction for City Properties b/data/2024/aaai/Data-Driven Structural Fire Risk Prediction for City Properties
new file mode 100644
index 0000000000..d4563d6a8d
--- /dev/null
+++ b/data/2024/aaai/Data-Driven Structural Fire Risk Prediction for City Properties	
@@ -0,0 +1 @@
+Fire Departments conduct inspections to prevent fires but it is unclear how to best allocate their limited inspection resources across the properties in a city. Currently, they use their intuition and experience to decide on which properties to inspect and lack a data-driven approach that could lead to a more principled use of inspection resources. The main contribution of this paper is to investigate such an approach, based on machine learning for predicting a fire risk score for properties in a city based on historical fire-incident data. These scores can then be used to help prioritize inspection resources toward higher-risk properties. We present a case study using data from a South Dakota fire department which contains information about properties in a city along with records of fire in- incidents. We use this data consisting of more than 72,000 properties to train a machine learning model to predict fire risk and evaluate its ability to rank the fire risk of properties in the city. We conduct and analyze experiments with variations of XG-Boost, which is an algorithm well-suited to the challenges in application, including missing data and a highly-skewed class distribution. Our evaluation of the model-generated rankings, based on ranking metrics, shows that the model significantly outperforms random rankings and other natural baselines. We also analyze the feature importance computed for the models, which provides further insight into the model behavior. This model has been integrated into an interface for displaying the rankings across a city and is ready for beta testing.
\ No newline at end of file
diff --git a/data/2024/aaai/Data-Efficient Graph Learning b/data/2024/aaai/Data-Efficient Graph Learning
new file mode 100644
index 0000000000..9902138c09
--- /dev/null
+++ b/data/2024/aaai/Data-Efficient Graph Learning	
@@ -0,0 +1 @@
+My research strives to develop fundamental graph-centric learning algorithms to reduce the need for human supervision in low-resource scenarios. The focus is on achieving effective and reliable data-efficient learning on graphs, which can be summarized into three facets: (1) graph weakly-supervised learning; (2) graph few-shot learning; and (3) graph self-supervised learning.
\ No newline at end of file
diff --git a/data/2024/aaai/Data-Free Generalized Zero-Shot Learning b/data/2024/aaai/Data-Free Generalized Zero-Shot Learning
new file mode 100644
index 0000000000..b6d673e7c5
--- /dev/null
+++ b/data/2024/aaai/Data-Free Generalized Zero-Shot Learning	
@@ -0,0 +1 @@
+Deep learning models have the ability to extract rich knowledge from large-scale datasets. However, the sharing of data has become increasingly challenging due to concerns regarding data copyright and privacy. Consequently, this hampers the effective transfer of knowledge from existing data to novel downstream tasks and concepts. Zero-shot learning (ZSL) approaches aim to recognize new classes by transferring semantic knowledge learned from base classes. However, traditional generative ZSL methods often require access to real images from base classes and rely on manually annotated attributes, which presents challenges in terms of data restrictions and model scalability. To this end, this paper tackles a challenging and practical problem dubbed as data-free zero-shot learning (DFZSL), where only the CLIP-based base classes data pre-trained classifier is available for zero-shot classification. Specifically, we propose a generic framework for DFZSL, which consists of three main components. Firstly, to recover the virtual features of the base data, we model the CLIP features of base class images as samples from a von Mises-Fisher (vMF) distribution based on the pre-trained classifier. Secondly, we leverage the text features of CLIP as low-cost semantic information and propose a feature-language prompt tuning (FLPT) method to further align the virtual image features and textual features. Thirdly, we train a conditional generative model using the well-aligned virtual image features and corresponding semantic text features, enabling the generation of new classes features and achieve better zero-shot generalization. Our framework has been evaluated on five commonly used benchmarks for generalized ZSL, as well as 11 benchmarks for the base-to-new ZSL. The results demonstrate the superiority and effectiveness of our approach. Our code is available in https://github.com/ylong4/DFZSL.
\ No newline at end of file
diff --git a/data/2024/aaai/Data-Free Hard-Label Robustness Stealing Attack b/data/2024/aaai/Data-Free Hard-Label Robustness Stealing Attack
new file mode 100644
index 0000000000..bc2f0e51ff
--- /dev/null
+++ b/data/2024/aaai/Data-Free Hard-Label Robustness Stealing Attack	
@@ -0,0 +1 @@
+The popularity of Machine Learning as a Service (MLaaS) has led to increased concerns about Model Stealing Attacks (MSA), which aim to craft a clone model by querying MLaaS. Currently, most research on MSA assumes that MLaaS can provide soft labels and that the attacker has a proxy dataset with a similar distribution. However, this fails to encapsulate the more practical scenario where only hard labels are returned by MLaaS and the data distribution remains elusive. Furthermore, most existing work focuses solely on stealing the model accuracy, neglecting the model robustness, while robustness is essential in security-sensitive scenarios, e.g, face-scan payment. Notably, improving model robustness often necessitates the use of expensive techniques such as adversarial training, thereby further making stealing robustness a more lucrative prospect. In response to these identified gaps, we introduce a novel Data-Free Hard-Label Robustness Stealing (DFHL-RS) attack in this paper, which enables the stealing of both model accuracy and robustness by simply querying hard labels of the target model without the help of any natural data. Comprehensive experiments demonstrate the effectiveness of our method. The clone model achieves a clean accuracy of 77.86% and a robust accuracy of 39.51% against AutoAttack, which are only 4.71% and 8.40% lower than the target model on the CIFAR-10 dataset, significantly exceeding the baselines. Our code is available at: https://github.com/LetheSec/DFHL-RS-Attack.
\ No newline at end of file
diff --git a/data/2024/aaai/DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via Diffusion Models b/data/2024/aaai/DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via Diffusion Models
new file mode 100644
index 0000000000..c02de63678
--- /dev/null
+++ b/data/2024/aaai/DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via Diffusion Models	
@@ -0,0 +1 @@
+Dataset sanitization is a widely adopted proactive defense against poisoning-based backdoor attacks, aimed at filtering out and removing poisoned samples from training datasets. However, existing methods have shown limited efficacy in countering the ever-evolving trigger functions, and often leading to considerable degradation of benign accuracy. In this paper, we propose DataElixir, a novel sanitization approach tailored to purify poisoned datasets. We leverage diffusion models to eliminate trigger features and restore benign features, thereby turning the poisoned samples into benign ones. Specifically, with multiple iterations of the forward and reverse process, we extract intermediary images and their predicted labels for each sample in the original dataset. Then, we identify anomalous samples in terms of the presence of label transition of the intermediary images, detect the target label by quantifying distribution discrepancy, select their purified images considering pixel and feature distance, and determine their ground-truth labels by training a benign model. Experiments conducted on 9 popular attacks demonstrates that DataElixir effectively mitigates various complex attacks while exerting minimal impact on benign accuracy, surpassing the performance of baseline defense methods.
\ No newline at end of file
diff --git a/data/2024/aaai/De-biased Attention Supervision for Text Classification with Causality b/data/2024/aaai/De-biased Attention Supervision for Text Classification with Causality
new file mode 100644
index 0000000000..2fab3cfa55
--- /dev/null
+++ b/data/2024/aaai/De-biased Attention Supervision for Text Classification with Causality	
@@ -0,0 +1 @@
+In text classification models, while the unsupervised attention mechanism can enhance performance, it often produces attention distributions that are puzzling to humans, such as assigning high weight to seemingly insignificant conjunctions. Recently, numerous studies have explored Attention Supervision (AS) to guide the model toward more interpretable attention distributions. However, such AS can impact classification performance, especially in specialized domains. In this paper, we address this issue from a causality perspective. Firstly, we leverage the causal graph to reveal two biases in the AS: 1) Bias caused by the label distribution of the dataset. 2) Bias caused by the words' different occurrence ranges that some words can occur across labels while others only occur in a particular label. We then propose a novel De-biased Attention Supervision (DAS) method to eliminate these biases with causal techniques. Specifically, we adopt backdoor adjustment on the label-caused bias and reduce the word-caused bias by subtracting the direct causal effect of the word. Through extensive experiments on two professional text classification datasets (e.g., medicine and law), we demonstrate that our method achieves improved classification accuracy along with more coherent attention distributions.
\ No newline at end of file
diff --git a/data/2024/aaai/DeRDaVa: Deletion-Robust Data Valuation for Machine Learning b/data/2024/aaai/DeRDaVa: Deletion-Robust Data Valuation for Machine Learning
new file mode 100644
index 0000000000..bef7df4c7b
--- /dev/null
+++ b/data/2024/aaai/DeRDaVa: Deletion-Robust Data Valuation for Machine Learning	
@@ -0,0 +1 @@
+Data valuation is concerned with determining a fair valuation of data from data sources to compensate them or to identify training examples that are the most or least useful for predictions. With the rising interest in personal data ownership and data protection regulations, model owners will likely have to fulfil more data deletion requests. This raises issues that have not been addressed by existing works: Are the data valuation scores still fair with deletions? Must the scores be expensively recomputed? The answer is no. To avoid recomputations, we propose using our data valuation framework DeRDaVa upfront for valuing each data source's contribution to preserving robust model performance after anticipated data deletions. DeRDaVa can be efficiently approximated and will assign higher values to data that are more useful or less likely to be deleted. We further generalize DeRDaVa to Risk-DeRDaVa to cater to risk-averse/seeking model owners who are concerned with the worst/best-cases model utility. We also empirically demonstrate the practicality of our solutions.
\ No newline at end of file
diff --git a/data/2024/aaai/DeS3: Adaptive Attention-Driven Self and Soft Shadow Removal Using ViT Similarity b/data/2024/aaai/DeS3: Adaptive Attention-Driven Self and Soft Shadow Removal Using ViT Similarity
new file mode 100644
index 0000000000..3d9ee4fb86
--- /dev/null
+++ b/data/2024/aaai/DeS3: Adaptive Attention-Driven Self and Soft Shadow Removal Using ViT Similarity	
@@ -0,0 +1 @@
+Removing soft and self shadows that lack clear boundaries from a single image is still challenging. Self shadows are shadows that are cast on the object itself. Most existing methods rely on binary shadow masks, without considering the ambiguous boundaries of soft and self shadows. In this paper, we present DeS3, a method that removes hard, soft and self shadows based on adaptive attention and ViT similarity. Our novel ViT similarity loss utilizes features extracted from a pre-trained Vision Transformer. This loss helps guide the reverse sampling towards recovering scene structures. Our adaptive attention is able to differentiate shadow regions from the underlying objects, as well as shadow regions from the object casting the shadow. This capability enables DeS3 to better recover the structures of objects even when they are partially occluded by shadows. Different from existing methods that rely on constraints during the training phase, we incorporate the ViT similarity during the sampling stage. Our method outperforms state-of-the-art methods on the SRD, AISTD, LRSS, USR and UIUC datasets, removing hard, soft, and self shadows robustly. Specifically, our method outperforms the SOTA method by 16% of the RMSE of the whole image on the LRSS dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/Dealing with Numeric and Metric Time Constraints in PDDL3 via Compilation to Numeric Planning b/data/2024/aaai/Dealing with Numeric and Metric Time Constraints in PDDL3 via Compilation to Numeric Planning
new file mode 100644
index 0000000000..2998378e4e
--- /dev/null
+++ b/data/2024/aaai/Dealing with Numeric and Metric Time Constraints in PDDL3 via Compilation to Numeric Planning	
@@ -0,0 +1,2 @@
+This paper studies an approach to planning with PDDL3 constraints involving mixed propositional and numeric conditions, as well as metric time constraints. 
+We show how the whole PDDL3 with instantaneous actions can be compiled away into a numeric planning problem without PDDL3 constraints, enabling the use of any state-of-the-art numeric planner that is agnostic to the existence of PDDL3. Our solution exploits the concept of regression. In addition to a basic compilation, we present an optimized variant based on the observation that it is possible to make the compilation sensitive to the structure of the problem to solve; this can be done by reasoning on the interactions between the problem actions and the constraints. The resulting optimization substantially reduces the size of the planning task. We experimentally observe that our approach significantly outperforms existing state-of-the-art planners supporting the same class of constraints over known benchmark domains, settling a new state-of-the-art planning system for PDDL3.
\ No newline at end of file
diff --git a/data/2024/aaai/Debiased Novel Category Discovering and Localization b/data/2024/aaai/Debiased Novel Category Discovering and Localization
new file mode 100644
index 0000000000..c84bb27b7b
--- /dev/null
+++ b/data/2024/aaai/Debiased Novel Category Discovering and Localization	
@@ -0,0 +1 @@
+In recent years, object detection in deep learning has experienced rapid development. However, most existing object detection models perform well only on closed-set datasets, ignoring a large number of potential objects whose categories are not defined in the training set. These objects are often identified as background or incorrectly classified as pre-defined categories by the detectors. In this paper, we focus on the challenging problem of Novel Class Discovery and Localization (NCDL), aiming to train detectors that can detect the categories present in the training data, while also actively discover, localize, and cluster new categories. We analyze existing NCDL methods and identify the core issue: object detectors tend to be biased towards seen objects, and this leads to the neglect of unseen targets. To address this issue, we first propose an Debiased Region Mining (DRM) approach that combines class-agnostic Region Proposal Network (RPN) and class-aware RPN in a complementary manner. Additionally, we suggest to improve the representation network through semi-supervised contrastive learning by leveraging unlabeled data. Finally, we adopt a simple and efficient mini-batch K-means clustering method for novel class discovery. We conduct extensive experiments on the NCDL benchmark, and the results demonstrate that the proposed DRM approach significantly outperforms previous methods, establishing a new state-of-the-art.
\ No newline at end of file
diff --git a/data/2024/aaai/Debiasing Multimodal Sarcasm Detection with Contrastive Learning b/data/2024/aaai/Debiasing Multimodal Sarcasm Detection with Contrastive Learning
new file mode 100644
index 0000000000..11ac0a9855
--- /dev/null
+++ b/data/2024/aaai/Debiasing Multimodal Sarcasm Detection with Contrastive Learning	
@@ -0,0 +1 @@
+Despite commendable achievements made by existing work, prevailing multimodal sarcasm detection studies rely more on textual content over visual information. It unavoidably induces spurious correlations between textual words and labels, thereby significantly hindering the models' generalization capability. To address this problem, we define the task of out-of-distribution (OOD) multimodal sarcasm detection, which aims to evaluate models' generalizability when the word distribution is different in training and testing settings. Moreover, we propose a novel debiasing multimodal sarcasm detection framework with contrastive learning, which aims to mitigate the harmful effect of biased textual factors for robust OOD generalization. In particular, we first design counterfactual data augmentation to construct the positive samples with dissimilar word biases and negative samples with similar word biases. Subsequently, we devise an adapted debiasing contrastive learning mechanism to empower the model to learn robust task-relevant features and alleviate the adverse effect of biased words. Extensive experiments show the superiority of the proposed framework.
\ No newline at end of file
diff --git a/data/2024/aaai/DeblurSR: Event-Based Motion Deblurring under the Spiking Representation b/data/2024/aaai/DeblurSR: Event-Based Motion Deblurring under the Spiking Representation
new file mode 100644
index 0000000000..c0d77bf76a
--- /dev/null
+++ b/data/2024/aaai/DeblurSR: Event-Based Motion Deblurring under the Spiking Representation	
@@ -0,0 +1 @@
+We present DeblurSR, a novel motion deblurring approach that converts a blurry image into a sharp video. DeblurSR utilizes event data to compensate for motion ambiguities and exploits the spiking representation to parameterize the sharp output video as a mapping from time to intensity. Our key contribution, the Spiking Representation (SR), is inspired by the neuromorphic principles determining how biological neurons communicate with each other in living organisms. We discuss why the spikes can represent sharp edges and how the spiking parameters are interpreted from the neuromorphic perspective. DeblurSR has higher output quality and requires fewer computing resources than state-of-the-art event-based motion deblurring methods. We additionally show that our approach easily extends to video super-resolution when combined with recent advances in implicit neural representation.
\ No newline at end of file
diff --git a/data/2024/aaai/Decentralized Gradient-Free Methods for Stochastic Non-smooth Non-convex Optimization b/data/2024/aaai/Decentralized Gradient-Free Methods for Stochastic Non-smooth Non-convex Optimization
new file mode 100644
index 0000000000..a776ec5e61
--- /dev/null
+++ b/data/2024/aaai/Decentralized Gradient-Free Methods for Stochastic Non-smooth Non-convex Optimization	
@@ -0,0 +1 @@
+We consider decentralized gradient-free optimization of minimizing Lipschitz continuous functions that satisfy neither smoothness nor convexity assumption. We propose two novel gradient-free algorithms, the Decentralized Gradient-Free Method (DGFM) and its variant, the Decentralized Gradient-Free Method+ (DGFM+). Based on the techniques of randomized smoothing and gradient tracking, DGFM requires the computation of the zeroth-order oracle of a single sample in each iteration, making it less demanding in terms of computational resources for individual computing nodes. Theoretically, DGFM achieves a complexity of O(d^(3/2)δ^(-1)ε^(-4)) for obtaining a (δ,ε)-Goldstein stationary point. DGFM+, an advanced version of DGFM, incorporates variance reduction to further improve the convergence behavior. It samples a mini-batch at each iteration and periodically draws a larger batch of data, which improves the complexity to O(d^(3/2)δ^(-1)ε^(-3)). Moreover, experimental results underscore the empirical advantages of our proposed algorithms when applied to real-world datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Decentralized Monte Carlo Tree Search for Partially Observable Multi-Agent Pathfinding b/data/2024/aaai/Decentralized Monte Carlo Tree Search for Partially Observable Multi-Agent Pathfinding
new file mode 100644
index 0000000000..302e102a82
--- /dev/null
+++ b/data/2024/aaai/Decentralized Monte Carlo Tree Search for Partially Observable Multi-Agent Pathfinding	
@@ -0,0 +1 @@
+The Multi-Agent Pathfinding (MAPF) problem involves finding a set of conflict-free paths for a group of agents confined to a graph. In typical MAPF scenarios, the graph and the agents' starting and ending vertices are known beforehand, allowing the use of centralized planning algorithms. However, in this study, we focus on the decentralized MAPF setting, where the agents may observe the other agents only locally and are restricted in communications with each other. Specifically, we investigate the lifelong variant of MAPF, where new goals are continually assigned to the agents upon completion of previous ones. Drawing inspiration from the successful AlphaZero approach, we propose a decentralized multi-agent Monte Carlo Tree Search (MCTS) method for MAPF tasks. Our approach utilizes the agent's observations to recreate the intrinsic Markov decision process, which is then used for planning with a tailored for multi-agent tasks version of neural MCTS. The experimental results show that our approach outperforms state-of-the-art learnable MAPF solvers. The source code is available at https://github.com/AIRI-Institute/mats-lp.
\ No newline at end of file
diff --git a/data/2024/aaai/Decentralized Scheduling with QoS Constraints: Achieving O(1) QoS Regret of Multi-Player Bandits b/data/2024/aaai/Decentralized Scheduling with QoS Constraints: Achieving O(1) QoS Regret of Multi-Player Bandits
new file mode 100644
index 0000000000..d830e103ee
--- /dev/null
+++ b/data/2024/aaai/Decentralized Scheduling with QoS Constraints: Achieving O(1) QoS Regret of Multi-Player Bandits	
@@ -0,0 +1 @@
+We consider a decentralized multi-player multi-armed bandit (MP-MAB) problem where players cannot observe the actions and rewards of other players and no explicit communication or coordination between players is possible. Prior studies mostly focus on maximizing the sum of rewards of the players over time. However, the total reward maximization learning may lead to imbalanced reward among players, leading to poor Quality of Service (QoS) for some players. In contrast, our objective is to let each player n achieve a predetermined expected average reward over time, i.e., achieving a predetermined level of QoS. We develop a novel decentralized MP-MAB algorithm to accomplish this objective by leveraging the methodology of randomized matching. We prove that our decentralized algorithm can ensure that all players have an O(1) QoS regret. We also reveal an analog between our MP-MAB model and the online wireless queuing systems, which builds a connection between QoS in MP-MAB learning and stability in queuing theory.
\ No newline at end of file
diff --git a/data/2024/aaai/Decentralized Sum-of-Nonconvex Optimization b/data/2024/aaai/Decentralized Sum-of-Nonconvex Optimization
new file mode 100644
index 0000000000..2f1df0cc9c
--- /dev/null
+++ b/data/2024/aaai/Decentralized Sum-of-Nonconvex Optimization	
@@ -0,0 +1 @@
+We consider the optimization problem of minimizing the sum-of-nonconvex function, i.e., a convex function that is the average of nonconvex components. The existing stochastic algorithms for such a problem only focus on a single machine and the centralized scenario. In this paper, we study the sum-of-nonconvex optimization in the decentralized setting. We present a new theoretical analysis of the PMGT-SVRG algorithm for this problem and prove the linear convergence of their approach. However, the convergence rate of the PMGT-SVRG algorithm has a linear dependency on the condition number, which is undesirable for the ill-conditioned problem. To remedy this issue, we propose an accelerated stochastic decentralized first-order algorithm by incorporating the techniques of acceleration, gradient tracking, and multi-consensus mixing into the SVRG algorithm. The convergence rate of the proposed method has a square-root dependency on the condition number. The numerical experiments validate the theoretical guarantee of our proposed algorithms on both synthetic and real-world datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Deciphering Compatibility Relationships with Textual Descriptions via Extraction and Explanation b/data/2024/aaai/Deciphering Compatibility Relationships with Textual Descriptions via Extraction and Explanation
new file mode 100644
index 0000000000..7fe87ae6ef
--- /dev/null
+++ b/data/2024/aaai/Deciphering Compatibility Relationships with Textual Descriptions via Extraction and Explanation	
@@ -0,0 +1 @@
+Understanding and accurately explaining compatibility relationships between fashion items is a challenging problem in the burgeoning domain of AI-driven outfit recommendations. Present models, while making strides in this area, still occasionally fall short, offering explanations that can be elementary and repetitive. This work aims to address these shortcomings by introducing the Pair Fashion Explanation (PFE) dataset, a unique resource that has been curated to illuminate these compatibility relationships. Furthermore, we propose an innovative two stage pipeline model that leverages this dataset. This fine-tuning allows the model to generate explanations that convey the compatibility relationships between items. Our experiments showcase the model's potential in crafting descriptions that are knowledgeable, aligned with ground-truth matching correlations, and that produce understandable and informative descriptions, as assessed by both automatic metrics and human evaluation. Our code and data are released at https://github.com/wangyu-ustc/PairFashionExplanation.
\ No newline at end of file
diff --git a/data/2024/aaai/Deciphering Raw Data in Neuro-Symbolic Learning with Provable Guarantees b/data/2024/aaai/Deciphering Raw Data in Neuro-Symbolic Learning with Provable Guarantees
new file mode 100644
index 0000000000..64db8168ee
--- /dev/null
+++ b/data/2024/aaai/Deciphering Raw Data in Neuro-Symbolic Learning with Provable Guarantees	
@@ -0,0 +1 @@
+Neuro-symbolic hybrid systems are promising for integrating machine learning and symbolic reasoning, where perception models are facilitated with information inferred from a symbolic knowledge base through logical reasoning. Despite empirical evidence showing the ability of hybrid systems to learn accurate perception models, the theoretical understanding of learnability is still lacking. Hence, it remains unclear why a hybrid system succeeds for a specific task and when it may fail given a different knowledge base. In this paper, we introduce a novel way of characterising supervision signals from a knowledge base, and establish a criterion for determining the knowledge’s efficacy in facilitating successful learning. This, for the first time, allows us to address the two questions above by inspecting the knowledge base under investigation. Our analysis suggests that many knowledge bases satisfy the criterion, thus enabling effective learning, while some fail to satisfy it, indicating potential failures. Comprehensive experiments confirm the utility of our criterion on benchmark tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Decision-Making for Land Conservation: A Derivative-Free Optimization Framework with Nonlinear Inputs b/data/2024/aaai/Decision-Making for Land Conservation: A Derivative-Free Optimization Framework with Nonlinear Inputs
new file mode 100644
index 0000000000..a4140c38d4
--- /dev/null
+++ b/data/2024/aaai/Decision-Making for Land Conservation: A Derivative-Free Optimization Framework with Nonlinear Inputs	
@@ -0,0 +1,6 @@
+Protected areas (PAs) are designated spaces where human activities are restricted to preserve critical habitats. Decision-makers are challenged with balancing a trade-off of financial feasibility with ecological benefit when establishing PAs. Given the long-term ramifications of these decisions and the constantly shifting environment, it is crucial that PAs are carefully selected with long-term viability in mind. 
+
+Using AI tools like simulation and optimization is common for designating PAs, but current decision models are primarily linear. In this paper, we propose a derivative-free optimization framework paired with a nonlinear component, population viability analysis (PVA). Formulated as a mixed integer nonlinear programming (MINLP) problem, our model allows for linear and nonlinear inputs. Connectivity, competition, crowding, and other similar concerns are handled by the PVA software, rather than expressed as constraints of the optimization model. In addition, we present numerical results that serve as a proof of concept, showing our models yield PAs with similar expected risk to that of preserving every parcel in a habitat, but at a significantly lower cost. 
+
+The overall goal is to promote interdisciplinary work by providing a new mathematical programming tool for conservationists that allows for nonlinear inputs and can be paired with existing ecological software. The code and data are available at
+https://github.com/cassiebuhler/conservation-dfo.
\ No newline at end of file
diff --git a/data/2024/aaai/Decoding AI's Nudge: A Unified Framework to Predict Human Behavior in AI-Assisted Decision Making b/data/2024/aaai/Decoding AI's Nudge: A Unified Framework to Predict Human Behavior in AI-Assisted Decision Making
new file mode 100644
index 0000000000..77be16c203
--- /dev/null
+++ b/data/2024/aaai/Decoding AI's Nudge: A Unified Framework to Predict Human Behavior in AI-Assisted Decision Making	
@@ -0,0 +1,3 @@
+With the rapid development of AI-based decision aids, different forms of AI assistance have been increasingly integrated into the human decision making processes. To best support humans in decision making, it is essential to quantitatively understand how diverse forms of AI assistance influence humans' decision making behavior. To this end, much of the current research focuses on the end-to-end prediction of human behavior using ``black-box'' models, often lacking interpretations of the nuanced ways in which AI assistance impacts the human decision making process. 
+Meanwhile, methods that prioritize the interpretability of human behavior predictions are often tailored for one specific form of AI assistance, making adaptations to other forms of assistance difficult. In this paper, we propose a computational framework that can provide an interpretable characterization of the influence of different forms of AI assistance on decision makers in AI-assisted decision making. By conceptualizing AI assistance as the ``nudge'' in human decision making processes, our approach centers around modelling how different forms of AI assistance modify humans' strategy in weighing different information in making their decisions. Evaluations on behavior data collected from real human decision makers 
+show that the proposed framework outperforms various baselines in accurately predicting human behavior in AI-assisted decision making. Based on the proposed framework, we further provide insights into how individuals with different cognitive styles are nudged by AI assistance differently.
\ No newline at end of file
diff --git a/data/2024/aaai/Decoding Global Preferences: Temporal and Cooperative Dependency Modeling in Multi-Agent Preference-Based Reinforcement Learning b/data/2024/aaai/Decoding Global Preferences: Temporal and Cooperative Dependency Modeling in Multi-Agent Preference-Based Reinforcement Learning
new file mode 100644
index 0000000000..f9073cec39
--- /dev/null
+++ b/data/2024/aaai/Decoding Global Preferences: Temporal and Cooperative Dependency Modeling in Multi-Agent Preference-Based Reinforcement Learning	
@@ -0,0 +1,2 @@
+Designing accurate reward functions for reinforcement learning (RL) has long been challenging. Preference-based RL (PbRL) offers a promising approach by using human preferences
+to train agents, eliminating the need for manual reward design. While successful in single-agent tasks, extending PbRL to complex multi-agent scenarios is nontrivial. Existing PbRL methods lack the capacity to comprehensively capture both temporal and cooperative aspects, leading to inadequate reward functions. This work introduces an advanced multi-agent preference learning framework that effectively addresses these limitations. Based on a cascading Transformer architecture, our approach captures both temporal and cooperative dependencies, alleviating issues related to reward uniformity and intricate interactions among agents. Experimental results demonstrate substantial performance improvements in multi-agent cooperative tasks, and the reconstructed reward function closely resembles expert-defined reward functions. The source code is available at https://github.com/catezi/MAPT.
\ No newline at end of file
diff --git a/data/2024/aaai/Decomposing Constraint Networks for Calculating c-Representations b/data/2024/aaai/Decomposing Constraint Networks for Calculating c-Representations
new file mode 100644
index 0000000000..ce8c29c858
--- /dev/null
+++ b/data/2024/aaai/Decomposing Constraint Networks for Calculating c-Representations	
@@ -0,0 +1 @@
+It is well-known from probability theory that network-based methods like Bayesian networks constitute remarkable frameworks for efficient probabilistic reasoning. In this paper, we focus on qualitative default reasoning based on Spohn’s ranking functions for which network-based methods have not yet been studied satisfactorily. With constraint networks, we develop a framework for iterative calculations of c-representations, a family of ranking models of conditional belief bases which show outstanding properties from a commonsense and formal point of view, that are characterized by assigning possible worlds a degree of implausibility via penalizing the falsification of conditionals. Constraint networks unveil the dependencies among these penalty points (and hence among the conditionals) and make it possible to compute the penalty points locally on so-called safe sub-bases. As an application of our framework, we show that skeptical c-inferences can be drawn locally from safe sub-bases without losing validity.
\ No newline at end of file
diff --git a/data/2024/aaai/Decomposing Semantic Shifts for Composed Image Retrieval b/data/2024/aaai/Decomposing Semantic Shifts for Composed Image Retrieval
new file mode 100644
index 0000000000..5adf9af958
--- /dev/null
+++ b/data/2024/aaai/Decomposing Semantic Shifts for Composed Image Retrieval	
@@ -0,0 +1 @@
+Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image. However, most existing methods focus on the composition learning of text and reference images and oversimplify the text as a description, neglecting the inherent structure and the user's shifting intention of the texts. As a result, these methods typically take shortcuts that disregard the visual cue of the reference images. To address this issue, we reconsider the text as instructions and propose a Semantic Shift Network (SSN) that explicitly decomposes the semantic shifts into two steps: from the reference image to the visual prototype and from the visual prototype to the target image. Specifically, SSN explicitly decomposes the instructions into two components: degradation and upgradation, where the degradation is used to picture the visual prototype from the reference image, while the upgradation is used to enrich the visual prototype into the final representations to retrieve the desired target image. The experimental results show that the proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new state-of-the-art performance. The code is available at https://github.com/starxing-yuu/SSN.
\ No newline at end of file
diff --git a/data/2024/aaai/Decomposing Temporal Equilibrium Strategy for Coordinated Distributed Multi-Agent Reinforcement Learning b/data/2024/aaai/Decomposing Temporal Equilibrium Strategy for Coordinated Distributed Multi-Agent Reinforcement Learning
new file mode 100644
index 0000000000..f4acad47bd
--- /dev/null
+++ b/data/2024/aaai/Decomposing Temporal Equilibrium Strategy for Coordinated Distributed Multi-Agent Reinforcement Learning	
@@ -0,0 +1 @@
+The increasing demands for system complexity and robustness have prompted the integration of temporal logic into Multi-Agent Reinforcement Learning (MARL) to address tasks with non-Markovian properties. However, incorporating non-Markovian properties introduces additional computational complexities, as agents are required to integrate historical data into their decision-making process. Also, optimizing strategies within a multi-agent environment presents significant challenges due to the exponential growth of the state space with the number of agents. In this study, we introduce an innovative hierarchical MARL framework that synthesizes temporal equilibrium strategies through parity games and subsequently encodes them as individual reward machines for MARL coordination. More specifically, we reduce the strategy synthesis problem into an emptiness problem concerning parity games with optimized states and transitions. Following this synthesis step, the temporal equilibrium strategy is decomposed into individual reward machines for decentralized MARL. Theoretical proofs are provided to verify the consistency of the Nash equilibrium between the parallel composition of decomposed strategies and the original strategy. Empirical evidence confirms the efficacy of the proposed synthesis technique, showcasing its ability to reduce state space compared to the state-of-the-art tool. Furthermore, our study highlights the superior performance of the distributed MARL paradigm over centralized approaches when deploying decomposed strategies.
\ No newline at end of file
diff --git a/data/2024/aaai/Decompositions in Compositional Translation of LTLf to DFA (Student Abstract) b/data/2024/aaai/Decompositions in Compositional Translation of LTLf to DFA (Student Abstract)
new file mode 100644
index 0000000000..562e459b5a
--- /dev/null
+++ b/data/2024/aaai/Decompositions in Compositional Translation of LTLf to DFA (Student Abstract)	
@@ -0,0 +1 @@
+Prior compositional methods in LTLf to DFA conversion have focussed on improving the composition phase. In this work, we examine improvements to the decomposition phase that result in overall improvements in LTLf to DFA translation. Our work is based on reducing the structure of the underlying Abstract Syntax Tree (AST) of a formula such that the new AST results in fewer composition operations.
\ No newline at end of file
diff --git a/data/2024/aaai/Decouple Content and Motion for Conditional Image-to-Video Generation b/data/2024/aaai/Decouple Content and Motion for Conditional Image-to-Video Generation
new file mode 100644
index 0000000000..0ccf97575f
--- /dev/null
+++ b/data/2024/aaai/Decouple Content and Motion for Conditional Image-to-Video Generation	
@@ -0,0 +1 @@
+The goal of conditional image-to-video (cI2V) generation is to create a believable new video by beginning with the condition, i.e., one image and text. The previous cI2V generation methods conventionally perform in RGB pixel space, with limitations in modeling motion consistency and visual continuity. Additionally, the efficiency of generating videos in pixel space is quite low. In this paper, we propose a novel approach to address these challenges by disentangling the target RGB pixels into two distinct components: spatial content and temporal motions. Specifically, we predict temporal motions which include motion vector and residual based on a 3D-UNet diffusion model. By explicitly modeling temporal motions and warping them to the starting image, we improve the temporal consistency of generated videos. This results in a reduction of spatial redundancy, emphasizing temporal details. Our proposed method achieves performance improvements by disentangling content and motion, all without introducing new structural complexities to the model. Extensive experiments on various datasets confirm our approach's superior performance over the majority of state-of-the-art methods in both effectiveness and efficiency.
\ No newline at end of file
diff --git a/data/2024/aaai/Decoupled Contrastive Learning for Long-Tailed Recognition b/data/2024/aaai/Decoupled Contrastive Learning for Long-Tailed Recognition
new file mode 100644
index 0000000000..de3d205789
--- /dev/null
+++ b/data/2024/aaai/Decoupled Contrastive Learning for Long-Tailed Recognition	
@@ -0,0 +1,2 @@
+Supervised Contrastive Loss (SCL) is popular in visual representation learning.
+ Given an anchor image, SCL pulls two types of positive samples, i.e., its augmentation and other images from the same class together, while pushes negative images apart to optimize the learned embedding. In the scenario of long-tailed recognition, where the number of samples in each class is imbalanced, treating two types of positive samples equally leads to the biased optimization for intra-category distance. In addition, similarity relationship among negative samples, that are ignored by SCL, also presents meaningful semantic cues. To improve the performance on long-tailed recognition, this paper addresses those two issues of SCL by decoupling the training objective. Specifically, it decouples two types of positives in SCL and optimizes their relations toward different objectives to alleviate the influence of the imbalanced dataset. We further propose a patch-based self distillation to transfer knowledge from head to tail classes to relieve the under-representation of tail classes. It uses patch-based features to mine shared visual patterns among different instances and leverages a self distillation procedure to transfer such knowledge. Experiments on different long-tailed classification benchmarks demonstrate the superiority of our method. For instance, it achieves the 57.7% top-1 accuracy on the ImageNet-LT dataset. Combined with the ensemble-based method, the performance can be further boosted to 59.7%, which substantially outperforms many recent works. Our code will be released.
\ No newline at end of file
diff --git a/data/2024/aaai/Decoupled Optimisation for Long-Tailed Visual Recognition b/data/2024/aaai/Decoupled Optimisation for Long-Tailed Visual Recognition
new file mode 100644
index 0000000000..c3e68e5908
--- /dev/null
+++ b/data/2024/aaai/Decoupled Optimisation for Long-Tailed Visual Recognition	
@@ -0,0 +1,2 @@
+When training on a long-tailed dataset, conventional learning algorithms tend to exhibit a bias towards classes with a larger sample size. Our investigation has revealed that this biased learning tendency originates from the model parameters, which are trained to disproportionately contribute to the classes characterised by their sample size (e.g., many, medium, and few classes).
+To balance the overall parameter contribution across all classes, we investigate the importance of each model parameter to the learning of different class groups, and propose a multistage parameter Decouple and Optimisation (DO) framework that decouples parameters into different groups with each group learning a specific portion of classes. To optimise the parameter learning, we apply different training objectives with a collaborative optimisation step to learn complementary information about each class group. Extensive experiments on long-tailed datasets, including CIFAR100, Places-LT, ImageNet-LT, and iNaturaList 2018, show that our framework achieves competitive performance compared to the state-of-the-art.
\ No newline at end of file
diff --git a/data/2024/aaai/Decoupled Textual Embeddings for Customized Image Generation b/data/2024/aaai/Decoupled Textual Embeddings for Customized Image Generation
new file mode 100644
index 0000000000..81567fe9f0
--- /dev/null
+++ b/data/2024/aaai/Decoupled Textual Embeddings for Customized Image Generation	
@@ -0,0 +1 @@
+Customized text-to-image generation, which aims to learn user-specified concepts with a few images, has drawn significant attention recently. However, existing methods usually suffer from overfitting issues and entangle the subject-unrelated information (e.g., background and pose) with the learned concept, limiting the potential to compose concept into new scenes. To address these issues, we propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation. Unlike conventional methods that learn a single concept embedding from the given images, our DETEX represents each image using multiple word embeddings during training, i.e., a learnable image-shared subject embedding and several image-specific subject-unrelated embeddings. To decouple irrelevant attributes (i.e., background and pose) from the subject embedding, we further present several attribute mappers that encode each image as several image-specific subject-unrelated embeddings. To encourage these unrelated embeddings to capture the irrelevant information, we incorporate them with corresponding attribute words and propose a joint training strategy to facilitate the disentanglement. During inference, we only use the subject embedding for image generation, while selectively using image-specific embeddings to retain image-specified attributes. Extensive experiments demonstrate that the subject embedding obtained by our method can faithfully represent the target concept, while showing superior editability compared to the state-of-the-art methods. Our code will be available at https://github.com/PrototypeNx/DETEX.
\ No newline at end of file
diff --git a/data/2024/aaai/Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning b/data/2024/aaai/Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning
new file mode 100644
index 0000000000..9bf7e31a99
--- /dev/null
+++ b/data/2024/aaai/Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning	
@@ -0,0 +1 @@
+Multi-domain learning (MDL) aims to train a model with minimal average risk across multiple overlapping but non-identical domains. To tackle the challenges of dataset bias and domain domination, numerous MDL approaches have been proposed from the perspectives of seeking commonalities by aligning distributions to reduce domain gap or reserving differences by implementing domain-specific towers, gates, and even experts. MDL models are becoming more and more complex with sophisticated network architectures or loss functions, introducing extra parameters and enlarging computation costs. In this paper, we propose a frustratingly easy and hyperparameter-free multi-domain learning method named Decoupled Training (D-Train). D-Train is a tri-phase general-to-specific training strategy that first pre-trains on all domains to warm up a root model, then post-trains on each domain by splitting into multi-heads, and finally fine-tunes the heads by fixing the backbone, enabling decouple training to achieve domain independence. Despite its extraordinary simplicity and efficiency, D-Train performs remarkably well in extensive evaluations of various datasets from standard benchmarks to applications of satellite imagery and recommender systems.
\ No newline at end of file
diff --git a/data/2024/aaai/Decoupling Degradations with Recurrent Network for Video Restoration in Under-Display Camera b/data/2024/aaai/Decoupling Degradations with Recurrent Network for Video Restoration in Under-Display Camera
new file mode 100644
index 0000000000..d11914bddd
--- /dev/null
+++ b/data/2024/aaai/Decoupling Degradations with Recurrent Network for Video Restoration in Under-Display Camera	
@@ -0,0 +1 @@
+Under-display camera (UDC) systems are the foundation of full-screen display devices in which the lens mounts under the display. The pixel array of light-emitting diodes used for display diffracts and attenuates incident light, causing various degradations as the light intensity changes. Unlike general video restoration which recovers video by treating different degradation factors equally, video restoration for UDC systems is more challenging that concerns removing diverse degradation over time while preserving temporal consistency. In this paper, we introduce a novel video restoration network, called D2RNet, specifically designed for UDC systems. It employs a set of Decoupling Attention Modules (DAM) that effectively separate the various video degradation factors. More specifically, a soft mask generation function is proposed to formulate each frame into flare and haze based on the diffraction arising from incident light of different intensities, followed by the proposed flare and haze removal components that leverage long- and short-term feature learning to handle the respective degradations. Such a design offers an targeted and effective solution to eliminating various types of degradation in UDC systems. We further extend our design into multi-scale to overcome the scale-changing of degradation that often occur in long-range videos. To demonstrate the superiority of D2RNet, we propose a large-scale UDC video benchmark by gathering HDR videos and generating realistically degraded videos using the point spread function measured by a commercial UDC system. Extensive quantitative and qualitative evaluations demonstrate the superiority of D2RNet compared to other state-of-the-art video restoration and UDC image restoration methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Decoupling Representation and Knowledge for Few-Shot Intent Classification and Slot Filling b/data/2024/aaai/Decoupling Representation and Knowledge for Few-Shot Intent Classification and Slot Filling
new file mode 100644
index 0000000000..8a60933546
--- /dev/null
+++ b/data/2024/aaai/Decoupling Representation and Knowledge for Few-Shot Intent Classification and Slot Filling	
@@ -0,0 +1 @@
+Few-shot intent classification and slot filling are important but challenging tasks due to the scarcity of finely labeled data. Therefore, current works first train a model on source domains with sufficiently labeled data, and then transfer the model to target domains where only rarely labeled data is available. However, experience transferring as a whole usually suffers from gaps that exist among source domains and target domains. For instance, transferring domain-specific-knowledge-related experience is difficult. To tackle this problem, we propose a new method that explicitly decouples the transferring of general-semantic-representation-related experience and the domain-specific-knowledge-related experience. Specifically, for domain-specific-knowledge-related experience, we design two modules to capture intent-slot relation and slot-slot relation respectively. Extensive experiments on Snips and FewJoint datasets show that our method achieves state-of-the-art performance. The method improves the joint accuracy metric from 27.72% to 42.20% in the 1-shot setting, and from 46.54% to 60.79% in the 5-shot setting.
\ No newline at end of file
diff --git a/data/2024/aaai/Decoupling User Relationships Guides Information Diffusion Prediction (Student Abstract) b/data/2024/aaai/Decoupling User Relationships Guides Information Diffusion Prediction (Student Abstract)
new file mode 100644
index 0000000000..b3b36897a4
--- /dev/null
+++ b/data/2024/aaai/Decoupling User Relationships Guides Information Diffusion Prediction (Student Abstract)	
@@ -0,0 +1 @@
+Information diffusion prediction is a critical task for many social network applications. However, current methods are mainly limited by the following aspects: user relationships behind resharing behaviors are complex and entangled. To address these issues, we propose MHGFormer, a novel multi-channel hypergraph transformer framework, to better decouple complex user relations and obtain fine-grained user representations. First, we employ designed triangular motifs to decouple user relations into three different level hypergraphs. Second, a position-aware hypergraph transformer is used to refine user relation and obtain high-quality user representations. Extensive experiments conducted on two social datasets demonstrate that MHGFormer outperforms state-of-the-art diffusion models across several settings.
\ No newline at end of file
diff --git a/data/2024/aaai/Deep Contrastive Graph Learning with Clustering-Oriented Guidance b/data/2024/aaai/Deep Contrastive Graph Learning with Clustering-Oriented Guidance
new file mode 100644
index 0000000000..dcb734a962
--- /dev/null
+++ b/data/2024/aaai/Deep Contrastive Graph Learning with Clustering-Oriented Guidance	
@@ -0,0 +1 @@
+Graph Convolutional Network (GCN) has exhibited remarkable potential in improving graph-based clustering. To handle the general clustering scenario without a prior graph, these models estimate an initial graph beforehand to apply GCN. Throughout the literature, we have witnessed that 1) most models focus on the initial graph while neglecting the original features. Therefore, the discriminability of the learned representation may be corrupted by a low-quality initial graph; 2) the training procedure lacks effective clustering guidance, which may lead to the incorporation of clustering-irrelevant information into the learned graph. To tackle these problems, the Deep Contrastive Graph Learning (DCGL) model is proposed for general data clustering. Specifically, we establish a pseudo-siamese network, which incorporates auto-encoder with GCN to emphasize both the graph structure and the original features. On this basis, feature-level contrastive learning is introduced to enhance the discriminative capacity, and the relationship between samples and centroids is employed as the clustering-oriented guidance. Afterward, a two-branch graph learning mechanism is designed to extract the local and global structural relationships, which are further embedded into a unified graph under the cluster-level contrastive guidance. Experimental results on several benchmark datasets demonstrate the superiority of DCGL against state-of-the-art algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Deep Copula-Based Survival Analysis for Dependent Censoring with Identifiability Guarantees b/data/2024/aaai/Deep Copula-Based Survival Analysis for Dependent Censoring with Identifiability Guarantees
new file mode 100644
index 0000000000..ff3d8f00c4
--- /dev/null
+++ b/data/2024/aaai/Deep Copula-Based Survival Analysis for Dependent Censoring with Identifiability Guarantees	
@@ -0,0 +1 @@
+Censoring is the central problem in survival analysis where either the time-to-event (for instance, death), or the time-to censoring (such as loss of follow-up) is observed for each sample. The majority of existing machine learning-based survival analysis methods assume that survival is conditionally independent of censoring given a set of covariates; an assumption that cannot be verified since only marginal distributions is available from the data. The existence of dependent censoring, along with the inherent bias in current estimators has been demonstrated in a variety of applications, accentuating the need for a more nuanced approach. However, existing methods that adjust for dependent censoring require practitioners to specify the ground truth copula. This requirement poses a significant challenge for practical applications, as model misspecification can lead to substantial bias. In this work, we propose a flexible deep learning-based survival analysis method that simultaneously accommodate for dependent censoring and eliminates the requirement for specifying the ground truth copula. We theoretically prove the identifiability of our model under a broad family of copulas and survival distributions. Experiments results from a wide range of datasets demonstrate that our approach successfully discerns the underlying dependency structure and significantly reduces survival estimation bias when compared to existing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Deep Hierarchical Video Compression b/data/2024/aaai/Deep Hierarchical Video Compression
new file mode 100644
index 0000000000..b48612dc51
--- /dev/null
+++ b/data/2024/aaai/Deep Hierarchical Video Compression	
@@ -0,0 +1 @@
+Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video frames. Instead, this work proposes hierarchical probabilistic predictive coding, for which hierarchal VAEs are carefully designed to characterize multiscale latent features as a family of flexible priors and posteriors to predict the probabilities of future frames. Under such a hierarchical structure, lightweight networks are sufficient for prediction. The proposed method outperforms representative learned video compression models on common testing videos and demonstrates computational friendliness with much less memory footprint and faster encoding/decoding. Extensive experiments on adaptation to temporal patterns also indicate the better generalization of our hierarchical predictive mechanism. Furthermore, our solution is the first to enable progressive decoding that is favored in networked video applications with packet loss.
\ No newline at end of file
diff --git a/data/2024/aaai/Deep Homography Estimation for Visual Place Recognition b/data/2024/aaai/Deep Homography Estimation for Visual Place Recognition
new file mode 100644
index 0000000000..3195abdca8
--- /dev/null
+++ b/data/2024/aaai/Deep Homography Estimation for Visual Place Recognition	
@@ -0,0 +1 @@
+Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR.
\ No newline at end of file
diff --git a/data/2024/aaai/Deep Incomplete Multi-View Learning Network with Insufficient Label Information b/data/2024/aaai/Deep Incomplete Multi-View Learning Network with Insufficient Label Information
new file mode 100644
index 0000000000..82c1f3b8f1
--- /dev/null
+++ b/data/2024/aaai/Deep Incomplete Multi-View Learning Network with Insufficient Label Information	
@@ -0,0 +1 @@
+Due to the efficiency of integrating semantic consensus and complementary information across different views, multi-view classification methods have attracted much attention in recent years. However, multi-view data often suffers from both the miss of view features and insufficient label information, which significantly decrease the performance of traditional multi-view classification methods in practice. Learning for such simultaneous lack of feature and label is crucial but rarely studied. To tackle these problems, we propose a novel Deep Incomplete Multi-view Learning Network (DIMvLN) by incorporating graph networks and semi-supervised learning in this paper. Specifically, DIMvLN firstly designs the deep graph networks to effectively recover missing data with assigning pseudo-labels of large amounts of unlabeled instances and refine the incomplete feature information. Meanwhile, to enhance the label information, a novel pseudo-label generation strategy with the similarity constraints of unlabeled instances is proposed to exploit additional supervisory information and guide the completion module to preserve more semantic information of absent multi-view data. Besides, we design view-specific representation extractors with the autoencoder structure and contrastive loss to learn high-level semantic representations for each view, promote cross-view consistencies and augment the separability between different categories. Finally, extensive experimental results demonstrate the effectiveness of our DIMvLN, attaining noteworthy performance improvements compared to state-of-the-art competitors on several public benchmark datasets. Code will be available at GitHub.
\ No newline at end of file
diff --git a/data/2024/aaai/Deep Learning for Style Transfer and Experimentation with Audio Effects and Music Creation b/data/2024/aaai/Deep Learning for Style Transfer and Experimentation with Audio Effects and Music Creation
new file mode 100644
index 0000000000..cdf06c85e5
--- /dev/null
+++ b/data/2024/aaai/Deep Learning for Style Transfer and Experimentation with Audio Effects and Music Creation	
@@ -0,0 +1 @@
+Recent advancements in deep learning have the potential to transform the process of writing and creating music. Models that have the potential to capture and analyze higher-level representations of music and audio can serve to change the field of digital signal processing. In this statement, I propose a set of Music+AI methods that serves to assist with the writing of and melodies, modelling and transferring of timbres, applying a wide variety of audio effects, including research into experimental audio effects, and production of audio samples using style transfers. Writing and producing music is a tedious task that is notably difficult to become proficient in, as many tools to create music both cost sums money and require long-term commitments to study. An all-encompassing framework for music processing would make the process much more accessible and simple and would allow for human art to work alongside technology to advance.
\ No newline at end of file
diff --git a/data/2024/aaai/Deep Learning on Graphs: A Data-Centric Exploration b/data/2024/aaai/Deep Learning on Graphs: A Data-Centric Exploration
new file mode 100644
index 0000000000..209ec08371
--- /dev/null
+++ b/data/2024/aaai/Deep Learning on Graphs: A Data-Centric Exploration	
@@ -0,0 +1 @@
+Many learning tasks in Artificial Intelligence (AI) require dealing with graph data, ranging from biology and chemistry to finance and education. As powerful deep learning tools for graphs, graph neural networks (GNNs) have demonstrated remarkable performance in various graph-related applications. Despite the significant accomplishments of GNNs, recent studies have highlighted that their efficiency and effectiveness face significant challenges such as adversarial robustness and scalability, which are fundamentally linked to data. While major attention has been devoted to improving GNNs from the model perspective, the potential of directly enhancing data has often been overlooked. It underscores a critical gap in GNN research---while model improvements are undoubtedly important, we also need to recognize and address the data-related factors contributing to the challenges. Hence, my research is to investigate solutions for these challenges from the data perspective, employing strategies such as data characterization, reduction, augmentation, transformation, and detection.
\ No newline at end of file
diff --git a/data/2024/aaai/Deep Reinforcement Learning for Communication Networks b/data/2024/aaai/Deep Reinforcement Learning for Communication Networks
new file mode 100644
index 0000000000..a67b603428
--- /dev/null
+++ b/data/2024/aaai/Deep Reinforcement Learning for Communication Networks	
@@ -0,0 +1 @@
+This research explores optimizing communication tasks with (Multi-Agent) Reinforcement Learning (RL/MARL) in Point-to-Point and Group Communication (GC) networks. The study initially applied RL for Congestion Control in networks with dynamic link properties, yielding competitive results. Then, it focused on the challenge of effective message dissemination in GC networks, by framing a novel game-theoretic formulation and designing methods to solve the task based on MARL and Graph Convolution. Future research will deepen the exploration of MARL in GC. This will contribute to both academic knowledge and practical advancements in the next generation of communication protocols.
\ No newline at end of file
diff --git a/data/2024/aaai/Deep Reinforcement Learning for Early Diagnosis of Lung Cancer b/data/2024/aaai/Deep Reinforcement Learning for Early Diagnosis of Lung Cancer
new file mode 100644
index 0000000000..dc8e7bccb6
--- /dev/null
+++ b/data/2024/aaai/Deep Reinforcement Learning for Early Diagnosis of Lung Cancer	
@@ -0,0 +1 @@
+Lung cancer remains the leading cause of cancer-related death worldwide, and early diagnosis of lung cancer is critical for improving the survival rate of patients. Performing annual low-dose computed tomography (LDCT) screening among high-risk populations is the primary approach for early diagnosis. However, after each screening, whether to continue monitoring (with follow-up screenings) or to order a biopsy for diagnosis remains a challenging decision to make. Continuing with follow-up screenings may lead to delayed diagnosis but ordering a biopsy without sufficient evidence incurs unnecessary risk and cost. In this paper, we tackle the problem by an optimal stopping approach. Our proposed algorithm, called EarlyStop-RL, utilizes the structure of the Snell envelope for optimal stopping, and model-free deep reinforcement learning for making diagnosis decisions. Through evaluating our algorithm on a commonly used clinical trial dataset (the National Lung Screening Trial), we demonstrate that EarlyStop-RL has the potential to greatly enhance risk assessment and early diagnosis of lung cancer, surpassing the performance of two widely adopted clinical models, namely the Lung-RADS and the Brock model.
\ No newline at end of file
diff --git a/data/2024/aaai/Deep Semantic Graph Transformer for Multi-View 3D Human Pose Estimation b/data/2024/aaai/Deep Semantic Graph Transformer for Multi-View 3D Human Pose Estimation
new file mode 100644
index 0000000000..7e0ddf1999
--- /dev/null
+++ b/data/2024/aaai/Deep Semantic Graph Transformer for Multi-View 3D Human Pose Estimation	
@@ -0,0 +1 @@
+Most Graph Convolutional Networks based 3D human pose estimation (HPE) methods were involved in single-view 3D HPE and utilized certain spatial graphs, existing key problems such as depth ambiguity, insufficient feature representation, or limited receptive fields. To address these issues, we propose a multi-view 3D HPE framework based on deep semantic graph transformer, which adaptively learns and fuses multi-view significant semantic features of human nodes to improve 3D HPE performance. First, we propose a deep semantic graph transformer encoder to enrich spatial feature information. It deeply mines the position, spatial structure, and skeletal edge knowledge of joints and dynamically learns their correlations. Then, we build a progressive multi-view spatial-temporal feature fusion framework to mitigate joint depth uncertainty. To enhance the pose spatial representation, deep spatial semantic feature are interacted and fused across different viewpoints during monocular feature extraction. Furthermore, long-time relevant temporal dependencies are modeled and spatial-temporal information from all viewpoints is fused to intermediately supervise the depth. Extensive experiments on three 3D HPE benchmarks show that our method achieves state-of-the-art results. It can effectively enhance pose features, mitigate depth ambiguity in single-view 3D HPE, and improve 3D HPE performance without providing camera parameters. Codes and models are available at https://github.com/z0911k/SGraFormer.
\ No newline at end of file
diff --git a/data/2024/aaai/Deep Structural Knowledge Exploitation and Synergy for Estimating Node Importance Value on Heterogeneous Information Networks b/data/2024/aaai/Deep Structural Knowledge Exploitation and Synergy for Estimating Node Importance Value on Heterogeneous Information Networks
new file mode 100644
index 0000000000..d9369c653e
--- /dev/null
+++ b/data/2024/aaai/Deep Structural Knowledge Exploitation and Synergy for Estimating Node Importance Value on Heterogeneous Information Networks	
@@ -0,0 +1 @@
+The classic problem of node importance estimation has been conventionally studied with homogeneous network topology analysis. To deal with practical network heterogeneity, a few recent methods employ graph neural models to automatically learn diverse sources of information. However, the major concern revolves around that their fully adaptive learning process may lead to insufficient information exploration, thereby formulating the problem as the isolated node value prediction with underperformance and less interpretability. In this work, we propose a novel learning framework namely SKES. Different from previous automatic learning designs, SKES exploits heterogeneous structural knowledge to enrich the informativeness of node representations. Then based on a sufficiently uninformative reference, SKES estimates the importance value for any input node, by quantifying its informativeness disparity against the reference. This establishes an interpretable node importance computation paradigm. Furthermore, SKES dives deep into the understanding that "nodes with similar characteristics are prone to have similar importance values" whilst guaranteeing that such informativeness disparity between any different nodes is orderly reflected by the embedding distance of their associated latent features. Extensive experiments on three widely-evaluated benchmarks demonstrate the performance superiority of SKES over several recent competing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Deep Unfolded Network with Intrinsic Supervision for Pan-Sharpening b/data/2024/aaai/Deep Unfolded Network with Intrinsic Supervision for Pan-Sharpening
new file mode 100644
index 0000000000..35421a1658
--- /dev/null
+++ b/data/2024/aaai/Deep Unfolded Network with Intrinsic Supervision for Pan-Sharpening	
@@ -0,0 +1 @@
+Existing deep pan-sharpening methods lack the learning of complementary information between PAN and MS modalities in the intermediate layers, and exhibit low interpretability due to their black-box designs. To this end, an interpretable deep unfolded network with intrinsic supervision for pan-sharpening is proposed. Building upon the observation degradation process, it formulates the pan-sharpening task as a variational model minimization with spatial consistency prior and spectral projection prior. The former prior requires a joint component decomposition of PAN and MS images to extract intrinsic features. By being supervised in the intermediate layers, it can selectively provide high-frequency information for spatial enhancement. The latter prior constrains the intensity correlation between MS and PAN images derived from physical observations, so as to improve spectral fidelity. To further enhance the transparency of network design, we develop an iterative solution algorithm following the half-quadratic splitting to unfold the deep model. It rigorously adheres to the variational model, significantly enhancing the interpretability behind network design and efficiently alternating the optimization of the network. Extensive experiments demonstrate the advantages of our method compared to state-of-the-arts, showcasing its remarkable generalization capability to real-world scenes. Our code is publicly available at https://github.com/Baixuzx7/DISPNet.
\ No newline at end of file
diff --git a/data/2024/aaai/Deep Variational Incomplete Multi-View Clustering: Exploring Shared Clustering Structures b/data/2024/aaai/Deep Variational Incomplete Multi-View Clustering: Exploring Shared Clustering Structures
new file mode 100644
index 0000000000..f684640528
--- /dev/null
+++ b/data/2024/aaai/Deep Variational Incomplete Multi-View Clustering: Exploring Shared Clustering Structures	
@@ -0,0 +1 @@
+Incomplete multi-view clustering (IMVC) aims to reveal shared clustering structures within multi-view data, where only partial views of the samples are available. Existing IMVC methods primarily suffer from two issues: 1) Imputation-based methods inevitably introduce inaccurate imputations, which in turn degrade clustering performance; 2) Imputation-free methods are susceptible to unbalanced information among views and fail to fully exploit shared information. To address these issues, we propose a novel method based on variational autoencoders. Specifically, we adopt multiple view-specific encoders to extract information from each view and utilize the Product-of-Experts approach to efficiently aggregate information to obtain the common representation. To enhance the shared information in the common representation, we introduce a coherence objective to mitigate the influence of information imbalance. By incorporating the Mixture-of-Gaussians prior information into the latent representation, our proposed method is able to learn the common representation with clustering-friendly structures. Extensive experiments on four datasets show that our method achieves competitive clustering performance compared with state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving b/data/2024/aaai/DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving
new file mode 100644
index 0000000000..c214e8ebd8
--- /dev/null
+++ b/data/2024/aaai/DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving	
@@ -0,0 +1 @@
+Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports the direct and explainable safety evaluation for autonomous driving. In this work, we propose DeepAccident, a large-scale dataset generated via a realistic simulator containing diverse accident scenarios that frequently occur in real-world driving. The proposed DeepAccident dataset includes 57K annotated frames and 285K annotated samples, approximately 7 times more than the large-scale nuScenes dataset with 40k annotated samples. In addition, we propose a new task, end-to-end motion and accident prediction, which can be used to directly evaluate the accident prediction ability for different autonomous driving algorithms. Furthermore, for each scenario, we set four vehicles along with one infrastructure to record data, thus providing diverse viewpoints for accident scenarios and enabling V2X (vehicle-to-everything) research on perception and prediction tasks. Finally, we present a baseline V2X model named V2XFormer that demonstrates superior performance for motion and accident prediction and 3D object detection compared to the single-vehicle model.
\ No newline at end of file
diff --git a/data/2024/aaai/DeepBern-Nets: Taming the Complexity of Certifying Neural Networks Using Bernstein Polynomial Activations and Precise Bound Propagation b/data/2024/aaai/DeepBern-Nets: Taming the Complexity of Certifying Neural Networks Using Bernstein Polynomial Activations and Precise Bound Propagation
new file mode 100644
index 0000000000..98b4f11092
--- /dev/null
+++ b/data/2024/aaai/DeepBern-Nets: Taming the Complexity of Certifying Neural Networks Using Bernstein Polynomial Activations and Precise Bound Propagation	
@@ -0,0 +1 @@
+Formal certification of Neural Networks (NNs) is crucial for ensuring their safety, fairness, and robustness. Unfortunately, on the one hand, sound and complete certification algorithms of ReLU-based NNs do not scale to large-scale NNs. On the other hand, incomplete certification algorithms are easier to compute, but they result in loose bounds that deteriorate with the depth of NN, which diminishes their effectiveness. In this paper, we ask the following question; can we replace the ReLU activation function with one that opens the door to incomplete certification algorithms that are easy to compute but can produce tight bounds on the NN's outputs? We introduce DeepBern-Nets, a class of NNs with activation functions based on Bernstein polynomials instead of the commonly used ReLU activation. Bernstein polynomials are smooth and differentiable functions with desirable properties such as the so-called range enclosure and subdivision properties. We design a novel Interval Bound Propagation (IBP) algorithm, called Bern-IBP, to efficiently compute tight bounds on DeepBern-Nets outputs. Our approach leverages the properties of Bernstein polynomials to improve the tractability of neural network certification tasks while maintaining the accuracy of the trained networks. We conduct experiments in adversarial robustness and reachability analysis settings to assess the effectiveness of the approach. Our proposed framework achieves high certified accuracy for adversarially-trained NNs, which is often a challenging task for certifiers of ReLU-based NNs. This work establishes Bernstein polynomial activation as a promising alternative for improving NN certification tasks across various NNs applications.
\ No newline at end of file
diff --git a/data/2024/aaai/DeepBranchTracer: A Generally-Applicable Approach to Curvilinear Structure Reconstruction Using Multi-Feature Learning b/data/2024/aaai/DeepBranchTracer: A Generally-Applicable Approach to Curvilinear Structure Reconstruction Using Multi-Feature Learning
new file mode 100644
index 0000000000..254e9ab038
--- /dev/null
+++ b/data/2024/aaai/DeepBranchTracer: A Generally-Applicable Approach to Curvilinear Structure Reconstruction Using Multi-Feature Learning	
@@ -0,0 +1 @@
+Curvilinear structures, which include line-like continuous objects, are fundamental geometrical elements in image-based applications. Reconstructing these structures from images constitutes a pivotal research area in computer vision. However, the complex topology and ambiguous image evidence render this process a challenging task. In this paper, we introduce DeepBranchTracer, a novel method that learns both external image features and internal geometric characteristics to reconstruct curvilinear structures. Firstly, we formulate the curvilinear structures extraction as a geometric attribute estimation problem. Then, a curvilinear structure feature learning network is designed to extract essential branch attributes, including the image features of centerline and boundary, and the geometric features of direction and radius. Finally, utilizing a multi-feature fusion tracing strategy, our model iteratively traces the entire branch by integrating the extracted image and geometric features. We extensively evaluated our model on both 2D and 3D datasets, demonstrating its superior performance over existing segmentation and reconstruction methods in terms of accuracy and continuity.
\ No newline at end of file
diff --git a/data/2024/aaai/DeepCalliFont: Few-Shot Chinese Calligraphy Font Synthesis by Integrating Dual-Modality Generative Models b/data/2024/aaai/DeepCalliFont: Few-Shot Chinese Calligraphy Font Synthesis by Integrating Dual-Modality Generative Models
new file mode 100644
index 0000000000..5adf07e72d
--- /dev/null
+++ b/data/2024/aaai/DeepCalliFont: Few-Shot Chinese Calligraphy Font Synthesis by Integrating Dual-Modality Generative Models	
@@ -0,0 +1 @@
+Few-shot font generation, especially for Chinese calligraphy fonts, is a challenging and ongoing problem. With the help of prior knowledge that is mainly based on glyph consistency assumptions, some recently proposed methods can synthesize high-quality Chinese glyph images. However, glyphs in calligraphy font styles often do not meet these assumptions. To address this problem, we propose a novel model, DeepCalliFont, for few-shot Chinese calligraphy font synthesis by integrating dual-modality generative models. Specifically, the proposed model consists of image synthesis and sequence generation branches, generating consistent results via a dual-modality representation learning strategy. The two modalities (i.e., glyph images and writing sequences) are properly integrated using a feature recombination module and a rasterization loss function. Furthermore, a new pre-training strategy is adopted to improve the performance by exploiting large amounts of uni-modality data. Both qualitative and quantitative experiments have been conducted to demonstrate the superiority of our method to other state-of-the-art approaches in the task of few-shot Chinese calligraphy font synthesis. The source code can be found at https://github.com/lsflyt-pku/DeepCalliFont.
\ No newline at end of file
diff --git a/data/2024/aaai/DeepSaDe: Learning Neural Networks That Guarantee Domain Constraint Satisfaction b/data/2024/aaai/DeepSaDe: Learning Neural Networks That Guarantee Domain Constraint Satisfaction
new file mode 100644
index 0000000000..9e0ec41830
--- /dev/null
+++ b/data/2024/aaai/DeepSaDe: Learning Neural Networks That Guarantee Domain Constraint Satisfaction	
@@ -0,0 +1 @@
+As machine learning models, specifically neural networks, are becoming increasingly popular, there are concerns regarding their trustworthiness, specially in safety-critical applications, e.g. actions of an autonomous vehicle must be safe. There are approaches that can train neural networks where such domain requirements are enforced as constraints, but they either cannot guarantee that the constraint will be satisfied by all possible predictions (even on unseen data) or they are limited in the type of constraints that can be enforced. In this paper, we present an approach to train neural networks which can enforce a wide variety of constraints and guarantee that the constraint is satisfied by all possible predictions. The approach builds on earlier work where learning linear models is formulated as a constraint satisfaction problem (CSP). To make this idea applicable to neural networks, two crucial new elements are added: constraint propagation over the network layers, and weight updates based on a mix of gradient descent and CSP solving. Evaluation on various machine learning tasks demonstrates that our approach is flexible enough to enforce a wide variety of domain constraints and is able to guarantee them in neural networks.
\ No newline at end of file
diff --git a/data/2024/aaai/DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing b/data/2024/aaai/DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing
new file mode 100644
index 0000000000..074e17b0f5
--- /dev/null
+++ b/data/2024/aaai/DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing	
@@ -0,0 +1 @@
+Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focus on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost ($3.7K if rent on Azure), while still maintaining 95% of model quality compared to baseline with full data and cost ($46.3K). For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost. DeepSpeed Data Efficiency is easy to use and tune, enabling us to easily apply it and verify its benefit on additional tasks including GPT-3 MoE model pretraining and small-scale GPT-2/ViT finetuning.
\ No newline at end of file
diff --git a/data/2024/aaai/Defeasible Normative Reasoning: A Proof-Theoretic Integration of Logical Argumentation b/data/2024/aaai/Defeasible Normative Reasoning: A Proof-Theoretic Integration of Logical Argumentation
new file mode 100644
index 0000000000..2e1da4df1a
--- /dev/null
+++ b/data/2024/aaai/Defeasible Normative Reasoning: A Proof-Theoretic Integration of Logical Argumentation	
@@ -0,0 +1 @@
+We present a novel computational approach to resolving conflicts among norms by nonmonotonic normative reasoning (in constrained I/O logics). Our approach extends standard sequent-based proof systems and makes them more adequate to nonmonotonic reasoning by adding to the sequents annotations that keep track of what is known about the defeasible status of the derived sequents. This makes transparent the reasons according to which norms should be applicable or inapplicable, and accordingly the sequents that make use of such norms are accepted or retracted. We also show that this proof theoretic method has tight links to the semantics of formal argumentation frameworks. The outcome of this paper is thus a threefold characterization result that relates, in the context of nonmonotonic normative reasoning, three traditional ingredients of AI-based reasoning methods: maximally consistent sets of premises (in constrained I/O logics), derived sequents (which are accepted in corresponding annotated sequent calculi), and logical arguments (that belong to the grounded extensions of the induced logical argumentation frameworks).
\ No newline at end of file
diff --git a/data/2024/aaai/Defog Artificial Intelligence Glasses: Neural Networks for the Imperfect Real World b/data/2024/aaai/Defog Artificial Intelligence Glasses: Neural Networks for the Imperfect Real World
new file mode 100644
index 0000000000..a55a8bbe1b
--- /dev/null
+++ b/data/2024/aaai/Defog Artificial Intelligence Glasses: Neural Networks for the Imperfect Real World	
@@ -0,0 +1 @@
+This research investigates the generalization capabilities of neural networks in deep learning when applied to real-world scenarios where data often contains imperfections, focusing on their adaptability to both noisy and non-noisy scenarios for image retrieval tasks. Our study explores approaches to preserve all available data, regardless of quality, for diverse tasks. The evaluation of results varies per task, due to the ultimate goal of developing a technique to extract relevant information while disregarding noise in the final network design for each specific task. The aim is to enhance accessibility and efficiency of AI across diverse tasks, particularly for individuals or countries with limited resources, lacking access to high-quality data. The dedication is directed towards fostering inclusivity and unlocking the potential of AI for wide-spread societal benefit.
\ No newline at end of file
diff --git a/data/2024/aaai/Defying Imbalanced Forgetting in Class Incremental Learning b/data/2024/aaai/Defying Imbalanced Forgetting in Class Incremental Learning
new file mode 100644
index 0000000000..8e6c8aa55b
--- /dev/null
+++ b/data/2024/aaai/Defying Imbalanced Forgetting in Class Incremental Learning	
@@ -0,0 +1 @@
+We observe a high level of imbalance in the accuracy of different learned classes in the same old task for the first time. This intriguing phenomenon, discovered in replay-based Class Incremental Learning (CIL), highlights the imbalanced forgetting of learned classes, as their accuracy is similar before the occurrence of catastrophic forgetting. This discovery remains previously unidentified due to the reliance on average incremental accuracy as the measurement for CIL, which assumes that the accuracy of classes within the same task is similar. However, this assumption is invalid in the face of catastrophic forgetting. Further empirical studies indicate that this imbalanced forgetting is caused by conflicts in representation between semantically similar old and new classes. These conflicts are rooted in the data imbalance present in replay-based CIL methods. Building on these insights, we propose CLass-Aware Disentanglement (CLAD) as a means to predict the old classes that are more likely to be forgotten and enhance their accuracy. Importantly, CLAD can be seamlessly integrated into existing CIL methods. Extensive experiments demonstrate that CLAD consistently improves current replay-based methods, resulting in performance gains of up to 2.56%.
\ No newline at end of file
diff --git a/data/2024/aaai/Delegation-Relegation for Boolean Matrix Factorization b/data/2024/aaai/Delegation-Relegation for Boolean Matrix Factorization
new file mode 100644
index 0000000000..c0af286e7d
--- /dev/null
+++ b/data/2024/aaai/Delegation-Relegation for Boolean Matrix Factorization	
@@ -0,0 +1,2 @@
+The Boolean Matrix Factorization (BMF) problem aims to represent a n×m Boolean matrix as the Boolean product of two matrices of small rank k, where the product is computed using Boolean algebra operations. However, finding a BMF of minimum rank is known to be NP-hard, posing challenges for heuristic algorithms and exact approaches in terms of rank found and computation time, particularly as matrix size or the number of entries equal to 1 grows.
+In this paper, we present a new approach to simplifying the matrix to be factorized by reducing the number of 1-entries, which allows to directly recover a Boolean factorization of the original matrix from its simplified version. We introduce two types of simplification: one that performs numerous simplifications without preserving the original rank and another that performs fewer simplifications but guarantees that an optimal BMF on the simplified matrix yields an optimal BMF on the original matrix. Furthermore, our experiments show that our approach outperforms existing exact BMF algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Deletion-Robust Submodular Maximization with Knapsack Constraints b/data/2024/aaai/Deletion-Robust Submodular Maximization with Knapsack Constraints
new file mode 100644
index 0000000000..958d6e5e57
--- /dev/null
+++ b/data/2024/aaai/Deletion-Robust Submodular Maximization with Knapsack Constraints	
@@ -0,0 +1 @@
+Submodular maximization algorithms have found wide applications in various fields such as data summarization, recommendation systems, and active learning. In recent years, deletion-robust submodular maximization algorithms have garnered attention due to their significant implications in scenarios where some data points may be removed due to user preferences or privacy concerns, such as in recommendation systems and influence maximization. In this paper, we study the fundamental problem of submodular maximization with knapsack constraints and propose a robust streaming algorithm for it. To the best of our knowledge, our algorithm is the first to solve this problem for non-monotone submodular functions and can achieve an approximation ratio of 1/(6.82+2.63d)-ϵ under a near-optimal summary size of O(k+r), where k denotes the maximum cardinality of any feasible solution, d denotes the number of the knapsack constraints and r is the robustness parameter. For monotone submodular functions, our algorithm can achieve an approximation ratio of 1/(2+2d)-ϵ under a near-optimal summary size of O(k+r), significantly improving upon the best-known ratio of Ω((1/d-ϵ)^2). The empirical performance of our algorithm is extensively evaluated in several applications including influence maximization and recommendation systems, and the experimental results demonstrate the effectiveness of our algorithm.
\ No newline at end of file
diff --git a/data/2024/aaai/Delivering Inflated Explanations b/data/2024/aaai/Delivering Inflated Explanations
new file mode 100644
index 0000000000..f0b608d8b9
--- /dev/null
+++ b/data/2024/aaai/Delivering Inflated Explanations	
@@ -0,0 +1,2 @@
+In the quest for Explainable Artificial Intelligence (XAI) one of the questions that frequently arises given a decision made by an AI system is, ``why was the decision made in this way?'' Formal approaches to explainability build a formal model of the AI system and use this to reason about the properties of the system. Given a set of feature values for an instance to be explained, and a resulting decision, a formal abductive explanation is a set of features, such that if they take the given value will always lead to the same decision. This explanation is useful, it shows that only some features were used in making the final decision. But it is narrow, it only shows that if the selected features take their given values the decision is unchanged. It is possible that some features may change values and still lead to the same decision. In this paper we formally define inflated explanations which is a set of features, and for each feature a set of values (always including the value of the instance being explained), such that the decision will remain unchanged, for any of the values allowed for any of the features in the (inflated) abductive explanation.
+Inflated formal explanations are more informative than common abductive explanations since e.g. they allow us to see if the exact value of a feature is important, or it could be any nearby value. Overall they allow us to better understand the role of each feature in the decision. We show that we can compute inflated explanations for not that much greater cost than abductive explanations, and that we can extend duality results for abductive explanations also to inflated explanations.
\ No newline at end of file
diff --git a/data/2024/aaai/Delving into Multimodal Prompting for Fine-Grained Visual Classification b/data/2024/aaai/Delving into Multimodal Prompting for Fine-Grained Visual Classification
new file mode 100644
index 0000000000..016512d49b
--- /dev/null
+++ b/data/2024/aaai/Delving into Multimodal Prompting for Fine-Grained Visual Classification	
@@ -0,0 +1 @@
+Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.
\ No newline at end of file
diff --git a/data/2024/aaai/Demystifying Algorithmic Fairness in an Uncertain World b/data/2024/aaai/Demystifying Algorithmic Fairness in an Uncertain World
new file mode 100644
index 0000000000..dc8e641ea0
--- /dev/null
+++ b/data/2024/aaai/Demystifying Algorithmic Fairness in an Uncertain World	
@@ -0,0 +1 @@
+Significant progress in the field of fair machine learning (ML) has been made to counteract algorithmic discrimination against marginalized groups. However, fairness remains an active research area that is far from settled. One key bottleneck is the implicit assumption that environments, where ML is developed and deployed, are certain and reliable. In a world that is characterized by volatility, uncertainty, complexity, and ambiguity, whether what has been developed in algorithmic fairness can still serve its purpose is far from obvious. In this talk, I will first discuss how to improve algorithmic fairness under two kinds of predictive uncertainties, i.e., aleatoric uncertainty (i.e., randomness and ambiguity in the data) and epistemic uncertainty (i.e., a lack of data or knowledge), respectively. The former regards historical bias reflected in the data and the latter corresponds to the bias perpetuated or amplified during model training due to lack of data or knowledge. In particular, the first work studies pushing the fairness-utility trade-off through aleatoric uncertainty, and the second work investigates fair few-shot learning. The last work introduces coverage-based fairness that ensures different groups enjoy identical treatment and receive equal coverage.
\ No newline at end of file
diff --git a/data/2024/aaai/DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning b/data/2024/aaai/DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning
new file mode 100644
index 0000000000..0d21b8e593
--- /dev/null
+++ b/data/2024/aaai/DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning	
@@ -0,0 +1 @@
+Contrastive-learning-based methods have dominated sentence representation learning. These methods regularize the representation space by pulling similar sentence representations closer and pushing away the dissimilar ones and have been proven effective in various NLP tasks, e.g., semantic textual similarity (STS) tasks. However, it is challenging for these methods to learn fine-grained semantics as they only learn from the inter-sentence perspective, i.e., their supervision signal comes from the relationship between data samples. In this work, we propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective. By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form. Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks, standing up well in comparison to contrastive-learning-based methods. Notably, the proposed intra-sentence denoising objective complements existing inter-sentence contrastive methodologies and can be integrated with them to further enhance performance. Our code is available at https://github.com/xinghaow99/DenoSent.
\ No newline at end of file
diff --git a/data/2024/aaai/Dense Projection for Anomaly Detection b/data/2024/aaai/Dense Projection for Anomaly Detection
new file mode 100644
index 0000000000..8e3e003cc6
--- /dev/null
+++ b/data/2024/aaai/Dense Projection for Anomaly Detection	
@@ -0,0 +1 @@
+This work presents a novel method called dense projection for unsupervised anomaly detection (DPAD). The main idea is maximizing the local density of (normal) training data and then determining whether a test data is anomalous or not by evaluating its density. Specifically, DPAD uses a deep neural network to learn locally dense representations of normal data. Since density estimation is computationally expensive, we minimize the local distances of the representations in an iteratively reweighting manner, where the weights are updated adaptively and the parameters are regularized to avoid model collapse (all representations collapse to a single point). Compared with many state-of-the-art methods of anomaly detection, our DPAD does not rely on any assumption about the distribution or spatial structure of the normal data and representations. Moreover, we provide theoretical guarantees for the effectiveness of DPAD. The experiments show that our method DPAD is effective not only in traditional one-class classification problems but also in scenarios with complex normal data composed of multiple classes.
\ No newline at end of file
diff --git a/data/2024/aaai/Density Matters: Improved Core-Set for Active Domain Adaptive Segmentation b/data/2024/aaai/Density Matters: Improved Core-Set for Active Domain Adaptive Segmentation
new file mode 100644
index 0000000000..ee2758eda9
--- /dev/null
+++ b/data/2024/aaai/Density Matters: Improved Core-Set for Active Domain Adaptive Segmentation	
@@ -0,0 +1 @@
+Active domain adaptation has emerged as a solution to balance the expensive annotation cost and the performance of trained models in semantic segmentation. However, existing works usually ignore the correlation between selected samples and its local context in feature space, which leads to inferior usage of annotation budgets. In this work, we revisit the theoretical bound of the classical Core-set method and identify that the performance is closely related to the local sample distribution around selected samples. To estimate the density of local samples efficiently, we introduce a local proxy estimator with Dynamic Masked Convolution and develop a Density-aware Greedy algorithm to optimize the bound. Extensive experiments demonstrate the superiority of our approach. Moreover, with very few labels, our scheme achieves comparable performance to the fully supervised counterpart.
\ No newline at end of file
diff --git a/data/2024/aaai/Dependency Structure-Enhanced Graph Attention Networks for Event Detection b/data/2024/aaai/Dependency Structure-Enhanced Graph Attention Networks for Event Detection
new file mode 100644
index 0000000000..c8b3c620c7
--- /dev/null
+++ b/data/2024/aaai/Dependency Structure-Enhanced Graph Attention Networks for Event Detection	
@@ -0,0 +1,2 @@
+Existing models on event detection share three-fold limitations, including (1) insufficient consideration of the structures between dependency relations, (2) limited exploration of the directed-edge semantics, and (3) issues in strengthening the event core arguments. To tackle these problems, we propose a dependency structure-enhanced event detection framework. In addition to the traditional token dependency parsing tree, denoted as TDG, our model considers the dependency edges in it as new nodes and constructs a dependency relation graph (DRG). DRG allows the embedding representations of dependency relations to be updated as nodes rather than edges in a graph neural network. 
+Moreover, the levels of core argument nodes in the two graphs are adjusted by dependency relation types in TDG to enhance their status. Subsequently, the two graphs are further encoded and jointly trained in graph attention networks (GAT). Importantly, we design an interaction strategy of node embedding for the two graphs and refine the attention coefficient computational method to encode the semantic meaning of directed edges. Extensive experiments are conducted to validate the effectiveness of our method, and the results confirm its superiority over the state-of-the-art baselines. Our model outperforms the best benchmark with the F1 score increased by 3.5 and 3.4 percentage points on ACE2005 English and Chinese corpus.
\ No newline at end of file
diff --git a/data/2024/aaai/Deploying ADVISER: Impact and Lessons from Using Artificial Intelligence for Child Vaccination Uptake in Nigeria b/data/2024/aaai/Deploying ADVISER: Impact and Lessons from Using Artificial Intelligence for Child Vaccination Uptake in Nigeria
new file mode 100644
index 0000000000..d3aa619a96
--- /dev/null
+++ b/data/2024/aaai/Deploying ADVISER: Impact and Lessons from Using Artificial Intelligence for Child Vaccination Uptake in Nigeria	
@@ -0,0 +1 @@
+More than 5 million children under five years die from largely preventable or treatable medical conditions every year, with an overwhelmingly large proportion of deaths occurring in underdeveloped countries with low vaccination uptake. One of the United Nations' sustainable development goals (SDG 3) aims to end preventable deaths of newborns and children under five years of age. We focus on Nigeria, where the rate of infant mortality is appalling. In particular, low vaccination uptake in Nigeria is a major driver of more than 2,000 daily deaths of children under the age of five years. In this paper, we describe our collaboration with government partners in Nigeria to deploy ADVISER: AI-Driven Vaccination Intervention Optimiser. The framework, based on an integer linear program that seeks to maximize the cumulative probability of successful vaccination, is the first successful deployment of an AI-enabled toolchain for optimizing the allocation of health interventions in Nigeria. In this paper, we provide a background of the ADVISER framework and present results, lessons, and success stories of deploying ADVISER to more than 13,000 families in the state of Oyo, Nigeria.
\ No newline at end of file
diff --git a/data/2024/aaai/Depression Detection via Capsule Networks with Contrastive Learning b/data/2024/aaai/Depression Detection via Capsule Networks with Contrastive Learning
new file mode 100644
index 0000000000..48c23a66d2
--- /dev/null
+++ b/data/2024/aaai/Depression Detection via Capsule Networks with Contrastive Learning	
@@ -0,0 +1 @@
+Depression detection is a challenging and crucial task in psychological illness diagnosis. Utilizing online user posts to predict whether a user suffers from depression seems an effective and promising direction. However, existing methods suffer from either poor interpretability brought by the black-box models or underwhelming performance caused by the completely separate two-stage model structure. To alleviate these limitations, we propose a novel capsule network integrated with contrastive learning for depression detection (DeCapsNet). The highlights of DeCapsNet can be summarized as follows. First, it extracts symptom capsules from user posts by leveraging meticulously designed symptom descriptions, and then distills them into class-indicative depression capsules. The overall workflow is in an explicit hierarchical reasoning manner and can be well interpreted by the Patient Health Questionnaire-9 (PHQ9), which is one of the most widely adopted questionnaires for depression diagnosis. Second, it integrates with contrastive learning, which can facilitate the embeddings from the same class to be pulled closer, while simultaneously pushing the embeddings from different classes apart. In addition, by adopting the end-to-end training strategy, it does not necessitate additional data annotation, and mitigates the potential adverse effects from the upstream task to the downstream task. Extensive experiments on three widely-used datasets show that in both within-dataset and cross-dataset scenarios our proposed method outperforms other strong baselines significantly.
\ No newline at end of file
diff --git a/data/2024/aaai/Depth-Guided Robust and Fast Point Cloud Fusion NeRF for Sparse Input Views b/data/2024/aaai/Depth-Guided Robust and Fast Point Cloud Fusion NeRF for Sparse Input Views
new file mode 100644
index 0000000000..950d895a76
--- /dev/null
+++ b/data/2024/aaai/Depth-Guided Robust and Fast Point Cloud Fusion NeRF for Sparse Input Views	
@@ -0,0 +1 @@
+Novel-view synthesis with sparse input views is important for real-world applications like AR/VR and autonomous driving. Recent methods have integrated depth information into NeRFs for sparse input synthesis, leveraging depth prior for geometric and spatial understanding. However, most existing works tend to overlook inaccuracies within depth maps and have low time efficiency. To address these issues, we propose a depth-guided robust and fast point cloud fusion NeRF for sparse inputs. We perceive radiance fields as an explicit voxel grid of features. A point cloud is constructed for each input view, characterized within the voxel grid using matrices and vectors. We accumulate the point cloud of each input view to construct the fused point cloud of the entire scene. Each voxel determines its density and appearance by referring to the point cloud of the entire scene. Through point cloud fusion and voxel grid fine-tuning, inaccuracies in depth values are refined or substituted by those from other views. Moreover, our method can achieve faster reconstruction and greater compactness through effective vector-matrix decomposition. Experimental results underline the superior performance and time efficiency of our approach compared to state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Descanning: From Scanned to the Original Images with a Color Correction Diffusion Model b/data/2024/aaai/Descanning: From Scanned to the Original Images with a Color Correction Diffusion Model
new file mode 100644
index 0000000000..06f536a75b
--- /dev/null
+++ b/data/2024/aaai/Descanning: From Scanned to the Original Images with a Color Correction Diffusion Model	
@@ -0,0 +1 @@
+A significant volume of analog information, i.e., documents and images, have been digitized in the form of scanned copies for storing, sharing, and/or analyzing in the digital world. However, the quality of such contents is severely degraded by various distortions caused by printing, storing, and scanning processes in the physical world. Although restoring high-quality content from scanned copies has become an indispensable task for many products, it has not been systematically explored, and to the best of our knowledge, no public datasets are available. In this paper, we define this problem as Descanning and introduce a new high-quality and large-scale dataset named DESCAN-18K. It contains 18K pairs of original and scanned images collected in the wild containing multiple complex degradations. In order to eliminate such complex degradations, we propose a new image restoration model called DescanDiffusion consisting of a color encoder that corrects the global color degradation and a conditional denoising diffusion probabilistic model (DDPM) that removes local degradations. To further improve the generalization ability of DescanDiffusion, we also design a synthetic data generation scheme by reproducing prominent degradations in scanned images. We demonstrate that our DescanDiffusion outperforms other baselines including commercial restoration products, objectively and subjectively, via comprehensive experiments and analyses.
\ No newline at end of file
diff --git a/data/2024/aaai/Designing Biological Sequences without Prior Knowledge Using Evolutionary Reinforcement Learning b/data/2024/aaai/Designing Biological Sequences without Prior Knowledge Using Evolutionary Reinforcement Learning
new file mode 100644
index 0000000000..c182c4f04d
--- /dev/null
+++ b/data/2024/aaai/Designing Biological Sequences without Prior Knowledge Using Evolutionary Reinforcement Learning	
@@ -0,0 +1 @@
+Designing novel biological sequences with desired properties is a significant challenge in biological science because of the extra large search space. The traditional design process usually involves multiple rounds of costly wet lab evaluations. To reduce the need for expensive wet lab experiments, machine learning methods are used to aid in designing biological sequences. However, the limited availability of biological sequences with known properties hinders the training of machine learning models, significantly restricting their applicability and performance. To fill this gap, we present ERLBioSeq, an Evolutionary Reinforcement Learning algorithm for BIOlogical SEQuence design. ERLBioSeq leverages the capability of reinforcement learning to learn without prior knowledge and the potential of evolutionary algorithms to enhance the exploration of reinforcement learning in the large search space of biological sequences. Additionally, to enhance the efficiency of biological sequence design, we developed a predictor for sequence screening in the biological sequence design process, which incorporates both the local and global sequence information. We evaluated the proposed method on three main types of biological sequence design tasks, including the design of DNA, RNA, and protein. The results demonstrate that the proposed method achieves significant improvement compared to the existing state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Detect Any Keypoints: An Efficient Light-Weight Few-Shot Keypoint Detector b/data/2024/aaai/Detect Any Keypoints: An Efficient Light-Weight Few-Shot Keypoint Detector
new file mode 100644
index 0000000000..06dcc5bbe8
--- /dev/null
+++ b/data/2024/aaai/Detect Any Keypoints: An Efficient Light-Weight Few-Shot Keypoint Detector	
@@ -0,0 +1 @@
+Recently the prompt-based models have become popular across various language and vision tasks. Following that trend, we perform few-shot keypoint detection (FSKD) by detecting any keypoints in a query image, given the prompts formed by support images and keypoints. FSKD can be applied to detecting keypoints and poses of diverse animal species. In order to maintain flexibility of detecting varying number of keypoints, existing FSKD approaches modulate query feature map per support keypoint, then detect the corresponding keypoint from each modulated feature via a detection head. Such a separation of modulation-detection makes model heavy and slow when the number of keypoints increases. To overcome this issue, we design a novel light-weight detector which combines modulation and detection into one step, with the goal of reducing the computational cost without the drop of performance. Moreover, to bridge the large domain shift of keypoints between seen and unseen species, we further improve our model with mean feature based contrastive learning to align keypoint distributions, resulting in better keypoint representations for FSKD. Compared to the state of the art, our light-weight detector reduces the number of parameters by 50%, training/test time by 50%, and achieves 5.62% accuracy gain on 1-shot novel keypoint detection in the Animal pose dataset. Our model is also robust to the number of keypoints and saves memory when evaluating a large number of keypoints (e.g., 1000) per episode.
\ No newline at end of file
diff --git a/data/2024/aaai/Detecting AI-Generated Code Assignments Using Perplexity of Large Language Models b/data/2024/aaai/Detecting AI-Generated Code Assignments Using Perplexity of Large Language Models
new file mode 100644
index 0000000000..00e7451916
--- /dev/null
+++ b/data/2024/aaai/Detecting AI-Generated Code Assignments Using Perplexity of Large Language Models	
@@ -0,0 +1 @@
+Large language models like ChatGPT can generate human-like code, posing challenges for programming education as students may be tempted to misuse them on assignments. However, there are currently no robust detectors designed specifically to identify AI-generated code. This is an issue that needs to be addressed to maintain academic integrity while allowing proper utilization of language models. Previous work has explored different approaches to detect AI-generated text, including watermarks, feature analysis, and fine-tuning language models. In this paper, we address the challenge of determining whether a student's code assignment was generated by a language model. First, our proposed method identifies AI-generated code by leveraging targeted masking perturbation paired with comperhesive scoring. Rather than applying a random mask, areas of the code with higher perplexity are more intensely masked. Second, we utilize a fine-tuned CodeBERT to fill in the masked portions, producing subtle modified samples. Then, we integrate the overall perplexity, variation of code line perplexity, and burstiness into a unified score. In this scoring scheme, a higher rank for the original code suggests it's more likely to be AI-generated. This approach stems from the observation that AI-generated codes typically have lower perplexity. Therefore, perturbations often exert minimal influence on them. Conversely, sections of human-composed codes that the model struggles to understand can see their perplexity reduced by such perturbations. Our method outperforms current open-source and commercial text detectors. Specifically, it improves detection of code submissions generated by OpenAI's text-davinci-003, raising average AUC from 0.56 (GPTZero baseline) to 0.87 for our detector.
\ No newline at end of file
diff --git a/data/2024/aaai/Detecting and Preventing Hallucinations in Large Vision Language Models b/data/2024/aaai/Detecting and Preventing Hallucinations in Large Vision Language Models
new file mode 100644
index 0000000000..d288561efd
--- /dev/null
+++ b/data/2024/aaai/Detecting and Preventing Hallucinations in Large Vision Language Models	
@@ -0,0 +1 @@
+Instruction tuned Large Vision Language Models (LVLMs) have significantly advanced in generalizing across a diverse set of multi-modal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are visually grounded is still a challenging task for these models. We find that even the current state-of-the-art LVLMs (InstructBLIP) still contain a staggering 30 percent of the hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this, we introduce M-HalDetect, a Multimodal Hallucination Detection Dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect consists of 16k fine-grained annotations on VQA examples, making it the first comprehensive multi-modal hallucination detection dataset for detailed image descriptions. Unlike previous work that only consider object hallucination, we additionally annotate both entity descriptions and relationships that are unfaithful. To demonstrate the potential of this dataset for hallucination prevention, we optimize InstructBLIP through our novel Fine-grained Direct Preference Optimization (FDPO). We also train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling (RS). We perform human evaluation on both FDPO and rejection sampling, and find that they reduce hallucination rates in InstructBLIP by 41% and 55% respectively. We also find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively, and has strong correlation with human evaluated accuracy scores. The dataset is available at https://github.com/hendryx-scale/mhal-detect.
\ No newline at end of file
diff --git a/data/2024/aaai/Detection and Defense of Unlearnable Examples b/data/2024/aaai/Detection and Defense of Unlearnable Examples
new file mode 100644
index 0000000000..bb33645df3
--- /dev/null
+++ b/data/2024/aaai/Detection and Defense of Unlearnable Examples	
@@ -0,0 +1 @@
+Privacy preserving has become increasingly critical with the emergence of social media. Unlearnable examples have been proposed to avoid leaking personal information on the Internet by degrading the generalization abilities of deep learning models. However, our study reveals that unlearnable examples are easily detectable. We provide theoretical results on linear separability of certain unlearnable poisoned dataset and simple network-based detection methods that can identify all existing unlearnable examples, as demonstrated by extensive experiments. Detectability of unlearnable examples with simple networks motivates us to design a novel defense method. We propose using stronger data augmentations coupled with adversarial noises generated by simple networks, to degrade the detectability and thus provide effective defense against unlearnable examples with a lower cost. Adversarial training with large budgets is a widely-used defense method on unlearnable examples. We establish quantitative criteria between the poison and adversarial budgets, which determine the existence of robust unlearnable examples or the failure of the adversarial defense.
\ No newline at end of file
diff --git a/data/2024/aaai/Detection-Based Intermediate Supervision for Visual Question Answering b/data/2024/aaai/Detection-Based Intermediate Supervision for Visual Question Answering
new file mode 100644
index 0000000000..c78934080e
--- /dev/null
+++ b/data/2024/aaai/Detection-Based Intermediate Supervision for Visual Question Answering	
@@ -0,0 +1 @@
+Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multi-hop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving inference interpretability. However, their performance may be hindered due to sketchy modeling of intermediate supervisions. For instance, (1) a prior assumption that each instance-module refers to only one grounded object yet overlooks other potentially associated grounded objects, impeding full cross-modal alignment learning; (2) IoU-based intermediate supervisions may introduce noise signals as the bounding box overlap issue might guide the model's focus towards irrelevant objects. To address these issues, a novel method, Detection-based Intermediate Supervision (DIS), is proposed, which adopts a generative detection framework to facilitate multiple grounding supervisions via sequence generation. As such, DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance. Furthermore, by considering intermediate results, DIS enhances the consistency in answering compositional questions and their sub-questions. Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency compared to prior approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Devignet: High-Resolution Vignetting Removal via a Dual Aggregated Fusion Transformer with Adaptive Channel Expansion b/data/2024/aaai/Devignet: High-Resolution Vignetting Removal via a Dual Aggregated Fusion Transformer with Adaptive Channel Expansion
new file mode 100644
index 0000000000..a941c563a2
--- /dev/null
+++ b/data/2024/aaai/Devignet: High-Resolution Vignetting Removal via a Dual Aggregated Fusion Transformer with Adaptive Channel Expansion	
@@ -0,0 +1 @@
+Vignetting commonly occurs as a degradation in images resulting from factors such as lens design, improper lens hood usage, and limitations in camera sensors. This degradation affects image details, color accuracy, and presents challenges in computational photography. Existing vignetting removal algorithms predominantly rely on ideal physics assumptions and hand-crafted parameters, resulting in the ineffective removal of irregular vignetting and suboptimal results. Moreover, the substantial lack of real-world vignetting datasets hinders the objective and comprehensive evaluation of vignetting removal. To address these challenges, we present VigSet, a pioneering dataset for vignetting removal. VigSet includes 983 pairs of both vignetting and vignetting-free high-resolution (over 4k) real-world images under various conditions. In addition, We introduce DeVigNet, a novel frequency-aware Transformer architecture designed for vignetting removal. Through the Laplacian Pyramid decomposition, we propose the Dual Aggregated Fusion Transformer to handle global features and remove vignetting in the low-frequency domain. Additionally, we propose the Adaptive Channel Expansion Module to enhance details in the high-frequency domain. The experiments demonstrate that the proposed model outperforms existing state-of-the-art methods. The code, models, and dataset are available at https://github.com/CXH-Research/DeVigNet.
\ No newline at end of file
diff --git a/data/2024/aaai/DexFuncGrasp: A Robotic Dexterous Functional Grasp Dataset Constructed from a Cost-Effective Real-Simulation Annotation System b/data/2024/aaai/DexFuncGrasp: A Robotic Dexterous Functional Grasp Dataset Constructed from a Cost-Effective Real-Simulation Annotation System
new file mode 100644
index 0000000000..35fceea764
--- /dev/null
+++ b/data/2024/aaai/DexFuncGrasp: A Robotic Dexterous Functional Grasp Dataset Constructed from a Cost-Effective Real-Simulation Annotation System	
@@ -0,0 +1 @@
+Robot grasp dataset is the basis of designing the robot's grasp generation model. Compared with the building grasp dataset for Low-DOF grippers, it is harder for High-DOF dexterous robot hand. Most current datasets meet the needs of generating stable grasps, but they are not suitable for dexterous hands to complete human-like functional grasp, such as grasp the handle of a cup or pressing the button of a flashlight, so as to enable robots to complete subsequent functional manipulation action autonomously, and there is no dataset with functional grasp pose annotations at present. This paper develops a unique Cost-Effective Real-Simulation Annotation System by leveraging natural hand's actions. The system is able to capture a functional grasp of a dexterous hand in a simulated environment assisted by human demonstration in real world. By using this system, dexterous grasp data can be collected efficiently as well as cost-effective. Finally, we construct the first dexterous functional grasp dataset with rich pose annotations. A Functional Grasp Synthesis Model is also provided to validate the effectiveness of the proposed system and dataset. Our project page is: https://hjlllll.github.io/DFG/.
\ No newline at end of file
diff --git a/data/2024/aaai/DiDA: Disambiguated Domain Alignment for Cross-Domain Retrieval with Partial Labels b/data/2024/aaai/DiDA: Disambiguated Domain Alignment for Cross-Domain Retrieval with Partial Labels
new file mode 100644
index 0000000000..1a92cec0b4
--- /dev/null
+++ b/data/2024/aaai/DiDA: Disambiguated Domain Alignment for Cross-Domain Retrieval with Partial Labels	
@@ -0,0 +1 @@
+Driven by generative AI and the Internet, there is an increasing availability of a wide variety of images, leading to the significant and popular task of cross-domain image retrieval. To reduce annotation costs and increase performance, this paper focuses on an untouched but challenging problem, i.e., cross-domain image retrieval with partial labels (PCIR). Specifically, PCIR faces great challenges due to the ambiguous supervision signal and the domain gap. To address these challenges, we propose a novel method called disambiguated domain alignment (DiDA) for cross-domain retrieval with partial labels. In detail, DiDA elaborates a novel prototype-score unitization learning mechanism (PSUL) to extract common discriminative representations by simultaneously disambiguating the partial labels and narrowing the domain gap. Additionally, DiDA proposes a prototype-based domain alignment mechanism (PBDA) to further bridge the inherent cross-domain discrepancy. Attributed to PSUL and PBDA, our DiDA effectively excavates domain-invariant discrimination for cross-domain image retrieval. We demonstrate the effectiveness of DiDA through comprehensive experiments on three benchmarks, comparing it to existing state-of-the-art methods. Code available: https://github.com/lhrrrrrr/DiDA.
\ No newline at end of file
diff --git a/data/2024/aaai/DiG-In-GNN: Discriminative Feature Guided GNN-Based Fraud Detector against Inconsistencies in Multi-Relation Fraud Graph b/data/2024/aaai/DiG-In-GNN: Discriminative Feature Guided GNN-Based Fraud Detector against Inconsistencies in Multi-Relation Fraud Graph
new file mode 100644
index 0000000000..a80f780092
--- /dev/null
+++ b/data/2024/aaai/DiG-In-GNN: Discriminative Feature Guided GNN-Based Fraud Detector against Inconsistencies in Multi-Relation Fraud Graph	
@@ -0,0 +1 @@
+Fraud detection on multi-relation graphs aims to identify fraudsters in graphs. Graph Neural Network (GNN) models leverage graph structures to pass messages from neighbors to the target nodes, thereby enriching the representations of those target nodes. However, feature and structural inconsistency in the graph, owing to fraudsters' camouflage behaviors, diminish the suspiciousness of fraud nodes which hinders the effectiveness of GNN-based models. In this work, we propose DiG-In-GNN, Discriminative Feature Guided GNN against Inconsistency, to dig into graphs for fraudsters. Specifically, we use multi-scale contrastive learning from the perspective of the neighborhood subgraph where the target node is located to generate guidance nodes to cope with the feature inconsistency. Then, guided by the guidance nodes, we conduct fine-grained neighbor selection through reinforcement learning for each neighbor node to precisely filter nodes that can enhance the message passing and therefore alleviate structural inconsistency. Finally, the two modules are integrated together to obtain discriminable representations of the nodes. Experiments on three fraud detection datasets demonstrate the superiority of the proposed method DiG-In-GNN, which obtains up to 20.73% improvement over previous state-of-the-art methods. Our code can be found at https://github.com/GraphBerry/DiG-In-GNN.
\ No newline at end of file
diff --git "a/data/2024/aaai/DiSCO: Diffusion Schr\303\266dinger Bridge for Molecular Conformer Optimization" "b/data/2024/aaai/DiSCO: Diffusion Schr\303\266dinger Bridge for Molecular Conformer Optimization"
new file mode 100644
index 0000000000..bf1a516b56
--- /dev/null
+++ "b/data/2024/aaai/DiSCO: Diffusion Schr\303\266dinger Bridge for Molecular Conformer Optimization"	
@@ -0,0 +1 @@
+The generation of energetically optimal 3D molecular conformers is crucial in cheminformatics and drug discovery. While deep generative models have been utilized for direct generation in Euclidean space, this approach encounters challenges, including the complexity of navigating a vast search space. Recent generative models that implement simplifications to circumvent these challenges have achieved state-of-the-art results, but this simplified approach unavoidably creates a gap between the generated conformers and the ground-truth conformational landscape. To bridge this gap, we introduce DiSCO: Diffusion Schrödinger Bridge for Molecular Conformer Optimization, a novel diffusion framework that enables direct learning of nonlinear diffusion processes in prior-constrained Euclidean space for the optimization of 3D molecular conformers. Through the incorporation of an SE(3)-equivariant Schrödinger bridge, we establish the roto-translational equivariance of the generated conformers. Our framework is model-agnostic and offers an easily implementable solution for the post hoc optimization of conformers produced by any generation method. Through comprehensive evaluations and analyses, we establish the strengths of our framework, substantiating the application of the Schrödinger bridge for molecular conformer optimization. First, our approach consistently outperforms four baseline approaches, producing conformers with higher diversity and improved quality. Then, we show that the intermediate conformers generated during our diffusion process exhibit valid and chemically meaningful characteristics. We also demonstrate the robustness of our method when starting from conformers of diverse quality, including those unseen during training. Lastly, we show that the precise generation of low-energy conformers via our framework helps in enhancing the downstream prediction of molecular properties. The code is available at https://github.com/Danyeong-Lee/DiSCO.
\ No newline at end of file
diff --git a/data/2024/aaai/Diagnosing and Rectifying Fake OOD Invariance: A Restructured Causal Approach b/data/2024/aaai/Diagnosing and Rectifying Fake OOD Invariance: A Restructured Causal Approach
new file mode 100644
index 0000000000..a6cd16d7b3
--- /dev/null
+++ b/data/2024/aaai/Diagnosing and Rectifying Fake OOD Invariance: A Restructured Causal Approach	
@@ -0,0 +1 @@
+Invariant representation learning (IRL) encourages the prediction from invariant causal features to labels deconfounded from the environments, advancing the technical roadmap of out-of-distribution (OOD) generalization. Despite spotlights around, recent theoretical result verified that some causal features recovered by IRLs merely pretend domain-invariantly in the training environments but fail in unseen domains. The fake invariance severely endangers OOD generalization since the trustful objective can not be diagnosed and existing causal remedies are invalid to rectify. In this paper, we review a IRL family (InvRat) under the Partially and Fully Informative Invariant Feature Structural Causal Models (PIIF SCM /FIIF SCM) respectively, to certify their weaknesses in representing fake invariant features, then, unify their causal diagrams to propose ReStructured SCM (RS-SCM). RS-SCM can ideally rebuild the spurious and the fake invariant features simultaneously. Given this, we further develop an approach based on conditional mutual information with respect to RS-SCM, then rigorously rectify the spurious and fake invariant effects. It can be easily implemented by a small feature selection subnet introduced in the IRL family, which is alternatively optimized to achieve our goal. Experiments verified the superiority of our approach to fight against the fake invariant issue across a variety of OOD generalization benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Dialogue for Prompting: A Policy-Gradient-Based Discrete Prompt Generation for Few-Shot Learning b/data/2024/aaai/Dialogue for Prompting: A Policy-Gradient-Based Discrete Prompt Generation for Few-Shot Learning
new file mode 100644
index 0000000000..e0ac9a0b10
--- /dev/null
+++ b/data/2024/aaai/Dialogue for Prompting: A Policy-Gradient-Based Discrete Prompt Generation for Few-Shot Learning	
@@ -0,0 +1 @@
+Prompt-based pre-trained language models (PLMs) paradigm has succeeded substantially in few-shot natural language processing (NLP) tasks. However, prior discrete prompt optimization methods require expert knowledge to design the base prompt set and identify high-quality prompts, which is costly, inefficient, and subjective. Meanwhile, existing continuous prompt optimization methods improve the performance by learning the ideal prompts through the gradient information of PLMs, whose high computational cost, and low readability and generalizability are often concerning. To address the research gap, we propose a Dialogue-comprised Policy-gradient-based Discrete Prompt Optimization (DP_2O) method. We first design a multi-round dialogue alignment strategy for readability prompt set generation based on GPT-4. Furthermore, we propose an efficient prompt screening metric to identify high-quality prompts with linear complexity. Finally, we construct a reinforcement learning (RL) framework based on policy gradients to match the prompts to inputs optimally. By training a policy network with only 0.62M parameters on the tasks in the few-shot setting, DP_2O outperforms the state-of-the-art (SOTA) method by 1.52% in accuracy on average on four open-source datasets. Moreover, subsequent experiments also demonstrate that DP_2O has good universality, robustness and generalization ability.
\ No newline at end of file
diff --git a/data/2024/aaai/Dialogues Are Not Just Text: Modeling Cognition for Dialogue Coherence Evaluation b/data/2024/aaai/Dialogues Are Not Just Text: Modeling Cognition for Dialogue Coherence Evaluation
new file mode 100644
index 0000000000..2c9d967629
--- /dev/null
+++ b/data/2024/aaai/Dialogues Are Not Just Text: Modeling Cognition for Dialogue Coherence Evaluation	
@@ -0,0 +1,2 @@
+The generation of logically coherent dialogues by humans relies on underlying cognitive abilities. Based on this, we redefine the dialogue coherence evaluation process, combining cognitive judgment with the basic text to achieve a more human-like evaluation. We propose a novel dialogue evaluation framework based on Dialogue Cognition Graph (DCGEval) to implement the fusion by in-depth interaction between cognition modeling and text modeling. The proposed Abstract Meaning Representation (AMR) based graph structure called DCG aims to uniformly model four dialogue cognitive abilities. Specifically, core-semantic cognition is modeled by converting the utterance into an AMR graph, which can extract essential semantic information without redundancy. The temporal and role cognition are modeled by establishing logical relationships among the different AMR graphs. Finally, the commonsense knowledge from ConceptNet is fused to express commonsense cognition. Experiments demonstrate the necessity of modeling human cognition for
+dialogue evaluation, and our DCGEval presents stronger correlations with human judgments compared to other state-of-the-art evaluation metrics.
\ No newline at end of file
diff --git a/data/2024/aaai/DifAttack: Query-Efficient Black-Box Adversarial Attack via Disentangled Feature Space b/data/2024/aaai/DifAttack: Query-Efficient Black-Box Adversarial Attack via Disentangled Feature Space
new file mode 100644
index 0000000000..d5d879882b
--- /dev/null
+++ b/data/2024/aaai/DifAttack: Query-Efficient Black-Box Adversarial Attack via Disentangled Feature Space	
@@ -0,0 +1 @@
+This work investigates efficient score-based black-box adversarial attacks with a high Attack Success Rate (\textbf{ASR}) and good generalizability. We design a novel attack method based on a hierarchical DIsentangled Feature space, called \textbf{DifAttack++}, which differs significantly from the existing ones operating over the entire feature space. Specifically, DifAttack++ firstly disentangles an image's latent feature into an Adversarial Feature (\textbf{AF}) and a Visual Feature (\textbf{VF}) via an autoencoder equipped with our specially designed Hierarchical Decouple-Fusion (\textbf{HDF}) module, where the AF dominates the adversarial capability of an image, while the VF largely determines its visual appearance. We train such two autoencoders for the clean and adversarial image domains (i.e., cross-domain) respectively to achieve image reconstructions and feature disentanglement, by using pairs of clean images and their Adversarial Examples (\textbf{AE}s) generated from available surrogate models via white-box attack methods. Eventually, in the black-box attack stage, DifAttack++ iteratively optimizes the AF according to the query feedback from the victim model until a successful AE is generated, while keeping the VF unaltered. Extensive experimental results demonstrate that our DifAttack++ leads to superior ASR and query efficiency than state-of-the-art methods, meanwhile exhibiting much better visual quality of AEs. The code is available at https://github.com/csjunjun/DifAttack.git.
\ No newline at end of file
diff --git a/data/2024/aaai/DiffAIL: Diffusion Adversarial Imitation Learning b/data/2024/aaai/DiffAIL: Diffusion Adversarial Imitation Learning
new file mode 100644
index 0000000000..c7c932335a
--- /dev/null
+++ b/data/2024/aaai/DiffAIL: Diffusion Adversarial Imitation Learning	
@@ -0,0 +1 @@
+Imitation learning aims to solve the problem of defining reward functions in real-world decision-making tasks. The current popular approach is the Adversarial Imitation Learning (AIL) framework, which matches expert state-action occupancy measures to obtain a surrogate reward for forward reinforcement learning. However, the traditional discriminator is a simple binary classifier and doesn't learn an accurate distribution, which may result in failing to identify expert-level state-action pairs induced by the policy interacting with the environment. To address this issue, we propose a method named diffusion adversarial imitation learning (DiffAIL), which introduces the diffusion model into the AIL framework. Specifically, DiffAIL models the state-action pairs as unconditional diffusion models and uses diffusion loss as part of the discriminator's learning objective, which enables the discriminator to capture better expert demonstrations and improve generalization. Experimentally, the results show that our method achieves state-of-the-art performance and significantly surpasses expert demonstration on two benchmark tasks, including the standard state-action setting and state-only settings.
\ No newline at end of file
diff --git a/data/2024/aaai/DiffBEV: Conditional Diffusion Model for Bird's Eye View Perception b/data/2024/aaai/DiffBEV: Conditional Diffusion Model for Bird's Eye View Perception
new file mode 100644
index 0000000000..9822e126cb
--- /dev/null
+++ b/data/2024/aaai/DiffBEV: Conditional Diffusion Model for Bird's Eye View Perception	
@@ -0,0 +1 @@
+BEV perception is of great importance in the field of autonomous driving, serving as the cornerstone of planning, controlling, and motion prediction. The quality of the BEV feature highly affects the performance of BEV perception. However, taking the noises in camera parameters and LiDAR scans into consideration, we usually obtain BEV representation with harmful noises. Diffusion models naturally have the ability to denoise noisy samples to the ideal data, which motivates us to utilize the diffusion model to get a better BEV representation. In this work, we propose an end-to-end framework, named DiffBEV, to exploit the potential of diffusion model to generate a more comprehensive BEV representation. To the best of our knowledge, we are the first to apply diffusion model to BEV perception. In practice, we design three types of conditions to guide the training of the diffusion model which denoises the coarse samples and refines the semantic feature in a progressive way. What's more, a cross-attention module is leveraged to fuse the context of BEV feature and the semantic content of conditional diffusion model. DiffBEV achieves a 25.9% mIoU on the nuScenes dataset, which is 6.2% higher than the best-performing existing approach. Quantitative and qualitative results on multiple benchmarks demonstrate the effectiveness of DiffBEV in BEV semantic segmentation and 3D object detection tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/DiffRAW: Leveraging Diffusion Model to Generate DSLR-Comparable Perceptual Quality sRGB from Smartphone RAW Images b/data/2024/aaai/DiffRAW: Leveraging Diffusion Model to Generate DSLR-Comparable Perceptual Quality sRGB from Smartphone RAW Images
new file mode 100644
index 0000000000..f96d356aea
--- /dev/null
+++ b/data/2024/aaai/DiffRAW: Leveraging Diffusion Model to Generate DSLR-Comparable Perceptual Quality sRGB from Smartphone RAW Images	
@@ -0,0 +1 @@
+Deriving DSLR-quality sRGB images from smartphone RAW images has become a compelling challenge due to discernible detail disparity, color mapping instability, and spatial misalignment in RAW-sRGB data pairs. We present DiffRAW, a novel method that incorporates the diffusion model for the first time in learning RAW-to-sRGB mappings. By leveraging the diffusion model, our approach effectively learns the high-quality detail distribution of DSLR images, thereby enhancing the details of output images. Simultaneously, we use the RAW image as a diffusion condition to maintain image structure information such as contours and textures. To mitigate the interference caused by the color and spatial misalignment in training data pairs, we embed a color-position preserving condition within DiffRAW, ensuring that the output images do not exhibit color biases and pixel shift issues. To accelerate the inference process of DiffRAW, we designed the Domain Transform Diffusion Method, an efficient diffusion process with its corresponding reverse process. The Domain Transform Diffusion Method can reduce the required inference steps for diffusion model-based image restoration/enhancement algorithms while enhancing the quality of the generated images. Through evaluations on the ZRR dataset, DiffRAW consistently demonstrates state-of-the-art performance across all perceptual quality metrics (e.g., LPIPS, FID, MUSIQ), while achieving comparable results in PSNR and SSIM.
\ No newline at end of file
diff --git a/data/2024/aaai/DiffSED: Sound Event Detection with Denoising Diffusion b/data/2024/aaai/DiffSED: Sound Event Detection with Denoising Diffusion
new file mode 100644
index 0000000000..df0fad041b
--- /dev/null
+++ b/data/2024/aaai/DiffSED: Sound Event Detection with Denoising Diffusion	
@@ -0,0 +1 @@
+Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the split-and-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the ground-truth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives, with 40+% faster convergence in training. Code: https://github.com/Surrey-UPLab/DiffSED
\ No newline at end of file
diff --git a/data/2024/aaai/Differentiable Auxiliary Learning for Sketch Re-Identification b/data/2024/aaai/Differentiable Auxiliary Learning for Sketch Re-Identification
new file mode 100644
index 0000000000..ed0470f24b
--- /dev/null
+++ b/data/2024/aaai/Differentiable Auxiliary Learning for Sketch Re-Identification	
@@ -0,0 +1 @@
+Sketch re-identification (Re-ID) seeks to match pedestrians' photos from surveillance videos with corresponding sketches. However, we observe that existing works still have two critical limitations: (i) cross- and intra-modality discrepancies hinder the extraction of modality-shared features, (ii) standard triplet loss fails to constrain latent feature distribution in each modality with inadequate samples. To overcome the above issues, we propose a differentiable auxiliary learning network (DALNet) to explore a robust auxiliary modality for Sketch Re-ID. Specifically, for (i) we construct an auxiliary modality by using a dynamic auxiliary generator (DAG) to bridge the gap between sketch and photo modalities. The auxiliary modality highlights the described person in photos to mitigate background clutter and learns sketch style through style refinement. Moreover, a modality interactive attention module (MIA) is presented to align the features and learn the invariant patterns of two modalities by auxiliary modality. To address (ii), we propose a multi-modality collaborative learning scheme (MMCL) to align the latent distribution of three modalities. An intra-modality circle loss in MMCL brings learned global and modality-shared features of the same identity closer in the case of insufficient samples within each modality. Extensive experiments verify the superior performance of our DALNet over the state-of-the-art methods for Sketch Re-ID, and the generalization in sketch-based image retrieval and sketch-photo face recognition tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Diffusion Language-Shapelets for Semi-supervised Time-Series Classification b/data/2024/aaai/Diffusion Language-Shapelets for Semi-supervised Time-Series Classification
new file mode 100644
index 0000000000..83653c8950
--- /dev/null
+++ b/data/2024/aaai/Diffusion Language-Shapelets for Semi-supervised Time-Series Classification	
@@ -0,0 +1 @@
+Semi-supervised time-series classification could effectively alleviate the issue of lacking labeled data. However, existing approaches usually ignore model interpretability, making it difficult for humans to understand the principles behind the predictions of a model. Shapelets are a set of discriminative subsequences that show high interpretability in time series classification tasks. Shapelet learning-based methods have demonstrated promising classification performance. Unfortunately, without enough labeled data, the shapelets learned by existing methods are often poorly discriminative, and even dissimilar to any subsequence of the original time series. To address this issue, we propose the Diffusion Language-Shapelets model (DiffShape) for semi-supervised time series classification. In DiffShape, a self-supervised diffusion learning mechanism is designed, which uses real subsequences as a condition. This helps to increase the similarity between the learned shapelets and real subsequences by using a large amount of unlabeled data. Furthermore, we introduce a contrastive language-shapelets learning strategy that improves the discriminability of the learned shapelets by incorporating the natural language descriptions of the time series. Experiments have been conducted on the UCR time series archive, and the results reveal that the proposed DiffShape method achieves state-of-the-art performance and exhibits superior interpretability over baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection b/data/2024/aaai/DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection
new file mode 100644
index 0000000000..49cb413eea
--- /dev/null
+++ b/data/2024/aaai/DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection	
@@ -0,0 +1 @@
+Limited by the encoder-decoder architecture, learning-based edge detectors usually have difficulty predicting edge maps that satisfy both correctness and crispness. With the recent success of the diffusion probabilistic model (DPM), we found it is especially suitable for accurate and crisp edge detection since the denoising process is directly applied to the original image size. Therefore, we propose the first diffusion model for the task of general edge detection, which we call DiffusionEdge. To avoid expensive computational resources while retaining the final performance, we apply DPM in the latent space and enable the classic cross-entropy loss which is uncertainty-aware in pixel level to directly optimize the parameters in latent space in a distillation manner. We also adopt a decoupled architecture to speed up the denoising process and propose a corresponding adaptive Fourier filter to adjust the latent features of specific frequencies. With all the technical designs, DiffusionEdge can be stably trained with limited resources, predicting crisp and accurate edge maps with much fewer augmentation strategies. Extensive experiments on four edge detection benchmarks demonstrate the superiority of DiffusionEdge both in correctness and crispness. On the NYUDv2 dataset, compared to the second best, we increase the ODS, OIS (without post-processing) and AC by 30.2%, 28.1% and 65.1%, respectively. Code: https://github.com/GuHuangAI/DiffusionEdge.
\ No newline at end of file
diff --git a/data/2024/aaai/DiffusionTrack: Diffusion Model for Multi-Object Tracking b/data/2024/aaai/DiffusionTrack: Diffusion Model for Multi-Object Tracking
new file mode 100644
index 0000000000..e475b59c8e
--- /dev/null
+++ b/data/2024/aaai/DiffusionTrack: Diffusion Model for Multi-Object Tracking	
@@ -0,0 +1 @@
+Multi-object tracking (MOT) is a challenging vision task that aims to detect individual objects within a single frame and associate them across multiple frames. Recent MOT approaches can be categorized into two-stage tracking-by-detection (TBD) methods and one-stage joint detection and tracking (JDT) methods. Despite the success of these approaches, they also suffer from common problems, such as harmful global or local inconsistency, poor trade-off between robustness and model complexity, and lack of flexibility in different scenes within the same video. In this paper we propose a simple but robust framework that formulates object detection and association jointly as a consistent denoising diffusion process from paired noise boxes to paired ground-truth boxes. This novel progressive denoising diffusion strategy substantially augments the tracker's effectiveness, enabling it to discriminate between various objects. During the training stage, paired object boxes diffuse from paired ground-truth boxes to random distribution, and the model learns detection and tracking simultaneously by reversing this noising process. In inference, the model refines a set of paired randomly generated boxes to the detection and tracking results in a flexible one-step or multi-step denoising diffusion process. Extensive experiments on three widely used MOT benchmarks, including MOT17, MOT20, and DanceTrack, demonstrate that our approach achieves competitive performance compared to the current state-of-the-art methods. Code is available at https://github.com/RainBowLuoCS/DiffusionTrack.
\ No newline at end of file
diff --git a/data/2024/aaai/Digital Twin-Driven Teat Localization and Shape Identification for Dairy Cow (Student Abstract) b/data/2024/aaai/Digital Twin-Driven Teat Localization and Shape Identification for Dairy Cow (Student Abstract)
new file mode 100644
index 0000000000..abc700808f
--- /dev/null
+++ b/data/2024/aaai/Digital Twin-Driven Teat Localization and Shape Identification for Dairy Cow (Student Abstract)	
@@ -0,0 +1 @@
+Dairy owners invest heavily to keep their animals healthy. There is good reason to hope that technologies such as computer vision and artificial intelligence (AI) could reduce costs, yet obstacles arise when adapting these advanced tools to farming environments. In this work, we applied AI tools to dairy cow teat localization and teat shape classification, obtaining a model that achieves a mean average precision of 0.783. This digital twin-driven approach is intended as a first step towards automating and accelerating the detection and treatment of hyperkeratosis, mastitis, and other medical conditions that significantly burden the dairy industry.
\ No newline at end of file
diff --git a/data/2024/aaai/Direct Amortized Likelihood Ratio Estimation b/data/2024/aaai/Direct Amortized Likelihood Ratio Estimation
new file mode 100644
index 0000000000..f412cdc002
--- /dev/null
+++ b/data/2024/aaai/Direct Amortized Likelihood Ratio Estimation	
@@ -0,0 +1 @@
+We introduce a new amortized likelihood ratio estimator for likelihood-free simulation-based inference (SBI). Our estimator is simple to train and estimates the likelihood ratio using a single forward pass of the neural estimator. Our approach directly computes the likelihood ratio between two competing parameter sets which is different from the previous approach of comparing two neural network output values. We refer to our model as the direct neural ratio estimator (DNRE). As part of introducing the DNRE, we derive a corresponding Monte Carlo estimate of the posterior. We benchmark our new ratio estimator and compare to previous ratio estimators in the literature. We show that our new ratio estimator often outperforms these previous approaches. As a further contribution, we introduce a new derivative estimator for likelihood ratio estimators that enables us to compare likelihood-free Hamiltonian Monte Carlo (HMC) with random-walk Metropolis-Hastings (MH). We show that HMC is equally competitive, which has not been previously shown. Finally, we include a novel real-world application of SBI by using our neural ratio estimator to design a quadcopter. Code is available at https://github.com/SRI-CSL/dnre.
\ No newline at end of file
diff --git a/data/2024/aaai/Direct May Not Be the Best: An Incremental Evolution View of Pose Generation b/data/2024/aaai/Direct May Not Be the Best: An Incremental Evolution View of Pose Generation
new file mode 100644
index 0000000000..fb3b57ce9c
--- /dev/null
+++ b/data/2024/aaai/Direct May Not Be the Best: An Incremental Evolution View of Pose Generation	
@@ -0,0 +1 @@
+Pose diversity is an inherent representative characteristic of 2D images. Due to the 3D to 2D projection mechanism, there is evident content discrepancy among distinct pose images. This is the main obstacle bothering pose transformation related researches. To deal with this challenge, we propose a fine-grained incremental evolution centered pose generation framework, rather than traditional direct one-to-one in a rush. Since proposed approach actually bypasses the theoretical difficulty of directly modeling dramatic non-linear variation, the incurred content distortion and blurring could be effectively constrained, at the same time the various individual pose details, especially clothes texture, could be precisely maintained. In order to systematically guide the evolution course, both global and incremental evolution constraints are elaborately designed and merged into the overall framework. And a novel triple-path knowledge fusion structure is worked out to take full advantage of all available valuable knowledge to conduct high-quality pose synthesis. In addition, our framework could generate a series of valuable by-products, namely the various intermediate poses. Extensive experiments have been conducted to verify the effectiveness of the proposed approach. Code is available at https://github.com/Xiaofei-CN/Incremental-Evolution-Pose-Generation.
\ No newline at end of file
diff --git a/data/2024/aaai/Directed Diffusion: Direct Control of Object Placement through Attention Guidance b/data/2024/aaai/Directed Diffusion: Direct Control of Object Placement through Attention Guidance
new file mode 100644
index 0000000000..e9f3b45d59
--- /dev/null
+++ b/data/2024/aaai/Directed Diffusion: Direct Control of Object Placement through Attention Guidance	
@@ -0,0 +1 @@
+Text-guided diffusion models such as DALLE-2, Imagen, and Stable Diffusion are able to generate an effectively endless variety of images given only a short text prompt describing the desired image content. In many cases the images are of very high quality. However, these models often struggle to compose scenes containing several key objects such as characters in specified positional relationships. The missing capability to ``direct'' the placement of characters and objects both within and across images is crucial in storytelling, as recognized in the literature on film and animation theory. In this work, we take a particularly straightforward approach to providing the needed direction. Drawing on the observation that the cross-attention maps for prompt words reflect the spatial layout of objects denoted by those words, we introduce an optimization objective that produces ``activation'' at desired positions in these cross-attention maps. The resulting approach is a step toward generalizing the applicability of text-guided diffusion models beyond single images to collections of related images, as in storybooks. Directed Diffusion provides easy high-level positional control over multiple objects, while making use of an existing pre-trained model and maintaining a coherent blend between the positioned objects and the background. Moreover, it requires only a few lines to implement.
\ No newline at end of file
diff --git "a/data/2024/aaai/Direction-Aware Video Demoir\303\251ing with Temporal-Guided Bilateral Learning" "b/data/2024/aaai/Direction-Aware Video Demoir\303\251ing with Temporal-Guided Bilateral Learning"
new file mode 100644
index 0000000000..0a07818895
--- /dev/null
+++ "b/data/2024/aaai/Direction-Aware Video Demoir\303\251ing with Temporal-Guided Bilateral Learning"	
@@ -0,0 +1 @@
+Moiré patterns occur when capturing images or videos on screens, severely degrading the quality of the captured images or videos. Despite the recent progresses, existing video demoiréing methods neglect the physical characteristics and formation process of moiré patterns, significantly limiting the effectiveness of video recovery. This paper presents a unified framework, DTNet, a direction-aware and temporal-guided bilateral learning network for video demoiréing. DTNet effectively incorporates the process of moiré pattern removal, alignment, color correction, and detail refinement. Our proposed DTNet comprises two primary stages: Frame-level Direction-aware Demoiréing and Alignment (FDDA) and Tone and Detail Refinement (TDR). In FDDA, we employ multiple directional DCT modes to perform the moiré pattern removal process in the frequency domain, effectively detecting the prominent moiré edges. Then, the coarse and fine-grained alignment is applied on the demoiréd features for facilitating the utilization of neighboring information. In TDR, we propose a temporal-guided bilateral learning pipeline to mitigate the degradation of color and details caused by the moiré patterns while preserving the restored frequency information in FDDA. Guided by the aligned temporal features from FDDA, the affine transformations for the recovery of the ultimate clean frames are learned in TDR. Extensive experiments demonstrate that our video demoiréing method outperforms state-of-the-art approaches by 2.3 dB in PSNR, and also delivers a superior visual experience.
\ No newline at end of file
diff --git a/data/2024/aaai/Dirichlet-Based Prediction Calibration for Learning with Noisy Labels b/data/2024/aaai/Dirichlet-Based Prediction Calibration for Learning with Noisy Labels
new file mode 100644
index 0000000000..38ed44cd0f
--- /dev/null
+++ b/data/2024/aaai/Dirichlet-Based Prediction Calibration for Learning with Noisy Labels	
@@ -0,0 +1 @@
+Learning with noisy labels can significantly hinder the generalization performance of deep neural networks (DNNs). Existing approaches address this issue through loss correction or example selection methods. However, these methods often rely on the model's predictions obtained from the softmax function, which can be over-confident and unreliable. In this study, we identify the translation invariance of the softmax function as the underlying cause of this problem and propose the \textit{Dirichlet-based Prediction Calibration} (DPC) method as a solution. Our method introduces a calibrated softmax function that breaks the translation invariance by incorporating a suitable constant in the exponent term, enabling more reliable model predictions. To ensure stable model training, we leverage a Dirichlet distribution to assign probabilities to predicted labels and introduce a novel evidence deep learning (EDL) loss. The proposed loss function encourages positive and sufficiently large logits for the given label, while penalizing negative and small logits for other labels, leading to more distinct logits and facilitating better example selection based on a large-margin criterion. Through extensive experiments on diverse benchmark datasets, we demonstrate that DPC achieves state-of-the-art performance. The code is available at https://github.com/chenchenzong/DPC.
\ No newline at end of file
diff --git a/data/2024/aaai/Discerning Temporal Difference Learning b/data/2024/aaai/Discerning Temporal Difference Learning
new file mode 100644
index 0000000000..fa9cf5e40f
--- /dev/null
+++ b/data/2024/aaai/Discerning Temporal Difference Learning	
@@ -0,0 +1 @@
+Temporal difference learning (TD) is a foundational concept in reinforcement learning (RL), aimed at efficiently assessing a policy's value function. TD(λ), a potent variant, incorporates a memory trace to distribute the prediction error into the historical context. However, this approach often neglects the significance of historical states and the relative importance of propagating the TD error, influenced by challenges such as visitation imbalance or outcome noise. To address this, we propose a novel TD algorithm named discerning TD learning (DTD), which allows flexible emphasis functions—predetermined or adapted during training—to allocate efforts effectively across states. We establish the convergence properties of our method within a specific class of emphasis functions and showcase its promising potential for adaptation to deep RL contexts. Empirical results underscore that employing a judicious emphasis function not only improves value estimation but also expedites learning across diverse scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Discovering Agents (Abstract Reprint) b/data/2024/aaai/Discovering Agents (Abstract Reprint)
new file mode 100644
index 0000000000..24b4fa9495
--- /dev/null
+++ b/data/2024/aaai/Discovering Agents (Abstract Reprint)	
@@ -0,0 +1 @@
+Causal models of agents have been used to analyse the safety aspects of machine learning systems. But identifying agents is non-trivial – often the causal model is just assumed by the modeller without much justification – and modelling failures can lead to mistakes in the safety analysis. This paper proposes the first formal causal definition of agents – roughly that agents are systems that would adapt their policy if their actions influenced the world in a different way. From this we derive the first causal discovery algorithm for discovering the presence of agents from empirical data, given a set of variables and under certain assumptions. We also provide algorithms for translating between causal models and game-theoretic influence diagrams. We demonstrate our approach by resolving some previous confusions caused by incorrect causal modelling of agents.
\ No newline at end of file
diff --git a/data/2024/aaai/Discovering Heterogeneous Causal Effects in Relational Data b/data/2024/aaai/Discovering Heterogeneous Causal Effects in Relational Data
new file mode 100644
index 0000000000..fbc6a7779e
--- /dev/null
+++ b/data/2024/aaai/Discovering Heterogeneous Causal Effects in Relational Data	
@@ -0,0 +1 @@
+Causal inference in relational data should account for the non-IID nature of the data and the interference phenomenon, which occurs when a unit's outcome is influenced by the treatments or outcomes of others. Existing solutions to causal inference under interference consider either homogeneous influence from peers or specific heterogeneous influence contexts (e.g., local neighborhood structure). This thesis investigates causal reasoning in relational data and the automated discovery of heterogeneous causal effects under arbitrary heterogeneous peer influence contexts and effect modification.
\ No newline at end of file
diff --git a/data/2024/aaai/Discovering Sequential Patterns with Predictable Inter-event Delays b/data/2024/aaai/Discovering Sequential Patterns with Predictable Inter-event Delays
new file mode 100644
index 0000000000..4def908941
--- /dev/null
+++ b/data/2024/aaai/Discovering Sequential Patterns with Predictable Inter-event Delays	
@@ -0,0 +1,2 @@
+Summarizing sequential data with serial episodes allows non-trivial insight into the data generating process. Existing methods penalize gaps in pattern occurrences equally, regardless of where in the pattern these occur. This results in a strong bias against patterns with long inter-event delays, and in addition that regularity in terms of delays is not rewarded or discovered---even though both aspects provide key insight.
+In this paper we tackle both these problems by explicitly modeling inter-event delay distributions. That is, we are not only interested in discovering the patterns, but also in describing how many times steps typically occur between their individual events. We formalize the problem in terms of the Minimum Description Length principle, by which we say the best set of patterns is the one that compresses the data best. The resulting optimization problem does not lend itself to exact optimization, and hence we propose Hopper to heuristically mine high quality patterns. Extensive experiments show that Hopper efficiently recovers the ground truth, discovers meaningful patterns from real-world data, and outperforms existing methods in discovering long-delay patterns.
\ No newline at end of file
diff --git a/data/2024/aaai/Discrepancy and Uncertainty Aware Denoising Knowledge Distillation for Zero-Shot Cross-Lingual Named Entity Recognition b/data/2024/aaai/Discrepancy and Uncertainty Aware Denoising Knowledge Distillation for Zero-Shot Cross-Lingual Named Entity Recognition
new file mode 100644
index 0000000000..e32e6b6635
--- /dev/null
+++ b/data/2024/aaai/Discrepancy and Uncertainty Aware Denoising Knowledge Distillation for Zero-Shot Cross-Lingual Named Entity Recognition	
@@ -0,0 +1,5 @@
+The knowledge distillation-based approaches have recently yielded state-of-the-art (SOTA) results for cross-lingual NER tasks in zero-shot scenarios. 
+These approaches typically employ a teacher network trained with the labelled source (rich-resource) language to infer pseudo-soft labels for the unlabelled target (zero-shot) language, and force a student network to approximate these pseudo labels to achieve knowledge transfer.
+However, previous works have rarely discussed the issue of pseudo-label noise caused by the source-target language gap, which can mislead the training of the student network and result in negative knowledge transfer. 
+This paper proposes an discrepancy and uncertainty aware Denoising Knowledge Distillation model (DenKD) to tackle this issue. 
+Specifically, DenKD uses a discrepancy-aware denoising representation learning method to optimize the class representations of the target language produced by the teacher network, thus enhancing the quality of pseudo labels and reducing noisy predictions. Further, DenKD employs an uncertainty-aware denoising method to quantify the pseudo-label noise and adjust the focus of the student network on different samples during knowledge distillation, thereby mitigating the noise's adverse effects. We conduct extensive experiments on 28 languages including 4 languages not covered by the pre-trained models, and the results demonstrate the effectiveness of our DenKD.
\ No newline at end of file
diff --git a/data/2024/aaai/Discrete Cycle-Consistency Based Unsupervised Deep Graph Matching b/data/2024/aaai/Discrete Cycle-Consistency Based Unsupervised Deep Graph Matching
new file mode 100644
index 0000000000..21b8d63a95
--- /dev/null
+++ b/data/2024/aaai/Discrete Cycle-Consistency Based Unsupervised Deep Graph Matching	
@@ -0,0 +1 @@
+We contribute to the sparsely populated area of unsupervised deep graph matching with application to keypoint matching in images. Contrary to the standard supervised approach, our method does not require ground truth correspondences between keypoint pairs. Instead, it is self-supervised by enforcing consistency of matchings between images of the same object category. As the matching and the consistency loss are discrete, their derivatives cannot be straightforwardly used for learning. We address this issue in a principled way by building our method upon the recent results on black-box differentiation of combinatorial solvers. This makes our method exceptionally flexible, as it is compatible with arbitrary network architectures and combinatorial solvers. Our experimental evaluation suggests that our technique sets a new state-of-the-art for unsupervised graph matching.
\ No newline at end of file
diff --git a/data/2024/aaai/Discretionary Trees: Understanding Street-Level Bureaucracy via Machine Learning b/data/2024/aaai/Discretionary Trees: Understanding Street-Level Bureaucracy via Machine Learning
new file mode 100644
index 0000000000..dae70e66cf
--- /dev/null
+++ b/data/2024/aaai/Discretionary Trees: Understanding Street-Level Bureaucracy via Machine Learning	
@@ -0,0 +1 @@
+Street-level bureaucrats interact directly with people on behalf of government agencies to perform a wide range of functions, including, for example, administering social services and policing. A key feature of street-level bureaucracy is that the civil servants, while tasked with implementing agency policy, are also granted significant discretion in how they choose to apply that policy in individual cases. Using that discretion could be beneficial, as it allows for exceptions to policies based on human interactions and evaluations, but it could also allow biases and inequities to seep into important domains of societal resource allocation. In this paper, we use machine learning techniques to understand street-level bureaucrats' behavior. We leverage a rich dataset that combines demographic and other information on households with information on which homelessness interventions they were assigned during a period when assignments were not formulaic. We find that caseworker decisions in this time are highly predictable overall, and some, but not all of this predictivity can be captured by simple decision rules. We theorize that the decisions not captured by the simple decision rules can be considered applications of caseworker discretion. These discretionary decisions are far from random in both the characteristics of such households and in terms of the outcomes of the decisions. Caseworkers typically only apply discretion to households that would be considered less vulnerable. When they do apply discretion to assign households to more intensive interventions, the marginal benefits to those households are significantly higher than would be expected if the households were chosen at random; there is no similar reduction in marginal benefit to households that are discretionarily allocated less intensive interventions, suggesting that caseworkers are using their knowledge and experience to improve outcomes for households experiencing homelessness.
\ No newline at end of file
diff --git a/data/2024/aaai/Discretization-Induced Dirichlet Posterior for Robust Uncertainty Quantification on Regression b/data/2024/aaai/Discretization-Induced Dirichlet Posterior for Robust Uncertainty Quantification on Regression
new file mode 100644
index 0000000000..e5d1bdca5c
--- /dev/null
+++ b/data/2024/aaai/Discretization-Induced Dirichlet Posterior for Robust Uncertainty Quantification on Regression	
@@ -0,0 +1 @@
+Uncertainty quantification is critical for deploying deep neural networks (DNNs) in real-world applications. An Auxiliary Uncertainty Estimator (AuxUE) is one of the most effective means to estimate the uncertainty of the main task prediction without modifying the main task model. To be considered robust, an AuxUE must be capable of maintaining its performance and triggering higher uncertainties while encountering Out-of-Distribution (OOD) inputs, i.e., to provide robust aleatoric and epistemic uncertainty. However, for vision regression tasks, current AuxUE designs are mainly adopted for aleatoric uncertainty estimates, and AuxUE robustness has not been explored. In this work, we propose a generalized AuxUE scheme for more robust uncertainty quantification on regression tasks. Concretely, to achieve a more robust aleatoric uncertainty estimation, different distribution assumptions are considered for heteroscedastic noise, and Laplace distribution is finally chosen to approximate the prediction error. For epistemic uncertainty, we propose a novel solution named Discretization-Induced Dirichlet pOsterior (DIDO), which models the Dirichlet posterior on the discretized prediction error. Extensive experiments on age estimation, monocular depth estimation, and super-resolution tasks show that our proposed method can provide robust uncertainty estimates in the face of noisy inputs and that it can be scalable to both image-level and pixel-wise tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Discriminative Forests Improve Generative Diversity for Generative Adversarial Networks b/data/2024/aaai/Discriminative Forests Improve Generative Diversity for Generative Adversarial Networks
new file mode 100644
index 0000000000..8f45e7a79e
--- /dev/null
+++ b/data/2024/aaai/Discriminative Forests Improve Generative Diversity for Generative Adversarial Networks	
@@ -0,0 +1 @@
+Improving the diversity of Artificial Intelligence Generated Content (AIGC) is one of the fundamental problems in the theory of generative models such as generative adversarial networks (GANs). Previous studies have demonstrated that the discriminator in GANs should have high capacity and robustness to achieve the diversity of generated data. However, a discriminator with high capacity tends to overfit and guide the generator toward collapsed equilibrium. In this study, we propose a novel discriminative forest GAN, named Forest-GAN, that replaces the discriminator to improve the capacity and robustness for modeling statistics in real-world data distribution. A discriminative forest is composed of multiple independent discriminators built on bootstrapped data. We prove that a discriminative forest has a generalization error bound, which is determined by the strength of individual discriminators and the correlations among them. Hence, a discriminative forest can provide very large capacity without any risk of overfitting, which subsequently improves the generative diversity. With the discriminative forest framework, we significantly improved the performance of AutoGAN with a new record FID of 19.27 from 30.71 on STL10 and improved the performance of StyleGAN2-ADA with a new record FID of 6.87 from 9.22 on LSUN-cat.
\ No newline at end of file
diff --git a/data/2024/aaai/Discriminatively Fuzzy Multi-View K-means Clustering with Local Structure Preserving b/data/2024/aaai/Discriminatively Fuzzy Multi-View K-means Clustering with Local Structure Preserving
new file mode 100644
index 0000000000..64e2407c2d
--- /dev/null
+++ b/data/2024/aaai/Discriminatively Fuzzy Multi-View K-means Clustering with Local Structure Preserving	
@@ -0,0 +1 @@
+Multi-view K-means clustering successfully generalizes K-means from single-view to multi-view, and obtains excellent clustering performance. In every view, it makes each data point close to the center of the corresponding cluster. However, multi-view K-means only considers the compactness of each cluster, but ignores the separability of different clusters, which is of great importance to producing a good clustering result. In this paper, we propose Discriminatively Fuzzy Multi-view K-means clustering with Local Structure Preserving (DFMKLS). On the basis of minimizing the distance between each data point and the center of the corresponding cluster, DFMKLS separates clusters by maximizing the distance between the centers of pairwise clusters. DFMKLS also relaxes its objective by introducing the idea of fuzzy clustering, which calculates the probability that a data point belongs to each cluster. Considering multi-view K-means mainly focuses on the global information of the data, to efficiently use the local information, we integrate the local structure preserving into the framework of DFMKLS. The effectiveness of DFMKLS is evaluated on benchmark multi-view datasets. It obtains superior performances than state-of-the-art multi-view clustering methods, including multi-view K-means.
\ No newline at end of file
diff --git a/data/2024/aaai/Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser b/data/2024/aaai/Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser
new file mode 100644
index 0000000000..c88d1c9f47
--- /dev/null
+++ b/data/2024/aaai/Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser	
@@ -0,0 +1 @@
+Recently, diffusion-based methods for monocular 3D human pose estimation have achieved state-of-the-art (SOTA) performance by directly regressing the 3D joint coordinates from the 2D pose sequence. Although some methods decompose the task into bone length and bone direction prediction based on the human anatomical skeleton to explicitly incorporate more human body prior constraints, the performance of these methods is significantly lower than that of the SOTA diffusion-based methods. This can be attributed to the tree structure of the human skeleton. Direct application of the disentangled method could amplify the accumulation of hierarchical errors, propagating through each hierarchy. Meanwhile, the hierarchical information has not been fully explored by the previous methods. To address these problems, a Disentangled Diffusion-based 3D human Pose Estimation method with Hierarchical Spatial and Temporal Denoiser is proposed, termed DDHPose. In our approach: (1) We disentangle the 3d pose and diffuse the bone length and bone direction during the forward process of the diffusion model to effectively model the human pose prior. A disentanglement loss is proposed to supervise diffusion model learning. (2) For the reverse process, we propose Hierarchical Spatial and Temporal Denoiser (HSTDenoiser) to improve the hierarchical modelling of each joint. Our HSTDenoiser comprises two components: the Hierarchical-Related Spatial Transformer (HRST) and the Hierarchical-Related Temporal Transformer (HRTT). HRST exploits joint spatial information and the influence of the parent joint on each joint for spatial modeling, while HRTT utilizes information from both the joint and its hierarchical adjacent joints to explore the hierarchical temporal correlations among joints. Extensive experiments on the Human3.6M and MPI-INF-3DHP datasets show that our method outperforms the SOTA disentangled-based, non-disentangled based, and probabilistic approaches by 10.0%, 2.0%, and 1.3%, respectively.
\ No newline at end of file
diff --git a/data/2024/aaai/Disentangled Partial Label Learning b/data/2024/aaai/Disentangled Partial Label Learning
new file mode 100644
index 0000000000..93615c4035
--- /dev/null
+++ b/data/2024/aaai/Disentangled Partial Label Learning	
@@ -0,0 +1 @@
+Partial label learning (PLL) induces a multi-class classifier from training examples each associated with a set of candidate labels, among which only one is valid. The formation of real-world data typically arises from heterogeneous entanglement of series latent explanatory factors, which are considered intrinsic properties for discriminating between different patterns. Though learning disentangled representation is expected to facilitate label disambiguation for partial-label (PL) examples, few existing works were dedicated to addressing this issue. In this paper, we make the first attempt towards disentangled PLL and propose a novel approach named TERIAL, which makes predictions according to derived disentangled representation of instances and label embeddings. The TERIAL approach formulates the PL examples as an undirected bipartite graph where instances are only connected with their candidate labels, and employs a tailored neighborhood routing mechanism to yield disentangled representation of nodes in the graph. Specifically, the proposed routing mechanism progressively infers the explanatory factors that contribute to the edge between adjacent nodes and augments the representation of the central node with factor-aware embedding information propagated from specific neighbors simultaneously via iteratively analyzing the promising subspace clusters formed by the node and its neighbors. The estimated labeling confidence matrix is also introduced to accommodate unreliable links owing to the inherent ambiguity of PLL. Moreover, we theoretically prove that the neighborhood routing mechanism will converge to the point estimate that maximizes the marginal likelihood of observed PL training examples. Comprehensive experiments over various datasets demonstrate that our approach outperforms the state-of-the-art counterparts.
\ No newline at end of file
diff --git a/data/2024/aaai/Disentanglement-Guided Spatial-Temporal Graph Neural Network for Metro Flow Forecasting (Student Abstract) b/data/2024/aaai/Disentanglement-Guided Spatial-Temporal Graph Neural Network for Metro Flow Forecasting (Student Abstract)
new file mode 100644
index 0000000000..8a29ed6435
--- /dev/null
+++ b/data/2024/aaai/Disentanglement-Guided Spatial-Temporal Graph Neural Network for Metro Flow Forecasting (Student Abstract)	
@@ -0,0 +1 @@
+In recent intelligent transportation applications, metro flow forecasting has received much attention from researchers. Most prior arts endeavor to explore spatial or temporal dependencies while ignoring the key characteristic patterns underlying historical flows, e.g., trend and periodicity. Although the multiple granularity distillations or spatial dependency correlation can promote the flow estimation. However, the potential noise and spatial dynamics are under-explored. To this end, we propose a novel Disentanglement-Guided Spatial-Temporal Graph Neural Network or DGST to address the above concerns. It contains a Disentanglement Pre-training procedure for characteristic pattern disentanglement learning, a Characteristic Pattern Prediction for different future characteristic explorations, and a Spatial-Temporal Correlation for spatial-temporal dynamic learning. Experiments on a real-world dataset demonstrate the superiority of our DGST.
\ No newline at end of file
diff --git a/data/2024/aaai/Disjoint Partial Enumeration without Blocking Clauses b/data/2024/aaai/Disjoint Partial Enumeration without Blocking Clauses
new file mode 100644
index 0000000000..aca11e8da4
--- /dev/null
+++ b/data/2024/aaai/Disjoint Partial Enumeration without Blocking Clauses	
@@ -0,0 +1,2 @@
+A basic algorithm for enumerating disjoint propositional models (disjoint AllSAT) is based on adding blocking clauses incrementally, ruling out previously found models. On the one hand, blocking clauses have the potential to reduce the number of generated models exponentially, as they can handle partial models. On the other hand, the introduction of a large number of blocking clauses affects memory consumption and drastically slows down unit propagation. 
+ We propose a new approach that allows for enumerating disjoint partial models with no need for blocking clauses by integrating: Conflict-Driven Clause-Learning (CDCL), Chronological Backtracking (CB), and methods for shrinking models (Implicant Shrinking). Experiments clearly show the benefits of our novel approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Dissenting Explanations: Leveraging Disagreement to Reduce Model Overreliance b/data/2024/aaai/Dissenting Explanations: Leveraging Disagreement to Reduce Model Overreliance
new file mode 100644
index 0000000000..2989cc231a
--- /dev/null
+++ b/data/2024/aaai/Dissenting Explanations: Leveraging Disagreement to Reduce Model Overreliance	
@@ -0,0 +1 @@
+While modern explanation methods have been shown to be inconsistent and contradictory, the explainability of black-box models nevertheless remains desirable. When the role of explanations extends from understanding models to aiding decision making, the semantics of explanations is not always fully understood – to what extent do explanations ``explain” a decision and to what extent do they merely advocate for a decision? Can we help humans gain insights from explanations accompanying correct predictions and not over-rely on incorrect predictions advocated for by explanations? With this perspective in mind, we introduce the notion of dissenting explanations: conflicting predictions with accompanying explanations. We first explore the advantage of dissenting explanations in the setting of model multiplicity, where multiple models with similar performance may have different predictions. Through a human study on the task of identifying deceptive reviews, we demonstrate that dissenting explanations reduce overreliance on model predictions, without reducing overall accuracy. Motivated by the utility of dissenting explanations we present both global and local methods for their generation.
\ No newline at end of file
diff --git a/data/2024/aaai/DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition b/data/2024/aaai/DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition
new file mode 100644
index 0000000000..45f44d270f
--- /dev/null
+++ b/data/2024/aaai/DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition	
@@ -0,0 +1 @@
+The utilization of multi-modal sensor data in visual place recognition (VPR) has demonstrated enhanced performance compared to single-modal counterparts. Nonetheless, integrating additional sensors comes with elevated costs and may not be feasible for systems that demand lightweight operation, thereby impacting the practical deployment of VPR. To address this issue, we resort to knowledge distillation, which empowers single-modal students to learn from cross-modal teachers without introducing additional sensors during inference. Despite the notable advancements achieved by current distillation approaches, the exploration of feature relationships remains an under-explored area. In order to tackle the challenge of cross-modal distillation in VPR, we present DistilVPR, a novel distillation pipeline for VPR. We propose leveraging feature relationships from multiple agents, including self-agents and cross-agents for teacher and student neural networks. Furthermore, we integrate various manifolds, characterized by different space curvatures for exploring feature relationships. This approach enhances the diversity of feature relationships, including Euclidean, spherical, and hyperbolic relationship modules, thereby enhancing the overall representational capacity. The experiments demonstrate that our proposed pipeline achieves state-of-the-art performance compared to other distillation baselines. We also conduct necessary ablation studies to show design effectiveness. The code is released at: https://github.com/sijieaaa/DistilVPR
\ No newline at end of file
diff --git a/data/2024/aaai/Distilling Autoregressive Models to Obtain High-Performance Non-autoregressive Solvers for Vehicle Routing Problems with Faster Inference Speed b/data/2024/aaai/Distilling Autoregressive Models to Obtain High-Performance Non-autoregressive Solvers for Vehicle Routing Problems with Faster Inference Speed
new file mode 100644
index 0000000000..496f602a52
--- /dev/null
+++ b/data/2024/aaai/Distilling Autoregressive Models to Obtain High-Performance Non-autoregressive Solvers for Vehicle Routing Problems with Faster Inference Speed	
@@ -0,0 +1 @@
+Neural construction models have shown promising performance for Vehicle Routing Problems (VRPs) by adopting either the Autoregressive (AR) or Non-Autoregressive (NAR) learning approach. While AR models produce high-quality solutions, they generally have a high inference latency due to their sequential generation nature. Conversely, NAR models generate solutions in parallel with a low inference latency but generally exhibit inferior performance. In this paper, we propose a generic Guided Non-Autoregressive Knowledge Distillation (GNARKD) method to obtain high-performance NAR models having a low inference latency. GNARKD removes the constraint of sequential generation in AR models while preserving the learned pivotal components in the network architecture to obtain the corresponding NAR models through knowledge distillation. We evaluate GNARKD by applying it to three widely adopted AR models to obtain NAR VRP solvers for both synthesized and real-world instances. The experimental results demonstrate that GNARKD significantly reduces the inference time (4-5 times faster) with acceptable performance drop (2-3%). To the best of our knowledge, this study is first-of-its-kind to obtain NAR VRP solvers from AR ones through knowledge distillation.
\ No newline at end of file
diff --git a/data/2024/aaai/Distilling Reliable Knowledge for Instance-Dependent Partial Label Learning b/data/2024/aaai/Distilling Reliable Knowledge for Instance-Dependent Partial Label Learning
new file mode 100644
index 0000000000..19c80b627e
--- /dev/null
+++ b/data/2024/aaai/Distilling Reliable Knowledge for Instance-Dependent Partial Label Learning	
@@ -0,0 +1 @@
+Partial label learning (PLL) refers to the classification task where each training instance is ambiguously annotated with a set of candidate labels. Despite substantial advancements in tackling this challenge, limited attention has been devoted to a more specific and realistic setting, denoted as instance-dependent partial label learning (IDPLL). Within this contex, the assignment of partial labels depends on the distinct features of individual instances, rather than being random. In this paper, we initiate an exploration into a self-distillation framework for this problem, driven by the proven effectiveness and stability of this framework. Nonetheless, a crucial shortfall is identified: the foundational assumption central to IDPLL, involving what we term as partial label knowledge stipulating that candidate labels should exhibit superior confidence compared to non-candidates, is not fully upheld within the distillation process. To address this challenge, we introduce DIRK, a novel distillation approach that leverages a rectification process to DIstill Reliable Knowledge, while concurrently preserves informative fine-grained label confidence. In addition, to harness the rectified confidence to its fullest potential, we propose a knowledge-based representation refinement module, seamlessly integrated into the DIRK framework. This module effectively transmits the essence of similarity knowledge from the label space to the feature space, thereby amplifying representation learning and subsequently engendering marked improvements in model performance. Experiments and analysis on multiple datasets validate the rationality and superiority of our proposed approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Distributed Manifold Hashing for Image Set Classification and Retrieval b/data/2024/aaai/Distributed Manifold Hashing for Image Set Classification and Retrieval
new file mode 100644
index 0000000000..f87cebcf69
--- /dev/null
+++ b/data/2024/aaai/Distributed Manifold Hashing for Image Set Classification and Retrieval	
@@ -0,0 +1 @@
+Conventional image set methods typically learn from image sets stored in one location. However, in real-world applications, image sets are often distributed or collected across different positions. Learning from such distributed image sets presents a challenge that has not been studied thus far. Moreover, efficiency is seldom addressed in large-scale image set applications. To fulfill these gaps, this paper proposes Distributed Manifold Hashing (DMH), which models distributed image sets as a connected graph. DMH employs Riemannian manifold to effectively represent each image set and further suggests learning hash code for each image set to achieve efficient computation and storage. DMH is formally formulated as a distributed learning problem with local consistency constraint on global variables among neighbor nodes, and can be optimized in parallel. Extensive experiments on three benchmark datasets demonstrate that DMH achieves highly competitive accuracies in a distributed setting and provides faster classification and retrieval than state-of-the-arts.
\ No newline at end of file
diff --git a/data/2024/aaai/Distribution Matching for Multi-Task Learning of Classification Tasks: A Large-Scale Study on Faces & Beyond b/data/2024/aaai/Distribution Matching for Multi-Task Learning of Classification Tasks: A Large-Scale Study on Faces & Beyond
new file mode 100644
index 0000000000..a87eeecb7d
--- /dev/null
+++ b/data/2024/aaai/Distribution Matching for Multi-Task Learning of Classification Tasks: A Large-Scale Study on Faces & Beyond	
@@ -0,0 +1 @@
+Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space, or parameter transfer. To provide sufficient learning support, modern MTL uses annotated data with full, or sufficiently large overlap across tasks, i.e., each input sample is annotated for all, or most of the tasks. However, collecting such annotations is prohibitive in many real applications, and cannot benefit from datasets available for individual tasks. In this work, we challenge this setup and show that MTL can be successful with classification tasks with little, or non-overlapping annotations, or when there is big discrepancy in the size of labeled data per task. We explore task-relatedness for co-annotation and co-training, and propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching. To demonstrate the general applicability of our method, we conducted diverse case studies in the domains of affective computing, face recognition, species recognition, and shopping item classification using nine datasets. Our large-scale study of affective tasks for basic expression recognition and facial action unit detection illustrates that our approach is network agnostic and brings large performance improvements compared to the state-of-the-art in both tasks and across all studied databases. In all case studies, we show that co-training via task-relatedness is advantageous and prevents negative transfer (which occurs when MT model's performance is worse than that of at least one single-task model).
\ No newline at end of file
diff --git a/data/2024/aaai/Distribution-Conditioned Adversarial Variational Autoencoder for Valid Instrumental Variable Generation b/data/2024/aaai/Distribution-Conditioned Adversarial Variational Autoencoder for Valid Instrumental Variable Generation
new file mode 100644
index 0000000000..8d00a718ab
--- /dev/null
+++ b/data/2024/aaai/Distribution-Conditioned Adversarial Variational Autoencoder for Valid Instrumental Variable Generation	
@@ -0,0 +1 @@
+Instrumental variables (IVs), widely applied in economics and healthcare, enable consistent counterfactual prediction in the presence of hidden confounding factors, effectively addressing endogeneity issues. The prevailing IV-based counterfactual prediction methods typically rely on the availability of valid IVs (satisfying Relevance, Exclusivity, and Exogeneity), a requirement which often proves elusive in real-world scenarios. Various data-driven techniques are being developed to create valid IVs (or representations of IVs) from a pool of IV candidates. However, most of these techniques still necessitate the inclusion of valid IVs within the set of candidates. This paper proposes a distribution-conditioned adversarial variational autoencoder to tackle this challenge. Specifically: 1) for Relevance and Exclusivity, we deduce the corresponding evidence lower bound following the Bayesian network structure and build the variational autoencoder; accordingly, 2) for Exogeneity , we design an adversarial game to encourage latent factors originating from the marginal distribution, compelling the independence between IVs and other outcome-related factors. Extensive experimental results validate the effectiveness, stability and generality of our proposed model in generating valid IV factors in the absence of valid IV candidates.
\ No newline at end of file
diff --git a/data/2024/aaai/Distributional Off-Policy Evaluation for Slate Recommendations b/data/2024/aaai/Distributional Off-Policy Evaluation for Slate Recommendations
new file mode 100644
index 0000000000..c5044a1be2
--- /dev/null
+++ b/data/2024/aaai/Distributional Off-Policy Evaluation for Slate Recommendations	
@@ -0,0 +1 @@
+Recommendation strategies are typically evaluated by using previously logged data, employing off-policy evaluation methods to estimate their expected performance. However, for strategies that present users with slates of multiple items, the resulting combinatorial action space renders many of these methods impractical. Prior work has developed estimators that leverage the structure in slates to estimate the expected off-policy performance, but the estimation of the entire performance distribution remains elusive. Estimating the complete distribution allows for a more comprehensive evaluation of recommendation strategies, particularly along the axes of risk and fairness that employ metrics computable from the distribution. In this paper, we propose an estimator for the complete off-policy performance distribution for slates and establish conditions under which the estimator is unbiased and consistent. This builds upon prior work on off-policy evaluation for slates and off-policy distribution estimation in reinforcement learning. We validate the efficacy of our method empirically on synthetic data as well as on a slate recommendation simulator constructed from real-world data (MovieLens-20M). Our results show a significant reduction in estimation variance and improved sample efficiency over prior work across a range of slate structures.
\ No newline at end of file
diff --git a/data/2024/aaai/Divergence-Guided Simultaneous Speech Translation b/data/2024/aaai/Divergence-Guided Simultaneous Speech Translation
new file mode 100644
index 0000000000..e9dc53bc4c
--- /dev/null
+++ b/data/2024/aaai/Divergence-Guided Simultaneous Speech Translation	
@@ -0,0 +1 @@
+To achieve high-quality translation with low latency, a Simultaneous Speech Translation (SimulST) system relies on a policy module to decide whether to translate immediately or wait for additional streaming input, along with a translation model capable of effectively handling partial speech input. Prior research has tackled these components separately, either using ``wait-k'' policies based on fixed-length segments or detected word boundaries, or dynamic policies based on different strategies (e.g., meaningful units), while employing offline models for prefix-to-prefix translation. In this paper, we propose Divergence-Guided Simultaneous Speech Translation (DiG-SST), a tightly integrated approach focusing on both translation quality and latency for streaming input. Specifically, we introduce a simple yet effective prefix-based strategy for training translation models with partial speech input, and develop an adaptive policy that makes read/write decisions for the translation model based on the expected divergence in translation distributions resulting from future input. Our experiments on multiple translation directions of the MuST-C benchmark demonstrate that our approach achieves a better trade-off between translation quality and latency compared to existing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Diverse Person: Customize Your Own Dataset for Text-Based Person Search b/data/2024/aaai/Diverse Person: Customize Your Own Dataset for Text-Based Person Search
new file mode 100644
index 0000000000..0d86a99e3d
--- /dev/null
+++ b/data/2024/aaai/Diverse Person: Customize Your Own Dataset for Text-Based Person Search	
@@ -0,0 +1 @@
+Text-based person search is a challenging task aimed at locating specific target pedestrians through text descriptions. Recent advancements have been made in this field, but there remains a deficiency in datasets tailored for text-based person search. The creation of new, real-world datasets is hindered by concerns such as the risk of pedestrian privacy leakage and the substantial costs of annotation. In this paper, we introduce a framework, named Diverse Person (DP), to achieve efficient and high-quality text-based person search data generation without involving privacy concerns. Specifically, we propose to leverage available images of clothing and accessories as reference attribute images to edit the original dataset images through diffusion models. Additionally, we employ a Large Language Model (LLM) to produce annotations that are both high in quality and stylistically consistent with those found in real-world datasets. Extensive experimental results demonstrate that the baseline models trained with our DP can achieve new state-of-the-art results on three public datasets, with performance improvements up to 4.82%, 2.15%, and 2.28% on CUHK-PEDES, ICFG-PEDES, and RSTPReid in terms of Rank-1 accuracy, respectively.
\ No newline at end of file
diff --git a/data/2024/aaai/Diverse Yet Biased: Towards Mitigating Biases in Generative AI (Student Abstract) b/data/2024/aaai/Diverse Yet Biased: Towards Mitigating Biases in Generative AI (Student Abstract)
new file mode 100644
index 0000000000..f5510e2f5b
--- /dev/null
+++ b/data/2024/aaai/Diverse Yet Biased: Towards Mitigating Biases in Generative AI (Student Abstract)	
@@ -0,0 +1 @@
+Generative Artificial Intelligence (AI) has garnered significant attention for its remarkable ability to generate text, images, and other forms of content. However, an inherent and increasingly concerning issue within generative AI systems is bias. These AI models often exhibit an Anglo-centric bias and tend to overlook the importance of diversity. This can be attributed to their training on extensive datasets sourced from the internet, which inevitably inherit the biases present in those data sources. Employing these datasets leads to AI-generated content that mirrors and perpetuates existing biases, encompassing various aspects such as gender, ethnic and cultural stereotypes. Addressing bias in generative AI is a complex challenge that necessitates substantial efforts. In order to tackle this issue, we propose a methodology for constructing moderately sized datasets with a social inclination. These datasets can be employed to rectify existing imbalances in datasets or to train models to generate socially inclusive material. Additionally, we present preliminary findings derived from training our model on these socially inclined datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation b/data/2024/aaai/Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
new file mode 100644
index 0000000000..69892d99fe
--- /dev/null
+++ b/data/2024/aaai/Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation	
@@ -0,0 +1 @@
+We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/TempoTokens/.
\ No newline at end of file
diff --git a/data/2024/aaai/Diverse and Stable 2D Diffusion Guided Text to 3D Generation with Noise Recalibration b/data/2024/aaai/Diverse and Stable 2D Diffusion Guided Text to 3D Generation with Noise Recalibration
new file mode 100644
index 0000000000..f73d824ccf
--- /dev/null
+++ b/data/2024/aaai/Diverse and Stable 2D Diffusion Guided Text to 3D Generation with Noise Recalibration	
@@ -0,0 +1 @@
+In recent years, following the success of text guided image generation, text guided 3D generation has gained increasing attention among researchers. Dreamfusion is a notable approach that enhances generation quality by utilizing 2D text guided diffusion models and introducing SDS loss, a technique for distilling 2D diffusion model information to train 3D models. However, the SDS loss has two major limitations that hinder its effectiveness. Firstly, when given a text prompt, the SDS loss struggles to produce diverse content. Secondly, during training, SDS loss may cause the generated content to overfit and collapse, limiting the model's ability to learn intricate texture details. To overcome these challenges, we propose a novel approach called Noise Recalibration algorithm. By incorporating this technique, we can generate 3D content with significantly greater diversity and stunning details. Our approach offers a promising solution to the limitations of SDS loss.
\ No newline at end of file
diff --git a/data/2024/aaai/Diversity-Authenticity Co-constrained Stylization for Federated Domain Generalization in Person Re-identification b/data/2024/aaai/Diversity-Authenticity Co-constrained Stylization for Federated Domain Generalization in Person Re-identification
new file mode 100644
index 0000000000..5408d81008
--- /dev/null
+++ b/data/2024/aaai/Diversity-Authenticity Co-constrained Stylization for Federated Domain Generalization in Person Re-identification	
@@ -0,0 +1 @@
+This paper tackles the problem of federated domain generalization in person re-identification (FedDG re-ID), aiming to learn a model generalizable to unseen domains with decentralized source domains. Previous methods mainly focus on preventing local overfitting. However, the direction of diversifying local data through stylization for model training is largely overlooked. This direction is popular in domain generalization but will encounter two issues under federated scenario: (1) Most stylization methods require the centralization of multiple domains to generate novel styles but this is not applicable under decentralized constraint. (2) The authenticity of generated data cannot be ensured especially given limited local data, which may impair the model optimization. To solve these two problems, we propose the Diversity-Authenticity Co-constrained Stylization (DACS), which can generate diverse and authentic data for learning robust local model. Specifically, we deploy a style transformation model on each domain to generate novel data with two constraints: (1) A diversity constraint is designed to increase data diversity, which enlarges the Wasserstein distance between the original and transformed data; (2) An authenticity constraint is proposed to ensure data authenticity, which enforces the transformed data to be easily/hardly recognized by the local-side global/local model. Extensive experiments demonstrate the effectiveness of the proposed DACS and show that DACS achieves state-of-the-art performance for FedDG re-ID.
\ No newline at end of file
diff --git a/data/2024/aaai/Divide and Conquer: Hybrid Pre-training for Person Search b/data/2024/aaai/Divide and Conquer: Hybrid Pre-training for Person Search
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/aaai/Divide-and-Aggregate Learning for Evaluating Performance on Unlabeled Data b/data/2024/aaai/Divide-and-Aggregate Learning for Evaluating Performance on Unlabeled Data
new file mode 100644
index 0000000000..8482afa0a5
--- /dev/null
+++ b/data/2024/aaai/Divide-and-Aggregate Learning for Evaluating Performance on Unlabeled Data	
@@ -0,0 +1 @@
+Artificial Intelligence (AI) models have become an integral part of modern society, significantly improving human lives. However, ensuring the reliability and safety of these models is of paramount importance. One critical aspect is the continuous monitoring and verification of model performance to prevent any potential risks. Real-time online evaluation of AI models is necessary to maintain their effectiveness and mitigate any harm caused by performance degradation. The traditional approach to model evaluation involves supervised methods that rely on manual labeling to compare results with model predictions. Unfortunately, this method is not suitable for online model monitoring due to its inherent lag and high cost. While there have been attempts to explore free-label model evaluation, these approaches often consider only the global features of the entire dataset. Additionally, they can only perform model evaluation based on a single dimension of model confidence or features. In this paper, we propose a novel approach called Divide-and-Aggregate Learning (DAL) for unsupervised model evaluation. Our method addresses the limitations of previous approaches by dividing the output of the model into buckets, capturing local information of the distribution. We then aggregate this local information to obtain global information and further represent the relationship between the distribution and model performance. Importantly, our method can simultaneously handle the confidence distribution and feature distribution of the model output. Extensive experiments have been conducted to demonstrate the effectiveness of our DAL model. The results show that our approach outperforms previous methods on four widely used datasets. We will make our source code publicly available.
\ No newline at end of file
diff --git a/data/2024/aaai/DocFormerv2: Local Features for Document Understanding b/data/2024/aaai/DocFormerv2: Local Features for Document Understanding
new file mode 100644
index 0000000000..80ef3bdd0b
--- /dev/null
+++ b/data/2024/aaai/DocFormerv2: Local Features for Document Understanding	
@@ -0,0 +1 @@
+We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine challenging datasets shows state-of-the-art performance on all over strong baselines - On TabFact (+4.3%), InfoVQA (+1.4%), FUNSD (+1.0%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, DocFormerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLI and Flamingo) on these tasks. Extensive ablations show that due to its novel pre-training tasks, DocFormerv2 understands multiple modalities better than prior-art in VDU.
\ No newline at end of file
diff --git a/data/2024/aaai/DocMSU: A Comprehensive Benchmark for Document-Level Multimodal Sarcasm Understanding b/data/2024/aaai/DocMSU: A Comprehensive Benchmark for Document-Level Multimodal Sarcasm Understanding
new file mode 100644
index 0000000000..c9f0348aa7
--- /dev/null
+++ b/data/2024/aaai/DocMSU: A Comprehensive Benchmark for Document-Level Multimodal Sarcasm Understanding	
@@ -0,0 +1,10 @@
+Multimodal Sarcasm Understanding (MSU) has a wide range of applications in the news field such as public opinion analysis and forgery detection. 
+However, existing MSU benchmarks and approaches usually focus on sentence-level MSU. 
+In document-level news, sarcasm clues are sparse or small and are often concealed in long text. 
+Moreover, compared to sentence-level comments like tweets, which mainly focus on only a few trends or hot topics (e.g., sports events), content in the news is considerably diverse. 
+Models created for sentence-level MSU may fail to capture sarcasm clues in document-level news. 
+To fill this gap, we present a comprehensive benchmark for Document-level Multimodal Sarcasm Understanding (DocMSU). 
+Our dataset contains 102,588 pieces of news with text-image pairs, covering 9 diverse topics such as health, business, etc.
+The proposed large-scale and diverse DocMSU significantly facilitates the research of document-level MSU in real-world scenarios. 
+To take on the new challenges posed by DocMSU, we introduce a fine-grained sarcasm comprehension method to properly align the pixel-level image features with word-level textual features in documents. 
+Experiments demonstrate the effectiveness of our method, showing that it can serve as a baseline approach to the challenging DocMSU.
\ No newline at end of file
diff --git a/data/2024/aaai/DocNLC: A Document Image Enhancement Framework with Normalized and Latent Contrastive Representation for Multiple Degradations b/data/2024/aaai/DocNLC: A Document Image Enhancement Framework with Normalized and Latent Contrastive Representation for Multiple Degradations
new file mode 100644
index 0000000000..430a287baf
--- /dev/null
+++ b/data/2024/aaai/DocNLC: A Document Image Enhancement Framework with Normalized and Latent Contrastive Representation for Multiple Degradations	
@@ -0,0 +1,2 @@
+Document Image Enhancement (DIE) remains challenging due to the prevalence of multiple degradations in document images captured by cameras. In this paper, we respond an interesting question: can the performance of pre-trained models and downstream DIE models be improved if they are bootstrapped using different degradation types of the same semantic samples and their high-dimensional features with ambiguous inter-class distance? To this end, we propose an effective contrastive learning paradigm for DIE — a Document image enhancement framework with Normalization and Latent Contrast (DocNLC). While existing DIE methods focus on eliminating one type of degradation, DocNLC considers the relationship between different types of degradation while utilizing both direct and latent contrasts to constrain content consistency, thus achieving a unified treatment of multiple types of degradation. Specifically, we devise a latent contrastive learning module to enforce explicit decorrelation of the normalized representations of different degradation types and to minimize the redundancy between them. Comprehensive experiments show that our method outperforms state-of-the-art DIE models in both pre-training and fine-tuning stages
+on four publicly available independent datasets. In addition, we discuss the potential benefits of DocNLC for downstream tasks. Our code is released at https://github.com/RylonW/DocNLC
\ No newline at end of file
diff --git a/data/2024/aaai/Does Any AI-Based Activity Contribute to Develop AI Conception? A Case Study with Italian Fifth and Sixth Grade Classes b/data/2024/aaai/Does Any AI-Based Activity Contribute to Develop AI Conception? A Case Study with Italian Fifth and Sixth Grade Classes
new file mode 100644
index 0000000000..56090f9af5
--- /dev/null
+++ b/data/2024/aaai/Does Any AI-Based Activity Contribute to Develop AI Conception? A Case Study with Italian Fifth and Sixth Grade Classes	
@@ -0,0 +1,13 @@
+Artificial Intelligence is undoubtedly becoming pervasive in everyday life of everyone.
+In this setting, developing correct AI conception since childhood is not only a need to 
+be addressed in educational curricula, but is also a children right.
+
+Accordingly, several initiatives at national and international levels aim at promoting AI
+and emerging technology literacy, supported also by a proliferation in the literature 
+of learning courses covering a variety of topics, learning objectives and targeted ages.
+Schools are therefore pushed to introduce innovative activities for children in their
+curricula.
+
+In this paper, we report the results of a case study where we tested the contribution 
+of an AI block-based course in developing computational thinking, and human 
+and AI minds understanding in fifth and sixth grade children.
\ No newline at end of file
diff --git a/data/2024/aaai/Does Few-Shot Learning Suffer from Backdoor Attacks? b/data/2024/aaai/Does Few-Shot Learning Suffer from Backdoor Attacks?
new file mode 100644
index 0000000000..adebc0cf1e
--- /dev/null
+++ b/data/2024/aaai/Does Few-Shot Learning Suffer from Backdoor Attacks?	
@@ -0,0 +1 @@
+The field of few-shot learning (FSL) has shown promising results in scenarios where training data is limited, but its vulnerability to backdoor attacks remains largely unexplored. We first explore this topic by first evaluating the performance of the existing backdoor attack methods on few-shot learning scenarios. Unlike in standard supervised learning, existing backdoor attack methods failed to perform an effective attack in FSL due to two main issues. Firstly, the model tends to overfit to either benign features or trigger features, causing a tough trade-off between attack success rate and benign accuracy. Secondly, due to the small number of training samples, the dirty label or visible trigger in the support set can be easily detected by victims, which reduces the stealthiness of attacks. It seemed that FSL could survive from backdoor attacks. However, in this paper, we propose the Few-shot Learning Backdoor Attack (FLBA) to show that FSL can still be vulnerable to backdoor attacks. Specifically, we first generate a trigger to maximize the gap between poisoned and benign features. It enables the model to learn both benign and trigger features, which solves the problem of overfitting. To make it more stealthy, we hide the trigger by optimizing two types of imperceptible perturbation, namely attractive and repulsive perturbation, instead of attaching the trigger directly. Once we obtain the perturbations, we can poison all samples in the benign support set into a hidden poisoned support set and fine-tune the model on it. Our method demonstrates a high Attack Success Rate (ASR) in FSL tasks with different few-shot learning paradigms while preserving clean accuracy and maintaining stealthiness. This study reveals that few-shot learning still suffers from backdoor attacks, and its security should be given attention.
\ No newline at end of file
diff --git a/data/2024/aaai/Does Robin Hood Use a Lightsaber?: Automated Planning for Storytelling b/data/2024/aaai/Does Robin Hood Use a Lightsaber?: Automated Planning for Storytelling
new file mode 100644
index 0000000000..db76a6e7b5
--- /dev/null
+++ b/data/2024/aaai/Does Robin Hood Use a Lightsaber?: Automated Planning for Storytelling	
@@ -0,0 +1 @@
+Humans have been using stories to entertain, educate, and persuade audiences for centuries. The advent of modern AI tools in the form of Large Language Models (LLMs) such as chatGPT continues to fulfill this purpose. However while recent work has shown that LLMs can successfully be used for narrative generation, they lack coherence and can be prone to repetition and stilted language. Automated Planning can therefore be combined with Natural Language text generation to create narratives (stories) that are logical, coherent, and believable. A planning model provides scaffolding to an LLM so that the LLM's language generation is context-dependent, in order to allow users to create more coherent, logical, and believable stories in a variety of domains.
\ No newline at end of file
diff --git a/data/2024/aaai/Domain Engineering to Represent Human Behavior Using Multi-Agent Planning and Inductive Methodologies b/data/2024/aaai/Domain Engineering to Represent Human Behavior Using Multi-Agent Planning and Inductive Methodologies
new file mode 100644
index 0000000000..b7f4e9912e
--- /dev/null
+++ b/data/2024/aaai/Domain Engineering to Represent Human Behavior Using Multi-Agent Planning and Inductive Methodologies	
@@ -0,0 +1 @@
+This research combines multi agent planning, the psycholinguistics of question asking, procedural grounded theory, and hierarchical task networks to represent domains for automated planning.
\ No newline at end of file
diff --git a/data/2024/aaai/Domain Generalizable Person Search Using Unreal Dataset b/data/2024/aaai/Domain Generalizable Person Search Using Unreal Dataset
new file mode 100644
index 0000000000..2ab3c9d28b
--- /dev/null
+++ b/data/2024/aaai/Domain Generalizable Person Search Using Unreal Dataset	
@@ -0,0 +1,6 @@
+Collecting and labeling real datasets to train the person search networks not only requires a lot of time and effort, but also accompanies privacy issues. 
+The weakly-supervised and unsupervised domain adaptation methods have been proposed to alleviate the labeling burden for target datasets, however, their generalization capability is limited. 
+We introduce a novel person search method based on the domain generalization framework, that uses an automatically labeled unreal dataset only for training but is applicable to arbitrary unseen real datasets. 
+To alleviate the domain gaps when transferring the knowledge from the unreal source dataset to the real target datasets, we estimate the fidelity of person instances which is then used to train the end-to-end network adaptively. 
+Moreover, we devise a domain-invariant feature learning scheme to encourage the network to suppress the domain-related features.
+Experimental results demonstrate that the proposed method provides the competitive performance to existing person search methods even though it is applicable to arbitrary unseen datasets without any prior knowledge and re-training burdens.
\ No newline at end of file
diff --git a/data/2024/aaai/Domain Generalization with Vital Phase Augmentation b/data/2024/aaai/Domain Generalization with Vital Phase Augmentation
new file mode 100644
index 0000000000..38843ff516
--- /dev/null
+++ b/data/2024/aaai/Domain Generalization with Vital Phase Augmentation	
@@ -0,0 +1 @@
+Deep neural networks have shown remarkable performance in image classification. However, their performance significantly deteriorates with corrupted input data. Domain generalization methods have been proposed to train robust models against out-of-distribution data. Data augmentation in the frequency domain is one of such approaches that enable a model to learn phase features to establish domain-invariant representations. This approach changes the amplitudes of the input data while preserving the phases. However, using fixed phases leads to susceptibility to phase fluctuations because amplitudes and phase fluctuations commonly occur in out-of-distribution. In this study, to address this problem, we introduce an approach using finite variation of the phases of input data rather than maintaining fixed phases. Based on the assumption that the degree of domain-invariant features varies for each phase, we propose a method to distinguish phases based on this degree. In addition, we propose a method called vital phase augmentation (VIPAug) that applies the variation to the phases differently according to the degree of domain-invariant features of given phases. The model depends more on the vital phases that contain more domain-invariant features for attaining robustness to amplitude and phase fluctuations. We present experimental evaluations of our proposed approach, which exhibited improved performance for both clean and corrupted data. VIPAug achieved SOTA performance on the benchmark CIFAR-10 and CIFAR-100 datasets, as well as near-SOTA performance on the ImageNet-100 and ImageNet datasets. Our code is available at https://github.com/excitedkid/vipaug.
\ No newline at end of file
diff --git a/data/2024/aaai/Domain Invariant Learning for Gaussian Processes and Bayesian Exploration b/data/2024/aaai/Domain Invariant Learning for Gaussian Processes and Bayesian Exploration
new file mode 100644
index 0000000000..0469d6fde5
--- /dev/null
+++ b/data/2024/aaai/Domain Invariant Learning for Gaussian Processes and Bayesian Exploration	
@@ -0,0 +1 @@
+Out-of-distribution (OOD) generalization has long been a challenging problem that remains largely unsolved. Gaussian processes (GP), as popular probabilistic model classes, especially in the small data regime, presume strong OOD generalization abilities. Surprisingly, their OOD generalization abilities have been under-explored before compared with other lines of GP research. In this paper, we identify that GP is not free from the problem and propose a domain invariant learning algorithm for Gaussian processes (DIL-GP) with a min-max optimization on the likelihood. DIL-GP discovers the heterogeneity in the data and forces invariance across partitioned subsets of data. We further extend the DIL-GP to improve Bayesian optimization's adaptability on changing environments. Numerical experiments demonstrate the superiority of DIL-GP for predictions on several synthetic and real-world datasets. We further demonstrate the effectiveness of the DIL-GP Bayesian optimization method on a PID parameters tuning experiment for a quadrotor. The full version and source code are available at: https://github.com/Billzxl/DIL-GP.
\ No newline at end of file
diff --git a/data/2024/aaai/Domain-Controlled Prompt Learning b/data/2024/aaai/Domain-Controlled Prompt Learning
new file mode 100644
index 0000000000..90d46a7a38
--- /dev/null
+++ b/data/2024/aaai/Domain-Controlled Prompt Learning	
@@ -0,0 +1 @@
+Large pre-trained vision-language models, such as CLIP, have shown remarkable generalization capabilities across various tasks when appropriate text prompts are provided. However, adapting these models to specific domains, like remote sensing images (RSIs), medical images, etc, remains unexplored and challenging. Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms, leading to suboptimal performance due to the misinterpretation of specific images in natural image patterns. To tackle this dilemma, we proposed a Domain-Controlled Prompt Learning for the specific domains. Specifically, the large-scale specific domain foundation model (LSDM) is first introduced to provide essential specific domain knowledge. Using lightweight neural networks, we transfer this knowledge into domain biases, which control both the visual and language branches to obtain domain-adaptive prompts in a directly incorporating manner. Simultaneously, to overcome the existing overfitting challenge, we propose a novel noisy-adding strategy, without extra trainable parameters, to help the model escape the suboptimal solution in a global domain oscillation manner. Experimental results show our method achieves state-of-the-art performance in specific domain image recognition datasets. Our code is available at https://github.com/caoql98/DCPL.
\ No newline at end of file
diff --git a/data/2024/aaai/Domain-Hallucinated Updating for Multi-Domain Face Anti-spoofing b/data/2024/aaai/Domain-Hallucinated Updating for Multi-Domain Face Anti-spoofing
new file mode 100644
index 0000000000..2c7d75f93e
--- /dev/null
+++ b/data/2024/aaai/Domain-Hallucinated Updating for Multi-Domain Face Anti-spoofing	
@@ -0,0 +1,10 @@
+Multi-Domain Face Anti-Spoofing (MD-FAS) is a practical setting that aims to update models on new domains using only novel data while ensuring that the knowledge acquired from previous domains is not forgotten.
+Prior methods utilize the responses from models to represent the previous domain knowledge or map the different domains into separated feature spaces to prevent forgetting.
+However, due to domain gaps, the responses of new data are not as accurate as those of previous data. 
+Also, without the supervision of previous data, separated feature spaces might be destroyed by new domains while updating, leading to catastrophic forgetting.
+Inspired by the challenges posed by the lack of previous data, we solve this issue from a new standpoint that generates hallucinated previous data for updating FAS model.
+To this end, we propose a novel Domain-Hallucinated Updating (DHU) framework to facilitate the hallucination of data.
+Specifically, Domain Information Explorer learns representative domain information of the previous domains. 
+Then, Domain Information Hallucination module transfers the new domain data to pseudo-previous domain ones.
+Moreover, Hallucinated Features Joint Learning module is proposed to asymmetrically align the new and pseudo-previous data for real samples via dual levels to learn more generalized features, promoting the results on all domains.
+Our experimental results and visualizations demonstrate that the proposed method outperforms state-of-the-art competitors in terms of effectiveness.
\ No newline at end of file
diff --git a/data/2024/aaai/Double Auction on Diffusion Network b/data/2024/aaai/Double Auction on Diffusion Network
new file mode 100644
index 0000000000..0c733c9e57
--- /dev/null
+++ b/data/2024/aaai/Double Auction on Diffusion Network	
@@ -0,0 +1 @@
+Mechanism design on social networks has attracted extensive attention recently. The goal is to design mechanisms to incentivize participants to invite more participants via their social networks, and the challenge is that the participants are competitors. Various mechanisms have been proposed for single-/multiple-unit auctions, but it has been shown that it is challenging to design such mechanisms for more complex settings. We move this forward to investigate a double auction on a network where each trader (a buyer or a seller) can link to other buyers and sellers. Incentiving invitation is more difficult than in multi-unit one-sided auctions, because there are two different roles and a buyer (seller) seems happy to invite a seller (buyer), but again the invited seller (buyer) may invite another buyer (seller) to compete with the original buyer (seller). To combat this, we propose a solution called dynamic trade reduction (DTR), which also guarantees a non-negative revenue for the market owner. Interestingly, our solution is also applicable to the multi-unit one-sided auction when there is only one seller linking to only buyers on the network. We believe that the principle of our solution has the potential to be extended to design the multi-item one-sided auction.
\ No newline at end of file
diff --git a/data/2024/aaai/Double Buffers CEM-TD3: More Efficient Evolution and Richer Exploration b/data/2024/aaai/Double Buffers CEM-TD3: More Efficient Evolution and Richer Exploration
new file mode 100644
index 0000000000..caeb77c99d
--- /dev/null
+++ b/data/2024/aaai/Double Buffers CEM-TD3: More Efficient Evolution and Richer Exploration	
@@ -0,0 +1 @@
+CEM-TD3 is a combination scheme using the simple cross-entropy method (CEM) and Twin Delayed Deep Deterministic policy gradient (TD3), and it achieves a satisfactory trade-off between performance and sample efficiency. However, we find that CEM-TD3 cannot fully address the low efficiency of policy search caused by CEM, and the policy gradient learning introduced by TD3 will weaken the diversity of individuals in the population. In this paper, we propose Double Buffers CEM-TD3 (DBCEM-TD3) that optimizes both CEM and TD3. For CEM, DBCEM-TD3 maintains an actor buffer to store the population required for evolution. In each iteration, it only needs to generate a small number of actors to replace the poor actors in the policy buffer to achieve more efficient evolution. The fitness of individuals in the actor buffer decreases exponentially with time, which can avoid premature convergence of the mean actor. For TD3, DBCEM-TD3 maintains a critic buffer with the same number of critics as the number of actors generated in each iteration, and each critic is trained independently by sampling from the shared replay buffer. In each iteration, each newly generated actor uses different critics to guide learning. This ensures more diverse behaviors among the learned actors, enabling richer experiences to be collected during the evaluation phase. We conduct experimental evaluations on five continuous control tasks provided by OpenAI Gym. DBCEM-TD3 outperforms CEM-TD3, TD3, and other classic off-policy reinforcement learning algorithms in terms of performance and sample efficiency.
\ No newline at end of file
diff --git a/data/2024/aaai/Double-Bounded Optimal Transport for Advanced Clustering and Classification b/data/2024/aaai/Double-Bounded Optimal Transport for Advanced Clustering and Classification
new file mode 100644
index 0000000000..626dcd6632
--- /dev/null
+++ b/data/2024/aaai/Double-Bounded Optimal Transport for Advanced Clustering and Classification	
@@ -0,0 +1 @@
+Optimal transport (OT) is attracting increasing attention in machine learning. It aims to transport a source distribution to a target one at minimal cost. In its vanilla form, the source and target distributions are predetermined, which contracts to the real-world case involving undetermined targets. In this paper, we propose Doubly Bounded Optimal Transport (DB-OT), which assumes that the target distribution is restricted within two boundaries instead of a fixed one, thus giving more freedom for the transport to find solutions. Based on the entropic regularization of DB-OT, three scaling-based algorithms are devised for calculating the optimal solution. We also show that our DB-OT is helpful for barycenter-based clustering, which can avoid the excessive concentration of samples in a single cluster. Then we further develop DB-OT techniques for long-tailed classification which is an emerging and open problem. We first propose a connection between OT and classification, that is, in the classification task, training involves optimizing the Inverse OT to learn the representations, while testing involves optimizing the OT for predictions. with this OT perspective, we first apply DB-OT to improve the loss, and the Balanced Softmax is shown as a special case. Then we apply DB-OT for inference in the testing process. Even with vanilla Softmax trained features, our experiments show that our method can achieve good results with our improved inference scheme in the testing stage.
\ No newline at end of file
diff --git a/data/2024/aaai/Double-Descent Curves in Neural Networks: A New Perspective Using Gaussian Processes b/data/2024/aaai/Double-Descent Curves in Neural Networks: A New Perspective Using Gaussian Processes
new file mode 100644
index 0000000000..f3f9a13912
--- /dev/null
+++ b/data/2024/aaai/Double-Descent Curves in Neural Networks: A New Perspective Using Gaussian Processes	
@@ -0,0 +1 @@
+Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters which is less than the number of data points, but then descends again in the overparameterized regime. In this paper, we use techniques from random matrix theory to characterize the spectral distribution of the empirical feature covariance matrix as a width-dependent perturbation of the spectrum of the neural network Gaussian process (NNGP) kernel, thus establishing a novel connection between the NNGP literature and the random matrix theory literature in the context of neural networks. Our analytical expressions allow us to explore the generalisation behavior of the corresponding kernel and GP regression. Furthermore, they offer a new interpretation of double-descent in terms of the discrepancy between the width-dependent empirical kernel and the width-independent NNGP kernel.
\ No newline at end of file
diff --git a/data/2024/aaai/Double-Layer Hybrid-Label Identification Feature Selection for Multi-View Multi-Label Learning b/data/2024/aaai/Double-Layer Hybrid-Label Identification Feature Selection for Multi-View Multi-Label Learning
new file mode 100644
index 0000000000..053ce1a412
--- /dev/null
+++ b/data/2024/aaai/Double-Layer Hybrid-Label Identification Feature Selection for Multi-View Multi-Label Learning	
@@ -0,0 +1 @@
+Multi-view multi-label feature selection aims to select informative features where the data are collected from multiple sources with multiple interdependent class labels. For fully exploiting multi-view information, most prior works mainly focus on the common part in the ideal circumstance. However, the inconsistent part hidden in each view, including noises and specific elements, may affect the quality of mapping between labels and feature representations. Meanwhile, ignoring the specific part might lead to a suboptimal result, as each label is supposed to possess specific characteristics of its own. To deal with the double problems in multi-view multi-label feature selection, we propose a unified loss function which is a totally splitting structure for observed labels as hybrid labels that is, common labels, view-to-all specific labels and noisy labels, and the view-to-all specific labels further splits into several specific labels of each view. The proposed method simultaneously considers the consistency and complementarity of different views. Through exploring the feature weights of hybrid labels, the mapping relationships between labels and features can be established sequentially based on their attributes. Additionally, the interrelatedness among hybrid labels is also investigated and injected into the function. Specific to the specific labels of each view, we construct the novel regularization paradigm incorporating logic operations. Finally, the convergence of the result is proved after applying the multiplicative update rules. Experiments on six datasets demonstrate the effectiveness and superiority of our method compared with the state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Doubly Perturbed Task Free Continual Learning b/data/2024/aaai/Doubly Perturbed Task Free Continual Learning
new file mode 100644
index 0000000000..88d8b1a326
--- /dev/null
+++ b/data/2024/aaai/Doubly Perturbed Task Free Continual Learning	
@@ -0,0 +1 @@
+Task-free online continual learning (TF-CL) is a challenging problem where the model incrementally learns tasks without explicit task information. Although training with entire data from the past, present as well as future is considered as the gold standard, naive approaches in TF-CL with the current samples may be conflicted with learning with samples in the future, leading to catastrophic forgetting and poor plasticity. Thus, a proactive consideration of an unseen future sample in TF-CL becomes imperative. Motivated by this intuition, we propose a novel TF-CL framework considering future samples and show that injecting adversarial perturbations on both input data and decision-making is effective. Then, we propose a novel method named Doubly Perturbed Continual Learning (DPCL) to efficiently implement these input and decision-making perturbations. Specifically, for input perturbation, we propose an approximate perturbation method that injects noise into the input data as well as the feature vector and then interpolates the two perturbed samples. For decision-making process perturbation, we devise multiple stochastic classifiers. We also investigate a memory management scheme and learning rate scheduling reflecting our proposed double perturbations. We demonstrate that our proposed method outperforms the state-of-the-art baseline methods by large margins on various TF-CL benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Dr. R.O. Bott Will See You Now: Exploring AI for Wellbeing with Middle School Students b/data/2024/aaai/Dr. R.O. Bott Will See You Now: Exploring AI for Wellbeing with Middle School Students
new file mode 100644
index 0000000000..1e9672ac97
--- /dev/null
+++ b/data/2024/aaai/Dr. R.O. Bott Will See You Now: Exploring AI for Wellbeing with Middle School Students	
@@ -0,0 +1 @@
+Artificial Intelligence (AI) is permeating almost every area of society, reshaping how many people, including youth, navigate the world. Despite the increased presence of AI, most people lack a baseline knowledge of how AI works. Moreover, social barriers often hinder equal access to AI courses, perpetuating disparities in participation in the field. To address this, it is crucial to design AI curricula that are effective, inclusive, and relevant, especially to learners from backgrounds that are historically excluded from working in tech. In this paper, we present AI for Wellbeing, a curriculum where students explore conversational AI and the ethical considerations around using it to promote wellbeing. We specifically designed content, educator materials, and educational technologies to meet the interests and needs of students and educators from diverse backgrounds. We piloted AI for Wellbeing in a 5-day virtual workshop with middle school teachers and students. Then, using a mixed-methods approach, we analyzed students' work and teachers' feedback. Our results suggest that the curriculum content and design effectively engaged students, enabling them to implement meaningful AI projects for wellbeing. We hope that the design of this curriculum and insights from our evaluation will inspire future efforts to create culturally relevant K-12 AI curricula.
\ No newline at end of file
diff --git a/data/2024/aaai/DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency b/data/2024/aaai/DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency
new file mode 100644
index 0000000000..156085cc0a
--- /dev/null
+++ b/data/2024/aaai/DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency	
@@ -0,0 +1 @@
+The combination of electronic health records (EHR) and medical images is crucial for clinicians in making diagnoses and forecasting prognoses. Strategically fusing these two data modalities has great potential to improve the accuracy of machine learning models in clinical prediction tasks. However, the asynchronous and complementary nature of EHR and medical images presents unique challenges. Missing modalities due to clinical and administrative factors are inevitable in practice, and the significance of each data modality varies depending on the patient and the prediction target, resulting in inconsistent predictions and suboptimal model performance. To address these challenges, we propose DrFuse to achieve effective clinical multi-modal fusion. It tackles the missing modality issue by disentangling the features shared across modalities and those unique within each modality. Furthermore, we address the modal inconsistency issue via a disease-wise attention layer that produces the patient- and disease-wise weighting for each modality to make the final prediction. We validate the proposed method using real-world large-scale datasets, MIMIC-IV and MIMIC-CXR. Experimental results show that the proposed method significantly outperforms the state-of-the-art models.
\ No newline at end of file
diff --git a/data/2024/aaai/DreamIdentity: Enhanced Editability for Efficient Face-Identity Preserved Image Generation b/data/2024/aaai/DreamIdentity: Enhanced Editability for Efficient Face-Identity Preserved Image Generation
new file mode 100644
index 0000000000..a8f22d4beb
--- /dev/null
+++ b/data/2024/aaai/DreamIdentity: Enhanced Editability for Efficient Face-Identity Preserved Image Generation	
@@ -0,0 +1 @@
+While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centric images, an intractable problem is how to preserve the face identity and follow the text prompts simultaneously for conditioned input face images and texts. Despite existing encoder-based methods achieving high efficiency and decent face similarity, the generated image often fails to follow the textual prompts. To ease this editability issue, we present DreamIdentity, to learn edit-friendly and accurate face-identity representations in the word embedding space. Specifically, we propose self-augmented editability learning to enhance the editability for projected embedding, which is achieved by constructing paired generated celebrity's face and edited celebrity images for training, aiming at transferring mature editability of off-the-shelf text-to-image models in celebrity to unseen identities. Furthermore, we design a novel dedicated face-identity encoder to learn an accurate representation of human faces, which applies multi-scale ID-aware features followed by a multi-embedding projector to generate the pseudo words in the text embedding space directly. Extensive experiments show that our method can generate more text-coherent and ID-preserved images with negligible time overhead compared to the standard text-to-image generation process.
\ No newline at end of file
diff --git a/data/2024/aaai/DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models b/data/2024/aaai/DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models
new file mode 100644
index 0000000000..e5565de52a
--- /dev/null
+++ b/data/2024/aaai/DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models	
@@ -0,0 +1,7 @@
+Recent progresses in large-scale text-to-image models have yielded remarkable accomplishments, finding various applications in art domain.
+However, expressing unique characteristics of an artwork (e.g. brushwork, colortone, or composition) with text prompts alone may encounter limitations due to the inherent constraints of verbal description.
+To this end, we introduce DreamStyle, a novel framework designed for artistic image synthesis, proficient in both text-to-image synthesis and style transfer.
+DreamStyle optimizes a multi-stage textual embedding with a context-aware text prompt, resulting in prominent image quality.
+In addition, with content and style guidance, DreamStyle exhibits flexibility to accommodate a range of style references.
+Experimental results demonstrate its superior performance across multiple scenarios, suggesting its promising potential in artistic product creation.
+Project page: https://nmhkahn.github.io/dreamstyler/
\ No newline at end of file
diff --git a/data/2024/aaai/Dual Mapping of 2D StyleGAN for 3D-Aware Image Generation and Manipulation (Student Abstract) b/data/2024/aaai/Dual Mapping of 2D StyleGAN for 3D-Aware Image Generation and Manipulation (Student Abstract)
new file mode 100644
index 0000000000..8bd072a81f
--- /dev/null
+++ b/data/2024/aaai/Dual Mapping of 2D StyleGAN for 3D-Aware Image Generation and Manipulation (Student Abstract)	
@@ -0,0 +1 @@
+3D-aware GANs successfully solve the problem of 3D-consistency generation and furthermore provide a 3D shape of the generated object. However, the application of the volume renderer disturbs the disentanglement of the latent space, which makes it difficult to manipulate 3D-aware GANs and lowers the image quality of style-based generators. In this work, we devise a dual-mapping framework to make the generated images of pretrained 2D StyleGAN consistent in 3D space. We utilize a tri-plane representation to estimate the 3D shape of the generated object and two mapping networks to bridge the latent space of StyleGAN and the 3D tri-plane space. Our method does not alter the parameters of the pretrained generator, which means the interpretability of latent space is preserved for various image manipulations. Experiments show that our method lifts the 3D awareness of pretrained 2D StyleGAN to 3D-aware GANs and outperforms the 3D-aware GANs in controllability and image quality.
\ No newline at end of file
diff --git a/data/2024/aaai/Dual Self-Paced Cross-Modal Hashing b/data/2024/aaai/Dual Self-Paced Cross-Modal Hashing
new file mode 100644
index 0000000000..eb147e404f
--- /dev/null
+++ b/data/2024/aaai/Dual Self-Paced Cross-Modal Hashing	
@@ -0,0 +1 @@
+Cross-modal hashing~(CMH) is an efficient technique to retrieve relevant data across different modalities, such as images, texts, and videos, which has attracted more and more attention due to its low storage cost and fast query speed. Although existing CMH methods achieve remarkable processes, almost all of them treat all samples of varying difficulty levels without discrimination, thus leaving them vulnerable to noise or outliers. Based on this observation, we reveal and study dual difficulty levels implied in cross-modal hashing learning, \ie instance-level and feature-level difficulty. To address this problem, we propose a novel Dual Self-Paced Cross-Modal Hashing (DSCMH) that mimics human cognitive learning to learn hashing from ``easy'' to ``hard'' in both instance and feature levels, thereby embracing robustness against noise/outliers. Specifically, our DSCMH assigns weights to each instance and feature to measure their difficulty or reliability, and then uses these weights to automatically filter out the noisy and irrelevant data points in the original space. By gradually increasing the weights during training, our method can focus on more instances and features from ``easy'' to ``hard'' in training, thus mitigating the adverse effects of noise or outliers. Extensive experiments are conducted on three widely-used benchmark datasets to demonstrate the effectiveness and robustness of the proposed DSCMH over 12 state-of-the-art CMH methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Dual-Level Curriculum Meta-Learning for Noisy Few-Shot Learning Tasks b/data/2024/aaai/Dual-Level Curriculum Meta-Learning for Noisy Few-Shot Learning Tasks
new file mode 100644
index 0000000000..b125f3f797
--- /dev/null
+++ b/data/2024/aaai/Dual-Level Curriculum Meta-Learning for Noisy Few-Shot Learning Tasks	
@@ -0,0 +1,2 @@
+Few-shot learning (FSL) is essential in many practical applications. However, the limited training examples make the models more vulnerable to label noise, which can lead to poor generalization capability. To address this critical challenge, we propose a curriculum meta-learning model that employs a novel dual-level class-example sampling strategy to create a robust curriculum for adaptive task distribution formulation and robust model training. The dual-level framework proposes a heuristic class sampling criterion that measures pairwise class boundary complexity to form a class curriculum; it uses effective example sampling through an under-trained proxy model to form an example curriculum. By utilizing both class-level and example-level information, our approach is more robust to handle limited training data and noisy labels that commonly occur in few-shot learning tasks.
+The model has efficient convergence behavior, which is verified through rigorous convergence analysis. Additionally, we establish a novel error bound through a hierarchical PAC-Bayesian analysis for curriculum meta-learning under noise. We conduct extensive experiments that demonstrate the effectiveness of our framework in outperforming existing noisy few-shot learning methods under various few-shot classification benchmarks. Our code is available at https://github.com/ritmininglab/DCML.
\ No newline at end of file
diff --git a/data/2024/aaai/Dual-Perspective Knowledge Enrichment for Semi-supervised 3D Object Detection b/data/2024/aaai/Dual-Perspective Knowledge Enrichment for Semi-supervised 3D Object Detection
new file mode 100644
index 0000000000..f24a7f8c7d
--- /dev/null
+++ b/data/2024/aaai/Dual-Perspective Knowledge Enrichment for Semi-supervised 3D Object Detection	
@@ -0,0 +1 @@
+Semi-supervised 3D object detection is a promising yet under-explored direction to reduce data annotation costs, especially for cluttered indoor scenes. A few prior works, such as SESS and 3DIoUMatch, attempt to solve this task by utilizing a teacher model to generate pseudo-labels for unlabeled samples. However, the availability of unlabeled samples in the 3D domain is relatively limited compared to its 2D counterpart due to the greater effort required to collect 3D data. Moreover, the loose consistency regularization in SESS and restricted pseudo-label selection strategy in 3DIoUMatch lead to either low-quality supervision or a limited amount of pseudo labels. To address these issues, we present a novel Dual-Perspective Knowledge Enrichment approach named DPKE for semi-supervised 3D object detection. Our DPKE enriches the knowledge of limited training data, particularly unlabeled data, from two perspectives: data-perspective and feature-perspective. Specifically, from the data-perspective, we propose a class-probabilistic data augmentation method that augments the input data with additional instances based on the varying distribution of class probabilities. Our DPKE achieves feature-perspective knowledge enrichment by designing a geometry-aware feature matching method that regularizes feature-level similarity between object proposals from the student and teacher models. Extensive experiments on the two benchmark datasets demonstrate that our DPKE achieves superior performance over existing state-of-the-art approaches under various label ratio conditions. The source code and models will be made available to the public.
\ No newline at end of file
diff --git a/data/2024/aaai/Dual-Prior Augmented Decoding Network for Long Tail Distribution in HOI Detection b/data/2024/aaai/Dual-Prior Augmented Decoding Network for Long Tail Distribution in HOI Detection
new file mode 100644
index 0000000000..00d8412856
--- /dev/null
+++ b/data/2024/aaai/Dual-Prior Augmented Decoding Network for Long Tail Distribution in HOI Detection	
@@ -0,0 +1 @@
+Human object interaction detection aims at localizing human-object pairs and recognizing their interactions. Trapped by the long-tailed distribution of the data, existing HOI detection methods often have difficulty recognizing the tail categories. Many approaches try to improve the recognition of HOI tasks by utilizing external knowledge (e.g. pre-trained visual-language models). However, these approaches mainly utilize external knowledge at the HOI combination level and achieve limited improvement in the tail categories. In this paper, we propose a dual-prior augmented decoding network by decomposing the HOI task into two sub-tasks: human-object pair detection and interaction recognition. For each subtask, we leverage external knowledge to enhance the model's ability at a finer granularity. Specifically, we acquire the prior candidates from an external classifier and embed them to assist the subsequent decoding process. Thus, the long-tail problem is mitigated from a coarse-to-fine level with the corresponding external knowledge. Our approach outperforms existing state-of-the-art models in various settings and significantly boosts the performance on the tail HOI categories. The source code is available at https://github.com/PRIS-CV/DP-ADN.
\ No newline at end of file
diff --git a/data/2024/aaai/Dual-View Whitening on Pre-trained Text Embeddings for Sequential Recommendation b/data/2024/aaai/Dual-View Whitening on Pre-trained Text Embeddings for Sequential Recommendation
new file mode 100644
index 0000000000..d031d2f649
--- /dev/null
+++ b/data/2024/aaai/Dual-View Whitening on Pre-trained Text Embeddings for Sequential Recommendation	
@@ -0,0 +1 @@
+Recent advances in sequential recommendation models have demonstrated the efficacy of integrating pre-trained text embeddings with item ID embeddings to achieve superior performance. However, our study takes a unique perspective by exclusively focusing on the untapped potential of text embeddings, obviating the need for ID embeddings. We begin by implementing a pre-processing strategy known as whitening, which effectively transforms the anisotropic semantic space of pre-trained text embeddings into an isotropic Gaussian distribution. Comprehensive experiments reveal that applying whitening to pre-trained text embeddings in sequential recommendation models significantly enhances performance. Yet, a full whitening operation might break the potential manifold of items with similar text semantics. To retain the original semantics while benefiting from the isotropy of the whitened text features, we propose a Dual-view Whitening method for Sequential Recommendation (DWSRec), which leverages both fully whitened and relaxed whitened item representations as dual views for effective recommendations. We further examine the advantages of our approach through both empirical and theoretical analyses. Experiments on three public benchmark datasets show that DWSRec outperforms state-of-the-art methods for sequential recommendation.
\ No newline at end of file
diff --git a/data/2024/aaai/Dual-Window Multiscale Transformer for Hyperspectral Snapshot Compressive Imaging b/data/2024/aaai/Dual-Window Multiscale Transformer for Hyperspectral Snapshot Compressive Imaging
new file mode 100644
index 0000000000..7c13d14a90
--- /dev/null
+++ b/data/2024/aaai/Dual-Window Multiscale Transformer for Hyperspectral Snapshot Compressive Imaging	
@@ -0,0 +1 @@
+Coded aperture snapshot spectral imaging (CASSI) system is an effective manner for hyperspectral snapshot compressive imaging. The core issue of CASSI is to solve the inverse problem for the reconstruction of hyperspectral image (HSI). In recent years, Transformer-based methods achieve promising performance in HSI reconstruction. However, capturing both long-range dependencies and local information while ensuring reasonable computational costs remains a challenging problem. In this paper, we propose a Transformer-based HSI reconstruction method called dual-window multiscale Transformer (DWMT), which is a coarse-to-fine process, reconstructing the global properties of HSI with the long-range dependencies. In our method, we propose a novel U-Net architecture using a dual-branch encoder to refine pixel information and full-scale skip connections to fuse different features, enhancing the extraction of fine-grained features. Meanwhile, we design a novel self-attention mechanism called dual-window multiscale multi-head self-attention (DWM-MSA), which utilizes two different-sized windows to compute self-attention, which can capture the long-range dependencies in a local region at different scales to improve the reconstruction performance. We also propose a novel position embedding method for Transformer, named con-abs position embedding (CAPE), which effectively enhances positional information of the HSIs. Extensive experiments on both the simulated and the real data are conducted to demonstrate the superior performance, stability, and generalization ability of our DWMT. Code of this project is at https://github.com/chenx2000/DWMT.
\ No newline at end of file
diff --git a/data/2024/aaai/Dynamic Budget Throttling in Repeated Second-Price Auctions b/data/2024/aaai/Dynamic Budget Throttling in Repeated Second-Price Auctions
new file mode 100644
index 0000000000..766b45d9a0
--- /dev/null
+++ b/data/2024/aaai/Dynamic Budget Throttling in Repeated Second-Price Auctions	
@@ -0,0 +1,10 @@
+In today's online advertising markets, a crucial requirement for an advertiser is to control her total expenditure within a time horizon under some budget. 
+Among various budget control methods, throttling has emerged as a popular choice, managing an advertiser's total expenditure by selecting only a subset of auctions to participate in.
+This paper provides a theoretical panorama of a single advertiser's dynamic budget throttling process in repeated second-price auctions.
+We first establish a lower bound on the regret and an upper bound on the asymptotic competitive ratio for any throttling algorithm, respectively, when the advertiser's values are stochastic and adversarial. 
+Regarding the algorithmic side, we propose the OGD-CB algorithm, which guarantees a near-optimal expected regret with stochastic values. 
+On the other hand, when values are adversarial, we prove that this algorithm also reaches the upper bound on the asymptotic competitive ratio. 
+We further compare throttling with pacing, another widely adopted budget control method, in repeated second-price auctions. 
+In the stochastic case, we demonstrate that pacing is generally superior to throttling for the advertiser, supporting the well-known result that pacing is asymptotically optimal in this scenario. 
+However, in the adversarial case, we give an exciting result indicating that throttling is also an asymptotically optimal dynamic bidding strategy. 
+Our results bridge the gaps in theoretical research of throttling in repeated auctions and comprehensively reveal the ability of this popular budget-smoothing strategy.
\ No newline at end of file
diff --git a/data/2024/aaai/Dynamic Feature Pruning and Consolidation for Occluded Person Re-identification b/data/2024/aaai/Dynamic Feature Pruning and Consolidation for Occluded Person Re-identification
new file mode 100644
index 0000000000..cdaab6c4cb
--- /dev/null
+++ b/data/2024/aaai/Dynamic Feature Pruning and Consolidation for Occluded Person Re-identification	
@@ -0,0 +1 @@
+Occluded person re-identification (ReID) is a challenging problem due to contamination from occluders. Existing approaches address the issue with prior knowledge cues, such as human body key points and semantic segmentations, which easily fail in the presence of heavy occlusion and other humans as occluders. In this paper, we propose a feature pruning and consolidation (FPC) framework to circumvent explicit human structure parsing. The framework mainly consists of a sparse encoder, a multi-view feature mathcing module, and a feature consolidation decoder. Specifically, the sparse encoder drops less important image tokens, mostly related to background noise and occluders, solely based on correlation within the class token attention. Subsequently, the matching stage relies on the preserved tokens produced by the sparse encoder to identify k-nearest neighbors in the gallery by measuring the image and patch-level combined similarity. Finally, we use the feature consolidation module to compensate pruned features using identified neighbors for recovering essential information while disregarding disturbance from noise and occlusion. Experimental results demonstrate the effectiveness of our proposed framework on occluded, partial, and holistic Re-ID datasets. In particular, our method outperforms state-of-the-art results by at least 8.6% mAP and 6.0% Rank-1 accuracy on the challenging Occluded-Duke dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/Dynamic Knowledge Injection for AIXI Agents b/data/2024/aaai/Dynamic Knowledge Injection for AIXI Agents
new file mode 100644
index 0000000000..dcb6075fb8
--- /dev/null
+++ b/data/2024/aaai/Dynamic Knowledge Injection for AIXI Agents	
@@ -0,0 +1 @@
+Prior approximations of AIXI, a Bayesian optimality notion for general reinforcement learning, can only approximate AIXI's Bayesian environment model using an a-priori defined set of models. This is a fundamental source of epistemic uncertainty for the agent in settings where the existence of systematic bias in the predefined model class cannot be resolved by simply collecting more data from the environment. We address this issue in the context of Human-AI teaming by considering a setup where additional knowledge for the agent in the form of new candidate models arrives from a human operator in an online fashion. We introduce a new agent called DynamicHedgeAIXI that maintains an exact Bayesian mixture over dynamically changing sets of models via a time-adaptive prior constructed from a variant of the Hedge algorithm. The DynamicHedgeAIXI agent is the richest direct approximation of AIXI known to date and comes with good performance guarantees. Experimental results on epidemic control on contact networks validates the agent's practical utility.
\ No newline at end of file
diff --git a/data/2024/aaai/Dynamic Reactive Spiking Graph Neural Network b/data/2024/aaai/Dynamic Reactive Spiking Graph Neural Network
new file mode 100644
index 0000000000..33d3794583
--- /dev/null
+++ b/data/2024/aaai/Dynamic Reactive Spiking Graph Neural Network	
@@ -0,0 +1 @@
+Spiking Graph Neural Networks are emerging tools for analyzing graph data along with low energy consumption and certain biological fidelity. Existing methods directly integrate same-reactive spiking neurons into graph neural networks for processing propagated graphs. However, such same-reactive neurons are not biological-functionality enough compared to the brain's dynamic-reactive ones, limiting the model's expression. Meanwhile, insufficient long-range neighbor information can be excavated with the few-step propagated graph, restricting discrimination of graph spiking embeddings. Inspired by the dynamic cognition in the brain, we propose a Dynamic Reactive Spiking Graph Neural Network that can enhance model's expressive ability in higher biological fidelity. Specifically, we design dynamic reactive spiking neurons to process spiking graph inputs, which have unique optimizable thresholds to spontaneously explore dynamic reactive states between neurons. Moreover, discriminative graph positional spikes are learned and integrated adaptively into spiking outputs through our neurons, thereby exploring long-range neighbors more thoroughly. Finally, with the dynamic reactive mechanism and learnable positional integration, we can obtain a powerful and highly bio-fidelity model with low energy consumption. Experiments on various domain-related datasets can demonstrate the effectiveness of our model. Our code is available at https://github.com/hzhao98/DRSGNN.
\ No newline at end of file
diff --git a/data/2024/aaai/Dynamic Regret of Adversarial MDPs with Unknown Transition and Linear Function Approximation b/data/2024/aaai/Dynamic Regret of Adversarial MDPs with Unknown Transition and Linear Function Approximation
new file mode 100644
index 0000000000..e7130fb6f7
--- /dev/null
+++ b/data/2024/aaai/Dynamic Regret of Adversarial MDPs with Unknown Transition and Linear Function Approximation	
@@ -0,0 +1 @@
+We study reinforcement learning (RL) in episodic MDPs with adversarial full-information losses and the unknown transition. Instead of the classical static regret, we adopt dynamic regret as the performance measure which benchmarks the learner's performance with changing policies, making it more suitable for non-stationary environments. The primary challenge is to handle the uncertainties of unknown transition and unknown non-stationarity of environments simultaneously. We propose a general framework to decouple the two sources of uncertainties and show the dynamic regret bound naturally decomposes into two terms, one due to constructing confidence sets to handle the unknown transition and the other due to choosing sub-optimal policies under the unknown non-stationarity. To this end, we first employ the two-layer online ensemble structure to handle the adaptation error due to the unknown non-stationarity, which is model-agnostic. Subsequently, we instantiate the framework to three fundamental MDP models, including tabular MDPs, linear MDPs and linear mixture MDPs, and present corresponding approaches to control the exploration error due to the unknown transition. We provide dynamic regret guarantees respectively and show they are optimal in terms of the number of episodes K and the non-stationarity P̄ᴋ by establishing matching lower bounds. To the best of our knowledge, this is the first work that achieves the dynamic regret exhibiting optimal dependence on K and P̄ᴋ without prior knowledge about the non-stationarity for adversarial MDPs with unknown transition.
\ No newline at end of file
diff --git a/data/2024/aaai/Dynamic Semantic-Based Spatial Graph Convolution Network for Skeleton-Based Human Action Recognition b/data/2024/aaai/Dynamic Semantic-Based Spatial Graph Convolution Network for Skeleton-Based Human Action Recognition
new file mode 100644
index 0000000000..c675b88e36
--- /dev/null
+++ b/data/2024/aaai/Dynamic Semantic-Based Spatial Graph Convolution Network for Skeleton-Based Human Action Recognition	
@@ -0,0 +1 @@
+Graph convolutional networks (GCNs) have attracted great attention and achieved remarkable performance in skeleton-based action recognition. However, most of the previous works are designed to refine skeleton topology without considering the types of different joints and edges, making them infeasible to represent the semantic information. In this paper, we proposed a dynamic semantic-based graph convolution network (DS-GCN) for skeleton-based human action recognition, where the joints and edge types were encoded in the skeleton topology in an implicit way. Specifically, two semantic modules, the joints type-aware adaptive topology and the edge type-aware adaptive topology, were proposed. Combining proposed semantics modules with temporal convolution, a powerful framework named DS-GCN was developed for skeleton-based action recognition. Extensive experiments in two datasets, NTU-RGB+D and Kinetics-400 show that the proposed semantic modules were generalized enough to be utilized in various backbones for boosting recognition accuracy. Meanwhile, the proposed DS-GCN notably outperformed state-of-the-art methods. The code is released here https://github.com/davelailai/DS-GCN
\ No newline at end of file
diff --git a/data/2024/aaai/Dynamic Spiking Graph Neural Networks b/data/2024/aaai/Dynamic Spiking Graph Neural Networks
new file mode 100644
index 0000000000..bf442c7949
--- /dev/null
+++ b/data/2024/aaai/Dynamic Spiking Graph Neural Networks	
@@ -0,0 +1 @@
+The integration of Spiking Neural Networks (SNNs) and Graph Neural Networks (GNNs) is gradually attracting attention due to the low power consumption and high efficiency in processing the non-Euclidean data represented by graphs. However, as a common problem, dynamic graph representation learning faces challenges such as high complexity and large memory overheads. Current work often uses SNNs instead of Recurrent Neural Networks (RNNs) by using binary features instead of continuous ones for efficient training, which would overlooks graph structure information and leads to the loss of details during propagation. Additionally, optimizing dynamic spiking models typically requires propagation of information across time steps, which increases memory requirements. To address these challenges, we present a framework named \underline{Dy}namic \underline{S}p\underline{i}king \underline{G}raph \underline{N}eural Networks (\method{}). To mitigate the information loss problem, \method{} propagates early-layer information directly to the last layer for information compensation. To accommodate the memory requirements, we apply the implicit differentiation on the equilibrium state, which does not rely on the exact reverse of the forward computation. While traditional implicit differentiation methods are usually used for static situations, \method{} extends it to the dynamic graph setting. Extensive experiments on three large-scale real-world dynamic graph datasets validate the effectiveness of \method{} on dynamic node classification tasks with lower computational costs.
\ No newline at end of file
diff --git a/data/2024/aaai/Dynamic Sub-graph Distillation for Robust Semi-supervised Continual Learning b/data/2024/aaai/Dynamic Sub-graph Distillation for Robust Semi-supervised Continual Learning
new file mode 100644
index 0000000000..2629a6f68e
--- /dev/null
+++ b/data/2024/aaai/Dynamic Sub-graph Distillation for Robust Semi-supervised Continual Learning	
@@ -0,0 +1 @@
+Continual learning (CL) has shown promising results and comparable performance to learning at once in a fully supervised manner. However, CL strategies typically require a large number of labeled samples, making their real-life deployment challenging. In this work, we focus on semi-supervised continual learning (SSCL), where the model progressively learns from partially labeled data with unknown categories. We provide a comprehensive analysis of SSCL and demonstrate that unreliable distributions of unlabeled data lead to unstable training and refinement of the progressing stages. This problem severely impacts the performance of SSCL. To address the limitations, we propose a novel approach called Dynamic Sub-Graph Distillation (DSGD) for semi-supervised continual learning, which leverages both semantic and structural information to achieve more stable knowledge distillation on unlabeled data and exhibit robustness against distribution bias. Firstly, we formalize a general model of structural distillation and design a dynamic graph construction for the continual learning progress. Next, we define a structure distillation vector and design a dynamic sub-graph distillation algorithm, which enables end-to-end training and adaptability to scale up tasks. The entire proposed method is adaptable to various CL methods and supervision settings. Finally, experiments conducted on three datasets CIFAR10, CIFAR100, and ImageNet-100, with varying supervision ratios, demonstrate the effectiveness of our proposed approach in mitigating the catastrophic forgetting problem in semi-supervised continual learning scenarios. Our code is available: https://github.com/fanyan0411/DSGD.
\ No newline at end of file
diff --git a/data/2024/aaai/Dynamic Tangled Derivative Logic of Metric Spaces b/data/2024/aaai/Dynamic Tangled Derivative Logic of Metric Spaces
new file mode 100644
index 0000000000..7c166a04ff
--- /dev/null
+++ b/data/2024/aaai/Dynamic Tangled Derivative Logic of Metric Spaces	
@@ -0,0 +1 @@
+Dynamical systems are abstract models of interaction between space and time. They are often used in fields such as physics and engineering to understand complex processes, but due to their general nature, they have found applications for studying computational processes, interaction in multi-agent systems, machine learning algorithms and other computer science related phenomena. In the vast majority of applications, a dynamical system consists of the action of a continuous `transition function' on a metric space. In this work, we consider decidable formal systems for reasoning about such structures. Spatial logics can be traced back to the 1940's, but our work follows a more dynamic turn that these logics have taken due to two recent developments: the study of the topological mu-calculus, and the the integration of linear temporal logic with logics based on the Cantor derivative. In this paper, we combine dynamic topological logics based on the Cantor derivative and the `next point in time' operators with an expressively complete fixed point operator to produce a combination of the topological mu-calculus with linear temporal logic. We show that the resulting logics are decidable and have a natural axiomatisation. Moreover, we prove that these logics are complete for interpretations on the Cantor space, the rational numbers, and subspaces thereof.
\ No newline at end of file
diff --git a/data/2024/aaai/Dynamic Weighted Combiner for Mixed-Modal Image Retrieval b/data/2024/aaai/Dynamic Weighted Combiner for Mixed-Modal Image Retrieval
new file mode 100644
index 0000000000..b47d9fe610
--- /dev/null
+++ b/data/2024/aaai/Dynamic Weighted Combiner for Mixed-Modal Image Retrieval	
@@ -0,0 +1 @@
+Mixed-Modal Image Retrieval (MMIR) as a flexible search paradigm has attracted wide attention. However, previous approaches always achieve limited performance, due to two critical factors are seriously overlooked. 1) The contribution of image and text modalities is different, but incorrectly treated equally. 2) There exist inherent labeling noises in describing users' intentions with text in web datasets from diverse real-world scenarios, giving rise to overfitting. We propose a Dynamic Weighted Combiner (DWC) to tackle the above challenges, which includes three merits. First, we propose an Editable Modality De-equalizer (EMD) by taking into account the contribution disparity between modalities, containing two modality feature editors and an adaptive weighted combiner. Second, to alleviate labeling noises and data bias, we propose a dynamic soft-similarity label generator (SSG) to implicitly improve noisy supervision. Finally, to bridge modality gaps and facilitate similarity learning, we propose a CLIP-based mutual enhancement module alternately trained by a mixed-modality contrastive loss. Extensive experiments verify that our proposed model significantly outperforms state-of-the-art methods on real-world datasets. The source code is available at https://github.com/fuxianghuang1/DWC.
\ No newline at end of file
diff --git a/data/2024/aaai/E2E-AT: A Unified Framework for Tackling Uncertainty in Task-Aware End-to-End Learning b/data/2024/aaai/E2E-AT: A Unified Framework for Tackling Uncertainty in Task-Aware End-to-End Learning
new file mode 100644
index 0000000000..38d88acf24
--- /dev/null
+++ b/data/2024/aaai/E2E-AT: A Unified Framework for Tackling Uncertainty in Task-Aware End-to-End Learning	
@@ -0,0 +1 @@
+Successful machine learning involves a complete pipeline of data, model, and downstream applications. Instead of treating them separately, there has been a prominent increase of attention within the constrained optimization (CO) and machine learning (ML) communities towards combining prediction and optimization models. The so-called end-to-end (E2E) learning captures the task-based objective for which they will be used for decision making. Although a large variety of E2E algorithms have been presented, it has not been fully investigated how to systematically address uncertainties involved in such models. Most of the existing work considers the uncertainties of ML in the input space and improves robustness through adversarial training. We extend this idea to E2E learning and prove that there is a robustness certification procedure by solving augmented integer programming. Furthermore, we show that neglecting the uncertainty of COs during training causes a new trigger for generalization errors. To include all these components, we propose a unified framework that covers the uncertainties emerging in both the input feature space of the ML models and the COs. The framework is described as a robust optimization problem and is practically solved via end-to-end adversarial training (E2E-AT). Finally, the performance of E2E-AT is evaluated by a real-world end-to-end power system operation problem, including load forecasting and sequential scheduling tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning b/data/2024/aaai/E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning
new file mode 100644
index 0000000000..3196808f0c
--- /dev/null
+++ b/data/2024/aaai/E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning	
@@ -0,0 +1 @@
+The bio-inspired event cameras or dynamic vision sensors are capable of asynchronously capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range. However, the non-structural spatial-temporal event-streams make it challenging for providing intuitive visualization with rich semantic information for human vision. It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization. However, current solutions are predominantly data-driven without considering the prior knowledge of the underlying statistics relating event-streams and video frames. It highly relies on the non-linearity and generalization capability of the deep neural networks, thus, is struggling on reconstructing detailed textures when the scenes are complex. In this work, we propose E2HQV, a novel E2V paradigm designed to produce high-quality video frames from events. This approach leverages a model-aided deep learning framework, underpinned by a theory-inspired E2V model, which is meticulously derived from the fundamental imaging principles of event cameras. To deal with the issue of state-reset in the recurrent components of E2HQV, we also design a temporal shift embedding module to further improve the quality of the video frames. Comprehensive evaluations on the real world event camera datasets validate our approach, with E2HQV, notably outperforming state-of-the-art approaches, e.g., surpassing the second best by over 40% for some evaluation metrics.
\ No newline at end of file
diff --git a/data/2024/aaai/EAN: An Efficient Attention Module Guided by Normalization for Deep Neural Networks b/data/2024/aaai/EAN: An Efficient Attention Module Guided by Normalization for Deep Neural Networks
new file mode 100644
index 0000000000..660f4a24a5
--- /dev/null
+++ b/data/2024/aaai/EAN: An Efficient Attention Module Guided by Normalization for Deep Neural Networks	
@@ -0,0 +1,2 @@
+Deep neural networks (DNNs) have achieved remarkable success in various fields, and two powerful techniques, feature normalization and attention mechanisms, have been widely used to enhance model performance. However, they are usually considered as two separate approaches or combined in a simplistic manner.
+In this paper, we investigate the intrinsic relationship between feature normalization and attention mechanisms and propose an Efficient Attention module guided by Normalization, dubbed EAN. Instead of using costly fully-connected layers for attention learning, EAN leverages the strengths of feature normalization and incorporates an Attention Generation (AG) unit to re-calibrate features. The proposed AG unit exploits the normalization component as a measure of the importance of distinct features and generates an attention mask using GroupNorm, L2 Norm, and Adaptation operations. By employing a grouping, AG unit and aggregation strategy, EAN is established, offering a unified module that harnesses the advantages of both normalization and attention, while maintaining minimal computational overhead. Furthermore, EAN serves as a plug-and-play module that can be seamlessly integrated with classic backbone architectures. Extensive quantitative evaluations on various visual tasks demonstrate that EAN achieves highly competitive performance compared to the current state-of-the-art attention methods while sustaining lower model complexity.
\ No newline at end of file
diff --git a/data/2024/aaai/EAT: Towards Long-Tailed Out-of-Distribution Detection b/data/2024/aaai/EAT: Towards Long-Tailed Out-of-Distribution Detection
new file mode 100644
index 0000000000..d410e97a2b
--- /dev/null
+++ b/data/2024/aaai/EAT: Towards Long-Tailed Out-of-Distribution Detection	
@@ -0,0 +1 @@
+Despite recent advancements in out-of-distribution (OOD) detection, most current studies assume a class-balanced in-distribution training dataset, which is rarely the case in real-world scenarios. This paper addresses the challenging task of long-tailed OOD detection, where the in-distribution data follows a long-tailed class distribution. The main difficulty lies in distinguishing OOD data from samples belonging to the tail classes, as the ability of a classifier to detect OOD instances is not strongly correlated with its accuracy on the in-distribution classes. To overcome this issue, we propose two simple ideas: (1) Expanding the in-distribution class space by introducing multiple abstention classes. This approach allows us to build a detector with clear decision boundaries by training on OOD data using virtual labels. (2) Augmenting the context-limited tail classes by overlaying images onto the context-rich OOD data. This technique encourages the model to pay more attention to the discriminative features of the tail classes. We provide a clue for separating in-distribution and OOD data by analyzing gradient noise. Through extensive experiments, we demonstrate that our method outperforms the current state-of-the-art on various benchmark datasets. Moreover, our method can be used as an add-on for existing long-tail learning approaches, significantly enhancing their OOD detection performance. Code is available at: https://github.com/Stomach-ache/Long-Tailed-OOD-Detection.
\ No newline at end of file
diff --git a/data/2024/aaai/ECHO-GL: Earnings Calls-Driven Heterogeneous Graph Learning for Stock Movement Prediction b/data/2024/aaai/ECHO-GL: Earnings Calls-Driven Heterogeneous Graph Learning for Stock Movement Prediction
new file mode 100644
index 0000000000..5006cb8eb4
--- /dev/null
+++ b/data/2024/aaai/ECHO-GL: Earnings Calls-Driven Heterogeneous Graph Learning for Stock Movement Prediction	
@@ -0,0 +1 @@
+Stock movement prediction serves an important role in quantitative trading. Despite advances in existing models that enhance stock movement prediction by incorporating stock relations, these prediction models face two limitations, i.e., constructing either insufficient or static stock relations, which fail to effectively capture the complex dynamic stock relations because such complex dynamic stock relations are influenced by various factors in the ever-changing financial market. To tackle the above limitations, we propose a novel stock movement prediction model ECHO-GL based on stock relations derived from earnings calls. ECHO-GL not only constructs comprehensive stock relations by exploiting the rich semantic information in the earnings calls but also captures the movement signals between related stocks based on multimodal and heterogeneous graph learning. Moreover, ECHO-GL customizes learnable stock stochastic processes based on the post earnings announcement drift (PEAD) phenomenon to generate the temporal stock price trajectory, which can be easily plugged into any investment strategy with different time horizons to meet investment demands. Extensive experiments on two financial datasets demonstrate the effectiveness of ECHO-GL on stock price movement prediction tasks together with high prediction accuracy and trading profitability.
\ No newline at end of file
diff --git a/data/2024/aaai/EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction b/data/2024/aaai/EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction
new file mode 100644
index 0000000000..0e85290ec1
--- /dev/null
+++ b/data/2024/aaai/EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction	
@@ -0,0 +1,8 @@
+Motion prediction is a crucial task in autonomous driving, and one of its major challenges lands in the multimodality of future behaviors.
+Many successful works have utilized mixture models which require identification of positive mixture components, and correspondingly fall into two main lines: prediction-based and anchor-based matching.
+The prediction clustering phenomenon in prediction-based matching makes it difficult to pick representative trajectories for downstream tasks, while the anchor-based matching suffers from a limited regression capability.
+In this paper, we introduce a novel paradigm, named Evolving and Distinct Anchors (EDA), to define the positive and negative components for multimodal motion prediction based on mixture models.
+We enable anchors to evolve and redistribute themselves under specific scenes for an enlarged regression capacity.
+Furthermore, we select distinct anchors before matching them with the ground truth, which results in impressive scoring performance.
+Our approach enhances all metrics compared to the baseline MTR, particularly with a notable relative reduction of 13.5% in Miss Rate, resulting in state-of-the-art performance on the Waymo Open Motion Dataset.
+Appendix and code are available at https://github.com/Longzhong-Lin/EDA.
\ No newline at end of file
diff --git a/data/2024/aaai/EG-NAS: Neural Architecture Search with Fast Evolutionary Exploration b/data/2024/aaai/EG-NAS: Neural Architecture Search with Fast Evolutionary Exploration
new file mode 100644
index 0000000000..e0a808dacc
--- /dev/null
+++ b/data/2024/aaai/EG-NAS: Neural Architecture Search with Fast Evolutionary Exploration	
@@ -0,0 +1 @@
+Differentiable Architecture Search (DARTS) has achieved a rapid search for excellent architectures by optimizing architecture parameters through gradient descent. However, this efficiency comes with a significant challenge: the risk of premature convergence to local optima, resulting in subpar performance that falls short of expectations. To address this issue, we propose a novel and effective method called Evolutionary Gradient-Based Neural Architecture Search (EG-NAS). Our approach combines the strengths of both gradient descent and evolutionary strategy, allowing for the exploration of various optimization directions during the architecture search process. To begin with, we continue to employ gradient descent for updating network parameters to ensure efficiency. Subsequently, to mitigate the risk of premature convergence, we introduce an evolutionary strategy with global search capabilities to optimize the architecture parameters. By leveraging the best of both worlds, our method strikes a balance between efficient exploration and exploitation of the search space. Moreover, we have redefined the fitness function to not only consider accuracy but also account for individual similarity. This inclusion enhances the diversity and accuracy of the optimized directions identified by the evolutionary strategy. Extensive experiments on various datasets and search spaces demonstrate that EG-NAS achieves highly competitive performance at significantly low search costs compared to state-of-the-art methods. The code is available at https://github.com/caicaicheng/EG-NAS.
\ No newline at end of file
diff --git a/data/2024/aaai/EMGAN: Early-Mix-GAN on Extracting Server-Side Model in Split Federated Learning b/data/2024/aaai/EMGAN: Early-Mix-GAN on Extracting Server-Side Model in Split Federated Learning
new file mode 100644
index 0000000000..68c3d574dd
--- /dev/null
+++ b/data/2024/aaai/EMGAN: Early-Mix-GAN on Extracting Server-Side Model in Split Federated Learning	
@@ -0,0 +1 @@
+Split Federated Learning (SFL) is an emerging edge-friendly version of Federated Learning (FL), where clients process a small portion of the entire model. While SFL was considered to be resistant to Model Extraction Attack (MEA) by design, a recent work shows it is not necessarily the case. In general, gradient-based MEAs are not effective on a target model that is changing, as is the case in training-from-scratch applications. In this work, we propose a strong MEA during the SFL training phase. The proposed Early-Mix-GAN (EMGAN) attack effectively exploits gradient queries regardless of data assumptions. EMGAN adopts three key components to address the problem of inconsistent gradients. Specifically, it employs (i) Early-learner approach for better adaptability, (ii) Multi-GAN approach to introduce randomness in generator training to mitigate mode collapse, and (iii) ProperMix to effectively augment the limited amount of synthetic data for a better approximation of the target domain data distribution. EMGAN achieves excellent results in extracting server-side models. With only 50 training samples, EMGAN successfully extracts a 5-layer server-side model of VGG-11 on CIFAR-10, with 7% less accuracy than the target model. With zero training data, the extracted model achieves 81.3% accuracy, which is significantly better than the 45.5% accuracy of the model extracted by the SoTA method. The code is available at "https://github.com/zlijingtao/SFL-MEA".
\ No newline at end of file
diff --git a/data/2024/aaai/EPSD: Early Pruning with Self-Distillation for Efficient Model Compression b/data/2024/aaai/EPSD: Early Pruning with Self-Distillation for Efficient Model Compression
new file mode 100644
index 0000000000..26d266096e
--- /dev/null
+++ b/data/2024/aaai/EPSD: Early Pruning with Self-Distillation for Efficient Model Compression	
@@ -0,0 +1 @@
+Neural network compression techniques, such as knowledge distillation (KD) and network pruning, have received increasing attention. Recent work `Prune, then Distill' reveals that a pruned student-friendly teacher network can benefit the performance of KD. However, the conventional teacher-student pipeline, which entails cumbersome pre-training of the teacher and complicated compression steps, makes pruning with KD less efficient. In addition to compressing models, recent compression techniques also emphasize the aspect of efficiency. Early pruning demands significantly less computational cost in comparison to the conventional pruning methods as it does not require a large pre-trained model. Likewise, a special case of KD, known as self-distillation (SD), is more efficient since it requires no pre-training or student-teacher pair selection. This inspires us to collaborate early pruning with SD for efficient model compression. In this work, we propose the framework named Early Pruning with Self-Distillation (EPSD), which identifies and preserves distillable weights in early pruning for a given SD task. EPSD efficiently combines early pruning and self-distillation in a two-step process, maintaining the pruned network's trainability for compression. Instead of a simple combination of pruning and SD, EPSD enables the pruned network to favor SD by keeping more distillable weights before training to ensure better distillation of the pruned network. We demonstrated that EPSD improves the training of pruned networks, supported by visual and quantitative analyses. Our evaluation covered diverse benchmarks (CIFAR-10/100, Tiny-ImageNet, full ImageNet, CUB-200-2011, and Pascal VOC), with EPSD outperforming advanced pruning and SD techniques.
\ No newline at end of file
diff --git a/data/2024/aaai/ERL-TD: Evolutionary Reinforcement Learning Enhanced with Truncated Variance and Distillation Mutation b/data/2024/aaai/ERL-TD: Evolutionary Reinforcement Learning Enhanced with Truncated Variance and Distillation Mutation
new file mode 100644
index 0000000000..af4db615ad
--- /dev/null
+++ b/data/2024/aaai/ERL-TD: Evolutionary Reinforcement Learning Enhanced with Truncated Variance and Distillation Mutation	
@@ -0,0 +1 @@
+Recently, an emerging research direction called Evolutionary Reinforcement Learning (ERL) has been proposed, which combines evolutionary algorithm with reinforcement learning (RL) for tackling the tasks of sequential decision making. However, the recently proposed ERL algorithms often suffer from two challenges: the inaccuracy of policy estimation caused by the overestimation bias in RL and the insufficiency of exploration caused by inefficient mutations. To alleviate these problems, we propose an Evolutionary Reinforcement Learning algorithm enhanced with Truncated variance and Distillation mutation, called ERL-TD. We utilize multiple Q-networks to evaluate state-action pairs, so that multiple networks can provide more accurate evaluations for state-action pairs, in which the variance of evaluations can be adopted to control the overestimation bias in RL. Moreover, we propose a new distillation mutation to provide a promising mutation direction, which is different from traditional mutation generating a large number of random solutions. We evaluate ERL-TD on the continuous control benchmarks from the OpenAI Gym and DeepMind Control Suite. The experiments show that ERL-TD shows excellent performance and outperforms all baseline RL algorithms on the test suites.
\ No newline at end of file
diff --git a/data/2024/aaai/ESG Accountability Made Easy: DocQA at Your Service b/data/2024/aaai/ESG Accountability Made Easy: DocQA at Your Service
new file mode 100644
index 0000000000..b2c0566650
--- /dev/null
+++ b/data/2024/aaai/ESG Accountability Made Easy: DocQA at Your Service	
@@ -0,0 +1 @@
+We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations. The Deep Search platform can be accessed at: https://ds4sd.github.io.
\ No newline at end of file
diff --git a/data/2024/aaai/ESRL: Efficient Sampling-Based Reinforcement Learning for Sequence Generation b/data/2024/aaai/ESRL: Efficient Sampling-Based Reinforcement Learning for Sequence Generation
new file mode 100644
index 0000000000..a87631f16d
--- /dev/null
+++ b/data/2024/aaai/ESRL: Efficient Sampling-Based Reinforcement Learning for Sequence Generation	
@@ -0,0 +1 @@
+Applying Reinforcement Learning (RL) to sequence generation models enables the direct optimization of long-term rewards (e.g., BLEU and human feedback), but typically requires large-scale sampling over a space of action sequences. This is a computational challenge as presented by the practice of sequence generation problems, such as machine translation, where we often deal with a large action space (e.g., a vocabulary) and a long action sequence (e.g., a translation). In this work, we introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL. We experiment with our approaches on the traditional sequence generation tasks, including machine translation and abstractive summarization. Furthermore, we evaluate our approaches in RL from human feedback (RLHF) through training a large language model using the reward model. Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption. Notably, ESRL yields consistent performance gains over the strong REINFORCE, minimum risk training, and proximal policy optimization methods. The code is available at https://github.com/wangclnlp/DeepSpeed-Chat-Extension/examples/esrl.
\ No newline at end of file
diff --git a/data/2024/aaai/ETDPC: A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations b/data/2024/aaai/ETDPC: A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations
new file mode 100644
index 0000000000..b4eb63e688
--- /dev/null
+++ b/data/2024/aaai/ETDPC: A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations	
@@ -0,0 +1 @@
+Electronic theses and dissertations (ETDs) have been proposed, advocated, and generated for more than 25 years. Although ETDs are hosted by commercial or institutional digital library repositories, they are still an understudied type of scholarly big data, partially because they are usually longer than conference and journal papers. Segmenting ETDs will allow researchers to study sectional content. Readers can navigate to particular pages of interest, to discover and explore the content buried in these long documents. Most existing frameworks on document page classification are designed for classifying general documents, and perform poorly on ETDs. In this paper, we propose ETDPC. Its backbone is a two-stream multimodal model with a cross-attention network to classify ETD pages into 13 categories. To overcome the challenge of imbalanced labeled samples, we augmented data for minority categories and employed a hierarchical classifier. ETDPC outperforms the state-of-the-art models in all categories, achieving an F1 of 0.84 -- 0.96 for 9 out of 13 categories. We also demonstrated its data efficiency. The code and data can be found on GitHub (https://github.com/lamps-lab/ETDMiner/tree/master/etd_segmentation).
\ No newline at end of file
diff --git a/data/2024/aaai/EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE b/data/2024/aaai/EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
new file mode 100644
index 0000000000..24f484f8bb
--- /dev/null
+++ b/data/2024/aaai/EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE	
@@ -0,0 +1 @@
+Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 4x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.
\ No newline at end of file
diff --git a/data/2024/aaai/Early Detection of Extreme Storm Tide Events Using Multimodal Data Processing b/data/2024/aaai/Early Detection of Extreme Storm Tide Events Using Multimodal Data Processing
new file mode 100644
index 0000000000..327808490c
--- /dev/null
+++ b/data/2024/aaai/Early Detection of Extreme Storm Tide Events Using Multimodal Data Processing	
@@ -0,0 +1 @@
+Sea-level rise is a well-known consequence of climate change. Several studies have estimated the social and economic impact of the increase in extreme flooding. An efficient way to mitigate its consequences is the development of a flood alert and prediction system, based on high-resolution numerical models and robust sensing networks. However, current models use various simplifying assumptions that compromise accuracy to ensure solvability within a reasonable timeframe, hindering more regular and cost-effective forecasts for various locations along the shoreline. To address these issues, this work proposes a hybrid model for multimodal data processing that combines physics-based numerical simulations, data obtained from a network of sensors, and satellite images to provide refined wave and sea-surface height forecasts, with real results obtained in a critical location within the Port of Santos (the largest port in Latin America). Our approach exhibits faster convergence than data-driven models while achieving more accurate predictions. Moreover, the model handles irregularly sampled time series and missing data without the need for complex preprocessing mechanisms or data imputation while keeping low computational costs through a combination of time encoding, recurrent and graph neural networks. Enabling raw sensor data to be easily combined with existing physics-based models opens up new possibilities for accurate extreme storm tide events forecast systems that enhance community safety and aid policymakers in their decision-making processes.
\ No newline at end of file
diff --git a/data/2024/aaai/EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading b/data/2024/aaai/EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading
new file mode 100644
index 0000000000..157ccdd66e
--- /dev/null
+++ b/data/2024/aaai/EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading	
@@ -0,0 +1 @@
+High-frequency trading (HFT) is using computer algorithms to make trading decisions in short time scales (e.g., second-level), which is widely used in the Cryptocurrency (Crypto) market, (e.g., Bitcoin). Reinforcement learning (RL) in financial research has shown stellar performance on many quantitative trading tasks. However, most methods focus on low-frequency trading, e.g., day-level, which cannot be directly applied to HFT because of two challenges. First, RL for HFT involves dealing with extremely long trajectories (e.g., 2.4 million steps per month), which is hard to optimize and evaluate. Second, the dramatic price fluctuations and market trend changes of Crypto make existing algorithms fail to maintain satisfactory performances. To tackle these challenges, we propose an Efficient hieArchical Reinforcement learNing method for High Frequency Trading (EarnHFT), a novel three-stage hierarchical RL framework for HFT. In stage I, we compute a Q-teacher, i.e., the optimal action value based on dynamic programming, for enhancing the performance and training efficiency of second level RL agents. In stage II, we construct a pool of diverse RL agents for different market trends, distinguished by return rates, where hundreds of RL agents are trained with different preferences of return rates and only a tiny fraction of them will be selected into the pool based on their profitability. In stage III, we train a minute-level router which dynamically picks a second-level agent from the pool to achieve stable performance across different markets. Through extensive experiments in various market trends on Crypto markets in a high-fidelity simulation trading environment, we demonstrate that EarnHFT significantly outperforms 6 state-of-art baselines in 6 popular financial criteria, exceeding the runner-up by 30% in profitability.
\ No newline at end of file
diff --git a/data/2024/aaai/EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering b/data/2024/aaai/EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering
new file mode 100644
index 0000000000..d5fdf6cafa
--- /dev/null
+++ b/data/2024/aaai/EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering	
@@ -0,0 +1 @@
+Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects and comprehensive reasoning. Based on city planning needs, we develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis. The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded. As objects are the basis for complex relational reasoning, we propose a Semantic OBject Awareness framework (SOBA) to advance VQA in an object-centric way. To preserve refined spatial locations and semantics, SOBA leverages a segmentation network for object semantics generation. The object-guided attention aggregates object interior features via pseudo masks, and bidirectional cross-attention further models object external relations hierarchically. To optimize object counting, we propose a numerical difference loss that dynamically adds difference penalties, unifying the classification and regression tasks. Experimental results show that SOBA outperforms both advanced general and remote sensing methods. We believe this dataset and framework provide a strong benchmark for Earth vision's complex analysis. The project page is at https://Junjue-Wang.github.io/homepage/EarthVQA.
\ No newline at end of file
diff --git a/data/2024/aaai/Earthfarsser: Versatile Spatio-Temporal Dynamical Systems Modeling in One Model b/data/2024/aaai/Earthfarsser: Versatile Spatio-Temporal Dynamical Systems Modeling in One Model
new file mode 100644
index 0000000000..98bd966d0a
--- /dev/null
+++ b/data/2024/aaai/Earthfarsser: Versatile Spatio-Temporal Dynamical Systems Modeling in One Model	
@@ -0,0 +1 @@
+Efficiently modeling spatio-temporal (ST) physical processes and observations presents a challenging problem for the deep learning community. Many recent studies have concentrated on meticulously reconciling various advantages, leading to designed models that are neither simple nor practical. To address this issue, this paper presents a systematic study on existing shortcomings faced by off-the-shelf models, including lack of local fidelity, poor prediction performance over long time-steps, low scalability, and inefficiency. To systematically address the aforementioned problems, we propose an EarthFarseer, a concise framework that combines parallel local convolutions and global Fourier-based transformer architectures, enabling dynamically capture the local-global spatial interactions and dependencies. EarthFarseer also incorporates a multi-scale fully convolutional and Fourier architectures to efficiently and effectively capture the temporal evolution. Our proposal demonstrates strong adaptability across various tasks and datasets, with fast convergence and better local fidelity in long time-steps predictions. Extensive experiments and visualizations over eight human society physical and natural physical datasets demonstrates the state-of-the-art performance of EarthFarseer. We release our code at https://github.com/easylearningscores/EarthFarseer.
\ No newline at end of file
diff --git a/data/2024/aaai/EasyTS: The Express Lane to Long Time Series Forecasting b/data/2024/aaai/EasyTS: The Express Lane to Long Time Series Forecasting
new file mode 100644
index 0000000000..73d2441ff0
--- /dev/null
+++ b/data/2024/aaai/EasyTS: The Express Lane to Long Time Series Forecasting	
@@ -0,0 +1 @@
+Responding to the escalating interest in long-term forecasting within the industry, we introduce EasyTS, a comprehensive toolkit engineered to streamline data collection, analysis, and model creation procedures. EasyTS acts as a unified solution, driving progress in long-term time series forecasting. The platform provides effortless access to various time series datasets, including a newly open-sourced multi-scenario dataset in the electricity domain. Integrated visualization and analysis tools help unveil inherent data features and relationships. EasyTS facilitates a user-friendly model validation approach with versatile evaluation criteria. This toolkit allows researchers to compare their models proficiently against renowned benchmarks. With our ongoing commitment to expanding our dataset collection and enhancing toolkit functionalities, we aspire to contribute significantly to the time series forecasting domain. Code is available at this repository: https://github.com/EdgeBigBang/EasyTS.git.
\ No newline at end of file
diff --git a/data/2024/aaai/EcomGPT: Instruction-Tuning Large Language Models with Chain-of-Task Tasks for E-commerce b/data/2024/aaai/EcomGPT: Instruction-Tuning Large Language Models with Chain-of-Task Tasks for E-commerce
new file mode 100644
index 0000000000..d78385219d
--- /dev/null
+++ b/data/2024/aaai/EcomGPT: Instruction-Tuning Large Language Models with Chain-of-Task Tasks for E-commerce	
@@ -0,0 +1,2 @@
+Recently, instruction-following Large Language Models (LLMs) , represented by ChatGPT, have exhibited exceptional performance in general Natural Language Processing (NLP) tasks. However, the unique characteristics of E-commerce data pose significant challenges to general LLMs. An LLM tailored specifically for E-commerce scenarios, possessing robust cross-dataset/task generalization capabilities, is a pressing necessity. To solve this issue, in this work, we proposed the first E-commerce instruction dataset EcomInstruct, with a total of 2.5 million instruction data. EcomInstruct scales up the data size and task diversity by constructing atomic tasks with E-commerce basic data types, such as product information, user reviews. Atomic tasks are defined as intermediate tasks implicitly involved in solving a final task, which we also call Chain-of-Task tasks. We developed EcomGPT
+with different parameter scales by training the backbone model BLOOMZ with the EcomInstruct. Benefiting from the fundamental semantic understanding capabilities acquired from the Chain-of-Task tasks, EcomGPT exhibits excellent zero-shot generalization capabilities. Extensive experiments and human evaluations demonstrate that EcomGPT outperforms ChatGPT in term of cross-dataset/task generalization on E-commerce tasks. The EcomGPT will be public at https://github.com/Alibaba-NLP/EcomGPT.
\ No newline at end of file
diff --git a/data/2024/aaai/Editing Language Model-Based Knowledge Graph Embeddings b/data/2024/aaai/Editing Language Model-Based Knowledge Graph Embeddings
new file mode 100644
index 0000000000..9fa9fc0b1c
--- /dev/null
+++ b/data/2024/aaai/Editing Language Model-Based Knowledge Graph Embeddings	
@@ -0,0 +1 @@
+Recently decades have witnessed the empirical success of framing Knowledge Graph (KG) embeddings via language models. However, language model-based KG embeddings are usually deployed as static artifacts, making them difficult to modify post-deployment without re-training after deployment. To address this issue, we propose a new task of editing language model-based KG embeddings in this paper. This task is designed to facilitate rapid, data-efficient updates to KG embeddings without compromising the performance of other aspects. We build four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and evaluate several knowledge editing baselines demonstrating the limited ability of previous models to handle the proposed challenging task. We further propose a simple yet strong baseline dubbed KGEditor, which utilizes additional parametric layers of the hypernetwork to edit/add facts. Our comprehensive experimental results reveal that KGEditor excels in updating specific facts without impacting the overall performance, even when faced with limited training resources. Code and datasets will be available at https://github.com/AnonymousForPapers/DeltaKG.
\ No newline at end of file
diff --git a/data/2024/aaai/Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches b/data/2024/aaai/Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches
new file mode 100644
index 0000000000..4fe3ca5a06
--- /dev/null
+++ b/data/2024/aaai/Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches	
@@ -0,0 +1 @@
+The selection of the assumed effect size (AES) critically determines the duration of an experiment, and hence its accuracy and efficiency. Traditionally, experimenters determine AES based on domain knowledge. However, this method becomes impractical for online experimentation services managing numerous experiments, and a more automated approach is hence of great demand. We initiate the study of data-driven AES selection in for online experimentation services by introducing two solutions. The first employs a three-layer Gaussian Mixture Model considering the heteroskedasticity across experiments, and it seeks to estimate the true expected effect size among positive experiments. The second method, grounded in utility theory, aims to determine the optimal effect size by striking a balance between the experiment's cost and the precision of decision-making. Through comparisons with baseline methods using both simulated and real data, we showcase the superior performance of the proposed approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Effective Causal Discovery under Identifiable Heteroscedastic Noise Model b/data/2024/aaai/Effective Causal Discovery under Identifiable Heteroscedastic Noise Model
new file mode 100644
index 0000000000..fbbb5987b1
--- /dev/null
+++ b/data/2024/aaai/Effective Causal Discovery under Identifiable Heteroscedastic Noise Model	
@@ -0,0 +1 @@
+Capturing the underlying structural causal relations represented by Directed Acyclic Graphs (DAGs) has been a fundamental task in various AI disciplines. Causal DAG learning via the continuous optimization framework has recently achieved promising performance in terms of accuracy and efficiency. However, most methods make strong assumptions of homoscedastic noise, i.e., exogenous noises have equal variances across variables, observations, or even both. The noises in real data usually violate both assumptions due to the biases introduced by different data collection processes. To address the heteroscedastic noise issue, we introduce relaxed implementable sufficient conditions and prove the identifiability of a general class of SEM subject to those conditions. Based on the identifiable general SEM, we propose a novel formulation for DAG learning which accounts for the noise variance variation across variables and observations. We then propose an effective two-phase iterative DAG learning algorithm to address the increasing optimization difficulties and learn a causal DAG from data with heteroscedastic variables noise under varying variance. We show significant empirical gains of the proposed approaches over state-of-the-art methods on both synthetic data and real data.
\ No newline at end of file
diff --git a/data/2024/aaai/Effective Comparative Prototype Hashing for Unsupervised Domain Adaptation b/data/2024/aaai/Effective Comparative Prototype Hashing for Unsupervised Domain Adaptation
new file mode 100644
index 0000000000..ace2019365
--- /dev/null
+++ b/data/2024/aaai/Effective Comparative Prototype Hashing for Unsupervised Domain Adaptation	
@@ -0,0 +1 @@
+Unsupervised domain adaptive hashing is a highly promising research direction within the field of retrieval. It aims to transfer valuable insights from the source domain to the target domain while maintaining high storage and retrieval efficiency. Despite its potential, this field remains relatively unexplored. Previous methods usually lead to unsatisfactory retrieval performance, as they frequently directly apply slightly modified domain adaptation algorithms to hash learning framework, or pursue domain alignment within the Hamming space characterized by limited semantic information. In this paper, we propose a simple yet effective approach named Comparative Prototype Hashing (CPH) for unsupervised domain adaptive image retrieval. We establish a domain-shared unit hypersphere space through prototype contrastive learning and then obtain the Hamming hypersphere space via mapping from the shared hypersphere. This strategy achieves a cohesive synergy between learning uniformly distributed and category conflict-averse feature representations, eliminating domain discrepancies, and facilitating hash code learning. Moreover, by leveraging dual-domain information to supervise the entire hashing model training process, we can generate hash codes that retain inter-sample similarity relationships within both domains. Experimental results validate that our CPH significantly outperforms the state-of-the-art counterparts across multiple cross-domain and single-domain retrieval tasks. Notably, on Office-Home and Office-31 datasets, CPH achieves an average performance improvement of 19.29% and 13.85% on cross-domain retrieval tasks compared to the second-best results, respectively. The source codes of our method are available at: https://github.com/christinecui/CPH.
\ No newline at end of file
diff --git a/data/2024/aaai/Effective Data Distillation for Tabular Datasets (Student Abstract) b/data/2024/aaai/Effective Data Distillation for Tabular Datasets (Student Abstract)
new file mode 100644
index 0000000000..da74b1ec2a
--- /dev/null
+++ b/data/2024/aaai/Effective Data Distillation for Tabular Datasets (Student Abstract)	
@@ -0,0 +1 @@
+Data distillation is a technique of reducing a large dataset into a smaller dataset. The smaller dataset can then be used to train a model which can perform comparably to a model trained on the full dataset. Past works have examined this approach for image datasets, focusing on neural networks as target models. However, tabular datasets pose new challenges not seen in images. A sample in tabular dataset is a one dimensional vector unlike the two (or three) dimensional pixel grid of images, and Non-NN models such as XGBoost can often outperform neural network (NN) based models. Our contribution in this work is two-fold: 1) We show in our work that data distillation methods from images do not translate directly to tabular data; 2) We propose a new distillation method that consistently outperforms the baseline for multiple different models, including non-NN models such as XGBoost.
\ No newline at end of file
diff --git a/data/2024/aaai/Effectiveness of Constant Stepsize in Markovian LSA and Statistical Inference b/data/2024/aaai/Effectiveness of Constant Stepsize in Markovian LSA and Statistical Inference
new file mode 100644
index 0000000000..64e23c9ec6
--- /dev/null
+++ b/data/2024/aaai/Effectiveness of Constant Stepsize in Markovian LSA and Statistical Inference	
@@ -0,0 +1 @@
+In this paper, we study the effectiveness of using a constant stepsize in statistical inference via linear stochastic approximation (LSA) algorithms with Markovian data. After establishing a Central Limit Theorem (CLT), we outline an inference procedure that uses averaged LSA iterates to construct confidence intervals (CIs). Our procedure leverages the fast mixing property of constant-stepsize LSA for better covariance estimation and employs Richardson-Romberg (RR) extrapolation to reduce the bias induced by constant stepsize and Markovian data. We develop theoretical results for guiding stepsize selection in RR extrapolation, and identify several important settings where the bias provably vanishes even without extrapolation. We conduct extensive numerical experiments and compare against classical inference approaches. Our results show that using a constant stepsize enjoys easy hyperparameter tuning, fast convergence, and consistently better CI coverage, especially when data is limited.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Algorithms for Non-gaussian Single Index Models with Generative Priors b/data/2024/aaai/Efficient Algorithms for Non-gaussian Single Index Models with Generative Priors
new file mode 100644
index 0000000000..086c724956
--- /dev/null
+++ b/data/2024/aaai/Efficient Algorithms for Non-gaussian Single Index Models with Generative Priors	
@@ -0,0 +1 @@
+In this work, we focus on high-dimensional single index models with non-Gaussian sensing vectors and generative priors. More specifically, our goal is to estimate the underlying signal from i.i.d. realizations of the semi-parameterized single index model, where the underlying signal is contained in (up to a constant scaling) the range of a Lipschitz continuous generative model with bounded low-dimensional inputs, the sensing vector follows a non-Gaussian distribution, the noise is a random variable that is independent of the sensing vector, and the unknown non-linear link function is differentiable. Using the first- and second-order Stein's identity, we introduce efficient algorithms to obtain estimated vectors that achieve the near-optimal statistical rate. Experimental results on image datasets are provided to support our theory.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Asynchronous Federated Learning with Prospective Momentum Aggregation and Fine-Grained Correction b/data/2024/aaai/Efficient Asynchronous Federated Learning with Prospective Momentum Aggregation and Fine-Grained Correction
new file mode 100644
index 0000000000..9ef9620fea
--- /dev/null
+++ b/data/2024/aaai/Efficient Asynchronous Federated Learning with Prospective Momentum Aggregation and Fine-Grained Correction	
@@ -0,0 +1 @@
+Asynchronous federated learning (AFL) is a distributed machine learning technique that allows multiple devices to collaboratively train deep learning models without sharing local data. However, AFL suffers from low efficiency due to poor client model training quality and slow server model convergence speed, which are a result of the heterogeneous nature of both data and devices. To address these issues, we propose Efficient Asynchronous Federated Learning with Prospective Momentum Aggregation and Fine-Grained Correction (FedAC). Our framework consists of three key components. The first component is client weight evaluation based on temporal gradient, which evaluates the client weight based on the similarity between the client and server update directions. The second component is adaptive server update with prospective weighted momentum, which uses an asynchronous buffered update strategy and a prospective weighted momentum with adaptive learning rate to update the global model in server. The last component is client update with fine-grained gradient correction, which introduces a fine-grained gradient correction term to mitigate the client drift and correct the client stochastic gradient. We conduct experiments on real and synthetic datasets, and compare with existing federated learning methods. Experimental results demonstrate effective improvements in model training efficiency and AFL performance by our framework.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Axiomatization of OWL 2 EL Ontologies from Data by Means of Formal Concept Analysis b/data/2024/aaai/Efficient Axiomatization of OWL 2 EL Ontologies from Data by Means of Formal Concept Analysis
new file mode 100644
index 0000000000..eb6373e434
--- /dev/null
+++ b/data/2024/aaai/Efficient Axiomatization of OWL 2 EL Ontologies from Data by Means of Formal Concept Analysis	
@@ -0,0 +1 @@
+We present an FCA-based axiomatization method that produces a complete EL TBox (the terminological part of an OWL 2 EL ontology) from a graph dataset in at most exponential time. We describe technical details that allow for efficient implementation as well as variations that dispense with the computation of extremely large axioms, thereby rendering the approach applicable albeit some completeness is lost. Moreover, we evaluate the prototype on real-world datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolution b/data/2024/aaai/Efficient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolution
new file mode 100644
index 0000000000..e99826723a
--- /dev/null
+++ b/data/2024/aaai/Efficient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolution	
@@ -0,0 +1 @@
+Image super-resolution is a fundamentally ill-posed problem because multiple valid high-resolution images exist for one low-resolution image. Super-resolution methods based on diffusion probabilistic models can deal with the ill-posed nature by learning the distribution of high-resolution images conditioned on low-resolution images, avoiding the problem of blurry images in PSNR-oriented methods. However, existing diffusion-based super-resolution methods have high time consumption with the use of iterative sampling, while the quality and consistency of generated images are less than ideal due to problems like color shifting. In this paper, we propose Efficient Conditional Diffusion Model with Probability Flow Sampling (ECDP) for image super-resolution. To reduce the time consumption, we design a continuous-time conditional diffusion model for image super-resolution, which enables the use of probability flow sampling for efficient generation. Additionally, to improve the consistency of generated images, we propose a hybrid parametrization for the denoiser network, which interpolates between the data-predicting parametrization and the noise-predicting parametrization for different noise scales. Moreover, we design an image quality loss as a complement to the score matching loss of diffusion models, further improving the consistency and quality of super-resolution. Extensive experiments on DIV2K, ImageNet, and CelebA demonstrate that our method achieves higher super-resolution quality than existing diffusion-based image super-resolution methods while having lower time consumption. Our code is available at https://github.com/Yuan-Yutao/ECDP.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Constraint Generation for Stochastic Shortest Path Problems b/data/2024/aaai/Efficient Constraint Generation for Stochastic Shortest Path Problems
new file mode 100644
index 0000000000..84a4636f8e
--- /dev/null
+++ b/data/2024/aaai/Efficient Constraint Generation for Stochastic Shortest Path Problems	
@@ -0,0 +1 @@
+Current methods for solving Stochastic Shortest Path Problems (SSPs) find states’ costs-to-go by applying Bellman backups, where state-of-the-art methods employ heuristics to select states to back up and prune. A fundamental limitation of these algorithms is their need to compute the cost-to-go for every applicable action during each state backup, leading to unnecessary computation for actions identified as sub-optimal. We present new connections between planning and operations research and, using this framework, we address this issue of unnecessary computation by introducing an efficient version of constraint generation for SSPs. This technique allows algorithms to ignore sub-optimal actions and avoid computing their costs-to-go. We also apply our novel technique to iLAO* resulting in a new algorithm, CG-iLAO*. Our experiments show that CG-iLAO* ignores up to 57% of iLAO*’s actions and it solves problems up to 8x and 3x faster than LRTDP and iLAO*.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Deweahter Mixture-of-Experts with Uncertainty-Aware Feature-Wise Linear Modulation b/data/2024/aaai/Efficient Deweahter Mixture-of-Experts with Uncertainty-Aware Feature-Wise Linear Modulation
new file mode 100644
index 0000000000..d3a171d556
--- /dev/null
+++ b/data/2024/aaai/Efficient Deweahter Mixture-of-Experts with Uncertainty-Aware Feature-Wise Linear Modulation	
@@ -0,0 +1,6 @@
+The Mixture-of-Experts (MoE) approach has demonstrated outstanding scalability in multi-task learning including low-level upstream tasks such as concurrent removal of multiple adverse weather effects.
+However, the conventional MoE architecture with parallel Feed Forward Network (FFN) experts leads to significant parameter and computational overheads that hinder its efficient deployment. In addition, the naive MoE linear router is suboptimal in assigning task-specific features to multiple experts which limits its further scalability.
+In this work, we propose an efficient MoE architecture with weight sharing across the experts. Inspired by the idea of linear feature modulation (FM), our architecture implicitly instantiates multiple experts via learnable activation modulations on a single shared expert block.
+The proposed Feature Modulated Expert (FME) serves as a building block for the novel Mixture-of-Feature-Modulation-Experts (MoFME) architecture, which can scale up the number of experts with low overhead.
+We further propose an Uncertainty-aware Router (UaR) to assign task-specific features to different FM modules with well-calibrated weights. This enables MoFME to effectively learn diverse expert functions for multiple tasks.
+The conducted experiments on the multi-deweather task show that our MoFME outperforms the state-of-the-art in the image restoration quality by 0.1-0.2 dB while saving more than 74% of parameters and 20% inference time over the conventional MoE counterpart. Experiments on the downstream segmentation and classification tasks further demonstrate the generalizability of MoFME to real open-world applications.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Learning in Polyhedral Games via Best-Response Oracles b/data/2024/aaai/Efficient Learning in Polyhedral Games via Best-Response Oracles
new file mode 100644
index 0000000000..1f8d887231
--- /dev/null
+++ b/data/2024/aaai/Efficient Learning in Polyhedral Games via Best-Response Oracles	
@@ -0,0 +1 @@
+We study online learning and equilibrium computation in games with polyhedral decision sets, a property shared by normal-form games (NFGs) and extensive-form games (EFGs), when the learning agent is restricted to utilizing a best-response oracle. We show how to achieve constant regret in zero-sum games and O(T^0.25) regret in general-sum games while using only O(log t) best-response queries at a given iteration t, thus improving over the best prior result, which required O(T) queries per iteration. Moreover, our framework yields the first last-iterate convergence guarantees for self-play with best-response oracles in zero-sum games. This convergence occurs at a linear rate, though with a condition-number dependence. We go on to show a O(T^(-0.5)) best-iterate convergence rate without such a dependence. Our results build on linear-rate convergence results for variants of the Frank-Wolfe (FW) algorithm for strongly convex and smooth minimization problems over polyhedral domains. These FW results depend on a condition number of the polytope, known as facial distance. In order to enable application to settings such as EFGs, we show two broad new results: 1) the facial distance for polytopes in standard form is at least γ/k where γ is the minimum value of a nonzero coordinate of a vertex of the polytope and k≤n is the number of tight inequality constraints in the optimal face, and 2) the facial distance for polytopes of the form Ax=b, Cx≤d, x≥0 where x∈R^n, C≥0 is a nonzero integral matrix, and d≥0, is at least 1/(c√n), where c is the infinity norm of C. This yields the first such results for several problems such as sequence-form polytopes, flow polytopes, and matching polytopes.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Learning of PDEs via Taylor Expansion and Sparse Decomposition into Value and Fourier Domains b/data/2024/aaai/Efficient Learning of PDEs via Taylor Expansion and Sparse Decomposition into Value and Fourier Domains
new file mode 100644
index 0000000000..84dc6257d6
--- /dev/null
+++ b/data/2024/aaai/Efficient Learning of PDEs via Taylor Expansion and Sparse Decomposition into Value and Fourier Domains	
@@ -0,0 +1 @@
+Accelerating the learning of Partial Differential Equations (PDEs) from experimental data will speed up the pace of scientific discovery. Previous randomized algorithms exploit sparsity in PDE updates for acceleration. However such methods are applicable to a limited class of decomposable PDEs, which have sparse features in the value domain. We propose Reel, which accelerates the learning of PDEs via random projection and has much broader applicability. Reel exploits the sparsity by decomposing dense updates into sparse ones in both the value and frequency domains. This decomposition enables efficient learning when the source of the updates consists of gradually changing terms across large areas (sparse in the frequency domain) in addition to a few rapid updates concentrated in a small set of “interfacial” regions (sparse in the value domain). Random projection is then applied to compress the sparse signals for learning. To expand the model applicability, Taylor series expansion is used in Reel to approximate the nonlinear PDE updates with polynomials in the decomposable form. Theoretically, we derive a constant factor approximation between the projected loss function and the original one with poly-logarithmic number of projected dimensions. Experimentally, we provide empirical evidence that our proposed Reel can lead to faster learning of PDE models (70-98% reduction in training time when the data is compressed to 1% of its original size) with comparable quality as the non-compressed models.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Lightweight Image Denoising with Triple Attention Transformer b/data/2024/aaai/Efficient Lightweight Image Denoising with Triple Attention Transformer
new file mode 100644
index 0000000000..cf1acc050d
--- /dev/null
+++ b/data/2024/aaai/Efficient Lightweight Image Denoising with Triple Attention Transformer	
@@ -0,0 +1 @@
+Transformer has shown outstanding performance on image denoising, but the existing Transformer methods for image denoising are with large model sizes and high computational complexity, which is unfriendly to resource-constrained devices. In this paper, we propose a Lightweight Image Denoising Transformer method (LIDFormer) based on Triple Multi-Dconv Head Transposed Attention (TMDTA) to boost computational efficiency. LIDFormer first implements Discrete Wavelet Transform (DWT), which transforms the input image into a low-frequency space, greatly reducing the computational complexity of image denoising. However, the low-frequency image lacks fine-feature information, which degrades the denoising performance. To handle this problem, we introduce the Complementary Periodic Feature Reusing (CPFR) scheme for aggregating the shallow-layer features and the deep-layer features. Furthermore, TMDTA is proposed to integrate global context along three dimensions, thereby enhancing the ability of global feature representation. Note that our method can be applied as a pipeline for both convolutional neural networks and Transformers. Extensive experiments on several benchmarks demonstrate that the proposed LIDFormer achieves a better trade-off between high performance and low computational complexity on real-world image denoising tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Look-Up Table from Expanded Convolutional Network for Accelerating Image Super-resolution b/data/2024/aaai/Efficient Look-Up Table from Expanded Convolutional Network for Accelerating Image Super-resolution
new file mode 100644
index 0000000000..1aa7bcbc74
--- /dev/null
+++ b/data/2024/aaai/Efficient Look-Up Table from Expanded Convolutional Network for Accelerating Image Super-resolution	
@@ -0,0 +1 @@
+The look-up table (LUT) has recently shown its practicability and effectiveness in super-resolution (SR) tasks due to its low computational cost and hardware independence. However, most existing methods focus on improving the performance of SR, neglecting the demand for high-speed SR on low-computational edge devices. In this paper, we propose an efficient expanded convolution (EC) layer, which expands the output size of regular convolution to enlarge the receptive field (RF) indirectly. It can increase the size of the LUT corresponding to the network linearly with the increase of RF. Additionally, after introducing the EC, multiple LUTs are merged into one LUT, achieving faster running speed while maintaining SR performance. More specifically, we expand the coverage of the convolutional output so that the output at the current position covers the target position and its surroundings, forming an overlapping sliding window at the output end. We sum up the overlapping parts of the sliding window as the output, thereby achieving the effect of enlarging the RF size. Moreover, by expanding the numerical range of the accumulated results and rescaling them to [0,255], the method can mitigate the error caused by quantization output. Experiments indicate that the proposed method performs better than the baseline method and is faster than other LUT-based SR methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Nonparametric Tensor Decomposition for Binary and Count Data b/data/2024/aaai/Efficient Nonparametric Tensor Decomposition for Binary and Count Data
new file mode 100644
index 0000000000..617fb140e6
--- /dev/null
+++ b/data/2024/aaai/Efficient Nonparametric Tensor Decomposition for Binary and Count Data	
@@ -0,0 +1 @@
+In numerous applications, binary reactions or event counts are observed and stored within high-order tensors. Tensor decompositions (TDs) serve as a powerful tool to handle such high-dimensional and sparse data. However, many traditional TDs are explicitly or implicitly designed based on the Gaussian distribution, which is unsuitable for discrete data. Moreover, most TDs rely on predefined multi-linear structures, such as CP and Tucker formats. Therefore, they may not be effective enough to handle complex real-world datasets. To address these issues, we propose ENTED, an Efficient Nonparametric TEnsor Decomposition for binary and count tensors. Specifically, we first employ a nonparametric Gaussian process (GP) to replace traditional multi-linear structures. Next, we utilize the Pólya-Gamma augmentation which provides a unified framework to establish conjugate models for binary and count distributions. Finally, to address the computational issue of GPs, we enhance the model by incorporating sparse orthogonal variational inference of inducing points, which offers a more effective covariance approximation within GPs and stochastic natural gradient updates for nonparametric models. We evaluate our model on several real-world tensor completion tasks, considering binary and count datasets. The results manifest both better performance and computational advantages of the proposed model.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Representation Learning of Satellite Image Time Series and Their Fusion for Spatiotemporal Applications b/data/2024/aaai/Efficient Representation Learning of Satellite Image Time Series and Their Fusion for Spatiotemporal Applications
new file mode 100644
index 0000000000..44634a5235
--- /dev/null
+++ b/data/2024/aaai/Efficient Representation Learning of Satellite Image Time Series and Their Fusion for Spatiotemporal Applications	
@@ -0,0 +1 @@
+Satellite data bolstered by their increasing accessibility is leading to many endeavors of automated monitoring of the earth's surface for various applications. Such applications demand high spatial resolution images at a temporal resolution of a few days which entails the challenge of processing a huge volume of image time series data. To overcome this computing bottleneck, we present PatchNet, a bespoke adaptation of beam search and attention mechanism. PatchNet is an automated patch selection neural network that requires only a partial spatial traversal of an image time series and yet achieves impressive results. Satellite systems face a trade-off between spatial and temporal resolutions due to budget/technical constraints e.g., Landsat-8/9 or Sentinel-2 have high spatial resolution whereas, MODIS has high temporal resolution. To deal with the limitation of coarse temporal resolution, we propose FuSITSNet, a twofold feature-based generic fusion model with multimodal learning in a contrastive setting. It produces a learned representation after fusion of two satellite image time series leveraging finer spatial resolution of Landsat and finer temporal resolution of MODIS. The patch alignment module of FuSITSNet aligns the PatchNet processed patches of Landsat-8 with the corresponding MODIS regions to incorporate its finer resolution temporal features. The untraversed patches are handled by the cross-modality attention which highlights additional hot spot features from the two modalities. We conduct extensive experiments on more than 2000 counties of US for crop yield, snow cover, and solar energy prediction and show that even one-fourth spatial processing of image time series produces state-of-the-art results. FuSITSNet outperforms the predictions of single modality and data obtained using existing generative fusion models and allows for monitoring of dynamic phenomena using freely accessible images, thereby unlocking new opportunities.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Spiking Neural Networks with Sparse Selective Activation for Continual Learning b/data/2024/aaai/Efficient Spiking Neural Networks with Sparse Selective Activation for Continual Learning
new file mode 100644
index 0000000000..ea40de081c
--- /dev/null
+++ b/data/2024/aaai/Efficient Spiking Neural Networks with Sparse Selective Activation for Continual Learning	
@@ -0,0 +1,2 @@
+The next generation of machine intelligence requires the capability of continual learning to acquire new knowledge without forgetting the old one while conserving limited computing resources. 
+Spiking neural networks (SNNs), compared to artificial neural networks (ANNs), have more characteristics that align with biological neurons, which may be helpful as a potential gating function for knowledge maintenance in neural networks. Inspired by the selective sparse activation principle of context gating in biological systems, we present a novel SNN model with selective activation to achieve continual learning. The trace-based K-Winner-Take-All (K-WTA) and variable threshold components are designed to form the sparsity in selective activation in spatial and temporal dimensions of spiking neurons, which promotes the subpopulation of neuron activation to perform specific tasks. As a result, continual learning can be maintained by routing different tasks via different populations of neurons in the network. The experiments are conducted on MNIST and CIFAR10 datasets under the class incremental setting. The results show that the proposed SNN model achieves competitive performance similar to and even surpasses the other regularization-based methods deployed under traditional ANNs.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Target Propagation by Deriving Analytical Solution b/data/2024/aaai/Efficient Target Propagation by Deriving Analytical Solution
new file mode 100644
index 0000000000..87a4b49b96
--- /dev/null
+++ b/data/2024/aaai/Efficient Target Propagation by Deriving Analytical Solution	
@@ -0,0 +1 @@
+Exploring biologically plausible algorithms as alternatives to error backpropagation (BP) is a challenging research topic in artificial intelligence. It also provides insights into the brain's learning methods. Recently, when combined with well-designed feedback loss functions such as Local Difference Reconstruction Loss (LDRL) and through hierarchical training of feedback pathway synaptic weights, Target Propagation (TP) has achieved performance comparable to BP in image classification tasks. However, with an increase in the number of network layers, the tuning and training cost of feedback weights escalates. Drawing inspiration from the work of Ernoult et al., we propose a training method that seeks the optimal solution for feedback weights. This method enhances the efficiency of feedback training by analytically minimizing feedback loss, allowing the feedback layer to skip certain local training iterations. More specifically, we introduce the Jacobian matching loss (JML) for feedback training. We also proactively implement layers designed to derive analytical solutions that minimize JML. Through experiments, we have validated the effectiveness of this approach. Using the CIFAR-10 dataset, our method showcases accuracy levels comparable to state-of-the-art TP methods. Furthermore, we have explored its effectiveness in more intricate network architectures.
\ No newline at end of file
diff --git a/data/2024/aaai/Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models b/data/2024/aaai/Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models
new file mode 100644
index 0000000000..1a2ff0fa02
--- /dev/null
+++ b/data/2024/aaai/Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models	
@@ -0,0 +1,3 @@
+Toxic content detection is crucial for online services to remove inappropriate content that violates community standards. To automate the detection process, prior works have proposed varieties of machine learning (ML) approaches to train Language Models (LMs) for toxic content detection. However, both their accuracy and transferability across datasets are limited. Recently, Large Language Models (LLMs) have shown promise in toxic content detection due to their superior zero-shot and few-shot in-context learning ability as well as broad transferability on ML tasks.
+However, efficiently designing prompts for LLMs remains challenging. Moreover, the high run-time cost of LLMs may hinder their deployments in production. To address these challenges, in this work, we propose BD-LLM, a novel and efficient approach to bootstrapping and distilling LLMs for toxic content detection. 
+Specifically, we design a novel prompting method named Decision-Tree-of-Thought (DToT) to bootstrap LLMs' detection performance and extract high-quality rationales. DToT can automatically select more fine-grained context to re-prompt LLMs when their responses lack confidence. Additionally, we use the rationales extracted via DToT to fine-tune student LMs. Our experimental results on various datasets demonstrate that DToT can improve the accuracy of LLMs by up to 4.6%. Furthermore, student LMs fine-tuned with rationales extracted via DToT outperform baselines on all datasets with up to 16.9% accuracy improvement, while being more than 60x smaller than conventional LLMs. Finally, we observe that student LMs fine-tuned with rationales exhibit better cross-dataset transferability.
\ No newline at end of file
diff --git a/data/2024/aaai/Electron Microscopy Images as Set of Fragments for Mitochondrial Segmentation b/data/2024/aaai/Electron Microscopy Images as Set of Fragments for Mitochondrial Segmentation
new file mode 100644
index 0000000000..0167fce5ce
--- /dev/null
+++ b/data/2024/aaai/Electron Microscopy Images as Set of Fragments for Mitochondrial Segmentation	
@@ -0,0 +1 @@
+Automatic mitochondrial segmentation enjoys great popularity with the development of deep learning. However, the coarse prediction raised by the presence of regular 3D grids in previous methods regardless of 3D CNN or the vision transformers suggest a possibly sub-optimal feature arrangement. To mitigate this limitation, we attempt to interpret the 3D EM image stacks as a set of interrelated 3D fragments for a better solution. However, it is non-trivial to model the 3D fragments without introducing excessive computational overhead. In this paper, we design a coherent fragment vision transformer (FragViT) combined with affinity learning to manipulate features on 3D fragments yet explore mutual relationships to model fragment-wise context, enjoying locality prior without sacrificing global reception. The proposed FragViT includes a fragment encoder and a hierarchical fragment aggregation module. The fragment encoder is equipped with affinity heads to transform the tokens into fragments with homogeneous semantics, and the multi-layer self-attention is used to explicitly learn inter-fragment relations with long-range dependencies. The hierarchical fragment aggregation module is responsible for hierarchically aggregating fragment-wise prediction back to the final voxel-wise prediction in a progressive manner. Extensive experimental results on the challenging MitoEM, Lucchi, and AC3/AC4 benchmarks demonstrate the effectiveness of the proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Elijah: Eliminating Backdoors Injected in Diffusion Models via Distribution Shift b/data/2024/aaai/Elijah: Eliminating Backdoors Injected in Diffusion Models via Distribution Shift
new file mode 100644
index 0000000000..1a02435fa7
--- /dev/null
+++ b/data/2024/aaai/Elijah: Eliminating Backdoors Injected in Diffusion Models via Distribution Shift	
@@ -0,0 +1 @@
+Diffusion models (DM) have become state-of-the-art generative models because of their capability of generating high-quality images from noises without adversarial training. However, they are vulnerable to backdoor attacks as reported by recent studies. When a data input (e.g., some Gaussian noise) is stamped with a trigger (e.g., a white patch), the backdoored model always generates the target image (e.g., an improper photo). However, effective defense strategies to mitigate backdoors from DMs are underexplored. To bridge this gap, we propose the first backdoor detection and removal framework for DMs. We evaluate our framework Elijah on over hundreds of DMs of 3 types including DDPM, NCSN and LDM, with 13 samplers against 3 existing backdoor attacks. Extensive experiments show that our approach can have close to 100% detection accuracy and reduce the backdoor effects to close to zero without significantly sacrificing the model utility.
\ No newline at end of file
diff --git a/data/2024/aaai/EmFORE: Learning Email Folder Classification Rules by Demonstration b/data/2024/aaai/EmFORE: Learning Email Folder Classification Rules by Demonstration
new file mode 100644
index 0000000000..3103dd2794
--- /dev/null
+++ b/data/2024/aaai/EmFORE: Learning Email Folder Classification Rules by Demonstration	
@@ -0,0 +1 @@
+Tools that help with email folder management are limited, as users have to manually write rules to assign emails to folders. We present EMFORE, an iterative learning system that automatically learns and updates such rules from observations. EMFORE is fast enough to suggest and update rules in real time and suppresses mails with low confidence to reduce the number of false positives. EMFORE can use different rule grammars, and thus be adapted to different clients, without changing the user experience. Previous methods do not learn rules, require complete retraining or multiple new examples after making a mistake, and do not distinguish between inbox and other folders. EMFORE learns rules incrementally and can make the neutral decision of leaving emails in the inbox, making it an ideal candidate for integration in email clients.
\ No newline at end of file
diff --git a/data/2024/aaai/Embedded Feature Selection on Graph-Based Multi-View Clustering b/data/2024/aaai/Embedded Feature Selection on Graph-Based Multi-View Clustering
new file mode 100644
index 0000000000..c63242c1f4
--- /dev/null
+++ b/data/2024/aaai/Embedded Feature Selection on Graph-Based Multi-View Clustering	
@@ -0,0 +1 @@
+Recently, anchor graph-based multi-view clustering has been proven to be highly efficient for large-scale data processing. However, most existing anchor graph-based clustering methods necessitate post-processing to obtain clustering labels and are unable to effectively utilize the information within anchor graphs. To solve these problems, we propose an Embedded Feature Selection on Graph-Based Multi-View Clustering (EFSGMC) approach to improve the clustering performance. Our method decomposes anchor graphs, taking advantage of memory efficiency, to obtain clustering labels in a single step without the need for post-processing. Furthermore, we introduce the l2,p-norm for graph-based feature selection, which selects the most relevant data for efficient graph factorization. Lastly, we employ the tensor Schatten p-norm as a tensor rank approximation function to capture the complementary information between different views, ensuring similarity between cluster assignment matrices. Experimental results on five real-world datasets demonstrate that our proposed method outperforms state-of-the-art approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning b/data/2024/aaai/Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning
new file mode 100644
index 0000000000..58bf729aa8
--- /dev/null
+++ b/data/2024/aaai/Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning	
@@ -0,0 +1 @@
+While vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, their mastery in a few languages like English restricts their applicability in broader communities. To this end, there is an increasing interest in developing multilingual VL models via a joint-learning setup, which, however, could be unrealistic due to expensive costs and data availability. In this work, we propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF). We begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP, a prevailing VL-PTM that has acquired image-English text alignment. Specifically, CLL-CLIP contains an expandable token embedding layer to handle linguistic differences. It solely trains token embeddings to improve memory stability and is optimized under cross-modal and cross-lingual objectives to learn the alignment between images and multilingual texts. To alleviate CF raised by covariate shift and lexical overlap, we further propose a novel approach that ensures the identical distribution of all token embeddings during initialization and regularizes token embedding learning during training. We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance. Extensive experiments verify the effectiveness of CLL-CLIP and show that our approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on XM3600, and improve various state-of-the-art methods consistently. Our code and data are available at https://github.com/yangbang18/CLFM.
\ No newline at end of file
diff --git a/data/2024/aaai/Emergent Communication for Numerical Concepts Generalization b/data/2024/aaai/Emergent Communication for Numerical Concepts Generalization
new file mode 100644
index 0000000000..cb7ffbd31c
--- /dev/null
+++ b/data/2024/aaai/Emergent Communication for Numerical Concepts Generalization	
@@ -0,0 +1 @@
+Research on emergent communication has recently gained significant traction as a promising avenue for the linguistic community to unravel human language's origins and explore artificial intelligence's generalization capabilities. Current research has predominantly concentrated on recognizing qualitative patterns of object attributes(e.g., shape and color) and paid little attention to the quantitative relationship among object quantities which is known as the part of numerical concepts. The ability to generalize numerical concepts, i.e., counting and calculations with unseen quantities, is essential, as it mirrors humans' foundational abstract reasoning abilities. In this work, we introduce the NumGame, leveraging the referential game framework, forcing agents to communicate and generalize the numerical concepts effectively. Inspired by the human learning process of numbers, we present a two-stage training approach that sequentially fosters a rudimentary numerical sense followed by the ability of arithmetic calculation, ultimately aiding agents in generating semantically stable and unambiguous language for numerical concepts. The experimental results indicate the impressive generalization capabilities to unseen quantities and regularity of the language emergence from communication.
\ No newline at end of file
diff --git a/data/2024/aaai/Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling b/data/2024/aaai/Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling
new file mode 100644
index 0000000000..f047642cb3
--- /dev/null
+++ b/data/2024/aaai/Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling	
@@ -0,0 +1 @@
+Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. While recognising the significance of CSS task, the prior studies have not thoroughly investigated the emotional expressiveness problems due to the scarcity of emotional conversational datasets and the difficulty of stateful emotion modeling. In this paper, we propose a novel emotional CSS model, termed ECSS, that includes two main components: 1) to enhance emotion understanding, we introduce a heterogeneous graph-based emotional context modeling mechanism, which takes the multi-source dialogue history as input to model the dialogue context and learn the emotion cues from the context; 2) to achieve emotion rendering, we employ a contrastive learning-based emotion renderer module to infer the accurate emotion style for the target utterance. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity, and annotate additional emotional information on the existing conversational dataset (DailyTalk). Both objective and subjective evaluations suggest that our model outperforms the baseline models in understanding and rendering emotions. These evaluations also underscore the importance of comprehensive emotional annotations. Code and audio samples can be found at: https://github.com/walker-hyf/ECSS.
\ No newline at end of file
diff --git a/data/2024/aaai/Empowering CAM-Based Methods with Capability to Generate Fine-Grained and High-Faithfulness Explanations b/data/2024/aaai/Empowering CAM-Based Methods with Capability to Generate Fine-Grained and High-Faithfulness Explanations
new file mode 100644
index 0000000000..a1b5b9e258
--- /dev/null
+++ b/data/2024/aaai/Empowering CAM-Based Methods with Capability to Generate Fine-Grained and High-Faithfulness Explanations	
@@ -0,0 +1 @@
+Recently, the explanation of neural network models has garnered considerable research attention. In computer vision, CAM (Class Activation Map)-based methods and LRP (Layer-wise Relevance Propagation) method are two common explanation methods. However, since most CAM-based methods can only generate global weights, they can only generate coarse-grained explanations at a deep layer. LRP and its variants, on the other hand, can generate fine-grained explanations. But the faithfulness of the explanations is too low. To address these challenges, in this paper, we propose FG-CAM (Fine-Grained CAM), which extends CAM-based methods to enable generating fine-grained and high-faithfulness explanations. FG-CAM uses the relationship between two adjacent layers of feature maps with resolution differences to gradually increase the explanation resolution, while finding the contributing pixels and filtering out the pixels that do not contribute. Our method not only solves the shortcoming of CAM-based methods without changing their characteristics, but also generates fine-grained explanations that have higher faithfulness than LRP and its variants. We also present FG-CAM with denoising, which is a variant of FG-CAM and is able to generate less noisy explanations with almost no change in explanation faithfulness. Experimental results show that the performance of FG-CAM is almost unaffected by the explanation resolution. FG-CAM outperforms existing CAM-based methods significantly in both shallow and intermediate layers, and outperforms LRP and its variants significantly in the input layer. Our code is available at https://github.com/dongmo-qcq/FG-CAM.
\ No newline at end of file
diff --git a/data/2024/aaai/EnColor: Improving Visual Accessibility with a Deep Encoder-Decoder Image Corrector for Color Vision Deficient Individuals b/data/2024/aaai/EnColor: Improving Visual Accessibility with a Deep Encoder-Decoder Image Corrector for Color Vision Deficient Individuals
new file mode 100644
index 0000000000..652f4542c1
--- /dev/null
+++ b/data/2024/aaai/EnColor: Improving Visual Accessibility with a Deep Encoder-Decoder Image Corrector for Color Vision Deficient Individuals	
@@ -0,0 +1 @@
+Individuals with color vision deficiencies (CVDs) often face significant challenges in accessing vital information for decision-making. In response, we introduce EnColor—a deep Encoder-decoder Color corrector for images, enabling individuals with CVDs to perceive the contents in originally intended colorization. Our network architecture is designed to effectively capture essential visual features for reconstructing standard images into color-corrected versions. In particular, our training pipeline is integrated with a CVD simulator so as to ensure the fidelity of output throughout the lens of individuals with impaired color vision. For evaluation, we focus primarily on tomato images, considering the profound impact of color vision deficiencies on practical domains like agri-food systems. Our quantitative results demonstrate that the EnColor model achieves over 16.8% improvement over previously introduced algorithms in terms of color retention, supporting our design choices. Furthermore, a survey with 43 participants provides subjective assessments with the highest scores on our method. Additionally, specific visual examples are presented to highlight accurately restored colors. We also publicly share all our codes of EnColor as well as the baseline methods to ensure reproducibility and facilitate more studies in CVD correction.
\ No newline at end of file
diff --git a/data/2024/aaai/EnMatch: Matchmaking for Better Player Engagement via Neural Combinatorial Optimization b/data/2024/aaai/EnMatch: Matchmaking for Better Player Engagement via Neural Combinatorial Optimization
new file mode 100644
index 0000000000..affe77182d
--- /dev/null
+++ b/data/2024/aaai/EnMatch: Matchmaking for Better Player Engagement via Neural Combinatorial Optimization	
@@ -0,0 +1 @@
+Matchmaking is a core task in e-sports and online games, as it contributes to player engagement and further influences the game's lifecycle. Previous methods focus on creating fair games at all times. They divide players into different tiers based on skill levels and only select players from the same tier for each game. Though this strategy can ensure fair matchmaking, it is not always good for player engagement. In this paper, we propose a novel Engagement-oriented Matchmaking (EnMatch) framework to ensure fair games and simultaneously enhance player engagement. Two main issues need to be addressed. First, it is unclear how to measure the impact of different team compositions and confrontations on player engagement during the game considering the variety of player characteristics. Second, such a detailed consideration on every single player during matchmaking will result in an NP-hard combinatorial optimization problem with non-linear objectives. In light of these challenges, we turn to real-world data analysis to reveal engagement-related factors. The resulting insights guide the development of engagement modeling, enabling the estimation of quantified engagement before a match is completed. To handle the combinatorial optimization problem, we formulate the problem into a reinforcement learning framework, in which a neural combinatorial optimization problem is built and solved. The performance of EnMatch is finally demonstrated through the comparison with other state-of-the-art methods based on several real-world datasets and online deployments on two games.
\ No newline at end of file
diff --git a/data/2024/aaai/Encoding Constraints as Binary Constraint Networks Satisfying BTP b/data/2024/aaai/Encoding Constraints as Binary Constraint Networks Satisfying BTP
new file mode 100644
index 0000000000..b23cff783c
--- /dev/null
+++ b/data/2024/aaai/Encoding Constraints as Binary Constraint Networks Satisfying BTP	
@@ -0,0 +1 @@
+Recently, the Binary Constraint Tree (BCT), a tree structured Binary Constraint Network (BCN), has been shown to be more succinct than various ad-hoc constraints. In this paper, we investigate the modelling power of a well-known tractable hybrid class generalizing BCT, i.e. the class of BCNs satisfying Broken Triangle Property (BTP) called BTP Networks (BTPNs). We show that the consistency checker of BTPN can be computed by polysize monotone circuit, thus, some global constraints cannot be encoded as polysize BTPN, such as the AllDifferent and Linear constraints. Then our study reveals that BTPN is strictly more succinct than the DNNF constraint and all 14 ad-hoc constraints discussed in (Wang and Yap 2023), such as the context-free grammar, BCT and smart table constraints. Furthermore, we also show that BTPN is as powerful as DNNF in terms of computing various operations and queries. In addition, we prove that it is NP-hard to determine the minimum sized BTPN encoding a constraint.
\ No newline at end of file
diff --git a/data/2024/aaai/EncryIP: A Practical Encryption-Based Framework for Model Intellectual Property Protection b/data/2024/aaai/EncryIP: A Practical Encryption-Based Framework for Model Intellectual Property Protection
new file mode 100644
index 0000000000..5c0e2b0f8f
--- /dev/null
+++ b/data/2024/aaai/EncryIP: A Practical Encryption-Based Framework for Model Intellectual Property Protection	
@@ -0,0 +1 @@
+In the rapidly growing digital economy, protecting intellectual property (IP) associated with digital products has become increasingly important. Within this context, machine learning (ML) models, being highly valuable digital assets, have gained significant attention for IP protection. This paper introduces a practical encryption-based framework called EncryIP, which seamlessly integrates a public-key encryption scheme into the model learning process. This approach enables the protected model to generate randomized and confused labels, ensuring that only individuals with accurate secret keys, signifying authorized users, can decrypt and reveal authentic labels. Importantly, the proposed framework not only facilitates the protected model to multiple authorized users without requiring repetitive training of the original ML model with IP protection methods but also maintains the model's performance without compromising its accuracy. Compared to existing methods like watermark-based, trigger-based, and passport-based approaches, EncryIP demonstrates superior effectiveness in both training protected models and efficiently detecting the unauthorized spread of ML models.
\ No newline at end of file
diff --git a/data/2024/aaai/End-to-End Learning of LTLf Formulae by Faithful LTLf Encoding b/data/2024/aaai/End-to-End Learning of LTLf Formulae by Faithful LTLf Encoding
new file mode 100644
index 0000000000..d90326e27c
--- /dev/null
+++ b/data/2024/aaai/End-to-End Learning of LTLf Formulae by Faithful LTLf Encoding	
@@ -0,0 +1 @@
+It is important to automatically discover the underlying tree-structured formulae from large amounts of data. In this paper, we examine learning linear temporal logic on finite traces (LTLf) formulae, which is a tree structure syntactically and characterizes temporal properties semantically. Its core challenge is to bridge the gap between the concise tree-structured syntax and the complex LTLf semantics. Besides, the learning quality is endangered by explosion of the search space and wrong search bias guided by imperfect data. We tackle these challenges by proposing an LTLf encoding method to parameterize a neural network so that the neural computation is able to simulate the inference of LTLf formulae. We first identify faithful LTLf encoding, a subclass of LTLf encoding, which has a one-to-one correspondence to LTLf formulae. Faithful encoding guarantees that the learned parameter assignment of the neural network can directly be interpreted to an LTLf formula. With such an encoding method, we then propose an end-to-end approach, TLTLf, to learn LTLf formulae through neural networks parameterized by our LTLf encoding method. Experimental results demonstrate that our approach achieves state-of-the-art performance with up to 7% improvement in accuracy, highlighting the benefits of introducing the faithful LTLf encoding.
\ No newline at end of file
diff --git a/data/2024/aaai/End-to-End Phase Field Model Discovery Combining Experimentation, Crowdsourcing, Simulation and Learning b/data/2024/aaai/End-to-End Phase Field Model Discovery Combining Experimentation, Crowdsourcing, Simulation and Learning
new file mode 100644
index 0000000000..a6514c116a
--- /dev/null
+++ b/data/2024/aaai/End-to-End Phase Field Model Discovery Combining Experimentation, Crowdsourcing, Simulation and Learning	
@@ -0,0 +1 @@
+The availability of tera-byte scale experiment data calls for AI driven approaches which automatically discover scientific models from data. Nonetheless, significant challenges present in AI-driven scientific discovery: (i) The annotation of large scale datasets requires fundamental re-thinking in developing scalable crowdsourcing tools. (ii) The learning of scientific models from data calls for innovations beyond black-box neural nets. (iii) Novel visualization & diagnosis tools are needed for the collaboration of experimental and theoretical physicists, and computer scientists. We present Phase-Field-Lab platform for end-to-end phase field model discovery, which automatically discovers phase field physics models from experiment data, integrating experimentation, crowdsourcing, simulation and learning. Phase-Field-Lab combines (i) a streamlined annotation tool which reduces the annotation time (by ~50-75%), while increasing annotation accuracy compared to baseline; (ii) an end-to-end neural model which automatically learns phase field models from data by embedding phase field simulation and existing domain knowledge into learning; and (iii) novel interfaces and visualizations to integrate our platform into the scientific discovery cycle of domain scientists. Our platform is deployed in the analysis of nano-structure evolution in materials under extreme conditions (high temperature and irradiation). Our approach reveals new properties of nano-void defects, which otherwise cannot be detected via manual analysis.
\ No newline at end of file
diff --git a/data/2024/aaai/End-to-End RGB-D Image Compression via Exploiting Channel-Modality Redundancy b/data/2024/aaai/End-to-End RGB-D Image Compression via Exploiting Channel-Modality Redundancy
new file mode 100644
index 0000000000..2fd3e5506d
--- /dev/null
+++ b/data/2024/aaai/End-to-End RGB-D Image Compression via Exploiting Channel-Modality Redundancy	
@@ -0,0 +1 @@
+As a kind of 3D data, RGB-D images have been extensively used in object tracking, 3D reconstruction, remote sensing mapping, and other tasks. In the realm of computer vision, the significance of RGB-D images is progressively growing. However, the existing learning-based image compression methods usually process RGB images and depth images separately, which cannot entirely exploit the redundant information between the modalities, limiting the further improvement of the Rate-Distortion performance. With the goal of overcoming the defect, in this paper, we propose a learning-based dual-branch RGB-D image compression framework. Compared with traditional RGB domain compression scheme, a YUV domain compression scheme is presented for spatial redundancy removal. In addition, Intra-Modality Attention (IMA) and Cross-Modality Attention (CMA) are introduced for modal redundancy removal. For the sake of benefiting from cross-modal prior information, Context Prediction Module (CPM) and Context Fusion Module (CFM) are raised in the conditional entropy model which makes the context probability prediction more accurate. The experimental results demonstrate our method outperforms existing image compression methods in two RGB-D image datasets. Compared with BPG, our proposed framework can achieve up to 15% bit rate saving for RGB images.
\ No newline at end of file
diff --git a/data/2024/aaai/End-to-End Real-Time Vanishing Point Detection with Transformer b/data/2024/aaai/End-to-End Real-Time Vanishing Point Detection with Transformer
new file mode 100644
index 0000000000..abc3348884
--- /dev/null
+++ b/data/2024/aaai/End-to-End Real-Time Vanishing Point Detection with Transformer	
@@ -0,0 +1 @@
+In this paper, we propose a novel transformer-based end-to-end real-time vanishing point detection method, which is named Vanishing Point TRansformer (VPTR). The proposed method can directly regress the locations of vanishing points from given images. To achieve this goal, we pose vanishing point detection as a point object detection task on the Gaussian hemisphere with region division. Considering low-level features always provide more geometric information which can contribute to accurate vanishing point prediction, we propose a clear architecture where vanishing point queries in the decoder can directly gather multi-level features from CNN backbone with deformable attention in VPTR. Our method does not rely on line detection or Manhattan world assumption, which makes it more flexible to use. VPTR runs at an inferring speed of 140 FPS on one NVIDIA 3090 card. Experimental results on synthetic and real-world datasets demonstrate that our method can be used in both natural and structural scenes, and is superior to other state-of-the-art methods on the balance of accuracy and efficiency.
\ No newline at end of file
diff --git a/data/2024/aaai/End-to-End Verification for Subgraph Solving b/data/2024/aaai/End-to-End Verification for Subgraph Solving
new file mode 100644
index 0000000000..b8cd0d45ef
--- /dev/null
+++ b/data/2024/aaai/End-to-End Verification for Subgraph Solving	
@@ -0,0 +1,3 @@
+Modern subgraph-finding algorithm implementations consist of thousands of lines of highly optimized code, and this complexity raises questions about their trustworthiness. Recently, some state-of-the-art subgraph solvers have been enhanced to output machine-verifiable proofs that their results are correct. While this significantly improves reliability, it is not a fully satisfactory solution, since end-users have to trust both the proof checking algorithms and the translation of the high-level graph problem into a low-level 0-1 integer linear program (ILP) used for the proofs.
+ 
+In this work, we present the first formally verified toolchain capable of full end-to-end verification for subgraph solving, which closes both of these trust gaps. We have built encoder frontends for various graph problems together with a 0-1 ILP (a.k.a. pseudo-Boolean) proof checker, all implemented and formally verified in the CakeML ecosystem. This toolchain is flexible and extensible, and we use it to build verified proof checkers for both decision and optimization graph problems, namely, subgraph isomorphism, maximum clique, and maximum common (connected) induced subgraph. Our experimental evaluation shows that end-to-end formal verification is now feasible for a wide range of hard graph problems.
\ No newline at end of file
diff --git a/data/2024/aaai/Energy Efficient Streaming Time Series Classification with Attentive Power Iteration b/data/2024/aaai/Energy Efficient Streaming Time Series Classification with Attentive Power Iteration
new file mode 100644
index 0000000000..57cf9ccf98
--- /dev/null
+++ b/data/2024/aaai/Energy Efficient Streaming Time Series Classification with Attentive Power Iteration	
@@ -0,0 +1 @@
+Efficiently processing time series data streams in real-time on resource-constrained devices offers significant advantages in terms of enhanced computational energy efficiency and reduced time-related risks. We introduce an innovative streaming time series classification network that utilizes attentive power iteration, enabling real-time processing on resource-constrained devices. Our model continuously updates a compact representation of the entire time series, enhancing classification accuracy while conserving energy and processing time. Notably, it excels in streaming scenarios without requiring complete time series access, enabling swift decisions. Experimental results show that our approach excels in classification accuracy and energy efficiency, with over 70% less consumption and threefold faster task completion than benchmarks. This work advances real-time responsiveness, energy conservation, and operational effectiveness for constrained devices, contributing to optimizing various applications.
\ No newline at end of file
diff --git a/data/2024/aaai/Engineering an Exact Pseudo-Boolean Model Counter b/data/2024/aaai/Engineering an Exact Pseudo-Boolean Model Counter
new file mode 100644
index 0000000000..23604070e3
--- /dev/null
+++ b/data/2024/aaai/Engineering an Exact Pseudo-Boolean Model Counter	
@@ -0,0 +1,3 @@
+Model counting, a fundamental task in computer science, involves determining the number of satisfying assignments to a Boolean formula, typically represented in conjunctive normal form (CNF). While model counting for CNF formulas has received extensive attention with a broad range of applications, the study of model counting for Pseudo-Boolean (PB) formulas has been relatively overlooked. Pseudo-Boolean formulas, being more succinct than propositional Boolean formulas, offer greater flexibility in representing real-world problems. Consequently, there is a crucial need to investigate efficient techniques for model counting for PB formulas.
+
+In this work, we propose the first exact Pseudo-Boolean model counter, PBCount , that relies on knowledge compilation approach via algebraic decision diagrams. Our extensive empirical evaluation shows that PBCount can compute counts for 1513 instances while the current state-of-the-art approach could only handle 1013 instances. Our work opens up several avenues for future work in the context of model counting for PB formulas, such as the development of preprocessing techniques and exploration of approaches other than knowledge compilation.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhance Diversified Top-k MaxSAT Solving by Incorporating New Strategy for Generating Diversified Initial Assignments (Student Abstract) b/data/2024/aaai/Enhance Diversified Top-k MaxSAT Solving by Incorporating New Strategy for Generating Diversified Initial Assignments (Student Abstract)
new file mode 100644
index 0000000000..2906899f0f
--- /dev/null
+++ b/data/2024/aaai/Enhance Diversified Top-k MaxSAT Solving by Incorporating New Strategy for Generating Diversified Initial Assignments (Student Abstract)	
@@ -0,0 +1 @@
+The Diversified Top-k MaxSAT (DTKMS) problem is an extension of MaxSAT. The objective of DTKMS is to find k feasible assignments of a given formula, such that each assignment satisfies all hard clauses and the k assignments together satisfy the maximum number of soft clauses. This paper presents a local search algorithm, DTKMS-DIA, which incorporates a new approach to generating initial assignments. Experimental results indicate that DTKMS-DIA can achieve attractive performance on 826 instances compared with state-of-the-art solvers.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhance Sketch Recognition's Explainability via Semantic Component-Level Parsing b/data/2024/aaai/Enhance Sketch Recognition's Explainability via Semantic Component-Level Parsing
new file mode 100644
index 0000000000..96a05bcb00
--- /dev/null
+++ b/data/2024/aaai/Enhance Sketch Recognition's Explainability via Semantic Component-Level Parsing	
@@ -0,0 +1 @@
+Free-hand sketches are appealing for humans as a universal tool to depict the visual world. Humans can recognize varied sketches of a category easily by identifying the concurrence and layout of the intrinsic semantic components of the category, since humans draw free-hand sketches based a common consensus that which types of semantic components constitute each sketch category. For example, an airplane should at least have a fuselage and wings. Based on this analysis, a semantic component-level memory module is constructed and embedded in the proposed structured sketch recognition network in this paper. The memory keys representing semantic components of each sketch category can be self-learned and enhance the recognition network's explainability. Our proposed networks can deal with different situations of sketch recognition, i.e., with or without semantic components labels of strokes. Experiments on the SPG and SketchIME datasets demonstrate the memory module's flexibility and the recognition network's explainability. The code and data are available at https://github.com/GuangmingZhu/SketchESC.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis b/data/2024/aaai/Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis
new file mode 100644
index 0000000000..cd8251994d
--- /dev/null
+++ b/data/2024/aaai/Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis	
@@ -0,0 +1 @@
+The emergence of text-driven motion synthesis technique provides animators with great potential to create efficiently. However, in most cases, textual expressions only contain general and qualitative motion descriptions, while lack fine depiction and sufficient intensity, leading to the synthesized motions that either (a) semantically compliant but uncontrollable over specific pose details, or (b) even deviates from the provided descriptions, bringing animators with undesired cases. In this paper, we propose DiffKFC, a conditional diffusion model for text-driven motion synthesis with KeyFrames Collaborated, enabling realistic generation with collaborative and efficient dual-level control: coarse guidance at semantic level, with only few keyframes for direct and fine-grained depiction down to body posture level. Unlike existing inference-editing diffusion models that incorporate conditions without training, our conditional diffusion model is explicitly trained and can fully exploit correlations among texts, keyframes and the diffused target frames. To preserve the control capability of discrete and sparse keyframes, we customize dilated mask attention modules where only partial valid tokens participate in local-to-global attention, indicated by the dilated keyframe mask. Additionally, we develop a simple yet effective smoothness prior, which steers the generated frames towards seamless keyframe transitions at inference. Extensive experiments show that our model not only achieves state-of-the-art performance in terms of semantic fidelity, but more importantly, is able to satisfy animator requirements through fine-grained guidance without tedious labor.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhanced Optical Character Recognition by Optical Sensor Combined with BERT and Cosine Similarity Scoring (Student Abstract) b/data/2024/aaai/Enhanced Optical Character Recognition by Optical Sensor Combined with BERT and Cosine Similarity Scoring (Student Abstract)
new file mode 100644
index 0000000000..ac89d528cf
--- /dev/null
+++ b/data/2024/aaai/Enhanced Optical Character Recognition by Optical Sensor Combined with BERT and Cosine Similarity Scoring (Student Abstract)	
@@ -0,0 +1 @@
+Optical character recognition(OCR) is the technology to identify text characters embedded within images. Conventional OCR models exhibit performance degradation when performing with noisy images. To solve this problem, we propose a novel model, which combines computer vision using optical sensor with natural language processing by bidirectional encoder representations from transformers(BERT) and cosine similarity scoring. The proposed model uses a confidence rate to determine whether to utilize optical sensor alone or BERT/cosine similarity scoring combined with the optical sensor. Experimental results show that the proposed model outperforms approximately 4.34 times better than the conventional OCR.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Bilingual Lexicon Induction via Bi-directional Translation Pair Retrieving b/data/2024/aaai/Enhancing Bilingual Lexicon Induction via Bi-directional Translation Pair Retrieving
new file mode 100644
index 0000000000..40c4a30781
--- /dev/null
+++ b/data/2024/aaai/Enhancing Bilingual Lexicon Induction via Bi-directional Translation Pair Retrieving	
@@ -0,0 +1 @@
+Most Bilingual Lexicon Induction (BLI) methods retrieve word translation pairs by finding the closest target word for a given source word based on cross-lingual word embeddings (WEs). However, we find that solely retrieving translation from the source-to-target perspective leads to some false positive translation pairs, which significantly harm the precision of BLI. To address this problem, we propose a novel and effective method to improve translation pair retrieval in cross-lingual WEs. Specifically, we consider both source-side and target-side perspectives throughout the retrieval process to alleviate false positive word pairings that emanate from a single perspective. On a benchmark dataset of BLI, our proposed method achieves competitive performance compared to existing state-of-the-art (SOTA) methods. It demonstrates effectiveness and robustness across six experimental languages, including similar language pairs and distant language pairs, under both supervised and unsupervised settings.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Cognitive Diagnosis Using Un-interacted Exercises: A Collaboration-Aware Mixed Sampling Approach b/data/2024/aaai/Enhancing Cognitive Diagnosis Using Un-interacted Exercises: A Collaboration-Aware Mixed Sampling Approach
new file mode 100644
index 0000000000..dba2c0e763
--- /dev/null
+++ b/data/2024/aaai/Enhancing Cognitive Diagnosis Using Un-interacted Exercises: A Collaboration-Aware Mixed Sampling Approach	
@@ -0,0 +1 @@
+Cognitive diagnosis is a crucial task in computer-aided education, aimed at evaluating students' proficiency levels across various knowledge concepts through exercises. Current models, however, primarily rely on students' answered exercises, neglecting the complex and rich information contained in un-interacted exercises. While recent research has attempted to leverage the data within un-interacted exercises linked to interacted knowledge concepts, aiming to address the long-tail issue, these studies fail to fully explore the informative, un-interacted exercises related to broader knowledge concepts. This oversight results in diminished performance when these models are applied to comprehensive datasets. In response to this gap, we present the Collaborative-aware Mixed Exercise Sampling (CMES) framework, which can effectively exploit the information present in un-interacted exercises linked to un-interacted knowledge concepts. Specifically, we introduce a novel universal sampling module where the training samples comprise not merely raw data slices, but enhanced samples generated by combining weight-enhanced attention mixture techniques. Given the necessity of real response labels in cognitive diagnosis, we also propose a ranking-based pseudo feedback module to regulate students' responses on generated exercises. The versatility of the CMES framework bolsters existing models and improves their adaptability. Finally, we demonstrate the effectiveness and interpretability of our framework through comprehensive experiments on real-world datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Ensemble Clustering with Adaptive High-Order Topological Weights b/data/2024/aaai/Enhancing Ensemble Clustering with Adaptive High-Order Topological Weights
new file mode 100644
index 0000000000..140b109769
--- /dev/null
+++ b/data/2024/aaai/Enhancing Ensemble Clustering with Adaptive High-Order Topological Weights	
@@ -0,0 +1 @@
+Ensemble clustering learns more accurate consensus results from a set of weak base clustering results. This technique is more challenging than other clustering algorithms due to the base clustering result set's randomness and the inaccessibility of data features. Existing ensemble clustering methods rely on the Co-association (CA) matrix quality but lack the capability to handle missing connections in base clustering. Inspired by the neighborhood high-order and topological similarity theories, this paper proposes a topological ensemble model based on high-order information. Specifically, this paper compensates for missing connections by mining neighborhood high-order connection information in the CA matrix and learning optimal connections with adaptive weights. Afterward, the learned excellent connections are embedded into topology learning to capture the topology of the base clustering. Finally, we incorporate adaptive high-order connection representation and topology learning into a unified learning framework. To our knowledge, this is the first ensemble clustering work based on topological similarity and high-order connectivity relations. Extensive experiments on multiple datasets demonstrate the effectiveness of the proposed method. The source code of the proposed approach is available at https://github.com/ltyong/awec.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Healthcare Predictions with Deep Learning Models b/data/2024/aaai/Enhancing Healthcare Predictions with Deep Learning Models
new file mode 100644
index 0000000000..e2a848aa3b
--- /dev/null
+++ b/data/2024/aaai/Enhancing Healthcare Predictions with Deep Learning Models	
@@ -0,0 +1 @@
+This study leverages Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to enhance diagnostics and predictions in healthcare. By training on extensive healthcare datasets, this project aims to improve early disease detection and health risk assessments. Evaluation emphasizes accuracy, reliability, and ethical considerations, including bias mitigation. This research promises to bridge AI advancements and clinical applications, offering significant improvements in diagnostic capabilities and healthcare accessibility.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Job Recommendation through LLM-Based Generative Adversarial Networks b/data/2024/aaai/Enhancing Job Recommendation through LLM-Based Generative Adversarial Networks
new file mode 100644
index 0000000000..dc5692d896
--- /dev/null
+++ b/data/2024/aaai/Enhancing Job Recommendation through LLM-Based Generative Adversarial Networks	
@@ -0,0 +1,3 @@
+Recommending suitable jobs to users is a critical task in online recruitment platforms. While existing job recommendation methods encounter challenges such as the low quality of users' resumes, which hampers their accuracy and practical effectiveness.With the rapid development of large language models (LLMs), utilizing the rich external knowledge encapsulated within them, as well as their powerful reasoning capabilities, is a promising way to complete users' resumes for more accurate recommendations. However, directly leveraging LLMs to enhance recommendation results is not a one-size-fits-all solution, as LLMs may suffer from fabricated generation and few-shot problems, which degrade the quality of resume completion.
+
+In this paper, we propose a novel LLM-based approach for job recommendation. To alleviate the limitation of fabricated generation for LLMs, we extract accurate and valuable information beyond users' self-description, which helps the LLMs better profile users for resume completion. Specifically, we not only extract users' explicit properties (e.g., skills, interests) from their self-description but also infer users' implicit characteristics from their behaviors for more accurate and meaningful resume completion. Nevertheless, some users still suffer from few-shot problems, which arise due to scarce interaction records, leading to limited guidance for high-quality resume generation. To address this issue, we propose aligning unpaired low-quality with high-quality generated resumes by Generative Adversarial Networks (GANs), which can refine the resume representations for better recommendation results. Extensive experiments on three large real-world recruitment datasets demonstrate the effectiveness of our proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Low-Resource Relation Representations through Multi-View Decoupling b/data/2024/aaai/Enhancing Low-Resource Relation Representations through Multi-View Decoupling
new file mode 100644
index 0000000000..1340771ff7
--- /dev/null
+++ b/data/2024/aaai/Enhancing Low-Resource Relation Representations through Multi-View Decoupling	
@@ -0,0 +1,6 @@
+Recently, prompt-tuning with pre-trained language models (PLMs) has demonstrated the significantly enhancing ability of relation extraction (RE) tasks. 
+However, in low-resource scenarios, where the available training data is scarce, previous prompt-based methods may still perform poorly for prompt-based representation learning due to a superficial understanding of the relation. 
+To this end, we highlight the importance of learning high-quality relation representation in low-resource scenarios for RE, and propose a novel prompt-based relation representation method, named MVRE (Multi-View Relation Extraction), to better leverage the capacity of PLMs to improve the performance of RE within the low-resource prompt-tuning paradigm. Specifically, MVRE decouples each relation into different perspectives to encompass multi-view relation representations for maximizing the likelihood during relation inference.
+Furthermore, we also design a Global-Local loss and a Dynamic-Initialization method for better alignment of the multi-view relation-representing virtual words, containing the semantics of relation labels during the optimization learning process and initialization. Extensive experiments on
+three benchmark datasets show that our method can achieve
+state-of-the-art in low-resource settings.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Machine Translation Experiences with Multilingual Knowledge Graphs b/data/2024/aaai/Enhancing Machine Translation Experiences with Multilingual Knowledge Graphs
new file mode 100644
index 0000000000..f1ea2f7b3b
--- /dev/null
+++ b/data/2024/aaai/Enhancing Machine Translation Experiences with Multilingual Knowledge Graphs	
@@ -0,0 +1 @@
+Translating entity names, especially when a literal translation is not correct, poses a significant challenge. Although Machine Translation (MT) systems have achieved impressive results, they still struggle to translate cultural nuances and language-specific context. In this work, we show that the integration of multilingual knowledge graphs into MT systems can address this problem and bring two significant benefits: i) improving the translation of utterances that contain entities by leveraging their human-curated aliases from a multilingual knowledge graph, and, ii) increasing the interpretability of the translation process by providing the user with information from the knowledge graph.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Multi-Label Classification via Dynamic Label-Order Learning b/data/2024/aaai/Enhancing Multi-Label Classification via Dynamic Label-Order Learning
new file mode 100644
index 0000000000..d6dc70b6ee
--- /dev/null
+++ b/data/2024/aaai/Enhancing Multi-Label Classification via Dynamic Label-Order Learning	
@@ -0,0 +1 @@
+Generative methods tackle Multi-Label Classification (MLC) by autoregressively generating label sequences. These methods excel at modeling label correlations and have achieved outstanding performance. However, a key challenge is determining the order of labels, as empirical findings indicate the significant impact of different orders on model learning and inference. Previous works adopt static label-ordering methods, assigning a unified label order for all samples based on label frequencies or co-occurrences. Nonetheless, such static methods neglect the unique semantics of each sample. More critically, these methods can cause the model to rigidly memorize training order, resulting in missing labels during inference. In light of these limitations, this paper proposes a dynamic label-order learning approach that adaptively learns a label order for each sample. Specifically, our approach adopts a difficulty-prioritized principle and iteratively constructs the label sequence based on the sample s semantics. To reduce the additional cost incurred by label-order learning, we use the same SEQ2SEQ model for label-order learning and MLC learning and introduce a unified loss function for joint optimization. Extensive experiments on public datasets reveal that our approach greatly outperforms previous methods. We will release our code at https: //github.com/KagamiBaka/DLOL.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Multi-Scale Diffusion Prediction via Sequential Hypergraphs and Adversarial Learning b/data/2024/aaai/Enhancing Multi-Scale Diffusion Prediction via Sequential Hypergraphs and Adversarial Learning
new file mode 100644
index 0000000000..ecf66e3aae
--- /dev/null
+++ b/data/2024/aaai/Enhancing Multi-Scale Diffusion Prediction via Sequential Hypergraphs and Adversarial Learning	
@@ -0,0 +1 @@
+Information diffusion prediction plays a crucial role in understanding the propagation of information in social networks, encompassing both macroscopic and microscopic prediction tasks. Macroscopic prediction estimates the overall impact of information diffusion, while microscopic prediction focuses on identifying the next user to be influenced. While prior research often concentrates on one of these aspects, a few tackle both concurrently. These two tasks provide complementary insights into the diffusion process at different levels, revealing common traits and unique attributes. The exploration of leveraging common features across these tasks to enhance information prediction remains an underexplored avenue. In this paper, we propose an intuitive and effective model that addresses both macroscopic and microscopic prediction tasks. Our approach considers the interactions and dynamics among cascades at the macro level and incorporates the social homophily of users in social networks at the micro level. Additionally, we introduce adversarial training and orthogonality constraints to ensure the integrity of shared features. Experimental results on four datasets demonstrate that our model significantly outperforms state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Neural Radiance Fields with Adaptive Multi-Exposure Fusion: A Bilevel Optimization Approach for Novel View Synthesis b/data/2024/aaai/Enhancing Neural Radiance Fields with Adaptive Multi-Exposure Fusion: A Bilevel Optimization Approach for Novel View Synthesis
new file mode 100644
index 0000000000..5b25790318
--- /dev/null
+++ b/data/2024/aaai/Enhancing Neural Radiance Fields with Adaptive Multi-Exposure Fusion: A Bilevel Optimization Approach for Novel View Synthesis	
@@ -0,0 +1 @@
+Neural Radiance Fields (NeRF) have made significant strides in the modeling and rendering of 3D scenes. However, due to the complexity of luminance information, existing NeRF methods often struggle to produce satisfactory renderings when dealing with high and low exposure images. To address this issue, we propose an innovative approach capable of effectively modeling and rendering images under multiple exposure conditions. Our method adaptively learns the characteristics of images under different exposure conditions through an unsupervised evaluator-simulator structure for HDR (High Dynamic Range) fusion. This approach enhances NeRF's comprehension and handling of light variations, leading to the generation of images with appropriate brightness. Simultaneously, we present a bilevel optimization method tailored for novel view synthesis, aiming to harmonize the luminance information of input images while preserving their structural and content consistency. This approach facilitates the concurrent optimization of multi-exposure correction and novel view synthesis, in an unsupervised manner. Through comprehensive experiments conducted on the LOM and LOL datasets, our approach surpasses existing methods, markedly enhancing the task of novel view synthesis for multi-exposure environments and attaining state-of-the-art results. The source code can be found at https://github.com/Archer-204/AME-NeRF.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Off-Policy Constrained Reinforcement Learning through Adaptive Ensemble C Estimation b/data/2024/aaai/Enhancing Off-Policy Constrained Reinforcement Learning through Adaptive Ensemble C Estimation
new file mode 100644
index 0000000000..6c91ed68bd
--- /dev/null
+++ b/data/2024/aaai/Enhancing Off-Policy Constrained Reinforcement Learning through Adaptive Ensemble C Estimation	
@@ -0,0 +1 @@
+In the domain of real-world agents, the application of Reinforcement Learning (RL) remains challenging due to the necessity for safety constraints. Previously, Constrained Reinforcement Learning (CRL) has predominantly focused on on-policy algorithms. Although these algorithms exhibit a degree of efficacy, their interactivity efficiency in real-world settings is sub-optimal, highlighting the demand for more efficient off-policy methods. However, off-policy CRL algorithms grapple with challenges in precise estimation of the C-function, particularly due to the fluctuations in the constrained Lagrange multiplier. Addressing this gap, our study focuses on the nuances of C-value estimation in off-policy CRL and introduces the Adaptive Ensemble C-learning (AEC) approach to reduce these inaccuracies. Building on state-of-the-art off-policy algorithms, we propose AEC-based CRL algorithms designed for enhanced task optimization. Extensive experiments on nine constrained robotics tasks reveal the superior interaction efficiency and performance of our algorithms in comparison to preceding methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing RAW-to-sRGB with Decoupled Style Structure in Fourier Domain b/data/2024/aaai/Enhancing RAW-to-sRGB with Decoupled Style Structure in Fourier Domain
new file mode 100644
index 0000000000..2ddfc6cf9e
--- /dev/null
+++ b/data/2024/aaai/Enhancing RAW-to-sRGB with Decoupled Style Structure in Fourier Domain	
@@ -0,0 +1 @@
+RAW to sRGB mapping, which aims to convert RAW images from smartphones into RGB form equivalent to that of Digital Single-Lens Reflex (DSLR) cameras, has become an important area of research. However, current methods often ignore the difference between cell phone RAW images and DSLR camera RGB images, a difference that goes beyond the color matrix and extends to spatial structure due to resolution variations. Recent methods directly rebuild color mapping and spatial structure via shared deep representation, limiting optimal performance. Inspired by Image Signal Processing (ISP) pipeline, which distinguishes image restoration and enhancement, we present a novel Neural ISP framework, named FourierISP. This approach breaks the image down into style and structure within the frequency domain, allowing for independent optimization. FourierISP is comprised of three subnetworks: Phase Enhance Subnet for structural refinement, Amplitude Refine Subnet for color learning, and Color Adaptation Subnet for blending them in a smooth manner. This approach sharpens both color and structure, and extensive evaluations across varied datasets confirm that our approach realizes state-of-the-art results. Code will be available at https://github.com/alexhe101/FourierISP.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Representation of Spiking Neural Networks via Similarity-Sensitive Contrastive Learning b/data/2024/aaai/Enhancing Representation of Spiking Neural Networks via Similarity-Sensitive Contrastive Learning
new file mode 100644
index 0000000000..c76536f093
--- /dev/null
+++ b/data/2024/aaai/Enhancing Representation of Spiking Neural Networks via Similarity-Sensitive Contrastive Learning	
@@ -0,0 +1 @@
+Spiking neural networks (SNNs) have attracted intensive attention as a promising energy-efficient alternative to conventional artificial neural networks (ANNs) recently, which could transmit information in form of binary spikes rather than continuous activations thus the multiplication of activation and weight could be replaced by addition to save energy. However, the binary spike representation form will sacrifice the expression performance of SNNs and lead to accuracy degradation compared with ANNs. Considering improving feature representation is beneficial to training an accurate SNN model, this paper focuses on enhancing the feature representation of the SNN. To this end, we establish a similarity-sensitive contrastive learning framework, where SNN could capture significantly more information from its ANN counterpart to improve representation by Mutual Information (MI) maximization with layer-wise sensitivity to similarity. In specific, it enriches the SNN’s feature representation by pulling the positive pairs of SNN's and ANN's feature representation of each layer from the same input samples closer together while pushing the negative pairs from different samples further apart. Experimental results show that our method consistently outperforms the current state-of-the-art algorithms on both popular non-spiking static and neuromorphic datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Robotics with Cognitive Capabilities b/data/2024/aaai/Enhancing Robotics with Cognitive Capabilities
new file mode 100644
index 0000000000..3b5f486653
--- /dev/null
+++ b/data/2024/aaai/Enhancing Robotics with Cognitive Capabilities	
@@ -0,0 +1 @@
+In the pursuit of creating more effective and adaptable robots, the flourishing field of cognitive robotics has arisen to infuse machines with human-like cognitive functions. This paper delves into the significance of cognitive robotics and charts a course for empowering robots with advanced cognitive capabilities. Drawing inspiration from current research in cognitive architectures, the paper underscores the importance of refined perception, language processing, complex decision-making, emotional intelligence, and cognitive synergy. By integrating these cognitive functions into robotic systems, the goal is to equip robots to operate intelligently in dynamic environments, collaborate seamlessly with humans, and adeptly handle diverse tasks. The proposed enhancements mark crucial strides towards the development of more versatile and capable intelligent robots.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Semi-supervised Domain Adaptation via Effective Target Labeling b/data/2024/aaai/Enhancing Semi-supervised Domain Adaptation via Effective Target Labeling
new file mode 100644
index 0000000000..cdcfaa8276
--- /dev/null
+++ b/data/2024/aaai/Enhancing Semi-supervised Domain Adaptation via Effective Target Labeling	
@@ -0,0 +1 @@
+Existing semi-supervised domain adaptation (SSDA) models have exhibited impressive performance on the target domain by effectively utilizing few labeled target samples per class (e.g., 3 samples per class). To guarantee an equal number of labeled target samples for each class, however, they require domain experts to manually recognize a considerable amount of the unlabeled target data. Moreover, as the target samples are not equally informative for shaping the decision boundaries of the learning models, it is crucial to select the most informative target samples for labeling, which is, however, impossible for human selectors. As a remedy, we propose an EFfective Target Labeling (EFTL) framework that harnesses active learning and pseudo-labeling strategies to automatically select some informative target samples to annotate. Concretely, we introduce a novel sample query strategy, called non-maximal degree node suppression (NDNS), that iteratively performs maximal degree node query and non-maximal degree node removal to select representative and diverse target samples for labeling. To learn target-specific characteristics, we propose a novel pseudo-labeling strategy that attempts to label low-confidence target samples accurately via clustering consistency (CC), and then inject information of the model uncertainty into our query process. CC enhances the utilization of the annotation budget and increases the number of “labeled” target samples while requiring no additional manual effort. Our proposed EFTL framework can be easily coupled with existing SSDA models, showing significant improvements on three benchmarks
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Student Performance Prediction on Learnersourced Questions with SGNN-LLM Synergy b/data/2024/aaai/Enhancing Student Performance Prediction on Learnersourced Questions with SGNN-LLM Synergy
new file mode 100644
index 0000000000..1698f00758
--- /dev/null
+++ b/data/2024/aaai/Enhancing Student Performance Prediction on Learnersourced Questions with SGNN-LLM Synergy	
@@ -0,0 +1,2 @@
+Learnersourcing offers great potential for scalable education through student content creation. However, predicting student performance on learnersourced questions, which is essential for personalizing the learning experience, is challenging due to the inherent noise in student-generated data. Moreover, while conventional graph-based methods can capture the complex network of student and question interactions, they often fall short under cold start conditions where limited student engagement with questions yields sparse data. To address both challenges, we introduce an innovative strategy that synergizes the potential of integrating Signed Graph Neural Networks (SGNNs) and Large Language Model (LLM) embeddings. Our methodology employs a signed bipartite graph to comprehensively model student answers, complemented by a contrastive learning framework that enhances noise resilience. Furthermore, LLM's contribution lies in generating foundational question embeddings, proving especially advantageous in addressing cold start scenarios characterized by limited graph data. 
+Validation across five real-world datasets sourced from the PeerWise platform underscores our approach's effectiveness. Our method outperforms baselines, showcasing enhanced predictive accuracy and robustness.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Training of Spiking Neural Network with Stochastic Latency b/data/2024/aaai/Enhancing Training of Spiking Neural Network with Stochastic Latency
new file mode 100644
index 0000000000..26264f44d3
--- /dev/null
+++ b/data/2024/aaai/Enhancing Training of Spiking Neural Network with Stochastic Latency	
@@ -0,0 +1 @@
+Spiking neural networks (SNNs) have garnered significant attention for their low power consumption when deployed on neuromorphic hardware that operates in orders of magnitude lower power than general-purpose hardware. Direct training methods for SNNs come with an inherent latency for which the SNNs are optimized, and in general, the higher the latency, the better the predictive powers of the models, but at the same time, the higher the energy consumption during training and inference. Furthermore, an SNN model optimized for one particular latency does not necessarily perform well in lower latencies, which becomes relevant in scenarios where it is necessary to switch to a lower latency because of the depletion of onboard energy or other operational requirements. In this work, we propose Stochastic Latency Training (SLT), a direct training method for SNNs that optimizes the model for the given latency but simultaneously offers a minimum reduction of predictive accuracy when shifted to lower inference latencies. We provide heuristics for our approach with partial theoretical justification and experimental evidence showing the state-of-the-art performance of our models on datasets such as CIFAR-10, DVS-CIFAR-10, CIFAR-100, and DVS-Gesture. Our code is available at https://github.com/srinuvaasu/SLT
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Transcription Factor Prediction through Multi-Task Learning (Student Abstract) b/data/2024/aaai/Enhancing Transcription Factor Prediction through Multi-Task Learning (Student Abstract)
new file mode 100644
index 0000000000..18be3cb3fd
--- /dev/null
+++ b/data/2024/aaai/Enhancing Transcription Factor Prediction through Multi-Task Learning (Student Abstract)	
@@ -0,0 +1 @@
+Transcription factors (TFs) play a fundamental role in gene regulation by selectively binding to specific DNA sequences. Understanding the nature and behavior of these TFs is essential for insights into gene regulation dynamics. In this study, we introduce a robust multi-task learning framework specifically tailored to harness both TF-specific annotations and TF-related domain annotations, thereby enhancing the accuracy of TF predictions. Notably, we incorporate cutting-edge language models that have recently garnered attention for their outstanding performance across various fields, particularly in biological computations like protein sequence modeling. Comparative experimental analysis with existing models, DeepTFactor and TFpredict, reveals that our multi-task learning framework achieves an accuracy exceeding 92% across four evaluation metrics on the TF prediction task, surpassing both competitors. Our work marks a significant leap in the domain of TF prediction, enriching our comprehension of gene regulatory mechanisms and paving the way for the discovery of novel regulatory motifs.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations b/data/2024/aaai/Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations
new file mode 100644
index 0000000000..f468604a79
--- /dev/null
+++ b/data/2024/aaai/Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations	
@@ -0,0 +1 @@
+Zero-shot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due to inadequate speaker disentanglement and content leakage. To overcome these constraints, we propose an innovative negation feature learning paradigm that models decoupled speaker attributes as deviations from the complete audio representation by utilizing the subtraction operation. By eliminating superfluous content information from the speaker representation, our negation scheme not only mitigates content leakage, thereby enhancing synthesis robustness, but also improves speaker fidelity. In addition, to facilitate the learning of diverse speaker attributes, we leverage multi-stream Transformers, which retain multiple hypotheses and instigate a training paradigm akin to ensemble learning. To unify these hypotheses and realize the final speaker representation, we employ attention pooling. Finally, in light of the imperative to generate target text utterances in the desired voice, we adopt adaptive layer normalizations to effectively fuse the previously generated speaker representation with the target text representations, as opposed to mere concatenation of the text and audio modalities. Extensive experiments and validations substantiate the efficacy of our proposed approach in preserving and harnessing speaker-specific attributes vis-à-vis alternative baseline models.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing the Efficiency of Altruism and Taxes in Affine Congestion Games through Signalling b/data/2024/aaai/Enhancing the Efficiency of Altruism and Taxes in Affine Congestion Games through Signalling
new file mode 100644
index 0000000000..a554adbebc
--- /dev/null
+++ b/data/2024/aaai/Enhancing the Efficiency of Altruism and Taxes in Affine Congestion Games through Signalling	
@@ -0,0 +1 @@
+We address the problem of improving the worst-case efficiency of pure Nash equilibria (aka, the price of anarchy) in affine congestion games, through a novel use of signalling. We assume that, for each player in the game, a most preferred strategy is publicly signalled. This can be done either distributedly by the players themselves, or be the outcome of some centralized algorithm. We apply this signalling scheme to two well-studied scenarios: games with partially altruistic players and games with resource taxation. We show a significant improvement in the price of anarchy of these games, whenever the aggregate signalled strategy profile is a good approximation of the game social optimum.
\ No newline at end of file
diff --git a/data/2024/aaai/Enhancing the Robustness of Spiking Neural Networks with Stochastic Gating Mechanisms b/data/2024/aaai/Enhancing the Robustness of Spiking Neural Networks with Stochastic Gating Mechanisms
new file mode 100644
index 0000000000..ab499e4d01
--- /dev/null
+++ b/data/2024/aaai/Enhancing the Robustness of Spiking Neural Networks with Stochastic Gating Mechanisms	
@@ -0,0 +1 @@
+Spiking neural networks (SNNs) exploit neural spikes to provide solutions for low-power intelligent applications on neuromorphic hardware. Although SNNs have high computational efficiency due to spiking communication, they still lack resistance to adversarial attacks and noise perturbations. In the brain, neuronal responses generally possess stochasticity induced by ion channels and synapses, while the role of stochasticity in computing tasks is poorly understood. Inspired by this, we elaborate a stochastic gating spiking neural model for layer-by-layer spike communication, introducing stochasticity to SNNs. Through theoretical analysis, our gating model can be viewed as a regularizer that prevents error amplification under attacks. Meanwhile, our work can explain the robustness of Poisson coding. Experimental results prove that our method can be used alone or with existing robust enhancement algorithms to improve SNN robustness and reduce SNN energy consumption. We hope our work will shed new light on the role of stochasticity in the computation of SNNs. Our code is available at https://github.com/DingJianhao/StoG-meets-SNN/.
\ No newline at end of file
diff --git a/data/2024/aaai/Entropic Open-Set Active Learning b/data/2024/aaai/Entropic Open-Set Active Learning
new file mode 100644
index 0000000000..08a367aeab
--- /dev/null
+++ b/data/2024/aaai/Entropic Open-Set Active Learning	
@@ -0,0 +1 @@
+Active Learning (AL) aims to enhance the performance of deep models by selecting the most informative samples for annotation from a pool of unlabeled data. Despite impressive performance in closed-set settings, most AL methods fail in real-world scenarios where the unlabeled data contains unknown categories. Recently, a few studies have attempted to tackle the AL problem for the open-set setting. However, these methods focus more on selecting known samples and do not efficiently utilize unknown samples obtained during AL rounds. In this work, we propose an Entropic Open-set AL (EOAL) framework which leverages both known and unknown distributions effectively to select informative samples during AL rounds. Specifically, our approach employs two different entropy scores. One measures the uncertainty of a sample with respect to the known-class distributions. The other measures the uncertainty of the sample with respect to the unknown-class distributions. By utilizing these two entropy scores we effectively separate the known and unknown samples from the unlabeled data resulting in better sampling. Through extensive experiments, we show that the proposed method outperforms existing state-of-the-art methods on CIFAR-10, CIFAR-100, and TinyImageNet datasets. Code is available at https://github.com/bardisafa/EOAL.
\ No newline at end of file
diff --git a/data/2024/aaai/Entropy Induced Pruning Framework for Convolutional Neural Networks b/data/2024/aaai/Entropy Induced Pruning Framework for Convolutional Neural Networks
new file mode 100644
index 0000000000..110e541d94
--- /dev/null
+++ b/data/2024/aaai/Entropy Induced Pruning Framework for Convolutional Neural Networks	
@@ -0,0 +1 @@
+Structured pruning techniques have achieved great compression performance on convolutional neural networks for image classification tasks. However, the majority of existing methods are sensitive with respect to the model parameters, and their pruning results may be unsatisfactory when the original model is trained poorly. That is, they need the original model to be fully trained, to obtain useful weight information. This is time-consuming, and makes the effectiveness of the pruning results dependent on the degree of model optimization. To address the above issue, we propose a novel metric named Average Filter Information Entropy (AFIE). It decomposes the weight matrix of each layer into a low-rank space, and quantifies the filter importance based on the distribution of the normalized eigenvalues. Intuitively, the eigenvalues capture the covariance among filters, and therefore could be a good guide for pruning. Since the distribution of eigenvalues is robust to the updating of parameters, AFIE can yield a stable evaluation for the importance of each filter no matter whether the original model is trained fully. We implement our AFIE-based pruning method for three popular CNN models of AlexNet, VGG-16, and ResNet-50, and test them on three widely-used image datasets MNIST, CIFAR-10, and ImageNet, respectively. The experimental results are encouraging. We surprisingly observe that for our methods, even when the original model is trained with only one epoch, the AFIE score of each filter keeps identical to the results when the model is fully-trained. This fully indicates the effectiveness of the proposed pruning method.
\ No newline at end of file
diff --git a/data/2024/aaai/Enumerating Safe Regions in Deep Neural Networks with Provable Probabilistic Guarantees b/data/2024/aaai/Enumerating Safe Regions in Deep Neural Networks with Provable Probabilistic Guarantees
new file mode 100644
index 0000000000..b39007cf36
--- /dev/null
+++ b/data/2024/aaai/Enumerating Safe Regions in Deep Neural Networks with Provable Probabilistic Guarantees	
@@ -0,0 +1 @@
+Identifying safe areas is a key point to guarantee trust for systems that are based on Deep Neural Networks (DNNs). To this end, we introduce the AllDNN-Verification problem: given a safety property and a DNN, enumerate the set of all the regions of the property input domain which are safe, i.e., where the property does hold. Due to the #P-hardness of the problem, we propose an efficient approximation method called ε-ProVe. Our approach exploits a controllable underestimation of the output reachable sets obtained via statistical prediction of tolerance limits, and can provide a tight —with provable probabilistic guarantees— lower estimate of the safe areas. Our empirical evaluation on different standard benchmarks shows the scalability and effectiveness of our method, offering valuable insights for this new type of verification of DNNs.
\ No newline at end of file
diff --git a/data/2024/aaai/Envy-Free House Allocation under Uncertain Preferences b/data/2024/aaai/Envy-Free House Allocation under Uncertain Preferences
new file mode 100644
index 0000000000..fa8e8ca215
--- /dev/null
+++ b/data/2024/aaai/Envy-Free House Allocation under Uncertain Preferences	
@@ -0,0 +1 @@
+Envy-freeness is one of the most important fairness concerns when allocating items. We study envy-free house allocation when agents have uncertain preferences over items and consider several well-studied preference uncertainty models. The central problem that we focus on is computing an allocation that has the highest probability of being envy-free. We show that each model leads to a distinct set of algorithmic and complexity results, including detailed results on (in-)approximability. En route, we consider two related problems of checking whether there exists an allocation that is possibly or necessarily envy-free. We give a complete picture of the computational complexity of these two problems for all the uncertainty models we consider.
\ No newline at end of file
diff --git a/data/2024/aaai/Episodic Return Decomposition by Difference of Implicitly Assigned Sub-trajectory Reward b/data/2024/aaai/Episodic Return Decomposition by Difference of Implicitly Assigned Sub-trajectory Reward
new file mode 100644
index 0000000000..b007ad2326
--- /dev/null
+++ b/data/2024/aaai/Episodic Return Decomposition by Difference of Implicitly Assigned Sub-trajectory Reward	
@@ -0,0 +1 @@
+Real-world decision-making problems are usually accompanied by delayed rewards, which affects the sample efficiency of Reinforcement Learning, especially in the extremely delayed case where the only feedback is the episodic reward obtained at the end of an episode. Episodic return decomposition is a promising way to deal with the episodic-reward setting. Several corresponding algorithms have shown remarkable effectiveness of the learned step-wise proxy rewards from return decomposition. However, these existing methods lack either attribution or representation capacity, leading to inefficient decomposition in the case of long-term episodes. In this paper, we propose a novel episodic return decomposition method called Diaster (Difference of implicitly assigned sub-trajectory reward). Diaster decomposes any episodic reward into credits of two divided sub-trajectories at any cut point, and the step-wise proxy rewards come from differences in expectation. We theoretically and empirically verify that the decomposed proxy reward function can guide the policy to be nearly optimal. Experimental results show that our method outperforms previous state-of-the-art methods in terms of both sample efficiency and performance. The code is available at https://github.com/HxLyn3/Diaster.
\ No newline at end of file
diff --git a/data/2024/aaai/Equity-Transformer: Solving NP-Hard Min-Max Routing Problems as Sequential Generation with Equity Context b/data/2024/aaai/Equity-Transformer: Solving NP-Hard Min-Max Routing Problems as Sequential Generation with Equity Context
new file mode 100644
index 0000000000..f7af5e5dd0
--- /dev/null
+++ b/data/2024/aaai/Equity-Transformer: Solving NP-Hard Min-Max Routing Problems as Sequential Generation with Equity Context	
@@ -0,0 +1 @@
+Min-max routing problems aim to minimize the maximum tour length among multiple agents as they collaboratively visit all cities, i.e., the completion time. These problems include impactful real-world applications but are known as NP-hard. Existing methods are facing challenges, particularly in large-scale problems that require the coordination of numerous agents to cover thousands of cities. This paper proposes Equity-Transformer to solve large-scale min-max routing problems. First, we model min-max routing problems into sequential planning, reducing the complexity and enabling the use of a powerful Transformer architecture. Second, we propose key inductive biases that ensure equitable workload distribution among agents. The effectiveness of Equity-Transformer is demonstrated through its superior performance in two representative min-max routing tasks: the min-max multi-agent traveling salesman problem (min-max mTSP) and the min-max multi-agent pick-up and delivery problem (min-max mPDP). Notably, our method achieves significant reductions of runtime, approximately 335 times, and cost values of about 53% compared to a competitive heuristic (LKH3) in the case of 100 vehicles with 1,000 cities of mTSP. We provide reproducible source code: https://github.com/kaist-silab/equity-transformer.
\ No newline at end of file
diff --git a/data/2024/aaai/Equivalence between Graph Spectral Clustering and Column Subset Selection (Student Abstract) b/data/2024/aaai/Equivalence between Graph Spectral Clustering and Column Subset Selection (Student Abstract)
new file mode 100644
index 0000000000..db53c5f2a6
--- /dev/null
+++ b/data/2024/aaai/Equivalence between Graph Spectral Clustering and Column Subset Selection (Student Abstract)	
@@ -0,0 +1 @@
+The common criteria for evaluating spectral clustering are NCut and RatioCut. The seemingly unrelated column subset selection (CSS) problem aims to compute a column subset that linearly approximates the entire matrix. A common criterion is the approximation error in the Frobenius norm (ApproxErr). We show that any algorithm for CSS can be viewed as a clustering algorithm that minimizes NCut by applying it to a matrix formed from graph edges. Conversely, any clustering algorithm can be seen as identifying a column subset from that matrix. In both cases, ApproxErr and NCut have the same value. Analogous results hold for RatioCut with a slightly different matrix. Therefore, established results for CSS can be mapped to spectral clustering. We use this to obtain new clustering algorithms, including an optimal one that is similar to A*. This is the first nontrivial clustering algorithm with such an optimality guarantee. A variant of the weighted A* runs much faster and provides bounds on the accuracy. Finally, we use the results from spectral clustering to prove the NP-hardness of CSS from sparse matrices.
\ No newline at end of file
diff --git a/data/2024/aaai/Estimating On-Road Transportation Carbon Emissions from Open Data of Road Network and Origin-Destination Flow Data b/data/2024/aaai/Estimating On-Road Transportation Carbon Emissions from Open Data of Road Network and Origin-Destination Flow Data
new file mode 100644
index 0000000000..2676c2f58c
--- /dev/null
+++ b/data/2024/aaai/Estimating On-Road Transportation Carbon Emissions from Open Data of Road Network and Origin-Destination Flow Data	
@@ -0,0 +1 @@
+Accounting for over 20% of the total carbon emissions, the precise estimation of on-road transportation carbon emissions is crucial for carbon emission monitoring and efficient mitigation policy formulation. However, existing estimation methods typically depend on hard-to-collect individual statistics of vehicle miles traveled to calculate emissions, thereby suffering from high data collection difficulty. To relieve this issue by utilizing the strong pattern recognition of artificial intelligence, we incorporate two sources of open data representative of the transportation demand and capacity factors, the origin-destination (OD) flow data and the road network data, to build a hierarchical heterogeneous graph learning method for on-road carbon emission estimation (HENCE). Specifically, a hierarchical graph consisting of the road network level, community level, and region level is constructed to model the multi-scale road network-based connectivity and travel connection between spatial areas. Heterogeneous graphs consisting of OD links and spatial links are further built at both the community level and region level to capture the intrinsic interactions between travel demand and road network accessibility. Extensive experiments on two large-scale real-world datasets demonstrate HENCE's effectiveness and superiority with R-squared exceeding 0.75 and outperforming baselines by 9.60% on average, validating its success in pioneering the use of artificial intelligence to empower carbon emission management and sustainability development. The implementation codes are available at this link: https://github.com/tsinghua-fib-lab/HENCE.
\ No newline at end of file
diff --git a/data/2024/aaai/EulerMormer: Robust Eulerian Motion Magnification via Dynamic Filtering within Transformer b/data/2024/aaai/EulerMormer: Robust Eulerian Motion Magnification via Dynamic Filtering within Transformer
new file mode 100644
index 0000000000..b6b5411234
--- /dev/null
+++ b/data/2024/aaai/EulerMormer: Robust Eulerian Motion Magnification via Dynamic Filtering within Transformer	
@@ -0,0 +1 @@
+Video Motion Magnification (VMM) aims to break the resolution limit of human visual perception capability and reveal the imperceptible minor motion that contains valuable information in the macroscopic domain. However, challenges arise in this task due to photon noise inevitably introduced by photographic devices and spatial inconsistency in amplification, leading to flickering artifacts in static fields and motion blur and distortion in dynamic fields in the video. Existing methods focus on explicit motion modeling without emphasizing prioritized denoising during the motion magnification process. This paper proposes a novel dynamic filtering strategy to achieve static-dynamic field adaptive denoising. Specifically, based on Eulerian theory, we separate texture and shape to extract motion representation through inter-frame shape differences, expecting to leverage these subdivided features to solve this task finely. Then, we introduce a novel dynamic filter that eliminates noise cues and preserves critical features in the motion magnification and amplification generation phases. Overall, our unified framework, EulerMormer, is a pioneering effort to first equip with Transformer in learning-based VMM. The core of the dynamic filter lies in a global dynamic sparse cross-covariance attention mechanism that explicitly removes noise while preserving vital information, coupled with a multi-scale dual-path gating mechanism that selectively regulates the dependence on different frequency features to reduce spatial attenuation and complement motion boundaries. We demonstrate extensive experiments that EulerMormer achieves more robust video motion magnification from the Eulerian perspective, significantly outperforming state-of-the-art methods. The source code is available at https://github.com/VUT-HFUT/EulerMormer.
\ No newline at end of file
diff --git a/data/2024/aaai/Evaluate Geometry of Radiance Fields with Low-Frequency Color Prior b/data/2024/aaai/Evaluate Geometry of Radiance Fields with Low-Frequency Color Prior
new file mode 100644
index 0000000000..835f85e013
--- /dev/null
+++ b/data/2024/aaai/Evaluate Geometry of Radiance Fields with Low-Frequency Color Prior	
@@ -0,0 +1,2 @@
+A radiance field is an effective representation of 3D scenes, which has been widely adopted in novel-view synthesis and 3D reconstruction. It is still an open and challenging problem to evaluate the geometry, i.e., the density field, as the ground-truth is almost impossible to obtain. One alternative indirect solution is to transform the density field into a point-cloud and compute its Chamfer Distance with the scanned ground-truth. However, many widely-used datasets have no point-cloud ground-truth since the scanning process along with the equipment is expensive and complicated.
+To this end, we propose a novel metric, named Inverse Mean Residual Color (IMRC), which can evaluate the geometry only with the observation images. Our key insight is that the better the geometry, the lower-frequency the computed color field. From this insight, given a reconstructed density field and observation images, we design a closed-form method to approximate the color field with low-frequency spherical harmonics, and compute the inverse mean residual color. Then the higher the IMRC, the better the geometry. Qualitative and quantitative experimental results verify the effectiveness of our proposed IMRC metric. We also benchmark several state-of-the-art methods using IMRC to promote future related research. Our code is available at https://github.com/qihangGH/IMRC.
\ No newline at end of file
diff --git a/data/2024/aaai/Evaluating AI Red Teaming's Readiness to Address Environmental Harms: A Thematic Analysis of LLM Discourse b/data/2024/aaai/Evaluating AI Red Teaming's Readiness to Address Environmental Harms: A Thematic Analysis of LLM Discourse
new file mode 100644
index 0000000000..8f095983b4
--- /dev/null
+++ b/data/2024/aaai/Evaluating AI Red Teaming's Readiness to Address Environmental Harms: A Thematic Analysis of LLM Discourse	
@@ -0,0 +1 @@
+This research explores the discourse surrounding red teaming and aims to identify any themes in the online discussion of potential environmental harms stemming from Large Language Models (LLMs). Focusing on the AI Red Teaming event at DEFCON 31, this study employs reflexive thematic analysis on diverse social networking site sources to extract insights into public discussion of LLM red teaming and its environmental implications. The findings intend to inform future research, highlighting the need for responsible AI development that addresses environmental concerns.
\ No newline at end of file
diff --git a/data/2024/aaai/Evaluating Pre-trial Programs Using Interpretable Machine Learning Matching Algorithms for Causal Inference b/data/2024/aaai/Evaluating Pre-trial Programs Using Interpretable Machine Learning Matching Algorithms for Causal Inference
new file mode 100644
index 0000000000..d4c7c2dcc4
--- /dev/null
+++ b/data/2024/aaai/Evaluating Pre-trial Programs Using Interpretable Machine Learning Matching Algorithms for Causal Inference	
@@ -0,0 +1 @@
+After a person is arrested and charged with a crime, they may be released on bail and required to participate in a community supervision program while awaiting trial. These 'pre-trial programs' are common throughout the United States, but very little research has demonstrated their effectiveness. Researchers have emphasized the need for more rigorous program evaluation methods, which we introduce in this article. We describe a program evaluation pipeline that uses recent interpretable machine learning techniques for observational causal inference, and demonstrate these techniques in a study of a pre-trial program in Durham, North Carolina. Our findings show no evidence that the program either significantly increased or decreased the probability of new criminal charges. If these findings replicate, the criminal-legal system needs to either improve pre-trial programs or consider alternatives to them. The simplest option is to release low-risk individuals back into the community without subjecting them to any restrictions or conditions. Another option is to assign individuals to pre-trial programs that incentivize pro-social behavior. We believe that the techniques introduced here can provide researchers the rigorous tools they need to evaluate these programs.
\ No newline at end of file
diff --git a/data/2024/aaai/Evaluating the Effectiveness of Explainable Artificial Intelligence Approaches (Student Abstract) b/data/2024/aaai/Evaluating the Effectiveness of Explainable Artificial Intelligence Approaches (Student Abstract)
new file mode 100644
index 0000000000..b3e46e49ac
--- /dev/null
+++ b/data/2024/aaai/Evaluating the Effectiveness of Explainable Artificial Intelligence Approaches (Student Abstract)	
@@ -0,0 +1 @@
+Explainable Artificial Intelligence (XAI), a promising future technology in the field of healthcare, has attracted significant interest. Despite ongoing efforts in the development of XAI approaches, there has been inadequate evaluation of explanation effectiveness and no standardized framework for the evaluation has been established. This study aims to examine the relationship between subjective interpretability and perceived plausibility for various XAI explanations and to determine the factors affecting users' acceptance of the XAI explanation.
\ No newline at end of file
diff --git a/data/2024/aaai/Evaluating the Efficacy of Prompting Techniques for Debiasing Language Model Outputs (Student Abstract) b/data/2024/aaai/Evaluating the Efficacy of Prompting Techniques for Debiasing Language Model Outputs (Student Abstract)
new file mode 100644
index 0000000000..3211073722
--- /dev/null
+++ b/data/2024/aaai/Evaluating the Efficacy of Prompting Techniques for Debiasing Language Model Outputs (Student Abstract)	
@@ -0,0 +1 @@
+Achieving fairness in Large Language Models (LLMs) continues to pose a persistent challenge, as these models are prone to inheriting biases from their training data, which can subsequently impact their performance in various applications. There is a need to systematically explore whether structured prompting techniques can offer opportunities for debiased text generation by LLMs. In this work, we designed an evaluative framework to test the efficacy of different prompting techniques for debiasing text along different dimensions. We aim to devise a general structured prompting approach to achieve fairness that generalizes well to different texts and LLMs.
\ No newline at end of file
diff --git a/data/2024/aaai/Evaluation of Large Language Models on Code Obfuscation (Student Abstract) b/data/2024/aaai/Evaluation of Large Language Models on Code Obfuscation (Student Abstract)
new file mode 100644
index 0000000000..45cfe94f62
--- /dev/null
+++ b/data/2024/aaai/Evaluation of Large Language Models on Code Obfuscation (Student Abstract)	
@@ -0,0 +1 @@
+Obfuscation intends to decrease interpretability of code and identification of code behavior. Large Language Models(LLMs) have been proposed for code synthesis and code analysis. This paper attempts to understand how well LLMs can analyse code and identify code behavior. Specifically, this paper systematically evaluates several LLMs’ capabilities to detect obfuscated code and identify behavior across a variety of obfuscation techniques with varying levels of complexity. LLMs proved to be better at detecting obfuscations that changed identifiers, even to misleading ones, compared to obfuscations involving code insertions (unused variables, as well as variables that replace constants with expressions that evaluate to those constants). Hardest to detect were obfuscations that layered multiple simple transformations. For these, only 20-40% of the LLMs’ responses were correct. Adding misleading documentation was also successful in misleading LLMs. We provide all our code to replicate results at https://github.com/SwindleA/LLMCodeObfuscation. Overall, our results suggest a gap in LLMs’ ability to understand code.
\ No newline at end of file
diff --git a/data/2024/aaai/Every Node Is Different: Dynamically Fusing Self-Supervised Tasks for Attributed Graph Clustering b/data/2024/aaai/Every Node Is Different: Dynamically Fusing Self-Supervised Tasks for Attributed Graph Clustering
new file mode 100644
index 0000000000..11f0f51032
--- /dev/null
+++ b/data/2024/aaai/Every Node Is Different: Dynamically Fusing Self-Supervised Tasks for Attributed Graph Clustering	
@@ -0,0 +1 @@
+Attributed graph clustering is an unsupervised task that partitions nodes into different groups. Self-supervised learning (SSL) shows great potential in handling this task, and some recent studies simultaneously learn multiple SSL tasks to further boost performance. Currently, different SSL tasks are assigned the same set of weights for all graph nodes. However, we observe that some graph nodes whose neighbors are in different groups require significantly different emphases on SSL tasks. In this paper, we propose to dynamically learn the weights of SSL tasks for different nodes and fuse the embeddings learned from different SSL tasks to boost performance. We design an innovative graph clustering approach, namely Dynamically Fusing Self-Supervised Learning (DyFSS). Specifically, DyFSS fuses features extracted from diverse SSL tasks using distinct weights derived from a gating network. To effectively learn the gating network, we design a dual-level self-supervised strategy that incorporates pseudo labels and the graph structure. Extensive experiments on five datasets show that DyFSS outperforms the state-of-the-art multi-task SSL methods by up to 8.66% on the accuracy metric. The code of DyFSS is available at: https://github.com/q086/DyFSS.
\ No newline at end of file
diff --git a/data/2024/aaai/Everything2Motion: Synchronizing Diverse Inputs via a Unified Framework for Human Motion Synthesis b/data/2024/aaai/Everything2Motion: Synchronizing Diverse Inputs via a Unified Framework for Human Motion Synthesis
new file mode 100644
index 0000000000..33c6eb8400
--- /dev/null
+++ b/data/2024/aaai/Everything2Motion: Synchronizing Diverse Inputs via a Unified Framework for Human Motion Synthesis	
@@ -0,0 +1 @@
+In the dynamic field of film and game development, the emergence of human motion synthesis methods has revolutionized avatar animation. Traditional methodologies, typically reliant on single modality inputs like text or audio, employ modality-specific model frameworks, posing challenges for unified model deployment and application. To address this, we propose Everything2Motion, a unified model framework. Everything2Motion consists of three key modules. The Input-Output Modality Modulation module tailors structures for specific multimodal inputs, eliminating the need for modality-specific frameworks. The Query-aware Autoencoder, based on the transformer encoder-decoder architecture, enables efficient latent motion generation. Lastly, the Prior Motion Distillation Decoder, a pretrained module, enhances the final skeleton sequence's naturalness and fluidity. Comprehensive experiments on several public datasets demonstrate the effectiveness of Everything2Motion, highlighting its potential for practical applications and setting a new benchmark in human motion synthesis.
\ No newline at end of file
diff --git a/data/2024/aaai/Evidential Uncertainty-Guided Mitochondria Segmentation for 3D EM Images b/data/2024/aaai/Evidential Uncertainty-Guided Mitochondria Segmentation for 3D EM Images
new file mode 100644
index 0000000000..2590598812
--- /dev/null
+++ b/data/2024/aaai/Evidential Uncertainty-Guided Mitochondria Segmentation for 3D EM Images	
@@ -0,0 +1 @@
+Recent advances in deep learning have greatly improved the segmentation of mitochondria from Electron Microscopy (EM) images. However, suffering from variations in mitochondrial morphology, imaging conditions, and image noise, existing methods still exhibit high uncertainty in their predictions. Moreover, in view of our findings, predictions with high levels of uncertainty are often accompanied by inaccuracies such as ambiguous boundaries and amount of false positive segments. To deal with the above problems, we propose a novel approach for mitochondria segmentation in 3D EM images that leverages evidential uncertainty estimation, which for the first time integrates evidential uncertainty to enhance the performance of segmentation. To be more specific, our proposed method not only provides accurate segmentation results, but also estimates associated uncertainty. Then, the estimated uncertainty is used to help improve the segmentation performance by an uncertainty rectification module, which leverages uncertainty maps and multi-scale information to refine the segmentation. Extensive experiments conducted on four challenging benchmarks demonstrate the superiority of our proposed method over existing approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Evolving Parameterized Prompt Memory for Continual Learning b/data/2024/aaai/Evolving Parameterized Prompt Memory for Continual Learning
new file mode 100644
index 0000000000..cabf6c3625
--- /dev/null
+++ b/data/2024/aaai/Evolving Parameterized Prompt Memory for Continual Learning	
@@ -0,0 +1 @@
+Recent studies have demonstrated the potency of leveraging prompts in Transformers for continual learning (CL). Nevertheless, employing a discrete key-prompt bottleneck can lead to selection mismatches and inappropriate prompt associations during testing. Furthermore, this approach hinders adaptive prompting due to the lack of shareability among nearly identical instances at more granular level. To address these challenges, we introduce the Evolving Parameterized Prompt Memory (EvoPrompt), a novel method involving adaptive and continuous prompting attached to pre-trained Vision Transformer (ViT), conditioned on specific instance. We formulate a continuous prompt function as a neural bottleneck and encode the collection of prompts on network weights. We establish a paired prompt memory system consisting of a stable reference and a flexible working prompt memory. Inspired by linear mode connectivity, we progressively fuse the working prompt memory and reference prompt memory during inter-task periods, resulting in continually evolved prompt memory. This fusion involves aligning functionally equivalent prompts using optimal transport and aggregating them in parameter space with an adjustable bias based on prompt node attribution. Additionally, to enhance backward compatibility, we propose compositional classifier initialization, which leverages prior prototypes from pre-trained models to guide the initialization of new classifiers in a subspace-aware manner. Comprehensive experiments validate that our approach achieves state-of-the-art performance in both class and domain incremental learning scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Exact ASP Counting with Compact Encodings b/data/2024/aaai/Exact ASP Counting with Compact Encodings
new file mode 100644
index 0000000000..4e9c5cc070
--- /dev/null
+++ b/data/2024/aaai/Exact ASP Counting with Compact Encodings	
@@ -0,0 +1,26 @@
+Answer Set Programming (ASP) has emerged as a promising
+paradigm in knowledge representation and automated reason-
+ing owing to its ability to model hard combinatorial problems
+from diverse domains in a natural way. Building on advances
+in propositional SAT solving, the past two decades have wit-
+nessed the emergence of well-engineered systems for solv-
+ing the answer set satisfiability problem, i.e., finding mod-
+els or answer sets for a given answer set program. In re-
+cent years, there has been growing interest in problems be-
+yond satisfiability, such as model counting, in the context of
+ASP. Akin to the early days of propositional model count-
+ing, state-of-the-art exact answer set counters do not scale
+well beyond small instances. Exact ASP counters struggle
+with handling larger input formulas. The primary contribu-
+tion of this paper is a new ASP counting framework, called
+sharpASP, which counts answer sets avoiding larger input
+formulas. This relies on an alternative way of defining answer
+sets that allows lifting of key techniques developed in the con-
+text of propositional model counting. Our extensive empirical
+analysis over 1470 benchmarks demonstrates significant per-
+formance gain over current state-of-the-art exact answer set
+counters. Specifically, by using sharpASP, we were able to
+solve 1062 benchmarks with PAR2 score of 3082 whereas
+using prior state-of-the-art, we could only solve 895 bench-
+marks with PAR2 score of 4205, all other experimental con-
+ditions being the same.
\ No newline at end of file
diff --git a/data/2024/aaai/Exact Algorithms and Lowerbounds for Multiagent Path Finding: Power of Treelike Topology b/data/2024/aaai/Exact Algorithms and Lowerbounds for Multiagent Path Finding: Power of Treelike Topology
new file mode 100644
index 0000000000..62aea62d4b
--- /dev/null
+++ b/data/2024/aaai/Exact Algorithms and Lowerbounds for Multiagent Path Finding: Power of Treelike Topology	
@@ -0,0 +1,11 @@
+In the Multiagent Path Finding (MAPF for short) problem, we focus on efficiently finding non-colliding paths for a set of k agents on a given graph G, where each agent seeks a path from its source vertex to a target.
+An important measure of the quality of the solution is the length of the proposed schedule l, that is, the length of a longest path (including the waiting time).
+In this work, we propose a systematic study under the parameterized complexity framework. The hardness results we provide align with many heuristics used for this problem, whose running time could potentially be improved based on our Fixed-Parameter Tractability (FPT) results.
+
+We show that MAPF is W[1]-hard with respect to k (even if k is combined with the maximum degree of the input graph).
+The problem remains NP-hard in planar graphs even if the maximum degree and the makespan l are fixed constants.
+On the positive side, we show an FPT algorithm for k+l.
+
+As we continue, the structure of G comes into play.
+We give an FPT algorithm for parameter k plus the diameter of the graph G.
+The MAPF problem is W[1]-hard for cliquewidth of G plus l while it is FPT for treewidth of G plus l.
\ No newline at end of file
diff --git a/data/2024/aaai/Exact Inference for Continuous-Time Gaussian Process Dynamics b/data/2024/aaai/Exact Inference for Continuous-Time Gaussian Process Dynamics
new file mode 100644
index 0000000000..d8290004b4
--- /dev/null
+++ b/data/2024/aaai/Exact Inference for Continuous-Time Gaussian Process Dynamics	
@@ -0,0 +1 @@
+Many physical systems can be described as a continuous-time dynamical system. In practice, the true system is often unknown and has to be learned from measurement data. Since data is typically collected in discrete time, e.g. by sensors, most methods in Gaussian process (GP) dynamics model learning are trained on one-step ahead predictions. While this scheme is mathematically tempting, it can become problematic in several scenarios, e.g. if measurements are provided at irregularly-sampled time steps or physical system properties have to be conserved. Thus, we aim for a GP model of the true continuous-time dynamics. We tackle this task by leveraging higher-order numerical integrators. These integrators provide the necessary tools to discretize dynamical systems with arbitrary accuracy. However, most higher-order integrators require dynamics evaluations at intermediate time steps, making exact GP inference intractable. In previous work, this problem is often addressed by approximate inference techniques. However, exact GP inference is preferable in many scenarios, e.g. due to its mathematical guarantees. In order to enable direct inference, we propose to leverage multistep and Taylor integrators. We demonstrate how exact inference schemes can be derived for these types of integrators. Further, we derive tailored sampling schemes that allow one to draw consistent dynamics functions from the posterior. The learned model can thus be integrated with arbitrary integrators, just like a standard dynamical system. We show empirically and theoretically that our approach yields an accurate representation of the continuous-time system.
\ No newline at end of file
diff --git a/data/2024/aaai/Exact Policy Recovery in Offline RL with Both Heavy-Tailed Rewards and Data Corruption b/data/2024/aaai/Exact Policy Recovery in Offline RL with Both Heavy-Tailed Rewards and Data Corruption
new file mode 100644
index 0000000000..bc29556ddd
--- /dev/null
+++ b/data/2024/aaai/Exact Policy Recovery in Offline RL with Both Heavy-Tailed Rewards and Data Corruption	
@@ -0,0 +1 @@
+We study offline reinforcement learning (RL) with heavy-tailed reward distribution and data corruption: (i) Moving beyond subGaussian reward distribution, we allow the rewards to have infinite variances; (ii) We allow corruptions where an attacker can arbitrarily modify a small fraction of the rewards and transitions in the dataset. We first derive a sufficient optimality condition for generalized Pessimistic Value Iteration (PEVI), which allows various estimators with proper confidence bounds and can be applied to multiple learning settings. In order to handle the data corruption and heavy-tailed reward setting, we prove that the trimmed-mean estimation achieves the minimax optimal error rate for robust mean estimation under heavy-tailed distributions. In the PEVI algorithm, we plug in the trimmed mean estimation and the confidence bound to solve the robust offline RL problem. Standard analysis reveals that data corruption induces a bias term in the suboptimality gap, which gives the false impression that any data corruption prevents optimal policy learning. By using the optimality condition for the generalized PEVI, we show that as long as the bias term is less than the ``action gap'', the policy returned by PEVI achieves the optimal value given sufficient data.
\ No newline at end of file
diff --git a/data/2024/aaai/Exact, Fast and Expressive Poisson Point Processes via Squared Neural Families b/data/2024/aaai/Exact, Fast and Expressive Poisson Point Processes via Squared Neural Families
new file mode 100644
index 0000000000..186396c1d6
--- /dev/null
+++ b/data/2024/aaai/Exact, Fast and Expressive Poisson Point Processes via Squared Neural Families	
@@ -0,0 +1,7 @@
+We introduce squared neural Poisson point processes (SNEPPPs) by parameterising the intensity function by the squared norm of a two layer neural network.
+When the hidden layer is fixed and the second layer has a single neuron, our approach resembles previous uses of squared Gaussian process or kernel methods, but allowing the hidden layer to be learnt allows for additional flexibility.
+In many cases of interest, the integrated intensity function admits a closed form and can be computed in quadratic time in the number of hidden neurons.
+We enumerate a far more extensive number of such cases than has previously been discussed.
+Our approach is more memory and time efficient than naive implementations of squared or exponentiated kernel methods or Gaussian processes.
+Maximum likelihood and maximum a posteriori estimates in a reparameterisation of the final layer of the intensity function can be obtained by solving a (strongly) convex optimisation problem using projected gradient descent. 
+We demonstrate SNEPPPs on real, and synthetic benchmarks, and provide a software implementation.
\ No newline at end of file
diff --git a/data/2024/aaai/Existence Is Chaos: Enhancing 3D Human Motion Prediction with Uncertainty Consideration b/data/2024/aaai/Existence Is Chaos: Enhancing 3D Human Motion Prediction with Uncertainty Consideration
new file mode 100644
index 0000000000..52525821cb
--- /dev/null
+++ b/data/2024/aaai/Existence Is Chaos: Enhancing 3D Human Motion Prediction with Uncertainty Consideration	
@@ -0,0 +1,2 @@
+Human motion prediction is consisting in forecasting future body poses from historically observed sequences. It is a longstanding challenge due to motion's complex dynamics and uncertainty. Existing methods focus on building up complicated neural networks to model the motion dynamics. The predicted results are required to be strictly similar to the training samples with L2 loss in current training pipeline. However, little attention has been paid to the uncertainty property which is crucial to the prediction task. We argue that the recorded motion in training data could be an observation of possible future, rather than a predetermined result. In addition, existing works calculate the predicted error on each future frame equally during training, while recent work indicated that different frames could play different roles. In this work, a novel computationally efficient encoder-decoder model with uncertainty consideration is proposed, which could learn proper characteristics for future frames by a dynamic function. Experimental results on benchmark datasets demonstrate that our uncertainty consideration approach has obvious advantages both in quantity and quality. Moreover, the proposed method could produce motion sequences with much better quality that avoids the intractable shaking artefacts. We believe our work could provide a novel perspective to consider the uncertainty quality for the general motion prediction task and encourage the studies in this field. The code will be available in
+https://github.com/Motionpre/Adaptive-Salient-Loss-SAGGB.
\ No newline at end of file
diff --git a/data/2024/aaai/ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment b/data/2024/aaai/ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment
new file mode 100644
index 0000000000..41b515ff9c
--- /dev/null
+++ b/data/2024/aaai/ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment	
@@ -0,0 +1,5 @@
+The objective of stylized speech-driven facial animation is to create animations that encapsulate specific emotional expressions. Existing methods often depend on pre-established emotional labels or facial expression templates, which may limit the necessary flexibility for accurately conveying user intent.
+In this research, we introduce a technique that enables the control of arbitrary styles by leveraging natural language as emotion prompts. This technique presents benefits in terms of both flexibility and user-friendliness.
+To realize this objective, we initially construct a Text-Expression Alignment Dataset (TEAD), wherein each facial expression is paired with several prompt-like descriptions. We propose an innovative automatic annotation method, supported by CahtGPT, to expedite the dataset construction, thereby eliminating the substantial expense of manual annotation.
+Following this, we utilize TEAD to train a CLIP-based model, termed ExpCLIP, which encodes text and facial expressions into semantically aligned style embeddings. The embeddings are subsequently integrated into the facial animation generator to yield expressive and controllable facial animations. Given the limited diversity of facial emotions in existing speech-driven facial animation training data, we further introduce an effective Expression Prompt Augmentation (EPA) mechanism to enable the animation generator to support unprecedented richness in style control.
+Comprehensive experiments illustrate that our method accomplishes expressive facial animation generation and offers enhanced flexibility in effectively conveying the desired style.
\ No newline at end of file
diff --git a/data/2024/aaai/Expand-and-Quantize: Unsupervised Semantic Segmentation Using High-Dimensional Space and Product Quantization b/data/2024/aaai/Expand-and-Quantize: Unsupervised Semantic Segmentation Using High-Dimensional Space and Product Quantization
new file mode 100644
index 0000000000..17dc5bcb95
--- /dev/null
+++ b/data/2024/aaai/Expand-and-Quantize: Unsupervised Semantic Segmentation Using High-Dimensional Space and Product Quantization	
@@ -0,0 +1,6 @@
+Unsupervised semantic segmentation (USS) aims to discover and recognize meaningful categories without any labels. 
+For a successful USS, two key abilities are required: 1) information compression and 2) clustering capability.
+Previous methods have relied on feature dimension reduction for information compression, however, this approach may hinder the process of clustering.
+In this paper, we propose a novel USS framework called Expand-and-Quantize Unsupervised Semantic Segmentation (EQUSS), which combines the benefits of high-dimensional spaces for better clustering and product quantization for effective information compression.
+Our extensive experiments demonstrate that EQUSS achieves state-of-the-art results on three standard benchmarks.
+In addition, we analyze the entropy of USS features, which is the first step towards understanding USS from the perspective of information theory.
\ No newline at end of file
diff --git a/data/2024/aaai/ExpeL: LLM Agents Are Experiential Learners b/data/2024/aaai/ExpeL: LLM Agents Are Experiential Learners
new file mode 100644
index 0000000000..cfe6b19f5d
--- /dev/null
+++ b/data/2024/aaai/ExpeL: LLM Agents Are Experiential Learners	
@@ -0,0 +1 @@
+The recent surge in research interest in applying large language models (LLMs) to decision-making tasks has flourished by leveraging the extensive world knowledge embedded in LLMs. While there is a growing demand to tailor LLMs for custom decision-making tasks, finetuning them for specific tasks is resource-intensive and may diminish the model's generalization capabilities. Moreover, state-of-the-art language models like GPT-4 and Claude are primarily accessible through API calls, with their parametric weights remaining proprietary and unavailable to the public. This scenario emphasizes the growing need for new methodologies that allow learning from agent experiences without requiring parametric updates. To address these problems, we introduce the Experiential Learning (ExpeL) agent. Our agent autonomously gathers experiences and extracts knowledge using natural language from a collection of training tasks. At inference, the agent recalls its extracted insights and past experiences to make informed decisions. Our empirical results highlight the robust learning efficacy of the ExpeL agent, indicating a consistent enhancement in its performance as it accumulates experiences. We further explore the emerging capabilities and transfer learning potential of the ExpeL agent through qualitative observations and additional experiments.
\ No newline at end of file
diff --git a/data/2024/aaai/Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders b/data/2024/aaai/Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders
new file mode 100644
index 0000000000..a9ddc2ed99
--- /dev/null
+++ b/data/2024/aaai/Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders	
@@ -0,0 +1 @@
+Recent advances in vision language pretraining (VLP) have been largely attributed to the large-scale data collected from the web. However, uncurated dataset contains weakly correlated image-text pairs, causing data inefficiency. To address the issue, knowledge distillation have been explored at the expense of extra image and text momentum encoders to generate teaching signals for misaligned image-text pairs. In this paper, our goal is to resolve the misalignment problem with an efficient distillation framework. To this end, we propose ECLIPSE: Expediting Contrastive Language-Image Pretraining with Self-distilled Encoders. ECLIPSE features a distinctive distillation architecture wherein a shared text encoder is utilized between an online image encoder and a momentum image encoder. This strategic design choice enables the distillation to operate within a unified projected space of text embedding, resulting in better performance. Based on the unified text embedding space, ECLIPSE compensates for the additional computational cost of the momentum image encoder by expediting the online image encoder. Through our extensive experiments, we validate that there is a sweet spot between expedition and distillation where the partial view from the expedited online image encoder interacts complementarily with the momentum teacher. As a result, ECLIPSE outperforms its counterparts while achieving substantial acceleration in inference speed.
\ No newline at end of file
diff --git a/data/2024/aaai/Explainable Earnings Call Representation Learning (Student Abstract) b/data/2024/aaai/Explainable Earnings Call Representation Learning (Student Abstract)
new file mode 100644
index 0000000000..a769ea61de
--- /dev/null
+++ b/data/2024/aaai/Explainable Earnings Call Representation Learning (Student Abstract)	
@@ -0,0 +1 @@
+Earnings call transcripts hold valuable insights that are vital for investors and analysts when making informed decisions. However, extracting these insights from lengthy and complex transcripts can be a challenging task. The traditional manual examination is not only time-consuming but also prone to errors and biases. Deep learning-based representation learning methods have emerged as promising and automated approaches to tackle this problem. Nevertheless, they may encounter significant challenges, such as the unreliability of the representation encoding process and certain domain-specific requirements in the context of finance. To address these issues, we propose a novel transcript representation learning model. Our model leverages the structural information of transcripts to effectively extract key insights, while endowing model with explainability via variational information bottleneck. Extensive experiments on two downstream financial tasks demonstrate the effectiveness of our approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Explainable Origin-Destination Crowd Flow Interpolation via Variational Multi-Modal Recurrent Graph Auto-Encoder b/data/2024/aaai/Explainable Origin-Destination Crowd Flow Interpolation via Variational Multi-Modal Recurrent Graph Auto-Encoder
new file mode 100644
index 0000000000..6492a23ee2
--- /dev/null
+++ b/data/2024/aaai/Explainable Origin-Destination Crowd Flow Interpolation via Variational Multi-Modal Recurrent Graph Auto-Encoder	
@@ -0,0 +1 @@
+Origin-destination (OD) crowd flow, if more accurately inferred at a fine-grained level, has the potential to enhance the efficacy of various urban applications. While in practice for mining OD crowd flow with effect, the problem of spatially interpolating OD crowd flow occurs since the ineluctable missing values. This problem is further complicated by the inherent scarcity and noise nature of OD crowd flow data. In this paper, we propose an uncertainty-aware interpolative and explainable framework, namely UApex, for realizing reliable and trustworthy OD crowd flow interpolation. Specifically, we first design a Variational Multi-modal Recurrent Graph Auto-Encoder (VMR-GAE) for uncertainty-aware OD crowd flow interpolation. A key idea here is to formulate the problem as semi-supervised learning on directed graphs. Next, to mitigate the data scarcity, we incorporate a distribution alignment mechanism that can introduce supplementary modals into variational inference. Then, a dedicated decoder with a Poisson prior is proposed for OD crowd flow interpolation. Moreover, to make VMR-GAE more trustworthy, we develop an efficient and uncertainty-aware explainer that can provide explanations from the spatiotemporal topology perspective via the Shapley value. Extensive experiments on two real-world datasets validate that VMR-GAE outperforms the state-of-the-art baselines. Also, an exploratory empirical study shows that the proposed explainer can generate meaningful spatiotemporal explanations.
\ No newline at end of file
diff --git a/data/2024/aaai/Explaining Generalization Power of a DNN Using Interactive Concepts b/data/2024/aaai/Explaining Generalization Power of a DNN Using Interactive Concepts
new file mode 100644
index 0000000000..e0d6128774
--- /dev/null
+++ b/data/2024/aaai/Explaining Generalization Power of a DNN Using Interactive Concepts	
@@ -0,0 +1 @@
+This paper explains the generalization power of a deep neural network (DNN) from the perspective of interactions. Although there is no universally accepted definition of the concepts encoded by a DNN, the sparsity of interactions in a DNN has been proved, i.e., the output score of a DNN can be well explained by a small number of interactions between input variables. In this way, to some extent, we can consider such interactions as interactive concepts encoded by the DNN. Therefore, in this paper, we derive an analytic explanation of inconsistency of concepts of different complexities. This may shed new lights on using the generalization power of concepts to explain the generalization power of the entire DNN. Besides, we discover that the DNN with stronger generalization power usually learns simple concepts more quickly and encodes fewer complex concepts. We also discover the detouring dynamics of learning complex concepts, which explains both the high learning difficulty and the low generalization power of complex concepts. The code will be released when the paper is accepted.
\ No newline at end of file
diff --git a/data/2024/aaai/Explicit Visual Prompts for Visual Object Tracking b/data/2024/aaai/Explicit Visual Prompts for Visual Object Tracking
new file mode 100644
index 0000000000..407604d441
--- /dev/null
+++ b/data/2024/aaai/Explicit Visual Prompts for Visual Object Tracking	
@@ -0,0 +1,3 @@
+How to effectively exploit spatio-temporal information is crucial to capture target appearance changes in visual tracking. However, most deep learning-based trackers mainly focus on designing a complicated appearance model or template updating strategy, while lacking the exploitation of context between consecutive frames and thus entailing the when-and-how-to-update dilemma. To address these issues, we propose a novel explicit visual prompts framework for visual tracking, dubbed EVPTrack. Specifically, we utilize spatio-temporal tokens to propagate information between consecutive frames without focusing on updating templates. As a result, we cannot only alleviate the challenge of when-to-update, but also avoid the hyper-parameters associated with updating strategies. Then, we utilize the spatio-temporal tokens to generate explicit visual prompts that facilitate inference in the current frame. The prompts are fed into a transformer encoder together with the image tokens without additional processing.
+Consequently, the efficiency of our model is improved by avoiding how-to-update. In addition, we consider multi-scale information as explicit visual prompts, providing multiscale template features to enhance the EVPTrack's ability to handle target scale changes. Extensive experimental results on six benchmarks (i.e., LaSOT, LaSOText, GOT-10k, UAV123, TrackingNet, and TNL2K.) validate that our EVPTrack can achieve competitive performance at a real-time speed by effectively exploiting both spatio-temporal and multi-scale information. Code and models are available at
+https://github.com/GXNU-ZhongLab/EVPTrack.
\ No newline at end of file
diff --git a/data/2024/aaai/Explicitly Perceiving and Preserving the Local Geometric Structures for 3D Point Cloud Attack b/data/2024/aaai/Explicitly Perceiving and Preserving the Local Geometric Structures for 3D Point Cloud Attack
new file mode 100644
index 0000000000..acd175df1e
--- /dev/null
+++ b/data/2024/aaai/Explicitly Perceiving and Preserving the Local Geometric Structures for 3D Point Cloud Attack	
@@ -0,0 +1 @@
+Deep learning models for point clouds have shown to be vulnerable to adversarial attacks, which have received increasing attention in various safety-critical applications such as autonomous driving, robotics, and surveillance. Existing 3D attack methods generally employ global distance losses to implicitly constrain the point-wise perturbations for optimization. However, these simple losses are quite difficult to accurately measure and restrict the proper 3D geometry as point clouds are highly structured. Although few recent works try to exploit additional shape-aware surface knowledge to globally constrain the point position, they still fail to preserve the detailed point-to-point geometric dependency in different local regions. To this end, in this paper, we propose a novel Multi-grained Geometry-aware Attack (MGA), which explicitly captures the local topology characteristics in different 3D regions for adversarial constraint. Specifically, we first develop multi-scale spectral local filter banks adapting to different 3D object shapes to explore potential geometric structures in local regions. Considering that objects may contain complex geometries, we then extend each filter bank into multi-layer ones to gradually capture the topology contexts of the same region in a coarse-to-fine manner. Hence, the focused local geometric structures will be highlighted in the coefficients calculated by the filtering process. At last, by restricting these coefficients between benign and adversarial samples, our MGA is able to properly measure and preserve the detailed geometry contexts in the whole 3D object with trivial perturbations. Extensive experiments demonstrate that our attack can achieve superior performance on various 3D classification models, with satisfying adversarial imperceptibility and strong resistance to different defense methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning (Abstract Reprint) b/data/2024/aaai/Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning (Abstract Reprint)
new file mode 100644
index 0000000000..20c392d90a
--- /dev/null
+++ b/data/2024/aaai/Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning (Abstract Reprint)	
@@ -0,0 +1 @@
+Offline reinforcement learning—learning a policy from a batch of data—is known to be hard for general MDPs. These results motivate the need to look at specific classes of MDPs where offline reinforcement learning might be feasible. In this work, we explore a restricted class of MDPs to obtain guarantees for offline reinforcement learning. The key property, which we call Action Impact Regularity (AIR), is that actions primarily impact a part of the state (an endogenous component) and have limited impact on the remaining part of the state (an exogenous component). AIR is a strong assumption, but it nonetheless holds in a number of real-world domains including financial markets. We discuss algorithms that exploit the AIR property, and provide a theoretical analysis for an algorithm based on Fitted-Q Iteration. Finally, we demonstrate that the algorithm outperforms existing offline reinforcement learning algorithms across different data collection policies in simulated and real world environments where the regularity holds.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploiting Auxiliary Caption for Video Grounding b/data/2024/aaai/Exploiting Auxiliary Caption for Video Grounding
new file mode 100644
index 0000000000..032a4fab30
--- /dev/null
+++ b/data/2024/aaai/Exploiting Auxiliary Caption for Video Grounding	
@@ -0,0 +1 @@
+Video grounding aims to locate a moment of interest matching the given query sentence from an untrimmed video. Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset. In this paper, we contend that exploiting easily available captions which describe general actions, i.e., auxiliary captions defined in our paper, will significantly boost the performance. To this end, we propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS). To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and query sentences into temporal space and fuse them into visual representations. Considering the gap between auxiliary captions and ground truth, we propose Asymmetric Cross-modal Contrastive Learning (ACCL) for constructing more negative pairs to maximize cross-modal mutual information. Extensive experiments on three public datasets (i.e., ActivityNet Captions, TACoS and ActivityNet-CG) demonstrate that our method significantly outperforms state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploiting Data Geometry in Machine Learning b/data/2024/aaai/Exploiting Data Geometry in Machine Learning
new file mode 100644
index 0000000000..a66ee454a1
--- /dev/null
+++ b/data/2024/aaai/Exploiting Data Geometry in Machine Learning	
@@ -0,0 +1 @@
+A key challenge in Machine Learning (ML) is the identification of geometric structure in high-dimensional data. Most algorithms assume that data lives in a high-dimensional vector space; however, many applications involve non-Euclidean data, such as graphs, strings and matrices, or data whose structure is determined by symmetries in the underlying system. Here, we discuss methods for identifying geometric structure in data and how leveraging data geometry can give rise to efficient ML algorithms with provable guarantees.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploiting Discrepancy in Feature Statistic for Out-of-Distribution Detection b/data/2024/aaai/Exploiting Discrepancy in Feature Statistic for Out-of-Distribution Detection
new file mode 100644
index 0000000000..643f80ac90
--- /dev/null
+++ b/data/2024/aaai/Exploiting Discrepancy in Feature Statistic for Out-of-Distribution Detection	
@@ -0,0 +1,2 @@
+Recent studies on out-of-distribution (OOD) detection focus on designing models or scoring functions that can effectively distinguish between unseen OOD data and in-distribution (ID) data. In this paper, we propose a simple yet novel ap-
+proach to OOD detection by leveraging the phenomenon that the average of feature vector elements from convolutional neural network (CNN) is typically larger for ID data than for OOD data. Specifically, the average of feature vector elements is used as part of the scoring function to further separate OOD data from ID data. We also provide mathematical analysis to explain this phenomenon. Experimental evaluations demonstrate that, when combined with a strong baseline, our method can achieve state-of-the-art performance on several OOD detection benchmarks. Furthermore, our method can be easily integrated into various CNN architectures and requires less computation. Source code address: https://github.com/SYSU-MIA-GROUP/statistical_discrepancy_ood.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploiting Geometry for Treatment Effect Estimation via Optimal Transport b/data/2024/aaai/Exploiting Geometry for Treatment Effect Estimation via Optimal Transport
new file mode 100644
index 0000000000..761cd9be18
--- /dev/null
+++ b/data/2024/aaai/Exploiting Geometry for Treatment Effect Estimation via Optimal Transport	
@@ -0,0 +1 @@
+Estimating treatment effects from observational data suffers from the issue of confounding bias, which is induced by the imbalanced confounder distributions between the treated and control groups. As an effective approach, re-weighting learns a group of sample weights to balance the confounder distributions. Existing methods of re-weighting highly rely on a propensity score model or moment alignment. However, for complex real-world data, it is difficult to obtain an accurate propensity score prediction. Although moment alignment is free of learning a propensity score model, accurate estimation for high-order moments is computationally difficult and still remains an open challenge, and first and second-order moments are insufficient to align the distributions and easy to be misled by outliers. In this paper, we exploit geometry to capture the intrinsic structure involved in data for balancing the confounder distributions, so that confounding bias can be reduced even with outliers. To achieve this, we construct a connection between treatment effect estimation and optimal transport, a powerful tool to capture geometric information. After that, we propose an optimal transport model to learn sample weights by extracting geometry from confounders, in which geometric information between groups and within groups is leveraged for better confounder balancing. A projected mirror descent algorithm is employed to solve the derived optimization problem. Experimental studies on both synthetic and real-world datasets demonstrate the effectiveness of our proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploiting Label Skews in Federated Learning with Model Concatenation b/data/2024/aaai/Exploiting Label Skews in Federated Learning with Model Concatenation
new file mode 100644
index 0000000000..59f918d5b3
--- /dev/null
+++ b/data/2024/aaai/Exploiting Label Skews in Federated Learning with Model Concatenation	
@@ -0,0 +1 @@
+Federated Learning (FL) has emerged as a promising solution to perform deep learning on different data owners without exchanging raw data. However, non-IID data has been a key challenge in FL, which could significantly degrade the accuracy of the final model. Among different non-IID types, label skews have been challenging and common in image classification and other tasks. Instead of averaging the local models in most previous studies, we propose FedConcat, a simple and effective approach that concatenates these local models as the base of the global model to effectively aggregate the local knowledge. To reduce the size of the global model, we adopt the clustering technique to group the clients by their label distributions and collaboratively train a model inside each cluster. We theoretically analyze the advantage of concatenation over averaging by analyzing the information bottleneck of deep neural networks. Experimental results demonstrate that FedConcat achieves significantly higher accuracy than previous state-of-the-art FL methods in various heterogeneous label skew distribution settings and meanwhile has lower communication costs. Our code is publicly available at https://github.com/sjtudyq/FedConcat.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploiting Polarized Material Cues for Robust Car Detection b/data/2024/aaai/Exploiting Polarized Material Cues for Robust Car Detection
new file mode 100644
index 0000000000..de5d9145fc
--- /dev/null
+++ b/data/2024/aaai/Exploiting Polarized Material Cues for Robust Car Detection	
@@ -0,0 +1 @@
+Car detection is an important task that serves as a crucial prerequisite for many automated driving functions. The large variations in lighting/weather conditions and vehicle densities of the scenes pose significant challenges to existing car detection algorithms to meet the highly accurate perception demand for safety, due to the unstable/limited color information, which impedes the extraction of meaningful/discriminative features of cars. In this work, we present a novel learning-based car detection method that leverages trichromatic linear polarization as an additional cue to disambiguate such challenging cases. A key observation is that polarization, characteristic of the light wave, can robustly describe intrinsic physical properties of the scene objects in various imaging conditions and is strongly linked to the nature of materials for cars (e.g., metal and glass) and their surrounding environment (e.g., soil and trees), thereby providing reliable and discriminative features for robust car detection in challenging scenes. To exploit polarization cues, we first construct a pixel-aligned RGB-Polarization car detection dataset, which we subsequently employ to train a novel multimodal fusion network. Our car detection network dynamically integrates RGB and polarization features in a request-and-complement manner and can explore the intrinsic material properties of cars across all learning samples. We extensively validate our method and demonstrate that it outperforms state-of-the-art detection methods. Experimental results show that polarization is a powerful cue for car detection. Our code is available at https://github.com/wind1117/AAAI24-PCDNet.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploiting the Social-Like Prior in Transformer for Visual Reasoning b/data/2024/aaai/Exploiting the Social-Like Prior in Transformer for Visual Reasoning
new file mode 100644
index 0000000000..9d8d45edc0
--- /dev/null
+++ b/data/2024/aaai/Exploiting the Social-Like Prior in Transformer for Visual Reasoning	
@@ -0,0 +1 @@
+Benefiting from instrumental global dependency modeling of self-attention (SA), transformer-based approaches have become the pivotal choices for numerous downstream visual reasoning tasks, such as visual question answering (VQA) and referring expression comprehension (REC). However, some studies have recently suggested that SA tends to suffer from rank collapse thereby inevitably leads to representation degradation as the transformer layer goes deeper. Inspired by social network theory, we attempt to make an analogy between social behavior and regional information interaction in SA, and harness two crucial notions of structural hole and degree centrality in social network to explore the possible optimization towards SA learning, which naturally deduces two plug-and-play social-like modules. Based on structural hole, the former module allows to make information interaction in SA more structured, which effectively avoids redundant information aggregation and global feature homogenization for better rank remedy, followed by latter module to comprehensively characterize and refine the representation discrimination via considering degree centrality of regions and transitivity of relations. Without bells and whistles, our model outperforms a bunch of baselines by a noticeable margin when considering our social-like prior on five benchmarks in VQA and REC tasks, and a series of explanatory results are showcased to sufficiently reveal the social-like behaviors in SA.
\ No newline at end of file
diff --git a/data/2024/aaai/Explore 3D Dance Generation via Reward Model from Automatically-Ranked Demonstrations b/data/2024/aaai/Explore 3D Dance Generation via Reward Model from Automatically-Ranked Demonstrations
new file mode 100644
index 0000000000..00f12ceb32
--- /dev/null
+++ b/data/2024/aaai/Explore 3D Dance Generation via Reward Model from Automatically-Ranked Demonstrations	
@@ -0,0 +1 @@
+This paper presents an Exploratory 3D Dance generation framework, E3D2, designed to address the exploration capability deficiency in existing music-conditioned 3D dance generation models. Current models often generate monotonous and simplistic dance sequences that misalign with human preferences because they lack exploration capabilities.The E3D2 framework involves a reward model trained from automatically-ranked dance demonstrations, which then guides the reinforcement learning process. This approach encourages the agent to explore and generate high quality and diverse dance movement sequences. The soundness of the reward model is both theoretically and experimentally validated. Empirical experiments demonstrate the effectiveness of E3D2 on the AIST++ dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploring Base-Class Suppression with Prior Guidance for Bias-Free One-Shot Object Detection b/data/2024/aaai/Exploring Base-Class Suppression with Prior Guidance for Bias-Free One-Shot Object Detection
new file mode 100644
index 0000000000..e8de13f12c
--- /dev/null
+++ b/data/2024/aaai/Exploring Base-Class Suppression with Prior Guidance for Bias-Free One-Shot Object Detection	
@@ -0,0 +1 @@
+One-shot object detection (OSOD) aims to detect all object instances towards the given category specified by a query image. Most existing studies in OSOD endeavor to establish effective cross-image correlation with limited query information, however, ignoring the problems of the model bias towards the base classes and the generalization degradation on the novel classes. Observing this, we propose a novel algorithm, namely Base-class Suppression with Prior Guidance (BSPG) network to achieve bias-free OSOD. Specifically, the objects of base categories can be detected by a base-class predictor and eliminated by a base-class suppression module (BcS). Moreover, a prior guidance module (PG) is designed to calculate the correlation of high-level features in a non-parametric manner, producing a class-agnostic prior map with unbiased semantic information to guide the subsequent detection process. Equipped with the proposed two modules, we endow the model with a strong discriminative ability to distinguish the target objects from distractors belonging to the base classes. Extensive experiments show that our method outperforms the previous techniques by a large margin and achieves new state-of-the-art performance under various evaluation settings.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploring Channel-Aware Typical Features for Out-of-Distribution Detection b/data/2024/aaai/Exploring Channel-Aware Typical Features for Out-of-Distribution Detection
new file mode 100644
index 0000000000..41a76cb809
--- /dev/null
+++ b/data/2024/aaai/Exploring Channel-Aware Typical Features for Out-of-Distribution Detection	
@@ -0,0 +1 @@
+Detecting out-of-distribution (OOD) data is essential to ensure the reliability of machine learning models when deployed in real-world scenarios. Different from most previous test-time OOD detection methods that focus on designing OOD scores, we delve into the challenges in OOD detection from the perspective of typicality and regard the feature’s high-probability region as the feature’s typical set. However, the existing typical-feature-based OOD detection method implies an assumption: the proportion of typical feature sets for each channel is fixed. According to our experimental analysis, each channel contributes differently to OOD detection. Adopting a fixed proportion for all channels results in several channels losing too many typical features or incorporating too many abnormal features, resulting in low performance. Therefore, exploring the channel-aware typical features is crucial to better-separating ID and OOD data. Driven by this insight, we propose expLoring channel-Aware tyPical featureS (LAPS). Firstly, LAPS obtains the channel-aware typical set by calibrating the channel-level typical set with the global typical set from the mean and standard deviation. Then, LAPS rectifies the features into channel-aware typical sets to obtain channel-aware typical features. Finally, LAPS leverages the channel-aware typical features to calculate the energy score for OOD detection. Theoretical and visual analyses verify that LAPS achieves a better bias-variance trade-off. Experiments verify the effectiveness and generalization of LAPS under different architectures and OOD scores.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploring Domain Incremental Video Highlights Detection with the LiveFood Benchmark b/data/2024/aaai/Exploring Domain Incremental Video Highlights Detection with the LiveFood Benchmark
new file mode 100644
index 0000000000..1f65916a72
--- /dev/null
+++ b/data/2024/aaai/Exploring Domain Incremental Video Highlights Detection with the LiveFood Benchmark	
@@ -0,0 +1 @@
+Video highlights detection (VHD) is an active research field in computer vision, aiming to locate the most user-appealing clips given raw video inputs. However, most VHD methods are based on the closed world assumption, i.e., a fixed number of highlight categories is defined in advance and all training data are available beforehand. Consequently, existing methods have poor scalability with respect to increasing highlight domains and training data. To address above issues, we propose a novel video highlights detection method named Global Prototype Encoding (GPE) to learn incrementally for adapting to new domains via parameterized prototypes. To facilitate this new research direction, we collect a finely annotated dataset termed LiveFood, including over 5,100 live gourmet videos that consist of four domains: ingredients, cooking, presentation, and eating. To the best of our knowledge, this is the first work to explore video highlights detection in the incremental learning setting, opening up new land to apply VHD for practical scenarios where both the concerned highlight domains and training data increase over time. We demonstrate the effectiveness of GPE through extensive experiments. Notably, GPE surpasses popular domain incremental learning methods on LiveFood, achieving significant mAP improvements on all domains. Concerning the classic datasets, GPE also yields comparable performance as previous arts. The code is available at: https://github.com/ForeverPs/IncrementalVHD_GPE.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploring Equation as a Better Intermediate Meaning Representation for Numerical Reasoning of Large Language Models b/data/2024/aaai/Exploring Equation as a Better Intermediate Meaning Representation for Numerical Reasoning of Large Language Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/aaai/Exploring Gradient Explosion in Generative Adversarial Imitation Learning: A Probabilistic Perspective b/data/2024/aaai/Exploring Gradient Explosion in Generative Adversarial Imitation Learning: A Probabilistic Perspective
new file mode 100644
index 0000000000..8a09c4ec3a
--- /dev/null
+++ b/data/2024/aaai/Exploring Gradient Explosion in Generative Adversarial Imitation Learning: A Probabilistic Perspective	
@@ -0,0 +1 @@
+Generative Adversarial Imitation Learning (GAIL) stands as a cornerstone approach in imitation learning. This paper investigates the gradient explosion in two types of GAIL: GAIL with deterministic policy (DE-GAIL) and GAIL with stochastic policy (ST-GAIL). We begin with the observation that the training can be highly unstable for DE-GAIL at the beginning of the training phase and end up divergence. Conversely, the ST-GAIL training trajectory remains consistent, reliably converging. To shed light on these disparities, we provide an explanation from a theoretical perspective. By establishing a probabilistic lower bound for GAIL, we demonstrate that gradient explosion is an inevitable outcome for DE-GAIL due to occasionally large expert-imitator policy disparity, whereas ST-GAIL does not have the issue with it. To substantiate our assertion, we illustrate how modifications in the reward function can mitigate the gradient explosion challenge. Finally, we propose CREDO, a simple yet effective strategy that clips the reward function during the training phase, allowing the GAIL to enjoy high data efficiency and stable trainability.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploring One-Shot Semi-supervised Federated Learning with Pre-trained Diffusion Models b/data/2024/aaai/Exploring One-Shot Semi-supervised Federated Learning with Pre-trained Diffusion Models
new file mode 100644
index 0000000000..2a81a6ba7d
--- /dev/null
+++ b/data/2024/aaai/Exploring One-Shot Semi-supervised Federated Learning with Pre-trained Diffusion Models	
@@ -0,0 +1 @@
+Recently, semi-supervised federated learning (semi-FL) has been proposed to handle the commonly seen real-world scenarios with labeled data on the server and unlabeled data on the clients. However, existing methods face several challenges such as communication costs, data heterogeneity, and training pressure on client devices. To address these challenges, we introduce the powerful diffusion models (DM) into semi-FL and propose FedDISC, a Federated Diffusion-Inspired Semi-supervised Co-training method. Specifically, we first extract prototypes of the labeled server data and use these prototypes to predict pseudo-labels of the client data. For each category, we compute the cluster centroids and domain-specific representations to signify the semantic and stylistic information of their distributions. After adding noise, these representations are sent back to the server, which uses the pre-trained DM to generate synthetic datasets complying with the client distributions and train a global model on it. With the assistance of vast knowledge within DM, the synthetic datasets have comparable quality and diversity to the client images, subsequently enabling the training of global models that achieve performance equivalent to or even surpassing the ceiling of supervised centralized training. FedDISC works within one communication round, does not require any local training, and involves very minimal information uploading, greatly enhancing its practicality. Extensive experiments on three large-scale datasets demonstrate that FedDISC effectively addresses the semi-FL problem on non-IID clients and outperforms the compared SOTA methods. Sufficient visualization experiments also illustrate that the synthetic dataset generated by FedDISC exhibits comparable diversity and quality to the original client dataset, with a neglectable possibility of leaking privacy-sensitive information of the clients.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation b/data/2024/aaai/Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
new file mode 100644
index 0000000000..7683f30f15
--- /dev/null
+++ b/data/2024/aaai/Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation	
@@ -0,0 +1 @@
+Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs). However, a systematic examination of various quantization schemes, model families, and quantization bit precision has been absent from the literature. In this paper, we conduct a comprehensive analysis of these factors by investigating the effects of PTQ on weight-only, activation-only, and weight-and-activation quantization using diverse methods such as round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants. We apply these methods to two distinct model families with parameters ranging from 125M to 176B. Our contributions include: (1) a sensitivity analysis revealing that activation quantization is generally more susceptible to weight quantization, with smaller models often outperforming larger models in terms of activation quantization; (2) an evaluation and comparison of existing PTQ methods to optimize model size reduction while minimizing the impact on accuracy, revealing that none of the current methods can achieve the original model quality for quantization with either INT4-weight or INT4-weight-and-INT8-activation; (3) based on these insights, we propose an optimized method called Low-Rank Compensation (LoRC), which employs low-rank matrices to enhance model quality recovery with a minimal increase in model size.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection b/data/2024/aaai/Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection
new file mode 100644
index 0000000000..1ede98348d
--- /dev/null
+++ b/data/2024/aaai/Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection	
@@ -0,0 +1 @@
+Human-Object Interaction (HOI) detection plays a vital role in scene understanding, which aims to predict the HOI triplet in the form of . Existing methods mainly extract multi-modal features (e.g., appearance, object semantics, human pose) and then fuse them together to directly predict HOI triplets. However, most of these methods focus on seeking for self-triplet aggregation, but ignore the potential cross-triplet dependencies, resulting in ambiguity of action prediction. In this work, we propose to explore Self- and Cross-Triplet Correlations (SCTC) for HOI detection. Specifically, we regard each triplet proposal as a graph where Human, Object represent nodes and Action indicates edge, to aggregate self-triplet correlation. Also, we try to explore cross-triplet dependencies by jointly considering instance-level, semantic-level, and layout-level relations. Besides, we leverage the CLIP model to assist our SCTC obtain interaction-aware feature by knowledge distillation, which provides useful action clues for HOI detection. Extensive experiments on HICO-DET and V-COCO datasets verify the effectiveness of our proposed SCTC.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploring Sparse Visual Prompt for Domain Adaptive Dense Prediction b/data/2024/aaai/Exploring Sparse Visual Prompt for Domain Adaptive Dense Prediction
new file mode 100644
index 0000000000..82df9b01e1
--- /dev/null
+++ b/data/2024/aaai/Exploring Sparse Visual Prompt for Domain Adaptive Dense Prediction	
@@ -0,0 +1 @@
+The visual prompts have provided an efficient manner in addressing visual cross-domain problems. Previous works introduce domain prompts to tackle the classification Test-Time Adaptation (TTA) problem by placing image-level prompts on the input and fine-tuning prompts for each target domain. However, since the image-level prompts mask out continuous spatial details in the prompt-allocated region, it will suffer from inaccurate contextual information and limited domain knowledge extraction, particularly when dealing with dense prediction TTA problems. To overcome these challenges, we propose a novel Sparse Visual Domain Prompts (SVDP) approach, which applies minimal trainable parameters (e.g., 0.1%) to pixels across the entire image and reserves more spatial information of the input. To better apply SVDP in extracting domain-specific knowledge, we introduce the Domain Prompt Placement (DPP) method to adaptively allocates trainable parameters of SVDP on the pixels with large distribution shifts. Furthermore, recognizing that each target domain sample exhibits a unique domain shift, we design Domain Prompt Updating (DPU) strategy to optimize prompt parameters differently for each sample, facilitating efficient adaptation to the target domain. Extensive experiments were conducted on widely-used TTA and continual TTA benchmarks, and our proposed method achieves state-of-the-art performance in both semantic segmentation and depth estimation tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Exploring Temporal Feature Correlation for Efficient and Stable Video Semantic Segmentation b/data/2024/aaai/Exploring Temporal Feature Correlation for Efficient and Stable Video Semantic Segmentation
new file mode 100644
index 0000000000..d9c6a5b37c
--- /dev/null
+++ b/data/2024/aaai/Exploring Temporal Feature Correlation for Efficient and Stable Video Semantic Segmentation	
@@ -0,0 +1 @@
+This paper tackles the problem of efficient and stable video semantic segmentation. While stability has been under-explored, prevalent work in efficient video semantic segmentation uses the keyframe paradigm. They efficiently process videos by only recomputing the low-level features and reusing high-level features computed at selected keyframes. In addition, the reused features stabilize the predictions across frames, thereby improving video consistency. However, dynamic scenes in the video can easily lead to misalignments between reused and recomputed features, which hampers performance. Moreover, relying on feature reuse to improve prediction consistency is brittle; an erroneous alignment of the features can easily lead to unstable predictions. Therefore, the keyframe paradigm exhibits a dilemma between stability and performance. We address this efficiency and stability challenge using a novel yet simple Temporal Feature Correlation (TFC) module. It uses the cosine similarity between two frames’ low-level features to inform the semantic label’s consistency across frames. Specifically, we selectively reuse label-consistent features across frames through linear interpolation and update others through sparse multi-scale deformable attention. As a result, we no longer directly reuse features to improve stability and thus effectively solve feature misalignment. This work provides a significant step towards efficient and stable video semantic segmentation. On the VSPW dataset, our method significantly improves the prediction consistency of image-based methods while being as fast and accurate.
\ No newline at end of file
diff --git a/data/2024/aaai/Exponent Relaxation of Polynomial Zonotopes and Its Applications in Formal Neural Network Verification b/data/2024/aaai/Exponent Relaxation of Polynomial Zonotopes and Its Applications in Formal Neural Network Verification
new file mode 100644
index 0000000000..748e2b42c3
--- /dev/null
+++ b/data/2024/aaai/Exponent Relaxation of Polynomial Zonotopes and Its Applications in Formal Neural Network Verification	
@@ -0,0 +1,9 @@
+Formal verification of neural networks is a challenging problem due to the complexity and nonlinearity of neural networks.
+ It has been shown that polynomial zonotopes can tightly enclose the output set of a neural network.
+ Unfortunately, the tight enclosure comes with additional complexity in the set representation,
+ thus, rendering subsequent operations expensive to compute, such as computing interval bounds and intersection checking.
+ To address this issue, we present a novel approach to restructure a polynomial zonotope to tightly enclose the original polynomial zonotope
+ while drastically reducing its complexity.
+ The restructuring is achieved by relaxing the exponents of the dependent factors of polynomial zonotopes and finding an appropriate approximation error.
+ We demonstrate the applicability of our approach on output sets of neural networks,
+ where we obtain tighter results in various subsequent operations, such as order reduction, zonotope enclosure, and range bounding.
\ No newline at end of file
diff --git a/data/2024/aaai/Exponential Hardness of Optimization from the Locality in Quantum Neural Networks b/data/2024/aaai/Exponential Hardness of Optimization from the Locality in Quantum Neural Networks
new file mode 100644
index 0000000000..b35b2b66d0
--- /dev/null
+++ b/data/2024/aaai/Exponential Hardness of Optimization from the Locality in Quantum Neural Networks	
@@ -0,0 +1 @@
+Quantum neural networks (QNNs) have become a leading paradigm for establishing near-term quantum applications in recent years. The trainability issue of QNNs has garnered extensive attention, spurring demand for a comprehensive analysis of QNNs in order to identify viable solutions. In this work, we propose a perspective that characterizes the trainability of QNNs based on their locality. We prove that the entire variation range of the loss function via adjusting any local quantum gate vanishes exponentially in the number of qubits with a high probability for a broad class of QNNs. This result reveals extra harsh constraints independent of gradients and unifies the restrictions on gradient-based and gradient-free optimizations naturally. We showcase the validity of our results with numerical simulations of representative models and examples. Our findings, as a fundamental property of random quantum circuits, deepen the understanding of the role of locality in QNNs and serve as a guideline for assessing the effectiveness of diverse training strategies for quantum neural networks.
\ No newline at end of file
diff --git a/data/2024/aaai/Exposing the Deception: Uncovering More Forgery Clues for Deepfake Detection b/data/2024/aaai/Exposing the Deception: Uncovering More Forgery Clues for Deepfake Detection
new file mode 100644
index 0000000000..a3db4777ef
--- /dev/null
+++ b/data/2024/aaai/Exposing the Deception: Uncovering More Forgery Clues for Deepfake Detection	
@@ -0,0 +1,3 @@
+Deepfake technology has given rise to a spectrum of novel and compelling applications. Unfortunately, the widespread proliferation of high-fidelity fake videos has led to pervasive confusion and deception, shattering our faith that seeing is believing. One aspect that has been overlooked so far is that current deepfake detection approaches may easily fall into the trap of overfitting, focusing only on forgery clues within one or a few local regions. Moreover, existing works heavily rely on neural networks to extract forgery features, lacking theoretical constraints guaranteeing that sufficient forgery clues are extracted and superfluous features are eliminated. These deficiencies culminate in unsatisfactory accuracy and limited generalizability in real-life scenarios.
+
+In this paper, we try to tackle these challenges through three designs: (1) We present a novel framework to capture broader forgery clues by extracting multiple non-overlapping local representations and fusing them into a global semantic-rich feature. (2) Based on the information bottleneck theory, we derive Local Information Loss to guarantee the orthogonality of local representations while preserving comprehensive task-relevant information. (3) Further, to fuse the local representations and remove task-irrelevant information, we arrive at a Global Information Loss through the theoretical analysis of mutual information. Empirically, our method achieves state-of-the-art performance on five benchmark datasets. Our code is available at https://github.com/QingyuLiu/Exposing-the-Deception, hoping to inspire researchers.
\ No newline at end of file
diff --git a/data/2024/aaai/Expressive Forecasting of 3D Whole-Body Human Motions b/data/2024/aaai/Expressive Forecasting of 3D Whole-Body Human Motions
new file mode 100644
index 0000000000..d8e4a93ff2
--- /dev/null
+++ b/data/2024/aaai/Expressive Forecasting of 3D Whole-Body Human Motions	
@@ -0,0 +1,2 @@
+Human motion forecasting, with the goal of estimating future human behavior over a period of time, is a fundamental task in many real-world applications. However, existing works typically concentrate on foretelling the major joints of the human body without considering the delicate movements of the human hands.
+In practical applications, hand gesture plays an important role in human communication with the real world, and expresses the primary intention of human beings. In this work, we are the first to formulate whole-body human pose forecasting task, which jointly predicts future both body and gesture activities. Correspondingly, we propose a novel Encoding-Alignment-Interaction (EAI) framework that aims to predict both coarse (body joints) and fine-grained (gestures) activities collaboratively, enabling expressive and cross-facilitated forecasting of 3D whole-body human motions. Specifically, our model involves two key constituents: cross-context alignment (XCA) and cross-context interaction (XCI). Considering the heterogeneous information within the whole-body, XCA aims to align the latent features of various human components, while XCI focuses on effectively capturing the context interaction among the human components. We conduct extensive experiments on a newly-introduced large-scale benchmark and achieve state-of-the-art performance. The code is public for research purposes at https://github.com/Dingpx/EAI.
\ No newline at end of file
diff --git a/data/2024/aaai/Expressive Multi-Agent Communication via Identity-Aware Learning b/data/2024/aaai/Expressive Multi-Agent Communication via Identity-Aware Learning
new file mode 100644
index 0000000000..2bb88d1bc7
--- /dev/null
+++ b/data/2024/aaai/Expressive Multi-Agent Communication via Identity-Aware Learning	
@@ -0,0 +1 @@
+Information sharing through communication is essential for tackling complex multi-agent reinforcement learning tasks. Many existing multi-agent communication protocols can be viewed as instances of message passing graph neural networks (GNNs). However, due to the significantly limited expressive ability of the standard GNN method, the agent feature representations remain similar and indistinguishable even though the agents have different neighborhood structures. This further results in the homogenization of agent behaviors and reduces the capability to solve tasks effectively. In this paper, we propose a multi-agent communication protocol via identity-aware learning (IDEAL), which explicitly enhances the distinguishability of agent feature representations to break the diversity bottleneck. Specifically, IDEAL extends existing multi-agent communication protocols by inductively considering the agents' identities during the message passing process. To obtain expressive feature representations for a given agent, IDEAL first extracts the ego network centered around that agent and then performs multiple rounds of heterogeneous message passing, where different parameter sets are applied to the central agent and the other surrounding agents within the ego network. IDEAL fosters expressive communication between agents and generates distinguishable feature representations, which promotes action diversity and individuality emergence. Experimental results on various benchmarks demonstrate IDEAL can be flexibly integrated into various multi-agent communication methods and enhances the corresponding performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Expressive and Flexible Simulation of Information Spread Strategies in Social Networks Using Planning b/data/2024/aaai/Expressive and Flexible Simulation of Information Spread Strategies in Social Networks Using Planning
new file mode 100644
index 0000000000..4b5e707817
--- /dev/null
+++ b/data/2024/aaai/Expressive and Flexible Simulation of Information Spread Strategies in Social Networks Using Planning	
@@ -0,0 +1 @@
+In the digital age, understanding the dynamics of information spread and opinion formation within networks is paramount. This research introduces an innovative framework that combines the principles of opinion dynamics with the strategic capabilities of Automated Planning. We have developed, to the best of our knowledge, the first-ever numeric PDDL tailored for opinion dynamics. Our tool empowers users to visualize intricate networks, simulate the evolution of opinions, and strategically influence that evolution to achieve specific outcomes. By harnessing Automated Planning techniques, our framework offers a nuanced approach to devise sequences of actions tailored to transition a network from its current opinion landscape to a desired state. This holistic approach provides insights into the intricate interplay of individual nodes within a network and paves the way for targeted interventions. Furthermore, the tool facilitates human-AI collaboration, enabling users to not only understand information spread but also devise practical strategies to mitigate potential harmful outcomes arising from it. Demo Video link - https://tinyurl.com/3k7bp99h
\ No newline at end of file
diff --git a/data/2024/aaai/FACL-Attack: Frequency-Aware Contrastive Learning for Transferable Adversarial Attacks b/data/2024/aaai/FACL-Attack: Frequency-Aware Contrastive Learning for Transferable Adversarial Attacks
new file mode 100644
index 0000000000..fe7705e570
--- /dev/null
+++ b/data/2024/aaai/FACL-Attack: Frequency-Aware Contrastive Learning for Transferable Adversarial Attacks	
@@ -0,0 +1 @@
+Deep neural networks are known to be vulnerable to security risks due to the inherent transferable nature of adversarial examples. Despite the success of recent generative model-based attacks demonstrating strong transferability, it still remains a challenge to design an efficient attack strategy in a real-world strict black-box setting, where both the target domain and model architectures are unknown. In this paper, we seek to explore a feature contrastive approach in the frequency domain to generate adversarial examples that are robust in both cross-domain and cross-model settings. With that goal in mind, we propose two modules that are only employed during the training phase: a Frequency-Aware Domain Randomization (FADR) module to randomize domain-variant low- and high-range frequency components and a Frequency-Augmented Contrastive Learning (FACL) module to effectively separate domain-invariant mid-frequency features of clean and perturbed image. We demonstrate strong transferability of our generated adversarial perturbations through extensive cross-domain and cross-model experiments, while keeping the inference time complexity.
\ No newline at end of file
diff --git a/data/2024/aaai/FAIR-FER: A Latent Alignment Approach for Mitigating Bias in Facial Expression Recognition (Student Abstract) b/data/2024/aaai/FAIR-FER: A Latent Alignment Approach for Mitigating Bias in Facial Expression Recognition (Student Abstract)
new file mode 100644
index 0000000000..1cbdeaaa29
--- /dev/null
+++ b/data/2024/aaai/FAIR-FER: A Latent Alignment Approach for Mitigating Bias in Facial Expression Recognition (Student Abstract)	
@@ -0,0 +1 @@
+Facial Expression Recognition (FER) is an extensively explored research problem in the domain of computer vision and artificial intelligence. FER, a supervised learning problem, requires significant training data representative of multiple socio-cultural demographic attributes. However, most of the FER dataset consists of images annotated by humans, which propagates individual and demographic biases. This work attempts to mitigate this bias using representation learning based on latent spaces, thereby increasing a deep learning model's fairness and overall accuracy.
\ No newline at end of file
diff --git a/data/2024/aaai/FAVOR: Full-Body AR-Driven Virtual Object Rearrangement Guided by Instruction Text b/data/2024/aaai/FAVOR: Full-Body AR-Driven Virtual Object Rearrangement Guided by Instruction Text
new file mode 100644
index 0000000000..348ce964ca
--- /dev/null
+++ b/data/2024/aaai/FAVOR: Full-Body AR-Driven Virtual Object Rearrangement Guided by Instruction Text	
@@ -0,0 +1 @@
+Rearrangement operations form the crux of interactions between humans and their environment. The ability to generate natural, fluid sequences of this operation is of essential value in AR/VR and CG. Bridging a gap in the field, our study introduces FAVOR: a novel dataset for Full-body AR-driven Virtual Object Rearrangement that uniquely employs motion capture systems and AR eyeglasses. Comprising 3k diverse motion rearrangement sequences and 7.17 million interaction data frames, this dataset breaks new ground in research data. We also present a pipeline FAVORITE for producing digital human rearrangement motion sequences guided by instructions. Experimental results, both qualitative and quantitative, suggest that this dataset and pipeline deliver high-quality motion sequences. Our dataset, code, and appendix are available at https://kailinli.github.io/FAVOR.
\ No newline at end of file
diff --git a/data/2024/aaai/FD3D: Exploiting Foreground Depth Map for Feature-Supervised Monocular 3D Object Detection b/data/2024/aaai/FD3D: Exploiting Foreground Depth Map for Feature-Supervised Monocular 3D Object Detection
new file mode 100644
index 0000000000..ebca5ff0f3
--- /dev/null
+++ b/data/2024/aaai/FD3D: Exploiting Foreground Depth Map for Feature-Supervised Monocular 3D Object Detection	
@@ -0,0 +1 @@
+Monocular 3D object detection usually adopts direct or hierarchical label supervision. Recently, the distillation supervision transfers the spatial knowledge from LiDAR- or stereo-based teacher networks to monocular detectors, but remaining the domain gap. To mitigate this issue and pursue adequate label manipulation, we exploit Foreground Depth map for feature-supervised monocular 3D object detection named FD3D, which develops the high-quality instructive intermediate features to conduct desirable auxiliary feature supervision with only the original image and annotation foreground object-wise depth map (AFOD) as input. Furthermore, we build up our instructive feature generation network to create instructive spatial features based on the sufficient correlation between image features and pre-processed AFOD, where AFOD provides the attention focus only on foreground objects to achieve clearer guidance in the detection task. Moreover, we apply the auxiliary feature supervision from the pixel and distribution level to achieve comprehensive spatial knowledge guidance. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both the KITTI and nuScenes datasets, with no external data and no extra inference computational cost. We also conduct quantitative and qualitative studies to reveal the effectiveness of our designs.
\ No newline at end of file
diff --git a/data/2024/aaai/FFT-Based Dynamic Token Mixer for Vision b/data/2024/aaai/FFT-Based Dynamic Token Mixer for Vision
new file mode 100644
index 0000000000..47684a74f4
--- /dev/null
+++ b/data/2024/aaai/FFT-Based Dynamic Token Mixer for Vision	
@@ -0,0 +1 @@
+Multi-head-self-attention (MHSA)-equipped models have achieved notable performance in computer vision. Their computational complexity is proportional to quadratic numbers of pixels in input feature maps, resulting in slow processing, especially when dealing with high-resolution images. New types of token-mixer are proposed as an alternative to MHSA to circumvent this problem: an FFT-based token-mixer involves global operations similar to MHSA but with lower computational complexity. However, despite its attractive properties, the FFT-based token-mixer has not been carefully examined in terms of its compatibility with the rapidly evolving MetaFormer architecture. Here, we propose a novel token-mixer called Dynamic Filter and novel image recognition models, DFFormer and CDFFormer, to close the gaps above. The results of image classification and downstream tasks, analysis, and visualization show that our models are helpful. Notably, their throughput and memory efficiency when dealing with high-resolution image recognition is remarkable. Our results indicate that Dynamic Filter is one of the token-mixer options that should be seriously considered. The code is available at https://github.com/okojoalg/dfformer
\ No newline at end of file
diff --git a/data/2024/aaai/FG-EmoTalk: Talking Head Video Generation with Fine-Grained Controllable Facial Expressions b/data/2024/aaai/FG-EmoTalk: Talking Head Video Generation with Fine-Grained Controllable Facial Expressions
new file mode 100644
index 0000000000..e2c8099b32
--- /dev/null
+++ b/data/2024/aaai/FG-EmoTalk: Talking Head Video Generation with Fine-Grained Controllable Facial Expressions	
@@ -0,0 +1 @@
+Although deep generative models have greatly improved one-shot video-driven talking head generation, few studies address fine-grained controllable facial expression editing, which is crucial for practical applications. Existing methods rely on a fixed set of predefined discrete emotion labels or simply copy expressions from input videos. This is limiting as expressions are complex, and methods using only emotion labels cannot generate fine-grained, accurate or mixed expressions. Generating talking head video with precise expressions is also difficult using 3D model-based approaches, as 3DMM only models facial movements and tends to produce deviations. In this paper, we propose a novel framework enabling fine-grained facial expression editing in talking face generation. Our goal is to achieve expression control by manipulating the intensities of individual facial Action Units (AUs) or groups. First, compared with existing methods which decouple the face into pose and expression, we propose a disentanglement scheme to isolates three components from the human face, namely, appearance, pose, and expression. Second, we propose to use input AUs to control muscle group intensities in the generated face, and integrate the AUs features with the disentangled expression latent code. Finally, we present a self-supervised training strategy with well-designed constraints. Experiments show our method achieves fine-grained expression control, produces high-quality talking head videos and outperforms baseline methods.
\ No newline at end of file
diff --git a/data/2024/aaai/FLAME: A Small Language Model for Spreadsheet Formulas b/data/2024/aaai/FLAME: A Small Language Model for Spreadsheet Formulas
new file mode 100644
index 0000000000..96bb87d9ac
--- /dev/null
+++ b/data/2024/aaai/FLAME: A Small Language Model for Spreadsheet Formulas	
@@ -0,0 +1,15 @@
+Spreadsheets are a vital tool for end-user data management. Using large language
+models for formula authoring assistance in these environments can be difficult,
+as these models are expensive to train and challenging to deploy due to their
+size (up to billions of parameters). We present FLAME, a transformer-based model
+trained exclusively on Excel formulas that leverages domain insights to achieve
+competitive performance while being substantially smaller (60M parameters) and
+training on two orders of magnitude less data. We curate a training dataset
+using sketch deduplication, introduce an Excel-specific formula tokenizer, and
+use domain-specific versions of masked span prediction and noisy auto-encoding
+as pre-training objectives. We evaluate FLAME on formula repair, formula
+completion, and similarity-based formula retrieval. FLAME can outperform much
+larger models, such as the Davinci (175B) and Cushman (12B) variants of Codex
+and CodeT5 (220M), in 10 of 14 evaluation settings for the repair and completion
+tasks. For formula retrieval, FLAME outperforms CodeT5, CodeBERT, and
+GraphCodeBERT.
\ No newline at end of file
diff --git a/data/2024/aaai/FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection b/data/2024/aaai/FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection
new file mode 100644
index 0000000000..c117fee597
--- /dev/null
+++ b/data/2024/aaai/FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection	
@@ -0,0 +1 @@
+The superior performances of pre-trained foundation models in various visual tasks underscore their potential to enhance the 2D models' open-vocabulary ability. Existing methods explore analogous applications in the 3D space. However, most of them only center around knowledge extraction from singular foundation models, which limits the open-vocabulary ability of 3D models. We hypothesize that leveraging complementary pre-trained knowledge from various foundation models can improve knowledge transfer from 2D pre-trained visual language models to the 3D space. In this work, we propose FM-OV3D, a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection, which improves the open-vocabulary localization and recognition abilities of 3D model by blending knowledge from multiple pre-trained foundation models, achieving true open-vocabulary without facing constraints from original 3D datasets. Specifically, to learn the open-vocabulary 3D localization ability, we adopt the open-vocabulary localization knowledge of the Grounded-Segment-Anything model. For open-vocabulary 3D recognition ability, We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git.
\ No newline at end of file
diff --git a/data/2024/aaai/FMRNet: Image Deraining via Frequency Mutual Revision b/data/2024/aaai/FMRNet: Image Deraining via Frequency Mutual Revision
new file mode 100644
index 0000000000..07d3af0111
--- /dev/null
+++ b/data/2024/aaai/FMRNet: Image Deraining via Frequency Mutual Revision	
@@ -0,0 +1 @@
+The wavelet transform has emerged as a powerful tool in deciphering structural information within images. And now, the latest research suggests that combining the prowess of wavelet transform with neural networks can lead to unparalleled image deraining results. By harnessing the strengths of both the spatial domain and frequency space, this innovative approach is poised to revolutionize the field of image processing. The fascinating challenge of developing a comprehensive framework that takes into account the intrinsic frequency property and the correlation between rain residue and background is yet to be fully explored. In this work, we propose to investigate the potential relationships among rain-free and residue components at the frequency domain, forming a frequency mutual revision network (FMRNet) for image deraining. Specifically, we explore the mutual representation of rain residue and background components at frequency domain, so as to better separate the rain layer from clean background while preserving structural textures of the degraded images. Meanwhile, the rain distribution prediction from the low-frequency coefficient, which can be seen as the degradation prior is used to refine the separation of rain residue and background components. Inversely, the updated rain residue is used to benefit the low-frequency rain distribution prediction, forming the multi-layer mutual learning. Extensive experiments demonstrate that our proposed FMRNet delivers significant performance gains for seven datasets on image deraining task, surpassing the state-of-the-art method ELFormer by 1.14 dB in PSNR on the Rain100L dataset, while with similar computation cost. Code and retrained models are available at https://github.com/kuijiang94/FMRNet.
\ No newline at end of file
diff --git a/data/2024/aaai/FPRF: Feed-Forward Photorealistic Style Transfer of Large-Scale 3D Neural Radiance Fields b/data/2024/aaai/FPRF: Feed-Forward Photorealistic Style Transfer of Large-Scale 3D Neural Radiance Fields
new file mode 100644
index 0000000000..ee063642ac
--- /dev/null
+++ b/data/2024/aaai/FPRF: Feed-Forward Photorealistic Style Transfer of Large-Scale 3D Neural Radiance Fields	
@@ -0,0 +1 @@
+We present FPRF, a feed-forward photorealistic style transfer method for large-scale 3D neural radiance fields. FPRF stylizes large-scale 3D scenes with arbitrary, multiple style reference images without additional optimization while preserving multi-view appearance consistency. Prior arts required tedious per-style/-scene optimization and were limited to small-scale 3D scenes. FPRF efficiently stylizes large-scale 3D scenes by introducing a style-decomposed 3D neural radiance field, which inherits AdaIN’s feed-forward stylization machinery, supporting arbitrary style reference images. Furthermore, FPRF supports multi-reference stylization with the semantic correspondence matching and local AdaIN, which adds diverse user control for 3D scene styles. FPRF also preserves multi-view consistency by applying semantic matching and style transfer processes directly onto queried features in 3D space. In experiments, we demonstrate that FPRF achieves favorable photorealistic quality 3D scene stylization for large-scale scenes with diverse reference images.
\ No newline at end of file
diff --git a/data/2024/aaai/FRED: Towards a Full Rotation-Equivariance in Aerial Image Object Detection b/data/2024/aaai/FRED: Towards a Full Rotation-Equivariance in Aerial Image Object Detection
new file mode 100644
index 0000000000..3884b2e9ff
--- /dev/null
+++ b/data/2024/aaai/FRED: Towards a Full Rotation-Equivariance in Aerial Image Object Detection	
@@ -0,0 +1 @@
+Rotation-equivariance is an essential yet challenging property in oriented object detection. While general object detectors naturally leverage robustness to spatial shifts due to the translation-equivariance of the conventional CNNs, achieving rotation-equivariance remains an elusive goal. Current detectors deploy various alignment techniques to derive rotation-invariant features, but still rely on high capacity models and heavy data augmentation with all possible rotations. In this paper, we introduce a Fully Rotation-Equivariant Oriented Object Detector (FRED), whose entire process from the image to the bounding box prediction is strictly equivariant. Specifically, we decouple the invariant task (object classification) and the equivariant task (object localization) to achieve end-to-end equivariance. We represent the bounding box as a set of rotation-equivariant vectors to implement rotation-equivariant localization. Moreover, we utilized these rotation-equivariant vectors as offsets in the deformable convolution, thereby enhancing the existing advantages of spatial adaptation. Leveraging full rotation-equivariance, our FRED demonstrates higher robustness to image-level rotation compared to existing methods. Furthermore, we show that FRED is one step closer to non-axis aligned learning through our experiments. Compared to state-of-the-art methods, our proposed method delivers comparable performance on DOTA-v1.0 and outperforms by 1.5 mAP on DOTA-v1.5, all while significantly reducing the model parameters to 16%.
\ No newline at end of file
diff --git a/data/2024/aaai/FRIH: Fine-Grained Region-Aware Image Harmonization b/data/2024/aaai/FRIH: Fine-Grained Region-Aware Image Harmonization
new file mode 100644
index 0000000000..60bf1ce790
--- /dev/null
+++ b/data/2024/aaai/FRIH: Fine-Grained Region-Aware Image Harmonization	
@@ -0,0 +1 @@
+Image harmonization aims to generate a more realistic appearance of foreground and background for a composite image. All the existing methods perform the same harmonization process for the whole foreground. However, the implanted foreground always contains different appearance patterns. Existing solutions ignore the difference of each color block and lose some specific details. Therefore, we propose a novel global-local two stages framework for Fine-grained Region-aware Image Harmonization (FRIH). In the first stage, the whole input foreground mask is used to make a global coarse-grained harmonization. In the second stage, we adaptively cluster the input foreground mask into several submasks. Each submask and the coarsely adjusted image are concatenated respectively and fed into a lightweight cascaded module, refining the global harmonization result. Moreover, we further design a fusion prediction module to generate the final result, utilizing the different degrees of harmonization results comprehensively. Without bells and whistles, our FRIH achieves a competitive performance on iHarmony4 dataset with a lightweight model.
\ No newline at end of file
diff --git a/data/2024/aaai/FT-GAN: Fine-Grained Tune Modeling for Chinese Opera Synthesis b/data/2024/aaai/FT-GAN: Fine-Grained Tune Modeling for Chinese Opera Synthesis
new file mode 100644
index 0000000000..5cd27c27f7
--- /dev/null
+++ b/data/2024/aaai/FT-GAN: Fine-Grained Tune Modeling for Chinese Opera Synthesis	
@@ -0,0 +1 @@
+Although singing voice synthesis (SVS) has made significant progress recently, with its unique styles and various genres, Chinese opera synthesis requires greater attention but is rarely studied for lack of training data and high expressiveness. In this work, we build a high-quality Gezi Opera (a type of Chinese opera popular in Fujian and Taiwan) audio-text alignment dataset and formulate specific data annotation methods applicable to Chinese operas. We propose FT-GAN, an acoustic model for fine-grained tune modeling in Chinese opera synthesis based on the empirical analysis of the differences between Chinese operas and pop songs. To further improve the quality of the synthesized opera, we propose a speech pre-training strategy for additional knowledge injection. The experimental results show that FT-GAN outperforms the strong baselines in SVS on the Gezi Opera synthesis task. Extensive experiments further verify that FT-GAN performs well on synthesis tasks of other operas such as Peking Opera. Audio samples, the dataset, and the codes are available at https://zhengmidon.github.io/FTGAN.github.io/.
\ No newline at end of file
diff --git a/data/2024/aaai/FaceCoresetNet: Differentiable Coresets for Face Set Recognition b/data/2024/aaai/FaceCoresetNet: Differentiable Coresets for Face Set Recognition
new file mode 100644
index 0000000000..c1e4bbe3c8
--- /dev/null
+++ b/data/2024/aaai/FaceCoresetNet: Differentiable Coresets for Face Set Recognition	
@@ -0,0 +1,3 @@
+In set-based face recognition, we aim to compute the most discriminative descriptor from an unbounded set of images and videos showing a single person. A discriminative descriptor balances two policies when aggregating information from a given set. The first is a quality-based policy: emphasizing high-quality and down-weighting low-quality images. The second is a diversity-based policy: emphasizing unique images in the set and down-weighting multiple occurrences of similar images as found in video clips which can overwhelm the set representation.
+This work frames face-set representation as a differentiable coreset selection problem. Our model learns how to select a small coreset of the input set that balances quality and diversity policies using a learned metric parameterized by the face quality, optimized end-to-end. The selection process is a differentiable farthest-point sampling (FPS) realized by approximating the non-differentiable Argmax operation with differentiable sampling from the Gumbel-Softmax distribution of distances. The small coreset is later used as queries in a self and cross-attention architecture to enrich the descriptor with information from the whole set. Our model is order-invariant and linear in the input set size.
+We set a new SOTA to set face verification on the IJB-B and IJB-C datasets. Our code is publicly available at https://github.com/ligaripash/FaceCoresetNet.
\ No newline at end of file
diff --git a/data/2024/aaai/FaceRSA: RSA-Aware Facial Identity Cryptography Framework b/data/2024/aaai/FaceRSA: RSA-Aware Facial Identity Cryptography Framework
new file mode 100644
index 0000000000..c74e71bab2
--- /dev/null
+++ b/data/2024/aaai/FaceRSA: RSA-Aware Facial Identity Cryptography Framework	
@@ -0,0 +1 @@
+With the flourishing of the Internet, sharing one's photos or automated processing of faces using computer vision technology has become an everyday occurrence. While enjoying the convenience, the concern for identity privacy is also emerging. Therefore, some efforts introduced the concept of ``password'' from traditional cryptography such as RSA into the face anonymization and deanonymization task to protect the facial identity without compromising the usability of the face image. However, these methods either suffer from the poor visual quality of the synthesis results or do not possess the full cryptographic properties, resulting in compromised security. In this paper, we present the first facial identity cryptography framework with full properties analogous to RSA. Our framework leverages the powerful generative capabilities of StyleGAN to achieve megapixel-level facial identity anonymization and deanonymization. Thanks to the great semantic decoupling of StyleGAN's latent space, the identity encryption and decryption process are performed in latent space by a well-designed password mapper in the manner of editing latent code. Meanwhile, the password-related information is imperceptibly hidden in the edited latent code owing to the redundant nature of the latent space. To make our cryptographic framework possesses all the properties analogous to RSA, we propose three types of loss functions: single anonymization loss, sequential anonymization loss, and associated anonymization loss. Extensive experiments and ablation analyses demonstrate the superiority of our method in terms of the quality of synthesis results, identity-irrelevant attributes preservation, deanonymization accuracy, and completeness of properties analogous to RSA.
\ No newline at end of file
diff --git a/data/2024/aaai/FacetCRS: Multi-Faceted Preference Learning for Pricking Filter Bubbles in Conversational Recommender System b/data/2024/aaai/FacetCRS: Multi-Faceted Preference Learning for Pricking Filter Bubbles in Conversational Recommender System
new file mode 100644
index 0000000000..79256c8b25
--- /dev/null
+++ b/data/2024/aaai/FacetCRS: Multi-Faceted Preference Learning for Pricking Filter Bubbles in Conversational Recommender System	
@@ -0,0 +1 @@
+The filter bubble is a notorious issue in Recommender Systems (RSs), which describes the phenomenon whereby users are exposed to a limited and narrow range of information or content that reinforces their existing dominant preferences and beliefs. This results in a lack of exposure to diverse and varied content. Many existing works have predominantly examined filter bubbles in static or relatively-static recommendation settings. However, filter bubbles will be continuously intensified over time due to the feedback loop between the user and the system in the real-world online recommendation. To address these issues, we propose a novel paradigm, Multi-Facet Preference Learning for Pricking Filter Bubbles in Conversational Recommender System (FacetCRS), which aims to burst filter bubbles in the conversational recommender system (CRS) through timely user-item interactions via natural language conversations. By considering diverse user preferences and intentions, FacetCRS automatically model user preference into multi-facets, including entity-, word-, context-, and review-facet, to capture diverse and dynamic user preferences to prick filter bubbles in the CRS. It is an end-to-end CRS framework to adaptively learn representations of various levels of preference facet and diverse types of external knowledge. Extensive experiments on two publicly available benchmark datasets demonstrate that our proposed method achieves state-of-the-art performance in mitigating filter bubbles and enhancing recommendation quality in CRS.
\ No newline at end of file
diff --git a/data/2024/aaai/Fact-Driven Logical Reasoning for Machine Reading Comprehension b/data/2024/aaai/Fact-Driven Logical Reasoning for Machine Reading Comprehension
new file mode 100644
index 0000000000..20b1ff6129
--- /dev/null
+++ b/data/2024/aaai/Fact-Driven Logical Reasoning for Machine Reading Comprehension	
@@ -0,0 +1 @@
+Recent years have witnessed an increasing interest in training machines with reasoning ability, which deeply relies on accurately and clearly presented clue forms. The clues are usually modeled as entity-aware knowledge in existing studies. However, those entity-aware clues are primarily focused on commonsense, making them insufficient for tasks that require knowledge of temporary facts or events, particularly in logical reasoning for reading comprehension. To address this challenge, we are motivated to cover both commonsense and temporary knowledge clues hierarchically. Specifically, we propose a general formalism of knowledge units by extracting backbone constituents of the sentence, such as the subject-verb-object formed ``facts''. We then construct a supergraph on top of the fact units, allowing for the benefit of sentence-level (relations among fact groups) and entity-level interactions (concepts or actions inside a fact). Experimental results on logical reasoning benchmarks and dialogue modeling datasets show that our approach improves the baselines substantially, and it is general across backbone models. Code is available at https://github.com/ozyyshr/FocalReasoner.
\ No newline at end of file
diff --git a/data/2024/aaai/Factored Online Planning in Many-Agent POMDPs b/data/2024/aaai/Factored Online Planning in Many-Agent POMDPs
new file mode 100644
index 0000000000..010577b2f4
--- /dev/null
+++ b/data/2024/aaai/Factored Online Planning in Many-Agent POMDPs	
@@ -0,0 +1 @@
+In centralized multi-agent systems, often modeled as multi-agent partially observable Markov decision processes (MPOMDPs), the action and observation spaces grow exponentially with the number of agents, making the value and belief estimation of single-agent online planning ineffective. Prior work partially tackles value estimation by exploiting the inherent structure of multi-agent settings via so-called coordination graphs. Additionally, belief estimation methods have been improved by incorporating the likelihood of observations into the approximation. However, the challenges of value estimation and belief estimation have only been tackled individually, which prevents existing methods from scaling to settings with many agents. Therefore, we address these challenges simultaneously. First, we introduce weighted particle filtering to a sample-based online planner for MPOMDPs. Second, we present a scalable approximation of the belief. Third, we bring an approach that exploits the typical locality of agent interactions to novel online planning algorithms for MPOMDPs operating on a so-called sparse particle filter tree. Our experimental evaluation against several state-of-the-art baselines shows that our methods (1) are competitive in settings with only a few agents and (2) improve over the baselines in the presence of many agents.
\ No newline at end of file
diff --git a/data/2024/aaai/Factorized Diffusion Autoencoder for Unsupervised Disentangled Representation Learning b/data/2024/aaai/Factorized Diffusion Autoencoder for Unsupervised Disentangled Representation Learning
new file mode 100644
index 0000000000..195c7c4e88
--- /dev/null
+++ b/data/2024/aaai/Factorized Diffusion Autoencoder for Unsupervised Disentangled Representation Learning	
@@ -0,0 +1 @@
+Unsupervised disentangled representation learning aims to recover semantically meaningful factors from real-world data without supervision, which is significant for model generalization and interpretability. Current methods mainly rely on assumptions of independence or informativeness of factors, regardless of interpretability. Intuitively, visually interpretable concepts better align with human-defined factors. However, exploiting visual interpretability as inductive bias is still under-explored. Inspired by the observation that most explanatory image factors can be represented by ``content + mask'', we propose a content-mask factorization network (CMFNet) to decompose an image into different groups of content codes and masks, which are further combined as content masks to represent different visual concepts. To ensure informativeness of the representations, the CMFNet is jointly learned with a generator conditioned on the content masks for reconstructing the input image. The conditional generator employs a diffusion model to leverage its robust distribution modeling capability. Our model is called the Factorized Diffusion Autoencoder (FDAE). To enhance disentanglement of visual concepts, we propose a content decorrelation loss and a mask entropy loss to decorrelate content masks in latent space and spatial space, respectively. Experiments on Shapes3d, MPI3D and Cars3d show that our method achieves advanced performance and can generate visually interpretable concept-specific masks. Source code and supplementary materials are available at https://github.com/wuancong/FDAE.
\ No newline at end of file
diff --git a/data/2024/aaai/Fair Allocation of Items in Multiple Regions b/data/2024/aaai/Fair Allocation of Items in Multiple Regions
new file mode 100644
index 0000000000..906844f23e
--- /dev/null
+++ b/data/2024/aaai/Fair Allocation of Items in Multiple Regions	
@@ -0,0 +1 @@
+We initiate the study of fair allocation with the set of divisible or indivisible items distributed in multiple regions. The key requirement is that each agent can only obtain items from one region. In this work, we consider two kinds of fairness concepts: envy-based notions including envy-freeness (EF) and envy-freeness up to one/any item (EF1/EFX), and share-based notions including proportionality (PROP) and proportionality up to one/any item (PROP1/PROPX). On the negative side, we show NP-hardness and inapproximability results about the aforementioned fairness notions. On the positive side, we propose several algorithms to compute the partial allocations that satisfy envy-based notions and allocations that approximate the above fairness notions.
\ No newline at end of file
diff --git a/data/2024/aaai/Fair Graph Learning Using Constraint-Aware Priority Adjustment and Graph Masking in River Networks b/data/2024/aaai/Fair Graph Learning Using Constraint-Aware Priority Adjustment and Graph Masking in River Networks
new file mode 100644
index 0000000000..e6f7b1b216
--- /dev/null
+++ b/data/2024/aaai/Fair Graph Learning Using Constraint-Aware Priority Adjustment and Graph Masking in River Networks	
@@ -0,0 +1 @@
+Accurate prediction of water quality and quantity is crucial for sustainable development and human well-being. However, existing data-driven methods often suffer from spatial biases in model performance due to heterogeneous data, limited observations, and noisy sensor data. To overcome these challenges, we propose Fair-Graph, a novel graph-based recurrent neural network that leverages interrelated knowledge from multiple rivers to predict water flow and temperature within large-scale stream networks. Additionally, we introduce node-specific graph masks for information aggregation and adaptation to enhance prediction over heterogeneous river segments. To reduce performance disparities across river segments, we introduce a centralized coordination strategy that adjusts training priorities for segments. We evaluate the prediction of water temperature within the Delaware River Basin, and the prediction of streamflow using simulated data from U.S. National Water Model in the Houston River network. The results showcase improvements in predictive performance and highlight the proposed model's ability to maintain spatial fairness over different river segments.
\ No newline at end of file
diff --git a/data/2024/aaai/Fair Lotteries for Participatory Budgeting b/data/2024/aaai/Fair Lotteries for Participatory Budgeting
new file mode 100644
index 0000000000..9c14a8d88b
--- /dev/null
+++ b/data/2024/aaai/Fair Lotteries for Participatory Budgeting	
@@ -0,0 +1 @@
+In pursuit of participatory budgeting (PB) outcomes with broader fairness guarantees, we initiate the study of lotteries over discrete PB outcomes. As the projects have heterogeneous costs, the amount spent may not be equal ex ante and ex post. To address this, we develop a technique to bound the amount by which the ex-post spend differs from the ex-ante spend---the property is termed budget balanced up to one project (BB1). With respect to fairness, we take a best-of-both-worlds perspective, seeking outcomes that are both ex-ante and ex-post fair. Towards this goal, we initiate a study of ex-ante fairness properties in PB, including Individual Fair Share (IFS), Unanimous Fair Share (UFS) and their stronger variants, as well as Group Fair Share (GFS). We show several incompatibility results between these ex-ante fairness notions and existing ex-post concepts based on justified representation. One of our main contributions is a randomized algorithm which simultaneously satisfies ex-ante Strong UFS, ex-post full justified representation (FJR) and ex-post BB1 for PB with binary utilities.
\ No newline at end of file
diff --git a/data/2024/aaai/Fair Multivariate Adaptive Regression Splines for Ensuring Equity and Transparency b/data/2024/aaai/Fair Multivariate Adaptive Regression Splines for Ensuring Equity and Transparency
new file mode 100644
index 0000000000..d1dcc2d329
--- /dev/null
+++ b/data/2024/aaai/Fair Multivariate Adaptive Regression Splines for Ensuring Equity and Transparency	
@@ -0,0 +1 @@
+Predictive analytics has been widely used in various domains, including education, to inform decision-making and improve outcomes. However, many predictive models are proprietary and inaccessible for evaluation or modification by researchers and practitioners, limiting their accountability and ethical design. Moreover, predictive models are often opaque and incomprehensible to the officials who use them, reducing their trust and utility. Furthermore, predictive models may introduce or exacerbate bias and inequity, as they have done in many sectors of society. Therefore, there is a need for transparent, interpretable, and fair predictive models that can be easily adopted and adapted by different stakeholders. In this paper, we propose a fair predictive model based on multivariate adaptive regression splines (MARS) that incorporates fairness measures in the learning process. MARS is a non-parametric regression model that performs feature selection, handles non-linear relationships, generates interpretable decision rules, and derives optimal splitting criteria on the variables. Specifically, we integrate fairness into the knot optimization algorithm and provide theoretical and empirical evidence of how it results in a fair knot placement. We apply our fairMARS model to real-world data and demonstrate its effectiveness in terms of accuracy and equity. Our paper contributes to the advancement of responsible and ethical predictive analytics for social good.
\ No newline at end of file
diff --git a/data/2024/aaai/Fair Participation via Sequential Policies b/data/2024/aaai/Fair Participation via Sequential Policies
new file mode 100644
index 0000000000..39b44e0e16
--- /dev/null
+++ b/data/2024/aaai/Fair Participation via Sequential Policies	
@@ -0,0 +1 @@
+Leading approaches to algorithmic fairness and policy-induced distribution shift are often misaligned with long-term objectives in sequential settings. We aim to correct these shortcomings by ensuring that both the objective and fairness constraints account for policy-induced distribution shift. First, we motivate this problem using an example in which individuals subject to algorithmic predictions modulate their willingness to participate with the policy maker. Fairness in this example is measured by the variance of group participation rates. Next, we develop a method for solving the resulting constrained, non-linear optimization problem and prove that this method converges to a fair, locally optimal policy given first-order information. Finally, we experimentally validate our claims in a semi-synthetic setting.
\ No newline at end of file
diff --git a/data/2024/aaai/Fair Representation Learning with Maximum Mean Discrepancy Distance Constraint (Student Abstract) b/data/2024/aaai/Fair Representation Learning with Maximum Mean Discrepancy Distance Constraint (Student Abstract)
new file mode 100644
index 0000000000..6063b6f3e3
--- /dev/null
+++ b/data/2024/aaai/Fair Representation Learning with Maximum Mean Discrepancy Distance Constraint (Student Abstract)	
@@ -0,0 +1 @@
+Unsupervised learning methods such as principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and autoencoding are regularly used in dimensionality reduction within the statistical learning scene. However, despite a pivot toward fairness and explainability in machine learning over the past few years, there have been few rigorous attempts toward a generalized framework of fair and explainable representation learning. Our paper explores the possibility of such a framework that leverages maximum mean discrepancy to remove information derived from a protected class from generated representations. For the optimization, we introduce a binary search component to optimize the Lagrangian coefficients. We present rigorous mathematical analysis and experimental results of our framework applied to t-SNE.
\ No newline at end of file
diff --git a/data/2024/aaai/Fair Sampling in Diffusion Models through Switching Mechanism b/data/2024/aaai/Fair Sampling in Diffusion Models through Switching Mechanism
new file mode 100644
index 0000000000..152e11effb
--- /dev/null
+++ b/data/2024/aaai/Fair Sampling in Diffusion Models through Switching Mechanism	
@@ -0,0 +1,3 @@
+Diffusion models have shown their effectiveness in generation tasks by well-approximating the underlying probability distribution. However, diffusion models are known to suffer from an amplified inherent bias from the training data in terms of fairness. While the sampling process of diffusion models can be controlled by conditional guidance, previous works have attempted to find empirical guidance to achieve quantitative fairness. 
+To address this limitation, we propose a fairness-aware sampling method called \textit{attribute switching} mechanism for diffusion models. Without additional training, the proposed sampling can obfuscate sensitive attributes in generated data without relying on classifiers.
+We mathematically prove and experimentally demonstrate the effectiveness of the proposed method on two key aspects: (i) the generation of fair data and (ii) the preservation of the utility of the generated data.
\ No newline at end of file
diff --git a/data/2024/aaai/Fair and Optimal Prediction via Post-Processing b/data/2024/aaai/Fair and Optimal Prediction via Post-Processing
new file mode 100644
index 0000000000..034d456616
--- /dev/null
+++ b/data/2024/aaai/Fair and Optimal Prediction via Post-Processing	
@@ -0,0 +1 @@
+In this talk I will discuss our recent work on characterizing the inherent tradeoff between fairness and accuracy in both classification and regression problems. I will also present a post-processing algorithm that derives optimal fair predictors from Bayes score functions.
\ No newline at end of file
diff --git a/data/2024/aaai/FairPlay: A Multi-Sided Fair Dynamic Pricing Policy for Hotels b/data/2024/aaai/FairPlay: A Multi-Sided Fair Dynamic Pricing Policy for Hotels
new file mode 100644
index 0000000000..75da8ab129
--- /dev/null
+++ b/data/2024/aaai/FairPlay: A Multi-Sided Fair Dynamic Pricing Policy for Hotels	
@@ -0,0 +1 @@
+In recent years, popular touristic destinations face overtourism. Local communities suffer from its consequences in several ways. Among others, overpricing and profiteering harms local societies and economies deeply. In this paper we focus on the problem of determining fair hotel room prices. Specifically, we put forward a dynamic pricing policy where the price of a room depends not only on the demand of the hotel it belongs to but also on the demand of: (i) similar rooms in the area and (ii) their hotels. To this purpose, we model our setting as a cooperative game and exploit an appropriate game theoretic solution concept that promotes fairness both on the customers' and the providers' side. Our simulation results involving price adjustments across real-world hotels datasets, confirm that ours is a fair dynamic pricing policy, avoiding both over- and under-pricing hotel rooms.
\ No newline at end of file
diff --git a/data/2024/aaai/FairSIN: Achieving Fairness in Graph Neural Networks through Sensitive Information Neutralization b/data/2024/aaai/FairSIN: Achieving Fairness in Graph Neural Networks through Sensitive Information Neutralization
new file mode 100644
index 0000000000..db64f0d307
--- /dev/null
+++ b/data/2024/aaai/FairSIN: Achieving Fairness in Graph Neural Networks through Sensitive Information Neutralization	
@@ -0,0 +1 @@
+Despite the remarkable success of graph neural networks (GNNs) in modeling graph-structured data, like other machine learning models, GNNs are also susceptible to making biased predictions based on sensitive attributes, such as race and gender. For fairness consideration, recent state-of-the-art (SOTA) methods propose to filter out sensitive information from inputs or representations, e.g., edge dropping or feature masking. However, we argue that such filtering-based strategies may also filter out some non-sensitive feature information, leading to a sub-optimal trade-off between predictive performance and fairness. To address this issue, we unveil an innovative neutralization-based paradigm, where additional Fairness-facilitating Features (F3) are incorporated into node features or representations before message passing. The F3 are expected to statistically neutralize the sensitive bias in node representations and provide additional nonsensitive information. We also provide theoretical explanations for our rationale, concluding that F3 can be realized by emphasizing the features of each node’s heterogeneous neighbors (neighbors with different sensitive attributes). We name our method as FairSIN, and present three implementation variants from both data-centric and model-centric perspectives. Experimental results on five benchmark datasets with three different GNN backbones show that FairSIN significantly improves fairness metrics while maintaining high prediction accuracies. Codes and appendix can be found at https://github.com/BUPT-GAMMA/FariSIN.
\ No newline at end of file
diff --git a/data/2024/aaai/FairTrade: Achieving Pareto-Optimal Trade-Offs between Balanced Accuracy and Fairness in Federated Learning b/data/2024/aaai/FairTrade: Achieving Pareto-Optimal Trade-Offs between Balanced Accuracy and Fairness in Federated Learning
new file mode 100644
index 0000000000..58080fa570
--- /dev/null
+++ b/data/2024/aaai/FairTrade: Achieving Pareto-Optimal Trade-Offs between Balanced Accuracy and Fairness in Federated Learning	
@@ -0,0 +1 @@
+As Federated Learning (FL) gains prominence in distributed machine learning applications, achieving fairness without compromising predictive performance becomes paramount. The data being gathered from distributed clients in an FL environment often leads to class imbalance. In such scenarios, balanced accuracy rather than accuracy is the true representation of model performance. However, most state-of-the-art fair FL methods report accuracy as the measure of performance, which can lead to misguided interpretations of the model's effectiveness to mitigate discrimination. To the best of our knowledge, this work presents the first attempt towards achieving Pareto-optimal trade-offs between balanced accuracy and fairness in a federated environment (FairTrade). By utilizing multi-objective optimization, the framework negotiates the intricate balance between model's balanced accuracy and fairness. The framework's agnostic design adeptly accommodates both statistical and causal fairness notions, ensuring its adaptability across diverse FL contexts. We provide empirical evidence of our framework's efficacy through extensive experiments on five real-world datasets and comparisons with six baselines. The empirical results underscore the potential of our framework in improving the trade-off between fairness and balanced accuracy in FL applications.
\ No newline at end of file
diff --git a/data/2024/aaai/Fairness under Covariate Shift: Improving Fairness-Accuracy Tradeoff with Few Unlabeled Test Samples b/data/2024/aaai/Fairness under Covariate Shift: Improving Fairness-Accuracy Tradeoff with Few Unlabeled Test Samples
new file mode 100644
index 0000000000..61c88607a1
--- /dev/null
+++ b/data/2024/aaai/Fairness under Covariate Shift: Improving Fairness-Accuracy Tradeoff with Few Unlabeled Test Samples	
@@ -0,0 +1 @@
+Covariate shift in the test data is a common practical phenomena that can significantly downgrade both the accuracy and the fairness performance of the model. Ensuring fairness across different sensitive groups under covariate shift is of paramount importance due to societal implications like criminal justice. We operate in the unsupervised regime where only a small set of unlabeled test samples along with a labeled training set is available. Towards improving fairness under this highly challenging yet realistic scenario, we make three contributions. First is a novel composite weighted entropy based objective for prediction accuracy which is optimized along with a representation matching loss for fairness. We experimentally verify that optimizing with our loss formulation outperforms a number of state-of-the-art baselines in the pareto sense with respect to the fairness-accuracy tradeoff on several standard datasets. Our second contribution is a new setting we term Asymmetric Covariate Shift that, to the best of our knowledge, has not been studied before. Asymmetric covariate shift occurs when distribution of covariates of one group shifts significantly compared to the other groups and this happens when a dominant group is over-represented. While this setting is extremely challenging for current baselines, We show that our proposed method significantly outperforms them. Our third contribution is theoretical, where we show that our weighted entropy term along with prediction loss on the training set approximates test loss under covariate shift. Empirically and through formal sample complexity bounds, we show that this approximation to the unseen test loss does not depend on importance sampling variance which affects many other baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Fairness with Censorship: Bridging the Gap between Fairness Research and Real-World Deployment b/data/2024/aaai/Fairness with Censorship: Bridging the Gap between Fairness Research and Real-World Deployment
new file mode 100644
index 0000000000..68ae3e3008
--- /dev/null
+++ b/data/2024/aaai/Fairness with Censorship: Bridging the Gap between Fairness Research and Real-World Deployment	
@@ -0,0 +1 @@
+Recent works in artificial intelligence fairness attempt to mitigate discrimination by proposing constrained optimization programs that achieve parity for some fairness statistics. Most assume the availability of class label which is impractical in many real-world applications such as precision medicine, actuarial analysis and recidivism prediction. To this end, this talk revisits fairness and reveals idiosyncrasies of existing fairness literature assuming the availability of class label that limits their real-world utility. The primary artifacts are formulating fairness with censorship to account for scenarios where the class label is not guaranteed, and a suite of corresponding new fairness notions, algorithms, and theoretical constructs to bridge the gap between the design of a ``fair'' model in the lab and its deployment in the real-world.
\ No newline at end of file
diff --git a/data/2024/aaai/Fairness without Demographics through Shared Latent Space-Based Debiasing b/data/2024/aaai/Fairness without Demographics through Shared Latent Space-Based Debiasing
new file mode 100644
index 0000000000..88eefa4122
--- /dev/null
+++ b/data/2024/aaai/Fairness without Demographics through Shared Latent Space-Based Debiasing	
@@ -0,0 +1 @@
+Ensuring fairness in machine learning (ML) is crucial, particularly in applications that impact diverse populations. The majority of existing works heavily rely on the availability of protected features like race and gender. However, practical challenges such as privacy concerns and regulatory restrictions often prohibit the use of this data, limiting the scope of traditional fairness research. To address this, we introduce a Shared Latent Space-based Debiasing (SLSD) method that transforms data from both the target domain, which lacks protected features, and a separate source domain, which contains these features, into correlated latent representations. This allows for joint training of a cross-domain protected group estimator on the representations. We then debias the downstream ML model with an adversarial learning technique that leverages the group estimator. We also present a relaxed variant of SLSD, the R-SLSD, that occasionally accesses a small subset of protected features from the target domain during its training phase. Our extensive experiments on benchmark datasets demonstrate that our methods consistently outperform existing state-of-the-art models in standard group fairness metrics.
\ No newline at end of file
diff --git a/data/2024/aaai/Fairness-Aware Structured Pruning in Transformers b/data/2024/aaai/Fairness-Aware Structured Pruning in Transformers
new file mode 100644
index 0000000000..7174b6f3ed
--- /dev/null
+++ b/data/2024/aaai/Fairness-Aware Structured Pruning in Transformers	
@@ -0,0 +1 @@
+The increasing size of large language models (LLMs) has introduced challenges in their training and inference. Removing model components is perceived as a solution to tackle the large model sizes, however, existing pruning methods solely focus on performance, without considering an essential aspect for the responsible use of LLMs: model fairness. It is crucial to address the fairness of LLMs towards diverse groups, such as women, Black people, LGBTQ+, Jewish communities, among others, as they are being deployed and available to a wide audience. In this work, first, we investigate how attention heads impact fairness and performance in pre-trained transformer-based language models. We then propose a novel method to prune the attention heads that negatively impact fairness while retaining the heads critical for performance, i.e. language modeling capabilities. Our approach is practical in terms of time and resources, as it does not require fine-tuning the final pruned, and fairer, model. Our findings demonstrate a reduction in gender bias by 19%, 19.5%, 39.5%, 34.7%, 23%, and 8% for DistilGPT-2, GPT-2, GPT-Neo of two different sizes, GPT-J, and Llama 2 models, respectively, in comparison to the biased model, with only a slight decrease in performance. WARNING: This work uses language that is offensive in nature.
\ No newline at end of file
diff --git a/data/2024/aaai/Faithful Trip Recommender Using Diffusion Guidance (Student Abstract) b/data/2024/aaai/Faithful Trip Recommender Using Diffusion Guidance (Student Abstract)
new file mode 100644
index 0000000000..2547818965
--- /dev/null
+++ b/data/2024/aaai/Faithful Trip Recommender Using Diffusion Guidance (Student Abstract)	
@@ -0,0 +1 @@
+Trip recommendation aims to plan user’s travel based on their specified preferences. Traditional heuristic and statistical approaches often fail to capture the intricate nuances of user intentions, leading to subpar performance. Recent deep-learning methods show attractive accuracy but struggle to generate faithful trajectories that match user intentions. In this work, we propose a DDPM-based incremental knowledge injection module to ensure the faithfulness of the generated trajectories. Experiments on two datasets verify the effectiveness of our approach.
\ No newline at end of file
diff --git a/data/2024/aaai/FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval b/data/2024/aaai/FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval
new file mode 100644
index 0000000000..bacdcd2b24
--- /dev/null
+++ b/data/2024/aaai/FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval	
@@ -0,0 +1 @@
+The goal of composed fashion image retrieval is to locate a target image based on a reference image and modified text. Recent methods utilize symmetric encoders (e.g., CLIP) pre-trained on large-scale non-fashion datasets. However, the input for this task exhibits an asymmetric nature, where the reference image contains rich content while the modified text is often brief. Therefore, methods employing symmetric encoders encounter a severe phenomenon: retrieval results dominated by reference images, leading to the oversight of modified text. We propose a Fashion Enhance-and-Refine Network (FashionERN) centered around two aspects: enhancing the text encoder and refining visual semantics. We introduce a Triple-branch Modifier Enhancement model, which injects relevant information from the reference image and aligns the modified text modality with the target image modality. Furthermore, we propose a Dual-guided Vision Refinement model that retains critical visual information through text-guided refinement and self-guided refinement processes. The combination of these two models significantly mitigates the reference dominance phenomenon, ensuring accurate fulfillment of modifier requirements. Comprehensive experiments demonstrate our approach's state-of-the-art performance on four commonly used datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Fast & Fair: A Collaborative Platform for Fair Division Applications b/data/2024/aaai/Fast & Fair: A Collaborative Platform for Fair Division Applications
new file mode 100644
index 0000000000..f97583e3e6
--- /dev/null
+++ b/data/2024/aaai/Fast & Fair: A Collaborative Platform for Fair Division Applications	
@@ -0,0 +1 @@
+Fair division, the study of how to fairly allocate resources among agents, has received substantial interest in the areas of artificial intelligence and multiagent systems. While there is an extensive theoretical literature on fair division by now, the developed algorithms are still mostly confined to research papers and inaccessible to the public. We attempt to bridge this gap by developing Fast & Fair, an open-source web application that hosts a number of fair allocation algorithms with user-friendly interfaces and explainable outcomes. In contrast to existing implementations, Fast & Fair is a collaborative platform that is open to community contributions and thereby facilitates the deployment of additional algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Fast Inter-frame Motion Prediction for Compressed Dynamic Point Cloud Attribute Enhancement b/data/2024/aaai/Fast Inter-frame Motion Prediction for Compressed Dynamic Point Cloud Attribute Enhancement
new file mode 100644
index 0000000000..30e736ef74
--- /dev/null
+++ b/data/2024/aaai/Fast Inter-frame Motion Prediction for Compressed Dynamic Point Cloud Attribute Enhancement	
@@ -0,0 +1 @@
+Recent years have witnessed the success of deep learning methods in quality enhancement of compressed point cloud. However, existing methods focus on geometry and attribute enhancement of single-frame point cloud. This paper proposes a novel compressed quality enhancement method for dynamic point cloud (DAE-MP). Specifically, we propose a fast inter-frame motion prediction module (IFMP) to explicitly estimate motion displacement and achieve inter-frame feature alignment. To maintain motion continuity between consecutive frames, we propose a motion consistency loss for supervised learning. Furthermore, a frequency component separation and fusion module is designed to extract rich frequency features adaptively. To the best of our knowledge, the proposed method is the first deep learning-based work to enhance the quality for compressed dynamic point cloud. Experimental results show that the proposed method can greatly improve the quality of compressed dynamic point cloud and provide a fast and efficient motion prediction plug-in for large-scale point cloud. For dynamic point cloud attribute with severely compressed artifact, our proposed DAE-MP method achieves up to 0.52dB (PSNR) performance gain. Moreover, the proposed IFMP module has a certain real-time processing ability for calculating the motion offset between dynamic point cloud frame.
\ No newline at end of file
diff --git a/data/2024/aaai/Fast Machine Unlearning without Retraining through Selective Synaptic Dampening b/data/2024/aaai/Fast Machine Unlearning without Retraining through Selective Synaptic Dampening
new file mode 100644
index 0000000000..2deebcab0a
--- /dev/null
+++ b/data/2024/aaai/Fast Machine Unlearning without Retraining through Selective Synaptic Dampening	
@@ -0,0 +1 @@
+Machine unlearning, the ability for a machine learning model to forget, is becoming increasingly important to comply with data privacy regulations, as well as to remove harmful, manipulated, or outdated information. The key challenge lies in forgetting specific information while protecting model performance on the remaining data. While current state-of-the-art methods perform well, they typically require some level of retraining over the retained data, in order to protect or restore model performance. This adds computational overhead and mandates that the training data remain available and accessible, which may not be feasible. In contrast, other methods employ a retrain-free paradigm, however, these approaches are prohibitively computationally expensive and do not perform on par with their retrain-based counterparts. We present Selective Synaptic Dampening (SSD), a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data. First, SSD uses the Fisher information matrix of the training and forgetting data to select parameters that are disproportionately important to the forget set. Second, SSD induces forgetting by dampening these parameters proportional to their relative importance to the forget set with respect to the wider training data. We evaluate our method against several existing unlearning methods in a range of experiments using ResNet18 and Vision Transformer. Results show that the performance of SSD is competitive with retrain-based post hoc methods, demonstrating the viability of retrain-free post hoc unlearning approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Fast and Controllable Post-training Sparsity: Learning Optimal Sparsity Allocation with Global Constraint in Minutes b/data/2024/aaai/Fast and Controllable Post-training Sparsity: Learning Optimal Sparsity Allocation with Global Constraint in Minutes
new file mode 100644
index 0000000000..c90d5aea6a
--- /dev/null
+++ b/data/2024/aaai/Fast and Controllable Post-training Sparsity: Learning Optimal Sparsity Allocation with Global Constraint in Minutes	
@@ -0,0 +1 @@
+Neural network sparsity has attracted many research interests due to its similarity to biological schemes and high energy efficiency. However, existing methods depend on long-time training or fine-tuning, which prevents large-scale applications. Recently, some works focusing on post-training sparsity (PTS) have emerged. They get rid of the high training cost but usually suffer from distinct accuracy degradation due to neglect of the reasonable sparsity rate at each layer. Previous methods for finding sparsity rates mainly focus on the training-aware scenario, which usually fails to converge stably under the PTS setting with limited data and much less training cost. In this paper, we propose a fast and controllable post-training sparsity (FCPTS) framework. By incorporating a differentiable bridge function and a controllable optimization objective, our method allows for rapid and accurate sparsity allocation learning in minutes, with the added assurance of convergence to a predetermined global sparsity rate. Equipped with these techniques, we can surpass the state-of-the-art methods by a large margin, e.g., over 30\% improvement for ResNet-50 on ImageNet under the sparsity rate of 80\%. Our plug-and-play code and supplementary materials are open-sourced at https://github.com/ModelTC/FCPTS.
\ No newline at end of file
diff --git a/data/2024/aaai/Fast and Knowledge-Free Deep Learning for General Game Playing (Student Abstract) b/data/2024/aaai/Fast and Knowledge-Free Deep Learning for General Game Playing (Student Abstract)
new file mode 100644
index 0000000000..65a897de65
--- /dev/null
+++ b/data/2024/aaai/Fast and Knowledge-Free Deep Learning for General Game Playing (Student Abstract)	
@@ -0,0 +1 @@
+We develop a method of adapting the AlphaZero model to General Game Playing (GGP) that focuses on faster model generation and requires less knowledge to be extracted from the game rules. The dataset generation uses MCTS playing instead of self-play; only the value network is used, and attention layers replace the convolutional ones. This allows us to abandon any assumptions about the action space and board topology. We implement the method within the Regular Boardgames GGP system and show that we can build models outperforming the UCT baseline for most games efficiently.
\ No newline at end of file
diff --git a/data/2024/aaai/Faster Stochastic Variance Reduction Methods for Compositional MiniMax Optimization b/data/2024/aaai/Faster Stochastic Variance Reduction Methods for Compositional MiniMax Optimization
new file mode 100644
index 0000000000..b178a5ed76
--- /dev/null
+++ b/data/2024/aaai/Faster Stochastic Variance Reduction Methods for Compositional MiniMax Optimization	
@@ -0,0 +1 @@
+This paper delves into the realm of stochastic optimization for compositional minimax optimization—a pivotal challenge across various machine learning domains, including deep AUC and reinforcement learning policy evaluation. Despite its significance, the problem of compositional minimax optimization is still under-explored. Adding to the complexity, current methods of compositional minimax optimization are plagued by sub-optimal complexities or heavy reliance on sizable batch sizes. To respond to these constraints, this paper introduces a novel method, called Nested STOchastic Recursive Momentum (NSTORM), which can achieve the optimal sample complexity and obtain the nearly accuracy solution, matching the existing minimax methods. We also demonstrate that NSTORM can achieve the same sample complexity under the Polyak-Lojasiewicz (PL)-condition—an insightful extension of its capabilities. Yet, NSTORM encounters an issue with its requirement for low learning rates, potentially constraining its real-world applicability in machine learning. To overcome this hurdle, we present ADAptive NSTORM (ADA-NSTORM) with adaptive learning rates. We demonstrate that ADA-NSTORM can achieve the same sample complexity but the experimental results show its more effectiveness. All the proposed complexities indicate that our proposed methods can match lower bounds to existing minimax optimizations, without requiring a large batch size in each iteration. Extensive experiments support the efficiency of our proposed methods.
\ No newline at end of file
diff --git a/data/2024/aaai/FeatWalk: Enhancing Few-Shot Classification through Local View Leveraging b/data/2024/aaai/FeatWalk: Enhancing Few-Shot Classification through Local View Leveraging
new file mode 100644
index 0000000000..25a1c11cb5
--- /dev/null
+++ b/data/2024/aaai/FeatWalk: Enhancing Few-Shot Classification through Local View Leveraging	
@@ -0,0 +1 @@
+Few-shot learning is a challenging task due to the limited availability of training samples. Recent few-shot learning studies with meta-learning and simple transfer learning methods have achieved promising performance. However, the feature extractor pre-trained with the upstream dataset may neglect the extraction of certain features which could be crucial for downstream tasks. In this study, inspired by the process of human learning in few-shot tasks, where humans not only observe the whole image (`global view') but also attend to various local image regions (`local view') for comprehensive understanding of detailed features, we propose a simple yet effective few-shot learning method called FeatWalk which can utilize the complementary nature of global and local views, therefore providing an intuitive and effective solution to the problem of insufficient local information extraction from the pre-trained feature extractor. Our method can be easily and flexibly combined with various existing methods, further enhancing few-shot learning performance. Extensive experiments on multiple benchmark datasets consistently demonstrate the effectiveness and versatility of our method.The source code is available at https://github.com/exceefind/FeatWalk.
\ No newline at end of file
diff --git a/data/2024/aaai/Feature Distribution Matching by Optimal Transport for Effective and Robust Coreset Selection b/data/2024/aaai/Feature Distribution Matching by Optimal Transport for Effective and Robust Coreset Selection
new file mode 100644
index 0000000000..71c8646664
--- /dev/null
+++ b/data/2024/aaai/Feature Distribution Matching by Optimal Transport for Effective and Robust Coreset Selection	
@@ -0,0 +1 @@
+Training neural networks with good generalization requires large computational costs in many deep learning methods due to large-scale datasets and over-parameterized models. Despite the emergence of a number of coreset selection methods to reduce the computational costs, the problem of coreset distribution bias, i.e., the skewed distribution between the coreset and the entire dataset, has not been well studied. In this paper, we find that the closer the feature distribution of the coreset is to that of the entire dataset, the better the generalization performance of the coreset, particularly under extreme pruning. This motivates us to propose a simple yet effective method for coreset selection to alleviate the distribution bias between the coreset and the entire dataset, called feature distribution matching (FDMat). Unlike gradient-based methods, which selects samples with larger gradient values or approximates gradient values of the entire dataset, FDMat aims to select coreset that is closest to feature distribution of the entire dataset. Specifically, FDMat transfers coreset selection as an optimal transport problem from the coreset to the entire dataset in feature embedding spaces. Moreover, our method shows strong robustness due to the removal of samples far from the distribution, especially for the entire dataset containing noisy and class-imbalanced samples. Extensive experiments on multiple benchmarks show that FDMat can improve the performance of coreset selection than existing coreset methods. The code is available at https://github.com/successhaha/FDMat.
\ No newline at end of file
diff --git a/data/2024/aaai/Feature Fusion from Head to Tail for Long-Tailed Visual Recognition b/data/2024/aaai/Feature Fusion from Head to Tail for Long-Tailed Visual Recognition
new file mode 100644
index 0000000000..042c276e01
--- /dev/null
+++ b/data/2024/aaai/Feature Fusion from Head to Tail for Long-Tailed Visual Recognition	
@@ -0,0 +1 @@
+The imbalanced distribution of long-tailed data presents a considerable challenge for deep learning models, as it causes them to prioritize the accurate classification of head classes but largely disregard tail classes. The biased decision boundary caused by inadequate semantic information in tail classes is one of the key factors contributing to their low recognition accuracy. To rectify this issue, we propose to augment tail classes by grafting the diverse semantic information from head classes, referred to as head-to-tail fusion (H2T). We replace a portion of feature maps from tail classes with those belonging to head classes. These fused features substantially enhance the diversity of tail classes. Both theoretical analysis and practical experimentation demonstrate that H2T can contribute to a more optimized solution for the decision boundary. We seamlessly integrate H2T in the classifier adjustment stage, making it a plug-and-play module. Its simplicity and ease of implementation allow for smooth integration with existing long-tailed recognition methods, facilitating a further performance boost. Extensive experiments on various long-tailed benchmarks demonstrate the effectiveness of the proposed H2T. The source code is available at https://github.com/Keke921/H2T.
\ No newline at end of file
diff --git a/data/2024/aaai/Feature Transportation Improves Graph Neural Networks b/data/2024/aaai/Feature Transportation Improves Graph Neural Networks
new file mode 100644
index 0000000000..217836e8ee
--- /dev/null
+++ b/data/2024/aaai/Feature Transportation Improves Graph Neural Networks	
@@ -0,0 +1,3 @@
+Graph neural networks (GNNs) have shown remarkable success in learning representations for graph-structured data. However, GNNs still face challenges in modeling complex phenomena that involve feature transportation. In this paper, we propose a novel GNN architecture inspired by Advection-Diffusion-Reaction systems, called ADR-GNN.
+Advection models feature transportation, while diffusion captures the local smoothing of features, and reaction represents the non-linear transformation between feature channels. We provide an analysis of the qualitative behavior of ADR-GNN, that shows the benefit of combining advection, diffusion, and reaction.
+To demonstrate its efficacy, we evaluate ADR-GNN on real-world node classification and spatio-temporal datasets, and show that it improves or offers competitive performance compared to state-of-the-art networks.
\ No newline at end of file
diff --git a/data/2024/aaai/Feature Unlearning for Pre-trained GANs and VAEs b/data/2024/aaai/Feature Unlearning for Pre-trained GANs and VAEs
new file mode 100644
index 0000000000..3a48c7d0ab
--- /dev/null
+++ b/data/2024/aaai/Feature Unlearning for Pre-trained GANs and VAEs	
@@ -0,0 +1 @@
+We tackle the problem of feature unlearning from a pre-trained image generative model: GANs and VAEs. Unlike a common unlearning task where an unlearning target is a subset of the training set, we aim to unlearn a specific feature, such as hairstyle from facial images, from the pre-trained generative models. As the target feature is only presented in a local region of an image, unlearning the entire image from the pre-trained model may result in losing other details in the remaining region of the image. To specify which features to unlearn, we collect randomly generated images that contain the target features. We then identify a latent representation corresponding to the target feature and then use the representation to fine-tune the pre-trained model. Through experiments on MNIST, CelebA, and FFHQ datasets, we show that target features are successfully removed while keeping the fidelity of the original models. Further experiments with an adversarial attack show that the unlearned model is more robust under the presence of malicious parties.
\ No newline at end of file
diff --git a/data/2024/aaai/FedCD: Federated Semi-Supervised Learning with Class Awareness Balance via Dual Teachers b/data/2024/aaai/FedCD: Federated Semi-Supervised Learning with Class Awareness Balance via Dual Teachers
new file mode 100644
index 0000000000..5576f00f4b
--- /dev/null
+++ b/data/2024/aaai/FedCD: Federated Semi-Supervised Learning with Class Awareness Balance via Dual Teachers	
@@ -0,0 +1 @@
+Recent advancements in deep learning have greatly improved the efficiency of auxiliary medical diagnostics. However, concerns over patient privacy and data annotation costs restrict the viability of centralized training models. In response, federated semi-supervised learning has garnered substantial attention from medical institutions. However, it faces challenges arising from knowledge discrepancies among local clients and class imbalance in non-independent and identically distributed data. Existing methods like class balance adaptation for addressing class imbalance often overlook low-confidence yet valuable rare samples in unlabeled data and may compromise client privacy. To address these issues, we propose a novel framework with class awareness balance and dual teacher distillation called FedCD. FedCD introduces a global-local framework to balance and purify global and local knowledge. Additionally, we introduce a novel class awareness balance module to effectively explore potential rare classes and encourage balanced learning in unlabeled clients. Importantly, our approach prioritizes privacy protection by only exchanging network parameters during communication. Experimental results on two medical datasets under various settings demonstrate the effectiveness of FedCD. The code is available at https://github.com/YunzZ-Liu/FedCD.
\ No newline at end of file
diff --git a/data/2024/aaai/FedCSL: A Scalable and Accurate Approach to Federated Causal Structure Learning b/data/2024/aaai/FedCSL: A Scalable and Accurate Approach to Federated Causal Structure Learning
new file mode 100644
index 0000000000..49274b3dc4
--- /dev/null
+++ b/data/2024/aaai/FedCSL: A Scalable and Accurate Approach to Federated Causal Structure Learning	
@@ -0,0 +1 @@
+As an emerging research direction, federated causal structure learning (CSL) aims at learning causal relationships from decentralized data across multiple clients while preserving data privacy. Existing federated CSL algorithms suffer from scalability and accuracy issues, since they require computationally expensive CSL algorithms to be executed at each client. Furthermore, in real-world scenarios, the number of samples held by each client varies significantly, and existing methods still assign equal weights to the learned structural information from each client, which severely harms the learning accuracy of those methods. To address these two limitations, we propose FedCSL, a scalable and accurate method for federated CSL. Specifically, FedCSL consists of two novel strategies: (1) a federated local-to-global learning strategy that enables FedCSL to scale to high-dimensional data for tackling the scalability issue, and (2) a novel weighted aggregation strategy that does not rely on any complex encryption techniques while preserving data privacy for tackling the accuracy issue. Extensive experiments on benchmark datasets, high-dimensional synthetic datasets and a real-world dataset verify the efficacy of the proposed FedCSL method. The source code is available at https://github.com/Xianjie-Guo/FedCSL.
\ No newline at end of file
diff --git a/data/2024/aaai/FedCompetitors: Harmonious Collaboration in Federated Learning with Competing Participants b/data/2024/aaai/FedCompetitors: Harmonious Collaboration in Federated Learning with Competing Participants
new file mode 100644
index 0000000000..567090c9da
--- /dev/null
+++ b/data/2024/aaai/FedCompetitors: Harmonious Collaboration in Federated Learning with Competing Participants	
@@ -0,0 +1 @@
+Federated learning (FL) provides a privacy-preserving approach for collaborative training of machine learning models. Given the potential data heterogeneity, it is crucial to select appropriate collaborators for each FL participant (FL-PT) based on data complementarity. Recent studies have addressed this challenge. Similarly, it is imperative to consider the inter-individual relationships among FL-PTs where some FL-PTs engage in competition. Although FL literature has acknowledged the significance of this scenario, practical methods for establishing FL ecosystems remain largely unexplored. In this paper, we extend a principle from the balance theory, namely “the friend of my enemy is my enemy”, to ensure the absence of conflicting interests within an FL ecosystem. The extended principle and the resulting problem are formulated via graph theory and integer linear programming. A polynomial-time algorithm is proposed to determine the collaborators of each FL-PT. The solution guarantees high scalability, allowing even competing FL-PTs to smoothly join the ecosystem without conflict of interest. The proposed framework jointly considers competition and data heterogeneity. Extensive experiments on real-world and synthetic data demonstrate its efficacy compared to five alternative approaches, and its ability to establish efficient collaboration networks among FL-PTs.
\ No newline at end of file
diff --git a/data/2024/aaai/FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning b/data/2024/aaai/FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning
new file mode 100644
index 0000000000..6a0f3285e7
--- /dev/null
+++ b/data/2024/aaai/FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning	
@@ -0,0 +1 @@
+Recently, foundation models have exhibited remarkable advancements in multi-modal learning. These models, equipped with millions (or billions) of parameters, typically require a substantial amount of data for finetuning. However, collecting and centralizing training data from diverse sectors becomes challenging due to distinct privacy regulations. Federated Learning (FL) emerges as a promising solution, enabling multiple clients to collaboratively train neural networks without centralizing their local data. To alleviate client computation burdens and communication overheads, previous works have adapted Parameter-efficient Finetuning (PEFT) methods for FL. Hereby, only a small fraction of the model parameters are optimized and communicated during federated communications. Nevertheless, most previous works have focused on a single modality and neglected one common phenomenon, i.e., the presence of data heterogeneity across the clients. Therefore, in this work, we propose a finetuning framework tailored to heterogeneous multi-modal FL, called Federated Dual-Aadapter Teacher (FedDAT). Specifically, our approach leverages a Dual-Adapter Teacher (DAT) to address data heterogeneity by regularizing the client local updates and applying Mutual Knowledge Distillation (MKD) for an efficient knowledge transfer. FedDAT is the first approach that enables an efficient distributed finetuning of foundation models for a variety of heterogeneous Vision-Language tasks. To demonstrate its effectiveness, we conduct extensive experiments on four multi-modality FL benchmarks with different types of data heterogeneity, where FedDAT substantially outperforms the existing centralized PEFT methods adapted for FL.
\ No newline at end of file
diff --git a/data/2024/aaai/FedDiv: Collaborative Noise Filtering for Federated Learning with Noisy Labels b/data/2024/aaai/FedDiv: Collaborative Noise Filtering for Federated Learning with Noisy Labels
new file mode 100644
index 0000000000..92b0a45444
--- /dev/null
+++ b/data/2024/aaai/FedDiv: Collaborative Noise Filtering for Federated Learning with Noisy Labels	
@@ -0,0 +1 @@
+Federated Learning with Noisy Labels (F-LNL) aims at seeking an optimal server model via collaborative distributed learning by aggregating multiple client models trained with local noisy or clean samples. On the basis of a federated learning framework, recent advances primarily adopt label noise filtering to separate clean samples from noisy ones on each client, thereby mitigating the negative impact of label noise. However, these prior methods do not learn noise filters by exploiting knowledge across all clients, leading to sub-optimal and inferior noise filtering performance and thus damaging training stability. In this paper, we present FedDiv to tackle the challenges of F-LNL. Specifically, we propose a global noise filter called Federated Noise Filter for effectively identifying samples with noisy labels on every client, thereby raising stability during local training sessions. Without sacrificing data privacy, this is achieved by modeling the global distribution of label noise across all clients. Then, in an effort to make the global model achieve higher performance, we introduce a Predictive Consistency based Sampler to identify more credible local data for local model training, thus preventing noise memorization and further boosting the training stability. Extensive experiments on CIFAR-10, CIFAR-100, and Clothing1M demonstrate that FedDiv achieves superior performance over state-of-the-art F-LNL methods under different label noise settings for both IID and non-IID data partitions. Source code is publicly available at https://github.com/lijichang/FLNL-FedDiv.
\ No newline at end of file
diff --git a/data/2024/aaai/FedFixer: Mitigating Heterogeneous Label Noise in Federated Learning b/data/2024/aaai/FedFixer: Mitigating Heterogeneous Label Noise in Federated Learning
new file mode 100644
index 0000000000..55e822ecd7
--- /dev/null
+++ b/data/2024/aaai/FedFixer: Mitigating Heterogeneous Label Noise in Federated Learning	
@@ -0,0 +1 @@
+Federated Learning (FL) heavily depends on label quality for its performance. However, the label distribution among individual clients is always both noisy and heterogeneous. The high loss incurred by client-specific samples in heterogeneous label noise poses challenges for distinguishing between client-specific and noisy label samples, impacting the effectiveness of existing label noise learning approaches. To tackle this issue, we propose FedFixer, where the personalized model is introduced to cooperate with the global model to effectively select clean client-specific samples. In the dual models, updating the personalized model solely at a local level can lead to overfitting on noisy data due to limited samples, consequently affecting both the local and global models’ performance. To mitigate overfitting, we address this concern from two perspectives. Firstly, we employ a confidence regularizer to alleviate the impact of unconfident predictions caused by label noise. Secondly, a distance regularizer is implemented to constrain the disparity between the personalized and global models. We validate the effectiveness of FedFixer through extensive experiments on benchmark datasets. The results demonstrate that FedFixer can perform well in filtering noisy label samples on different clients, especially in highly heterogeneous label noise scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/FedGCR: Achieving Performance and Fairness for Federated Learning with Distinct Client Types via Group Customization and Reweighting b/data/2024/aaai/FedGCR: Achieving Performance and Fairness for Federated Learning with Distinct Client Types via Group Customization and Reweighting
new file mode 100644
index 0000000000..e53dd95c44
--- /dev/null
+++ b/data/2024/aaai/FedGCR: Achieving Performance and Fairness for Federated Learning with Distinct Client Types via Group Customization and Reweighting	
@@ -0,0 +1 @@
+To achieve better performance and greater fairness in Federated Learning (FL), much of the existing research has centered on individual clients, using domain adaptation techniques and redesigned aggregation schemes to counteract client data heterogeneity. However, an overlooked scenario exists where clients belong to distinctive groups, or, client types, in which groups of clients share similar characteristics such as device specifications or data patterns. Despite being common in group collaborations, this scenario has been overlooked in previous research, potentially leading to performance degradation and systemic biases against certain client types. To bridge this gap, we introduce Federated learning with Group Customization and Reweighting (FedGCR). FedGCR enhances both performance and fairness for FL with Distinct Client Types, consisting of a Federated Group Customization (FedGC) model to provide customization via a novel prompt tuning technique to mitigate the data disparity across different client-types, and a Federated Group Reweighting (FedGR) aggregation scheme to ensure uniform and unbiased performances between clients and between client types by a novel reweighting approach. Extensive experiment comparisons with prior FL methods in domain adaptation and fairness demonstrate the superiority of FedGCR in all metrics, including the overall accuracy and performance uniformity in both the group and the individual level. FedGCR achieves 82.74% accuracy and 12.26(↓) in performance uniformity on the Digit-Five dataset and 81.88% and 14.88%(↓) on DomainNet with a domain imbalance factor of 10, which significantly outperforms the state-of-the-art. Code is available at https://github.com/celinezheng/fedgcr.
\ No newline at end of file
diff --git a/data/2024/aaai/FedLF: Layer-Wise Fair Federated Learning b/data/2024/aaai/FedLF: Layer-Wise Fair Federated Learning
new file mode 100644
index 0000000000..a51b520588
--- /dev/null
+++ b/data/2024/aaai/FedLF: Layer-Wise Fair Federated Learning	
@@ -0,0 +1 @@
+Fairness has become an important concern in Federated Learning (FL). An unfair model that performs well for some clients while performing poorly for others can reduce the willingness of clients to participate. In this work, we identify a direct cause of unfairness in FL - the use of an unfair direction to update the global model, which favors some clients while conflicting with other clients’ gradients at the model and layer levels. To address these issues, we propose a layer-wise fair Federated Learning algorithm (FedLF). Firstly, we formulate a multi-objective optimization problem with an effective fair-driven objective for FL. A layer-wise fair direction is then calculated to mitigate the model and layer-level gradient conflicts and reduce the improvement bias. We further provide the theoretical analysis on how FedLF can improve fairness and guarantee convergence. Extensive experiments on different learning tasks and models demonstrate that FedLF outperforms the SOTA FL algorithms in terms of accuracy and fairness. The source code is available at https://github.com/zibinpan/FedLF.
\ No newline at end of file
diff --git a/data/2024/aaai/FedLPS: Heterogeneous Federated Learning for Multiple Tasks with Local Parameter Sharing b/data/2024/aaai/FedLPS: Heterogeneous Federated Learning for Multiple Tasks with Local Parameter Sharing
new file mode 100644
index 0000000000..3d4da5ba67
--- /dev/null
+++ b/data/2024/aaai/FedLPS: Heterogeneous Federated Learning for Multiple Tasks with Local Parameter Sharing	
@@ -0,0 +1 @@
+Federated Learning (FL) has emerged as a promising solution in Edge Computing (EC) environments to process the proliferation of data generated by edge devices. By collaboratively optimizing the global machine learning models on distributed edge devices, FL circumvents the need for transmitting raw data and enhances user privacy. Despite practical successes, FL still confronts significant challenges including constrained edge device resources, multiple tasks deployment, and data heterogeneity. However, existing studies focus on mitigating the FL training costs of each single task whereas neglecting the resource consumption across multiple tasks in heterogeneous FL scenarios. In this paper, we propose Heterogeneous Federated Learning with Local Parameter Sharing (FedLPS) to fill this gap. FedLPS leverages principles from transfer learning to facilitate the deployment of multiple tasks on a single device by dividing the local model into a shareable encoder and task-specific encoders. To further reduce resource consumption, a channel-wise model pruning algorithm that shrinks the footprint of local models while accounting for both data and system heterogeneity is employed in FedLPS. Additionally, a novel heterogeneous model aggregation algorithm is proposed to aggregate the heterogeneous predictors in FedLPS. We implemented the proposed FedLPS on a real FL platform and compared it with state-of-the-art (SOTA) FL frameworks. The experimental results on five popular datasets and two modern DNN models illustrate that the proposed FedLPS significantly outperforms the SOTA FL frameworks by up to 4.88% and reduces the computational resource consumption by 21.3%. Our code is available at: https://github.com/jyzgh/FedLPS.
\ No newline at end of file
diff --git a/data/2024/aaai/FedMut: Generalized Federated Learning via Stochastic Mutation b/data/2024/aaai/FedMut: Generalized Federated Learning via Stochastic Mutation
new file mode 100644
index 0000000000..7c03e3a976
--- /dev/null
+++ b/data/2024/aaai/FedMut: Generalized Federated Learning via Stochastic Mutation	
@@ -0,0 +1 @@
+Although Federated Learning (FL) enables collaborative model training without sharing the raw data of clients, it encounters low-performance problems caused by various heterogeneous scenarios. Due to the limitation of dispatching the same global model to clients for local training, traditional Federated Average (FedAvg)-based FL models face the problem of easily getting stuck into a sharp solution, which results in training a low-performance global model. To address this problem, this paper presents a novel FL approach named FedMut, which mutates the global model according to the gradient change to generate several intermediate models for the next round of training. Each intermediate model will be dispatched to a client for local training. Eventually, the global model converges into a flat area within the range of mutated models and has a well-generalization compared with the global model trained by FedAvg. Experimental results on well-known datasets demonstrate the effectiveness of our FedMut approach in various data heterogeneity scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/FedNS: A Fast Sketching Newton-Type Algorithm for Federated Learning b/data/2024/aaai/FedNS: A Fast Sketching Newton-Type Algorithm for Federated Learning
new file mode 100644
index 0000000000..54f48c3bde
--- /dev/null
+++ b/data/2024/aaai/FedNS: A Fast Sketching Newton-Type Algorithm for Federated Learning	
@@ -0,0 +1 @@
+Recent Newton-type federated learning algorithms have demonstrated linear convergence with respect to the communication rounds. However, communicating Hessian matrices is often unfeasible due to their quadratic communication complexity. In this paper, we introduce a novel approach to tackle this issue while still achieving fast convergence rates. Our proposed method, named as Federated Newton Sketch methods (FedNS), approximates the centralized Newton's method by communicating the sketched square-root Hessian instead of the exact Hessian. To enhance communication efficiency, we reduce the sketch size to match the effective dimension of the Hessian matrix. We provide convergence analysis based on statistical learning for the federated Newton sketch approaches. Specifically, our approaches reach super-linear convergence rates w.r.t. the communication rounds for the first time. We validate the effectiveness of our algorithms through various experiments, which coincide with our theoretical findings.
\ No newline at end of file
diff --git a/data/2024/aaai/FedST: Federated Style Transfer Learning for Non-IID Image Segmentation b/data/2024/aaai/FedST: Federated Style Transfer Learning for Non-IID Image Segmentation
new file mode 100644
index 0000000000..62fdabed77
--- /dev/null
+++ b/data/2024/aaai/FedST: Federated Style Transfer Learning for Non-IID Image Segmentation	
@@ -0,0 +1 @@
+Federated learning collaboratively trains machine learning models among different clients while keeping data privacy and has become the mainstream for breaking data silos. However, the non-independently and identically distribution (i.e., Non-IID) characteristic of different image domains among different clients reduces the benefits of federated learning and has become a bottleneck problem restricting the accuracy and generalization of federated models. In this work, we propose a novel federated image segmentation method based on style transfer, FedST, by using a denoising diffusion probabilistic model to achieve feature disentanglement and image synthesis of cross-domain image data between multiple clients. Thus it can share style features among clients while protecting structure features of image data, which effectively alleviates the influence of the Non-IID phenomenon. Experiments prove that our method achieves superior segmentation performance compared to state-of-art methods among four different Non-IID datasets in objective and subjective assessment. The code is available at https://github.com/YoferChen/FedST.
\ No newline at end of file
diff --git a/data/2024/aaai/FedTGP: Trainable Global Prototypes with Adaptive-Margin-Enhanced Contrastive Learning for Data and Model Heterogeneity in Federated Learning b/data/2024/aaai/FedTGP: Trainable Global Prototypes with Adaptive-Margin-Enhanced Contrastive Learning for Data and Model Heterogeneity in Federated Learning
new file mode 100644
index 0000000000..962db0623f
--- /dev/null
+++ b/data/2024/aaai/FedTGP: Trainable Global Prototypes with Adaptive-Margin-Enhanced Contrastive Learning for Data and Model Heterogeneity in Federated Learning	
@@ -0,0 +1 @@
+Recently, Heterogeneous Federated Learning (HtFL) has attracted attention due to its ability to support heterogeneous models and data. To reduce the high communication cost of transmitting model parameters, a major challenge in HtFL, prototype-based HtFL methods are proposed to solely share class representatives, a.k.a, prototypes, among heterogeneous clients while maintaining the privacy of clients’ models. However, these prototypes are naively aggregated into global prototypes on the server using weighted averaging, resulting in suboptimal global knowledge which negatively impacts the performance of clients. To overcome this challenge, we introduce a novel HtFL approach called FedTGP, which leverages our Adaptive-margin-enhanced Contrastive Learning (ACL) to learn Trainable Global Prototypes (TGP) on the server. By incorporating ACL, our approach enhances prototype separability while preserving semantic meaning. Extensive experiments with twelve heterogeneous models demonstrate that our FedTGP surpasses state-of-the-art methods by up to 9.08% in accuracy while maintaining the communication and privacy advantages of prototype-based HtFL. Our code is available at https://github.com/TsingZ0/FedTGP.
\ No newline at end of file
diff --git a/data/2024/aaai/Federated Adaptive Prompt Tuning for Multi-Domain Collaborative Learning b/data/2024/aaai/Federated Adaptive Prompt Tuning for Multi-Domain Collaborative Learning
new file mode 100644
index 0000000000..f59ed572a3
--- /dev/null
+++ b/data/2024/aaai/Federated Adaptive Prompt Tuning for Multi-Domain Collaborative Learning	
@@ -0,0 +1 @@
+Federated learning (FL) enables multiple clients to collaboratively train a global model without disclosing their data. Previous researches often require training the complete model parameters. However, the emergence of powerful pre-trained models makes it possible to achieve higher performance with fewer learnable parameters in FL. In this paper, we propose a federated adaptive prompt tuning algorithm, FedAPT, for multi-domain collaborative image classification with powerful foundation models, like CLIP. Compared with direct federated prompt tuning, our core idea is to adaptively unlock specific domain knowledge for each test sample in order to provide them with personalized prompts. To implement this idea, we design an adaptive prompt tuning module, which consists of a meta prompt, an adaptive network, and some keys. The server randomly generates a set of keys and assigns a unique key to each client. Then all clients cooperatively train the global adaptive network and meta prompt with the local datasets and the frozen keys. Ultimately, the global aggregation model can assign a personalized prompt to CLIP based on the domain features of each test sample. We perform extensive experiments on two multi-domain image classification datasets across two different settings -- supervised and unsupervised. The results show that FedAPT can achieve better performance with less than 10% of the number of parameters of the fully trained model, and the global model can perform well in diverse client domains simultaneously.
\ No newline at end of file
diff --git a/data/2024/aaai/Federated Causality Learning with Explainable Adaptive Optimization b/data/2024/aaai/Federated Causality Learning with Explainable Adaptive Optimization
new file mode 100644
index 0000000000..2392e42ead
--- /dev/null
+++ b/data/2024/aaai/Federated Causality Learning with Explainable Adaptive Optimization	
@@ -0,0 +1 @@
+Discovering the causality from observational data is a crucial task in various scientific domains. With increasing awareness of privacy, data are not allowed to be exposed, and it is very hard to learn causal graphs from dispersed data, since these data may have different distributions. In this paper, we propose a federated causal discovery strategy (FedCausal) to learn the unified global causal graph from decentralized heterogeneous data. We design a global optimization formula to naturally aggregate the causal graphs from client data and constrain the acyclicity of the global graph without exposing local data. Unlike other federated causal learning algorithms, FedCausal unifies the local and global optimizations into a complete directed acyclic graph (DAG) learning process with a flexible optimization objective. We prove that this optimization objective has a high interpretability and can adaptively handle homogeneous and heterogeneous data. Experimental results on synthetic and real datasets show that FedCausal can effectively deal with non-independently and identically distributed (non-iid) data and has a superior performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Federated Contextual Cascading Bandits with Asynchronous Communication and Heterogeneous Users b/data/2024/aaai/Federated Contextual Cascading Bandits with Asynchronous Communication and Heterogeneous Users
new file mode 100644
index 0000000000..94154169e1
--- /dev/null
+++ b/data/2024/aaai/Federated Contextual Cascading Bandits with Asynchronous Communication and Heterogeneous Users	
@@ -0,0 +1 @@
+We study the problem of federated contextual combinatorial cascading bandits, where agents collaborate under the coordination of a central server to provide tailored recommendations to users. Existing works consider either a synchronous framework, necessitating full agent participation and global synchronization, or assume user homogeneity with identical behaviors. We overcome these limitations by considering (1) federated agents operating in an asynchronous communication paradigm, where no mandatory synchronization is required and all agents communicate independently with the server, (2) heterogeneous user behaviors, where users can be stratified into latent user clusters, each exhibiting distinct preferences. For this setting, we propose a UCB-type algorithm with delicate communication protocols. Through theoretical analysis, we give sub-linear regret bounds on par with those achieved in the synchronous framework, while incurring only logarithmic communication costs. Empirical evaluation on synthetic and real-world datasets validates our algorithm's superior performance in terms of regrets and communication costs.
\ No newline at end of file
diff --git a/data/2024/aaai/Federated Graph Learning under Domain Shift with Generalizable Prototypes b/data/2024/aaai/Federated Graph Learning under Domain Shift with Generalizable Prototypes
new file mode 100644
index 0000000000..0b7dd5ab2d
--- /dev/null
+++ b/data/2024/aaai/Federated Graph Learning under Domain Shift with Generalizable Prototypes	
@@ -0,0 +1 @@
+Federated Graph Learning is a privacy-preserving collaborative approach for training a shared model on graph-structured data in the distributed environment. However, in real-world scenarios, the client graph data usually originate from diverse domains, this unavoidably hinders the generalization performance of the final global model. To address this challenge, we start the first attempt to investigate this scenario by learning a well-generalizable model. In order to improve the performance of the global model from different perspectives, we propose a novel framework called Federated Graph Learning with Generalizable Prototypes (FGGP). It decouples the global model into two levels and bridges them via prototypes. These prototypes, which are semantic centers derived from the feature extractor, can provide valuable classification information. At the classification model level, we innovatively eschew the traditional classifiers, then instead leverage clustered prototypes to capture fruitful domain information and enhance the discriminative capability of the classes, improving the performance of multi-domain predictions. Furthermore, at the feature extractor level, we go beyond traditional approaches by implicitly injecting distinct global knowledge and employing contrastive learning to obtain more powerful prototypes while enhancing the feature extractor generalization ability. Experimental results on various datasets are presented to validate the effectiveness of the proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Federated Label-Noise Learning with Local Diversity Product Regularization b/data/2024/aaai/Federated Label-Noise Learning with Local Diversity Product Regularization
new file mode 100644
index 0000000000..2804ed5526
--- /dev/null
+++ b/data/2024/aaai/Federated Label-Noise Learning with Local Diversity Product Regularization	
@@ -0,0 +1,10 @@
+Training data in federated learning (FL) frameworks can have label noise, since they must be stored and annotated on clients' devices. 
+If trained over such corrupted data, the models learn the wrong knowledge of label noise, which highly degrades their performance. 
+Although several FL schemes are designed to combat label noise, they suffer performance degradation when the clients' devices only have limited local training samples.
+To this end, a new scheme called federated label-noise learning (FedLNL) is developed in this paper.
+The key problem of FedLNL is how to estimate a noise transition matrix (NTM) accurately in the case of limited local training samples.
+If a gradient-based update method is used to update the local NTM on each client's device, it can generate too large gradients for the local NTM, causing a high estimation error of the local NTM.
+To tackle this issue, an alternating update method for the local NTM and the local classifier is designed in FedLNL, where the local NTM is updated by a Bayesian inference-based update method.
+Such an alternating update method makes the loss function of existing NTM-based schemes not applicable to FedLNL.
+To enable federated optimization of FedLNL, a new regularizer on the parameters of the classifier called local diversity product regularizer is designed for the loss function of FedLNL. 
+The results show that FedLNL improves the test accuracy of a trained model by up to 25.98%, compared with the state-of-the-art FL schemes that tackle label-noise issues.
\ No newline at end of file
diff --git a/data/2024/aaai/Federated Modality-Specific Encoders and Multimodal Anchors for Personalized Brain Tumor Segmentation b/data/2024/aaai/Federated Modality-Specific Encoders and Multimodal Anchors for Personalized Brain Tumor Segmentation
new file mode 100644
index 0000000000..257d402014
--- /dev/null
+++ b/data/2024/aaai/Federated Modality-Specific Encoders and Multimodal Anchors for Personalized Brain Tumor Segmentation	
@@ -0,0 +1 @@
+Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, it is not uncommon that some FL participants only possess a subset of the complete imaging modalities, posing inter-modal heterogeneity as a challenge to effectively training a global model on all participants’ data. In addition, each participant would expect to obtain a personalized model tailored for its local data characteristics from the FL in such a scenario. In this work, we propose a new FL framework with federated modality-specific encoders and multimodal anchors (FedMEMA) to simultaneously address the two concurrent issues. Above all, FedMEMA employs an exclusive encoder for each modality to account for the inter-modal heterogeneity in the first place. In the meantime, while the encoders are shared by the participants, the decoders are personalized to meet individual needs. Specifically, a server with full-modal data employs a fusion decoder to aggregate and fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation reversely. Meanwhile, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the encoder parameters. On the other end, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up the information loss due to absent modalities while adapting the representations of present ones. FedMEMA is validated on the BraTS 2020 benchmark for multimodal brain tumor segmentation. Results show that it outperforms various up-to-date methods for multimodal and personalized FL and that its novel designs are effective. Our code is available.
\ No newline at end of file
diff --git a/data/2024/aaai/Federated Partial Label Learning with Local-Adaptive Augmentation and Regularization b/data/2024/aaai/Federated Partial Label Learning with Local-Adaptive Augmentation and Regularization
new file mode 100644
index 0000000000..3df96f653d
--- /dev/null
+++ b/data/2024/aaai/Federated Partial Label Learning with Local-Adaptive Augmentation and Regularization	
@@ -0,0 +1 @@
+Partial label learning (PLL) expands the applicability of supervised machine learning models by enabling effective learning from weakly annotated overcomplete labels. Existing PLL methods however focus on the standard centralized learning scenarios. In this paper, we expand PLL into the distributed computation setting by formalizing a new learning scenario named as federated partial label learning (FedPLL), where the training data with partial labels are distributed across multiple local clients with privacy constraints. To address this challenging problem, we propose a novel Federated PLL method with Local-Adaptive Augmentation and Regularization (FedPLL-LAAR). In addition to alleviating the partial label noise with moving-average label disambiguation, the proposed method performs MixUp-based local-adaptive data augmentation to mitigate the challenge posed by insufficient and imprecisely annotated local data, and dynamically incorporates the guidance of global model to minimize client drift through adaptive gradient alignment regularization between the global and local models. Extensive experiments conducted on multiple datasets under the FedPLL setting demonstrate the effectiveness of the proposed FedPLL-LAAR method for federated partial label learning.
\ No newline at end of file
diff --git a/data/2024/aaai/Federated X-armed Bandit b/data/2024/aaai/Federated X-armed Bandit
new file mode 100644
index 0000000000..cbaf168c0e
--- /dev/null
+++ b/data/2024/aaai/Federated X-armed Bandit	
@@ -0,0 +1 @@
+This work establishes the first framework of federated X-armed bandit, where different clients face heterogeneous local objective functions defined on the same domain and are required to collaboratively figure out the global optimum. We propose the first federated algorithm for such problems, named Fed-PNE. By utilizing the topological structure of the global objective inside the hierarchical partitioning and the weak smoothness property, our algorithm achieves sublinear cumulative regret with respect to both the number of clients and the evaluation budget. Meanwhile, it only requires logarithmic communications between the central server and clients, protecting the client privacy. Experimental results on synthetic functions and real datasets validate the advantages of Fed-PNE over various centralized and federated baseline algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Few Shot Part Segmentation Reveals Compositional Logic for Industrial Anomaly Detection b/data/2024/aaai/Few Shot Part Segmentation Reveals Compositional Logic for Industrial Anomaly Detection
new file mode 100644
index 0000000000..9220babf05
--- /dev/null
+++ b/data/2024/aaai/Few Shot Part Segmentation Reveals Compositional Logic for Industrial Anomaly Detection	
@@ -0,0 +1 @@
+Logical anomalies (LA) refer to data violating underlying logical constraints e.g., the quantity, arrangement, or composition of components within an image. Detecting accurately such anomalies requires models to reason about various component types through segmentation. However, curation of pixel-level annotations for semantic segmentation is both time-consuming and expensive. Although there are some prior few-shot or unsupervised co-part segmentation algorithms, they often fail on images with industrial object. These images have components with similar textures and shapes, and a precise differentiation proves challenging. In this study, we introduce a novel component segmentation model for LA detection that leverages a few labeled samples and unlabeled images sharing logical constraints. To ensure consistent segmentation across unlabeled images, we employ a histogram matching loss in conjunction with an entropy loss. As segmentation predictions play a crucial role, we propose to enhance both local and global sample validity detection by capturing key aspects from visual semantics via three memory banks: class histograms, component composition embeddings and patch-level representations. For effective LA detection, we propose an adaptive scaling strategy to standardize anomaly scores from different memory banks in inference. Extensive experiments on the public benchmark MVTec LOCO AD reveal our method achieves 98.1% AUROC in LA detection vs. 89.6% from competing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Few-Shot Learning from Augmented Label-Uncertain Queries in Bongard-HOI b/data/2024/aaai/Few-Shot Learning from Augmented Label-Uncertain Queries in Bongard-HOI
new file mode 100644
index 0000000000..e462607fbc
--- /dev/null
+++ b/data/2024/aaai/Few-Shot Learning from Augmented Label-Uncertain Queries in Bongard-HOI	
@@ -0,0 +1 @@
+Detecting human-object interactions (HOI) in a few-shot setting remains a challenge. Existing meta-learning methods struggle to extract representative features for classification due to the limited data, while existing few-shot HOI models rely on HOI text labels for classification. Moreover, some query images may display visual similarity to those outside their class, such as similar backgrounds between different HOI classes. This makes learning more challenging, especially with limited samples. Bongard-HOI epitomizes this HOI few-shot problem, making it the benchmark we focus on in this paper. In our proposed method, we introduce novel label-uncertain query augmentation techniques to enhance the diversity of the query inputs, aiming to distinguish the positive HOI class from the negative ones. As these augmented inputs may or may not have the same class label as the original inputs, their class label is unknown. Those belonging to a different class become hard samples due to their visual similarity to the original ones. Additionally, we introduce a novel pseudo-label generation technique that enables a mean teacher model to learn from the augmented label-uncertain inputs. We propose to augment the negative support set for the student model to enrich the semantic information, fostering diversity that challenges and enhances the student’s learning. Experimental results demonstrate that our method sets a new state-of-the-art (SOTA) performance by achieving 68.74% accuracy on the Bongard-HOI benchmark, a significant improvement over the existing SOTA of 66.59%. In our evaluation on HICO-FS, a more general few-shot recognition dataset, our method achieves 73.27% accuracy, outperforming the previous SOTA of 71.20% in the 5- way 5-shot task.
\ No newline at end of file
diff --git a/data/2024/aaai/Few-Shot Learning via Repurposing Ensemble of Black-Box Models b/data/2024/aaai/Few-Shot Learning via Repurposing Ensemble of Black-Box Models
new file mode 100644
index 0000000000..d7bd6e9e9a
--- /dev/null
+++ b/data/2024/aaai/Few-Shot Learning via Repurposing Ensemble of Black-Box Models	
@@ -0,0 +1 @@
+This paper investigates the problem of exploiting existing solution models of previous tasks to address a related target task with limited training data. Existing approaches addressing this problem often require access to the internal parameterization of the existing solution models and possibly their training data, which is not possible in many practical settings. To relax this requirement, We approach this problem from a new perspective of black-box re-purposing, which augments the target inputs and leverages their corresponding outputs generated by existing black-box APIs into a feature ensemble. We hypothesize that such feature ensemble can be learned to incorporate and encode relevant black-box knowledge into the feature representation of target data, which will compensate for their scarcity. This hypothesis is confirmed via the reported successes of our proposed black-box ensemble in solving multiple few-shot learning tasks derived from various benchmark datasets. All reported results show consistently that the set of heterogeneous black-box solutions of previous tasks can indeed be reused and combined effectively to solve a reasonably related target task without requiring access to a large training dataset. This is the first step towards enabling new possibilities to further supplement existing techniques in transfer or meta learning with black-box knowledge.
\ No newline at end of file
diff --git a/data/2024/aaai/Few-Shot Neural Radiance Fields under Unconstrained Illumination b/data/2024/aaai/Few-Shot Neural Radiance Fields under Unconstrained Illumination
new file mode 100644
index 0000000000..4fb1f2ffca
--- /dev/null
+++ b/data/2024/aaai/Few-Shot Neural Radiance Fields under Unconstrained Illumination	
@@ -0,0 +1 @@
+In this paper, we introduce a new challenge for synthesizing novel view images in practical environments with limited input multi-view images and varying lighting conditions. Neural radiance fields (NeRF), one of the pioneering works for this task, demand an extensive set of multi-view images taken under constrained illumination, which is often unattainable in real-world settings. While some previous works have managed to synthesize novel views given images with different illumination, their performance still relies on a substantial number of input multi-view images. To address this problem, we suggest ExtremeNeRF, which utilizes multi-view albedo consistency, supported by geometric alignment. Specifically, we extract intrinsic image components that should be illumination-invariant across different views, enabling direct appearance comparison between the input and novel view under unconstrained illumination. We offer thorough experimental results for task evaluation, employing the newly created NeRF Extreme benchmark—the first in-the-wild benchmark for novel view synthesis under multiple viewing directions and varying illuminations.
\ No newline at end of file
diff --git a/data/2024/aaai/Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language b/data/2024/aaai/Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language
new file mode 100644
index 0000000000..1d9fcfc03e
--- /dev/null
+++ b/data/2024/aaai/Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language	
@@ -0,0 +1,2 @@
+Given an untrimmed video and a sentence query, video moment retrieval using language (VMR) aims to locate a target query-relevant moment. Since the untrimmed video is overlong, almost all existing VMR methods first sparsely down-sample each untrimmed video into multiple fixed-length video clips and then conduct multi-modal interactions with the query feature and expensive clip features for reasoning, which is infeasible for long real-world videos that span hours. Since the video is downsampled into fixed-length clips, some query-related frames may be filtered out, which will blur the specific boundary of the target moment, take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. To this end, in this paper, we propose an efficient approach, SpotVMR, to trim the query-relevant clip. Besides, our proposed SpotVMR can serve as plug-and-play module, which achieves efficiency for state-of-the-art VMR methods while maintaining good retrieval performance. Especially, we first design a novel clip search model that learns to identify promising video regions to search conditioned on the language query. Then, we introduce a set of low-cost semantic indexing features to capture the context of objects and interactions that suggest where to search the query-relevant moment. Also, the distillation loss is utilized to address the optimization issues arising from end-to-end joint training of the clip selector and VMR model.
+Extensive experiments on three challenging datasets demonstrate its effectiveness.
\ No newline at end of file
diff --git a/data/2024/aaai/Find the Lady: Permutation and Re-synchronization of Deep Neural Networks b/data/2024/aaai/Find the Lady: Permutation and Re-synchronization of Deep Neural Networks
new file mode 100644
index 0000000000..2cbd6bc175
--- /dev/null
+++ b/data/2024/aaai/Find the Lady: Permutation and Re-synchronization of Deep Neural Networks	
@@ -0,0 +1,2 @@
+Deep neural networks are characterized by multiple symmetrical, equi-loss solutions that are redundant. Thus, the order of neurons in a layer and feature maps can be given arbitrary permutations, without affecting (or minimally affecting) their output. If we shuffle these neurons, or if we apply to them some perturbations (like fine-tuning) can we put them back in the original order i.e. re-synchronize? Is there a possible corruption threat? Answering these questions is important for applications like neural network white-box watermarking for ownership tracking and integrity verification.
+We advance a method to re-synchronize the order of permuted neurons. Our method is also effective if neurons are further altered by parameter pruning, quantization, and fine-tuning, showing robustness to integrity attacks. Additionally, we provide theoretical and practical evidence for the usual means to corrupt the integrity of the model, resulting in a solution to counter it. We test our approach on popular computer vision datasets and models, and we illustrate the threat and our countermeasure on a popular white-box watermarking method.
\ No newline at end of file
diff --git a/data/2024/aaai/Finding Visual Saliency in Continuous Spike Stream b/data/2024/aaai/Finding Visual Saliency in Continuous Spike Stream
new file mode 100644
index 0000000000..7533d335a3
--- /dev/null
+++ b/data/2024/aaai/Finding Visual Saliency in Continuous Spike Stream	
@@ -0,0 +1 @@
+As a bio-inspired vision sensor, the spike camera emulates the operational principles of the fovea, a compact retinal region, by employing spike discharges to encode the accumulation of per-pixel luminance intensity. Leveraging its high temporal resolution and bio-inspired neuromorphic design, the spike camera holds significant promise for advancing computer vision applications. Saliency detection mimic the behavior of human beings and capture the most salient region from the scenes. In this paper, we investigate the visual saliency in the continuous spike stream for the first time. To effectively process the binary spike stream, we propose a Recurrent Spiking Transformer (RST) framework, which is based on a full spiking neural network. Our framework enables the extraction of spatio-temporal features from the continuous spatio-temporal spike stream while maintaining low power consumption. To facilitate the training and validation of our proposed model, we build a comprehensive real-world spike-based visual saliency dataset, enriched with numerous light conditions. Extensive experiments demonstrate the superior performance of our Recurrent Spiking Transformer framework in comparison to other spike neural network-based methods. Our framework exhibits a substantial margin of improvement in capturing and highlighting visual saliency in the spike stream, which not only provides a new perspective for spike-based saliency segmentation but also shows a new paradigm for full SNN-based transformer models. The code and dataset are available at https://github.com/BIT-Vision/SVS.
\ No newline at end of file
diff --git "a/data/2024/aaai/Finding \316\265 and \316\264 of Traditional Disclosure Control Systems" "b/data/2024/aaai/Finding \316\265 and \316\264 of Traditional Disclosure Control Systems"
new file mode 100644
index 0000000000..83a62a489e
--- /dev/null
+++ "b/data/2024/aaai/Finding \316\265 and \316\264 of Traditional Disclosure Control Systems"	
@@ -0,0 +1 @@
+This paper analyzes the privacy of traditional Statistical Disclosure Control (SDC) systems under a differential privacy interpretation. SDCs, such as cell suppression and swapping, promise to safeguard the confidentiality of data and are routinely adopted in data analyses with profound societal and economic impacts. Through a formal analysis and empirical evaluation of demographic data from real households in the U.S., the paper shows that widely adopted SDC systems not only induce vastly larger privacy losses than classical differential privacy mechanisms, but, they may also come at a cost of larger accuracy and fairness.
\ No newline at end of file
diff --git a/data/2024/aaai/Fine Structure-Aware Sampling: A New Sampling Training Scheme for Pixel-Aligned Implicit Models in Single-View Human Reconstruction b/data/2024/aaai/Fine Structure-Aware Sampling: A New Sampling Training Scheme for Pixel-Aligned Implicit Models in Single-View Human Reconstruction
new file mode 100644
index 0000000000..0bd09e34c6
--- /dev/null
+++ b/data/2024/aaai/Fine Structure-Aware Sampling: A New Sampling Training Scheme for Pixel-Aligned Implicit Models in Single-View Human Reconstruction	
@@ -0,0 +1,2 @@
+Pixel-aligned implicit models, such as PIFu, PIFuHD, and ICON, are used for single-view clothed human reconstruction. These models need to be trained using a sampling training scheme. Existing sampling training schemes either fail to capture thin surfaces (e.g. ears, fingers) or cause noisy artefacts in reconstructed meshes. To address these problems, we introduce Fine Structured-Aware Sampling (FSS), a new sampling training scheme to train pixel-aligned implicit models for single-view human reconstruction. FSS resolves the aforementioned problems by proactively adapting to the thickness and complexity of surfaces. In addition, unlike existing sampling training schemes, FSS shows how normals of sample points can be capitalized in the training process to improve results.
+Lastly, to further improve the training process, FSS proposes a mesh thickness loss signal for pixel-aligned implicit models. It becomes computationally feasible to introduce this loss once a slight reworking of the pixel-aligned implicit function framework is carried out. Our results show that our methods significantly outperform SOTA methods qualitatively and quantitatively. Our code is publicly available at https://github.com/kcyt/FSS.
\ No newline at end of file
diff --git a/data/2024/aaai/Fine-Grained Distillation for Long Document Retrieval b/data/2024/aaai/Fine-Grained Distillation for Long Document Retrieval
new file mode 100644
index 0000000000..c839a79a91
--- /dev/null
+++ b/data/2024/aaai/Fine-Grained Distillation for Long Document Retrieval	
@@ -0,0 +1 @@
+Long document retrieval aims to fetch query-relevant documents from a large-scale collection, where knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder. However, in contrast to passages or sentences, retrieval on long documents suffers from the \textit{scope hypothesis} that a long document may cover multiple topics. This maximizes their structure heterogeneity and poses a granular-mismatch issue, leading to an inferior distillation efficacy. In this work, we propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers. While preserving the conventional dense retrieval paradigm, it first produces global-consistent representations crossing different fine granularity and then applies multi-granular aligned distillation merely during training. In experiments, we evaluate our framework on two long-document retrieval benchmarks, which show state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Fine-Grained Knowledge Selection and Restoration for Non-exemplar Class Incremental Learning b/data/2024/aaai/Fine-Grained Knowledge Selection and Restoration for Non-exemplar Class Incremental Learning
new file mode 100644
index 0000000000..c076d8a74c
--- /dev/null
+++ b/data/2024/aaai/Fine-Grained Knowledge Selection and Restoration for Non-exemplar Class Incremental Learning	
@@ -0,0 +1,2 @@
+Non-exemplar class incremental learning aims to learn both the new and old tasks without accessing any training data from the past. This strict restriction enlarges the difficulty of alleviating catastrophic forgetting since all techniques can only be applied to current task data. Considering this challenge, we propose a novel framework of fine-grained knowledge selection and restoration. The conventional knowledge distillation-based methods place too strict constraints on the network parameters and features to prevent forgetting, which limits the training of new tasks. To loose this constraint, we proposed a novel fine-grained selective patch-level distillation to adaptively balance plasticity and stability. Some task-agnostic patches can be used to preserve the decision boundary of the old task. While some patches containing the important foreground are favorable for learning the new task.
+ Moreover, we employ a task-agnostic mechanism to generate more realistic prototypes of old tasks with the current task sample for reducing classifier bias for fine-grained knowledge restoration. Extensive experiments on CIFAR100, TinyImageNet and ImageNet-Subset demonstrate the effectiveness of our method. Code is available at https://github.com/scok30/vit-cil.
\ No newline at end of file
diff --git a/data/2024/aaai/Fine-Grained Multi-View Hand Reconstruction Using Inverse Rendering b/data/2024/aaai/Fine-Grained Multi-View Hand Reconstruction Using Inverse Rendering
new file mode 100644
index 0000000000..28770b309f
--- /dev/null
+++ b/data/2024/aaai/Fine-Grained Multi-View Hand Reconstruction Using Inverse Rendering	
@@ -0,0 +1 @@
+Reconstructing high-fidelity hand models with intricate textures plays a crucial role in enhancing human-object interaction and advancing real-world applications. Despite the state-of-the-art methods excelling in texture generation and image rendering, they often face challenges in accurately capturing geometric details. Learning-based approaches usually offer better robustness and faster inference, which tend to produce smoother results and require substantial amounts of training data. To address these issues, we present a novel fine-grained multi-view hand mesh reconstruction method that leverages inverse rendering to restore hand poses and intricate details. Firstly, our approach predicts a parametric hand mesh model through Graph Convolutional Networks (GCN) based method from multi-view images. We further introduce a novel Hand Albedo and Mesh (HAM) optimization module to refine both the hand mesh and textures, which is capable of preserving the mesh topology. In addition, we suggest an effective mesh-based neural rendering scheme to simultaneously generate photo-realistic image and optimize mesh geometry by fusing the pre-trained rendering network with vertex features. We conduct the comprehensive experiments on InterHand2.6M, DeepHandMesh and dataset collected by ourself, whose promising results show that our proposed approach outperforms the state-of-the-art methods on both reconstruction accuracy and rendering quality. Code and dataset are publicly available at https://github.com/agnJason/FMHR.
\ No newline at end of file
diff --git a/data/2024/aaai/Fine-Grained Prototypes Distillation for Few-Shot Object Detection b/data/2024/aaai/Fine-Grained Prototypes Distillation for Few-Shot Object Detection
new file mode 100644
index 0000000000..0032b84405
--- /dev/null
+++ b/data/2024/aaai/Fine-Grained Prototypes Distillation for Few-Shot Object Detection	
@@ -0,0 +1 @@
+Few-shot object detection (FSOD) aims at extending a generic detector for novel object detection with only a few training examples. It attracts great concerns recently due to the practical meanings. Meta-learning has been demonstrated to be an effective paradigm for this task. In general, methods based on meta-learning employ an additional support branch to encode novel examples (a.k.a. support images) into class prototypes, which are then fused with query branch to facilitate the model prediction. However, the class-level prototypes are difficult to precisely generate, and they also lack detailed information, leading to instability in performance. New methods are required to capture the distinctive local context for more robust novel object detection. To this end, we propose to distill the most representative support features into fine-grained prototypes. These prototypes are then assigned into query feature maps based on the matching results, modeling the detailed feature relations between two branches. This process is realized by our Fine-Grained Feature Aggregation (FFA) module. Moreover, in terms of high-level feature fusion, we propose Balanced Class-Agnostic Sampling (B-CAS) strategy and Non-Linear Fusion (NLF) module from differenct perspectives. They are complementary to each other and depict the high-level feature relations more effectively. Extensive experiments on PASCAL VOC and MS COCO benchmarks show that our method sets a new state-of-the-art performance in most settings. Our code is available at https://github.com/wangchen1801/FPD.
\ No newline at end of file
diff --git a/data/2024/aaai/Fine-Tuning Graph Neural Networks by Preserving Graph Generative Patterns b/data/2024/aaai/Fine-Tuning Graph Neural Networks by Preserving Graph Generative Patterns
new file mode 100644
index 0000000000..2286789009
--- /dev/null
+++ b/data/2024/aaai/Fine-Tuning Graph Neural Networks by Preserving Graph Generative Patterns	
@@ -0,0 +1,8 @@
+Recently, the paradigm of pre-training and fine-tuning graph neural networks has been intensively studied and applied in a wide range of graph mining tasks. 
+Its success is generally attributed to the structural consistency between pre-training and downstream datasets, which, however, does not hold in many real-world scenarios. 
+Existing works have shown that the structural divergence between pre-training and downstream graphs significantly limits the transferability when using the vanilla fine-tuning strategy. This divergence leads to model overfitting on pre-training graphs and causes difficulties in capturing the structural properties of the downstream graphs. 
+In this paper, we identify the fundamental cause of structural divergence as the discrepancy of generative patterns between the pre-training and downstream graphs.
+Furthermore, we propose G-Tuning to preserve the generative patterns of downstream graphs. 
+Given a downstream graph G, the core idea is to tune the pre-trained GNN so that it can reconstruct the generative patterns of G, the graphon W.
+However, the exact reconstruction of a graphon is known to be computationally expensive. To overcome this challenge, we provide a theoretical analysis that establishes the existence of a set of alternative graphons called graphon bases for any given graphon. By utilizing a linear combination of these graphon bases, we can efficiently approximate W. This theoretical finding forms the basis of our model, as it enables effective learning of the graphon bases and their associated coefficients.
+Compared with existing algorithms, G-Tuning demonstrates consistent performance improvement in 7 in-domain and 7 out-of-domain transfer learning experiments.
\ No newline at end of file
diff --git a/data/2024/aaai/Fine-Tuning Large Language Model Based Explainable Recommendation with Explainable Quality Reward b/data/2024/aaai/Fine-Tuning Large Language Model Based Explainable Recommendation with Explainable Quality Reward
new file mode 100644
index 0000000000..218020f739
--- /dev/null
+++ b/data/2024/aaai/Fine-Tuning Large Language Model Based Explainable Recommendation with Explainable Quality Reward	
@@ -0,0 +1 @@
+Large language model-based explainable recommendation (LLM-based ER) systems can provide remarkable human-like explanations and have widely received attention from researchers. However, the original LLM-based ER systems face three low-quality problems in their generated explanations, i.e., lack of personalization, inconsistency, and questionable explanation data. To address these problems, we propose a novel LLM-based ER model denoted as LLM2ER to serve as a backbone and devise two innovative explainable quality reward models for fine-tuning such a backbone in a reinforcement learning paradigm, ultimately yielding a fine-tuned model denoted as LLM2ER-EQR, which can provide high-quality explanations. LLM2ER-EQR can generate personalized, informative, and consistent high-quality explanations learned from questionable-quality explanation datasets. Extensive experiments conducted on three real-world datasets demonstrate that our model can generate fluent, diverse, informative, and highly personalized explanations.
\ No newline at end of file
diff --git a/data/2024/aaai/Finetuning LLMs for Automatic Concept to TTI Prompt Generation (Student Abstract) b/data/2024/aaai/Finetuning LLMs for Automatic Concept to TTI Prompt Generation (Student Abstract)
new file mode 100644
index 0000000000..ba4b2c7553
--- /dev/null
+++ b/data/2024/aaai/Finetuning LLMs for Automatic Concept to TTI Prompt Generation (Student Abstract)	
@@ -0,0 +1 @@
+Our work explores bridging the gap between large language models and text-to-image models to create a tool for quickly and easily generating high quality images from a given concept. In our experiments we successfully improved image quality with only a preliminary utilization of the available resources for finetuning.
\ No newline at end of file
diff --git a/data/2024/aaai/Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling on Sparse Hypergraphs b/data/2024/aaai/Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling on Sparse Hypergraphs
new file mode 100644
index 0000000000..d6f0371ad2
--- /dev/null
+++ b/data/2024/aaai/Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling on Sparse Hypergraphs	
@@ -0,0 +1 @@
+We study the multi-agent multi-armed bandit (MAMAB) problem, where agents are factored into overlapping groups. Each group represents a hyperedge, forming a hypergraph over the agents. At each round of interaction, the learner pulls a joint arm (composed of individual arms for each agent) and receives a reward according to the hypergraph structure. Specifically, we assume there is a local reward for each hyperedge, and the reward of the joint arm is the sum of these local rewards. Previous work introduced the multi-agent Thompson sampling (MATS) algorithm and derived a Bayesian regret bound. However, it remains an open problem how to derive a frequentist regret bound for Thompson sampling in this multi-agent setting. To address these issues, we propose an efficient variant of MATS, the epsilon-exploring Multi-Agent Thompson Sampling (eps-MATS) algorithm, which performs MATS exploration with probability epsilon while adopts a greedy policy otherwise. We prove that eps-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size. We also derive a lower bound for this setting, which implies our frequentist regret upper bound is optimal up to constant and logarithm terms, when the hypergraph is sufficiently sparse. Thorough experiments on standard MAMAB problems demonstrate the superior performance and the improved computational efficiency of eps-MATS compared with existing algorithms in the same setting.
\ No newline at end of file
diff --git a/data/2024/aaai/FlexKBQA: A Flexible LLM-Powered Framework for Few-Shot Knowledge Base Question Answering b/data/2024/aaai/FlexKBQA: A Flexible LLM-Powered Framework for Few-Shot Knowledge Base Question Answering
new file mode 100644
index 0000000000..a5891111f5
--- /dev/null
+++ b/data/2024/aaai/FlexKBQA: A Flexible LLM-Powered Framework for Few-Shot Knowledge Base Question Answering	
@@ -0,0 +1 @@
+Knowledge base question answering (KBQA) is a critical yet challenging task due to the vast number of entities within knowledge bases and the diversity of natural language questions posed by users. Unfortunately, the performance of most KBQA models tends to decline significantly in real-world scenarios where high-quality annotated data is insufficient. To mitigate the burden associated with manual annotation, we introduce FlexKBQA by utilizing Large Language Models (LLMs) as program translators for addressing the challenges inherent in the few-shot KBQA task. Specifically, FlexKBQA leverages automated algorithms to sample diverse programs, such as SPARQL queries, from the knowledge base, which are subsequently converted into natural language questions via LLMs. This synthetic dataset facilitates training a specialized lightweight model for the KB. Additionally, to reduce the barriers of distribution shift between synthetic data and real user questions, FlexKBQA introduces an executionguided self-training method to iterative leverage unlabeled user questions. Furthermore, we explore harnessing the inherent reasoning capability of LLMs to enhance the entire framework. Consequently, FlexKBQA delivers substantial flexibility, encompassing data annotation, deployment, and being domain agnostic. Through extensive experiments on GrailQA, WebQSP, and KQA Pro, we observe that under the few-shot even the more challenging zero-shot scenarios, FlexKBQA achieves impressive results with a few annotations, surpassing all previous baselines and even approaching the performance of supervised models, achieving a remarkable 93% performance relative to the fully-supervised models. We posit that FlexKBQA represents a significant advancement towards exploring better integration of large and lightweight models. Code is available at https://github.com/leezythu/FlexKBQA.
\ No newline at end of file
diff --git a/data/2024/aaai/FlexiBO: A Decoupled Cost-Aware Multi-objective Optimization Approach for Deep Neural Networks (Abstract Reprint) b/data/2024/aaai/FlexiBO: A Decoupled Cost-Aware Multi-objective Optimization Approach for Deep Neural Networks (Abstract Reprint)
new file mode 100644
index 0000000000..68866dfcad
--- /dev/null
+++ b/data/2024/aaai/FlexiBO: A Decoupled Cost-Aware Multi-objective Optimization Approach for Deep Neural Networks (Abstract Reprint)	
@@ -0,0 +1 @@
+The design of machine learning systems often requires trading off different objectives, for example, prediction error and energy consumption for deep neural networks (DNNs). Typically, no single design performs well in all objectives; therefore, finding Pareto-optimal designs is of interest. The search for Pareto-optimal designs involves evaluating designs in an iterative process, and the measurements are used to evaluate an acquisition function that guides the search process. However, measuring different objectives incurs different costs. For example, the cost of measuring the prediction error of DNNs is orders of magnitude higher than that of measuring the energy consumption of a pre-trained DNN as it requires re-training the DNN. Current state-of-the-art methods do not consider this difference in objective evaluation cost, potentially incurring expensive evaluations of objective functions in the optimization process. In this paper, we develop a novel decoupled and cost-aware multi-objective optimization algorithm, which we call Flexible Multi-Objective Bayesian Optimization (FlexiBO) to address this issue. For evaluating each design, FlexiBO selects the objective with higher relative gain by weighting the improvement of the hypervolume of the Pareto region with the measurement cost of each objective. This strategy, therefore, balances the expense of collecting new information with the knowledge gained through objective evaluations, preventing FlexiBO from performing expensive measurements for little to no gain. We evaluate FlexiBO on seven state-of-the-art DNNs for image recognition, natural language processing (NLP), and speech-to-text translation. Our results indicate that, given the same total experimental budget, FlexiBO discovers designs with 4.8% to 12.4% lower hypervolume error than the best method in state-of-the-art multi-objective optimization.
\ No newline at end of file
diff --git a/data/2024/aaai/Flood Insights: Integrating Remote and Social Sensing Data for Flood Exposure, Damage, and Urgent Needs Mapping b/data/2024/aaai/Flood Insights: Integrating Remote and Social Sensing Data for Flood Exposure, Damage, and Urgent Needs Mapping
new file mode 100644
index 0000000000..cdbfb42161
--- /dev/null
+++ b/data/2024/aaai/Flood Insights: Integrating Remote and Social Sensing Data for Flood Exposure, Damage, and Urgent Needs Mapping	
@@ -0,0 +1 @@
+The absence of comprehensive situational awareness information poses a significant challenge for humanitarian organizations during their response efforts. We present Flood Insights, an end-to-end system that ingests data from multiple non-traditional data sources such as remote sensing, social sensing, and geospatial data. We employ state-of-the-art natural language processing and computer vision models to identify flood exposure, ground-level damage and flood reports, and most importantly, urgent needs of affected people. We deploy and test the system during a recent real-world catastrophe, the 2022 Pakistan floods, to surface critical situational and damage information at the district level. We validated the system's effectiveness through geographic regression analysis using official ground-truth data, showcasing its strong performance and explanatory power. Moreover, the system was commended by the United Nations Development Programme stationed in Pakistan, as well as local authorities, for pinpointing hard-hit districts and enhancing disaster response.
\ No newline at end of file
diff --git a/data/2024/aaai/Flow-Event Autoencoder: Event Stream Object Recognition Dataset Generation with Arbitrary High Temporal Resolution b/data/2024/aaai/Flow-Event Autoencoder: Event Stream Object Recognition Dataset Generation with Arbitrary High Temporal Resolution
new file mode 100644
index 0000000000..8dba40ea99
--- /dev/null
+++ b/data/2024/aaai/Flow-Event Autoencoder: Event Stream Object Recognition Dataset Generation with Arbitrary High Temporal Resolution	
@@ -0,0 +1 @@
+Event camera has unique advantages in high temporal resolution and dynamic range and has shown potentials in several computer vision tasks. However, due to the novelty of this hardware, there’s a lack of large benchmark DVS event-stream datasets, including datasets for object recognition. In this work, we proposed an encoder-decoder method to augment event stream dataset from image and optical flow with arbitrary temporal resolution for object recognition task. We believe this proposed method can be generalized well in augmenting event stream vision data for object recognition and will help advance the development of event vision paradigm.
\ No newline at end of file
diff --git a/data/2024/aaai/Fluctuation-Based Adaptive Structured Pruning for Large Language Models b/data/2024/aaai/Fluctuation-Based Adaptive Structured Pruning for Large Language Models
new file mode 100644
index 0000000000..42459d83fd
--- /dev/null
+++ b/data/2024/aaai/Fluctuation-Based Adaptive Structured Pruning for Large Language Models	
@@ -0,0 +1,2 @@
+Network Pruning is a promising way to address the huge computing resource demands of the deployment and inference of Large Language Models (LLMs). Retraining-free is important for LLMs' pruning methods. However, almost all of the existing retraining-free pruning approaches for LLMs focus on unstructured pruning, which requires specific hardware support for acceleration. In this paper, we propose a novel retraining-free structured pruning framework for LLMs, named FLAP (FLuctuation-based Adaptive
+Structured Pruning). It is hardware-friendly by effectively reducing storage and enhancing inference speed. For effective structured pruning of LLMs, we highlight three critical elements that demand the utmost attention: formulating structured importance metrics, adaptively searching the global compressed model, and implementing compensation mechanisms to mitigate performance loss. First, FLAP determines whether the output feature map is easily recoverable when a column of weight is removed, based on the fluctuation pruning metric. Then it standardizes the importance scores to adaptively determine the global compressed model structure. At last, FLAP adds additional bias terms to recover the output feature maps using the baseline values. We thoroughly evaluate our approach on a variety of language benchmarks. Without any retraining, our method significantly outperforms the state-of-the-art methods, including LLM-Pruner and the extension of Wanda in structured pruning. The code is released at https://github.com/CASIA-IVA-Lab/FLAP.
\ No newline at end of file
diff --git a/data/2024/aaai/FoSp: Focus and Separation Network for Early Smoke Segmentation b/data/2024/aaai/FoSp: Focus and Separation Network for Early Smoke Segmentation
new file mode 100644
index 0000000000..e85b086054
--- /dev/null
+++ b/data/2024/aaai/FoSp: Focus and Separation Network for Early Smoke Segmentation	
@@ -0,0 +1 @@
+Early smoke segmentation (ESS) enables the accurate identification of smoke sources, facilitating the prompt extinguishing of fires and preventing large-scale gas leaks. But ESS poses greater challenges than conventional object and regular smoke segmentation due to its small scale and transparent appearance, which can result in high miss detection rate and low precision. To address these issues, a Focus and Separation Network (FoSp) is proposed. We first introduce a Focus module employing bidirectional cascade which guides low-resolution and high-resolution features towards mid-resolution to locate and determine the scope of smoke, reducing the miss detection rate. Next, we propose a Separation module that separates smoke images into a pure smoke foreground and a smoke-free background, enhancing the contrast between smoke and background fundamentally, improving segmentation precision. Finally, a Domain Fusion module is developed to integrate the distinctive features of the two modules which can balance recall and precision to achieve high F_beta. Futhermore, to promote the development of ESS, we introduce a high-quality real-world dataset called SmokeSeg, which contains more small and transparent smoke than the existing datasets. Experimental results show that our model achieves the best performance on three available smoke segmentation datasets: SYN70K (mIoU: 83.00%), SMOKE5K (F_beta: 81.6%) and SmokeSeg (F_beta: 72.05%). The code can be found at https://github.com/LujianYao/FoSp.
\ No newline at end of file
diff --git a/data/2024/aaai/FoX: Formation-Aware Exploration in Multi-Agent Reinforcement Learning b/data/2024/aaai/FoX: Formation-Aware Exploration in Multi-Agent Reinforcement Learning
new file mode 100644
index 0000000000..8678b8d5ee
--- /dev/null
+++ b/data/2024/aaai/FoX: Formation-Aware Exploration in Multi-Agent Reinforcement Learning	
@@ -0,0 +1 @@
+Recently, deep multi-agent reinforcement learning (MARL) has gained significant popularity due to its success in various cooperative multi-agent tasks. However, exploration still remains a challenging problem in MARL due to the partial observability of the agents and the exploration space that can grow exponentially as the number of agents increases. Firstly, in order to address the scalability issue of the exploration space, we define a formation-based equivalence relation on the exploration space and aim to reduce the search space by exploring only meaningful states in different formations. Then, we propose a novel formation-aware exploration (FoX) framework that encourages partially observable agents to visit the states in diverse formations by guiding them to be well aware of their current formation solely based on their own observations. Numerical results show that the proposed FoX framework significantly outperforms the state-of-the-art MARL algorithms on Google Research Football (GRF) and sparse Starcraft II multi-agent challenge (SMAC) tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/FocalDreamer: Text-Driven 3D Editing via Focal-Fusion Assembly b/data/2024/aaai/FocalDreamer: Text-Driven 3D Editing via Focal-Fusion Assembly
new file mode 100644
index 0000000000..ef75a99cab
--- /dev/null
+++ b/data/2024/aaai/FocalDreamer: Text-Driven 3D Editing via Focal-Fusion Assembly	
@@ -0,0 +1 @@
+While text-3D editing has made significant strides in leveraging score distillation sampling, emerging approaches still fall short in delivering separable, precise and consistent outcomes that are vital to content creation. In response, we introduce FocalDreamer, a framework that merges base shape with editable parts according to text prompts for fine-grained editing within desired regions. Specifically, equipped with geometry union and dual-path rendering, FocalDreamer assembles independent 3D parts into a complete object, tailored for convenient instance reuse and part-wise control. We propose geometric focal loss and style consistency regularization, which encourage focal fusion and congruent overall appearance. Furthermore, FocalDreamer generates high-fidelity geometry and PBR textures which are compatible with widely-used graphics engines. Extensive experiments have highlighted the superior editing capabilities of FocalDreamer in both quantitative and qualitative evaluations.
\ No newline at end of file
diff --git a/data/2024/aaai/Focus Stacking with High Fidelity and Superior Visual Effects b/data/2024/aaai/Focus Stacking with High Fidelity and Superior Visual Effects
new file mode 100644
index 0000000000..f821fdd5af
--- /dev/null
+++ b/data/2024/aaai/Focus Stacking with High Fidelity and Superior Visual Effects	
@@ -0,0 +1 @@
+Focus stacking is a technique in computational photography, and it synthesizes a single all-in-focus image from different focal plane images. It is difficult for previous works to produce a high-quality all-in-focus image that meets two goals: high-fidelity to its source images and good visual effects without defects or abnormalities. This paper proposes a novel method based on optical imaging process analysis and modeling. Based on a foreground segmentation - diffusion elimination architecture, the foreground segmentation makes most of the areas in full-focus images heritage information from the source images to achieve high fidelity; diffusion elimination models the physical imaging process and is specially used to solve the transition region (TR) problem that is a long-term neglected issue and degrades visual effects of synthesized images. Based on extensive experiments on simulated dataset, existing realistic dataset and our proposed BetaFusion dataset, the results show that our proposed method can generate high-quality all-in-focus images by achieving two goals simultaneously, especially can successfully solve the TR problem and eliminate the visual effect degradation of synthesized images caused by the TR problem.
\ No newline at end of file
diff --git a/data/2024/aaai/Focus-Then-Decide: Segmentation-Assisted Reinforcement Learning b/data/2024/aaai/Focus-Then-Decide: Segmentation-Assisted Reinforcement Learning
new file mode 100644
index 0000000000..676552a8fa
--- /dev/null
+++ b/data/2024/aaai/Focus-Then-Decide: Segmentation-Assisted Reinforcement Learning	
@@ -0,0 +1,3 @@
+Visual Reinforcement Learning (RL) is a promising approach to achieve human-like intelligence. However, it currently faces challenges in learning efficiently within noisy environments. In contrast, humans can quickly identify task-relevant objects in distraction-filled surroundings by applying previously acquired common knowledge. Recently, foundational models in natural language processing and computer vision have achieved remarkable successes, and the common knowledge within these models can significantly benefit downstream task training. Inspired by these achievements, we aim to incorporate common knowledge from foundational models into visual RL. We propose a novel Focus-Then-Decide (FTD) framework, allowing the agent to make decisions based solely on task-relevant objects. To achieve this, we introduce an attention mechanism to select task-relevant objects from the object set returned by a foundational segmentation model, and only use the task-relevant objects for the subsequent training of the decision module. Additionally, we specifically employed two generic self-supervised objectives to facilitate the rapid learning of this attention mechanism. Experimental results on challenging tasks based on DeepMind Control Suite and Franka Emika Robotics demonstrate that our method can quickly and accurately pinpoint objects of interest in noisy environments. Consequently, it achieves a significant performance improvement over current state-of-the-art algorithms.
+Project Page: https://www.lamda.nju.edu.cn/chenc/FTD.html
+Code: https://github.com/LAMDA-RL/FTD
\ No newline at end of file
diff --git a/data/2024/aaai/Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos b/data/2024/aaai/Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos
new file mode 100644
index 0000000000..0f576d4b37
--- /dev/null
+++ b/data/2024/aaai/Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos	
@@ -0,0 +1 @@
+Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e., image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models are available on https://follow-your-pose.github.io/.
\ No newline at end of file
diff --git a/data/2024/aaai/Forecasting Bimanual Object Manipulation Sequences from Unimanual Observations b/data/2024/aaai/Forecasting Bimanual Object Manipulation Sequences from Unimanual Observations
new file mode 100644
index 0000000000..87d8c7de51
--- /dev/null
+++ b/data/2024/aaai/Forecasting Bimanual Object Manipulation Sequences from Unimanual Observations	
@@ -0,0 +1 @@
+Learning to forecast bimanual object manipulation sequences from unimanual observations has broad applications in assistive robots and augmented reality. This challenging task requires us to first infer motion from the missing arm and the object it would have been manipulating were the person bimanual, then forecast the human and object motion while maintaining hand-object contact during manipulation. Previous attempts model the hand-object interactions only implicitly, and thus tend to produce unrealistic motion where the objects float in air. We address this with a novel neural network that (i) identifies and forecasts the pose for only the objects undergoing motion through an object motion module and (ii) refines human pose predictions by encouraging hand-object contact during manipulation through an ensemble of human pose predictors. The components are also designed to be generic enough for use in both unimanual and bimanual contexts. Our approach outperforms the state-of-the-art pose forecasting methods on bimanual manipulation datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Formal Logic Enabled Personalized Federated Learning through Property Inference b/data/2024/aaai/Formal Logic Enabled Personalized Federated Learning through Property Inference
new file mode 100644
index 0000000000..3d47c4f738
--- /dev/null
+++ b/data/2024/aaai/Formal Logic Enabled Personalized Federated Learning through Property Inference	
@@ -0,0 +1 @@
+Recent advancements in federated learning (FL) have greatly facilitated the development of decentralized collaborative applications, particularly in the domain of Artificial Intelligence of Things (AIoT). However, a critical aspect missing from the current research landscape is the ability to enable data-driven client models with symbolic reasoning capabilities. Specifically, the inherent heterogeneity of participating client devices poses a significant challenge, as each client exhibits unique logic reasoning properties. Failing to consider these device-specific specifications can result in critical properties being missed in the client predictions, leading to suboptimal performance. In this work, we propose a new training paradigm that leverages temporal logic reasoning to address this issue. Our approach involves enhancing the training process by incorporating mechanically generated logic expressions for each FL client. Additionally, we introduce the concept of aggregation clusters and develop a partitioning algorithm to effectively group clients based on the alignment of their temporal reasoning properties. We evaluate the proposed method on two tasks: a real-world traffic volume prediction task consisting of sensory data from fifteen states and a smart city multi-task prediction utilizing synthetic data. The evaluation results exhibit clear improvements, with performance accuracy improved by up to 54% across all sequential prediction models.
\ No newline at end of file
diff --git a/data/2024/aaai/Fostering Trustworthiness in Machine Learning Algorithms b/data/2024/aaai/Fostering Trustworthiness in Machine Learning Algorithms
new file mode 100644
index 0000000000..566b4cd8eb
--- /dev/null
+++ b/data/2024/aaai/Fostering Trustworthiness in Machine Learning Algorithms	
@@ -0,0 +1 @@
+Recent years have seen a surge in research that develops and applies machine learning algorithms to create intelligent learning systems. However, traditional machine learning algorithms have primarily focused on optimizing accuracy and efficiency, and they often fail to consider how to foster trustworthiness in their design. As a result, machine learning models usually face a trust crisis in real-world applications. Driven by these urgent concerns about trustworthiness, in this talk, I will introduce my research efforts towards the goal of making machine learning trustworthy. Specifically, I will delve into the following key research topics: security vulnerabilities and robustness, model explanations, and privacy-preserving mechanisms.
\ No newline at end of file
diff --git a/data/2024/aaai/Foundations of Autonomous Vehicles: A Curriculum Model for Developing Competencies in Artificial Intelligence and the Internet of Things for Grades 7-10 b/data/2024/aaai/Foundations of Autonomous Vehicles: A Curriculum Model for Developing Competencies in Artificial Intelligence and the Internet of Things for Grades 7-10
new file mode 100644
index 0000000000..e80b98685d
--- /dev/null
+++ b/data/2024/aaai/Foundations of Autonomous Vehicles: A Curriculum Model for Developing Competencies in Artificial Intelligence and the Internet of Things for Grades 7-10	
@@ -0,0 +1,2 @@
+A few states (e.g., Maryland, Georgia, and Florida) have initiated efforts to incorporate artificial intelligence outcomes in K-12 education but others are still relying on informal spaces for learning and literacy in this area. In this manuscript, we share the curriculum and content of an informal effort focused on students in grades 7-10. We combined artificial intelligence competencies with Internet of Things skills to enable meaningful learning covering all Five Big Ideas in AI. In our one-week summer camp, students experimented with perceptions by working with vision, infrared, and ultrasonic sensors. They learned about representation through work with neural network playgrounds. Students engaged in supervised learning of an image processing model and used the model to control the actions of a robot car. Natural interactions and societal impacts were assessed as students observed the robot car's behavior. 
+Results demonstrate that our curriculum was successful in achieving its objectives. Excluding the robot car kit, the curriculum was created using free platforms and tools. This program could be replicated in informal settings by any educator or collaborator with a computer science background. This paper describes our summer camp curriculum, its components and their implementation, the lessons learned, and potential future enhancements.
\ No newline at end of file
diff --git a/data/2024/aaai/Foundations of Reactive Synthesis for Declarative Process Specifications b/data/2024/aaai/Foundations of Reactive Synthesis for Declarative Process Specifications
new file mode 100644
index 0000000000..b659bddddb
--- /dev/null
+++ b/data/2024/aaai/Foundations of Reactive Synthesis for Declarative Process Specifications	
@@ -0,0 +1 @@
+Given a specification of Linear-time Temporal Logic interpreted over finite traces (LTLf), the reactive synthesis problem asks to find a finitely-representable, terminating controller that reacts to the uncontrollable actions of an environment in order to enforce a desired system specification. In this paper we study, for the first time, the foundations of reactive synthesis for DECLARE, a well-established declarative, pattern-based business process modelling language grounded in LTLf. We provide a threefold contribution. First, we define a reactive synthesis problem for DECLARE. Second, we show how an arbitrary DECLARE specification can be polynomially encoded into an equivalent pure-past one in LTLf, and exploit this to define an EXPTIME algorithm for DECLARE synthesis. Third, we derive a symbolic version of this algorithm, by introducing a novel translation of pure-past temporal formulas into symbolic deterministic finite automata.
\ No newline at end of file
diff --git a/data/2024/aaai/Fractional Deep Reinforcement Learning for Age-Minimal Mobile Edge Computing b/data/2024/aaai/Fractional Deep Reinforcement Learning for Age-Minimal Mobile Edge Computing
new file mode 100644
index 0000000000..1bcacd7c24
--- /dev/null
+++ b/data/2024/aaai/Fractional Deep Reinforcement Learning for Age-Minimal Mobile Edge Computing	
@@ -0,0 +1 @@
+Mobile edge computing (MEC) is a promising paradigm for real-time applications with intensive computational needs (e.g., autonomous driving), as it can reduce the processing delay. In this work, we focus on the timeliness of computational-intensive updates, measured by Age-of-Information (AoI), and study how to jointly optimize the task updating and offloading policies for AoI with fractional form. Specifically, we consider edge load dynamics and formulate a task scheduling problem to minimize the expected time-average AoI. The uncertain edge load dynamics, the nature of the fractional objective, and hybrid continuous-discrete action space (due to the joint optimization) make this problem challenging and existing approaches not directly applicable. To this end, we propose a fractional reinforcement learning (RL) framework and prove its convergence. We further design a model-free fractional deep RL (DRL) algorithm, where each device makes scheduling decisions with the hybrid action space without knowing the system dynamics and decisions of other devices. Experimental results show that our proposed algorithms reduce the average AoI by up to 57.6% compared with several non-fractional benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Frame Semantic Role Labeling Using Arbitrary-Order Conditional Random Fields b/data/2024/aaai/Frame Semantic Role Labeling Using Arbitrary-Order Conditional Random Fields
new file mode 100644
index 0000000000..11f77601bd
--- /dev/null
+++ b/data/2024/aaai/Frame Semantic Role Labeling Using Arbitrary-Order Conditional Random Fields	
@@ -0,0 +1 @@
+This paper presents an approach to frame semantic role labeling (FSRL), a task in natural language processing that identifies semantic roles within a text following the theory of frame semantics. Unlike previous approaches which do not adequately model correlations and interactions amongst arguments, we propose arbitrary-order conditional random fields (CRFs) that are capable of modeling full interaction amongst an arbitrary number of arguments of a given predicate. To achieve tractable representation and inference, we apply canonical polyadic decomposition to the arbitrary-order factor in our proposed CRF and utilize mean-field variational inference for approximate inference. We further unfold our iterative inference procedure into a recurrent neural network that is connected to our neural encoder and scorer, enabling end-to-end training and inference. Finally, we also improve our model with several techniques such as span-based scoring and decoding. Our experiments show that our approach achieves state-of-the-art performance in FSRL.
\ No newline at end of file
diff --git a/data/2024/aaai/Frequency Oracle for Sensitive Data Monitoring (Student Abstract) b/data/2024/aaai/Frequency Oracle for Sensitive Data Monitoring (Student Abstract)
new file mode 100644
index 0000000000..3736444c2b
--- /dev/null
+++ b/data/2024/aaai/Frequency Oracle for Sensitive Data Monitoring (Student Abstract)	
@@ -0,0 +1 @@
+As data privacy issues grow, finding the best privacy preservation algorithm for each situation is increasingly essential. This research has focused on understanding the frequency oracles (FO) privacy preservation algorithms. FO conduct the frequency estimation of any value in the domain. The aim is to explore how each can be best used and recommend which one to use with which data type. We experimented with different data scenarios and federated learning settings. Results showed clear guidance on when to use a specific algorithm.
\ No newline at end of file
diff --git a/data/2024/aaai/Frequency Shuffling and Enhancement for Open Set Recognition b/data/2024/aaai/Frequency Shuffling and Enhancement for Open Set Recognition
new file mode 100644
index 0000000000..f15fb225de
--- /dev/null
+++ b/data/2024/aaai/Frequency Shuffling and Enhancement for Open Set Recognition	
@@ -0,0 +1 @@
+Open-Set Recognition (OSR) aims to accurately identify known classes while effectively rejecting unknown classes to guarantee reliability. Most existing OSR methods focus on learning in the spatial domain, where subtle texture and global structure are potentially intertwined. Empirical studies have shown that DNNs trained in the original spatial domain are inclined to over-perceive subtle texture. The biased semantic perception could lead to catastrophic over-confidence when predicting both known and unknown classes. To this end, we propose an innovative approach by decomposing the spatial domain to the frequency domain to separately consider global (low-frequency) and subtle (high-frequency) information, named Frequency Shuffling and Enhancement (FreSH). To alleviate the overfitting of subtle texture, we introduce the High-Frequency Shuffling (HFS) strategy that generates diverse high-frequency information and promotes the capture of low-frequency invariance. Moreover, to enhance the perception of global structure, we propose the Low-Frequency Residual (LFR) learning procedure that constructs a composite feature space, integrating low-frequency and original spatial features. Experiments on various benchmarks demonstrate that the proposed FreSH consistently trumps the state-of-the-arts by a considerable margin.
\ No newline at end of file
diff --git a/data/2024/aaai/Frequency Spectrum Is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector b/data/2024/aaai/Frequency Spectrum Is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector
new file mode 100644
index 0000000000..8c5b2213ee
--- /dev/null
+++ b/data/2024/aaai/Frequency Spectrum Is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector	
@@ -0,0 +1 @@
+Multimodal content, such as mixing text with images, presents significant challenges to rumor detection in social media. Existing multimodal rumor detection has focused on mixing tokens among spatial and sequential locations for unimodal representation or fusing clues of rumor veracity across modalities. However, they suffer from less discriminative unimodal representation and are vulnerable to intricate location dependencies in the time-consuming fusion of spatial and sequential tokens. This work makes the first attempt at multimodal rumor detection in the frequency domain, which efficiently transforms spatial features into the frequency spectrum and obtains highly discriminative spectrum features for multimodal representation and fusion. A novel Frequency Spectrum Representation and fUsion network (FSRU) with dual contrastive learning reveals the frequency spectrum is more effective for multimodal representation and fusion, extracting the informative components for rumor detection. FSRU involves three novel mechanisms: utilizing the Fourier transform to convert features in the spatial domain to the frequency domain, the unimodal spectrum compression, and the cross-modal spectrum co-selection module in the frequency domain. Substantial experiments show that FSRU achieves satisfactory multimodal rumor detection performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Frequency-Adaptive Pan-Sharpening with Mixture of Experts b/data/2024/aaai/Frequency-Adaptive Pan-Sharpening with Mixture of Experts
new file mode 100644
index 0000000000..772a8f0601
--- /dev/null
+++ b/data/2024/aaai/Frequency-Adaptive Pan-Sharpening with Mixture of Experts	
@@ -0,0 +1 @@
+Pan-sharpening involves reconstructing missing high-frequency information in multi-spectral images with low spatial resolution, using a higher-resolution panchromatic image as guidance. Although the inborn connection with frequency domain, existing pan-sharpening research has not almost investigated the potential solution upon frequency domain. To this end, we propose a novel Frequency Adaptive Mixture of Experts (FAME) learning framework for pan-sharpening, which consists of three key components: the Adaptive Frequency Separation Prediction Module, the Sub-Frequency Learning Expert Module, and the Expert Mixture Module. In detail, the first leverages the discrete cosine transform to perform frequency separation by predicting the frequency mask. On the basis of generated mask, the second with low-frequency MOE and high-frequency MOE takes account for enabling the effective low-frequency and high-frequency information reconstruction. Followed by, the final fusion module dynamically weights high frequency and low-frequency MOE knowledge to adapt to remote sensing images with significant content variations. Quantitative and qualitative experiments over multiple datasets demonstrate that our method performs the best against other state-of-the-art ones and comprises a strong generalization ability for real-world scenes. Code will be made publicly at https://github.com/alexhe101/FAME-Net.
\ No newline at end of file
diff --git a/data/2024/aaai/Frequency-Aware Deepfake Detection: Improving Generalizability through Frequency Space Domain Learning b/data/2024/aaai/Frequency-Aware Deepfake Detection: Improving Generalizability through Frequency Space Domain Learning
new file mode 100644
index 0000000000..c65336128e
--- /dev/null
+++ b/data/2024/aaai/Frequency-Aware Deepfake Detection: Improving Generalizability through Frequency Space Domain Learning	
@@ -0,0 +1 @@
+This research addresses the challenge of developing a universal deepfake detector that can effectively identify unseen deepfake images despite limited training data. Existing frequency-based paradigms have relied on frequency-level artifacts introduced during the up-sampling in GAN pipelines to detect forgeries. However, the rapid advancements in synthesis technology have led to specific artifacts for each generation model. Consequently, these detectors have exhibited a lack of proficiency in learning the frequency domain and tend to overfit to the artifacts present in the training data, leading to suboptimal performance on unseen sources. To address this issue, we introduce a novel frequency-aware approach called FreqNet, centered around frequency domain learning, specifically designed to enhance the generalizability of deepfake detectors. Our method forces the detector to continuously focus on high-frequency information, exploiting high-frequency representation of features across spatial and channel dimensions. Additionally, we incorporate a straightforward frequency domain learning module to learn source-agnostic features. It involves convolutional layers applied to both the phase spectrum and amplitude spectrum between the Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (iFFT). Extensive experimentation involving 17 GANs demonstrates the effectiveness of our proposed method, showcasing state-of-the-art performance (+9.8\%) while requiring fewer parameters. The code is available at https://github.com/chuangchuangtan/FreqNet-DeepfakeDetection.
\ No newline at end of file
diff --git a/data/2024/aaai/Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation b/data/2024/aaai/Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation
new file mode 100644
index 0000000000..075aae6aa4
--- /dev/null
+++ b/data/2024/aaai/Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation	
@@ -0,0 +1 @@
+Recently, text-to-image diffusion models have emerged as a powerful tool for image-to-image translation (I2I), allowing flexible image translation via user-provided text prompts. This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework contributing a novel solution to text-guided I2I from a frequency-domain perspective. At the heart of our framework is a feature-space frequency-domain filtering module based on Discrete Cosine Transform, which extracts image features carrying different DCT spectral bands to control the text-to-image generation process of the Latent Diffusion Model, realizing versatile I2I applications including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. Different from related methods, FCDiffusion establishes a unified text-driven I2I framework suiting diverse I2I application scenarios simply by switching among different frequency control branches. The effectiveness and superiority of our method for text-guided I2I are demonstrated with extensive experiments both qualitatively and quantitatively. Our project is publicly available at: https://xianggao1102.github.io/FCDiffusion/.
\ No newline at end of file
diff --git a/data/2024/aaai/Friendly Attacks to Improve Channel Coding Reliability b/data/2024/aaai/Friendly Attacks to Improve Channel Coding Reliability
new file mode 100644
index 0000000000..fa7b44037f
--- /dev/null
+++ b/data/2024/aaai/Friendly Attacks to Improve Channel Coding Reliability	
@@ -0,0 +1 @@
+This paper introduces a novel approach called "friendly attack" aimed at enhancing the performance of error correction channel codes. Inspired by the concept of adversarial attacks, our method leverages the idea of introducing slight perturbations to the neural network input, resulting in a substantial impact on the network's performance. By introducing small perturbations to fixed-point modulated codewords before transmission, we effectively improve the decoder's performance without violating the input power constraint. The perturbation design is accomplished by a modified iterative fast gradient method. This study investigates various decoder architectures suitable for computing gradients to obtain the desired perturbations. Specifically, we consider belief propagation (BP) for LDPC codes; the error correcting code transformer, BP and neural BP (NBP) for polar codes, and neural BCJR for convolutional codes. We demonstrate that the proposed friendly attack method can improve the reliability across different channels, modulations, codes, and decoders. This method allows us to increase the reliability of communication with a legacy receiver by simply modifying the transmitted codeword appropriately.
\ No newline at end of file
diff --git a/data/2024/aaai/From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery b/data/2024/aaai/From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery
new file mode 100644
index 0000000000..e20bc81424
--- /dev/null
+++ b/data/2024/aaai/From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery	
@@ -0,0 +1 @@
+Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost, highlighting its efficiency. Furthermore, our method shows a sustained improvement as the volume of pseudo data increases, revealing the great potential of pseudo data in advancing low-resource cross-modal molecule discovery.
\ No newline at end of file
diff --git a/data/2024/aaai/From Coarse to Fine: A Distillation Method for Fine-Grained Emotion-Causal Span Pair Extraction in Conversation b/data/2024/aaai/From Coarse to Fine: A Distillation Method for Fine-Grained Emotion-Causal Span Pair Extraction in Conversation
new file mode 100644
index 0000000000..cb811b9c86
--- /dev/null
+++ b/data/2024/aaai/From Coarse to Fine: A Distillation Method for Fine-Grained Emotion-Causal Span Pair Extraction in Conversation	
@@ -0,0 +1,7 @@
+We study the problem of extracting emotions and the causes behind these emotions in conversations.
+Existing methods either tackle them separately or jointly model them at the coarse-grained level of emotions (fewer emotion categories) and causes (utterance-level causes). 
+In this work, we aim to jointly extract more fine-grained emotions and causes.
+We construct a fine-grained dataset FG-RECCON, includes 16 fine-grained emotion categories and span-level causes.
+To further improve the fine-grained extraction performance, we propose to utilize the casual discourse knowledge in a knowledge distillation way.
+Specifically, the teacher model learns to predict causal connective words between utterances, and then guides the student model in identifying both the fine-grained emotion labels and causal spans.
+Experimental results demonstrate that our distillation method achieves state-of-the-art performance on both RECCON and FG-RECCON dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/From Consumers to Critical Users: Prompty, an AI Literacy Tool for High School Students b/data/2024/aaai/From Consumers to Critical Users: Prompty, an AI Literacy Tool for High School Students
new file mode 100644
index 0000000000..27941a0657
--- /dev/null
+++ b/data/2024/aaai/From Consumers to Critical Users: Prompty, an AI Literacy Tool for High School Students	
@@ -0,0 +1 @@
+In an age where Large Language Models (LLMs) expedite the generation of text, the skills for critically evaluating and creating meaningful text using these models are often lacking. To help classroom teachers address this, we introduce Prompty, a specialized teaching tool co-designed to facilitate both critical and effective use of LLMs. Prompty serves multiple learning goals: it allows students to critically evaluate text generated by LLMs, aids in their writing practice, and provides a deeper understanding of how LLMs function—all within a student-friendly environment secured by essential guardrails. Prompty was co-designed in collaboration with high school teachers as part of CRAFT, an initiative by Stanford University to promote AI literacy. It was pilot-tested in a high school English class to serve as an AI writing assistant, focusing on the critical evaluation of machine-generated text. This trial yielded preliminary evidence that attests to the tool's effectiveness in fulfilling its educational goals. The findings from the pilot study indicate that easy-to-use tools like Prompty have great potential. These tools can be adapted to fit the goals of individual teachers. They can help in achieving subject-specific learning goals while serving as an effective way to teach AI concepts in high school.
\ No newline at end of file
diff --git a/data/2024/aaai/From GARCH to Neural Network for Volatility Forecast b/data/2024/aaai/From GARCH to Neural Network for Volatility Forecast
new file mode 100644
index 0000000000..d4c1b65cbd
--- /dev/null
+++ b/data/2024/aaai/From GARCH to Neural Network for Volatility Forecast	
@@ -0,0 +1 @@
+Volatility, as a measure of uncertainty, plays a crucial role in numerous financial activities such as risk management. The Econometrics and Machine Learning communities have developed two distinct approaches for financial volatility forecasting: the stochastic approach and the neural network (NN) approach. Despite their individual strengths, these methodologies have conventionally evolved in separate research trajectories with little interaction between them. This study endeavors to bridge this gap by establishing an equivalence relationship between models of the GARCH family and their corresponding NN counterparts. With the equivalence relationship established, we introduce an innovative approach, named GARCH-NN, for constructing NN-based volatility models. It obtains the NN counterparts of GARCH models and integrates them as components into an established NN architecture, thereby seamlessly infusing volatility stylized facts (SFs) inherent in the GARCH models into the neural network. We develop the GARCH-LSTM model to showcase the power of GARCH-NN approach. Experiment results validate that amalgamating the NN counterparts of the GARCH family models into established NN models leads to enhanced outcomes compared to employing the stochastic and NN models in isolation.
\ No newline at end of file
diff --git a/data/2024/aaai/From Hope to Safety: Unlearning Biases of Deep Models via Gradient Penalization in Latent Space b/data/2024/aaai/From Hope to Safety: Unlearning Biases of Deep Models via Gradient Penalization in Latent Space
new file mode 100644
index 0000000000..73259f3ee4
--- /dev/null
+++ b/data/2024/aaai/From Hope to Safety: Unlearning Biases of Deep Models via Gradient Penalization in Latent Space	
@@ -0,0 +1 @@
+Deep Neural Networks are prone to learning spurious correlations embedded in the training data, leading to potentially biased predictions. This poses risks when deploying these models for high-stake decision-making, such as in medical applications. Current methods for post-hoc model correction either require input-level annotations which are only possible for spatially localized biases, or augment the latent feature space, thereby hoping to enforce the right reasons. We present a novel method for model correction on the concept level that explicitly reduces model sensitivity towards biases via gradient penalization. When modeling biases via Concept Activation Vectors, we highlight the importance of choosing robust directions, as traditional regression-based approaches such as Support Vector Machines tend to result in diverging directions. We effectively mitigate biases in controlled and real-world settings on the ISIC, Bone Age, ImageNet and CelebA datasets using VGG, ResNet and EfficientNet architectures. Code and Appendix are available on https://github.com/frederikpahde/rrclarc.
\ No newline at end of file
diff --git a/data/2024/aaai/From Raw Video to Pedagogical Insights: A Unified Framework for Student Behavior Analysis b/data/2024/aaai/From Raw Video to Pedagogical Insights: A Unified Framework for Student Behavior Analysis
new file mode 100644
index 0000000000..2ae2cec78c
--- /dev/null
+++ b/data/2024/aaai/From Raw Video to Pedagogical Insights: A Unified Framework for Student Behavior Analysis	
@@ -0,0 +1 @@
+Understanding student behavior in educational settings is critical in improving both the quality of pedagogy and the level of student engagement. While various AI-based models exist for classroom analysis, they tend to specialize in limited tasks and lack generalizability across diverse educational environments. Additionally, these models often fall short in ensuring student privacy and in providing actionable insights accessible to educators. To bridge this gap, we introduce a unified, end-to-end framework by leveraging temporal action detection techniques and advanced large language models for a more nuanced student behavior analysis. Our proposed framework provides an end-to-end pipeline that starts with raw classroom video footage and culminates in the autonomous generation of pedagogical reports. It offers a comprehensive and scalable solution for student behavior analysis. Experimental validation confirms the capability of our framework to accurately identify student behaviors and to produce pedagogically meaningful insights, thereby setting the stage for future AI-assisted educational assessments.
\ No newline at end of file
diff --git a/data/2024/aaai/From Retrieval to Generation: A Simple and Unified Generative Model for End-to-End Task-Oriented Dialogue b/data/2024/aaai/From Retrieval to Generation: A Simple and Unified Generative Model for End-to-End Task-Oriented Dialogue
new file mode 100644
index 0000000000..b75fbe8baf
--- /dev/null
+++ b/data/2024/aaai/From Retrieval to Generation: A Simple and Unified Generative Model for End-to-End Task-Oriented Dialogue	
@@ -0,0 +1 @@
+Retrieving appropriate records from the external knowledge base to generate informative responses is the core capability of end-to-end task-oriented dialogue systems (EToDs). Most of the existing methods additionally train the retrieval model or use the memory network to retrieve the knowledge base, which decouples the knowledge retrieval task from the response generation task, making it difficult to jointly optimize and failing to capture the internal relationship between the two tasks. In this paper, we propose a simple and unified generative model for task-oriented dialogue systems, which recasts the EToDs task as a single sequence generation task and uses maximum likelihood training to train the two tasks in a unified manner. To prevent the generation of non-existent records, we design the prefix trie to constrain the model generation, which ensures consistency between the generated records and the existing records in the knowledge base. Experimental results on three public benchmark datasets demonstrate that our method achieves robust performance on generating system responses and outperforms the baseline systems. To facilitate future research in this area, the code is available at https://github.com/dzy1011/Uni-ToD.
\ No newline at end of file
diff --git a/data/2024/aaai/From Static to Dynamic: Knowledge Metabolism for Large Language Models b/data/2024/aaai/From Static to Dynamic: Knowledge Metabolism for Large Language Models
new file mode 100644
index 0000000000..c403999194
--- /dev/null
+++ b/data/2024/aaai/From Static to Dynamic: Knowledge Metabolism for Large Language Models	
@@ -0,0 +1,3 @@
+The immense parameter space of Large Language Models (LLMs) endows them with superior knowledge retention capabilities, allowing them to excel in a variety of natural language processing tasks. However, it also instigates difficulties in consistently tuning LMs to incorporate the most recent knowledge, which may further lead LMs to produce inaccurate and fabricated content. 
+To alleviate this issue, we propose a knowledge metabolism framework for LLMs. This framework proactively sustains the credibility of knowledge through an auxiliary external memory component and directly delivers pertinent knowledge for LM inference, thereby suppressing hallucinations caused by obsolete internal knowledge during the LM inference process.
+Benchmark experiments demonstrate DynaMind's effectiveness in overcoming this challenge. The code and demo of DynaMind are available at: https://github.com/Elfsong/DynaMind.
\ No newline at end of file
diff --git a/data/2024/aaai/From Statistical Relational to Neuro-Symbolic Artificial Intelligence b/data/2024/aaai/From Statistical Relational to Neuro-Symbolic Artificial Intelligence
new file mode 100644
index 0000000000..cf81c285ed
--- /dev/null
+++ b/data/2024/aaai/From Statistical Relational to Neuro-Symbolic Artificial Intelligence	
@@ -0,0 +1 @@
+The integration of learning and reasoning is one of the key challenges in artificial intelligence and machine learning today. The area of Neuro-Symbolic AI (NeSy) tackles this challenge by integrating symbolic reasoning with neural networks. In our recent work, we provided an introduction to NeSy by drawing several parallels to another field that has a rich tradition in integrating learning and reasoning, namely Statistical Relational Artificial Intelligence (StarAI).
\ No newline at end of file
diff --git a/data/2024/aaai/From Toxic to Trustworthy: Using Self-Distillation and Semi-supervised Methods to Refine Neural Networks b/data/2024/aaai/From Toxic to Trustworthy: Using Self-Distillation and Semi-supervised Methods to Refine Neural Networks
new file mode 100644
index 0000000000..a481a83bc3
--- /dev/null
+++ b/data/2024/aaai/From Toxic to Trustworthy: Using Self-Distillation and Semi-supervised Methods to Refine Neural Networks	
@@ -0,0 +1 @@
+Despite the tremendous success of deep neural networks (DNNs) across various fields, their susceptibility to potential backdoor attacks seriously threatens their application security, particularly in safety-critical or security-sensitive ones. Given this growing threat, there is a pressing need for research into purging backdoors from DNNs. However, prior efforts on erasing backdoor triggers not only failed to withstand increasingly powerful attacks but also resulted in reduced model performance. In this paper, we propose From Toxic to Trustworthy (FTT), an innovative approach to eliminate backdoor triggers while simultaneously enhancing model accuracy. Following the stringent and practical assumption of limited availability of clean data, we introduce a self-attention distillation (SAD) method to remove the backdoor by aligning the shallow and deep parts of the network. Furthermore, we first devise a semi-supervised learning (SSL) method that leverages ubiquitous and available poisoned data to further purify backdoors and improve accuracy. Extensive experiments on various attacks and models have shown that our FTT can reduce the attack success rate from 97% to 1% and improve the accuracy of 4% on average, demonstrating its effectiveness in mitigating backdoor attacks and improving model performance. Compared to state-of-the-art (SOTA) methods, our FTT can reduce the attack success rate by 2 times and improve the accuracy by 5%, shedding light on backdoor cleansing.
\ No newline at end of file
diff --git a/data/2024/aaai/Frozen CLIP Transformer Is an Efficient Point Cloud Encoder b/data/2024/aaai/Frozen CLIP Transformer Is an Efficient Point Cloud Encoder
new file mode 100644
index 0000000000..248e857ef6
--- /dev/null
+++ b/data/2024/aaai/Frozen CLIP Transformer Is an Efficient Point Cloud Encoder	
@@ -0,0 +1 @@
+The pretrain-finetune paradigm has achieved great success in NLP and 2D image fields because of the high-quality representation ability and transferability of their pretrained models. However, pretraining such a strong model is difficult in the 3D point cloud field due to the limited amount of point cloud sequences. This paper introduces Efficient Point Cloud Learning (EPCL), an effective and efficient point cloud learner for directly training high-quality point cloud models with a frozen CLIP transformer. Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data. Specifically, the input point cloud is divided into a series of local patches, which are converted to token embeddings by the designed point cloud tokenizer. These token embeddings are concatenated with a task token and fed into the frozen CLIP transformer to learn point cloud representation. The intuition is that the proposed point cloud tokenizer projects the input point cloud into a unified token space that is similar to the 2D images. Comprehensive experiments on 3D detection, semantic segmentation, classification and few-shot learning demonstrate that the CLIP transformer can serve as an efficient point cloud encoder and our method achieves promising performance on both indoor and outdoor benchmarks. In particular, performance gains brought by our EPCL are 19.7 AP50 on ScanNet V2 detection, 4.4 mIoU on S3DIS segmentation and 1.2 mIoU on SemanticKITTI segmentation compared to contemporary pretrained models. Code is available at \url{https://github.com/XiaoshuiHuang/EPCL}.
\ No newline at end of file
diff --git a/data/2024/aaai/Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning b/data/2024/aaai/Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning
new file mode 100644
index 0000000000..38f469270a
--- /dev/null
+++ b/data/2024/aaai/Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning	
@@ -0,0 +1,5 @@
+Large Language Models (LLM) exhibit zero-shot mathematical reasoning capacity as a behavior emergent with scale, commonly manifesting as chain-of-thoughts (CoT) reasoning. However, multiple empirical findings suggest that this prowess is exclusive to LLMs that have exorbitant sizes (beyond 50 billion parameters). Meanwhile, educational neuroscientists suggest that symbolic algebraic manipulation be introduced around the same time as arithmetic word problems so as to modularize language-to-formulation, symbolic manipulation of the formulation, and endgame arithmetic.
+In this paper, we start with the hypothesis that much smaller LMs, which are weak at multi-step reasoning, can achieve reasonable arithmetic reasoning if arithmetic word problems are posed as a formalize-then-solve task.
+In our architecture, which we call SyReLM, the LM serves the role of a translator to map natural language arithmetic questions into a formal language (FL) description. A symbolic solver then evaluates the FL expression to obtain the answer.
+A small frozen LM, equipped with an efficient low-rank adapter, is capable of generating FL expressions that incorporate natural language descriptions of the arithmetic problem (e.g., variable names and their purposes, formal expressions combining variables, etc.).
+We adopt policy-gradient reinforcement learning to train the adapted LM, informed by the non-differentiable symbolic solver. This marks a sharp departure from the recent development in tool-augmented LLMs, in which the external tools (e.g., calculator, Web search, etc.) are essentially detached from the learning phase of the LM. SyReLM shows massive improvements (e.g., +30.65 absolute point improvement in accuracy on the SVAMP dataset using GPT-J 6B model) over base LMs, while keeping our testbed easy to diagnose and interpret, and within the reach of most researchers.
\ No newline at end of file
diff --git a/data/2024/aaai/Full Bayesian Significance Testing for Neural Networks b/data/2024/aaai/Full Bayesian Significance Testing for Neural Networks
new file mode 100644
index 0000000000..1aa05c17f1
--- /dev/null
+++ b/data/2024/aaai/Full Bayesian Significance Testing for Neural Networks	
@@ -0,0 +1 @@
+Significance testing aims to determine whether a proposition about the population distribution is the truth or not given observations. However, traditional significance testing often needs to derive the distribution of the testing statistic, failing to deal with complex nonlinear relationships. In this paper, we propose to conduct Full Bayesian Significance Testing for neural networks, called nFBST, to overcome the limitation in relationship characterization of traditional approaches. A Bayesian neural network is utilized to fit the nonlinear and multi-dimensional relationships with small errors and avoid hard theoretical derivation by computing the evidence value. Besides, nFBST can test not only global significance but also local and instance-wise significance, which previous testing methods don't focus on. Moreover, nFBST is a general framework that can be extended based on the measures selected, such as Grad-nFBST, LRP-nFBST, DeepLIFT-nFBST, LIME-nFBST. A range of experiments on both simulated and real data are conducted to show the advantages of our method.
\ No newline at end of file
diff --git a/data/2024/aaai/Full-Body Motion Reconstruction with Sparse Sensing from Graph Perspective b/data/2024/aaai/Full-Body Motion Reconstruction with Sparse Sensing from Graph Perspective
new file mode 100644
index 0000000000..9730293727
--- /dev/null
+++ b/data/2024/aaai/Full-Body Motion Reconstruction with Sparse Sensing from Graph Perspective	
@@ -0,0 +1 @@
+Estimating 3D full-body pose from sparse sensor data is a pivotal technique employed for the reconstruction of realistic human motions in Augmented Reality and Virtual Reality. However, translating sparse sensor signals into comprehensive human motion remains a challenge since the sparsely distributed sensors in common VR systems fail to capture the motion of full human body. In this paper, we use well-designed Body Pose Graph (BPG) to represent the human body and translate the challenge into a prediction problem of graph missing nodes. Then, we propose a novel full-body motion reconstruction framework based on BPG. To establish BPG, nodes are initially endowed with features extracted from sparse sensor signals. Features from identifiable joint nodes across diverse sensors are amalgamated and processed from both temporal and spatial perspectives. Temporal dynamics are captured using the Temporal Pyramid Structure, while spatial relations in joint movements inform the spatial attributes. The resultant features serve as the foundational elements of the BPG nodes. To further refine the BPG, node features are updated through a graph neural network that incorporates edge reflecting varying joint relations. Our method's effectiveness is evidenced by the attained state-of-the-art performance, particularly in lower body motion, outperforming other baseline methods. Additionally, an ablation study validates the efficacy of each module in our proposed framework.
\ No newline at end of file
diff --git a/data/2024/aaai/Fully Data-Driven Pseudo Label Estimation for Pointly-Supervised Panoptic Segmentation b/data/2024/aaai/Fully Data-Driven Pseudo Label Estimation for Pointly-Supervised Panoptic Segmentation
new file mode 100644
index 0000000000..db968a2d61
--- /dev/null
+++ b/data/2024/aaai/Fully Data-Driven Pseudo Label Estimation for Pointly-Supervised Panoptic Segmentation	
@@ -0,0 +1 @@
+The core of pointly-supervised panoptic segmentation is estimating accurate dense pseudo labels from sparse point labels to train the panoptic head. Previous works generate pseudo labels mainly based on hand-crafted rules, such as connecting multiple points into polygon masks, or assigning the label information of labeled pixels to unlabeled pixels based on the artificially defined traversing distance. The accuracy of pseudo labels is limited by the quality of the hand-crafted rules (polygon masks are rough at object contour regions, and the traversing distance error will result in wrong pseudo labels). To overcome the limitation of hand-crafted rules, we estimate pseudo labels with a fully data-driven pseudo label branch, which is optimized by point labels end-to-end and predicts more accurate pseudo labels than previous methods. We also train an auxiliary semantic branch with point labels, it assists the training of the pseudo label branch by transferring semantic segmentation knowledge through shared parameters. Experiments on Pascal VOC and MS COCO demonstrate that our approach is effective and shows state-of-the-art performance compared with related works. Codes are available at https://github.com/BraveGroup/FDD.
\ No newline at end of file
diff --git a/data/2024/aaai/Fully-Connected Spatial-Temporal Graph for Multivariate Time-Series Data b/data/2024/aaai/Fully-Connected Spatial-Temporal Graph for Multivariate Time-Series Data
new file mode 100644
index 0000000000..b0cff78198
--- /dev/null
+++ b/data/2024/aaai/Fully-Connected Spatial-Temporal Graph for Multivariate Time-Series Data	
@@ -0,0 +1 @@
+Multivariate Time-Series (MTS) data is crucial in various application fields. With its sequential and multi-source (multiple sensors) properties, MTS data inherently exhibits Spatial-Temporal (ST) dependencies, involving temporal correlations between timestamps and spatial correlations between sensors in each timestamp. To effectively leverage this information, Graph Neural Network-based methods (GNNs) have been widely adopted. However, existing approaches separately capture spatial dependency and temporal dependency and fail to capture the correlations between Different sEnsors at Different Timestamps (DEDT). Overlooking such correlations hinders the comprehensive modelling of ST dependencies within MTS data, thus restricting existing GNNs from learning effective representations. To address this limitation, we propose a novel method called Fully-Connected Spatial-Temporal Graph Neural Network (FC-STGNN), including two key components namely FC graph construction and FC graph convolution. For graph construction, we design a decay graph to connect sensors across all timestamps based on their temporal distances, enabling us to fully model the ST dependencies by considering the correlations between DEDT. Further, we devise FC graph convolution with a moving-pooling GNN layer to effectively capture the ST dependencies for learning effective representations. Extensive experiments show the effectiveness of FC-STGNN on multiple MTS datasets compared to SOTA methods. The code is available at https://github.com/Frank-Wang-oss/FCSTGNN.
\ No newline at end of file
diff --git a/data/2024/aaai/Fusing Conditional Submodular GAN and Programmatic Weak Supervision b/data/2024/aaai/Fusing Conditional Submodular GAN and Programmatic Weak Supervision
new file mode 100644
index 0000000000..d05e6d99d6
--- /dev/null
+++ b/data/2024/aaai/Fusing Conditional Submodular GAN and Programmatic Weak Supervision	
@@ -0,0 +1,3 @@
+Programmatic Weak Supervision (PWS) and generative models serve as crucial tools that enable researchers to maximize the utility of existing datasets without resorting to laborious data gathering and manual annotation processes. PWS uses various weak supervision techniques to estimate the underlying class labels of data, while generative models primarily concentrate on sampling from the underlying distribution of the given dataset. Although these methods have the potential to complement each other, they have mostly been studied independently.
+ Recently, WSGAN proposed a mechanism to fuse these two models. Their approach utilizes the discrete latent factors of InfoGAN for the training of the label models and leverages the class-dependent information of the label models to generate images of specific classes. However, the disentangled latent factor learned by the InfoGAN may not necessarily be class specific and hence could potentially affect the label model's accuracy. Moreover, the prediction of the label model is often noisy in nature and can have a detrimental impact on the quality of images generated by GAN. In our work, we address these challenges by (i) implementing a noise-aware classifier using the pseudo labels generated by the label model, (ii) utilizing the prediction of the noise-aware classifier for training the label model as well as generation of class-conditioned images. Additionally, We also investigate the effect of training the classifier with a subset of the dataset within a defined uncertainty budget on pseudo labels. We accomplish this by formalizing the subset selection problem as submodular maximization with a knapsack constraint on the entropy of pseudo labels. We conduct experiments on multiple datasets and demonstrate the efficacy of our methods on several tasks vis-a-vis the current state-of-the-art methods. Our implementation is
+available at https://github.com/kyrs/subpws-gan
\ No newline at end of file
diff --git a/data/2024/aaai/Fusion-Vital: Video-RF Fusion Transformer for Advanced Remote Physiological Measurement b/data/2024/aaai/Fusion-Vital: Video-RF Fusion Transformer for Advanced Remote Physiological Measurement
new file mode 100644
index 0000000000..b135f20fb7
--- /dev/null
+++ b/data/2024/aaai/Fusion-Vital: Video-RF Fusion Transformer for Advanced Remote Physiological Measurement	
@@ -0,0 +1 @@
+Remote physiology, which involves monitoring vital signs without the need for physical contact, has great potential for various applications. Current remote physiology methods rely only on a single camera or radio frequency (RF) sensor to capture the microscopic signatures from vital movements. However, our study shows that fusing deep RGB and RF features from both sensor streams can further improve performance. Because these multimodal features are defined in distinct dimensions and have varying contextual importance, the main challenge in the fusion process lies in the effective alignment of them and adaptive integration of features under dynamic scenarios. To address this challenge, we propose a novel vital sensing model, named Fusion-Vital, that combines the RGB and RF modalities through the new introduction of pairwise input formats and transformer-based fusion strategies. We also perform comprehensive experiments based on a newly collected and released remote vital dataset comprising synchronized video-RF sensors, showing the superiority of the fusion approach over the previous single-sensor baselines in various aspects.
\ No newline at end of file
diff --git a/data/2024/aaai/FusionFormer: A Concise Unified Feature Fusion Transformer for 3D Pose Estimation b/data/2024/aaai/FusionFormer: A Concise Unified Feature Fusion Transformer for 3D Pose Estimation
new file mode 100644
index 0000000000..1e6b03ae57
--- /dev/null
+++ b/data/2024/aaai/FusionFormer: A Concise Unified Feature Fusion Transformer for 3D Pose Estimation	
@@ -0,0 +1 @@
+Depth uncertainty is a core challenge in 3D human pose estimation, especially when the camera parameters are unknown. Previous methods try to reduce the impact of depth uncertainty by multi-view and/or multi-frame feature fusion to utilize more spatial and temporal information. However, they generally lead to marginal improvements and their performance still cannot match the camera-parameter-required methods. The reason is that their handcrafted fusion schemes cannot fuse the features flexibly, e.g., the multi-view and/or multi-frame features are fused separately. Moreover, the diverse and complicated fusion schemes make the principle for developing effective fusion schemes unclear and also raises an open problem that whether there exist more simple and elegant fusion schemes. To address these issues, this paper proposes an extremely concise unified feature fusion transformer (FusionFormer) with minimized handcrafted design for 3D pose estimation. FusionFormer fuses both the multi-view and multi-frame features in a unified fusion scheme, in which all the features are accessible to each other and thus can be fused flexibly. Experimental results on several mainstream datasets demonstrate that FusionFormer achieves state-of-the-art performance. To our best knowledge, this is the first camera-parameter-free method to outperform the existing camera-parameter-required methods, revealing the tremendous potential of camera-parameter-free models. These impressive experimental results together with our concise feature fusion scheme resolve the above open problem. Another appealing feature of FusionFormer we observe is that benefiting from its effective fusion scheme, we can achieve impressive performance with smaller model size and less FLOPs.
\ No newline at end of file
diff --git "a/data/2024/aaai/F\302\263-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis" "b/data/2024/aaai/F\302\263-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis"
new file mode 100644
index 0000000000..82e4205f3c
--- /dev/null
+++ "b/data/2024/aaai/F\302\263-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis"	
@@ -0,0 +1 @@
+Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by training transformers or diffusion models on large-scale datasets. Nevertheless, inferring such large models incurs huge costs. Previous inference acceleration works either require costly retraining or are model-specific. To address this issue, instead of retraining we explore the inference process of two mainstream T2V models using transformers and diffusion models. The exploration reveals the redundancy in temporal attention modules of both models, which are commonly utilized to establish temporal relations among frames. Consequently, we propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights. Specifically, when aggregate temporal attention values are ranked below a certain ratio, corresponding weights will be pruned. Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in inference acceleration, quality assurance and broad applicability.
\ No newline at end of file
diff --git a/data/2024/aaai/G-LIME: Statistical Learning for Local Interpretations of Deep Neural Networks Using Global Priors (Abstract Reprint) b/data/2024/aaai/G-LIME: Statistical Learning for Local Interpretations of Deep Neural Networks Using Global Priors (Abstract Reprint)
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/aaai/G-NAS: Generalizable Neural Architecture Search for Single Domain Generalization Object Detection b/data/2024/aaai/G-NAS: Generalizable Neural Architecture Search for Single Domain Generalization Object Detection
new file mode 100644
index 0000000000..e76e590220
--- /dev/null
+++ b/data/2024/aaai/G-NAS: Generalizable Neural Architecture Search for Single Domain Generalization Object Detection	
@@ -0,0 +1 @@
+In this paper, we focus on a realistic yet challenging task, Single Domain Generalization Object Detection (S-DGOD), where only one source domain's data can be used for training object detectors, but have to generalize multiple distinct target domains. In S-DGOD, both high-capacity fitting and generalization abilities are needed due to the task's complexity. Differentiable Neural Architecture Search (NAS) is known for its high capacity for complex data fitting and we propose to leverage Differentiable NAS to solve S-DGOD. However, it may confront severe over-fitting issues due to the feature imbalance phenomenon, where parameters optimized by gradient descent are biased to learn from the easy-to-learn features, which are usually non-causal and spuriously correlated to ground truth labels, such as the features of background in object detection data. Consequently, this leads to serious performance degradation, especially in generalizing to unseen target domains with huge domain gaps between the source domain and target domains. To address this issue, we propose the Generalizable loss (G-loss), which is an OoD-aware objective, preventing NAS from over-fitting by using gradient descent to optimize parameters not only on a subset of easy-to-learn features but also the remaining predictive features for generalization, and the overall framework is named G-NAS. Experimental results on the S-DGOD urban-scene datasets demonstrate that the proposed G-NAS achieves SOTA performance compared to baseline methods. Codes are available at https://github.com/wufan-cse/G-NAS.
\ No newline at end of file
diff --git a/data/2024/aaai/G2L-CariGAN: Caricature Generation from Global Structure to Local Features b/data/2024/aaai/G2L-CariGAN: Caricature Generation from Global Structure to Local Features
new file mode 100644
index 0000000000..470bfab4aa
--- /dev/null
+++ b/data/2024/aaai/G2L-CariGAN: Caricature Generation from Global Structure to Local Features	
@@ -0,0 +1 @@
+Existing GAN-based approaches to caricature generation mainly focus on exaggerating a character’s global facial structure. This often leads to the failure in highlighting significant facial features such as big eyes and hook nose. To address this limitation, we propose a new approach termed as G2L-CariGAN, which uses feature maps of spatial dimensions instead of latent codes for geometric exaggeration. G2L-CariGAN first exaggerates the global facial structure of the character on a low-dimensional feature map and then exaggerates its local facial features on a high-dimensional feature map. Moreover, we develop a caricature identity loss function based on feature maps, which well retains the character's identity after exaggeration. Our experiments have demonstrated that G2L-CariGAN outperforms the state-of-arts in terms of the quality of exaggerating a character and retaining its identity.
\ No newline at end of file
diff --git a/data/2024/aaai/G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model b/data/2024/aaai/G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model
new file mode 100644
index 0000000000..b2416321f4
--- /dev/null
+++ b/data/2024/aaai/G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model	
@@ -0,0 +1 @@
+The Sign Language Production (SLP) project aims to automatically translate spoken languages into sign sequences. Our approach focuses on the transformation of sign gloss sequences into their corresponding sign pose sequences (G2P). In this paper, we present a novel solution for this task by converting the continuous pose space generation problem into a discrete sequence generation problem. We introduce the Pose-VQVAE framework, which combines Variational Autoencoders (VAEs) with vector quantization to produce a discrete latent representation for continuous pose sequences. Additionally, we propose the G2P-DDM model, a discrete denoising diffusion architecture for length-varied discrete sequence data, to model the latent prior. To further enhance the quality of pose sequence generation in the discrete space, we present the CodeUnet model to leverage spatial-temporal information. Lastly, we develop a heuristic sequential clustering method to predict variable lengths of pose sequences for corresponding gloss sequences. Our results show that our model outperforms state-of-the-art G2P models on the public SLP evaluation benchmark. For more generated results, please visit our project page: https://slpdiffusier.github.io/g2p-ddm.
\ No newline at end of file
diff --git a/data/2024/aaai/GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework b/data/2024/aaai/GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework
new file mode 100644
index 0000000000..4d9f422d71
--- /dev/null
+++ b/data/2024/aaai/GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework	
@@ -0,0 +1 @@
+Particle-based Variational Inference (ParVI) methods approximate the target distribution by iteratively evolving finite weighted particle systems. Recent advances of ParVI methods reveal the benefits of accelerated position update strategies and dynamic weight adjustment approaches. In this paper, we propose the first ParVI framework that possesses both accelerated position update and dynamical weight adjustment simultaneously, named the General Accelerated Dynamic-Weight Particle-based Variational Inference (GAD-PVI) framework. Generally, GAD-PVI simulates the semi-Hamiltonian gradient flow on a novel Information-Fisher-Rao space, which yields an additional decrease on the local functional dissipation. GAD-PVI is compatible with different dissimilarity functionals and associated smoothing approaches under three information metrics. Experiments on both synthetic and real-world data demonstrate the faster convergence and reduced approximation error of GAD-PVI methods over the state-of-the-art.
\ No newline at end of file
diff --git a/data/2024/aaai/GAMC: An Unsupervised Method for Fake News Detection Using Graph Autoencoder with Masking b/data/2024/aaai/GAMC: An Unsupervised Method for Fake News Detection Using Graph Autoencoder with Masking
new file mode 100644
index 0000000000..cfe339de6f
--- /dev/null
+++ b/data/2024/aaai/GAMC: An Unsupervised Method for Fake News Detection Using Graph Autoencoder with Masking	
@@ -0,0 +1 @@
+With the rise of social media, the spread of fake news has become a significant concern, potentially misleading public perceptions and impacting social stability. Although deep learning methods like CNNs, RNNs, and Transformer-based models like BERT have enhanced fake news detection. However, they primarily focus on content and do not consider social context during news propagation. Graph-based techniques have incorporated the social context but are limited by the need for large labeled datasets. To address these challenges, this paper introduces GAMC, an unsupervised fake news detection technique using the Graph Autoencoder with Masking and Contrastive learning. By leveraging both the context and content of news propagation as self-supervised signals, our method reduces the dependency on labeled datasets. Specifically, GAMC begins by applying data augmentation to the original news propagation graphs. Subsequently, these augmented graphs are encoded using a graph encoder and subsequently reconstructed via a graph decoder. Finally, a composite loss function that encompasses both reconstruction error and contrastive loss is designed. Firstly, it ensures the model can effectively capture the latent features, based on minimizing the discrepancy between reconstructed and original graph representations. Secondly, it aligns the representations of augmented graphs that originate from the same source. Experiments on the real-world dataset validate the effectiveness of our method.
\ No newline at end of file
diff --git a/data/2024/aaai/GCNext: Towards the Unity of Graph Convolutions for Human Motion Prediction b/data/2024/aaai/GCNext: Towards the Unity of Graph Convolutions for Human Motion Prediction
new file mode 100644
index 0000000000..9ad37ccc9f
--- /dev/null
+++ b/data/2024/aaai/GCNext: Towards the Unity of Graph Convolutions for Human Motion Prediction	
@@ -0,0 +1 @@
+The past few years has witnessed the dominance of Graph Convolutional Networks (GCNs) over human motion prediction. Various styles of graph convolutions have been proposed, with each one meticulously designed and incorporated into a carefully-crafted network architecture. This paper breaks the limits of existing knowledge by proposing Universal Graph Convolution (UniGC), a novel graph convolution concept that re-conceptualizes different graph convolutions as its special cases. Leveraging UniGC on network-level, we propose GCNext, a novel GCN-building paradigm that dynamically determines the best-fitting graph convolutions both sample-wise and layer-wise. GCNext offers multiple use cases, including training a new GCN from scratch or refining a preexisting GCN. Experiments on Human3.6M, AMASS, and 3DPW datasets show that, by incorporating unique module-to-network designs, GCNext yields up to 9x lower computational cost than existing GCN methods, on top of achieving state-of-the-art performance. Our code is available at https://github.com/BradleyWang0416/GCNext.
\ No newline at end of file
diff --git a/data/2024/aaai/GEAR-Up: Generative AI and External Knowledge-Based Retrieval: Upgrading Scholarly Article Searches for Systematic Reviews b/data/2024/aaai/GEAR-Up: Generative AI and External Knowledge-Based Retrieval: Upgrading Scholarly Article Searches for Systematic Reviews
new file mode 100644
index 0000000000..5124db575b
--- /dev/null
+++ b/data/2024/aaai/GEAR-Up: Generative AI and External Knowledge-Based Retrieval: Upgrading Scholarly Article Searches for Systematic Reviews	
@@ -0,0 +1 @@
+This paper addresses the time-intensive nature of systematic reviews (SRs) and proposes a solution leveraging advancements in Generative AI (e.g., ChatGPT) and external knowledge augmentation (e.g., Retrieval-Augmented Generation). The proposed system, GEAR-Up, automates query development and translation in SRs, enhancing efficiency by enriching user queries with context from language models and knowledge graphs. Collaborating with librarians, qualitative evaluations demonstrate improved reproducibility and search strategy quality. Access the demo at https://youtu.be/zMdP56GJ9mU.
\ No newline at end of file
diff --git a/data/2024/aaai/GLDL: Graph Label Distribution Learning b/data/2024/aaai/GLDL: Graph Label Distribution Learning
new file mode 100644
index 0000000000..39ffbe3520
--- /dev/null
+++ b/data/2024/aaai/GLDL: Graph Label Distribution Learning	
@@ -0,0 +1 @@
+Label Distribution Learning (LDL), as a more general learning setting than generic single-label and multi-label learning, has been commonly used in computer vision and many other applications. To date, existing LDL approaches are designed and applied to data without considering the interdependence between instances. In this paper, we propose a Graph Label Distribution Learning (GLDL) framework, which explicitly models three types of relationships: instance-instance, label-label, and instance-label, to learn the label distribution for networked data. A label-label network is learned to capture label-to-label correlation, through which GLDL can accurately learn label distributions for nodes. Dual graph convolution network (GCN) Co-training with heterogeneous message passing ensures two GCNs, one focusing on instance-instance relationship and the other one targeting label-label correlation, are jointly trained such that instance-instance relationship can help induce label-label correlation and vice versa. Our theoretical study derives the error bound of GLDL. For verification, four benchmark datasets with label distributions for nodes are created using common graph benchmarks. The experiments show that considering dependency helps learn better label distributions for networked data, compared to state-of-the-art LDL baseline. In addition, GLDL not only outperforms simple GCN and graph attention networks (GAT) using distribution loss but is also superior to its variant considering label-label relationship as a static network. GLDL and its benchmarks are the first research endeavors to address LDL for graphs. Code and benchmark data are released for public access.
\ No newline at end of file
diff --git a/data/2024/aaai/GLH-Water: A Large-Scale Dataset for Global Surface Water Detection in Large-Size Very-High-Resolution Satellite Imagery b/data/2024/aaai/GLH-Water: A Large-Scale Dataset for Global Surface Water Detection in Large-Size Very-High-Resolution Satellite Imagery
new file mode 100644
index 0000000000..1906dac709
--- /dev/null
+++ b/data/2024/aaai/GLH-Water: A Large-Scale Dataset for Global Surface Water Detection in Large-Size Very-High-Resolution Satellite Imagery	
@@ -0,0 +1 @@
+Global surface water detection in very-high-resolution (VHR) satellite imagery can directly serve major applications such as refined flood mapping and water resource assessment. Although achievements have been made in detecting surface water in small-size satellite images corresponding to local geographic scales, datasets and methods suitable for mapping and analyzing global surface water have yet to be explored. To encourage the development of this task and facilitate the implementation of relevant applications, we propose the GLH-water dataset that consists of 250 satellite images and 40.96 billion pixels labeled surface water annotations that are distributed globally and contain water bodies exhibiting a wide variety of types (e.g. , rivers, lakes, and ponds in forests, irrigated fields, bare areas, and urban areas). Each image is of the size 12,800 × 12,800 pixels at 0.3 meter spatial resolution. To build a benchmark for GLH-water, we perform extensive experiments employing representative surface water detection models, popular semantic segmentation models, and ultra-high resolution segmentation models. Furthermore, we also design a strong baseline with the novel pyramid consistency loss (PCL) to initially explore this challenge, increasing IoU by 2.4% over the next best baseline. Finally, we implement the cross-dataset generalization and pilot area application experiments, and the superior performance illustrates the strong generalization and practical application value of GLH-water dataset. Project page: https://jack-bo1220.github.io/project/GLH-water.html
\ No newline at end of file
diff --git a/data/2024/aaai/GLOP: Learning Global Partition and Local Construction for Solving Large-Scale Routing Problems in Real-Time b/data/2024/aaai/GLOP: Learning Global Partition and Local Construction for Solving Large-Scale Routing Problems in Real-Time
new file mode 100644
index 0000000000..b10c4f7b6b
--- /dev/null
+++ b/data/2024/aaai/GLOP: Learning Global Partition and Local Construction for Solving Large-Scale Routing Problems in Real-Time	
@@ -0,0 +1 @@
+The recent end-to-end neural solvers have shown promise for small-scale routing problems but suffered from limited real-time scaling-up performance. This paper proposes GLOP (Global and Local Optimization Policies), a unified hierarchical framework that efficiently scales toward large-scale routing problems. GLOP hierarchically partitions large routing problems into Travelling Salesman Problems (TSPs) and TSPs into Shortest Hamiltonian Path Problems. For the first time, we hybridize non-autoregressive neural heuristics for coarse-grained problem partitions and autoregressive neural heuristics for fine-grained route constructions, leveraging the scalability of the former and the meticulousness of the latter. Experimental results show that GLOP achieves competitive and state-of-the-art real-time performance on large-scale routing problems, including TSP, ATSP, CVRP, and PCTSP. Our code is available at: https://github.com/henry-yeh/GLOP.
\ No newline at end of file
diff --git a/data/2024/aaai/GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval b/data/2024/aaai/GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval
new file mode 100644
index 0000000000..d61e0d3a60
--- /dev/null
+++ b/data/2024/aaai/GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval	
@@ -0,0 +1 @@
+Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR methods adopt scanning-based clip construction to achieve explicit clip modeling, which is information-redundant and requires a large storage overhead. To solve the efficiency problem of PRVR methods, this paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly. During frame interactions, we incorporate Gaussian-Mixture-Model constraints to focus each frame on its adjacent frames instead of the whole video. Then generated representations will contain multi-scale clip information, achieving implicit clip modeling. In addition, PRVR methods ignore semantic differences between text queries relevant to the same video, leading to a sparse embedding space. We propose a query diverse loss to distinguish these text queries, making the embedding space more intensive and contain more semantic information. Extensive experiments on three large-scale video datasets (i.e., TVR, ActivityNet Captions, and Charades-STA) demonstrate the superiority and efficiency of GMMFormer.
\ No newline at end of file
diff --git a/data/2024/aaai/GMP-AR: Granularity Message Passing and Adaptive Reconciliation for Temporal Hierarchy Forecasting b/data/2024/aaai/GMP-AR: Granularity Message Passing and Adaptive Reconciliation for Temporal Hierarchy Forecasting
new file mode 100644
index 0000000000..407915aecc
--- /dev/null
+++ b/data/2024/aaai/GMP-AR: Granularity Message Passing and Adaptive Reconciliation for Temporal Hierarchy Forecasting	
@@ -0,0 +1 @@
+Time series forecasts of different temporal granularity are widely used in real-world applications, e.g., sales prediction in days and weeks for making different inventory plans. However, these tasks are usually solved separately without ensuring coherence, which is crucial for aligning downstream decisions. Previous works mainly focus on ensuring coherence with some straightforward methods, e.g., aggregation from the forecasts of fine granularity to the coarse ones, and allocation from the coarse granularity to the fine ones. These methods merely take the temporal hierarchical structure to maintain coherence without improving the forecasting accuracy. In this paper, we propose a novel granularity message-passing mechanism (GMP) that leverages temporal hierarchy information to improve forecasting performance and also utilizes an adaptive reconciliation (AR) strategy to maintain coherence without performance loss. Furthermore, we introduce an optimization module to achieve task-based targets while adhering to more real-world constraints. Experiments on real-world datasets demonstrate that our framework (GMP-AR) achieves superior performances on temporal hierarchical forecasting tasks compared to state-of-the-art methods. In addition, our framework has been successfully applied to a real-world task of payment traffic management in Alipay by integrating with the task-based optimization module.
\ No newline at end of file
diff --git a/data/2024/aaai/GO-DICE: Goal-Conditioned Option-Aware Offline Imitation Learning via Stationary Distribution Correction Estimation b/data/2024/aaai/GO-DICE: Goal-Conditioned Option-Aware Offline Imitation Learning via Stationary Distribution Correction Estimation
new file mode 100644
index 0000000000..9f1e1e3687
--- /dev/null
+++ b/data/2024/aaai/GO-DICE: Goal-Conditioned Option-Aware Offline Imitation Learning via Stationary Distribution Correction Estimation	
@@ -0,0 +1 @@
+Offline imitation learning (IL) refers to learning expert behavior solely from demonstrations, without any additional interaction with the environment. Despite significant advances in offline IL, existing techniques find it challenging to learn policies for long-horizon tasks and require significant re-training when task specifications change. Towards addressing these limitations, we present GO-DICE an offline IL technique for goal-conditioned long-horizon sequential tasks. GO-DICE discerns a hierarchy of sub-tasks from demonstrations and uses these to learn separate policies for sub-task transitions and action execution, respectively; this hierarchical policy learning facilitates long-horizon reasoning.Inspired by the expansive DICE-family of techniques, policy learning at both the levels transpires within the space of stationary distributions. Further, both policies are learnt with goal conditioning to minimize need for retraining when task goals change. Experimental results substantiate that GO-DICE outperforms recent baselines, as evidenced by a marked improvement in the completion rate of increasingly challenging pick-and-place Mujoco robotic tasks. GO-DICE is also capable of leveraging imperfect demonstration and partial task segmentation when available, both of which boost task performance relative to learning from expert demonstrations alone.
\ No newline at end of file
diff --git a/data/2024/aaai/GOALNET: Interleaving Neural Goal Predicate Inference with Classical Planning for Generalization in Robot Instruction Following b/data/2024/aaai/GOALNET: Interleaving Neural Goal Predicate Inference with Classical Planning for Generalization in Robot Instruction Following
new file mode 100644
index 0000000000..5230053147
--- /dev/null
+++ b/data/2024/aaai/GOALNET: Interleaving Neural Goal Predicate Inference with Classical Planning for Generalization in Robot Instruction Following	
@@ -0,0 +1 @@
+Our goal is to enable a robot to learn how to sequence its actions to perform high-level tasks specified as natural language instructions, given successful demonstrations from a human partner. Our novel neuro-symbolic solution GOALNET builds an iterative two-step approach that interleaves (i) inferring next subgoal predicate implied by the language instruction, for a given world state, and (ii) synthesizing a feasible subgoal-reaching plan from that state. The agent executes the plan, and the two steps are repeated. GOALNET combines (i) learning, where dense representations are acquired for language instruction and the world state via a neural network prediction model, enabling generalization to novel settings and (ii) planning, where the cause-effect modeling by a classical planner eschews irrelevant predicates, facilitating multi-stage decision making in large domains. GOALNET obtains 78% improvement in the goal reaching rate in comparison to several state-of-the-art approaches on benchmark data with multi-stage instructions. Further, GOALNET can generalize to novel instructions for scenes with unseen objects. Source code available at https://github. com/reail-iitd/goalnet.
\ No newline at end of file
diff --git a/data/2024/aaai/GOODAT: Towards Test-Time Graph Out-of-Distribution Detection b/data/2024/aaai/GOODAT: Towards Test-Time Graph Out-of-Distribution Detection
new file mode 100644
index 0000000000..1e65db39e5
--- /dev/null
+++ b/data/2024/aaai/GOODAT: Towards Test-Time Graph Out-of-Distribution Detection	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) have found widespread application in modeling graph data across diverse domains. While GNNs excel in scenarios where the testing data shares the distribution of their training counterparts (in distribution, ID), they often exhibit incorrect predictions when confronted with samples from an unfamiliar distribution (out-of-distribution, OOD). To identify and reject OOD samples with GNNs, recent studies have explored graph OOD detection, often focusing on training a specific model or modifying the data on top of a well-trained GNN. Despite their effectiveness, these methods come with heavy training resources and costs, as they need to optimize the GNN-based models on training data. Moreover, their reliance on modifying the original GNNs and accessing training data further restricts their universality. To this end, this paper introduces a method to detect Graph Out-of-Distribution At Test-time (namely GOODAT), a data-centric, unsupervised, and plug-and-play solution that operates independently of training data and modifications of GNN architecture. With a lightweight graph masker, GOODAT can learn informative subgraphs from test samples, enabling the capture of distinct graph patterns between OOD and ID samples. To optimize the graph masker, we meticulously design three unsupervised objective functions based on the graph information bottleneck principle, motivating the masker to capture compact yet informative subgraphs for OOD detection. Comprehensive evaluations confirm that our GOODAT method outperforms state-of-the-art benchmarks across a variety of real-world datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting b/data/2024/aaai/GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting
new file mode 100644
index 0000000000..1d00724890
--- /dev/null
+++ b/data/2024/aaai/GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting	
@@ -0,0 +1 @@
+Time series forecasting is an essential area of machine learning with a wide range of real-world applications. Most of the previous forecasting models aim to capture dynamic characteristics from uni-modal numerical historical data. Although extra knowledge can boost the time series forecasting performance, it is hard to collect such information. In addition, how to fuse the multimodal information is non-trivial. In this paper, we first propose a general principle of collecting the corresponding textual information from different data sources with the help of modern large language models (LLM). Then, we propose a prompt-based LLM framework to utilize both the numerical data and the textual information simultaneously, named GPT4MTS. In practice, we propose a GDELT-based multimodal time series dataset for news impact forecasting, which provides a concise and well-structured version of time series dataset with textual information for further research in communication. Through extensive experiments, we demonstrate the effectiveness of our proposed method on forecasting tasks with extra-textual information.
\ No newline at end of file
diff --git a/data/2024/aaai/GSDD: Generative Space Dataset Distillation for Image Super-resolution b/data/2024/aaai/GSDD: Generative Space Dataset Distillation for Image Super-resolution
new file mode 100644
index 0000000000..e097c988f6
--- /dev/null
+++ b/data/2024/aaai/GSDD: Generative Space Dataset Distillation for Image Super-resolution	
@@ -0,0 +1 @@
+Single image super-resolution (SISR), especially in the real world, usually builds a large amount of LR-HR image pairs to learn representations that contain rich textural and structural information. However, relying on massive data for model training not only reduces training efficiency, but also causes heavy data storage burdens. In this paper, we attempt a pioneering study on dataset distillation (DD) for SISR problems to explore how data could be slimmed and compressed for the task. Unlike previous coreset selection methods which select a few typical examples directly from the original data, we remove the limitation that the selected data cannot be further edited, and propose to synthesize and optimize samples to preserve more task-useful representations. Concretely, by utilizing pre-trained GANs as a suitable approximation of realistic data distribution, we propose GSDD, which distills data in a latent generative space based on GAN-inversion techniques. By optimizing them to match with the practical data distribution in an informative feature space, the distilled data could then be synthesized. Experimental results demonstrate that when trained with our distilled data, GSDD can achieve comparable performance to the state-of-the-art (SOTA) SISR algorithms, while a nearly ×8 increase in training efficiency and a saving of almost 93.2% data storage space can be realized. Further experiments on challenging real-world data also demonstrate the promising generalization ability of GSDD.
\ No newline at end of file
diff --git a/data/2024/aaai/GSENet: Global Semantic Enhancement Network for Lane Detection b/data/2024/aaai/GSENet: Global Semantic Enhancement Network for Lane Detection
new file mode 100644
index 0000000000..aac547aa78
--- /dev/null
+++ b/data/2024/aaai/GSENet: Global Semantic Enhancement Network for Lane Detection	
@@ -0,0 +1 @@
+Lane detection is the cornerstone of autonomous driving. Although existing methods have achieved promising results, there are still limitations in addressing challenging scenarios such as abnormal weather, occlusion, and curves. These scenarios with low visibility usually require to rely on the broad information of the entire scene provided by global semantics and local texture information to predict the precise position and shape of the lane lines. In this paper, we propose a Global Semantic Enhancement Network for lane detection, which involves a complete set of systems for feature extraction and global features transmission. Traditional methods for global feature extraction usually require deep convolution layer stacks. However, this approach of obtaining global features solely through a larger receptive field not only fails to capture precise global features but also leads to an overly deep model, which results in slow inference speed. To address these challenges, we propose a novel operation called the Global feature Extraction Module (GEM). Additionally, we introduce the Top Layer Auxiliary Module (TLAM) as a channel for feature distillation, which facilitates a bottom-up transmission of global features. Furthermore, we introduce two novel loss functions: the Angle Loss, which account for the angle between predicted and ground truth lanes, and the Generalized Line IoU Loss function that considers the scenarios where significant deviations occur between the prediction of lanes and ground truth in some harsh conditions. The experimental results reveal that the proposed method exhibits remarkable superiority over the current state-of-the-art techniques for lane detection.Our codes are available at:https://github.com/crystal250/GSENet.
\ No newline at end of file
diff --git a/data/2024/aaai/GSN: Generalisable Segmentation in Neural Radiance Field b/data/2024/aaai/GSN: Generalisable Segmentation in Neural Radiance Field
new file mode 100644
index 0000000000..c83fa6d29b
--- /dev/null
+++ b/data/2024/aaai/GSN: Generalisable Segmentation in Neural Radiance Field	
@@ -0,0 +1 @@
+Traditional Radiance Field (RF) representations capture details of a specific scene and must be trained afresh on each scene. Semantic feature fields have been added to RFs to facilitate several segmentation tasks. Generalised RF representations learn the principles of view interpolation. A generalised RF can render new views of an unknown and untrained scene, given a few views. We present a way to distil feature fields into the generalised GNT representation. Our GSN representation generates new views of unseen scenes on the fly along with consistent, per-pixel semantic features. This enables multi-view segmentation of arbitrary new scenes. We show different semantic features being distilled into generalised RFs. Our multi-view segmentation results are on par with methods that use traditional RFs. GSN closes the gap between standard and generalisable RF methods significantly. Project Page: https://vinayak-vg.github.io/GSN/
\ No newline at end of file
diff --git a/data/2024/aaai/GSO-Net: Grid Surface Optimization via Learning Geometric Constraints b/data/2024/aaai/GSO-Net: Grid Surface Optimization via Learning Geometric Constraints
new file mode 100644
index 0000000000..8e6a6e7c07
--- /dev/null
+++ b/data/2024/aaai/GSO-Net: Grid Surface Optimization via Learning Geometric Constraints	
@@ -0,0 +1 @@
+In the context of surface representations, we find a natural structural similarity between grid surface and image data. Motivated by this inspiration, we propose a novel approach: encoding grid surfaces as geometric images and using image processing methods to address surface optimization-related problems. As a result, we have created the first dataset for grid surface optimization and devised a learning-based grid surface optimization network specifically tailored to geometric images, addressing the surface optimization problem through a data-driven learning of geometric constraints paradigm. We conduct extensive experiments on developable surface optimization, surface flattening, and surface denoising tasks using the designed network and datasets. The results demonstrate that our proposed method not only addresses the surface optimization problem better than traditional numerical optimization methods, especially for complex surfaces, but also boosts the optimization speed by multiple orders of magnitude. This pioneering study successfully applies deep learning methods to the field of surface optimization and provides a new solution paradigm for similar tasks, which will provide inspiration and guidance for future developments in the field of discrete surface optimization. The code and dataset are available at https://github.com/chaoyunwang/GSO-Net.
\ No newline at end of file
diff --git a/data/2024/aaai/G^2SAM: Graph-Based Global Semantic Awareness Method for Multimodal Sarcasm Detection b/data/2024/aaai/G^2SAM: Graph-Based Global Semantic Awareness Method for Multimodal Sarcasm Detection
new file mode 100644
index 0000000000..eeb020b18f
--- /dev/null
+++ b/data/2024/aaai/G^2SAM: Graph-Based Global Semantic Awareness Method for Multimodal Sarcasm Detection	
@@ -0,0 +1 @@
+Multimodal sarcasm detection, aiming to detect the ironic sentiment within multimodal social data, has gained substantial popularity in both the natural language processing and computer vision communities. Recently, graph-based studies by drawing sentimental relations to detect multimodal sarcasm have made notable advancements. However, they have neglected exploiting graph-based global semantic congruity from existing instances to facilitate the prediction, which ultimately hinders the model's performance. In this paper, we introduce a new inference paradigm that leverages global graph-based semantic awareness to handle this task. Firstly, we construct fine-grained multimodal graphs for each instance and integrate them into semantic space to draw graph-based relations. During inference, we leverage global semantic congruity to retrieve k-nearest neighbor instances in semantic space as references for voting on the final prediction. To enhance the semantic correlation of representation in semantic space, we also introduce label-aware graph contrastive learning to further improve the performance. Experimental results demonstrate that our model achieves state-of-the-art (SOTA) performance in multimodal sarcasm detection. The code will be available at https://github.com/upccpu/G2SAM.
\ No newline at end of file
diff --git a/data/2024/aaai/GaLileo: General Linear Relaxation Framework for Tightening Robustness Certification of Transformers b/data/2024/aaai/GaLileo: General Linear Relaxation Framework for Tightening Robustness Certification of Transformers
new file mode 100644
index 0000000000..6c2fa42778
--- /dev/null
+++ b/data/2024/aaai/GaLileo: General Linear Relaxation Framework for Tightening Robustness Certification of Transformers	
@@ -0,0 +1 @@
+Transformers based on attention mechanisms exhibit vulnerability to adversarial examples, posing a substantial threat to the security of their applications. Aiming to solve this problem, the concept of robustness certification is introduced to formally ascertain the presence of any adversarial example within a specified region surrounding a given sample. However, prior works have neglected the dependencies among inputs of softmax (the most complex function in attention mechanisms) during linear relaxations. This oversight has consequently led to imprecise certification results. In this work, we introduce GaLileo, a general linear relaxation framework designed to certify the robustness of Transformers. GaLileo effectively surmounts the trade-off between precision and efficiency in robustness certification through our innovative n-dimensional relaxation approach. Notably, our relaxation technique represents a pioneering effort as the first linear relaxation for n-dimensional functions such as softmax. Our novel approach successfully transcends the challenges posed by the curse of dimensionality inherent in linear relaxations, thereby enhancing linear bounds by incorporating input dependencies. Our evaluations encompassed a thorough analysis utilizing the SST and Yelp datasets along with diverse Transformers of different depths and widths. The experimental results demonstrate that, as compared to the baseline method CROWN-BaF, GaLileo achieves up to 3.24 times larger certified radii while requiring similar running times. Additionally, GaLileo successfully attains certification for Transformers' robustness against multi-word lp perturbations, marking a notable accomplishment in this field.
\ No newline at end of file
diff --git a/data/2024/aaai/Game-Theoretic Unlearnable Example Generator b/data/2024/aaai/Game-Theoretic Unlearnable Example Generator
new file mode 100644
index 0000000000..c28d3f24cc
--- /dev/null
+++ b/data/2024/aaai/Game-Theoretic Unlearnable Example Generator	
@@ -0,0 +1 @@
+Unlearnable example attacks are data poisoning attacks aiming to degrade the clean test accuracy of deep learning by adding imperceptible perturbations to the training samples, which can be formulated as a bi-level optimization problem. However, directly solving this optimization problem is intractable for deep neural networks. In this paper, we investigate unlearnable example attacks from a game-theoretic perspective, by formulating the attack as a nonzero sum Stackelberg game. First, the existence of game equilibria is proved under the normal setting and the adversarial training setting. It is shown that the game equilibrium gives the most powerful poison attack in that the victim has the lowest test accuracy among all networks within the same hypothesis space when certain loss functions are used. Second, we propose a novel attack method, called the Game Unlearnable Example (GUE), which has three main gradients. (1) The poisons are obtained by directly solving the equilibrium of the Stackelberg game with a first-order algorithm. (2) We employ an autoencoder-like generative network model as the poison attacker. (3) A novel payoff function is introduced to evaluate the performance of the poison. Comprehensive experiments demonstrate that GUE can effectively poison the model in various scenarios. Furthermore, the GUE still works by using a relatively small percentage of the training data to train the generator, and the poison generator can generalize to unseen data well. Our implementation code can be found at https://github.com/hong-xian/gue.
\ No newline at end of file
diff --git a/data/2024/aaai/Gated Attention Coding for Training High-Performance and Efficient Spiking Neural Networks b/data/2024/aaai/Gated Attention Coding for Training High-Performance and Efficient Spiking Neural Networks
new file mode 100644
index 0000000000..e00b8e4bbc
--- /dev/null
+++ b/data/2024/aaai/Gated Attention Coding for Training High-Performance and Efficient Spiking Neural Networks	
@@ -0,0 +1 @@
+Spiking neural networks (SNNs) are emerging as an energy-efficient alternative to traditional artificial neural networks (ANNs) due to their unique spike-based event-driven nature. Coding is crucial in SNNs as it converts external input stimuli into spatio-temporal feature sequences. However, most existing deep SNNs rely on direct coding that generates powerless spike representation and lacks the temporal dynamics inherent in human vision. Hence, we introduce Gated Attention Coding (GAC), a plug-and-play module that leverages the multi-dimensional gated attention unit to efficiently encode inputs into powerful representations before feeding them into the SNN architecture. GAC functions as a preprocessing layer that does not disrupt the spike-driven nature of the SNN, making it amenable to efficient neuromorphic hardware implementation with minimal modifications. Through an observer model theoretical analysis, we demonstrate GAC's attention mechanism improves temporal dynamics and coding efficiency. Experiments on CIFAR10/100 and ImageNet datasets demonstrate that GAC achieves state-of-the-art accuracy with remarkable efficiency. Notably, we improve top-1 accuracy by 3.10% on CIFAR100 with only 6-time steps and 1.07% on ImageNet while reducing energy usage to 66.9% of the previous works. To our best knowledge, it is the first time to explore the attention-based dynamic coding scheme in deep SNNs, with exceptional effectiveness and efficiency on large-scale datasets. Code is available at https://github.com/bollossom/GAC.
\ No newline at end of file
diff --git a/data/2024/aaai/Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video Grounding b/data/2024/aaai/Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video Grounding
new file mode 100644
index 0000000000..06c811e30e
--- /dev/null
+++ b/data/2024/aaai/Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video Grounding	
@@ -0,0 +1 @@
+In the weakly supervised temporal video grounding study, previous methods use predetermined single Gaussian proposals which lack the ability to express diverse events described by the sentence query. To enhance the expression ability of a proposal, we propose a Gaussian mixture proposal (GMP) that can depict arbitrary shapes by learning importance, centroid, and range of every Gaussian in the mixture. In learning GMP, each Gaussian is not trained in a feature space but is implemented over a temporal location. Thus the conventional feature-based learning for Gaussian mixture model is not valid for our case. In our special setting, to learn moderately coupled Gaussian mixture capturing diverse events, we newly propose a pull-push learning scheme using pulling and pushing losses, each of which plays an opposite role to the other. The effects of components in our scheme are verified in-depth with extensive ablation studies and the overall scheme achieves state-of-the-art performance. Our code is available at https://github.com/sunoh-kim/pps.
\ No newline at end of file
diff --git a/data/2024/aaai/Gaussian Process Neural Additive Models b/data/2024/aaai/Gaussian Process Neural Additive Models
new file mode 100644
index 0000000000..68db548611
--- /dev/null
+++ b/data/2024/aaai/Gaussian Process Neural Additive Models	
@@ -0,0 +1 @@
+Deep neural networks have revolutionized many fields, but their black-box nature also occasionally prevents their wider adoption in fields such as healthcare and finance where interpretable and explainable models are required. The recent development of Neural Additive Models (NAMs) poses a major step in the direction of interpretable deep learning for tabular datasets. In this paper, we propose a new subclass of NAMs that utilize a single-layer neural network construction of the Gaussian process via random Fourier features, which we call Gaussian Process Neural Additive Models (GP-NAM). GP-NAMs have the advantage of a convex objective function and number of trainable parameters that grows linearly with feature dimensions. It suffers no loss in performance compared with deeper NAM approaches because GPs are well-suited to learning complex non-parametric univariate functions. We demonstrate the performance of GP-NAM on several tabular datasets, showing that it achieves comparable performance in both classification and regression tasks with a massive reduction in the number of parameters.
\ No newline at end of file
diff --git a/data/2024/aaai/Gaze Target Detection by Merging Human Attention and Activity Cues b/data/2024/aaai/Gaze Target Detection by Merging Human Attention and Activity Cues
new file mode 100644
index 0000000000..eb6febc727
--- /dev/null
+++ b/data/2024/aaai/Gaze Target Detection by Merging Human Attention and Activity Cues	
@@ -0,0 +1 @@
+Despite achieving impressive performance, current methods for detecting gaze targets, which depend on visual saliency and spatial scene geometry, continue to face challenges when it comes to detecting gaze targets within intricate image backgrounds. One of the primary reasons for this lies in the oversight of the intricate connection between human attention and activity cues. In this study, we introduce an innovative approach that amalgamates the visual saliency detection with the body-part & object interaction both guided by the soft gaze attention. This fusion enables precise and dependable detection of gaze targets amidst intricate image backgrounds. Our approach attains state-of-the-art performance on both the Gazefollow benchmark and the GazeVideoAttn benchmark. In comparison to recent methods that rely on intricate 3D reconstruction of a single input image, our approach, which solely leverages 2D image information, still exhibits a substantial lead across all evaluation metrics, positioning it closer to human-level performance. These outcomes underscore the potent effectiveness of our proposed method in the gaze target detection task.
\ No newline at end of file
diff --git a/data/2024/aaai/Gaze from Origin: Learning for Generalized Gaze Estimation by Embedding the Gaze Frontalization Process b/data/2024/aaai/Gaze from Origin: Learning for Generalized Gaze Estimation by Embedding the Gaze Frontalization Process
new file mode 100644
index 0000000000..94dad26a63
--- /dev/null
+++ b/data/2024/aaai/Gaze from Origin: Learning for Generalized Gaze Estimation by Embedding the Gaze Frontalization Process	
@@ -0,0 +1 @@
+Gaze estimation aims to accurately estimate the direction or position at which a person is looking. With the development of deep learning techniques, a number of gaze estimation methods have been proposed and achieved state-of-the-art performance. However, these methods are limited to within-dataset settings, whose performance drops when tested on unseen datasets. We argue that this is caused by infinite and continuous gaze labels. To alleviate this problem, we propose using gaze frontalization as an auxiliary task to constrain gaze estimation. Based on this, we propose a novel gaze domain generalization framework named Gaze Frontalization-based Auxiliary Learning (GFAL) Framework which embeds the gaze frontalization process, i.e., guiding the feature so that the eyeball can rotate and look at the front (camera), without any target domain information during training. Experimental results show that our proposed framework is able to achieve state-of-the-art performance on gaze domain generalization task, which is competitive with or even superior to the SOTA gaze unsupervised domain adaptation methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Gaze-Based Interaction Adaptation for People with Involuntary Head Movements (Student Abstract) b/data/2024/aaai/Gaze-Based Interaction Adaptation for People with Involuntary Head Movements (Student Abstract)
new file mode 100644
index 0000000000..ef284edea5
--- /dev/null
+++ b/data/2024/aaai/Gaze-Based Interaction Adaptation for People with Involuntary Head Movements (Student Abstract)	
@@ -0,0 +1 @@
+Gaze estimation is an important research area in computer vision and machine learning. Eye-tracking and gaze-based interactions have made assistive technology (AT) more accessible to people with physical limitations. However, a non-negligible proportion of existing AT users, including those having dyskinetic cerebral palsy (CP) or severe intellectual disabilities (ID), have difficulties in using eye trackers due to their involuntary body movements. In this paper, we propose an adaptation method pertaining to head movement prediction and fixation smoothing to stabilize our target users' gaze points on the screen and improve their user experience (UX) in gaze-based interaction. Our empirical experimentation shows that our method significantly shortens the users' selection time and increases their selection accuracy.
\ No newline at end of file
diff --git a/data/2024/aaai/General Commerce Intelligence: Glocally Federated NLP-Based Engine for Privacy-Preserving and Sustainable Personalized Services of Multi-Merchants b/data/2024/aaai/General Commerce Intelligence: Glocally Federated NLP-Based Engine for Privacy-Preserving and Sustainable Personalized Services of Multi-Merchants
new file mode 100644
index 0000000000..13391abfd9
--- /dev/null
+++ b/data/2024/aaai/General Commerce Intelligence: Glocally Federated NLP-Based Engine for Privacy-Preserving and Sustainable Personalized Services of Multi-Merchants	
@@ -0,0 +1 @@
+One of the most crucial capabilities in the commercial sector is a personalized prediction of a customer's next purchase. We present a novel method of creating a commerce intelligence engine that caters to multiple merchants intended for the UB Platform, managed by e-payment company Harex InfoTech. To cultivate this intelligence, we utilized payment receipt data and created a Natural Language Processing (NLP)-based commerce model using a Transformer to accommodate multinational and merchant trade. Our model, called General Commerce Intelligence (GCI), provides a range of services for merchants, including product recommendations, product brainstorming, product bundling, event promotions, collaborative marketing, target marketing, and demand fore-casting etc. To bolster user privacy and foster sustainable business collaboration, especially among micro-, small-, and medium-sized enterprises (MSMEs), the GCI model was trained through federated learning, especially with glocalization. This study delves into the structure, development, and assessment of GCI, showcasing its transformative capacity to implement User Centric AI and re-shape the global commerce landscape to benefit MSMEs.
\ No newline at end of file
diff --git a/data/2024/aaai/Generalisation through Negation and Predicate Invention b/data/2024/aaai/Generalisation through Negation and Predicate Invention
new file mode 100644
index 0000000000..5cb93d8921
--- /dev/null
+++ b/data/2024/aaai/Generalisation through Negation and Predicate Invention	
@@ -0,0 +1,2 @@
+The ability to generalise from a small number of examples is a fundamental challenge in machine learning. To tackle this challenge, we introduce an inductive logic programming (ILP) approach that combines negation and predicate invention. 
+Combining these two features allows an ILP system to generalise better by learning rules with universally quantified body-only variables. We implement our idea in NOPI, which can learn normal logic programs with predicate invention, including Datalog programs with stratified negation. Our experimental results on multiple domains show that our approach can improve predictive accuracies and learning times.
\ No newline at end of file
diff --git a/data/2024/aaai/Generalising Planning Environment Redesign b/data/2024/aaai/Generalising Planning Environment Redesign
new file mode 100644
index 0000000000..73974e49de
--- /dev/null
+++ b/data/2024/aaai/Generalising Planning Environment Redesign	
@@ -0,0 +1 @@
+In Environment Design, one interested party seeks to affect another agent's decisions by applying changes to the environment. Most research on planning environment (re)design assumes the interested party's objective is to facilitate the recognition of goals and plans, and search over the space of environment modifications to find the minimal set of changes that simplify those tasks and optimise a particular metric. This search space is usually intractable, so existing approaches devise metric-dependent pruning techniques for performing search more efficiently. This results in approaches that are not able to generalise across different objectives and/or metrics. In this paper, we argue that the interested party could have objectives and metrics that are not necessarily related to recognising agents' goals or plans. Thus, to generalise the task of Planning Environment Redesign, we develop a general environment redesign approach that is metric-agnostic and leverages recent research on top-quality planning to efficiently redesign planning environments according to any interested party's objective and metric. Experiments over a set of environment redesign benchmarks show that our general approach outperforms existing approaches when using well-known metrics, such as facilitating the recognition of goals, as well as its effectiveness when solving environment redesign tasks that optimise a novel set of different metrics.
\ No newline at end of file
diff --git a/data/2024/aaai/Generalizable Fourier Augmentation for Unsupervised Video Object Segmentation b/data/2024/aaai/Generalizable Fourier Augmentation for Unsupervised Video Object Segmentation
new file mode 100644
index 0000000000..7c96aa8383
--- /dev/null
+++ b/data/2024/aaai/Generalizable Fourier Augmentation for Unsupervised Video Object Segmentation	
@@ -0,0 +1,4 @@
+The performance of existing unsupervised video object segmentation methods typically suffers from severe performance degradation on test videos when tested in out-of-distribution scenarios. The primary reason is that the test data in real-
+world may not follow the independent and identically distribution (i.i.d.) assumption, leading to domain shift. In this paper, we propose a generalizable fourier augmentation method during training to improve the generalization ability of the model. To achieve this, we perform Fast Fourier Transform (FFT) over the intermediate spatial domain features in each layer to yield corresponding frequency representations, including amplitude components (encoding scene-aware styles such as texture, color, contrast of the scene) and phase components (encoding rich semantics). We produce a variety of style features via Gaussian sampling to augment the training data, thereby improving the generalization capability of the model. To further improve the cross-domain generalization
+performance of the model, we design a phase feature update strategy via exponential moving average using phase features from past frames in an online update manner, which could help the model to learn cross-domain-invariant features. Extensive experiments show that our proposed method achieves
+the state-of-the-art performance on popular benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Generalizable Policy Improvement via Reinforcement Sampling (Student Abstract) b/data/2024/aaai/Generalizable Policy Improvement via Reinforcement Sampling (Student Abstract)
new file mode 100644
index 0000000000..748df5ec47
--- /dev/null
+++ b/data/2024/aaai/Generalizable Policy Improvement via Reinforcement Sampling (Student Abstract)	
@@ -0,0 +1 @@
+Current policy gradient techniques excel in refining policies over sampled states but falter when generalizing to unseen states. To address this, we introduce Reinforcement Sampling (RS), a novel method leveraging a generalizable action value function to sample improved decisions. RS is able to improve the decision quality whenever the action value estimation is accurate. It works by improving the agent's decision on the fly on the states the agent is visiting. Compared with the historically experienced states in which conventional policy gradient methods improve the policy, the currently visited states are more relevant to the agent. Our method sufficiently exploits the generalizability of the value function on unseen states and sheds new light on the future development of generalizable reinforcement learning.
\ No newline at end of file
diff --git a/data/2024/aaai/Generalization Analysis of Machine Learning Algorithms via the Worst-Case Data-Generating Probability Measure b/data/2024/aaai/Generalization Analysis of Machine Learning Algorithms via the Worst-Case Data-Generating Probability Measure
new file mode 100644
index 0000000000..536cb4f7fc
--- /dev/null
+++ b/data/2024/aaai/Generalization Analysis of Machine Learning Algorithms via the Worst-Case Data-Generating Probability Measure	
@@ -0,0 +1 @@
+In this paper, the worst-case probability measure over the data is introduced as a tool for characterizing the generalization capabilities of machine learning algorithms. More specifically, the worst-case probability measure is a Gibbs probability measure and the unique solution to the maximization of the expected loss under a relative entropy constraint with respect to a reference probability measure. Fundamental generalization metrics, such as the sensitivity of the expected loss, the sensitivity of the empirical risk, and the generalization gap are shown to have closed-form expressions involving the worst-case data-generating probability measure. Existing results for the Gibbs algorithm, such as characterizing the generalization gap as a sum of mutual information and lautum information, up to a constant factor, are recovered. A novel parallel is established between the worst-case data-generating probability measure and the Gibbs algorithm. Specifically, the Gibbs probability measure is identified as a fundamental commonality of the model space and the data space for machine learning algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Generalize for Future: Slow and Fast Trajectory Learning for CTR Prediction b/data/2024/aaai/Generalize for Future: Slow and Fast Trajectory Learning for CTR Prediction
new file mode 100644
index 0000000000..063c38ed39
--- /dev/null
+++ b/data/2024/aaai/Generalize for Future: Slow and Fast Trajectory Learning for CTR Prediction	
@@ -0,0 +1 @@
+Deep neural networks (DNNs) have achieved significant advancements in click-through rate (CTR) prediction by demonstrating strong generalization on training data. However, in real-world scenarios, the assumption of independent and identically distributed (i.i.d.) conditions, which is fundamental to this problem, is often violated due to temporal distribution shifts. This violation can lead to suboptimal model performance when optimizing empirical risk without access to future data, resulting in overfitting on the training data and convergence to a single sharp minimum. To address this challenge, we propose a novel model updating framework called Slow and Fast Trajectory Learning (SFTL) network. SFTL aims to mitigate the discrepancy between past and future domains while quickly adapting to recent changes in small temporal drifts. This mechanism entails two interactions among three complementary learners: (i) the Working Learner, which updates model parameters using modern optimizers (e.g., Adam, Adagrad) and serves as the primary learner in the recommendation system, (ii) the Slow Learner, which is updated in each temporal domain by directly assigning the model weights of the working learner, and (iii) the Fast Learner, which is updated in each iteration by assigning exponentially moving average weights of the working learner. Additionally, we propose a novel rank-based trajectory loss to facilitate interaction between the working learner and trajectory learner, aiming to adapt to temporal drift and enhance performance in the current domain compared to the past. We provide theoretical understanding and conduct extensive experiments on real-world CTR prediction datasets to validate the effectiveness and efficiency of SFTL in terms of both convergence speed and model performance. The results demonstrate the superiority of SFTL over existing approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Generalized Bradley-Terry Models for Score Estimation from Paired Comparisons b/data/2024/aaai/Generalized Bradley-Terry Models for Score Estimation from Paired Comparisons
new file mode 100644
index 0000000000..fb42a143f8
--- /dev/null
+++ b/data/2024/aaai/Generalized Bradley-Terry Models for Score Estimation from Paired Comparisons	
@@ -0,0 +1 @@
+Many applications, e.g. in content recommendation, sports, or recruitment, leverage the comparisons of alternatives to score those alternatives. The classical Bradley-Terry model and its variants have been widely used to do so. The historical model considers binary comparisons (victory/defeat) between alternatives, while more recent developments allow finer comparisons to be taken into account. In this article, we introduce a probabilistic model encompassing a broad variety of paired comparisons that can take discrete or continuous values. We do so by considering a well-behaved subset of the exponential family, which we call the family of generalized Bradley-Terry (GBT) models, as it includes the classical Bradley-Terry model and many of its variants. Remarkably, we prove that all GBT models are guaranteed to yield a strictly convex negative log-likelihood. Moreover, assuming a Gaussian prior on alternatives' scores, we prove that the maximum a posteriori (MAP) of GBT models, whose existence, uniqueness and fast computation are thus guaranteed, varies monotonically with respect to comparisons (the more A beats B, the better the score of A) and is Lipschitz-resilient with respect to each new comparison (a single new comparison can only have a bounded effect on all the estimated scores). These desirable properties make GBT models appealing for practical use. We illustrate some features of GBT models on simulations.
\ No newline at end of file
diff --git a/data/2024/aaai/Generalized Planning for the Abstraction and Reasoning Corpus b/data/2024/aaai/Generalized Planning for the Abstraction and Reasoning Corpus
new file mode 100644
index 0000000000..966179c1aa
--- /dev/null
+++ b/data/2024/aaai/Generalized Planning for the Abstraction and Reasoning Corpus	
@@ -0,0 +1 @@
+The Abstraction and Reasoning Corpus (ARC) is a general artificial intelligence benchmark that poses difficulties for pure machine learning methods due to its requirement for fluid intelligence with a focus on reasoning and abstraction. In this work, we introduce an ARC solver, Generalized Planning for Abstract Reasoning (GPAR). It casts an ARC problem as a generalized planning (GP) problem, where a solution is formalized as a planning program with pointers. We express each ARC problem using the standard Planning Domain Definition Language (PDDL) coupled with external functions representing object-centric abstractions. We show how to scale up GP solvers via domain knowledge specific to ARC in the form of restrictions over the actions model, predicates, arguments and valid structure of planning programs. Our experiments demonstrate that GPAR outperforms the state-of-the-art solvers on the object-centric tasks of the ARC, showing the effectiveness of GP and the expressiveness of PDDL to model ARC problems. The challenges provided by the ARC benchmark motivate research to advance existing GP solvers and understand new relations with other planning computational models. Code is available at github.com/you68681/GPAR.
\ No newline at end of file
diff --git a/data/2024/aaai/Generalized Planning in PDDL Domains with Pretrained Large Language Models b/data/2024/aaai/Generalized Planning in PDDL Domains with Pretrained Large Language Models
new file mode 100644
index 0000000000..c181202be9
--- /dev/null
+++ b/data/2024/aaai/Generalized Planning in PDDL Domains with Pretrained Large Language Models	
@@ -0,0 +1 @@
+Recent work has considered whether large language models (LLMs) can function as planners: given a task, generate a plan. We investigate whether LLMs can serve as generalized planners: given a domain and training tasks, generate a program that efficiently produces plans for other tasks in the domain. In particular, we consider PDDL domains and use GPT-4 to synthesize Python programs. We also consider (1) Chain-of-Thought (CoT) summarization, where the LLM is prompted to summarize the domain and propose a strategy in words before synthesizing the program; and (2) automated debugging, where the program is validated with respect to the training tasks, and in case of errors, the LLM is re-prompted with four types of feedback. We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines. Overall, we find that GPT-4 is a surprisingly powerful generalized planner. We also conclude that automated debugging is very important, that CoT summarization has non-uniform impact, that GPT-4 is far superior to GPT-3.5, and that just two training tasks are often sufficient for strong generalization.
\ No newline at end of file
diff --git a/data/2024/aaai/Generalized Variational Inference via Optimal Transport b/data/2024/aaai/Generalized Variational Inference via Optimal Transport
new file mode 100644
index 0000000000..327986eaa0
--- /dev/null
+++ b/data/2024/aaai/Generalized Variational Inference via Optimal Transport	
@@ -0,0 +1 @@
+Variational Inference (VI) has gained popularity as a flexible approximate inference scheme for computing posterior distributions in Bayesian models. Original VI methods use Kullback-Leibler (KL) divergence to construct variational objectives. However, KL divergence has zero-forcing behavior and is completely agnostic to the metric of the underlying data distribution, resulting in bad approximations. To alleviate this issue, we propose a new variational objective by using Optimal Transport (OT) distance, which is a metric-aware divergence, to measure the difference between approximate posteriors and priors. The superior performance of OT distance enables us to learn more accurate approximations. We further enhance the objective by gradually including the OT term using a hyperparameter λ for over-parameterized models. We develop a Variational inference method with OT (VOT) which presents a gradient-based black-box framework for solving Bayesian models, even when the density function of approximate distribution is not available. We provide the consistency analysis of approximate posteriors and demonstrate the practical effectiveness on Bayesian neural networks and variational autoencoders.
\ No newline at end of file
diff --git a/data/2024/aaai/Generalizing across Temporal Domains with Koopman Operators b/data/2024/aaai/Generalizing across Temporal Domains with Koopman Operators
new file mode 100644
index 0000000000..8b9fbfce77
--- /dev/null
+++ b/data/2024/aaai/Generalizing across Temporal Domains with Koopman Operators	
@@ -0,0 +1 @@
+In the field of domain generalization, the task of constructing a predictive model capable of generalizing to a target domain without access to target data remains challenging. This problem becomes further complicated when considering evolving dynamics between domains. While various approaches have been proposed to address this issue, a comprehensive understanding of the underlying generalization theory is still lacking. In this study, we contribute novel theoretic results that aligning conditional distribution leads to the reduction of generalization bounds. Our analysis serves as a key motivation for solving the Temporal Domain Generalization (TDG) problem through the application of Koopman Neural Operators, resulting in Temporal Koopman Networks (TKNets). By employing Koopman Neural Operators, we effectively address the time-evolving distributions encountered in TDG using the principles of Koopman theory, where measurement functions are sought to establish linear transition relations between evolving domains. Through empirical evaluations conducted on synthetic and real-world datasets, we validate the effectiveness of our proposed approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Generating Diagnostic and Actionable Explanations for Fair Graph Neural Networks b/data/2024/aaai/Generating Diagnostic and Actionable Explanations for Fair Graph Neural Networks
new file mode 100644
index 0000000000..1d4386352d
--- /dev/null
+++ b/data/2024/aaai/Generating Diagnostic and Actionable Explanations for Fair Graph Neural Networks	
@@ -0,0 +1 @@
+A plethora of fair graph neural networks (GNNs) have been proposed to promote algorithmic fairness for high-stake real-life contexts. Meanwhile, explainability is generally proposed to help machine learning practitioners debug models by providing human-understandable explanations. However, seldom work on explainability is made to generate explanations for fairness diagnosis in GNNs. From the explainability perspective, this paper explores the problem of what subgraph patterns cause the biased behavior of GNNs, and what actions could practitioners take to rectify the bias? By answering the two questions, this paper aims to produce compact, diagnostic, and actionable explanations that are responsible for discriminatory behavior. Specifically, we formulate the problem of generating diagnostic and actionable explanations as a multi-objective combinatorial optimization problem. To solve the problem, a dedicated multi-objective evolutionary algorithm is presented to ensure GNNs' explainability and fairness in one go. In particular, an influenced nodes-based gradient approximation is developed to boost the computation efficiency of the evolutionary algorithm. We provide a theoretical analysis to illustrate the effectiveness of the proposed framework. Extensive experiments have been conducted to demonstrate the superiority of the proposed method in terms of classification performance, fairness, and interpretability.
\ No newline at end of file
diff --git a/data/2024/aaai/Generating Images of Rare Concepts Using Pre-trained Diffusion Models b/data/2024/aaai/Generating Images of Rare Concepts Using Pre-trained Diffusion Models
new file mode 100644
index 0000000000..cbf889df63
--- /dev/null
+++ b/data/2024/aaai/Generating Images of Rare Concepts Using Pre-trained Diffusion Models	
@@ -0,0 +1 @@
+Text-to-image diffusion models can synthesize high quality images, but they have various limitations. Here we highlight a common failure mode of these models, namely, generating uncommon concepts and structured concepts like hand palms. We show that their limitation is partly due to the long-tail nature of their training data: web-crawled data sets are strongly unbalanced, causing models to under-represent concepts from the tail of the distribution. We characterize the effect of unbalanced training data on text-to-image models and offer a remedy. We show that rare concepts can be correctly generated by carefully selecting suitable generation seeds in the noise space, using a small reference set of images, a technique that we call SeedSelect. SeedSelect does not require retraining or finetuning the diffusion model. We assess the faithfulness, quality and diversity of SeedSelect in creating rare objects and generating complex formations like hand images, and find it consistently achieves superior performance. We further show the advantage of SeedSelect in semantic data augmentation. Generating semantically appropriate images can successfully improve performance in few-shot recognition benchmarks, for classes from the head and from the tail of the training data of diffusion models.
\ No newline at end of file
diff --git a/data/2024/aaai/Generating Universal Adversarial Perturbations for Quantum Classifiers b/data/2024/aaai/Generating Universal Adversarial Perturbations for Quantum Classifiers
new file mode 100644
index 0000000000..16fa665f3d
--- /dev/null
+++ b/data/2024/aaai/Generating Universal Adversarial Perturbations for Quantum Classifiers	
@@ -0,0 +1 @@
+Quantum Machine Learning (QML) has emerged as a promising field of research, aiming to leverage the capabilities of quantum computing to enhance existing machine learning methodologies. Recent studies have revealed that, like their classical counterparts, QML models based on Parametrized Quantum Circuits (PQCs) are also vulnerable to adversarial attacks. Moreover, the existence of Universal Adversarial Perturbations (UAPs) in the quantum domain has been demonstrated theoretically in the context of quantum classifiers. In this work, we introduce QuGAP: a novel framework for generating UAPs for quantum classifiers. We conceptualize the notion of additive UAPs for PQC-based classifiers and theoretically demonstrate their existence. We then utilize generative models (QuGAP-A) to craft additive UAPs and experimentally show that quantum classifiers are susceptible to such attacks. Moreover, we formulate a new method for generating unitary UAPs (QuGAP-U) using quantum generative models and a novel loss function based on fidelity constraints. We evaluate the performance of the proposed framework and show that our method achieves state-of-the-art misclassification rates, while maintaining high fidelity between legitimate and adversarial samples.
\ No newline at end of file
diff --git a/data/2024/aaai/Generation of Visual Representations for Multi-Modal Mathematical Knowledge b/data/2024/aaai/Generation of Visual Representations for Multi-Modal Mathematical Knowledge
new file mode 100644
index 0000000000..d24523fd07
--- /dev/null
+++ b/data/2024/aaai/Generation of Visual Representations for Multi-Modal Mathematical Knowledge	
@@ -0,0 +1 @@
+In this paper we introduce MaRE, a tool designed to generate representations in multiple modalities for a given mathematical problem while ensuring the correctness and interpretability of the transformations between different representations. The theoretical foundation for this tool is Representational Systems Theory (RST), a mathematical framework for studying the structure and transformations of representations. In MaRE’s web front-end user interface, a set of probability equations in Bayesian Notation can be rigorously transformed into Area Diagrams, Contingency Tables, and Probability Trees with just one click, utilising a back-end engine based on RST. A table of cognitive costs, based on the cognitive Representational Interpretive Structure Theory (RIST), that a representation places on a particular profile of user is produced at the same time. MaRE is general and domain independent, applicable to other representations encoded in RST. It may enhance mathematical education and research, facilitating multi-modal knowledge representation and discovery.
\ No newline at end of file
diff --git a/data/2024/aaai/Generative Calibration of Inaccurate Annotation for Label Distribution Learning b/data/2024/aaai/Generative Calibration of Inaccurate Annotation for Label Distribution Learning
new file mode 100644
index 0000000000..74b8d56319
--- /dev/null
+++ b/data/2024/aaai/Generative Calibration of Inaccurate Annotation for Label Distribution Learning	
@@ -0,0 +1 @@
+Label distribution learning (LDL) is an effective learning paradigm for handling label ambiguity. When applying LDL, it typically requires datasets annotated with label distributions. However, obtaining supervised data for LDL is a challenging task. Due to the randomness of label annotation, the annotator can produce inaccurate annotation results for the instance, affecting the accuracy and generalization ability of the LDL model. To address this problem, we propose a generative approach to calibrate the inaccurate annotation for LDL using variational inference techniques. Specifically, we assume that instances with similar features share latent similar label distributions. The feature vectors and label distributions are generated by Gaussian mixture and Dirichlet mixture, respectively. The relationship between them is established through a shared categorical variable, which effectively utilizes the label distribution of instances with similar features, and achieves a more accurate label distribution through the generative approach. Furthermore, we use a confusion matrix to model the factors that contribute to the inaccuracy during the annotation process, which captures the relationship between label distributions and inaccurate label distributions. Finally, the label distribution is used to calibrate the available information in the noisy dataset to obtain the ground-truth label distribution.
\ No newline at end of file
diff --git a/data/2024/aaai/Generative Model Perception Rectification Algorithm for Trade-Off between Diversity and Quality b/data/2024/aaai/Generative Model Perception Rectification Algorithm for Trade-Off between Diversity and Quality
new file mode 100644
index 0000000000..98dd3e6d2f
--- /dev/null
+++ b/data/2024/aaai/Generative Model Perception Rectification Algorithm for Trade-Off between Diversity and Quality	
@@ -0,0 +1 @@
+How to balance the diversity and quality of results from generative models through perception rectification poses a significant challenge. Abnormal perception in generative models is typically caused by two factors: inadequate model structure and imbalanced data distribution. In response to this issue, we propose the dynamic model perception rectification algorithm (DMPRA) for generalized generative models. The core idea is to gain a comprehensive perception of the data in the generative model by appropriately highlighting the low-density samples in the perception space, also known as the minor group samples. The entire process can be summarized as "search-evaluation-adjustment". To identify low-density regions in the data manifold within the perception space of generative models, we introduce a filtering method based on extended neighborhood sampling. Based on the informational value of samples from low-density regions, our proposed mechanism generates informative weights to assess the significance of these samples in correcting the models' perception. By using dynamic adjustment, DMPRA ensures simultaneous enhancement of diversity and quality in the presence of imbalanced data distribution. Experimental results indicate that the algorithm has effectively improved Generative Adversarial Nets (GANs), Normalizing Flows (Flows), Variational Auto-Encoders (VAEs), and Diffusion Models (Diffusion).
\ No newline at end of file
diff --git a/data/2024/aaai/Generative Model for Decision Trees b/data/2024/aaai/Generative Model for Decision Trees
new file mode 100644
index 0000000000..6cbc176abe
--- /dev/null
+++ b/data/2024/aaai/Generative Model for Decision Trees	
@@ -0,0 +1 @@
+Decision trees are among the most popular supervised models due to their interpretability and knowledge representation resembling human reasoning. Commonly-used decision tree induction algorithms are based on greedy top-down strategies. Although these approaches are known to be an efficient heuristic, the resulting trees are only locally optimal and tend to have overly complex structures. On the other hand, optimal decision tree algorithms attempt to create an entire decision tree at once to achieve global optimality. We place our proposal between these approaches by designing a generative model for decision trees. Our method first learns a latent decision tree space through a variational architecture using pre-trained decision tree models. Then, it adopts a genetic procedure to explore such latent space to find a compact decision tree with good predictive performance. We compare our proposal against classical tree induction methods, optimal approaches, and ensemble models. The results show that our proposal can generate accurate and shallow, i.e., interpretable, decision trees.
\ No newline at end of file
diff --git a/data/2024/aaai/Generative Model-Based Feature Knowledge Distillation for Action Recognition b/data/2024/aaai/Generative Model-Based Feature Knowledge Distillation for Action Recognition
new file mode 100644
index 0000000000..478f4ac688
--- /dev/null
+++ b/data/2024/aaai/Generative Model-Based Feature Knowledge Distillation for Action Recognition	
@@ -0,0 +1 @@
+Knowledge distillation (KD), a technique widely employed in computer vision, has emerged as a de facto standard for improving the performance of small neural networks. However, prevailing KD-based approaches in video tasks primarily focus on designing loss functions and fusing cross-modal information. This overlooks the spatial-temporal feature semantics, resulting in limited advancements in model compression. Addressing this gap, our paper introduces an innovative knowledge distillation framework, with the generative model for training a lightweight student model. In particular, the framework is organized into two steps: the initial phase is Feature Representation, wherein a generative model-based attention module is trained to represent feature semantics; Subsequently, the Generative-based Feature Distillation phase encompasses both Generative Distillation and Attention Distillation, with the objective of transferring attention-based feature semantics with the generative model. The efficacy of our approach is demonstrated through comprehensive experiments on diverse popular datasets, proving considerable enhancements in video action recognition task. Moreover, the effectiveness of our proposed framework is validated in the context of more intricate video action detection task. Our code is available at https://github.com/aaai-24/Generative-based-KD.
\ No newline at end of file
diff --git a/data/2024/aaai/Generative Multi-Modal Knowledge Retrieval with Large Language Models b/data/2024/aaai/Generative Multi-Modal Knowledge Retrieval with Large Language Models
new file mode 100644
index 0000000000..57a991fbb3
--- /dev/null
+++ b/data/2024/aaai/Generative Multi-Modal Knowledge Retrieval with Large Language Models	
@@ -0,0 +1 @@
+Knowledge retrieval with multi-modal queries plays a crucial role in supporting knowledge-intensive multi-modal applications. However, existing methods face challenges in terms of their effectiveness and training efficiency, especially when it comes to training and integrating multiple retrievers to handle multi-modal queries. In this paper, we propose an innovative end-to-end generative framework for multi-modal knowledge retrieval. Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases, even when trained with limited data. We retrieve knowledge via a two-step process: 1) generating knowledge clues related to the queries, and 2) obtaining the relevant document by searching databases using the knowledge clue. In particular, we first introduce an object-aware prefix-tuning technique to guide multi-grained visual learning. Then, we align multi-grained visual features into the textual feature space of the LLM, employing the LLM to capture cross-modal interactions. Subsequently, we construct instruction data with a unified format for model training. Finally, we propose the knowledge-guided generation strategy to impose prior constraints in the decoding steps, thereby promoting the generation of distinctive knowledge clues. Through experiments conducted on three benchmarks, we demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Generative-Based Fusion Mechanism for Multi-Modal Tracking b/data/2024/aaai/Generative-Based Fusion Mechanism for Multi-Modal Tracking
new file mode 100644
index 0000000000..3af825281e
--- /dev/null
+++ b/data/2024/aaai/Generative-Based Fusion Mechanism for Multi-Modal Tracking	
@@ -0,0 +1 @@
+Generative models (GMs) have received increasing research interest for their remarkable capacity to achieve comprehensive understanding. However, their potential application in the domain of multi-modal tracking has remained unexplored. In this context, we seek to uncover the potential of harnessing generative techniques to address the critical challenge, information fusion, in multi-modal tracking. In this paper, we delve into two prominent GM techniques, namely, Conditional Generative Adversarial Networks (CGANs) and Diffusion Models (DMs). Different from the standard fusion process where the features from each modality are directly fed into the fusion block, we combine these multi-modal features with random noise in the GM framework, effectively transforming the original training samples into harder instances. This design excels at extracting discriminative clues from the features, enhancing the ultimate tracking performance. Based on this, we conduct extensive experiments across two multi-modal tracking tasks, three baseline methods, and four challenging benchmarks. The experimental results demonstrate that the proposed generative-based fusion mechanism achieves state-of-the-art performance by setting new records on GTOT, LasHeR and RGBD1K. Code will be available at https://github.com/Zhangyong-Tang/GMMT.
\ No newline at end of file
diff --git a/data/2024/aaai/Generator Assisted Mixture of Experts for Feature Acquisition in Batch b/data/2024/aaai/Generator Assisted Mixture of Experts for Feature Acquisition in Batch
new file mode 100644
index 0000000000..597b7c0303
--- /dev/null
+++ b/data/2024/aaai/Generator Assisted Mixture of Experts for Feature Acquisition in Batch	
@@ -0,0 +1,2 @@
+Given a set of observations, feature acquisition is about finding the subset of unobserved features which would enhance accuracy. Such problems has been explored in a sequential setting in prior work. Here, the model receives feedback from every new feature acquireed and chooses to explore more features or to predict. However, sequential acquisition is not feasible in some settings where time is of essence. We consider the problem of feature acquisition in batch, where the subset of features to be queried in batch is chosen based on the currently observed features, and then acquired as a batch, followed by prediction. We solve this problem using several technical innovations. First, we use a feature generator to draw a subset of the synthetic features for some examples, which reduces the cost of oracle queries. Second, to make the feature acquisition problem tractable for the large heterogeneous observed features, we partition the data into buckets, by borrowing tools from locality sensitive hashing and then train a mixture of experts model. Third, we design a tractable lower bound of the original objective.
+We use a greedy algorithm combined with model training to solve the underlying problem. Experiments with four datasets shows that our approach outperforms these methods in terms of trade off between accuracy and feature acquisition cost.
\ No newline at end of file
diff --git a/data/2024/aaai/Geometric-Facilitated Denoising Diffusion Model for 3D Molecule Generation b/data/2024/aaai/Geometric-Facilitated Denoising Diffusion Model for 3D Molecule Generation
new file mode 100644
index 0000000000..9a5ffeb671
--- /dev/null
+++ b/data/2024/aaai/Geometric-Facilitated Denoising Diffusion Model for 3D Molecule Generation	
@@ -0,0 +1 @@
+Denoising diffusion models have shown great potential in multiple research areas. Existing diffusion-based generative methods on de novo 3D molecule generation face two major challenges. Since majority heavy atoms in molecules allow connections to multiple atoms through single bonds, solely using pair-wise distance to model molecule geometries is insufficient. Therefore, the first one involves proposing an effective neural network as the denoising kernel that is capable to capture complex multi-body interatomic relationships and learn high-quality features. Due to the discrete nature of graphs, mainstream diffusion-based methods for molecules heavily rely on predefined rules and generate edges in an indirect manner. The second challenge involves accommodating molecule generation to diffusion and accurately predicting the existence of bonds. In our research, we view the iterative way of updating molecule conformations in diffusion process is consistent with molecular dynamics and introduce a novel molecule generation method named Geometric-Facilitated Molecular Diffusion (GFMDiff). For the first challenge, we introduce a Dual-track Transformer Network (DTN) to fully excevate global spatial relationships and learn high quality representations which contribute to accurate predictions of features and geometries. As for the second challenge, we design Geometric-facilitated Loss (GFLoss) which intervenes the formation of bonds during the training period, instead of directly embedding edges into the latent space. Comprehensive experiments on current benchmarks demonstrate the superiority of GFMDiff.
\ No newline at end of file
diff --git a/data/2024/aaai/Geometry-Guided Domain Generalization for Monocular 3D Object Detection b/data/2024/aaai/Geometry-Guided Domain Generalization for Monocular 3D Object Detection
new file mode 100644
index 0000000000..58ae33a1c2
--- /dev/null
+++ b/data/2024/aaai/Geometry-Guided Domain Generalization for Monocular 3D Object Detection	
@@ -0,0 +1 @@
+Monocular 3D object detection (M3OD) is important for autonomous driving. However, existing deep learning-based methods easily suffer from performance degradation in real-world scenarios due to the substantial domain gap between training and testing. M3OD's domain gaps are complex, including camera intrinsic parameters, extrinsic parameters, image appearance, etc. Existing works primarily focus on the domain gaps of camera intrinsic parameters, ignoring other key factors. Moreover, at the feature level, conventional domain invariant learning methods generally cause the negative transfer issue, due to the ignorance of dependency between geometry tasks and domains. To tackle these issues, in this paper, we propose MonoGDG, a geometry-guided domain generalization framework for M3OD, which effectively addresses the domain gap at both camera and feature levels. Specifically, MonoGDG consists of two major components. One is geometry-based image reprojection, which mitigates the impact of camera discrepancy by unifying intrinsic parameters, randomizing camera orientations, and unifying the field of view range. The other is geometry-dependent feature disentanglement, which overcomes the negative transfer problems by incorporating domain-shared and domain-specific features. Additionally, we leverage a depth-disentangled domain discriminator and a domain-aware geometry regression attention mechanism to account for the geometry-domain dependency. Extensive experiments on multiple autonomous driving benchmarks demonstrate that our method achieves state-of-the-art performance in domain generalization for M3OD.
\ No newline at end of file
diff --git a/data/2024/aaai/Get a Head Start: On-Demand Pedagogical Policy Selection in Intelligent Tutoring b/data/2024/aaai/Get a Head Start: On-Demand Pedagogical Policy Selection in Intelligent Tutoring
new file mode 100644
index 0000000000..cec97ee143
--- /dev/null
+++ b/data/2024/aaai/Get a Head Start: On-Demand Pedagogical Policy Selection in Intelligent Tutoring	
@@ -0,0 +1 @@
+Reinforcement learning (RL) is broadly employed in human-involved systems to enhance human outcomes. Off-policy evaluation (OPE) has been pivotal for RL in those realms since online policy learning and evaluation can be high-stake. Intelligent tutoring has raised tremendous attentions as highly challenging when applying OPE to human-involved systems, due to that students' subgroups can favor different pedagogical policies and the costly procedure that policies have to be induced fully offline and then directly deployed to the upcoming semester. In this work, we formulate on-demand pedagogical policy selection (ODPS) to tackle the challenges for OPE in intelligent tutoring. We propose a pipeline, EduPlanner, as a concrete solution for ODPS. Our pipeline results in an theoretically unbiased estimator, and enables efficient and customized policy selection by identifying subgroups over both historical data and on-arrival initial logs. We evaluate our approach on the Probability ITS that has been used in real classrooms for over eight years. Our study shows significant improvement on learning outcomes of students with EduPlanner, especially for the ones associated with low-performing subgroups.
\ No newline at end of file
diff --git a/data/2024/aaai/GigaHumanDet: Exploring Full-Body Detection on Gigapixel-Level Images b/data/2024/aaai/GigaHumanDet: Exploring Full-Body Detection on Gigapixel-Level Images
new file mode 100644
index 0000000000..6399908679
--- /dev/null
+++ b/data/2024/aaai/GigaHumanDet: Exploring Full-Body Detection on Gigapixel-Level Images	
@@ -0,0 +1 @@
+Performing person detection in super-high-resolution images has been a challenging task. For such a task, modern detectors, which usually encode a box using center and width/height, struggle with accuracy due to two factors: 1) Human characteristic: people come in various postures and the center with high freedom is difficult to capture robust visual pattern; 2) Image characteristic: due to vast scale diversity of input (gigapixel-level), distance regression (for width and height) is hard to pinpoint, especially for a person, with substantial scale, who is near the camera. To address these challenges, we propose GigaHumanDet, an innovative solution aimed at further enhancing detection accuracy for gigapixel-level images. GigaHumanDet employs the corner modeling method to avoid the potential issues of a high degree of freedom in center pinpointing. To better distinguish similar-looking persons and enforce instance consistency of corner pairs, an instance-guided learning approach is designed to capture discriminative individual semantics. Further, we devise reliable shape-aware bodyness equipped with a multi-precision strategy as the human corner matching guidance to be appropriately adapted to the single-view large scene. Experimental results on PANDA and STCrowd datasets show the superiority and strong applicability of our design. Notably, our model achieves 82.4% in term of AP, outperforming current state-of-the-arts by more than 10%.
\ No newline at end of file
diff --git a/data/2024/aaai/GraFITi: Graphs for Forecasting Irregularly Sampled Time Series b/data/2024/aaai/GraFITi: Graphs for Forecasting Irregularly Sampled Time Series
new file mode 100644
index 0000000000..43bb65b664
--- /dev/null
+++ b/data/2024/aaai/GraFITi: Graphs for Forecasting Irregularly Sampled Time Series	
@@ -0,0 +1 @@
+Forecasting irregularly sampled time series with missing values is a crucial task for numerous real-world applications such as healthcare, astronomy, and climate sciences. State-of-the-art approaches to this problem rely on Ordinary Differential Equations (ODEs) which are known to be slow and often require additional features to handle missing values. To address this issue, we propose a novel model using Graphs for Forecasting Irregularly Sampled Time Series with missing values which we call GraFITi. GraFITi first converts the time series to a Sparsity Structure Graph which is a sparse bipartite graph, and then reformulates the forecasting problem as the edge weight prediction task in the graph. It uses the power of Graph Neural Networks to learn the graph and predict the target edge weights. GraFITi has been tested on 3 real-world and 1 synthetic irregularly sampled time series dataset with missing values and compared with various state-of-the-art models. The experimental results demonstrate that GraFITi improves the forecasting accuracy by up to 17% and reduces the run time up to 5 times compared to the state-of-the-art forecasting models.
\ No newline at end of file
diff --git a/data/2024/aaai/Grab What You Need: Rethinking Complex Table Structure Recognition with Flexible Components Deliberation b/data/2024/aaai/Grab What You Need: Rethinking Complex Table Structure Recognition with Flexible Components Deliberation
new file mode 100644
index 0000000000..3e25be11d6
--- /dev/null
+++ b/data/2024/aaai/Grab What You Need: Rethinking Complex Table Structure Recognition with Flexible Components Deliberation	
@@ -0,0 +1 @@
+Recently, Table Structure Recognition (TSR) task, aiming at identifying table structure into machine readable formats, has received increasing interest in the community. While impressive success, most single table component-based methods can not perform well on unregularized table cases distracted by not only complicated inner structure but also exterior capture distortion. In this paper, we raise it as Complex TSR problem, where the performance degeneration of existing methods is attributable to their inefficient component usage and redundant post-processing. To mitigate it, we shift our perspective from table component extraction towards the efficient multiple components leverage, which awaits further exploration in the field. Specifically, we propose a seminal method, termed GrabTab, equipped with newly proposed Component Deliberator, to handle various types of tables in a unified framework. Thanks to its progressive deliberation mechanism, our GrabTab can flexibly accommodate to most complex tables with reasonable components selected but without complicated post-processing involved. Quantitative experimental results on public benchmarks demonstrate that our method significantly outperforms the state-of-the-arts, especially under more challenging scenes.
\ No newline at end of file
diff --git a/data/2024/aaai/GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent b/data/2024/aaai/GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent
new file mode 100644
index 0000000000..1d3748cbad
--- /dev/null
+++ b/data/2024/aaai/GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent	
@@ -0,0 +1 @@
+Decision Trees (DTs) are commonly used for many machine learning tasks due to their high degree of interpretability. However, learning a DT from data is a difficult optimization problem, as it is non-convex and non-differentiable. Therefore, common approaches learn DTs using a greedy growth algorithm that minimizes the impurity locally at each internal node. Unfortunately, this greedy procedure can lead to inaccurate trees. In this paper, we present a novel approach for learning hard, axis-aligned DTs with gradient descent. The proposed method uses backpropagation with a straight-through operator on a dense DT representation, to jointly optimize all tree parameters. Our approach outperforms existing methods on binary classification benchmarks and achieves competitive results for multi-class tasks. The implementation is available under: https://github.com/s-marton/GradTree
\ No newline at end of file
diff --git a/data/2024/aaai/Gradient-Guided Modality Decoupling for Missing-Modality Robustness b/data/2024/aaai/Gradient-Guided Modality Decoupling for Missing-Modality Robustness
new file mode 100644
index 0000000000..5d9c39be9c
--- /dev/null
+++ b/data/2024/aaai/Gradient-Guided Modality Decoupling for Missing-Modality Robustness	
@@ -0,0 +1 @@
+Multimodal learning with incomplete input data (missing modality) is very practical and challenging. In this work, we conduct an in-depth analysis of this challenge and find that modality dominance has a significant negative impact on the model training, greatly degrading the missing modality performance. Motivated by Grad-CAM, we introduce a novel indicator, gradients, to monitor and reduce modality dominance which widely exists in the missing-modality scenario. In aid of this indicator, we present a novel Gradient-guided Modality Decoupling (GMD) method to decouple the dependency on dominating modalities. Specifically, GMD removes the conflicted gradient components from different modalities to achieve this decoupling, significantly improving the performance. In addition, to flexibly handle modal-incomplete data, we design a parameter-efficient Dynamic Sharing (DS) framework which can adaptively switch on/off the network parameters based on whether one modality is available. We conduct extensive experiments on three popular multimodal benchmarks, including BraTS 2018 for medical segmentation, CMU-MOSI, and CMU-MOSEI for sentiment analysis. The results show that our method can significantly outperform the competitors, showing the effectiveness of the proposed solutions. Our code is released here: https://github.com/HaoWang420/Gradient-guided-Modality-Decoupling.
\ No newline at end of file
diff --git a/data/2024/aaai/Gradual Residuals Alignment: A Dual-Stream Framework for GAN Inversion and Image Attribute Editing b/data/2024/aaai/Gradual Residuals Alignment: A Dual-Stream Framework for GAN Inversion and Image Attribute Editing
new file mode 100644
index 0000000000..fa74e25aa0
--- /dev/null
+++ b/data/2024/aaai/Gradual Residuals Alignment: A Dual-Stream Framework for GAN Inversion and Image Attribute Editing	
@@ -0,0 +1 @@
+GAN-based image attribute editing firstly leverages GAN Inversion to project real images into the latent space of GAN and then manipulates corresponding latent codes. Recent inversion methods mainly utilize additional high-bit features to improve image details preservation, as low-bit codes cannot faithfully reconstruct source images, leading to the loss of details. However, during editing, existing works fail to accurately complement the lost details and suffer from poor editability. The main reason is they inject all the lost details indiscriminately at one time, which inherently induces the position and quantity of details to overfit source images, resulting in inconsistent content and artifacts in edited images. This work argues that details should be gradually injected into both the reconstruction and editing process in a multi-stage coarse-to-fine manner for better detail preservation and high editability. Therefore, a novel dual-stream framework is proposed to accurately complement details at each stage. The Reconstruction Stream is employed to embed coarse-to-fine lost details into residual features and then adaptively add them to the GAN generator. In the Editing Stream, residual features are accurately aligned by our Selective Attention mechanism and then injected into the editing process in a multi-stage manner. Extensive experiments have shown the superiority of our framework in both reconstruction accuracy and editing quality compared with existing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Gramformer: Learning Crowd Counting via Graph-Modulated Transformer b/data/2024/aaai/Gramformer: Learning Crowd Counting via Graph-Modulated Transformer
new file mode 100644
index 0000000000..eb4ddbddaf
--- /dev/null
+++ b/data/2024/aaai/Gramformer: Learning Crowd Counting via Graph-Modulated Transformer	
@@ -0,0 +1 @@
+Transformer has been popular in recent crowd counting work since it breaks the limited receptive field of traditional CNNs. However, since crowd images always contain a large number of similar patches, the self-attention mechanism in Transformer tends to find a homogenized solution where the attention maps of almost all patches are identical. In this paper, we address this problem by proposing Gramformer: a graph-modulated transformer to enhance the network by adjusting the attention and input node features respectively on the basis of two different types of graphs. Firstly, an attention graph is proposed to diverse attention maps to attend to complementary information. The graph is building upon the dissimilarities between patches, modulating the attention in an anti-similarity fashion. Secondly, a feature-based centrality encoding is proposed to discover the centrality positions or importance of nodes. We encode them with a proposed centrality indices scheme to modulate the node features and similarity relationships. Extensive experiments on four challenging crowd counting datasets have validated the competitiveness of the proposed method. Code is available at https://github.com/LoraLinH/Gramformer.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph Anomaly Detection via Prototype-Aware Label Propagation (Student Abstract) b/data/2024/aaai/Graph Anomaly Detection via Prototype-Aware Label Propagation (Student Abstract)
new file mode 100644
index 0000000000..05baed8d65
--- /dev/null
+++ b/data/2024/aaai/Graph Anomaly Detection via Prototype-Aware Label Propagation (Student Abstract)	
@@ -0,0 +1 @@
+Detecting anomalies on attributed graphs is a challenging task since labelled anomalies are highly labour-intensive by taking specialized domain knowledge to make anomalous samples not as available as normal ones. Moreover, graphs contain complex structure information as well as attribute information, leading to anomalies that can be typically hidden in the structure space, attribute space, and the mix of both. In this paper, we propose a novel model for graph anomaly detection named ProGAD. Specifically, ProGAD takes advance of label propagation to infer high-quality pseudo labels by considering the structure and attribute inconsistencies between normal and abnormal samples. Meanwhile, ProGAD introduces the prior knowledge of class distribution to correct and refine pseudo labels with a prototype-aware strategy. Experiments demonstrate that ProGAD achieves strong performance compared with the current state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph Anomaly Detection with Diffusion Model-Based Graph Enhancement (Student Abstract) b/data/2024/aaai/Graph Anomaly Detection with Diffusion Model-Based Graph Enhancement (Student Abstract)
new file mode 100644
index 0000000000..64cced81b4
--- /dev/null
+++ b/data/2024/aaai/Graph Anomaly Detection with Diffusion Model-Based Graph Enhancement (Student Abstract)	
@@ -0,0 +1 @@
+Graph anomaly detection has gained significant research interest across various domains. Due to the lack of labeled data, contrastive learning has been applied in detecting anomalies and various scales of contrastive strategies have been initiated. However, these methods might force two instances (e.g., node-level and subgraph-level representations) with different category labels to be consistent during model training, which can adversely impact the model robustness. To tackle this problem, we present a novel contrastive learning framework with the Diffusion model-based graph Enhancement module for Graph Anomaly Detection, DEGAD. In this framework, we design a diffusion model-based graph enhancement module to manipulate neighbors to generate enhanced graphs, which can efficiently alleviate the inconsistent problem. Further, based on the enhanced graphs, we present a multi-scale contrastive module to discriminate anomalies. Experimental results demonstrate the superiority of our model.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph Bayesian Optimization for Multiplex Influence Maximization b/data/2024/aaai/Graph Bayesian Optimization for Multiplex Influence Maximization
new file mode 100644
index 0000000000..b3275f7ec2
--- /dev/null
+++ b/data/2024/aaai/Graph Bayesian Optimization for Multiplex Influence Maximization	
@@ -0,0 +1,4 @@
+Influence maximization (IM) is the problem of identifying a limited number of initial influential users within a social network to maximize the number of influenced users. However, previous research has mostly focused on individual information propagation, neglecting the simultaneous and interactive dissemination of multiple information items. In reality, when users encounter a piece of information, such as a smartphone product, they often associate it with related products in their minds, such as earphones or computers from the same brand. Additionally, information platforms frequently recommend related content to users, amplifying this cascading effect and leading to multiplex influence diffusion.
+
+This paper first formulates the Multiplex Influence Maximization (Multi-IM) problem using multiplex diffusion models with an information association mechanism. In this problem, the seed set is a combination of influential users and information. To effectively manage the combinatorial complexity, we propose Graph Bayesian Optimization for Multi-IM (GBIM). The multiplex diffusion process is thoroughly investigated using a highly effective global kernelized attention message-passing module. This module, in conjunction with Bayesian linear regression (BLR), produces a scalable surrogate model. A data acquisition module incorporating the exploration-exploitation trade-off is developed to optimize the seed set further.
+Extensive experiments on synthetic and real-world datasets have proven our proposed framework effective. The code is available at https://github.com/zirui-yuan/GBIM.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph Clustering Methods Derived from Column Subset Selection (Student Abstract) b/data/2024/aaai/Graph Clustering Methods Derived from Column Subset Selection (Student Abstract)
new file mode 100644
index 0000000000..8c9ab6c45d
--- /dev/null
+++ b/data/2024/aaai/Graph Clustering Methods Derived from Column Subset Selection (Student Abstract)	
@@ -0,0 +1 @@
+Spectral clustering is a powerful clustering technique. It leverages the spectral properties of graphs to partition data points into meaningful clusters. The most common criterion for evaluating multi-way spectral clustering is NCut. Column Subset Selection is an important optimization technique in the domain of feature selection and dimension reduction which aims to identify a subset of columns of a given data matrix that can be used to approximate the entire matrix. We show that column subset selection can be used to compute spectral clustering and use this to obtain new graph clustering algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph Context Transformation Learning for Progressive Correspondence Pruning b/data/2024/aaai/Graph Context Transformation Learning for Progressive Correspondence Pruning
new file mode 100644
index 0000000000..d5cf1f71d2
--- /dev/null
+++ b/data/2024/aaai/Graph Context Transformation Learning for Progressive Correspondence Pruning	
@@ -0,0 +1 @@
+Most of existing correspondence pruning methods only concentrate on gathering the context information as much as possible while neglecting effective ways to utilize such information. In order to tackle this dilemma, in this paper we propose Graph Context Transformation Network (GCT-Net) enhancing context information to conduct consensus guidance for progressive correspondence pruning. Specifically, we design the Graph Context Enhance Transformer which first generates the graph network and then transforms it into multi-branch graph contexts. Moreover, it employs self-attention and cross-attention to magnify characteristics of each graph context for emphasizing the unique as well as shared essential information. To further apply the recalibrated graph contexts to the global domain, we propose the Graph Context Guidance Transformer. This module adopts a confident-based sampling strategy to temporarily screen high-confidence vertices for guiding accurate classification by searching global consensus between screened vertices and remaining ones. The extensive experimental results on outlier removal and relative pose estimation clearly demonstrate the superior performance of GCT-Net compared to state-of-the-art methods across outdoor and indoor datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph Contrastive Invariant Learning from the Causal Perspective b/data/2024/aaai/Graph Contrastive Invariant Learning from the Causal Perspective
new file mode 100644
index 0000000000..5e359f1d2a
--- /dev/null
+++ b/data/2024/aaai/Graph Contrastive Invariant Learning from the Causal Perspective	
@@ -0,0 +1 @@
+Graph contrastive learning (GCL), learning the node representation by contrasting two augmented graphs in a self-supervised way, has attracted considerable attention. GCL is usually believed to learn the invariant representation. However, does this understanding always hold in practice? In this paper, we first study GCL from the perspective of causality. By analyzing GCL with the structural causal model (SCM), we discover that traditional GCL may not well learn the invariant representations due to the non-causal information contained in the graph. How can we fix it and encourage the current GCL to learn better invariant representations? The SCM offers two requirements and motives us to propose a novel GCL method. Particularly, we introduce the spectral graph augmentation to simulate the intervention upon non-causal factors. Then we design the invariance objective and independence objective to better capture the causal factors. Specifically, (i) the invariance objective encourages the encoder to capture the invariant information contained in causal variables, and (ii) the independence objective aims to reduce the influence of confounders on the causal variables. Experimental results demonstrate the effectiveness of our approach on node classification tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph Disentangled Contrastive Learning with Personalized Transfer for Cross-Domain Recommendation b/data/2024/aaai/Graph Disentangled Contrastive Learning with Personalized Transfer for Cross-Domain Recommendation
new file mode 100644
index 0000000000..e092c988fa
--- /dev/null
+++ b/data/2024/aaai/Graph Disentangled Contrastive Learning with Personalized Transfer for Cross-Domain Recommendation	
@@ -0,0 +1,2 @@
+Cross-Domain Recommendation (CDR) has been proven to effectively alleviate the data sparsity problem in Recommender System (RS). Recent CDR methods often disentangle user features into domain-invariant and domain-specific features for efficient cross-domain knowledge transfer. Despite showcasing robust performance, three crucial aspects remain unexplored for existing disentangled CDR approaches: i) The significance nuances of the interaction behaviors are ignored in generating disentangled features; ii) 
+The user features are disentangled irrelevant to the individual items to be recommended; iii) The general knowledge transfer overlooks the user's personality when interacting with diverse items. To this end, we propose a Graph Disentangled Contrastive framework for CDR (GDCCDR) with personalized transfer by meta-networks. An adaptive parameter-free filter is proposed to gauge the significance of diverse interactions, thereby facilitating more refined disentangled representations. In sight of the success of Contrastive Learning (CL) in RS, we propose two CL-based constraints for item-aware disentanglement. Proximate CL ensures the coherence of domain-invariant features between domains, while eliminatory CL strives to disentangle features within each domains using mutual information between users and items. Finally, for domain-invariant features, we adopt meta-networks to achieve personalized transfer. Experimental results on four real-world datasets demonstrate the superiority of GDCCDR over state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph Invariant Learning with Subgraph Co-mixup for Out-of-Distribution Generalization b/data/2024/aaai/Graph Invariant Learning with Subgraph Co-mixup for Out-of-Distribution Generalization
new file mode 100644
index 0000000000..3752de1ba6
--- /dev/null
+++ b/data/2024/aaai/Graph Invariant Learning with Subgraph Co-mixup for Out-of-Distribution Generalization	
@@ -0,0 +1,3 @@
+Graph neural networks (GNNs) have been demonstrated to perform well in graph representation learning, but always lacking in generalization capability when tackling out-of-distribution (OOD) data. Graph invariant learning methods, backed by the invariance principle among defined multiple environments, have shown effectiveness in dealing with this issue. However, existing methods heavily rely on well-predefined or accurately generated environment partitions, which are hard to be obtained in practice, leading to sub-optimal OOD generalization performances.
+In this paper, we propose a novel graph invariant learning method based on invariant and variant patterns co-mixup strategy, which is capable of jointly generating mixed multiple environments and capturing invariant patterns from the mixed graph data. Specifically, we first adopt a subgraph extractor to identify invariant subgraphs. Subsequently, we design one novel co-mixup strategy, i.e., jointly conducting environment mixup and invariant mixup. For the environment mixup, we mix the variant environment-related subgraphs so as to generate sufficiently diverse multiple environments, which is important to guarantee the quality of the graph invariant learning. For the invariant mixup, we mix the invariant subgraphs, further encouraging to capture invariant patterns behind graphs while getting rid of spurious correlations for OOD generalization. We demonstrate that the proposed environment mixup and invariant mixup can mutually promote each other.
+Extensive experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-art under various distribution shifts.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph Learning in 4D: A Quaternion-Valued Laplacian to Enhance Spectral GCNs b/data/2024/aaai/Graph Learning in 4D: A Quaternion-Valued Laplacian to Enhance Spectral GCNs
new file mode 100644
index 0000000000..e30afa8343
--- /dev/null
+++ b/data/2024/aaai/Graph Learning in 4D: A Quaternion-Valued Laplacian to Enhance Spectral GCNs	
@@ -0,0 +1 @@
+We introduce QuaterGCN, a spectral Graph Convolutional Network (GCN) with quaternion-valued weights at whose core lies the Quaternionic Laplacian, a quaternion-valued Laplacian matrix by whose proposal we generalize two widely-used Laplacian matrices: the classical Laplacian (defined for undirected graphs) and the complex-valued Sign-Magnetic Laplacian (proposed within the spectral GCN SigMaNet to handle digraphs with weights of arbitrary sign). In addition to its generality, QuaterGCN is the only Laplacian to completely preserve the (di)graph topology that we are aware of, as it can handle graphs and digraphs containing antiparallel pairs of edges (digons) of different weight without reducing them to a single (directed or undirected) edge as done by other Laplacians. Experimental results show the superior performance of QuaterGCN compared to other state-of-the-art GCNs, particularly in scenarios where the information the digons carry is crucial to successfully address the task at hand.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph Neural Networks with Soft Association between Topology and Attribute b/data/2024/aaai/Graph Neural Networks with Soft Association between Topology and Attribute
new file mode 100644
index 0000000000..bdd4e21bd4
--- /dev/null
+++ b/data/2024/aaai/Graph Neural Networks with Soft Association between Topology and Attribute	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) have shown great performance in learning representations for graph-structured data. However, recent studies have found that the interference between topology and attribute can lead to distorted node representations. Most GNNs are designed based on homophily assumptions, thus they cannot be applied to graphs with heterophily. This research critically analyzes the propagation principles of various GNNs and the corresponding challenges from an optimization perspective. A novel GNN called Graph Neural Networks with Soft Association between Topology and Attribute (GNN-SATA) is proposed. Different embeddings are utilized to gain insights into attributes and structures while establishing their interconnections through soft association. Further as integral components of the soft association, a Graph Pruning Module (GPM) and Graph Augmentation Module (GAM) are developed. These modules dynamically remove or add edges to the adjacency relationships to make the model better fit with graphs with homophily or heterophily. Experimental results on homophilic and heterophilic graph datasets convincingly demonstrate that the proposed GNN-SATA effectively captures more accurate adjacency relationships and outperforms state-of-the-art approaches. Especially on the heterophilic graph dataset Squirrel, GNN-SATA achieves a 2.81% improvement in accuracy, utilizing merely 27.19% of the original number of adjacency relationships. Our code is released at https://github.com/wwwfadecom/GNN-SATA.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph Neural Prompting with Large Language Models b/data/2024/aaai/Graph Neural Prompting with Large Language Models
new file mode 100644
index 0000000000..bfa9ac72a4
--- /dev/null
+++ b/data/2024/aaai/Graph Neural Prompting with Large Language Models	
@@ -0,0 +1 @@
+Large language models (LLMs) have shown remarkable generalization capability with exceptional performance in various language modeling tasks. However, they still exhibit inherent limitations in precisely capturing and returning grounded knowledge. While existing work has explored utilizing knowledge graphs (KGs) to enhance language modeling via joint training and customized model architectures, applying this to LLMs is problematic owing to their large number of parameters and high computational cost. Therefore, how to enhance pre-trained LLMs using grounded knowledge, e.g., retrieval-augmented generation, remains an open question. In this work, we propose Graph Neural Prompting (GNP), a novel plug-and-play method to assist pre-trained LLMs in learning beneficial knowledge from KGs. GNP encompasses various designs, including a standard graph neural network encoder, a cross-modality pooling module, a domain projector, and a self-supervised link prediction objective. Extensive experiments on multiple datasets demonstrate the superiority of GNP on both commonsense and biomedical reasoning tasks across different LLM sizes and settings. Code is available at https://github.com/meettyj/GNP.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph Reasoning Transformers for Knowledge-Aware Question Answering b/data/2024/aaai/Graph Reasoning Transformers for Knowledge-Aware Question Answering
new file mode 100644
index 0000000000..f0150b638b
--- /dev/null
+++ b/data/2024/aaai/Graph Reasoning Transformers for Knowledge-Aware Question Answering	
@@ -0,0 +1 @@
+Augmenting Language Models (LMs) with structured knowledge graphs (KGs) aims to leverage structured world knowledge to enhance the capability of LMs to complete knowledge-intensive tasks. However, existing methods are unable to effectively utilize the structured knowledge in a KG due to their inability to capture the rich relational semantics of knowledge triplets. Moreover, the modality gap between natural language text and KGs has become a challenging obstacle when aligning and fusing cross-modal information. To address these challenges, we propose a novel knowledge-augmented question answering (QA) model, namely, Graph Reasoning Transformers (GRT). Different from conventional node-level methods, the GRT serves knowledge triplets as atomic knowledge and utilize a triplet-level graph encoder to capture triplet-level graph features. Furthermore, to alleviate the negative effect of the modality gap on joint reasoning, we propose a representation alignment pretraining to align the cross-modal representations and introduce a cross-modal information fusion module with attention bias to enable fine-grained information fusion. Extensive experiments conducted on three knowledge-intensive QA benchmarks show that the GRT outperforms the state-of-the-art KG-augmented QA systems, demonstrating the effectiveness and adaptation of our proposed model.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph of Thoughts: Solving Elaborate Problems with Large Language Models b/data/2024/aaai/Graph of Thoughts: Solving Elaborate Problems with Large Language Models
new file mode 100644
index 0000000000..cdd82e3ebc
--- /dev/null
+++ b/data/2024/aaai/Graph of Thoughts: Solving Elaborate Problems with Large Language Models	
@@ -0,0 +1,19 @@
+We introduce Graph of Thoughts (GoT): a framework that
+advances prompting capabilities in large language models
+(LLMs) beyond those offered by paradigms such as 
+Chain-of-Thought or Tree of Thoughts (ToT). The key idea and 
+primary advantage of GoT is the ability to model the information 
+generated by an LLM as an arbitrary graph, where units of 
+information ("LLM thoughts") are vertices, and edges correspond
+to dependencies between these vertices. This approach enables 
+combining arbitrary LLM thoughts into synergistic outcomes, 
+distilling the essence of whole networks of thoughts,
+or enhancing thoughts using feedback loops. We illustrate
+that GoT offers advantages over state of the art on different
+tasks, for example increasing the quality of sorting by 62%
+over ToT, while simultaneously reducing costs by >31%.
+We ensure that GoT is extensible with new thought 
+transformations and thus can be used to spearhead new prompting
+schemes. This work brings the LLM reasoning closer to human 
+thinking or brain mechanisms such as recurrence, both
+of which form complex networks
\ No newline at end of file
diff --git a/data/2024/aaai/Graph-Aware Contrasting for Multivariate Time-Series Classification b/data/2024/aaai/Graph-Aware Contrasting for Multivariate Time-Series Classification
new file mode 100644
index 0000000000..de00d523cf
--- /dev/null
+++ b/data/2024/aaai/Graph-Aware Contrasting for Multivariate Time-Series Classification	
@@ -0,0 +1 @@
+Contrastive learning, as a self-supervised learning paradigm, becomes popular for Multivariate Time-Series (MTS) classification. It ensures the consistency across different views of unlabeled samples and then learns effective representations for these samples. Existing contrastive learning methods mainly focus on achieving temporal consistency with temporal augmentation and contrasting techniques, aiming to preserve temporal patterns against perturbations for MTS data. However, they overlook spatial consistency that requires the stability of individual sensors and their correlations. As MTS data typically originate from multiple sensors, ensuring spatial consistency becomes essential for the overall performance of contrastive learning on MTS data. Thus, we propose Graph-Aware Contrasting for spatial consistency across MTS data. Specifically, we propose graph augmentations including node and edge augmentations to preserve the stability of sensors and their correlations, followed by graph contrasting with both node- and graph-level contrasting to extract robust sensor- and global-level features. We further introduce multi-window temporal contrasting to ensure temporal consistency in the data for each sensor. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performance on various MTS classification tasks. The code is available at https://github.com/Frank-Wang-oss/TS-GAC.
\ No newline at end of file
diff --git a/data/2024/aaai/Graph-Based Prediction and Planning Policy Network (GP3Net) for Scalable Self-Driving in Dynamic Environments Using Deep Reinforcement Learning b/data/2024/aaai/Graph-Based Prediction and Planning Policy Network (GP3Net) for Scalable Self-Driving in Dynamic Environments Using Deep Reinforcement Learning
new file mode 100644
index 0000000000..f0e6a054dc
--- /dev/null
+++ b/data/2024/aaai/Graph-Based Prediction and Planning Policy Network (GP3Net) for Scalable Self-Driving in Dynamic Environments Using Deep Reinforcement Learning	
@@ -0,0 +1 @@
+Recent advancements in motion planning for Autonomous Vehicles (AVs) show great promise in using expert driver behaviors in non-stationary driving environments. However, learning only through expert drivers needs more generalizability to recover from domain shifts and near-failure scenarios due to the dynamic behavior of traffic participants and weather conditions. A deep Graph-based Prediction and Planning Policy Network (GP3Net) framework is proposed for non-stationary environments that encodes the interactions between traffic participants with contextual information and provides a decision for safe maneuver for AV. A spatio-temporal graph models the interactions between traffic participants for predicting the future trajectories of those participants. The predicted trajectories are utilized to generate a future occupancy map around the AV with uncertainties embedded to anticipate the evolving non-stationary driving environments. Then the contextual information and future occupancy maps are input to the policy network of the GP3Net framework and trained using Proximal Policy Optimization (PPO) algorithm. The proposed GP3Net performance is evaluated on standard CARLA benchmarking scenarios with domain shifts of traffic patterns (urban, highway, and mixed). The results show that the GP3Net outperforms previous state-of-the-art imitation learning-based planning models for different towns. Further, in unseen new weather conditions, GP3Net completes the desired route with fewer traffic infractions. Finally, the results emphasize the advantage of including the prediction module to enhance safety measures in non-stationary environments.
\ No newline at end of file
diff --git a/data/2024/aaai/Grey-Box Bayesian Optimization for Sensor Placement in Assisted Living Environments b/data/2024/aaai/Grey-Box Bayesian Optimization for Sensor Placement in Assisted Living Environments
new file mode 100644
index 0000000000..2d1848ca96
--- /dev/null
+++ b/data/2024/aaai/Grey-Box Bayesian Optimization for Sensor Placement in Assisted Living Environments	
@@ -0,0 +1 @@
+Optimizing the configuration and placement of sensors is crucial for reliable fall detection, indoor localization, and activity recognition in assisted living spaces. We propose a novel, sample-efficient approach to find a high-quality sensor placement in an arbitrary indoor space based on grey-box Bayesian optimization and simulation-based evaluation. Our key technical contribution lies in capturing domain-specific knowledge about the spatial distribution of activities and incorporating it into the iterative selection of query points in Bayesian optimization. Considering two simulated indoor environments and a real-world dataset containing human activities and sensor triggers, we show that our proposed method performs better compared to state-of-the-art black-box optimization techniques in identifying high-quality sensor placements, leading to an accurate activity recognition model in terms of F1-score, while also requiring a significantly lower (51.3% on average) number of expensive function queries.
\ No newline at end of file
diff --git a/data/2024/aaai/GridFormer: Point-Grid Transformer for Surface Reconstruction b/data/2024/aaai/GridFormer: Point-Grid Transformer for Surface Reconstruction
new file mode 100644
index 0000000000..8c0a2b01c8
--- /dev/null
+++ b/data/2024/aaai/GridFormer: Point-Grid Transformer for Surface Reconstruction	
@@ -0,0 +1 @@
+Implicit neural networks have emerged as a crucial technology in 3D surface reconstruction. To reconstruct continuous surfaces from discrete point clouds, encoding the input points into regular grid features (plane or volume) has been commonly employed in existing approaches. However, these methods typically use the grid as an index for uniformly scattering point features. Compared with the irregular point features, the regular grid features may sacrifice some reconstruction details but improve efficiency. To take full advantage of these two types of features, we introduce a novel and high-efficiency attention mechanism between the grid and point features named Point-Grid Transformer (GridFormer). This mechanism treats the grid as a transfer point connecting the space and point cloud. Our method maximizes the spatial expressiveness of grid features and maintains computational efficiency. Furthermore, optimizing predictions over the entire space could potentially result in blurred boundaries. To address this issue, we further propose a boundary optimization strategy incorporating margin binary cross-entropy loss and boundary sampling. This approach enables us to achieve a more precise representation of the object structure. Our experiments validate that our method is effective and outperforms the state-of-the-art approaches under widely used benchmarks by producing more precise geometry reconstructions. The code is available at https://github.com/list17/GridFormer.
\ No newline at end of file
diff --git a/data/2024/aaai/GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection b/data/2024/aaai/GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection
new file mode 100644
index 0000000000..39086e3435
--- /dev/null
+++ b/data/2024/aaai/GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection	
@@ -0,0 +1 @@
+Visual grounding, a crucial vision-language task involving the understanding of the visual context based on the query expression, necessitates the model to capture the interactions between objects, as well as various spatial and attribute information. However, the annotation data of visual grounding task is limited due to its time-consuming and labor-intensive annotation process, resulting in the trained models being constrained from generalizing its capability to a broader domain. To address this challenge, we propose GroundVLP, a simple yet effective zero-shot method that harnesses visual grounding ability from the existing models trained from image-text pairs and pure object detection data, both of which are more conveniently obtainable and offer a broader domain compared to visual grounding annotation data. GroundVLP proposes a fusion mechanism that combines the heatmap from GradCAM and the object proposals of open-vocabulary detectors. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets, surpassing prior zero-shot state-of-the-art by approximately 28% on the test split of RefCOCO and RefCOCO+. Furthermore, GroundVLP performs comparably to or even better than some non-VLP-based supervised models on the Flickr30k entities dataset. Our code is available at https://github.com/om-ai-lab/GroundVLP.
\ No newline at end of file
diff --git a/data/2024/aaai/Guiding a Harsh-Environments Robust Detector via RAW Data Characteristic Mining b/data/2024/aaai/Guiding a Harsh-Environments Robust Detector via RAW Data Characteristic Mining
new file mode 100644
index 0000000000..8bf2075284
--- /dev/null
+++ b/data/2024/aaai/Guiding a Harsh-Environments Robust Detector via RAW Data Characteristic Mining	
@@ -0,0 +1 @@
+Consumer-grade cameras capture the RAW physical description of a scene and then process the image signals to obtain high-quality RGB images that are faithful to human visual perception. Conventionally, dense prediction scenes require high-precision recognition of objects in RGB images. However, predicting RGB data to exhibit the expected adaptability and robustness in harsh environments can be challenging. By capitalizing on the broader color gamut and higher bit depth offered by RAW data, in this paper, we demonstrate that RAW data can significantly improve the accuracy and robustness of object detectors in harsh environments. Firstly, we propose a general Pipeline for RAW Detection (PRD), along with a preprocessing strategy tailored to RAW data. Secondly, we design the RAW Corruption Benchmark (RCB) to address the dearth of benchmarks that reflect realistic scenarios in harsh environments. Thirdly, we demonstrate the significant improvement of RAW images in object detection for low-light and corrupt scenes. Specifically, our experiments indicate that PRD (using FCOS) outperforms RGB detection by 13.9mAP on LOD-Snow without generating restored images. Finally, we introduce a new nonlinear method called Functional Regularization (FR), which can effectively mine the unique characteristics of RAW data. The code is available at https://github.com/DreamerCCC/RawMining.
\ No newline at end of file
diff --git a/data/2024/aaai/GxVAEs: Two Joint VAEs Generate Hit Molecules from Gene Expression Profiles b/data/2024/aaai/GxVAEs: Two Joint VAEs Generate Hit Molecules from Gene Expression Profiles
new file mode 100644
index 0000000000..aedaaddafb
--- /dev/null
+++ b/data/2024/aaai/GxVAEs: Two Joint VAEs Generate Hit Molecules from Gene Expression Profiles	
@@ -0,0 +1 @@
+The de novo generation of hit-like molecules that show bioactivity and drug-likeness is an important task in computer-aided drug discovery. Although artificial intelligence can generate molecules with desired chemical properties, most previous studies have ignored the influence of disease-related cellular environments. This study proposes a novel deep generative model called GxVAEs to generate hit-like molecules from gene expression profiles by leveraging two joint variational autoencoders (VAEs). The first VAE, ProfileVAE, extracts latent features from gene expression profiles. The extracted features serve as the conditions that guide the second VAE, which is called MolVAE, in generating hit-like molecules. GxVAEs bridge the gap between molecular generation and the cellular environment in a biological system, and produce molecules that are biologically meaningful in the context of specific diseases. Experiments and case studies on the generation of therapeutic molecules show that GxVAEs outperforms current state-of-the-art baselines and yield hit-like molecules with potential bioactivity and drug-like properties. We were able to successfully generate the potential molecular structures with therapeutic effects for various diseases from patients’ disease profiles.
\ No newline at end of file
diff --git a/data/2024/aaai/H2GFormer: Horizontal-to-Global Voxel Transformer for 3D Semantic Scene Completion b/data/2024/aaai/H2GFormer: Horizontal-to-Global Voxel Transformer for 3D Semantic Scene Completion
new file mode 100644
index 0000000000..ed48a2b024
--- /dev/null
+++ b/data/2024/aaai/H2GFormer: Horizontal-to-Global Voxel Transformer for 3D Semantic Scene Completion	
@@ -0,0 +1 @@
+3D Semantic Scene Completion (SSC) has emerged as a novel task in vision-based holistic 3D scene understanding. Its objective is to densely predict the occupancy and category of each voxel in a 3D scene based on input from either LiDAR or images. Currently, many transformer-based semantic scene completion frameworks employ simple yet popular Cross-Attention and Self-Attention mechanisms to integrate and infer dense geometric and semantic information of voxels. However, they overlook the distinctions among voxels in the scene, especially in outdoor scenarios where the horizontal direction contains more variations. And voxels located at object boundaries and within the interior of objects exhibit varying levels of positional significance. To address this issue, we propose a transformer-based SSC framework called H2GFormer that incorporates a horizontal-to-global approach. This framework takes into full consideration the variations of voxels in the horizontal direction and the characteristics of voxels on object boundaries. We introduce a horizontal window-to-global attention (W2G) module that effectively fuses semantic information by first diffusing it horizontally from reliably visible voxels and then propagating the semantic understanding to global voxels, ensuring a more reliable fusion of semantic-aware features. Moreover, an Internal-External Position Awareness Loss (IoE-PALoss) is utilized during network training to emphasize the critical positions within the transition regions between objects. The experiments conducted on the SemanticKITTI dataset demonstrate that H2GFormer exhibits superior performance in both geometric and semantic completion tasks. Our code is available on https://github.com/Ryanwy1/H2GFormer.
\ No newline at end of file
diff --git a/data/2024/aaai/HACDR-Net: Heterogeneous-Aware Convolutional Network for Diabetic Retinopathy Multi-Lesion Segmentation b/data/2024/aaai/HACDR-Net: Heterogeneous-Aware Convolutional Network for Diabetic Retinopathy Multi-Lesion Segmentation
new file mode 100644
index 0000000000..c723b1b76c
--- /dev/null
+++ b/data/2024/aaai/HACDR-Net: Heterogeneous-Aware Convolutional Network for Diabetic Retinopathy Multi-Lesion Segmentation	
@@ -0,0 +1 @@
+Diabetic Retinopathy (DR), the leading cause of blindness in diabetic patients, is diagnosed by the condition of retinal multiple lesions. As a difficult task in medical image segmentation, DR multi-lesion segmentation faces the main concerns as follows. On the one hand, retinal lesions vary in location, shape, and size. On the other hand, because some lesions occupy only a very small part of the entire fundus image, the high proportion of background leads to difficulties in lesion segmentation. To solve the above problems, we propose a heterogeneous-aware convolutional network (HACDR-Net) that composes heterogeneous cross-convolution, heterogeneous modulated deformable convolution, and optional near-far-aware convolution. Our network introduces an adaptive aggregation module to summarize the heterogeneous feature maps and get diverse lesion areas in the heterogeneous receptive field along the channels and space. In addition, to solve the problem of the highly imbalanced proportion of focal areas, we design a new medical image segmentation loss function, Noise Adjusted Loss (NALoss). NALoss balances the predictive feature distribution of background and lesion by jointing Gaussian noise and hard example mining, thus enhancing awareness of lesions. We conduct the experiments on the public datasets IDRiD and DDR, and the experimental results show that the proposed method achieves better performance than other state-of-the-art methods. The code is open-sourced on github.com/xqh180110910537/HACDR-Net.
\ No newline at end of file
diff --git a/data/2024/aaai/HAGO-Net: Hierarchical Geometric Massage Passing for Molecular Representation Learning b/data/2024/aaai/HAGO-Net: Hierarchical Geometric Massage Passing for Molecular Representation Learning
new file mode 100644
index 0000000000..9b65d699eb
--- /dev/null
+++ b/data/2024/aaai/HAGO-Net: Hierarchical Geometric Massage Passing for Molecular Representation Learning	
@@ -0,0 +1 @@
+Molecular representation learning has emerged as a game-changer at the intersection of AI and chemistry, with great potential in applications such as drug design and materials discovery. A substantial obstacle in successfully applying molecular representation learning is the difficulty of effectively and completely characterizing and learning molecular geometry, which has not been well addressed to date. To overcome this challenge, we propose a novel framework that features a novel geometric graph, termed HAGO-Graph, and a specifically designed geometric graph learning model, HAGO-Net. In the framework, the foundation is HAGO-Graph, which enables a complete characterization of molecular geometry in a hierarchical manner. Specifically, we leverage the concept of n-body in physics to characterize geometric patterns at multiple spatial scales. We then specifically design a message passing scheme, HAGO-MPS, and implement the scheme as a geometric graph neural network, HAGO-Net, to effectively learn the representation of HAGO-Graph by horizontal and vertical aggregation. We further prove DHAGO-Net, the derivative function of HAGO-Net, is an equivariant model. The proposed models are validated by extensive comparisons on four challenging benchmarks. Notably, the models exhibited state-of-the-art performance in molecular chirality identification and property prediction, achieving state-of-the-art performance on five properties of QM9 dataset. The models also achieved competitive results on molecular dynamics prediction task.
\ No newline at end of file
diff --git a/data/2024/aaai/HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors b/data/2024/aaai/HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors
new file mode 100644
index 0000000000..2164b093ed
--- /dev/null
+++ b/data/2024/aaai/HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors	
@@ -0,0 +1 @@
+The main streams of human activity recognition (HAR) algorithms are developed based on RGB cameras which usually suffer from illumination, fast motion, privacy preservation, and large energy consumption. Meanwhile, the biologically inspired event cameras attracted great interest due to their unique features, such as high dynamic range, dense temporal but sparse spatial resolution, low latency, low power, etc. As it is a newly arising sensor, even there is no realistic large-scale dataset for HAR. Considering its great practical value, in this paper, we propose a large-scale benchmark dataset to bridge this gap, termed HARDVS, which contains 300 categories and more than 100K event sequences. We evaluate and report the performance of multiple popular HAR algorithms, which provide extensive baselines for future works to compare. More importantly, we propose a novel spatial-temporal feature learning and fusion framework, termed ESTF, for event stream based human activity recognition. It first projects the event streams into spatial and temporal embeddings using StemNet, then, encodes and fuses the dual-view representations using Transformer networks. Finally, the dual features are concatenated and fed into a classification head for activity prediction. Extensive experiments on multiple datasets fully validated the effectiveness of our model. Both the dataset and source code will be released at https://github.com/Event-AHU/HARDVS.
\ No newline at end of file
diff --git a/data/2024/aaai/HDMixer: Hierarchical Dependency with Extendable Patch for Multivariate Time Series Forecasting b/data/2024/aaai/HDMixer: Hierarchical Dependency with Extendable Patch for Multivariate Time Series Forecasting
new file mode 100644
index 0000000000..84aec85d92
--- /dev/null
+++ b/data/2024/aaai/HDMixer: Hierarchical Dependency with Extendable Patch for Multivariate Time Series Forecasting	
@@ -0,0 +1 @@
+Multivariate time series (MTS) prediction has been widely adopted in various scenarios. Recently, some methods have employed patching to enhance local semantics and improve model performance. However, length-fixed patch are prone to losing temporal boundary information, such as complete peaks and periods. Moreover, existing methods mainly focus on modeling long-term dependencies across patches, while paying little attention to other dimensions (e.g., short-term dependencies within patches and complex interactions among cross-variavle patches). To address these challenges, we propose a pure MLP-based HDMixer, aiming to acquire patches with richer semantic information and efficiently modeling hierarchical interactions. Specifically, we design a Length-Extendable Patcher (LEP) tailored to MTS, which enriches the boundary information of patches and alleviates semantic incoherence in series. Subsequently, we devise a Hierarchical Dependency Explorer (HDE) based on pure MLPs. This explorer effectively models short-term dependencies within patches, long-term dependencies across patches, and complex interactions among variables. Extensive experiments on 9 real-world datasets demonstrate the superiority of our approach. The code is available at https://github.com/hqh0728/HDMixer.
\ No newline at end of file
diff --git a/data/2024/aaai/HDformer: A Higher-Dimensional Transformer for Detecting Diabetes Utilizing Long-Range Vascular Signals b/data/2024/aaai/HDformer: A Higher-Dimensional Transformer for Detecting Diabetes Utilizing Long-Range Vascular Signals
new file mode 100644
index 0000000000..2315966379
--- /dev/null
+++ b/data/2024/aaai/HDformer: A Higher-Dimensional Transformer for Detecting Diabetes Utilizing Long-Range Vascular Signals	
@@ -0,0 +1 @@
+Diabetes mellitus is a global concern, and early detection can prevent serious complications. 50% of those with diabetes live undiagnosed, disproportionately afflicting low-income groups. Non-invasive methods have emerged for timely detection; however, their limited accuracy constrains clinical usage. In this research, we present a novel Higher Dimensional Transformer (HDformer), the first Transformer-based architecture which utilizes long-range photoplethysmography (PPG) to detect diabetes. The long-range PPG maximizes signal contextual information when compared to the less-than 30 second signals commonly used in existing research. To increase the computational efficiency of HDformer’s long-range processing, a new attention module, Time Square Attention (TSA), is invented to achieve linear computational complexity with respect to the token volume while retaining the local/global dependencies. TSA converts the 1D inputs into 2D representations, grouping the adjacent points into a single 2D token. It then generates dynamic patches and feeds them into a gated mixture-of-experts (MoE) network, optimizing the learning on different attention areas. HDformer achieves state-of-the-art results (sensitivity 98.4, accuracy 97.3, specificity 92.8, AUC 0.929) on the standard MIMIC-III dataset, surpassing existing research. Furthermore, we develop an end-to-end solution where a low-cost wearable is prototyped to connect with the HDformer in the Cloud via a mobile app. This scalable, convenient, and affordable approach provides instantaneous detection and continuous monitoring for individuals. It aids doctors in easily screening for diabetes and safeguards underprivileged communities. The enhanced versatility of HDformer allows for efficient processing and learning of long-range signals in general one-dimensional time-series sequences, particularly for all biomedical waveforms.
\ No newline at end of file
diff --git a/data/2024/aaai/HEAP: Unsupervised Object Discovery and Localization with Contrastive Grouping b/data/2024/aaai/HEAP: Unsupervised Object Discovery and Localization with Contrastive Grouping
new file mode 100644
index 0000000000..a182ab65b5
--- /dev/null
+++ b/data/2024/aaai/HEAP: Unsupervised Object Discovery and Localization with Contrastive Grouping	
@@ -0,0 +1 @@
+Unsupervised object discovery and localization aims to detect or segment objects in an image without any supervision. Recent efforts have demonstrated a notable potential to identify salient foreground objects by utilizing self-supervised transformer features. However, their scopes only build upon patch-level features within an image, neglecting region/image-level and cross-image relationships at a broader scale. Moreover, these methods cannot differentiate various semantics from multiple instances. To address these problems, we introduce Hierarchical mErging framework via contrAstive grouPing (HEAP). Specifically, a novel lightweight head with cross-attention mechanism is designed to adaptively group intra-image patches into semantically coherent regions based on correlation among self-supervised features. Further, to ensure the distinguishability among various regions, we introduce a region-level contrastive clustering loss to pull closer similar regions across images. Also, an image-level contrastive loss is present to push foreground and background representations apart, with which foreground objects and background are accordingly discovered. HEAP facilitates efficient hierarchical image decomposition, which contributes to more accurate object discovery while also enabling differentiation among objects of various classes. Extensive experimental results on semantic segmentation retrieval, unsupervised object discovery, and saliency detection tasks demonstrate that HEAP achieves state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2024/aaai/HGE: Embedding Temporal Knowledge Graphs in a Product Space of Heterogeneous Geometric Subspaces b/data/2024/aaai/HGE: Embedding Temporal Knowledge Graphs in a Product Space of Heterogeneous Geometric Subspaces
new file mode 100644
index 0000000000..271c8ff735
--- /dev/null
+++ b/data/2024/aaai/HGE: Embedding Temporal Knowledge Graphs in a Product Space of Heterogeneous Geometric Subspaces	
@@ -0,0 +1 @@
+Temporal knowledge graphs represent temporal facts (s,p,o,?) relating a subject s and an object o via a relation label p at time ?, where ? could be a time point or time interval. Temporal knowledge graphs may exhibit static temporal patterns at distinct points in time and dynamic temporal patterns between different timestamps. In order to learn a rich set of static and dynamic temporal patterns and apply them for inference, several embedding approaches have been suggested in the literature. However, as most of them resort to single underlying embedding spaces, their capability to model all kinds of temporal patterns was severely limited by having to adhere to the geometric property of their one embedding space. We lift this limitation by an embedding approach that maps temporal facts into a product space of several heterogeneous geometric subspaces with distinct geometric properties, i.e.\ Complex, Dual, and Split-complex spaces. In addition, we propose a temporal-geometric attention mechanism to integrate information from different geometric subspaces conveniently according to the captured relational and temporal information. Experimental results on standard temporal benchmark datasets favorably evaluate our approach against state-of-the-art models.
\ No newline at end of file
diff --git a/data/2024/aaai/HGPrompt: Bridging Homogeneous and Heterogeneous Graphs for Few-Shot Prompt Learning b/data/2024/aaai/HGPrompt: Bridging Homogeneous and Heterogeneous Graphs for Few-Shot Prompt Learning
new file mode 100644
index 0000000000..b8045108ef
--- /dev/null
+++ b/data/2024/aaai/HGPrompt: Bridging Homogeneous and Heterogeneous Graphs for Few-Shot Prompt Learning	
@@ -0,0 +1,3 @@
+Graph neural networks (GNNs) and heterogeneous graph neural networks (HGNNs) are prominent techniques for homogeneous and heterogeneous graph representation learning, yet their performance in an end-to-end supervised framework greatly depends on the availability of task-specific supervision. To reduce the labeling cost, pre-training on self-supervised pretext tasks has become a popular paradigm, but there is often a gap between the pre-trained model and downstream tasks, stemming from the divergence in their objectives. To bridge the gap, prompt learning has risen as a promising direction especially in few-shot settings, without the need to fully fine-tune the pre-trained model. While there has been some early exploration of prompt-based learning on graphs, they primarily deal with homogeneous graphs, ignoring the heterogeneous graphs that are prevalent in downstream applications. In this paper, we propose HGPROMPT, a
+novel pre-training and prompting framework to unify not only pre-training and downstream tasks but also homogeneous and heterogeneous graphs via a dual-template design. Moreover, we propose dual-prompt in HGPROMPT to assist a downstream task in locating the most relevant prior to bridge the gaps caused by not only feature variations but also heterogeneity differences across tasks. Finally, we thoroughly evaluate and analyze HGPROMPT through extensive experiments
+on three public datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/HISR: Hybrid Implicit Surface Representation for Photorealistic 3D Human Reconstruction b/data/2024/aaai/HISR: Hybrid Implicit Surface Representation for Photorealistic 3D Human Reconstruction
new file mode 100644
index 0000000000..e4aded37e5
--- /dev/null
+++ b/data/2024/aaai/HISR: Hybrid Implicit Surface Representation for Photorealistic 3D Human Reconstruction	
@@ -0,0 +1 @@
+Neural reconstruction and rendering strategies have demonstrated state-of-the-art performances due, in part, to their ability to preserve high level shape details. Existing approaches, however, either represent objects as implicit surface functions or neural volumes and still struggle to recover shapes with heterogeneous materials, in particular human skin, hair or clothes. To this aim, we present a new hybrid implicit surface representation to model human shapes. This representation is composed of two surface layers that represent opaque and translucent regions on the clothed human body. We segment different regions automatically using visual cues and learn to reconstruct two signed distance functions (SDFs). We perform surface-based rendering on opaque regions (e.g., body, face, clothes) to preserve high-fidelity surface normals and volume rendering on translucent regions (e.g., hair). Experiments demonstrate that our approach obtains state-of-the-art results on 3D human reconstructions, and also shows competitive performances on other objects.
\ No newline at end of file
diff --git a/data/2024/aaai/HONGAT: Graph Attention Networks in the Presence of High-Order Neighbors b/data/2024/aaai/HONGAT: Graph Attention Networks in the Presence of High-Order Neighbors
new file mode 100644
index 0000000000..71cc1c6033
--- /dev/null
+++ b/data/2024/aaai/HONGAT: Graph Attention Networks in the Presence of High-Order Neighbors	
@@ -0,0 +1 @@
+Graph Attention Networks (GATs) that compute node representation by its lower-order neighbors, are state-of-the-art architecture for representation learning with graphs. In practice, however, the high-order neighbors that turn out to be useful, remain largely unemployed in GATs. Efforts on this issue remain to be limited. This paper proposes a simple and effective high-order neighbor GAT (HONGAT) model to both effectively exploit informative high-order neighbors and address over-smoothing at the decision boundary of nodes. Two tightly coupled novel technologies, namely common neighbor similarity and new masking matrix, are introduced. Specifically, high-order neighbors are fully explored by generic high-order common-neighbor-based similarity; in order to prevent severe over-smoothing, typical averaging range no longer works well and a new masking mechanism is employed without any extra hyperparameter. Extensive empirical evaluation on real-world datasets clearly shows the necessity of the new algorithm in the ability of exploring high-order neighbors, which promisingly achieves significant gains over previous state-of-the-art graph attention methods.
\ No newline at end of file
diff --git a/data/2024/aaai/HOP to the Next Tasks and Domains for Continual Learning in NLP b/data/2024/aaai/HOP to the Next Tasks and Domains for Continual Learning in NLP
new file mode 100644
index 0000000000..6247e88fb3
--- /dev/null
+++ b/data/2024/aaai/HOP to the Next Tasks and Domains for Continual Learning in NLP	
@@ -0,0 +1 @@
+Continual Learning (CL) aims to learn a sequence of problems (i.e., tasks and domains) by transferring knowledge acquired on previous problems, whilst avoiding forgetting of past ones. Different from previous approaches which focused on CL for one NLP task or domain in a specific use-case, in this paper, we address a more general CL setting to learn from a sequence of problems in a unique framework. Our method, HOP, permits to hop across tasks and domains by addressing the CL problem along three directions: (i) we employ a set of adapters to generalize a large pre-trained model to unseen problems, (ii) we compute high-order moments over the distribution of embedded representations to distinguish independent and correlated statistics across different tasks and domains, (iii) we process this enriched information with auxiliary heads specialized for each end problem. Extensive experimental campaign on 4 NLP applications, 5 benchmarks and 2 CL setups demonstrates the effectiveness of our HOP.
\ No newline at end of file
diff --git a/data/2024/aaai/HORIZON: High-Resolution Semantically Controlled Panorama Synthesis b/data/2024/aaai/HORIZON: High-Resolution Semantically Controlled Panorama Synthesis
new file mode 100644
index 0000000000..c86b568287
--- /dev/null
+++ b/data/2024/aaai/HORIZON: High-Resolution Semantically Controlled Panorama Synthesis	
@@ -0,0 +1 @@
+Panorama synthesis endeavors to craft captivating 360-degree visual landscapes, immersing users in the heart of virtual worlds. Nevertheless, contemporary panoramic synthesis techniques grapple with the challenge of semantically guiding the content generation process. Although recent breakthroughs in visual synthesis have unlocked the potential for semantic control in 2D flat images, a direct application of these methods to panorama synthesis yields distorted content. In this study, we unveil an innovative framework for generating high-resolution panoramas, adeptly addressing the issues of spherical distortion and edge discontinuity through sophisticated spherical modeling. Our pioneering approach empowers users with semantic control, harnessing both image and text inputs, while concurrently streamlining the generation of high-resolution panoramas using parallel decoding. We rigorously evaluate our methodology on a diverse array of indoor and outdoor datasets, establishing its superiority over recent related work, in terms of both quantitative and qualitative performance metrics. Our research elevates the controllability, efficiency, and fidelity of panorama synthesis to new levels.
\ No newline at end of file
diff --git a/data/2024/aaai/HR-Pro: Point-Supervised Temporal Action Localization via Hierarchical Reliability Propagation b/data/2024/aaai/HR-Pro: Point-Supervised Temporal Action Localization via Hierarchical Reliability Propagation
new file mode 100644
index 0000000000..72f05d1c11
--- /dev/null
+++ b/data/2024/aaai/HR-Pro: Point-Supervised Temporal Action Localization via Hierarchical Reliability Propagation	
@@ -0,0 +1 @@
+Point-supervised Temporal Action Localization (PSTAL) is an emerging research direction for label-efficient learning. However, current methods mainly focus on optimizing the network either at the snippet-level or the instance-level, neglecting the inherent reliability of point annotations at both levels. In this paper, we propose a Hierarchical Reliability Propagation (HR-Pro) framework, which consists of two reliability-aware stages: Snippet-level Discrimination Learning and Instance-level Completeness Learning, both stages explore the efficient propagation of high-confidence cues in point annotations. For snippet-level learning, we introduce an online-updated memory to store reliable snippet prototypes for each class. We then employ a Reliability-aware Attention Block to capture both intra-video and inter-video dependencies of snippets, resulting in more discriminative and robust snippet representation. For instance-level learning, we propose a point-based proposal generation approach as a means of connecting snippets and instances, which produces high-confidence proposals for further optimization at the instance level. Through multi-level reliability-aware learning, we obtain more reliable confidence scores and more accurate temporal boundaries of predicted proposals. Our HR-Pro achieves state-of-the-art performance on multiple challenging benchmarks, including an impressive average mAP of 60.3% on THUMOS14. Notably, our HR-Pro largely surpasses all previous point-supervised methods, and even outperforms several competitive fully-supervised methods. Code will be available at https://github.com/pipixin321/HR-Pro.
\ No newline at end of file
diff --git a/data/2024/aaai/Hand-Centric Motion Refinement for 3D Hand-Object Interaction via Hierarchical Spatial-Temporal Modeling b/data/2024/aaai/Hand-Centric Motion Refinement for 3D Hand-Object Interaction via Hierarchical Spatial-Temporal Modeling
new file mode 100644
index 0000000000..960523870b
--- /dev/null
+++ b/data/2024/aaai/Hand-Centric Motion Refinement for 3D Hand-Object Interaction via Hierarchical Spatial-Temporal Modeling	
@@ -0,0 +1 @@
+Hands are the main medium when people interact with the world. Generating proper 3D motion for hand-object interaction is vital for applications such as virtual reality and robotics. Although grasp tracking or object manipulation synthesis can produce coarse hand motion, this kind of motion is inevitably noisy and full of jitter. To address this problem, we propose a data-driven method for coarse motion refinement. First, we design a hand-centric representation to describe the dynamic spatial-temporal relation between hands and objects. Compared to the object-centric representation, our hand-centric representation is straightforward and does not require an ambiguous projection process that converts object-based prediction into hand motion. Second, to capture the dynamic clues of hand-object interaction, we propose a new architecture that models the spatial and temporal structure in a hierarchical manner. Extensive experiments demonstrate that our method outperforms previous methods by a noticeable margin.
\ No newline at end of file
diff --git a/data/2024/aaai/Handling Long and Richly Constrained Tasks through Constrained Hierarchical Reinforcement Learning b/data/2024/aaai/Handling Long and Richly Constrained Tasks through Constrained Hierarchical Reinforcement Learning
new file mode 100644
index 0000000000..6988bf3550
--- /dev/null
+++ b/data/2024/aaai/Handling Long and Richly Constrained Tasks through Constrained Hierarchical Reinforcement Learning	
@@ -0,0 +1 @@
+Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories and have demonstrated good performance in primarily short horizon tasks. In this paper, we are specifically interested in the problem of solving temporally extended decision making problems such as robots cleaning different areas in a house while avoiding slippery and unsafe areas (e.g., stairs) and retaining enough charge to move to a charging dock; in the presence of complex safety constraints. Our key contribution is a (safety) Constrained Search with Hierarchical Reinforcement Learning (CoSHRL) mechanism that combines an upper level constrained search agent (which computes a reward maximizing policy from a given start to a far away goal state while satisfying cost constraints) with a low-level goal conditioned RL agent (which estimates cost and reward values to move between nearby states). A major advantage of CoSHRL is that it can handle constraints on the cost value distribution (e.g., on Conditional Value at Risk, CVaR) and can adjust to flexible constraint thresholds without retraining. We perform extensive experiments with different types of safety constraints to demonstrate the utility of our approach over leading approaches in constrained and hierarchical RL.
\ No newline at end of file
diff --git a/data/2024/aaai/Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation b/data/2024/aaai/Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation
new file mode 100644
index 0000000000..3744cb1a5a
--- /dev/null
+++ b/data/2024/aaai/Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation	
@@ -0,0 +1 @@
+Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed. While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster. Successful existing models have employed various techniques to avoid this problem, most of which require data augmentation or which aim to make the average soft assignment across the dataset the same for each cluster. We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments. Using a Bayesian framework, we derive an intuitive optimization objective that can be straightforwardly included in the training of the encoder network. Tested on four image datasets, it consistently avoids collapse more robustly than other methods and leads to more accurate clustering. We also conduct further experiments and analyses justifying our choice to regularize the hard cluster assignments. Code is available at https://github.com/Lou1sM/online_hard_clustering.
\ No newline at end of file
diff --git a/data/2024/aaai/Hardness of Random Reordered Encodings of Parity for Resolution and CDCL b/data/2024/aaai/Hardness of Random Reordered Encodings of Parity for Resolution and CDCL
new file mode 100644
index 0000000000..3f04afc877
--- /dev/null
+++ b/data/2024/aaai/Hardness of Random Reordered Encodings of Parity for Resolution and CDCL	
@@ -0,0 +1 @@
+Parity reasoning is challenging for Conflict-Driven Clause Learning (CDCL) SAT solvers. This has been observed even for simple formulas encoding two contradictory parity constraints with different variable orders (Chew and Heule 2020). We provide an analytical explanation for their hardness by showing that they require exponential resolution refutations with high probability when the variable order is chosen at random. We obtain this result by proving that these formulas, which are known to be Tseitin formulas, have Tseitin graphs of linear treewidth with high probability. Since such Tseitin formulas require exponential resolution refutations, our result follows. We generalize this argument to a new class of formulas that capture a basic form of parity reasoning involving a sum of two random parity constraints with random orders. Even when the variable order for the sum is chosen favorably, these formulas remain hard for resolution. In contrast, we prove that they have short DRAT refutations. We show experimentally that the running time of CDCL SAT solvers on both classes of formulas grows exponentially with their treewidth.
\ No newline at end of file
diff --git a/data/2024/aaai/Harmonious Mobility for Robots that Work with and around People b/data/2024/aaai/Harmonious Mobility for Robots that Work with and around People
new file mode 100644
index 0000000000..4fb4551bce
--- /dev/null
+++ b/data/2024/aaai/Harmonious Mobility for Robots that Work with and around People	
@@ -0,0 +1 @@
+The integration of advances from machine learning and computer vision with the classical autonomy stack has brought successful robot deployments in fulfilment, manufacturing, and transportation. However, unstructured and dynamic environments such as pedestrian spaces and streets, workplaces, and homes pose additional challenges such as modeling human behavior, understanding user perceptions, and ensuring human safety and comfort. My work addresses such challenges to enable robots to fluently work with and around people to increase productivity and assist users.
\ No newline at end of file
diff --git a/data/2024/aaai/Harnessing Edge Information for Improved Robustness in Vision Transformers b/data/2024/aaai/Harnessing Edge Information for Improved Robustness in Vision Transformers
new file mode 100644
index 0000000000..0434b33e6f
--- /dev/null
+++ b/data/2024/aaai/Harnessing Edge Information for Improved Robustness in Vision Transformers	
@@ -0,0 +1 @@
+Deep Neural Networks (DNNs) have demonstrated remarkable accuracy in vision classification tasks. However, they exhibit vulnerability to additional noises known as adversarial attacks. Previous studies hypothesize that this vulnerability might stem from the fact that high-accuracy DNNs heavily rely on irrelevant and non-robust features, such as textures and the background. In this work, we reveal that edge information extracted from images can provide relevant and robust features related to shapes and the foreground. These features assist pretrained DNNs in achieving improved adversarial robustness without compromising their accuracy on clean images. A lightweight and plug-and-play EdgeNet is proposed, which can be seamlessly integrated into existing pretrained DNNs, including Vision Transformers, a recent family of state-of-the-art models for vision classification. Our EdgeNet can process edges derived from either clean nature images or noisy adversarial images, yielding robust features which can be injected into the intermediate layers of the frozen backbone DNNs. The cost of obtaining such edges using conventional edge detection algorithms (e.g., Canny edge detector) is marginal, and the cost of training the EdgeNet is equivalent to that of fine-tuning the backbone network with techniques such as Adapter.
\ No newline at end of file
diff --git a/data/2024/aaai/Harnessing Holistic Discourse Features and Triadic Interaction for Sentiment Quadruple Extraction in Dialogues b/data/2024/aaai/Harnessing Holistic Discourse Features and Triadic Interaction for Sentiment Quadruple Extraction in Dialogues
new file mode 100644
index 0000000000..5a2b079c81
--- /dev/null
+++ b/data/2024/aaai/Harnessing Holistic Discourse Features and Triadic Interaction for Sentiment Quadruple Extraction in Dialogues	
@@ -0,0 +1 @@
+Dialogue Aspect-based Sentiment Quadruple (DiaASQ) is a newly-emergent task aiming to extract the sentiment quadruple (i.e., targets, aspects, opinions, and sentiments) from conversations. While showing promising performance, the prior DiaASQ approach unfortunately falls prey to the key crux of DiaASQ, including insufficient modeling of discourse features, and lacking quadruple extraction, which hinders further task improvement. To this end, we introduce a novel framework that not only capitalizes on comprehensive discourse feature modeling, but also captures the intrinsic interaction for optimal quadruple extraction. On the one hand, drawing upon multiple discourse features, our approach constructs a token-level heterogeneous graph and enhances token interactions through a heterogeneous attention network. We further propose a novel triadic scorer, strengthening weak token relations within a quadruple, thereby enhancing the cohesion of the quadruple extraction. Experimental results on the DiaASQ benchmark showcase that our model significantly outperforms existing baselines across both English and Chinese datasets. Our code is available at https://bit.ly/3v27pqA.
\ No newline at end of file
diff --git a/data/2024/aaai/Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models b/data/2024/aaai/Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models
new file mode 100644
index 0000000000..c986ae05ad
--- /dev/null
+++ b/data/2024/aaai/Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models	
@@ -0,0 +1,3 @@
+Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures, such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), that excel at accelerating parallel workloads and dense vector matrix multiplications. Potentially more efficient neural network models utilizing sparsity and recurrence cannot leverage the full power of SIMD processor and are thus at a severe disadvantage compared to today's prominent parallel architectures like Transformers and CNNs, thereby hindering the path towards more sustainable AI. 
+To overcome this limitation, we explore sparse and recurrent model training on a massively parallel multiple instruction multiple data (MIMD) architecture with distributed local memory. We implement a training routine based on backpropagation though time (BPTT) for the brain-inspired class of Spiking Neural Networks (SNNs) that feature binary sparse activations. We observe a massive advantage in using sparse activation tensors with a MIMD processor, the Intelligence Processing Unit (IPU) compared to GPUs. On training workloads, our results demonstrate 5-10x throughput gains compared to A100 GPUs and up to 38x gains for higher levels of activation sparsity, without a significant slowdown in training convergence or reduction in final model performance. Furthermore, our results show highly promising trends for both single and multi IPU configurations as we scale up to larger model sizes. 
+Our work paves the way towards more efficient, non-standard models via AI training hardware beyond GPUs, and competitive large scale SNN models.
\ No newline at end of file
diff --git a/data/2024/aaai/Harnessing Network Effect for Fake News Mitigation: Selecting Debunkers via Self-Imitation Learning b/data/2024/aaai/Harnessing Network Effect for Fake News Mitigation: Selecting Debunkers via Self-Imitation Learning
new file mode 100644
index 0000000000..b2b13bd5d1
--- /dev/null
+++ b/data/2024/aaai/Harnessing Network Effect for Fake News Mitigation: Selecting Debunkers via Self-Imitation Learning	
@@ -0,0 +1 @@
+This study aims to minimize the influence of fake news on social networks by deploying debunkers to propagate true news. This is framed as a reinforcement learning problem, where, at each stage, one user is selected to propagate true news. A challenging issue is episodic reward where the "net" effect of selecting individual debunkers cannot be discerned from the interleaving information propagation on social networks, and only the collective effect from mitigation efforts can be observed. Existing Self-Imitation Learning (SIL) methods have shown promise in learning from episodic rewards, but are ill-suited to the real-world application of fake news mitigation because of their poor sample efficiency. To learn a more effective debunker selection policy for fake news mitigation, this study proposes NAGASIL - Negative sampling and state Augmented Generative Adversarial Self-Imitation Learning, which consists of two improvements geared towards fake news mitigation: learning from negative samples, and an augmented state representation to capture the "real" environment state by integrating the current observed state with the previous state-action pairs from the same campaign. Experiments on two social networks show that NAGASIL yields superior performance to standard GASIL and state-of-the-art fake news mitigation models.
\ No newline at end of file
diff --git a/data/2024/aaai/Harnessing the Power of Beta Scoring in Deep Active Learning for Multi-Label Text Classification b/data/2024/aaai/Harnessing the Power of Beta Scoring in Deep Active Learning for Multi-Label Text Classification
new file mode 100644
index 0000000000..955fb4de54
--- /dev/null
+++ b/data/2024/aaai/Harnessing the Power of Beta Scoring in Deep Active Learning for Multi-Label Text Classification	
@@ -0,0 +1 @@
+Within the scope of natural language processing, the domain of multi-label text classification is uniquely challenging due to its expansive and uneven label distribution. The complexity deepens due to the demand for an extensive set of annotated data for training an advanced deep learning model, especially in specialized fields where the labeling task can be labor-intensive and often requires domain-specific knowledge. Addressing these challenges, our study introduces a novel deep active learning strategy, capitalizing on the Beta family of proper scoring rules within the Expected Loss Reduction framework. It computes the expected increase in scores using the Beta Scoring Rules, which are then transformed into sample vector representations. These vector representations guide the diverse selection of informative sample, directly linking this process to the model's expected proper score. Comprehensive evaluations across both synthetic and real datasets reveal our method's capability to often outperform established acquisition techniques in multi-label text classification, presenting encouraging outcomes across various architectural and dataset scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Harnessing the Power of SVD: An SVA Module for Enhanced Signal Classification b/data/2024/aaai/Harnessing the Power of SVD: An SVA Module for Enhanced Signal Classification
new file mode 100644
index 0000000000..f69af3feb3
--- /dev/null
+++ b/data/2024/aaai/Harnessing the Power of SVD: An SVA Module for Enhanced Signal Classification	
@@ -0,0 +1 @@
+Deep learning methods have achieved outstanding performance in various signal tasks. However, due to degraded signals in real electromagnetic environment, it is crucial to seek methods that can improve the representation of signal features. In this paper, a Singular Value decomposition-based Attention, SVA is proposed to explore structure of signal data for adaptively enhancing intrinsic feature. Using a deep neural network as a base model, SVA performs feature semantic subspace learning through a decomposition layer and combines it with an attention layer to achieve adaptive enhancement of signal features. Moreover, we consider the gradient explosion problem brought by SVA and optimize SVA to improve the stability of training. Extensive experimental results demon-strate that applying SVA to a generalized classification model can significantly improve its ability in representations, making its recognition performance competitive with, or even better than, the state-of-the-art task-specific models.
\ No newline at end of file
diff --git a/data/2024/aaai/Hawkes-Enhanced Spatial-Temporal Hypergraph Contrastive Learning Based on Criminal Correlations b/data/2024/aaai/Hawkes-Enhanced Spatial-Temporal Hypergraph Contrastive Learning Based on Criminal Correlations
new file mode 100644
index 0000000000..bad00195a9
--- /dev/null
+++ b/data/2024/aaai/Hawkes-Enhanced Spatial-Temporal Hypergraph Contrastive Learning Based on Criminal Correlations	
@@ -0,0 +1 @@
+Crime prediction is a crucial yet challenging task within urban computing, which benefits public safety and resource optimization. Over the years, various models have been proposed, and spatial-temporal hypergraph learning models have recently shown outstanding performances. However, three correlations underlying crime are ignored, thus hindering the performance of previous models. Specifically, there are two spatial correlations and one temporal correlation, i.e., (1) co-occurrence of different types of crimes (type spatial correlation), (2) the closer to the crime center, the more dangerous it is around the neighborhood area (neighbor spatial correlation), and (3) the closer between two timestamps, the more relevant events are (hawkes temporal correlation). To this end, we propose Hawkes-enhanced Spatial-Temporal Hypergraph Contrastive Learning framework (HCL), which mines the aforementioned correlations via two specific strategies. Concretely, contrastive learning strategies are designed for two spatial correlations, and hawkes process modeling is adopted for temporal correlations. Extensive experiments demonstrate the promising capacities of HCL from four aspects, i.e., superiority, transferability, effectiveness, and sensitivity.
\ No newline at end of file
diff --git a/data/2024/aaai/Hear You Say You: An Efficient Framework for Marine Mammal Sounds' Classification b/data/2024/aaai/Hear You Say You: An Efficient Framework for Marine Mammal Sounds' Classification
new file mode 100644
index 0000000000..c86bfb81b9
--- /dev/null
+++ b/data/2024/aaai/Hear You Say You: An Efficient Framework for Marine Mammal Sounds' Classification	
@@ -0,0 +1 @@
+Marine mammals and their ecosystem face significant threats from, for example, military active sonar and marine transportation. To mitigate this harm, early detection and classification of marine mammals are essential. While recent efforts have utilized spectrogram analysis and machine learning techniques, there remain challenges in their efficiency. Therefore, we propose a novel knowledge distillation framework, named XCFSMN, for this problem. We construct a teacher model that fuses the features extracted from an X-vector extractor, a DenseNet and Cross-Covariance attended compact Feed-Forward Sequential Memory Network (cFSMN). The teacher model transfers knowledge to a simpler cFSMN model through a temperature-cooling strategy for efficient learning. Compared to multiple convolutional neural network backbones and transformers, the proposed framework achieves state-of-the-art efficiency and performance. The improved model size is approximately 20 times smaller and the inference time can be 10 times shorter without affecting the model’s accuracy.
\ No newline at end of file
diff --git a/data/2024/aaai/Heterogeneous Test-Time Training for Multi-Modal Person Re-identification b/data/2024/aaai/Heterogeneous Test-Time Training for Multi-Modal Person Re-identification
new file mode 100644
index 0000000000..46240c4b7e
--- /dev/null
+++ b/data/2024/aaai/Heterogeneous Test-Time Training for Multi-Modal Person Re-identification	
@@ -0,0 +1 @@
+Multi-modal person re-identification (ReID) seeks to mitigate challenging lighting conditions by incorporating diverse modalities. Most existing multi-modal ReID methods concentrate on leveraging complementary multi-modal information via fusion or interaction. However, the relationships among heterogeneous modalities and the domain traits of unlabeled test data are rarely explored. In this paper, we propose a Heterogeneous Test-time Training (HTT) framework for multi-modal person ReID. We first propose a Cross-identity Inter-modal Margin (CIM) loss to amplify the differentiation among distinct identity samples. Moreover, we design a Multi-modal Test-time Training (MTT) strategy to enhance the generalization of the model by leveraging the relationships in the heterogeneous modalities and the information existing in the test data. Specifically, in the training stage, we utilize the CIM loss to further enlarge the distance between anchor and negative by forcing the inter-modal distance to maintain the margin, resulting in an enhancement of the discriminative capacity of the ultimate descriptor. Subsequently, since the test data contains characteristics of the target domain, we adapt the MTT strategy to optimize the network before the inference by using self-supervised tasks designed based on relationships among modalities. Experimental results on benchmark multi-modal ReID datasets RGBNT201, Market1501-MM, RGBN300, and RGBNT100 validate the effectiveness of the proposed method. The codes can be found at https://github.com/ziwang1121/HTT.
\ No newline at end of file
diff --git a/data/2024/aaai/HiFi-Gas: Hierarchical Federated Learning Incentive Mechanism Enhanced Gas Usage Estimation b/data/2024/aaai/HiFi-Gas: Hierarchical Federated Learning Incentive Mechanism Enhanced Gas Usage Estimation
new file mode 100644
index 0000000000..fe61c6d406
--- /dev/null
+++ b/data/2024/aaai/HiFi-Gas: Hierarchical Federated Learning Incentive Mechanism Enhanced Gas Usage Estimation	
@@ -0,0 +1 @@
+Gas usage estimation plays a critical role in various aspects of the power generation and delivery business, including budgeting, resource planning, and environmental preservation. Federated Learning (FL) has demonstrated its potential in enhancing the accuracy and reliability of gas usage estimation by enabling distributedly owned data to be leveraged, while ensuring privacy and confidentiality. However, to effectively motivate stakeholders to contribute their high-quality local data and computational resources for this purpose, incentive mechanism design is key. In this paper, we report our experience designing and deploying the Hierarchical FL Incentive mechanism for Gas usage estimation (HiFi-Gas) system. It is designed to cater to the unique structure of gas companies and their affiliated heating stations. HiFi-Gas provides effective incentivization in a hierarchical federated learning framework that consists of a horizontal federated learning (HFL) component for effective collaboration among gas companies and multiple vertical federated learning (VFL) components for the gas company and its affiliated heating stations. To motivate active participation and ensure fairness among gas companies and heating stations, we incorporate a multi-dimensional contribution-aware reward distribution function that considers both data quality and model contributions. Since its deployment in the ENN Group in December 2022, HiFi-Gas has successfully provided incentives for gas companies and heating stations to actively participate in FL training, resulting in more than 12% higher average gas usage estimation accuracy and substantial gas procurement cost savings. This implementation marks the first successful deployment of a hierarchical FL incentive approach in the energy industry.
\ No newline at end of file
diff --git a/data/2024/aaai/HiHPQ: Hierarchical Hyperbolic Product Quantization for Unsupervised Image Retrieval b/data/2024/aaai/HiHPQ: Hierarchical Hyperbolic Product Quantization for Unsupervised Image Retrieval
new file mode 100644
index 0000000000..f295351bdd
--- /dev/null
+++ b/data/2024/aaai/HiHPQ: Hierarchical Hyperbolic Product Quantization for Unsupervised Image Retrieval	
@@ -0,0 +1 @@
+Existing unsupervised deep product quantization methods primarily aim for the increased similarity between different views of the identical image, whereas the delicate multi-level semantic similarities preserved between images are overlooked. Moreover, these methods predominantly focus on the Euclidean space for computational convenience, compromising their ability to map the multi-level semantic relationships between images effectively. To mitigate these shortcomings, we propose a novel unsupervised product quantization method dubbed Hierarchical Hyperbolic Product Quantization (HiHPQ), which learns quantized representations by incorporating hierarchical semantic similarity within hyperbolic geometry. Specifically, we propose a hyperbolic product quantizer, where the hyperbolic codebook attention mechanism and the quantized contrastive learning on the hyperbolic product manifold are introduced to expedite quantization. Furthermore, we propose a hierarchical semantics learning module, designed to enhance the distinction between similar and non-matching images for a query by utilizing the extracted hierarchical semantics as an additional training supervision. Experiments on benchmark image datasets show that our proposed method outperforms state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Hidden Follower Detection: How Is the Gaze-Spacing Pattern Embodied in Frequency Domain? b/data/2024/aaai/Hidden Follower Detection: How Is the Gaze-Spacing Pattern Embodied in Frequency Domain?
new file mode 100644
index 0000000000..3e42f12b2e
--- /dev/null
+++ b/data/2024/aaai/Hidden Follower Detection: How Is the Gaze-Spacing Pattern Embodied in Frequency Domain?	
@@ -0,0 +1 @@
+Spatiotemporal social behavior analysis is a technique that studies the social behavior patterns of objects and estimates their risks based on their trajectories. In social public scenarios such as train stations, hidden following behavior has become one of the most challenging issues due to its probability of evolving into violent events, which is more than 25%. In recent years, research on hidden following detection (HFD) has focused on differences in time series between hidden followers and normal pedestrians under two temporal characteristics: gaze and spatial distance. However, the time-domain representation for time series is irreversible and usually causes the loss of critical information. In this paper, we deeply study the expression efficiency of time/frequency domain features of time series, by exploring the recovery mechanism of features to source time series, we establish a fidelity estimation method for feature expression and a selection model for frequency-domain features based on the signal-to-distortion ratio (SDR). Experimental results demonstrate the feature fidelity of time series and HFD performance are positively correlated, and the fidelity of frequency-domain features and HFD performance are significantly better than the time-domain features. On both real and simulated datasets, the accuracy of the proposed method is increased by 3%, and the gaze-only module is improved by 10%. Related research has explored new methods for optimal feature selection based on fidelity, new patterns for efficient feature expression of hidden following behavior, and the mechanism of multimodal collaborative identification.
\ No newline at end of file
diff --git a/data/2024/aaai/Hierarchical Aligned Multimodal Learning for NER on Tweet Posts b/data/2024/aaai/Hierarchical Aligned Multimodal Learning for NER on Tweet Posts
new file mode 100644
index 0000000000..93a4a70ce9
--- /dev/null
+++ b/data/2024/aaai/Hierarchical Aligned Multimodal Learning for NER on Tweet Posts	
@@ -0,0 +1 @@
+Mining structured knowledge from tweets using named entity recognition (NER) can be beneficial for many downstream applications such as recommendation and intention under standing. With tweet posts tending to be multimodal, multimodal named entity recognition (MNER) has attracted more attention. In this paper, we propose a novel approach, which can dynamically align the image and text sequence and achieve the multi-level cross-modal learning to augment textual word representation for MNER improvement. To be specific, our framework can be split into three main stages: the first stage focuses on intra-modality representation learning to derive the implicit global and local knowledge of each modality, the second evaluates the relevance between the text and its accompanying image and integrates different grained visual information based on the relevance, the third enforces semantic refinement via iterative cross-modal interactions and co-attention. We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model.
\ No newline at end of file
diff --git a/data/2024/aaai/Hierarchical Multi-Marginal Optimal Transport for Network Alignment b/data/2024/aaai/Hierarchical Multi-Marginal Optimal Transport for Network Alignment
new file mode 100644
index 0000000000..e93d793364
--- /dev/null
+++ b/data/2024/aaai/Hierarchical Multi-Marginal Optimal Transport for Network Alignment	
@@ -0,0 +1 @@
+Finding node correspondence across networks, namely multi-network alignment, is an essential prerequisite for joint learning on multiple networks. Despite great success in aligning networks in pairs, the literature on multi-network alignment is sparse due to the exponentially growing solution space and lack of high-order discrepancy measures. To fill this gap, we propose a hierarchical multi-marginal optimal transport framework named HOT for multi-network alignment. To handle the large solution space, multiple networks are decomposed into smaller aligned clusters via the fused Gromov-Wasserstein (FGW) barycenter. To depict high-order relationships across multiple networks, the FGW distance is generalized to the multi-marginal setting, based on which networks can be aligned jointly. A fast proximal point method is further developed with guaranteed convergence to a local optimum. Extensive experiments and analysis show that our proposed HOT achieves significant improvements over the state-of-the-art in both effectiveness and scalability.
\ No newline at end of file
diff --git a/data/2024/aaai/Hierarchical Planning and Learning for Robots in Stochastic Settings Using Zero-Shot Option Invention b/data/2024/aaai/Hierarchical Planning and Learning for Robots in Stochastic Settings Using Zero-Shot Option Invention
new file mode 100644
index 0000000000..d247b9343a
--- /dev/null
+++ b/data/2024/aaai/Hierarchical Planning and Learning for Robots in Stochastic Settings Using Zero-Shot Option Invention	
@@ -0,0 +1 @@
+This paper addresses the problem of inventing and using hierarchical representations for stochastic robot-planning problems. Rather than using hand-coded state or action representations as input, it presents new methods for learning how to create a high-level action representation for long-horizon, sparse reward robot planning problems in stochastic settings with unknown dynamics. After training, this system yields a robot-specific but environment independent planning system. Given new problem instances in unseen stochastic environments, it first creates zero-shot options (without any experience on the new environment) with dense pseudo-rewards and then uses them to solve the input problem in a hierarchical planning and refinement process. Theoretical results identify sufficient conditions for completeness of the presented approach. Extensive empirical analysis shows that even in settings that go beyond these sufficient conditions, this approach convincingly outperforms baselines by 2x in terms of solution time with orders of magnitude improvement in solution quality.
\ No newline at end of file
diff --git a/data/2024/aaai/Hierarchical and Incremental Structural Entropy Minimization for Unsupervised Social Event Detection b/data/2024/aaai/Hierarchical and Incremental Structural Entropy Minimization for Unsupervised Social Event Detection
new file mode 100644
index 0000000000..ad3cd58b3d
--- /dev/null
+++ b/data/2024/aaai/Hierarchical and Incremental Structural Entropy Minimization for Unsupervised Social Event Detection	
@@ -0,0 +1 @@
+As a trending approach for social event detection, graph neural network (GNN)-based methods enable a fusion of natural language semantics and the complex social network structural information, thus showing SOTA performance. However, GNN-based methods can miss useful message correlations. Moreover, they require manual labeling for training and predetermining the number of events for prediction. In this work, we address social event detection via graph structural entropy (SE) minimization. While keeping the merits of the GNN-based methods, the proposed framework, HISEvent, constructs more informative message graphs, is unsupervised, and does not require the number of events given a priori. Specifically, we incrementally explore the graph neighborhoods using 1-dimensional (1D) SE minimization to supplement the existing message graph with edges between semantically related messages. We then detect events from the message graph by hierarchically minimizing 2-dimensional (2D) SE. Our proposed 1D and 2D SE minimization algorithms are customized for social event detection and effectively tackle the efficiency problem of the existing SE minimization algorithms. Extensive experiments show that HISEvent consistently outperforms GNN-based methods and achieves the new SOTA for social event detection under both closed- and open-set settings while being efficient and robust.
\ No newline at end of file
diff --git a/data/2024/aaai/Hierarchize Pareto Dominance in Multi-Objective Stochastic Linear Bandits b/data/2024/aaai/Hierarchize Pareto Dominance in Multi-Objective Stochastic Linear Bandits
new file mode 100644
index 0000000000..66f7939038
--- /dev/null
+++ b/data/2024/aaai/Hierarchize Pareto Dominance in Multi-Objective Stochastic Linear Bandits	
@@ -0,0 +1 @@
+Multi-objective Stochastic Linear bandit (MOSLB) plays a critical role in the sequential decision-making paradigm, however, most existing methods focus on the Pareto dominance among different objectives without considering any priority. In this paper, we study bandit algorithms under mixed Pareto-lexicographic orders, which can reflect decision makers' preferences. We adopt the Grossone approach to deal with these orders and develop the notion of Pareto-lexicographic optimality to evaluate the learners' performance. Our work represents a first attempt to address these important and realistic orders in bandit algorithms. To design algorithms under these orders, the upper confidence bound (UCB) policy and the prior free lexicographical filter are adapted to approximate the optimal arms at each round. Moreover, the framework of the algorithms involves two stages in pursuit of the balance between exploration and exploitation. Theoretical analysis as well as numerical experiments demonstrate the effectiveness of our algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/High Significant Fault Detection in Azure Core Workload Insights b/data/2024/aaai/High Significant Fault Detection in Azure Core Workload Insights
new file mode 100644
index 0000000000..a919fc2d7d
--- /dev/null
+++ b/data/2024/aaai/High Significant Fault Detection in Azure Core Workload Insights	
@@ -0,0 +1 @@
+Azure Core workload insights have time-series data with different metric units. Faults or Anomalies are observed in these time-series data owing to faults observed with respect to metric name, resources region, dimensions, and its dimension value associated with the data. For Azure Core, an important task is to highlight faults or anomalies to the user on a dashboard that they can perceive easily. The number of anomalies reported should be highly significant and in a limited number, e.g., 5-20 anomalies reported per hour. The reported anomalies will have significant user perception and high reconstruction error in any time-series forecasting model. Hence, our task is to automatically identify 'high significant anomalies' and their associated information for user perception.
\ No newline at end of file
diff --git a/data/2024/aaai/High-Dimensional Analysis for Generalized Nonlinear Regression: From Asymptotics to Algorithm b/data/2024/aaai/High-Dimensional Analysis for Generalized Nonlinear Regression: From Asymptotics to Algorithm
new file mode 100644
index 0000000000..0e6cbaca8b
--- /dev/null
+++ b/data/2024/aaai/High-Dimensional Analysis for Generalized Nonlinear Regression: From Asymptotics to Algorithm	
@@ -0,0 +1 @@
+Overparameterization often leads to benign overfitting, where deep neural networks can be trained to overfit the training data but still generalize well on unseen data. However, it lacks a generalized asymptotic framework for nonlinear regressions and connections to conventional complexity notions. In this paper, we propose a generalized high-dimensional analysis for nonlinear regression models, including various nonlinear feature mapping methods and subsampling. Specifically, we first provide an implicit regularization parameter and asymptotic equivalents related to a classical complexity notion, i.e., effective dimension. We then present a high-dimensional analysis for nonlinear ridge regression and extend it to ridgeless regression in the under-parameterized and over-parameterized regimes, respectively. We find that the limiting risks decrease with the effective dimension. Motivated by these theoretical findings, we propose an algorithm, namely RFRed, to improve generalization ability. Finally, we validate our theoretical findings and the proposed algorithm through several experiments.
\ No newline at end of file
diff --git a/data/2024/aaai/High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field b/data/2024/aaai/High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field
new file mode 100644
index 0000000000..1fd71ca721
--- /dev/null
+++ b/data/2024/aaai/High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field	
@@ -0,0 +1 @@
+One crucial aspect of 3D head avatar reconstruction lies in the details of facial expressions. Although recent NeRF-based photo-realistic 3D head avatar methods achieve high-quality avatar rendering, they still encounter challenges retaining intricate facial expression details because they overlook the potential of specific expression variations at different spatial positions when conditioning the radiance field. Motivated by this observation, we introduce a novel Spatially-Varying Expression (SVE) conditioning. The SVE can be obtained by a simple MLP-based generation network, encompassing both spatial positional features and global expression information. Benefiting from rich and diverse information of the SVE at different positions, the proposed SVE-conditioned NeRF can deal with intricate facial expressions and achieve realistic rendering and geometry details of high-fidelity 3D head avatars. Additionally, to further elevate the geometric and rendering quality, we introduce a new coarse-to-fine training strategy, including a geometry initialization strategy at the coarse stage and an adaptive importance sampling strategy at the fine stage. Extensive experiments indicate that our method outperforms other state-of-the-art (SOTA) methods in rendering and geometry quality on mobile phone-collected and public datasets. Code and data can be found at https://github.com/minghanqin/AvatarSVE.
\ No newline at end of file
diff --git a/data/2024/aaai/High-Fidelity Diffusion-Based Image Editing b/data/2024/aaai/High-Fidelity Diffusion-Based Image Editing
new file mode 100644
index 0000000000..17fac845a9
--- /dev/null
+++ b/data/2024/aaai/High-Fidelity Diffusion-Based Image Editing	
@@ -0,0 +1 @@
+Diffusion models have attained remarkable success in the domains of image generation and editing. It is widely recognized that employing larger inversion and denoising steps in diffusion model leads to improved image reconstruction quality. However, the editing performance of diffusion models tends to be no more satisfactory even with increasing denoising steps. The deficiency in editing could be attributed to the conditional Markovian property of the editing process, where errors accumulate throughout denoising steps. To tackle this challenge, we first propose an innovative framework where a rectifier module is incorporated to modulate diffusion model weights with residual features from the original images, thereby providing compensatory information to bridge the fidelity gap. Furthermore, we introduce a novel learning paradigm aimed at minimizing error propagation during the editing process, which trains the editing procedure in a manner similar to denoising score-matching. Extensive experiments demonstrate that our proposed framework and training strategy achieve high-fidelity reconstruction and editing results across various levels of denoising steps, meanwhile exhibits exceptional performance in terms of both quantitative metric and qualitative assessments. Lastly, we explore our model's generalization though several applications like image-to-image translation and out-of-domain image editing.
\ No newline at end of file
diff --git a/data/2024/aaai/High-Fidelity Gradient Inversion in Distributed Learning b/data/2024/aaai/High-Fidelity Gradient Inversion in Distributed Learning
new file mode 100644
index 0000000000..6204b06aca
--- /dev/null
+++ b/data/2024/aaai/High-Fidelity Gradient Inversion in Distributed Learning	
@@ -0,0 +1 @@
+Distributed learning frameworks aim to train global models by sharing gradients among clients while preserving the data privacy of each individual client. However, extensive research has demonstrated that these learning frameworks do not absolutely ensure the privacy, as training data can be reconstructed from shared gradients. Nevertheless, the existing privacy-breaking attack methods have certain limitations. Some are applicable only to small models, while others can only recover images in small batch size and low resolutions, or with low fidelity. Furthermore, when there are some data with the same label in a training batch, existing attack methods usually perform poorly. In this work, we successfully address the limitations of existing attacks by two steps. Firstly, we model the coefficient of variation (CV) of features and design an evolutionary algorithm based on the minimum CV to accurately reconstruct the labels of all training data. After that, we propose a stepwise gradient inversion attack, which dynamically adapts the objective function, thereby effectively and rationally promoting the convergence of attack results towards an optimal solution. With these two steps, our method is able to recover high resolution images (224*224 pixel, from ImageNet and Web) with high fidelity in distributed learning scenarios involving complex models and larger batch size. Experiment results demonstrate the superiority of our approach, reveal the potential vulnerabilities of the distributed learning paradigm, and emphasize the necessity of developing more secure mechanisms. Source code is available at https://github.com/MiLab-HITSZ/2023YeHFGradInv.
\ No newline at end of file
diff --git a/data/2024/aaai/High-Order Structure Based Middle-Feature Learning for Visible-Infrared Person Re-identification b/data/2024/aaai/High-Order Structure Based Middle-Feature Learning for Visible-Infrared Person Re-identification
new file mode 100644
index 0000000000..312b4ae873
--- /dev/null
+++ b/data/2024/aaai/High-Order Structure Based Middle-Feature Learning for Visible-Infrared Person Re-identification	
@@ -0,0 +1 @@
+Visible-infrared person re-identification (VI-ReID) aims to retrieve images of the same persons captured by visible (VIS) and infrared (IR) cameras. Existing VI-ReID methods ignore high-order structure information of features while being relatively difficult to learn a reasonable common feature space due to the large modality discrepancy between VIS and IR images. To address the above problems, we propose a novel high-order structure based middle-feature learning network (HOS-Net) for effective VI-ReID. Specifically, we first leverage a short- and long-range feature extraction (SLE) module to effectively exploit both short-range and long-range features. Then, we propose a high-order structure learning (HSL) module to successfully model the high-order relationship across different local features of each person image based on a whitened hypergraph network. This greatly alleviates model collapse and enhances feature representations. Finally, we develop a common feature space learning (CFL) module to learn a discriminative and reasonable common feature space based on middle features generated by aligning features from different modalities and ranges. In particular, a modality-range identity-center contrastive (MRIC) loss is proposed to reduce the distances between the VIS, IR, and middle features, smoothing the training process. Extensive experiments on the SYSU-MM01, RegDB, and LLCM datasets show that our HOS-Net achieves superior state-of-the-art performance. Our code is available at https://github.com/Jaulaucoeng/HOS-Net.
\ No newline at end of file
diff --git a/data/2024/aaai/High-Quality Real-Time Rendering Using Subpixel Sampling Reconstruction b/data/2024/aaai/High-Quality Real-Time Rendering Using Subpixel Sampling Reconstruction
new file mode 100644
index 0000000000..72bcdfecc2
--- /dev/null
+++ b/data/2024/aaai/High-Quality Real-Time Rendering Using Subpixel Sampling Reconstruction	
@@ -0,0 +1 @@
+Generating high-quality, realistic rendering images for real-time applications generally requires tracing a few samples-per-pixel (spp) and using deep learning-based approaches to denoise the resulting low-spp images. Existing denoising methods necessitate a substantial time expenditure when rendering at high resolutions due to the physically-based sampling and network inference time burdens. In this paper, we propose a novel Monte Carlo sampling strategy to accelerate the sampling process and a corresponding denoiser, subpixel sampling reconstruction (SSR), to obtain high-quality images. Extensive experiments demonstrate that our method significantly outperforms previous approaches in denoising quality and reduces overall time costs, enabling real-time rendering capabilities at 2K resolution.
\ No newline at end of file
diff --git a/data/2024/aaai/History Matters: Temporal Knowledge Editing in Large Language Model b/data/2024/aaai/History Matters: Temporal Knowledge Editing in Large Language Model
new file mode 100644
index 0000000000..2c382a6bf3
--- /dev/null
+++ b/data/2024/aaai/History Matters: Temporal Knowledge Editing in Large Language Model	
@@ -0,0 +1 @@
+The imperative task of revising or updating the knowledge stored within large language models arises from two distinct sources: intrinsic errors inherent in the model which should be corrected and outdated knowledge due to external shifts in the real world which should be updated. Prevailing efforts in model editing conflate these two distinct categories of edits arising from distinct reasons and directly modify the original knowledge in models into new knowledge. However, we argue that preserving the model's original knowledge remains pertinent. Specifically, if a model's knowledge becomes outdated due to evolving worldly dynamics, it should retain recollection of the historical knowledge while integrating the newfound knowledge. In this work, we introduce the task of Temporal Knowledge Editing (TKE) and establish a benchmark AToKe (Assessment of TempOral Knowledge Editing) to evaluate current model editing methods. We find that while existing model editing methods are effective at making models remember new knowledge, the edited model catastrophically forgets historical knowledge. To address this gap, we propose a simple and general framework termed Multi-Editing with Time Objective (METO) for enhancing existing editing models, which edits both historical and new knowledge concurrently and optimizes the model's prediction for the time of each fact. Our assessments demonstrate that while AToKe is still difficult, METO maintains the effectiveness of learning new knowledge and meanwhile substantially improves the performance of edited models on utilizing historical knowledge.
\ No newline at end of file
diff --git a/data/2024/aaai/Homophily-Related: Adaptive Hybrid Graph Filter for Multi-View Graph Clustering b/data/2024/aaai/Homophily-Related: Adaptive Hybrid Graph Filter for Multi-View Graph Clustering
new file mode 100644
index 0000000000..4e99231cbc
--- /dev/null
+++ b/data/2024/aaai/Homophily-Related: Adaptive Hybrid Graph Filter for Multi-View Graph Clustering	
@@ -0,0 +1 @@
+Recently there is a growing focus on graph data, and multi-view graph clustering has become a popular area of research interest. Most of the existing methods are only applicable to homophilous graphs, yet the extensive real-world graph data can hardly fulfill the homophily assumption, where the connected nodes tend to belong to the same class. Several studies have pointed out that the poor performance on heterophilous graphs is actually due to the fact that conventional graph neural networks (GNNs), which are essentially low-pass filters, discard information other than the low-frequency information on the graph. Nevertheless, on certain graphs, particularly heterophilous ones, neglecting high-frequency information and focusing solely on low-frequency information impedes the learning of node representations. To break this limitation, our motivation is to perform graph filtering that is closely related to the homophily degree of the given graph, with the aim of fully leveraging both low-frequency and high-frequency signals to learn distinguishable node embedding. In this work, we propose Adaptive Hybrid Graph Filter for Multi-View Graph Clustering (AHGFC). Specifically, a graph joint process and graph joint aggregation matrix are first designed by using the intrinsic node features and adjacency relationship, which makes the low and high-frequency signals on the graph more distinguishable. Then we design an adaptive hybrid graph filter that is related to the homophily degree, which learns the node embedding based on the graph joint aggregation matrix. After that, the node embedding of each view is weighted and fused into a consensus embedding for the downstream task. Experimental results show that our proposed model performs well on six datasets containing homophilous and heterophilous graphs.
\ No newline at end of file
diff --git a/data/2024/aaai/Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models b/data/2024/aaai/Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models
new file mode 100644
index 0000000000..629780096e
--- /dev/null
+++ b/data/2024/aaai/Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models	
@@ -0,0 +1 @@
+Recently, Large Language Models (LLMs) have shown impressive abilities in code generation. However, existing LLMs' decoding strategies are designed for Natural Language (NL) generation, overlooking the differences between NL and programming languages (PL). Due to this oversight, a better decoding strategy for code generation remains an open question. In this paper, we conduct the first systematic study to explore a decoding strategy specialized in code generation. With an analysis of loss distributions of code tokens, we find that code tokens can be divided into two categories: challenging tokens that are difficult to predict and confident tokens that can be easily inferred. Among them, the challenging tokens mainly appear at the beginning of a code block. Inspired by the above findings, we propose a simple yet effective method: Adaptive Temperature (AdapT) sampling, which dynamically adjusts the temperature coefficient when decoding different tokens. We apply a larger temperature when sampling for challenging tokens, allowing LLMs to explore diverse choices. We employ a smaller temperature for confident tokens avoiding the influence of tail randomness noises. We apply AdapT sampling to LLMs with different sizes and conduct evaluations on two popular datasets. Results show that AdapT sampling significantly outperforms state-of-the-art decoding strategy.
\ No newline at end of file
diff --git a/data/2024/aaai/How Teachers Can Use Large Language Models and Bloom's Taxonomy to Create Educational Quizzes b/data/2024/aaai/How Teachers Can Use Large Language Models and Bloom's Taxonomy to Create Educational Quizzes
new file mode 100644
index 0000000000..448ab6235a
--- /dev/null
+++ b/data/2024/aaai/How Teachers Can Use Large Language Models and Bloom's Taxonomy to Create Educational Quizzes	
@@ -0,0 +1 @@
+Question generation (QG) is a natural language processing task with an abundance of potential benefits and use cases in the educational domain. In order for this potential to be realized, QG systems must be designed and validated with pedagogical needs in mind. However, little research has assessed or designed QG approaches with the input of real teachers or students. This paper applies a large language model-based QG approach where questions are generated with learning goals derived from Bloom's taxonomy. The automatically generated questions are used in multiple experiments designed to assess how teachers use them in practice. The results demonstrate that teachers prefer to write quizzes with automatically generated questions, and that such quizzes have no loss in quality compared to handwritten versions. Further, several metrics indicate that automatically generated questions can even improve the quality of the quizzes created, showing the promise for large scale use of QG in the classroom setting.
\ No newline at end of file
diff --git a/data/2024/aaai/How to Evaluate Behavioral Models b/data/2024/aaai/How to Evaluate Behavioral Models
new file mode 100644
index 0000000000..71a8383f1d
--- /dev/null
+++ b/data/2024/aaai/How to Evaluate Behavioral Models	
@@ -0,0 +1 @@
+Researchers building behavioral models, such as behavioral game theorists, use experimental data to evaluate predictive models of human behavior. However, there is little agreement about which loss function should be used in evaluations, with error rate, negative log-likelihood, cross-entropy, Brier score, and squared L2 error all being common choices. We attempt to offer a principled answer to the question of which loss functions should be used for this task, formalizing axioms that we argue loss functions should satisfy. We construct a family of loss functions, which we dub ``diagonal bounded Bregman divergences'', that satisfy all of these axioms. These rule out many loss functions used in practice, but notably include squared L2 error; we thus recommend its use for evaluating behavioral models.
\ No newline at end of file
diff --git a/data/2024/aaai/How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection b/data/2024/aaai/How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection
new file mode 100644
index 0000000000..266719c68f
--- /dev/null
+++ b/data/2024/aaai/How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection	
@@ -0,0 +1 @@
+Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at https://github.com/om-ai-lab/OVDEval
\ No newline at end of file
diff --git a/data/2024/aaai/How to Make Knockout Tournaments More Popular? b/data/2024/aaai/How to Make Knockout Tournaments More Popular?
new file mode 100644
index 0000000000..b82ffe82d1
--- /dev/null
+++ b/data/2024/aaai/How to Make Knockout Tournaments More Popular?	
@@ -0,0 +1 @@
+Given a mapping from a set of players to the leaves of a complete binary tree (called a seeding), a knockout tournament is conducted as follows: every round, every two players with a common parent compete against each other, and the winner is promoted to the common parent; then, the leaves are deleted. When only one player remains, it is declared the winner. This is a popular competition format in sports, elections, and decision-making. Over the past decade, it has been studied intensively from both theoretical and practical points of view. Most frequently, the objective is to seed the tournament in a way that ``assists'' (or even guarantees) some particular player to win the competition. We introduce a new objective, which is very sensible from the perspective of the directors of the competition: maximize the profit or popularity of the tournament. Specifically, we associate a ``score'' with every possible match, and aim to seed the tournament to maximize the sum of the scores of the matches that take place. We focus on the case where we assume a total order on the players' strengths, and provide a wide spectrum of results on the computational complexity of the problem.
\ No newline at end of file
diff --git a/data/2024/aaai/How to Overcome Curse-of-Dimensionality for Out-of-Distribution Detection? b/data/2024/aaai/How to Overcome Curse-of-Dimensionality for Out-of-Distribution Detection?
new file mode 100644
index 0000000000..4df931f6f1
--- /dev/null
+++ b/data/2024/aaai/How to Overcome Curse-of-Dimensionality for Out-of-Distribution Detection?	
@@ -0,0 +1 @@
+Machine learning models deployed in the wild can be challenged by out-of-distribution (OOD) data from unknown classes. Recent advances in OOD detection rely on distance measures to distinguish samples that are relatively far away from the in-distribution (ID) data. Despite the promise, distance-based methods can suffer from the curse-of-dimensionality problem, which limits the efficacy in high dimensional feature space. To combat this problem, we propose a novel framework, Subspace Nearest Neighbor (SNN), for OOD detection. In training, our method regularizes the model and its feature representation by leveraging the most relevant subset of dimensions (i.e. subspace). The subspace learning yields highly distinguishable distance measures between ID and OOD data. We provide comprehensive experiments and ablations to validate the efficacy of SNN. Compared to the current best distance-based method, SNN reduces the average FPR95 by 15.96% on the CIFAR-100 benchmark.
\ No newline at end of file
diff --git a/data/2024/aaai/How to Protect Copyright Data in Optimization of Large Language Models? b/data/2024/aaai/How to Protect Copyright Data in Optimization of Large Language Models?
new file mode 100644
index 0000000000..38a14792f4
--- /dev/null
+++ b/data/2024/aaai/How to Protect Copyright Data in Optimization of Large Language Models?	
@@ -0,0 +1,3 @@
+Large language models (LLMs) and generative AI have played a transformative role in computer research and applications. Controversy has arisen as to whether these models output copyrighted data, which can occur if the data the models are trained on is copyrighted. LLMs are built on the transformer neural network architecture, which in turn relies on a mathematical computation called Attention that uses the softmax function.
+
+In this paper, we observe that large language model training and optimization can be seen as a softmax regression problem. We then establish a method of efficiently performing softmax regression, in a way that prevents the regression function from generating copyright data. This establishes a theoretical method of training large language models in a way that avoids generating copyright data.
\ No newline at end of file
diff --git a/data/2024/aaai/How to Trade Off the Quantity and Capacity of Teacher Ensemble: Learning Categorical Distribution to Stochastically Employ a Teacher for Distillation b/data/2024/aaai/How to Trade Off the Quantity and Capacity of Teacher Ensemble: Learning Categorical Distribution to Stochastically Employ a Teacher for Distillation
new file mode 100644
index 0000000000..5a38f880f9
--- /dev/null
+++ b/data/2024/aaai/How to Trade Off the Quantity and Capacity of Teacher Ensemble: Learning Categorical Distribution to Stochastically Employ a Teacher for Distillation	
@@ -0,0 +1 @@
+We observe two phenomenons with respect to quantity and capacity: 1) more teacher is not always better for multi-teacher knowledge distillation, and 2) stronger teacher is not always better for single-teacher knowledge distillation. To trade off the quantity and capacity of teacher ensemble, in this paper, we propose a new distillation paradigm named Dynamic Knowledge Distillation (DynaKD) that learn an adaptive categorical distribution to stochastically employ a teacher from a teacher ensemble in each step, to transfer knowledge from teacher ensemble into student. DynaKD has three advantages: 1) it can preserve diversity of each teacher via one-to-one distillation manner instead of several-for-one, 2) it can make the best of powerful teacher via those multi-level assistant teachers in ensemble, and 3) it can also dynamically determine the importance of each teacher for various tasks. To verify the effectiveness of the proposed approach, we conduct extensive experiments for BERT compression on GLUE benchmark. Experimental results show that the proposed approach achieves state-of-the-art score compared to previous compression approaches on five out of seven downstream tasks, including pushing MRPC F1 and accuracy to 92.2 (1.4 point absolute improvement), RTE accuracy to 76.2 (2.8 point absolute improvement). Moreover, we conduct also extensive experiments for image classification on CIFAR-100. Similarly, DynaKD achieves also state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2024/aaai/How to Use the Metropolis Algorithm for Multi-Objective Optimization? b/data/2024/aaai/How to Use the Metropolis Algorithm for Multi-Objective Optimization?
new file mode 100644
index 0000000000..1558cc0df1
--- /dev/null
+++ b/data/2024/aaai/How to Use the Metropolis Algorithm for Multi-Objective Optimization?	
@@ -0,0 +1,7 @@
+The Metropolis algorithm can cope with local optima by accepting inferior solutions with suitably small probability. That this can work well was not only observed in empirical research, but also via mathematical runtime analyses on single-objective benchmarks. This paper takes several steps towards understanding, again via theoretical means, whether such advantages can also be obtained in multi-objective optimization.
+
+The original Metropolis algorithm has two components, one-bit mutation and the acceptance strategy, which allows accepting inferior solutions. When adjusting the acceptance strategy to multi-objective optimization in the way that an inferior solution that is accepted replaces its parent, then the Metropolis algorithm is not very efficient on our multi-objective version of the multimodal DLB benchmark called DLTB. With one-bit mutation, this multi-objective Metropolis algorithm cannot optimize the DLTB problem, with standard bit-wise mutation it needs at least Ω(n^5) time to cover the full Pareto front. In contrast, we show that many other multi-objective optimizers, namely the GSEMO, SMS-EMOA, and NSGA-II, only need time O(n^4).
+
+When keeping the parent when an inferior point is accepted, the multi-objective Metropolis algorithm both with one-bit or standard bit-wise mutation solves the DLTB problem efficiently, with one-bit mutation experimentally leading to better results than several other algorithms.
+
+Overall, our work suggests that the general mechanism of the Metropolis algorithm can be interesting in multi-objective optimization, but that the implementation details can have a huge impact on the performance.
\ No newline at end of file
diff --git a/data/2024/aaai/HuTuMotion: Human-Tuned Navigation of Latent Motion Diffusion Models with Minimal Feedback b/data/2024/aaai/HuTuMotion: Human-Tuned Navigation of Latent Motion Diffusion Models with Minimal Feedback
new file mode 100644
index 0000000000..1ea2c8fa6f
--- /dev/null
+++ b/data/2024/aaai/HuTuMotion: Human-Tuned Navigation of Latent Motion Diffusion Models with Minimal Feedback	
@@ -0,0 +1 @@
+We introduce HuTuMotion, an innovative approach for generating natural human motions that navigates latent motion diffusion models by leveraging few-shot human feedback. Unlike existing approaches that sample latent variables from a standard normal prior distribution, our method adapts the prior distribution to better suit the characteristics of the data, as indicated by human feedback, thus enhancing the quality of motion generation. Furthermore, our findings reveal that utilizing few-shot feedback can yield performance levels on par with those attained through extensive human feedback. This discovery emphasizes the potential and efficiency of incorporating few-shot human-guided optimization within latent diffusion models for personalized and style-aware human motion generation applications. The experimental results show the significantly superior performance of our method over existing state-of-the-art approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Human-Guided Moral Decision Making in Text-Based Games b/data/2024/aaai/Human-Guided Moral Decision Making in Text-Based Games
new file mode 100644
index 0000000000..b708ab5888
--- /dev/null
+++ b/data/2024/aaai/Human-Guided Moral Decision Making in Text-Based Games	
@@ -0,0 +1 @@
+Training reinforcement learning (RL) agents to achieve desired goals while also acting morally is a challenging problem. Transformer-based language models (LMs) have shown some promise in moral awareness, but their use in different contexts is problematic because of the complexity and implicitness of human morality. In this paper, we build on text-based games, which are challenging environments for current RL agents, and propose the HuMAL (Human-guided Morality Awareness Learning) algorithm, which adaptively learns personal values through human-agent collaboration with minimal manual feedback. We evaluate HuMAL on the Jiminy Cricket benchmark, a set of text-based games with various scenes and dense morality annotations, using both simulated and actual human feedback. The experimental results demonstrate that with a small amount of human feedback, HuMAL can improve task performance and reduce immoral behavior in a variety of games and is adaptable to different personal values.
\ No newline at end of file
diff --git a/data/2024/aaai/Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking b/data/2024/aaai/Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking
new file mode 100644
index 0000000000..0b7499083d
--- /dev/null
+++ b/data/2024/aaai/Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking	
@@ -0,0 +1 @@
+Multi-Object Tracking (MOT) aims to detect and associate all desired objects across frames. Most methods accomplish the task by explicitly or implicitly leveraging strong cues (i.e., spatial and appearance information), which exhibit powerful instance-level discrimination. However, when object occlusion and clustering occur, spatial and appearance information will become ambiguous simultaneously due to the high overlap among objects. In this paper, we demonstrate this long-standing challenge in MOT can be efficiently and effectively resolved by incorporating weak cues to compensate for strong cues. Along with velocity direction, we introduce the confidence and height state as potential weak cues. With superior performance, our method still maintains Simple, Online and Real-Time (SORT) characteristics. Also, our method shows strong generalization for diverse trackers and scenarios in a plug-and-play and training-free manner. Significant and consistent improvements are observed when applying our method to 5 different representative trackers. Further, with both strong and weak cues, our method Hybrid-SORT achieves superior performance on diverse benchmarks, including MOT17, MOT20, and especially DanceTrack where interaction and severe occlusion frequently happen with complex motions. The code and models are available at https://github.com/ymzis69/HybridSORT.
\ No newline at end of file
diff --git a/data/2024/aaai/Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for Loss-Free Multi-Exposure Image Fusion b/data/2024/aaai/Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for Loss-Free Multi-Exposure Image Fusion
new file mode 100644
index 0000000000..121561d54b
--- /dev/null
+++ b/data/2024/aaai/Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for Loss-Free Multi-Exposure Image Fusion	
@@ -0,0 +1,2 @@
+Multi-exposure image fusion (MEF) has emerged as a prominent solution to address the limitations of digital imaging in representing varied exposure levels. Despite its advancements, the field grapples with challenges, notably the reliance on manual designs for network structures and loss functions, and the constraints of utilizing simulated reference images as ground truths. Consequently, current methodologies often suffer from color distortions and exposure artifacts, further complicating the quest for authentic image representation. In addressing these challenges, this paper presents a Hybrid-Supervised Dual-Search approach for MEF, dubbed HSDS-MEF, which introduces a bi-level optimization search scheme for automatic design of both network structures and loss functions. More specifically, we harness a unique dual research mechanism rooted in a novel weighted structure refinement architecture search. Besides, a hybrid supervised contrast constraint seamlessly guides and integrates with searching process, facilitating a more adaptive and comprehensive search for optimal loss functions. We realize the state-of-the-art performance in comparison to various competitive schemes, yielding a 10.61% and 4.38% improvement in Visual Information Fidelity (VIF)
+for general and no-reference scenarios, respectively, while providing results with high contrast, rich details and colors. The code is available at https://github.com/RollingPlain/HSDS_MEF.
\ No newline at end of file
diff --git a/data/2024/aaai/HybridGait: A Benchmark for Spatial-Temporal Cloth-Changing Gait Recognition with Hybrid Explorations b/data/2024/aaai/HybridGait: A Benchmark for Spatial-Temporal Cloth-Changing Gait Recognition with Hybrid Explorations
new file mode 100644
index 0000000000..3f272d49d3
--- /dev/null
+++ b/data/2024/aaai/HybridGait: A Benchmark for Spatial-Temporal Cloth-Changing Gait Recognition with Hybrid Explorations	
@@ -0,0 +1 @@
+Existing gait recognition benchmarks mostly include minor clothing variations in the laboratory environments, but lack persistent changes in appearance over time and space. In this paper, we propose the first in-the-wild benchmark CCGait for cloth-changing gait recognition, which incorporates diverse clothing changes, indoor and outdoor scenes, and multi-modal statistics over 92 days. To further address the coupling effect of clothing and viewpoint variations, we propose a hybrid approach HybridGait that exploits both temporal dynamics and the projected 2D information of 3D human meshes. Specifically, we introduce a Canonical Alignment Spatial-Temporal Transformer (CA-STT) module to encode human joint position-aware features, and fully exploit 3D dense priors via a Silhouette-guided Deformation with 3D-2D Appearance Projection (SilD) strategy. Our contributions are twofold: we provide a challenging benchmark CCGait that captures realistic appearance changes over expanded time and space, and we propose a hybrid framework HybridGait that outperforms prior works on CCGait and Gait3D benchmarks. Our project page is available at https://github.com/HCVLab/HybridGait.
\ No newline at end of file
diff --git a/data/2024/aaai/Hyp-OW: Exploiting Hierarchical Structure Learning with Hyperbolic Distance Enhances Open World Object Detection b/data/2024/aaai/Hyp-OW: Exploiting Hierarchical Structure Learning with Hyperbolic Distance Enhances Open World Object Detection
new file mode 100644
index 0000000000..f55174e98e
--- /dev/null
+++ b/data/2024/aaai/Hyp-OW: Exploiting Hierarchical Structure Learning with Hyperbolic Distance Enhances Open World Object Detection	
@@ -0,0 +1 @@
+Open World Object Detection (OWOD) is a challenging and realistic task that extends beyond the scope of standard Object Detection task. It involves detecting both known and unknown objects while integrating learned knowledge for future tasks. However, the level of "unknownness" varies significantly depending on the context. For example, a tree is typically considered part of the background in a self-driving scene, but it may be significant in a household context. We argue that this contextual information should already be embedded within the known classes. In other words, there should be a semantic or latent structure relationship between the known and unknown items to be discovered. Motivated by this observation, we propose Hyp-OW, a method that learns and models hierarchical representation of known items through a SuperClass Regularizer. Leveraging this representation allows us to effectively detect unknown objects using a similarity distance-based relabeling module. Extensive experiments on benchmark datasets demonstrate the effectiveness of Hyp-OW, achieving improvement in both known and unknown detection (up to 6 percent). These findings are particularly pronounced in our newly designed benchmark, where a strong hierarchical structure exists between known and unknown objects.
\ No newline at end of file
diff --git a/data/2024/aaai/HyperCube: Implicit Field Representations of Voxelized 3D Models (Student Abstract) b/data/2024/aaai/HyperCube: Implicit Field Representations of Voxelized 3D Models (Student Abstract)
new file mode 100644
index 0000000000..fd551daf83
--- /dev/null
+++ b/data/2024/aaai/HyperCube: Implicit Field Representations of Voxelized 3D Models (Student Abstract)	
@@ -0,0 +1 @@
+Implicit field representations offer an effective way of generating 3D object shapes. They leverage an implicit decoder (IM-NET) trained to take a 3D point coordinate concatenated with a shape encoding and to output a value indicating whether the point is outside the shape. This approach enables the efficient rendering of visually plausible objects but also has some significant limitations, resulting in a cumbersome training procedure and empty spaces within the rendered mesh. In this paper, we introduce a new HyperCube architecture based on interval arithmetic that enables direct processing of 3D voxels, trained using a hypernetwork paradigm to enforce model convergence. The code is available at https://github.com/mproszewska/hypercube.
\ No newline at end of file
diff --git a/data/2024/aaai/HyperEditor: Achieving Both Authenticity and Cross-Domain Capability in Image Editing via Hypernetworks b/data/2024/aaai/HyperEditor: Achieving Both Authenticity and Cross-Domain Capability in Image Editing via Hypernetworks
new file mode 100644
index 0000000000..1c7b489bd8
--- /dev/null
+++ b/data/2024/aaai/HyperEditor: Achieving Both Authenticity and Cross-Domain Capability in Image Editing via Hypernetworks	
@@ -0,0 +1 @@
+Editing real images authentically while also achieving cross-domain editing remains a challenge. Recent studies have focused on converting real images into latent codes and accomplishing image editing by manipulating these codes. However, merely manipulating the latent codes would constrain the edited images to the generator's image domain, hindering the attainment of diverse editing goals. In response, we propose an innovative image editing method called HyperEditor, which utilizes weight factors generated by hypernetworks to reassign the weights of the pre-trained StyleGAN2's generator. Guided by CLIP's cross-modal image-text semantic alignment, this innovative approach enables us to simultaneously accomplish authentic attribute editing and cross-domain style transfer, a capability not realized in previous methods. Additionally, we ascertain that modifying only the weights of specific layers in the generator can yield an equivalent editing result. Therefore, we introduce an adaptive layer selector, enabling our hypernetworks to autonomously identify the layers requiring output weight factors, which can further improve our hypernetworks' efficiency. Extensive experiments on abundant challenging datasets demonstrate the effectiveness of our method.
\ No newline at end of file
diff --git a/data/2024/aaai/HyperFast: Instant Classification for Tabular Data b/data/2024/aaai/HyperFast: Instant Classification for Tabular Data
new file mode 100644
index 0000000000..1878d98bc9
--- /dev/null
+++ b/data/2024/aaai/HyperFast: Instant Classification for Tabular Data	
@@ -0,0 +1 @@
+Training deep learning models and performing hyperparameter tuning can be computationally demanding and time-consuming. Meanwhile, traditional machine learning methods like gradient-boosting algorithms remain the preferred choice for most tabular data applications, while neural network alternatives require extensive hyperparameter tuning or work only in toy datasets under limited settings. In this paper, we introduce HyperFast, a meta-trained hypernetwork designed for instant classification of tabular data in a single forward pass. HyperFast generates a task-specific neural network tailored to an unseen dataset that can be directly used for classification inference, removing the need for training a model. We report extensive experiments with OpenML and genomic data, comparing HyperFast to competing tabular data neural networks, traditional ML methods, AutoML systems, and boosting machines. HyperFast shows highly competitive results, while being significantly faster. Additionally, our approach demonstrates robust adaptability across a variety of classification tasks with little to no fine-tuning, positioning HyperFast as a strong solution for numerous applications and rapid model deployment. HyperFast introduces a promising paradigm for fast classification, with the potential to substantially decrease the computational burden of deep learning. Our code, which offers a scikit-learn-like interface, along with the trained HyperFast model, can be found at https://github.com/AI-sandbox/HyperFast.
\ No newline at end of file
diff --git a/data/2024/aaai/Hyperbolic Graph Diffusion Model b/data/2024/aaai/Hyperbolic Graph Diffusion Model
new file mode 100644
index 0000000000..abef8206a4
--- /dev/null
+++ b/data/2024/aaai/Hyperbolic Graph Diffusion Model	
@@ -0,0 +1 @@
+Diffusion generative models (DMs) have achieved promising results in image and graph generation. However, real-world graphs, such as social networks, molecular graphs, and traffic graphs, generally share non-Euclidean topologies and hidden hierarchies. For example, the degree distributions of graphs are mostly power-law distributions. The current latent diffusion model embeds the hierarchical data in a Euclidean space, which leads to distortions and interferes with modeling the distribution. Instead, hyperbolic space has been found to be more suitable for capturing complex hierarchical structures due to its exponential growth property. In order to simultaneously utilize the data generation capabilities of diffusion models and the ability of hyperbolic embeddings to extract latent hierarchical distributions, we propose a novel graph generation method called, Hyperbolic Graph Diffusion Model (HGDM), which consists of an auto-encoder to encode nodes into successive hyperbolic embeddings, and a DM that operates in the hyperbolic latent space. HGDM captures the crucial graph structure distributions by constructing a hyperbolic potential node space that incorporates edge information. Extensive experiments show that HGDM achieves better performance in generic graph and molecule generation benchmarks, with a 48% improvement in the quality of graph generation with highly hierarchical structures.
\ No newline at end of file
diff --git a/data/2024/aaai/Hypercorrelation Evolution for Video Class-Incremental Learning b/data/2024/aaai/Hypercorrelation Evolution for Video Class-Incremental Learning
new file mode 100644
index 0000000000..e599005ff1
--- /dev/null
+++ b/data/2024/aaai/Hypercorrelation Evolution for Video Class-Incremental Learning	
@@ -0,0 +1 @@
+Video class-incremental learning aims to recognize new actions while restricting the catastrophic forgetting of old ones, whose representative samples can only be saved in limited memory. Semantically variable subactions are susceptible to class confusion due to data imbalance. While existing methods address the problem by estimating and distilling the spatio-temporal knowledge, we further explores that the refinement of hierarchical correlations is crucial for the alignment of spatio-temporal features. To enhance the adaptability on evolved actions, we proposes a hierarchical aggregation strategy, in which hierarchical matching matrices are combined and jointly optimized to selectively store and retrieve relevant features from previous tasks. Meanwhile, a correlation refinement mechanism is presented to reinforce the bias on informative exemplars according to online hypercorrelation distribution. Experimental results demonstrate the effectiveness of the proposed method on three standard video class-incremental learning benchmarks, outperforming state-of-the-art methods. Code is available at: https://github.com/Lsen991031/HCE
\ No newline at end of file
diff --git a/data/2024/aaai/Hypergraph Joint Representation Learning for Hypervertices and Hyperedges via Cross Expansion b/data/2024/aaai/Hypergraph Joint Representation Learning for Hypervertices and Hyperedges via Cross Expansion
new file mode 100644
index 0000000000..b73e7a0925
--- /dev/null
+++ b/data/2024/aaai/Hypergraph Joint Representation Learning for Hypervertices and Hyperedges via Cross Expansion	
@@ -0,0 +1 @@
+Hypergraph captures high-order information in structured data and obtains much attention in machine learning and data mining. Existing approaches mainly learn representations for hypervertices by transforming a hypergraph to a standard graph, or learn representations for hypervertices and hyperedges in separate spaces. In this paper, we propose a hypergraph expansion method to transform a hypergraph to a standard graph while preserving high-order information. Different from previous hypergraph expansion approaches like clique expansion and star expansion, we transform both hypervertices and hyperedges in the hypergraph to vertices in the expanded graph, and construct connections between hypervertices or hyperedges, so that richer relationships can be used in graph learning. Based on the expanded graph, we propose a learning model to embed hypervertices and hyperedges in a joint representation space. Compared with the method of learning separate spaces for hypervertices and hyperedges, our method is able to capture common knowledge involved in hypervertices and hyperedges, and also improve the data efficiency and computational efficiency. To better leverage structure information, we minimize the graph reconstruction loss to preserve the structure information in the model. We perform experiments on both hypervertex classification and hyperedge classification tasks to demonstrate the effectiveness of our proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Hypergraph Neural Architecture Search b/data/2024/aaai/Hypergraph Neural Architecture Search
new file mode 100644
index 0000000000..3e448b5fc0
--- /dev/null
+++ b/data/2024/aaai/Hypergraph Neural Architecture Search	
@@ -0,0 +1 @@
+In recent years, Hypergraph Neural Networks (HGNNs) have achieved considerable success by manually designing architectures, which are capable of extracting effective patterns with high-order interactions from non-Euclidean data. However, such mechanism is extremely inefficient, demanding tremendous human efforts to tune diverse model parameters. In this paper, we propose a novel Hypergraph Neural Architecture Search (HyperNAS) to automatically design the optimal HGNNs. The proposed model constructs a search space suitable for hypergraphs, and derives hypergraph architectures through differentiable search strategies. A hypergraph structure-aware distance criterion is introduced as a guideline for obtaining an optimal hypergraph architecture via the leave-one-out method. Experimental results for node classification on benchmark Cora, Citeseer, Pubmed citation networks and hypergraph datasets show that HyperNAS outperforms existing HGNNs models and graph NAS methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Hypothesis Testing for Class-Conditional Noise Using Local Maximum Likelihood b/data/2024/aaai/Hypothesis Testing for Class-Conditional Noise Using Local Maximum Likelihood
new file mode 100644
index 0000000000..3eff765bfb
--- /dev/null
+++ b/data/2024/aaai/Hypothesis Testing for Class-Conditional Noise Using Local Maximum Likelihood	
@@ -0,0 +1 @@
+In supervised learning, automatically assessing the quality of the labels before any learning takes place remains an open research question. In certain particular cases, hypothesis testing procedures have been proposed to assess whether a given instance-label dataset is contaminated with class-conditional label noise, as opposed to uniform label noise. The existing theory builds on the asymptotic properties of the Maximum Likelihood Estimate for parametric logistic regression. However, the parametric assumptions on top of which these approaches are constructed are often too strong and unrealistic in practice. To alleviate this problem, in this paper we propose an alternative path by showing how similar procedures can be followed when the underlying model is a product of Local Maximum Likelihood Estimation that leads to more flexible nonparametric logistic regression models, which in turn are less susceptible to model misspecification. This different view allows for wider applicability of the tests by offering users access to a richer model class. Similarly to existing works, we assume we have access to anchor points which are provided by the users. We introduce the necessary ingredients for the adaptation of the hypothesis tests to the case of nonparametric logistic regression and empirically compare against the parametric approach presenting both synthetic and real-world case studies and discussing the advantages and limitations of the proposed approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Hypothesis, Verification, and Induction: Grounding Large Language Models with Self-Driven Skill Learning b/data/2024/aaai/Hypothesis, Verification, and Induction: Grounding Large Language Models with Self-Driven Skill Learning
new file mode 100644
index 0000000000..5aac0d91ee
--- /dev/null
+++ b/data/2024/aaai/Hypothesis, Verification, and Induction: Grounding Large Language Models with Self-Driven Skill Learning	
@@ -0,0 +1 @@
+Large language models (LLMs) show their powerful automatic reasoning and planning capability with a wealth of semantic knowledge about the human world. However, the grounding problem still hinders the applications of LLMs in the real-world environment. Existing studies try to fine-tune the LLM or utilize pre-defined behavior APIs to bridge the LLMs and the environment, which not only costs huge human efforts to customize for every single task but also weakens the generality strengths of LLMs. To autonomously ground the LLM onto the environment, we proposed the Hypothesis, Verification, and Induction (HYVIN) framework to automatically and progressively ground the LLM with self-driven skill learning. HYVIN first employs the LLM to propose the hypothesis of sub-goals to achieve tasks and then verify the feasibility of the hypothesis via interacting with the underlying environment. Once verified, HYVIN can then learn generalized skills with the guidance of these successfully grounded subgoals. These skills can be further utilized to accomplish more complex tasks that fail to pass the verification phase. Verified in the famous instruction following task set, BabyAI, HYVIN achieves comparable performance in the most challenging tasks compared with imitation learning methods that cost millions of demonstrations, proving the effectiveness of learned skills and showing the feasibility and efficiency of our framework.
\ No newline at end of file
diff --git a/data/2024/aaai/I Open at the Close: A Deep Reinforcement Learning Evaluation of Open Streets Initiatives b/data/2024/aaai/I Open at the Close: A Deep Reinforcement Learning Evaluation of Open Streets Initiatives
new file mode 100644
index 0000000000..c1e819182f
--- /dev/null
+++ b/data/2024/aaai/I Open at the Close: A Deep Reinforcement Learning Evaluation of Open Streets Initiatives	
@@ -0,0 +1 @@
+The open streets initiative "opens" streets to pedestrians and bicyclists by closing them to cars and trucks. The initiative, adopted by many cities across North America, increases community space in urban environments. But could open streets also make cities safer and less congested? We study this question by framing the choice of which streets to open as a reinforcement learning problem. In order to simulate the impact of opening streets, we first compare models for predicting vehicle collisions given network and temporal data. We find that a recurrent graph neural network, leveraging the graph structure and the short-term temporal dependence of the data, gives the best predictive performance. Then, with the ability to simulate collisions and traffic, we frame a reinforcement learning problem to find which streets to open. We compare the streets in the open streets initiative to those proposed by a Q-learning algorithm. We find that the streets proposed by the Q-learning algorithm have reliably better outcomes, while streets already selected by the open streets initiative have similar outcomes to randomly selected streets. We present our work as a step toward principally choosing which streets to open for safer and less congested cities.
\ No newline at end of file
diff --git a/data/2024/aaai/I Prefer Not to Say: Protecting User Consent in Models with Optional Personal Data b/data/2024/aaai/I Prefer Not to Say: Protecting User Consent in Models with Optional Personal Data
new file mode 100644
index 0000000000..4f8ffcef9e
--- /dev/null
+++ b/data/2024/aaai/I Prefer Not to Say: Protecting User Consent in Models with Optional Personal Data	
@@ -0,0 +1 @@
+We examine machine learning models in a setup where individuals have the choice to share optional personal information with a decision-making system, as seen in modern insurance pricing models. Some users consent to their data being used whereas others object and keep their data undisclosed. In this work, we show that the decision not to share data can be considered as information in itself that should be protected to respect users' privacy. This observation raises the overlooked problem of how to ensure that users who protect their personal data do not suffer any disadvantages as a result. To address this problem, we formalize protection requirements for models which only use the information for which active user consent was obtained. This excludes implicit information contained in the decision to share data or not. We offer the first solution to this problem by proposing the notion of Protected User Consent (PUC), which we prove to be loss-optimal under our protection requirement. We observe that privacy and performance are not fundamentally at odds with each other and that it is possible for a decision maker to benefit from additional data while respecting users' consent. To learn PUC-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we analyze the implications of PUC on challenging real datasets, tasks, and models.
\ No newline at end of file
diff --git a/data/2024/aaai/I-CEE: Tailoring Explanations of Image Classification Models to User Expertise b/data/2024/aaai/I-CEE: Tailoring Explanations of Image Classification Models to User Expertise
new file mode 100644
index 0000000000..fd3691ca22
--- /dev/null
+++ b/data/2024/aaai/I-CEE: Tailoring Explanations of Image Classification Models to User Expertise	
@@ -0,0 +1 @@
+Effectively explaining decisions of black-box machine learning models is critical to responsible deployment of AI systems that rely on them. Recognizing their importance, the field of explainable AI (XAI) provides several techniques to generate these explanations. Yet, there is relatively little emphasis on the user (the explainee) in this growing body of work and most XAI techniques generate "one-size-fits-all'' explanations. To bridge this gap and achieve a step closer towards human-centered XAI, we present I-CEE, a framework that provides Image Classification Explanations tailored to User Expertise. Informed by existing work, I-CEE explains the decisions of image classification models by providing the user with an informative subset of training data (i.e., example images), corresponding local explanations, and model decisions. However, unlike prior work, I-CEE models the informativeness of the example images to depend on user expertise, resulting in different examples for different users. We posit that by tailoring the example set to user expertise, I-CEE can better facilitate users' understanding and simulatability of the model. To evaluate our approach, we conduct detailed experiments in both simulation and with human participants (N = 100) on multiple datasets. Experiments with simulated users show that I-CEE improves users' ability to accurately predict the model's decisions (simulatability) compared to baselines, providing promising preliminary results. Experiments with human participants demonstrate that our method significantly improves user simulatability accuracy, highlighting the importance of human-centered XAI.
\ No newline at end of file
diff --git a/data/2024/aaai/IBCA: An Intelligent Platform for Social Insurance Benefit Qualification Status Assessment b/data/2024/aaai/IBCA: An Intelligent Platform for Social Insurance Benefit Qualification Status Assessment
new file mode 100644
index 0000000000..b12f3074c2
--- /dev/null
+++ b/data/2024/aaai/IBCA: An Intelligent Platform for Social Insurance Benefit Qualification Status Assessment	
@@ -0,0 +1 @@
+Social insurance benefits qualification assessment is an important task to ensure that retirees enjoy their benefits according to the regulations. It also plays a key role in curbing social security frauds. In this paper, we report the deployment of the Intelligent Benefit Certification and Analysis (IBCA) platform, an AI-empowered platform for verifying the status of retirees to ensure proper dispursement of funds in Shandong province, China. Based on an improved Gated Recurrent Unit (GRU) neural network, IBCA aggregates missing value interpolation, temporal information, and global and local feature extraction to perform accurate retiree survival rate prediction. Based on the predicted results, a reliability assessment mechanism based on Variational Auto-Encoder (VAE) and Monte-Carlo Dropout (MC Dropout) is executed to perform reliability assessment. Deployed since November 2019, the IBCA platform has been adopted by 12 cities across the Shandong province, handling over 50 terabytes of data. It has empowered human resources and social services, civil affairs, and health care institutions to collaboratively provide high-quality public services. Under the IBCA platform, the efficiency of resources utilization as well as the accuracy of benefit qualification assessment have been significantly improved. It has helped Dareway Software Co. Ltd earn over RMB 50 million of revenue.
\ No newline at end of file
diff --git a/data/2024/aaai/ICAR: Image-Based Complementary Auto Reasoning b/data/2024/aaai/ICAR: Image-Based Complementary Auto Reasoning
new file mode 100644
index 0000000000..192c009dd3
--- /dev/null
+++ b/data/2024/aaai/ICAR: Image-Based Complementary Auto Reasoning	
@@ -0,0 +1 @@
+Scene-aware Complementary Item Retrieval (CIR) is a challenging task which requires to generate a set of compatible items across domains. Due to the subjectivity, it is difficult to set up a rigorous standard for both data collection and learning objectives. To address this challenging task, we propose a visual compatibility concept, composed of similarity (resembling in color, geometry, texture, and etc.) and complementarity (different items like table vs chair completing a group). Based on this notion, we propose a compatibility learning framework, a category-aware Flexible Bidirectional Transformer (FBT), for visual ``scene-based set compatibility reasoning'' with the cross-domain visual similarity input and auto-regressive complementary item generation. We introduce a ``Flexible Bidirectional Transformer (FBT),'' consisting of an encoder with flexible masking, a category prediction arm, and an auto-regressive visual embedding prediction arm. And the inputs for FBT are cross-domain visual similarity invariant embeddings, making this framework quite generalizable. Furthermore, our proposed FBT model learns the inter-object compatibility from a large set of scene images in a self-supervised way. Compared with the SOTA methods, this approach achieves up to 5.3% and 9.6% in FITB score and 22.3% and 31.8% SFID improvement on fashion and furniture, respectively.
\ No newline at end of file
diff --git a/data/2024/aaai/IGAMT: Privacy-Preserving Electronic Health Record Synthesization with Heterogeneity and Irregularity b/data/2024/aaai/IGAMT: Privacy-Preserving Electronic Health Record Synthesization with Heterogeneity and Irregularity
new file mode 100644
index 0000000000..33c321f4fc
--- /dev/null
+++ b/data/2024/aaai/IGAMT: Privacy-Preserving Electronic Health Record Synthesization with Heterogeneity and Irregularity	
@@ -0,0 +1 @@
+Integrating electronic health records (EHR) into machine learning-driven clinical research and hospital applications is important, as it harnesses extensive and high-quality patient data to enhance outcome predictions and treatment personalization. Nonetheless, due to privacy and security concerns, the secondary purpose of EHR data is consistently governed and regulated, primarily for research intentions, thereby constraining researchers' access to EHR data. Generating synthetic EHR data with deep learning methods is a viable and promising approach to mitigate privacy concerns, offering not only a supplementary resource for downstream applications but also sidestepping the confidentiality risks associated with real patient data. While prior efforts have concentrated on EHR data synthesis, significant challenges persist in the domain of generating synthetic EHR data: balancing the heterogeneity of real EHR including temporal and non-temporal features, addressing the missing values and irregular measures, and ensuring the privacy of the real data used for model training. Existing works in this domain only focused on solving one or two aforementioned challenges. In this work, we propose IGAMT, an innovative framework to generate privacy-preserved synthetic EHR data that not only maintain high quality with heterogeneous features, missing values, and irregular measures but also balances the privacy-utility trade-off. Extensive experiments prove that IGAMT significantly outperforms baseline architectures in terms of visual resemblance and comparable performance in downstream applications. Ablation case studies also prove the effectiveness of the techniques applied in IGAMT.
\ No newline at end of file
diff --git a/data/2024/aaai/IINet: Implicit Intra-inter Information Fusion for Real-Time Stereo Matching b/data/2024/aaai/IINet: Implicit Intra-inter Information Fusion for Real-Time Stereo Matching
new file mode 100644
index 0000000000..b2029266dd
--- /dev/null
+++ b/data/2024/aaai/IINet: Implicit Intra-inter Information Fusion for Real-Time Stereo Matching	
@@ -0,0 +1 @@
+Recently, there has been a growing interest in 3D CNN-based stereo matching methods due to their remarkable accuracy. However, the high complexity of 3D convolution makes it challenging to strike a balance between accuracy and speed. Notably, explicit 3D volumes contain considerable redundancy. In this study, we delve into more compact 2D implicit network to eliminate redundancy and boost real-time performance. However, simply replacing explicit 3D networks with 2D implicit networks causes issues that can lead to performance degradation, including the loss of structural information, the quality decline of inter-image information, as well as the inaccurate regression caused by low-level features. To address these issues, we first integrate intra-image information to fuse with inter-image information, facilitating propagation guided by structural cues. Subsequently, we introduce the Fast Multi-scale Score Volume (FMSV) and Confidence Based Filtering (CBF) to efficiently acquire accurate multi-scale, noise-free inter-image information. Furthermore, combined with the Residual Context-aware Upsampler (RCU), our Intra-Inter Fusing network is meticulously designed to enhance information transmission on both feature-level and disparity-level, thereby enabling accurate and robust regression. Experimental results affirm the superiority of our network in terms of both speed and accuracy compared to all other fast methods.
\ No newline at end of file
diff --git a/data/2024/aaai/INFORMEDQX: Informed Conflict Detection for Over-Constrained Problems b/data/2024/aaai/INFORMEDQX: Informed Conflict Detection for Over-Constrained Problems
new file mode 100644
index 0000000000..6b1079211b
--- /dev/null
+++ b/data/2024/aaai/INFORMEDQX: Informed Conflict Detection for Over-Constrained Problems	
@@ -0,0 +1 @@
+Conflict detection is relevant in various application scenarios, ranging from interactive decision-making to the diagnosis of faulty knowledge bases. Conflicts can be regarded as sets of constraints that cause an inconsistency. In many scenarios (e.g., constraint-based configuration), conflicts are repeatedly determined for the same or similar sets of constraints. This misses out on the valuable opportunity for leveraging knowledge reuse and related potential performance improvements, which are extremely important, specifically interactive constraint-based applications. In this paper, we show how to integrate knowledge reuse concepts into non-instructive conflict detection. We introduce the InformedQX algorithm, which is a reuse-aware variant of QuickXPlain. The results of a related performance analysis with the Linux-2.6.3.33 configuration knowledge base show significant improvements in terms of runtime performance compared to QuickXPlain.
\ No newline at end of file
diff --git a/data/2024/aaai/IOFM: Using the Interpolation Technique on the Over-Fitted Models to Identify Clean-Annotated Samples b/data/2024/aaai/IOFM: Using the Interpolation Technique on the Over-Fitted Models to Identify Clean-Annotated Samples
new file mode 100644
index 0000000000..3e2941b745
--- /dev/null
+++ b/data/2024/aaai/IOFM: Using the Interpolation Technique on the Over-Fitted Models to Identify Clean-Annotated Samples	
@@ -0,0 +1 @@
+Most recent state-of-the-art algorithms for handling noisy label problems are based on the memorization effect, which is a phenomenon that deep neural networks (DNNs) memorize clean data before noisy ones. While the memorization effect can be a powerful tool, there are several cases where memorization effect does not occur. Examples are imbalanced class distributions and heavy contamination on labels. To address this limitation, we introduce a whole new approach called the interpolation with the over-fitted model (IOFM), which leverages over-fitted deep neural networks. The IOFM utilizes a new finding of over-fitted DNNs: for a given training sample, its neighborhoods chosen from the feature space are distributed differently on the original input space depending on the cleanness of the target sample. The IOFM has notable features in two aspects: 1) it yields superior results even when the training data are imbalanced or heavily noisy, 2) since we utilize over-fitted deep neural networks, a fine-tuning procedure to select the optimal training epoch, which is an essential yet sensitive factor for the success of the memorization effect, is not required, and thus, the IOFM can be used for non-experts. Through extensive experiments, we show that our method can serve as a promising alternative to existing solutions dealing with noisy labels, offering improved performance even in challenging situations.
\ No newline at end of file
diff --git a/data/2024/aaai/IPRemover: A Generative Model Inversion Attack against Deep Neural Network Fingerprinting and Watermarking b/data/2024/aaai/IPRemover: A Generative Model Inversion Attack against Deep Neural Network Fingerprinting and Watermarking
new file mode 100644
index 0000000000..4b3d0a7cff
--- /dev/null
+++ b/data/2024/aaai/IPRemover: A Generative Model Inversion Attack against Deep Neural Network Fingerprinting and Watermarking	
@@ -0,0 +1 @@
+Training Deep Neural Networks (DNNs) can be expensive when data is difficult to obtain or labeling them requires significant domain expertise. Hence, it is crucial that the Intellectual Property (IP) of DNNs trained on valuable data be protected against IP infringement. DNN fingerprinting and watermarking are two lines of work in DNN IP protection. Recently proposed DNN fingerprinting techniques are able to detect IP infringement while preserving model performance by relying on the key assumption that the decision boundaries of independently trained models are intrinsically different from one another. In contrast, DNN watermarking embeds a watermark in a model and verifies IP infringement if an identical or similar watermark is extracted from a suspect model. The techniques deployed in fingerprinting and watermarking vary significantly because their underlying mechanisms are different. From an adversary's perspective, a successful IP removal attack should defeat both fingerprinting and watermarking. However, to the best of our knowledge, there is no work on such attacks in the literature yet. In this paper, we fill this gap by presenting an IP removal attack that can defeat both fingerprinting and watermarking. We consider the challenging data-free scenario whereby all data is inverted from the victim model. Under this setting, a stolen model only depends on the victim model. Experimental results demonstrate the success of our attack in defeating state-of-the-art DNN fingerprinting and watermarking techniques. This work reveals a novel attack surface that exploits generative model inversion attacks to bypass DNN IP defenses. This threat must be addressed by future defenses for reliable IP protection.
\ No newline at end of file
diff --git a/data/2024/aaai/IRPruneDet: Efficient Infrared Small Target Detection via Wavelet Structure-Regularized Soft Channel Pruning b/data/2024/aaai/IRPruneDet: Efficient Infrared Small Target Detection via Wavelet Structure-Regularized Soft Channel Pruning
new file mode 100644
index 0000000000..af7d8ad172
--- /dev/null
+++ b/data/2024/aaai/IRPruneDet: Efficient Infrared Small Target Detection via Wavelet Structure-Regularized Soft Channel Pruning	
@@ -0,0 +1 @@
+Infrared Small Target Detection (IRSTD) refers to detecting faint targets in infrared images, which has achieved notable progress with the advent of deep learning. However, the drive for improved detection accuracy has led to larger, intricate models with redundant parameters, causing storage and computation inefficiencies. In this pioneering study, we introduce the concept of utilizing network pruning to enhance the efficiency of IRSTD. Due to the challenge posed by low signal-to-noise ratios and the absence of detailed semantic information in infrared images, directly applying existing pruning techniques yields suboptimal performance. To address this, we propose a novel wavelet structure-regularized soft channel pruning method, giving rise to the efficient IRPruneDet model. Our approach involves representing the weight matrix in the wavelet domain and formulating a wavelet channel pruning strategy. We incorporate wavelet regularization to induce structural sparsity without incurring extra memory usage. Moreover, we design a soft channel reconstruction method that preserves important target information against premature pruning, thereby ensuring an optimal sparse structure while maintaining overall sparsity. Through extensive experiments on two widely-used benchmarks, our IRPruneDet method surpasses established techniques in both model complexity and accuracy. Specifically, when employing U-net as the baseline network, IRPruneDet achieves a 64.13% reduction in parameters and a 51.19% decrease in FLOPS, while improving IoU from 73.31% to 75.12% and nIoU from 70.92% to 74.30%. The code is available at https://github.com/hd0013/IRPruneDet.
\ No newline at end of file
diff --git a/data/2024/aaai/ISP-Teacher: Image Signal Process with Disentanglement Regularization for Unsupervised Domain Adaptive Dark Object Detection b/data/2024/aaai/ISP-Teacher: Image Signal Process with Disentanglement Regularization for Unsupervised Domain Adaptive Dark Object Detection
new file mode 100644
index 0000000000..fc092e560d
--- /dev/null
+++ b/data/2024/aaai/ISP-Teacher: Image Signal Process with Disentanglement Regularization for Unsupervised Domain Adaptive Dark Object Detection	
@@ -0,0 +1 @@
+Object detection in dark conditions has always been a great challenge due to the complex formation process of low-light images. Currently, the mainstream methods usually adopt domain adaptation with Teacher-Student architecture to solve the dark object detection problem, and they imitate the dark conditions by using non-learnable data augmentation strategies on the annotated source daytime images. Note that these methods neglected to model the intrinsic imaging process, i.e. image signal processing (ISP), which is important for camera sensors to generate low-light images. To solve the above problems, in this paper, we propose a novel method named ISP-Teacher for dark object detection by exploring Teacher-Student architecture from a new perspective (i.e. self-supervised learning based ISP degradation). Specifically, we first design a day-to-night transformation module that consistent with the ISP pipeline of the camera sensors (ISP-DTM) to make the augmented images look more in line with the natural low-light images captured by cameras, and the ISP-related parameters are learned in a self-supervised manner. Moreover, to avoid the conflict between the ISP degradation and detection tasks in a shared encoder, we propose a disentanglement regularization (DR) that minimizes the absolute value of cosine similarity to disentangle two tasks and push two gradients vectors as orthogonal as possible. Extensive experiments conducted on two benchmarks show the effectiveness of our method in dark object detection. In particular, ISP-Teacher achieves an improvement of +2.4% AP and +3.3% AP over the SOTA method on BDD100k and SHIFT datasets, respectively. The code can be found at https://github.com/zhangyin1996/ISP-Teacher.
\ No newline at end of file
diff --git a/data/2024/aaai/IT3D: Improved Text-to-3D Generation with Explicit View Synthesis b/data/2024/aaai/IT3D: Improved Text-to-3D Generation with Explicit View Synthesis
new file mode 100644
index 0000000000..592d0f2514
--- /dev/null
+++ b/data/2024/aaai/IT3D: Improved Text-to-3D Generation with Explicit View Synthesis	
@@ -0,0 +1 @@
+Recent strides in Text-to-3D techniques have been propelled by distilling knowledge from powerful large text-to-image diffusion models (LDMs). Nonetheless, existing Text-to-3D approaches often grapple with challenges such as over-saturation, inadequate detailing, and unrealistic outputs. This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues. Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images based on the renderings of coarse 3D models. Although the generated images mostly alleviate the aforementioned issues, challenges such as view inconsistency and significant content variance persist due to the inherent generative nature of large diffusion models, posing extensive difficulties in leveraging these images effectively. To overcome this hurdle, we advocate integrating a discriminator alongside a novel Diffusion-GAN dual training strategy to guide the training of 3D models. For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data. We conduct a comprehensive set of experiments that demonstrate the effectiveness of our method over baseline approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Identifiability of Direct Effects from Summary Causal Graphs b/data/2024/aaai/Identifiability of Direct Effects from Summary Causal Graphs
new file mode 100644
index 0000000000..246146ac3a
--- /dev/null
+++ b/data/2024/aaai/Identifiability of Direct Effects from Summary Causal Graphs	
@@ -0,0 +1,2 @@
+Dynamic structural causal models (SCMs) are a powerful framework for reasoning in dynamic systems about direct effects which measure how a change in one variable affects another variable while holding all other variables constant. The causal relations in a dynamic structural causal model can be qualitatively represented with an acyclic full-time causal graph. Assuming linearity and no hidden confounding and given the full-time causal graph, the direct causal effect is always identifiable. However, in many application such a graph is not available for various reasons but nevertheless experts have access to the summary causal graph of the full-time causal graph which represents causal relations between time series while omitting temporal information and allowing cycles. This paper presents a complete identifiability result which characterizes all cases for which the direct effect
+is graphically identifiable from a summary causal graph and gives two sound finite adjustment sets that can be used to estimate the direct effect whenever it is identifiable.
\ No newline at end of file
diff --git a/data/2024/aaai/Identification for Tree-Shaped Structural Causal Models in Polynomial Time b/data/2024/aaai/Identification for Tree-Shaped Structural Causal Models in Polynomial Time
new file mode 100644
index 0000000000..22c0e7bd52
--- /dev/null
+++ b/data/2024/aaai/Identification for Tree-Shaped Structural Causal Models in Polynomial Time	
@@ -0,0 +1 @@
+Linear structural causal models (SCMs) are used to express and analyze the relationships between random variables. Direct causal effects are represented as directed edges and confounding factors as bidirected edges. Identifying the causal parameters from correlations between the nodes is an open problem in artificial intelligence. In this paper, we study SCMs whose directed component forms a tree. Van der Zander et al. give a PSPACE-algorithm for the identification problem in this case, which is a significant improvement over the general Gröbner basis approach, which has doubly-exponential time complexity in the number of structural parameters. However, they do not show that their algorithm is complete. In this work, we present a randomized polynomial-time algorithm, which solves the identification problem for tree-shaped SCMs. For every structural parameter, our algorithms decides whether it is generically identifiable, generically 2-identifiable, or generically unidentifiable. (No other cases can occur.) In the first two cases, it provides one or two fractional affine square root terms of polynomials (FASTPs) for the corresponding parameter, respectively. In particular, our algorithm is not only polynomial time, but also complete for for tree-shaped SCMs.
\ No newline at end of file
diff --git a/data/2024/aaai/Identification of Causal Structure in the Presence of Missing Data with Additive Noise Model b/data/2024/aaai/Identification of Causal Structure in the Presence of Missing Data with Additive Noise Model
new file mode 100644
index 0000000000..b936ec4bb8
--- /dev/null
+++ b/data/2024/aaai/Identification of Causal Structure in the Presence of Missing Data with Additive Noise Model	
@@ -0,0 +1,3 @@
+Missing data are an unavoidable complication frequently encountered in many causal discovery tasks. 
+When a missing process depends on the missing values themselves (known as self-masking missingness), the recovery of the joint distribution becomes unattainable, and detecting the presence of such self-masking missingness remains a perplexing challenge. Consequently, due to the inability to reconstruct the original distribution and to discern the underlying missingness mechanism, simply applying existing causal discovery methods would lead to wrong conclusions. In this work, we found that the recent advances additive noise model has the potential for learning causal structure under the existence of the self-masking missingness. With this observation, we aim to investigate the identification problem of learning causal structure from missing data under an additive noise model with different missingness mechanisms, where the `no self-masking missingness' assumption can be eliminated appropriately. 
+Specifically, we first elegantly extend the scope of identifiability of causal skeleton to the case with weak self-masking missingness (i.e., no other variable could be the cause of self-masking indicators except itself). We further provide the sufficient and necessary identification conditions of the causal direction under additive noise model and show that the causal structure can be identified up to an IN-equivalent pattern. We finally propose a practical algorithm based on the above theoretical results on learning the causal skeleton and causal direction. Extensive experiments on synthetic and real data demonstrate the efficiency and effectiveness of the proposed algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Identification of Causal Structure with Latent Variables Based on Higher Order Cumulants b/data/2024/aaai/Identification of Causal Structure with Latent Variables Based on Higher Order Cumulants
new file mode 100644
index 0000000000..573baceb25
--- /dev/null
+++ b/data/2024/aaai/Identification of Causal Structure with Latent Variables Based on Higher Order Cumulants	
@@ -0,0 +1 @@
+Causal discovery with latent variables is a crucial but challenging task. Despite the emergence of numerous methods aimed at addressing this challenge, they are not fully identified to the structure that two observed variables are influenced by one latent variable and there might be a directed edge in between. Interestingly, we notice that this structure can be identified through the utilization of higher-order cumulants. By leveraging the higher-order cumulants of non-Gaussian data, we provide an analytical solution for estimating the causal coefficients or their ratios. With the estimated (ratios of) causal coefficients, we propose a novel approach to identify the existence of a causal edge between two observed variables subject to latent variable influence. In case when such a causal edge exits, we introduce an asymmetry criterion to determine the causal direction. The experimental results demonstrate the effectiveness of our proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Identification of Necessary Semantic Undertakers in the Causal View for Image-Text Matching b/data/2024/aaai/Identification of Necessary Semantic Undertakers in the Causal View for Image-Text Matching
new file mode 100644
index 0000000000..272468684f
--- /dev/null
+++ b/data/2024/aaai/Identification of Necessary Semantic Undertakers in the Causal View for Image-Text Matching	
@@ -0,0 +1 @@
+Image-text matching bridges vision and language, which is a fundamental task in multimodal intelligence. Its key challenge lies in how to capture visual-semantic relevance. Fine-grained semantic interactions come from fragment alignments between image regions and text words. However, not all fragments contribute to image-text relevance, and many existing methods are devoted to mining the vital ones to measure the relevance accurately. How well image and text relate depends on the degree of semantic sharing between them. Treating the degree as an effect and fragments as its possible causes, we define those indispensable causes for the generation of the degree as necessary undertakers, i.e., if any of them did not occur, the relevance would be no longer valid. In this paper, we revisit image-text matching in the causal view and uncover inherent causal properties of relevance generation. Then we propose a novel theoretical prototype for estimating the probability-of-necessity of fragments, PN_f, for the degree of semantic sharing by means of causal inference, and further design a Necessary Undertaker Identification Framework (NUIF) for image-text matching, which explicitly formalizes the fragment's contribution to image-text relevance by modeling PN_f in two ways. Extensive experiments show our method achieves state-of-the-art on benchmarks Flickr30K and MSCOCO.
\ No newline at end of file
diff --git a/data/2024/aaai/Identifying Guarantors of War Veterans Using Robust-SEAL: A Case of the Korean War b/data/2024/aaai/Identifying Guarantors of War Veterans Using Robust-SEAL: A Case of the Korean War
new file mode 100644
index 0000000000..106e466f18
--- /dev/null
+++ b/data/2024/aaai/Identifying Guarantors of War Veterans Using Robust-SEAL: A Case of the Korean War	
@@ -0,0 +1 @@
+Most countries provide veterans with various benefits to reward their sacrifice. Unfortunately, many veterans have failed to prove their status due to loss of military records. Thus, some governments allow the verification of those veterans through "buddy statements" obtained from the people who can vouch for the buddy's participation in the war. However, it is still challenging for veterans to find guarantors directly. With this background, we suggest to utilizing historical war records of combined operations to increase the pool of potential guarantors for the buddy statements. However, a combined operation network among troops can have missing edges and perturbations on attributes of the troop due to inaccurate information. In this study, we learn from some recorded interactions which might be incomplete and noisy, and predict missing linkages among the troops that might have interacted together in the war, by proposing Robust-SEAL (learning from Subgraphs, Embeddings, and Attributes for Link prediction). It combines two Graph Neural Network (GNN) architectures: robust Graph Convolutional Network which considers the uncertainty of node attributes with a probabilistic approach, and SEAL which improves the expressive power of the GNN with a labeling trick. Our proposed approach was applied to Korean War data with perturbations. For experimentations, we hid some actual interactions and found that Robust-SEAL restores missing interactions better than other GNN-based baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Identifying Reasons for Bias: An Argumentation-Based Approach b/data/2024/aaai/Identifying Reasons for Bias: An Argumentation-Based Approach
new file mode 100644
index 0000000000..326b7c3127
--- /dev/null
+++ b/data/2024/aaai/Identifying Reasons for Bias: An Argumentation-Based Approach	
@@ -0,0 +1 @@
+As algorithmic decision-making systems become more prevalent in society, ensuring the fairness of these systems is becoming increasingly important. Whilst there has been substantial research in building fair algorithmic decision-making systems, the majority of these methods require access to the training data, including personal characteristics, and are not transparent regarding which individuals are classified unfairly. In this paper, we propose a novel model-agnostic argumentation-based method to determine why an individual is classified differently in comparison to similar individuals. Our method uses a quantitative argumentation framework to represent attribute-value pairs of an individual and of those similar to them, and uses a well-known semantics to identify the attribute-value pairs in the individual contributing most to their different classification. We evaluate our method on two datasets commonly used in the fairness literature and illustrate its effectiveness in the identification of bias.
\ No newline at end of file
diff --git a/data/2024/aaai/Identifying and Addressing Disparities in Public Libraries with Bayesian Latent Variable Modeling b/data/2024/aaai/Identifying and Addressing Disparities in Public Libraries with Bayesian Latent Variable Modeling
new file mode 100644
index 0000000000..f131868e39
--- /dev/null
+++ b/data/2024/aaai/Identifying and Addressing Disparities in Public Libraries with Bayesian Latent Variable Modeling	
@@ -0,0 +1,3 @@
+Public libraries are an essential public good. We ask: are urban library systems providing equitable service to all residents, in terms of the books they have access to and check out? If not, what causes disparities: heterogeneous book collections, resident behavior and access, and/or operational policies? Existing methods leverage only system-level outcome data (such as overall checkouts per branch), and so cannot distinguish between these factors. As a result, it is difficult to use their results to guide interventions to increase equitable access. We propose a Bayesian framework to characterize book checkout behavior across multiple branches of a library system, learning heterogeneous book popularity, overall branch demand, and usage of the online hold system, while controlling for book availability.
+
+In collaboration with the New York Public Library, we apply our framework to granular data consisting of over 400,000 checkouts during 2022. We first show that our model significantly out-performs baseline methods in predicting checkouts at the book-branch level. Next, we study spatial and socioeconomic disparities. We show that disparities are largely driven by disparate use of the online holds system, which allows library patrons to receive books from any other branch through an online portal. This system thus leads to a large outflow of popular books from branches in lower income neighborhoods to those in high income ones. Finally, we illustrate the use of our model and insights to quantify the impact of potential interventions, such as changing how books are internally routed between branches to fulfill hold requests.
\ No newline at end of file
diff --git a/data/2024/aaai/Identifying, Mitigating, and Anticipating Bias in Algorithmic Decisions b/data/2024/aaai/Identifying, Mitigating, and Anticipating Bias in Algorithmic Decisions
new file mode 100644
index 0000000000..cd8e077b59
--- /dev/null
+++ b/data/2024/aaai/Identifying, Mitigating, and Anticipating Bias in Algorithmic Decisions	
@@ -0,0 +1 @@
+Today's machine learning (ML) applications predominantly adhere to a standard paradigm: the decision maker designs the algorithm by optimizing a model for some objective function. While this has proven to be a powerful approach in many domains, it comes with inherent side effects: the power over the algorithmic outcomes lies solely in the hands of the algorithm designer, and alternative objectives, such as fairness, are often disregarded. This is particularly problematic if the algorithm is used to make consequential decisions that affect peoples lives. My research focuses on developing principled methods to characterize and address the mismatch between these different objectives.
\ No newline at end of file
diff --git a/data/2024/aaai/Image Captioning with Multi-Context Synthetic Data b/data/2024/aaai/Image Captioning with Multi-Context Synthetic Data
new file mode 100644
index 0000000000..0329ad4783
--- /dev/null
+++ b/data/2024/aaai/Image Captioning with Multi-Context Synthetic Data	
@@ -0,0 +1 @@
+Image captioning requires numerous annotated image-text pairs, resulting in substantial annotation costs. Recently, large models (e.g. diffusion models and large language models) have excelled in producing high-quality images and text. This potential can be harnessed to create synthetic image-text pairs for training captioning models. Synthetic data can improve cost and time efficiency in data collection, allow for customization to specific domains, bootstrap generalization capability for zero-shot performance, and circumvent privacy concerns associated with real-world data. However, existing methods struggle to attain satisfactory performance solely through synthetic data. We identify the issue as generated images from simple descriptions mostly capture a solitary perspective with limited context, failing to align with the intricate scenes prevalent in real-world imagery. To tackle this, we present an innovative pipeline that introduces multi-context data generation. Beginning with an initial text corpus, our approach employs a large language model to extract multiple sentences portraying the same scene from diverse viewpoints. These sentences are then condensed into a single sentence with multiple contexts. Subsequently, we generate intricate images using the condensed captions through diffusion models. Our model is exclusively trained on synthetic image-text pairs crafted through this process. The effectiveness of our pipeline is validated through experimental results in both the in-domain and cross-domain settings, where it achieves state-of-the-art performance on well-known datasets such as MSCOCO, Flickr30k, and NoCaps.
\ No newline at end of file
diff --git a/data/2024/aaai/Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually b/data/2024/aaai/Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually
new file mode 100644
index 0000000000..3e43277c66
--- /dev/null
+++ b/data/2024/aaai/Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually	
@@ -0,0 +1 @@
+Social media platforms are being increasingly used by malicious actors to share unsafe content, such as images depicting sexual activity, cyberbullying, and self-harm. Consequently, major platforms use artificial intelligence (AI) and human moderation to obfuscate such images to make them safer. Two critical needs for obfuscating unsafe images is that an accurate rationale for obfuscating image regions must be provided, and the sensitive regions should be obfuscated (e.g. blurring) for users' safety. This process involves addressing two key problems: (1) the reason for obfuscating unsafe images demands the platform to provide an accurate rationale that must be grounded in unsafe image-specific attributes, and (2) the unsafe regions in the image must be minimally obfuscated while still depicting the safe regions. In this work, we address these key issues by first performing visual reasoning by designing a visual reasoning model (VLM) conditioned on pre-trained unsafe image classifiers to provide an accurate rationale grounded in unsafe image attributes, and then proposing a counterfactual explanation algorithm that minimally identifies and obfuscates unsafe regions for safe viewing, by first utilizing an unsafe image classifier attribution matrix to guide segmentation for a more optimal subregion segmentation followed by an informed greedy search to determine the minimum number of subregions required to modify the classifier's output based on attribution score. Extensive experiments on uncurated data from social networks emphasize the efficacy of our proposed method. We make our code available at: https://github.com/SecureAIAutonomyLab/ConditionalVLM
\ No newline at end of file
diff --git a/data/2024/aaai/Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network b/data/2024/aaai/Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network
new file mode 100644
index 0000000000..a4de350f14
--- /dev/null
+++ b/data/2024/aaai/Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network	
@@ -0,0 +1 @@
+Scene text recognition is inherently a vision-language task. However, previous works have predominantly focused either on extracting more robust visual features or designing better language modeling. How to effectively and jointly model vision and language to mitigate heavy reliance on a single modality remains a problem. In this paper, aiming to enhance vision-language reasoning in scene text recognition, we present a balanced, unified and synchronized vision-language reasoning network (BUSNet). Firstly, revisiting the image as a language by balanced concatenation along length dimension alleviates the issue of over-reliance on vision or language. Secondly, BUSNet learns an ensemble of unified external and internal vision-language model with shared weight by masked modality modeling (MMM). Thirdly, a novel vision-language reasoning module (VLRM) with synchronized vision-language decoding capacity is proposed. Additionally, BUSNet achieves improved performance through iterative reasoning, which utilizes the vision-language prediction as a new language input. Extensive experiments indicate that BUSNet achieves state-of-the-art performance on several mainstream benchmark datasets and more challenge datasets for both synthetic and real training data compared to recent outstanding methods. Code and dataset will be available at https://github.com/jjwei66/BUSNet.
\ No newline at end of file
diff --git a/data/2024/aaai/ImageCaptioner2: Image Captioner for Image Captioning Bias Amplification Assessment b/data/2024/aaai/ImageCaptioner2: Image Captioner for Image Captioning Bias Amplification Assessment
new file mode 100644
index 0000000000..b06de009e6
--- /dev/null
+++ b/data/2024/aaai/ImageCaptioner2: Image Captioner for Image Captioning Bias Amplification Assessment	
@@ -0,0 +1,2 @@
+Most pre-trained learning systems are known to suffer from bias, which typically emerges from the data, the model, or both. Measuring and quantifying bias and its sources is a challenging task and has been extensively studied in image
+captioning. Despite the significant effort in this direction, we observed that existing metrics lack consistency in the inclusion of the visual signal. In this paper, we introduce a new bias assessment metric, dubbed ImageCaptioner2, for image captioning. Instead of measuring the absolute bias in the model or the data, ImageCaptioner2pay more attention to the bias introduced by the model w.r.t the data bias, termed bias amplification. Unlike the existing methods, which only evaluate the image captioning algorithms based on the generated captions only, ImageCaptioner2incorporates the image while measuring the bias. In addition, we design a formulation for measuring the bias of generated captions as prompt-based image captioning instead of using language classifiers. Finally, we apply our ImageCaptioner2metric across 11 different image captioning architectures on three different datasets, i.e., MS-COCO caption dataset, Artemis V1, and Artemis V2, and on three different protected attributes, i.e., gender, race, and emotions. Consequently, we verify the effectiveness of our ImageCaptioner2metric by proposing Anonymous-Bench, which is a novel human evaluation paradigm for bias metrics. Our metric shows significant superiority over the recent bias metric; LIC, in terms of human alignment, where the correlation scores are 80% and 54% for our metric and LIC, respectively. The code and more details are available at https://eslambakr.github.io/imagecaptioner2.github.io/.
\ No newline at end of file
diff --git a/data/2024/aaai/ImageSTEAM: Teacher Professional Development for Integrating Visual Computing into Middle School Lessons b/data/2024/aaai/ImageSTEAM: Teacher Professional Development for Integrating Visual Computing into Middle School Lessons
new file mode 100644
index 0000000000..d3e8beeaf3
--- /dev/null
+++ b/data/2024/aaai/ImageSTEAM: Teacher Professional Development for Integrating Visual Computing into Middle School Lessons	
@@ -0,0 +1 @@
+Artificial intelligence (AI) and its teaching in the K-12 grades has been championed as a vital need for the United States due to the technology's future prominence in the 21st century. However, there remain several barriers to effective AI lessons at these age groups including the broad range of interdisciplinary knowledge needed and the lack of formal training or preparation for teachers to implement these lessons. In this experience report, we present ImageSTEAM, a teacher professional development for creating lessons surrounding computer vision, machine learning, and computational photography/cameras targeted for middle school grades 6-8 classes. Teacher professional development workshops were conducted in the states of Arizona and Georgia from 2021-2023 where lessons were co-created with teachers to introduce various specific visual computing concepts while aligning to state and national standards. In addition, the use of a variety of computer vision and image processing software including custom designed Python notebooks were created as technology activities and demonstrations to be used in the classroom. Educational research showed that teachers improved their self-efficacy and outcomes for concepts in computer vision, machine learning, and artificial intelligence when participating in the program. Results from the professional development workshops highlight key opportunities and challenges in integrating this content into the standard curriculum, the benefits of a co-creation pedagogy, and the positive impact on teacher and student's learning experiences. The open-source program curriculum is available at www.imagesteam.org.
\ No newline at end of file
diff --git a/data/2024/aaai/Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning b/data/2024/aaai/Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning
new file mode 100644
index 0000000000..33cf893873
--- /dev/null
+++ b/data/2024/aaai/Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning	
@@ -0,0 +1 @@
+Effective exploration is crucial to discovering optimal strategies for multi-agent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often face challenges obtaining specific joint action sequences to reach successful states in long-horizon tasks. To address this limitation, we propose Imagine, Initialize, and Explore (IIE), a novel method that offers a promising solution for efficient multi-agent exploration in complex scenarios. IIE employs a transformer model to imagine how the agents reach a critical state that can influence each other's transition functions. Then, we initialize the environment at this state using a simulator before the exploration phase. We formulate the imagination as a sequence modeling problem, where the states, observations, prompts, actions, and rewards are predicted autoregressively. The prompt consists of timestep-to-go, return-to-go, influence value, and one-shot demonstration, specifying the desired state and trajectory as well as guiding the action generation. By initializing agents at the critical states, IIE significantly increases the likelihood of discovering potentially important under-explored regions. Despite its simplicity, empirical results demonstrate that our method outperforms multi-agent exploration baselines on the StarCraft Multi-Agent Challenge (SMAC) and SMACv2 environments. Particularly, IIE shows improved performance in the sparse-reward SMAC tasks and produces more effective curricula over the initialized states than other generative methods, such as CVAE-GAN and diffusion models.
\ No newline at end of file
diff --git a/data/2024/aaai/Imitation of Life: A Search Engine for Biologically Inspired Design b/data/2024/aaai/Imitation of Life: A Search Engine for Biologically Inspired Design
new file mode 100644
index 0000000000..54d3613294
--- /dev/null
+++ b/data/2024/aaai/Imitation of Life: A Search Engine for Biologically Inspired Design	
@@ -0,0 +1,3 @@
+Biologically Inspired Design (BID), or Biomimicry, is a problem-solving methodology that applies analogies from nature to solve engineering challenges. For example, Speedo engineers designed swimsuits based on shark skin. Finding relevant biological solutions for real-world problems poses significant challenges, both due to the limited biological knowledge engineers and designers typically possess and to the limited BID resources. Existing BID datasets are hand-curated and small, and scaling them up requires costly human annotations.
+
+In this paper, we introduce BARcode (Biological Analogy Retriever), a search engine for automatically mining bio-inspirations from the web at scale. Using advances in natural language understanding and data programming, BARcode identifies potential inspirations for engineering challenges. Our experiments demonstrate that BARcode can retrieve inspirations that are valuable to engineers and designers tackling real-world problems, as well as recover famous historical BID examples. We release data and code; we view BARcode as a step towards addressing the challenges that have historically hindered the practical application of BID to engineering innovation.
\ No newline at end of file
diff --git a/data/2024/aaai/Impartial Adversarial Distillation: Addressing Biased Data-Free Knowledge Distillation via Adaptive Constrained Optimization b/data/2024/aaai/Impartial Adversarial Distillation: Addressing Biased Data-Free Knowledge Distillation via Adaptive Constrained Optimization
new file mode 100644
index 0000000000..9f167f1e76
--- /dev/null
+++ b/data/2024/aaai/Impartial Adversarial Distillation: Addressing Biased Data-Free Knowledge Distillation via Adaptive Constrained Optimization	
@@ -0,0 +1 @@
+Data-Free Knowledge Distillation (DFKD) enables knowledge transfer from a pretrained teacher to a light-weighted student without original training data. Existing works are limited by a strong assumption that samples used to pretrain the teacher model are balanced, which is, however, unrealistic for many real-world tasks. In this work, we investigated a pragmatic yet under-explored problem: how to perform DFKD from a teacher model pretrained from imbalanced data. We observe a seemingly counter-intuitive phenomenon, i.e., adversarial DFKD algorithms favour minority classes, while causing a disastrous impact on majority classes. We theoretically prove that a biased teacher could cause severe disparity on different groups of synthetic data in adversarial distillation, which further exacerbates the mode collapse of a generator and consequently degenerates the overall accuracy of a distilled student model. To tackle this problem, we propose a class-adaptive regularization method, aiming to encourage impartial representation learning of a generator among different classes under a constrained learning formulation. We devise a primal-dual algorithm to solve the target optimization problem. Through extensive experiments, we show that our method mitigates the biased learning of majority classes in DFKD and improves the overall performance compared with baselines. Code will be available at https://github.com/ldpbuaa/ipad.
\ No newline at end of file
diff --git a/data/2024/aaai/Implications of Distance over Redistricting Maps: Central and Outlier Maps b/data/2024/aaai/Implications of Distance over Redistricting Maps: Central and Outlier Maps
new file mode 100644
index 0000000000..849a3173f2
--- /dev/null
+++ b/data/2024/aaai/Implications of Distance over Redistricting Maps: Central and Outlier Maps	
@@ -0,0 +1 @@
+In representative democracy, a redistricting map is chosen to partition an electorate into districts which each elects a representative. A valid redistricting map must satisfy a collection of constraints such as being compact, contiguous, and of almost-equal population. However, these constraints are loose enough to enable an enormous ensemble of valid redistricting maps. This enables a partisan legislature to gerrymander by choosing a map which unfairly favors it. In this paper, we introduce an interpretable and tractable distance measure over redistricting maps which does not use election results and study its implications over the ensemble of redistricting maps. Specifically, we define a central map which may be considered "most typical" and give a rigorous justification for it by showing that it mirrors the Kemeny ranking in a scenario where we have a committee voting over a collection of redistricting maps to be drawn. We include runnning time and sample complexity analysis for our algorithms, including some negative results which hold using any algorithm. We further study outlier detection based on this distance measure and show that our framework can detect some gerrymandered maps. More precisely, we show some maps that are widely considered to be gerrymandered that lie very far away from our central maps in comparison to a large ensemble of valid redistricting maps. Since our distance measure does not rely on election results, this gives a significant advantage in gerrymandering detection which is lacking in all previous methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Implicit Modeling of Non-rigid Objects with Cross-Category Signals b/data/2024/aaai/Implicit Modeling of Non-rigid Objects with Cross-Category Signals
new file mode 100644
index 0000000000..e464ab8d08
--- /dev/null
+++ b/data/2024/aaai/Implicit Modeling of Non-rigid Objects with Cross-Category Signals	
@@ -0,0 +1 @@
+Deep implicit functions (DIFs) have emerged as a potent and articulate means of representing 3D shapes. However, methods modeling object categories or non-rigid entities have mainly focused on single-object scenarios. In this work, we propose MODIF, a multi-object deep implicit function that jointly learns the deformation fields and instance-specific latent codes for multiple objects at once. Our emphasis is on non-rigid, non-interpenetrating entities such as organs. To effectively capture the interrelation between these entities and ensure precise, collision-free representations, our approach facilitates signaling between category-specific fields to adequately rectify shapes. We also introduce novel inter-object supervision: an attraction-repulsion loss is formulated to refine contact regions between objects. Our approach is demonstrated on various medical benchmarks, involving modeling different groups of intricate anatomical entities. Experimental results illustrate that our model can proficiently learn the shape representation of each organ and their relations to others, to the point that shapes missing from unseen instances can be consistently recovered by our method. Finally, MODIF can also propagate semantic information throughout the population via accurate point correspondences.
\ No newline at end of file
diff --git "a/data/2024/aaai/Improve Robustness of Reinforcement Learning against Observation Perturbations via l\342\210\236 Lipschitz Policy Networks" "b/data/2024/aaai/Improve Robustness of Reinforcement Learning against Observation Perturbations via l\342\210\236 Lipschitz Policy Networks"
new file mode 100644
index 0000000000..2785cfc6b0
--- /dev/null
+++ "b/data/2024/aaai/Improve Robustness of Reinforcement Learning against Observation Perturbations via l\342\210\236 Lipschitz Policy Networks"	
@@ -0,0 +1 @@
+Deep Reinforcement Learning (DRL) has achieved remarkable advances in sequential decision tasks. However, recent works have revealed that DRL agents are susceptible to slight perturbations in observations. This vulnerability raises concerns regarding the effectiveness and robustness of deploying such agents in real-world applications. In this work, we propose a novel robust reinforcement learning method called SortRL, which improves the robustness of DRL policies against observation perturbations from the perspective of the network architecture. We employ a novel architecture for the policy network that incorporates global $l_\infty$ Lipschitz continuity and provide a convenient method to enhance policy robustness based on the output margin. Besides, a training framework is designed for SortRL, which solves given tasks while maintaining robustness against $l_\infty$ bounded perturbations on the observations. Several experiments are conducted to evaluate the effectiveness of our method, including classic control tasks and video games. The results demonstrate that SortRL achieves state-of-the-art robustness performance against different perturbation strength.
\ No newline at end of file
diff --git a/data/2024/aaai/Improved Anonymous Multi-Agent Path Finding Algorithm b/data/2024/aaai/Improved Anonymous Multi-Agent Path Finding Algorithm
new file mode 100644
index 0000000000..7fe73f1cc3
--- /dev/null
+++ b/data/2024/aaai/Improved Anonymous Multi-Agent Path Finding Algorithm	
@@ -0,0 +1 @@
+We consider an Anonymous Multi-Agent Path-Finding (AMAPF) problem where the set of agents is confined to a graph, a set of goal vertices is given and each of these vertices has to be reached by some agent. The problem is to find an assignment of the goals to the agents as well as the collision-free paths, and we are interested in finding the solution with the optimal makespan. A well-established approach to solve this problem is to reduce it to a special type of a graph search problem, i.e. to the problem of finding a maximum flow on an auxiliary graph induced by the input one. The size of the former graph may be very large and the search on it may become a bottleneck. To this end, we suggest a specific search algorithm that leverages the idea of exploring the search space not through considering separate search states but rather bulks of them simultaneously. That is, we implicitly compress, store and expand bulks of the search states as single states, which results in high reduction in runtime and memory. Empirically, the resultant AMAPF solver demonstrates superior performance compared to the state-of-the-art competitor and is able to solve all publicly available MAPF instances from the well-known MovingAI benchmark in less than 30 seconds.
\ No newline at end of file
diff --git a/data/2024/aaai/Improved Bandits in Many-to-One Matching Markets with Incentive Compatibility b/data/2024/aaai/Improved Bandits in Many-to-One Matching Markets with Incentive Compatibility
new file mode 100644
index 0000000000..ab0ad6c5af
--- /dev/null
+++ b/data/2024/aaai/Improved Bandits in Many-to-One Matching Markets with Incentive Compatibility	
@@ -0,0 +1 @@
+Two-sided matching markets have been widely studied in the literature due to their rich applications. Since participants are usually uncertain about their preferences, online algorithms have recently been adopted to learn them through iterative interactions. An existing work initiates the study of this problem in a many-to-one setting with responsiveness. However, their results are far from optimal and lack guarantees of incentive compatibility. We first extend an existing algorithm for the one-to-one setting to this more general setting and show it achieves a near-optimal bound for player-optimal regret. Nevertheless, due to the substantial requirement for collaboration, a single player's deviation could lead to a huge increase in its own cumulative rewards and a linear regret for others. In this paper, we aim to enhance the regret bound in many-to-one markets while ensuring incentive compatibility. We first propose the adaptively explore-then-deferred-acceptance (AETDA) algorithm for responsiveness setting and derive an upper bound for player-optimal stable regret while demonstrating its guarantee of incentive compatibility. This result is a significant improvement over existing works. And to the best of our knowledge, it constitutes the first player-optimal guarantee in matching markets that offers such robust assurances. We also consider broader substitutable preferences, one of the most general conditions to ensure the existence of a stable matching and cover responsiveness. We devise an online DA (ODA) algorithm and establish an upper bound for the player-pessimal stable regret for this setting.
\ No newline at end of file
diff --git a/data/2024/aaai/Improved Graph Contrastive Learning for Short Text Classification b/data/2024/aaai/Improved Graph Contrastive Learning for Short Text Classification
new file mode 100644
index 0000000000..a076dbb4fd
--- /dev/null
+++ b/data/2024/aaai/Improved Graph Contrastive Learning for Short Text Classification	
@@ -0,0 +1,2 @@
+Text classification occupies an important role in natural language processing and has many applications in real life. Short text classification, as one of its subtopics, has attracted increasing interest from researchers since it is more challenging due to its semantic sparsity and insufficient labeled data. Recent studies attempt to combine graph learning and contrastive learning to alleviate the above problems in short text classification. Despite their fruitful success, there are still several inherent limitations. First, the generation of augmented views may disrupt the semantic structure within the text and introduce negative effects due to noise permutation. Second, they ignore the clustering-friendly features in unlabeled data and fail to further utilize the prior information in few valuable labeled data. To this end, we propose a novel model that utilizes improved Graph contrastIve learning for short text classiFicaTion (GIFT). Specifically, we construct a heterogeneous graph containing several component graphs by mining from an internal corpus and introducing an external knowledge graph. Then, we use singular value decomposition to generate augmented views for graph contrastive learning. Moreover, we employ constrained kmeans on labeled texts to learn clustering-friendly features, which facilitate cluster-oriented contrastive learning and assist in obtaining better category boundaries. Extensive experimental results show that GIFT significantly outperforms previous state-of-the-art methods. Our code can be found in
+https://github.com/KEAML-JLU/GIFT.
\ No newline at end of file
diff --git a/data/2024/aaai/Improved MLP Point Cloud Processing with High-Dimensional Positional Encoding b/data/2024/aaai/Improved MLP Point Cloud Processing with High-Dimensional Positional Encoding
new file mode 100644
index 0000000000..a7913069bc
--- /dev/null
+++ b/data/2024/aaai/Improved MLP Point Cloud Processing with High-Dimensional Positional Encoding	
@@ -0,0 +1 @@
+Multi-Layer Perceptron (MLP) models are the bedrock of contemporary point cloud processing. However, their complex network architectures obscure the source of their strength. We first develop an “abstraction and refinement” (ABS-REF) view for the neural modeling of point clouds. This view elucidates that whereas the early models focused on the ABS stage, the more recent techniques devise sophisticated REF stages to attain performance advantage in point cloud processing. We then borrow the concept of “positional encoding” from transformer literature, and propose a High-dimensional Positional Encoding (HPE) module, which can be readily deployed to MLP based architectures. We leverage our module to develop a suite of HPENet, which are MLP networks that follow ABS-REF paradigm, albeit with a sophisticated HPE based REF stage. The developed technique is extensively evaluated for 3D object classification, object part segmentation, semantic segmentation and object detection. We establish new state-of-the-art results of 87.6 mAcc on ScanObjectNN for object classification, and 85.5 class mIoU on ShapeNetPart for object part segmentation, and 72.7 and 78.7 mIoU on Area-5 and 6-fold experiments with S3DIS for semantic segmentation. The source code for this work is available at https://github.com/zouyanmei/HPENet.
\ No newline at end of file
diff --git a/data/2024/aaai/Improved Metric Distortion via Threshold Approvals b/data/2024/aaai/Improved Metric Distortion via Threshold Approvals
new file mode 100644
index 0000000000..2fdd6f0b1f
--- /dev/null
+++ b/data/2024/aaai/Improved Metric Distortion via Threshold Approvals	
@@ -0,0 +1 @@
+We consider a social choice setting in which agents and alternatives are represented by points in a metric space, and the cost of an agent for an alternative is the distance between the corresponding points in the space. The goal is to choose a single alternative to (approximately) minimize the social cost (cost of all agents) or the maximum cost of any agent, when only limited information about the preferences of the agents is given. Previous work has shown that the best possible distortion one can hope to achieve is 3 when access to the ordinal preferences of the agents is given, even when the distances between alternatives in the metric space are known. We improve upon this bound of 3 by designing deterministic mechanisms that exploit a bit of cardinal information. We show that it is possible to achieve distortion 1+sqrt(2) by using the ordinal preferences of the agents, the distances between alternatives, and a threshold approval set per agent that contains all alternatives for whom her cost is within an appropriately chosen factor of her cost for her most-preferred alternative. We show that this bound is the best possible for any deterministic mechanism in general metric spaces, and also provide improved bounds for the fundamental case of a line metric.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Audio-Visual Segmentation with Bidirectional Generation b/data/2024/aaai/Improving Audio-Visual Segmentation with Bidirectional Generation
new file mode 100644
index 0000000000..abc38f04ac
--- /dev/null
+++ b/data/2024/aaai/Improving Audio-Visual Segmentation with Bidirectional Generation	
@@ -0,0 +1 @@
+The aim of audio-visual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the contribution of each modality is implicitly or explicitly modeled. Nevertheless, the interconnections between different modalities tend to be overlooked in audio-visual modeling. In this paper, inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we introduce a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS. To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Moreover, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods. To showcase the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. Code is released in: https://github.com/OpenNLPLab/AVS-bidirectional.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Automatic VQA Evaluation Using Large Language Models b/data/2024/aaai/Improving Automatic VQA Evaluation Using Large Language Models
new file mode 100644
index 0000000000..5ad8efa3fe
--- /dev/null
+++ b/data/2024/aaai/Improving Automatic VQA Evaluation Using Large Language Models	
@@ -0,0 +1 @@
+8 years after the visual question answering (VQA) task was proposed, accuracy remains the primary metric for automatic evaluation. VQA Accuracy has been effective so far in the IID evaluation setting. However, our community is undergoing a shift towards open-ended generative models and OOD evaluation. In this new paradigm, the existing VQA Accuracy metric is overly stringent and underestimates the performance of VQA systems. Thus, there is a need to develop more robust automatic VQA metrics that serve as a proxy for human judgment. In this work, we propose to leverage the in-context learning capabilities of instruction-tuned large language models (LLMs) to build a better VQA metric. We formulate VQA evaluation as an answer-rating task where the LLM is instructed to score the accuracy of a candidate answer given a set of reference answers. We demonstrate the proposed metric better correlates with human judgment compared to existing metrics across several VQA models and benchmarks. We hope wide adoption of our metric will contribute to better estimating the research progress on the VQA task. We plan to release the evaluation code and collected human judgments.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Autonomous Separation Assurance through Distributed Reinforcement Learning with Attention Networks b/data/2024/aaai/Improving Autonomous Separation Assurance through Distributed Reinforcement Learning with Attention Networks
new file mode 100644
index 0000000000..511816af68
--- /dev/null
+++ b/data/2024/aaai/Improving Autonomous Separation Assurance through Distributed Reinforcement Learning with Attention Networks	
@@ -0,0 +1 @@
+Advanced Air Mobility (AAM) introduces a new, efficient mode of transportation with the use of vehicle autonomy and electrified aircraft to provide increasingly autonomous transportation between previously underserved markets. Safe and efficient navigation of low altitude aircraft through highly dense environments requires the integration of a multitude of complex observations, such as surveillance, knowledge of vehicle dynamics, and weather. The processing and reasoning on these observations pose challenges due to the various sources of uncertainty in the information while ensuring cooperation with a variable number of aircraft in the airspace. These challenges coupled with the requirement to make safety-critical decisions in real-time rule out the use of conventional separation assurance techniques. We present a decentralized reinforcement learning framework to provide autonomous self-separation capabilities within AAM corridors with the use of speed and vertical maneuvers. The problem is formulated as a Markov Decision Process and solved by developing a novel extension to the sample-efficient, off-policy soft actor-critic (SAC) algorithm. We introduce the use of attention networks for variable-length observation processing and a distributed computing architecture to achieve high training sample throughput as compared to existing approaches. A comprehensive numerical study shows that the proposed framework can ensure safe and efficient separation of aircraft in high density, dynamic environments with various sources of uncertainty.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning b/data/2024/aaai/Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning
new file mode 100644
index 0000000000..7280d0aae7
--- /dev/null
+++ b/data/2024/aaai/Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning	
@@ -0,0 +1 @@
+Although image captioning models have made significant advancements in recent years, the majority of them heavily depend on high-quality datasets containing paired images and texts which are costly to acquire. Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings. However, not only does a modality gap exist between CLIP text and image features, but a discrepancy also arises between training and inference due to the unavailability of real-world images, which hinders the cross-modal alignment in text-only captioning. This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs. A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space. Furthermore, textual information is gathered to represent image features, resulting in the image features with various semantics and the bridged modality gap. To unify training and inference, synthetic image features would serve as the training prefix for the language decoder, while real images are used for inference. Additionally, salient objects in images are detected as assistance to enhance the learning of modality alignment. Experimental results demonstrate that our method obtains the state-of-the-art performance on benchmark datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Diffusion-Based Image Restoration with Error Contraction and Error Correction b/data/2024/aaai/Improving Diffusion-Based Image Restoration with Error Contraction and Error Correction
new file mode 100644
index 0000000000..6e1e2a8bc3
--- /dev/null
+++ b/data/2024/aaai/Improving Diffusion-Based Image Restoration with Error Contraction and Error Correction	
@@ -0,0 +1 @@
+Generative diffusion prior captured from the off-the-shelf denoising diffusion generative model has recently attained significant interest. However, several attempts have been made to adopt diffusion models to noisy inverse problems either fail to achieve satisfactory results or require a few thousand iterations to achieve high-quality reconstructions. In this work, we propose a diffusion-based image restoration with error contraction and error correction (DiffECC) method. Two strategies are introduced to contract the restoration error in the posterior sampling process. First, we combine existing CNN-based approaches with diffusion models to ensure data consistency from the beginning. Second, to amplify the error contraction effects of the noise, a restart sampling algorithm is designed. In the error correction strategy, the estimation-correction idea is proposed on both the data term and the prior term. Solving them iteratively within the diffusion sampling framework leads to superior image generation results. Experimental results for image restoration tasks such as super-resolution (SR), Gaussian deblurring, and motion deblurring demonstrate that our approach can reconstruct high-quality images compared with state-of-the-art sampling-based diffusion models.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Distinguishability of Class for Graph Neural Networks b/data/2024/aaai/Improving Distinguishability of Class for Graph Neural Networks
new file mode 100644
index 0000000000..0c71255dc3
--- /dev/null
+++ b/data/2024/aaai/Improving Distinguishability of Class for Graph Neural Networks	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) have received widespread attention and applications due to their excellent performance in graph representation learning. Most existing GNNs can only aggregate 1-hop neighbors in a GNN layer, so they usually stack multiple GNN layers to obtain more information from larger neighborhoods. However, many studies have shown that model performance experiences a significant degradation with the increase of GNN layers. In this paper, we first introduce the concept of distinguishability of class to indirectly evaluate the learned node representations, and verify the positive correlation between distinguishability of class and model performance. Then, we propose a Graph Neural Network guided by Distinguishability of class (Disc-GNN) to monitor the representation learning, so as to learn better node representations and improve model performance. Specifically, we first perform inter-layer filtering and initial compensation based on Local Distinguishability of Class (LDC) in each layer, so that the learned node representations have the ability to distinguish different classes. Furthermore, we add a regularization term based on Global Distinguishability of Class (GDC) to achieve global optimization of model performance. Extensive experiments on six real-world datasets have shown that the competitive performance of Disc-GNN to the state-of-the-art methods on node classification and node clustering tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Expressive Power of Spectral Graph Neural Networks with Eigenvalue Correction b/data/2024/aaai/Improving Expressive Power of Spectral Graph Neural Networks with Eigenvalue Correction
new file mode 100644
index 0000000000..7d728a594e
--- /dev/null
+++ b/data/2024/aaai/Improving Expressive Power of Spectral Graph Neural Networks with Eigenvalue Correction	
@@ -0,0 +1 @@
+In recent years, spectral graph neural networks, characterized by polynomial filters, have garnered increasing attention and have achieved remarkable performance in tasks such as node classification. These models typically assume that eigenvalues for the normalized Laplacian matrix are distinct from each other, thus expecting a polynomial filter to have a high fitting ability. However, this paper empirically observes that normalized Laplacian matrices frequently possess repeated eigenvalues. Moreover, we theoretically establish that the number of distinguishable eigenvalues plays a pivotal role in determining the expressive power of spectral graph neural networks. In light of this observation, we propose an eigenvalue correction strategy that can free polynomial filters from the constraints of repeated eigenvalue inputs. Concretely, the proposed eigenvalue correction strategy enhances the uniform distribution of eigenvalues, thus mitigating repeated eigenvalues, and improving the fitting capacity and expressive power of polynomial filters. Extensive experimental results on both synthetic and real-world datasets demonstrate the superiority of our method.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Factual Error Correction by Learning to Inject Factual Errors b/data/2024/aaai/Improving Factual Error Correction by Learning to Inject Factual Errors
new file mode 100644
index 0000000000..05aedf932e
--- /dev/null
+++ b/data/2024/aaai/Improving Factual Error Correction by Learning to Inject Factual Errors	
@@ -0,0 +1 @@
+Factual error correction (FEC) aims to revise factual errors in false claims with minimal editing, making them faithful to the provided evidence. This task is crucial for alleviating the hallucination problem encountered by large language models. Given the lack of paired data (i.e., false claims and their corresponding correct claims), existing methods typically adopt the ‘mask-then-correct’ paradigm. This paradigm relies solely on unpaired false claims and correct claims, thus being referred to as distantly supervised methods. These methods require a masker to explicitly identify factual errors within false claims before revising with a corrector. However, the absence of paired data to train the masker makes accurately pinpointing factual errors within claims challenging. To mitigate this, we propose to improve FEC by Learning to Inject Factual Errors (LIFE), a three-step distantly supervised method: ‘mask-corrupt-correct’. Specifically, we first train a corruptor using the ‘mask-then-corrupt’ procedure, allowing it to deliberately introduce factual errors into correct text. The corruptor is then applied to correct claims, generating a substantial amount of paired data. After that, we filter out low-quality data, and use the remaining data to train a corrector. Notably, our corrector does not require a masker, thus circumventing the bottleneck associated with explicit factual error identification. Our experiments on a public dataset verify the effectiveness of LIFE in two key aspects: Firstly, it outperforms the previous best-performing distantly supervised method by a notable margin of 10.59 points in SARI Final (19.3% improvement). Secondly, even compared to ChatGPT prompted with in-context examples, LIFE achieves a superiority of 7.16 points in SARI Final.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Faithfulness in Abstractive Text Summarization with EDUs Using BART (Student Abstract) b/data/2024/aaai/Improving Faithfulness in Abstractive Text Summarization with EDUs Using BART (Student Abstract)
new file mode 100644
index 0000000000..396ef673b4
--- /dev/null
+++ b/data/2024/aaai/Improving Faithfulness in Abstractive Text Summarization with EDUs Using BART (Student Abstract)	
@@ -0,0 +1 @@
+Abstractive text summarization uses the summarizer’s own words to capture the main information of a source document in a summary. While it is more challenging to automate than extractive text summarization, recent advancements in deep learning approaches and pre-trained language models have improved its performance. However, abstractive text summarization still has issues such as unfaithfulness. To address this problem, we propose a new approach that utilizes important Elementary Discourse Units (EDUs) to guide BART-based text summarization. Our approach showed the improvement in truthfulness and source document coverage in comparison to some previous studies.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving GNN Calibration with Discriminative Ability: Insights and Strategies b/data/2024/aaai/Improving GNN Calibration with Discriminative Ability: Insights and Strategies
new file mode 100644
index 0000000000..100f582fd4
--- /dev/null
+++ b/data/2024/aaai/Improving GNN Calibration with Discriminative Ability: Insights and Strategies	
@@ -0,0 +1 @@
+The widespread adoption of Graph Neural Networks (GNNs) has led to an increasing focus on their reliability. To address the issue of underconfidence in GNNs, various calibration methods have been developed to gain notable reductions in calibration error. However, we observe that existing approaches generally fail to enhance consistently, and in some cases even deteriorate, GNNs' ability to discriminate between correct and incorrect predictions. In this study, we advocate the significance of discriminative ability and the inclusion of relevant evaluation metrics. Our rationale is twofold: 1) Overlooking discriminative ability can inadvertently compromise the overall quality of the model; 2) Leveraging discriminative ability can significantly inform and improve calibration outcomes. Therefore, we thoroughly explore the reasons why existing calibration methods have ineffectiveness and even degradation regarding the discriminative ability of GNNs. Building upon these insights, we conduct GNN calibration experiments across multiple datasets using a straightforward example model, denoted as DC(GNN). Its excellent performance confirms the potential of integrating discriminative ability as a key consideration in the calibration of GNNs, thereby establishing a pathway toward more effective and reliable network calibration.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Health Information Access in the World's Largest Maternal Mobile Health Program via Bandit Algorithms b/data/2024/aaai/Improving Health Information Access in the World's Largest Maternal Mobile Health Program via Bandit Algorithms
new file mode 100644
index 0000000000..3dabd9d56d
--- /dev/null
+++ b/data/2024/aaai/Improving Health Information Access in the World's Largest Maternal Mobile Health Program via Bandit Algorithms	
@@ -0,0 +1 @@
+Harnessing the wide-spread availability of cell phones, many nonprofits have launched mobile health (mHealth) programs to deliver information via voice or text to beneficiaries in underserved communities, with maternal and infant health being a key area of such mHealth programs. Unfortunately, dwindling listenership is a major challenge, requiring targeted interventions using limited resources. This paper focuses on Kilkari, the world's largest mHealth program for maternal and child care -- with over 3 million active subscribers at a time -- launched by India's Ministry of Health and Family Welfare (MoHFW) and run by the non-profit ARMMAN. We present a system called CHAHAK that aims to reduce automated dropouts as well as boost engagement with the program through the strategic allocation of interventions to beneficiaries. Past work in a similar domain has focused on a much smaller scale mHealth program and used markovian restless multiarmed bandits to optimize a single limited intervention resource. However this paper demonstrates the challenges in adopting a markovian approach in Kilkari; therefore CHAHAK instead relies on non-markovian time-series restless bandits, and optimizes a layered set of multiple interventions to improve listenership. We use real Kilkari data from the Odisha state in India to show CHAHAK's effectiveness in harnessing multiple interventions to boost listenership, benefiting marginalized communities. When deployed CHAHAK will assist the largest maternal mHealth program to date.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving IP Geolocation With Target-Centric IP Graph (Student Abstract) b/data/2024/aaai/Improving IP Geolocation With Target-Centric IP Graph (Student Abstract)
new file mode 100644
index 0000000000..a234e8155d
--- /dev/null
+++ b/data/2024/aaai/Improving IP Geolocation With Target-Centric IP Graph (Student Abstract)	
@@ -0,0 +1 @@
+Accurate IP geolocation is indispensable for location-aware applications. While recent advances based on router-centric IP graphs are considered cutting-edge, one challenge remain: the prevalence of sparse IP graphs (14.24% with fewer than 10 nodes, 9.73% isolated) limits graph learning. To mitigate this issue, we designate the target host as the central node and aggregate multiple last-hop routers to construct the target-centric IP graph, instead of relying solely on the router with the smallest last-hop latency as in previous works. Experiments on three real-world datasets show that our method significantly improves the geolocation accuracy compared to existing baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis b/data/2024/aaai/Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis
new file mode 100644
index 0000000000..2d76935904
--- /dev/null
+++ b/data/2024/aaai/Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis	
@@ -0,0 +1 @@
+Large language models (LLMs) offer significant promise as a knowledge source for task learning. Prompt engineering has been shown to be effective for eliciting knowledge from an LLM, but alone it is insufficient for acquiring relevant, situationally grounded knowledge for an embodied agent learning novel tasks. We describe a cognitive-agent approach, STARS, that extends and complements prompt engineering, mitigating its limitations and thus enabling an agent to acquire new task knowledge matched to its native language capabilities, embodiment, environment, and user preferences. The STARS approach is to increase the response space of LLMs and deploy general strategies, embedded within the autonomous agent, to evaluate, repair, and select among candidate responses produced by the LLM. We describe the approach and experiments that show how an agent, by retrieving and evaluating a breadth of responses from the LLM, can achieve 77-94% task completion in one-shot learning without user oversight. The approach achieves 100% task completion when human oversight (such as an indication of preference) is provided. Further, the type of oversight largely shifts from explicit, natural language instruction to simple confirmation/discomfirmation of high-quality responses that have been vetted by the agent before presentation to a user.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Neural Network Generalization on Data-Limited Regression with Doubly-Robust Boosting b/data/2024/aaai/Improving Neural Network Generalization on Data-Limited Regression with Doubly-Robust Boosting
new file mode 100644
index 0000000000..37ab42260a
--- /dev/null
+++ b/data/2024/aaai/Improving Neural Network Generalization on Data-Limited Regression with Doubly-Robust Boosting	
@@ -0,0 +1,7 @@
+Enhancing the generalization performance of neural networks given limited data availability remains a formidable challenge, due to the model selection trade-off between training error and generalization gap.
+To handle this challenge, we present a posterior optimization issue, specifically designed to reduce the generalization error of trained neural networks.
+To operationalize this concept, we propose a Doubly-Robust Boosting machine (DRBoost) which consists of a statistical learner and a zero-order optimizer.
+The statistical learner reduces the model capacity and thus the generalization gap; the zero-order optimizer minimizes the training error in a gradient-free manner. The two components cooperate to reduce the generalization error of a fully trained neural network in a doubly robust manner.
+Furthermore, the statistical learner alleviates the multicollinearity in the discriminative layer and enhances the generalization performance.
+The zero-order optimizer eliminates the reliance on gradient calculation and offers more flexibility in learning objective selection.
+Experiments demonstrate that DRBoost improves the generalization performance of various prevalent neural network backbones effectively.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Open Set Recognition via Visual Prompts Distilled from Common-Sense Knowledge b/data/2024/aaai/Improving Open Set Recognition via Visual Prompts Distilled from Common-Sense Knowledge
new file mode 100644
index 0000000000..be638c6bbb
--- /dev/null
+++ b/data/2024/aaai/Improving Open Set Recognition via Visual Prompts Distilled from Common-Sense Knowledge	
@@ -0,0 +1 @@
+Open Set Recognition (OSR) poses significant challenges in distinguishing known from unknown classes. In OSR, the overconfidence problem has become a persistent obstacle, where visual recognition models often misclassify unknown objects as known objects with high confidence. This issue stems from the fact that visual recognition models often lack the integration of common-sense knowledge, a feature that is naturally present in language-based models but lacking in visual recognition systems. In this paper, we propose a novel approach to enhance OSR performance by distilling common-sense knowledge into visual prompts. Utilizing text prompts that embody common-sense knowledge about known classes, the proposed visual prompt is learned by extracting semantic common-sense features and aligning them with image features from visual recognition models. The unique aspect of this work is the training of individual visual prompts for each class to encapsulate this common-sense knowledge. Our methodology is model-agnostic, capable of enhancing OSR across various visual recognition models, and computationally light as it focuses solely on training the visual prompts. This research introduces a method for addressing OSR, aiming at a more systematic integration of visual recognition systems with common-sense knowledge. The obtained results indicate an enhancement in recognition accuracy, suggesting the applicability of this approach in practical settings.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Open-Domain Dialogue Response Generation with Multi-Source Multilingual Commonsense Knowledge b/data/2024/aaai/Improving Open-Domain Dialogue Response Generation with Multi-Source Multilingual Commonsense Knowledge
new file mode 100644
index 0000000000..3a82b5ba3f
--- /dev/null
+++ b/data/2024/aaai/Improving Open-Domain Dialogue Response Generation with Multi-Source Multilingual Commonsense Knowledge	
@@ -0,0 +1 @@
+Knowledge-grounded Dialogue Response Generation (KRG) can facilitate informative and fidelity dialogues using external knowledge. Prior monolingual works can only use the knowledge of the corresponding native language. Thus, due to the prohibitive costs of collecting and constructing external knowledge bases, the limited scale of accessible external knowledge always constrains the ability of KRG, especially in low-resource language scenarios. To this end, we propose a new task, Multi-Source Multilingual Knowledge-Grounded Response Generation (MMKRG), which simultaneously uses multiple knowledge sources of different languages. We notice that simply combining knowledge of different languages is inefficient due to the Cross-Conflict issue and Cross-Repetition issue. Thus, we propose a novel approach MMK-BART, which uses a simple but elegant Estimate-Cluster-Penalize mechanism to overcome the mentioned issues and adopts the multilingual language model mBART as the backbone. Meanwhile, based on the recent multilingual corpus XDailyDialog, we propose an MMKRG dataset MMK-DailyDialog, which has been aligned to the large-scale multilingual commonsense knowledge base ConceptNet and supports four languages (English, Chinese, German, and Italian). Extensive experiments have verified the effectiveness of our dataset and approach in monolingual, cross-lingual, and multilingual scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation b/data/2024/aaai/Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation
new file mode 100644
index 0000000000..6008721f64
--- /dev/null
+++ b/data/2024/aaai/Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation	
@@ -0,0 +1 @@
+Recent advancements in single-stage Panoptic Narrative Grounding (PNG) have demonstrated significant potential. These methods predict pixel-level masks by directly matching pixels and phrases. However, they often neglect the modeling of semantic and visual relationships between phrase-level instances, limiting their ability for complex multi-modal reasoning in PNG. To tackle this issue, we propose XPNG, a “differentiation-refinement-localization” reasoning paradigm for accurately locating instances or regions. In XPNG, we introduce a Semantic Context Convolution (SCC) module to leverage semantic priors for generating distinctive features. This well-crafted module employs a combination of dynamic channel-wise convolution and pixel-wise convolution to embed semantic information and establish inter-object relationships guided by semantics. Subsequently, we propose a Visual Context Verification (VCV) module to provide visual cues, eliminating potential space biases introduced by semantics and further refining the visual features generated by the previous module. Extensive experiments on PNG benchmark datasets reveal that our approach achieves state-of-the-art performance, significantly outperforming existing methods by a considerable margin and yielding a 3.9-point improvement in overall metrics. Our codes and results are available at our project webpage: https://github.com/TianyuGoGO/XPNG.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving Robustness for Joint Optimization of Camera Pose and Decomposed Low-Rank Tensorial Radiance Fields b/data/2024/aaai/Improving Robustness for Joint Optimization of Camera Pose and Decomposed Low-Rank Tensorial Radiance Fields
new file mode 100644
index 0000000000..4cabc8a518
--- /dev/null
+++ b/data/2024/aaai/Improving Robustness for Joint Optimization of Camera Pose and Decomposed Low-Rank Tensorial Radiance Fields	
@@ -0,0 +1,13 @@
+In this paper, we propose an algorithm that allows joint refinement of camera pose and scene geometry represented by decomposed low-rank tensor, using only 2D images as supervision. 
+
+First, we conduct a pilot study based on a 1D signal and relate our findings to 3D scenarios, where the naive joint pose optimization on voxel-based NeRFs can easily lead to sub-optimal solutions.
+
+Moreover, based on the analysis of the frequency spectrum, we propose to apply convolutional Gaussian filters on 2D and 3D radiance fields for a coarse-to-fine training schedule that enables joint camera pose optimization.
+
+Leveraging the decomposition property in decomposed low-rank tensor, our method achieves an equivalent effect to brute-force 3D convolution with only incurring little computational overhead. 
+
+To further improve the robustness and stability of joint optimization, we also propose techniques of smoothed 2D supervision, randomly scaled kernel parameters, and edge-guided loss mask. 
+
+Extensive quantitative and qualitative evaluations demonstrate that our proposed framework achieves superior performance in novel view synthesis as well as rapid convergence for optimization.
+
+The source code is available at https://github.com/Nemo1999/Joint-TensoRF.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving the Adversarial Transferability of Vision Transformers with Virtual Dense Connection b/data/2024/aaai/Improving the Adversarial Transferability of Vision Transformers with Virtual Dense Connection
new file mode 100644
index 0000000000..c4e526ced7
--- /dev/null
+++ b/data/2024/aaai/Improving the Adversarial Transferability of Vision Transformers with Virtual Dense Connection	
@@ -0,0 +1 @@
+With the great achievement of vision transformers (ViTs), transformer-based approaches have become the new paradigm for solving various computer vision tasks. However, recent research shows that similar to convolutional neural networks (CNNs), ViTs are still vulnerable to adversarial attacks. To explore the shared deficiency of models with different structures, researchers begin to analyze the cross-structure adversarial transferability, which is still under-explored. Therefore, in this work, we focus on the ViT attacks to improve the cross-structure transferability between the transformer-based and convolution-based models. Previous studies fail to thoroughly investigate the influence of the components inside the ViT models on adversarial transferability, leading to inferior performance. To overcome the drawback, we launch a motivating study by linearly down-scaling the gradients of components inside the ViT models to analyze their influence on adversarial transferability. Based on the motivating study, we find that the gradient of the skip connection most influences transferability and believe that back-propagating gradients from deeper blocks can enhance transferability. Therefore, we propose the Virtual Dense Connection method (VDC). Specifically, without changing the forward pass, we first recompose the original network to add virtual dense connections. Then we back-propagate gradients of deeper Attention maps and Multi-layer Perceptron (MLP) blocks via virtual dense connections when generating adversarial samples. Extensive experiments confirm the superiority of our proposed method over the state-of-the-art baselines, with an 8.2% improvement in transferability between ViT models and a 7.2% improvement in cross-structure transferability from ViTs to CNNs.
\ No newline at end of file
diff --git a/data/2024/aaai/Improving the Robustness of Knowledge-Grounded Dialogue via Contrastive Learning b/data/2024/aaai/Improving the Robustness of Knowledge-Grounded Dialogue via Contrastive Learning
new file mode 100644
index 0000000000..4b4e1a3b76
--- /dev/null
+++ b/data/2024/aaai/Improving the Robustness of Knowledge-Grounded Dialogue via Contrastive Learning	
@@ -0,0 +1 @@
+Knowledge-grounded dialogue (KGD) learns to generate an informative response based on a given dialogue context and external knowledge (e.g., knowledge graphs; KGs). Recently, the emergence of large language models (LLMs) and pre-training techniques has brought great success to knowledge-grounded dialogue. However, when building KGD systems in real applications, there are various real-world noises that are inevitable to face. For example, the dialogue context might involve perturbations such as misspellings and abbreviations. In addition, KGs typically suffer from incompletion and also might contain erroneous and outdated facts. Such real-world noises pose a challenge to the robustness of KGD systems and hinder their applications in the real world. In this paper, we propose an entity-based contrastive learning framework for improving the robustness of KGD. Specifically, we make use of the entity information in a KGD sample to create both its positive and negative samples which involve semantic-irrelevant and semantic-relevant perturbations, respectively. The contrastive learning framework ensures the KGD model is aware of these two types of perturbations, thus could generate informative responses with the potentially noisy inputs in real applications. Experimental results on three widely-used benchmark datasets show that our method achieves new state-of-the-art performance in terms of automatic evaluation scores, verifying its effectiveness and potentiality. Furthermore, we show that our method is able to generate better responses than comparison models in both the noisy and the few-shot settings.
\ No newline at end of file
diff --git a/data/2024/aaai/In-Hand 3D Object Reconstruction from a Monocular RGB Video b/data/2024/aaai/In-Hand 3D Object Reconstruction from a Monocular RGB Video
new file mode 100644
index 0000000000..7e05234a5e
--- /dev/null
+++ b/data/2024/aaai/In-Hand 3D Object Reconstruction from a Monocular RGB Video	
@@ -0,0 +1 @@
+Our work aims to reconstruct a 3D object that is held and rotated by a hand in front of a static RGB camera. Previous methods that use implicit neural representations to recover the geometry of a generic hand-held object from multi-view images achieved compelling results in the visible part of the object. However, these methods falter in accurately capturing the shape within the hand-object contact region due to occlusion. In this paper, we propose a novel method that deals with surface reconstruction under occlusion by incorporating priors of 2D occlusion elucidation and physical contact constraints. For the former, we introduce an object amodal completion network to infer the 2D complete mask of objects under occlusion. To ensure the accuracy and view consistency of the predicted 2D amodal masks, we devise a joint optimization method for both amodal mask refinement and 3D reconstruction. For the latter, we impose penetration and attraction constraints on the local geometry in contact regions. We evaluate our approach on HO3D and HOD datasets and demonstrate that it outperforms the state-of-the-art methods in terms of reconstruction surface quality, with an improvement of 52% on HO3D and 20% on HOD. Project webpage: https://east-j.github.io/ihor.
\ No newline at end of file
diff --git a/data/2024/aaai/IncepSeqNet: Advancing Signal Classification with Multi-Shape Augmentation (Student Abstract) b/data/2024/aaai/IncepSeqNet: Advancing Signal Classification with Multi-Shape Augmentation (Student Abstract)
new file mode 100644
index 0000000000..7cbe486615
--- /dev/null
+++ b/data/2024/aaai/IncepSeqNet: Advancing Signal Classification with Multi-Shape Augmentation (Student Abstract)	
@@ -0,0 +1 @@
+This work proposes and analyzes IncepSeqNet which is a new model combining the Inception Module with the innovative Multi-Shape Augmentation technique. IncepSeqNet excels in feature extraction from sequence signal data consisting of a number of complex numbers to achieve superior classification accuracy across various SNR(Signal-to-Noise Ratio) environments. Experimental results demonstrate IncepSeqNet’s outperformance of existing models, particularly at low SNR levels. Furthermore, we have confirmed its applicability in practical 5G systems by using real-world signal data.
\ No newline at end of file
diff --git a/data/2024/aaai/Incomplete Contrastive Multi-View Clustering with High-Confidence Guiding b/data/2024/aaai/Incomplete Contrastive Multi-View Clustering with High-Confidence Guiding
new file mode 100644
index 0000000000..043dbc80d7
--- /dev/null
+++ b/data/2024/aaai/Incomplete Contrastive Multi-View Clustering with High-Confidence Guiding	
@@ -0,0 +1 @@
+Incomplete multi-view clustering becomes an important research problem, since multi-view data with missing values are ubiquitous in real-world applications. Although great efforts have been made for incomplete multi-view clustering, there are still some challenges: 1) most existing methods didn't make full use of multi-view information to deal with missing values; 2) most methods just employ the consistent information within multi-view data but ignore the complementary information; 3) For the existing incomplete multi-view clustering methods, incomplete multi-view representation learning and clustering are treated as independent processes, which leads to performance gap. In this work, we proposed a novel Incomplete Contrastive Multi-View Clustering method with high-confidence guiding (ICMVC). Firstly, we proposed a multi-view consistency relation transfer plus graph convolutional network to tackle missing values problem. Secondly, instance-level attention fusion and high-confidence guiding are proposed to exploit the complementary information while instance-level contrastive learning for latent representation is designed to employ the consistent information. Thirdly, an end-to-end framework is proposed to integrate multi-view missing values handling, multi-view representation learning and clustering assignment for joint optimization. Experiments compared with state-of-the-art approaches demonstrated the effectiveness and superiority of our method. Our code is publicly available at https://github.com/liunian-Jay/ICMVC. The version with supplementary material can be found at http://arxiv.org/abs/2312.08697.
\ No newline at end of file
diff --git a/data/2024/aaai/Inconsistency-Based Data-Centric Active Open-Set Annotation b/data/2024/aaai/Inconsistency-Based Data-Centric Active Open-Set Annotation
new file mode 100644
index 0000000000..7b6488a37e
--- /dev/null
+++ b/data/2024/aaai/Inconsistency-Based Data-Centric Active Open-Set Annotation	
@@ -0,0 +1 @@
+Active learning, a method to reduce labeling effort for training deep neural networks, is often limited by the assumption that all unlabeled data belong to known classes. This closed-world assumption fails in practical scenarios with unknown classes in the data, leading to active open-set annotation challenges. Existing methods struggle with this uncertainty. We introduce NEAT, a novel, computationally efficient, data-centric active learning approach for open-set data. NEAT differentiates and labels known classes from a mix of known and unknown classes, using a clusterability criterion and a consistency mea- sure that detects inconsistencies between model predictions and feature distribution. In contrast to recent learning-centric solutions, NEAT shows superior performance in active open- set annotation, as our experiments confirm. Additional details on the further evaluation metrics, implementation, and archi- tecture of our method can be found in the public document at https://arxiv.org/pdf/2401.04923.pdf.
\ No newline at end of file
diff --git a/data/2024/aaai/Incorporating Serverless Computing into P2P Networks for ML Training: In-Database Tasks and Their Scalability Implications (Student Abstract) b/data/2024/aaai/Incorporating Serverless Computing into P2P Networks for ML Training: In-Database Tasks and Their Scalability Implications (Student Abstract)
new file mode 100644
index 0000000000..7a8a5ecc30
--- /dev/null
+++ b/data/2024/aaai/Incorporating Serverless Computing into P2P Networks for ML Training: In-Database Tasks and Their Scalability Implications (Student Abstract)	
@@ -0,0 +1 @@
+Distributed ML addresses challenges from increasing data and model complexities. Peer to peer (P2P) networks in distributed ML offer scalability and fault tolerance. However, they also encounter challenges related to resource consumption, and communication overhead as the number of participating peers grows. This research introduces a novel architecture that combines serverless computing with P2P networks for distributed training. Serverless computing enhances this model with parallel processing and cost effective scalability, suitable for resource-intensive tasks. Preliminary results show that peers can offload expensive computational tasks to serverless platforms. However, their inherent statelessness necessitates strong communication methods, suggesting a pivotal role for databases. To this end, we have enhanced an in memory database to support ML training tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Independence of Irrelevant Alternatives under the Lens of Pairwise Distortion b/data/2024/aaai/Independence of Irrelevant Alternatives under the Lens of Pairwise Distortion
new file mode 100644
index 0000000000..7dcdac4731
--- /dev/null
+++ b/data/2024/aaai/Independence of Irrelevant Alternatives under the Lens of Pairwise Distortion	
@@ -0,0 +1,2 @@
+We give a quantitative analysis of the independence of irrelevant alternatives (IIA) axiom. IIA says that the society's preference between x and y should depend only on individual preferences between x and y: we show that, in several contexts, if the individuals express their preferences about additional (``irrelevant'') alternatives, this information helps to estimate better which of x and y has higher social welfare.
+Our contribution is threefold: (1) we provide a new tool to measure the impact of IIA on social welfare (pairwise distortion), based on the well-established notion of voting distortion, (2) we study the average impact of IIA in both general and metric settings, with experiments on synthetic and real data and (3) we study the worst-case impact of IIA in the 1D-Euclidean metric space.
\ No newline at end of file
diff --git a/data/2024/aaai/Independency Adversarial Learning for Cross-Modal Sound Separation b/data/2024/aaai/Independency Adversarial Learning for Cross-Modal Sound Separation
new file mode 100644
index 0000000000..e8b604fc4c
--- /dev/null
+++ b/data/2024/aaai/Independency Adversarial Learning for Cross-Modal Sound Separation	
@@ -0,0 +1 @@
+The sound mixture separation is still challenging due to heavy sound overlapping and disturbance from noise. Unsupervised separation would significantly increase the difficulty. As sound overlapping always hinders accurate sound separation, we propose an Independency Adversarial Learning based Cross-Modal Sound Separation (IAL-CMS) approach, where IAL employs adversarial learning to minimize the correlation of separated sound elements, exploring high sound independence; CMS performs cross-modal sound separation, incorporating audio-visual consistent feature learning and interactive cross-attention learning to emphasize the semantic consistency among cross-modal features. Both audio-visual consistency and audio consistency are kept to guarantee accurate separation. The consistency and sound independence ensure the decomposition of overlapping mixtures into unrelated and distinguishable sound elements. The proposed approach is evaluated on MUSIC, VGGSound, and AudioSet. Extensive experiments certify that our approach outperforms existing approaches in supervised and unsupervised scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/IndicCONAN: A Multilingual Dataset for Combating Hate Speech in Indian Context b/data/2024/aaai/IndicCONAN: A Multilingual Dataset for Combating Hate Speech in Indian Context
new file mode 100644
index 0000000000..7bcd503c58
--- /dev/null
+++ b/data/2024/aaai/IndicCONAN: A Multilingual Dataset for Combating Hate Speech in Indian Context	
@@ -0,0 +1,17 @@
+Hate speech (HS) is a growing concern in many parts of
+the world, including India, where it has led to numerous instances of violence and discrimination. The development of
+effective counter-narratives (CNs) is a critical step in combating hate speech, but there is a lack of research in this
+area, especially in non-English languages. In this paper, we
+introduce a new dataset, IndicCONAN, of counter-narratives
+against hate speech in Hindi and Indian English. We propose a scalable human-in-the-loop approach for generating counter-narratives by an auto-regressive language model
+through machine generation - human correction cycle, where
+the model uses augmented data from previous cycles to generate new training samples. These newly generated samples
+are then reviewed and edited by annotators, leading to further
+model refnement. The dataset consists of over 2,500 exam- ˜
+ples of counter-narratives each in both English and Hindi corresponding to various hate speeches in the Indian context. We
+also present a framework for generating CNs conditioned on
+specifc CN type with a mean perplexity of 3.85 for English
+and 3.70 for Hindi, a mean toxicity score of 0.04 for English
+and 0.06 for Hindi, and a mean diversity of 0.08 for English
+and 0.14 for Hindi. Our dataset and framework provide valuable resources for researchers and practitioners working to
+combat hate speech in the Indian context.
\ No newline at end of file
diff --git a/data/2024/aaai/Inducing Clusters Deep Kernel Gaussian Process for Longitudinal Data b/data/2024/aaai/Inducing Clusters Deep Kernel Gaussian Process for Longitudinal Data
new file mode 100644
index 0000000000..36443b8583
--- /dev/null
+++ b/data/2024/aaai/Inducing Clusters Deep Kernel Gaussian Process for Longitudinal Data	
@@ -0,0 +1 @@
+We consider the problem of predictive modeling from irregularly and sparsely sampled longitudinal data with unknown, complex correlation structures and abrupt discontinuities. To address these challenges, we introduce a novel inducing clusters longitudinal deep kernel Gaussian Process (ICDKGP). ICDKGP approximates the data generating process by a zero-mean GP with a longitudinal deep kernel that models the unknown complex correlation structure in the data and a deterministic non-zero mean function to model the abrupt discontinuities. To improve the scalability and interpretability of ICDKGP, we introduce inducing clusters corresponding to centers of clusters in the training data. We formulate the training of ICDKGP as a constrained optimization problem and derive its evidence lower bound. We introduce a novel relaxation of the resulting problem which under rather mild assumptions yields a solution with error bounded relative to the original problem. We describe the results of extensive experiments demonstrating that ICDKGP substantially outperforms the state-of-the-art longitudinal methods on data with both smoothly and non-smoothly varying outcomes.
\ No newline at end of file
diff --git a/data/2024/aaai/Inertial Algorithm with Dry Fraction and Convolutional Sparse Coding for 3D Localization with Light Field Microscopy b/data/2024/aaai/Inertial Algorithm with Dry Fraction and Convolutional Sparse Coding for 3D Localization with Light Field Microscopy
new file mode 100644
index 0000000000..35f207c7fc
--- /dev/null
+++ b/data/2024/aaai/Inertial Algorithm with Dry Fraction and Convolutional Sparse Coding for 3D Localization with Light Field Microscopy	
@@ -0,0 +1 @@
+Light field microscopy is a high-speed 3D imaging technique that records the light field from multiple angles by the microlens array(MLA), thus allowing us to obtain information about the light source from a single image only. For the fundamental problem of neuron localization, we improve the method of combining depth-dependent dictionary with sparse coding in this paper. In order to obtain higher localization accuracy and good noise immunity, we propose an inertial proximal gradient acceleration algorithm with dry friction, Fast-IPGDF. By preventing falling into a local minimum, our algorithm achieves better convergence and converges quite fast, which improves the speed and accuracy of obtaining the locolization of the light source based on the matching depth of epipolar plane images (EPI). We demonstrate the effectiveness of the algorithm for localizing non-scattered fluorescent beads in both noisy and non-noisy environments. The experimental results show that our method can achieve simultaneous localization of multiple point sources and effective localization in noisy environments. Compared to existing studies, our method shows significant improvements in both localization accuracy and speed.
\ No newline at end of file
diff --git a/data/2024/aaai/Inference and Learning in Dynamic Decision Networks Using Knowledge Compilation b/data/2024/aaai/Inference and Learning in Dynamic Decision Networks Using Knowledge Compilation
new file mode 100644
index 0000000000..039fd9d41a
--- /dev/null
+++ b/data/2024/aaai/Inference and Learning in Dynamic Decision Networks Using Knowledge Compilation	
@@ -0,0 +1 @@
+Decision making under uncertainty in dynamic environments is a fundamental AI problem in which agents need to determine which decisions (or actions) to make at each time step to maximise their expected utility. Dynamic decision networks (DDNs) are an extension of dynamic Bayesian networks with decisions and utilities. DDNs can be used to compactly represent Markov decision processes (MDPs). We propose a novel algorithm called mapl-cirup that leverages knowledge compilation techniques developed for (dynamic) Bayesian networks to perform inference and gradient-based learning in DDNs. Specifically, we knowledge-compile the Bellman update present in DDNs into dynamic decision circuits and evaluate them within an (algebraic) model counting framework. In contrast to other exact symbolic MDP approaches, we obtain differentiable circuits that enable gradient-based parameter learning.
\ No newline at end of file
diff --git a/data/2024/aaai/Influential Exemplar Replay for Incremental Learning in Recommender Systems b/data/2024/aaai/Influential Exemplar Replay for Incremental Learning in Recommender Systems
new file mode 100644
index 0000000000..0d845558b7
--- /dev/null
+++ b/data/2024/aaai/Influential Exemplar Replay for Incremental Learning in Recommender Systems	
@@ -0,0 +1 @@
+Personalized recommender systems have found widespread applications for effective information filtering. Conventional models engage in knowledge mining within the static setting to reconstruct singular historical data. Nonetheless, the dynamics of real-world environments are in a constant state of flux, rendering acquired model knowledge inadequate for accommodating emergent trends and thus leading to notable recommendation performance decline. Given the typically prohibitive cost of exhaustive model retraining, it has emerged to study incremental learning for recommender systems with ever-growing data. In this paper, we propose an effective model-agnostic framework, namely INFluential Exemplar Replay (INFER). INFER facilitates recommender models in retaining the earlier assimilated knowledge, e.g., users' enduring preferences, while concurrently accommodating evolving trends manifested in users' new interaction behaviors. We commence with a vanilla implementation that centers on identifying the most representative data samples for effective consolidation of early knowledge. Subsequently, we propose an advanced solution, namely INFERONCE, to optimize the computational overhead associated with the vanilla implementation. Extensive experiments on four prototypical backbone models, two classic recommendation tasks, and four widely used benchmarks consistently demonstrate the effectiveness of our method as well as its compatibility for extending to several incremental recommender models.
\ No newline at end of file
diff --git a/data/2024/aaai/Information Design for Congestion Games with Unknown Demand b/data/2024/aaai/Information Design for Congestion Games with Unknown Demand
new file mode 100644
index 0000000000..ddd1608c4d
--- /dev/null
+++ b/data/2024/aaai/Information Design for Congestion Games with Unknown Demand	
@@ -0,0 +1,4 @@
+We study a novel approach to information design in the standard traffic model of network congestion games. It captures the natural condition that the demand is unknown to the users of the network. A principal (e.g., a mobility service) commits to a signaling strategy, observes the realized demand and sends a (public) signal to agents (i.e., users of the network). Based on the induced belief about the demand, the users then form an equilibrium. We consider the algorithmic goal of the principal: Compute a signaling scheme that minimizes the expected total cost of the induced equilibrium. We concentrate on single-commodity networks and affine cost functions, for which we obtain the following results.
+
+First, we devise a fully polynomial-time approximation scheme (FPTAS) for the case that the demand can only take two values. It relies on several structural properties of the cost of the induced equilibrium as a function of the updated belief about the distribution of demands. We show that this function is piecewise linear for any number of demands, and monotonic for two demands. 
+Second, we give a complete characterization of the graph structures for which it is optimal to fully reveal the information about the realized demand. This signaling scheme turns out to be optimal for all cost functions and probability distributions over demands if and only if the graph is series-parallel. Third, we propose an algorithm that computes the optimal signaling scheme for any number of demands whose time complexity is polynomial in the number of supports that occur in a Wardrop equilibrium for some demand. Finally, we conduct a computational study that tests this algorithm on real-world instances.
\ No newline at end of file
diff --git a/data/2024/aaai/Input Margins Can Predict Generalization Too b/data/2024/aaai/Input Margins Can Predict Generalization Too
new file mode 100644
index 0000000000..2a4f5c1a18
--- /dev/null
+++ b/data/2024/aaai/Input Margins Can Predict Generalization Too	
@@ -0,0 +1,2 @@
+Understanding generalization in deep neural networks is an active area of research. A promising avenue of exploration has been that of margin measurements: the shortest distance to the decision boundary for a given sample or its representation internal to the network. While margins have been shown to be correlated with the generalization ability of a model when measured at its hidden representations (hidden margins), no such link between large margins and generalization has been established for input margins. We show that while input margins are not generally predictive of generalization, they can be if the search space is appropriately constrained.
+We develop such a measure based on input margins, which we refer to as 'constrained margins'. The predictive power of this new measure is demonstrated on the 'Predicting Generalization in Deep Learning' (PGDL) dataset and contrasted with hidden representation margins. We find that constrained margins achieve highly competitive scores and outperform other margin measurements in general. This provides a novel insight on the relationship between generalization and classification margins, and highlights the importance of considering the data manifold for investigations of generalization in DNNs.
\ No newline at end of file
diff --git a/data/2024/aaai/Inspecting Prediction Confidence for Detecting Black-Box Backdoor Attacks b/data/2024/aaai/Inspecting Prediction Confidence for Detecting Black-Box Backdoor Attacks
new file mode 100644
index 0000000000..6ef6cc18ae
--- /dev/null
+++ b/data/2024/aaai/Inspecting Prediction Confidence for Detecting Black-Box Backdoor Attacks	
@@ -0,0 +1,2 @@
+Backdoor attacks have been shown to be a serious security threat against deep learning models, and various defenses have been proposed to detect whether a model is backdoored or not. However, as indicated by a recent black-box attack, existing defenses can be easily bypassed by implanting the backdoor in the frequency domain.
+To this end, we propose a new defense DTInspector against black-box backdoor attacks, based on a new observation related to the prediction confidence of learning models. That is, to achieve a high attack success rate with a small amount of poisoned data, backdoor attacks usually render a model exhibiting statistically higher prediction confidences on the poisoned samples. We provide both theoretical and empirical evidence for the generality of this observation. DTInspector then carefully examines the prediction confidences of data samples, and decides the existence of backdoor using the shortcut nature of backdoor triggers. Extensive evaluations on six backdoor attacks, four datasets, and three advanced attacking types demonstrate the effectiveness of the proposed defense.
\ No newline at end of file
diff --git a/data/2024/aaai/Instance-Aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning b/data/2024/aaai/Instance-Aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning
new file mode 100644
index 0000000000..432eb7be00
--- /dev/null
+++ b/data/2024/aaai/Instance-Aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning	
@@ -0,0 +1 @@
+Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field. Under such a paradigm, accurate BEV representation construction relies on reliable depth estimation for multi-camera images. However, existing approaches exhaustively predict depths for every pixel without prioritizing objects, which are precisely the entities requiring detection in the 3D space. To this end, we propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector. First, a category-specific structural priors mining approach is proposed for enhancing the efficacy of monocular depth generation. Besides, a self-boosting learning strategy is further proposed to encourage the model to place more emphasis on challenging objects in computation-expensive temporal stereo matching. Together they provide advanced depth estimation results for high-quality BEV features construction, benefiting the ultimate 3D detection. The proposed method achieves state-of-the-art performances on the challenging nuScenes benchmark, and extensive experimental results demonstrate the effectiveness of our designs.
\ No newline at end of file
diff --git a/data/2024/aaai/Instance-Conditional Timescales of Decay for Non-Stationary Learning b/data/2024/aaai/Instance-Conditional Timescales of Decay for Non-Stationary Learning
new file mode 100644
index 0000000000..9377a9f79d
--- /dev/null
+++ b/data/2024/aaai/Instance-Conditional Timescales of Decay for Non-Stationary Learning	
@@ -0,0 +1 @@
+Slow concept drift is a ubiquitous, yet under-studied problem in practical machine learning systems. In such settings, although recent data is more indicative of future data, naively prioritizing recent instances runs the risk of losing valuable information from the past. We propose an optimization-driven approach towards balancing instance importance over large training windows. First, we model instance relevance using a mixture of multiple timescales of decay, allowing us to capture rich temporal trends. Second, we learn an auxiliary scorer model that recovers the appropriate mixture of timescales as a function of the instance itself. Finally, we propose a nested optimization objective for learning the scorer, by which it maximizes forward transfer for the learned model. Experiments on a large real-world dataset of 39M photos over a 9 year period show upto 15% relative gains in accuracy compared to other robust learning baselines. We replicate our gains on two collections of real-world datasets for non-stationary learning, and extend our work to continual learning settings where, too, we beat SOTA methods by large margins.
\ No newline at end of file
diff --git a/data/2024/aaai/Instance-Wise Laplace Mechanism via Deep Reinforcement Learning (Student Abstract) b/data/2024/aaai/Instance-Wise Laplace Mechanism via Deep Reinforcement Learning (Student Abstract)
new file mode 100644
index 0000000000..090379b2b7
--- /dev/null
+++ b/data/2024/aaai/Instance-Wise Laplace Mechanism via Deep Reinforcement Learning (Student Abstract)	
@@ -0,0 +1,5 @@
+Recent research has shown a growing interest in per-instance differential privacy (pDP), highlighting the fact that each data instance within a dataset may incur distinct levels of privacy loss.
+However, conventional additive noise mechanisms apply identical noise to all query outputs, thereby deteriorating data statistics.
+In this study, we propose an instance-wise Laplace mechanism, which adds non-identical Laplace noises to the query output for each data instance.
+A challenge arises from the complex interaction of additive noise, where the noise introduced to individual instances impacts the pDP of other instances, adding complexity and resilience to straightforward solutions.
+To tackle this problem, we introduce an instance-wise Laplace mechanism algorithm via deep reinforcement learning and validate its ability to better preserve data statistics on a real dataset, compared to the original Laplace mechanism.
\ No newline at end of file
diff --git a/data/2024/aaai/Integer Is Enough: When Vertical Federated Learning Meets Rounding b/data/2024/aaai/Integer Is Enough: When Vertical Federated Learning Meets Rounding
new file mode 100644
index 0000000000..623435474d
--- /dev/null
+++ b/data/2024/aaai/Integer Is Enough: When Vertical Federated Learning Meets Rounding	
@@ -0,0 +1,7 @@
+Vertical Federated Learning (VFL) is a solution increasingly used by companies with the same user group but differing features, enabling them to collaboratively train a machine learning model. 
+VFL ensures that clients exchange intermediate results extracted by their local models, without sharing raw data. 
+However, in practice, VFL encounters several challenges, such as computational and communication overhead, privacy leakage risk, and adversarial attack. 
+Our study reveals that the usage of floating-point (FP) numbers is a common factor causing these issues, as they can be redundant and contain too much information. 
+To address this, we propose a new architecture called rounding layer, which converts intermediate results to integers. 
+Our theoretical analysis and empirical results demonstrate the benefits of the rounding layer in reducing computation and memory overhead, providing privacy protection, preserving model performance, and mitigating adversarial attacks. 
+We hope this paper inspires further research into novel architectures to address practical issues in VFL.
\ No newline at end of file
diff --git a/data/2024/aaai/Integrated Decision Gradients: Compute Your Attributions Where the Model Makes Its Decision b/data/2024/aaai/Integrated Decision Gradients: Compute Your Attributions Where the Model Makes Its Decision
new file mode 100644
index 0000000000..511c266066
--- /dev/null
+++ b/data/2024/aaai/Integrated Decision Gradients: Compute Your Attributions Where the Model Makes Its Decision	
@@ -0,0 +1 @@
+Attribution algorithms are frequently employed to explain the decisions of neural network models. Integrated Gradients (IG) is an influential attribution method due to its strong axiomatic foundation. The algorithm is based on integrating the gradients along a path from a reference image to the input image. Unfortunately, it can be observed that gradients computed from regions where the output logit changes minimally along the path provide poor explanations for the model decision, which is called the saturation effect problem. In this paper, we propose an attribution algorithm called integrated decision gradients (IDG). The algorithm focuses on integrating gradients from the region of the path where the model makes its decision, i.e., the portion of the path where the output logit rapidly transitions from zero to its final value. This is practically realized by scaling each gradient by the derivative of the output logit with respect to the path. The algorithm thereby provides a principled solution to the saturation problem. Additionally, we minimize the errors within the Riemann sum approximation of the path integral by utilizing non-uniform subdivisions determined by adaptive sampling. In the evaluation on ImageNet, it is demonstrated that IDG outperforms IG, Left-IG, Guided IG, and adversarial gradient integration both qualitatively and quantitatively using standard insertion and deletion metrics across three common models.
\ No newline at end of file
diff --git a/data/2024/aaai/Integrated Systems for Computational Scientific Discovery b/data/2024/aaai/Integrated Systems for Computational Scientific Discovery
new file mode 100644
index 0000000000..0ef825aa03
--- /dev/null
+++ b/data/2024/aaai/Integrated Systems for Computational Scientific Discovery	
@@ -0,0 +1,7 @@
+This paper poses the challenge of developing and evaluating integrated
+systems for computational scientific discovery. We note some distinguishing
+characteristics of discovery tasks, examine eight component abilities,
+review previous successes at partial integration, and consider hurdles
+the AI research community must leap to transform the vision for
+integrated discovery into reality. In closing, we discuss promising
+scientific domains in which to test such computational artifacts.
\ No newline at end of file
diff --git a/data/2024/aaai/Integrating Neural Pathways for Learning in Deep Reinforcement Learning Models b/data/2024/aaai/Integrating Neural Pathways for Learning in Deep Reinforcement Learning Models
new file mode 100644
index 0000000000..3dd5897684
--- /dev/null
+++ b/data/2024/aaai/Integrating Neural Pathways for Learning in Deep Reinforcement Learning Models	
@@ -0,0 +1 @@
+Considering that the human brain is the most powerful, generalizable, and energy-efficient computer we know of, it makes the most sense to look to neuroscience for ideas regarding deep learning model improvements. I propose one such idea, augmenting a traditional Advantage-Actor-Critic (A2C) model with additional learning signals akin to those in the brain. Pursuing this direction of research should hopefully result in a new reinforcement learning (RL) control paradigm that can learn from fewer examples, train with greater stability, and possibly consume less energy.
\ No newline at end of file
diff --git a/data/2024/aaai/Intelligent Calibration for Bias Reduction in Sentiment Corpora Annotation Process b/data/2024/aaai/Intelligent Calibration for Bias Reduction in Sentiment Corpora Annotation Process
new file mode 100644
index 0000000000..82071d8623
--- /dev/null
+++ b/data/2024/aaai/Intelligent Calibration for Bias Reduction in Sentiment Corpora Annotation Process	
@@ -0,0 +1 @@
+This paper focuses in the inherent anchoring bias present in sequential reviews-sentiment corpora annotation processes. It proposes employing a limited subset of meticulously chosen reviews at the outset of the process, as a means of calibration, effectively mitigating the phenomenon. Through extensive experimentation we validate the phenomenon of sentiment bias in the annotation process and show that its magnitude can be influenced by pre-calibration. Furthermore, we show that the choice of the calibration set matters, hence the need for effective guidelines for choosing the reviews to be included in it. A comparison of annotators performance with the proposed calibration to annotation processes that do not use calibration or use a randomly-picked calibration set, reveals that indeed the calibration set picked is highly effective---it manages to substantially reduce the average absolute error compared to the other cases. Furthermore, the proposed selection guidelines are found to be highly robust in picking an effective calibration set also for domains different than the one based on which these rules were extracted.
\ No newline at end of file
diff --git a/data/2024/aaai/Intentional Evolutionary Learning for Untrimmed Videos with Long Tail Distribution b/data/2024/aaai/Intentional Evolutionary Learning for Untrimmed Videos with Long Tail Distribution
new file mode 100644
index 0000000000..a84c7e994a
--- /dev/null
+++ b/data/2024/aaai/Intentional Evolutionary Learning for Untrimmed Videos with Long Tail Distribution	
@@ -0,0 +1 @@
+Human intention understanding in untrimmed videos aims to watch a natural video and predict what the person’s intention is. Currently, exploration of predicting human intentions in untrimmed videos is far from enough. On the one hand, untrimmed videos with mixed actions and backgrounds have a significant long-tail distribution with concept drift characteristics. On the other hand, most methods can only perceive instantaneous intentions, but cannot determine the evolution of intentions. To solve the above challenges, we propose a loss based on Instance Confidence and Class Accuracy (ICCA), which aims to alleviate the prediction bias caused by the long-tail distribution with concept drift characteristics in video streams. In addition, we propose an intention-oriented evolutionary learning method to determine the intention evolution pattern (from what action to what action) and the time of evolution (when the action evolves). We conducted extensive experiments on two untrimmed video datasets (THUMOS14 and ActivityNET v1.3), and our method has achieved excellent results compared to SOTA methods. The code and supplementary materials are available at https://github.com/Jennifer123www/UntrimmedVideo.
\ No newline at end of file
diff --git a/data/2024/aaai/Interactive Human-Centric Bias Mitigation b/data/2024/aaai/Interactive Human-Centric Bias Mitigation
new file mode 100644
index 0000000000..97bf98a8bf
--- /dev/null
+++ b/data/2024/aaai/Interactive Human-Centric Bias Mitigation	
@@ -0,0 +1 @@
+Bias mitigation algorithms differ in their definition of bias and how they go about achieving that objective. Bias mitigation algorithms impact different cohorts differently and allowing end users and data scientists to understand the impact of these differences in order to make informed choices is a relatively unexplored domain. This demonstration presents an interactive bias mitigation pipeline that allows users to understand the cohorts impacted by their algorithm choice and provide feedback in order to provide a bias mitigated pipeline that most aligns with their goals.
\ No newline at end of file
diff --git a/data/2024/aaai/Interactive Hyperparameter Optimization in Multi-Objective Problems via Preference Learning b/data/2024/aaai/Interactive Hyperparameter Optimization in Multi-Objective Problems via Preference Learning
new file mode 100644
index 0000000000..2845168559
--- /dev/null
+++ b/data/2024/aaai/Interactive Hyperparameter Optimization in Multi-Objective Problems via Preference Learning	
@@ -0,0 +1,9 @@
+Hyperparameter optimization (HPO) is important to leverage the full potential of machine learning (ML). 
+In practice, users are often interested in multi-objective (MO) problems, i.e., optimizing potentially conflicting objectives, like accuracy and energy consumption.
+To tackle this, the vast majority of MO-ML algorithms return a Pareto front of non-dominated machine learning models to the user.
+Optimizing the hyperparameters of such algorithms is non-trivial as evaluating a hyperparameter configuration entails evaluating the quality of the resulting Pareto front. 
+In literature, there are known indicators that assess the quality of a Pareto front (e.g., hypervolume, R2) by quantifying different properties (e.g., volume, proximity to a reference point). However, choosing the indicator that leads to the desired Pareto front might be a hard task for a user. In this paper, we propose a human-centered interactive HPO approach tailored towards multi-objective ML leveraging preference learning to extract desiderata from users that guide the optimization.
+Instead of relying on the user guessing the most suitable indicator for their needs, our approach automatically learns an appropriate indicator.
+Concretely, we leverage pairwise comparisons of distinct Pareto fronts to learn such an appropriate quality indicator.
+Then, we optimize the hyperparameters of the underlying MO-ML algorithm towards this learned indicator using a state-of-the-art HPO approach.
+In an experimental study targeting the environmental impact of ML, we demonstrate that our approach leads to substantially better Pareto fronts compared to optimizing based on a wrong indicator pre-selected by the user, and performs comparable in the case of an advanced user knowing which indicator to pick.
\ No newline at end of file
diff --git a/data/2024/aaai/Interactive Mars Image Content-Based Search with Interpretable Machine Learning b/data/2024/aaai/Interactive Mars Image Content-Based Search with Interpretable Machine Learning
new file mode 100644
index 0000000000..ec9d298f85
--- /dev/null
+++ b/data/2024/aaai/Interactive Mars Image Content-Based Search with Interpretable Machine Learning	
@@ -0,0 +1 @@
+The NASA Planetary Data System (PDS) hosts millions of images of planets, moons, and other bodies collected throughout many missions. The ever-expanding nature of data and user engagement demands an interpretable content classification system to support scientific discovery and individual curiosity. In this paper, we leverage a prototype-based architecture to enable users to understand and validate the evidence used by a classifier trained on images from the Mars Science Laboratory (MSL) Curiosity rover mission. In addition to providing explanations, we investigate the diversity and correctness of evidence used by the content-based classifier. The work presented in this paper will be deployed on the PDS Image Atlas, replacing its non-interpretable counterpart.
\ No newline at end of file
diff --git a/data/2024/aaai/Interactive Plan Selection Using Linear Temporal Logic, Disjunctive Action Landmarks, and Natural Language Instruction b/data/2024/aaai/Interactive Plan Selection Using Linear Temporal Logic, Disjunctive Action Landmarks, and Natural Language Instruction
new file mode 100644
index 0000000000..3e03b77185
--- /dev/null
+++ b/data/2024/aaai/Interactive Plan Selection Using Linear Temporal Logic, Disjunctive Action Landmarks, and Natural Language Instruction	
@@ -0,0 +1 @@
+We present Lemming – a visualization tool for the interactive selection of plans for a given problem, allowing the user to efficiently whittle down the set of plans and select their plan(s) of choice. We demonstrate four different user experiences for this process, three of them based on the principle of using disjunctive action landmarks as guidance to cut down the set of choice points for the user, and one on the use of linear temporal logic (LTL) to impart additional constraints into the plan set using natural language (NL) instruction.
\ No newline at end of file
diff --git a/data/2024/aaai/Interactive Theorem Provers: Applications in AI, Opportunities, and Challenges b/data/2024/aaai/Interactive Theorem Provers: Applications in AI, Opportunities, and Challenges
new file mode 100644
index 0000000000..462b41ba66
--- /dev/null
+++ b/data/2024/aaai/Interactive Theorem Provers: Applications in AI, Opportunities, and Challenges	
@@ -0,0 +1,3 @@
+Interactive theorem provers (ITPs) are computer programs in which axioms and a conjecture are stated in a formal language, and a user provides the ITP with relatively high-level steps of a formal proof for the conjecture. Then, by invoking automated theorem provers, the ITP tries to generate low-level steps that fill the gaps between the steps provided by the user, thus forming a complete formal proof of the conjecture. The ITP also checks the entire formal proof against the axioms, thus confirming the soundness of all derivations in the formal proof.
+
+In this talk, I will discuss the existing opportunities and potential benefits to applying ITPs to reason about and verify AI concepts, algorithms, and software. I will also discuss the challenges we have to being able to apply ITPs in AI and reap those benefits. I will do so by discussing a number of my previous projects on the application of ITPs to different AI concepts, algorithms, and software systems. These projects span different areas of planning (classical planning, temporal planning, and planning under uncertainty) as well as algorithms with applications in algorithmic game theory, like general graph matching and online matching.
\ No newline at end of file
diff --git a/data/2024/aaai/Interactive Visual Task Learning for Robots b/data/2024/aaai/Interactive Visual Task Learning for Robots
new file mode 100644
index 0000000000..c3971e0bd1
--- /dev/null
+++ b/data/2024/aaai/Interactive Visual Task Learning for Robots	
@@ -0,0 +1,8 @@
+We present a demonstrable framework for robots to learn novel visual concepts and visual tasks via in-situ linguistic interactions with human users. Previous approaches in computer vision have either used large pre-trained visual models to infer novel objects zero-shot, or added novel concepts along with their attributes and representations to a concept hierarchy. We extend the approaches that focus on learning visual concept hierarchies and take this ability one step further to demonstrate novel task solving on robots along with the learned visual concepts. 
+To enable a visual concept learner to solve robotics tasks one-shot, we developed two distinct techniques.
+Firstly, we propose a novel approach, Hi-Viscont(HIerarchical VISual CONcept learner for Task), which augments information of a novel concept, that is being taught, to its parent nodes within a concept hierarchy. 
+This information propagation allows all concepts in a hierarchy to update as novel concepts are taught in a continual learning setting. 
+Secondly, we represent a visual task as a scene graph with language annotations, allowing us to create novel permutations of a demonstrated task zero-shot in-situ. 
+Combining the two techniques, we present a demonstration on a real robot that learns visual task and concepts in one-shot from in-situ interactions with human users, and generalize to perform a novel visual task of the same type in zero-shot.
+As shown by the studies in the main conference paper, our system achieves a success rate of 50% on solving the whole task correctly with generalization where the baseline performs at 14% without any ability to generalize to novel tasks and concepts. 
+We will demonstrate our working interactive learning pipeline at AAAI 2024 in person with our robot and other required hardware.
\ No newline at end of file
diff --git a/data/2024/aaai/InterpretARA: Enhancing Hybrid Automatic Readability Assessment with Linguistic Feature Interpreter and Contrastive Learning b/data/2024/aaai/InterpretARA: Enhancing Hybrid Automatic Readability Assessment with Linguistic Feature Interpreter and Contrastive Learning
new file mode 100644
index 0000000000..c6377592c8
--- /dev/null
+++ b/data/2024/aaai/InterpretARA: Enhancing Hybrid Automatic Readability Assessment with Linguistic Feature Interpreter and Contrastive Learning	
@@ -0,0 +1 @@
+The hybrid automatic readability assessment (ARA) models that combine deep and linguistic features have recently received rising attention due to their impressive performance. However, the utilization of linguistic features is not fully realized, as ARA models frequently concentrate excessively on numerical values of these features, neglecting valuable structural information embedded within them. This leads to limited contribution of linguistic features in these hybrid ARA models, and in some cases, it may even result in counterproductive outcomes. In this paper, we propose a novel hybrid ARA model named InterpretARA through introducing a linguistic interpreter to better comprehend the structural information contained in linguistic features, and leveraging the contrastive learning that enables the model to understand relative difficulty relationships among texts and thus enhances deep representations. Both document-level and segment-level deep representations are extracted and used for the readability assessment. A series of experiments are conducted over four English corpora and one Chinese corpus to demonstrate the effectiveness of the proposed model. Experimental results show that InterpretARA outperforms state-of-the-art models in most corpora, and the introduced linguistic interpreter can provide more useful information than existing ways for ARA.
\ No newline at end of file
diff --git a/data/2024/aaai/Interpretability Benchmark for Evaluating Spatial Misalignment of Prototypical Parts Explanations b/data/2024/aaai/Interpretability Benchmark for Evaluating Spatial Misalignment of Prototypical Parts Explanations
new file mode 100644
index 0000000000..07e8fe2b68
--- /dev/null
+++ b/data/2024/aaai/Interpretability Benchmark for Evaluating Spatial Misalignment of Prototypical Parts Explanations	
@@ -0,0 +1 @@
+Prototypical parts-based networks are becoming increasingly popular due to their faithful self-explanations. However, their similarity maps are calculated in the penultimate network layer. Therefore, the receptive field of the prototype activation region often depends on parts of the image outside this region, which can lead to misleading interpretations. We name this undesired behavior a spatial explanation misalignment and introduce an interpretability benchmark with a set of dedicated metrics for quantifying this phenomenon. In addition, we propose a method for misalignment compensation and apply it to existing state-of-the-art models. We show the expressiveness of our benchmark and the effectiveness of the proposed compensation methodology through extensive empirical studies.
\ No newline at end of file
diff --git a/data/2024/aaai/Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models b/data/2024/aaai/Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models
new file mode 100644
index 0000000000..00ba9d339e
--- /dev/null
+++ b/data/2024/aaai/Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models	
@@ -0,0 +1 @@
+Many individuals are likely to face a legal dispute at some point in their lives, but their lack of understanding of how to navigate these complex issues often renders them vulnerable. The advancement of natural language processing opens new avenues for bridging this legal literacy gap through the development of automated legal aid systems. However, existing legal question answering (LQA) approaches often suffer from a narrow scope, being either confined to specific legal domains or limited to brief, uninformative responses. In this work, we propose an end-to-end methodology designed to generate long-form answers to any statutory law questions, utilizing a "retrieve-then-read" pipeline. To support this approach, we introduce and release the Long-form Legal Question Answering (LLeQA) dataset, comprising 1,868 expert-annotated legal questions in the French language, complete with detailed answers rooted in pertinent legal provisions. Our experimental results demonstrate promising performance on automatic evaluation metrics, but a qualitative analysis uncovers areas for refinement. As one of the only comprehensive, expert-annotated long-form LQA dataset, LLeQA has the potential to not only accelerate research towards resolving a significant real-world issue, but also act as a rigorous benchmark for evaluating NLP models in specialized domains. We publicly release our code, data, and models.
\ No newline at end of file
diff --git a/data/2024/aaai/Interpretable3D: An Ad-Hoc Interpretable Classifier for 3D Point Clouds b/data/2024/aaai/Interpretable3D: An Ad-Hoc Interpretable Classifier for 3D Point Clouds
new file mode 100644
index 0000000000..a5debb9d55
--- /dev/null
+++ b/data/2024/aaai/Interpretable3D: An Ad-Hoc Interpretable Classifier for 3D Point Clouds	
@@ -0,0 +1 @@
+3D decision-critical tasks urgently require research on explanations to ensure system reliability and transparency. Extensive explanatory research has been conducted on 2D images, but there is a lack in the 3D field. Furthermore, the existing explanations for 3D models are post-hoc and can be misleading, as they separate explanations from the original model. To address these issues, we propose an ad-hoc interpretable classifier for 3D point clouds (i.e., Interpretable3D). As an intuitive case-based classifier, Interpretable3D can provide reliable ad-hoc explanations without any embarrassing nuances. It allows users to understand how queries are embedded within past observations in prototype sets. Interpretable3D has two iterative training steps: 1) updating one prototype with the mean of the embeddings within the same sub-class in Prototype Estimation, and 2) penalizing or rewarding the estimated prototypes in Prototype Optimization. The mean of embeddings has a clear statistical meaning, i.e., class sub-centers. Moreover, we update prototypes with their most similar observations in the last few epochs. Finally, Interpretable3D classifies new samples according to prototypes. We evaluate the performance of Interpretable3D on four popular point cloud models: DGCNN, PointNet2, PointMLP, and PointNeXt. Our Interpretable3D demonstrates comparable or superior performance compared to softmax-based black-box models in the tasks of 3D shape classification and part segmentation. Our code is released at: github.com/FengZicai/Interpretable3D.
\ No newline at end of file
diff --git a/data/2024/aaai/Interpreting Temporal Knowledge Graph Reasoning (Student Abstract) b/data/2024/aaai/Interpreting Temporal Knowledge Graph Reasoning (Student Abstract)
new file mode 100644
index 0000000000..b97fc7ba5e
--- /dev/null
+++ b/data/2024/aaai/Interpreting Temporal Knowledge Graph Reasoning (Student Abstract)	
@@ -0,0 +1 @@
+Temporal knowledge graph reasoning is an essential task that holds immense value in diverse real-world applications. Existing studies mainly focus on leveraging structural and sequential dependencies, excelling in tasks like entity and link prediction. However, they confront a notable interpretability gap in their predictions, a pivotal facet for comprehending model behavior. In this study, we propose an innovative method, LSGAT, which not only exhibits remarkable precision in entity predictions but also enhances interpretability by identifying pivotal historical events influencing event predictions. LSGAT enables concise explanations for prediction outcomes, offering valuable insights into the otherwise enigmatic "black box" reasoning process. Through an exploration of the implications of the most influential events, it facilitates a deeper understanding of the underlying mechanisms governing predictions.
\ No newline at end of file
diff --git a/data/2024/aaai/Intersection of Artificial Intelligence and Medical Education (Student Abstract) b/data/2024/aaai/Intersection of Artificial Intelligence and Medical Education (Student Abstract)
new file mode 100644
index 0000000000..8a07611bdf
--- /dev/null
+++ b/data/2024/aaai/Intersection of Artificial Intelligence and Medical Education (Student Abstract)	
@@ -0,0 +1 @@
+Can advanced AI-driven technologies transform the traditionally arduous educational process in medicine? This study takes a deep dive into how the publicly available OpenAI ChatGPT-3.5 performs in answering board-style questions designed for physicians training to become pathologists. Correctly answering 75% of 543 questions using an engaging and fast-paced format was an impressive performance. It underscores the potential as well as improvement opportunities of using interactive AI in future medical training.
\ No newline at end of file
diff --git a/data/2024/aaai/Intra- and Inter-group Optimal Transport for User-Oriented Fairness in Recommender Systems b/data/2024/aaai/Intra- and Inter-group Optimal Transport for User-Oriented Fairness in Recommender Systems
new file mode 100644
index 0000000000..86f688d6e3
--- /dev/null
+++ b/data/2024/aaai/Intra- and Inter-group Optimal Transport for User-Oriented Fairness in Recommender Systems	
@@ -0,0 +1 @@
+Recommender systems are typically biased toward a small group of users, leading to severe unfairness in recommendation performance, i.e., User-Oriented Fairness (UOF) issue. Existing research on UOF exhibits notable limitations in two phases of recommendation models. In the training phase, current methods fail to tackle the root cause of the UOF issue, which lies in the unfair training process between advantaged and disadvantaged users. In the evaluation phase, the current UOF metric lacks the ability to comprehensively evaluate varying cases of unfairness. In this paper, we aim to address the aforementioned limitations and ensure recommendation models treat user groups of varying activity levels equally. In the training phase, we propose a novel Intra- and Inter-GrOup Optimal Transport framework (II-GOOT) to alleviate the data sparsity problem for disadvantaged users and narrow the training gap between advantaged and disadvantaged users. In the evaluation phase, we introduce a novel metric called ?-UOF, which enables the identification and assessment of various cases of UOF. This helps prevent recommendation models from leading to unfavorable fairness outcomes, where both advantaged and disadvantaged users experience subpar recommendation performance. We conduct extensive experiments on three real-world datasets based on four backbone recommendation models to prove the effectiveness of ?-UOF and the efficiency of our proposed II-GOOT.
\ No newline at end of file
diff --git a/data/2024/aaai/Intrinsic Action Tendency Consistency for Cooperative Multi-Agent Reinforcement Learning b/data/2024/aaai/Intrinsic Action Tendency Consistency for Cooperative Multi-Agent Reinforcement Learning
new file mode 100644
index 0000000000..6b88cf82d1
--- /dev/null
+++ b/data/2024/aaai/Intrinsic Action Tendency Consistency for Cooperative Multi-Agent Reinforcement Learning	
@@ -0,0 +1 @@
+Efficient collaboration in the centralized training with decentralized execution (CTDE) paradigm remains a challenge in cooperative multi-agent systems. We identify divergent action tendencies among agents as a significant obstacle to CTDE's training efficiency, requiring a large number of training samples to achieve a unified consensus on agents' policies. This divergence stems from the lack of adequate team consensus-related guidance signals during credit assignment in CTDE. To address this, we propose Intrinsic Action Tendency Consistency, a novel approach for cooperative multi-agent reinforcement learning. It integrates intrinsic rewards, obtained through an action model, into a reward-additive CTDE (RA-CTDE) framework. We formulate an action model that enables surrounding agents to predict the central agent's action tendency. Leveraging these predictions, we compute a cooperative intrinsic reward that encourages agents to align their actions with their neighbors' predictions. We establish the equivalence between RA-CTDE and CTDE through theoretical analyses, demonstrating that CTDE's training process can be achieved using N individual targets. Building on this insight, we introduce a novel method to combine intrinsic rewards and RA-CTDE. Extensive experiments on challenging tasks in SMAC, MPE, and GRF benchmarks showcase the improved performance of our method.
\ No newline at end of file
diff --git a/data/2024/aaai/Intrinsic Phase-Preserving Networks for Depth Super Resolution b/data/2024/aaai/Intrinsic Phase-Preserving Networks for Depth Super Resolution
new file mode 100644
index 0000000000..6de3979a22
--- /dev/null
+++ b/data/2024/aaai/Intrinsic Phase-Preserving Networks for Depth Super Resolution	
@@ -0,0 +1 @@
+Depth map super-resolution (DSR) plays an indispensable role in 3D vision. We discover an non-trivial spectral phenomenon: the components of high-resolution (HR) and low-resolution (LR) depth maps manifest the same intrinsic phase, and the spectral phase of RGB is a superset of them, which suggests that a phase-aware filter can assist in the precise use of RGB cues. Motivated by this, we propose an intrinsic phase-preserving DSR paradigm, named IPPNet, to fully exploit inter-modality collaboration in a mutually guided way. In a nutshell, a novel Phase-Preserving Filtering Module (PPFM) is developed to generate dynamic phase-aware filters according to the LR depth flow to filter out erroneous noisy components contained in RGB and then conduct depth enhancement via the modulation of the phase-preserved RGB signal. By stacking multiple PPFM blocks, the proposed IPPNet is capable of reaching a highly competitive restoration performance. Extensive experiments on various benchmark datasets, e.g., NYU v2, RGB-D-D, reach SOTA performance and also well demonstrate the validity of the proposed phase-preserving scheme. Code: https://github.com/neuralchen/IPPNet/.
\ No newline at end of file
diff --git a/data/2024/aaai/Introduction to the Special Track on Artificial Intelligence and COVID-19 (Abstract Reprint) b/data/2024/aaai/Introduction to the Special Track on Artificial Intelligence and COVID-19 (Abstract Reprint)
new file mode 100644
index 0000000000..662c5c154c
--- /dev/null
+++ b/data/2024/aaai/Introduction to the Special Track on Artificial Intelligence and COVID-19 (Abstract Reprint)	
@@ -0,0 +1 @@
+The human race is facing one of the most meaningful public health emergencies in the modern era caused by the COVID-19 pandemic. This pandemic introduced various challenges, from lock-downs with significant economic costs to fundamentally altering the way of life for many people around the world. The battle to understand and control the virus is still at its early stages yet meaningful insights have already been made. The uncertainty of why some patients are infected and experience severe symptoms, while others are infected but asymptomatic, and others are not infected at all, makes managing this pandemic very challenging. Furthermore, the development of treatments and vaccines relies on knowledge generated from an ever evolving and expanding information space. Given the availability of digital data in the modern era, artificial intelligence (AI) is a meaningful tool for addressing the various challenges introduced by this unexpected pandemic. Some of the challenges include: outbreak prediction, risk modeling including infection and symptom development, testing strategy optimization, drug development, treatment repurposing, vaccine development, and others.
\ No newline at end of file
diff --git a/data/2024/aaai/Invariant Random Forest: Tree-Based Model Solution for OOD Generalization b/data/2024/aaai/Invariant Random Forest: Tree-Based Model Solution for OOD Generalization
new file mode 100644
index 0000000000..5732a52fd2
--- /dev/null
+++ b/data/2024/aaai/Invariant Random Forest: Tree-Based Model Solution for OOD Generalization	
@@ -0,0 +1 @@
+Out-Of-Distribution (OOD) generalization is an essential topic in machine learning. However, recent research is only focusing on the corresponding methods for neural networks. This paper introduces a novel and effective solution for OOD generalization of decision tree models, named Invariant Decision Tree (IDT). IDT enforces a penalty term with regard to the unstable/varying behavior of a split across different environments during the growth of the tree. Its ensemble version, the Invariant Random Forest (IRF), is constructed. Our proposed method is motivated by a theoretical result under mild conditions, and validated by numerical tests with both synthetic and real datasets. The superior performance compared to non-OOD tree models implies that considering OOD generalization for tree models is absolutely necessary and should be given more attention.
\ No newline at end of file
diff --git a/data/2024/aaai/Inverse Weight-Balancing for Deep Long-Tailed Learning b/data/2024/aaai/Inverse Weight-Balancing for Deep Long-Tailed Learning
new file mode 100644
index 0000000000..afe98a2804
--- /dev/null
+++ b/data/2024/aaai/Inverse Weight-Balancing for Deep Long-Tailed Learning	
@@ -0,0 +1 @@
+The performance of deep learning models often degrades rapidly when faced with imbalanced data characterized by a long-tailed distribution. Researchers have found that the fully connected layer trained by cross-entropy loss has large weight-norms for classes with many samples, but not for classes with few samples. How to address the data imbalance problem with both the encoder and the classifier seems an under-researched problem. In this paper, we propose an inverse weight-balancing (IWB) approach to guide model training and alleviate the data imbalance problem in two stages. In the first stage, an encoder and classifier (the fully connected layer) are trained using conventional cross-entropy loss. In the second stage, with a fixed encoder, the classifier is finetuned through an adaptive distribution for IWB in the decision space. Unlike existing inverse image frequency that implements a multiplicative margin adjustment transformation in the classification layer, our approach can be interpreted as an adaptive distribution alignment strategy using not only the class-wise number distribution but also the sample-wise difficulty distribution in both encoder and classifier. Experiments show that our method can greatly improve performance on imbalanced datasets such as CIFAR100-LT with different imbalance factors, ImageNet-LT, and iNaturelists2018.
\ No newline at end of file
diff --git a/data/2024/aaai/Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following b/data/2024/aaai/Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following
new file mode 100644
index 0000000000..02e07550c0
--- /dev/null
+++ b/data/2024/aaai/Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following	
@@ -0,0 +1 @@
+In this paper, we present our finding that prepending a Task-Agnostic Prefix Prompt (TAPP) to the input improves the instruction-following ability of various Large Language Models (LLMs) during inference. TAPP is different from canonical prompts for LLMs in that it is a fixed prompt prepended to the beginning of every input regardless of the target task for zero-shot generalization. We observe that both base LLMs (i.e. not fine-tuned to follow instructions) and instruction-tuned models benefit from TAPP, resulting in 34.58% and 12.26% improvement on average, respectively. This implies that the instruction-following ability of LLMs can be improved during inference time with a fixed prompt constructed with simple heuristics. We hypothesize that TAPP assists language models to better estimate the output distribution by focusing more on the instruction of the target task during inference. In other words, such ability does not seem to be sufficiently activated in not only base LLMs but also many instruction-fine-tuned LLMs.
\ No newline at end of file
diff --git a/data/2024/aaai/Investigation into Training Dynamics of Learned Optimizers (Student Abstract) b/data/2024/aaai/Investigation into Training Dynamics of Learned Optimizers (Student Abstract)
new file mode 100644
index 0000000000..d16a8e8a80
--- /dev/null
+++ b/data/2024/aaai/Investigation into Training Dynamics of Learned Optimizers (Student Abstract)	
@@ -0,0 +1 @@
+Modern machine learning heavily relies on optimization, and as deep learning models grow more complex and data-hungry, the search for efficient learning becomes crucial. Learned optimizers disrupt traditional handcrafted methods such as SGD and Adam by learning the optimization strategy itself, potentially speeding up training. However, the learned optimizers' dynamics are still not well understood. To remedy this, our work explores their optimization trajectories from the perspective of network architecture symmetries and proposed parameter update distributions.
\ No newline at end of file
diff --git a/data/2024/aaai/Invisible Backdoor Attack against 3D Point Cloud Classifier in Graph Spectral Domain b/data/2024/aaai/Invisible Backdoor Attack against 3D Point Cloud Classifier in Graph Spectral Domain
new file mode 100644
index 0000000000..026c0ace9d
--- /dev/null
+++ b/data/2024/aaai/Invisible Backdoor Attack against 3D Point Cloud Classifier in Graph Spectral Domain	
@@ -0,0 +1 @@
+3D point cloud has been wildly used in security crucial domains, such as self-driving and 3D face recognition. Backdoor attack is a serious threat that usually destroy Deep Neural Networks (DNN) in the training stage. Though a few 3D backdoor attacks are designed to achieve guaranteed attack efficiency, their deformation will alarm human inspection. To obtain invisible backdoored point cloud, this paper proposes a novel 3D backdoor attack, named IBAPC, which generates backdoor trigger in the graph spectral domain. The effectiveness is grounded by the advantage of graph spectral signal that it can induce both global structure and local points to be responsible for the caused deformation in spatial domain. In detail, a new backdoor implanting function is proposed whose aim is to transform point cloud to graph spectral signal for conducting backdoor trigger. Then, we design a backdoor training procedure which updates the parameter of backdoor implanting function and victim 3D DNN alternately. Finally, the backdoored 3D DNN and its associated backdoor implanting function is obtained by finishing the backdoor training procedure. Experiment results suggest that IBAPC achieves SOTA attack stealthiness from three aspects including objective distance measurement, subjective human evaluation, graph spectral signal residual. At the same time, it obtains competitive attack efficiency. The code is available at https://github.com/f-lk/IBAPC.
\ No newline at end of file
diff --git a/data/2024/aaai/Is a Large Language Model a Good Annotator for Event Extraction? b/data/2024/aaai/Is a Large Language Model a Good Annotator for Event Extraction?
new file mode 100644
index 0000000000..f2b21e6549
--- /dev/null
+++ b/data/2024/aaai/Is a Large Language Model a Good Annotator for Event Extraction?	
@@ -0,0 +1 @@
+Event extraction is an important task in natural language processing that focuses on mining event-related information from unstructured text. Despite considerable advancements, it is still challenging to achieve satisfactory performance in this task, and issues like data scarcity and imbalance obstruct progress. In this paper, we introduce an innovative approach where we employ Large Language Models (LLMs) as expert annotators for event extraction. We strategically include sample data from the training dataset in the prompt as a reference, ensuring alignment between the data distribution of LLM-generated samples and that of the benchmark dataset. This enables us to craft an augmented dataset that complements existing benchmarks, alleviating the challenges of data imbalance and scarcity and thereby enhancing the performance of fine-tuned models. We conducted extensive experiments to validate the efficacy of our proposed method, and we believe that this approach holds great potential for propelling the development and application of more advanced and reliable event extraction systems in real-world scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Iterative Regularization with k-support Norm: An Important Complement to Sparse Recovery b/data/2024/aaai/Iterative Regularization with k-support Norm: An Important Complement to Sparse Recovery
new file mode 100644
index 0000000000..2c1acdbb04
--- /dev/null
+++ b/data/2024/aaai/Iterative Regularization with k-support Norm: An Important Complement to Sparse Recovery	
@@ -0,0 +1,3 @@
+Sparse recovery is ubiquitous in machine learning and signal processing. Due to the NP-hard nature of sparse recovery, existing methods are known to suffer either from restrictive (or even unknown) applicability conditions, or high computational cost. Recently, iterative regularization methods have emerged as a promising fast approach because they can achieve sparse recovery in one pass through early stopping, rather than the tedious grid-search used in the traditional methods.
+However, most of those iterative methods are based on the l1 norm which requires restrictive applicability conditions and could fail in many cases. Therefore, achieving sparse recovery with iterative regularization methods under a wider range of conditions has yet to be further explored.
+To address this issue, we propose a novel iterative regularization algorithm, IRKSN, based on the k-support norm regularizer rather than the l1 norm. We provide conditions for sparse recovery with IRKSN, and compare them with traditional conditions for recovery with l1 norm regularizers. Additionally, we give an early stopping bound on the model error of IRKSN with explicit constants, achieving the standard linear rate for sparse recovery. Finally, we illustrate the applicability of our algorithm on several experiments, including a support recovery experiment with a correlated design matrix.
\ No newline at end of file
diff --git a/data/2024/aaai/Iterative Token Evaluation and Refinement for Real-World Super-resolution b/data/2024/aaai/Iterative Token Evaluation and Refinement for Real-World Super-resolution
new file mode 100644
index 0000000000..65af8593cf
--- /dev/null
+++ b/data/2024/aaai/Iterative Token Evaluation and Refinement for Real-World Super-resolution	
@@ -0,0 +1 @@
+Real-world image super-resolution (RWSR) is a long-standing problem as low-quality (LQ) images often have complex and unidentified degradations. Existing methods such as Generative Adversarial Networks (GANs) or continuous diffusion models present their own issues including GANs being difficult to train while continuous diffusion models requiring numerous inference steps. In this paper, we propose an Iterative Token Evaluation and Refinement (ITER) framework for RWSR, which utilizes a discrete diffusion model operating in the discrete token representation space, i.e., indexes of features extracted from a VQGAN codebook pre-trained with high-quality (HQ) images. We show that ITER is easier to train than GANs and more efficient than continuous diffusion models. Specifically, we divide RWSR into two sub-tasks, i.e., distortion removal and texture generation. Distortion removal involves simple HQ token prediction with LQ images, while texture generation uses a discrete diffusion model to iteratively refine the distortion removal output with a token refinement network. In particular, we propose to include a token evaluation network in the discrete diffusion process. It learns to evaluate which tokens are good restorations and helps to improve the iterative refinement results. Moreover, the evaluation network can first check status of the distortion removal output and then adaptively select total refinement steps needed, thereby maintaining a good balance between distortion removal and texture generation. Extensive experimental results show that ITER is easy to train and performs well within just 8 iterative steps.
\ No newline at end of file
diff --git a/data/2024/aaai/JoLT: Jointly Learned Representations of Language and Time-Series for Clinical Time-Series Interpretation (Student Abstract) b/data/2024/aaai/JoLT: Jointly Learned Representations of Language and Time-Series for Clinical Time-Series Interpretation (Student Abstract)
new file mode 100644
index 0000000000..78ca869007
--- /dev/null
+++ b/data/2024/aaai/JoLT: Jointly Learned Representations of Language and Time-Series for Clinical Time-Series Interpretation (Student Abstract)	
@@ -0,0 +1 @@
+Time-series and text data are prevalent in healthcare and frequently co-exist, yet they are typically modeled in isolation. Even studies that jointly model time-series and text, do so by converting time-series to images or graphs. We hypothesize that explicitly modeling time-series jointly with text can improve tasks such as summarization and question answering for time-series data, which have received little attention so far. To address this gap, we introduce JoLT to jointly learn desired representations from pre-trained time-series and text models. JoLT utilizes a Querying Transformer (Q-Former) to align the time-series and text representations. Our experiments on a large real-world electrocardiography dataset for medical time-series summarization show that JoLT outperforms state-of-the-art image captioning approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Joint Demosaicing and Denoising for Spike Camera b/data/2024/aaai/Joint Demosaicing and Denoising for Spike Camera
new file mode 100644
index 0000000000..8394f63e41
--- /dev/null
+++ b/data/2024/aaai/Joint Demosaicing and Denoising for Spike Camera	
@@ -0,0 +1 @@
+As a neuromorphic camera with high temporal resolution, spike camera can capture dynamic scenes with high-speed motion. Recently, spike camera with a color filter array (CFA) has been developed for color imaging. There are some methods for spike camera demosaicing to reconstruct color images from Bayer-pattern spike streams. However, the demosaicing results are bothered by severe noise in spike streams, to which previous works pay less attention. In this paper, we propose an iterative joint demosaicing and denoising network (SJDD-Net) for spike cameras based on the observation model. Firstly, we design a color spike representation (CSR) to learn latent representation from Bayer-pattern spike streams. In CSR, we propose an offset-sharing deformable convolution module to align temporal features of color channels. Then we develop a spike noise estimator (SNE) to obtain features of the noise distribution. Finally, a color correlation prior (CCP) module is proposed to utilize the color correlation for better details. For training and evaluation, we designed a spike camera simulator to generate Bayer-pattern spike streams with synthesized noise. Besides, we captured some Bayer-pattern spike streams, building the first real-world captured dataset to our knowledge. Experimental results show that our method can restore clean images from Bayer-pattern spike streams. The source codes and dataset are available at https://github.com/csycdong/SJDD-Net.
\ No newline at end of file
diff --git a/data/2024/aaai/Joint Learning Neuronal Skeleton and Brain Circuit Topology with Permutation Invariant Encoders for Neuron Classification b/data/2024/aaai/Joint Learning Neuronal Skeleton and Brain Circuit Topology with Permutation Invariant Encoders for Neuron Classification
new file mode 100644
index 0000000000..c939b012e5
--- /dev/null
+++ b/data/2024/aaai/Joint Learning Neuronal Skeleton and Brain Circuit Topology with Permutation Invariant Encoders for Neuron Classification	
@@ -0,0 +1 @@
+Determining the types of neurons within a nervous system plays a significant role in the analysis of brain connectomics and the investigation of neurological diseases. However, the efficiency of utilizing anatomical, physiological, or molecular characteristics of neurons is relatively low and costly. With the advancements in electron microscopy imaging and analysis techniques for brain tissue, we are able to obtain whole-brain connectome consisting neuronal high-resolution morphology and connectivity information. However, few models are built based on such data for automated neuron classification. In this paper, we propose NeuNet, a framework that combines morphological information of neurons obtained from skeleton and topological information between neurons obtained from neural circuit. Specifically, NeuNet consists of three components, namely Skeleton Encoder, Connectome Encoder, and Readout Layer. Skeleton Encoder integrates the local information of neurons in a bottom-up manner, with a one-dimensional convolution in neural skeleton's point data; Connectome Encoder uses a graph neural network to capture the topological information of neural circuit; finally, Readout Layer fuses the above two information and outputs classification results. We reprocess and release two new datasets for neuron classification task from volume electron microscopy(VEM) images of human brain cortex and Drosophila brain. Experiments on these two datasets demonstrated the effectiveness of our model with accuracies of 0.9169 and 0.9363, respectively. Code and data are available at: https://github.com/WHUminghui/NeuNet.
\ No newline at end of file
diff --git a/data/2024/aaai/Jointly Improving the Sample and Communication Complexities in Decentralized Stochastic Minimax Optimization b/data/2024/aaai/Jointly Improving the Sample and Communication Complexities in Decentralized Stochastic Minimax Optimization
new file mode 100644
index 0000000000..54626d7fb0
--- /dev/null
+++ b/data/2024/aaai/Jointly Improving the Sample and Communication Complexities in Decentralized Stochastic Minimax Optimization	
@@ -0,0 +1 @@
+We propose a novel single-loop decentralized algorithm, DGDA-VR, for solving the stochastic nonconvex strongly-concave minimax problems over a connected network of agents, which are equipped with stochastic first-order oracles to estimate their local gradients. DGDA-VR, incorporating variance reduction, achieves O(ε^−3) oracle complexity and O(ε^−2) communication complexity without resorting to multi-communication rounds – both are optimal, i.e., matching the lower bounds for this class of problems. Since DGDA-VR does not require multiple communication rounds, it is applicable to a broader range of decentralized computational environments. To the best of our knowledge, this is the first distributed method using a single communication round in each iteration to jointly optimize the oracle and communication complexities for the problem considered here.
\ No newline at end of file
diff --git a/data/2024/aaai/Jointly Modeling Spatio-Temporal Features of Tactile Signals for Action Classification b/data/2024/aaai/Jointly Modeling Spatio-Temporal Features of Tactile Signals for Action Classification
new file mode 100644
index 0000000000..0eae1aac04
--- /dev/null
+++ b/data/2024/aaai/Jointly Modeling Spatio-Temporal Features of Tactile Signals for Action Classification	
@@ -0,0 +1 @@
+Tactile signals collected by wearable electronics are essential in modeling and understanding human behavior. One of the main applications of tactile signals is action classification, especially in healthcare and robotics. However, existing tactile classification methods fail to capture the spatial and temporal features of tactile signals simultaneously, which results in sub-optimal performances. In this paper, we design Spatio-Temporal Aware tactility Transformer (STAT) to utilize continuous tactile signals for action classification. We propose spatial and temporal embeddings along with a new temporal pretraining task in our model, which aims to enhance the transformer in modeling the spatio-temporal features of tactile signals. Specially, the designed temporal pretraining task is to differentiate the time order of tubelet inputs to model the temporal properties explicitly. Experimental results on a public action classification dataset demonstrate that our model outperforms state-of-the-art methods in all metrics.
\ No newline at end of file
diff --git a/data/2024/aaai/Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons b/data/2024/aaai/Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons
new file mode 100644
index 0000000000..cd9ca2d5fa
--- /dev/null
+++ b/data/2024/aaai/Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons	
@@ -0,0 +1 @@
+Pre-trained language models (PLMs) contain vast amounts of factual knowledge, but how the knowledge is stored in the parameters remains unclear. This paper delves into the complex task of understanding how factual knowledge is stored in multilingual PLMs, and introduces the Architecture-adapted Multilingual Integrated Gradients method, which successfully localizes knowledge neurons more precisely compared to current methods, and is more universal across various architectures and languages. Moreover, we conduct an in-depth exploration on knowledge neurons, leading to the following two important discoveries: (1) The discovery of Language-Independent Knowledge Neurons, which store factual knowledge in a form that transcends language. We design cross-lingual knowledge editing experiments, demonstrating that the PLMs can accomplish this task based on language-independent neurons; (2) The discovery of Degenerate Knowledge Neurons, a novel type of neuron showing that different knowledge neurons can store the same fact. Its property of functional overlap endows the PLMs with a robust mastery of factual knowledge. We design fact-checking experiments, proving that the degenerate knowledge neurons can help the PLMs to detect wrong facts. Experiments corroborate these findings, shedding light on the mechanisms of factual knowledge storage in multilingual PLMs, and contribute valuable insights to the field. The code is available at https://github.com/heng840/AMIG.
\ No newline at end of file
diff --git a/data/2024/aaai/KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning b/data/2024/aaai/KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning
new file mode 100644
index 0000000000..8dfc05d160
--- /dev/null
+++ b/data/2024/aaai/KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have demonstrated impressive performance in natural language processing tasks by leveraging chain of thought (CoT) that enables step-by-step thinking. Extending LLMs with multimodal capabilities is the recent interest, but incurs computational cost and requires substantial hardware resources. To address these challenges, we propose KAM-CoT a framework that integrates CoT reasoning, Knowledge Graphs (KGs), and multiple modalities for a comprehensive understanding of multimodal tasks. KAM-CoT adopts a two-stage training process with KG grounding to generate effective rationales and answers. By incorporating external knowledge from KGs during reasoning, the model gains a deeper contextual understanding reducing hallucinations and enhancing the quality of answers. This knowledge-augmented CoT reasoning empowers the model to handle questions requiring external context, providing more informed answers. Experimental findings show KAM-CoT outperforms the state-of-the-art methods. On the ScienceQA dataset, we achieve an average accuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. Remarkably, KAM-CoT achieves these results with only 280M trainable parameters at a time, demonstrating its cost-efficiency and effectiveness.
\ No newline at end of file
diff --git a/data/2024/aaai/KAMEL: Knowledge Aware Medical Entity Linkage to Automate Health Insurance Claims Processing b/data/2024/aaai/KAMEL: Knowledge Aware Medical Entity Linkage to Automate Health Insurance Claims Processing
new file mode 100644
index 0000000000..d8942d7885
--- /dev/null
+++ b/data/2024/aaai/KAMEL: Knowledge Aware Medical Entity Linkage to Automate Health Insurance Claims Processing	
@@ -0,0 +1 @@
+Automating the processing of health insurance claims to achieve "Straight-Through Processing" is one of the holy grails that all insurance companies aim to achieve. One of the major impediments to this automation is the difficulty in establishing the relationship between the underwriting exclusions that a policy has and the incoming claim's diagnosis information. Typically, policy underwriting exclusions are captured in free-text such as "Respiratory illnesses are excluded due to a pre-existing asthma condition". A medical claim coming from a hospital would have the diagnosis represented using the International Classification of Disease (ICD) codes from the World Health Organization. The complex and labour-intensive task of establishing the relationship between free-text underwriting exclusions in health insurance policies and medical diagnosis codes from health insurance claims is critical towards determining if a claim should be rejected due to underwriting exclusions. In this work, we present a novel framework that leverages both explicit and implicit domain knowledge present in medical ontologies and pre-trained language models respectively, to effectively establish the relationship between free-text describing medical conditions present in underwriting exclusions and the ICD-10CM diagnosis codes in health insurance claims. Termed KAMEL (Knowledge Aware Medical Entity Linkage), our proposed framework addresses the limitations faced by prior approaches when evaluated on real-world health insurance claims data. Our proposed framework have been deployed in several multi-national health insurance providers to automate their health insurance claims.
\ No newline at end of file
diff --git a/data/2024/aaai/KG-TREAT: Pre-training for Treatment Effect Estimation by Synergizing Patient Data with Knowledge Graphs b/data/2024/aaai/KG-TREAT: Pre-training for Treatment Effect Estimation by Synergizing Patient Data with Knowledge Graphs
new file mode 100644
index 0000000000..371985bdfd
--- /dev/null
+++ b/data/2024/aaai/KG-TREAT: Pre-training for Treatment Effect Estimation by Synergizing Patient Data with Knowledge Graphs	
@@ -0,0 +1 @@
+Treatment effect estimation (TEE) is the task of determining the impact of various treatments on patient outcomes. Current TEE methods fall short due to reliance on limited labeled data and challenges posed by sparse and high-dimensional observational patient data. To address the challenges, we introduce a novel pre-training and fine-tuning framework, KG-TREAT, which synergizes large-scale observational patient data with biomedical knowledge graphs (KGs) to enhance TEE. Unlike previous approaches, KG-TREAT constructs dual-focus KGs and integrates a deep bi-level attention synergy method for in-depth information fusion, enabling distinct encoding of treatment-covariate and outcome-covariate relationships. KG-TREAT also incorporates two pre-training tasks to ensure a thorough grounding and contextualization of patient data and KGs. Evaluation on four downstream TEE tasks shows KG-TREAT's superiority over existing methods, with an average improvement of 7% in Area under the ROC Curve (AUC) and 9% in Influence Function-based Precision of Estimating Heterogeneous Effects (IF-PEHE). The effectiveness of our estimated treatment effects is further affirmed by alignment with established randomized clinical trial findings.
\ No newline at end of file
diff --git a/data/2024/aaai/KGDM: A Diffusion Model to Capture Multiple Relation Semantics for Knowledge Graph Embedding b/data/2024/aaai/KGDM: A Diffusion Model to Capture Multiple Relation Semantics for Knowledge Graph Embedding
new file mode 100644
index 0000000000..b24ca9d4bb
--- /dev/null
+++ b/data/2024/aaai/KGDM: A Diffusion Model to Capture Multiple Relation Semantics for Knowledge Graph Embedding	
@@ -0,0 +1 @@
+Knowledge graph embedding (KGE) is an efficient and scalable method for knowledge graph completion. However, most existing KGE methods suffer from the challenge of multiple relation semantics, which often degrades their performance. This is because most KGE methods learn fixed continuous vectors for entities (relations) and make deterministic entity predictions to complete the knowledge graph, which hardly captures multiple relation semantics. To tackle this issue, previous works try to learn complex probabilistic embeddings instead of fixed embeddings but suffer from heavy computational complexity. In contrast, this paper proposes a simple yet efficient framework namely the Knowledge Graph Diffusion Model (KGDM) to capture the multiple relation semantics in prediction. Its key idea is to cast the problem of entity prediction into conditional entity generation. Specifically, KGDM estimates the probabilistic distribution of target entities in prediction through Denoising Diffusion Probabilistic Models (DDPM). To bridge the gap between continuous diffusion models and discrete KGs, two learnable embedding functions are defined to map entities and relation to continuous vectors. To consider connectivity patterns of KGs, a Conditional Entity Denoiser model is introduced to generate target entities conditioned on given entities and relations. Extensive experiments demonstrate that KGDM significantly outperforms existing state-of-the-art methods in three benchmark datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/KGTS: Contrastive Trajectory Similarity Learning over Prompt Knowledge Graph Embedding b/data/2024/aaai/KGTS: Contrastive Trajectory Similarity Learning over Prompt Knowledge Graph Embedding
new file mode 100644
index 0000000000..e5b9607deb
--- /dev/null
+++ b/data/2024/aaai/KGTS: Contrastive Trajectory Similarity Learning over Prompt Knowledge Graph Embedding	
@@ -0,0 +1 @@
+Trajectory similarity computation serves as a fundamental functionality of various spatial information applications. Although existing deep learning similarity computation methods offer better efficiency and accuracy than non-learning solutions, they are still immature in trajectory embedding and suffer from poor generality and heavy preprocessing for training. Targeting these limitations, we propose a novel framework named KGTS based on knowledge graph grid embedding, prompt trajectory embedding, and unsupervised contrastive learning for improved trajectory similarity computation. Specifically, we first embed map grids with a GRot embedding method to vigorously grasp the neighbouring relations of grids. Then, a prompt trajectory embedding network incorporates the resulting grid embedding and extracts trajectory structure and point order information. It is trained by unsupervised contrastive learning, which not only alleviates the heavy preprocessing burden but also provides exceptional generality with creatively designed strategies for positive sample generation. The prompt trajectory embedding adopts a customized prompt paradigm to mitigate the gap between the grid embedding and the trajectory embedding. Extensive experiments on two real-world trajectory datasets demonstrate the superior performance of KGTS over state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/KPA-Tracker: Towards Robust and Real-Time Category-Level Articulated Object 6D Pose Tracking b/data/2024/aaai/KPA-Tracker: Towards Robust and Real-Time Category-Level Articulated Object 6D Pose Tracking
new file mode 100644
index 0000000000..a1ca1e3c2b
--- /dev/null
+++ b/data/2024/aaai/KPA-Tracker: Towards Robust and Real-Time Category-Level Articulated Object 6D Pose Tracking	
@@ -0,0 +1 @@
+Our life is populated with articulated objects. Current category-level articulation estimation works largely focus on predicting part-level 6D poses on static point cloud observations. In this paper, we tackle the problem of category-level online robust and real-time 6D pose tracking of articulated objects, where we propose KPA-Tracker, a novel 3D KeyPoint based Articulated object pose Tracker. Given an RGB-D image or a partial point cloud at the current frame as well as the estimated per-part 6D poses from the last frame, our KPA-Tracker can effectively update the poses with learned 3D keypoints between the adjacent frames. Specifically, we first canonicalize the input point cloud and formulate the pose tracking as an inter-frame pose increment estimation task. To learn consistent and separate 3D keypoints for every rigid part, we build KPA-Gen that outputs the high-quality ordered 3D keypoints in an unsupervised manner. During pose tracking on the whole video, we further propose a keypoint-based articulation tracking algorithm that mines keyframes as reference for accurate pose updating. We provide extensive experiments on validating our KPA-Tracker on various datasets ranging from synthetic point cloud observation to real-world scenarios, which demonstrates the superior performance and robustness of the KPA-Tracker. We believe that our work has the potential to be applied in many fields including robotics, embodied intelligence and augmented reality. All the datasets and codes are available at https://github.com/hhhhhar/KPA-Tracker.
\ No newline at end of file
diff --git a/data/2024/aaai/KeDuSR: Real-World Dual-Lens Super-Resolution via Kernel-Free Matching b/data/2024/aaai/KeDuSR: Real-World Dual-Lens Super-Resolution via Kernel-Free Matching
new file mode 100644
index 0000000000..a77d9f47c4
--- /dev/null
+++ b/data/2024/aaai/KeDuSR: Real-World Dual-Lens Super-Resolution via Kernel-Free Matching	
@@ -0,0 +1 @@
+Dual-lens super-resolution (SR) is a practical scenario for reference (Ref) based SR by utilizing the telephoto image (Ref) to assist the super-resolution of the low-resolution wide-angle image (LR input). Different from general RefSR, the Ref in dual-lens SR only covers the overlapped field of view (FoV) area. However, current dual-lens SR methods rarely utilize these specific characteristics and directly perform dense matching between the LR input and Ref. Due to the resolution gap between LR and Ref, the matching may miss the best-matched candidate and destroy the consistent structures in the overlapped FoV area. Different from them, we propose to first align the Ref with the center region (namely the overlapped FoV area) of the LR input by combining global warping and local warping to make the aligned Ref be sharp and consistent. Then, we formulate the aligned Ref and LR center as value-key pairs, and the corner region of the LR is formulated as queries. In this way, we propose a kernel-free matching strategy by matching between the LR-corner (query) and LR-center (key) regions, and the corresponding aligned Ref (value) can be warped to the corner region of the target. Our kernel-free matching strategy avoids the resolution gap between LR and Ref, which makes our network have better generalization ability. In addition, we construct a DuSR-Real dataset with (LR, Ref, HR) triples, where the LR and HR are well aligned. Experiments on three datasets demonstrate that our method outperforms the second-best method by a large margin. Our code and dataset are available at https://github.com/ZifanCui/KeDuSR.
\ No newline at end of file
diff --git a/data/2024/aaai/Keep the Faith: Faithful Explanations in Convolutional Neural Networks for Case-Based Reasoning b/data/2024/aaai/Keep the Faith: Faithful Explanations in Convolutional Neural Networks for Case-Based Reasoning
new file mode 100644
index 0000000000..496caa005f
--- /dev/null
+++ b/data/2024/aaai/Keep the Faith: Faithful Explanations in Convolutional Neural Networks for Case-Based Reasoning	
@@ -0,0 +1 @@
+Explaining predictions of black-box neural networks is crucial when applied to decision-critical tasks. Thus, attribution maps are commonly used to identify important image regions, despite prior work showing that humans prefer explanations based on similar examples. To this end, ProtoPNet learns a set of class-representative feature vectors (prototypes) for case-based reasoning. During inference, similarities of latent features to prototypes are linearly classified to form predictions and attribution maps are provided to explain the similarity. In this work, we evaluate whether architectures for case-based reasoning fulfill established axioms required for faithful explanations using the example of ProtoPNet. We show that such architectures allow the extraction of faithful explanations. However, we prove that the attribution maps used to explain the similarities violate the axioms. We propose a new procedure to extract explanations for trained ProtoPNets, named ProtoPFaith. Conceptually, these explanations are Shapley values, calculated on the similarity scores of each prototype. They allow to faithfully answer which prototypes are present in an unseen image and quantify each pixel’s contribution to that presence, thereby complying with all axioms. The theoretical violations of ProtoPNet manifest in our experiments on three datasets (CUB-200-2011, Stanford Dogs, RSNA) and five architectures (ConvNet, ResNet, ResNet50, WideResNet50, ResNeXt50). Our experiments show a qualitative difference between the explanations given by ProtoPNet and ProtoPFaith. Additionally, we quantify the explanations with the Area Over the Perturbation Curve, on which ProtoPFaith outperforms ProtoPNet on all experiments by a factor >10^3.
\ No newline at end of file
diff --git a/data/2024/aaai/Kepler Light Curve Classification Using Deep Learning and Markov Transition Field (Student Abstract) b/data/2024/aaai/Kepler Light Curve Classification Using Deep Learning and Markov Transition Field (Student Abstract)
new file mode 100644
index 0000000000..305536a555
--- /dev/null
+++ b/data/2024/aaai/Kepler Light Curve Classification Using Deep Learning and Markov Transition Field (Student Abstract)	
@@ -0,0 +1,13 @@
+An exoplanet is a planet, which is not a part of our solar system.
+Whether life exists in one or more of these exoplanets
+has fascinated humans for centuries. NASA’s Kepler Space
+Telescope has discovered more than 70% of known exoplanets
+in our universe. However, manually determining whether a
+Kepler light curve indicates an exoplanet or not becomes infeasible
+with the large volume of data. Due to this, we propose
+a deep learning-based strategy to automatically classify
+a Kepler light curve. More specifically, we first convert the
+light curve time series into its corresponding Markov Transition
+Field (MTF) image and then classify it. Results show
+that the accuracy of the proposed technique is 99.39%, which
+is higher than all current state-of-the-art approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Kernelized Normalizing Constant Estimation: Bridging Bayesian Quadrature and Bayesian Optimization b/data/2024/aaai/Kernelized Normalizing Constant Estimation: Bridging Bayesian Quadrature and Bayesian Optimization
new file mode 100644
index 0000000000..6cc8dd6f05
--- /dev/null
+++ b/data/2024/aaai/Kernelized Normalizing Constant Estimation: Bridging Bayesian Quadrature and Bayesian Optimization	
@@ -0,0 +1 @@
+In this paper, we study the problem of estimating the normalizing constant through queries to the black-box function f, which is the integration of the exponential function of f scaled by a problem parameter lambda. We assume f belongs to a reproducing kernel Hilbert space (RKHS), and show that to estimate the normalizing constant within a small relative error, the level of difficulty depends on the value of lambda: When lambda approaches zero, the problem is similar to Bayesian quadrature (BQ), while when lambda approaches infinity, the problem is similar to Bayesian optimization (BO). More generally, the problem varies between BQ and BO. We find that this pattern holds true even when the function evaluations are noisy, bringing new aspects to this topic. Our findings are supported by both algorithm-independent lower bounds and algorithmic upper bounds, as well as simulation studies conducted on a variety of benchmark functions.
\ No newline at end of file
diff --git a/data/2024/aaai/Keypoint Fusion for RGB-D Based 3D Hand Pose Estimation b/data/2024/aaai/Keypoint Fusion for RGB-D Based 3D Hand Pose Estimation
new file mode 100644
index 0000000000..fdddbca153
--- /dev/null
+++ b/data/2024/aaai/Keypoint Fusion for RGB-D Based 3D Hand Pose Estimation	
@@ -0,0 +1 @@
+Previous 3D hand pose estimation methods primarily rely on a single modality, either RGB or depth, and the comprehensive utilization of the dual modalities has not been extensively explored. RGB and depth data provide complementary information and thus can be fused to enhance the robustness of 3D hand pose estimation. However, there exist two problems for applying existing fusion methods in 3D hand pose estimation: redundancy of dense feature fusion and ambiguity of visual features. First, pixel-wise feature interactions introduce high computational costs and ineffective calculations of invalid pixels. Second, visual features suffer from ambiguity due to color and texture similarities, as well as depth holes and noise caused by frequent hand movements, which interferes with modeling cross-modal correlations. In this paper, we propose Keypoint-Fusion for RGB-D based 3D hand pose estimation, which leverages the unique advantages of dual modalities to mutually eliminate the feature ambiguity, and performs cross-modal feature fusion in a more efficient way. Specifically, we focus cross-modal fusion on sparse yet informative spatial regions (i.e. keypoints). Meanwhile, by explicitly extracting relatively more reliable information as disambiguation evidence, depth modality provides 3D geometric information for RGB feature pixels, and RGB modality complements the precise edge information lost due to the depth noise. Keypoint-Fusion achieves state-of-the-art performance on two challenging hand datasets, significantly decreasing the error compared with previous single-modal methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Knowledge Distillation from Single-Task Teachers to Multi-Task Student for End-to-End Autonomous Driving b/data/2024/aaai/Knowledge Distillation from Single-Task Teachers to Multi-Task Student for End-to-End Autonomous Driving
new file mode 100644
index 0000000000..8eb66b0b1b
--- /dev/null
+++ b/data/2024/aaai/Knowledge Distillation from Single-Task Teachers to Multi-Task Student for End-to-End Autonomous Driving	
@@ -0,0 +1 @@
+In the domain of end-to-end autonomous driving, conventional sensor fusion techniques exhibit inadequacies, particularly when facing challenging scenarios with numerous dynamic agents. Imitation learning hampers the performance by the expert and encounters issues with out-of-distribution challenges. To overcome these limitations, we propose a transformer-based algorithm designed to fuse diverse representations from RGB-D cameras through knowledge distillation. This approach leverages insights from multi-task teachers to enhance the learning capabilities of single-task students, particularly in a Reinforcement Learning (RL) setting. Our model consists of two primary modules: the perception module, responsible for encoding observation data acquired from RGB-D cameras and performing tasks such as semantic segmentation, semantic depth cloud mapping (SDC), ego vehicle speed estimation, and traffic light state recognition. Subsequently, the control module decodes these features, incorporating additional data, including a rough simulator for static and dynamic environments, to anticipate waypoints within a latent feature space. Vehicular controls (e.g., steering, throttle, and brake) are obtained directly from measurement features and environmental states using the RL agent and are further refined by a PID algorithm that dynamically follows waypoints. The model undergoes rigorous evaluation and comparative analysis on the CARLA simulator across various scenarios, encompassing normal to adversarial conditions. Our code is available at https://github.com/pagand/e2etransfuser/ to facilitate future studies.
\ No newline at end of file
diff --git a/data/2024/aaai/Knowledge Enhanced Representation Learning for Drug Discovery b/data/2024/aaai/Knowledge Enhanced Representation Learning for Drug Discovery
new file mode 100644
index 0000000000..d137c24c71
--- /dev/null
+++ b/data/2024/aaai/Knowledge Enhanced Representation Learning for Drug Discovery	
@@ -0,0 +1 @@
+Recent research on predicting the binding affinity between drug molecules and proteins use representations learned, through unsupervised learning techniques, from large databases of molecule SMILES and protein sequences. While these representations have significantly enhanced the predictions, they are usually based on a limited set of modalities, and they do not exploit available knowledge about existing relations among molecules and proteins. Our study reveals that enhanced representations, derived from multimodal knowledge graphs describing relations among molecules and proteins, lead to state-of-the-art results in well-established benchmarks (first place in the leaderboard for Therapeutics Data Commons benchmark ``Drug-Target Interaction Domain Generalization Benchmark", with an improvement of 8 points with respect to previous best result). Moreover, our results significantly surpass those achieved in standard benchmarks by using conventional pre-trained representations that rely only on sequence or SMILES data. We release our multimodal knowledge graphs, integrating data from seven public data sources, and which contain over 30 million triples. Pretrained models from our proposed graphs and benchmark task source code are also released.
\ No newline at end of file
diff --git a/data/2024/aaai/Knowledge Graph Error Detection with Contrastive Confidence Adaption b/data/2024/aaai/Knowledge Graph Error Detection with Contrastive Confidence Adaption
new file mode 100644
index 0000000000..169e1413aa
--- /dev/null
+++ b/data/2024/aaai/Knowledge Graph Error Detection with Contrastive Confidence Adaption	
@@ -0,0 +1 @@
+Knowledge graphs (KGs) often contain various errors. Previous works on detecting errors in KGs mainly rely on triplet embedding from graph structure. We conduct an empirical study and find that these works struggle to discriminate noise from semantically-similar correct triplets. In this paper, we propose a KG error detection model CCA to integrate both textual and graph structural information from triplet reconstruction for better distinguishing semantics. We design interactive contrastive learning to capture the differences between textual and structural patterns. Furthermore, we construct realistic datasets with semantically-similar noise and adversarial noise. Experimental results demonstrate that CCA outperforms state-of-the-art baselines, especially on semantically-similar noise and adversarial noise.
\ No newline at end of file
diff --git a/data/2024/aaai/Knowledge Guided Semi-supervised Learning for Quality Assessment of User Generated Videos b/data/2024/aaai/Knowledge Guided Semi-supervised Learning for Quality Assessment of User Generated Videos
new file mode 100644
index 0000000000..6a84e170f6
--- /dev/null
+++ b/data/2024/aaai/Knowledge Guided Semi-supervised Learning for Quality Assessment of User Generated Videos	
@@ -0,0 +1 @@
+Perceptual quality assessment of user generated content (UGC) videos is challenging due to the requirement of large scale human annotated videos for training. In this work, we address this challenge by first designing a self-supervised Spatio-Temporal Visual Quality Representation Learning (ST-VQRL) framework to generate robust quality aware features for videos. Then, we propose a dual-model based Semi Supervised Learning (SSL) method specifically designed for the Video Quality Assessment (SSL-VQA) task, through a novel knowledge transfer of quality predictions between the two models. Our SSL-VQA method uses the ST-VQRL backbone to produce robust performances across various VQA datasets including cross-database settings, despite being learned with limited human annotated videos. Our model improves the state-of-the-art performance when trained only with limited data by around 10%, and by around 15% when unlabelled data is also used in SSL. Source codes and checkpoints are available at https://github.com/Shankhanil006/SSL-VQA.
\ No newline at end of file
diff --git a/data/2024/aaai/Knowledge Transfer via Compact Model in Federated Learning (Student Abstract) b/data/2024/aaai/Knowledge Transfer via Compact Model in Federated Learning (Student Abstract)
new file mode 100644
index 0000000000..5ac9b1bbff
--- /dev/null
+++ b/data/2024/aaai/Knowledge Transfer via Compact Model in Federated Learning (Student Abstract)	
@@ -0,0 +1 @@
+Communication overhead remains a significant challenge in federated learning due to frequent global model updates. Essentially, the update of the global model can be viewed as knowledge transfer. We aim to transfer more knowledge through a compact model while reducing communication overhead. In our study, we introduce a federated learning framework where clients pre-train large models locally and the server initializes a compact model to communicate. This compact model should be light in size but still have enough knowledge to refine the global model effectively. We facilitate the knowledge transfer from local to global models based on pre-training outcomes. Our experiments show that our approach significantly reduce communication overhead without sacrificing accuracy.
\ No newline at end of file
diff --git a/data/2024/aaai/Knowledge-Aware Explainable Reciprocal Recommendation b/data/2024/aaai/Knowledge-Aware Explainable Reciprocal Recommendation
new file mode 100644
index 0000000000..a4d773eb75
--- /dev/null
+++ b/data/2024/aaai/Knowledge-Aware Explainable Reciprocal Recommendation	
@@ -0,0 +1 @@
+Reciprocal recommender systems (RRS) have been widely used in online platforms such as online dating and recruitment. They can simultaneously fulfill the needs of both parties involved in the recommendation process. Due to the inherent nature of the task, interaction data is relatively sparse compared to other recommendation tasks. Existing works mainly address this issue through content-based recommendation methods. However, these methods often implicitly model textual information from a unified perspective, making it challenging to capture the distinct intentions held by each party, which further leads to limited performance and the lack of interpretability. In this paper, we propose a Knowledge-Aware Explainable Reciprocal Recommender System (KAERR), which models metapaths between two parties independently, considering their respective perspectives and requirements. Various metapaths are fused using an attention-based mechanism, where the attention weights unveil dual-perspective preferences and provide recommendation explanations for both parties. Extensive experiments on two real-world datasets from diverse scenarios demonstrate that the proposed model outperforms state-of-the-art baselines, while also delivering compelling reasons for recommendations to both parties.
\ No newline at end of file
diff --git a/data/2024/aaai/Knowledge-Aware Neuron Interpretation for Scene Classification b/data/2024/aaai/Knowledge-Aware Neuron Interpretation for Scene Classification
new file mode 100644
index 0000000000..b3de82a2ec
--- /dev/null
+++ b/data/2024/aaai/Knowledge-Aware Neuron Interpretation for Scene Classification	
@@ -0,0 +1 @@
+Although neural models have achieved remarkable performance, they still encounter doubts due to the intransparency. To this end, model prediction explanation is attracting more and more attentions. However, current methods rarely incorporate external knowledge and still suffer from three limitations: (1) Neglecting concept completeness. Merely selecting concepts may not sufficient for prediction. (2) Lacking concept fusion. Failure to merge semantically-equivalent concepts. (3) Difficult in manipulating model behavior. Lack of verification for explanation on original model. To address these issues, we propose a novel knowledge-aware neuron interpretation framework to explain model predictions for image scene classification. Specifically, for concept completeness, we present core concepts of a scene based on knowledge graph, ConceptNet, to gauge the completeness of concepts. Our method, incorporating complete concepts, effectively provides better prediction explanations compared to baselines. Furthermore, for concept fusion, we introduce a knowledge graph-based method known as Concept Filtering, which produces over 23% point gain on neuron behaviors for neuron interpretation. At last, we propose Model Manipulation, which aims to study whether the core concepts based on ConceptNet could be employed to manipulate model behavior. The results show that core concepts can effectively improve the performance of original model by over 26%.
\ No newline at end of file
diff --git a/data/2024/aaai/Knowledge-Aware Parameter Coaching for Personalized Federated Learning b/data/2024/aaai/Knowledge-Aware Parameter Coaching for Personalized Federated Learning
new file mode 100644
index 0000000000..2fcf2e5473
--- /dev/null
+++ b/data/2024/aaai/Knowledge-Aware Parameter Coaching for Personalized Federated Learning	
@@ -0,0 +1 @@
+Personalized Federated Learning (pFL) can effectively exploit the non-IID data from distributed clients by customizing personalized models. Existing pFL methods either simply take the local model as a whole for aggregation or require significant training overhead to induce the inter-client personalized weights, and thus clients cannot efficiently exploit the mutually relevant knowledge from each other. In this paper, we propose a knowledge-aware parameter coaching scheme where each client can swiftly and granularly refer to parameters of other clients to guide the local training, whereby accurate personalized client models can be efficiently produced without contradictory knowledge. Specifically, a novel regularizer is designed to conduct layer-wise parameters coaching via a relation cube, which is constructed based on the knowledge represented by the layered parameters among all clients. Then, we develop an optimization method to update the relation cube and the parameters of each client. It is theoretically demonstrated that the convergence of the proposed method can be guaranteed under both convex and non-convex settings. Extensive experiments are conducted over various datasets, which show that the proposed method can achieve better performance compared with the state-of-the-art baselines in terms of accuracy and convergence speed.
\ No newline at end of file
diff --git a/data/2024/aaai/Knowledge-Enhanced Historical Document Segmentation and Recognition b/data/2024/aaai/Knowledge-Enhanced Historical Document Segmentation and Recognition
new file mode 100644
index 0000000000..2d965ca497
--- /dev/null
+++ b/data/2024/aaai/Knowledge-Enhanced Historical Document Segmentation and Recognition	
@@ -0,0 +1 @@
+Optical Character Recognition (OCR) of historical document images remains a challenging task because of the distorted input images, extensive number of uncommon characters, and the scarcity of labeled data, which impedes modern deep learning-based OCR techniques from achieving good recognition accuracy. Meanwhile, there exists a substantial amount of expert knowledge that can be utilized in this task. However, such knowledge is usually complicated and could only be accurately expressed with formal languages such as first-order logic (FOL), which is difficult to be directly integrated into deep learning models. This paper proposes KESAR, a novel Knowledge-Enhanced Document Segmentation And Recognition method for historical document images based on the Abductive Learning (ABL) framework. The segmentation and recognition models are enhanced by incorporating background knowledge for character extraction and prediction, followed by an efficient joint optimization of both models. We validate the effectiveness of KESAR on historical document datasets. The experimental results demonstrate that our method can simultaneously utilize knowledge-driven reasoning and data-driven learning, which outperforms the current state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Knowledge-Powered Recommendation for an Improved Diet Water Footprint b/data/2024/aaai/Knowledge-Powered Recommendation for an Improved Diet Water Footprint
new file mode 100644
index 0000000000..2599e3ce08
--- /dev/null
+++ b/data/2024/aaai/Knowledge-Powered Recommendation for an Improved Diet Water Footprint	
@@ -0,0 +1 @@
+According to WWF, 1.1 billion people lack access to water, and 2.7 billion experience water scarcity at least one month a year. By 2025, two-thirds of the world's population may be facing water shortages. This highlights the urgency of managing water usage efficiently, especially in water-intensive sectors like food. This paper proposes a recommendation engine, powered by knowledge graphs, aiming to facilitate sustainable and healthy food consumption. The engine recommends ingredient substitutes in user recipes that improve nutritional value and reduce environmental impact, particularly water footprint. The system architecture includes source identification, information extraction, schema alignment, knowledge graph construction, and user interface development. The research offers a promising tool for promoting healthier eating habits and contributing to water conservation efforts.
\ No newline at end of file
diff --git a/data/2024/aaai/Kumaraswamy Wavelet for Heterophilic Scene Graph Generation b/data/2024/aaai/Kumaraswamy Wavelet for Heterophilic Scene Graph Generation
new file mode 100644
index 0000000000..8b1b055b2e
--- /dev/null
+++ b/data/2024/aaai/Kumaraswamy Wavelet for Heterophilic Scene Graph Generation	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) has demonstrated its capabilities in the field of scene graph generation (SGG) by updating node representations from neighboring nodes. Actually it can be viewed as a form of low-pass filter in the spatial domain, which smooths node feature representation and retains commonalities among nodes. However, spatial GNNs does not work well in the case of heterophilic SGG in which fine-grained predicates are always connected to a large number of coarse-grained predicates. Blind smoothing undermines the discriminative information of the fine-grained predicates, resulting in failure to predict them accurately. To address the heterophily, our key idea is to design tailored filters by wavelet transform from the spectral domain. First, we prove rigorously that when the heterophily on the scene graph increases, the spectral energy gradually shifts towards the high-frequency part. Inspired by this observation, we subsequently propose the Kumaraswamy Wavelet Graph Neural Network (KWGNN). KWGNN leverages complementary multi-group Kumaraswamy wavelets to cover all frequency bands. Finally, KWGNN adaptively generates band-pass filters and then integrates the filtering results to better accommodate varying levels of smoothness on the graph. Comprehensive experiments on the Visual Genome and Open Images datasets show that our method achieves state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2024/aaai/LAFA: Multimodal Knowledge Graph Completion with Link Aware Fusion and Aggregation b/data/2024/aaai/LAFA: Multimodal Knowledge Graph Completion with Link Aware Fusion and Aggregation
new file mode 100644
index 0000000000..f077a31295
--- /dev/null
+++ b/data/2024/aaai/LAFA: Multimodal Knowledge Graph Completion with Link Aware Fusion and Aggregation	
@@ -0,0 +1 @@
+Recently, an enormous amount of research has emerged on multimodal knowledge graph completion (MKGC), which seeks to extract knowledge from multimodal data and predict the most plausible missing facts to complete a given multimodal knowledge graph (MKG). However, existing MKGC approaches largely ignore that visual information may introduce noise and lead to uncertainty when adding them to the traditional KG embeddings due to the contribution of each associated image to entity is different in diverse link scenarios. Moreover, treating each triple independently when learning entity embeddings leads to local structural and the whole graph information missing. To address these challenges, we propose a novel link aware fusion and aggregation based multimodal knowledge graph completion model named LAFA, which is composed of link aware fusion module and link aware aggregation module. The link aware fusion module alleviates noise of irrelevant visual information by calculating the importance between an entity and its associated images in different link scenarios, and fuses the visual and structural embeddings according to the importance through our proposed modality embedding fusion mechanism. The link aware aggregation module assigns neighbor structural information to a given central entity by calculating the importance between the entity and its neighbors, and aggregating the fused embeddings through linear combination according to the importance. Extensive experiments on standard datasets validate that LAFA can obtain state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2024/aaai/LAMM: Label Alignment for Multi-Modal Prompt Learning b/data/2024/aaai/LAMM: Label Alignment for Multi-Modal Prompt Learning
new file mode 100644
index 0000000000..5ebc915bcd
--- /dev/null
+++ b/data/2024/aaai/LAMM: Label Alignment for Multi-Modal Prompt Learning	
@@ -0,0 +1 @@
+With the success of pre-trained visual-language (VL) models such as CLIP in visual representation tasks, transferring pre-trained models to downstream tasks has become a crucial paradigm. Recently, the prompt tuning paradigm, which draws inspiration from natural language processing (NLP), has made significant progress in VL field. However, preceding methods mainly focus on constructing prompt templates for text and visual inputs, neglecting the gap in class label representations between the VL models and downstream tasks. To address this challenge, we introduce an innovative label alignment method named \textbf{LAMM}, which can dynamically adjust the category embeddings of downstream datasets through end-to-end training. Moreover, to achieve a more appropriate label distribution, we propose a hierarchical loss, encompassing the alignment of the parameter space, feature space, and logits space. We conduct experiments on 11 downstream vision datasets and demonstrate that our method significantly improves the performance of existing multi-modal prompt learning models in few-shot scenarios, exhibiting an average accuracy improvement of 2.31(\%) compared to the state-of-the-art methods on 16 shots. Moreover, our methodology exhibits the preeminence in continual learning compared to other prompt tuning methods. Importantly, our method is synergistic with existing prompt tuning methods and can boost the performance on top of them. Our code and dataset will be publicly available at https://github.com/gaojingsheng/LAMM.
\ No newline at end of file
diff --git a/data/2024/aaai/LAMPAT: Low-Rank Adaption for Multilingual Paraphrasing Using Adversarial Training b/data/2024/aaai/LAMPAT: Low-Rank Adaption for Multilingual Paraphrasing Using Adversarial Training
new file mode 100644
index 0000000000..4555c7a40b
--- /dev/null
+++ b/data/2024/aaai/LAMPAT: Low-Rank Adaption for Multilingual Paraphrasing Using Adversarial Training	
@@ -0,0 +1 @@
+Paraphrases are texts that convey the same meaning while using different words or sentence structures. It can be used as an automatic data augmentation tool for many Natural Language Processing tasks, especially when dealing with low-resource languages, where data shortage is a significant problem. To generate a paraphrase in multilingual settings, previous studies have leveraged the knowledge from the machine translation field, i.e., forming a paraphrase through zero-shot machine translation in the same language. Despite good performance on human evaluation, those methods still require parallel translation datasets, thus making them inapplicable to languages that do not have parallel corpora. To mitigate that problem, we proposed the first unsupervised multilingual paraphrasing model, LAMPAT (Low-rank Adaptation for Multilingual Paraphrasing using Adversarial Training), by which monolingual dataset is sufficient enough to generate a human-like and diverse sentence. Throughout the experiments, we found out that our method not only works well for English but can generalize on unseen languages as well. Data and code are available at https://github.com/phkhanhtrinh23/LAMPAT.
\ No newline at end of file
diff --git a/data/2024/aaai/LDMVFI: Video Frame Interpolation with Latent Diffusion Models b/data/2024/aaai/LDMVFI: Video Frame Interpolation with Latent Diffusion Models
new file mode 100644
index 0000000000..0abc1d2bd4
--- /dev/null
+++ b/data/2024/aaai/LDMVFI: Video Frame Interpolation with Latent Diffusion Models	
@@ -0,0 +1 @@
+Existing works on video frame interpolation (VFI) mostly employ deep neural networks that are trained by minimizing the L1, L2, or deep feature space distance (e.g. VGG loss) between their outputs and ground-truth frames. However, recent works have shown that these metrics are poor indicators of perceptual VFI quality. Towards developing perceptually-oriented VFI methods, in this work we propose latent diffusion model-based VFI, LDMVFI. This approaches the VFI problem from a generative perspective by formulating it as a conditional generation problem. As the first effort to address VFI using latent diffusion models, we rigorously benchmark our method on common test sets used in the existing VFI literature. Our quantitative experiments and user study indicate that LDMVFI is able to interpolate video content with favorable perceptual quality compared to the state of the art, even in the high-resolution regime. Our code is available at https://github.com/danier97/LDMVFI.
\ No newline at end of file
diff --git a/data/2024/aaai/LDS2AE: Local Diffusion Shared-Specific Autoencoder for Multimodal Remote Sensing Image Classification with Arbitrary Missing Modalities b/data/2024/aaai/LDS2AE: Local Diffusion Shared-Specific Autoencoder for Multimodal Remote Sensing Image Classification with Arbitrary Missing Modalities
new file mode 100644
index 0000000000..b3047d35a4
--- /dev/null
+++ b/data/2024/aaai/LDS2AE: Local Diffusion Shared-Specific Autoencoder for Multimodal Remote Sensing Image Classification with Arbitrary Missing Modalities	
@@ -0,0 +1 @@
+Recent research on the joint classification of multimodal remote sensing data has achieved great success. However, due to the limitations imposed by imaging conditions, the case of missing modalities often occurs in practice. Most previous researchers regard the classification in case of different missing modalities as independent tasks. They train a specific classification model for each fixed missing modality by extracting multimodal joint representation, which cannot handle the classification of arbitrary (including multiple and random) missing modalities. In this work, we propose a local diffusion shared-specific autoencoder (LDS2AE), which solves the classification of arbitrary missing modalities with a single model. The LDS2AE captures the data distribution of different modalities to learn multimodal shared feature for classification by designing a novel local diffusion autoencoder which consists of a modality-shared encoder and several modality-specific decoders. The modality-shared encoder is designed to extract multimodal shared feature by employing the same parameters to map multimodal data into a shared subspace. The modality-specific decoders put the multimodal shared feature to reconstruct the image of each modality, which facilitates the shared feature to learn unique information of different modalities. In addition, we incorporate masked training to the diffusion autoencoder to achieve local diffusion, which significantly reduces the training cost of model. The approach is tested on widely-used multimodal remote sensing datasets, demonstrating the effectiveness of the proposed LDS2AE in addressing the classification of arbitrary missing modalities. The code is available at https://github.com/Jiahuiqu/LDS2AE.
\ No newline at end of file
diff --git a/data/2024/aaai/LERE: Learning-Based Low-Rank Matrix Recovery with Rank Estimation b/data/2024/aaai/LERE: Learning-Based Low-Rank Matrix Recovery with Rank Estimation
new file mode 100644
index 0000000000..deae0c9211
--- /dev/null
+++ b/data/2024/aaai/LERE: Learning-Based Low-Rank Matrix Recovery with Rank Estimation	
@@ -0,0 +1 @@
+A fundamental task in the realms of computer vision, Low-Rank Matrix Recovery (LRMR) focuses on the inherent low-rank structure precise recovery from incomplete data and/or corrupted measurements given that the rank is a known prior or accurately estimated. However, it remains challenging for existing rank estimation methods to accurately estimate the rank of an ill-conditioned matrix. Also, existing LRMR optimization methods are heavily dependent on the chosen parameters, and are therefore difficult to adapt to different situations. Addressing these issues, A novel LEarning-based low-rank matrix recovery with Rank Estimation (LERE) is proposed. More specifically, considering the characteristics of the Gerschgorin disk's center and radius, a new heuristic decision rule in the Gerschgorin Disk Theorem is significantly enhanced and the low-rank boundary can be exactly located, which leads to a marked improvement in the accuracy of rank estimation. According to the estimated rank, we select row and column sub-matrices from the observation matrix by uniformly random sampling. A 17-iteration feedforward-recurrent-mixed neural network is then adapted to learn the parameters in the sub-matrix recovery processing. Finally, by the correlation of the row sub-matrix and column sub-matrix, LERE successfully recovers the underlying low-rank matrix. Overall, LERE is more efficient and robust than existing LRMR methods. Experimental results demonstrate that LERE surpasses state-of-the-art (SOTA) methods. The code for this work is accessible at https://github.com/zhengqinxu/LERE.
\ No newline at end of file
diff --git a/data/2024/aaai/LERMO: A Novel Web Game for AI-Enhanced Sign Language Recognition b/data/2024/aaai/LERMO: A Novel Web Game for AI-Enhanced Sign Language Recognition
new file mode 100644
index 0000000000..cbb383cefd
--- /dev/null
+++ b/data/2024/aaai/LERMO: A Novel Web Game for AI-Enhanced Sign Language Recognition	
@@ -0,0 +1 @@
+Sign language is a visual and gestural communication system used by deaf and hearing-impaired people. Despite numerous deep learning methods proposed for automatic interpretation, a gap persists in developing applications that effectively utilize these models for assisting sign language studies and inclusion. We introduce LERMO (https://lermo.app/), a web game merging machine learning and gamification to enhance sign language fingerspelling. Inspired by Wordle™, LERMO offers an interactive word-guessing game where users can play using a video camera. We create a new dataset of labeled landmark fingerspelling and design our model to ensure optimal speed and efficiency to run on a web browser. We survey approximately 40 users, which find LERMO user-friendly and innovative. From those, 95% believe LERMO could be used to enhance fingerspelling skills.
\ No newline at end of file
diff --git a/data/2024/aaai/LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition b/data/2024/aaai/LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition
new file mode 100644
index 0000000000..372f46bb2d
--- /dev/null
+++ b/data/2024/aaai/LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition	
@@ -0,0 +1 @@
+The Vision Transformer (ViT) excels in accuracy when handling high-resolution images, yet it confronts the challenge of significant spatial redundancy, leading to increased computational and memory requirements. To address this, we present the Localization and Focus Vision Transformer (LF-ViT). This model operates by strategically curtailing computational demands without impinging on performance. In the Localization phase, a reduced-resolution image is processed; if a definitive prediction remains elusive, our pioneering Neighborhood Global Class Attention (NGCA) mechanism is triggered, effectively identifying and spotlighting class-discriminative regions based on initial findings. Subsequently, in the Focus phase, this designated region is used from the original image to enhance recognition. Uniquely, LF-ViT employs consistent parameters across both phases, ensuring seamless end-to-end optimization. Our empirical tests affirm LF-ViT's prowess: it remarkably decreases Deit-S's FLOPs by 63% and concurrently amplifies throughput twofold. Code of this project is at https://github.com/edgeai1/LF-ViT.git.
\ No newline at end of file
diff --git a/data/2024/aaai/LGMRec: Local and Global Graph Learning for Multimodal Recommendation b/data/2024/aaai/LGMRec: Local and Global Graph Learning for Multimodal Recommendation
new file mode 100644
index 0000000000..8100ece8e5
--- /dev/null
+++ b/data/2024/aaai/LGMRec: Local and Global Graph Learning for Multimodal Recommendation	
@@ -0,0 +1 @@
+The multimodal recommendation has gradually become the infrastructure of online media platforms, enabling them to provide personalized service to users through a joint modeling of user historical behaviors (e.g., purchases, clicks) and item various modalities (e.g., visual and textual). The majority of existing studies typically focus on utilizing modal features or modal-related graph structure to learn user local interests. Nevertheless, these approaches encounter two limitations: (1) Shared updates of user ID embeddings result in the consequential coupling between collaboration and multimodal signals; (2) Lack of exploration into robust global user interests to alleviate the sparse interaction problems faced by local interest modeling. To address these issues, we propose a novel Local and Global Graph Learning-guided Multimodal Recommender (LGMRec), which jointly models local and global user interests. Specifically, we present a local graph embedding module to independently learn collaborative-related and modality-related embeddings of users and items with local topological relations. Moreover, a global hypergraph embedding module is designed to capture global user and item embeddings by modeling insightful global dependency relations. The global embeddings acquired within the hypergraph embedding space can then be combined with two decoupled local embeddings to improve the accuracy and robustness of recommendations. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our LGMRec over various state-of-the-art recommendation baselines, showcasing its effectiveness in modeling both local and global user interests.
\ No newline at end of file
diff --git a/data/2024/aaai/LION: Implicit Vision Prompt Tuning b/data/2024/aaai/LION: Implicit Vision Prompt Tuning
new file mode 100644
index 0000000000..9a84335de1
--- /dev/null
+++ b/data/2024/aaai/LION: Implicit Vision Prompt Tuning	
@@ -0,0 +1,5 @@
+Despite recent promising performances across a range of vision tasks, vision Transformers still have an issue of high computational costs.
+Recently, vision prompt learning has provided an economical solution to this problem without fine-tuning the whole large-scale model. 
+However, the efficiency and effectiveness of existing models are still far from satisfactory due to the parameter cost of extensive prompt blocks and tricky prompt framework designs. 
+In this paper, we propose a light-weight prompt framework named impLicit vIsion prOmpt tuNing (LION), which is motivated by deep implicit models with stable low memory costs for various complex tasks.
+In particular, we merely insect two equilibrium implicit layers in two ends of the pre-trained backbone with parameters frozen. Moreover, according to the lottery hypothesis, we further prune the parameters to relieve the computation burden in implicit layers. Various experiments have validated that our LION obtains promising performances on a wide range of datasets. Most importantly, LION reduces up to 11.5 % of training parameter numbers while obtaining higher performance than the state-of-the-art VPT, especially under challenging scenes. Furthermore, we find that our proposed LION has an excellent generalization performance, making it an easy way to boost transfer learning in the future.
\ No newline at end of file
diff --git a/data/2024/aaai/LLM vs Small Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model b/data/2024/aaai/LLM vs Small Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model
new file mode 100644
index 0000000000..be74780867
--- /dev/null
+++ b/data/2024/aaai/LLM vs Small Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model	
@@ -0,0 +1 @@
+Personality detection aims to detect one's personality traits underlying in social media posts. One challenge of this task is the scarcity of ground-truth personality traits which are collected from self-report questionnaires. Most existing methods learn post features directly by fine-tuning the pre-trained language models under the supervision of limited personality labels. This leads to inferior quality of post features and consequently affects the performance. In addition, they treat personality traits as one-hot classification labels, overlooking the semantic information within them. In this paper, we propose a large language model (LLM) based text augmentation enhanced personality detection model, which distills the LLM's knowledge to enhance the small model for personality detection, even when the LLM fails in this task. Specifically, we enable LLM to generate post analyses (augmentations) from the aspects of semantic, sentiment, and linguistic, which are critical for personality detection. By using contrastive learning to pull them together in the embedding space, the post encoder can better capture the psycho-linguistic information within the post representations, thus improving personality detection. Furthermore, we utilize the LLM to enrich the information of personality labels for enhancing the detection performance. Experimental results on the benchmark datasets demonstrate that our model outperforms the state-of-the-art methods on personality detection.
\ No newline at end of file
diff --git a/data/2024/aaai/LLM-Powered Synthetic Environments for Self-Driving Scenarios b/data/2024/aaai/LLM-Powered Synthetic Environments for Self-Driving Scenarios
new file mode 100644
index 0000000000..4d88791d16
--- /dev/null
+++ b/data/2024/aaai/LLM-Powered Synthetic Environments for Self-Driving Scenarios	
@@ -0,0 +1,2 @@
+This paper outlines a proposal exploring the potential use of Large Language Models (LLMs), particularly GPT-4, in crafting realistic synthetic environments for self-driving scenarios. The envisioned approach involves dynamic scene generation within game engines, leveraging LLMs to introduce challenging elements for autonomous vehicles. The proposed evaluation process outlines assessments such as realistic testing, safety metrics, and user interaction, aiming to set the stage for potential improvements in self-driving system performance.
+The paper aims to contribute to the AI field by discussing how LLMs could be utilized to create valuable testing grounds for autonomous vehicles, potentially fostering the development of more robust self-driving technology. The envisioned impact is the eventual enhancement of road safety and the possible acceleration of the adoption of autonomous vehicles, paving the way for a future with safer and more efficient transportation.
\ No newline at end of file
diff --git a/data/2024/aaai/LLMEval: A Preliminary Study on How to Evaluate Large Language Models b/data/2024/aaai/LLMEval: A Preliminary Study on How to Evaluate Large Language Models
new file mode 100644
index 0000000000..51b877a84c
--- /dev/null
+++ b/data/2024/aaai/LLMEval: A Preliminary Study on How to Evaluate Large Language Models	
@@ -0,0 +1,10 @@
+Recently, the evaluation of Large Language Models has emerged as a popular area of research. 
+The three crucial questions for LLM evaluation are ``what, where, and how to evaluate''.
+However, the existing research mainly focuses on the first two questions, which are basically what tasks to give the LLM during testing and what kind of knowledge it should deal with.
+As for the third question, which is about what standards to use, the types of evaluators, how to score, and how to rank, there hasn't been much discussion.
+In this paper, we analyze evaluation methods by comparing various criteria with both manual and automatic evaluation, utilizing onsite, crowd-sourcing, public annotators and GPT-4, with different scoring methods and ranking systems. 
+We propose a new dataset, LLMEval and conduct evaluations on 20 LLMs. 
+A total of 2,186 individuals participated, leading to the generation of 243,337 manual annotations and 57,511 automatic evaluation results.
+We perform comparisons and analyses of different settings and conduct 10 conclusions that can provide some insights for evaluating LLM in the future. The dataset and the results are publicly available at 
+https://github.com/llmeval.
+The version with the appendix are publicly available at https://arxiv.org/abs/2312.07398.
\ No newline at end of file
diff --git a/data/2024/aaai/LLMGuard: Guarding against Unsafe LLM Behavior b/data/2024/aaai/LLMGuard: Guarding against Unsafe LLM Behavior
new file mode 100644
index 0000000000..821153865c
--- /dev/null
+++ b/data/2024/aaai/LLMGuard: Guarding against Unsafe LLM Behavior	
@@ -0,0 +1,2 @@
+Although the rise of Large Language Models (LLMs) in enterprise settings brings new opportunities and capabilities, it also brings challenges, such as the risk of generating inappropriate, biased, or misleading content that violates regulations and can have legal concerns.
+To alleviate this, we present "LLMGuard", a tool that monitors user interactions with an LLM application and flags content against specific behaviours or conversation topics. To do this robustly, LLMGuard employs an ensemble of detectors.
\ No newline at end of file
diff --git a/data/2024/aaai/LLMRG: Improving Recommendations through Large Language Model Reasoning Graphs b/data/2024/aaai/LLMRG: Improving Recommendations through Large Language Model Reasoning Graphs
new file mode 100644
index 0000000000..1278b97b5e
--- /dev/null
+++ b/data/2024/aaai/LLMRG: Improving Recommendations through Large Language Model Reasoning Graphs	
@@ -0,0 +1 @@
+Recommendation systems aim to provide users with relevant suggestions, but often lack interpretability and fail to capture higher-level semantic relationships between user behaviors and profiles. In this paper, we propose a novel approach that leverages large language models (LLMs) to construct personalized reasoning graphs. These graphs link a user's profile and behavioral sequences through causal and logical inferences, representing the user's interests in an interpretable way. Our approach, LLM reasoning graphs (LLMRG), has four components: chained graph reasoning, divergent extension, self-verification and scoring, and knowledge base self-improvement. The resulting reasoning graph is encoded using graph neural networks, which serves as additional input to improve conventional recommender systems, without requiring extra user or item information. Our approach demonstrates how LLMs can enable more logical and interpretable recommender systems through personalized reasoning graphs. LLMRG allows recommendations to benefit from both engineered recommendation systems and LLM-derived reasoning graphs. We demonstrate the effectiveness of LLMRG on benchmarks and real-world scenarios in enhancing base recommendation models.
\ No newline at end of file
diff --git a/data/2024/aaai/LMD: Faster Image Reconstruction with Latent Masking Diffusion b/data/2024/aaai/LMD: Faster Image Reconstruction with Latent Masking Diffusion
new file mode 100644
index 0000000000..0c32b250ea
--- /dev/null
+++ b/data/2024/aaai/LMD: Faster Image Reconstruction with Latent Masking Diffusion	
@@ -0,0 +1 @@
+As a class of fruitful approaches, diffusion probabilistic models (DPMs) have shown excellent advantages in high-resolution image reconstruction. On the other hand, masked autoencoders (MAEs), as popular self-supervised vision learners, have demonstrated simpler and more effective image reconstruction and transfer capabilities on downstream tasks. However, they all require extremely high training costs, either due to inherent high temporal-dependence (i.e., excessively long diffusion steps) or due to artificially low spatial-dependence (i.e., human-formulated high mask ratio, such as 0.75). To the end, this paper presents LMD, a faster image reconstruction framework with Latent Masking Diffusion. First, we propose to project and reconstruct images in latent space through a pre-trained variational autoencoder, which is theoretically more efficient than in the pixel-based space. Then, we combine the advantages of MAEs and DPMs to design a progressive masking diffusion model, which gradually increases the masking proportion by three different schedulers and reconstructs the latent features from simple to difficult, without sequentially performing denoising diffusion as in DPMs or using fixed high masking ratio as in MAEs, so as to alleviate the high training time-consumption predicament. Our approach allows for learning high-capacity models and accelerate their training (by 3x or more) and barely reduces the original accuracy. Inference speed in downstream tasks also significantly outperforms the previous approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/LR-XFL: Logical Reasoning-Based Explainable Federated Learning b/data/2024/aaai/LR-XFL: Logical Reasoning-Based Explainable Federated Learning
new file mode 100644
index 0000000000..da2385d221
--- /dev/null
+++ b/data/2024/aaai/LR-XFL: Logical Reasoning-Based Explainable Federated Learning	
@@ -0,0 +1 @@
+Federated learning (FL) is an emerging approach for training machine learning models collaboratively while preserving data privacy. The need for privacy protection makes it difficult for FL models to achieve global transparency and explainability. To address this limitation, we incorporate logic-based explanations into FL by proposing the Logical Reasoning-based eXplainable Federated Learning (LR-XFL) approach. Under LR-XFL, FL clients create local logic rules based on their local data and send them, along with model updates, to the FL server. The FL server connects the local logic rules through a proper logical connector that is derived based on properties of client data, without requiring access to the raw data. In addition, the server also aggregates the local model updates with weight values determined by the quality of the clients’ local data as reflected by their uploaded logic rules. The results show that LR-XFL outperforms the most relevant baseline by 1.19%, 5.81% and 5.41% in terms of classification accuracy, rule accuracy and rule fidelity, respectively. The explicit rule evaluation and expression under LR-XFL enable human experts to validate and correct the rules on the server side, hence improving the global FL model’s robustness to errors. It has the potential to enhance the transparency of FL models for areas like healthcare and finance where both data privacy and explainability are important.
\ No newline at end of file
diff --git a/data/2024/aaai/LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network b/data/2024/aaai/LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network
new file mode 100644
index 0000000000..123b0802f8
--- /dev/null
+++ b/data/2024/aaai/LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network	
@@ -0,0 +1 @@
+Recently, regression-based methods, which predict parameterized text shapes for text localization, have gained popularity in scene text detection. However, the existing parameterized text shape methods still have limitations in modeling arbitrary-shaped texts due to ignoring the utilization of text-specific shape information. Moreover, the time consumption of the entire pipeline has been largely overlooked, leading to a suboptimal overall inference speed. To address these issues, we first propose a novel parameterized text shape method based on low-rank approximation. Unlike other shape representation methods that employ data-irrelevant parameterization, our approach utilizes singular value decomposition and reconstructs the text shape using a few eigenvectors learned from labeled text contours. By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation. Next, we propose a dual assignment scheme for speed acceleration. It adopts a sparse assignment branch to accelerate the inference speed, and meanwhile, provides ample supervised signals for training through a dense assignment branch. Building upon these designs, we implement an accurate and efficient arbitrary-shaped text detector named LRANet. Extensive experiments are conducted on several challenging benchmarks, demonstrating the superior accuracy and efficiency of LRANet compared to state-of-the-art methods. Code is available at: https://github.com/ychensu/LRANet.git
\ No newline at end of file
diff --git a/data/2024/aaai/LSTKC: Long Short-Term Knowledge Consolidation for Lifelong Person Re-identification b/data/2024/aaai/LSTKC: Long Short-Term Knowledge Consolidation for Lifelong Person Re-identification
new file mode 100644
index 0000000000..ada193f5f9
--- /dev/null
+++ b/data/2024/aaai/LSTKC: Long Short-Term Knowledge Consolidation for Lifelong Person Re-identification	
@@ -0,0 +1 @@
+Lifelong person re-identification (LReID) aims to train a unified model from diverse data sources step by step. The severe domain gaps between different training steps result in catastrophic forgetting in LReID, and existing methods mainly rely on data replay and knowledge distillation techniques to handle this issue. However, the former solution needs to store historical exemplars which inevitably impedes data privacy. The existing knowledge distillation-based models usually retain all the knowledge of the learned old models without any selections, which will inevitably include erroneous and detrimental knowledge that severely impacts the learning performance of the new model. To address these issues, we propose an exemplar-free LReID method named LongShort Term Knowledge Consolidation (LSTKC) that contains a Rectification-based Short-Term Knowledge Transfer module (R-STKT) and an Estimation-based Long-Term Knowledge Consolidation module (E-LTKC). For each learning iteration within one training step, R-STKT aims to filter and rectify the erroneous knowledge contained in the old model and transfer the rectified knowledge to facilitate the short-term learning of the new model. Meanwhile, once one training step is finished, E-LTKC proposes to further consolidate the learned long-term knowledge via adaptively fusing the parameters of models from different steps. Consequently, experimental results show that our LSTKC exceeds the state-of-the-art methods by 6.3%/9.4% and 7.9%/4.5%, 6.4%/8.0% and 9.0%/5.5% average mAP/R@1 on seen and unseen domains under two different training orders of the challenging LReID benchmark respectively.
\ No newline at end of file
diff --git a/data/2024/aaai/LaMAR: Laplacian Pyramid for Multimodal Adaptive Super Resolution (Student Abstract) b/data/2024/aaai/LaMAR: Laplacian Pyramid for Multimodal Adaptive Super Resolution (Student Abstract)
new file mode 100644
index 0000000000..6cc8add2bc
--- /dev/null
+++ b/data/2024/aaai/LaMAR: Laplacian Pyramid for Multimodal Adaptive Super Resolution (Student Abstract)	
@@ -0,0 +1 @@
+Recent advances in image-to-image translation involve the integration of non-visual imagery in deep models. Non-visual sensors, although more costly, often produce low-resolution images. To combat this, methods using RGB images to enhance the resolution of these modalities have been introduced. Fusing these modalities to achieve high-resolution results demands models with millions of parameters and extended inference times. We present LaMAR, a lightweight model. It employs Laplacian image pyramids combined with a low-resolution thermal image for Guided Thermal Super Resolution. By decomposing the RGB image into a Laplacian pyramid, LaMAR preserves image details and avoids high-resolution feature map computations, ensuring efficiency. With faster inference times and fewer parameters, our model demonstrates state-of-the-art results.
\ No newline at end of file
diff --git a/data/2024/aaai/LaViP: Language-Grounded Visual Prompting b/data/2024/aaai/LaViP: Language-Grounded Visual Prompting
new file mode 100644
index 0000000000..a4d005dd37
--- /dev/null
+++ b/data/2024/aaai/LaViP: Language-Grounded Visual Prompting	
@@ -0,0 +1 @@
+We introduce a language-grounded visual prompting method to adapt the visual encoder of vision-language models for downstream tasks. By capitalizing on language integration, we devise a parameter-efficient strategy to adjust the input of the visual encoder, eliminating the need to modify or add to the model's parameters. Due to this design choice, our algorithm can operate even in black-box scenarios, showcasing adaptability in situations where access to the model's parameters is constrained. We will empirically demonstrate that, compared to prior art, grounding visual prompts with language enhances both the accuracy and speed of adaptation. Moreover, our algorithm excels in base-to-novel class generalization, overcoming limitations of visual prompting and exhibiting the capacity to generalize beyond seen classes. We thoroughly assess and evaluate our method across a variety of image recognition datasets, such as EuroSAT, UCF101, DTD, and CLEVR, spanning different learning situations, including few-shot adaptation, base-to-novel class generalization, and transfer learning.
\ No newline at end of file
diff --git a/data/2024/aaai/Label Attentive Distillation for GNN-Based Graph Classification b/data/2024/aaai/Label Attentive Distillation for GNN-Based Graph Classification
new file mode 100644
index 0000000000..0b6bb10aa6
--- /dev/null
+++ b/data/2024/aaai/Label Attentive Distillation for GNN-Based Graph Classification	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) have emerged as a powerful tool for modeling graph-structured data, exhibiting remarkable potential in applications such as social networks, recommendation systems, and molecular structures. However, the conventional GNNs perform node-level feature aggregation from neighbors without considering graph-label information, which leads to the misaligned embedding problem that may cause a detrimental effect on graph-level tasks such as graph classification. In this paper, we propose a novel label-attentive distillation method called LAD-GNN for graph representation learning to solve this problem. It alternatively trains a teacher model and a student GNN with a distillation-based approach. In the teacher model, a label-attentive encoder is proposed to encode the label information fusing with the node features to generate ideal embedding. In the student model, the ideal embedding is used as intermediate supervision to urge the student GNN to learn class-friendly node embedding to facilitate graph-level tasks. Generally, LAD-GNN is an enhanced GNN training approach that can be incorporated with arbitrary GNN backbone to improve performance without significant increase of computational cost. Extensive experiments with 7 GNN backbones based on 10 benchmark datasets show that LAD-GNN improves the SOTA GNNs in graph classification accuracy. The source codes of LAD-GNN are publicly available on https://github.com/XiaobinHong/LAD-GNN.
\ No newline at end of file
diff --git a/data/2024/aaai/Label-Efficient Few-Shot Semantic Segmentation with Unsupervised Meta-Training b/data/2024/aaai/Label-Efficient Few-Shot Semantic Segmentation with Unsupervised Meta-Training
new file mode 100644
index 0000000000..0c0f214bd6
--- /dev/null
+++ b/data/2024/aaai/Label-Efficient Few-Shot Semantic Segmentation with Unsupervised Meta-Training	
@@ -0,0 +1 @@
+The goal of this paper is to alleviate the training cost for few-shot semantic segmentation (FSS) models. Despite that FSS in nature improves model generalization to new concepts using only a handful of test exemplars, it relies on strong supervision from a considerable amount of labeled training data for base classes. However, collecting pixel-level annotations is notoriously expensive and time-consuming, and small-scale training datasets convey low information density that limits test-time generalization. To resolve the issue, we take a pioneering step towards label-efficient training of FSS models from fully unlabeled training data, or additionally a few labeled samples to enhance the performance. This motivates an approach based on a novel unsupervised meta-training paradigm. In particular, the approach first distills pre-trained unsupervised pixel embedding into compact semantic clusters from which a massive number of pseudo meta-tasks is constructed. To mitigate the noise in the pseudo meta-tasks, we further advocate a robust Transformer-based FSS model with a novel prototype-based cross-attention design. Extensive experiments have been conducted on two standard benchmarks, i.e., PASCAL-5i and COCO-20i, and the results show that our method produces impressive performance without any annotations, and is comparable to fully supervised competitors even using only 20% of the annotations. Our code is available at: https://github.com/SSSKYue/UMTFSS.
\ No newline at end of file
diff --git a/data/2024/aaai/Labels Need Prompts Too: Mask Matching for Natural Language Understanding Tasks b/data/2024/aaai/Labels Need Prompts Too: Mask Matching for Natural Language Understanding Tasks
new file mode 100644
index 0000000000..6e412163ad
--- /dev/null
+++ b/data/2024/aaai/Labels Need Prompts Too: Mask Matching for Natural Language Understanding Tasks	
@@ -0,0 +1 @@
+Textual label names (descriptions) are typically semantically rich in many natural language understanding (NLU) tasks. In this paper, we incorporate the prompting methodology, which is widely used to enrich model input, into the label side for the first time. Specifically, we propose a Mask Matching method, which equips an input with a prompt and its label with another, and then makes predictions by matching their mask representations. We evaluate our method extensively on 8 NLU tasks with 14 datasets. The experimental results show that Mask Matching significantly outperforms its counterparts of fine-tuning and conventional prompt-tuning, setting up state-of-the-art performances in several datasets. Mask Matching is particularly good at handling NLU tasks with large label counts and informative label names. As pioneering efforts that investigate the label-side prompt, we also discuss open issues for future study.
\ No newline at end of file
diff --git a/data/2024/aaai/LaneGraph2Seq: Lane Topology Extraction with Language Model via Vertex-Edge Encoding and Connectivity Enhancement b/data/2024/aaai/LaneGraph2Seq: Lane Topology Extraction with Language Model via Vertex-Edge Encoding and Connectivity Enhancement
new file mode 100644
index 0000000000..f1a2e2029c
--- /dev/null
+++ b/data/2024/aaai/LaneGraph2Seq: Lane Topology Extraction with Language Model via Vertex-Edge Encoding and Connectivity Enhancement	
@@ -0,0 +1,3 @@
+Understanding road structures is crucial for autonomous driving. Intricate road structures are often depicted using lane graphs, which include centerline curves and connections forming a Directed Acyclic Graph (DAG). Accurate extraction of lane graphs relies on precisely estimating vertex and edge information within the DAG.
+Recent research highlights Transformer-based language models' impressive sequence prediction abilities, making them effective for learning graph representations when graph data are encoded as sequences. However, existing studies focus mainly on modeling vertices explicitly, leaving edge information simply embedded in the network. 
+Consequently, these approaches fall short in the task of lane graph extraction. To address this, we introduce LaneGraph2Seq, a novel approach for lane graph extraction. It leverages a language model with vertex-edge encoding and connectivity enhancement. Our serialization strategy includes a vertex-centric depth-first traversal and a concise edge-based partition sequence. Additionally, we use classifier-free guidance combined with nucleus sampling to improve lane connectivity. We validate our method on prominent datasets, nuScenes and Argoverse 2, showcasing consistent and compelling results. Our LaneGraph2Seq approach demonstrates superior performance compared to state-of-the-art techniques in lane graph extraction.
\ No newline at end of file
diff --git a/data/2024/aaai/Language-Guided Transformer for Federated Multi-Label Classification b/data/2024/aaai/Language-Guided Transformer for Federated Multi-Label Classification
new file mode 100644
index 0000000000..f44d534487
--- /dev/null
+++ b/data/2024/aaai/Language-Guided Transformer for Federated Multi-Label Classification	
@@ -0,0 +1 @@
+Federated Learning (FL) is an emerging paradigm that enables multiple users to collaboratively train a robust model in a privacy-preserving manner without sharing their private data. Most existing approaches of FL only consider traditional single-label image classification, ignoring the impact when transferring the task to multi-label image classification. Nevertheless, it is still challenging for FL to deal with user heterogeneity in their local data distribution in the real-world FL scenario, and this issue becomes even more severe in multi-label image classification. Inspired by the recent success of Transformers in centralized settings, we propose a novel FL framework for multi-label classification. Since partial label correlation may be observed by local clients during training, direct aggregation of locally updated models would not produce satisfactory performances. Thus, we propose a novel FL framework of Language-Guided Transformer (FedLGT) to tackle this challenging task, which aims to exploit and transfer knowledge across different clients for learning a robust global model. Through extensive experiments on various multi-label datasets (e.g., FLAIR, MS-COCO, etc.), we show that our FedLGT is able to achieve satisfactory performance and outperforms standard FL techniques under multi-label FL scenarios. Code is available at https://github.com/Jack24658735/FedLGT.
\ No newline at end of file
diff --git a/data/2024/aaai/Large Language Models Are Clinical Reasoners: Reasoning-Aware Diagnosis Framework with Prompt-Generated Rationales b/data/2024/aaai/Large Language Models Are Clinical Reasoners: Reasoning-Aware Diagnosis Framework with Prompt-Generated Rationales
new file mode 100644
index 0000000000..561f9a28a6
--- /dev/null
+++ b/data/2024/aaai/Large Language Models Are Clinical Reasoners: Reasoning-Aware Diagnosis Framework with Prompt-Generated Rationales	
@@ -0,0 +1 @@
+Machine reasoning has made great progress in recent years owing to large language models (LLMs). In the clinical domain, however, most NLP-driven projects mainly focus on clinical classification or reading comprehension, and under-explore clinical reasoning for disease diagnosis due to the expensive rationale annotation with clinicians. In this work, we present a "reasoning-aware" diagnosis framework that rationalizes the diagnostic process via prompt-based learning in a time- and labor-efficient manner, and learns to reason over the prompt-generated rationales. Specifically, we address the clinical reasoning for disease diagnosis, where the LLM generates diagnostic rationales providing its insight on presented patient data and the reasoning path towards the diagnosis, namely Clinical Chain-of-Thought (Clinical CoT). We empirically demonstrate LLMs/LMs' ability of clinical reasoning via extensive experiments and analyses on both rationale generation and disease diagnosis in various settings. We further propose a novel set of criteria for evaluating machine-generated rationales' potential for real-world clinical settings, facilitating and benefiting future research in this area.
\ No newline at end of file
diff --git a/data/2024/aaai/Large Language Models Are Neurosymbolic Reasoners b/data/2024/aaai/Large Language Models Are Neurosymbolic Reasoners
new file mode 100644
index 0000000000..37ecb64f06
--- /dev/null
+++ b/data/2024/aaai/Large Language Models Are Neurosymbolic Reasoners	
@@ -0,0 +1 @@
+A wide range of real-world applications is characterized by their symbolic nature, necessitating a strong capability for symbolic reasoning. This paper investigates the potential application of Large Language Models (LLMs) as symbolic reasoners. We focus on text-based games, significant benchmarks for agents with natural language capabilities, particularly in symbolic tasks like math, map reading, sorting, and applying common sense in text-based worlds. To facilitate these agents, we propose an LLM agent designed to tackle symbolic challenges and achieve in-game objectives. We begin by initializing the LLM agent and informing it of its role. The agent then receives observations and a set of valid actions from the text-based games, along with a specific symbolic module. With these inputs, the LLM agent chooses an action and interacts with the game environments. Our experimental results demonstrate that our method significantly enhances the capability of LLMs as automated agents for symbolic reasoning, and our LLM agent is effective in text-based games involving symbolic tasks, achieving an average performance of 88% across all tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Large Language Models as Planning Domain Generators (Student Abstract) b/data/2024/aaai/Large Language Models as Planning Domain Generators (Student Abstract)
new file mode 100644
index 0000000000..16df4ed635
--- /dev/null
+++ b/data/2024/aaai/Large Language Models as Planning Domain Generators (Student Abstract)	
@@ -0,0 +1,14 @@
+The creation of planning models, and in particular domain
+models, is among the last bastions of tasks that require exten-
+sive manual labor in AI planning; it is desirable to simplify
+this process for the sake of making planning more accessi-
+ble. To this end, we investigate whether large language mod-
+els (LLMs) can be used to generate planning domain models
+from textual descriptions. We propose a novel task for this
+as well as a means of automated evaluation for generated do-
+mains by comparing the sets of plans for domain instances.
+Finally, we perform an empirical analysis of 7 large language
+models, including coding and chat models across 9 different
+planning domains. Our results show that LLMs, particularly
+larger ones, exhibit some level of proficiency in generating
+correct planning domains from natural language descriptions
\ No newline at end of file
diff --git a/data/2024/aaai/Large Occluded Human Image Completion via Image-Prior Cooperating b/data/2024/aaai/Large Occluded Human Image Completion via Image-Prior Cooperating
new file mode 100644
index 0000000000..fcd64409b8
--- /dev/null
+++ b/data/2024/aaai/Large Occluded Human Image Completion via Image-Prior Cooperating	
@@ -0,0 +1 @@
+The completion of large occluded human body images poses a unique challenge for general image completion methods. The complex shape variations of human bodies make it difficult to establish a consistent understanding of their structures. Furthermore, as human vision is highly sensitive to human bodies, even slight artifacts can significantly compromise image fidelity. To address these challenges, we propose a large occluded human image completion (LOHC) model based on a novel image-prior cooperative completion strategy. Our model leverages human segmentation maps as a prior, and completes the image and prior simultaneously. Compared to the widely adopted prior-then-image completion strategy for object completion, this cooperative completion process fosters more effective interaction between the prior and image information. Our model consists of two stages. The first stage is a transformer-based auto-regressive network that predicts the overall structure of the missing area by generating a coarse completed image at a lower resolution. The second stage is a convolutional network that refines the coarse images. As the coarse result may not always be accurate, we propose a Dynamic Fusion Module (DFM) to selectively fuses the useful features from the coarse image with the original input at spatial and channel levels. Through extensive experiments, we demonstrate our method’s superior performance compared to state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Large-Scale Multi-Robot Coverage Path Planning via Local Search b/data/2024/aaai/Large-Scale Multi-Robot Coverage Path Planning via Local Search
new file mode 100644
index 0000000000..d4652c8d87
--- /dev/null
+++ b/data/2024/aaai/Large-Scale Multi-Robot Coverage Path Planning via Local Search	
@@ -0,0 +1,2 @@
+We study graph-based Multi-Robot Coverage Path Planning (MCPP) that aims to compute coverage paths for multiple robots to cover all vertices of a given 2D grid terrain graph G. Existing graph-based MCPP algorithms first compute a tree cover on G---a forest of multiple trees that cover all vertices---and then employ the Spanning Tree Coverage (STC) paradigm to generate coverage paths on the decomposed graph D of the terrain graph G by circumnavigating the edges of the computed trees, aiming to optimize the makespan (i.e., the maximum coverage path cost among all robots).
+In this paper, we take a different approach by exploring how to systematically search for good coverage paths directly on D. We introduce a new algorithmic framework, called LS-MCPP, which leverages a local search to operate directly on D. We propose a novel standalone paradigm, Extended-STC (ESTC), that extends STC to achieve complete coverage for MCPP on any decomposed graph, even those resulting from incomplete terrain graphs. Furthermore, we demonstrate how to integrate ESTC with three novel types of neighborhood operators into our framework to effectively guide its search process. Our extensive experiments demonstrate the effectiveness of LS-MCPP, consistently improving the initial solution returned by two state-of-the-art baseline algorithms that compute suboptimal tree covers on G, with a notable reduction in makespan by up to 35.7% and 30.3%, respectively. Moreover, LS-MCPP consistently matches or surpasses the results of optimal tree cover computation, achieving these outcomes with orders of magnitude faster runtime, thereby showcasing its significant benefits for large-scale real-world coverage tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Large-Scale Non-convex Stochastic Constrained Distributionally Robust Optimization b/data/2024/aaai/Large-Scale Non-convex Stochastic Constrained Distributionally Robust Optimization
new file mode 100644
index 0000000000..9a2bd90df1
--- /dev/null
+++ b/data/2024/aaai/Large-Scale Non-convex Stochastic Constrained Distributionally Robust Optimization	
@@ -0,0 +1 @@
+Distributionally robust optimization (DRO) is a powerful framework for training robust models against data distribution shifts. This paper focuses on constrained DRO, which has an explicit characterization of the robustness level. Existing studies on constrained DRO mostly focus on convex loss function, and exclude the practical and challenging case with non-convex loss function, e.g., neural network. This paper develops a stochastic algorithm and its performance analysis for non-convex constrained DRO. The computational complexity of our stochastic algorithm at each iteration is independent of the overall dataset size, and thus is suitable for large-scale applications. We focus on the general Cressie-Read family divergence defined uncertainty set which includes chi^2-divergences as a special case. We prove that our algorithm finds an epsilon-stationary point with an improved computational complexity than existing methods. Our method also applies to the smoothed conditional value at risk (CVaR) DRO.
\ No newline at end of file
diff --git a/data/2024/aaai/Latent Diffusion Transformer for Probabilistic Time Series Forecasting b/data/2024/aaai/Latent Diffusion Transformer for Probabilistic Time Series Forecasting
new file mode 100644
index 0000000000..197bb040ed
--- /dev/null
+++ b/data/2024/aaai/Latent Diffusion Transformer for Probabilistic Time Series Forecasting	
@@ -0,0 +1 @@
+The probability prediction of multivariate time series is a notoriously challenging but practical task. This research proposes to condense high-dimensional multivariate time series forecasting into a problem of latent space time series generation, to improve the expressiveness of each timestamp and make forecasting more manageable. To solve the problem that the existing work is hard to extend to high-dimensional multivariate time series, we present a latent multivariate time series diffusion framework called Latent Diffusion Transformer (LDT), which consists of a symmetric statistics-aware autoencoder and a diffusion-based conditional generator, to implement this idea. Through careful design, the time series autoencoder can compress multivariate timestamp patterns into a concise latent representation by considering dynamic statistics. Then, the diffusion-based conditional generator is able to efficiently generate realistic multivariate timestamp values on a continuous latent space under a novel self-conditioning guidance which is modeled in a non-autoregressive way. Extensive experiments demonstrate that our model achieves state-of-the-art performance on many popular high-dimensional multivariate time series datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Latent Space Editing in Transformer-Based Flow Matching b/data/2024/aaai/Latent Space Editing in Transformer-Based Flow Matching
new file mode 100644
index 0000000000..91c9998d96
--- /dev/null
+++ b/data/2024/aaai/Latent Space Editing in Transformer-Based Flow Matching	
@@ -0,0 +1 @@
+This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call u-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. Our code will be publicly available at https://taohu.me/lfm/
\ No newline at end of file
diff --git a/data/2024/aaai/LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction b/data/2024/aaai/LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction
new file mode 100644
index 0000000000..326b5ce84f
--- /dev/null
+++ b/data/2024/aaai/LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction	
@@ -0,0 +1 @@
+Data contamination in evaluation is getting increasingly prevalent with the emergence of language models pre-trained on super large, automatically crawled corpora. This problem leads to significant challenges in the accurate assessment of model capabilities and generalisations. In this paper, we propose LatestEval, an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations. LatestEval avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models. We develop the LatestEval automated pipeline to 1) gather the latest texts; 2) identify key information, and 3) construct questions targeting the information while removing the existing answers from the context. This encourages models to infer the answers themselves based on the remaining context, rather than just copy-paste. Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks, suggesting a significantly reduced risk of data contamination and leading to a more robust evaluation. Data and code are publicly available at: https://github.com/liyucheng09/LatestEval.
\ No newline at end of file
diff --git a/data/2024/aaai/Layer Attack Unlearning: Fast and Accurate Machine Unlearning via Layer Level Attack and Knowledge Distillation b/data/2024/aaai/Layer Attack Unlearning: Fast and Accurate Machine Unlearning via Layer Level Attack and Knowledge Distillation
new file mode 100644
index 0000000000..3ad6913c2c
--- /dev/null
+++ b/data/2024/aaai/Layer Attack Unlearning: Fast and Accurate Machine Unlearning via Layer Level Attack and Knowledge Distillation	
@@ -0,0 +1 @@
+Recently, serious concerns have been raised about the privacy issues related to training datasets in machine learning algorithms when including personal data. Various regulations in different countries, including the GDPR grant individuals to have personal data erased, known as ‘the right to be forgotten’ or ‘the right to erasure’. However, there has been less research on effectively and practically deleting the requested personal data from the training set while not jeopardizing the overall machine learning performance. In this work, we propose a fast and novel machine unlearning paradigm at the layer level called layer attack unlearning, which is highly accurate and fast compared to existing machine unlearning algorithms. We introduce the Partial-PGD algorithm to locate the samples to forget efficiently. In addition, we only use the last layer of the model inspired by the Forward-Forward algorithm for unlearning process. Lastly, we use Knowledge Distillation (KD) to reliably learn the decision boundaries from the teacher using soft label information to improve accuracy performance. We conducted extensive experiments with SOTA machine unlearning models and demonstrated the effectiveness of our approach for accuracy and end-to-end unlearning performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Layer Collaboration in the Forward-Forward Algorithm b/data/2024/aaai/Layer Collaboration in the Forward-Forward Algorithm
new file mode 100644
index 0000000000..53bcc41896
--- /dev/null
+++ b/data/2024/aaai/Layer Collaboration in the Forward-Forward Algorithm	
@@ -0,0 +1 @@
+Backpropagation, which uses the chain rule, is the de-facto standard algorithm for optimizing neural networks nowadays. Recently, Hinton (2022) proposed the forward-forward algorithm, a promising alternative that optimizes neural nets layer-by-layer, without propagating gradients throughout the network. Although such an approach has several advantages over back-propagation and shows promising results, the fact that each layer is being trained independently limits the optimization process. Specifically, it prevents the network's layers from collaborating to learn complex and rich features. In this work, we study layer collaboration in the forward-forward algorithm. We show that the current version of the forward-forward algorithm is suboptimal when considering information flow in the network, resulting in a lack of collaboration between layers of the network. We propose an improved version that supports layer collaboration to better utilize the network structure, while not requiring any additional assumptions or computations. We empirically demonstrate the efficacy of the proposed version when considering both information flow and objective metrics. Additionally, we provide a theoretical motivation for the proposed method, inspired by functional entropy theory.
\ No newline at end of file
diff --git a/data/2024/aaai/Layer Compression of Deep Networks with Straight Flows b/data/2024/aaai/Layer Compression of Deep Networks with Straight Flows
new file mode 100644
index 0000000000..fa6867a658
--- /dev/null
+++ b/data/2024/aaai/Layer Compression of Deep Networks with Straight Flows	
@@ -0,0 +1,7 @@
+Very deep neural networks lead to significantly better performance on various real tasks. However, it usually causes slow inference and is hard to be deployed on real-world devices. How to reduce the number of layers to save memory and to accelerate the inference is an eye-catching topic. 
+ In this work, we introduce an intermediate objective, a continuous-time network, before distilling deep networks into shallow networks.
+ First, we distill a given deep network into a continuous-time neural flow model, which can be discretized with an ODE solver and the inference requires passing through the network multiple times.
+ By forcing the flow transport trajectory to be straight lines, we find that it is easier to compress the infinite step model into a one-step neural flow model, which only requires passing through the flow model once.
+ Secondly, we refine the one-step flow model together with the final head layer with knowledge distillation and finally, we can replace the given deep network with this one-step flow network. 
+ Empirically, we demonstrate that our method outperforms direct distillation and other baselines on different model architectures (e.g. ResNet, ViT) on image classification and semantic segmentation tasks. 
+ We also manifest that our distilled model naturally serves as an early-exit dynamic inference model.
\ No newline at end of file
diff --git a/data/2024/aaai/Layer-Wise Representation Fusion for Compositional Generalization b/data/2024/aaai/Layer-Wise Representation Fusion for Compositional Generalization
new file mode 100644
index 0000000000..d2e5d74c7e
--- /dev/null
+++ b/data/2024/aaai/Layer-Wise Representation Fusion for Compositional Generalization	
@@ -0,0 +1 @@
+Existing neural models are demonstrated to struggle with compositional generalization (CG), i.e., the ability to systematically generalize to unseen compositions of seen components. A key reason for failure on CG is that the syntactic and semantic representations of sequences in both the uppermost layer of the encoder and decoder are entangled. However, previous work concentrates on separating the learning of syntax and semantics instead of exploring the reasons behind the representation entanglement (RE) problem to solve it. We explain why it exists by analyzing the representation evolving mechanism from the bottom to the top of the Transformer layers. We find that the ``shallow'' residual connections within each layer fail to fuse previous layers' information effectively, leading to information forgetting between layers and further the RE problems. Inspired by this, we propose LRF, a novel Layer-wise Representation Fusion framework for CG, which learns to fuse previous layers' information back into the encoding and decoding process effectively through introducing a fuse-attention module at each encoder and decoder layer. LRF achieves promising results on two realistic benchmarks, empirically demonstrating the effectiveness of our proposal. Codes are available at https://github.com/thinkaboutzero/LRF.
\ No newline at end of file
diff --git a/data/2024/aaai/Learn How to See: Collaborative Embodied Learning for Object Detection and Camera Adjusting b/data/2024/aaai/Learn How to See: Collaborative Embodied Learning for Object Detection and Camera Adjusting
new file mode 100644
index 0000000000..38e54cedd5
--- /dev/null
+++ b/data/2024/aaai/Learn How to See: Collaborative Embodied Learning for Object Detection and Camera Adjusting	
@@ -0,0 +1 @@
+Passive object detectors, trained on large-scale static datasets, often overlook the feedback from object detection to image acquisition. Embodied vision and active detection mitigate this issue by interacting with the environment. Nevertheless, the materialization of activeness hinges on resource-intensive data collection and annotation. To tackle these challenges, we propose a collaborative student-teacher framework. Technically, a replay buffer is built based on the trajectory data to encapsulate the relationship of state, action, and reward. In addition, the student network diverges from reinforcement learning by redefining sequential decision pathways using a GPT structure enriched with causal self-attention. Moreover, the teacher network establishes a subtle state-reward mapping based on adjacent benefit differences, providing reliable rewards for student adaptively self-tuning with the vast unlabeled replay buffer data. Additionally, an innovative yet straightforward benefit reference value is proposed within the teacher network, adding to its effectiveness and simplicity. Leveraging a flexible replay buffer and embodied collaboration between teacher and student, the framework learns to see before detection with shallower features and shorter inference steps. Experiments highlight significant advantages of our algorithm over state-of-the-art detectors. The code is released at https://github.com/lydonShen/STF.
\ No newline at end of file
diff --git a/data/2024/aaai/Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object Video Generation b/data/2024/aaai/Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object Video Generation
new file mode 100644
index 0000000000..24fb67f6ed
--- /dev/null
+++ b/data/2024/aaai/Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object Video Generation	
@@ -0,0 +1 @@
+We propose a novel unsupervised method to autoregressively generate videos from a single frame and a sparse motion input. Our trained model can generate unseen realistic object-to-object interactions. Although our model has never been given the explicit segmentation and motion of each object in the scene during training, it is able to implicitly separate their dynamics and extents. Key components in our method are the randomized conditioning scheme, the encoding of the input motion control, and the randomized and sparse sampling to enable generalization to out of distribution but realistic correlations. Our model, which we call YODA, has therefore the ability to move objects without physically touching them. Through extensive qualitative and quantitative evaluations on several datasets, we show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.
\ No newline at end of file
diff --git a/data/2024/aaai/Learn to Follow: Decentralized Lifelong Multi-Agent Pathfinding via Planning and Learning b/data/2024/aaai/Learn to Follow: Decentralized Lifelong Multi-Agent Pathfinding via Planning and Learning
new file mode 100644
index 0000000000..dee14d1af6
--- /dev/null
+++ b/data/2024/aaai/Learn to Follow: Decentralized Lifelong Multi-Agent Pathfinding via Planning and Learning	
@@ -0,0 +1 @@
+Multi-agent Pathfinding (MAPF) problem generally asks to find a set of conflict-free paths for a set of agents confined to a graph and is typically solved in a centralized fashion. Conversely, in this work, we investigate the decentralized MAPF setting, when the central controller that possesses all the information on the agents' locations and goals is absent and the agents have to sequentially decide the actions on their own without having access to the full state of the environment. We focus on the practically important lifelong variant of MAPF, which involves continuously assigning new goals to the agents upon arrival to the previous ones. To address this complex problem, we propose a method that integrates two complementary approaches: planning with heuristic search and reinforcement learning through policy optimization. Planning is utilized to construct and re-plan individual paths. We enhance our planning algorithm with a dedicated technique tailored to avoid congestion and increase the throughput of the system. We employ reinforcement learning to discover the collision avoidance policies that effectively guide the agents along the paths. The policy is implemented as a neural network and is effectively trained without any reward-shaping or external guidance. We evaluate our method on a wide range of setups comparing it to the state-of-the-art solvers. The results show that our method consistently outperforms the learnable competitors, showing higher throughput and better ability to generalize to the maps that were unseen at the training stage. Moreover our solver outperforms a rule-based one in terms of throughput and is an order of magnitude faster than a state-of-the-art search-based solver. The code is available at https://github.com/AIRI-Institute/learn-to-follow.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Accurate and Bidirectional Transformation via Dynamic Embedding Transportation for Cross-Domain Recommendation b/data/2024/aaai/Learning Accurate and Bidirectional Transformation via Dynamic Embedding Transportation for Cross-Domain Recommendation
new file mode 100644
index 0000000000..dfd0e69ec1
--- /dev/null
+++ b/data/2024/aaai/Learning Accurate and Bidirectional Transformation via Dynamic Embedding Transportation for Cross-Domain Recommendation	
@@ -0,0 +1,2 @@
+With the rapid development of Internet and Web techniques, Cross-Domain Recommendation (CDR) models have been widely explored for resolving the data-sparsity
+and cold-start problem. Meanwhile, most CDR models should utilize explicit domain-shareable information (e.g., overlapped users or items) for knowledge transfer across domains. However, this assumption may not be always satisfied since users and items are always non-overlapped in real practice. The performance of many previous works will be severely impaired when these domain-shareable information are not available. To address the aforementioned issues, we propose the Joint Preference Exploration and Dynamic Embedding Transportation model (JPEDET) in this paper which is a novel framework for solving the CDR problem when users and items are non-overlapped. JPEDET includes two main modules, i.e., joint preference exploration module and dynamic embedding transportation module. The joint preference exploration module aims to fuse rating and review information for modelling user preferences. The dynamic embedding transportation module is set to share knowledge via neural ordinary equations for dual transformation across domains. Moreover, we innovatively propose the dynamic transport flow equipped with linear interpolation guidance on barycentric Wasserstein path for achieving accurate and bidirectional transformation. Our empirical study on Amazon datasets demonstrates that JPEDET significantly outperforms the state-of-the-art models under the CDR setting.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Broadcast Protocols b/data/2024/aaai/Learning Broadcast Protocols
new file mode 100644
index 0000000000..0fdc37a52e
--- /dev/null
+++ b/data/2024/aaai/Learning Broadcast Protocols	
@@ -0,0 +1 @@
+The problem of learning a computational model from examples has been receiving growing attention. For the particularly challenging problem of learning models of distributed systems, existing results are restricted to models with a fixed number of interacting processes. In this work we look for the first time (to the best of our knowledge) at the problem of learning a distributed system with an arbitrary number of processes, assuming only that there exists a cutoff, i.e., a number of processes that is sufficient to produce all observable behaviors. Specifically, we consider fine broadcast protocols, these are broadcast protocols (BPs) with a finite cutoff and no hidden states. We provide a learning algorithm that can infer a correct BP from a sample that is consistent with a fine BP, and a minimal equivalent BP if the sample is sufficiently complete. On the negative side we show that (a) characteristic sets of exponential size are unavoidable, (b) the consistency problem for fine BPs is NP hard, and (c) that fine BPs are not polynomially predictable.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Cluster-Wise Anchors for Multi-View Clustering b/data/2024/aaai/Learning Cluster-Wise Anchors for Multi-View Clustering
new file mode 100644
index 0000000000..97ee5a9ff7
--- /dev/null
+++ b/data/2024/aaai/Learning Cluster-Wise Anchors for Multi-View Clustering	
@@ -0,0 +1 @@
+Due to its effectiveness and efficiency, anchor based multi-view clustering (MVC) has recently attracted much attention. Most existing approaches try to adaptively learn anchors to construct an anchor graph for clustering. However, they generally focus on improving the diversity among anchors by using orthogonal constraint and ignore the underlying semantic relations, which may make the anchors not representative and discriminative enough. To address this problem, we propose an adaptive Cluster-wise Anchor learning based MVC method, CAMVC for short. We first make an anchor cluster assumption that supposes the prior cluster structure of target anchors by pre-defining a consensus cluster indicator matrix. Based on the prior knowledge, an explicit cluster structure of latent anchors is enforced by learning diverse cluster centroids, which can explore both inter-cluster diversity and intra-cluster consistency of anchors, and improve the subspace representation discrimination. Extensive results demonstrate the effectiveness and superiority of our proposed method compared with some state-of-the-art MVC approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Continuous Implicit Field with Local Distance Indicator for Arbitrary-Scale Point Cloud Upsampling b/data/2024/aaai/Learning Continuous Implicit Field with Local Distance Indicator for Arbitrary-Scale Point Cloud Upsampling
new file mode 100644
index 0000000000..2dfd31c4ce
--- /dev/null
+++ b/data/2024/aaai/Learning Continuous Implicit Field with Local Distance Indicator for Arbitrary-Scale Point Cloud Upsampling	
@@ -0,0 +1 @@
+Point cloud upsampling aims to generate dense and uniformly distributed point sets from a sparse point cloud, which plays a critical role in 3D computer vision. Previous methods typically split a sparse point cloud into several local patches, upsample patch points, and merge all upsampled patches. However, these methods often produce holes, outliers or non-uniformity due to the splitting and merging process which does not maintain consistency among local patches.To address these issues, we propose a novel approach that learns an unsigned distance field guided by local priors for point cloud upsampling. Specifically, we train a local distance indicator (LDI) that predicts the unsigned distance from a query point to a local implicit surface. Utilizing the learned LDI, we learn an unsigned distance field to represent the sparse point cloud with patch consistency. At inference time, we randomly sample queries around the sparse point cloud, and project these query points onto the zero-level set of the learned implicit field to generate a dense point cloud. We justify that the implicit field is naturally continuous, which inherently enables the application of arbitrary-scale upsampling without necessarily retraining for various scales. We conduct comprehensive experiments on both synthetic data and real scans, and report state-of-the-art results under widely used benchmarks. Project page: https://lisj575.github.io/APU-LDI
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Deformable Hypothesis Sampling for Accurate PatchMatch Multi-View Stereo b/data/2024/aaai/Learning Deformable Hypothesis Sampling for Accurate PatchMatch Multi-View Stereo
new file mode 100644
index 0000000000..5268346274
--- /dev/null
+++ b/data/2024/aaai/Learning Deformable Hypothesis Sampling for Accurate PatchMatch Multi-View Stereo	
@@ -0,0 +1 @@
+This paper introduces a learnable Deformable Hypothesis Sampler (DeformSampler) to address the challenging issue of noisy depth estimation in faithful PatchMatch multi-view stereo (MVS). We observe that the heuristic depth hypothesis sampling modes employed by PatchMatch MVS solvers are insensitive to (i) the piece-wise smooth distribution of depths across the object surface and (ii) the implicit multi-modal distribution of depth prediction probabilities along the ray direction on the surface points. Accordingly, we develop DeformSampler to learn distribution-sensitive sample spaces to (i) propagate depths consistent with the scene's geometry across the object surface and (ii) fit a Laplace Mixture model that approaches the point-wise probabilities distribution of the actual depths along the ray direction. We integrate DeformSampler into a learnable PatchMatch MVS system to enhance depth estimation in challenging areas, such as piece-wise discontinuous surface boundaries and weakly-textured regions. Experimental results on DTU and Tanks & Temples datasets demonstrate its superior performance and generalization capabilities compared to state-of-the-art competitors. Code is available at https://github.com/Geo-Tell/DS-PMNet.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Dense Correspondence for NeRF-Based Face Reenactment b/data/2024/aaai/Learning Dense Correspondence for NeRF-Based Face Reenactment
new file mode 100644
index 0000000000..effd764212
--- /dev/null
+++ b/data/2024/aaai/Learning Dense Correspondence for NeRF-Based Face Reenactment	
@@ -0,0 +1 @@
+Face reenactment is challenging due to the need to establish dense correspondence between various face representations for motion transfer. Recent studies have utilized Neural Radiance Field (NeRF) as fundamental representation, which further enhanced the performance of multi-view face reenactment in photo-realism and 3D consistency. However, establishing dense correspondence between different face NeRFs is non-trivial, because implicit representations lack ground-truth correspondence annotations like mesh-based 3D parametric models (e.g., 3DMM) with index-aligned vertexes. Although aligning 3DMM space with NeRF-based face representations can realize motion control, it is sub-optimal for their limited face-only modeling and low identity fidelity. Therefore, we are inspired to ask: Can we learn the dense correspondence between different NeRF-based face representations without a 3D parametric model prior? To address this challenge, we propose a novel framework, which adopts tri-planes as fundamental NeRF representation and decomposes face tri-planes into three components: canonical tri-planes, identity deformations, and motion. In terms of motion control, our key contribution is proposing a Plane Dictionary (PlaneDict) module, which efficiently maps the motion conditions to a linear weighted addition of learnable orthogonal plane bases. To the best of our knowledge, our framework is the first method that achieves one-shot multi-view face reenactment without a 3D parametric model prior. Extensive experiments demonstrate that we produce better results in fine-grained motion control and identity preservation than previous methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Diffusions under Uncertainty b/data/2024/aaai/Learning Diffusions under Uncertainty
new file mode 100644
index 0000000000..ca083f8435
--- /dev/null
+++ b/data/2024/aaai/Learning Diffusions under Uncertainty	
@@ -0,0 +1 @@
+To infer a diffusion network based on observations from historical diffusion processes, existing approaches assume that observation data contain exact occurrence time of each node infection, or at least the eventual infection statuses of nodes in each diffusion process. They determine potential influence relationships between nodes by identifying frequent sequences, or statistical correlations, among node infections. In some real-world settings, such as the spread of epidemics, tracing exact infection times is often infeasible due to a high cost; even obtaining precise infection statuses of nodes is a challenging task, since observable symptoms such as headache only partially reveal a node’s true status. In this work, we investigate how to effectively infer a diffusion network from observation data with uncertainty. Provided with only probabilistic information about node infection statuses, we formulate the problem of diffusion network inference as a constrained nonlinear regression w.r.t. the probabilistic data. An alternating maximization method is designed to solve this regression problem iteratively, and the improvement of solution quality in each iteration can be theoretically guaranteed. Empirical studies are conducted on both synthetic and real-world networks, and the results verify the effectiveness and efficiency of our approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Discrete-Time Major-Minor Mean Field Games b/data/2024/aaai/Learning Discrete-Time Major-Minor Mean Field Games
new file mode 100644
index 0000000000..ab4f38aff3
--- /dev/null
+++ b/data/2024/aaai/Learning Discrete-Time Major-Minor Mean Field Games	
@@ -0,0 +1 @@
+Recent techniques based on Mean Field Games (MFGs) allow the scalable analysis of multi-player games with many similar, rational agents. However, standard MFGs remain limited to homogeneous players that weakly influence each other, and cannot model major players that strongly influence other players, severely limiting the class of problems that can be handled. We propose a novel discrete time version of major-minor MFGs (M3FGs), along with a learning algorithm based on fictitious play and partitioning the probability simplex. Importantly, M3FGs generalize MFGs with common noise and can handle not only random exogeneous environment states but also major players. A key challenge is that the mean field is stochastic and not deterministic as in standard MFGs. Our theoretical investigation verifies both the M3FG model and its algorithmic solution, showing firstly the well-posedness of the M3FG model starting from a finite game of interest, and secondly convergence and approximation guarantees of the fictitious play algorithm. Then, we empirically verify the obtained theoretical results, ablating some of the theoretical assumptions made, and show successful equilibrium learning in three example problems. Overall, we establish a learning framework for a novel and broad class of tractable games.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Discriminative Noise Guidance for Image Forgery Detection and Localization b/data/2024/aaai/Learning Discriminative Noise Guidance for Image Forgery Detection and Localization
new file mode 100644
index 0000000000..c3cb3f5423
--- /dev/null
+++ b/data/2024/aaai/Learning Discriminative Noise Guidance for Image Forgery Detection and Localization	
@@ -0,0 +1 @@
+This study introduces a new method for detecting and localizing image forgery by focusing on manipulation traces within the noise domain. We posit that nearly invisible noise in RGB images carries tampering traces, useful for distinguishing and locating forgeries. However, the advancement of tampering technology complicates the direct application of noise for forgery detection, as the noise inconsistency between forged and authentic regions is not fully exploited. To tackle this, we develop a two-step discriminative noise-guided approach to explicitly enhance the representation and use of noise inconsistencies, thereby fully exploiting noise information to improve the accuracy and robustness of forgery detection. Specifically, we first enhance the noise discriminability of forged regions compared to authentic ones using a de-noising network and a statistics-based constraint. Then, we merge a model-driven guided filtering mechanism with a data-driven attention mechanism to create a learnable and differentiable noise-guided filter. This sophisticated filter allows us to maintain the edges of forged regions learned from the noise. Comprehensive experiments on multiple datasets demonstrate that our method can reliably detect and localize forgeries, surpassing existing state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Diverse Risk Preferences in Population-Based Self-Play b/data/2024/aaai/Learning Diverse Risk Preferences in Population-Based Self-Play
new file mode 100644
index 0000000000..680fadbafe
--- /dev/null
+++ b/data/2024/aaai/Learning Diverse Risk Preferences in Population-Based Self-Play	
@@ -0,0 +1 @@
+Among the remarkable successes of Reinforcement Learning (RL), self-play algorithms have played a crucial role in solving competitive games. However, current self-play RL methods commonly optimize the agent to maximize the expected win-rates against its current or historical copies, resulting in a limited strategy style and a tendency to get stuck in local optima. To address this limitation, it is important to improve the diversity of policies, allowing the agent to break stalemates and enhance its robustness when facing with different opponents. In this paper, we present a novel perspective to promote diversity by considering that agents could have diverse risk preferences in the face of uncertainty. To achieve this, we introduce a novel reinforcement learning algorithm called Risk-sensitive Proximal Policy Optimization (RPPO), which smoothly interpolates between worst-case and best-case policy learning, enabling policy learning with desired risk preferences. Furthermore, by seamlessly integrating RPPO with population-based self-play, agents in the population optimize dynamic risk-sensitive objectives using experiences gained from playing against diverse opponents. Our empirical results demonstrate that our method achieves comparable or superior performance in competitive games and, importantly, leads to the emergence of diverse behavioral modes. Code is available at https://github.com/Jackory/RPBT.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Domain-Independent Heuristics for Grounded and Lifted Planning b/data/2024/aaai/Learning Domain-Independent Heuristics for Grounded and Lifted Planning
new file mode 100644
index 0000000000..812d8b4740
--- /dev/null
+++ b/data/2024/aaai/Learning Domain-Independent Heuristics for Grounded and Lifted Planning	
@@ -0,0 +1 @@
+We present three novel graph representations of planning tasks suitable for learning domain-independent heuristics using Graph Neural Networks (GNNs) to guide search. In particular, to mitigate the issues caused by large grounded GNNs we present the first method for learning domain-independent heuristics with only the lifted representation of a planning task. We also provide a theoretical analysis of the expressiveness of our models, showing that some are more powerful than STRIPS-HGN, the only other existing model for learning domain-independent heuristics. Our experiments show that our heuristics generalise to much larger problems than those in the training set, vastly surpassing STRIPS-HGN heuristics.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Efficient and Robust Multi-Agent Communication via Graph Information Bottleneck b/data/2024/aaai/Learning Efficient and Robust Multi-Agent Communication via Graph Information Bottleneck
new file mode 100644
index 0000000000..35e7023409
--- /dev/null
+++ b/data/2024/aaai/Learning Efficient and Robust Multi-Agent Communication via Graph Information Bottleneck	
@@ -0,0 +1 @@
+Efficient communication learning among agents has been shown crucial for cooperative multi-agent reinforcement learning (MARL), as it can promote the action coordination of agents and ultimately improve performance. Graph neural network (GNN) provide a general paradigm for communication learning, which consider agents and communication channels as nodes and edges in a graph, with the action selection corresponding to node labeling. Under such paradigm, an agent aggregates information from neighbor agents, which can reduce uncertainty in local decision-making and induce implicit action coordination. However, this communication paradigm is vulnerable to adversarial attacks and noise, and how to learn robust and efficient communication under perturbations has largely not been studied. To this end, this paper introduces a novel Multi-Agent communication mechanism via Graph Information bottleneck (MAGI), which can optimally balance the robustness and expressiveness of the message representation learned by agents. This communication mechanism is aim at learning the minimal sufficient message representation for an agent by maximizing the mutual information (MI) between the message representation and the selected action, and simultaneously constraining the MI between the message representation and the agent feature. Empirical results demonstrate that MAGI is more robust and efficient than state-of-the-art GNN-based MARL methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Encodings for Constructive Neural Combinatorial Optimization Needs to Regret b/data/2024/aaai/Learning Encodings for Constructive Neural Combinatorial Optimization Needs to Regret
new file mode 100644
index 0000000000..da3d43a180
--- /dev/null
+++ b/data/2024/aaai/Learning Encodings for Constructive Neural Combinatorial Optimization Needs to Regret	
@@ -0,0 +1 @@
+Deep-reinforcement-learning (DRL) based neural combinatorial optimization (NCO) methods have demonstrated efficiency without relying on the guidance of optimal solutions. As the most mainstream among them, the learning constructive heuristic (LCH) achieves high-quality solutions through a rapid autoregressive solution construction process. However, these LCH-based methods are deficient in convergency, and there is still a performance gap compared to the optimal. Intuitively, learning to regret some steps in the solution construction process is helpful to the training efficiency and network representations. This article proposes a novel regret-based mechanism for an advanced solution construction process. Our method can be applied as a plug-in to any existing LCH-based DRL-NCO method. Experimental results demonstrate the capability of our work to enhance the performance of various NCO models. Results also show that the proposed LCH-Regret outperforms the previous modification methods on several typical combinatorial optimization problems. The code and Supplementary File are available at https://github.com/SunnyR7/LCH-Regret.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Explicit Contact for Implicit Reconstruction of Hand-Held Objects from Monocular Images b/data/2024/aaai/Learning Explicit Contact for Implicit Reconstruction of Hand-Held Objects from Monocular Images
new file mode 100644
index 0000000000..40feb8be9e
--- /dev/null
+++ b/data/2024/aaai/Learning Explicit Contact for Implicit Reconstruction of Hand-Held Objects from Monocular Images	
@@ -0,0 +1 @@
+Reconstructing hand-held objects from monocular RGB images is an appealing yet challenging task. In this task, contacts between hands and objects provide important cues for recovering the 3D geometry of the hand-held objects. Though recent works have employed implicit functions to achieve impressive progress, they ignore formulating contacts in their frameworks, which results in producing less realistic object meshes. In this work, we explore how to model contacts in an explicit way to benefit the implicit reconstruction of hand-held objects. Our method consists of two components: explicit contact prediction and implicit shape reconstruction. In the first part, we propose a new subtask of directly estimating 3D hand-object contacts from a single image. The part-level and vertex-level graph-based transformers are cascaded and jointly learned in a coarse-to-fine manner for more accurate contact probabilities. In the second part, we introduce a novel method to diffuse estimated contact states from the hand mesh surface to nearby 3D space and leverage diffused contact probabilities to construct the implicit neural representation for the manipulated object. Benefiting from estimating the interaction patterns between the hand and the object, our method can reconstruct more realistic object meshes, especially for object parts that are in contact with hands. Extensive experiments on challenging benchmarks show that the proposed method outperforms the current state of the arts by a great margin. Our code is publicly available at https://junxinghu.github.io/projects/hoi.html.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Fair Policies for Multi-Stage Selection Problems from Observational Data b/data/2024/aaai/Learning Fair Policies for Multi-Stage Selection Problems from Observational Data
new file mode 100644
index 0000000000..4d69602d28
--- /dev/null
+++ b/data/2024/aaai/Learning Fair Policies for Multi-Stage Selection Problems from Observational Data	
@@ -0,0 +1 @@
+We consider the problem of learning fair policies for multi-stage selection problems from observational data. This problem arises in several high-stakes domains such as company hiring, loan approval, or bail decisions where outcomes (e.g., career success, loan repayment, recidivism) are only observed for those selected. We propose a multi-stage framework that can be augmented with various fairness constraints, such as demographic parity or equal opportunity. This problem is a highly intractable infinite chance-constrained program involving the unknown joint distribution of covariates and outcomes. Motivated by the potential impact of selection decisions on people’s lives and livelihoods, we propose to focus on interpretable linear selection rules. Leveraging tools from causal inference and sample average approximation, we obtain an asymptotically consistent solution to this selection problem by solving a mixed binary conic optimization problem, which can be solved using standard off-the-shelf solvers. We conduct extensive computational experiments on a variety of datasets adapted from the UCI repository on which we show that our proposed approaches can achieve an 11.6% improvement in precision and a 38% reduction in the measure of unfairness compared to the existing selection policy.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning GAI-Decomposable Utility Models for Multiattribute Decision Making b/data/2024/aaai/Learning GAI-Decomposable Utility Models for Multiattribute Decision Making
new file mode 100644
index 0000000000..3a4572f56e
--- /dev/null
+++ b/data/2024/aaai/Learning GAI-Decomposable Utility Models for Multiattribute Decision Making	
@@ -0,0 +1 @@
+We propose an approach to learn a multiattribute utility function to model, explain or predict the value system of a Decision Maker. The main challenge of the modelling task is to describe human values and preferences in the presence of interacting attributes while keeping the utility function as simple as possible. We focus on the generalized additive decomposable utility model which allows interactions between attributes while preserving some additive decomposability of the evaluation model. We present a learning approach able to identify the factors of interacting attributes and to learn the utility functions defined on these factors. This approach relies on the determination of a sparse representation of the ANOVA decomposition of the multiattribute utility function using multiple kernel learning. It applies to both continuous and discrete attributes. Numerical tests are performed to demonstrate the practical efficiency of the learning approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Generalizable and Composable Abstractions for Transfer in Reinforcement Learning b/data/2024/aaai/Learning Generalizable and Composable Abstractions for Transfer in Reinforcement Learning
new file mode 100644
index 0000000000..274704ab90
--- /dev/null
+++ b/data/2024/aaai/Learning Generalizable and Composable Abstractions for Transfer in Reinforcement Learning	
@@ -0,0 +1 @@
+Reinforcement Learning (RL) in complex environments presents many challenges: agents require learning concise representations of both environments and behaviors for efficient reasoning and generalizing experiences to new, unseen situations. However, RL approaches can be sample-inefficient and difficult to scale, especially in long-horizon sparse reward settings. To address these issues, the goal of my doctoral research is to develop methods that automatically construct semantically meaningful state and temporal abstractions for efficient transfer and generalization. In my work, I develop hierarchical approaches for learning transferable, generalizable knowledge in the form of symbolically represented options, as well as for integrating search techniques with RL to solve new problems by efficiently composing the learned options. Empirical results show that the resulting approaches effectively learn and transfer knowledge, achieving superior sample efficiency compared to SOTA methods while also enhancing interpretability.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Generalized Medical Image Segmentation from Decoupled Feature Queries b/data/2024/aaai/Learning Generalized Medical Image Segmentation from Decoupled Feature Queries
new file mode 100644
index 0000000000..208b180dfa
--- /dev/null
+++ b/data/2024/aaai/Learning Generalized Medical Image Segmentation from Decoupled Feature Queries	
@@ -0,0 +1,5 @@
+Domain generalized medical image segmentation requires models to learn from multiple source domains and generalize well to arbitrary unseen target domain. Such a task is both technically challenging and clinically practical, due to the domain shift problem (i.e., images are collected from different hospitals and scanners). Existing methods focused on either learning shape-invariant representation or reaching consensus among the source domains. An ideal generalized representation is supposed to show similar pattern responses within the same channel for cross-domain images.
+However, to deal with the significant distribution discrepancy, the network tends to capture similar patterns by multiple channels, while different cross-domain patterns are also allowed to rest in the same channel. 
+To address this issue, we propose to leverage channel-wise decoupled deep features as queries. With the aid of cross-attention mechanism, the long-range dependency between deep and shallow features can be fully mined via self-attention and then guides the learning of generalized representation. Besides, a relaxed deep whitening transformation is proposed to learn channel-wise decoupled features in a feasible way. The proposed decoupled fea-
+ture query (DFQ) scheme can be seamlessly integrate into the Transformer segmentation model in an end-to-end manner. 
+Extensive experiments show its state-of-the-art performance, notably outperforming the runner-up by 1.31% and 1.98% with DSC metric on generalized fundus and prostate benchmarks, respectively. Source code is available at https://github.com/BiQiWHU/DFQ.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Generalized Segmentation for Foggy-Scenes by Bi-directional Wavelet Guidance b/data/2024/aaai/Learning Generalized Segmentation for Foggy-Scenes by Bi-directional Wavelet Guidance
new file mode 100644
index 0000000000..2e0b3ca8fb
--- /dev/null
+++ b/data/2024/aaai/Learning Generalized Segmentation for Foggy-Scenes by Bi-directional Wavelet Guidance	
@@ -0,0 +1,12 @@
+Learning scene semantics that can be well generalized to foggy conditions is important for safety-crucial applications such as autonomous driving. 
+Existing methods need both annotated clear images and foggy images to train a curriculum domain adaptation model.
+Unfortunately, these methods can only generalize to the target foggy domain that has seen in the training stage, but the foggy domains vary a lot in both urban-scene styles and fog styles.
+In this paper, we propose to learn scene segmentation well generalized to foggy-scenes under the domain generalization setting, which does not involve any foggy images in the training stage and can generalize to any arbitrary unseen foggy scenes. 
+We argue that an ideal segmentation model that can be well generalized to foggy-scenes need to simultaneously enhance the content, de-correlate the urban-scene style and de-correlate the fog style. 
+As the content (e.g., scene semantic) rests more in low-frequency features while the style of urban-scene and fog rests more in high-frequency features, we propose a novel bi-directional wavelet guidance (BWG) mechanism to realize the above three objectives in a divide-and-conquer manner. 
+With the aid of Haar wavelet transformation,
+the low frequency component is concentrated on the content enhancement self-attention, while the high frequency component is shifted to the style and fog self-attention for de-correlation purpose.
+It is integrated into existing mask-level Transformer segmentation pipelines in a learnable fashion.
+Large-scale experiments are conducted on four foggy-scene segmentation datasets under a variety of interesting settings.
+The proposed method significantly outperforms existing directly-supervised, curriculum domain adaptation and domain generalization segmentation methods. 
+Source code is available at https://github.com/BiQiWHU/BWG.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models b/data/2024/aaai/Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models
new file mode 100644
index 0000000000..fdb8e4dea8
--- /dev/null
+++ b/data/2024/aaai/Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models	
@@ -0,0 +1 @@
+Prompt learning has become a prevalent strategy for adapting vision-language foundation models to downstream tasks. As large language models (LLMs) have emerged, recent studies have explored the use of category-related descriptions as input to enhance prompt effectiveness. Nevertheless, conventional descriptions fall short of structured information that effectively represents the interconnections among entities or attributes linked to a particular category. To address this limitation and prioritize harnessing structured knowledge, this paper advocates for leveraging LLMs to build a graph for each description to model the entities and attributes describing the category, as well as their correlations. Preexisting prompt tuning methods exhibit inadequacies in managing this structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), which enables simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning. In addition, by incorporating high-level and global-level prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Extensive experiments demonstrate that our HPT shows strong effectiveness and generalizes much better than existing SOTA methods. Our code is available at https://github.com/Vill-Lab/2024-AAAI-HPT.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Hybrid Dynamics Models with Simulator-Informed Latent States b/data/2024/aaai/Learning Hybrid Dynamics Models with Simulator-Informed Latent States
new file mode 100644
index 0000000000..14b89e1c5c
--- /dev/null
+++ b/data/2024/aaai/Learning Hybrid Dynamics Models with Simulator-Informed Latent States	
@@ -0,0 +1 @@
+Dynamics model learning deals with the task of inferring unknown dynamics from measurement data and predicting the future behavior of the system. A typical approach to address this problem is to train recurrent models. However, predictions with these models are often not physically meaningful. Further, they suffer from deteriorated behavior over time due to accumulating errors. Often, simulators building on first principles are available being physically meaningful by design. However, modeling simplifications typically cause inaccuracies in these models. Consequently, hybrid modeling is an emerging trend that aims to combine the best of both worlds. In this paper, we propose a new approach to hybrid modeling, where we inform the latent states of a learned model via a black-box simulator. This allows to control the predictions via the simulator preventing them from accumulating errors. This is especially challenging since, in contrast to previous approaches, access to the simulator's latent states is not available. We tackle the task by leveraging observers, a well-known concept from control theory, inferring unknown latent states from observations and dynamics over time. In our learning-based setting, we jointly learn the dynamics and an observer that infers the latent states via the simulator. Thus, the simulator constantly corrects the latent states, compensating for modeling mismatch caused by learning. To maintain flexibility, we train an RNN-based residuum for the latent states that cannot be informed by the simulator.
\ No newline at end of file
diff --git "a/data/2024/aaai/Learning Image Demoir\303\251ing from Unpaired Real Data" "b/data/2024/aaai/Learning Image Demoir\303\251ing from Unpaired Real Data"
new file mode 100644
index 0000000000..14385c9555
--- /dev/null
+++ "b/data/2024/aaai/Learning Image Demoir\303\251ing from Unpaired Real Data"	
@@ -0,0 +1 @@
+This paper focuses on addressing the issue of image demoiréing. Unlike the large volume of existing studies that rely on learning from paired real data, we attempt to learn a demoiréing model from unpaired real data, i.e., moiré images associated with irrelevant clean images. The proposed method, referred to as Unpaired Demoiréing(UnDeM), synthesizes pseudo moiré images from unpaired datasets, generating pairs with clean images for training demoiréing models. To achieve this, we divide real moiré images into patches and group them in compliance with their moiré complexity. We introduce a novel moiré generation framework to synthesize moiré images with diverse moiré features, resembling real moiré patches, and details akin to real moiré-free images. Additionally, we introduce an adaptive denoise method to eliminate the low-quality pseudo moiré images that adversely impact the learning of demoiréing models. We conduct extensive experiments on the commonly-used FHDMi and UHDM datasets. Results manifest that our UnDeM performs better than existing methods when using existing demoiréing models such as MBCNN and ESDNet-L. Code: https://github.com/zysxmu/UnDeM.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Invariant Inter-pixel Correlations for Superpixel Generation b/data/2024/aaai/Learning Invariant Inter-pixel Correlations for Superpixel Generation
new file mode 100644
index 0000000000..6b5cc0f3c3
--- /dev/null
+++ b/data/2024/aaai/Learning Invariant Inter-pixel Correlations for Superpixel Generation	
@@ -0,0 +1 @@
+Deep superpixel algorithms have made remarkable strides by substituting hand-crafted features with learnable ones. Nevertheless, we observe that existing deep superpixel methods, serving as mid-level representation operations, remain sensitive to the statistical properties (e.g., color distribution, high-level semantics) embedded within the training dataset. Consequently, learnable features exhibit constrained discriminative capability, resulting in unsatisfactory pixel grouping performance, particularly in untrainable application scenarios. To address this issue, we propose the Content Disentangle Superpixel (CDS) algorithm to selectively separate the invariant inter-pixel correlations and statistical properties, i.e., style noise. Specifically, We first construct auxiliary modalities that are homologous to the original RGB image but have substantial stylistic variations. Then, driven by mutual information, we propose the local-grid correlation alignment across modalities to reduce the distribution discrepancy of adaptively selected features and learn invariant inter-pixel correlations. Afterwards, we perform global-style mutual information minimization to enforce the separation of invariant content and train data styles. The experimental results on four benchmark datasets demonstrate the superiority of our approach to existing state-of-the-art methods, regarding boundary adherence, generalization, and efficiency. Code and pre-trained model are available at https://github.com/rookiie/CDSpixel.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning MDL Logic Programs from Noisy Data b/data/2024/aaai/Learning MDL Logic Programs from Noisy Data
new file mode 100644
index 0000000000..46ea7a8942
--- /dev/null
+++ b/data/2024/aaai/Learning MDL Logic Programs from Noisy Data	
@@ -0,0 +1 @@
+Many inductive logic programming approaches struggle to learn programs from noisy data. To overcome this limitation, we introduce an approach that learns minimal description length programs from noisy data, including recursive programs. Our experiments on several domains, including drug design, game playing, and program synthesis, show that our approach can outperform existing approaches in terms of predictive accuracies and scale to moderate amounts of noise.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Multi-Modal Cross-Scale Deformable Transformer Network for Unregistered Hyperspectral Image Super-resolution b/data/2024/aaai/Learning Multi-Modal Cross-Scale Deformable Transformer Network for Unregistered Hyperspectral Image Super-resolution
new file mode 100644
index 0000000000..3162fd3393
--- /dev/null
+++ b/data/2024/aaai/Learning Multi-Modal Cross-Scale Deformable Transformer Network for Unregistered Hyperspectral Image Super-resolution	
@@ -0,0 +1 @@
+Hyperspectral image super-resolution (HSI-SR) is a technology to improve the spatial resolution of HSI. Existing fusion-based SR methods have shown great performance, but still have some problems as follows: 1) existing methods assume that the auxiliary image providing spatial information is strictly registered with the HSI, but images are difficult to be registered finely due to the shooting platforms, shooting viewpoints and the influence of atmospheric turbulence; 2) most of the methods are based on convolutional neural networks (CNNs), which is effective for local features but cannot utilize the global features. To this end, we propose a multi-modal cross-scale deformable transformer network (M2DTN) to achieve unregistered HSI-SR. Specifically, we formulate a spectrum-preserving based spatial-guided registration-SR unified model (SSRU) from the view of the realistic degradation scenarios. According to SSRU, we propose multi-modal registration deformable module (MMRD) to align features between different modalities by deformation field. In order to efficiently utilize the unique information between different modals, we design multi-scale feature transformer (MSFT) to emphasize the spatial-spectral features at different scales. In addition, we propose the cross-scale feature aggregation module (CSFA) to accurately reconstruct the HSI by aggregating feature information at different scales. Experiments show that M2DTN outperforms the-state-of-the-art HSI-SR methods. Code is obtainable at https://github.com/Jiahuiqu/M2DTN.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Multi-Object Positional Relationships via Emergent Communication b/data/2024/aaai/Learning Multi-Object Positional Relationships via Emergent Communication
new file mode 100644
index 0000000000..92b1f098a0
--- /dev/null
+++ b/data/2024/aaai/Learning Multi-Object Positional Relationships via Emergent Communication	
@@ -0,0 +1 @@
+The study of emergent communication has been dedicated to interactive artificial intelligence. While existing work focuses on communication about single objects or complex image scenes, we argue that communicating relationships between multiple objects is important in more realistic tasks, but understudied. In this paper, we try to fill this gap and focus on emergent communication about positional relationships between two objects. We train agents in the referential game where observations contain two objects, and find that generalization is the major problem when the positional relationship is involved. The key factor affecting the generalization ability of the emergent language is the input variation between Speaker and Listener, which is realized by a random image generator in our work. Further, we find that the learned language can generalize well in a new multi-step MDP task where the positional relationship describes the goal, and performs better than raw-pixel images as well as pre-trained image features, verifying the strong generalization ability of discrete sequences. We also show that language transfer from the referential game performs better in the new task than learning language directly in this task, implying the potential benefits of pre-training in referential games. All in all, our experiments demonstrate the viability and merit of having agents learn to communicate positional relationships between multiple objects through emergent communication.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Multi-Scale Video-Text Correspondence for Weakly Supervised Temporal Article Gronding b/data/2024/aaai/Learning Multi-Scale Video-Text Correspondence for Weakly Supervised Temporal Article Gronding
new file mode 100644
index 0000000000..1f6cd466b0
--- /dev/null
+++ b/data/2024/aaai/Learning Multi-Scale Video-Text Correspondence for Weakly Supervised Temporal Article Gronding	
@@ -0,0 +1 @@
+Weakly Supervised temporal Article Grounding (WSAG) is a challenging and practical task in video understanding. Specifically, given a video and a relevant article, whose sentences are at different semantic scales, WSAG aims to localize corresponding video segments for all “groundable” sentences. Compared to other grounding tasks, e.g., localizing one target segment with respect to a given sentence query, WSAG confronts an essential obstacle rooted in the intricate multi-scale information inherent within both textual and visual modalities. Existing methods overlook the modeling and alignment of such structured information present in multi-scale video segments and hierarchical textual content. To this end, we propose a Multi-Scale Video-Text Correspondence Learning (MVTCL) framework, which enhances the grounding performance in complex scenes by modeling multi-scale semantic correspondence both within and between modalities. Specifically, MVTCL initially aggregates video content spanning distinct temporal scales and leverages hierarchical textual relationships in both temporal and semantic dimensions via a semantic calibration module. Then multi-scale contrastive learning module is introduced to generate more discriminative representations by selecting typical contexts and performing inter-video contrastive learning. Through the multi-scale semantic calibration architecture and supervision design, our method achieves new state-of-the-art performance on existing WSAG benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Multi-Task Sparse Representation Based on Fisher Information b/data/2024/aaai/Learning Multi-Task Sparse Representation Based on Fisher Information
new file mode 100644
index 0000000000..aec32bd342
--- /dev/null
+++ b/data/2024/aaai/Learning Multi-Task Sparse Representation Based on Fisher Information	
@@ -0,0 +1 @@
+Multi-task learning deals with multiple related tasks simultaneously by sharing knowledge. In a typical deep multi-task learning model, all tasks use the same feature space and share the latent knowledge. If the tasks are weakly correlated or some features are negatively correlated, sharing all knowledge often leads to negative knowledge transfer among. To overcome this issue, this paper proposes a Fisher sparse multi-task learning method. It can obtain a sparse sharing representation for each task. In such a way, tasks share features on a sparse subspace. Our method can ensure that the knowledge transferred among tasks is beneficial. Specifically, we first propose a sparse deep multi-task learning model, and then introduce Fisher sparse module into traditional deep multi-task learning to learn the sparse variables of task. By alternately updating the neural network parameters and sparse variables, a sparse sharing representation can be learned for each task. In addition, in order to reduce the computational overhead, an heuristic method is used to estimate the Fisher information of neural network parameters. Experimental results show that, comparing with other methods, our proposed method can improve the performance for all tasks, and has high sparsity in multi-task learning.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Multimodal Volumetric Features for Large-Scale Neuron Tracing b/data/2024/aaai/Learning Multimodal Volumetric Features for Large-Scale Neuron Tracing
new file mode 100644
index 0000000000..9c82f2ab69
--- /dev/null
+++ b/data/2024/aaai/Learning Multimodal Volumetric Features for Large-Scale Neuron Tracing	
@@ -0,0 +1 @@
+The current neuron reconstruction pipeline for electron microscopy (EM) data usually includes automatic image segmentation followed by extensive human expert proofreading. In this work, we aim to reduce human workload by predicting connectivity between over-segmented neuron pieces, taking both microscopy image and 3D morphology features into account, similar to human proofreading workflow. To this end, we first construct a dataset, named FlyTracing, that contains millions of pairwise connections of segments expanding the whole fly brain, which is three orders of magnitude larger than existing datasets for neuron segment connection. To learn sophisticated biological imaging features from the connectivity annotations, we propose a novel connectivity-aware contrastive learning method to generate dense volumetric EM image embedding. The learned embeddings can be easily incorporated with any point or voxel-based morphological representations for automatic neuron tracing. Extensive comparisons of different combination schemes of image and morphological representation in identifying split errors across the whole fly brain demonstrate the superiority of the proposed approach, especially for the locations that contain severe imaging artifacts, such as section missing and misalignment. The dataset and code are available at https://github.com/Levishery/Flywire-Neuron-Tracing.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Neuro-Symbolic Abstractions for Robot Planning and Learning b/data/2024/aaai/Learning Neuro-Symbolic Abstractions for Robot Planning and Learning
new file mode 100644
index 0000000000..01a62f6ffa
--- /dev/null
+++ b/data/2024/aaai/Learning Neuro-Symbolic Abstractions for Robot Planning and Learning	
@@ -0,0 +1 @@
+Although state-of-the-art hierarchical robot planning algorithms allow robots to efficiently compute long-horizon motion plans for achieving user desired tasks, these methods typically rely upon environment-dependent state and action abstractions that need to be hand-designed by experts. On the other hand, non-hierarchical robot planning approaches fail to compute solutions for complex tasks that require reasoning over a long horizon. My research addresses these problems by proposing an approach for learning abstractions and developing hierarchical planners that efficiently use learned abstractions to boost robot planning performance and provide strong guarantees of reliability.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Not to Regret b/data/2024/aaai/Learning Not to Regret
new file mode 100644
index 0000000000..c8962c1467
--- /dev/null
+++ b/data/2024/aaai/Learning Not to Regret	
@@ -0,0 +1,7 @@
+The literature on game-theoretic equilibrium finding predominantly focuses on single games or their repeated play.
+Nevertheless, numerous real-world scenarios feature playing a game sampled from a distribution of similar, but not identical games, such as playing poker with different public cards or trading correlated assets on the stock market. 
+As these similar games feature similar equilibra, we investigate a way to accelerate equilibrium finding on such a distribution.
+We present a novel ``learning not to regret'' framework, enabling us to meta-learn a regret minimizer tailored to a specific distribution. 
+Our key contribution, Neural Predictive Regret Matching, is uniquely meta-learned to converge rapidly for the chosen distribution of games, while having regret minimization guarantees on any game.
+We validated our algorithms' faster convergence on a distribution of river poker games. 
+Our experiments show that the meta-learned algorithms outpace their non-meta-learned counterparts, achieving more than tenfold improvements.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Only When It Matters: Cost-Aware Long-Tailed Classification b/data/2024/aaai/Learning Only When It Matters: Cost-Aware Long-Tailed Classification
new file mode 100644
index 0000000000..db58debb71
--- /dev/null
+++ b/data/2024/aaai/Learning Only When It Matters: Cost-Aware Long-Tailed Classification	
@@ -0,0 +1,2 @@
+Most current long-tailed classification approaches assume the cost-agnostic scenario, where the training distribution of classes is long-tailed while the testing distribution of classes is balanced. Meanwhile, the misclassification costs of all instances are the same. On the other hand, in many real-world applications, it is more proper to assume that the training and testing distributions of classes are the same, while the misclassification cost of tail-class instances is varied. In this work, we model such a scenario as cost-aware long-tailed classification, in which the identification of high-cost tail instances and focusing learning on them thereafter is essential. In consequence, we propose the learning strategy of augmenting new instances based on adaptive region partition in the feature space. We conduct theoretical analysis to show that under the assumption
+that the feature-space distance and the misclassification cost are correlated, the identification of high-cost tail instances can be realized by building region partitions with a low variance of risk within each region. The resulting AugARP approach could significantly outperform baseline approaches on both benchmark datasets and real-world product sales datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Pattern-Based Extractors from Natural Language and Knowledge Graphs: Applying Large Language Models to Wikipedia and Linked Open Data b/data/2024/aaai/Learning Pattern-Based Extractors from Natural Language and Knowledge Graphs: Applying Large Language Models to Wikipedia and Linked Open Data
new file mode 100644
index 0000000000..37cd766ea1
--- /dev/null
+++ b/data/2024/aaai/Learning Pattern-Based Extractors from Natural Language and Knowledge Graphs: Applying Large Language Models to Wikipedia and Linked Open Data	
@@ -0,0 +1 @@
+Seq-to-seq transformer models have recently been successfully used for relation extraction, showing their flexibility, effectiveness, and scalability on that task. In this context, knowledge graphs aligned with Wikipedia such as DBpedia and Wikidata give us the opportunity to leverage existing texts and corresponding RDF graphs in order to extract, from these texts, the knowledge that is missing in the corresponding graphs and meanwhile improve their coverage. The goal of my thesis is to learn efficient extractors targeting specific RDF patterns and to do so by leveraging the latest language models and the dual base formed by Wikipedia on the one hand, and DBpedia and Wikidata on the other hand.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Performance Maximizing Ensembles with Explainability Guarantees b/data/2024/aaai/Learning Performance Maximizing Ensembles with Explainability Guarantees
new file mode 100644
index 0000000000..974e061327
--- /dev/null
+++ b/data/2024/aaai/Learning Performance Maximizing Ensembles with Explainability Guarantees	
@@ -0,0 +1 @@
+In this paper we propose a method for the optimal allocation of observations between an intrinsically explainable glass box model and a black box model. An optimal allocation being defined as one which, for any given explainability level (i.e. the proportion of observations for which the explainable model is the prediction function), maximizes the performance of the ensemble on the underlying task, and maximizes performance of the explainable model on the observations allocated to it, subject to the maximal ensemble performance condition. The proposed method is shown to produce such explainability optimal allocations on a benchmark suite of tabular datasets across a variety of explainable and black box model types. These learned allocations are found to consistently maintain ensemble performance at very high explainability levels (explaining 74% of observations on average), and in some cases even outperform both the component explainable and black box models while improving explainability.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Persistent Community Structures in Dynamic Networks via Topological Data Analysis b/data/2024/aaai/Learning Persistent Community Structures in Dynamic Networks via Topological Data Analysis
new file mode 100644
index 0000000000..6ac989a8fd
--- /dev/null
+++ b/data/2024/aaai/Learning Persistent Community Structures in Dynamic Networks via Topological Data Analysis	
@@ -0,0 +1 @@
+Dynamic community detection methods often lack effective mechanisms to ensure temporal consistency, hindering the analysis of network evolution. In this paper, we propose a novel deep graph clustering framework with temporal consistency regularization on inter-community structures, inspired by the concept of minimal network topological changes within short intervals. Specifically, to address the representation collapse problem, we first introduce MFC, a matrix factorization-based deep graph clustering algorithm that preserves node embedding. Based on static clustering results, we construct probabilistic community networks and compute their persistence homology, a robust topological measure, to assess structural similarity between them. Moreover, a novel neural network regularization TopoReg is introduced to ensure the preservation of topological similarity between inter-community structures over time intervals. Our approach enhances temporal consistency and clustering accuracy on real-world datasets with both fixed and varying numbers of communities. It is also a pioneer application of TDA in temporally persistent community detection, offering an insightful contribution to field of network analysis. Code and data are available at the public git repository: https://github.com/kundtx/MFC-TopoReg.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Planning Domains from Non-redundant Fully-Observed Traces: Theoretical Foundations and Complexity Analysis b/data/2024/aaai/Learning Planning Domains from Non-redundant Fully-Observed Traces: Theoretical Foundations and Complexity Analysis
new file mode 100644
index 0000000000..e5ebd0c33c
--- /dev/null
+++ b/data/2024/aaai/Learning Planning Domains from Non-redundant Fully-Observed Traces: Theoretical Foundations and Complexity Analysis	
@@ -0,0 +1,8 @@
+Domain learning is the task of finding an action model that can explain given observed plan executions, so-called traces. 
+It allows us to automate the identification of actions' preconditions and effects instead of relying on hand-modeled expert knowledge. 
+While previous research has put forth various techniques and covers multiple planning formalisms, the theoretical foundations of domain learning are still in their infancy.
+ 
+We investigate the most basic setting, that is grounded classical planning without negative preconditions or conditional effects with full observability of the state variables. 
+The given traces are assumed to be justified in the sense that either no single action or no set of actions can be removed without violating correctness of the plan. 
+Furthermore, we might be given additional constraints in the form of a propositional logical formula. 
+We show the consequences of these assumptions for the computational complexity of identifying a satisfactory planning domain.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Random Noise Salient Feature Fusion Siamese Network for Low-Resolution Object Tracking (Student Abstract) b/data/2024/aaai/Learning Random Noise Salient Feature Fusion Siamese Network for Low-Resolution Object Tracking (Student Abstract)
new file mode 100644
index 0000000000..ecb95907e5
--- /dev/null
+++ b/data/2024/aaai/Learning Random Noise Salient Feature Fusion Siamese Network for Low-Resolution Object Tracking (Student Abstract)	
@@ -0,0 +1 @@
+Despite Siamese trackers’ substantial potential, they offer sub-optimal tracking performance in low-resolution (LR) contexts. We introduce a Random Noise Salient Feature Fusion Learning Network to address this issue. This method integrates random noise-infused feature maps into a similaritylearning matching model. This integration acts as an effective regularization technique, enhancing the network’s generalization capabilities in LR environments. Additionally, by integrating attention mechanisms, we enhance the discriminative ability of the network, assigning more weights to important features. This directs the network’s focus toward the most salient regions of the feature map, ensuring improved accuracy without a significant increase in parameter overhead, and maintaining a high operating speed. To validate the effectiveness of our method, we performed qualitative and quantitative comparisons with state-of-the-art (SOTA) trackers.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Real-World Image De-weathering with Imperfect Supervision b/data/2024/aaai/Learning Real-World Image De-weathering with Imperfect Supervision
new file mode 100644
index 0000000000..825c592df0
--- /dev/null
+++ b/data/2024/aaai/Learning Real-World Image De-weathering with Imperfect Supervision	
@@ -0,0 +1 @@
+Real-world image de-weathering aims at removing various undesirable weather-related artifacts. Owing to the impossibility of capturing image pairs concurrently, existing real-world de-weathering datasets often exhibit inconsistent illumination, position, and textures between the ground-truth images and the input degraded images, resulting in imperfect supervision. Such non-ideal supervision negatively affects the training process of learning-based de-weathering methods. In this work, we attempt to address the problem with a unified solution for various inconsistencies. Specifically, inspired by information bottleneck theory, we first develop a Consistent Label Constructor (CLC) to generate a pseudo-label as consistent as possible with the input degraded image while removing most weather-related degradation. In particular, multiple adjacent frames of the current input are also fed into CLC to enhance the pseudo-label. Then we combine the original imperfect labels and pseudo-labels to jointly supervise the de-weathering model by the proposed Information Allocation Strategy (IAS). During testing, only the de-weathering model is used for inference. Experiments on two real-world de-weathering datasets show that our method helps existing de-weathering models achieve better performance. Code is available at https://github.com/1180300419/imperfect-deweathering.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Reduced Fluid Dynamics b/data/2024/aaai/Learning Reduced Fluid Dynamics
new file mode 100644
index 0000000000..6c391ebaac
--- /dev/null
+++ b/data/2024/aaai/Learning Reduced Fluid Dynamics	
@@ -0,0 +1 @@
+Predicting the state evolution of ultra high-dimensional, time-reversible fluid dynamic systems is a crucial but computationally expensive task. Existing physics-informed neural networks either incur high inference cost or cannot preserve the time-reversible nature of the underlying dynamics system. We propose a model-based approach to identify low-dimensional, time reversible, nonlinear fluid dynamic systems. Our method utilizes the symplectic structure of reduced Eulerian fluid and use stochastic Riemann optimization to obtain a low-dimensional bases that minimize the expected trajectory-wise dimension-reduction error over a given distribution of initial conditions. We show that such minimization is well-defined since the reduced trajectories are differentiable with respect to the subspace bases over the entire Grassmannian manifold, under proper choices of timestep sizes and numerical integrators. Finally, we propose a loss function measuring the trajectory-wise discrepancy between the original and reduced models. By tensor precomputation, we show that gradient information of such loss function can be evaluated efficiently over a long trajectory without time-integrating the high-dimensional dynamic system. Through evaluations on a row of simulation benchmarks, we show that our method reduces the discrepancy by 50-90 percent over conventional reduced models and we outperform PINNs by exactly preserving the time reversibility.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Representations for Robust Human-Robot Interaction b/data/2024/aaai/Learning Representations for Robust Human-Robot Interaction
new file mode 100644
index 0000000000..1fd2673394
--- /dev/null
+++ b/data/2024/aaai/Learning Representations for Robust Human-Robot Interaction	
@@ -0,0 +1 @@
+For robots to robustly and flexibly interact with humans, they need to acquire skills to use across scenarios. One way to enable the generalization of skills is to learn representations that are useful for downstream tasks. Learning a representation for interactions requires an understanding of what (e.g., objects) as well as how (e.g., actions, controls, and manners) to interact with. However, most existing language or visual representations mainly focus on objects. To enable robust human-robot interactions, we need a representation that is not just grounded at the object level but to reason at the action level. The ability to reason about an agent’s own actions and other’s actions will be crucial for long-tail interactions. My research focuses on leveraging the compositional nature of language and reward functions to learn representations that generalize to novel scenarios. Together with the information from multiple modalities, the learned representation can reason about task progress, future behaviors, and the goals/beliefs of an agent. The above ideas have been demonstrated in my research on building robots to understand language and engage in social interactions.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Representations on the Unit Sphere: Investigating Angular Gaussian and Von Mises-Fisher Distributions for Online Continual Learning b/data/2024/aaai/Learning Representations on the Unit Sphere: Investigating Angular Gaussian and Von Mises-Fisher Distributions for Online Continual Learning
new file mode 100644
index 0000000000..9eff975333
--- /dev/null
+++ b/data/2024/aaai/Learning Representations on the Unit Sphere: Investigating Angular Gaussian and Von Mises-Fisher Distributions for Online Continual Learning	
@@ -0,0 +1 @@
+We use the maximum a posteriori estimation principle for learning representations distributed on the unit sphere. We propose to use the angular Gaussian distribution, which corresponds to a Gaussian projected on the unit-sphere and derive the associated loss function. We also consider the von Mises-Fisher distribution, which is the conditional of a Gaussian in the unit-sphere. The learned representations are pushed toward fixed directions, which are the prior means of the Gaussians; allowing for a learning strategy that is resilient to data drift. This makes it suitable for online continual learning, which is the problem of training neural networks on a continuous data stream, where multiple classification tasks are presented sequentially so that data from past tasks are no longer accessible, and data from the current task can be seen only once. To address this challenging scenario, we propose a memory-based representation learning technique equipped with our new loss functions. Our approach does not require negative data or knowledge of task boundaries and performs well with smaller batch sizes while being computationally efficient. We demonstrate with extensive experiments that the proposed method outperforms the current state-of-the-art methods on both standard evaluation scenarios and realistic scenarios with blurry task boundaries. For reproducibility, we use the same training pipeline for every compared method and share the code at https://github.com/Nicolas1203/ocl-fd.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Robust Rationales for Model Explainability: A Guidance-Based Approach b/data/2024/aaai/Learning Robust Rationales for Model Explainability: A Guidance-Based Approach
new file mode 100644
index 0000000000..e4d449453d
--- /dev/null
+++ b/data/2024/aaai/Learning Robust Rationales for Model Explainability: A Guidance-Based Approach	
@@ -0,0 +1 @@
+Selective rationalization can be regarded as a straightforward self-explaining approach for enhancing model explainability in natural language processing tasks. It aims to provide explanations that are more accessible and understandable to non-technical users by first selecting subsets of input texts as rationales and then predicting based on chosen subsets. However, existing methods that follow this select-then-predict framework may suffer from the rationalization degeneration problem, resulting in sub-optimal or unsatisfactory rationales that do not align with human judgments. This problem may further lead to rationalization failure, resulting in meaningless rationales that ultimately undermine people's trust in the rationalization model. To address these challenges, we propose a Guidance-based Rationalization method (G-RAT) that effectively improves robustness against failure situations and the quality of rationales by using a guidance module to regularize selections and distributions. Experimental results on two synthetic settings prove that our method is robust to the rationalization degeneration and failure problems, while the results on two real datasets show its effectiveness in providing rationales in line with human judgments. The source code is available at https://github.com/shuaibo919/g-rat.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Safe Action Models with Partial Observability b/data/2024/aaai/Learning Safe Action Models with Partial Observability
new file mode 100644
index 0000000000..3187d3d37a
--- /dev/null
+++ b/data/2024/aaai/Learning Safe Action Models with Partial Observability	
@@ -0,0 +1,5 @@
+A common approach for solving planning problems is to model them in a formal language such as the Planning Domain Definition Language (PDDL), and then use an appropriate PDDL planner. 
+Several algorithms for learning PDDL models from observations have been proposed but plans created with these learned models may not be sound. 
+We propose two algorithms for learning PDDL models that are guaranteed to be safe to use even when given observations that include partially observable states. 
+We analyze these algorithms theoretically, characterizing the sample complexity each algorithm requires to guarantee probabilistic completeness. 
+We also show experimentally that our algorithms are often better than FAMA, a state-of-the-art PDDL learning algorithm.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Small Decision Trees for Data of Low Rank-Width b/data/2024/aaai/Learning Small Decision Trees for Data of Low Rank-Width
new file mode 100644
index 0000000000..c8e482387c
--- /dev/null
+++ b/data/2024/aaai/Learning Small Decision Trees for Data of Low Rank-Width	
@@ -0,0 +1,11 @@
+We consider the NP-hard problem of finding a smallest decision tree
+representing a classification instance in terms of a partially defined
+Boolean function. Small decision trees are desirable to provide an
+interpretable model for the given data. We show that the problem is
+fixed-parameter tractable when parameterized by the rank-width of the
+incidence graph of the given classification instance. Our algorithm
+proceeds by dynamic programming using an NLC decomposition obtained
+from a rank-width decomposition. The key to the algorithm is a
+succinct representation of partial solutions. This allows us to limit
+the space and time requirements for each dynamic programming step in
+terms of the parameter.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Small Decision Trees with Few Outliers: A Parameterized Perspective b/data/2024/aaai/Learning Small Decision Trees with Few Outliers: A Parameterized Perspective
new file mode 100644
index 0000000000..b0ca8c55d2
--- /dev/null
+++ b/data/2024/aaai/Learning Small Decision Trees with Few Outliers: A Parameterized Perspective	
@@ -0,0 +1,2 @@
+Decision trees is a fundamental tool in machine learning for representing, classifying, and generalizing data. It is desirable to construct ``small'' decision trees, by minimizing either the size (s) or the depth (d) of the decision tree (DT). Recently, the parameterized complexity of Decision Tree Learning has attracted a lot of attention.
+We consider a generalization of Decision Tree Learning where given a classification instance E and an integer t, the task is to find a ``small'' DT that disagrees with E in at most t examples. We consider two problems: DTSO and DTDO, where the goal is to construct a DT minimizing s and d, respectively. We first establish that both DTSO and DTDO are W[1]-hard when parameterized by s+y and d+y, respectively, where y is the maximum number of features in which two differently labeled examples can differ. We complement this result by showing that these problems become FPT if we include the parameter t. We also consider the kernelization complexity of these problems and establish several positive and negative results for both DTSO and DTDO.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Spatially Collaged Fourier Bases for Implicit Neural Representation b/data/2024/aaai/Learning Spatially Collaged Fourier Bases for Implicit Neural Representation
new file mode 100644
index 0000000000..2b42f65509
--- /dev/null
+++ b/data/2024/aaai/Learning Spatially Collaged Fourier Bases for Implicit Neural Representation	
@@ -0,0 +1 @@
+Existing approaches to Implicit Neural Representation (INR) can be interpreted as a global scene representation via a linear combination of Fourier bases of different frequencies. However, such universal basis functions can limit the representation capability in local regions where a specific component is unnecessary, resulting in unpleasant artifacts. To this end, we introduce a learnable spatial mask that effectively dispatches distinct Fourier bases into respective regions. This translates into collaging Fourier patches, thus enabling an accurate representation of complex signals. Comprehensive experiments demonstrate the superior reconstruction quality of the proposed approach over existing baselines across various INR tasks, including image fitting, video representation, and 3D shape representation. Our method outperforms all other baselines, improving the image fitting PSNR by over 3dB and 3D reconstruction to 98.81 IoU and 0.0011 Chamfer Distance.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Subject-Aware Cropping by Outpainting Professional Photos b/data/2024/aaai/Learning Subject-Aware Cropping by Outpainting Professional Photos
new file mode 100644
index 0000000000..e9be4c99c5
--- /dev/null
+++ b/data/2024/aaai/Learning Subject-Aware Cropping by Outpainting Professional Photos	
@@ -0,0 +1 @@
+How to frame (or crop) a photo often depends on the image subject and its context; e.g., a human portrait. Recent works have defined the subject-aware image cropping task as a nuanced and practical version of image cropping. We propose a weakly-supervised approach (GenCrop) to learn what makes a high-quality, subject-aware crop from professional stock images. Unlike supervised prior work, GenCrop requires no new manual annotations beyond the existing stock image collection. The key challenge in learning from this data, however, is that the images are already cropped and we do not know what regions were removed. Our insight is to combine a library of stock images with a modern, pre-trained text-to-image diffusion model. The stock image collection provides diversity, and its images serve as pseudo-labels for a good crop. The text-image diffusion model is used to out-paint (i.e., outward inpainting) realistic uncropped images. Using this procedure, we are able to automatically generate a large dataset of cropped-uncropped training pairs to train a cropping model. Despite being weakly-supervised, GenCrop is competitive with state-of-the-art supervised methods and significantly better than comparable weakly-supervised baselines on quantitative and qualitative evaluation metrics.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Task-Aware Language-Image Representation for Class-Incremental Object Detection b/data/2024/aaai/Learning Task-Aware Language-Image Representation for Class-Incremental Object Detection
new file mode 100644
index 0000000000..b70899fa73
--- /dev/null
+++ b/data/2024/aaai/Learning Task-Aware Language-Image Representation for Class-Incremental Object Detection	
@@ -0,0 +1 @@
+Class-incremental object detection (CIOD) is a real-world desired capability, requiring an object detector to continuously adapt to new tasks without forgetting learned ones, with the main challenge being catastrophic forgetting. Many methods based on distillation and replay have been proposed to alleviate this problem. However, they typically learn on a pure visual backbone, neglecting the powerful representation capabilities of textual cues, which to some extent limits their performance. In this paper, we propose task-aware language-image representation to mitigate catastrophic forgetting, introducing a new paradigm for language-image-based CIOD. First of all, we demonstrate the significant advantage of language-image detectors in mitigating catastrophic forgetting. Secondly, we propose a learning task-aware language-image representation method that overcomes the existing drawback of directly utilizing the language-image detector for CIOD. More specifically, we learn the language-image representation of different tasks through an insulating approach in the training stage, while using the alignment scores produced by task-specific language-image representation in the inference stage. Through our proposed method, language-image detectors can be more practical for CIOD. We conduct extensive experiments on COCO 2017 and Pascal VOC 2007 and demonstrate that the proposed method achieves state-of-the-art results under the various CIOD settings.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Temporal Resolution in Spectrogram for Audio Classification b/data/2024/aaai/Learning Temporal Resolution in Spectrogram for Audio Classification
new file mode 100644
index 0000000000..6ed24475f3
--- /dev/null
+++ b/data/2024/aaai/Learning Temporal Resolution in Spectrogram for Audio Classification	
@@ -0,0 +1 @@
+The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not always optimal for different types of sound. The temporal resolution affects not only classification accuracy but also computational cost. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier and can be jointly optimized with the classification task. We evaluate DiffRes on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We further show that DiffRes can improve classification accuracy by increasing the temporal resolution of input acoustic features, without adding to the computational cost.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Time Slot Preferences via Mobility Tree for Next POI Recommendation b/data/2024/aaai/Learning Time Slot Preferences via Mobility Tree for Next POI Recommendation
new file mode 100644
index 0000000000..ae6e69336b
--- /dev/null
+++ b/data/2024/aaai/Learning Time Slot Preferences via Mobility Tree for Next POI Recommendation	
@@ -0,0 +1 @@
+Next Point-of-Interests (POIs) recommendation task aims to provide a dynamic ranking of POIs based on users' current check-in trajectories. The recommendation performance of this task is contingent upon a comprehensive understanding of users' personalized behavioral patterns through Location-based Social Networks (LBSNs) data. While prior studies have adeptly captured sequential patterns and transitional relationships within users' check-in trajectories, a noticeable gap persists in devising a mechanism for discerning specialized behavioral patterns during distinct time slots, such as noon, afternoon, or evening. In this paper, we introduce an innovative data structure termed the ``Mobility Tree'', tailored for hierarchically describing users' check-in records. The Mobility Tree encompasses multi-granularity time slot nodes to learn user preferences across varying temporal periods. Meanwhile, we propose the Mobility Tree Network (MTNet), a multitask framework for personalized preference learning based on Mobility Trees. We develop a four-step node interaction operation to propagate feature information from the leaf nodes to the root node. Additionally, we adopt a multitask training strategy to push the model towards learning a robust representation. The comprehensive experimental results demonstrate the superiority of MTNet over eleven state-of-the-art next POI recommendation models across three real-world LBSN datasets, substantiating the efficacy of time slot preference learning facilitated by Mobility Tree.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Ultrametric Trees for Optimal Transport Regression b/data/2024/aaai/Learning Ultrametric Trees for Optimal Transport Regression
new file mode 100644
index 0000000000..c4287a48c6
--- /dev/null
+++ b/data/2024/aaai/Learning Ultrametric Trees for Optimal Transport Regression	
@@ -0,0 +1 @@
+Optimal transport provides a metric which quantifies the dissimilarity between probability measures. For measures supported in discrete metric spaces, finding the optimal transport distance has cubic time complexity in the size of the space. However, measures supported on trees admit a closed-form optimal transport that can be computed in linear time. In this paper, we aim to find an optimal tree structure for a given discrete metric space so that the tree-Wasserstein distance approximates the optimal transport distance in the original space. One of our key ideas is to cast the problem in ultrametric spaces. This helps us optimize over the space of ultrametric trees --- a mixed-discrete and continuous optimization problem --- via projected gradient decent over the space of ultrametric matrices. During optimization, we project the parameters to the ultrametric space via a hierarchical minimum spanning tree algorithm, equivalent to the closest projection to ultrametrics under the supremum norm. Experimental results on real datasets show that our approach outperforms previous approaches (e.g. Flowtree, Quadtree) in approximating optimal transport distances. Finally, experiments on synthetic data generated on ground truth trees show that our algorithm can accurately uncover the underlying trees.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Uncertainty-Aware Temporally-Extended Actions b/data/2024/aaai/Learning Uncertainty-Aware Temporally-Extended Actions
new file mode 100644
index 0000000000..3c44fe2ebd
--- /dev/null
+++ b/data/2024/aaai/Learning Uncertainty-Aware Temporally-Extended Actions	
@@ -0,0 +1 @@
+In reinforcement learning, temporal abstraction in the action space, exemplified by action repetition, is a technique to facilitate policy learning through extended actions. However, a primary limitation in previous studies of action repetition is its potential to degrade performance, particularly when sub-optimal actions are repeated. This issue often negates the advantages of action repetition. To address this, we propose a novel algorithm named Uncertainty-aware Temporal Extension (UTE). UTE employs ensemble methods to accurately measure uncertainty during action extension. This feature allows policies to strategically choose between emphasizing exploration or adopting an uncertainty-averse approach, tailored to their specific needs. We demonstrate the effectiveness of UTE through experiments in Gridworld and Atari 2600 environments. Our findings show that UTE outperforms existing action repetition algorithms, effectively mitigating their inherent limitations and significantly enhancing policy learning efficiency.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning Visual Abstract Reasoning through Dual-Stream Networks b/data/2024/aaai/Learning Visual Abstract Reasoning through Dual-Stream Networks
new file mode 100644
index 0000000000..c447c89e14
--- /dev/null
+++ b/data/2024/aaai/Learning Visual Abstract Reasoning through Dual-Stream Networks	
@@ -0,0 +1 @@
+Visual abstract reasoning tasks present challenges for deep neural networks, exposing limitations in their capabilities. In this work, we present a neural network model that addresses the challenges posed by Raven’s Progressive Matrices (RPM). Inspired by the two-stream hypothesis of visual processing, we introduce the Dual-stream Reasoning Network (DRNet), which utilizes two parallel branches to capture image features. On top of the two streams, a reasoning module first learns to merge the high-level features of the same image. Then, it employs a rule extractor to handle combinations involving the eight context images and each candidate image, extracting discrete abstract rules and utilizing an multilayer perceptron (MLP) to make predictions. Empirical results demonstrate that the proposed DRNet achieves state-of-the-art average performance across multiple RPM benchmarks. Furthermore, DRNet demonstrates robust generalization capabilities, even extending to various out-of-distribution scenarios. The dual streams within DRNet serve distinct functions by addressing local or spatial information. They are then integrated into the reasoning module, leveraging abstract rules to facilitate the execution of visual reasoning tasks. These findings indicate that the dual-stream architecture could play a crucial role in visual abstract reasoning.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning from Ambiguous Demonstrations with Self-Explanation Guided Reinforcement Learning b/data/2024/aaai/Learning from Ambiguous Demonstrations with Self-Explanation Guided Reinforcement Learning
new file mode 100644
index 0000000000..3091585232
--- /dev/null
+++ b/data/2024/aaai/Learning from Ambiguous Demonstrations with Self-Explanation Guided Reinforcement Learning	
@@ -0,0 +1 @@
+Our work aims at efficiently leveraging ambiguous demonstrations for the training of a reinforcement learning (RL) agent. An ambiguous demonstration can usually be interpreted in multiple ways, which severely hinders the RL agent from learning stably and efficiently. Since an optimal demonstration may also suffer from being ambiguous, previous works that combine RL and learning from demonstration (RLfD works) may not work well. Inspired by how humans handle such situations, we propose to use self-explanation (an agent generates explanations for itself) to recognize valuable high-level relational features as an interpretation of why a successful trajectory is successful. This way, the agent can leverage the explained important relations as guidance for its RL learning. Our main contribution is to propose the Self-Explanation for RL from Demonstrations (SERLfD) framework, which can overcome the limitations of existing RLfD works. Our experimental results show that an RLfD model can be improved by using our SERLfD framework in terms of training stability and performance. To foster further research in self-explanation-guided robot learning, we have made our demonstrations and code publicly accessible at https://github.com/YantianZha/SERLfD. For a deeper understanding of our work, interested readers can refer to our arXiv version at https://arxiv.org/pdf/2110.05286.pdf, including an accompanying appendix.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning from Failure: Improving Meeting Summarization without Good Samples b/data/2024/aaai/Learning from Failure: Improving Meeting Summarization without Good Samples
new file mode 100644
index 0000000000..3e895aa2ff
--- /dev/null
+++ b/data/2024/aaai/Learning from Failure: Improving Meeting Summarization without Good Samples	
@@ -0,0 +1 @@
+Existing methods aligning language models with various human needs are reliant heavily on high-quality and task-specific data. However, industrial deployment of task-specific language models often encounter challenges in the availability of appropriate training samples. Taking meeting summarization for instance, public datasets are scarce, and private corpora are also hard to obtain due to privacy issues or resource-demanding annotation. To improve meeting summarization in the absence of positively-rated (i.e., ``good'') samples, we propose Score Tuning, a cold start tuning framework that leverages bad samples of distinguishable degrees to incrementally enhance the performance of summary generation without an initial presence of good samples. Our method utilizes asynchronous and numerical human feedback that measure the quality of generated summaries. Formulating data into triplets of (transcript, summary, score), our approach instructs a pre-trained model to learn the association between summary qualities and human-rated scores and hence to generate better summaries corresponding to higher scores. The experiment results show that our method is effective in improving meeting summarization on both English and Chinese corpora while requiring less annotated data and training resources compared to existing alignment methods. Additionally, we also preliminarily explore the transferability of our approach in machine translation tasks and demonstrate its potential for future development and usage in other domains.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning from History: Task-agnostic Model Contrastive Learning for Image Restoration b/data/2024/aaai/Learning from History: Task-agnostic Model Contrastive Learning for Image Restoration
new file mode 100644
index 0000000000..bb9e0602f3
--- /dev/null
+++ b/data/2024/aaai/Learning from History: Task-agnostic Model Contrastive Learning for Image Restoration	
@@ -0,0 +1 @@
+Contrastive learning has emerged as a prevailing paradigm for high-level vision tasks, which, by introducing properly negative samples, has also been exploited for low-level vision tasks to achieve a compact optimization space to account for their ill-posed nature. However, existing methods rely on manually predefined and task-oriented negatives, which often exhibit pronounced task-specific biases. To address this challenge, our paper introduces an innovative method termed 'learning from history', which dynamically generates negative samples from the target model itself. Our approach, named Model Contrastive Learning for Image Restoration (MCLIR), rejuvenates latency models as negative models, making it compatible with diverse image restoration tasks. We propose the Self-Prior guided Negative loss (SPN) to enable it. This approach significantly enhances existing models when retrained with the proposed model contrastive paradigm. The results show significant improvements in image restoration across various tasks and architectures. For example, models retrained with SPN outperform the original FFANet and DehazeFormer by 3.41 and 0.57 dB on the RESIDE indoor dataset for image dehazing. Similarly, they achieve notable improvements of 0.47 dB on SPA-Data over IDT for image deraining and 0.12 dB on Manga109 for a 4x scale super-resolution over lightweight SwinIR, respectively. Code and retrained models are available at https://github.com/Aitical/MCLIR.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning from an Infant's Visual Experience b/data/2024/aaai/Learning from an Infant's Visual Experience
new file mode 100644
index 0000000000..10be12afcd
--- /dev/null
+++ b/data/2024/aaai/Learning from an Infant's Visual Experience	
@@ -0,0 +1 @@
+Infants see a selective view of the world: they see some objects with high frequency and from a wide range of viewpoints (e.g., their toys during playing) while a much larger set of objects are seen much more rarely and from limited viewpoints (e.g., objects they see outdoors). Extensive, repeated visual experiences with a small number of objects during infancy plays a big role in the development of human visual skills. Internet-style datasets that are commonly used in computer vision research do not contain the regularities that result from such repeated, structured experiences with a few objects. This has led to a dearth of models that learn by exploiting these regularities. In my PhD dissertation, I use deep learning models to investigate how regularities in an infant's visual experience can be leveraged for visual representation learning.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning in Online Principal-Agent Interactions: The Power of Menus b/data/2024/aaai/Learning in Online Principal-Agent Interactions: The Power of Menus
new file mode 100644
index 0000000000..cc3734c249
--- /dev/null
+++ b/data/2024/aaai/Learning in Online Principal-Agent Interactions: The Power of Menus	
@@ -0,0 +1 @@
+We study a ubiquitous learning challenge in online principal-agent problems during which the principal learns the agent's private information from the agent's revealed preferences in historical interactions. This paradigm includes important special cases such as pricing and contract design, which have been widely studied in recent literature. However, existing work considers the case where the principal can only choose a single strategy at every round to interact with the agent and then observe the agent's revealed preference through their actions. In this paper, we extend this line of study to allow the principal to offer a menu of strategies to the agent and learn additionally from observing the agent's selection from the menu. We provide a thorough investigation of several online principal-agent problem settings and characterize their sample complexities, accompanied by the corresponding algorithms we have developed. We instantiate this paradigm to several important design problems — including Stackelberg (security) games, contract design, and information design. Finally, we also explore the connection between our findings and existing results about online learning in Stackelberg games, and we offer a solution that can overcome a key hard instance of previous work.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning the Causal Structure of Networked Dynamical Systems under Latent Nodes and Structured Noise b/data/2024/aaai/Learning the Causal Structure of Networked Dynamical Systems under Latent Nodes and Structured Noise
new file mode 100644
index 0000000000..7d84728937
--- /dev/null
+++ b/data/2024/aaai/Learning the Causal Structure of Networked Dynamical Systems under Latent Nodes and Structured Noise	
@@ -0,0 +1 @@
+This paper considers learning the hidden causal network of a linear networked dynamical system (NDS) from the time series data at some of its nodes -- partial observability. The dynamics of the NDS are driven by colored noise that generates spurious associations across pairs of nodes, rendering the problem much harder. To address the challenge of noise correlation and partial observability, we assign to each pair of nodes a feature vector computed from the time series data of observed nodes. The feature embedding is engineered to yield structural consistency: there exists an affine hyperplane that consistently partitions the set of features, separating the feature vectors corresponding to connected pairs of nodes from those corresponding to disconnected pairs. The causal inference problem is thus addressed via clustering the designed features. We demonstrate with simple baseline supervised methods the competitive performance of the proposed causal inference mechanism under broad connectivity regimes and noise correlation levels, including a real world network. Further, we devise novel technical guarantees of structural consistency for linear NDS under the considered regime.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning the Topology and Behavior of Discrete Dynamical Systems b/data/2024/aaai/Learning the Topology and Behavior of Discrete Dynamical Systems
new file mode 100644
index 0000000000..315ad6b3db
--- /dev/null
+++ b/data/2024/aaai/Learning the Topology and Behavior of Discrete Dynamical Systems	
@@ -0,0 +1 @@
+Discrete dynamical systems are commonly used to model the spread of contagions on real-world networks. Under the PAC framework, existing research has studied the problem of learning the behavior of a system, assuming that the underlying network is known. In this work, we focus on a more challenging setting: to learn both the behavior and the underlying topology of a black-box system. We show that, in general, this learning problem is computationally intractable. On the positive side, we present efficient learning methods under the PAC model when the underlying graph of the dynamical system belongs to certain classes. Further, we examine a relaxed setting where the topology of an unknown system is partially observed. For this case, we develop an efficient PAC learner to infer the system and establish the sample complexity. Lastly, we present a formal analysis of the expressive power of the hypothesis class of dynamical systems where both the topology and behavior are unknown, using the well-known Natarajan dimension formalism. Our results provide a theoretical foundation for learning both the topology and behavior of discrete dynamical systems.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning to Approximate Adaptive Kernel Convolution on Graphs b/data/2024/aaai/Learning to Approximate Adaptive Kernel Convolution on Graphs
new file mode 100644
index 0000000000..83375febf6
--- /dev/null
+++ b/data/2024/aaai/Learning to Approximate Adaptive Kernel Convolution on Graphs	
@@ -0,0 +1 @@
+Various Graph Neural Networks (GNN) have been successful in analyzing data in non-Euclidean spaces, however, they have limitations such as oversmoothing, i.e., information becomes excessively averaged as the number of hidden layers increases. The issue stems from the intrinsic formulation of conventional graph convolution where the nodal features are aggregated from a direct neighborhood per layer across the entire nodes in the graph. As setting different number of hidden layers per node is infeasible, recent works leverage a diffusion kernel to redefine the graph structure and incorporate information from farther nodes. Unfortunately, such approaches suffer from heavy diagonalization of a graph Laplacian or learning a large transform matrix. In this regards, we propose a diffusion learning framework where the range of feature aggregation is controlled by the scale of a diffusion kernel. For efficient computation, we derive closed-form derivatives of approximations of the graph convolution with respect to the scale, so that node-wise range can be adaptively learned.With a downstream classifier, the entire framework is made trainable in an end-to-end manner. Our model is tested on various standard datasets for node-wise classification for the state-of-the-art performance, and it is also validated on a real-world brain network data for graph classifications to demonstrate its practicality for Alzheimer classification.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning to Build Solutions in Stochastic Matching Problems Using Flows (Student Abstract) b/data/2024/aaai/Learning to Build Solutions in Stochastic Matching Problems Using Flows (Student Abstract)
new file mode 100644
index 0000000000..18c0bbeb8a
--- /dev/null
+++ b/data/2024/aaai/Learning to Build Solutions in Stochastic Matching Problems Using Flows (Student Abstract)	
@@ -0,0 +1 @@
+Generative Flow Networks, known as GFlowNets, have been introduced in recent times, presenting an exciting possibility for neural networks to model distributions across various data structures. In this paper, we broaden their applicability to encompass scenarios where the data structures are optimal solutions of a combinatorial problem. Concretely, we propose the use of GFlowNets to learn the distribution of optimal solutions for kidney exchange problems (KEPs), a generalized form of matching problems involving cycles.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning to Learn Better Visual Prompts b/data/2024/aaai/Learning to Learn Better Visual Prompts
new file mode 100644
index 0000000000..bc54e25c53
--- /dev/null
+++ b/data/2024/aaai/Learning to Learn Better Visual Prompts	
@@ -0,0 +1 @@
+Prompt tuning provides a low-cost way of adapting vision-language models (VLMs) for various downstream vision tasks without requiring updating the huge pre-trained parameters. Dispensing with the conventional manual crafting of prompts, the recent prompt tuning method of Context Optimization (CoOp) introduces adaptable vectors as text prompts. Nevertheless, several previous works point out that the CoOp-based approaches are easy to overfit to the base classes and hard to generalize to novel classes. In this paper, we reckon that the prompt tuning works well only in the base classes because of the limited capacity of the adaptable vectors. The scale of the pre-trained model is hundreds times the scale of the adaptable vector, thus the learned vector has a very limited ability to absorb the knowledge of novel classes. To minimize this excessive overfitting of textual knowledge on the base class, we view prompt tuning as learning to learn (LoL) and learn the prompt in the way of meta-learning, the training manner of dividing the base classes into many different subclasses could fully exert the limited capacity of prompt tuning and thus transfer it power to recognize the novel classes. To be specific, we initially perform fine-tuning on the base class based on the CoOp method for pre-trained CLIP. Subsequently, predicated on the fine-tuned CLIP model, we carry out further fine-tuning in an N-way K-shot manner from the perspective of meta-learning on the base classes. We finally apply the learned textual vector and VLM for unseen classes.Extensive experiments on benchmark datasets validate the efficacy of our meta-learning-informed prompt tuning, affirming its role as a robust optimization strategy for VLMs.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning to Learn in Interactive Constraint Acquisition b/data/2024/aaai/Learning to Learn in Interactive Constraint Acquisition
new file mode 100644
index 0000000000..cd89ed7f6d
--- /dev/null
+++ b/data/2024/aaai/Learning to Learn in Interactive Constraint Acquisition	
@@ -0,0 +1,4 @@
+Constraint Programming (CP) has been successfully used to model and solve complex combinatorial problems. However, modeling is often not trivial and requires expertise, which is a bottleneck to wider adoption. In Constraint Acquisition (CA), the goal is to assist the user by automatically learning the model.
+In (inter)active CA, this is done by interactively posting queries to the user, e.g. does this partial solution satisfy your (unspecified) constraints or not.
+While interactive CA methods learn the constraints, the learning is related to symbolic concept learning, as the goal is to learn an exact representation. 
+However, a large number of queries is required to learn the model, which is a major limitation. In this paper, we aim to alleviate this limitation by tightening the connection of CA and Machine Learning (ML), by, for the first time in interactive CA, exploiting statistical ML methods. We propose to use probabilistic classification models to guide interactive CA queries to the most promising parts. We discuss how to train classifiers to predict whether a candidate expression from the bias is a constraint of the problem or not, using both relation-based and scope-based features. We then show how the predictions can be used in all layers of interactive CA: the query generation, the scope finding, and the lowest-level constraint finding. We experimentally evaluate our proposed methods using different classifiers and show that our methods greatly outperform the state of the art, decreasing the number of queries needed to converge by up to 72%.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning to Manipulate Artistic Images b/data/2024/aaai/Learning to Manipulate Artistic Images
new file mode 100644
index 0000000000..6b5c40c2ce
--- /dev/null
+++ b/data/2024/aaai/Learning to Manipulate Artistic Images	
@@ -0,0 +1 @@
+Recent advancement in computer vision has significantly lowered the barriers to artistic creation. Exemplar-based image translation methods have attracted much attention due to flexibility and controllability. However, these methods hold assumptions regarding semantics or require semantic information as the input, while accurate semantics is not easy to obtain in artistic images. Besides, these methods suffer from cross-domain artifacts due to training data prior and generate imprecise structure due to feature compression in the spatial domain. In this paper, we propose an arbitrary Style Image Manipulation Network (SIM-Net), which leverages semantic-free information as guidance and a region transportation strategy in a self-supervised manner for image generation. Our method balances computational efficiency and high resolution to a certain extent. Moreover, our method facilitates zero-shot style image manipulation. Both qualitative and quantitative experiments demonstrate the superiority of our method over state-of-the-art methods.Code is available at https://github.com/SnailForce/SIM-Net.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning to Optimize Permutation Flow Shop Scheduling via Graph-Based Imitation Learning b/data/2024/aaai/Learning to Optimize Permutation Flow Shop Scheduling via Graph-Based Imitation Learning
new file mode 100644
index 0000000000..ce119c42ac
--- /dev/null
+++ b/data/2024/aaai/Learning to Optimize Permutation Flow Shop Scheduling via Graph-Based Imitation Learning	
@@ -0,0 +1 @@
+The permutation flow shop scheduling (PFSS), aiming at finding the optimal permutation of jobs, is widely used in manufacturing systems. When solving large-scale PFSS problems, traditional optimization algorithms such as heuristics could hardly meet the demands of both solution accuracy and computational efficiency, thus learning-based methods have recently garnered more attention. Some work attempts to solve the problems by reinforcement learning methods, which suffer from slow convergence issues during training and are still not accurate enough regarding the solutions. To that end, we propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately. Moreover, in order to extract better feature representations of input jobs, we incorporate the graph structure as the encoder. The extensive experiments reveal that our proposed model obtains significant promotion and presents excellent generalizability in large-scale problems with up to 1000 jobs. Compared to the state-of-the-art reinforcement learning method, our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average. The code is available at: https://github.com/longkangli/PFSS-IL.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning to Pivot as a Smart Expert b/data/2024/aaai/Learning to Pivot as a Smart Expert
new file mode 100644
index 0000000000..dc3c49d317
--- /dev/null
+++ b/data/2024/aaai/Learning to Pivot as a Smart Expert	
@@ -0,0 +1 @@
+Linear programming has been practically solved mainly by simplex and interior point methods. Compared with the weakly polynomial complexity obtained by the interior point methods, the existence of strongly polynomial bounds for the length of the pivot path generated by the simplex methods remains a mystery. In this paper, we propose two novel pivot experts that leverage both global and local information of the linear programming instances for the primal simplex method and show their excellent performance numerically. The experts can be regarded as a benchmark to evaluate the performance of classical pivot rules, although they are hard to directly implement. To tackle this challenge, we employ a graph convolutional neural network model, trained via imitation learning, to mimic the behavior of the pivot expert. Our pivot rule, learned empirically, displays a significant advantage over conventional methods in various linear programming problems, as demonstrated through a series of rigorous experiments.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning to Prompt Knowledge Transfer for Open-World Continual Learning b/data/2024/aaai/Learning to Prompt Knowledge Transfer for Open-World Continual Learning
new file mode 100644
index 0000000000..f383fab7dd
--- /dev/null
+++ b/data/2024/aaai/Learning to Prompt Knowledge Transfer for Open-World Continual Learning	
@@ -0,0 +1 @@
+This paper studies the problem of continual learning in an open-world scenario, referred to as Open-world Continual Learning (OwCL). OwCL is increasingly rising while it is highly challenging in two-fold: i) learning a sequence of tasks without forgetting knowns in the past, and ii) identifying unknowns (novel objects/classes) in the future. Existing OwCL methods suffer from the adaptability of task-aware boundaries between knowns and unknowns, and do not consider the mechanism of knowledge transfer. In this work, we propose Pro-KT, a novel prompt-enhanced knowledge transfer model for OwCL. Pro-KT includes two key components: (1) a prompt bank to encode and transfer both task-generic and task-specific knowledge, and (2) a task-aware open-set boundary to identify unknowns in the new tasks. Experimental results using two real-world datasets demonstrate that the proposed Pro-KT outperforms the state-of-the-art counterparts in both the detection of unknowns and the classification of knowns markedly. Code released at https://github.com/YujieLi42/Pro-KT.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning to Rank in Generative Retrieval b/data/2024/aaai/Learning to Rank in Generative Retrieval
new file mode 100644
index 0000000000..11d46d2591
--- /dev/null
+++ b/data/2024/aaai/Learning to Rank in Generative Retrieval	
@@ -0,0 +1 @@
+Generative retrieval stands out as a promising new paradigm in text retrieval that aims to generate identifier strings of relevant passages as the retrieval target. This generative paradigm taps into powerful generative language models, distinct from traditional sparse or dense retrieval methods. However, only learning to generate is insufficient for generative retrieval. Generative retrieval learns to generate identifiers of relevant passages as an intermediate goal and then converts predicted identifiers into the final passage rank list. The disconnect between the learning objective of autoregressive models and the desired passage ranking target leads to a learning gap. To bridge this gap, we propose a learning-to-rank framework for generative retrieval, dubbed LTRGR. LTRGR enables generative retrieval to learn to rank passages directly, optimizing the autoregressive model toward the final passage ranking target via a rank loss. This framework only requires an additional learning-to-rank training phase to enhance current generative retrieval systems and does not add any burden to the inference stage. We conducted experiments on three public benchmarks, and the results demonstrate that LTRGR achieves state-of-the-art performance among generative retrieval methods. The code and checkpoints are released at https://github.com/liyongqi67/LTRGR.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning to Reweight for Generalizable Graph Neural Network b/data/2024/aaai/Learning to Reweight for Generalizable Graph Neural Network
new file mode 100644
index 0000000000..d1adb89583
--- /dev/null
+++ b/data/2024/aaai/Learning to Reweight for Generalizable Graph Neural Network	
@@ -0,0 +1,8 @@
+Graph Neural Networks (GNNs) show promising results for graph tasks. However, existing GNNs' generalization ability will degrade when there exist distribution shifts between testing and training graph data.
+The fundamental reason for the severe degeneration is that most GNNs are designed based on the I.I.D hypothesis. In such a setting, GNNs tend to exploit subtle statistical correlations existing in the training set for predictions, even though it is a spurious correlation.
+In this paper, we study the problem of the generalization ability of GNNs on Out-Of-Distribution (OOD) settings.
+To solve this problem, we propose the Learning to Reweight for Generalizable Graph Neural Network (L2R-GNN) to enhance the generalization ability for achieving satisfactory performance on unseen testing graphs that have different distributions with training graphs.
+We propose a novel nonlinear graph decorrelation method, which can substantially improve the out-of-distribution generalization ability and compares favorably to previous methods in restraining the over-reduced sample size.
+The variables of graph representation are clustered based on the stability of their correlations, and graph decorrelation method learns weights to remove correlations between the variables of different clusters rather than any two variables.
+Besides, we introduce an effective stochastic algorithm based on bi-level optimization for the L2R-GNN framework, which enables simultaneously learning the optimal weights and GNN parameters, and avoids the over-fitting issue.
+Experiments show that L2R-GNN greatly outperforms baselines on various graph prediction benchmarks under distribution shifts.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning to Stop Cut Generation for Efficient Mixed-Integer Linear Programming b/data/2024/aaai/Learning to Stop Cut Generation for Efficient Mixed-Integer Linear Programming
new file mode 100644
index 0000000000..ee744b141a
--- /dev/null
+++ b/data/2024/aaai/Learning to Stop Cut Generation for Efficient Mixed-Integer Linear Programming	
@@ -0,0 +1 @@
+Cutting planes (cuts) play an important role in solving mixed-integer linear programs (MILPs), as they significantly tighten the dual bounds and improve the solving performance. A key problem for cuts is when to stop cuts generation, which is important for the efficiency of solving MILPs. However, many modern MILP solvers employ hard-coded heuristics to tackle this problem, which tends to neglect underlying patterns among MILPs from certain applications. To address this challenge, we formulate the cuts generation stopping problem as a reinforcement learning problem and propose a novel hybrid graph representation model (HYGRO) to learn effective stopping strategies. An appealing feature of HYGRO is that it can effectively capture both the dynamic and static features of MILPs, enabling dynamic decision-making for the stopping strategies. To the best of our knowledge, HYGRO is the first data-driven method to tackle the cuts generation stopping problem. By integrating our approach with modern solvers, experiments demonstrate that HYGRO significantly improves the efficiency of solving MILPs compared to competitive baselines, achieving up to 31% improvement.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning to Unlearn: Instance-Wise Unlearning for Pre-trained Classifiers b/data/2024/aaai/Learning to Unlearn: Instance-Wise Unlearning for Pre-trained Classifiers
new file mode 100644
index 0000000000..56a1926617
--- /dev/null
+++ b/data/2024/aaai/Learning to Unlearn: Instance-Wise Unlearning for Pre-trained Classifiers	
@@ -0,0 +1 @@
+Since the recent advent of regulations for data protection (e.g., the General Data Protection Regulation), there has been increasing demand in deleting information learned from sensitive data in pre-trained models without retraining from scratch. The inherent vulnerability of neural networks towards adversarial attacks and unfairness also calls for a robust method to remove or correct information in an instance-wise fashion, while retaining the predictive performance across remaining data. To this end, we consider instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model, by either misclassifying each instance away from its original prediction or relabeling the instance to a different label. We also propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information. Both methods only require the pre-trained model and data instances to forget, allowing painless application to real-life settings where the entire training set is unavailable. Through extensive experimentation on various image classification benchmarks, we show that our approach effectively preserves knowledge of remaining data while unlearning given instances in both single-task and continual unlearning scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning with Noisy Labels Using Hyperspherical Margin Weighting b/data/2024/aaai/Learning with Noisy Labels Using Hyperspherical Margin Weighting
new file mode 100644
index 0000000000..2d03fa1f67
--- /dev/null
+++ b/data/2024/aaai/Learning with Noisy Labels Using Hyperspherical Margin Weighting	
@@ -0,0 +1 @@
+Datasets often include noisy labels, but learning from them is difficult. Since mislabeled examples usually have larger loss values in training, the small-loss trick is regarded as a standard metric to identify the clean example from the training set for better performance. Nonetheless, this proposal ignores that some clean but hard-to-learn examples also generate large losses. They could be misidentified by this criterion. In this paper, we propose a new metric called the Integrated Area Margin (IAM), which is superior to the traditional small-loss trick, particularly in recognizing the clean but hard-to-learn examples. According to the IAM, we further offer the Hyperspherical Margin Weighting (HMW) approach. It is a new sample weighting strategy that restructures the importance of each example. It should be highlighted that our approach is universal and can strengthen various methods in this field. Experiments on both benchmark and real-world datasets indicate that our HMW outperforms many state-of-the-art approaches in learning with noisy label tasks. Codes are available at https://github.com/Zhangshuojackpot/HMW.
\ No newline at end of file
diff --git a/data/2024/aaai/Learning-Augmented Online Algorithm for Two-Level Ski-Rental Problem b/data/2024/aaai/Learning-Augmented Online Algorithm for Two-Level Ski-Rental Problem
new file mode 100644
index 0000000000..e50fc21a0a
--- /dev/null
+++ b/data/2024/aaai/Learning-Augmented Online Algorithm for Two-Level Ski-Rental Problem	
@@ -0,0 +1 @@
+In this paper, we study the two-level ski-rental problem, where a user needs to fulfill a sequence of demands for multiple items by choosing one of the three payment options: paying for the on-demand usage (i.e., rent), buying individual items (i.e., single purchase), and buying all the items (i.e., combo purchase). Without knowing future demands, the user aims to minimize the total cost (i.e., the sum of the rental, single purchase, and combo purchase costs) by balancing the trade-off between the expensive upfront costs (for purchase) and the potential future expenses (for rent). We first design a robust online algorithm (RDTSR) that offers a worst-case performance guarantee. While online algorithms are robust against the worst-case scenarios, they are often overly cautious and thus suffer a poor average performance in typical scenarios. On the other hand, Machine Learning (ML) algorithms typically show promising average performance in various applications but lack worst-case performance guarantees. To harness the benefits of both methods, we develop a learning-augmented algorithm (LADTSR) by integrating ML predictions into the robust online algorithm, which outperforms the robust online algorithm under accurate predictions while ensuring worst-case performance guarantees even when predictions are inaccurate. Finally, we conduct numerical experiments on both synthetic and real-world trace data to corroborate the effectiveness of our approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Leaving the Nest: Going beyond Local Loss Functions for Predict-Then-Optimize b/data/2024/aaai/Leaving the Nest: Going beyond Local Loss Functions for Predict-Then-Optimize
new file mode 100644
index 0000000000..09b2be03d2
--- /dev/null
+++ b/data/2024/aaai/Leaving the Nest: Going beyond Local Loss Functions for Predict-Then-Optimize	
@@ -0,0 +1 @@
+Predict-then-Optimize is a framework for using machine learning to perform decision-making under uncertainty. The central research question it asks is, "How can we use the structure of a decision-making task to tailor ML models for that specific task?" To this end, recent work has proposed learning task-specific loss functions that capture this underlying structure. However, current approaches make restrictive assumptions about the form of these losses and their impact on ML model behavior. These assumptions both lead to approaches with high computational cost, and when they are violated in practice, poor performance. In this paper, we propose solutions to these issues, avoiding the aforementioned assumptions and utilizing the ML model's features to increase the sample efficiency of learning loss functions. We empirically show that our method achieves state-of-the-art results in four domains from the literature, often requiring an order of magnitude fewer samples than comparable methods from past work. Moreover, our approach outperforms the best existing method by nearly 200% when the localness assumption is broken.
\ No newline at end of file
diff --git a/data/2024/aaai/Less Is More: Label Recommendation for Weakly Supervised Point Cloud Semantic Segmentation b/data/2024/aaai/Less Is More: Label Recommendation for Weakly Supervised Point Cloud Semantic Segmentation
new file mode 100644
index 0000000000..3b0d2fc8a6
--- /dev/null
+++ b/data/2024/aaai/Less Is More: Label Recommendation for Weakly Supervised Point Cloud Semantic Segmentation	
@@ -0,0 +1 @@
+Semantic segmentation of LiDAR point clouds is an important task in autonomous driving. However, training deep models via conventional supervised methods requires large datasets which are costly to label. It is critical to have label-efficient segmentation approaches to scale up the model to new operational domains or to improve performance on rare cases. While most prior works focus on indoor scenes, we are one of the first to propose a label-efficient semantic segmentation pipeline for outdoor scenes with LiDAR point clouds. Our method co-designs an efficient labeling process with semi/weakly supervised learning and is applicable to nearly any 3D semantic segmentation backbones. Specifically, we leverage geometry patterns in outdoor scenes to have a heuristic pre-segmentation to reduce the manual labeling and jointly design the learning targets with the labeling process. In the learning step, we leverage prototype learning to get more descriptive point embeddings and use multi-scan distillation to exploit richer semantics from temporally aggregated point clouds to boost the performance of single-scan models. Evaluated on the SemanticKITTI and the nuScenes datasets, we show that our proposed method outperforms existing label-efficient methods. With extremely limited human annotations (e.g., 0.1% point labels), our proposed method is even highly competitive compared to the fully supervised counterpart with 100% labels.
\ No newline at end of file
diff --git a/data/2024/aaai/Let All Be Whitened: Multi-Teacher Distillation for Efficient Visual Retrieval b/data/2024/aaai/Let All Be Whitened: Multi-Teacher Distillation for Efficient Visual Retrieval
new file mode 100644
index 0000000000..d8838b98af
--- /dev/null
+++ b/data/2024/aaai/Let All Be Whitened: Multi-Teacher Distillation for Efficient Visual Retrieval	
@@ -0,0 +1 @@
+Visual retrieval aims to search for the most relevant visual items, e.g., images and videos, from a candidate gallery with a given query item. Accuracy and efficiency are two competing objectives in retrieval tasks. Instead of crafting a new method pursuing further improvement on accuracy, in this paper we propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval. Furthermore, we discover that the similarities obtained by different retrieval models are diversified and incommensurable, which makes it challenging to jointly distill knowledge from multiple models. Therefore, we propose to whiten the output of teacher models before fusion, which enables effective multi-teacher distillation for retrieval models. Whiten-MTD is conceptually simple and practically effective. Extensive experiments on two landmark image retrieval datasets and one video retrieval dataset demonstrate the effectiveness of our proposed method, and its good balance of retrieval performance and efficiency. Our source code is released at https://github.com/Maryeon/whiten_mtd.
\ No newline at end of file
diff --git a/data/2024/aaai/Let There Be Sound: Reconstructing High Quality Speech from Silent Videos b/data/2024/aaai/Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
new file mode 100644
index 0000000000..6e81ffa426
--- /dev/null
+++ b/data/2024/aaai/Let There Be Sound: Reconstructing High Quality Speech from Silent Videos	
@@ -0,0 +1 @@
+The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments on two datasets, and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existing methods in terms of speech naturalness and intelligibility by a large margin. Synthesised samples are available at our demo page: https://mm.kaist.ac.kr/projects/LTBS.
\ No newline at end of file
diff --git a/data/2024/aaai/Levenshtein Distance Embedding with Poisson Regression for DNA Storage b/data/2024/aaai/Levenshtein Distance Embedding with Poisson Regression for DNA Storage
new file mode 100644
index 0000000000..eed0e1a7cd
--- /dev/null
+++ b/data/2024/aaai/Levenshtein Distance Embedding with Poisson Regression for DNA Storage	
@@ -0,0 +1 @@
+Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Leverage the Explainability of Transformer Models to Improve the DNA 5-Methylcytosine Identification (Student Abstract) b/data/2024/aaai/Leverage the Explainability of Transformer Models to Improve the DNA 5-Methylcytosine Identification (Student Abstract)
new file mode 100644
index 0000000000..113a7ef8f6
--- /dev/null
+++ b/data/2024/aaai/Leverage the Explainability of Transformer Models to Improve the DNA 5-Methylcytosine Identification (Student Abstract)	
@@ -0,0 +1 @@
+DNA methylation is an epigenetic mechanism for regulating gene expression, and it plays an important role in many biological processes. While methylation sites can be identified using laboratory techniques, much work is being done on developing computational approaches using machine learning. Here, we present a deep-learning algorithm for determining the 5-methylcytosine status of a DNA sequence. We propose an ensemble framework that treats the self-attention score as an explicit feature that is added to the encoder layer generated by fine-tuned language models. We evaluate the performance of the model under different data distribution scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision b/data/2024/aaai/Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision
new file mode 100644
index 0000000000..5545dd314e
--- /dev/null
+++ b/data/2024/aaai/Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision	
@@ -0,0 +1 @@
+Computer vision models have been known to encode harmful biases, leading to the potentially unfair treatment of historically marginalized groups, such as people of color. However, there remains a lack of datasets balanced along demographic traits that can be used to evaluate the downstream fairness of these models. In this work, we demonstrate that diffusion models can be leveraged to create such a dataset. We first use a diffusion model to generate a large set of images depicting various occupations. Subsequently, each image is edited using inpainting to generate multiple variants, where each variant refers to a different perceived race. Using this dataset, we benchmark several vision-language models on a multi-class occupation classification task. We find that images generated with non-Caucasian labels have a significantly higher occupation misclassification rate than images generated with Caucasian labels, and that several misclassifications are suggestive of racial biases. We measure a model’s downstream fairness by computing the standard deviation in the probability of predicting the true occupation label across the different identity groups. Using this fairness metric, we find significant disparities between the evaluated vision-and-language models. We hope that our work demonstrates the potential value of diffusion methods for fairness evaluations.
\ No newline at end of file
diff --git a/data/2024/aaai/Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection b/data/2024/aaai/Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection
new file mode 100644
index 0000000000..0ea4e3b008
--- /dev/null
+++ b/data/2024/aaai/Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection	
@@ -0,0 +1 @@
+Training high-accuracy 3D detectors necessitates massive labeled 3D annotations with 7 degree-of-freedom, which is laborious and time-consuming. Therefore, the form of point annotations is proposed to offer significant prospects for practical applications in 3D detection, which is not only more accessible and less expensive but also provides strong spatial information for object localization. In this paper, we empirically discover that it is non-trivial to merely adapt Point-DETR to its 3D form, encountering two main bottlenecks: 1) it fails to encode strong 3D prior into the model, and 2) it generates low-quality pseudo labels in distant regions due to the extreme sparsity of LiDAR points. To overcome these challenges, we introduce Point-DETR3D, a teacher-student framework for weakly semi-supervised 3D detection, designed to fully capitalize on point-wise supervision within a constrained instance-wise annotation budget. Different from Point-DETR which encodes 3D positional information solely through a point encoder, we propose an explicit positional query initialization strategy to enhance the positional prior. Considering the low quality of pseudo labels at distant regions produced by the teacher model, we enhance the detector's perception by incorporating dense imagery data through a novel Cross-Modal Deformable RoI Fusion (D-RoI). Moreover, an innovative point-guided self-supervised learning technique is proposed to allow for fully exploiting point priors, even in student models. Extensive experiments on representative nuScenes dataset demonstrate our Point-DETR3D obtains significant improvements compared to previous works. Notably, with only 5% of labeled data, Point-DETR3D achieves over 90% performance of its fully supervised counterpart.
\ No newline at end of file
diff --git a/data/2024/aaai/Leveraging Local Variance for Pseudo-Label Selection in Semi-supervised Learning b/data/2024/aaai/Leveraging Local Variance for Pseudo-Label Selection in Semi-supervised Learning
new file mode 100644
index 0000000000..254065daee
--- /dev/null
+++ b/data/2024/aaai/Leveraging Local Variance for Pseudo-Label Selection in Semi-supervised Learning	
@@ -0,0 +1,2 @@
+Semi-supervised learning algorithms that use pseudo-labeling have become increasingly popular for improving model performance by utilizing both labeled and unlabeled data. 
+In this paper, we offer a fresh perspective on the selection of pseudo-labels, inspired by theoretical insights. We suggest that pseudo-labels with a high degree of local variance are more prone to inaccuracies. Based on this premise, we introduce the Local Variance Match (LVM) method, which aims to optimize the selection of pseudo-labels in semi-supervised learning (SSL) tasks. Our methodology is validated through a series of experiments on widely-used image classification datasets, such as CIFAR-10, CIFAR-100, and SVHN, spanning various labeled data quantity scenarios. The empirical findings show that the LVM method substantially outpaces current SSL techniques, achieving state-of-the-art results in many of these scenarios. For instance, we observed an error rate of 5.41% on CIFAR-10 with a single label for each class, 35.87% on CIFAR-100 when using four labels per class, and 1.94% on SVHN with four labels for each class. Notably, the standout error rate of 5.41% is less than 1% shy of the performance in a fully-supervised learning environment. In experiments on ImageNet with 100k labeled data, the LVM also reached state-of-the-art outcomes. Additionally, the efficacy of the LVM method is further validated by its stellar performance in speech recognition experiments.
\ No newline at end of file
diff --git a/data/2024/aaai/Leveraging Normalization Layer in Adapters with Progressive Learning and Adaptive Distillation for Cross-Domain Few-Shot Learning b/data/2024/aaai/Leveraging Normalization Layer in Adapters with Progressive Learning and Adaptive Distillation for Cross-Domain Few-Shot Learning
new file mode 100644
index 0000000000..1b6e1b0282
--- /dev/null
+++ b/data/2024/aaai/Leveraging Normalization Layer in Adapters with Progressive Learning and Adaptive Distillation for Cross-Domain Few-Shot Learning	
@@ -0,0 +1 @@
+Cross-domain few-shot learning presents a formidable challenge, as models must be trained on base classes and then tested on novel classes from various domains with only a few samples at hand. While prior approaches have primarily focused on parameter-efficient methods of using adapters, they often overlook two critical issues: shifts in batch statistics and noisy sample statistics arising from domain discrepancy variations. In this paper, we introduce Leveraging Normalization Layer in Adapters with Progressive Learning and Adaptive Distillation (ProLAD), marking two principal contributions. First, our methodology utilizes two separate adapters: one devoid of a normalization layer, which is more effective for similar domains, and another embedded with a normalization layer, designed to leverage the batch statistics of the target domain, thus proving effective for dissimilar domains. Second, to address the pitfalls of noisy statistics, we deploy two strategies: a progressive training of the two adapters and an adaptive distillation technique derived from features determined by the model solely with the adapter devoid of a normalization layer. Through this adaptive distillation, our approach functions as a modulator, controlling the primary adapter for adaptation, based on each domain. Evaluations on standard cross-domain few-shot learning benchmarks confirm that our technique outperforms existing state-of-the-art methodologies.
\ No newline at end of file
diff --git a/data/2024/aaai/Leveraging Opposite Gender Interaction Ratio as a Path towards Fairness in Online Dating Recommendations Based on User Sexual Orientation b/data/2024/aaai/Leveraging Opposite Gender Interaction Ratio as a Path towards Fairness in Online Dating Recommendations Based on User Sexual Orientation
new file mode 100644
index 0000000000..2572749234
--- /dev/null
+++ b/data/2024/aaai/Leveraging Opposite Gender Interaction Ratio as a Path towards Fairness in Online Dating Recommendations Based on User Sexual Orientation	
@@ -0,0 +1 @@
+Online dating platforms have gained widespread popularity as a means for individuals to seek potential romantic relationships. While recommender systems have been designed to improve the user experience in dating platforms by providing personalized recommendations, increasing concerns about fairness have encouraged the development of fairness-aware recommender systems from various perspectives (e.g., gender and race). However, sexual orientation, which plays a significant role in finding a satisfying relationship, is under-investigated. To fill this crucial gap, we propose a novel metric, Opposite Gender Interaction Ratio (OGIR), as a way to investigate potential unfairness for users with varying preferences towards the opposite gender. We empirically analyze a real online dating dataset and observe existing recommender algorithms could suffer from group unfairness according to OGIR. We further investigate the potential causes for such gaps in recommendation quality, which lead to the challenges of group quantity imbalance and group calibration imbalance. Ultimately, we propose a fair recommender system based on re-weighting and re-ranking strategies to respectively mitigate these associated imbalance challenges. Experimental results demonstrate both strategies improve fairness while their combination achieves the best performance towards maintaining model utility while improving fairness.
\ No newline at end of file
diff --git a/data/2024/aaai/Leveraging Partial Symmetry for Multi-Agent Reinforcement Learning b/data/2024/aaai/Leveraging Partial Symmetry for Multi-Agent Reinforcement Learning
new file mode 100644
index 0000000000..a8c4e425c4
--- /dev/null
+++ b/data/2024/aaai/Leveraging Partial Symmetry for Multi-Agent Reinforcement Learning	
@@ -0,0 +1 @@
+Incorporating symmetry as an inductive bias into multi-agent reinforcement learning (MARL) has led to improvements in generalization, data efficiency, and physical consistency. While prior research has succeeded in using perfect symmetry prior, the realm of partial symmetry in the multi-agent domain remains unexplored. To fill in this gap, we introduce the partially symmetric Markov game, a new subclass of the Markov game. We then theoretically show that the performance error introduced by utilizing symmetry in MARL is bounded, implying that the symmetry prior can still be useful in MARL even in partial symmetry situations. Motivated by this insight, we propose the Partial Symmetry Exploitation (PSE) framework that is able to adaptively incorporate symmetry prior in MARL under different symmetry-breaking conditions. Specifically, by adaptively adjusting the exploitation of symmetry, our framework is able to achieve superior sample efficiency and overall performance of MARL algorithms. Extensive experiments are conducted to demonstrate the superior performance of the proposed framework over baselines. Finally, we implement the proposed framework in real-world multi-robot testbed to show its superiority.
\ No newline at end of file
diff --git a/data/2024/aaai/Liberating Seen Classes: Boosting Few-Shot and Zero-Shot Text Classification via Anchor Generation and Classification Reframing b/data/2024/aaai/Liberating Seen Classes: Boosting Few-Shot and Zero-Shot Text Classification via Anchor Generation and Classification Reframing
new file mode 100644
index 0000000000..90e8246152
--- /dev/null
+++ b/data/2024/aaai/Liberating Seen Classes: Boosting Few-Shot and Zero-Shot Text Classification via Anchor Generation and Classification Reframing	
@@ -0,0 +1 @@
+Few-shot and zero-shot text classification aim to recognize samples from novel classes with limited labeled samples or no labeled samples at all. While prevailing methods have shown promising performance via transferring knowledge from seen classes to unseen classes, they are still limited by (1) Inherent dissimilarities among classes make the transformation of features learned from seen classes to unseen classes both difficult and inefficient. (2) Rare labeled novel samples usually cannot provide enough supervision signals to enable the model to adjust from the source distribution to the target distribution, especially for complicated scenarios. To alleviate the above issues, we propose a simple and effective strategy for few-shot and zero-shot text classification. We aim to liberate the model from the confines of seen classes, thereby enabling it to predict unseen categories without the necessity of training on seen classes. Specifically, for mining more related unseen category knowledge, we utilize a large pre-trained language model to generate pseudo novel samples, and select the most representative ones as category anchors. After that, we convert the multi-class classification task into a binary classification task and use the similarities of query-anchor pairs for prediction to fully leverage the limited supervision signals. Extensive experiments on six widely used public datasets show that our proposed method can outperform other strong baselines significantly in few-shot and zero-shot tasks, even without using any seen class samples.
\ No newline at end of file
diff --git a/data/2024/aaai/Lifting by Image - Leveraging Image Cues for Accurate 3D Human Pose Estimation b/data/2024/aaai/Lifting by Image - Leveraging Image Cues for Accurate 3D Human Pose Estimation
new file mode 100644
index 0000000000..2242b9efdc
--- /dev/null
+++ b/data/2024/aaai/Lifting by Image - Leveraging Image Cues for Accurate 3D Human Pose Estimation	
@@ -0,0 +1 @@
+The "lifting from 2D pose" method has been the dominant approach to 3D Human Pose Estimation (3DHPE) due to the powerful visual analysis ability of 2D pose estimators. Widely known, there exists a depth ambiguity problem when estimating solely from 2D pose, where one 2D pose can be mapped to multiple 3D poses. Intuitively, the rich semantic and texture information in images can contribute to a more accurate "lifting" procedure. Yet, existing research encounters two primary challenges. Firstly, the distribution of image data in 3D motion capture datasets is too narrow because of the laboratorial environment, which leads to poor generalization ability of methods trained with image information. Secondly, effective strategies for leveraging image information are lacking. In this paper, we give new insight into the cause of poor generalization problems and the effectiveness of image features. Based on that, we propose an advanced framework. Specifically, the framework consists of two stages. First, we enable the keypoints to query and select the beneficial features from all image patches. To reduce the keypoints attention to inconsequential background features, we design a novel Pose-guided Transformer Layer, which adaptively limits the updates to unimportant image patches. Then, through a designed Adaptive Feature Selection Module, we prune less significant image patches from the feature map. In the second stage, we allow the keypoints to further emphasize the retained critical image features. This progressive learning approach prevents further training on insignificant image features. Experimental results show that our model achieves state-of-the-art performance on both the Human3.6M dataset and the MPI-INF-3DHP dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack b/data/2024/aaai/LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack
new file mode 100644
index 0000000000..f344338ec0
--- /dev/null
+++ b/data/2024/aaai/LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack	
@@ -0,0 +1 @@
+Natural language processing models are vulnerable to adversarial examples. Previous textual adversarial attacks adopt model internal information (gradients or confidence scores) to generate adversarial examples. However, this information is unavailable in the real world. Therefore, we focus on a more realistic and challenging setting, named hard-label attack, in which the attacker can only query the model and obtain a discrete prediction label. Existing hard-label attack algorithms tend to initialize adversarial examples by random substitution and then utilize complex heuristic algorithms to optimize the adversarial perturbation. These methods require a lot of model queries and the attack success rate is restricted by adversary initialization. In this paper, we propose a novel hard-label attack algorithm named LimeAttack, which leverages a local explainable method to approximate word importance ranking, and then adopts beam search to find the optimal solution. Extensive experiments show that LimeAttack achieves the better attacking performance compared with existing hard-label attack under the same query budget. In addition, we evaluate the effectiveness of LimeAttack on large language models and some defense methods, and results indicate that adversarial examples remain a significant threat to large language models. The adversarial examples crafted by LimeAttack are highly transferable and effectively improve model robustness in adversarial training.
\ No newline at end of file
diff --git a/data/2024/aaai/Limitations of Face Image Generation b/data/2024/aaai/Limitations of Face Image Generation
new file mode 100644
index 0000000000..df4b2457a7
--- /dev/null
+++ b/data/2024/aaai/Limitations of Face Image Generation	
@@ -0,0 +1 @@
+Text-to-image diffusion models have achieved widespread popularity due to their unprecedented image generation capability. In particular, their ability to synthesize and modify human faces has spurred research into using generated face images in both training data augmentation and model performance assessments. In this paper, we study the efficacy and shortcomings of generative models in the context of face generation. Utilizing a combination of qualitative and quantitative measures, including embedding-based metrics and user studies, we present a framework to audit the characteristics of generated faces conditioned on a set of social attributes. We applied our framework on faces generated through state-of-the-art text-to-image diffusion models. We identify several limitations of face image generation that include faithfulness to the text prompt, demographic disparities, and distributional shifts. Furthermore, we present an analytical model that provides insights into how training data selection contributes to the performance of generative models. Our survey data and analytics code can be found online at https://github.com/wi-pi/Limitations_of_Face_Generation
\ No newline at end of file
diff --git a/data/2024/aaai/Limited Query Graph Connectivity Test b/data/2024/aaai/Limited Query Graph Connectivity Test
new file mode 100644
index 0000000000..18df13ddca
--- /dev/null
+++ b/data/2024/aaai/Limited Query Graph Connectivity Test	
@@ -0,0 +1,7 @@
+We propose a combinatorial optimisation model called Limited Query Graph Connectivity Test. We consider a graph whose edges have two possible states (On/Off). The edges' states are hidden initially. We could query an edge to reveal its state. Given a source s and a destination t, we aim to test s−t connectivity by identifying either a path (consisting of only On edges) or a cut (consisting of only Off edges). We are limited to B queries, after which we stop regardless of whether graph connectivity is established. We aim to design a query policy that minimizes the expected number of queries.
+
+Our model is mainly motivated by a cyber security use case where we need to establish whether attack paths exist in a given network, between a source (i.e., a compromised user node) and a destination (i.e., a high-privilege admin node). Edge query is resolved by manual effort from the IT admin, which is the motivation behind query minimization.
+
+Our model is highly related to Stochastic Boolean Function Evaluation (SBFE). There are two existing exact algorithms for SBFE that are prohibitively expensive. We propose a signifcantly more scalable exact algorithm. While previous exact algorithms only scale for trivial graphs (i.e., past works experimented on at most 20 edges), we empirically demonstrate that our algorithm is scalable for a wide range of much larger practical graphs (i.e., graphs representing Windows domain networks with tens of thousands of edges).
+
+We also propose three heuristics. Our best-performing heuristic is via limiting the planning horizon of the exact algorithm. The other two are via reinforcement learning (RL) and Monte Carlo tree search (MCTS). We also derive an algorithm for computing the performance lower bound. Experimentally, we show that all our heuristics are near optimal. The heuristic building on the exact algorithm outperforms all other heuristics, surpassing RL, MCTS and eight existing heuristics ported from SBFE and related literature.
\ No newline at end of file
diff --git a/data/2024/aaai/Limited-Supervised Multi-Label Learning with Dependency Noise b/data/2024/aaai/Limited-Supervised Multi-Label Learning with Dependency Noise
new file mode 100644
index 0000000000..b264647ef9
--- /dev/null
+++ b/data/2024/aaai/Limited-Supervised Multi-Label Learning with Dependency Noise	
@@ -0,0 +1 @@
+Limited-supervised multi-label learning (LML) leverages weak or noisy supervision for multi-label classification model training over data with label noise, which contain missing labels and/or redundant labels. Existing studies usually solve LML problems by assuming that label noise is independent of the input features and class labels, while ignoring the fact that noisy labels may depend on the input features (instance-dependent) and the classes (label-dependent) in many real-world applications. In this paper, we propose limited-supervised Multi-label Learning with Dependency Noise (MLDN) to simultaneously identify the instance-dependent and label-dependent label noise by factorizing the noise matrix as the outputs of a mapping from the feature and label representations. Meanwhile, we regularize the problem with the manifold constraint on noise matrix to preserve local relationships and uncover the manifold structure. Theoretically, we bound noise recover error for the resulting problem. We solve the problem by using a first-order scheme based on proximal operator, and the convergence rate of it is at least sub-linear. Extensive experiments conducted on various datasets demonstrate the superiority of our proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Linear-Time Algorithms for Front-Door Adjustment in Causal Graphs b/data/2024/aaai/Linear-Time Algorithms for Front-Door Adjustment in Causal Graphs
new file mode 100644
index 0000000000..6cdc32a54e
--- /dev/null
+++ b/data/2024/aaai/Linear-Time Algorithms for Front-Door Adjustment in Causal Graphs	
@@ -0,0 +1 @@
+Causal effect estimation from observational data is a fundamental task in empirical sciences. It becomes particularly challenging when unobserved confounders are involved in a system. This paper focuses on front-door adjustment – a classic technique which, using observed mediators allows to identify causal effects even in the presence of unobserved confounding. While the statistical properties of the front-door estimation are quite well understood, its algorithmic aspects remained unexplored for a long time. In 2022, Jeong, Tian, and Bareinboim presented the first polynomial-time algorithm for finding sets satisfying the front-door criterion in a given directed acyclic graph (DAG), with an O(n³(n+m)) run time, where n denotes the number of variables and m the number of edges of the causal graph. In our work, we give the first linear-time, i.e., O(n+m), algorithm for this task, which thus reaches the asymptotically optimal time complexity. This result implies an O(n(n+m)) delay enumeration algorithm of all front-door adjustment sets, again improving previous work by a factor of n³. Moreover, we provide the first linear-time algorithm for finding a minimal front-door adjustment set. We offer implementations of our algorithms in multiple programming languages to facilitate practical usage and empirically validate their feasibility, even for large graphs.
\ No newline at end of file
diff --git a/data/2024/aaai/Linear-Time Verification of Data-Aware Processes Modulo Theories via Covers and Automata b/data/2024/aaai/Linear-Time Verification of Data-Aware Processes Modulo Theories via Covers and Automata
new file mode 100644
index 0000000000..96ea57390a
--- /dev/null
+++ b/data/2024/aaai/Linear-Time Verification of Data-Aware Processes Modulo Theories via Covers and Automata	
@@ -0,0 +1 @@
+The need to model and analyse dynamic systems operating over complex data is ubiquitous in AI and neighboring areas, in particular business process management. Analysing such data-aware systems is a notoriously difficult problem, as they are intrinsically infinite-state. Existing approaches work for specific datatypes, and/or limit themselves to the verification of safety properties. In this paper, we lift both such limitations, studying for the first time linear-time verification for so-called data-aware processes modulo theories (DMTs), from the foundational and practical point of view. The DMT model is very general, as it supports processes operating over variables that can store arbitrary types of data, ranging over infinite domains and equipped with domain-specific predicates. Specifically, we provide four contributions. First, we devise a semi-decision procedure for linear-time verification of DMTs, which works for a very large class of datatypes obeying to mild model-theoretic assumptions. The procedure relies on a unique combination of automata-theoretic and cover computation techniques to respectively deal with linear-time properties and datatypes. Second, we identify an abstract, semantic property that guarantees the existence of a faithful finite-state abstraction of the original system, and show that our method becomes a decision procedure in this case. Third, we identify concrete, checkable classes of systems that satisfy this property, generalising several results in the literature. Finally, we present an implementation and an experimental evaluation over a benchmark of real-world data-aware business processes.
\ No newline at end of file
diff --git a/data/2024/aaai/Link Prediction in Multilayer Networks via Cross-Network Embedding b/data/2024/aaai/Link Prediction in Multilayer Networks via Cross-Network Embedding
new file mode 100644
index 0000000000..3a2a7f2a30
--- /dev/null
+++ b/data/2024/aaai/Link Prediction in Multilayer Networks via Cross-Network Embedding	
@@ -0,0 +1 @@
+Link prediction is a fundamental task in network analysis, with the objective of predicting missing or potential links. While existing studies have mainly concentrated on single networks, it is worth noting that numerous real-world networks exhibit interconnectedness. For example, individuals often register on various social media platforms to access diverse services, such as chatting, tweeting, blogging, and rating movies. These platforms share a subset of users and are termed multilayer networks. The interlayer links in such networks hold valuable information that provides more comprehensive insights into the network structure. To effectively exploit this complementary information and enhance link prediction in the target network, we propose a novel cross-network embedding method. This method aims to represent different networks in a shared latent space, preserving proximity within single networks as well as consistency across multilayer networks. Specifically, nodes can aggregate messages from aligned nodes in other layers. Extensive experiments conducted on real-world datasets demonstrate the superior performance of our proposed method for link prediction in multilayer networks.
\ No newline at end of file
diff --git a/data/2024/aaai/Live and Learn: Continual Action Clustering with Incremental Views b/data/2024/aaai/Live and Learn: Continual Action Clustering with Incremental Views
new file mode 100644
index 0000000000..355e80493c
--- /dev/null
+++ b/data/2024/aaai/Live and Learn: Continual Action Clustering with Incremental Views	
@@ -0,0 +1 @@
+Multi-view action clustering leverages the complementary information from different camera views to enhance the clustering performance. Although existing approaches have achieved significant progress, they assume all camera views are available in advance, which is impractical when the camera view is incremental over time. Besides, learning the invariant information among multiple camera views is still a challenging issue, especially in continual learning scenario. Aiming at these problems, we propose a novel continual action clustering (CAC) method, which is capable of learning action categories in a continual learning manner. To be specific, we first devise a category memory library, which captures and stores the learned categories from historical views. Then, as a new camera view arrives, we only need to maintain a consensus partition matrix, which can be updated by leveraging the incoming new camera view rather than keeping all of them. Finally, a three-step alternate optimization is proposed, in which the category memory library and consensus partition matrix are optimized. The empirical experimental results on 6 realistic multi-view action collections demonstrate the excellent clustering performance and time/space efficiency of the CAC compared with 15 state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Local Consistency Guidance: Personalized Stylization Method of Face Video (Student Abstract) b/data/2024/aaai/Local Consistency Guidance: Personalized Stylization Method of Face Video (Student Abstract)
new file mode 100644
index 0000000000..3de66a6895
--- /dev/null
+++ b/data/2024/aaai/Local Consistency Guidance: Personalized Stylization Method of Face Video (Student Abstract)	
@@ -0,0 +1 @@
+Face video stylization aims to convert real face videos into specified reference styles. While one-shot methods perform well in single-image stylization, ensuring continuity between frames and retaining the original facial expressions present challenges in video stylization. To address these issues, our approach employs a personalized diffusion model with pixel-level control. We propose Local Consistency Guidance(LCG) strategy, composed of local-cross attention and local style transfer, to ensure temporal consistency. This framework enables the synthesis of high-quality stylized face videos with excellent temporal continuity.
\ No newline at end of file
diff --git a/data/2024/aaai/Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding b/data/2024/aaai/Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding
new file mode 100644
index 0000000000..a932da4ee7
--- /dev/null
+++ b/data/2024/aaai/Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding	
@@ -0,0 +1 @@
+This paper for the first time leverages multi-modal videos for weakly-supervised temporal video grounding. As labeling the video moment is labor-intensive and subjective, the weakly-supervised approaches have gained increasing attention in recent years. However, these approaches could inherently compromise performance due to inadequate supervision. Therefore, to tackle this challenge, we for the first time pay attention to exploiting complementary information extracted from multi-modal videos (e.g., RGB frames, optical flows), where richer supervision is naturally introduced in the weaklysupervised context. Our motivation is that by integrating different modalities of the videos, the model is learned from synergic supervision and thereby can attain superior generalization capability. However, addressing multiple modalities† would also inevitably introduce additional computational overhead, and might become inapplicable if a particular modality is inaccessible. To solve this issue, we adopt a novel route: building a multi-modal distillation algorithm to capitalize on the multi-modal knowledge as supervision for model training, while still being able to work with only the single modal input during inference. As such, we can utilize the benefits brought by the supplementary nature of multiple modalities, without compromising the applicability in practical scenarios. Specifically, we first propose a cross-modal mutual learning framework and train a sophisticated teacher model to learn collaboratively from the multi-modal videos. Then we identify two sorts of knowledge from the teacher model, i.e., temporal boundaries and semantic activation map. And we devise a local-global distillation algorithm to transfer this knowledge to a student model of single-modal input at both local and global levels. Extensive experiments on large-scale datasets demonstrate that our method achieves state-of-the-art performance with/without multi-modal inputs.
\ No newline at end of file
diff --git a/data/2024/aaai/Locality Preserving Refinement for Shape Matching with Functional Maps b/data/2024/aaai/Locality Preserving Refinement for Shape Matching with Functional Maps
new file mode 100644
index 0000000000..4bf8cb5266
--- /dev/null
+++ b/data/2024/aaai/Locality Preserving Refinement for Shape Matching with Functional Maps	
@@ -0,0 +1 @@
+In this paper, we address the nonrigid shape matching with outliers by a novel and effective pointwise map refinement method, termed Locality Preserving Refinement. For accurate pointwise conversion from a given functional map, our method formulates a two-step procedure. Firstly, starting with noisy point-to-point correspondences, we identify inliers by leveraging the neighborhood support, which yields a closed-form solution with linear time complexity. After obtained the reliable correspondences of inliers, we refine the pointwise correspondences for outliers using local linear embedding, which operates in an adaptive spectral similarity space to further eliminate the ambiguities that are difficult to handle in the functional space. By refining pointwise correspondences with local consistency thus embedding geometric constraints into functional spaces, our method achieves considerable improvement in accuracy with linearithmic time and space cost. Extensive experiments on public benchmarks demonstrate the superiority of our method over the state-of-the-art methods. Our code is publicly available at https://github.com/XiaYifan1999/LOPR.
\ No newline at end of file
diff --git a/data/2024/aaai/Locally Rainbow Paths b/data/2024/aaai/Locally Rainbow Paths
new file mode 100644
index 0000000000..5f62b23915
--- /dev/null
+++ b/data/2024/aaai/Locally Rainbow Paths	
@@ -0,0 +1 @@
+We introduce the algorithmic problem of finding a locally rainbow path of length l connecting two distinguished vertices s and t in a vertex-colored directed graph. Herein, a path is locally rainbow if between any two visits of equally colored vertices, the path traverses consecutively at leaset r differently colored vertices. This problem generalizes the well-known problem of finding a rainbow path. It finds natural applications whenever there are different types of resources that must be protected from overuse, such as crop sequence optimization or production process scheduling. We show that the problem is computationally intractable even if r=2 or if one looks for a locally rainbow among the shortest paths. On the positive side, if one looks for a path that takes only a short detour (i.e., it is slightly longer than the shortest path) and if r is small, the problem can be solved efficiently. Indeed, the running time of the respective algorithm is near-optimal unless the ETH fails.
\ No newline at end of file
diff --git a/data/2024/aaai/LogoStyleFool: Vitiating Video Recognition Systems via Logo Style Transfer b/data/2024/aaai/LogoStyleFool: Vitiating Video Recognition Systems via Logo Style Transfer
new file mode 100644
index 0000000000..07cef35645
--- /dev/null
+++ b/data/2024/aaai/LogoStyleFool: Vitiating Video Recognition Systems via Logo Style Transfer	
@@ -0,0 +1 @@
+Video recognition systems are vulnerable to adversarial examples. Recent studies show that style transfer-based and patch-based unrestricted perturbations can effectively improve attack efficiency. These attacks, however, face two main challenges: 1) Adding large stylized perturbations to all pixels reduces the naturalness of the video and such perturbations can be easily detected. 2) Patch-based video attacks are not extensible to targeted attacks due to the limited search space of reinforcement learning that has been widely used in video attacks recently. In this paper, we focus on the video black-box setting and propose a novel attack framework named LogoStyleFool by adding a stylized logo to the clean video. We separate the attack into three stages: style reference selection, reinforcement-learning-based logo style transfer, and perturbation optimization. We solve the first challenge by scaling down the perturbation range to a regional logo, while the second challenge is addressed by complementing an optimization stage after reinforcement learning. Experimental results substantiate the overall superiority of LogoStyleFool over three state-of-the-art patch-based attacks in terms of attack performance and semantic preservation. Meanwhile, LogoStyleFool still maintains its performance against two existing patch-based defense methods. We believe that our research is beneficial in increasing the attention of the security community to such subregional style transfer attacks.
\ No newline at end of file
diff --git a/data/2024/aaai/Long-Tailed Learning as Multi-Objective Optimization b/data/2024/aaai/Long-Tailed Learning as Multi-Objective Optimization
new file mode 100644
index 0000000000..d2b0a250d6
--- /dev/null
+++ b/data/2024/aaai/Long-Tailed Learning as Multi-Objective Optimization	
@@ -0,0 +1 @@
+Real-world data is extremely imbalanced and presents a long-tailed distribution, resulting in models biased towards classes with sufficient samples and performing poorly on rare classes. Recent methods propose to rebalance classes but they undertake the seesaw dilemma (what is increasing performance on tail classes may decrease that of head classes, and vice versa). In this paper, we argue that the seesaw dilemma is derived from the gradient imbalance of different classes, in which gradients of inappropriate classes are set to important for updating, thus prone to overcompensation or undercompensation on tail classes. To achieve ideal compensation, we formulate long-tailed recognition as a multi-objective optimization problem, which fairly respects the contributions of head and tail classes simultaneously. For efficiency, we propose a Gradient-Balancing Grouping (GBG) strategy to gather the classes with similar gradient directions, thus approximately making every update under a Pareto descent direction. Our GBG method drives classes with similar gradient directions to form a more representative gradient and provides ideal compensation to the tail classes. Moreover, we conduct extensive experiments on commonly used benchmarks in long-tailed learning and demonstrate the superiority of our method over existing SOTA methods. Our code is released at https://github.com/WickyLee1998/GBG_v1.
\ No newline at end of file
diff --git a/data/2024/aaai/Long-Tailed Partial Label Learning by Head Classifier and Tail Classifier Cooperation b/data/2024/aaai/Long-Tailed Partial Label Learning by Head Classifier and Tail Classifier Cooperation
new file mode 100644
index 0000000000..d755ea0516
--- /dev/null
+++ b/data/2024/aaai/Long-Tailed Partial Label Learning by Head Classifier and Tail Classifier Cooperation	
@@ -0,0 +1 @@
+In partial label learning (PLL), each instance is associated with a set of candidate labels, among which only one is correct. The traditional PLL almost all implicitly assume that the distribution of the classes is balanced. However, in real-world applications, the distribution of the classes is imbalanced or long-tailed, leading to the long-tailed partial label learning problem. The previous methods solve this problem mainly by ameliorating the ability to learn in the tail classes, which will sacrifice the performance of the head classes. While keeping the performance of the head classes may degrade the performance of the tail classes. Therefore, in this paper, we construct two classifiers, i.e., a head classifier for keeping the performance of dominant classes and a tail classifier for improving the performance of the tail classes. Then, we propose a classifier weight estimation module to automatically estimate the shot belongingness (head class or tail class) of the samples and allocate the weights for the head classifier and tail classifier when making prediction. This cooperation improves the prediction ability for both the head classes and the tail classes. The experiments on the benchmarks demonstrate the proposed approach improves the accuracy of the SOTA methods by a substantial margin. Code and data are available at: https://github.com/pruirui/HTC-LTPLL.
\ No newline at end of file
diff --git a/data/2024/aaai/Long-Term Fair Decision Making through Deep Generative Models b/data/2024/aaai/Long-Term Fair Decision Making through Deep Generative Models
new file mode 100644
index 0000000000..d2378f2983
--- /dev/null
+++ b/data/2024/aaai/Long-Term Fair Decision Making through Deep Generative Models	
@@ -0,0 +1 @@
+This paper studies long-term fair machine learning which aims to mitigate group disparity over the long term in sequential decision-making systems. To define long-term fairness, we leverage the temporal causal graph and use the 1-Wasserstein distance between the interventional distributions of different demographic groups at a sufficiently large time step as the quantitative metric. Then, we propose a three-phase learning framework where the decision model is trained on high-fidelity data generated by a deep generative model. We formulate the optimization problem as a performative risk minimization and adopt the repeated gradient descent algorithm for learning. The empirical evaluation shows the efficacy of the proposed method using both synthetic and semi-synthetic datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Long-Term Safe Reinforcement Learning with Binary Feedback b/data/2024/aaai/Long-Term Safe Reinforcement Learning with Binary Feedback
new file mode 100644
index 0000000000..1d27e95489
--- /dev/null
+++ b/data/2024/aaai/Long-Term Safe Reinforcement Learning with Binary Feedback	
@@ -0,0 +1 @@
+Safety is an indispensable requirement for applying reinforcement learning (RL) to real problems. Although there has been a surge of safe RL algorithms proposed in recent years, most existing work typically 1) relies on receiving numeric safety feedback; 2) does not guarantee safety during the learning process; 3) limits the problem to a priori known, deterministic transition dynamics; and/or 4) assume the existence of a known safe policy for any states. Addressing the issues mentioned above, we thus propose Long-term Binary-feedback Safe RL (LoBiSaRL), a safe RL algorithm for constrained Markov decision processes (CMDPs) with binary safety feedback and an unknown, stochastic state transition function. LoBiSaRL optimizes a policy to maximize rewards while guaranteeing long-term safety that an agent executes only safe state-action pairs throughout each episode with high probability. Specifically, LoBiSaRL models the binary safety function via a generalized linear model (GLM) and conservatively takes only a safe action at every time step while inferring its effect on future safety under proper assumptions. Our theoretical results show that LoBiSaRL guarantees the long-term safety constraint, with high probability. Finally, our empirical results demonstrate that our algorithm is safer than existing methods without significantly compromising performance in terms of reward.
\ No newline at end of file
diff --git a/data/2024/aaai/Lost Domain Generalization Is a Natural Consequence of Lack of Training Domains b/data/2024/aaai/Lost Domain Generalization Is a Natural Consequence of Lack of Training Domains
new file mode 100644
index 0000000000..28a4a53580
--- /dev/null
+++ b/data/2024/aaai/Lost Domain Generalization Is a Natural Consequence of Lack of Training Domains	
@@ -0,0 +1 @@
+We show a hardness result for the number of training domains required to achieve a small population error in the test domain. Although many domain generalization algorithms have been developed under various domain-invariance assumptions, there is significant evidence to indicate that out-of-distribution (o.o.d.) test accuracy of state-of-the-art o.o.d. algorithms is on par with empirical risk minimization and random guess on the domain generalization benchmarks such as DomainBed. In this work, we analyze its cause and attribute the lost domain generalization to the lack of training domains. We show that, in a minimax lower bound fashion, any learning algorithm that outputs a classifier with an ε excess error to the Bayes optimal classifier requires at least poly(1/ε) number of training domains, even though the number of training data sampled from each training domain is large. Experiments on the DomainBed benchmark demonstrate that o.o.d. test accuracy is monotonically increasing as the number of training domains increases. Our result sheds light on the intrinsic hardness of domain generalization and suggests benchmarking o.o.d. algorithms by the datasets with a sufficient number of training domains.
\ No newline at end of file
diff --git a/data/2024/aaai/Low Category Uncertainty and High Training Potential Instance Learning for Unsupervised Domain Adaptation b/data/2024/aaai/Low Category Uncertainty and High Training Potential Instance Learning for Unsupervised Domain Adaptation
new file mode 100644
index 0000000000..5677b84cca
--- /dev/null
+++ b/data/2024/aaai/Low Category Uncertainty and High Training Potential Instance Learning for Unsupervised Domain Adaptation	
@@ -0,0 +1 @@
+Recently, instance contrastive learning achieves good results in unsupervised domain adaptation. It reduces the distances between positive samples and the anchor, increases the distances between negative samples and the anchor, and learns discriminative feature representations for target samples. However, most recent methods for identifying positive and negative samples are based on whether the pseudo-labels of samples and the pseudo-label of the anchor correspond to the same class. Due to the lack of target labels, many uncertain data are mistakenly labeled during the training process, and many low training potential data are also utilized. To address these problems, we propose Low Category Uncertainty and High Training Potential Instance Learning for Unsupervised Domain Adaptation (LUHP). We first propose a weight to measure the category uncertainty of the target sample. We can effectively filter the samples near the decision boundary through category uncertainty thresholds which are calculated by weights. Then we propose a new loss to focus on samples with high training potential. Finally, for anchors with low category uncertainty, we propose a sample reuse strategy to make the model more robust. We demonstrate the effectiveness of LUHP by showing the results of four datasets widely used in unsupervised domain adaptation.
\ No newline at end of file
diff --git a/data/2024/aaai/Low-Distortion Clustering with Ordinal and Limited Cardinal Information b/data/2024/aaai/Low-Distortion Clustering with Ordinal and Limited Cardinal Information
new file mode 100644
index 0000000000..559ecdd291
--- /dev/null
+++ b/data/2024/aaai/Low-Distortion Clustering with Ordinal and Limited Cardinal Information	
@@ -0,0 +1,3 @@
+Motivated by recent work in computational social choice, we extend the metric distortion framework to clustering problems. Given a set of n agents located in an underlying metric space, our goal is to partition them into k clusters, optimizing some social cost objective. The metric space is defined by a distance function d between the agent locations. Information about d is available only implicitly via n rankings, through which each agent ranks all other agents in terms of their distance from her. Still, even though no cardinal information (i.e., the exact distance values) is available, we would like to evaluate clustering algorithms in terms of social cost objectives that are defined using d. This is done using the notion of distortion, which measures how far from optimality a clustering can be, taking into account all underlying metrics that are consistent with the ordinal information available.
+
+Unfortunately, the most important clustering objectives (e.g., those used in the well-known k-median and k-center problems) do not admit algorithms with finite distortion. To sidestep this disappointing fact, we follow two alternative approaches: We first explore whether resource augmentation can be beneficial. We consider algorithms that use more than k clusters but compare their social cost to that of the optimal k-clusterings. We show that using exponentially (in terms of k) many clusters, we can get low (constant or logarithmic) distortion for the k-center and k-median objectives. Interestingly, such an exponential blowup is shown to be necessary. More importantly, we explore whether limited cardinal information can be used to obtain better results. Somewhat surprisingly, for k-median and k-center, we show that a number of queries that is polynomial in k and only logarithmic in n (i.e., only sublinear in the number of agents for the most relevant scenarios in practice) is enough to get constant distortion.
\ No newline at end of file
diff --git a/data/2024/aaai/Low-Latency Space-Time Supersampling for Real-Time Rendering b/data/2024/aaai/Low-Latency Space-Time Supersampling for Real-Time Rendering
new file mode 100644
index 0000000000..7ded326559
--- /dev/null
+++ b/data/2024/aaai/Low-Latency Space-Time Supersampling for Real-Time Rendering	
@@ -0,0 +1 @@
+With the rise of real-time rendering and the evolution of display devices, there is a growing demand for post-processing methods that offer high-resolution content in a high frame rate. Existing techniques often suffer from quality and latency issues due to the disjointed treatment of frame supersampling and extrapolation. In this paper, we recognize the shared context and mechanisms between frame supersampling and extrapolation, and present a novel framework, Space-time Supersampling (STSS). By integrating them into a unified framework, STSS can improve the overall quality with lower latency. To implement an efficient architecture, we treat the aliasing and warping holes unified as reshading regions and put forth two key components to compensate the regions, namely Random Reshading Masking (RRM) and Efficient Reshading Module (ERM). Extensive experiments demonstrate that our approach achieves superior visual fidelity compared to state-of-the-art (SOTA) methods. Notably, the performance is achieved within only 4ms, saving up to 75\% of time against the conventional two-stage pipeline that necessitates 17ms.
\ No newline at end of file
diff --git a/data/2024/aaai/Low-Light Face Super-resolution via Illumination, Structure, and Texture Associated Representation b/data/2024/aaai/Low-Light Face Super-resolution via Illumination, Structure, and Texture Associated Representation
new file mode 100644
index 0000000000..54609cf62a
--- /dev/null
+++ b/data/2024/aaai/Low-Light Face Super-resolution via Illumination, Structure, and Texture Associated Representation	
@@ -0,0 +1 @@
+Human face captured at night or in dimly lit environments has become a common practice, accompanied by complex low-light and low-resolution degradations. However, the existing face super-resolution (FSR) technologies and derived cascaded schemes are inadequate to recover credible textures. In this paper, we propose a novel approach that decomposes the restoration task into face structural fidelity maintaining and texture consistency learning. The former aims to enhance the quality of face images while improving the structural fidelity, while the latter focuses on eliminating perturbations and artifacts caused by low-light degradation and reconstruction. Based on this, we develop a novel low-light low-resolution face super-resolution framework. Our method consists of two steps: an illumination correction face super-resolution network (IC-FSRNet) for lighting the face and recovering the structural information, and a detail enhancement model (DENet) for improving facial details, thus making them more visually appealing and easier to analyze. As the relighted regions could provide complementary information to boost face super-resolution and vice versa, we introduce the mutual learning to harness the informative components from relighted regions and reconstruction, and achieve the iterative refinement. In addition, DENet equipped with diffusion probabilistic model is built to further improve face image visual quality. Experiments demonstrate that the proposed joint optimization framework achieves significant improvements in reconstruction quality and perceptual quality over existing two-stage sequential solutions. Code is available at https://github.com/wcy-cs/IC-FSRDENet.
\ No newline at end of file
diff --git a/data/2024/aaai/Low-Rank Kernel Tensor Learning for Incomplete Multi-View Clustering b/data/2024/aaai/Low-Rank Kernel Tensor Learning for Incomplete Multi-View Clustering
new file mode 100644
index 0000000000..aed24d6eb6
--- /dev/null
+++ b/data/2024/aaai/Low-Rank Kernel Tensor Learning for Incomplete Multi-View Clustering	
@@ -0,0 +1 @@
+Incomplete Multiple Kernel Clustering algorithms, which aim to learn a common latent representation from pre-constructed incomplete multiple kernels from the original data, followed by k-means for clustering. They have attracted intensive attention due to their high computational efficiency. However, our observation reveals that the imputation of these approaches for each kernel ignores the influence of other incomplete kernels. In light of this, we present a novel method called Low-Rank Kernel Tensor Learning for Incomplete Multiple Views Clustering (LRKT-IMVC) to address the above issue. Specifically, LRKT-IMVC first introduces the concept of kernel tensor to explore the inter-view correlations, and then the low-rank kernel tensor constraint is used to further capture the consistency information to impute missing kernel elements, thereby improving the quality of clustering. Moreover, we carefully design an alternative optimization method with promising convergence to solve the resulting optimization problem. The proposed method is compared with recent advances in experiments with different missing ratios on seven well-known datasets, demonstrating its effectiveness and the advantages of the proposed interpolation method.
\ No newline at end of file
diff --git a/data/2024/aaai/Lyapunov-Stable Deep Equilibrium Models b/data/2024/aaai/Lyapunov-Stable Deep Equilibrium Models
new file mode 100644
index 0000000000..bbbf6a725e
--- /dev/null
+++ b/data/2024/aaai/Lyapunov-Stable Deep Equilibrium Models	
@@ -0,0 +1 @@
+Deep equilibrium (DEQ) models have emerged as a promising class of implicit layer models, which abandon traditional depth by solving for the fixed points of a single nonlinear layer. Despite their success, the stability of the fixed points for these models remains poorly understood. By considering DEQ models as nonlinear dynamic systems, we propose a robust DEQ model named LyaDEQ with guaranteed provable stability via Lyapunov theory. The crux of our method is ensuring the Lyapunov stability of the DEQ model's fixed points, which enables the proposed model to resist minor initial perturbations. To avoid poor adversarial defense due to Lyapunov-stable fixed points being located near each other, we orthogonalize the layers after the Lyapunov stability module to separate different fixed points. We evaluate LyaDEQ models under well-known adversarial attacks, and experimental results demonstrate significant improvement in robustness. Furthermore, we show that the LyaDEQ model can be combined with other defense methods, such as adversarial training, to achieve even better adversarial robustness.
\ No newline at end of file
diff --git a/data/2024/aaai/M-BEV: Masked BEV Perception for Robust Autonomous Driving b/data/2024/aaai/M-BEV: Masked BEV Perception for Robust Autonomous Driving
new file mode 100644
index 0000000000..ec809b4a13
--- /dev/null
+++ b/data/2024/aaai/M-BEV: Masked BEV Perception for Robust Autonomous Driving	
@@ -0,0 +1 @@
+3D perception is a critical problem in autonomous driving. Recently, the Bird’s-Eye-View (BEV) approach has attracted extensive attention, due to low-cost deployment and desirable vision detection capacity. However, the existing models ignore a realistic scenario during the driving procedure, i.e., one or more view cameras may be failed, which largely deteriorates their performance. To tackle this problem, we propose a generic Masked BEV (M-BEV) perception framework, which can effectively improve robustness to this challenging scenario, by random masking and reconstructing camera views in the end-to-end training. More specifically, we develop a novel Masked View Reconstruction (MVR) module in our M-BEV. It mimics various missing cases by randomly masking features of different camera views, then leverages the original features of these views as self-supervision and reconstructs the masked ones with the distinct spatio-temporal context across camera views. Via such a plug-and-play MVR, our M-BEV is capable of learning the missing views from the resting ones, and thus well generalized for robust view recovery and accurate perception in the testing. We perform extensive experiments on the popular NuScenes benchmark, where our framework can significantly boost 3D perception performance of the state-of-the-art models on various missing view cases, e.g., for the absence of back view, our M-BEV promotes the PETRv2 model with 10.3% mAP gain.
\ No newline at end of file
diff --git a/data/2024/aaai/M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis b/data/2024/aaai/M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis
new file mode 100644
index 0000000000..2a983feaec
--- /dev/null
+++ b/data/2024/aaai/M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis	
@@ -0,0 +1 @@
+Document layout analysis is a crucial step for intelligent document understanding. However, many existing methods primarily focus on the visual aspects and overlook the textual features of documents. Although document pre-trained models utilize multi-modal features during the pre-training phase, they tend to operate as a unimodal pipeline when it comes to layout analysis tasks. Furthermore, current multi-modal methods perform worse than unimodal detectors on complex layout analysis datasets. To address these limitations, we propose an effective and pluggable multi-modal fusion approach named M2Doc, which fuses visual and textual features for better layout detection. M2Doc contains two pluggable multi-modal fusion modules, early-fusion and late-fusion, which align and fuse visual and textual features at the pixel level and block level. Benefitting from the concision and effectiveness of M2Doc, it can be easily applied to various detectors for better layout detection, including two-stage and end-to-end object detectors. Our experimental results demonstrate significant performance improvements in detectors equipped with M2Doc on datasets such as DocLayNet (+11.3 mAP) and M6Doc (+1.9 mAP). Furthermore, through the integration of the DINO detector with M2Doc, we achieve state-of-the-art results on DocLayNet (89.0 mAP), M6Doc (69.9 mAP), and PubLayNet (95.5 mAP). The code will be publicly released at https://github.com/johnning2333/M2Doc.
\ No newline at end of file
diff --git a/data/2024/aaai/M2SD: Multiple Mixing Self-Distillation for Few-Shot Class-Incremental Learning b/data/2024/aaai/M2SD: Multiple Mixing Self-Distillation for Few-Shot Class-Incremental Learning
new file mode 100644
index 0000000000..a7a3e6beba
--- /dev/null
+++ b/data/2024/aaai/M2SD: Multiple Mixing Self-Distillation for Few-Shot Class-Incremental Learning	
@@ -0,0 +1 @@
+Few-shot Class-incremental learning (FSCIL) is a challenging task in machine learning that aims to recognize new classes from a limited number of instances while preserving the ability to classify previously learned classes without retraining the entire model. This presents challenges in updating the model with new classes using limited training data, particularly in balancing acquiring new knowledge while retaining the old. We propose a novel method named Multiple Mxing Self-Distillation (M2SD) during the training phase to address these issues. Specifically, we propose a dual-branch structure that facilitates the expansion of the entire feature space to accommodate new classes. Furthermore, we introduce a feature enhancement component that can pass additional enhanced information back to the base network by self-distillation, resulting in improved classification performance upon adding new classes. After training, we discard both structures, leaving only the primary network to classify new class instances. Extensive experiments demonstrate that our approach achieves superior performance over previous state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy b/data/2024/aaai/M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy
new file mode 100644
index 0000000000..c577520660
--- /dev/null
+++ b/data/2024/aaai/M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy	
@@ -0,0 +1 @@
+Training state-of-the-art (SOTA) deep models often requires extensive data, resulting in substantial training and storage costs. To address these challenges, dataset condensation has been developed to learn a small synthetic set that preserves essential information from the original large-scale dataset. Nowadays, optimization-oriented methods have been the primary method in the field of dataset condensation for achieving SOTA results. However, the bi-level optimization process hinders the practical application of such methods to realistic and larger datasets. To enhance condensation efficiency, previous works proposed Distribution-Matching (DM) as an alternative, which significantly reduces the condensation cost. Nonetheless, current DM-based methods still yield less comparable results to SOTA optimization-oriented methods. In this paper, we argue that existing DM-based methods overlook the higher-order alignment of the distributions, which may lead to sub-optimal matching results. Inspired by this, we present a novel DM-based method named M3D for dataset condensation by Minimizing the Maximum Mean Discrepancy between feature representations of the synthetic and real images. By embedding their distributions in a reproducing kernel Hilbert space, we align all orders of moments of the distributions of real and synthetic images, resulting in a more generalized condensed set. Notably, our method even surpasses the SOTA optimization-oriented method IDC on the high-resolution ImageNet dataset. Extensive analysis is conducted to verify the effectiveness of the proposed method. Source codes are available at https://github.com/Hansong-Zhang/M3D.
\ No newline at end of file
diff --git a/data/2024/aaai/M3SOT: Multi-Frame, Multi-Field, Multi-Space 3D Single Object Tracking b/data/2024/aaai/M3SOT: Multi-Frame, Multi-Field, Multi-Space 3D Single Object Tracking
new file mode 100644
index 0000000000..b1fab635bf
--- /dev/null
+++ b/data/2024/aaai/M3SOT: Multi-Frame, Multi-Field, Multi-Space 3D Single Object Tracking	
@@ -0,0 +1 @@
+3D Single Object Tracking (SOT) stands a forefront task of computer vision, proving essential for applications like autonomous driving. Sparse and occluded data in scene point clouds introduce variations in the appearance of tracked objects, adding complexity to the task. In this research, we unveil M3SOT, a novel 3D SOT framework, which synergizes multiple input frames (template sets), multiple receptive fields (continuous contexts), and multiple solution spaces (distinct tasks) in ONE model. Remarkably, M3SOT pioneers in modeling temporality, contexts, and tasks directly from point clouds, revisiting a perspective on the key factors influencing SOT. To this end, we design a transformer-based network centered on point cloud targets in the search area, aggregating diverse contextual representations and propagating target cues by employing historical frames. As M3SOT spans varied processing perspectives, we've streamlined the network—trimming its depth and optimizing its structure—to ensure a lightweight and efficient deployment for SOT applications. We posit that, backed by practical construction, M3SOT sidesteps the need for complex frameworks and auxiliary components to deliver sterling results. Extensive experiments on benchmarks such as KITTI, nuScenes, and Waymo Open Dataset demonstrate that M3SOT achieves state-of-the-art performance at 38 FPS. Our code and models are available at https://github.com/ywu0912/TeamCode.git.
\ No newline at end of file
diff --git a/data/2024/aaai/MA-Net: Rethinking Neural Unit in the Light of Astrocytes b/data/2024/aaai/MA-Net: Rethinking Neural Unit in the Light of Astrocytes
new file mode 100644
index 0000000000..382e13aa17
--- /dev/null
+++ b/data/2024/aaai/MA-Net: Rethinking Neural Unit in the Light of Astrocytes	
@@ -0,0 +1 @@
+The artificial neuron (N-N) model-based networks have accomplished extraordinary success for various vision tasks. However, as a simplification of the mammal neuron model, their structure is locked during training, resulting in overfitting and over-parameters. The astrocyte, newly explored by biologists, can adaptively modulate neuronal communication by inserting itself between neurons. The communication, between the astrocyte and neuron, is bidirectionally and shows the potential to alleviate issues raised by unidirectional communication in the N-N model. In this paper, we first elaborate on the artificial Multi-Astrocyte-Neuron (MA-N) model, which enriches the functionality of the artificial neuron model. Our MA-N model is formulated at both astrocyte- and neuron-level that mimics the bidirectional communication with temporal and joint mechanisms. Then, we construct the MA-Net network with the MA-N model, whose neural connections can be continuously and adaptively modulated during training. Experiments show that our MA-Net advances new state-of-the-art on multiple tasks while significantly reducing its parameters by connection optimization.
\ No newline at end of file
diff --git a/data/2024/aaai/MANDREL: Modular Reinforcement Learning Pipelines for Material Discovery b/data/2024/aaai/MANDREL: Modular Reinforcement Learning Pipelines for Material Discovery
new file mode 100644
index 0000000000..bbec9f6b61
--- /dev/null
+++ b/data/2024/aaai/MANDREL: Modular Reinforcement Learning Pipelines for Material Discovery	
@@ -0,0 +1 @@
+AI-driven materials discovery is evolving rapidly with new approaches and pipelines for experimentation and design. However, the pipelines are often designed in isolation. We introduce a modular reinforcement learning framework for inter-operable experimentation and design of tailored, novel molecular species. The framework unifies reinforcement learning (RL) pipelines and allows the mixing and matching of choices for the underlying chemical action space, molecular representation, desired molecular properties, and RL algorithm. Our demo showcases the framework's capabilities applied to benchmark problems like quantitative estimate of drug-likeness and PLogP, as well as the design of novel small molecule solvents for carbon capture.
\ No newline at end of file
diff --git "a/data/2024/aaai/MAPTree: Beating \"Optimal\" Decision Trees with Bayesian Decision Trees" "b/data/2024/aaai/MAPTree: Beating \"Optimal\" Decision Trees with Bayesian Decision Trees"
new file mode 100644
index 0000000000..1b5b1ad7ca
--- /dev/null
+++ "b/data/2024/aaai/MAPTree: Beating \"Optimal\" Decision Trees with Bayesian Decision Trees"	
@@ -0,0 +1 @@
+Decision trees remain one of the most popular machine learning models today, largely due to their out-of-the-box performance and interpretability. In this work, we present a Bayesian approach to decision tree induction via maximum a posteriori inference of a posterior distribution over trees. We first demonstrate a connection between maximum a posteriori inference of decision trees and AND/OR search. Using this connection, we propose an AND/OR search algorithm, dubbed MAPTree, which is able to recover the maximum a posteriori tree. Lastly, we demonstrate the empirical performance of the maximum a posteriori tree both on synthetic data and in real world settings. On 16 real world datasets, MAPTree either outperforms baselines or demonstrates comparable performance but with much smaller trees. On a synthetic dataset, MAPTree also demonstrates greater robustness to noise and better generalization than existing approaches. Finally, MAPTree recovers the maxiumum a posteriori tree faster than existing sampling approaches and, in contrast with those algorithms, is able to provide a certificate of optimality. The code for our experiments is available at https://github.com/ThrunGroup/maptree.
\ No newline at end of file
diff --git a/data/2024/aaai/MCA: Moment Channel Attention Networks b/data/2024/aaai/MCA: Moment Channel Attention Networks
new file mode 100644
index 0000000000..e1c8af74f7
--- /dev/null
+++ b/data/2024/aaai/MCA: Moment Channel Attention Networks	
@@ -0,0 +1 @@
+Channel attention mechanisms endeavor to recalibrate channel weights to enhance representation abilities of networks. However, mainstream methods often rely solely on global average pooling as the feature squeezer, which significantly limits the overall potential of models. In this paper, we investigate the statistical moments of feature maps within a neural network. Our findings highlight the critical role of high-order moments in enhancing model capacity. Consequently, we introduce a flexible and comprehensive mechanism termed Extensive Moment Aggregation (EMA) to capture the global spatial context. Building upon this mechanism, we propose the Moment Channel Attention (MCA) framework, which efficiently incorporates multiple levels of moment-based information while minimizing additional computation costs through our Cross Moment Convolution (CMC) module. The CMC module via channel-wise convolution layer to capture multiple order moment information as well as cross channel features. The MCA block is designed to be lightweight and easily integrated into a variety of neural network architectures. Experimental results on classical image classification, object detection, and instance segmentation tasks demonstrate that our proposed method achieves state-of-the-art results, outperforming existing channel attention methods.
\ No newline at end of file
diff --git a/data/2024/aaai/MCL-NER: Cross-Lingual Named Entity Recognition via Multi-View Contrastive Learning b/data/2024/aaai/MCL-NER: Cross-Lingual Named Entity Recognition via Multi-View Contrastive Learning
new file mode 100644
index 0000000000..c7124d23e2
--- /dev/null
+++ b/data/2024/aaai/MCL-NER: Cross-Lingual Named Entity Recognition via Multi-View Contrastive Learning	
@@ -0,0 +1,11 @@
+Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora, especially for non-English
+data. While prior efforts mainly focus on data-driven transfer methods, a significant aspect that has not been fully explored is aligning both semantic and token-level representations across diverse languages. In this paper, we propose Multi-view Contrastive Learning for Cross-lingual Named
+Entity Recognition (MCL-NER). Specifically, we reframe the CrossNER task into a problem of recognizing relationships between pairs of tokens. This approach taps into the
+inherent contextual nuances of token-to-token connections within entities, allowing us to align representations across
+different languages. A multi-view contrastive learning framework is introduced to encompass semantic contrasts between
+source, codeswitched, and target sentences, as well as contrasts among token-to-token relations. By enforcing agreement within both semantic and relational spaces, we minimize the gap between source sentences and their counterparts of both codeswitched and target sentences. This alignment
+extends to the relationships between diverse tokens, enhancing the projection of entities across languages. We further
+augment CrossNER by combining self-training with labeled source data and unlabeled target data. Our experiments on
+the XTREME benchmark, spanning 40 languages, demonstrate the superiority of MCL-NER over prior data-driven
+and model-based approaches. It achieves a substantial increase of nearly +2.0 F1 scores across a broad spectrum and
+establishes itself as the new state-of-the-art performer.
\ No newline at end of file
diff --git a/data/2024/aaai/MCSSME: Multi-Task Contrastive Learning for Semi-supervised Singing Melody Extraction from Polyphonic Music b/data/2024/aaai/MCSSME: Multi-Task Contrastive Learning for Semi-supervised Singing Melody Extraction from Polyphonic Music
new file mode 100644
index 0000000000..78c9145694
--- /dev/null
+++ b/data/2024/aaai/MCSSME: Multi-Task Contrastive Learning for Semi-supervised Singing Melody Extraction from Polyphonic Music	
@@ -0,0 +1,2 @@
+Singing melody extraction is an important task in the field of music information retrieval (MIR). The development of data-driven models for this task have achieved great successes. However, the existing models have two major limitations: firstly, most of the existing singing melody extraction models have formulated this task as a pixel-level prediction task. The lack of labeling data has limited the model for further improvements. Secondly, the generalization of the existing models are prone to be disturbed by the music genres. To address the issues mentioned above, in this paper, we propose a multi-Task contrastive learning framework for semi-supervised singing melody extraction, termed as MCSSME. 
+Specifically, to deal with data scarcity limitation, we propose a self-consistency regularization (SCR) method to train the model on the unlabeled data. Transformations are applied to the raw signal of polyphonic music, which makes the network to improve its representation capability via recognizing the transformations. We further propose a novel multi-task learning (MTL) approach to jointly learn singing melody extraction and classification of transformed data. To deal with generalization limitation, we also propose a contrastive embedding learning, which strengthens the intra-class compactness and inter-class separability. To improve the generalization on different music genres, we also propose a domain classification method to learn task-dependent features by mapping data from different music genres to shared subspace. MCSSME evaluates on a set of well-known public melody extraction datasets with promising performances. The experimental results demonstrate the effectiveness of the MCSSME framework for singing melody extraction from polyphonic music using very limited labeled data scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/MDFL: Multi-Domain Diffusion-Driven Feature Learning b/data/2024/aaai/MDFL: Multi-Domain Diffusion-Driven Feature Learning
new file mode 100644
index 0000000000..5551b05961
--- /dev/null
+++ b/data/2024/aaai/MDFL: Multi-Domain Diffusion-Driven Feature Learning	
@@ -0,0 +1 @@
+High-dimensional images, known for their rich semantic information, are widely applied in remote sensing and other fields. The spatial information in these images reflects the object's texture features, while the spectral information reveals the potential spectral representations across different bands. Currently, the understanding of high-dimensional images remains limited to a single-domain perspective with performance degradation. Motivated by the masking texture effect observed in the human visual system, we present a multi-domain diffusion-driven feature learning network (MDFL) , a scheme to redefine the effective information domain that the model really focuses on. This method employs diffusion-based posterior sampling to explicitly consider joint information interactions between the high-dimensional manifold structures in the spectral, spatial, and frequency domains, thereby eliminating the influence of masking texture effects in visual models. Additionally, we introduce a feature reuse mechanism to gather deep and raw features of high-dimensional data. We demonstrate that MDFL significantly improves the feature extraction performance of high-dimensional data, thereby providing a powerful aid for revealing the intrinsic patterns and structures of such data. The experimental results on three multi-modal remote sensing datasets show that MDFL reaches an average overall accuracy of 98.25%, outperforming various state-of-the-art baseline schemes. Code available at https://github.com/LDXDU/MDFL-AAAI-24.
\ No newline at end of file
diff --git a/data/2024/aaai/MDGNN: Multi-Relational Dynamic Graph Neural Network for Comprehensive and Dynamic Stock Investment Prediction b/data/2024/aaai/MDGNN: Multi-Relational Dynamic Graph Neural Network for Comprehensive and Dynamic Stock Investment Prediction
new file mode 100644
index 0000000000..69e7ab0e91
--- /dev/null
+++ b/data/2024/aaai/MDGNN: Multi-Relational Dynamic Graph Neural Network for Comprehensive and Dynamic Stock Investment Prediction	
@@ -0,0 +1 @@
+The stock market is a crucial component of the financial system, but predicting the movement of stock prices is challenging due to the dynamic and intricate relations arising from various aspects such as economic indicators, financial reports, global news, and investor sentiment. Traditional sequential methods and graph-based models have been applied in stock movement prediction, but they have limitations in capturing the multifaceted and temporal influences in stock price movements. To address these challenges, the Multi-relational Dynamic Graph Neural Network (MDGNN) framework is proposed, which utilizes a discrete dynamic graph to comprehensively capture multifaceted relations among stocks and their evolution over time. The representation generated from the graph offers a complete perspective on the interrelationships among stocks and associated entities. Additionally, the power of the Transformer structure is leveraged to encode the temporal evolution of multiplex relations, providing a dynamic and effective approach to predicting stock investment. Further, our proposed MDGNN framework achieves the best performance in public datasets compared with the state-of-the-art stock investment methods.
\ No newline at end of file
diff --git a/data/2024/aaai/MEPSI: An MDL-Based Ensemble Pruning Approach with Structural Information b/data/2024/aaai/MEPSI: An MDL-Based Ensemble Pruning Approach with Structural Information
new file mode 100644
index 0000000000..dd2e8efdee
--- /dev/null
+++ b/data/2024/aaai/MEPSI: An MDL-Based Ensemble Pruning Approach with Structural Information	
@@ -0,0 +1 @@
+Ensemble pruning that combines a subset of individual learners generated in parallel to make predictions is an important topic in ensemble learning. Past decades have developed a lot of pruning algorithms that focus on the external behavior of learners on samples, which may lead to over-fitting. In this paper, we conjecture that the generalization performance of an ensemble is not only related to its external behavior on samples but also dependent on the internal structure of individual learners. We propose the general MEPSI approach based on Kolmogorov complexity and the Minimum Description Length (MDL) principle, which formulates the ensemble pruning task as the two-objective optimization problem that comprises the empirical error and structural information among individual learners. We also provide a concrete implementation of MEPSI on decision trees. The theoretical results provide generalization bounds for both the general MEPSI approach and tree-based implementation. The comparative experiments conducted on multiple real-world data sets demonstrate the effectiveness of our proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/MERGE: Fast Private Text Generation b/data/2024/aaai/MERGE: Fast Private Text Generation
new file mode 100644
index 0000000000..c047f8018f
--- /dev/null
+++ b/data/2024/aaai/MERGE: Fast Private Text Generation	
@@ -0,0 +1 @@
+The drastic increase in language models' parameters has led to a new trend of deploying models in cloud servers, raising growing concerns about private inference for Transformer-based models. Existing two-party privacy-preserving techniques, however, only take into account natural language understanding (NLU) scenarios. Private inference in natural language generation (NLG), crucial for applications like translation and code completion, remains underexplored. In addition, previous privacy-preserving techniques suffer from convergence issues during model training and exhibit poor inference speed when used with NLG models due to the neglect of time-consuming operations in auto-regressive generations. To address these issues, we propose a fast private text generation framework for Transformer-based language models, namely MERGE. MERGE reuses the output hidden state as the word embedding to bypass the embedding computation and reorganize the linear operations in the Transformer module to accelerate the forward procedure. Extensive experiments show that MERGE achieves a 26.5x speedup to the vanilla encrypted model under the sequence length 512, and reduces 80% communication cost, with an up to 10x speedup to state-of-the-art approximated models.
\ No newline at end of file
diff --git a/data/2024/aaai/MESED: A Multi-Modal Entity Set Expansion Dataset with Fine-Grained Semantic Classes and Hard Negative Entities b/data/2024/aaai/MESED: A Multi-Modal Entity Set Expansion Dataset with Fine-Grained Semantic Classes and Hard Negative Entities
new file mode 100644
index 0000000000..8b8f3e342a
--- /dev/null
+++ b/data/2024/aaai/MESED: A Multi-Modal Entity Set Expansion Dataset with Fine-Grained Semantic Classes and Hard Negative Entities	
@@ -0,0 +1 @@
+The Entity Set Expansion (ESE) task aims to expand a handful of seed entities with new entities belonging to the same semantic class. Conventional ESE methods are based on mono-modality (i.e., literal modality), which struggle to deal with complex entities in the real world such as (1) Negative entities with fine-grained semantic differences. (2) Synonymous entities. (3) Polysemous entities. (4) Long-tailed entities. These challenges prompt us to propose novel Multi-modal Entity Set Expansion (MESE), where models integrate information from multiple modalities to represent entities. Intuitively, the benefits of multi-modal information for ESE are threefold: (1) Different modalities can provide complementary information. (2) Multi-modal information provides a unified signal via common visual properties for the same semantic class or entity. (3) Multi-modal information offers robust alignment signals for synonymous entities. To assess model performance in MESE, we constructed the MESED dataset which is the first multi-modal dataset for ESE with large-scale and elaborate manual calibration. A powerful multi-modal model MultiExpan is proposed which is pre-trained on four multimodal pre-training tasks. The extensive experiments and analyses on MESED demonstrate the high quality of the dataset and the effectiveness of our MultiExpan, as well as pointing the direction for future research. The benchmark and code are public at https://github.com/THUKElab/MESED.
\ No newline at end of file
diff --git a/data/2024/aaai/MFABA: A More Faithful and Accelerated Boundary-Based Attribution Method for Deep Neural Networks b/data/2024/aaai/MFABA: A More Faithful and Accelerated Boundary-Based Attribution Method for Deep Neural Networks
new file mode 100644
index 0000000000..1aac76a58c
--- /dev/null
+++ b/data/2024/aaai/MFABA: A More Faithful and Accelerated Boundary-Based Attribution Method for Deep Neural Networks	
@@ -0,0 +1 @@
+To better understand the output of deep neural networks (DNN), attribution based methods have been an important approach for model interpretability, which assign a score for each input dimension to indicate its importance towards the model outcome. Notably, the attribution methods use the ax- ioms of sensitivity and implementation invariance to ensure the validity and reliability of attribution results. Yet, the ex- isting attribution methods present challenges for effective in- terpretation and efficient computation. In this work, we in- troduce MFABA, an attribution algorithm that adheres to ax- ioms, as a novel method for interpreting DNN. Addition- ally, we provide the theoretical proof and in-depth analy- sis for MFABA algorithm, and conduct a large scale exper- iment. The results demonstrate its superiority by achieving over 101.5142 times faster speed than the state-of-the-art at- tribution algorithms. The effectiveness of MFABA is thor- oughly evaluated through the statistical analysis in compar- ison to other methods, and the full implementation package is open-source at: https://github.com/LMBTough/MFABA.
\ No newline at end of file
diff --git a/data/2024/aaai/MFOS: Model-Free & One-Shot Object Pose Estimation b/data/2024/aaai/MFOS: Model-Free & One-Shot Object Pose Estimation
new file mode 100644
index 0000000000..600bfeba3b
--- /dev/null
+++ b/data/2024/aaai/MFOS: Model-Free & One-Shot Object Pose Estimation	
@@ -0,0 +1 @@
+Existing learning-based methods for object pose estimation in RGB images are mostly model-specific or category based. They lack the capability to generalize to new object categories at test time, hence severely hindering their practicability and scalability. Notably, recent attempts have been made to solve this issue, but they still require accurate 3D data of the object surface at both train and test time. In this paper, we introduce a novel approach that can estimate in a single forward pass the pose of objects never seen during training, given minimum input. In contrast to existing state-of-the-art approaches, which rely on task-specific modules, our proposed model is entirely based on a transformer architecture, which can benefit from recently proposed 3D-geometry general pretraining. We conduct extensive experiments and report state-of-the-art one-shot performance on the challenging LINEMOD benchmark. Finally, extensive ablations allow us to determine good practices with this relatively new type of architecture in the field.
\ No newline at end of file
diff --git a/data/2024/aaai/MFTN: Multi-Level Feature Transfer Network Based on MRI-Transformer for MR Image Super-resolution b/data/2024/aaai/MFTN: Multi-Level Feature Transfer Network Based on MRI-Transformer for MR Image Super-resolution
new file mode 100644
index 0000000000..e37f7d2453
--- /dev/null
+++ b/data/2024/aaai/MFTN: Multi-Level Feature Transfer Network Based on MRI-Transformer for MR Image Super-resolution	
@@ -0,0 +1 @@
+Due to the unique environment and inherent properties of magnetic resonance imaging (MRI) instruments, MR images typically have lower resolution. Therefore, improving the resolution of MR images is beneficial for assisting doctors in diagnosing the condition. Currently, the existing MR image super-resolution (SR) methods still have the problem of insufficient detail reconstruction. To overcome this issue, this paper proposes a multi-level feature transfer network (MFTN) based on MRI-Transformer to realize SR of low-resolution MRI data. MFTN consists of a multi-scale feature reconstruction network (MFRN) and a multi-level feature extraction branch (MFEB). MFRN is constructed as a pyramid structure to gradually reconstruct image features at different scales by integrating the features obtained from MFEB, and MFEB is constructed to provide detail information at different scales for low resolution MR image SR reconstruction by constructing multiple MRI-Transformer modules. Each MRI-Transformer module is designed to learn the transfer features from the reference image by establishing feature correlations between the reference image and low-resolution MR image. In addition, a contrast learning constraint item is added to the loss function to enhance the texture details of the SR image. A large number of experiments show that our network can effectively reconstruct high-quality MR Images and achieves better performance compared to some state-of-the-art methods. The source code of this work will be released on GitHub.
\ No newline at end of file
diff --git a/data/2024/aaai/MGNet: Learning Correspondences via Multiple Graphs b/data/2024/aaai/MGNet: Learning Correspondences via Multiple Graphs
new file mode 100644
index 0000000000..2956645ac1
--- /dev/null
+++ b/data/2024/aaai/MGNet: Learning Correspondences via Multiple Graphs	
@@ -0,0 +1 @@
+Learning correspondences aims to find correct correspondences (inliers) from the initial correspondence set with an uneven correspondence distribution and a low inlier rate, which can be regarded as graph data. Recent advances usually use graph neural networks (GNNs) to build a single type of graph or simply stack local graphs into the global one to complete the task. But they ignore the complementary relationship between different types of graphs, which can effectively capture potential relationships among sparse correspondences. To address this problem, we propose MGNet to effectively combine multiple complementary graphs. To obtain information integrating implicit and explicit local graphs, we construct local graphs from implicit and explicit aspects and combine them effectively, which is used to build a global graph. Moreover, we propose Graph Soft Degree Attention (GSDA) to make full use of all sparse correspondence information at once in the global graph, which can capture and amplify discriminative features. Extensive experiments demonstrate that MGNet outperforms state-of-the-art methods in different visual tasks. The code is provided in https://github.com/DAILUANYUAN/MGNet-2024AAAI.
\ No newline at end of file
diff --git a/data/2024/aaai/MGQFormer: Mask-Guided Query-Based Transformer for Image Manipulation Localization b/data/2024/aaai/MGQFormer: Mask-Guided Query-Based Transformer for Image Manipulation Localization
new file mode 100644
index 0000000000..b7f661805f
--- /dev/null
+++ b/data/2024/aaai/MGQFormer: Mask-Guided Query-Based Transformer for Image Manipulation Localization	
@@ -0,0 +1 @@
+Deep learning-based models have made great progress in image tampering localization, which aims to distinguish between manipulated and authentic regions. However, these models suffer from inefficient training. This is because they use ground-truth mask labels mainly through the cross-entropy loss, which prioritizes per-pixel precision but disregards the spatial location and shape details of manipulated regions. To address this problem, we propose a Mask-Guided Query-based Transformer Framework (MGQFormer), which uses ground-truth masks to guide the learnable query token (LQT) in identifying the forged regions. Specifically, we extract feature embeddings of ground-truth masks as the guiding query token (GQT) and feed GQT and LQT into MGQFormer to estimate fake regions, respectively. Then we make MGQFormer learn the position and shape information in ground-truth mask labels by proposing a mask-guided loss to reduce the feature distance between GQT and LQT. We also observe that such mask-guided training strategy has a significant impact on the convergence speed of MGQFormer training. Extensive experiments on multiple benchmarks show that our method significantly improves over state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/MICA: Towards Explainable Skin Lesion Diagnosis via Multi-Level Image-Concept Alignment b/data/2024/aaai/MICA: Towards Explainable Skin Lesion Diagnosis via Multi-Level Image-Concept Alignment
new file mode 100644
index 0000000000..7558f8937d
--- /dev/null
+++ b/data/2024/aaai/MICA: Towards Explainable Skin Lesion Diagnosis via Multi-Level Image-Concept Alignment	
@@ -0,0 +1 @@
+Black-box deep learning approaches have showcased significant potential in the realm of medical image analysis. However, the stringent trustworthiness requirements intrinsic to the medical field have catalyzed research into the utilization of Explainable Artificial Intelligence (XAI), with a particular focus on concept-based methods. Existing concept-based methods predominantly apply concept annotations from a single perspective (e.g., global level), neglecting the nuanced semantic relationships between sub-regions and concepts embedded within medical images. This leads to underutilization of the valuable medical information and may cause models to fall short in harmoniously balancing interpretability and performance when employing inherently interpretable architectures such as Concept Bottlenecks. To mitigate these shortcomings, we propose a multi-modal explainable disease diagnosis framework that meticulously aligns medical images and clinical-related concepts semantically at multiple strata, encompassing the image level, token level, and concept level. Moreover, our method allows for model intervention and offers both textual and visual explanations in terms of human-interpretable concepts. Experimental results on three skin image datasets demonstrate that our method, while preserving model interpretability, attains high performance and label efficiency for concept detection and disease diagnosis. The code is available at https://github.com/Tommy-Bie/MICA.
\ No newline at end of file
diff --git a/data/2024/aaai/MIDDAG: Where Does Our News Go? Investigating Information Diffusion via Community-Level Information Pathways b/data/2024/aaai/MIDDAG: Where Does Our News Go? Investigating Information Diffusion via Community-Level Information Pathways
new file mode 100644
index 0000000000..9b096229cc
--- /dev/null
+++ b/data/2024/aaai/MIDDAG: Where Does Our News Go? Investigating Information Diffusion via Community-Level Information Pathways	
@@ -0,0 +1 @@
+We present MIDDAG, an intuitive, interactive system that visualizes the information propagation paths on social media triggered by COVID-19-related news articles accompanied by comprehensive insights including user/community susceptibility level, as well as events and popular opinions raised by the crowd while propagating the information. Besides discovering information flow patterns among users, we construct communities among users and develop the propagation forecasting capability, enabling tracing and understanding of how information is disseminated at a higher level. A demo video and more are available at https://info-pathways.github.io.
\ No newline at end of file
diff --git a/data/2024/aaai/MIND: Multi-Task Incremental Network Distillation b/data/2024/aaai/MIND: Multi-Task Incremental Network Distillation
new file mode 100644
index 0000000000..63034b96c3
--- /dev/null
+++ b/data/2024/aaai/MIND: Multi-Task Incremental Network Distillation	
@@ -0,0 +1 @@
+The recent surge of pervasive devices that generate dynamic data streams has underscored the necessity for learning systems to adapt continually to data distributional shifts. To tackle this challenge, the research community has put forth a spectrum of methodologies, including the demanding pursuit of class-incremental learning without replay data. In this study, we present MIND, a parameter isolation method that aims to significantly enhance the performance of replay-free solutions and achieve state-of-the-art results on several widely studied datasets. Our approach introduces two main contributions: two alternative distillation procedures that significantly improve the efficiency of MIND increasing the accumulated knowledge of each sub-network, and the optimization of the BachNorm layers across tasks inside the sub-networks. Overall, MIND outperforms all the state-of-the-art methods for rehearsal-free Class-Incremental learning (with an increment in classification accuracy of approx. +6% on CIFAR-100/10 and +10% on TinyImageNet/10) reaching up to approx. +40% accuracy in Domain-Incremental scenarios. Moreover, we ablated each contribution to demonstrate its impact on performance improvement. Our results showcase the superior performance of MIND indicating its potential for addressing the challenges posed by Class-incremental and Domain-Incremental learning in resource-constrained environments.
\ No newline at end of file
diff --git a/data/2024/aaai/MINES: Message Intercommunication for Inductive Relation Reasoning over Neighbor-Enhanced Subgraphs b/data/2024/aaai/MINES: Message Intercommunication for Inductive Relation Reasoning over Neighbor-Enhanced Subgraphs
new file mode 100644
index 0000000000..85d6723321
--- /dev/null
+++ b/data/2024/aaai/MINES: Message Intercommunication for Inductive Relation Reasoning over Neighbor-Enhanced Subgraphs	
@@ -0,0 +1 @@
+GraIL and its variants have shown their promising capacities for inductive relation reasoning on knowledge graphs. However, the uni-directional message-passing mechanism hinders such models from exploiting hidden mutual relations between entities in directed graphs. Besides, the enclosing subgraph extraction in most GraIL-based models restricts the model from extracting enough discriminative information for reasoning. Consequently, the expressive ability of these models is limited. To address the problems, we propose a novel GraIL-based framework, termed MINES, by introducing a Message Intercommunication mechanism on the Neighbor-Enhanced Subgraph. Concretely, the message intercommunication mechanism is designed to capture the omitted hidden mutual information. It introduces bi-directed information interactions between connected entities by inserting an undirected/bi-directed GCN layer between uni-directed RGCN layers. Moreover, inspired by the success of involving more neighbors in other graph-based tasks, we extend the neighborhood area beyond the enclosing subgraph to enhance the information collection for inductive relation reasoning. Extensive experiments prove the promising capacity of the proposed MINES from various aspects, especially for the superiority, effectiveness, and transfer ability.
\ No newline at end of file
diff --git a/data/2024/aaai/MKG-FENN: A Multimodal Knowledge Graph Fused End-to-End Neural Network for Accurate Drug-Drug Interaction Prediction b/data/2024/aaai/MKG-FENN: A Multimodal Knowledge Graph Fused End-to-End Neural Network for Accurate Drug-Drug Interaction Prediction
new file mode 100644
index 0000000000..acb4a7cd28
--- /dev/null
+++ b/data/2024/aaai/MKG-FENN: A Multimodal Knowledge Graph Fused End-to-End Neural Network for Accurate Drug-Drug Interaction Prediction	
@@ -0,0 +1 @@
+Taking incompatible multiple drugs together may cause adverse interactions and side effects on the body. Accurate prediction of drug-drug interaction (DDI) events is essential for avoiding this issue. Recently, various artificial intelligence-based approaches have been proposed for predicting DDI events. However, DDI events are associated with complex relationships and mechanisms among drugs, targets, enzymes, transporters, molecular structures, etc. Existing approaches either partially or loosely consider these relationships and mechanisms by a non-end-to-end learning framework, resulting in sub-optimal feature extractions and fusions for prediction. Different from them, this paper proposes a Multimodal Knowledge Graph Fused End-to-end Neural Network (MKGFENN) that consists of two main parts: multimodal knowledge graph (MKG) and fused end-to-end neural network (FENN). First, MKG is constructed by comprehensively exploiting DDI events-associated relationships and mechanisms from four knowledge graphs of drugs-chemical entities, drug-substructures, drugs-drugs, and molecular structures. Correspondingly, a four channels graph neural network is designed to extract high-order and semantic features from MKG. Second, FENN designs a multi-layer perceptron to fuse the extracted features by end-to-end learning. With such designs, the feature extractions and fusions of DDI events are guaranteed to be comprehensive and optimal for prediction. Through extensive experiments on real drug datasets, we demonstrate that MKG-FENN exhibits high accuracy and significantly outperforms state-of-the-art models in predicting DDI events. The source code and supplementary file of this article are available on: https://github.com/wudi1989/MKG-FENN.
\ No newline at end of file
diff --git a/data/2024/aaai/MLNet: Mutual Learning Network with Neighborhood Invariance for Universal Domain Adaptation b/data/2024/aaai/MLNet: Mutual Learning Network with Neighborhood Invariance for Universal Domain Adaptation
new file mode 100644
index 0000000000..6ad3048229
--- /dev/null
+++ b/data/2024/aaai/MLNet: Mutual Learning Network with Neighborhood Invariance for Universal Domain Adaptation	
@@ -0,0 +1 @@
+Universal domain adaptation (UniDA) is a practical but challenging problem, in which information about the relation between the source and the target domains is not given for knowledge transfer. Existing UniDA methods may suffer from the problems of overlooking intra-domain variations in the target domain and difficulty in separating between the similar known and unknown class. To address these issues, we propose a novel Mutual Learning Network (MLNet) with neighborhood invariance for UniDA. In our method, confidence-guided invariant feature learning with self-adaptive neighbor selection is designed to reduce the intra-domain variations for more generalizable feature representation. By using the cross-domain mixup scheme for better unknown-class identification, the proposed method compensates for the misidentified known-class errors by mutual learning between the closed-set and open-set classifiers. Extensive experiments on three publicly available benchmarks demonstrate that our method achieves the best results compared to the state-of-the-arts in most cases and significantly outperforms the baseline across all the four settings in UniDA. Code is available at https://github.com/YanzuoLu/MLNet.
\ No newline at end of file
diff --git a/data/2024/aaai/MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding b/data/2024/aaai/MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding
new file mode 100644
index 0000000000..1e3166a97f
--- /dev/null
+++ b/data/2024/aaai/MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding	
@@ -0,0 +1 @@
+In perception, multiple sensory information is integrated to map visual information from 2D views onto 3D objects, which is beneficial for understanding in 3D environments. But in terms of a single 2D view rendered from different angles, only limited partial information can be provided. The richness and value of Multi-view 2D information can provide superior self-supervised signals for 3D objects. In this paper, we propose a novel self-supervised point cloud representation learning method, MM-Point, which is driven by intra-modal and inter-modal similarity objectives. The core of MM-Point lies in the Multi-modal interaction and transmission between 3D objects and multiple 2D views at the same time. In order to more effectively simultaneously perform the consistent cross-modal objective of 2D multi-view information based on contrastive learning, we further propose Multi-MLP and Multi-level Augmentation strategies. Through carefully designed transformation strategies, we further learn Multi-level invariance in 2D Multi-views. MM-Point demonstrates state-of-the-art (SOTA) performance in various downstream tasks. For instance, it achieves a peak accuracy of 92.4% on the synthetic dataset ModelNet40, and a top accuracy of 87.8% on the real-world dataset ScanObjectNN, comparable to fully supervised methods. Additionally, we demonstrate its effectiveness in tasks such as few-shot classification, 3D part segmentation and 3D semantic segmentation.
\ No newline at end of file
diff --git a/data/2024/aaai/MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis b/data/2024/aaai/MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
new file mode 100644
index 0000000000..dbbb2399bf
--- /dev/null
+++ b/data/2024/aaai/MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis	
@@ -0,0 +1 @@
+The style transfer task in Text-to-Speech (TTS) refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide style transfer. In this paper, we propose a more flexible multi-modal and style controllable TTS framework named MM-TTS. It can utilize any modality as the prompt in unified multi-modal prompt space, including reference speech, emotional facial images, and text descriptions, to control the style of the generated speech in a system. The challenges of modeling such a multi-modal style controllable TTS mainly lie in two aspects: 1) aligning the multi-modal information into a unified style space to enable the input of arbitrary modality as the style prompt in a single system, and 2) efficiently transferring the unified style representation into the given text content, thereby empowering the ability to generate prompt style-related voice. To address these problems, we propose an aligned multi-modal prompt encoder that embeds different modalities into a unified style space, supporting style transfer for different modalities. Additionally, we present a new adaptive style transfer method named Style Adaptive Convolutions (SAConv) to achieve a better style representation. Furthermore, we design a Rectified Flow based Refiner to solve the problem of over-smoothing Mel-spectrogram and generate audio of higher fidelity. Since there is no public dataset for multi-modal TTS, we construct a dataset named MEAD-TTS, which is related to the field of expressive talking head. Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multi-modal prompts. The audio samples and constructed dataset are available at https://multimodal-tts.github.io.
\ No newline at end of file
diff --git a/data/2024/aaai/MRMLREC: A Two-Stage Approach for Addressing Data Sparsity in MOOC Video Recommendation (Student Abstract) b/data/2024/aaai/MRMLREC: A Two-Stage Approach for Addressing Data Sparsity in MOOC Video Recommendation (Student Abstract)
new file mode 100644
index 0000000000..dfb49c73fd
--- /dev/null
+++ b/data/2024/aaai/MRMLREC: A Two-Stage Approach for Addressing Data Sparsity in MOOC Video Recommendation (Student Abstract)	
@@ -0,0 +1 @@
+With the abundance of learning resources available on massive open online courses (MOOCs) platforms, the issue of interactive data sparsity has emerged as a significant challenge.This paper introduces MRMLREC, an efficient MOOC video recommendation which consists of two main stages: multi-relational representation and multi-level recommendation, aiming to solve the problem of data sparsity. In the multi-relational representation stage, MRMLREC adopts a tripartite approach, constructing relational graphs based on temporal sequences, courses-videos relation, and knowledge concepts-video relation. These graphs are processed by a Graph Convolution Network (GCN) and two variant Graph Attention Networks (GAT) to derive representations. A variant of the Long Short-Term Memory Network (LSTM) then integrates these multi-dimensional data to enhance the overall representation. The multi-level recommendation stage introduces three prediction tasks at varying levels—courses, knowledge concepts, and videos—to mitigate data sparsity and improve the interpretability of video recommendations. Beam search (BS) is employed to identify top-β items at each level, refining the subsequent level's search space and enhancing recommendation efficiency. Additionally, an optional layer offers both personalization and diversification modes, ensuring variety in recommended videos and maintaining learner engagement. Comprehensive experiments demonstrate the effectiveness of MRMLREC on two real-world instances from Xuetang X.
\ No newline at end of file
diff --git a/data/2024/aaai/MSGNet: Learning Multi-Scale Inter-series Correlations for Multivariate Time Series Forecasting b/data/2024/aaai/MSGNet: Learning Multi-Scale Inter-series Correlations for Multivariate Time Series Forecasting
new file mode 100644
index 0000000000..77de260e5d
--- /dev/null
+++ b/data/2024/aaai/MSGNet: Learning Multi-Scale Inter-series Correlations for Multivariate Time Series Forecasting	
@@ -0,0 +1 @@
+Multivariate time series forecasting poses an ongoing challenge across various disciplines. Time series data often exhibit diverse intra-series and inter-series correlations, contributing to intricate and interwoven dependencies that have been the focus of numerous studies. Nevertheless, a significant research gap remains in comprehending the varying inter-series correlations across different time scales among multiple time series, an area that has received limited attention in the literature. To bridge this gap, this paper introduces MSGNet, an advanced deep learning model designed to capture the varying inter-series correlations across multiple time scales using frequency domain analysis and adaptive graph convolution. By leveraging frequency domain analysis, MSGNet effectively extracts salient periodic patterns and decomposes the time series into distinct time scales. The model incorporates a self-attention mechanism to capture intra-series dependencies, while introducing an adaptive mixhop graph convolution layer to autonomously learn diverse inter-series correlations within each time scale. Extensive experiments are conducted on several real-world datasets to showcase the effectiveness of MSGNet. Furthermore, MSGNet possesses the ability to automatically learn explainable multi-scale inter-series correlations, exhibiting strong generalization capabilities even when applied to out-of-distribution samples.
\ No newline at end of file
diff --git a/data/2024/aaai/MWSIS: Multimodal Weakly Supervised Instance Segmentation with 2D Box Annotations for Autonomous Driving b/data/2024/aaai/MWSIS: Multimodal Weakly Supervised Instance Segmentation with 2D Box Annotations for Autonomous Driving
new file mode 100644
index 0000000000..d8d11eecd9
--- /dev/null
+++ b/data/2024/aaai/MWSIS: Multimodal Weakly Supervised Instance Segmentation with 2D Box Annotations for Autonomous Driving	
@@ -0,0 +1 @@
+Instance segmentation is a fundamental research in computer vision, especially in autonomous driving. However, manual mask annotation for instance segmentation is quite time-consuming and costly. To address this problem, some prior works attempt to apply weakly supervised manner by exploring 2D or 3D boxes. However, no one has ever successfully segmented 2D and 3D instances simultaneously by only using 2D box annotations, which could further reduce the annotation cost by an order of magnitude. Thus, we propose a novel framework called Multimodal Weakly Supervised Instance Segmentation (MWSIS), which incorporates various fine-grained label correction modules for both 2D and 3D modalities, along with a new multimodal cross-supervision approach. In the 2D pseudo label generation branch, the Instance-based Pseudo Mask Generation (IPG) module utilizes predictions for self-supervised correction. Similarly, in the 3D pseudo label generation branch, the Spatial-based Pseudo Label Generation (SPG) module generates pseudo labels by incorporating the spatial prior information of the point cloud. To further refine the generated pseudo labels, the Point-based Voting Label Correction (PVC) module utilizes historical predictions for correction. Additionally, a Ring Segment-based Label Correction (RSC) module is proposed to refine the predictions by leveraging the depth prior information from the point cloud. Finally, the Consistency Sparse Cross-modal Supervision (CSCS) module reduces the inconsistency of multimodal predictions by response distillation. Particularly, transferring the 3D backbone to downstream tasks not only improves the performance of the 3D detectors, but also outperforms fully supervised instance segmentation with only 5% fully supervised annotations. On the Waymo dataset, the proposed framework demonstrates significant improvements over the baseline, especially achieving 2.59% mAP and 12.75% mAP increases for 2D and 3D instance segmentation tasks, respectively. The code is available at https://github.com/jiangxb98/mwsis-plugin.
\ No newline at end of file
diff --git a/data/2024/aaai/Machine Learning-Powered Combinatorial Clock Auction b/data/2024/aaai/Machine Learning-Powered Combinatorial Clock Auction
new file mode 100644
index 0000000000..f914c054b8
--- /dev/null
+++ b/data/2024/aaai/Machine Learning-Powered Combinatorial Clock Auction	
@@ -0,0 +1 @@
+We study the design of iterative combinatorial auctions (ICAs). The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, several papers have recently proposed machine learning (ML)-based preference elicitation algorithms that aim to elicit only the most important information from bidders. However, from a practical point of view, the main shortcoming of this prior work is that those designs elicit bidders' preferences via value queries (i.e., “What is your value for the bundle {A, B}?''). In most real-world ICA domains, value queries are considered impractical, since they impose an unrealistically high cognitive burden on bidders, which is why they are not used in practice. In this paper, we address this shortcoming by designing an ML-powered combinatorial clock auction that elicits information from the bidders only via demand queries (i.e., “At prices p, what is your most preferred bundle of items?''). We make two key technical contributions: First, we present a novel method for training an ML model on demand queries. Second, based on those trained ML models, we introduce an efficient method for determining the demand query with the highest clearing potential, for which we also provide a theoretical foundation. We experimentally evaluate our ML-based demand query mechanism in several spectrum auction domains and compare it against the most established real-world ICA: the combinatorial clock auction (CCA). Our mechanism significantly outperforms the CCA in terms of efficiency in all domains, it achieves higher efficiency in a significantly reduced number of rounds, and, using linear prices, it exhibits vastly higher clearing potential. Thus, with this paper we bridge the gap between research and practice and propose the first practical ML-powered ICA.
\ No newline at end of file
diff --git a/data/2024/aaai/Machine-Created Universal Language for Cross-Lingual Transfer b/data/2024/aaai/Machine-Created Universal Language for Cross-Lingual Transfer
new file mode 100644
index 0000000000..19c091adb6
--- /dev/null
+++ b/data/2024/aaai/Machine-Created Universal Language for Cross-Lingual Transfer	
@@ -0,0 +1 @@
+There are two primary approaches to addressing cross-lingual transfer: multilingual pre-training, which implicitly aligns the hidden representations of various languages, and translate-test, which explicitly translates different languages into an intermediate language, such as English. Translate-test offers better interpretability compared to multilingual pre-training. However, it has lower performance than multilingual pre-training and struggles with word-level tasks due to translation altering word order. As a result, we propose a new Machine-created Universal Language (MUL) as an alternative intermediate language. MUL comprises a set of discrete symbols forming a universal vocabulary and a natural language to MUL translator for converting multiple natural languages to MUL. MUL unifies shared concepts from various languages into a single universal word, enhancing cross-language transfer. Additionally, MUL retains language-specific words and word order, allowing the model to be easily applied to word-level tasks. Our experiments demonstrate that translating into MUL yields improved performance compared to multilingual pre-training, and our analysis indicates that MUL possesses strong interpretability. The code is at: https://github.com/microsoft/Unicoder/tree/master/MCUL.
\ No newline at end of file
diff --git a/data/2024/aaai/MagiCapture: High-Resolution Multi-Concept Portrait Customization b/data/2024/aaai/MagiCapture: High-Resolution Multi-Concept Portrait Customization
new file mode 100644
index 0000000000..e28ab52c9a
--- /dev/null
+++ b/data/2024/aaai/MagiCapture: High-Resolution Multi-Concept Portrait Customization	
@@ -0,0 +1 @@
+Large-scale text-to-image models including Stable Diffusion are capable of generating high-fidelity photorealistic portrait images. There is an active research area dedicated to personalizing these models, aiming to synthesize specific subjects or styles using provided sets of reference images. However, despite the plausible results from these personalization methods, they tend to produce images that often fall short of realism and are not yet on a commercially viable level. This is particularly noticeable in portrait image generation, where any unnatural artifact in human faces is easily discernible due to our inherent human bias. To address this, we introduce MagiCapture, a personalization method for integrating subject and style concepts to generate high-resolution portrait images using just a few subject and style references. For instance, given a handful of random selfies, our fine-tuned model can generate high-quality portrait images in specific styles, such as passport or profile photos. The main challenge with this task is the absence of ground truth for the composed concepts, leading to a reduction in the quality of the final output and an identity shift of the source subject. To address these issues, we present a novel Attention Refocusing loss coupled with auxiliary priors, both of which facilitate robust learning within this weakly supervised learning setting. Our pipeline also includes additional post-processing steps to ensure the creation of highly realistic outputs. MagiCapture outperforms other baselines in both quantitative and qualitative evaluations and can also be generalized to other non-human objects.
\ No newline at end of file
diff --git a/data/2024/aaai/Make Lossy Compression Meaningful for Low-Light Images b/data/2024/aaai/Make Lossy Compression Meaningful for Low-Light Images
new file mode 100644
index 0000000000..df931db2f9
--- /dev/null
+++ b/data/2024/aaai/Make Lossy Compression Meaningful for Low-Light Images	
@@ -0,0 +1 @@
+Low-light images frequently occur due to unavoidable environmental influences or technical limitations, such as insufficient lighting or limited exposure time. To achieve better visibility for visual perception, low-light image enhancement is usually adopted. Besides, lossy image compression is vital for meeting the requirements of storage and transmission in computer vision applications. To touch the above two practical demands, current solutions can be categorized into two sequential manners: ``Compress before Enhance (CbE)'' or ``Enhance before Compress (EbC)''. However, both of them are not suitable since: (1) Error accumulation in the individual models plagues sequential solutions. Especially, once low-light images are compressed by existing general lossy image compression approaches, useful information (e.g., texture details) would be lost resulting in a dramatic performance decrease in low-light image enhancement. (2) Due to the intermediate process, the sequential solution introduces an additional burden resulting in low efficiency. We propose a novel joint solution to simultaneously achieve a high compression rate and good enhancement performance for low-light images with much lower computational cost and fewer model parameters. We design an end-to-end trainable architecture, which includes the main enhancement branch and the signal-to-noise ratio (SNR) aware branch. Experimental results show that our proposed joint solution achieves a significant improvement over different combinations of existing state-of-the-art sequential ``Compress before Enhance'' or ``Enhance before Compress'' solutions for low-light images, which would make lossy low-light image compression more meaningful. The project is publicly available at: https://github.com/CaiShilv/Joint-IC-LL.
\ No newline at end of file
diff --git a/data/2024/aaai/Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior b/data/2024/aaai/Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior
new file mode 100644
index 0000000000..eade4ce4ad
--- /dev/null
+++ b/data/2024/aaai/Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior	
@@ -0,0 +1 @@
+Recent vision-language pre-trained (VLP) models have become the backbone for many downstream tasks, but they are utilized as frozen model without learning. Prompt learning is a method to improve the pre-trained VLP model by adding a learnable context vector to the inputs of the text encoder. In a few-shot learning scenario of the downstream task, MLE training can lead the context vector to over-fit dominant image features in the training data. This overfitting can potentially harm the generalization ability, especially in the presence of a distribution shift between the training and test dataset. This paper presents a Bayesian-based framework of prompt tuning, which could alleviate the over-fitting issues on few-shot learning application and increase the adaptability of prompts on unobserved instances. Specifically, modeling data-dependent prior enhances the adaptability of text features for both seen and unseen image features without the trade-off of performance between them. Based on the Bayesian framework, we utilize the Wasserstein gradient flow in the estimation of our target posterior distribution, which enables our prompt to be flexible in capturing the complex modes of image features. We demonstrate the effectiveness of our method on benchmark datasets for several experiments by showing statistically significant improvements on performance compared to existing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Make RepVGG Greater Again: A Quantization-Aware Approach b/data/2024/aaai/Make RepVGG Greater Again: A Quantization-Aware Approach
new file mode 100644
index 0000000000..c0af268392
--- /dev/null
+++ b/data/2024/aaai/Make RepVGG Greater Again: A Quantization-Aware Approach	
@@ -0,0 +1 @@
+The tradeoff between performance and inference speed is critical for practical applications. Architecture reparameterization obtains better tradeoffs and it is becoming an increasingly popular ingredient in modern convolutional neural networks. Nonetheless, its quantization performance is usually too poor to deploy (e.g. more than 20% top-1 accuracy drop on ImageNet) when INT8 inference is desired. In this paper, we dive into the underlying mechanism of this failure, where the original design inevitably enlarges quantization error. We propose a simple, robust, and effective remedy to have a quantization-friendly structure that also enjoys reparameterization benefits. Our method greatly bridges the gap between INT8 and FP32 accuracy for RepVGG. Without bells and whistles, the top-1 accuracy drop on ImageNet is reduced within 2% by standard post-training quantization. Extensive experiments on detection and semantic segmentation tasks verify its generalization.
\ No newline at end of file
diff --git a/data/2024/aaai/Making AI Policies Transparent to Humans through Demonstrations b/data/2024/aaai/Making AI Policies Transparent to Humans through Demonstrations
new file mode 100644
index 0000000000..5c31da8b79
--- /dev/null
+++ b/data/2024/aaai/Making AI Policies Transparent to Humans through Demonstrations	
@@ -0,0 +1 @@
+Demonstrations are a powerful way of increasing the transparency of AI policies to humans. Though we can approximately model human learning from demonstrations as inverse reinforcement learning, we note that human learning can differ from algorithmic learning in key ways, e.g. humans are computationally limited and may sometimes struggle to understand all of the nuances of a demonstration. Unlike related work that provide demonstrations to humans that simply maximize information gain, I leverage concepts from the human education literature, such as the zone of proximal development and scaffolding, to show demonstrations that balance informativeness and difficulty of understanding to maximize human learning.
\ No newline at end of file
diff --git a/data/2024/aaai/Making Natural Language Reasoning Explainable and Faithful b/data/2024/aaai/Making Natural Language Reasoning Explainable and Faithful
new file mode 100644
index 0000000000..3c47826f96
--- /dev/null
+++ b/data/2024/aaai/Making Natural Language Reasoning Explainable and Faithful	
@@ -0,0 +1 @@
+Neural models, including large language models (LLMs), achieve superior performance on logical reasoning tasks such as question answering. To elicit reasoning capabilities from LLMs, recent works propose using the chain-of-thought (CoT) mechanism to generate both the reasoning chain and the answer, which enhances the model’s capabilities in conducting reasoning. However, due to LLM’s uninterpretable nature and the extreme flexibility of free-form explanations, several challenges remain: such as struggling with inaccurate reasoning, hallucinations, and not aligning with human preferences. In this talk, we will focus on (1) our design of leveraging structured information (that is grounded to the context), for the explainable complex question answering and reasoning; (2) our multi-module interpretable framework for inductive reasoning, which conducts step-wise faithful reasoning with iterative feedback.
\ No newline at end of file
diff --git a/data/2024/aaai/Manifold Constraints for Imperceptible Adversarial Attacks on Point Clouds b/data/2024/aaai/Manifold Constraints for Imperceptible Adversarial Attacks on Point Clouds
new file mode 100644
index 0000000000..500e6b8ffe
--- /dev/null
+++ b/data/2024/aaai/Manifold Constraints for Imperceptible Adversarial Attacks on Point Clouds	
@@ -0,0 +1 @@
+Adversarial attacks on 3D point clouds often exhibit unsatisfactory imperceptibility, which primarily stems from the disregard for manifold-aware distortion, i.e., distortion of the underlying 2-manifold surfaces. In this paper, we develop novel manifold constraints to reduce such distortion, aiming to enhance the imperceptibility of adversarial attacks on 3D point clouds. Specifically, we construct a bijective manifold mapping between point clouds and a simple parameter shape using an invertible auto-encoder. Consequently, manifold-aware distortion during attacks can be captured within the parameter space. By enforcing manifold constraints that preserve local properties of the parameter shape, manifold-aware distortion is effectively mitigated, ultimately leading to enhanced imperceptibility. Extensive experiments demonstrate that integrating manifold constraints into conventional adversarial attack solutions yields superior imperceptibility, outperforming the state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Manifold-Based Verbalizer Space Re-embedding for Tuning-Free Prompt-Based Classification b/data/2024/aaai/Manifold-Based Verbalizer Space Re-embedding for Tuning-Free Prompt-Based Classification
new file mode 100644
index 0000000000..8e0586cf24
--- /dev/null
+++ b/data/2024/aaai/Manifold-Based Verbalizer Space Re-embedding for Tuning-Free Prompt-Based Classification	
@@ -0,0 +1 @@
+Prompt-based classification adapts tasks to a cloze question format utilizing the [MASK] token and the filled tokens are then mapped to labels through pre-defined verbalizers. Recent studies have explored the use of verbalizer embeddings to reduce labor in this process. However, all existing studies require a tuning process for either the pre-trained models or additional trainable embeddings. Meanwhile, the distance between high-dimensional verbalizer embeddings should not be measured by Euclidean distance due to the potential for non-linear manifolds in the representation space. In this study, we propose a tuning-free manifold-based space re-embedding method called Locally Linear Embedding with Intra-class Neighborhood Constraint (LLE-INC) for verbalizer embeddings, which preserves local properties within the same class as guidance for classification. Experimental results indicate that even without tuning any parameters, our LLE-INC is on par with automated verbalizers with parameter tuning. And with the parameter updating, our approach further enhances prompt-based tuning by up to 3.2%. Furthermore, experiments with the LLaMA-7B&13B indicate that LLE-INC is an efficient tuning-free classification approach for the hyper-scale language models.
\ No newline at end of file
diff --git a/data/2024/aaai/Manipulation-Robust Selection of Citizens' Assemblies b/data/2024/aaai/Manipulation-Robust Selection of Citizens' Assemblies
new file mode 100644
index 0000000000..38040f6bc6
--- /dev/null
+++ b/data/2024/aaai/Manipulation-Robust Selection of Citizens' Assemblies	
@@ -0,0 +1 @@
+Among the recent work on designing algorithms for selecting citizens' assembly participants, one key property of these algorithms has not yet been studied: their manipulability. Strategic manipulation is a concern because these algorithms must satisfy representation constraints according to volunteers' self-reported features; misreporting these features could thereby increase a volunteer's chance of being selected, decrease someone else's chance, and/or increase the expected number of seats given to their group. Strikingly, we show that Leximin — an algorithm that is widely used for its fairness — is highly manipulable in this way. We then introduce a new class of selection algorithms that use Lp norms as objective functions. We show that the manipulability of the Lp-based algorithm decreases in O(1/n^(1-1/p)) as the number of volunteers n grows, approaching the optimal rate of O(1/n) as p approaches infinity. These theoretical results are confirmed via experiments in eight real-world datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/MapLE: Matching Molecular Analogues Promptly with Low Computational Resources by Multi-Metrics Evaluation (Student Abstract) b/data/2024/aaai/MapLE: Matching Molecular Analogues Promptly with Low Computational Resources by Multi-Metrics Evaluation (Student Abstract)
new file mode 100644
index 0000000000..296847b024
--- /dev/null
+++ b/data/2024/aaai/MapLE: Matching Molecular Analogues Promptly with Low Computational Resources by Multi-Metrics Evaluation (Student Abstract)	
@@ -0,0 +1 @@
+Matching molecular analogues is a computational chemistry and bioinformatics research issue which is used to identify molecules that are structurally or functionally similar to a target molecule. Recent studies on matching analogous molecules have predominantly concentrated on enhancing effectiveness, often sidelining computational efficiency, particularly in contexts of low computational resources. This oversight poses challenges in many real applications (e.g., drug discovery, catalyst generation and so forth). To tackle this issue, we propose a general strategy named MapLE, aiming to promptly match analogous molecules with low computational resources by multi-metrics evaluation. Experimental evaluation conducted on a public biomolecular dataset validates the excellent and efficient performance of the proposed strategy.
\ No newline at end of file
diff --git a/data/2024/aaai/Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation b/data/2024/aaai/Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation
new file mode 100644
index 0000000000..18bbe9c66b
--- /dev/null
+++ b/data/2024/aaai/Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation	
@@ -0,0 +1,3 @@
+Homography estimation is a fundamental problem in computer vision. Previous works mainly focus on estimating either a single homography, or multiple homographies based on mesh grid division of the image. In practical scenarios, single homography is inadequate and often leads to a compromised result for multiple planes; while mesh grid multi-homography damages the plane distribution of the scene, and does not fully address the restriction to use homography. 
+
+In this work, we propose a novel semantics guided multi-homography estimation framework, Mask-Homo, to provide an explicit solution to the multi-plane depth disparity problem. First, a pseudo plane mask generation module is designed to obtain multiple correlated regions that follow the plane distribution of the scene. Then, multiple local homography transformations, each of which aligns a correlated region precisely, are predicted and corresponding warped images are fused to obtain the final result. Furthermore, a new metric, Mask-PSNR, is proposed for more comprehensive evaluation of alignment. Extensive experiments are conducted to verify the effectiveness of the proposed method. Our code is available at https://github.com/SAITPublic/MaskHomo.
\ No newline at end of file
diff --git a/data/2024/aaai/MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation b/data/2024/aaai/MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation
new file mode 100644
index 0000000000..a5242c3f28
--- /dev/null
+++ b/data/2024/aaai/MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation	
@@ -0,0 +1 @@
+Few-shot instance segmentation extends the few-shot learning paradigm to the instance segmentation task, which tries to segment instance objects from a query image with a few annotated examples of novel categories. Conventional approaches have attempted to address the task via prototype learning, known as point estimation. However, this mechanism depends on prototypes (e.g. mean of K-shot) for prediction, leading to performance instability. To overcome the disadvantage of the point estimation mechanism, we propose a novel approach, dubbed MaskDiff, which models the underlying conditional distribution of a binary mask, which is conditioned on an object region and K-shot information. Inspired by augmentation approaches that perturb data with Gaussian noise for populating low data density regions, we model the mask distribution with a diffusion probabilistic model. We also propose to utilize classifier-free guided mask sampling to integrate category information into the binary mask generation process. Without bells and whistles, our proposed method consistently outperforms state-of-the-art methods on both base and novel classes of the COCO dataset while simultaneously being more stable than existing methods. The source code is available at: https://github.com/minhquanlecs/MaskDiff.
\ No newline at end of file
diff --git a/data/2024/aaai/Mastering Context-to-Label Representation Transformation for Event Causality Identification with Diffusion Models b/data/2024/aaai/Mastering Context-to-Label Representation Transformation for Event Causality Identification with Diffusion Models
new file mode 100644
index 0000000000..1d7a788e82
--- /dev/null
+++ b/data/2024/aaai/Mastering Context-to-Label Representation Transformation for Event Causality Identification with Diffusion Models	
@@ -0,0 +1 @@
+To understand event structures of documents, event causality identification (ECI) emerges as a crucial task, aiming to discern causal relationships among event mentions. The latest approach for ECI has introduced advanced deep learning models where transformer-based encoding models, complemented by enriching components, are typically leveraged to learn effective event context representations for causality prediction. As such, an important step for ECI models is to transform the event context representations into causal label representations to perform logits score computation for training and inference purposes. Within this framework, event context representations might encapsulate numerous complicated and noisy structures due to the potential long context between the input events while causal label representations are intended to capture pure information about the causal relations to facilitate score estimation. Nonetheless, a notable drawback of existing ECI models stems from their reliance on simple feed-forward networks to handle the complex context-to-label representation transformation process, which might require drastic changes in the representations to hinder the learning process. To overcome this issue, our work introduces a novel method for ECI where, instead abrupt transformations, event context representations are gradually updated to achieve effective label representations. This process will be done incrementally to allow filtering of irrelevant structures at varying levels of granularity for causal relations. To realize this, we present a diffusion model to learn gradual representation transition processes between context and causal labels. It operates through a forward pass for causal label representation noising and a reverse pass for reconstructing label representations from random noise. Our experiments on different datasets across multiple languages demonstrate the advantages of the diffusion model with state-of-the-art performance for ECI.
\ No newline at end of file
diff --git a/data/2024/aaai/MatchDet: A Collaborative Framework for Image Matching and Object Detection b/data/2024/aaai/MatchDet: A Collaborative Framework for Image Matching and Object Detection
new file mode 100644
index 0000000000..c5577434fe
--- /dev/null
+++ b/data/2024/aaai/MatchDet: A Collaborative Framework for Image Matching and Object Detection	
@@ -0,0 +1 @@
+Image matching and object detection are two fundamental and challenging tasks, while many related applications consider them two individual tasks (i.e. task-individual). In this paper, a collaborative framework called MatchDet (i.e. task-collaborative) is proposed for image matching and object detection to obtain mutual improvements. To achieve the collaborative learning of the two tasks, we propose three novel modules, including a Weighted Spatial Attention Module (WSAM) for Detector, and Weighted Attention Module (WAM) and Box Filter for Matcher. Specifically, the WSAM highlights the foreground regions of target image to benefit the subsequent detector, the WAM enhances the connection between the foreground regions of pair images to ensure high-quality matches, and Box Filter mitigates the impact of false matches. We evaluate the approaches on a new benchmark with two datasets called Warp-COCO and miniScanNet. Experimental results show our approaches are effective and achieve competitive improvements.
\ No newline at end of file
diff --git a/data/2024/aaai/MathAttack: Attacking Large Language Models towards Math Solving Ability b/data/2024/aaai/MathAttack: Attacking Large Language Models towards Math Solving Ability
new file mode 100644
index 0000000000..ce6db335b0
--- /dev/null
+++ b/data/2024/aaai/MathAttack: Attacking Large Language Models towards Math Solving Ability	
@@ -0,0 +1 @@
+With the boom of Large Language Models (LLMs), the research of solving Math Word Problem (MWP) has recently made great progress. However, there are few studies to examine the robustness of LLMs in math solving ability. Instead of attacking prompts in the use of LLMs, we propose a MathAttack model to attack MWP samples which are closer to the essence of robustness in solving math problems. Compared to traditional text adversarial attack, it is essential to preserve the mathematical logic of original MWPs during the attacking. To this end, we propose logical entity recognition to identify logical entries which are then frozen. Subsequently, the remaining text are attacked by adopting a word-level attacker. Furthermore, we propose a new dataset RobustMath to evaluate the robustness of LLMs in math solving ability. Extensive experiments on our RobustMath and two another math benchmark datasets GSM8K and MultiAirth show that MathAttack could effectively attack the math solving ability of LLMs. In the experiments, we observe that (1) Our adversarial samples from higher-accuracy LLMs are also effective for attacking LLMs with lower accuracy (e.g., transfer from larger to smaller-size LLMs, or from few-shot to zero-shot prompts); (2) Complex MWPs (such as more solving steps, longer text, more numbers) are more vulnerable to attack; (3) We can improve the robustness of LLMs by using our adversarial samples in few-shot prompts. Finally, we hope our practice and observation can serve as an important attempt towards enhancing the robustness of LLMs in math solving ability. The code and dataset is available at: https://github.com/zhouzihao501/MathAttack.
\ No newline at end of file
diff --git a/data/2024/aaai/MaxEnt Loss: Calibrating Graph Neural Networks under Out-of-Distribution Shift (Student Abstract) b/data/2024/aaai/MaxEnt Loss: Calibrating Graph Neural Networks under Out-of-Distribution Shift (Student Abstract)
new file mode 100644
index 0000000000..e31c01feb1
--- /dev/null
+++ b/data/2024/aaai/MaxEnt Loss: Calibrating Graph Neural Networks under Out-of-Distribution Shift (Student Abstract)	
@@ -0,0 +1 @@
+We present a new, simple and effective loss function for calibrating graph neural networks (GNNs). Miscalibration is the problem whereby a model's probabilities does not reflect it's correctness, making it difficult and possibly dangerous for real-world deployment. We compare our method against other baselines on a novel ID and OOD graph form of the Celeb-A faces dataset. Our findings show that our method improves calibration for GNNs, which are not immune to miscalibration in-distribution (ID) and out-of-distribution (OOD). Our code is available for review at https://github.com/dexterdley/CS6208/tree/main/Project.
\ No newline at end of file
diff --git a/data/2024/aaai/MaxEnt Loss: Constrained Maximum Entropy for Calibration under Out-of-Distribution Shift b/data/2024/aaai/MaxEnt Loss: Constrained Maximum Entropy for Calibration under Out-of-Distribution Shift
new file mode 100644
index 0000000000..56670af44f
--- /dev/null
+++ b/data/2024/aaai/MaxEnt Loss: Constrained Maximum Entropy for Calibration under Out-of-Distribution Shift	
@@ -0,0 +1 @@
+We present a new loss function that addresses the out-of-distribution (OOD) network calibration problem. While many objective functions have been proposed to effectively calibrate models in-distribution, our findings show that they do not always fare well OOD. Based on the Principle of Maximum Entropy, we incorporate helpful statistical constraints observed during training, delivering better model calibration without sacrificing accuracy. We provide theoretical analysis and show empirically that our method works well in practice, achieving state-of-the-art calibration on both synthetic and real-world benchmarks. Our code is available at https://github.com/dexterdley/MaxEnt-Loss.
\ No newline at end of file
diff --git a/data/2024/aaai/Maxileximin Envy Allocations and Connected Goods b/data/2024/aaai/Maxileximin Envy Allocations and Connected Goods
new file mode 100644
index 0000000000..36a3360c71
--- /dev/null
+++ b/data/2024/aaai/Maxileximin Envy Allocations and Connected Goods	
@@ -0,0 +1,3 @@
+Fair allocation of indivisible goods presents intriguing challenges from both a social choice perspective and an algorithmic standpoint. Due to the indivisibility of goods, it is common for one agent to envy the bundle of goods assigned to another agent and, indeed, envy-free solutions do not exist in general. In line with the classical game-theoretic concept of Nucleolus in coalitional games, we propose that a fair allocation should minimize the agents’ dissatisfaction profile in a lexicographic manner, where the dissatisfaction of an agent is defined as her maximum envy towards other agents. Therefore, we seek allocations that minimize the maximum envy. In cases where multiple solutions have an equal maximum value, we minimize the second-worst value, and so on. Additionally, as is customary in fair division problems, we also consider an efficiency requirement: among the allocations with the best agents’ dissatisfaction profile, we prioritize those that maximize the sum of agents’ utilities, known as maximum social welfare. Such allocations, referred to as maxileximin allocations, always exist.
+In this study, we analyze the computational properties of maxileximin allocations in the context of fair allocation problems with constraints. Specifically, we focus on the Connected Fair Division problem, where goods correspond to the nodes of a graph, and a bundle of goods is allowed if the subgraph formed by those goods is connected. We demonstrate that the problem is F∆P2 -complete, even for instances with simple graphical structures such as path and star graphs.
+However, we identify islands of tractability for instances with more intricate graphs, such as those having bounded treewidth, provided that the number of agents is bounded by a fixed number and utility functions use small values.
\ No newline at end of file
diff --git a/data/2024/aaai/Maximizing the Success Probability of Policy Allocations in Online Systems b/data/2024/aaai/Maximizing the Success Probability of Policy Allocations in Online Systems
new file mode 100644
index 0000000000..cc988e2a8a
--- /dev/null
+++ b/data/2024/aaai/Maximizing the Success Probability of Policy Allocations in Online Systems	
@@ -0,0 +1 @@
+The effectiveness of advertising in e-commerce largely depends on the ability of merchants to bid on and win impressions for their targeted users. The bidding procedure is highly complex due to various factors such as market competition, user behavior, and the diverse objectives of advertisers. In this paper we consider the problem at the level of user timelines instead of individual bid requests, manipulating full policies (i.e. pre-defined bidding strategies) and not bid values. In order to optimally allocate policies to users, typical multiple treatments allocation methods solve knapsack-like problems which aim at maximizing an expected value under constraints. In the specific context of online advertising, we argue that optimizing for the probability of success is a more suited objective than expected value maximization, and we introduce the SuccessProbaMax algorithm that aims at finding the policy allocation which is the most likely to outperform a fixed reference policy. Finally, we conduct comprehensive experiments both on synthetic and real-world data to evaluate its performance. The results demonstrate that our proposed algorithm outperforms conventional expected-value maximization algorithms in terms of success rate.
\ No newline at end of file
diff --git a/data/2024/aaai/MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance b/data/2024/aaai/MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance
new file mode 100644
index 0000000000..8ab881c030
--- /dev/null
+++ b/data/2024/aaai/MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance	
@@ -0,0 +1 @@
+This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observation-space scores in latent Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the Diffusion Models. Through extensive qualitative, quantitative, and subjective experiments on various benchmarks, the study demonstrates the effectiveness and superiority of the proposed approach. Our project page can be found at https://medm2023.github.io
\ No newline at end of file
diff --git a/data/2024/aaai/Mean Teacher DETR with Masked Feature Alignment: A Robust Domain Adaptive Detection Transformer Framework b/data/2024/aaai/Mean Teacher DETR with Masked Feature Alignment: A Robust Domain Adaptive Detection Transformer Framework
new file mode 100644
index 0000000000..9947886204
--- /dev/null
+++ b/data/2024/aaai/Mean Teacher DETR with Masked Feature Alignment: A Robust Domain Adaptive Detection Transformer Framework	
@@ -0,0 +1 @@
+Unsupervised domain adaptation object detection(UDAOD) research on Detection Transformer(DETR) mainly focuses on feature alignment and existing methods can be divided into two kinds, each of which has its unresolved issues. One-stage feature alignment methods can easily lead to performance fluctuation and training stagnation. Two-stage feature alignment method based on mean teacher comprises a pretraining stage followed by a self-training stage, each facing problems in obtaining reliable pretrained model and achieving consistent performance gains. Methods mentioned above have not yet explore how to utilize the third related domain such as target-like domain to assist adaptation. To address these issues, we propose a two-stage framework named MTM, i.e. Mean Teacher-DETR with Masked Feature Alignment. In the pretraining stage, we utilize labeled target-like images produced by image style transfer to avoid performance fluctuation. In the self-training stage, we leverage unlabeled target images by pseudo labels based on mean teacher and propose a module called Object Queries Knowledge Transfer(OQKT) to ensure consistent performance gains of the student model. Most importantly, we propose masked feature alignment methods including Masked Domain Query-based Feature Alignment(MDQFA) and Masked Token-wise Feature Alignment(MTWFA) to alleviate domain shift in a more robust way, which not only prevent training stagnation and lead to a robust pretrained model in the pretraining stage, but also enhance the model's target performance in the self-training stage. Experiments on three challenging scenarios and a theoretical analysis verify the effectiveness of MTM.
\ No newline at end of file
diff --git a/data/2024/aaai/Measuring Self-Supervised Representation Quality for Downstream Classification Using Discriminative Features b/data/2024/aaai/Measuring Self-Supervised Representation Quality for Downstream Classification Using Discriminative Features
new file mode 100644
index 0000000000..a5ac41477b
--- /dev/null
+++ b/data/2024/aaai/Measuring Self-Supervised Representation Quality for Downstream Classification Using Discriminative Features	
@@ -0,0 +1 @@
+Self-supervised learning (SSL) has shown impressive results in downstream classification tasks. However, there is limited work in understanding their failure modes and interpreting their learned representations. In this paper, we study the representation space of state-of-the-art self-supervised models including SimCLR, SwaV, MoCo, BYOL, DINO, SimSiam, VICReg and Barlow Twins. Without the use of class label information, we discover discriminative features that correspond to unique physical attributes in images, present mostly in correctly-classified representations. Using these features, we can compress the representation space by up to$40% without significantly affecting linear classification performance. We then propose Self-Supervised Representation Quality Score (or Q-Score), an unsupervised score that can reliably predict if a given sample is likely to be mis-classified during linear evaluation, achieving AUPRC of 91.45 on ImageNet-100 and 78.78 on ImageNet-1K. Q-Score can also be used as a regularization term on pre-trained encoders to remedy low-quality representations. Fine-tuning with Q-Score regularization can boost the linear probing accuracy of SSL models by up to 5.8% on ImageNet-100 and 3.7% on ImageNet-1K compared to their baselines. Finally, using gradient heatmaps and Salient ImageNet masks, we define a metric to quantify the interpretability of each representation. We show that discriminative features are strongly correlated to core attributes and, enhancing these features through Q-score regularization makes SSL representations more interpretable.
\ No newline at end of file
diff --git a/data/2024/aaai/Measuring Task Similarity and Its Implication in Fine-Tuning Graph Neural Networks b/data/2024/aaai/Measuring Task Similarity and Its Implication in Fine-Tuning Graph Neural Networks
new file mode 100644
index 0000000000..e67d52f37f
--- /dev/null
+++ b/data/2024/aaai/Measuring Task Similarity and Its Implication in Fine-Tuning Graph Neural Networks	
@@ -0,0 +1 @@
+The paradigm of pre-training and fine-tuning graph neural networks has attracted wide research attention. In previous studies, the pre-trained models are viewed as universally versatile, and applied for a diverse range of downstream tasks. In many situations, however, this practice results in limited or even negative transfer. This paper, for the first time, emphasizes the specific application scope of graph pre-trained models: not all downstream tasks can effectively benefit from a graph pre-trained model. In light of this, we introduce the measure task consistency to quantify the similarity between graph pre-training and downstream tasks. This measure assesses the extent to which downstream tasks can benefit from specific pre-training tasks. Moreover, a novel fine-tuning strategy, Bridge-Tune, is proposed to further diminish the impact of the difference between pre-training and downstream tasks. The key innovation in Bridge-Tune is an intermediate step that bridges pre-training and downstream tasks. This step takes into account the task differences and further refines the pre-trained model. The superiority of the presented fine-tuning strategy is validated via numerous experiments with different pre-trained models and downstream tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records b/data/2024/aaai/MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records
new file mode 100644
index 0000000000..c73f404e50
--- /dev/null
+++ b/data/2024/aaai/MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records	
@@ -0,0 +1 @@
+The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. We make MedAlign available under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.
\ No newline at end of file
diff --git a/data/2024/aaai/MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models b/data/2024/aaai/MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
new file mode 100644
index 0000000000..53b3673e7d
--- /dev/null
+++ b/data/2024/aaai/MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models	
@@ -0,0 +1 @@
+The emergence of various medical large language models (LLMs) in the medical domain has highlighted the need for unified evaluation standards, as manual evaluation of LLMs proves to be time-consuming and labor-intensive. To address this issue, we introduce MedBench, a comprehensive benchmark for the Chinese medical domain, comprising 40,041 questions sourced from authentic examination exercises and medical reports of diverse branches of medicine. In particular, this benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, the Doctor In-Charge Qualification Examination, and real-world clinic cases encompassing examinations, diagnoses, and treatments. MedBench replicates the educational progression and clinical practice experiences of doctors in Mainland China, thereby establish- ing itself as a credible benchmark for assessing the mastery of knowledge and reasoning abilities in medical language learning models. We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings: (1) Chinese medical LLMs underperform on this benchmark, highlighting the need for significant advances in clinical knowledge and diagnostic precision. (2) Several general-domain LLMs surprisingly possess considerable medical knowledge. These findings elucidate both the capabilities and limitations of LLMs within the context of MedBench, with the ultimate goal of aiding the medical research community.
\ No newline at end of file
diff --git a/data/2024/aaai/MedSegDiff-V2: Diffusion-Based Medical Image Segmentation with Transformer b/data/2024/aaai/MedSegDiff-V2: Diffusion-Based Medical Image Segmentation with Transformer
new file mode 100644
index 0000000000..91b654927a
--- /dev/null
+++ b/data/2024/aaai/MedSegDiff-V2: Diffusion-Based Medical Image Segmentation with Transformer	
@@ -0,0 +1 @@
+The Diffusion Probabilistic Model (DPM) has recently gained popularity in the field of computer vision, thanks to its image generation applications, such as Imagen, Latent Diffusion Models, and Stable Diffusion, which have demonstrated impressive capabilities and sparked much discussion within the community. Recent investigations have further unveiled the utility of DPM in the domain of medical image analysis, as underscored by the commendable performance exhibited by the medical image segmentation model across various tasks. Although these models were originally underpinned by a UNet architecture, there exists a potential avenue for enhancing their performance through the integration of vision transformer mechanisms. However, we discovered that simply combining these two models resulted in subpar performance. To effectively integrate these two cutting-edge techniques for the Medical image segmentation, we propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2. We verify its effectiveness on 20 medical image segmentation tasks with different image modalities. Through comprehensive evaluation, our approach demonstrates superiority over prior state-of-the-art (SOTA) methodologies. Code is released at https://github.com/KidsWithTokens/MedSegDiff.
\ No newline at end of file
diff --git a/data/2024/aaai/Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games b/data/2024/aaai/Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games
new file mode 100644
index 0000000000..efb9cf45d8
--- /dev/null
+++ b/data/2024/aaai/Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games	
@@ -0,0 +1 @@
+Learning in games considers how multiple agents maximize their own rewards through repeated games. Memory, an ability that an agent changes his/her action depending on the history of actions in previous games, is often introduced into learning to explore more clever strategies and discuss the decision-making of real agents like humans. However, such games with memory are hard to analyze because they exhibit complex phenomena like chaotic dynamics or divergence from Nash equilibrium. In particular, how asymmetry in memory capacities between agents affects learning in games is still unclear. In response, this study formulates a gradient ascent algorithm in games with asymmetry memory capacities. To obtain theoretical insights into learning dynamics, we first consider a simple case of zero-sum games. We observe complex behavior, where learning dynamics draw a heteroclinic connection from unstable fixed points to stable ones. Despite this complexity, we analyze learning dynamics and prove local convergence to these stable fixed points, i.e., the Nash equilibria. We identify the mechanism driving this convergence: an agent with a longer memory learns to exploit the other, which in turn endows the other's utility function with strict concavity. We further numerically observe such convergence in various initial strategies, action numbers, and memory lengths. This study reveals a novel phenomenon due to memory asymmetry, providing fundamental strides in learning in games and new insights into computing equilibria.
\ No newline at end of file
diff --git a/data/2024/aaai/Memory-Augmenting Decoder-Only Language Models through Encoders (Student Abstract) b/data/2024/aaai/Memory-Augmenting Decoder-Only Language Models through Encoders (Student Abstract)
new file mode 100644
index 0000000000..9009c7df45
--- /dev/null
+++ b/data/2024/aaai/Memory-Augmenting Decoder-Only Language Models through Encoders (Student Abstract)	
@@ -0,0 +1 @@
+The Transformer architecture has seen a lot of attention in recent years also thanks to its ability to scale well and allow massive parallelism during training. This has made possible the development of Language Models (LMs) of increasing size and the discovery of latent abilities that completely outclass traditional methods e.g. rule-based systems. However, they also introduced new issues, like their inability to retain the history of previous interactions due to their stateless nature or the difficulty in controlling their generation. Different attempts have been made to address these issues, e.g. a `brute force' approach to solving the memory issue is to include the full conversation history in the context window, a solution that is limited by the quadratic scalability of Transformers. In this work, we explore computationally practical solutions to the memory problem. We propose to augment the decoder-only architecture of (most) Large LMs with a (relatively small) memory encoder. Its output is prepended to the decoder's input in a similar fashion to recent works in Adapters and the original Transformer architecture. Initial experiments show promising results, however future work is needed to compare with State-of-the-Art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Memory-Efficient Prompt Tuning for Incremental Histopathology Classification b/data/2024/aaai/Memory-Efficient Prompt Tuning for Incremental Histopathology Classification
new file mode 100644
index 0000000000..98819f1d4f
--- /dev/null
+++ b/data/2024/aaai/Memory-Efficient Prompt Tuning for Incremental Histopathology Classification	
@@ -0,0 +1 @@
+Recent studies have made remarkable progress in histopathology classification. Based on current successes, contemporary works proposed to further upgrade the model towards a more generalizable and robust direction through incrementally learning from the sequentially delivered domains. Unlike previous parameter isolation based approaches that usually demand massive computation resources during model updating, we present a memory-efficient prompt tuning framework to cultivate model generalization potential in economical memory cost. For each incoming domain, we reuse the existing parameters of the initial classification model and attach lightweight trainable prompts into it for customized tuning. Considering the domain heterogeneity, we perform decoupled prompt tuning, where we adopt a domain-specific prompt for each domain to independently investigate its distinctive characteristics, and one domain-invariant prompt shared across all domains to continually explore the common content embedding throughout time. All domain-specific prompts will be appended to the prompt bank and isolated from further changes to prevent forgetting the distinctive features of early-seen domains. While the domain-invariant prompt will be passed on and iteratively evolve by style-augmented prompt refining to improve model generalization capability over time. In specific, we construct a graph with existing prompts and build a style-augmented graph attention network to guide the domain-invariant prompt exploring the overlapped latent embedding among all delivered domains for more domain-generic representations. We have extensively evaluated our framework with two histopathology tasks, i.e., breast cancer metastasis classification and epithelium-stroma tissue classification, where our approach yielded superior performance and memory efficiency over the competing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Memory-Efficient Reversible Spiking Neural Networks b/data/2024/aaai/Memory-Efficient Reversible Spiking Neural Networks
new file mode 100644
index 0000000000..675e9d53b6
--- /dev/null
+++ b/data/2024/aaai/Memory-Efficient Reversible Spiking Neural Networks	
@@ -0,0 +1 @@
+Spiking neural networks (SNNs) are potential competitors to artificial neural networks (ANNs) due to their high energy-efficiency on neuromorphic hardware. However, SNNs are unfolded over simulation time steps during the training process. Thus, SNNs require much more memory than ANNs, which impedes the training of deeper SNN models. In this paper, we propose the reversible spiking neural network to reduce the memory cost of intermediate activations and membrane potentials during training. Firstly, we extend the reversible architecture along temporal dimension and propose the reversible spiking block, which can reconstruct the computational graph and recompute all intermediate variables in forward pass with a reverse process. On this basis, we adopt the state-of-the-art SNN models to the reversible variants, namely reversible spiking ResNet (RevSResNet) and reversible spiking transformer (RevSFormer). Through experiments on static and neuromorphic datasets, we demonstrate that the memory cost per image of our reversible SNNs does not increase with the network depth. On CIFAR10 and CIFAR100 datasets, our RevSResNet37 and RevSFormer-4-384 achieve comparable accuracies and consume 3.79x and 3.00x lower GPU memory per image than their counterparts with roughly identical model complexity and parameters. We believe that this work can unleash the memory constraints in SNN training and pave the way for training extremely large and deep SNNs.
\ No newline at end of file
diff --git a/data/2024/aaai/MemoryBank: Enhancing Large Language Models with Long-Term Memory b/data/2024/aaai/MemoryBank: Enhancing Large Language Models with Long-Term Memory
new file mode 100644
index 0000000000..5fb68d0508
--- /dev/null
+++ b/data/2024/aaai/MemoryBank: Enhancing Large Language Models with Long-Term Memory	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have drastically reshaped our interactions with artificial intelligence (AI) systems, showcasing impressive performance across an extensive array of tasks. Despite this, a notable hindrance remains—the deficiency of a long-term memory mechanism within these models. This shortfall becomes increasingly evident in situations demanding sustained interaction, such as personal companion systems, psychological counseling, and secretarial assistance. Recognizing the necessity for long-term memory, we propose MemoryBank, a novel memory mechanism tailored for LLMs. MemoryBank enables the models to summon relevant memories, continually evolve through continuous memory updates, comprehend, and adapt to a user's personality over time by synthesizing information from previous interactions. To mimic anthropomorphic behaviors and selectively preserve memory, MemoryBank incorporates a memory updating mechanism, inspired by the Ebbinghaus Forgetting Curve theory. This mechanism permits the AI to forget and reinforce memory based on time elapsed and the relative significance of the memory, thereby offering a more human-like memory mechanism and enriched user experience. MemoryBank is versatile in accommodating both closed-source models like ChatGPT and open-source models such as ChatGLM. To validate MemoryBank's effectiveness, we exemplify its application through the creation of an LLM-based chatbot named SiliconFriend in a long-term AI Companion scenario. Further tuned with psychological dialog data, SiliconFriend displays heightened empathy and discernment in its interactions. Experiment involves both qualitative analysis with real-world user dialogs and quantitative analysis with simulated dialogs. In the latter, ChatGPT acts as multiple users with diverse characteristics and generates long-term dialog contexts covering a wide array of topics. The results of our analysis reveal that SiliconFriend, equipped with MemoryBank, exhibits a strong capability for long-term companionship as it can provide emphatic response, recall relevant memories and understand user personality.
\ No newline at end of file
diff --git a/data/2024/aaai/Merging AI Incidents Research with Political Misinformation Research: Introducing the Political Deepfakes Incidents Database b/data/2024/aaai/Merging AI Incidents Research with Political Misinformation Research: Introducing the Political Deepfakes Incidents Database
new file mode 100644
index 0000000000..8a6de66c06
--- /dev/null
+++ b/data/2024/aaai/Merging AI Incidents Research with Political Misinformation Research: Introducing the Political Deepfakes Incidents Database	
@@ -0,0 +1 @@
+This article presents the Political Deepfakes Incidents Database (PDID), a collection of politically-salient deepfakes, encompassing synthetically-created videos, images, and less-sophisticated `cheapfakes.' The project is driven by the rise of generative AI in politics, ongoing policy efforts to address harms, and the need to connect AI incidents and political communication research. The database contains political deepfake content, metadata, and researcher-coded descriptors drawn from political science, public policy, communication, and misinformation studies. It aims to help reveal the prevalence, trends, and impact of political deepfakes, such as those featuring major political figures or events. The PDID can benefit policymakers, researchers, journalists, fact-checkers, and the public by providing insights into deepfake usage, aiding in regulation, enabling in-depth analyses, supporting fact-checking and trust-building efforts, and raising awareness of political deepfakes. It is suitable for research and application on media effects, political discourse, AI ethics, technology governance, media literacy, and countermeasures.
\ No newline at end of file
diff --git a/data/2024/aaai/Meta-Crafting: Improved Detection of Out-of-Distributed Texts via Crafting Metadata Space (Student Abstract) b/data/2024/aaai/Meta-Crafting: Improved Detection of Out-of-Distributed Texts via Crafting Metadata Space (Student Abstract)
new file mode 100644
index 0000000000..0853fa1c0d
--- /dev/null
+++ b/data/2024/aaai/Meta-Crafting: Improved Detection of Out-of-Distributed Texts via Crafting Metadata Space (Student Abstract)	
@@ -0,0 +1 @@
+Detecting out-of-distribution (OOD) samples is crucial for robust NLP models. Recent works observe two OOD types: background shifts (style change) and semantic shifts (content change), but existing detection methods vary in effectiveness for each type. To this end, we propose Meta-Crafting, a unified OOD detection method by constructing a new discriminative feature space utilizing 7 model-driven metadata chosen empirically that well detects both types of shifts. Our experimental results demonstrate state-of-the-art robustness to both shifts and significantly improved detection on stress datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables b/data/2024/aaai/Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables
new file mode 100644
index 0000000000..2cb5994be3
--- /dev/null
+++ b/data/2024/aaai/Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables	
@@ -0,0 +1 @@
+Designing suitable reward functions for numerous interacting intelligent agents is challenging in real-world applications. Inverse reinforcement learning (IRL) in mean field games (MFGs) offers a practical framework to infer reward functions from expert demonstrations. While promising, the assumption of agent homogeneity limits the capability of existing methods to handle demonstrations with heterogeneous and unknown objectives, which are common in practice. To this end, we propose a deep latent variable MFG model and an associated IRL method. Critically, our method can infer rewards from different yet structurally similar tasks without prior knowledge about underlying contexts or modifying the MFG model itself. Our experiments, conducted on simulated scenarios and a real-world spatial taxi-ride pricing problem, demonstrate the superiority of our approach over state-of-the-art IRL methods in MFGs.
\ No newline at end of file
diff --git a/data/2024/aaai/Meta-Learning-Based Adaptive Stability Certificates for Dynamical Systems b/data/2024/aaai/Meta-Learning-Based Adaptive Stability Certificates for Dynamical Systems
new file mode 100644
index 0000000000..d037e496af
--- /dev/null
+++ b/data/2024/aaai/Meta-Learning-Based Adaptive Stability Certificates for Dynamical Systems	
@@ -0,0 +1 @@
+This paper addresses the problem of Neural Network (NN) based adaptive stability certification in a dynamical system. The state-of-the-art methods, such as Neural Lyapunov Functions (NLFs), use NN-based formulations to assess the stability of a non-linear dynamical system and compute a Region of Attraction (ROA) in the state space. However, under parametric uncertainty, if the values of system parameters vary over time, the NLF methods fail to adapt to such changes and may lead to conservative stability assessment performance. We circumvent this issue by integrating Model Agnostic Meta-learning (MAML) with NLFs and propose meta-NLFs. In this process, we train a meta-function that adapts to any parametric shifts and updates into an NLF for the system with new test-time parameter values. We demonstrate the stability assessment performance of meta-NLFs on some standard benchmark autonomous dynamical systems.
\ No newline at end of file
diff --git a/data/2024/aaai/Meta-Reinforcement Learning via Exploratory Task Clustering b/data/2024/aaai/Meta-Reinforcement Learning via Exploratory Task Clustering
new file mode 100644
index 0000000000..4033bb77e5
--- /dev/null
+++ b/data/2024/aaai/Meta-Reinforcement Learning via Exploratory Task Clustering	
@@ -0,0 +1 @@
+Meta-reinforcement learning (meta-RL) aims to quickly solve new RL tasks by leveraging knowledge from prior tasks. Previous studies often assume a single-mode homogeneous task distribution, ignoring possible structured heterogeneity among tasks. Such an oversight can hamper effective exploration and adaptation, especially with limited samples. In this work, we harness the structured heterogeneity among tasks via clustering to improve meta-RL, which facilitates knowledge sharing at the cluster level. To facilitate exploration, we also develop a dedicated cluster-level exploratory policy to discover task clusters via divide-and-conquer. The knowledge from the discovered clusters helps to narrow the search space of task-specific policy learning, leading to more sample-efficient policy adaptation. We evaluate the proposed method on environments with parametric clusters (e.g., rewards and state dynamics in the MuJoCo suite) and non-parametric clusters (e.g., control skills in the Meta-World suite). The results demonstrate strong advantages of our solution against a set of representative meta-RL methods.
\ No newline at end of file
diff --git a/data/2024/aaai/MetaCARD: Meta-Reinforcement Learning with Task Uncertainty Feedback via Decoupled Context-Aware Reward and Dynamics Components b/data/2024/aaai/MetaCARD: Meta-Reinforcement Learning with Task Uncertainty Feedback via Decoupled Context-Aware Reward and Dynamics Components
new file mode 100644
index 0000000000..dcbc9ce15e
--- /dev/null
+++ b/data/2024/aaai/MetaCARD: Meta-Reinforcement Learning with Task Uncertainty Feedback via Decoupled Context-Aware Reward and Dynamics Components	
@@ -0,0 +1 @@
+Meta-Reinforcement Learning (Meta-RL) aims to reveal shared characteristics in dynamics and reward functions across diverse training tasks. This objective is achieved by meta-learning a policy that is conditioned on task representations with encoded trajectory data or context, thus allowing rapid adaptation to new tasks from a known task distribution. However, since the trajectory data generated by the policy may be biased, the task inference module tends to form spurious correlations between trajectory data and specific tasks, thereby leading to poor adaptation to new tasks. To address this issue, we propose the Meta-RL with task unCertAinty feedback through decoupled context-aware Reward and Dynamics components (MetaCARD). MetaCARD distinctly decouples the dynamics and rewards when inferring tasks and integrates task uncertainty feedback from policy evaluation into the task inference module. This design effectively reduces uncertainty in tasks with changes in dynamics or/and reward functions, thereby enabling accurate task identification and adaptation. The experiment results on both Meta-World and classical MuJoCo benchmarks show that MetaCARD significantly outperforms prevailing Meta-RL baselines, demonstrating its remarkable adaptation ability in sophisticated environments that involve changes in both reward functions and dynamics.
\ No newline at end of file
diff --git a/data/2024/aaai/MetaDiff: Meta-Learning with Conditional Diffusion for Few-Shot Learning b/data/2024/aaai/MetaDiff: Meta-Learning with Conditional Diffusion for Few-Shot Learning
new file mode 100644
index 0000000000..eff4d882bf
--- /dev/null
+++ b/data/2024/aaai/MetaDiff: Meta-Learning with Conditional Diffusion for Few-Shot Learning	
@@ -0,0 +1 @@
+Equipping a deep model the ability of few-shot learning (FSL) is a core challenge for artificial intelligence. Gradient-based meta-learning effectively addresses the challenge by learning how to learn novel tasks. Its key idea is learning a deep model in a bi-level optimization manner, where the outer-loop process learns a shared gradient descent algorithm (called meta-optimizer), while the inner-loop process leverages it to optimize a task-specific base learner with few examples. Although these methods have shown superior performance on FSL, the outer-loop process requires calculating second-order derivatives along the inner-loop path, which imposes considerable memory burdens and the risk of vanishing gradients. This degrades meta-learning performance. Inspired by recent diffusion models, we find that the inner-loop gradient descent process can be viewed as a reverse process (i.e., denoising) of diffusion where the target of denoising is the weight of base learner but origin data. Based on this fact, we propose to model the gradient descent algorithm as a diffusion model and then present a novel conditional diffusion-based meta-learning, called MetaDiff, that effectively models the optimization process of base learner weights from Gaussian initialization to target weights in a denoising manner. Thanks to the training efficiency of diffusion models, our MetaDiff does not need to differentiate through the inner-loop path such that the memory burdens and the risk of vanishing gradients can be effectively alleviated for improving FSL. Experimental results show that our MetaDiff outperforms state-of-the-art gradient-based meta-learning family on FSL tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/MetaMix: Meta-State Precision Searcher for Mixed-Precision Activation Quantization b/data/2024/aaai/MetaMix: Meta-State Precision Searcher for Mixed-Precision Activation Quantization
new file mode 100644
index 0000000000..1b93d61835
--- /dev/null
+++ b/data/2024/aaai/MetaMix: Meta-State Precision Searcher for Mixed-Precision Activation Quantization	
@@ -0,0 +1 @@
+Mixed-precision quantization of efficient networks often suffer from activation instability encountered in the exploration of bit selections. To address this problem, we propose a novel method called MetaMix which consists of bit selection and weight training phases. The bit selection phase iterates two steps, (1) the mixed-precision-aware weight update, and (2) the bit-search training with the fixed mixed-precision-aware weights, both of which combined reduce activation instability in mixed-precision quantization and contribute to fast and high-quality bit selection. The weight training phase exploits the weights and step sizes trained in the bit selection phase and fine-tunes them thereby offering fast training. Our experiments with efficient and hard-to-quantize networks, i.e., MobileNet v2 and v3, and ResNet-18 on ImageNet show that our proposed method pushes the boundary of mixed-precision quantization, in terms of accuracy vs. operations, by outperforming both mixed- and single-precision SOTA methods.
\ No newline at end of file
diff --git a/data/2024/aaai/MetaRLEC: Meta-Reinforcement Learning for Discovery of Brain Effective Connectivity b/data/2024/aaai/MetaRLEC: Meta-Reinforcement Learning for Discovery of Brain Effective Connectivity
new file mode 100644
index 0000000000..61bf46d0e3
--- /dev/null
+++ b/data/2024/aaai/MetaRLEC: Meta-Reinforcement Learning for Discovery of Brain Effective Connectivity	
@@ -0,0 +1 @@
+In recent years, the discovery of brain effective connectivity (EC) networks through computational analysis of functional magnetic resonance imaging (fMRI) data has gained prominence in neuroscience and neuroimaging. However, owing to the influence of diverse factors during data collection and processing, fMRI data typically exhibits high noise and limited sample characteristics, consequently leading to suboptimal performance of current methods. In this paper, we propose a novel brain effective connectivity discovery method based on meta-reinforcement learning, called MetaRLEC. The method mainly consists of three modules: actor, critic, and meta-critic. MetaRLEC first employs an encoder-decoder framework: the encoder utilizing a Transformer, converts noisy fMRI data into a state embedding; the decoder employing bidirectional LSTM, discovers brain region dependencies from the state and generates actions (EC networks). Then a critic network evaluates these actions, incentivizing the actor to learn higher-reward actions amidst the high-noise setting. Finally, a meta-critic framework facilitates online learning of historical state-action pairs, integrating an action-value neural network and supplementary training losses to enhance the model's adaptability to small-sample fMRI data. We conduct comprehensive experiments on both simulated and real-world data to demonstrate the efficacy of our proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation b/data/2024/aaai/Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation
new file mode 100644
index 0000000000..88f810f176
--- /dev/null
+++ b/data/2024/aaai/Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation	
@@ -0,0 +1 @@
+Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style. However, existing works primarily focus on achieving precise lip synchronization while neglecting to model the subject-specific speaking style, often resulting in unrealistic facial animations. To the best of our knowledge, this work makes the first attempt to explore the coupled information between the speaking style and the semantic content in facial motions. Specifically, we introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding and leads to a more realistic synthesis of speech-driven facial animations. Subsequently, we propose a novel framework called Mimic to learn disentangled representations of the speaking style and content from facial motions by building two latent spaces for style and content, respectively. Moreover, to facilitate disentangled representation learning, we introduce four well-designed constraints: an auxiliary style classifier, an auxiliary inverse classifier, a content contrastive loss, and a pair of latent cycle losses, which can effectively contribute to the construction of the identity-related style space and semantic-related content space. Extensive qualitative and quantitative experiments conducted on three publicly available datasets demonstrate that our approach outperforms state-of-the-art methods and is capable of capturing diverse speaking styles for speech-driven 3D facial animation. The source code and supplementary video are publicly available at: https://zeqing-wang.github.io/Mimic/
\ No newline at end of file
diff --git a/data/2024/aaai/Mimicking the Maestro: Exploring the Efficacy of a Virtual AI Teacher in Fine Motor Skill Acquisition b/data/2024/aaai/Mimicking the Maestro: Exploring the Efficacy of a Virtual AI Teacher in Fine Motor Skill Acquisition
new file mode 100644
index 0000000000..0d1ea94b72
--- /dev/null
+++ b/data/2024/aaai/Mimicking the Maestro: Exploring the Efficacy of a Virtual AI Teacher in Fine Motor Skill Acquisition	
@@ -0,0 +1 @@
+Motor skills, especially fine motor skills like handwriting, play an essential role in academic pursuits and everyday life. Traditional methods to teach these skills, although effective, can be time-consuming and inconsistent. With the rise of advanced technologies like robotics and artificial intelligence, there is increasing interest in automating such teaching processes. In this study, we examine the potential of a virtual AI teacher in emulating the techniques of human educators for motor skill acquisition. We introduce an AI teacher model that captures the distinct characteristics of human instructors. Using a reinforcement learning environment tailored to mimic teacher-learner interactions, we tested our AI model against four guiding hypotheses, emphasizing improved learner performance, enhanced rate of skill acquisition, and reduced variability in learning outcomes. Our findings, validated on synthetic learners, revealed significant improvements across all tested hypotheses. Notably, our model showcased robustness across different learners and settings and demonstrated adaptability to handwriting. This research underscores the potential of integrating Imitation and Reinforcement Learning models with robotics in revolutionizing the teaching of critical motor skills.
\ No newline at end of file
diff --git a/data/2024/aaai/MindMap: Constructing Evidence Chains for Multi-Step Reasoning in Large Language Models b/data/2024/aaai/MindMap: Constructing Evidence Chains for Multi-Step Reasoning in Large Language Models
new file mode 100644
index 0000000000..1cc0100ad7
--- /dev/null
+++ b/data/2024/aaai/MindMap: Constructing Evidence Chains for Multi-Step Reasoning in Large Language Models	
@@ -0,0 +1 @@
+Large language models (LLMs) have demonstrated remarkable performance in various natural language processing tasks. However, they still face significant challenges in automated reasoning, particularly in scenarios involving multi-step reasoning. In this paper, we focus on the logical reasoning problem. The main task is to answer a question based on a set of available facts and rules. A lot of work has focused on guiding LLMs to think logically by generating reasoning paths, ignoring the structure among available facts. In this paper, we propose a simple approach MindMap by introducing evidence chains for supporting reasoning. An evidence chain refers to a set of facts that involve the same subject. In this way, we can organize related facts together to avoid missing important information. MindMap can be integrated with existing reasoning framework, such as Chain-of-Thought (CoT) and Selection-Inference (SI), by letting the model select relevant evidence chains instead of independent facts. The experimental results on the bAbI and ProofWriterOWA datasets demonstrate the effectiveness of MindMap.It can significantly improve CoT and SI, especially in multi-step reasoning tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/MineObserver 2.0: A Deep Learning & In-Game Framework for Assessing Natural Language Descriptions of Minecraft Imagery b/data/2024/aaai/MineObserver 2.0: A Deep Learning & In-Game Framework for Assessing Natural Language Descriptions of Minecraft Imagery
new file mode 100644
index 0000000000..d64e0396fb
--- /dev/null
+++ b/data/2024/aaai/MineObserver 2.0: A Deep Learning & In-Game Framework for Assessing Natural Language Descriptions of Minecraft Imagery	
@@ -0,0 +1 @@
+MineObserver 2.0 is an AI framework that uses Computer Vision and Natural Language Processing for assessing the accuracy of learner-generated descriptions of Minecraft images that include some scientifically relevant content. The system automatically assesses the accuracy of participant observations, written in natural language, made during science learning activities that take place in Minecraft. We demonstrate our system working in real-time and describe a teacher dashboard to showcase observations, both of which advance our previous work. We present the results of a study showing that MineObserver 2.0 improves over its predecessor both in perceived accuracy of the system's generated descriptions as well as in usefulness of the system's feedback. In future work, we intend improve system generated descriptions to give more teacher control and shift the system to perform continuous learning to more rapidly respond to novel observations made by learners.
\ No newline at end of file
diff --git a/data/2024/aaai/Minibatch Stochastic Three Points Method for Unconstrained Smooth Minimization b/data/2024/aaai/Minibatch Stochastic Three Points Method for Unconstrained Smooth Minimization
new file mode 100644
index 0000000000..4fcdf84d1c
--- /dev/null
+++ b/data/2024/aaai/Minibatch Stochastic Three Points Method for Unconstrained Smooth Minimization	
@@ -0,0 +1 @@
+We present a new zero-order optimization method called Minibatch Stochastic Three Points (MiSTP), specifically designed to solve stochastic unconstrained minimization problems when only an approximate evaluation of the objective function is possible. MiSTP is an extension of the Stochastic Three Point Method (STP). The key innovation of MiSTP is that it selects the next point solely based on the objective function approximation, without relying on its exact evaluation. At each iteration, MiSTP generates a random search direction and compares the approximations of the objective function at the current point, the randomly generated direction and its opposite. The best of these three points is chosen as the next iterate. We analyze the worst-case complexity of MiSTP in the convex and non-convex cases and demonstrate that it matches the most accurate complexity bounds known in the literature for zero-order optimization methods. We perform extensive numerical evaluations to assess the computational efficiency of MiSTP and compare its performance to other state-of-the-art methods by testing it on several machine learning tasks. The results show that MiSTP outperforms or has comparable performance against state-of-the-art methods indicating its potential for a wide range of practical applications.
\ No newline at end of file
diff --git a/data/2024/aaai/Minimal Macro-Based Rewritings of Formal Languages: Theory and Applications in Ontology Engineering (and Beyond) b/data/2024/aaai/Minimal Macro-Based Rewritings of Formal Languages: Theory and Applications in Ontology Engineering (and Beyond)
new file mode 100644
index 0000000000..4dbd332a24
--- /dev/null
+++ b/data/2024/aaai/Minimal Macro-Based Rewritings of Formal Languages: Theory and Applications in Ontology Engineering (and Beyond)	
@@ -0,0 +1 @@
+In this paper, we introduce the problem of rewriting finite formal languages using syntactic macros such that the rewriting is minimal in size. We present polynomial-time algorithms to solve variants of this problem and show their correctness. To demonstrate the practical relevance of the proposed problems and the feasibility and effectiveness of our algorithms in practice, we apply these to biomedical ontologies authored in OWL. We find that such rewritings can significantly reduce the size of ontologies by capturing repeated expressions with macros. This approach not only offers valuable assistance in enhancing ontology quality and comprehension but can also be seen as a general methodology for evaluating features of rewriting systems (including syntactic macros, templates, or other forms of rewriting rules), which can be analyzed in terms of their influence on computational problems.
\ No newline at end of file
diff --git a/data/2024/aaai/Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents b/data/2024/aaai/Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents
new file mode 100644
index 0000000000..c30da6d043
--- /dev/null
+++ b/data/2024/aaai/Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents	
@@ -0,0 +1 @@
+Robustly cooperating with unseen agents and human partners presents significant challenges due to the diverse cooperative conventions these partners may adopt. Existing Ad Hoc Teamwork (AHT) methods address this challenge by training an agent with a population of diverse teammate policies obtained through maximizing specific diversity metrics. However, prior heuristic-based diversity metrics do not always maximize the agent's robustness in all cooperative problems. In this work, we first propose that maximizing an AHT agent's robustness requires it to emulate policies in the minimum coverage set (MCS), the set of best-response policies to any partner policies in the environment. We then introduce the L-BRDiv algorithm that generates a set of teammate policies that, when used for AHT training, encourage agents to emulate policies from the MCS. L-BRDiv works by solving a constrained optimization problem to jointly train teammate policies for AHT training and approximating AHT agent policies that are members of the MCS. We empirically demonstrate that L-BRDiv produces more robust AHT agents than state-of-the-art methods in a broader range of two-player cooperative problems without the need for extensive hyperparameter tuning for its objectives. Our study shows that L-BRDiv outperforms the baseline methods by prioritizing discovering distinct members of the MCS instead of repeatedly finding redundant policies.
\ No newline at end of file
diff --git a/data/2024/aaai/Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training b/data/2024/aaai/Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
new file mode 100644
index 0000000000..e44c0f29e7
--- /dev/null
+++ b/data/2024/aaai/Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training	
@@ -0,0 +1 @@
+Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely observed modality gap in the latent space of CLIP harms the performance of zero-shot captioning by breaking the alignment between paired image-text features. To address this issue, we conduct an analysis on the CLIP latent space which leads to two findings. Firstly, we observe that the CLIP's visual feature of image subregions can achieve closer proximity to the paired caption due to the inherent information loss in text descriptions. In addition, we show that the modality gap between a paired image-text can be empirically modeled as a zero-mean Gaussian distribution. Motivated by the findings, we propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap. In particular, we introduce a subregion feature aggregation to leverage local region information, which produces a compact visual representation for matching text representation. Moreover, we incorporate a noise injection and CLIP reranking strategy to boost captioning performance. We also extend our framework to build a zero-shot VQA pipeline, demonstrating its generality. Through extensive experiments on common captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that our method achieves remarkable performance improvements. Code is available at https://github.com/Artanic30/MacCap.
\ No newline at end of file
diff --git a/data/2024/aaai/Mining Gaze for Contrastive Learning toward Computer-Assisted Diagnosis b/data/2024/aaai/Mining Gaze for Contrastive Learning toward Computer-Assisted Diagnosis
new file mode 100644
index 0000000000..f00308af1f
--- /dev/null
+++ b/data/2024/aaai/Mining Gaze for Contrastive Learning toward Computer-Assisted Diagnosis	
@@ -0,0 +1 @@
+Obtaining large-scale radiology reports can be difficult for medical images due to ethical concerns, limiting the effectiveness of contrastive pre-training in the medical image domain and underscoring the need for alternative methods. In this paper, we propose eye-tracking as an alternative to text reports, as it allows for the passive collection of gaze signals without ethical issues. By tracking the gaze of radiologists as they read and diagnose medical images, we can understand their visual attention and clinical reasoning. When a radiologist has similar gazes for two medical images, it may indicate semantic similarity for diagnosis, and these images should be treated as positive pairs when pre-training a computer-assisted diagnosis (CAD) network through contrastive learning. Accordingly, we introduce the Medical contrastive Gaze Image Pre-training (McGIP) as a plug-and-play module for contrastive learning frameworks. McGIP uses radiologist gaze to guide contrastive pre-training. We evaluate our method using two representative types of medical images and two common types of gaze data. The experimental results demonstrate the practicality of McGIP, indicating its high potential for various clinical scenarios and applications.
\ No newline at end of file
diff --git a/data/2024/aaai/Mitigating Idiom Inconsistency: A Multi-Semantic Contrastive Learning Method for Chinese Idiom Reading Comprehension b/data/2024/aaai/Mitigating Idiom Inconsistency: A Multi-Semantic Contrastive Learning Method for Chinese Idiom Reading Comprehension
new file mode 100644
index 0000000000..4a1fdda2ad
--- /dev/null
+++ b/data/2024/aaai/Mitigating Idiom Inconsistency: A Multi-Semantic Contrastive Learning Method for Chinese Idiom Reading Comprehension	
@@ -0,0 +1 @@
+Chinese idioms pose a significant challenge for machine reading comprehension due to their metaphorical meanings often diverging from their literal counterparts, leading to metaphorical inconsistency. Furthermore, the same idiom can have different meanings in different contexts, resulting in contextual inconsistency. Although deep learning-based methods have achieved some success in idioms reading comprehension, existing approaches still struggle to accurately capture idiom representations due to metaphorical inconsistency and contextual inconsistency of idioms. To address these challenges, we propose a novel model, Multi-Semantic Contrastive Learning Method (MSCLM), which simultaneously addresses metaphorical inconsistency and contextual inconsistency of idioms. To mitigate metaphorical inconsistency, we propose a metaphor contrastive learning module based on the prompt method, bridging the semantic gap between literal and metaphorical meanings of idioms. To mitigate contextual inconsistency, we propose a multi-semantic cross-attention module to explore semantic features between different metaphors of the same idiom in various contexts. Our model has been compared with multiple current latest models (including GPT-3.5) on multiple Chinese idiom reading comprehension datasets, and the experimental results demonstrate that MSCLM outperforms state-of-the-art models.
\ No newline at end of file
diff --git a/data/2024/aaai/Mitigating Label Bias in Machine Learning: Fairness through Confident Learning b/data/2024/aaai/Mitigating Label Bias in Machine Learning: Fairness through Confident Learning
new file mode 100644
index 0000000000..20bd59d581
--- /dev/null
+++ b/data/2024/aaai/Mitigating Label Bias in Machine Learning: Fairness through Confident Learning	
@@ -0,0 +1 @@
+Discrimination can occur when the underlying unbiased labels are overwritten by an agent with potential bias, resulting in biased datasets that unfairly harm specific groups and cause classifiers to inherit these biases. In this paper, we demonstrate that despite only having access to the biased labels, it is possible to eliminate bias by filtering the fairest instances within the framework of confident learning. In the context of confident learning, low self-confidence usually indicates potential label errors; however, this is not always the case. Instances, particularly those from underrepresented groups, might exhibit low confidence scores for reasons other than labeling errors. To address this limitation, our approach employs truncation of the confidence score and extends the confidence interval of the probabilistic threshold. Additionally, we incorporate with co-teaching paradigm for providing a more robust and reliable selection of fair instances and effectively mitigating the adverse effects of biased labels. Through extensive experimentation and evaluation of various datasets, we demonstrate the efficacy of our approach in promoting fairness and reducing the impact of label bias in machine learning models.
\ No newline at end of file
diff --git a/data/2024/aaai/Mitigating Label Noise through Data Ambiguation b/data/2024/aaai/Mitigating Label Noise through Data Ambiguation
new file mode 100644
index 0000000000..a4396bf735
--- /dev/null
+++ b/data/2024/aaai/Mitigating Label Noise through Data Ambiguation	
@@ -0,0 +1 @@
+Label noise poses an important challenge in machine learning, especially in deep learning, in which large models with high expressive power dominate the field. Models of that kind are prone to memorizing incorrect labels, thereby harming generalization performance. Many methods have been proposed to address this problem, including robust loss functions and more complex label correction approaches. Robust loss functions are appealing due to their simplicity, but typically lack flexibility, while label correction usually adds substantial complexity to the training setup. In this paper, we suggest to address the shortcomings of both methodologies by "ambiguating" the target information, adding additional, complementary candidate labels in case the learner is not sufficiently convinced of the observed training label. More precisely, we leverage the framework of so-called superset learning to construct set-valued targets based on a confidence threshold, which deliver imprecise yet more reliable beliefs about the ground-truth, effectively helping the learner to suppress the memorization effect. In an extensive empirical evaluation, our method demonstrates favorable learning behavior on synthetic and real-world noise, confirming the effectiveness in detecting and correcting erroneous training labels.
\ No newline at end of file
diff --git a/data/2024/aaai/Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-Based Retrofitting b/data/2024/aaai/Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-Based Retrofitting
new file mode 100644
index 0000000000..519fb3f5d8
--- /dev/null
+++ b/data/2024/aaai/Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-Based Retrofitting	
@@ -0,0 +1 @@
+Incorporating factual knowledge in knowledge graph is regarded as a promising approach for mitigating the hallucination of large language models (LLMs). Existing methods usually only use the user's input to query the knowledge graph, thus failing to address the factual hallucination generated by LLMs during its reasoning process. To address this problem, this paper proposes Knowledge Graph-based Retrofitting (KGR), a new framework that incorporates LLMs with KGs to mitigate factual hallucination during the reasoning process by retrofitting the initial draft responses of LLMs based on the factual knowledge stored in KGs. Specifically, KGR leverages LLMs to extract, select, validate, and retrofit factual statements within the model-generated responses, which enables an autonomous knowledge verifying and refining procedure without any additional manual efforts. Experiments show that KGR can significantly improve the performance of LLMs on factual QA benchmarks especially when involving complex reasoning processes, which demonstrates the necessity and effectiveness of KGR in mitigating hallucination and enhancing the reliability of LLMs.
\ No newline at end of file
diff --git a/data/2024/aaai/Mitigating the Impact of False Negative in Dense Retrieval with Contrastive Confidence Regularization b/data/2024/aaai/Mitigating the Impact of False Negative in Dense Retrieval with Contrastive Confidence Regularization
new file mode 100644
index 0000000000..0f71831e89
--- /dev/null
+++ b/data/2024/aaai/Mitigating the Impact of False Negative in Dense Retrieval with Contrastive Confidence Regularization	
@@ -0,0 +1 @@
+In open-domain Question Answering (QA), dense text retrieval is crucial for finding relevant passages to generate answers. Typically, contrastive learning is used to train a retrieval model, which maps passages and queries to the same semantic space, making similar ones closer and dissimilar ones further apart. However, training such a system is challenging due to the false negative problem, where relevant passages may be missed during data annotation. Hard negative sampling, commonly used to improve contrastive learning, can introduce more noise in training. This is because hard negatives are those close to a given query, and thus more likely to be false negatives. To address this, we propose a novel contrastive confidence regularizer for Noise Contrastive Estimation (NCE) loss, a commonly used contrastive loss. Our analysis shows that the regularizer helps make the dense retrieval model more robust against false negatives with a theoretical guarantee. Additionally, we propose a model-agnostic method to filter out noisy negative passages in the dataset, improving any downstream dense retrieval models. Through experiments on three datasets, we demonstrate that our method achieves better retrieval performance in comparison to existing state-of-the-art dense retrieval systems.
\ No newline at end of file
diff --git a/data/2024/aaai/Mixed Geometry Message and Trainable Convolutional Attention Network for Knowledge Graph Completion b/data/2024/aaai/Mixed Geometry Message and Trainable Convolutional Attention Network for Knowledge Graph Completion
new file mode 100644
index 0000000000..e23e127cc0
--- /dev/null
+++ b/data/2024/aaai/Mixed Geometry Message and Trainable Convolutional Attention Network for Knowledge Graph Completion	
@@ -0,0 +1 @@
+Knowledge graph completion (KGC) aims to study the embedding representation to solve the incompleteness of knowledge graphs (KGs). Recently, graph convolutional networks (GCNs) and graph attention networks (GATs) have been widely used in KGC tasks by capturing neighbor information of entities. However, Both GCNs and GATs based KGC models have their limitations, and the best method is to analyze the neighbors of each entity (pre-validating), while this process is prohibitively expensive. Furthermore, the representation quality of the embeddings can affect the aggregation of neighbor information (message passing). To address the above limitations, we propose a novel knowledge graph completion model with mixed geometry message and trainable convolutional attention network named MGTCA. Concretely, the mixed geometry message function generates rich neighbor message by integrating spatially information in the hyperbolic space, hypersphere space and Euclidean space jointly. To complete the autonomous switching of graph neural networks (GNNs) and eliminate the necessity of pre-validating the local structure of KGs, a trainable convolutional attention network is proposed by comprising three types of GNNs in one trainable formulation. Furthermore, a mixed geometry scoring function is proposed, which calculates scores of triples by novel prediction function and similarity function based on different geometric spaces. Extensive experiments on three standard datasets confirm the effectiveness of our innovations, and the performance of MGTCA is significantly improved compared to the state-of-the-art approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Mixed-Effects Contextual Bandits b/data/2024/aaai/Mixed-Effects Contextual Bandits
new file mode 100644
index 0000000000..1e0e24a4e5
--- /dev/null
+++ b/data/2024/aaai/Mixed-Effects Contextual Bandits	
@@ -0,0 +1 @@
+We study a novel variant of a contextual bandit problem with multi-dimensional reward feedback formulated as a mixed-effects model, where the correlations between multiple feedback are induced by sharing stochastic coefficients called random effects. We propose a novel algorithm, Mixed-Effects Contextual UCB (ME-CUCB), achieving tildeO(d sqrt(mT)) regret bound after T rounds where d is the dimension of contexts and m is the dimension of outcomes, with either known or unknown covariance structure. This is a tighter regret bound than that of the naive canonical linear bandit algorithm ignoring the correlations among rewards. We prove a lower bound of Omega(d sqrt(mT)) matching the upper bound up to logarithmic factors. To our knowledge, this is the first work providing a regret analysis for mixed-effects models and algorithms involving weighted least-squares estimators. Our theoretical analysis faces a significant technical challenge in that the error terms do not constitute martingales since the weights depend on the rewards. We overcome this challenge by using covering numbers, of theoretical interest in its own right. We provide numerical experiments demonstrating the advantage of our proposed algorithm, supporting the theoretical claims.
\ No newline at end of file
diff --git a/data/2024/aaai/Mixup-Induced Domain Extrapolation for Domain Generalization b/data/2024/aaai/Mixup-Induced Domain Extrapolation for Domain Generalization
new file mode 100644
index 0000000000..6576cd6833
--- /dev/null
+++ b/data/2024/aaai/Mixup-Induced Domain Extrapolation for Domain Generalization	
@@ -0,0 +1 @@
+Domain generalization aims to learn a well-performed classifier on multiple source domains for unseen target domains under domain shift. Domain-invariant representation (DIR) is an intuitive approach and has been of great concern. In practice, since the targets are variant and agnostic, only a few sources are not sufficient to reflect the entire domain population, leading to biased DIR. Derived from PAC-Bayes framework, we provide a novel generalization bound involving the number of domains sampled from the environment (N) and the radius of the Wasserstein ball centred on the target (r), which have rarely been considered before. Herein, we can obtain two natural and significant findings: when N increases, 1) the gap between the source and target sampling environments can be gradually mitigated; 2) the target can be better approximated within the Wasserstein ball. These findings prompt us to collect adequate domains against domain shift. For seeking convenience, we design a novel yet simple Extrapolation Domain strategy induced by the Mixup scheme, namely EDM. Through a reverse Mixup scheme to generate the extrapolated domains, combined with the interpolated domains, we expand the interpolation space spanned by the sources, providing more abundant domains to increase sampling intersections to shorten r. Moreover, EDM is easy to implement and be plugged-and-played. In experiments, EDM has been plugged into several methods in both closed and open set settings, achieving up to 5.73% improvement.
\ No newline at end of file
diff --git a/data/2024/aaai/MobileInst: Video Instance Segmentation on the Mobile b/data/2024/aaai/MobileInst: Video Instance Segmentation on the Mobile
new file mode 100644
index 0000000000..58c2292eb0
--- /dev/null
+++ b/data/2024/aaai/MobileInst: Video Instance Segmentation on the Mobile	
@@ -0,0 +1 @@
+Video instance segmentation on mobile devices is an important yet very challenging edge AI problem. It mainly suffers from (1) heavy computation and memory costs for frame-by-frame pixel-level instance perception and (2) complicated heuristics for tracking objects. To address these issues, we present MobileInst, a lightweight and mobile-friendly framework for video instance segmentation on mobile devices. Firstly, MobileInst adopts a mobile vision transformer to extract multi-level semantic features and presents an efficient query-based dual-transformer instance decoder for mask kernels and a semantic-enhanced mask decoder to generate instance segmentation per frame. Secondly, MobileInst exploits simple yet effective kernel reuse and kernel association to track objects for video instance segmentation. Further, we propose temporal query passing to enhance the tracking ability for kernels. We conduct experiments on COCO and YouTube-VIS datasets to demonstrate the superiority of MobileInst and evaluate the inference latency on one single CPU core of the Snapdragon 778G Mobile Platform, without other methods of acceleration. On the COCO dataset, MobileInst achieves 31.2 mask AP and 433 ms on the mobile CPU, which reduces the latency by 50% compared to the previous SOTA. For video instance segmentation, MobileInst achieves 35.0 AP and 30.1 AP on YouTube-VIS 2019 & 2021.
\ No newline at end of file
diff --git a/data/2024/aaai/ModWaveMLP: MLP-Based Mode Decomposition and Wavelet Denoising Model to Defeat Complex Structures in Traffic Forecasting b/data/2024/aaai/ModWaveMLP: MLP-Based Mode Decomposition and Wavelet Denoising Model to Defeat Complex Structures in Traffic Forecasting
new file mode 100644
index 0000000000..b8bf78c54d
--- /dev/null
+++ b/data/2024/aaai/ModWaveMLP: MLP-Based Mode Decomposition and Wavelet Denoising Model to Defeat Complex Structures in Traffic Forecasting	
@@ -0,0 +1 @@
+Traffic prediction is the core issue of Intelligent Transportation Systems. Recently, researchers have tended to use complex structures, such as transformer-based structures, for tasks such as traffic prediction. Notably, traffic data is simpler to process compared to text and images, which raises questions about the necessity of these structures. Additionally, when handling traffic data, researchers tend to manually design the model structure based on the data features, which makes the structure of traffic prediction redundant and the model generalizability limited. To address the above, we introduce the ‘ModWaveMLP’—A multilayer perceptron (MLP) based model designed according to mode decomposition and wavelet noise reduction information learning concepts. The model is based on simple MLP structure, which achieves the separation and prediction of different traffic modes and does not depend on additional features introduced such as the topology of the traffic network. By performing experiments on real-world datasets METR-LA and PEMS-BAY, our model achieves SOTA, outperforms GNN and transformer-based models, and outperforms those that introduce additional feature data with better generalizability, and we further demonstrate the effectiveness of the various parts of the model through ablation experiments. This offers new insights to subsequent researchers involved in traffic model design. The code is available at: https://github.com/Kqingzheng/ModWaveMLP.
\ No newline at end of file
diff --git a/data/2024/aaai/Model AI Assignments 2024 b/data/2024/aaai/Model AI Assignments 2024
new file mode 100644
index 0000000000..6b19d0034e
--- /dev/null
+++ b/data/2024/aaai/Model AI Assignments 2024	
@@ -0,0 +1,10 @@
+The Model AI Assignments session seeks to gather and dis-
+seminate the best assignment designs of the Artificial In-
+telligence (AI) Education community. Recognizing that as-
+signments form the core of student learning experience, we
+
+here present abstracts of five AI assignments from the 2024
+session that are easily adoptable, playfully engaging, and
+
+flexible for a variety of instructor needs. Assignment spec-
+ifications and supporting resources may be found at http://modelai.gettysburg.edu.
\ No newline at end of file
diff --git a/data/2024/aaai/Model Counting and Sampling via Semiring Extensions b/data/2024/aaai/Model Counting and Sampling via Semiring Extensions
new file mode 100644
index 0000000000..74c2310906
--- /dev/null
+++ b/data/2024/aaai/Model Counting and Sampling via Semiring Extensions	
@@ -0,0 +1 @@
+Many decision and optimization problems have natural extensions as counting problems. The best known example is the Boolean satisfiability problem (SAT), where we want to count the satisfying assignments of truth values to the variables, which is known as the #SAT problem. Likewise, for discrete optimization problems, we want to count the states on which the objective function attains the optimal value. Both SAT and discrete optimization can be formulated as selective marginalize a product function (MPF) queries. Here, we show how general selective MPF queries can be extended for model counting. MPF queries are encoded as tensor hypernetworks over suitable semirings that can be solved by generic tensor hypernetwork contraction algorithms. Our model counting extension is again an MPF query, on an extended semiring, that can be solved by the same contraction algorithms. Model counting is required for uniform model sampling. We show how the counting extension can be further extended for model sampling by constructing yet another semiring. We have implemented the model counting and sampling extensions. Experiments show that our generic approach is competitive with the state of the art in model counting and model sampling.
\ No newline at end of file
diff --git a/data/2024/aaai/Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning b/data/2024/aaai/Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning
new file mode 100644
index 0000000000..65c584939d
--- /dev/null
+++ b/data/2024/aaai/Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning	
@@ -0,0 +1 @@
+In data-rich domains such as vision, language, and speech, deep learning prevails to deliver high-performance task-specific models and can even learn general task-agnostic representations for efficient finetuning to downstream tasks. However, deep learning in resource-limited domains still faces multiple challenges including (i) limited data, (ii) constrained model development cost, and (iii) lack of adequate pre-trained models for effective finetuning. This paper provides an overview of model reprogramming to bridge this gap. Model reprogramming enables resource-efficient cross-domain machine learning by repurposing and reusing a well-developed pre-trained model from a source domain to solve tasks in a target domain without model finetuning, where the source and target domains can be vastly different. In many applications, model reprogramming outperforms transfer learning and training from scratch. This paper elucidates the methodology of model reprogramming, summarizes existing use cases, provides a theoretical explanation of the success of model reprogramming, and concludes with a discussion on open-ended research questions and opportunities.
\ No newline at end of file
diff --git a/data/2024/aaai/Modeling Adaptive Inter-Task Feature Interactions via Sentiment-Aware Contrastive Learning for Joint Aspect-Sentiment Prediction b/data/2024/aaai/Modeling Adaptive Inter-Task Feature Interactions via Sentiment-Aware Contrastive Learning for Joint Aspect-Sentiment Prediction
new file mode 100644
index 0000000000..e65ff3da7b
--- /dev/null
+++ b/data/2024/aaai/Modeling Adaptive Inter-Task Feature Interactions via Sentiment-Aware Contrastive Learning for Joint Aspect-Sentiment Prediction	
@@ -0,0 +1 @@
+Aspect prediction (AP) and sentiment prediction (SP) are representative applications in fine-grained sentiment anal- ysis. They can be considered as sequential tasks, where AP identifies mentioned aspects in a sentence, and SP infers fine-grained sentiments for these aspects. Recent models perform the aspect-sentiment prediction in a joint man-ner, but heavily rely on the feature interactions of aspect and sentiment. One drawback is that they ignore correlation strength varies between aspect features and sentiment fea- tures across different sentences, and employ a fixed feature interaction strategy may limit effective knowledge transfer across tasks. To tackle this issue, in this paper, we propose an Adaptive Inter-task Feature Interaction framework, AIFI, for joint aspect-sentiment prediction. Specifically, we introduce a novel contrast-based alignment method based on contrastive learning. Our approach considers the AP-specific and SP-specific representations of a given sentence as a positive pair, while representation of another random sentence serves as a negative example. Moreover, we propose an inter-task feature correlation network to predict the contrast strength, which is determined by the temperature coefficient in the InfoNCE loss. This dynamic correlation adjustment enhances model’s ability to capture proper feature interactions more efficiently. Experimental results on three datasets validate the effectiveness of our approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Modeling Knowledge Graphs with Composite Reasoning b/data/2024/aaai/Modeling Knowledge Graphs with Composite Reasoning
new file mode 100644
index 0000000000..c39231331e
--- /dev/null
+++ b/data/2024/aaai/Modeling Knowledge Graphs with Composite Reasoning	
@@ -0,0 +1,3 @@
+The ability to combine multiple pieces of existing knowledge to infer new knowledge is both crucial and challenging. In this paper, we explore how facts of various entities are combined in the context of knowledge graph completion (KGC). We use composite reasoning to unify the views from different KGC models, including translational models, tensor factorization (TF)-based models, instance-based learning models, and KGC regularizers.
+
+Moreover, our comprehensive examination of composite reasoning revealed an unexpected phenomenon: certain TF-based models learn embeddings with erroneous composite reasoning, which ultimately violates their fundamental collaborative filtering assumption and reduces their effects. This motivates us to reduce their composition error. Empirical evaluations demonstrate that mitigating the composition risk not only enhances the performance of TF-based models across all tested settings, but also surpass or is competitive with the state-of-the-art performance on two out of four benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Modeling Stereo-Confidence out of the End-to-End Stereo-Matching Network via Disparity Plane Sweep b/data/2024/aaai/Modeling Stereo-Confidence out of the End-to-End Stereo-Matching Network via Disparity Plane Sweep
new file mode 100644
index 0000000000..2b54a3f687
--- /dev/null
+++ b/data/2024/aaai/Modeling Stereo-Confidence out of the End-to-End Stereo-Matching Network via Disparity Plane Sweep	
@@ -0,0 +1,7 @@
+We propose a novel stereo-confidence that can be measured externally to various stereo-matching networks, offering an alternative input modality choice of the cost volume for learning-based approaches, especially in safety-critical systems.
+Grounded in the foundational concepts of disparity definition and the disparity plane sweep, the proposed stereo-confidence method is built upon the idea that any shift in a stereo-image pair should be updated in a corresponding amount shift in the disparity map. 
+Based on this idea, the proposed stereo-confidence method can be summarized in three folds.
+1) Using the disparity plane sweep, multiple disparity maps can be obtained and treated as a 3-D volume (predicted disparity volume), like the cost volume is constructed. 
+2) One of these disparity maps serves as an anchor, allowing us to define a desirable (or ideal) disparity profile at every spatial point.
+3) By comparing the desirable and predicted disparity profiles, we can quantify the level of matching ambiguity between left and right images for confidence measurement. 
+Extensive experimental results using various stereo-matching networks and datasets demonstrate that the proposed stereo-confidence method not only shows competitive performance on its own but also consistent performance improvements when it is used as an input modality for learning-based stereo-confidence methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Moderate Message Passing Improves Calibration: A Universal Way to Mitigate Confidence Bias in Graph Neural Networks b/data/2024/aaai/Moderate Message Passing Improves Calibration: A Universal Way to Mitigate Confidence Bias in Graph Neural Networks
new file mode 100644
index 0000000000..6ba2c53000
--- /dev/null
+++ b/data/2024/aaai/Moderate Message Passing Improves Calibration: A Universal Way to Mitigate Confidence Bias in Graph Neural Networks	
@@ -0,0 +1 @@
+Confidence calibration in Graph Neural Networks (GNNs) aims to align a model's predicted confidence with its actual accuracy. Recent studies have indicated that GNNs exhibit an under-confidence bias, which contrasts the over-confidence bias commonly observed in deep neural networks. However, our deeper investigation into this topic reveals that not all GNNs exhibit this behavior. Upon closer examination of message passing in GNNs, we found a clear link between message aggregation and confidence levels. Specifically, GNNs with extensive message aggregation, often seen in deep architectures or when leveraging large amounts of labeled data, tend to exhibit overconfidence. This overconfidence can be attributed to factors like over-learning and over-smoothing. Conversely, GNNs with fewer layers, known for their balanced message passing and superior node representation, may exhibit under-confidence. To counter these confidence biases, we introduce the Adaptive Unified Label Smoothing (AU-LS) technique. Our experiments show that AU-LS outperforms existing methods, addressing both over and under-confidence in various GNN scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts b/data/2024/aaai/MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts
new file mode 100644
index 0000000000..5bbc2eea99
--- /dev/null
+++ b/data/2024/aaai/MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts	
@@ -0,0 +1 @@
+Deep learning is now widely used in drug discovery, providing significant acceleration and cost reduction. As the most fundamental building block, molecular representation is essential for predicting molecular properties to enable various downstream applications. Most existing methods attempt to incorporate more information to learn better representations. However, not all features are equally important for a specific task. Ignoring this would potentially compromise the training efficiency and predictive accuracy. To address this issue, we propose a novel approach, which treats language models as an agent and molecular pretraining models as a knowledge base. The agent accentuates task-relevant features in the molecular representation by understanding the natural language description of the task, just as a tailor customizes clothes for clients. Thus, we call this approach MolTailor. Evaluations demonstrate MolTailor's superior performance over baselines, validating the efficacy of enhancing relevance for molecular representation learning. This illustrates the potential of language model guided optimization to better exploit and unleash the capabilities of existing powerful molecular representation methods. Our code and appendix are available at https://github.com/SCIR-HI/MolTailor.
\ No newline at end of file
diff --git a/data/2024/aaai/Molecular Optimization Model with Patentability Constraint b/data/2024/aaai/Molecular Optimization Model with Patentability Constraint
new file mode 100644
index 0000000000..27f8c27827
--- /dev/null
+++ b/data/2024/aaai/Molecular Optimization Model with Patentability Constraint	
@@ -0,0 +1,5 @@
+In drug development, molecular optimization is a crucial challenge that involves generating novel molecules given a lead molecule as input. The task requires maintaining molecular similarity to the original molecule while simultaneously optimizing multiple chemical attributes. To aid in this process, numerous generative models have been proposed.
+However, in practical applications, it is crucial for these models not only to generate novel molecules with the above constraints but also to generate molecules that significantly differ from any existing patented compounds. 
+In this work, we present a multi-optimization molecular framework to address this challenge.
+Our framework trains a model to prioritize both enhanced properties and substantial dissimilarity from patented compounds. By jointly learning continuous representations of optimized and patentable molecules, we ensure that the generated molecules are significantly distant from any patented compounds while improving chemical properties.
+Through empirical evaluation, we demonstrate the superior performance of our approach compared to state-of-the-art molecular optimization methods both in chemical property optimization and patentability.
\ No newline at end of file
diff --git a/data/2024/aaai/Monitoring of Perception Systems: Deterministic, Probabilistic, and Learning-Based Fault Detection and Identification (Abstract Reprint) b/data/2024/aaai/Monitoring of Perception Systems: Deterministic, Probabilistic, and Learning-Based Fault Detection and Identification (Abstract Reprint)
new file mode 100644
index 0000000000..63b15ebc4c
--- /dev/null
+++ b/data/2024/aaai/Monitoring of Perception Systems: Deterministic, Probabilistic, and Learning-Based Fault Detection and Identification (Abstract Reprint)	
@@ -0,0 +1 @@
+This paper investigates runtime monitoring of perception systems. Perception is a critical component of high-integrity applications of robotics and autonomous systems, such as self-driving cars. In these applications, failure of perception systems may put human life at risk, and a broad adoption of these technologies requires the development of methodologies to guarantee and monitor safe operation. Despite the paramount importance of perception, currently there is no formal approach for system-level perception monitoring. In this paper, we formalize the problem of runtime fault detection and identification in perception systems and present a framework to model diagnostic information using a diagnostic graph. We then provide a set of deterministic, probabilistic, and learning-based algorithms that use diagnostic graphs to perform fault detection and identification. Moreover, we investigate fundamental limits and provide deterministic and probabilistic guarantees on the fault detection and identification results. We conclude the paper with an extensive experimental evaluation, which recreates several realistic failure modes in the LGSVL open-source autonomous driving simulator, and applies the proposed system monitors to a state-of-the-art autonomous driving software stack (Baidu's Apollo Auto). The results show that the proposed system monitors outperform baselines, have the potential of preventing accidents in realistic autonomous driving scenarios, and incur a negligible computational overhead.
\ No newline at end of file
diff --git a/data/2024/aaai/Mono3DVG: 3D Visual Grounding in Monocular Images b/data/2024/aaai/Mono3DVG: 3D Visual Grounding in Monocular Images
new file mode 100644
index 0000000000..d3c2f5bd39
--- /dev/null
+++ b/data/2024/aaai/Mono3DVG: 3D Visual Grounding in Monocular Images	
@@ -0,0 +1 @@
+We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be released.
\ No newline at end of file
diff --git a/data/2024/aaai/Monocular 3D Hand Mesh Recovery via Dual Noise Estimation b/data/2024/aaai/Monocular 3D Hand Mesh Recovery via Dual Noise Estimation
new file mode 100644
index 0000000000..9e28917c4b
--- /dev/null
+++ b/data/2024/aaai/Monocular 3D Hand Mesh Recovery via Dual Noise Estimation	
@@ -0,0 +1 @@
+Current parametric models have made notable progress in 3D hand pose and shape estimation. However, due to the fixed hand topology and complex hand poses, current models are hard to generate meshes that are aligned with the image well. To tackle this issue, we introduce a dual noise estimation method in this paper. Given a single-view image as input, we first adopt a baseline parametric regressor to obtain the coarse hand meshes. We assume the mesh vertices and their image-plane projections are noisy, and can be associated in a unified probabilistic model. We then learn the distributions of noise to refine mesh vertices and their projections. The refined vertices are further utilized to refine camera parameters in a closed-form manner. Consequently, our method obtains well-aligned and high-quality 3D hand meshes. Extensive experiments on the large-scale Interhand2.6M dataset demonstrate that the proposed method not only improves the performance of its baseline by more than 10% but also achieves state-of-the-art performance. Project page: https://github.com/hanhuili/DNE4Hand.
\ No newline at end of file
diff --git a/data/2024/aaai/Monte Carlo Tree Search in the Presence of Transition Uncertainty b/data/2024/aaai/Monte Carlo Tree Search in the Presence of Transition Uncertainty
new file mode 100644
index 0000000000..0ceb7dbbf3
--- /dev/null
+++ b/data/2024/aaai/Monte Carlo Tree Search in the Presence of Transition Uncertainty	
@@ -0,0 +1 @@
+Monte Carlo Tree Search (MCTS) is an immensely popular search-based framework used for decision making. It is traditionally applied to domains where a perfect simulation model of the environment is available. We study and improve MCTS in the context where the environment model is given but imperfect. We show that the discrepancy between the model and the actual environment can lead to significant performance degradation with standard MCTS. We therefore develop Uncertainty Adapted MCTS (UA-MCTS), a more robust algorithm within the MCTS framework. We estimate the transition uncertainty in the given model, and direct the search towards more certain transitions in the state space. We modify all four MCTS phases to improve the search behavior by considering these estimates. We prove, in the corrupted bandit case, that adding uncertainty information to adapt UCB leads to tighter regret bound than standard UCB. Empirically, we evaluate UA-MCTS and its individual components on the deterministic domains from the MinAtar test suite. Our results demonstrate that UA-MCTS strongly improves MCTS in the presence of model transition errors.
\ No newline at end of file
diff --git a/data/2024/aaai/Moral Uncertainty and the Problem of Fanaticism b/data/2024/aaai/Moral Uncertainty and the Problem of Fanaticism
new file mode 100644
index 0000000000..0d3d76bbe4
--- /dev/null
+++ b/data/2024/aaai/Moral Uncertainty and the Problem of Fanaticism	
@@ -0,0 +1 @@
+While there is universal agreement that agents ought to act ethically, there is no agreement as to what constitutes ethical behaviour. To address this problem, recent philosophical approaches to `moral uncertainty' propose aggregation of multiple ethical theories to guide agent behaviour. However, one of the foundational proposals for aggregation - Maximising Expected Choiceworthiness (MEC) - has been criticised as being vulnerable to fanaticism; the problem of an ethical theory dominating agent behaviour despite low credence (confidence) in said theory. Fanaticism thus undermines the `democratic' motivation for accommodating multiple ethical perspectives. The problem of fanaticism has not yet been mathematically defined. Representing moral uncertainty as an instance of social welfare aggregation, this paper contributes to the field of moral uncertainty by 1) formalising the problem of fanaticism as a property of social welfare functionals and 2) providing non-fanatical alternatives to MEC, i.e. Highest k-trimmed Mean and Highest Median.
\ No newline at end of file
diff --git a/data/2024/aaai/MorphVAE: Advancing Morphological Design of Voxel-Based Soft Robots with Variational Autoencoders b/data/2024/aaai/MorphVAE: Advancing Morphological Design of Voxel-Based Soft Robots with Variational Autoencoders
new file mode 100644
index 0000000000..97181b85b3
--- /dev/null
+++ b/data/2024/aaai/MorphVAE: Advancing Morphological Design of Voxel-Based Soft Robots with Variational Autoencoders	
@@ -0,0 +1 @@
+Soft robot design is an intricate field with unique challenges due to its complex and vast search space. In the past literature, evolutionary computation algorithms, including novel probabilistic generative models (PGMs), have shown potential in this realm. However, these methods are sample inefficient and predominantly focus on rigid robots in locomotion tasks, which limit their performance and application in robot design automation. In this work, we propose MorphVAE, an innovative PGM that incorporates a multi-task training scheme and a meticulously crafted sampling technique termed ``continuous natural selection'', aimed at bolstering sample efficiency. This method empowers us to gain insights from assessed samples across diverse tasks and temporal evolutionary stages, while simultaneously maintaining a delicate balance between optimization efficiency and biodiversity. Through extensive experiments in various locomotion and manipulation tasks, we substantiate the efficiency of MorphVAE in generating high-performing and diverse designs, surpassing the performance of competitive baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Motion Deblurring via Spatial-Temporal Collaboration of Frames and Events b/data/2024/aaai/Motion Deblurring via Spatial-Temporal Collaboration of Frames and Events
new file mode 100644
index 0000000000..4e7ed10997
--- /dev/null
+++ b/data/2024/aaai/Motion Deblurring via Spatial-Temporal Collaboration of Frames and Events	
@@ -0,0 +1 @@
+Motion deblurring can be advanced by exploiting informative features from supplementary sensors such as event cameras, which can capture rich motion information asynchronously with high temporal resolution. Existing event-based motion deblurring methods neither consider the modality redundancy in spatial fusion nor temporal cooperation between events and frames. To tackle these limitations, a novel spatial-temporal collaboration network (STCNet) is proposed for event-based motion deblurring. Firstly, we propose a differential-modality based cross-modal calibration strategy to suppress redundancy for complementarity enhancement, and then bimodal spatial fusion is achieved with an elaborate cross-modal co-attention mechanism to weight the contributions of them for importance balance. Besides, we present a frame-event mutual spatio-temporal attention scheme to alleviate the errors of relying only on frames to compute cross-temporal similarities when the motion blur is significant, and then the spatio-temporal features from both frames and events are aggregated with the custom cross-temporal coordinate attention. Extensive experiments on both synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance. Project website: https://github.com/wyang-vis/STCNet.
\ No newline at end of file
diff --git a/data/2024/aaai/MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators b/data/2024/aaai/MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators
new file mode 100644
index 0000000000..d1b56cc611
--- /dev/null
+++ b/data/2024/aaai/MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators	
@@ -0,0 +1 @@
+Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in large language models (LLMs). Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Visit our webpage at https://qiqiapink.github.io/MotionGPT/.
\ No newline at end of file
diff --git a/data/2024/aaai/MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation b/data/2024/aaai/MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation
new file mode 100644
index 0000000000..9a342b440e
--- /dev/null
+++ b/data/2024/aaai/MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation	
@@ -0,0 +1 @@
+Controllable generation of 3D human motions becomes an important topic as the world embraces digital transformation. Existing works, though making promising progress with the advent of diffusion models, heavily rely on meticulously captured and annotated (e.g., text) high-quality motion corpus, a resource-intensive endeavor in the real world. This motivates our proposed MotionMix, a simple yet effective weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences. Specifically, we separate the denoising objectives of a diffusion model into two stages: obtaining conditional rough motion approximations in the initial T-T* steps by learning the noisy annotated motions, followed by the unconditional refinement of these preliminary motions during the last T* steps using unannotated motions. Notably, though learning from two sources of imperfect data, our model does not compromise motion generation quality compared to fully supervised approaches that access gold data. Extensive experiments on several benchmarks demonstrate that our MotionMix, as a versatile framework, consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling b/data/2024/aaai/MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
new file mode 100644
index 0000000000..5bb05232e8
--- /dev/null
+++ b/data/2024/aaai/MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling	
@@ -0,0 +1,2 @@
+Video-and-language understanding has a variety of applications in the industry, such as video question answering, text-video retrieval, and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal encoders and feature fusion modules, which consume high computational costs. Specially, they have difficulty dealing with dense video frames or long text prevalent in industrial applications. 
+This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model that achieves efficient and effective feature fusion and rapid adaptation to downstream tasks. Specifically, we design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules to sample long sequences and fuse multi-modal features, which reduces the computational costs and addresses performance degradation caused by previous samplers. Therefore, MuLTI can handle longer sequences with limited computational costs. Then, to further enhance the model's performance and fill in the lack of pretraining tasks in the video question answering, we propose a new pretraining task named Multiple Choice Modeling. This task bridges the gap between pretraining and downstream tasks and improves the model's ability to align video and text features. Benefiting from the efficient feature fusion module and the new pretraining task, MuLTI achieves state-of-the-art performance on multiple datasets. Implementation and pretrained models will be released.
\ No newline at end of file
diff --git a/data/2024/aaai/MuST: Robust Image Watermarking for Multi-Source Tracing b/data/2024/aaai/MuST: Robust Image Watermarking for Multi-Source Tracing
new file mode 100644
index 0000000000..d6abeab222
--- /dev/null
+++ b/data/2024/aaai/MuST: Robust Image Watermarking for Multi-Source Tracing	
@@ -0,0 +1,2 @@
+In recent years, with the popularity of social media applications, massive digital images are available online, which brings great convenience to image recreation. However, the use of unauthorized image materials in multi-source composite images is still inadequately regulated, which may cause significant loss and discouragement to the copyright owners of the source image materials. Ideally, deep watermarking techniques could provide a solution for protecting these copyrights based on their encoder-noise-decoder training strategy. Yet existing image watermarking schemes, which are mostly designed for single images, cannot well address the copyright protection requirements in this scenario, since the multi-source image composing process commonly includes distortions that are not well investigated in previous methods, e.g., the extreme downsizing.
+To meet such demands, we propose MuST, a multi-source tracing robust watermarking scheme, whose architecture includes a multi-source image detector and minimum external rectangle operation for multiple watermark resynchronization and extraction. Furthermore, we constructed an image material dataset covering common image categories and designed the simulation model of the multi-source image composing process as the noise layer. Experiments demonstrate the excellent performance of MuST in tracing sources of image materials from the composite images compared with SOTA watermarking methods, which could maintain the extraction accuracy above 98% to trace the sources of at least 3 different image materials while keeping the average PSNR of watermarked image materials higher than 42.51 dB. We released our code on https://github.com/MrCrims/MuST
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Architecture Multi-Expert Diffusion Models b/data/2024/aaai/Multi-Architecture Multi-Expert Diffusion Models
new file mode 100644
index 0000000000..64ccd9c10c
--- /dev/null
+++ b/data/2024/aaai/Multi-Architecture Multi-Expert Diffusion Models	
@@ -0,0 +1 @@
+In this paper, we address the performance degradation of efficient diffusion models by introducing Multi-architecturE Multi-Expert diffusion models (MEME). We identify the need for tailored operations at different time-steps in diffusion processes and leverage this insight to create compact yet high-performing models. MEME assigns distinct architectures to different time-step intervals, balancing convolution and self-attention operations based on observed frequency characteristics. We also introduce a soft interval assignment strategy for comprehensive training. Empirically, MEME operates 3.3 times faster than baselines while improving image generation quality (FID scores) by 0.62 (FFHQ) and 0.37 (CelebA). Though we validate the effectiveness of assigning more optimal architecture per time-step, where efficient models outperform the larger models, we argue that MEME opens a new design choice for diffusion models that can be easily applied in other scenarios, such as large multi-expert models.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Class Support Vector Machine with Maximizing Minimum Margin b/data/2024/aaai/Multi-Class Support Vector Machine with Maximizing Minimum Margin
new file mode 100644
index 0000000000..020f210c1c
--- /dev/null
+++ b/data/2024/aaai/Multi-Class Support Vector Machine with Maximizing Minimum Margin	
@@ -0,0 +1,3 @@
+Support Vector Machine (SVM) stands out as a prominent machine learning technique widely applied in practical pattern recognition tasks. It achieves binary classification by maximizing the "margin", which represents the minimum distance between instances and the decision boundary. Although many efforts have been dedicated to expanding SVM for multi-class case through strategies such as one versus one and one versus the rest, satisfactory solutions remain to be developed. In this paper, we propose a novel method for multi-class SVM that incorporates pairwise class loss considerations and maximizes the minimum margin. Adhering to this concept, we embrace a new formulation that imparts heightened flexibility to multi-class SVM.
+Furthermore, the correlations between the proposed method and multiple forms of multi-class SVM are analyzed. The proposed regularizer, akin to the concept of "margin", can serve as a seamless enhancement over the softmax in deep learning, providing guidance for network parameter learning. Empirical evaluations demonstrate the effectiveness and superiority of our proposed 
+method over existing multi-classification methods. Complete version is available at https://arxiv.org/pdf/2312.06578.pdf. Code is available at https://github.com/zz-haooo/M3SVM.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Constellation-Inspired Single-Shot Global LiDAR Localization b/data/2024/aaai/Multi-Constellation-Inspired Single-Shot Global LiDAR Localization
new file mode 100644
index 0000000000..2423f57ce4
--- /dev/null
+++ b/data/2024/aaai/Multi-Constellation-Inspired Single-Shot Global LiDAR Localization	
@@ -0,0 +1 @@
+Global localization is a challenging task for intelligent robots, as its accuracy directly contributes to the performance of downstream navigation and planning tasks. However, existing literature focus more on the place retrieval and the success rate of localization, with limited attention given to the metrics of position estimation. In this paper, a single-shot global LiDAR localization method is proposed with the ultimate goal of achieving high position accuracy, inspired by the positioning approach of multi-constellation localization systems. Initially, we perform coarse localization using global descriptors and select observation points along with their corresponding coordinates based on the obtained coarse localization results. Coordinates can be acquired from a pre-built map, GNSS, or other devices. Then, a lightweight LiDAR odometry method is designed to estimate the distance between the retrieved data and the observation points. Ultimately, the localization problem is transformed into an optimization problem of solving a system of multiple sphere equations. The experimental results on the KITTI dataset and the self-collected dataset demonstrate that our method achieves an average localization error (including errors in the z-axis) of 0.89 meters. In addition, it achieves retrieval efficiency of 0.357 s per frame on the former dataset and 0.214 s per frame on the latter one. Code and data are available at https://github.com/jlurobot/multi-constellation-localization.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Cross Sampling and Frequency-Division Reconstruction for Image Compressed Sensing b/data/2024/aaai/Multi-Cross Sampling and Frequency-Division Reconstruction for Image Compressed Sensing
new file mode 100644
index 0000000000..43791c9286
--- /dev/null
+++ b/data/2024/aaai/Multi-Cross Sampling and Frequency-Division Reconstruction for Image Compressed Sensing	
@@ -0,0 +1 @@
+Deep Compressed Sensing (DCS) has attracted considerable interest due to its superior quality and speed compared to traditional CS algorithms. However, current approaches employ simplistic convolutional downsampling to acquire measurements, making it difficult to retain high-level features of the original signal for better image reconstruction. Furthermore, these approaches often overlook the presence of both high- and low-frequency information within the network, despite their critical role in achieving high-quality reconstruction. To address these challenges, we propose a novel Multi-Cross Sampling and Frequency Division Network (MCFD-Net) for image CS. The Dynamic Multi-Cross Sampling (DMCS) module, a sampling network of MCFD-Net, incorporates pyramid cross convolution and dual-branch sampling with multi-level pooling. Additionally, it introduces an attention mechanism between perception blocks to enhance adaptive learning effects. In the second deep reconstruction stage, we design a Frequency Division Reconstruction Module (FDRM). This module employs a discrete wavelet transform to extract high- and low-frequency information from images. It then applies multi-scale convolution and self-similarity attention compensation separately to both types of information before merging the output reconstruction results. The MCFD-Net integrates the DMCS and FDRM to construct an end-to-end learning network. Extensive CS experiments conducted on multiple benchmark datasets demonstrate that our MCFD-Net outperforms state-of-the-art approaches, while also exhibiting superior noise robustness.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Dimensional Fair Federated Learning b/data/2024/aaai/Multi-Dimensional Fair Federated Learning
new file mode 100644
index 0000000000..5f2900771a
--- /dev/null
+++ b/data/2024/aaai/Multi-Dimensional Fair Federated Learning	
@@ -0,0 +1 @@
+Federated learning (FL) has emerged as a promising collaborative and secure paradigm for training a model from decentralized data without compromising privacy. Group fairness and client fairness are two dimensions of fairness that are important for FL. Standard FL can result in disproportionate disadvantages for certain clients, and it still faces the challenge of treating different groups equitably in a population. The problem of privately training fair FL models without compromising the generalization capability of disadvantaged clients remains open. In this paper, we propose a method, called mFairFL, to address this problem and achieve group fairness and client fairness simultaneously. mFairFL leverages differential multipliers to construct an optimization objective for empirical risk minimization with fairness constraints. Before aggregating locally trained models, it first detects conflicts among their gradients, and then iteratively curates the direction and magnitude of gradients to mitigate these conflicts. Theoretical analysis proves mFairFL facilitates the fairness in model development. The experimental evaluations based on three benchmark datasets show significant advantages of mFairFL compared to seven state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Domain Deep Learning from a Multi-View Perspective for Cross-Border E-commerce Search b/data/2024/aaai/Multi-Domain Deep Learning from a Multi-View Perspective for Cross-Border E-commerce Search
new file mode 100644
index 0000000000..da5421fec5
--- /dev/null
+++ b/data/2024/aaai/Multi-Domain Deep Learning from a Multi-View Perspective for Cross-Border E-commerce Search	
@@ -0,0 +1 @@
+Building click-through rate (CTR) and conversion rate (CVR) prediction models for cross-border e-commerce search requires modeling the correlations among multi-domains. Existing multi-domain methods would suffer severely from poor scalability and low efficiency when number of domains increases. To this end, we propose a Domain-Aware Multi-view mOdel (DAMO), which is domain-number-invariant, to effectively leverage cross-domain relations from a multi-view perspective. Specifically, instead of working in the original feature space defined by different domains, DAMO maps everything to a new low-rank multi-view space. To achieve this, DAMO firstly extracts multi-domain features in an explicit feature-interactive manner. These features are parsed to a multi-view extractor to obtain view-invariant and view-specific features. Then a multi-view predictor inputs these two sets of features and outputs view-based predictions. To enforce view-awareness in the predictor, we further propose a lightweight view-attention estimator to dynamically learn the optimal view-specific weights w.r.t. a view-guided loss. Extensive experiments on public and industrial datasets show that compared with state-of-the-art models, our DAMO achieves better performance with lower storage and computational costs. In addition, deploying DAMO to a large-scale cross-border e-commence platform leads to 1.21%, 1.76%, and 1.66% improvements over the existing CGC-based model in the online AB-testing experiment in terms of CTR, CVR, and Gross Merchandises Value, respectively.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Domain Incremental Learning for Face Presentation Attack Detection b/data/2024/aaai/Multi-Domain Incremental Learning for Face Presentation Attack Detection
new file mode 100644
index 0000000000..0a44dd82ad
--- /dev/null
+++ b/data/2024/aaai/Multi-Domain Incremental Learning for Face Presentation Attack Detection	
@@ -0,0 +1 @@
+Previous face Presentation Attack Detection (PAD) methods aim to improve the effectiveness of cross-domain tasks. However, in real-world scenarios, the original training data of the pre-trained model is not available due to data privacy or other reasons. Under these constraints, general methods for fine-tuning single-target domain data may lose previously learned knowledge, leading to a catastrophic forgetting problem. To address these issues, we propose a multi-domain incremental learning (MDIL) method for PAD, which not only learns knowledge well from the new domain but also maintains the performance of previous domains stably. Specifically, we propose an adaptive domain-specific experts (ADE) framework based on the vision transformer to preserve the discriminability of previous domains. Furthermore, an asymmetric classifier is designed to keep the output distribution of different classifiers consistent, thereby improving the generalization ability. Extensive experiments show that our proposed method achieves state-of-the-art performance compared to prior methods of incremental learning. Excitingly, under more stringent setting conditions, our method approximates or even outperforms the DA/DG-based methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Domain Multi-Scale Diffusion Model for Low-Light Image Enhancement b/data/2024/aaai/Multi-Domain Multi-Scale Diffusion Model for Low-Light Image Enhancement
new file mode 100644
index 0000000000..0e545a8d81
--- /dev/null
+++ b/data/2024/aaai/Multi-Domain Multi-Scale Diffusion Model for Low-Light Image Enhancement	
@@ -0,0 +1 @@
+Diffusion models have achieved remarkable progress in low-light image enhancement. However, there remain two practical limitations: (1) existing methods mainly focus on the spatial domain for the diffusion process, while neglecting the essential features in the frequency domain; (2) conventional patch-based sampling strategy inevitably leads to severe checkerboard artifacts due to the uneven overlapping. To address these limitations in one go, we propose a Multi-Domain Multi-Scale (MDMS) diffusion model for low-light image enhancement. In particular, we introduce a spatial-frequency fusion module to seamlessly integrates spatial and frequency information. By leveraging the Multi-Domain Learning (MDL) paradigm, our proposed model is endowed with the capability to adaptively facilitate noise distribution learning, thereby enhancing the quality of the generated images. Meanwhile, we propose a Multi-Scale Sampling (MSS) strategy that follows a divide-ensemble manner by merging the restored patches under different resolutions. Such a multi-scale learning paradigm explicitly derives patch information from different granularities, thus leading to smoother boundaries. Furthermore, we empirically adopt the Bright Channel Prior (BCP) which indicates natural statistical regularity as an additional restoration guidance. Experimental results on LOL and LOLv2 datasets demonstrate that our method achieves state-of-the-art performance for the low-light image enhancement task. Codes are available at https://github.com/Oliiveralien/MDMS.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Domain Recommendation to Attract Users via Domain Preference Modeling b/data/2024/aaai/Multi-Domain Recommendation to Attract Users via Domain Preference Modeling
new file mode 100644
index 0000000000..4e64b85795
--- /dev/null
+++ b/data/2024/aaai/Multi-Domain Recommendation to Attract Users via Domain Preference Modeling	
@@ -0,0 +1 @@
+Recently, web platforms are operating various service domains simultaneously. Targeting a platform that operates multiple service domains, we introduce a new task, Multi-Domain Recommendation to Attract Users (MDRAU), which recommends items from multiple ``unseen'' domains with which each user has not interacted yet, by using knowledge from the user's ``seen'' domains. In this paper, we point out two challenges of MDRAU task. First, there are numerous possible combinations of mappings from seen to unseen domains because users have usually interacted with a different subset of service domains. Second, a user might have different preference for each of the target unseen domains, which requires recommendations to reflect users' preference on domains as well as items. To tackle these challenges, we propose DRIP framework that models users' preference at two levels (i.e., domain and item) and learns various seen-unseen domain mappings in a unified way with masked domain modeling. Our extensive experiments demonstrate the effectiveness of DRIP in MDRAU task and its ability to capture users' domain-level preferences.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Energy Guided Image Translation with Stochastic Differential Equations for Near-Infrared Facial Expression Recognition b/data/2024/aaai/Multi-Energy Guided Image Translation with Stochastic Differential Equations for Near-Infrared Facial Expression Recognition
new file mode 100644
index 0000000000..07930d179e
--- /dev/null
+++ b/data/2024/aaai/Multi-Energy Guided Image Translation with Stochastic Differential Equations for Near-Infrared Facial Expression Recognition	
@@ -0,0 +1 @@
+Illumination variation has been a long-term challenge in real-world facial expression recognition (FER). Under uncontrolled or non-visible light conditions, near-infrared (NIR) can provide a simple and alternative solution to obtain high-quality images and supplement the geometric and texture details that are missing in the visible (VIS) domain. Due to the lack of large-scale NIR facial expression datasets, directly extending VIS FER methods to the NIR spectrum may be ineffective. Additionally, previous heterogeneous image synthesis methods are restricted by low controllability without prior task knowledge. To tackle these issues, we present the first approach, called for NIR-FER Stochastic Differential Equations (NFER-SDE), that transforms face expression appearance between heterogeneous modalities to the overfitting problem on small-scale NIR data. NFER-SDE can take the whole VIS source image as input and, together with domain-specific knowledge, guide the preservation of modality-invariant information in the high-frequency content of the image. Extensive experiments and ablation studies show that NFER-SDE significantly improves the performance of NIR FER and achieves state-of-the-art results on the only two available NIR FER datasets, Oulu-CASIA and Large-HFE.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Expert Distillation for Few-Shot Coordination (Student Abstract) b/data/2024/aaai/Multi-Expert Distillation for Few-Shot Coordination (Student Abstract)
new file mode 100644
index 0000000000..2a0b96e1d4
--- /dev/null
+++ b/data/2024/aaai/Multi-Expert Distillation for Few-Shot Coordination (Student Abstract)	
@@ -0,0 +1 @@
+Ad hoc teamwork is a crucial challenge that aims to design an agent capable of effective collaboration with teammates employing diverse strategies without prior coordination. However, current Population-Based Training (PBT) approaches train the ad hoc agent through interaction with diverse teammates from scratch, which suffer from low efficiency. We introduce Multi-Expert Distillation (MED), a novel approach that directly distills diverse strategies through modeling across-episodic sequences. Experiments show that our algorithm achieves more efficient and stable training and has the ability to improve its behavior using historical contexts. Our code is available at https://github.com/LAMDA-RL/MED.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Label Supervised Contrastive Learning b/data/2024/aaai/Multi-Label Supervised Contrastive Learning
new file mode 100644
index 0000000000..2ce0f04433
--- /dev/null
+++ b/data/2024/aaai/Multi-Label Supervised Contrastive Learning	
@@ -0,0 +1,7 @@
+Multi-label classification is an arduous problem given the complication in label correlation. Whilst sharing a common goal with contrastive learning in utilizing correlations for representation learning, how to better leverage label information remains challenging. 
+Previous endeavors include extracting label-level presentations or mapping labels to an embedding space, overlooking the correlation between multiple labels. 
+It exhibits a great ambiguity in determining positive samples with different extent of label overlap between samples and integrating such relations in loss functions. 
+In our work, we propose Multi-Label Supervised Contrastive learning (MulSupCon) with a novel contrastive loss function to adjust weights based on how much overlap one sample shares with the anchor. 
+By analyzing gradients, we explain why our method performs better under multi-label circumstances. 
+To evaluate, we conduct direct classification and transfer learning on several multi-label datasets, including widely-used image datasets such as MS-COCO and NUS-WIDE.
+Validation indicates that our method outperforms the traditional multi-label classification method and shows a competitive performance when comparing to other existing approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Level Cross-Modal Alignment for Image Clustering b/data/2024/aaai/Multi-Level Cross-Modal Alignment for Image Clustering
new file mode 100644
index 0000000000..265f7aafd6
--- /dev/null
+++ b/data/2024/aaai/Multi-Level Cross-Modal Alignment for Image Clustering	
@@ -0,0 +1 @@
+Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pretraining model could produce poor-quality pseudo labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel Multi-level Cross-modal Alignment method to improve the alignments in a cross-modal pretraining model for downstream tasks, by building a smaller but better semantic space and aligning the images and texts in three levels, i.e., instance-level, prototype-level, and semantic-level. Theoretical results show that our proposed method converges, and suggests effective means to reduce the expected clustering risk of our method. Experimental results on five benchmark datasets clearly show the superiority of our new method.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Modal Disordered Representation Learning Network for Description-Based Person Search b/data/2024/aaai/Multi-Modal Disordered Representation Learning Network for Description-Based Person Search
new file mode 100644
index 0000000000..ac04babe38
--- /dev/null
+++ b/data/2024/aaai/Multi-Modal Disordered Representation Learning Network for Description-Based Person Search	
@@ -0,0 +1 @@
+Description-based person search aims to retrieve images of the target identity via textual descriptions. One of the challenges for this task is to extract discriminative representation from images and descriptions. Most existing methods apply the part-based split method or external models to explore the fine-grained details of local features, which ignore the global relationship between partial information and cause network instability. To overcome these issues, we propose a Multi-modal Disordered Representation Learning Network (MDRL) for description-based person search to fully extract the visual and textual representations. Specifically, we design a Cross-modality Global Feature Learning Architecture to learn the global features from the two modalities and meet the demand of the task. Based on our global network, we introduce a Disorder Local Learning Module to explore local features by a disordered reorganization strategy from both visual and textual aspects and enhance the robustness of the whole network. Besides, we introduce a Cross-modality Interaction Module to guide the two streams to extract visual or textual representations considering the correlation between modalities. Extensive experiments are conducted on two public datasets, and the results show that our method outperforms the state-of-the-art methods on CUHK-PEDES and ICFG-PEDES datasets and achieves superior performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models b/data/2024/aaai/Multi-Modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models
new file mode 100644
index 0000000000..8873c59463
--- /dev/null
+++ b/data/2024/aaai/Multi-Modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models	
@@ -0,0 +1 @@
+Chain-of-thought (CoT) reasoning has exhibited impressive performance in language models for solving complex tasks and answering questions. However, many real-world questions require multi-modal information, such as text and images. Previous research on multi-modal CoT has primarily focused on extracting fixed image features from off-the-shelf vision models and then fusing them with text using attention mechanisms. This approach has limitations because these vision models were not designed for complex reasoning tasks and do not align well with language thoughts. To overcome this limitation, we introduce a novel approach for multi-modal CoT reasoning that utilizes latent space learning via diffusion processes to generate effective image features that align with language thoughts. Our method fuses image features and text representations at a deep level and improves the complex reasoning ability of multi-modal CoT. We demonstrate the efficacy of our proposed method on multi-modal ScienceQA and machine translation benchmarks, achieving state-of-the-art performance on ScienceQA. Overall, our approach offers a more robust and effective solution for multi-modal reasoning in language models, enhancing their ability to tackle complex real-world problems.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection b/data/2024/aaai/Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection
new file mode 100644
index 0000000000..445f453c87
--- /dev/null
+++ b/data/2024/aaai/Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection	
@@ -0,0 +1 @@
+Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose visual-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Modality Affinity Inference for Weakly Supervised 3D Semantic Segmentation b/data/2024/aaai/Multi-Modality Affinity Inference for Weakly Supervised 3D Semantic Segmentation
new file mode 100644
index 0000000000..7cbdd243db
--- /dev/null
+++ b/data/2024/aaai/Multi-Modality Affinity Inference for Weakly Supervised 3D Semantic Segmentation	
@@ -0,0 +1 @@
+3D point cloud semantic segmentation has a wide range of applications. Recently, weakly supervised point cloud segmentation methods have been proposed, aiming to alleviate the expensive and laborious manual annotation process by leveraging scene-level labels. However, these methods have not effectively exploited the rich geometric information (such as shape and scale) and appearance information (such as color and texture) present in RGB-D scans. Furthermore, current approaches fail to fully leverage the point affinity that can be inferred from the feature extraction network, which is crucial for learning from weak scene-level labels. Additionally, previous work overlooks the detrimental effects of the long-tailed distribution of point cloud data in weakly supervised 3D semantic segmentation. To this end, this paper proposes a simple yet effective scene-level weakly supervised point cloud segmentation method with a newly introduced multi-modality point affinity inference module. The point affinity proposed in this paper is characterized by features from multiple modalities (e.g., point cloud and RGB), and is further refined by normalizing the classifier weights to alleviate the detrimental effects of long-tailed distribution without the need of the prior of category distribution. Extensive experiments on the ScanNet and S3DIS benchmarks verify the effectiveness of our proposed method, which outperforms the state-of-the-art by ~4% to ~ 6% mIoU. Codes are released at https://github.com/Sunny599/AAAI24-3DWSSG-MMA.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Objective Bayesian Optimization with Active Preference Learning b/data/2024/aaai/Multi-Objective Bayesian Optimization with Active Preference Learning
new file mode 100644
index 0000000000..7effa94a4a
--- /dev/null
+++ b/data/2024/aaai/Multi-Objective Bayesian Optimization with Active Preference Learning	
@@ -0,0 +1 @@
+There are a lot of real-world black-box optimization problems that need to optimize multiple criteria simultaneously. However, in a multi-objective optimization (MOO) problem, identifying the whole Pareto front requires the prohibitive search cost, while in many practical scenarios, the decision maker (DM) only needs a specific solution among the set of the Pareto optimal solutions. We propose a Bayesian optimization (BO) approach to identifying the most preferred solution in the MOO with expensive objective functions, in which a Bayesian preference model of the DM is adaptively estimated by an interactive manner based on the two types of supervisions called the pairwise preference and improvement request. To explore the most preferred solution, we define an acquisition function in which the uncertainty both in the objective function and the DM preference is incorporated. Further, to minimize the interaction cost with the DM, we also propose an active learning strategy for the preference estimation. We empirically demonstrate the effectiveness of our proposed method through the benchmark function optimization and the hyper-parameter optimization problems for machine learning models.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification b/data/2024/aaai/Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification
new file mode 100644
index 0000000000..5a404e997e
--- /dev/null
+++ b/data/2024/aaai/Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification	
@@ -0,0 +1,2 @@
+The fine-grained attribute descriptions can significantly supplement the valuable semantic information for person image, which is vital to the success of person re-identification (ReID)
+task. However, current ReID algorithms typically failed to effectively leverage the rich contextual information available, primarily due to their reliance on simplistic and coarse utilization of image attributes. Recent advances in artificial intelligence generated content have made it possible to automatically generate plentiful fine-grained attribute descriptions and make full use of them. Thereby, this paper explores the potential of using the generated multiple person attributes as prompts in ReID tasks with off-the-shelf (large) models for more accurate retrieval results. To this end, we present a new framework called Multi-Prompts ReID (MP-ReID), based on prompt learning and language models, to fully dip fine attributes to assist ReID task. Specifically, MP-ReID first learns to hallucinate diverse, informative, and promptable sentences for describing the query images. This procedure includes (i) explicit prompts of which attributes a person has and furthermore (ii) implicit learnable prompts for adjusting/conditioning the criteria used towards this person identity matching. Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models. Moreover, an alignment module is designed to fuse multi-prompts (i.e., explicit and implicit ones) progressively and mitigate the cross-modal gap. Extensive experiments on the existing attribute-involved ReID datasets, namely, Market1501 and DukeMTMC-reID, demonstrate the effectiveness and rationality of the proposed MP-ReID solution.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Prototype Space Learning for Commonsense-Based Scene Graph Generation b/data/2024/aaai/Multi-Prototype Space Learning for Commonsense-Based Scene Graph Generation
new file mode 100644
index 0000000000..5f04832288
--- /dev/null
+++ b/data/2024/aaai/Multi-Prototype Space Learning for Commonsense-Based Scene Graph Generation	
@@ -0,0 +1 @@
+In the domain of scene graph generation, modeling commonsense as a single-prototype representation has been typically employed to facilitate the recognition of infrequent predicates. However, a fundamental challenge lies in the large intra-class variations of the visual appearance of predicates, resulting in subclasses within a predicate class. Such a challenge typically leads to the problem of misclassifying diverse predicates due to the rough predicate space clustering. In this paper, inspired by cognitive science, we maintain multi-prototype representations for each predicate class, which can accurately find the multiple class centers of the predicate space. Technically, we propose a novel multi-prototype learning framework consisting of three main steps: prototype-predicate matching, prototype updating, and prototype space optimization. We first design a triple-level optimal transport to match each predicate feature within the same class to a specific prototype. In addition, the prototypes are updated using momentum updating to find the class centers according to the matching results. Finally, we enhance the inter-class separability of the prototype space through iterations of the inter-class separability loss and intra-class compactness loss. Extensive evaluations demonstrate that our approach significantly outperforms state-of-the-art methods on the Visual Genome dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Region Text-Driven Manipulation of Diffusion Imagery b/data/2024/aaai/Multi-Region Text-Driven Manipulation of Diffusion Imagery
new file mode 100644
index 0000000000..5632f6e162
--- /dev/null
+++ b/data/2024/aaai/Multi-Region Text-Driven Manipulation of Diffusion Imagery	
@@ -0,0 +1 @@
+Text-guided image manipulation has attracted significant attention recently. Prevailing techniques concentrate on image attribute editing for individual objects, however, encountering challenges when it comes to multi-object editing. The main reason is the lack of consistency constraints on the spatial layout. This work presents a multi-region guided image manipulation framework, enabling manipulation through region-level textual prompts. With MultiDiffusion as a baseline, we are dedicated to the automatic generation of a rational multi-object spatial distribution, where disparate regions are fused as a unified entity. To mitigate interference from regional fusion, we employ an off-the-shelf model (CLIP) to impose region-aware spatial guidance on multi-object manipulation. Moreover, when applied to the StableDiffusion, the presence of quality-related yet object-agnostic lengthy words hampers the manipulation. To ensure focus on meaningful object-specific words for efficient guidance and generation, we introduce a keyword selection method. Furthermore, we demonstrate a downstream application of our method for multi-region inversion, which is tailored for manipulating multiple objects in real images. Our approach, compatible with variants of Stable Diffusion models, is readily applicable for manipulating diverse objects in extensive images with high-quality generation, showing superb image control capabilities. Code is available at https://github.com/liyiming09/multi-region-guided-diffusion.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Scale Dynamic Graph Learning for Time Series Anomaly Detection (Student Abstract) b/data/2024/aaai/Multi-Scale Dynamic Graph Learning for Time Series Anomaly Detection (Student Abstract)
new file mode 100644
index 0000000000..4639fd41eb
--- /dev/null
+++ b/data/2024/aaai/Multi-Scale Dynamic Graph Learning for Time Series Anomaly Detection (Student Abstract)	
@@ -0,0 +1 @@
+The success of graph neural networks (GNNs) has spurred numerous new works leveraging GNNs for modeling multivariate time series anomaly detection. Despite their achieved performance improvements, most of them only consider static graph to describe the spatial-temporal dependencies between time series. Moreover, existing works neglect the time and scale-changing structures of time series. In this work, we propose MDGAD, a novel multi-scale dynamic graph structure learning approach for time series anomaly detection. We design a multi-scale graph structure learning module that captures the complex correlations among time series, constructing an evolving graph at each scale. Meanwhile, an anomaly detector is used to combine bilateral prediction errors to detect abnormal data. Experiments conducted on two time series datasets demonstrate the effectiveness of MDGAD.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Scene Generalized Trajectory Global Graph Solver with Composite Nodes for Multiple Object Tracking b/data/2024/aaai/Multi-Scene Generalized Trajectory Global Graph Solver with Composite Nodes for Multiple Object Tracking
new file mode 100644
index 0000000000..e4e9094928
--- /dev/null
+++ b/data/2024/aaai/Multi-Scene Generalized Trajectory Global Graph Solver with Composite Nodes for Multiple Object Tracking	
@@ -0,0 +1 @@
+The global multi-object tracking (MOT) system can consider interaction, occlusion, and other ``visual blur'' scenarios to ensure effective object tracking in long videos. Among them, graph-based tracking-by-detection paradigms achieve surprising performance. However, their fully-connected nature poses storage space requirements that challenge algorithm handling long videos. Currently, commonly used methods are still generated trajectories by building one-forward associations across frames. Such matches produced under the guidance of first-order similarity information may not be optimal from a longer-time perspective. Moreover, they often lack an end-to-end scheme for correcting mismatches. This paper proposes the Composite Node Message Passing Network (CoNo-Link), a multi-scene generalized framework for modeling ultra-long frames information for association. CoNo-Link's solution is a low-storage overhead method for building constrained connected graphs. In addition to the previous method of treating objects as nodes, the network innovatively treats object trajectories as nodes for information interaction, improving the graph neural network's feature representation capability. Specifically, we formulate the graph-building problem as a top-k selection task for some reliable objects or trajectories. Our model can learn better predictions on longer-time scales by adding composite nodes. As a result, our method outperforms the state-of-the-art in several commonly used datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Source Collaborative Gradient Discrepancy Minimization for Federated Domain Generalization b/data/2024/aaai/Multi-Source Collaborative Gradient Discrepancy Minimization for Federated Domain Generalization
new file mode 100644
index 0000000000..555ecfc6f0
--- /dev/null
+++ b/data/2024/aaai/Multi-Source Collaborative Gradient Discrepancy Minimization for Federated Domain Generalization	
@@ -0,0 +1 @@
+Federated Domain Generalization aims to learn a domain-invariant model from multiple decentralized source domains for deployment on unseen target domain. Due to privacy concerns, the data from different source domains are kept isolated, which poses challenges in bridging the domain gap. To address this issue, we propose a Multi-source Collaborative Gradient Discrepancy Minimization (MCGDM) method for federated domain generalization. Specifically, we propose intra-domain gradient matching between the original images and augmented images to avoid overfitting the domain-specific information within isolated domains. Additionally, we propose inter-domain gradient matching with the collaboration of other domains, which can further reduce the domain shift across decentralized domains. Combining intra-domain and inter-domain gradient matching, our method enables the learned model to generalize well on unseen domains. Furthermore, our method can be extended to the federated domain adaptation task by fine-tuning the target model on the pseudo-labeled target domain. The extensive experiments on federated domain generalization and adaptation indicate that our method outperforms the state-of-the-art methods significantly.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Stage Prompting for Next Best Agent Recommendations in Adaptive Workflows b/data/2024/aaai/Multi-Stage Prompting for Next Best Agent Recommendations in Adaptive Workflows
new file mode 100644
index 0000000000..03556a9b10
--- /dev/null
+++ b/data/2024/aaai/Multi-Stage Prompting for Next Best Agent Recommendations in Adaptive Workflows	
@@ -0,0 +1 @@
+Traditional business processes such as loan processing, order processing, or procurement have a series of steps that are pre-defined at design and executed by enterprise systems. Recent advancements in new-age businesses, however, focus on having adaptive and ad-hoc processes by stitching together a set of functions or steps enabled through autonomous agents. Further, to enable business users to execute a flexible set of steps, there have been works on providing a conversational interface to interact and execute automation. Often, it is necessary to guide the user through the set of possible steps in the process (or workflow). Existing work on recommending the next agent to run relies on historical data. However, with changing workflows and new automation constantly getting added, it is important to provide recommendations without historical data. Additionally, hand-crafted recommendation rules do not scale. The adaptive workflow being a combination of structured and unstructured information, makes it harder to mine. Hence, in this work, we leverage Large Language Models (LLMs) to combine process knowledge with the meta-data of agents to discover NBAs specifically at cold-start. We propose a multi-stage approach that uses existing process knowledge and agent meta-data information to prompt LLM and recommend meaningful next best agent (NBA) based on user utterances.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-Step Denoising Scheduled Sampling: Towards Alleviating Exposure Bias for Diffusion Models b/data/2024/aaai/Multi-Step Denoising Scheduled Sampling: Towards Alleviating Exposure Bias for Diffusion Models
new file mode 100644
index 0000000000..762182e3bd
--- /dev/null
+++ b/data/2024/aaai/Multi-Step Denoising Scheduled Sampling: Towards Alleviating Exposure Bias for Diffusion Models	
@@ -0,0 +1 @@
+Denoising Diffusion Probabilistic Models (DDPMs) have achieved significant success in generation tasks. Nevertheless, the exposure bias issue, i.e., the natural discrepancy between the training (the output of each step is calculated individually by a given input) and inference (the output of each step is calculated based on the input iteratively obtained based on the model), harms the performance of DDPMs. To our knowledge, few works have tried to tackle this issue by modifying the training process for DDPMs, but they still perform unsatisfactorily due to 1) partially modeling the discrepancy and 2) ignoring the prediction error accumulation. To address the above issues, in this paper, we propose a multi-step denoising scheduled sampling (MDSS) strategy to alleviate the exposure bias for DDPMs. Analyzing the formulations of the training and inference of DDPMs, MDSS 1) comprehensively considers the discrepancy influence of prediction errors on the output of the model (the Gaussian noise) and the output of the step (the calculated input signal of the next step), and 2) efficiently models the prediction error accumulation by using multiple iterations of a mathematical formulation initialized from one-step prediction error obtained from the model. The experimental results, compared with previous works, demonstrate that our approach is more effective in mitigating exposure bias in DDPM, DDIM, and DPM-solver. In particular, MDSS achieves an FID score of 3.86 in 100 sample steps of DDIM on the CIFAR-10 dataset, whereas the second best obtains 4.78. The code will be available on GitHub.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-View Dynamic Reflection Prior for Video Glass Surface Detection b/data/2024/aaai/Multi-View Dynamic Reflection Prior for Video Glass Surface Detection
new file mode 100644
index 0000000000..7de060c1d5
--- /dev/null
+++ b/data/2024/aaai/Multi-View Dynamic Reflection Prior for Video Glass Surface Detection	
@@ -0,0 +1 @@
+Recent research has shown significant interest in image-based glass surface detection (GSD). However, detecting glass surfaces in dynamic scenes remains largely unexplored due to the lack of a high-quality dataset and an effective video glass surface detection (VGSD) method. In this paper, we propose the first VGSD approach. Our key observation is that reflections frequently appear on glass surfaces, but they change dynamically as the camera moves. Based on this observation, we propose to offset the excessive dependence on a single uncertainty reflection via joint modeling of temporal and spatial reflection cues. To this end, we propose the VGSD-Net with two novel modules: a Location-aware Reflection Extraction (LRE) module and a Context-enhanced Reflection Integration (CRI) module, for the position-aware reflection feature extraction and the spatial-temporal reflection cues integration, respectively. We have also created the first large-scale video glass surface dataset (VGSD-D), consisting of 19,166 image frames with accurately-annotated glass masks extracted from 297 videos. Extensive experiments demonstrate that VGSD-Net outperforms state-of-the-art approaches adapted from related fields. Code and dataset will be available at https://github.com/fawnliu/VGSD.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting b/data/2024/aaai/Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting
new file mode 100644
index 0000000000..5bbe5c1faf
--- /dev/null
+++ b/data/2024/aaai/Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting	
@@ -0,0 +1 @@
+Recent deep learning-based multi-view people detection (MVD) methods have shown promising results on existing datasets. However, current methods are mainly trained and evaluated on small, single scenes with a limited number of multi-view frames and fixed camera views. As a result, these methods may not be practical for detecting people in larger, more complex scenes with severe occlusions and camera calibration errors. This paper focuses on improving multi-view people detection by developing a supervised view-wise contribution weighting approach that better fuses multi-camera information under large scenes. Besides, a large synthetic dataset is adopted to enhance the model's generalization ability and enable more practical evaluation and comparison. The model's performance on new testing scenes is further improved with a simple domain adaptation technique. Experimental results demonstrate the effectiveness of our approach in achieving promising cross-scene multi-view people detection performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-View Randomized Kernel Classification via Nonconvex Optimization b/data/2024/aaai/Multi-View Randomized Kernel Classification via Nonconvex Optimization
new file mode 100644
index 0000000000..90bebfe769
--- /dev/null
+++ b/data/2024/aaai/Multi-View Randomized Kernel Classification via Nonconvex Optimization	
@@ -0,0 +1,8 @@
+Multi kernel learning (MKL) is a representative supervised multi-view learning method widely applied in multi-modal and multi-view applications.
+MKL aims to classify data by integrating complementary information from predefined kernels.
+Although existing MKL methods achieve promising performance, they fail to consider the tradeoff between diversity and classification accuracy of kernels, preventing further improvement of classification performance.
+In this paper, we tackle this problem by generating a number of high-quality base learning kernels and selecting a kernel subset with maximum pairwise diversity and minimum generalization errors.
+We first formulate this idea as a nonconvex quadratic integer programming problem.
+Then we transform this nonconvex problem into a convex optimization problem and prove it is equivalent to a semidefinite relaxation problem, which a semidefinite-based branch-and-bound algorithm can quickly solve.
+Experimental results on the real-world datasets demonstrate the superiority of the proposed method.
+The results also show that our method works for the support vector machine (SVM) classifier and other state-of-the-art kernel classifiers.
\ No newline at end of file
diff --git a/data/2024/aaai/Multi-world Model in Continual Reinforcement Learning b/data/2024/aaai/Multi-world Model in Continual Reinforcement Learning
new file mode 100644
index 0000000000..ca422ba088
--- /dev/null
+++ b/data/2024/aaai/Multi-world Model in Continual Reinforcement Learning	
@@ -0,0 +1 @@
+World Models are made of generative networks that can predict future states of a single environment which it was trained on. This research proposes a Multi-world Model, a foundational model built from World Models for the field of continual reinforcement learning that is trained on many different environments, enabling it to generalize state sequence predictions even for unseen settings.
\ No newline at end of file
diff --git a/data/2024/aaai/MultiSum: A Multi-Facet Approach for Extractive Social Summarization Utilizing Semantic and Sociological Relationships b/data/2024/aaai/MultiSum: A Multi-Facet Approach for Extractive Social Summarization Utilizing Semantic and Sociological Relationships
new file mode 100644
index 0000000000..676146beb7
--- /dev/null
+++ b/data/2024/aaai/MultiSum: A Multi-Facet Approach for Extractive Social Summarization Utilizing Semantic and Sociological Relationships	
@@ -0,0 +1 @@
+Social summarization aims to provide summaries for a large number of social texts (called posts) about a single topic. To extract a summary, both the representation of post and summary selection method are crucial. Previous methods introduce social relation to enhance post embedding to mitigate the sparse representation due to its brief and informal expression. However, they ignore that there are multiple relations between posts. Besides, existing graph-based centrality calculation approaches tend to select posts from one aspect. This leads to facet bias especially when there are multiple viewpoints. In this paper, we propose a model named MultiSum to improve social summarization. Specifically, 1) We use graph convolutional networks to fuse text content with social and semantic relations to improve post representation; 2) The similarity between the summary and all aspects is incorporated into the centrality score during the selection phase, encouraging the model to pay attention to different facets. Experimental results on English and Chinese corpora support the effectiveness of this model. Furthermore, external evaluations by human experts and large language models demonstrate the validity of MultiSum in facet coverage and redundancy reduction.
\ No newline at end of file
diff --git a/data/2024/aaai/Multiagent Gumbel MuZero: Efficient Planning in Combinatorial Action Spaces b/data/2024/aaai/Multiagent Gumbel MuZero: Efficient Planning in Combinatorial Action Spaces
new file mode 100644
index 0000000000..729e659c17
--- /dev/null
+++ b/data/2024/aaai/Multiagent Gumbel MuZero: Efficient Planning in Combinatorial Action Spaces	
@@ -0,0 +1 @@
+AlphaZero and MuZero have achieved state-of-the-art (SOTA) performance in a wide range of domains, including board games and robotics, with discrete and continuous action spaces. However, to obtain an improved policy, they often require an excessively large number of simulations, especially for domains with large action spaces. As the simulation budget decreases, their performance drops significantly. In addition, many important real-world applications have combinatorial (or exponential) action spaces, making it infeasible to search directly over all possible actions. In this paper, we extend AlphaZero and MuZero to learn and plan in more complex multiagent (MA) Markov decision processes, where the action spaces increase exponentially with the number of agents. Our new algorithms, MA Gumbel AlphaZero and MA Gumbel MuZero, respectively without and with model learning, achieve superior performance on cooperative multiagent control problems, while reducing the number of environmental interactions by up to an order of magnitude compared to model-free approaches. In particular, we significantly improve prior performance when planning with much fewer simulation budgets. The code and appendix are available at https://github.com/tjuHaoXiaotian/MA-MuZero.
\ No newline at end of file
diff --git a/data/2024/aaai/Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation b/data/2024/aaai/Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation
new file mode 100644
index 0000000000..e2ece75f0f
--- /dev/null
+++ b/data/2024/aaai/Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation	
@@ -0,0 +1 @@
+Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose the multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multi-channel audio streams and a visual stream in parallel, with intra-, and inter-channel contrastive as training targets to fully exploit the rich information in multi-channel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of multichannel multi-modal representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Multilevel Attention Network with Semi-supervised Domain Adaptation for Drug-Target Prediction b/data/2024/aaai/Multilevel Attention Network with Semi-supervised Domain Adaptation for Drug-Target Prediction
new file mode 100644
index 0000000000..0d352567fd
--- /dev/null
+++ b/data/2024/aaai/Multilevel Attention Network with Semi-supervised Domain Adaptation for Drug-Target Prediction	
@@ -0,0 +1 @@
+Prediction of drug-target interactions (DTIs) is a crucial step in drug discovery, and deep learning methods have shown great promise on various DTI datasets. However, existing approaches still face several challenges, including limited labeled data, hidden bias issue, and a lack of generalization ability to out-of-domain data. These challenges hinder the model's capacity to learn truly informative interaction features, leading to shortcut learning and inferior predictive performance on novel drug-target pairs. To address these issues, we propose MlanDTI, a semi-supervised domain adaptive multilevel attention network (Mlan) for DTI prediction. We utilize two pre-trained BERT models to acquire bidirectional representations enriched with information from unlabeled data. Then, we introduce a multilevel attention mechanism, enabling the model to learn domain-invariant DTIs at different hierarchical levels. Moreover, we present a simple yet effective semi-supervised pseudo-labeling method to further enhance our model's predictive ability in cross-domain scenarios. Experiments on four datasets show that MlanDTI achieves state-of-the-art performances over other methods under intra-domain settings and outperforms all other approaches under cross-domain settings. The source code is available at https://github.com/CMACH508/MlanDTI.
\ No newline at end of file
diff --git a/data/2024/aaai/Multilingual Medical Language Models: A Path to Improving Lay Health Worker Effectiveness (Student Abstract) b/data/2024/aaai/Multilingual Medical Language Models: A Path to Improving Lay Health Worker Effectiveness (Student Abstract)
new file mode 100644
index 0000000000..820ade4b2b
--- /dev/null
+++ b/data/2024/aaai/Multilingual Medical Language Models: A Path to Improving Lay Health Worker Effectiveness (Student Abstract)	
@@ -0,0 +1,3 @@
+The COVID-19 pandemic has exacerbated the challenges faced by healthcare delivery in developing nations, placing additional strain on already fragile infrastructure and healthcare systems. This has prompted an increased reliance on lay healthcare workers (LHWs) to meet the surging demand for services. Due to limited formal training, many LHWs have resorted to using unreliable sources, such as internet searches, to access medical information.
+
+Large language models (LLMs) offer a promising opportunity to support LHWs by providing accurate, context-sensitive information for improving healthcare delivery, provided they are appropriately fine-tuned on domain-specific multilingual data. This paper delves into critical issues and presents potential solutions for developing LLM-powered virtual assistants tailored to LHWs serving Telugu and Hindi-speaking populations. Key focal points include the customization of language and content to suit local contexts, the integration of feedback mechanisms to continuously enhance assistance quality, and the delicate balance between automation and human oversight.
\ No newline at end of file
diff --git a/data/2024/aaai/Multimodal Ensembling for Zero-Shot Image Classification b/data/2024/aaai/Multimodal Ensembling for Zero-Shot Image Classification
new file mode 100644
index 0000000000..3d8c144af1
--- /dev/null
+++ b/data/2024/aaai/Multimodal Ensembling for Zero-Shot Image Classification	
@@ -0,0 +1 @@
+Artificial intelligence has made significant progress in image classification, an essential task for machine perception to achieve human-level image understanding. Despite recent advances in vision-language fields, multimodal image classification is still challenging, particularly for the following two reasons. First, models with low capacity often suffer from underfitting and thus underperform on fine-grained image classification. Second, it is important to ensure high-quality data with rich cross-modal representations of each class, which is often difficult to generate. Here, we utilize ensemble learning to reduce the impact of these issues on pre-trained models. We aim to create a meta-model that combines the predictions of multiple open-vocabulary multimodal models trained on different data to create more robust and accurate predictions. By utilizing ensemble learning and multimodal machine learning, we will achieve higher prediction accuracies without any additional training or fine-tuning, meaning that this method is completely zero-shot.
\ No newline at end of file
diff --git a/data/2024/aaai/Multimodal Event Causality Reasoning with Scene Graph Enhanced Interaction Network b/data/2024/aaai/Multimodal Event Causality Reasoning with Scene Graph Enhanced Interaction Network
new file mode 100644
index 0000000000..a6afdb7011
--- /dev/null
+++ b/data/2024/aaai/Multimodal Event Causality Reasoning with Scene Graph Enhanced Interaction Network	
@@ -0,0 +1 @@
+Multimodal event causality reasoning aims to recognize the causal relations based on the given events and accompanying image pairs, requiring the model to have a comprehensive grasp of visual and textual information. However, existing studies fail to effectively model the relations of the objects within the image and capture the object interactions across the image pair, resulting in an insufficient understanding of visual information by the model. To address these issues, we propose a Scene Graph Enhanced Interaction Network (SEIN) in this paper, which can leverage the interactions of the generated scene graph for multimodal event causality reasoning. Specifically, the proposed method adopts a graph convolutional network to model the objects and their relations derived from the scene graph structure, empowering the model to exploit the rich structural and semantic information in the image adequately. To capture the object interactions between the two images, we design an optimal transport-based alignment strategy to match the objects across the images, which could help the model recognize changes in visual information and facilitate causality reasoning. In addition, we introduce a cross-modal fusion module to combine textual and visual features for causality prediction. Experimental results indicate that the proposed SEIN outperforms state-of-the-art methods on the Vis-Causal dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/Multimodal Graph Neural Architecture Search under Distribution Shifts b/data/2024/aaai/Multimodal Graph Neural Architecture Search under Distribution Shifts
new file mode 100644
index 0000000000..d13cc85672
--- /dev/null
+++ b/data/2024/aaai/Multimodal Graph Neural Architecture Search under Distribution Shifts	
@@ -0,0 +1 @@
+Multimodal graph neural architecture search (MGNAS) has shown great success for automatically designing the optimal multimodal graph neural network (MGNN) architecture by leveraging multimodal representation, crossmodal information and graph structure in one unified framework. However, existing MGNAS fails to handle distribution shifts that naturally exist in multimodal graph data, since the searched architectures inevitably capture spurious statistical correlations under distribution shifts. To solve this problem, we propose a novel Out-of-distribution Generalized Multimodal Graph Neural Architecture Search (OMG-NAS) method which optimizes the MGNN architecture with respect to its performance on decorrelated OOD data. Specifically, we propose a multimodal graph representation decorrelation strategy, which encourages the searched MGNN model to output representations that eliminate spurious correlations through iteratively optimizing the feature weights and controller. In addition, we propose a global sample weight estimator that facilitates the sharing of optimal sample weights learned from existing architectures. This design promotes the effective estimation of the sample weights for candidate MGNN architectures to generate decorrelated multimodal graph representations, concentrating more on the truly predictive relations between invariant features and ground-truth labels. Extensive experiments on real-world multimodal graph datasets demonstrate the superiority of our proposed method over SOTA baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Multiobjective Lipschitz Bandits under Lexicographic Ordering b/data/2024/aaai/Multiobjective Lipschitz Bandits under Lexicographic Ordering
new file mode 100644
index 0000000000..9005fcba22
--- /dev/null
+++ b/data/2024/aaai/Multiobjective Lipschitz Bandits under Lexicographic Ordering	
@@ -0,0 +1 @@
+This paper studies the multiobjective bandit problem under lexicographic ordering, wherein the learner aims to simultaneously maximize ? objectives hierarchically. The only existing algorithm for this problem considers the multi-armed bandit model, and its regret bound is O((KT)^(2/3)) under a metric called priority-based regret. However, this bound is suboptimal, as the lower bound for single objective multi-armed bandits is Omega(KlogT). Moreover, this bound becomes vacuous when the arm number K is infinite. To address these limitations, we investigate the multiobjective Lipschitz bandit model, which allows for an infinite arm set. Utilizing a newly designed multi-stage decision-making strategy, we develop an improved algorithm that achieves a general regret bound of O(T^((d_z^i+1)/(d_z^i+2))) for the i-th objective, where d_z^i is the zooming dimension for the i-th objective, with i in {1,2,...,m}. This bound matches the lower bound of the single objective Lipschitz bandit problem in terms of T, indicating that our algorithm is almost optimal. Numerical experiments confirm the effectiveness of our algorithm.
\ No newline at end of file
diff --git a/data/2024/aaai/Multipartite Entity Resolution: Motivating a K-Tuple Perspective (Student Abstract) b/data/2024/aaai/Multipartite Entity Resolution: Motivating a K-Tuple Perspective (Student Abstract)
new file mode 100644
index 0000000000..9af6f951e9
--- /dev/null
+++ b/data/2024/aaai/Multipartite Entity Resolution: Motivating a K-Tuple Perspective (Student Abstract)	
@@ -0,0 +1 @@
+Entity Resolution (ER) is the problem of algorithmically matching records, mentions, or entries that refer to the same underlying real-world entity. Traditionally, the problem assumes (at most) two datasets, between which records need to be matched. There is considerably less research in ER when k > 2 datasets are involved. The evaluation of such multipartite ER (M-ER) is especially complex, since the usual ER metrics assume (whether implicitly or explicitly) k < 3. This paper takes the first step towards motivating a k-tuple approach for evaluating M-ER. Using standard algorithms and k-tuple versions of metrics like precision and recall, our preliminary results suggest a significant difference compared to aggregated pairwise evaluation, which would first decompose the M-ER problem into independent bipartite problems and then aggregate their metrics. Hence, M-ER may be more challenging and warrant more novel approaches than current decomposition-based pairwise approaches would suggest.
\ No newline at end of file
diff --git a/data/2024/aaai/Multiple Hypothesis Dropout: Estimating the Parameters of Multi-Modal Output Distributions b/data/2024/aaai/Multiple Hypothesis Dropout: Estimating the Parameters of Multi-Modal Output Distributions
new file mode 100644
index 0000000000..e356e4d6bf
--- /dev/null
+++ b/data/2024/aaai/Multiple Hypothesis Dropout: Estimating the Parameters of Multi-Modal Output Distributions	
@@ -0,0 +1,3 @@
+In many real-world applications, from robotics to pedestrian trajectory prediction, there is a need to predict multiple real-valued outputs to represent several potential scenarios. Current deep learning techniques to address multiple-output problems are based on two main methodologies: (1) mixture density networks, which suffer from poor stability at high dimensions, or (2) multiple choice learning (MCL), an approach that uses M single-output functions, each only producing a point estimate hypothesis. This paper presents a Mixture of Multiple-Output functions (MoM) approach using a novel variant of dropout, Multiple Hypothesis Dropout. Unlike traditional MCL-based approaches, each multiple-output function not only estimates the mean but also the variance for its hypothesis. This is achieved through a novel stochastic winner-take-all loss which allows each multiple-output function to estimate variance through the spread of its subnetwork predictions.
+Experiments on supervised learning problems illustrate that our approach outperforms existing solutions for reconstructing multimodal output distributions.
+Additional studies on unsupervised learning problems show that estimating the parameters of latent posterior distributions within a discrete autoencoder significantly improves codebook efficiency, sample quality, precision and recall.
\ No newline at end of file
diff --git a/data/2024/aaai/Multiple-Source Localization from a Single-Snapshot Observation Using Graph Bayesian Optimization b/data/2024/aaai/Multiple-Source Localization from a Single-Snapshot Observation Using Graph Bayesian Optimization
new file mode 100644
index 0000000000..f7de1ed3eb
--- /dev/null
+++ b/data/2024/aaai/Multiple-Source Localization from a Single-Snapshot Observation Using Graph Bayesian Optimization	
@@ -0,0 +1,2 @@
+Due to the significance of its various applications, source localization has garnered considerable attention as one of the most important means to confront diffusion hazards. Multi-source localization from a single-snapshot observation is especially relevant due to its prevalence. However, the inherent complexities of this problem, such as limited information, interactions among sources, and dependence on diffusion models, pose challenges to resolution. Current methods typically utilize heuristics and greedy selection, and they are usually bonded with one diffusion model. Consequently, their effectiveness is constrained.
+To address these limitations, we propose a simulation-based method termed BOSouL. Bayesian optimization (BO) is adopted to approximate the results for its sample efficiency. A surrogate function models uncertainty from the limited information. It takes sets of nodes as the input instead of individual nodes. BOSouL can incorporate any diffusion model in the data acquisition process through simulations. Empirical studies demonstrate that its performance is robust across graph structures and diffusion models. The code is available at https://github.com/XGraph-Team/BOSouL.
\ No newline at end of file
diff --git a/data/2024/aaai/Multiscale Attention Wavelet Neural Operator for Capturing Steep Trajectories in Biochemical Systems b/data/2024/aaai/Multiscale Attention Wavelet Neural Operator for Capturing Steep Trajectories in Biochemical Systems
new file mode 100644
index 0000000000..cbb9b66e29
--- /dev/null
+++ b/data/2024/aaai/Multiscale Attention Wavelet Neural Operator for Capturing Steep Trajectories in Biochemical Systems	
@@ -0,0 +1 @@
+In biochemical modeling, some foundational systems can exhibit sudden and profound behavioral shifts, such as the cellular signaling pathway models, in which the physiological responses promptly react to environmental changes, resulting in steep changes in their dynamic model trajectories. These steep changes are one of the major challenges in biochemical modeling governed by nonlinear differential equations. One promising way to tackle this challenge is converting the input data from the time domain to the frequency domain through Fourier Neural Operators, which enhances the ability to analyze data periodicity and regularity. However, the effectiveness of these Fourier based methods diminishes in scenarios with complex abrupt switches. To address this limitation, an innovative Multiscale Attention Wavelet Neural Operator (MAWNO) method is proposed in this paper, which comprehensively combines the attention mechanism with the versatile wavelet transforms to effectively capture these abrupt switches. Specifically, the wavelet transform scrutinizes data across multiple scales to extract the characteristics of abrupt signals into wavelet coefficients, while the self-attention mechanism is adeptly introduced to enhance the wavelet coefficients in high-frequency signals that can better characterize the abrupt switches. Experimental results substantiate MAWNO’s supremacy in terms of accuracy on three classical biochemical models featuring periodic and steep trajectories. https://github.com/SUDERS/MAWNO.
\ No newline at end of file
diff --git a/data/2024/aaai/Multiscale Low-Frequency Memory Network for Improved Feature Extraction in Convolutional Neural Networks b/data/2024/aaai/Multiscale Low-Frequency Memory Network for Improved Feature Extraction in Convolutional Neural Networks
new file mode 100644
index 0000000000..011248eed7
--- /dev/null
+++ b/data/2024/aaai/Multiscale Low-Frequency Memory Network for Improved Feature Extraction in Convolutional Neural Networks	
@@ -0,0 +1 @@
+Deep learning and Convolutional Neural Networks (CNNs) have driven major transformations in diverse research areas. However, their limitations in handling low-frequency in-formation present obstacles in certain tasks like interpreting global structures or managing smooth transition images. Despite the promising performance of transformer struc-tures in numerous tasks, their intricate optimization com-plexities highlight the persistent need for refined CNN en-hancements using limited resources. Responding to these complexities, we introduce a novel framework, the Mul-tiscale Low-Frequency Memory (MLFM) Network, with the goal to harness the full potential of CNNs while keep-ing their complexity unchanged. The MLFM efficiently preserves low-frequency information, enhancing perfor-mance in targeted computer vision tasks. Central to our MLFM is the Low-Frequency Memory Unit (LFMU), which stores various low-frequency data and forms a parallel channel to the core network. A key advantage of MLFM is its seamless compatibility with various prevalent networks, requiring no alterations to their original core structure. Testing on ImageNet demonstrated substantial accuracy improvements in multiple 2D CNNs, including ResNet, MobileNet, EfficientNet, and ConvNeXt. Furthermore, we showcase MLFM's versatility beyond traditional image classification by successfully integrating it into image-to-image translation tasks, specifically in semantic segmenta-tion networks like FCN and U-Net. In conclusion, our work signifies a pivotal stride in the journey of optimizing the ef-ficacy and efficiency of CNNs with limited resources. This research builds upon the existing CNN foundations and paves the way for future advancements in computer vision. Our codes are available at https://github.com/AlphaWuSeu/MLFM.
\ No newline at end of file
diff --git a/data/2024/aaai/Multivariate Time-Series Imagification with Time Embedding in Constrained Environments (Student Abstract) b/data/2024/aaai/Multivariate Time-Series Imagification with Time Embedding in Constrained Environments (Student Abstract)
new file mode 100644
index 0000000000..37f0938605
--- /dev/null
+++ b/data/2024/aaai/Multivariate Time-Series Imagification with Time Embedding in Constrained Environments (Student Abstract)	
@@ -0,0 +1 @@
+We present an imagification approach for multivariate time-series data tailored to constrained NN-based forecasting model training environments. Our imagification process consists of two key steps: Re-stacking and time embedding. In the Re-stacking stage, time-series data are arranged based on high correlation, forming the first image channel using a sliding window technique. The time embedding stage adds two additional image channels by incorporating real-time information. We evaluate our method by comparing it with three benchmark imagification techniques using a simple CNN-based model. Additionally, we conduct a comparison with LSTM, a conventional time-series forecasting model. Experimental results demonstrate that our proposed approach achieves three times faster model training termination while maintaining forecasting accuracy.
\ No newline at end of file
diff --git a/data/2024/aaai/MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion b/data/2024/aaai/MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion
new file mode 100644
index 0000000000..7d14db1b13
--- /dev/null
+++ b/data/2024/aaai/MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion	
@@ -0,0 +1 @@
+Generating music with emotion is an important task in automatic music generation, in which emotion is evoked through a variety of musical elements (such as pitch and duration) that change over time and collaborate with each other. However, prior research on deep learning-based emotional music generation has rarely explored the contribution of different musical elements to emotions, let alone the deliberate manipulation of these elements to alter the emotion of music, which is not conducive to fine-grained element-level control over emotions. To address this gap, we present a novel approach employing musical element-based regularization in the latent space to disentangle distinct elements, investigate their roles in distinguishing emotions, and further manipulate elements to alter musical emotions. Specifically, we propose a novel VQ-VAE-based model named MusER. MusER incorporates a regularization loss to enforce the correspondence between the musical element sequences and the specific dimensions of latent variable sequences, providing a new solution for disentangling discrete sequences. Taking advantage of the disentangled latent vectors, a two-level decoding strategy that includes multiple decoders attending to latent vectors with different semantics is devised to better predict the elements. By visualizing latent space, we conclude that MusER yields a disentangled and interpretable latent space and gain insights into the contribution of distinct elements to the emotional dimensions (i.e., arousal and valence). Experimental results demonstrate that MusER outperforms the state-of-the-art models for generating emotional music in both objective and subjective evaluation. Besides, we rearrange music through element transfer and attempt to alter the emotion of music by transferring emotion-distinguishable elements.
\ No newline at end of file
diff --git a/data/2024/aaai/Music Style Transfer with Time-Varying Inversion of Diffusion Models b/data/2024/aaai/Music Style Transfer with Time-Varying Inversion of Diffusion Models
new file mode 100644
index 0000000000..f0de97cb4a
--- /dev/null
+++ b/data/2024/aaai/Music Style Transfer with Time-Varying Inversion of Diffusion Models	
@@ -0,0 +1 @@
+With the development of diffusion models, text-guided image style transfer has demonstrated great controllable and high-quality results. However, the utilization of text for diverse music style transfer poses significant challenges, primarily due to the limited availability of matched audio-text datasets. Music, being an abstract and complex art form, exhibits variations and intricacies even within the same genre, thereby making accurate textual descriptions challenging. This paper presents a music style transfer approach that effectively captures musical attributes using minimal data. We introduce a novel time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, we utilize a bias-reduced stylization technique to get stable results. Experimental results demonstrate that our method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Samples and code are available at https://lsfhuihuiff.github.io/MusicTI/.
\ No newline at end of file
diff --git a/data/2024/aaai/Mutual-Modality Adversarial Attack with Semantic Perturbation b/data/2024/aaai/Mutual-Modality Adversarial Attack with Semantic Perturbation
new file mode 100644
index 0000000000..03a4bb60f8
--- /dev/null
+++ b/data/2024/aaai/Mutual-Modality Adversarial Attack with Semantic Perturbation	
@@ -0,0 +1,6 @@
+Adversarial attacks constitute a notable threat to machine learning systems, given their potential to induce erroneous predictions and classifications. However, within real-world contexts, the essential specifics of the deployed model are frequently treated as a black box, consequently mitigating the vulnerability to such attacks.
+Thus, enhancing the transferability of the adversarial samples has become a crucial area of research, which heavily relies on selecting appropriate surrogate models.
+To address this challenge, we propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme. Our approach is accomplished by leveraging the pre-trained CLIP model. Firstly, we conduct a visual attack on the clean image that causes semantic perturbations on the aligned embedding space with the other textual modality. 
+Then, we apply the corresponding defense on the textual modality by updating the prompts, which forces the re-matching on the perturbed embedding space. 
+Finally, to enhance the attack transferability, we utilize the iterative training strategy on the visual attack and the textual defense, where the two processes optimize from each other.
+We evaluate our approach on several benchmark datasets and demonstrate that our mutual-modal attack strategy can effectively produce high-transferable attacks, which are stable regardless of the target networks. Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution.
\ No newline at end of file
diff --git a/data/2024/aaai/N-gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding b/data/2024/aaai/N-gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding
new file mode 100644
index 0000000000..168b3253a7
--- /dev/null
+++ b/data/2024/aaai/N-gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding	
@@ -0,0 +1,5 @@
+The first step to apply deep learning techniques for symbolic music understanding is to transform musical pieces (mainly in MIDI format) into sequences of predefined tokens like note pitch, note velocity, and chords. Subsequently, the sequences are fed into a neural sequence model to accomplish specific tasks.
+Music sequences exhibit strong correlations between adjacent elements, making them prime candidates for N-gram techniques from Natural Language Processing (NLP). Consider classical piano music: specific melodies might recur throughout a piece, with subtle variations each time.
+In this paper, we propose a novel method, NG-Midiformer, for understanding symbolic music sequences that leverages the N-gram approach. Our method involves first processing music pieces into word-like sequences with our proposed unsupervised compoundation, followed by using our N-gram Transformer encoder, which can effectively incorporate N-gram information to enhance the primary encoder part for better understanding of music sequences.
+The pre-training process on large-scale music datasets enables the model to thoroughly learn the N-gram information contained within music sequences, and subsequently apply this information for making inferences during the fine-tuning stage.
+Experiment on various datasets demonstrate the effectiveness of our method and achieved state-of-the-art performance on a series of music understanding downstream tasks. The code and model weights will be released at https://github.com/CinqueOrigin/NG-Midiformer.
\ No newline at end of file
diff --git a/data/2024/aaai/ND-MRM: Neuronal Diversity Inspired Multisensory Recognition Model b/data/2024/aaai/ND-MRM: Neuronal Diversity Inspired Multisensory Recognition Model
new file mode 100644
index 0000000000..6a1c313572
--- /dev/null
+++ b/data/2024/aaai/ND-MRM: Neuronal Diversity Inspired Multisensory Recognition Model	
@@ -0,0 +1 @@
+Cross-sensory interaction is a key aspect for multisensory recognition. Without cross-sensory interaction, artificial neural networks show inferior performance in multisensory recognition. On the contrary, the human brain has an inherently remarkable ability in multisensory recognition, which stems from the diverse neurons that exhibit distinct responses to sensory inputs, especially the multisensory neurons with multisensory responses hence enabling cross-sensory interaction. Based on this neuronal diversity, we propose a Neuronal Diversity inspired Multisensory Recognition Model (ND-MRM), which, similar to the brain, comprises unisensory neurons and multisensory neurons. To reflect the different responses characteristics of diverse neurons in the brain, special connection constraints are innovatively designed to regulate the features transmission in the ND-MRM. Leveraging this novel concept of neuronal diversity, our model is biologically plausible, enabling more effective recognition of multisensory information. To validate the performance of the proposed ND-MRM, we employ a multisensory emotion recognition task as a case study. The results demonstrate that our model surpasses state-of-the-art brain-inspired baselines on two datasets, proving the potential of brain-inspired methods for advancing multisensory interaction and recognition.
\ No newline at end of file
diff --git a/data/2024/aaai/NESTER: An Adaptive Neurosymbolic Method for Causal Effect Estimation b/data/2024/aaai/NESTER: An Adaptive Neurosymbolic Method for Causal Effect Estimation
new file mode 100644
index 0000000000..36033c4238
--- /dev/null
+++ b/data/2024/aaai/NESTER: An Adaptive Neurosymbolic Method for Causal Effect Estimation	
@@ -0,0 +1 @@
+Causal effect estimation from observational data is a central problem in causal inference. Methods based on potential outcomes framework solve this problem by exploiting inductive biases and heuristics from causal inference. Each of these methods addresses a specific aspect of causal effect estimation, such as controlling propensity score, enforcing randomization, etc., by designing neural network (NN) architectures and regularizers. In this paper, we propose an adaptive method called Neurosymbolic Causal Effect Estimator (NESTER), a generalized method for causal effect estimation. NESTER integrates the ideas used in existing methods based on multi-head NNs for causal effect estimation into one framework. We design a Domain Specific Language (DSL) tailored for causal effect estimation based on causal inductive biases used in literature. We conduct a theoretical analysis to investigate NESTER's efficacy in estimating causal effects. Our comprehensive empirical results show that NESTER performs better than state-of-the-art methods on benchmark datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement b/data/2024/aaai/NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement
new file mode 100644
index 0000000000..f7bbf184df
--- /dev/null
+++ b/data/2024/aaai/NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement	
@@ -0,0 +1 @@
+3D lookup tables (3D LUTs) are a key component for image enhancement. Modern image signal processors (ISPs) have dedicated support for these as part of the camera rendering pipeline. Cameras typically provide multiple options for picture styles, where each style is usually obtained by applying a unique handcrafted 3D LUT. Current approaches for learning and applying 3D LUTs are notably fast, yet not so memory-efficient, as storing multiple 3D LUTs is required. For this reason and other implementation limitations, their use on mobile devices is less popular. In this work, we propose a Neural Implicit LUT (NILUT), an implicitly defined continuous 3D color transformation parameterized by a neural network. We show that NILUTs are capable of accurately emulating real 3D LUTs. Moreover, a NILUT can be extended to incorporate multiple styles into a single network with the ability to blend styles implicitly. Our novel approach is memory-efficient, controllable and can complement previous methods, including learned ISPs. Code at https://github.com/mv-lab/nilut
\ No newline at end of file
diff --git a/data/2024/aaai/NN-Steiner: A Mixed Neural-Algorithmic Approach for the Rectilinear Steiner Minimum Tree Problem b/data/2024/aaai/NN-Steiner: A Mixed Neural-Algorithmic Approach for the Rectilinear Steiner Minimum Tree Problem
new file mode 100644
index 0000000000..8e4b57c006
--- /dev/null
+++ b/data/2024/aaai/NN-Steiner: A Mixed Neural-Algorithmic Approach for the Rectilinear Steiner Minimum Tree Problem	
@@ -0,0 +1 @@
+Recent years have witnessed rapid advances in the use of neural networks to solve combinatorial optimization problems. Nevertheless, designing the "right" neural model that can effectively handle a given optimization problem can be challenging, and often there is no theoretical understanding or justification of the resulting neural model. In this paper, we focus on the rectilinear Steiner minimum tree (RSMT) problem, which is of critical importance in IC layout design and as a result has attracted numerous heuristic approaches in the VLSI literature. Our contributions are two-fold. On the methodology front, we propose NN-Steiner which is a novel mixed neural-algorithmic framework for computing RSMTs that leverages the celebrated PTAS algorithmic framework of Arora to solve this problem (and other geometric optimization problems). Our NN-Steiner replaces key algorithmic components within Arora's PTAS by suitable neural components. In particular, NN-Steiner only needs four neural network (NN) components that are called repeatedly within an algorithmic framework. Crucially, each of the four NN components is only of bounded size independent of input size, and thus easy to train. Furthermore, as the NN component is learning a generic algorithmic step, once learned, the resulting mixed neural-algorithmic framework generalizes to much larger instances not seen in training. Our NN-Steiner, to our best knowledge, is the first neural architecture of bounded size that has capacity to approximately solve RSMT (and variants). On the empirical front, we show how NN-Steiner can be implemented and demonstrate the effectiveness of our resulting approach, especially in terms of generalization, by comparing with state-of-the-art methods (both neural and non-neural based).
\ No newline at end of file
diff --git a/data/2024/aaai/NaMa: Neighbor-Aware Multi-Modal Adaptive Learning for Prostate Tumor Segmentation on Anisotropic MR Images b/data/2024/aaai/NaMa: Neighbor-Aware Multi-Modal Adaptive Learning for Prostate Tumor Segmentation on Anisotropic MR Images
new file mode 100644
index 0000000000..cfa3dd1c74
--- /dev/null
+++ b/data/2024/aaai/NaMa: Neighbor-Aware Multi-Modal Adaptive Learning for Prostate Tumor Segmentation on Anisotropic MR Images	
@@ -0,0 +1 @@
+Accurate segmentation of prostate tumors from multi-modal magnetic resonance (MR) images is crucial for diagnosis and treatment of prostate cancer. However, the robustness of existing segmentation methods is limited, mainly because these methods 1) fail to adaptively assess subject-specific information of each MR modality for accurate tumor delineation, and 2) lack effective utilization of inter-slice information across thick slices in MR images to segment tumor as a whole 3D volume. In this work, we propose a two-stage neighbor-aware multi-modal adaptive learning network (NaMa) for accurate prostate tumor segmentation from multi-modal anisotropic MR images. In particular, in the first stage, we apply subject-specific multi-modal fusion in each slice by developing a novel modality-informativeness adaptive learning (MIAL) module for selecting and adaptively fusing informative representation of each modality based on inter-modality correlations. In the second stage, we exploit inter-slice feature correlations to derive volumetric tumor segmentation. Specifically, we first use a Unet variant with sequence layers to coarsely capture slice relationship at a global scale, and further generate an activation map for each slice. Then, we introduce an activation mapping guidance (AMG) module to refine slice-wise representation (via information from adjacent slices) for consistent tumor segmentation across neighboring slices. Besides, during the network training, we further apply a random mask strategy to each MR modality to improve feature representation efficiency. Experiments on both in-house and public (PICAI) multi-modal prostate tumor datasets show that our proposed NaMa performs better than state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/NaRuto: Automatically Acquiring Planning Models from Narrative Texts b/data/2024/aaai/NaRuto: Automatically Acquiring Planning Models from Narrative Texts
new file mode 100644
index 0000000000..530c20c3ce
--- /dev/null
+++ b/data/2024/aaai/NaRuto: Automatically Acquiring Planning Models from Narrative Texts	
@@ -0,0 +1 @@
+Domain model acquisition has been identified as a bottleneck in the application of planning technology, especially within narrative planning. Learning action models from narrative texts in an automated way is essential to overcome this barrier, but challenging because of the inherent complexities of such texts. We present an evaluation of planning domain models derived from narrative texts using our fully automated, unsupervised system, NaRuto. Our system combines structured event extraction, predictions of commonsense event relations, and textual contradictions and similarities. Evaluation results show that NaRuto generates domain models of significantly better quality than existing fully automated methods, and even sometimes on par with those created by semi-automated methods, with human assistance.
\ No newline at end of file
diff --git a/data/2024/aaai/NarrativePlay: An Automated System for Crafting Visual Worlds in Novels for Role-Playing b/data/2024/aaai/NarrativePlay: An Automated System for Crafting Visual Worlds in Novels for Role-Playing
new file mode 100644
index 0000000000..364c900ff4
--- /dev/null
+++ b/data/2024/aaai/NarrativePlay: An Automated System for Crafting Visual Worlds in Novels for Role-Playing	
@@ -0,0 +1 @@
+In this demo, we present NarrativePlay -- an innovative system enabling users to role-play a fictional character and interact with dynamically generated narrative environments. Unlike existing predefined sandbox approaches, NarrativePlay centres around the main storyline events extracted from the narrative, allowing users to experience the story from the perspective of a character they chose. To design versatile AI agents for diverse scenarios, we employ a framework built on a Large Language Models (LLMs) to extract detailed character traits from text. We also incorporate automatically generated visual displays of narrative settings, character portraits, and character speech, greatly enhancing the overall user experience.
\ No newline at end of file
diff --git a/data/2024/aaai/Natural Strategic Ability in Stochastic Multi-Agent Systems b/data/2024/aaai/Natural Strategic Ability in Stochastic Multi-Agent Systems
new file mode 100644
index 0000000000..3f43fb8e93
--- /dev/null
+++ b/data/2024/aaai/Natural Strategic Ability in Stochastic Multi-Agent Systems	
@@ -0,0 +1 @@
+Strategies synthesized using formal methods can be complex and often require infinite memory, which does not correspond to the expected behavior when trying to model Multi-Agent Systems (MAS). To capture such behaviors, natural strategies are a recently proposed framework striking a balance between the ability of agents to strategize with memory and the complexity of the model-checking problem, but until now has been restricted to fully deterministic settings. For the first time, we consider the probabilistic temporal logics PATL and PATL∗ under natural strategies (NatPATL and NatPATL∗). As main result we show that, in stochastic MAS, NatPATL model-checking is NP-complete when the active coalition is restricted to deterministic strategies. We also give a 2NEXPTIME complexity result for NatPATL∗ with the same restriction. In the unrestricted case, we give an EXPSPACE complexity for NatPATL and 3EXPSPACE complexity for NatPATL*.
\ No newline at end of file
diff --git a/data/2024/aaai/NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models b/data/2024/aaai/NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
new file mode 100644
index 0000000000..0493d148d3
--- /dev/null
+++ b/data/2024/aaai/NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models	
@@ -0,0 +1 @@
+Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goals, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models. Code is available at: https://github.com/GengzeZhou/NavGPT.
\ No newline at end of file
diff --git a/data/2024/aaai/Navigating Open Set Scenarios for Skeleton-Based Action Recognition b/data/2024/aaai/Navigating Open Set Scenarios for Skeleton-Based Action Recognition
new file mode 100644
index 0000000000..1f638da71e
--- /dev/null
+++ b/data/2024/aaai/Navigating Open Set Scenarios for Skeleton-Based Action Recognition	
@@ -0,0 +1 @@
+In real-world scenarios, human actions often fall outside the distribution of training data, making it crucial for models to recognize known actions and reject unknown ones. However, using pure skeleton data in such open-set conditions poses challenges due to the lack of visual background cues and the distinct sparse structure of body pose sequences. In this paper, we tackle the unexplored Open-Set Skeleton-based Action Recognition (OS-SAR) task and formalize the benchmark on three skeleton-based datasets. We assess the performance of seven established open-set approaches on our task and identify their limits and critical generalization issues when dealing with skeleton information.To address these challenges, we propose a distance-based cross-modality ensemble method that leverages the cross-modal alignment of skeleton joints, bones, and velocities to achieve superior open-set recognition performance. We refer to the key idea as CrossMax - an approach that utilizes a novel cross-modality mean max discrepancy suppression mechanism to align latent spaces during training and a cross-modality distance-based logits refinement method during testing. CrossMax outperforms existing approaches and consistently yields state-of-the-art results across all datasets and backbones. We will release the benchmark, code, and models to the community.
\ No newline at end of file
diff --git a/data/2024/aaai/Navigating Real-World Partial Label Learning: Unveiling Fine-Grained Images with Attributes b/data/2024/aaai/Navigating Real-World Partial Label Learning: Unveiling Fine-Grained Images with Attributes
new file mode 100644
index 0000000000..bc1e206d66
--- /dev/null
+++ b/data/2024/aaai/Navigating Real-World Partial Label Learning: Unveiling Fine-Grained Images with Attributes	
@@ -0,0 +1 @@
+Partial label learning (PLL), a significant research area, addresses the challenge of annotating each sample with a candidate label set containing the true label when obtaining accurate labels is infeasible. However, existing PLL methods often rely on generic datasets like CIFAR, where annotators can readily differentiate candidate labels and are unlikely to confuse, making it less realistic for real-world partial label applications. In response, our research focuses on a rarely studied problem, PLL on fine-grained images with attributes. And we propose a novel framework called Shared to Learn, Distinct to Disambiguate (SoDisam). Within the candidate label set, the categories may exhibit numerous shared attribute features, posing a challenge in accurately distinguishing them. Rather than perceiving it as an impediment, we capitalize on these shared attributes as definitive sources of supervision. This insight guides us to learn attribute space visual representation to focus on the information from these shared attributes. Moreover, we introduce an attribute attention mechanism tailored to harness the remaining distinct attributes. This mechanism directs the originally holistic feature towards specific regions, capturing corresponding discriminative features. In addition, a dynamic disambiguation module is introduced, continuously adjusting the two aforementioned mechanisms and achieve the final disambiguation process. Extensive experiments demonstrate the effectiveness of our approach on fine-grained partial label datasets. The proposed SoDisam framework not only addresses the challenges associated with fine-grained partial label learning but also provides a more realistic representation of real-world partial label scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Navigating Uncertainty in Epidemic Contexts with Reinforcement Learning b/data/2024/aaai/Navigating Uncertainty in Epidemic Contexts with Reinforcement Learning
new file mode 100644
index 0000000000..6a5c31767a
--- /dev/null
+++ b/data/2024/aaai/Navigating Uncertainty in Epidemic Contexts with Reinforcement Learning	
@@ -0,0 +1 @@
+My research integrates stochastic epidemic models with reinforcement learning to develop effective strategies or policies to inform operational decisions. The objective is to refine policies that are attuned to diverse outbreak dynamics and to offer a tool for informed planning in real-world settings.
\ No newline at end of file
diff --git a/data/2024/aaai/NeBLa: Neural Beer-Lambert for 3D Reconstruction of Oral Structures from Panoramic Radiographs b/data/2024/aaai/NeBLa: Neural Beer-Lambert for 3D Reconstruction of Oral Structures from Panoramic Radiographs
new file mode 100644
index 0000000000..a6a8d54324
--- /dev/null
+++ b/data/2024/aaai/NeBLa: Neural Beer-Lambert for 3D Reconstruction of Oral Structures from Panoramic Radiographs	
@@ -0,0 +1 @@
+Panoramic radiography (Panoramic X-ray, PX) is a widely used imaging modality for dental examination. However, PX only provides a flattened 2D image, lacking in a 3D view of the oral structure. In this paper, we propose NeBLa (Neural Beer-Lambert) to estimate 3D oral structures from real-world PX. NeBLa tackles full 3D reconstruction for varying subjects (patients) where each reconstruction is based only on a single panoramic image. We create an intermediate representation called simulated PX (SimPX) from 3D Cone-beam computed tomography (CBCT) data based on the Beer-Lambert law of X-ray rendering and rotational principles of PX imaging. SimPX aims at not only truthfully simulating PX, but also facilitates the reverting process back to 3D data. We propose a novel neural model based on ray tracing which exploits both global and local input features to convert SimPX to 3D output. At inference, a real PX image is translated to a SimPX-style image with semantic regularization, and the translated image is processed by generation module to produce high-quality outputs. Experiments show that NeBLa outperforms prior state-of-the-art in reconstruction tasks both quantitatively and qualitatively. Unlike prior methods, NeBLa does not require any prior information such as the shape of dental arches, nor the matched PX-CBCT dataset for training, which is difficult to obtain in clinical practice. Our code is available at https://github.com/sihwa-park/nebla.
\ No newline at end of file
diff --git a/data/2024/aaai/NeRF-LiDAR: Generating Realistic LiDAR Point Clouds with Neural Radiance Fields b/data/2024/aaai/NeRF-LiDAR: Generating Realistic LiDAR Point Clouds with Neural Radiance Fields
new file mode 100644
index 0000000000..678879b9f0
--- /dev/null
+++ b/data/2024/aaai/NeRF-LiDAR: Generating Realistic LiDAR Point Clouds with Neural Radiance Fields	
@@ -0,0 +1,2 @@
+Labelling LiDAR point clouds for training autonomous driving is extremely expensive and difficult. LiDAR simulation aims at generating realistic LiDAR data with labels for training and verifying self-driving algorithms more efficiently. Recently, Neural Radiance Fields (NeRF) have been proposed for novel view synthesis using implicit reconstruction of 3D scenes. Inspired by this, we present NeRF-LIDAR, a novel LiDAR simulation method that leverages real-world information to generate realistic LIDAR point clouds. Different from existing LiDAR simulators, we use real images and point cloud data collected by self-driving cars to learn the 3D scene representation, point cloud generation and label rendering. We verify the effectiveness of our NeRF-LiDAR by training different 3D segmentation models on the generated LiDAR point clouds. 
+It reveals that the trained models are able to achieve similar accuracy when compared with the same model trained on the real LiDAR data. Besides, the generated data is capable of boosting the accuracy through pre-training which helps reduce the requirements of the real labeled data. Code is available at https://github.com/fudan-zvg/NeRF-LiDAR
\ No newline at end of file
diff --git a/data/2024/aaai/NeRF-VPT: Learning Novel View Representations with Neural Radiance Fields via View Prompt Tuning b/data/2024/aaai/NeRF-VPT: Learning Novel View Representations with Neural Radiance Fields via View Prompt Tuning
new file mode 100644
index 0000000000..21389a881d
--- /dev/null
+++ b/data/2024/aaai/NeRF-VPT: Learning Novel View Representations with Neural Radiance Fields via View Prompt Tuning	
@@ -0,0 +1 @@
+Neural Radiance Fields (NeRF) have garnered remarkable success in novel view synthesis. Nonetheless, the task of generating high-quality images for novel views persists as a critical challenge. While the existing efforts have exhibited commendable progress, capturing intricate details, enhancing textures, and achieving superior Peak Signal-to-Noise Ratio (PSNR) metrics warrant further focused attention and advancement. In this work, we propose NeRF-VPT, an innovative method for novel view synthesis to address these challenges. Our proposed NeRF-VPT employs a cascading view prompt tuning paradigm, wherein RGB information gained from preceding rendering outcomes serves as instructive visual prompts for subsequent rendering stages, with the aspiration that the prior knowledge embedded in the prompts can facilitate the gradual enhancement of rendered image quality. NeRF-VPT only requires sampling RGB data from previous stage renderings as priors at each training stage, without relying on extra guidance or complex techniques. Thus, our NeRF-VPT is plug-and-play and can be readily integrated into existing methods. By conducting comparative analyses of our NeRF-VPT against several NeRF-based approaches on demanding real-scene benchmarks, such as Realistic Synthetic 360, Real Forward-Facing, Replica dataset, and a user-captured dataset, we substantiate that our NeRF-VPT significantly elevates baseline performance and proficiently generates more high-quality novel view images than all the compared state-of-the-art methods. Furthermore, the cascading learning of NeRF-VPT introduces adaptability to scenarios with sparse inputs, resulting in a significant enhancement of accuracy for sparse-view novel view synthesis. The source code and dataset are available at https://github.com/Freedomcls/NeRF-VPT.
\ No newline at end of file
diff --git a/data/2024/aaai/NeRFail: Neural Radiance Fields-Based Multiview Adversarial Attack b/data/2024/aaai/NeRFail: Neural Radiance Fields-Based Multiview Adversarial Attack
new file mode 100644
index 0000000000..3136aa62ef
--- /dev/null
+++ b/data/2024/aaai/NeRFail: Neural Radiance Fields-Based Multiview Adversarial Attack	
@@ -0,0 +1 @@
+Adversarial attacks, i.e., generating adversarial perturbations with a small magnitude to deceive deep neural networks, are important for investigating and improving model trustworthiness. Traditionally, the topic was scoped within 2D images without considering 3D multiview information. Benefiting from Neural Radiance Fields (NeRF), one can easily reconstruct a 3D scene with a Multi-Layer Perceptron (MLP) from given 2D views and synthesize photo-realistic renderings of novel vantages. This opens up a door to discussing the possibility of undertaking to attack multiview NeRF network with downstream tasks from different rendering angles, which we denote Neural Radiance Fiels-based multiview adversarial Attack (NeRFail). The goal is, given one scene and a subset of views, to deceive the recognition results of agnostic view angles as well as given views. To do so, we propose a transformation mapping from pixels to 3D points such that our attack generates multiview adversarial perturbations by attacking a subset of images with different views, intending to prevent the downstream classifier from correctly predicting images rendered by NeRF from other views. Experiments show that our multiview adversarial perturbations successfully obfuscate the downstream classifier at both known and unknown views. Notably, when retraining another NeRF on the perturbed training data, we show that the perturbation can be inherited and reproduced. The code can be found at https://github.com/jiang-wenxiang/NeRFail.
\ No newline at end of file
diff --git a/data/2024/aaai/NeSyFOLD: A Framework for Interpretable Image Classification b/data/2024/aaai/NeSyFOLD: A Framework for Interpretable Image Classification
new file mode 100644
index 0000000000..7a2361e097
--- /dev/null
+++ b/data/2024/aaai/NeSyFOLD: A Framework for Interpretable Image Classification	
@@ -0,0 +1,26 @@
+Deep learning models such as CNNs have surpassed human
+performance in computer vision tasks such as image classi-
+fication. However, despite their sophistication, these models
+lack interpretability which can lead to biased outcomes re-
+flecting existing prejudices in the data. We aim to make pre-
+dictions made by a CNN interpretable. Hence, we present a
+novel framework called NeSyFOLD to create a neurosym-
+bolic (NeSy) model for image classification tasks. The model
+is a CNN with all layers following the last convolutional layer
+replaced by a stratified answer set program (ASP) derived
+from the last layer kernels. The answer set program can be
+viewed as a rule-set, wherein the truth value of each pred-
+icate depends on the activation of the corresponding kernel
+in the CNN. The rule-set serves as a global explanation for
+the model and is interpretable. We also use our NeSyFOLD
+framework with a CNN that is trained using a sparse kernel
+learning technique called Elite BackProp (EBP). This leads to
+a significant reduction in rule-set size without compromising
+accuracy or fidelity thus improving scalability of the NeSy
+model and interpretability of its rule-set. Evaluation is done
+on datasets with varied complexity and sizes. We also pro-
+pose a novel algorithm for labelling the predicates in the rule-
+set with meaningful semantic concept(s) learnt by the CNN.
+We evaluate the performance of our “semantic labelling algo-
+rithm” to quantify the efficacy of the semantic labelling for
+both the NeSy model and the NeSy-EBP model.
\ No newline at end of file
diff --git a/data/2024/aaai/Near-Optimal Resilient Aggregation Rules for Distributed Learning Using 1-Center and 1-Mean Clustering with Outliers b/data/2024/aaai/Near-Optimal Resilient Aggregation Rules for Distributed Learning Using 1-Center and 1-Mean Clustering with Outliers
new file mode 100644
index 0000000000..580db2cca1
--- /dev/null
+++ b/data/2024/aaai/Near-Optimal Resilient Aggregation Rules for Distributed Learning Using 1-Center and 1-Mean Clustering with Outliers	
@@ -0,0 +1 @@
+Byzantine machine learning has garnered considerable attention in light of the unpredictable faults that can occur in large-scale distributed learning systems. The key to secure resilience against Byzantine machines in distributed learning is resilient aggregation mechanisms. Although abundant resilient aggregation rules have been proposed, they are designed in ad-hoc manners, imposing extra barriers on comparing, analyzing, and improving the rules across performance criteria. This paper studies near-optimal aggregation rules using clustering in the presence of outliers. Our outlier-robust clustering approach utilizes geometric properties of the update vectors provided by workers. Our analysis show that constant approximations to the 1-center and 1-mean clustering problems with outliers provide near-optimal resilient aggregators for metric-based criteria, which have been proven to be crucial in the homogeneous and heterogeneous cases respectively. In addition, we discuss two contradicting types of attacks under which no single aggregation rule is guaranteed to improve upon the naive average. Based on the discussion, we propose a two-phase resilient aggregation framework. We run experiments for image classification using a non-convex loss function. The proposed algorithms outperform previously known aggregation rules by a large margin with both homogeneous and heterogeneous data distributions among non-faulty workers. Code and appendix are available at https://github.com/jerry907/AAAI24-RASHB.
\ No newline at end of file
diff --git a/data/2024/aaai/Nearly Equitable Allocations beyond Additivity and Monotonicity b/data/2024/aaai/Nearly Equitable Allocations beyond Additivity and Monotonicity
new file mode 100644
index 0000000000..e525f8420d
--- /dev/null
+++ b/data/2024/aaai/Nearly Equitable Allocations beyond Additivity and Monotonicity	
@@ -0,0 +1,3 @@
+Equitability (EQ) in fair division requires that items be allocated such that all agents value the bundle they receive equally. With indivisible items, an equitable allocation may not exist, and hence we instead consider a meaningful analog, EQx, that requires equitability up to any item. EQx allocations exist for monotone, additive valuations. However, if (1) the agents' valuations are not additive or (2) the set of indivisible items includes both goods and chores (positively and negatively valued items), then prior to the current work it was not known whether EQx allocations exist or not. 
+
+We study both the existence and efficient computation of EQx allocations. (1) For monotone valuations (not necessarily additive), we show that EQx allocations always exist. Also, for the large class of weakly well-layered valuations, EQx allocations can be found in polynomial time. Further, we prove that approximately EQx allocations can be computed efficiently under general monotone valuations. (2) For non-monotone valuations, we show that an EQx allocation may not exist, even for two agents with additive valuations. Under some special cases, however, we show existence and efficient computability of EQx allocations. This includes the case of two agents with additive valuations where each item is either a good or a chore, and there are no mixed items.
\ No newline at end of file
diff --git a/data/2024/aaai/NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-world Video Super-Resolution b/data/2024/aaai/NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-world Video Super-Resolution
new file mode 100644
index 0000000000..5b2a9b7b75
--- /dev/null
+++ b/data/2024/aaai/NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-world Video Super-Resolution	
@@ -0,0 +1 @@
+The capability of video super-resolution (VSR) to synthesize high-resolution (HR) video from ideal datasets has been demonstrated in many works. However, applying the VSR model to real-world video with unknown and complex degradation remains a challenging task. First, existing degradation metrics in most VSR methods are not able to effectively simulate real-world noise and blur. On the contrary, simple combinations of classical degradation are used for real-world noise modeling, which led to the VSR model often being violated by out-of-distribution noise. Second, many SR models focus on noise simulation and transfer. Nevertheless, the sampled noise is monotonous and limited. To address the aforementioned problems, we propose a Negatives augmentation strategy for generalized noise modeling in Video Super-Resolution (NegVSR) task. Specifically, we first propose sequential noise generation toward real-world data to extract practical noise sequences. Then, the degeneration domain is widely expanded by negative augmentation to build up various yet challenging real-world noise sets. We further propose the augmented negative guidance loss to learn robust features among augmented negatives effectively. Extensive experiments on real-world datasets (e.g., VideoLQ and FLIR) show that our method outperforms state-of-the-art methods with clear margins, especially in visual quality. Project page is available at: https://negvsr.github.io/.
\ No newline at end of file
diff --git a/data/2024/aaai/Negative Pre-aware for Noisy Cross-Modal Matching b/data/2024/aaai/Negative Pre-aware for Noisy Cross-Modal Matching
new file mode 100644
index 0000000000..f90d1bb22e
--- /dev/null
+++ b/data/2024/aaai/Negative Pre-aware for Noisy Cross-Modal Matching	
@@ -0,0 +1 @@
+Cross-modal noise-robust learning is a challenging task since noisy correspondence is hard to recognize and rectify. Due to the cumulative and unavoidable negative impact of unresolved noise, existing methods cannot maintain a stable performance when the noise increases. In this paper, we present a novel Negative Pre-aware Cross-modal (NPC) matching solution for large visual-language model fine-tuning on noisy downstream tasks. It is featured in two aspects: (1) For noise recognition and resistance, previous methods usually directly filter out a noise subset, we propose to estimate the negative impact of each sample. It does not need additional correction mechanisms that may predict unreliable correction results, leading to self-reinforcing error. We assign a confidence weight to each sample according to its negative impact in the training process. This adaptively adjusts the contribution of each sample to avoid noisy accumulation. (2) For maintaining stable performance with increasing noise, we utilize the memorization effect of DNNs by maintaining a memory bank. Specifically, we apply GMM to select high-confident clean samples as the memory entry, where the memory entry is used to estimate the negative impact of each sample. Since clean samples are easier distinguished by GMM with increasing noise, the memory bank can still maintain high quality at a high noise ratio. Compared to the correction mechanism focusing on noise samples, memory bank-based estimation is more robust, which makes the model performance stable on noisy datasets. Extensive experiments demonstrate that our method significantly improves matching accuracy and performance stability at increasing noise ratio. Our approach also surpasses the state-of-the-art methods by a large margin. The code is available at: https://github.com/ZhangXu0963/NPC.
\ No newline at end of file
diff --git a/data/2024/aaai/Neighborhood-Enhanced 3D Human Pose Estimation with Monocular LiDAR in Long-Range Outdoor Scenes b/data/2024/aaai/Neighborhood-Enhanced 3D Human Pose Estimation with Monocular LiDAR in Long-Range Outdoor Scenes
new file mode 100644
index 0000000000..0a7e4084d1
--- /dev/null
+++ b/data/2024/aaai/Neighborhood-Enhanced 3D Human Pose Estimation with Monocular LiDAR in Long-Range Outdoor Scenes	
@@ -0,0 +1 @@
+3D human pose estimation (3HPE) in large-scale outdoor scenes using commercial LiDAR has attracted significant attention due to its potential for real-life applications. However, existing LiDAR-based methods for 3HPE primarily rely on recovering 3D human poses from individual point clouds, and the coherence cues present in the neighborhood are not sufficiently harnessed. In this work, we explore spatial and contexture coherence cues contained in the neighborhood that lead to great performance improvements in 3HPE. Specifically, firstly, we deeply investigate the 3D neighbor in the background (3BN) which serves as a spatial coherence cue for inferring reliable motion since it provides physical laws to limit motion targets. Secondly, we introduce a novel 3D scanning neighbor (3SN) generated during the data collection and 3SN implies structural edge coherence cues. We use 3SN to overcome the degradation of performance and data quality caused by the sparsity-varying properties of LiDAR point clouds. In order to effectively model the complementation between these distinct cues and build consistent temporal relationships across human motions, we propose a new transformer-based module called the CoherenceFuse module. Extensive experiments were conducted on publicly available datasets, namely LidarHuman26M, CIMI4D, SLOPER4D and Waymo Open Dataset v2.0, showcase the superiority and effectiveness of our proposed method. In particular, when compared with LidarCap on the LidarHuman26M dataset, our method demonstrates a reduction of 7.08mm in the average MPJPE metric, along with a decrease of 16.55mm in the MPJPE metric for distances exceeding 25 meters. The code and models are available at https://github.com/jingyi-zhang/Neighborhood-enhanced-LidarCap.
\ No newline at end of file
diff --git a/data/2024/aaai/NeuSurf: On-Surface Priors for Neural Surface Reconstruction from Sparse Input Views b/data/2024/aaai/NeuSurf: On-Surface Priors for Neural Surface Reconstruction from Sparse Input Views
new file mode 100644
index 0000000000..6584b934f5
--- /dev/null
+++ b/data/2024/aaai/NeuSurf: On-Surface Priors for Neural Surface Reconstruction from Sparse Input Views	
@@ -0,0 +1 @@
+Recently, neural implicit functions have demonstrated remarkable results in the field of multi-view reconstruction. However, most existing methods are tailored for dense views and exhibit unsatisfactory performance when dealing with sparse views. Several latest methods have been proposed for generalizing implicit reconstruction to address the sparse view reconstruction task, but they still suffer from high training costs and are merely valid under carefully selected perspectives. In this paper, we propose a novel sparse view reconstruction framework that leverages on-surface priors to achieve highly faithful surface reconstruction. Specifically, we design several constraints on global geometry alignment and local geometry refinement for jointly optimizing coarse shapes and fine details. To achieve this, we train a neural network to learn a global implicit field from the on-surface points obtained from SfM and then leverage it as a coarse geometric constraint. To exploit local geometric consistency, we project on-surface points onto seen and unseen views, treating the consistent loss of projected features as a fine geometric constraint. The experimental results with DTU and BlendedMVS datasets in two prevalent sparse settings demonstrate significant improvements over the state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Neural Amortized Inference for Nested Multi-Agent Reasoning b/data/2024/aaai/Neural Amortized Inference for Nested Multi-Agent Reasoning
new file mode 100644
index 0000000000..e12b158494
--- /dev/null
+++ b/data/2024/aaai/Neural Amortized Inference for Nested Multi-Agent Reasoning	
@@ -0,0 +1 @@
+Multi-agent interactions, such as communication, teaching, and bluffing, often rely on higher-order social inference, i.e., understanding how others infer oneself. Such intricate reasoning can be effectively modeled through nested multi-agent reasoning. Nonetheless, the computational complexity escalates exponentially with each level of reasoning, posing a significant challenge. However, humans effortlessly perform complex social inferences as part of their daily lives. To bridge the gap between human-like inference capabilities and computational limitations, we propose a novel approach: leveraging neural networks to amortize high-order social inference, thereby expediting nested multi-agent reasoning. We evaluate our method in two challenging multi-agent interaction domains. The experimental results demonstrate that our method is computationally efficient while exhibiting minimal degradation in accuracy.
\ No newline at end of file
diff --git a/data/2024/aaai/Neural Bookmarks: Information Retrieval with Deep Learning and EEG Data b/data/2024/aaai/Neural Bookmarks: Information Retrieval with Deep Learning and EEG Data
new file mode 100644
index 0000000000..9543d54ef9
--- /dev/null
+++ b/data/2024/aaai/Neural Bookmarks: Information Retrieval with Deep Learning and EEG Data	
@@ -0,0 +1 @@
+In neural memory decoding, a concept being mentally recalled is identified using brain data. Recently, the feasibility of neural memory decoding with EEG data has been demonstrated. Here we propose a new application – neural information retrieval – that uses neural memory decoding to allow a document to be retrieved merely by thinking about it. In this paper we describe neural memory decoding, define the application of neural information retrieval, present experimental results related to the practicality of the application, and discuss issues of deployment and data privacy.
\ No newline at end of file
diff --git a/data/2024/aaai/Neural Causal Abstractions b/data/2024/aaai/Neural Causal Abstractions
new file mode 100644
index 0000000000..d418ef7474
--- /dev/null
+++ b/data/2024/aaai/Neural Causal Abstractions	
@@ -0,0 +1 @@
+The ability of humans to understand the world in terms of cause and effect relationships, as well as their ability to compress information into abstract concepts, are two hallmark features of human intelligence. These two topics have been studied in tandem under the theory of causal abstractions, but it is an open problem how to best leverage abstraction theory in real-world causal inference tasks, where the true model is not known, and limited data is available in most practical settings. In this paper, we focus on a family of causal abstractions constructed by clustering variables and their domains, redefining abstractions to be amenable to individual causal distributions. We show that such abstractions can be learned in practice using Neural Causal Models, allowing us to utilize the deep learning toolkit to solve causal tasks (identification, estimation, sampling) at different levels of abstraction granularity. Finally, we show how representation learning can be used to learn abstractions, which we apply in our experiments to scale causal inferences to high dimensional settings such as with image data.
\ No newline at end of file
diff --git a/data/2024/aaai/Neural Closure Certificates b/data/2024/aaai/Neural Closure Certificates
new file mode 100644
index 0000000000..08a1e92483
--- /dev/null
+++ b/data/2024/aaai/Neural Closure Certificates	
@@ -0,0 +1,10 @@
+Notions of transition invariants and closure certificates have seen recent use in the formal verification of controlled dynamical systems against \omega-regular properties.
+Unfortunately, existing approaches face limitations in two directions.
+First, they require a closed-form mathematical expression representing the model of the system.
+Such an expression may be difficult to find, too complex to be of any use, or unavailable due to security or privacy constraints.
+Second, finding such invariants typically rely on optimization techniques such as sum-of-squares (SOS) or satisfiability modulo theory (SMT) solvers.
+This restricts the classes of systems that need to be formally verified.
+To address these drawbacks, we introduce a notion of neural closure certificates.
+We present a data-driven algorithm that trains a neural network to represent a closure certificate.
+Our approach is formally correct under some mild assumptions, i.e., one is able to formally show that the unknown system satisfies the \omega-regular property of interest if a neural closure certificate can be computed.
+Finally, we demonstrate the efficacy of our approach with relevant case studies.
\ No newline at end of file
diff --git a/data/2024/aaai/Neural Gaussian Similarity Modeling for Differential Graph Structure Learning b/data/2024/aaai/Neural Gaussian Similarity Modeling for Differential Graph Structure Learning
new file mode 100644
index 0000000000..60de54faa5
--- /dev/null
+++ b/data/2024/aaai/Neural Gaussian Similarity Modeling for Differential Graph Structure Learning	
@@ -0,0 +1 @@
+Graph Structure Learning (GSL) has demonstrated considerable potential in the analysis of graph-unknown non-Euclidean data across a wide range of domains. However, constructing an end-to-end graph structure learning model poses a challenge due to the impediment of gradient flow caused by the nearest neighbor sampling strategy. In this paper, we construct a differential graph structure learning model by replacing the non-differentiable nearest neighbor sampling with a differentiable sampling using the reparameterization trick. Under this framework, we argue that the act of sampling nearest neighbors may not invariably be essential, particularly in instances where node features exhibit a significant degree of similarity. To alleviate this issue, the bell-shaped Gaussian Similarity (GauSim) modeling is proposed to sample non-nearest neighbors. To adaptively model the similarity, we further propose Neural Gaussian Similarity (NeuralGauSim) with learnable parameters featuring flexible sampling behaviors. In addition, we develop a scalable method by transferring the large-scale graph to the transition graph to significantly reduce the complexity. Experimental results demonstrate the effectiveness of the proposed methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Neural Network Approximation for Pessimistic Offline Reinforcement Learning b/data/2024/aaai/Neural Network Approximation for Pessimistic Offline Reinforcement Learning
new file mode 100644
index 0000000000..b259377094
--- /dev/null
+++ b/data/2024/aaai/Neural Network Approximation for Pessimistic Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Deep reinforcement learning (RL) has shown remarkable success in specific offline decision-making scenarios, yet its theoretical guarantees are still under development. Existing works on offline RL theory primarily emphasize a few trivial settings, such as linear MDP or general function approximation with strong assumptions and independent data, which lack guidance for practical use. The coupling of deep learning and Bellman residuals makes this problem challenging, in addition to the difficulty of data dependence. In this paper, we establish a non-asymptotic estimation error of pessimistic offline RL using general neural network approximation with C-mixing data regarding the structure of networks, the dimension of datasets, and the concentrability of data coverage, under mild assumptions. Our result shows that the estimation error consists of two parts: the first converges to zero at a desired rate on the sample size with partially controllable concentrability, and the second becomes negligible if the residual constraint is tight. This result demonstrates the explicit efficiency of deep adversarial offline RL frameworks. We utilize the empirical process tool for C-mixing sequences and the neural network approximation theory for the Holder class to achieve this. We also develop methods to bound the Bellman estimation error caused by function approximation with empirical Bellman constraint perturbations. Additionally, we present a result that lessens the curse of dimensionality using data with low intrinsic dimensionality and function classes with low complexity. Our estimation provides valuable insights into the development of deep offline RL and guidance for algorithm model design.
\ No newline at end of file
diff --git a/data/2024/aaai/Neural Network Approximators for Marginal MAP in Probabilistic Circuits b/data/2024/aaai/Neural Network Approximators for Marginal MAP in Probabilistic Circuits
new file mode 100644
index 0000000000..8bc3601182
--- /dev/null
+++ b/data/2024/aaai/Neural Network Approximators for Marginal MAP in Probabilistic Circuits	
@@ -0,0 +1 @@
+Probabilistic circuits (PCs) such as sum-product networks efficiently represent large multi-variate probability distributions. They are preferred in practice over other probabilistic representations, such as Bayesian and Markov networks, because PCs can solve marginal inference (MAR) tasks in time that scales linearly in the size of the network. Unfortunately, the most probable explanation (MPE) task and its generalization, the marginal maximum-a-posteriori (MMAP) inference task remain NP-hard in these models. Inspired by the recent work on using neural networks for generating near-optimal solutions to optimization problems such as integer linear programming, we propose an approach that uses neural networks to approximate MMAP inference in PCs. The key idea in our approach is to approximate the cost of an assignment to the query variables using a continuous multilinear function and then use the latter as a loss function. The two main benefits of our new method are that it is self-supervised, and after the neural network is learned, it requires only linear time to output a solution. We evaluate our new approach on several benchmark datasets and show that it outperforms three competing linear time approximations: max-product inference, max-marginal inference, and sequential estimation, which are used in practice to solve MMAP tasks in PCs.
\ No newline at end of file
diff --git a/data/2024/aaai/Neural Oscillators for Generalization of Physics-Informed Machine Learning b/data/2024/aaai/Neural Oscillators for Generalization of Physics-Informed Machine Learning
new file mode 100644
index 0000000000..c9770eca92
--- /dev/null
+++ b/data/2024/aaai/Neural Oscillators for Generalization of Physics-Informed Machine Learning	
@@ -0,0 +1 @@
+A primary challenge of physics-informed machine learning (PIML) is its generalization beyond the training domain, especially when dealing with complex physical problems represented by partial differential equations (PDEs). This paper aims to enhance the generalization capabilities of PIML, facilitating practical, real-world applications where accurate predictions in unexplored regions are crucial. We leverage the inherent causality and temporal sequential characteristics of PDE solutions to fuse PIML models with recurrent neural architectures based on systems of ordinary differential equations, referred to as neural oscillators. Through effectively capturing long-time dependencies and mitigating the exploding and vanishing gradient problem, neural oscillators foster improved generalization in PIML tasks. Extensive experimentation involving time-dependent nonlinear PDEs and biharmonic beam equations demonstrates the efficacy of the proposed approach. Incorporating neural oscillators outperforms existing state-of-the-art methods on benchmark problems across various metrics. Consequently, the proposed method improves the generalization capabilities of PIML, providing accurate solutions for extrapolation and prediction beyond the training data.
\ No newline at end of file
diff --git a/data/2024/aaai/Neural Physical Simulation with Multi-Resolution Hash Grid Encoding b/data/2024/aaai/Neural Physical Simulation with Multi-Resolution Hash Grid Encoding
new file mode 100644
index 0000000000..7f08e1118f
--- /dev/null
+++ b/data/2024/aaai/Neural Physical Simulation with Multi-Resolution Hash Grid Encoding	
@@ -0,0 +1 @@
+We explore the generalization of the implicit representation in the physical simulation task. Traditional time-dependent partial differential equations (PDEs) solvers for physical simulation often adopt the grid or mesh for spatial discretization, which is memory-consuming for high resolution and lack of adaptivity. Many implicit representations like local extreme machine or Siren are proposed but they are still too compact to suffer from limited accuracy in handling local details and a long time of convergence. We contribute a neural simulation framework based on multi-resolution hash grid representation to introduce hierarchical consideration of global and local information, simultaneously. Furthermore, we propose two key strategies: 1) a numerical gradient method for computing high-order derivatives with boundary conditions; 2) a range analysis sample method for fast neural geometry boundary sampling with dynamic topologies. Our method shows much higher accuracy and strong flexibility for various simulation problems: e.g., large elastic deformations, complex fluid dynamics, and multi-scale phenomena which remain challenging for existing neural physical solvers.
\ No newline at end of file
diff --git a/data/2024/aaai/Neural Reasoning about Agents' Goals, Preferences, and Actions b/data/2024/aaai/Neural Reasoning about Agents' Goals, Preferences, and Actions
new file mode 100644
index 0000000000..bebf62a873
--- /dev/null
+++ b/data/2024/aaai/Neural Reasoning about Agents' Goals, Preferences, and Actions	
@@ -0,0 +1 @@
+We propose the Intuitive Reasoning Network (IRENE) - a novel neural model for intuitive psychological reasoning about agents' goals, preferences, and actions that can generalise previous experiences to new situations. IRENE combines a graph neural network for learning agent and world state representations with a transformer to encode the task context. When evaluated on the challenging Baby Intuitions Benchmark, IRENE achieves new state-of-the-art performance on three out of its five tasks - with up to 48.9% improvement. In contrast to existing methods, IRENE is able to bind preferences to specific agents, to better distinguish between rational and irrational agents, and to better understand the role of blocking obstacles. We also investigate, for the first time, the influence of the training tasks on test performance. Our analyses demonstrate the effectiveness of IRENE in combining prior knowledge gained during training for unseen evaluation tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Neural Time-Reversed Generalized Riccati Equation b/data/2024/aaai/Neural Time-Reversed Generalized Riccati Equation
new file mode 100644
index 0000000000..cdf7fd8937
--- /dev/null
+++ b/data/2024/aaai/Neural Time-Reversed Generalized Riccati Equation	
@@ -0,0 +1 @@
+Optimal control deals with optimization problems in which variables steer a dynamical system, and its outcome contributes to the objective function. Two classical approaches to solving these problems are Dynamic Programming and the Pontryagin Maximum Principle. In both approaches, Hamiltonian equations offer an interpretation of optimality through auxiliary variables known as costates. However, Hamiltonian equations are rarely used due to their reliance on forward-backward algorithms across the entire temporal domain. This paper introduces a novel neural-based approach to optimal control. Neural networks are employed not only for implementing state dynamics but also for estimating costate variables. The parameters of the latter network are determined at each time step using a newly introduced local policy referred to as the time-reversed generalized Riccati equation. This policy is inspired by a result discussed in the Linear Quadratic (LQ) problem, which we conjecture stabilizes state dynamics. We support this conjecture by discussing experimental results from a range of optimal control case studies.
\ No newline at end of file
diff --git a/data/2024/aaai/Neuro-Symbolic Integration for Reasoning and Learning on Knowledge Graphs b/data/2024/aaai/Neuro-Symbolic Integration for Reasoning and Learning on Knowledge Graphs
new file mode 100644
index 0000000000..b653c8363a
--- /dev/null
+++ b/data/2024/aaai/Neuro-Symbolic Integration for Reasoning and Learning on Knowledge Graphs	
@@ -0,0 +1 @@
+The goal of this thesis is to address knowledge graph completion tasks using neuro-symbolic methods. Neuro-symbolic methods allow the joint utilization of symbolic information defined as meta-rules in ontologies and knowledge graph embedding methods that represent entities and relations of the graph in a low-dimensional vector space. This approach has the potential to improve the resolution of knowledge graph completion tasks in terms of reliability, interpretability, data-efficiency and robustness.
\ No newline at end of file
diff --git a/data/2024/aaai/Neuroevolution of a Multi-Generator GAN (Student Abstract) b/data/2024/aaai/Neuroevolution of a Multi-Generator GAN (Student Abstract)
new file mode 100644
index 0000000000..b78d81259a
--- /dev/null
+++ b/data/2024/aaai/Neuroevolution of a Multi-Generator GAN (Student Abstract)	
@@ -0,0 +1 @@
+Evolutionary Algorithms (EA) have been leveraged to tackle the challenges faced while using GANs such as mode collapse, vanishing gradient, latent space search, etc. However, the existing techniques of using EA with GANs operate backpropagation and EA in isolation from each other, leaving ample room for further exploration. This paper creates a collaborative bridge between EA and GANs by exploring a neuroevolution method for utilising both EA and backpropagation-based optimisation, simultaneously, for a multi-generator GAN architecture. Experiments conducted using a standard dataset with variants of the proposed method highlight the towering impact of each of the components involved in the proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Neuromorphic Event Signal-Driven Network for Video De-raining b/data/2024/aaai/Neuromorphic Event Signal-Driven Network for Video De-raining
new file mode 100644
index 0000000000..65e60e9b01
--- /dev/null
+++ b/data/2024/aaai/Neuromorphic Event Signal-Driven Network for Video De-raining	
@@ -0,0 +1 @@
+Convolutional neural networks-based video de-raining methods commonly rely on dense intensity frames captured by CMOS sensors. However, the limited temporal resolution of these sensors hinders the capture of dynamic rainfall information, limiting further improvement in de-raining performance. This study aims to overcome this issue by incorporating the neuromorphic event signal into the video de-raining to enhance the dynamic information perception. Specifically, we first utilize the dynamic information from the event signal as prior knowledge, and integrate it into existing de-raining objectives to better constrain the solution space. We then design an optimization algorithm to solve the objective, and construct a de-raining network with CNNs as the backbone architecture using a modular strategy to mimic the optimization process. To further explore the temporal correlation of the event signal, we incorporate a spiking self-attention module into our network. By leveraging the low latency and high temporal resolution of the event signal, along with the spatial and temporal representation capabilities of convolutional and spiking neural networks, our model captures more accurate dynamic information and significantly improves de-raining performance. For example, our network achieves a 1.24dB improvement on the SynHeavy25 dataset compared to the previous state-of-the-art method, while utilizing only 39% of the parameters.
\ No newline at end of file
diff --git a/data/2024/aaai/New Classes of the Greedy-Applicable Arm Feature Distributions in the Sparse Linear Bandit Problem b/data/2024/aaai/New Classes of the Greedy-Applicable Arm Feature Distributions in the Sparse Linear Bandit Problem
new file mode 100644
index 0000000000..62e2e67431
--- /dev/null
+++ b/data/2024/aaai/New Classes of the Greedy-Applicable Arm Feature Distributions in the Sparse Linear Bandit Problem	
@@ -0,0 +1 @@
+We consider the sparse contextual bandit problem where arm feature affects reward through the inner product of sparse parameters. Recent studies have developed sparsity-agnostic algorithms based on the greedy arm selection policy. However, the analysis of these algorithms requires strong assumptions on the arm feature distribution to ensure that the greedily selected samples are sufficiently diverse; One of the most common assumptions, relaxed symmetry, imposes approximate origin-symmetry on the distribution, which cannot allow distributions that has origin-asymmetric support. In this paper, we show that the greedy algorithm is applicable to a wider range of the arm feature distributions from two aspects. Firstly, we show that a mixture distribution that has a greedy-applicable component is also greedy-applicable. Second, we propose new distribution classes, related to Gaussian mixture, discrete, and radial distribution, for which the sample diversity is guaranteed. The proposed classes can describe distributions with origin-asymmetric support and, in conjunction with the first claim, provide theoretical guarantees of the greedy policy for a very wide range of the arm feature distributions.
\ No newline at end of file
diff --git a/data/2024/aaai/NightRain: Nighttime Video Deraining via Adaptive-Rain-Removal and Adaptive-Correction b/data/2024/aaai/NightRain: Nighttime Video Deraining via Adaptive-Rain-Removal and Adaptive-Correction
new file mode 100644
index 0000000000..58ccfe0161
--- /dev/null
+++ b/data/2024/aaai/NightRain: Nighttime Video Deraining via Adaptive-Rain-Removal and Adaptive-Correction	
@@ -0,0 +1 @@
+Existing deep-learning-based methods for nighttime video deraining rely on synthetic data due to the absence of real-world paired data. However, the intricacies of the real world, particularly with the presence of light effects and low-light regions affected by noise, create significant domain gaps, hampering synthetic-trained models in removing rain streaks properly and leading to over-saturation and color shifts. Motivated by this, we introduce NightRain, a novel nighttime video deraining method with adaptive-rain-removal and adaptive-correction. Our adaptive-rain-removal uses unlabeled rain videos to enable our model to derain real-world rain videos, particularly in regions affected by complex light effects. The idea is to allow our model to obtain rain-free regions based on the confidence scores. Once rain-free regions and the corresponding regions from our input are obtained, we can have region-based paired real data. These paired data are used to train our model using a teacher-student framework, allowing the model to iteratively learn from less challenging regions to more challenging regions. Our adaptive-correction aims to rectify errors in our model's predictions, such as over-saturation and color shifts. The idea is to learn from clear night input training videos based on the differences or distance between those input videos and their corresponding predictions. Our model learns from these differences, compelling our model to correct the errors. From extensive experiments, our method demonstrates state-of-the-art performance. It achieves a PSNR of 26.73dB, surpassing existing nighttime video deraining methods by a substantial margin of 13.7%.
\ No newline at end of file
diff --git a/data/2024/aaai/No Head Left Behind - Multi-Head Alignment Distillation for Transformers b/data/2024/aaai/No Head Left Behind - Multi-Head Alignment Distillation for Transformers
new file mode 100644
index 0000000000..98c9934837
--- /dev/null
+++ b/data/2024/aaai/No Head Left Behind - Multi-Head Alignment Distillation for Transformers	
@@ -0,0 +1 @@
+Knowledge distillation aims at reducing model size without compromising much performance. Recent work has applied it to large vision-language (VL) Transformers, and has shown that attention maps in the multi-head attention modules of vision-language Transformers contain extensive intra-modal and cross-modal co-reference relations to be distilled. The standard approach is to apply a one-to-one attention map distillation loss, i.e. the Teacher's first attention head instructs the Student's first head, the second teaches the second, and so forth, but this only works when the numbers of attention heads in the Teacher and Student are the same. To remove this constraint, we propose a new Attention Map Alignment Distillation (AMAD) method for Transformers with multi-head attention, which works for a Teacher and a Student with different numbers of attention heads. Specifically, we soft-align different heads in Teacher and Student attention maps using a cosine similarity weighting. The Teacher head contributes more to the Student heads for which it has a higher similarity weight. Each Teacher head contributes to all the Student heads by minimizing the divergence between the attention activation distributions for the soft-aligned heads. No head is left behind. This distillation approach operates like cross-attention. We experiment on distilling VL-T5 and BLIP, and apply AMAD loss on their T5, BERT, and ViT sub-modules. We show, under vision-language setting, that AMAD outperforms conventional distillation methods on VQA-2.0, COCO captioning, and Multi30K translation datasets. We further show that even without VL pre-training, the distilled VL-T5 models outperform corresponding VL pre-trained VL-T5 models that are further fine-tuned by ground-truth signals, and that fine-tuning distillation can also compensate to some degree for the absence of VL pre-training for BLIP models.
\ No newline at end of file
diff --git a/data/2024/aaai/No Internal Regret with Non-convex Loss Functions b/data/2024/aaai/No Internal Regret with Non-convex Loss Functions
new file mode 100644
index 0000000000..f0d4cea18b
--- /dev/null
+++ b/data/2024/aaai/No Internal Regret with Non-convex Loss Functions	
@@ -0,0 +1 @@
+Internal regret is a measure of performance of an online learning algorithm, which measures the change in performance by substituting every occurrence of a given action i by an alternative action j. Algorithms for minimizing internal regret are known for the finite experts setting, including a general reduction to the problem of minimizing external regret for this case. The reduction however crucially depends on the finiteness of the action space. In this work we approach the problem of minimizing internal regret for a continuous action space. For the full information setting, we show how to obtain O(sqrt(T)) internal regret for the class of Lipschitz functions, as well as non-Lipschitz dispersed functions, i.e. the non-Lipschitzness may not concentrate in a small region of the action space. We also consider extensions to partial feedback settings, and again obtain sublinear internal regret. Finally we discuss applications of internal regret minimization over continuous spaces to correlated equilibria in pricing problems and auction design, as well as to data-driven hyperparameter tuning.
\ No newline at end of file
diff --git a/data/2024/aaai/No More Shortcuts: Realizing the Potential of Temporal Self-Supervision b/data/2024/aaai/No More Shortcuts: Realizing the Potential of Temporal Self-Supervision
new file mode 100644
index 0000000000..29777ffdc4
--- /dev/null
+++ b/data/2024/aaai/No More Shortcuts: Realizing the Potential of Temporal Self-Supervision	
@@ -0,0 +1 @@
+Self-supervised approaches for video have shown impressive results in video understanding tasks. However, unlike early works that leverage temporal self-supervision, current state-of-the-art methods primarily rely on tasks from the image domain (e.g., contrastive learning) that do not explicitly promote the learning of temporal features. We identify two factors that limit existing temporal self-supervision: 1) tasks are too simple, resulting in saturated training performance, and 2) we uncover shortcuts based on local appearance statistics that hinder the learning of high-level features. To address these issues, we propose 1) a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks and 2) an effective augmentation strategy to mitigate shortcuts. Our model extends a representation of single video frames, pre-trained through contrastive learning, with a transformer that we train through temporal self-supervision. We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision. Our extensive experiments show state-of-the-art performance across 10 video understanding datasets, illustrating the generalization ability and robustness of our learned video representations. Project Page: https://daveishan.github.io/nms-webpage.
\ No newline at end of file
diff --git a/data/2024/aaai/No Prejudice! Fair Federated Graph Neural Networks for Personalized Recommendation b/data/2024/aaai/No Prejudice! Fair Federated Graph Neural Networks for Personalized Recommendation
new file mode 100644
index 0000000000..64222c9144
--- /dev/null
+++ b/data/2024/aaai/No Prejudice! Fair Federated Graph Neural Networks for Personalized Recommendation	
@@ -0,0 +1 @@
+Ensuring fairness in Recommendation Systems (RSs) across demographic groups is critical due to the increased integration of RSs in applications such as personalized healthcare, finance, and e-commerce. Graph-based RSs play a crucial role in capturing intricate higher-order interactions among entities. However, integrating these graph models into the Federated Learning (FL) paradigm with fairness constraints poses formidable challenges as this requires access to the entire interaction graph and sensitive user information (such as gender, age, etc.) at the central server. This paper addresses the pervasive issue of inherent bias within RSs for different demographic groups without compromising the privacy of sensitive user attributes in FL environment with the graph-based model. To address the group bias, we propose F2PGNN (Fair Federated Personalized Graph Neural Network), a novel framework that leverages the power of Personalized Graph Neural Network (GNN) coupled with fairness considerations. Additionally, we use differential privacy techniques to fortify privacy protection. Experimental evaluation on three publicly available datasets showcases the efficacy of F2PGNN in mitigating group unfairness by 47% ∼ 99% compared to the state-of-the-art while preserving privacy and maintaining the utility. The results validate the significance of our framework in achieving equitable and personalized recommendations using GNN within the FL landscape. Source code is at: https://github.com/nimeshagrawal/F2PGNN-AAAI24
\ No newline at end of file
diff --git a/data/2024/aaai/No Prior Mask: Eliminate Redundant Action for Deep Reinforcement Learning b/data/2024/aaai/No Prior Mask: Eliminate Redundant Action for Deep Reinforcement Learning
new file mode 100644
index 0000000000..7ee8eff6aa
--- /dev/null
+++ b/data/2024/aaai/No Prior Mask: Eliminate Redundant Action for Deep Reinforcement Learning	
@@ -0,0 +1 @@
+The large action space is one fundamental obstacle to deploying Reinforcement Learning methods in the real world. The numerous redundant actions will cause the agents to make repeated or invalid attempts, even leading to task failure. Although current algorithms conduct some initial explorations for this issue, they either suffer from rule-based systems or depend on expert demonstrations, which significantly limits their applicability in many real-world settings. In this work, we examine the theoretical analysis of what action can be eliminated in policy optimization and propose a novel redundant action filtering mechanism. Unlike other works, our method constructs the similarity factor by estimating the distance between the state distributions, which requires no prior knowledge. In addition, we combine the modified inverse model to avoid extensive computation in high-dimensional state space. We reveal the underlying structure of action spaces and propose a simple yet efficient redundant action filtering mechanism named No Prior Mask (NPM) based on the above techniques. We show the superior performance of our method by conducting extensive experiments on high-dimensional, pixel-input, and stochastic problems with various action redundancy tasks. Our code is public online at https://github.com/zhongdy15/npm.
\ No newline at end of file
diff --git a/data/2024/aaai/Noise-Aware Image Captioning with Progressively Exploring Mismatched Words b/data/2024/aaai/Noise-Aware Image Captioning with Progressively Exploring Mismatched Words
new file mode 100644
index 0000000000..add3d40ca0
--- /dev/null
+++ b/data/2024/aaai/Noise-Aware Image Captioning with Progressively Exploring Mismatched Words	
@@ -0,0 +1 @@
+Image captioning aims to automatically generate captions for images by learning a cross-modal generator from vision to language. The large amount of image-text pairs required for training is usually sourced from the internet due to the manual cost, which brings the noise with mismatched relevance that affects the learning process. Unlike traditional noisy label learning, the key challenge in processing noisy image-text pairs is to finely identify the mismatched words to make the most use of trustworthy information in the text, rather than coarsely weighing the entire examples. To tackle this challenge, we propose a Noise-aware Image Captioning method (NIC) to adaptively mitigate the erroneous guidance from noise by progressively exploring mismatched words. Specifically, NIC first identifies mismatched words by quantifying word-label reliability from two aspects: 1) inter-modal representativeness, which measures the significance of the current word by assessing cross-modal correlation via prediction certainty; 2) intra-modal informativeness, which amplifies the effect of current prediction by combining the quality of subsequent word generation. During optimization, NIC constructs the pseudo-word-labels considering the reliability of the origin word-labels and model convergence to periodically coordinate mismatched words. As a result, NIC can effectively exploit both clean and noisy image-text pairs to learn a more robust mapping function. Extensive experiments conducted on the MS-COCO and Conceptual Caption datasets validate the effectiveness of our method in various noisy scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Noise-Free Optimization in Early Training Steps for Image Super-resolution b/data/2024/aaai/Noise-Free Optimization in Early Training Steps for Image Super-resolution
new file mode 100644
index 0000000000..b4e55921f5
--- /dev/null
+++ b/data/2024/aaai/Noise-Free Optimization in Early Training Steps for Image Super-resolution	
@@ -0,0 +1 @@
+Recent deep-learning-based single image super-resolution (SISR) methods have shown impressive performance whereas typical methods train their networks by minimizing the pixel-wise distance with respect to a given high-resolution (HR) image. However, despite the basic training scheme being the predominant choice, its use in the context of ill-posed inverse problems has not been thoroughly investigated. In this work, we aim to provide a better comprehension of the underlying constituent by decomposing target HR images into two subcomponents: (1) the optimal centroid which is the expectation over multiple potential HR images, and (2) the inherent noise defined as the residual between the HR image and the centroid. Our findings show that the current training scheme cannot capture the ill-posed nature of SISR and becomes vulnerable to the inherent noise term, especially during early training steps. To tackle this issue, we propose a novel optimization method that can effectively remove the inherent noise term in the early steps of vanilla training by estimating the optimal centroid and directly optimizing toward the estimation. Experimental results show that the proposed method can effectively enhance the stability of vanilla training, leading to overall performance gain. Codes are available at github.com/2minkyulee/ECO.
\ No newline at end of file
diff --git a/data/2024/aaai/Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation b/data/2024/aaai/Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation
new file mode 100644
index 0000000000..7e8fff3ff5
--- /dev/null
+++ b/data/2024/aaai/Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation	
@@ -0,0 +1,3 @@
+Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. Recently, to alleviate expensive data collection, co-occurring pairs from the Internet are automatically harvested for training.
+However, it inevitably includes mismatched pairs, i.e., noisy correspondences, undermining supervision reliability and degrading performance. Current methods leverage deep neural networks' memorization effect to address noisy correspondences, which overconfidently focus on similarity-guided training with hard negatives and suffer from self-reinforcing errors. In light of above, we introduce a novel noisy correspondence learning framework, namely Self-Reinforcing Errors Mitigation (SREM).
+Specifically, by viewing sample matching as classification tasks within the batch, we generate classification logits for the given sample. Instead of a single similarity score, we refine sample filtration through energy uncertainty and estimate model's sensitivity of selected clean samples using swapped classification entropy, in view of the overall prediction distribution. Additionally, we propose cross-modal biased complementary learning to leverage negative matches overlooked in hard-negative training, further improving model optimization stability and curbing self-reinforcing errors. Extensive experiments on challenging benchmarks affirm the efficacy and efficiency of SREM.
\ No newline at end of file
diff --git a/data/2024/aaai/Non-exemplar Domain Incremental Object Detection via Learning Domain Bias b/data/2024/aaai/Non-exemplar Domain Incremental Object Detection via Learning Domain Bias
new file mode 100644
index 0000000000..db270b8afe
--- /dev/null
+++ b/data/2024/aaai/Non-exemplar Domain Incremental Object Detection via Learning Domain Bias	
@@ -0,0 +1 @@
+Domain incremental object detection (DIOD) aims to gradually learn a unified object detection model from a dataset stream composed of different domains, achieving good performance in all encountered domains. The most critical obstacle to this goal is the catastrophic forgetting problem, where the performance of the model improves rapidly in new domains but deteriorates sharply in old ones after a few sessions. To address this problem, we propose a non-exemplar DIOD method named learning domain bias (LDB), which learns domain bias independently at each new session, avoiding saving examples from old domains. Concretely, a base model is first obtained through training during session 1. Then, LDB freezes the weights of the base model and trains individual domain bias for each new incoming domain, adapting the base model to the distribution of new domains. At test time, since the domain ID is unknown, we propose a domain selector based on nearest mean classifier (NMC), which selects the most appropriate domain bias for a test image. Extensive experimental evaluations on two series of datasets demonstrate the effectiveness of the proposed LDB method in achieving high accuracy on new and old domain datasets. The code is available at https://github.com/SONGX1997/LDB.
\ No newline at end of file
diff --git a/data/2024/aaai/Non-exemplar Online Class-Incremental Continual Learning via Dual-Prototype Self-Augment and Refinement b/data/2024/aaai/Non-exemplar Online Class-Incremental Continual Learning via Dual-Prototype Self-Augment and Refinement
new file mode 100644
index 0000000000..469a21d14a
--- /dev/null
+++ b/data/2024/aaai/Non-exemplar Online Class-Incremental Continual Learning via Dual-Prototype Self-Augment and Refinement	
@@ -0,0 +1 @@
+This paper investigates a new, practical, but challenging problem named Non-exemplar Online Class-incremental continual Learning (NO-CL), which aims to preserve the discernibility of base classes without buffering data examples and efficiently learn novel classes continuously in a single-pass (i.e., online) data stream. The challenges of this task are mainly two-fold: (1) Both base and novel classes suffer from severe catastrophic forgetting as no previous samples are available for replay. (2) As the online data can only be observed once, there is no way to fully re-train the whole model, e.g., re-calibrate the decision boundaries via prototype alignment or feature distillation. In this paper, we propose a novel Dual-prototype Self-augment and Refinement method (DSR) for NO-CL problem, which consists of two strategies: 1) Dual class prototypes: vanilla and high-dimensional prototypes are exploited to utilize the pre-trained information and obtain robust quasi-orthogonal representations rather than example buffers for both privacy preservation and memory reduction. 2) Self-augment and refinement: Instead of updating the whole network, we optimize high-dimensional prototypes alternatively with the extra projection module based on self-augment vanilla prototypes, through a bi-level optimization problem. Extensive experiments demonstrate the effectiveness and superiority of the proposed DSR in NO-CL.
\ No newline at end of file
diff --git a/data/2024/aaai/Non-flat ABA Is an Instance of Bipolar Argumentation b/data/2024/aaai/Non-flat ABA Is an Instance of Bipolar Argumentation
new file mode 100644
index 0000000000..97a5483941
--- /dev/null
+++ b/data/2024/aaai/Non-flat ABA Is an Instance of Bipolar Argumentation	
@@ -0,0 +1,3 @@
+Assumption-based Argumentation (ABA) is a well-known structured argumentation formalism, whereby arguments and attacks between them are drawn from rules, defeasible assumptions and their contraries. 
+A common restriction imposed on ABA frameworks (ABAFs) is that they are flat, i.e. each of the defeasible assumptions can only be assumed, but not derived. While it is known that flat ABAFs can be translated into abstract argumentation frameworks (AFs) as proposed by Dung, no translation exists from general, possibly non-flat ABAFs into any kind of abstract argumentation formalism. 
+In this paper, we close this gap and show that bipolar AFs (BAFs) can instantiate general ABAFs. To this end we develop suitable, novel BAF semantics which borrow from the notion of deductive support. We investigate basic properties of our BAFs, including computational complexity, and prove the desired relation to ABAFs under several semantics.
\ No newline at end of file
diff --git a/data/2024/aaai/Non-monotone Sequential Submodular Maximization b/data/2024/aaai/Non-monotone Sequential Submodular Maximization
new file mode 100644
index 0000000000..def9462eb9
--- /dev/null
+++ b/data/2024/aaai/Non-monotone Sequential Submodular Maximization	
@@ -0,0 +1,2 @@
+In this paper, we study a fundamental problem in submodular optimization known as sequential submodular maximization. The primary objective of this problem is to select and rank a sequence of items to optimize a group of submodular functions.
+The existing research on this problem has predominantly concentrated on the monotone setting, assuming that the submodular functions are non-decreasing. However, in various real-world scenarios, like diversity-aware recommendation systems, adding items to an existing set might negatively impact the overall utility. In response, we propose to study this problem with non-monotone submodular functions and develop approximation algorithms for both flexible and fixed length constraints, as well as a special case with identical utility functions. The empirical evaluations further validate the effectiveness of our proposed algorithms in the domain of video recommendations.
\ No newline at end of file
diff --git a/data/2024/aaai/Non-parametric Representation Learning with Kernels b/data/2024/aaai/Non-parametric Representation Learning with Kernels
new file mode 100644
index 0000000000..b1e6dfdd3f
--- /dev/null
+++ b/data/2024/aaai/Non-parametric Representation Learning with Kernels	
@@ -0,0 +1 @@
+Unsupervised and self-supervised representation learning has become popular in recent years for learning useful features from unlabelled data. Representation learning has been mostly developed in the neural network literature, and other models for representation learning are surprisingly unexplored. In this work, we introduce and analyze several kernel-based representation learning approaches: Firstly, we define two kernel Self-Supervised Learning (SSL) models using contrastive loss functions and secondly, a Kernel Autoencoder (AE) model based on the idea of embedding and reconstructing data. We argue that the classical representer theorems for supervised kernel machines are not always applicable for (self-supervised) representation learning, and present new representer theorems, which show that the representations learned by our kernel models can be expressed in terms of kernel matrices. We further derive generalisation error bounds for representation learning with kernel SSL and AE, and empirically evaluate the performance of these methods in both small data regimes as well as in comparison with neural network based models.
\ No newline at end of file
diff --git a/data/2024/aaai/Non-stationary Projection-Free Online Learning with Dynamic and Adaptive Regret Guarantees b/data/2024/aaai/Non-stationary Projection-Free Online Learning with Dynamic and Adaptive Regret Guarantees
new file mode 100644
index 0000000000..2d86096d24
--- /dev/null
+++ b/data/2024/aaai/Non-stationary Projection-Free Online Learning with Dynamic and Adaptive Regret Guarantees	
@@ -0,0 +1 @@
+Projection-free online learning has drawn increasing interest due to its efficiency in solving high-dimensional problems with complicated constraints. However, most existing projection-free online methods focus on minimizing the static regret, which unfortunately fails to capture the challenge of changing environments. In this paper, we investigate non-stationary projection-free online learning, and choose dynamic regret and adaptive regret to measure the performance. Specifically, we first provide a novel dynamic regret analysis for an existing projection-free method named BOGD_IP, and establish an O(T^¾ (1+P_T)) dynamic regret bound, where P_T denotes the path-length of the comparator sequence. Then, we improve the upper bound to O(T^¾ (1+P_T)^¼) by running multiple BOGD_IP algorithms with different step sizes in parallel, and tracking the best one on the fly. Our results are the first general-case dynamic regret bounds for projection-free online learning, and can recover the existing O(T^¾) static regret by setting P_T = 0. Furthermore, we propose a projection-free method to attain an O(?^¾) adaptive regret bound for any interval with length ?, which nearly matches the static regret over that interval. The essential idea is to maintain a set of BOGD_IP algorithms dynamically, and combine them by a meta algorithm. Moreover, we demonstrate that it is also equipped with an O(T^¾ (1+P_T)^¼) dynamic regret bound. Finally, empirical studies verify our theoretical findings.
\ No newline at end of file
diff --git a/data/2024/aaai/NondBREM: Nondeterministic Offline Reinforcement Learning for Large-Scale Order Dispatching b/data/2024/aaai/NondBREM: Nondeterministic Offline Reinforcement Learning for Large-Scale Order Dispatching
new file mode 100644
index 0000000000..dc5cd8591b
--- /dev/null
+++ b/data/2024/aaai/NondBREM: Nondeterministic Offline Reinforcement Learning for Large-Scale Order Dispatching	
@@ -0,0 +1 @@
+One of the most important tasks in ride-hailing is order dispatching, i.e., assigning unserved orders to available drivers. Recent order dispatching has achieved a significant improvement due to the advance of reinforcement learning, which has been approved to be able to effectively address sequential decision-making problems like order dispatching. However, most existing reinforcement learning methods require agents to learn the optimal policy by interacting with environments online, which is challenging or impractical for real-world deployment due to high costs or safety concerns. For example, due to the spatiotemporally unbalanced supply and demand, online reinforcement learning-based order dispatching may significantly impact the revenue of the ride-hailing platform and passenger experience during the policy learning period. Hence, in this work, we develop an offline deep reinforcement learning framework called NondBREM for large-scale order dispatching, which learns policy from only the accumulated logged data to avoid costly and unsafe interactions with the environment. In NondBREM, a Nondeterministic Batch-Constrained Q-learning (NondBCQ) module is developed to reduce the algorithm extrapolation error and a Random Ensemble Mixture (REM) module that integrates multiple value networks with multi-head networks is utilized to improve the model generalization and robustness. Extensive experiments on large-scale real-world ride-hailing datasets show the superiority of our design.
\ No newline at end of file
diff --git a/data/2024/aaai/Norm Tweaking: High-Performance Low-Bit Quantization of Large Language Models b/data/2024/aaai/Norm Tweaking: High-Performance Low-Bit Quantization of Large Language Models
new file mode 100644
index 0000000000..930bc84e44
--- /dev/null
+++ b/data/2024/aaai/Norm Tweaking: High-Performance Low-Bit Quantization of Large Language Models	
@@ -0,0 +1 @@
+As the size of large language models (LLMs) continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving acceptable 4-bit weight-only quantization, attempts at lower-bit quantization often result in severe performance degradation. In this paper, we introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision while being cost-efficient. Our approach is inspired by the observation that rectifying the quantized activation distribution to match its float counterpart can readily restore accuracy for LLMs. To achieve this, we carefully design a tweaking strategy that includes calibration data generation and channel-wise distance constraint to update the weights of normalization layers for better generalization. We conduct extensive experiments on various datasets using several open-sourced LLMs. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations, surpassing existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the same level of accuracy at 2-bit quantization as their float ones. Our simple and effective approach makes it more practical for real-world applications.
\ No newline at end of file
diff --git a/data/2024/aaai/Novax or Novak? Estimating Social Media Stance towards Celebrity Vaccine Hesitancy (Student Abstract) b/data/2024/aaai/Novax or Novak? Estimating Social Media Stance towards Celebrity Vaccine Hesitancy (Student Abstract)
new file mode 100644
index 0000000000..c90b763137
--- /dev/null
+++ b/data/2024/aaai/Novax or Novak? Estimating Social Media Stance towards Celebrity Vaccine Hesitancy (Student Abstract)	
@@ -0,0 +1 @@
+On 15 January 2022, noted tennis player Novak Djokovic was deported from Australia due to his unvaccinated status for the COVID-19 vaccine. This paper presents a stance classifier and evaluates public reaction to this episode and the impact of this behavior on social media discourse on YouTube. We observed a significant spike of individuals who supported and opposed his behavior at the time of the episode. Supporters outnumbered those who opposed this behavior by over 4x. Our study reports a disturbing trend that following every major Djokovic win, even now, vaccine skeptics often conflate his tennis success as a fitting reply to vaccine mandates.
\ No newline at end of file
diff --git a/data/2024/aaai/Novel Class Discovery for Representation of Real-World Heritage Data as Neural Radiance Fields (Student Abstract) b/data/2024/aaai/Novel Class Discovery for Representation of Real-World Heritage Data as Neural Radiance Fields (Student Abstract)
new file mode 100644
index 0000000000..c5781ef75d
--- /dev/null
+++ b/data/2024/aaai/Novel Class Discovery for Representation of Real-World Heritage Data as Neural Radiance Fields (Student Abstract)	
@@ -0,0 +1 @@
+Neural Radiance Fields (NeRF) have been extensively explored as a leading approach for modeling and representing 3D data across various domains. Their ability to capture arbitrary scale point clouds and generate novel views makes them particularly valuable for digitizing cultural heritage sites. However, despite their impressive rendering capabilities, prior methods have often overlooked a significant real-world challenge: handling open-world scenarios characterized by unstructured data containing multiple classes in a single set of unlabeled images. To address this challenge, we propose a novel method NCD-NeRF that leverages Novel-Class Discovery to effectively tackle the complexities inherent in real-world data with unlabeled classes while excelling in producing high-quality NeRF representation. To validate our approach, we conducted a benchmarking analysis using a custom-collected dataset featuring UNESCO World Heritage sites in India. We observe that our proposed NCD-NeRF can parallely discover novel classes and render high-quality 3D volumes.
\ No newline at end of file
diff --git a/data/2024/aaai/Novel Class Discovery in Chest X-rays via Paired Images and Text b/data/2024/aaai/Novel Class Discovery in Chest X-rays via Paired Images and Text
new file mode 100644
index 0000000000..c17a337b09
--- /dev/null
+++ b/data/2024/aaai/Novel Class Discovery in Chest X-rays via Paired Images and Text	
@@ -0,0 +1 @@
+Novel class discover(NCD) aims to identify new classes undefined during model training phase with the help of knowledge of known classes. Many methods have been proposed and notably boosted performance of NCD in natural images. However, there has been no work done in discovering new classes based on medical images and disease categories, which is crucial for understanding and diagnosing specific diseases. Moreover, most of the existing methods only utilize information from image modality and use labels as the only supervisory information. In this paper, we propose a multi-modal novel class discovery method based on paired images and text, inspired by the low classification accuracy of chest X-ray images and the relatively higher accuracy of the paired text. Specifically, we first pretrain the image encoder and text encoder with multi-modal contrastive learning on the entire dataset and then we generate pseudo-labels separately on the image branch and text branch. We utilize intra-modal consistency to assess the quality of pseudo-labels and adjust the weights of the pseudo-labels from both branches to generate the ultimate pseudo-labels for training. Experiments on eight subset splits of MIMIC-CXR-JPG dataset show that our method improves the clustering performance of unlabeled classes by about 10% on average compared to state-of-the-art methods. Code is available at: https://github.com/zzzzzzzzjy/MMNCD-main.
\ No newline at end of file
diff --git a/data/2024/aaai/Novelty vs. Potential Heuristics: A Comparison of Hardness Measures for Satisficing Planning b/data/2024/aaai/Novelty vs. Potential Heuristics: A Comparison of Hardness Measures for Satisficing Planning
new file mode 100644
index 0000000000..336c95c452
--- /dev/null
+++ b/data/2024/aaai/Novelty vs. Potential Heuristics: A Comparison of Hardness Measures for Satisficing Planning	
@@ -0,0 +1,4 @@
+Classical planning considers a given task and searches for a plan to solve it. Some tasks are harder to solve than others. We can measure the 'hardness' of a task with the novelty width and the correlation complexity. In this work, we compare these measures.
+Additionally, we introduce the river measure, a new measure that is based on potential heuristics and therefore similar to the correlation complexity but also comparable to the novelty width.
+We show that the river measure is upper bounded by the correlation complexity and by the novelty width +1. 
+Furthermore, we show that we can convert a planning task with a polynomial blowup of the task size to ensure that a heuristic of dimension 2 exists that gives rise to backtrack-free search.
\ No newline at end of file
diff --git a/data/2024/aaai/Nowcasting Temporal Trends Using Indirect Surveys b/data/2024/aaai/Nowcasting Temporal Trends Using Indirect Surveys
new file mode 100644
index 0000000000..0e70fee98e
--- /dev/null
+++ b/data/2024/aaai/Nowcasting Temporal Trends Using Indirect Surveys	
@@ -0,0 +1 @@
+Indirect surveys, in which respondents provide information about other people they know, have been proposed for estimating (nowcasting) the size of a hidden population where privacy is important or the hidden population is hard to reach. Examples include estimating casualties in an earthquake, conditions among female sex workers, and the prevalence of drug use and infectious diseases. The Network Scale-up Method (NSUM) is the classical approach to developing estimates from indirect surveys, but it was designed for one-shot surveys. Further, it requires certain assumptions and asking for or estimating the number of individuals in each respondent's network. In recent years, surveys have been increasingly deployed online and can collect data continuously (e.g., COVID-19 surveys on Facebook during much of the pandemic). Conventional NSUM can be applied to these scenarios by analyzing the data independently at each point in time, but this misses the opportunity of leveraging the temporal dimension. We propose to use the responses from indirect surveys collected over time and develop analytical tools (i) to prove that indirect surveys can provide better estimates for the trends of the hidden population over time, as compared to direct surveys and (ii) to identify appropriate temporal aggregations to improve the estimates. We demonstrate through extensive simulations that our approach outperforms traditional NSUM and direct surveying methods. We also empirically demonstrate the superiority of our approach on a real indirect survey dataset of COVID-19 cases.
\ No newline at end of file
diff --git a/data/2024/aaai/NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario b/data/2024/aaai/NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario
new file mode 100644
index 0000000000..d1cafbe75e
--- /dev/null
+++ b/data/2024/aaai/NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario	
@@ -0,0 +1 @@
+We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://github.com/qiantianwen/NuScenes-QA.
\ No newline at end of file
diff --git a/data/2024/aaai/Null Space Matters: Range-Null Decomposition for Consistent Multi-Contrast MRI Reconstruction b/data/2024/aaai/Null Space Matters: Range-Null Decomposition for Consistent Multi-Contrast MRI Reconstruction
new file mode 100644
index 0000000000..a8df0254a1
--- /dev/null
+++ b/data/2024/aaai/Null Space Matters: Range-Null Decomposition for Consistent Multi-Contrast MRI Reconstruction	
@@ -0,0 +1 @@
+Consistency and interpretability have long been the critical issues in MRI reconstruction. While interpretability has been dramatically improved with the employment of deep unfolding networks (DUNs), current methods still suffer from inconsistencies and generate inferior anatomical structure. Especially in multi-contrast scenes, different imaging protocols often exacerbate the concerned issue. In this paper, we propose a range-null decomposition-assisted DUN architecture to ensure consistency while still providing desirable interpretability. Given the input decomposed, we argue that the inconsistency could be analytically relieved by feeding solely the null-space component into proximal mapping, while leaving the range-space counterpart fixed. More importantly, a correlation decoupling scheme is further proposed to narrow the information gap for multi-contrast fusion, which dynamically borrows isotropic features from the opponent while maintaining the modality-specific ones. Specifically, the two features are attached to different frequencies and learned individually by the newly designed isotropy encoder and anisotropy encoder. The former strives for the contrast-shared information, while the latter serves to capture the contrast-specific features. The quantitative and qualitative results show that our proposal outperforms most cutting-edge methods by a large margin. Codes will be released on https://github.com/chenjiachengzzz/RNU.
\ No newline at end of file
diff --git a/data/2024/aaai/OCEAN-MBRL: Offline Conservative Exploration for Model-Based Offline Reinforcement Learning b/data/2024/aaai/OCEAN-MBRL: Offline Conservative Exploration for Model-Based Offline Reinforcement Learning
new file mode 100644
index 0000000000..ca751a8783
--- /dev/null
+++ b/data/2024/aaai/OCEAN-MBRL: Offline Conservative Exploration for Model-Based Offline Reinforcement Learning	
@@ -0,0 +1,7 @@
+Model-based offline reinforcement learning (RL) algorithms have emerged as a promising paradigm for offline RL.
+These algorithms usually learn a dynamics model from a static dataset of transitions, use the model to generate synthetic trajectories, and perform conservative policy optimization within these trajectories. 
+However, our observations indicate that policy optimization methods used in these model-based offline RL algorithms are not effective at exploring the learned model and induce biased exploration, which ultimately impairs the performance of the algorithm.
+To address this issue, we propose Offline Conservative ExplorAtioN (OCEAN), a novel rollout approach to model-based offline RL.
+In our method, we incorporate additional exploration techniques and introduce three conservative constraints based on uncertainty estimation to mitigate the potential impact of significant dynamic errors resulting from exploratory transitions. 
+Our work is a plug-in method and can be combined with classical model-based RL algorithms, such as MOPO, COMBO, and RAMBO.
+Experiment results of our method on the D4RL MuJoCo benchmark show that OCEAN significantly improves the performance of existing algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/ODTrack: Online Dense Temporal Token Learning for Visual Tracking b/data/2024/aaai/ODTrack: Online Dense Temporal Token Learning for Visual Tracking
new file mode 100644
index 0000000000..d3046832cf
--- /dev/null
+++ b/data/2024/aaai/ODTrack: Online Dense Temporal Token Learning for Visual Tracking	
@@ -0,0 +1 @@
+Online contextual reasoning and association across consecutive video frames are critical to perceive instances in visual tracking. However, most current top-performing trackers persistently lean on sparse temporal relationships between reference and search frames via an offline mode. Consequently, they can only interact independently within each image-pair and establish limited temporal correlations. To alleviate the above problem, we propose a simple, flexible and effective video-level tracking pipeline, named ODTrack, which densely associates the contextual relationships of video frames in an online token propagation manner. ODTrack receives video frames of arbitrary length to capture the spatio-temporal trajectory relationships of an instance, and compresses the discrimination features (localization information) of a target into a token sequence to achieve frame-to-frame association. This new solution brings the following benefits: 1) the purified token sequences can serve as prompts for the inference in the next video frame, whereby past information is leveraged to guide future inference; 2) the complex online update strategies are effectively avoided by the iterative propagation of token sequences, and thus we can achieve more efficient model representation and computation. ODTrack achieves a new SOTA performance on seven benchmarks, while running at real-time speed. Code and models are available at https://github.com/GXNU-ZhongLab/ODTrack.
\ No newline at end of file
diff --git a/data/2024/aaai/ORES: Open-Vocabulary Responsible Visual Synthesis b/data/2024/aaai/ORES: Open-Vocabulary Responsible Visual Synthesis
new file mode 100644
index 0000000000..df753354f4
--- /dev/null
+++ b/data/2024/aaai/ORES: Open-Vocabulary Responsible Visual Synthesis	
@@ -0,0 +1 @@
+Avoiding synthesizing specific visual concepts is an essential challenge in responsible visual synthesis. However, the visual concept that needs to be avoided for responsible visual synthesis tends to be diverse, depending on the region, context, and usage scenarios. In this work, we formalize a new task, Open-vocabulary Responsible Visual Synthesis (ORES), where the synthesis model is able to avoid forbidden visual concepts while allowing users to input any desired content. To address this problem, we present a Two-stage Intervention (TIN) framework. By introducing 1) rewriting with learnable instruction through a large-scale language model (LLM) and 2) synthesizing with prompt intervention on a diffusion synthesis model, it can effectively synthesize images avoiding any concepts but following the user's query as much as possible. To evaluate on ORES, we provide a publicly available dataset, baseline models, and benchmark. Experimental results demonstrate the effectiveness of our method in reducing risks of image generation. Our work highlights the potential of LLMs in responsible visual synthesis. Our code and dataset is public available in https://github.com/kodenii/ORES.
\ No newline at end of file
diff --git a/data/2024/aaai/OSFFNet: Omni-Stage Feature Fusion Network for Lightweight Image Super-Resolution b/data/2024/aaai/OSFFNet: Omni-Stage Feature Fusion Network for Lightweight Image Super-Resolution
new file mode 100644
index 0000000000..436e04976b
--- /dev/null
+++ b/data/2024/aaai/OSFFNet: Omni-Stage Feature Fusion Network for Lightweight Image Super-Resolution	
@@ -0,0 +1 @@
+Recently, several lightweight methods have been proposed to implement single-image super-resolution (SISR) on resource-constrained devices. However, these methods primarily focus on simplifying network structures without the full utilization of shallow features. The fact remains that shallow features encompass crucial details for the super-resolution task, including edges, textures, and colors. Therefore, developing a novel architecture that can effectively integrate features from different levels and capitalize on their mutual complementarity is necessary. We first analyze the relationship between multi-stage features and the restoration tasks in a classic lightweight SR method. Based on these observations, we propose an Omni-Stage Feature Fusion (OSFF) architecture, which incorporates Original Image Stacked Initialisation, Shallow Feature Global Connection, and Multi-Receptive Field Dynamic Fusion. An Attention-Enhanced Feature Distillation module is also designed to enhance the model performance. Finally, leveraging these contributions, we construct an Omni-Stage Feature Fusion Network (OSFFNet). Through extensive experiments on various benchmark datasets, the proposed model outperforms state-of-the-art methods. Notably, it achieves a 0.26dB PSNR improvement over the second-best method for x2 SR on the Urban100 dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples b/data/2024/aaai/OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples
new file mode 100644
index 0000000000..34731baf14
--- /dev/null
+++ b/data/2024/aaai/OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, a malicious user might attempt to deliberately evade the detectors based on detection results, but this has not been assumed in previous studies. In this paper, we propose OUTFOX, a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output. In this framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect, while the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Experiments in the domain of student essays show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score. Furthermore, the proposed detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts. Finally, the proposed attacker drastically degrades the performance of detectors by up to -57.0 points F1-score, massively outperforming the baseline paraphrasing method for evading detection.
\ No newline at end of file
diff --git a/data/2024/aaai/OVD-Explorer: Optimism Should Not Be the Sole Pursuit of Exploration in Noisy Environments b/data/2024/aaai/OVD-Explorer: Optimism Should Not Be the Sole Pursuit of Exploration in Noisy Environments
new file mode 100644
index 0000000000..1456f65db8
--- /dev/null
+++ b/data/2024/aaai/OVD-Explorer: Optimism Should Not Be the Sole Pursuit of Exploration in Noisy Environments	
@@ -0,0 +1 @@
+In reinforcement learning, the optimism in the face of uncertainty (OFU) is a mainstream principle for directing exploration towards less explored areas, characterized by higher uncertainty. However, in the presence of environmental stochasticity (noise), purely optimistic exploration may lead to excessive probing of high-noise areas, consequently impeding exploration efficiency. Hence, in exploring noisy environments, while optimism-driven exploration serves as a foundation, prudent attention to alleviating unnecessary over-exploration in high-noise areas becomes beneficial. In this work, we propose Optimistic Value Distribution Explorer (OVD-Explorer) to achieve a noise-aware optimistic exploration for continuous control. OVD-Explorer proposes a new measurement of the policy's exploration ability considering noise in optimistic perspectives, and leverages gradient ascent to drive exploration. Practically, OVD-Explorer can be easily integrated with continuous control RL algorithms. Extensive evaluations on the MuJoCo and GridChaos tasks demonstrate the superiority of OVD-Explorer in achieving noise-aware optimistic exploration.
\ No newline at end of file
diff --git a/data/2024/aaai/OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models b/data/2024/aaai/OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models
new file mode 100644
index 0000000000..18380cccc4
--- /dev/null
+++ b/data/2024/aaai/OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models	
@@ -0,0 +1 @@
+Large language models (LLMs) with hundreds of billions of parameters require powerful server-grade GPUs for inference, limiting their practical deployment. To address this challenge, we introduce the outlier-aware weight quantization (OWQ) method, which aims to minimize LLM's footprint through low-precision representation. OWQ prioritizes a small subset of structured weights sensitive to quantization, storing them in high-precision, while applying highly tuned quantization to the remaining dense weights. This sensitivity-aware mixed-precision scheme reduces the quantization error notably, and extensive experiments demonstrate that 3.1-bit models using OWQ perform comparably to 4-bit models optimized by OPTQ. Furthermore, OWQ incorporates a parameter-efficient fine-tuning for task-specific adaptation, called weak column tuning (WCT), enabling accurate task-specific LLM adaptation with minimal memory overhead in the optimized format. OWQ represents a notable advancement in the flexibility, efficiency, and practicality of LLM optimization literature. The source code is available at https://github.com/xvyaward/owq.
\ No newline at end of file
diff --git a/data/2024/aaai/Object Attribute Matters in Visual Question Answering b/data/2024/aaai/Object Attribute Matters in Visual Question Answering
new file mode 100644
index 0000000000..989938e294
--- /dev/null
+++ b/data/2024/aaai/Object Attribute Matters in Visual Question Answering	
@@ -0,0 +1 @@
+Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. However, integrating visual and textual semantics solely through attention layers is insufficient to comprehensively understand and align information from both modalities. Intuitively, object attributes can naturally serve as a bridge to unify them, which has been overlooked in previous research. In this paper, we propose a novel VQA approach from the perspective of utilizing object attribute, aiming to achieve better object-level visual-language alignment and multimodal scene understanding. Specifically, we design an attribute fusion module and a contrastive knowledge distillation module. The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing. The enhanced object-level visual features contribute to solving fine-grained problem like counting-question. The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness. Furthermore, to augment scene understanding and the out-of-distribution performance, the contrastive knowledge distillation module introduces a series of implicit knowledge. We distill knowledge into attributes through contrastive loss, which further strengthens the representation learning of attribute features and facilitates visual-linguistic alignment. Intensive experiments on six datasets, COCO-QA, VQAv2, VQA-CPv2, VQA-CPv1, VQAvs and TDIUC, show the superiority of the proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering b/data/2024/aaai/Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering
new file mode 100644
index 0000000000..542d5f7be1
--- /dev/null
+++ b/data/2024/aaai/Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering	
@@ -0,0 +1 @@
+This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions. In this paper, we propose to explicitly consider fine-grained visual objects in video frames (object-level clues) and explore the multi-modal relations (\textit{i.e.}, the object, audio, and question) in terms of feature interaction and model optimization. For the former, we present an end-to-end object-oriented network that adopts a question-conditioned clue discovery module to concentrate audio/visual modalities on respective keywords of the question and designs a modality-conditioned clue collection module to highlight closely associated audio segments or visual objects. For model optimization, we propose an object-aware adaptive-positivity learning strategy that selects the highly semantic-matched multi-modal pair as \textit{positivity}. Specifically, we design two object-aware contrastive loss functions to identify the highly relevant question-object pairs and audio-object pairs, respectively. These selected pairs are constrained to have larger similarity values than the mismatched pairs. The positivity-selecting process is adaptive as the positivity pairs selected in each video frame may be different. These two object-aware objectives help the model understand \textit{which objects are exactly relevant to the question} and \textit{which are making sounds}. Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance. The code is available at https://github.com/zhangbin-ai/APL.
\ No newline at end of file
diff --git a/data/2024/aaai/Object-Aware Domain Generalization for Object Detection b/data/2024/aaai/Object-Aware Domain Generalization for Object Detection
new file mode 100644
index 0000000000..ffd7505d41
--- /dev/null
+++ b/data/2024/aaai/Object-Aware Domain Generalization for Object Detection	
@@ -0,0 +1 @@
+Single-domain generalization (S-DG) aims to generalize a model to unseen environments with a single-source domain. However, most S-DG approaches have been conducted in the field of classification. When these approaches are applied to object detection, the semantic features of some objects can be damaged, which can lead to imprecise object localization and misclassification. To address these problems, we propose an object-aware domain generalization (OA-DG) method for single-domain generalization in object detection. Our method consists of data augmentation and training strategy, which are called OA-Mix and OA-Loss, respectively. OA-Mix generates multi-domain data with multi-level transformation and object-aware mixing strategy. OA-Loss enables models to learn domain-invariant representations for objects and backgrounds from the original and OA-Mixed images. Our proposed method outperforms state-of-the-art works on standard benchmarks. Our code is available at https://github.com/WoojuLee24/OA-DG.
\ No newline at end of file
diff --git a/data/2024/aaai/Occluded Person Re-identification via Saliency-Guided Patch Transfer b/data/2024/aaai/Occluded Person Re-identification via Saliency-Guided Patch Transfer
new file mode 100644
index 0000000000..3e0714f43e
--- /dev/null
+++ b/data/2024/aaai/Occluded Person Re-identification via Saliency-Guided Patch Transfer	
@@ -0,0 +1 @@
+While generic person re-identification has made remarkable improvement in recent years, these methods are designed under the assumption that the entire body of the person is available. This assumption brings about a significant performance degradation when suffering from occlusion caused by various obstacles in real-world applications. To address this issue, data-driven strategies have emerged to enhance the model's robustness to occlusion. Following the random erasing paradigm, these strategies typically employ randomly generated noise to supersede randomly selected image regions to simulate obstacles. However, the random strategy is not sensitive to location and content, meaning they cannot mimic real-world occlusion cases in application scenarios. To overcome this limitation and fully exploit the real scene information in datasets, this paper proposes a more intuitive and effective data-driven strategy named Saliency-Guided Patch Transfer (SPT). Combined with the vision transformer, SPT divides person instances and background obstacles using salient patch selection. By transferring person instances to different background obstacles, SPT can easily generate photo-realistic occluded samples. Furthermore, we propose an occlusion-aware Intersection over Union (OIoU) with mask-rolling to filter the more suitable combination and a class-ignoring strategy to achieve more stable processing. Extensive experimental evaluations conducted on occluded and holistic person re-identification benchmarks demonstrate that SPT provides a significant performance gain among different ViT-based ReID algorithms on occluded ReID.
\ No newline at end of file
diff --git a/data/2024/aaai/OctOcc: High-Resolution 3D Occupancy Prediction with Octree b/data/2024/aaai/OctOcc: High-Resolution 3D Occupancy Prediction with Octree
new file mode 100644
index 0000000000..5ebff3a2ce
--- /dev/null
+++ b/data/2024/aaai/OctOcc: High-Resolution 3D Occupancy Prediction with Octree	
@@ -0,0 +1,9 @@
+3D semantic occupancy has garnered considerable attention due to its abundant structural information encompassing the entire scene in autonomous driving.
+However, existing 3D occupancy prediction methods contend with the constraint of low-resolution 3D voxel features arising from the limitation of computational memory.
+To address this limitation and achieve a more fine-grained representation of 3D scenes, we propose OctOcc, a novel octree-based approach for 3D semantic occupancy prediction. 
+OctOcc is conceptually rooted in the observation that the vast majority of 3D space is left unoccupied. 
+Capitalizing on this insight, we endeavor to cultivate memory-efficient high-resolution 3D occupancy predictions by mitigating superfluous cross-attentions. 
+Specifically, we devise a hierarchical octree structure that selectively generates finer-grained cross-attentions solely in potentially occupied regions.
+Extending our inquiry beyond 3D space, we identify analogous redundancies within another side of cross attentions, 2D images.
+Consequently, a 2D image feature filtering network is conceived to expunge extraneous regions.
+Experimental results demonstrate that the proposed OctOcc significantly outperforms existing methods on nuScenes and SemanticKITTI datasets with limited memory consumption.
\ No newline at end of file
diff --git a/data/2024/aaai/Offline Model-Based Optimization via Policy-Guided Gradient Search b/data/2024/aaai/Offline Model-Based Optimization via Policy-Guided Gradient Search
new file mode 100644
index 0000000000..b265f61200
--- /dev/null
+++ b/data/2024/aaai/Offline Model-Based Optimization via Policy-Guided Gradient Search	
@@ -0,0 +1 @@
+Offline optimization is an emerging problem in many experimental engineering domains including protein, drug or aircraft design, where online experimentation to collect evaluation data is too expensive or dangerous. To avoid that, one has to optimize an unknown function given only its offline evaluation at a fixed set of inputs. A naive solution to this problem is to learn a surrogate model of the unknown function and optimize this surrogate instead. However, such a naive optimizer is prone to erroneous overestimation of the surrogate (possibly due to over-fitting on a biased sample of function evaluation) on inputs outside the offline dataset. Prior approaches addressing this challenge have primarily focused on learning robust surrogate models. However, their search strategies are derived from the surrogate model rather than the actual offline data. To fill this important gap, we introduce a new learning-to-search perspective for offline optimization by reformulating it as an offline reinforcement learning problem. Our proposed policy-guided gradient search approach explicitly learns the best policy for a given surrogate model created from the offline data. Our empirical results on multiple benchmarks demonstrate that the learned optimization policy can be combined with existing offline surrogates to significantly improve the optimization performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Offline and Online Optical Flow Enhancement for Deep Video Compression b/data/2024/aaai/Offline and Online Optical Flow Enhancement for Deep Video Compression
new file mode 100644
index 0000000000..acb3efc04c
--- /dev/null
+++ b/data/2024/aaai/Offline and Online Optical Flow Enhancement for Deep Video Compression	
@@ -0,0 +1 @@
+Video compression relies heavily on exploiting the temporal redundancy between video frames, which is usually achieved by estimating and using the motion information. The motion information is represented as optical flows in most of the existing deep video compression networks. Indeed, these networks often adopt pre-trained optical flow estimation networks for motion estimation. The optical flows, however, may be less suitable for video compression due to the following two factors. First, the optical flow estimation networks were trained to perform inter-frame prediction as accurately as possible, but the optical flows themselves may cost too many bits to encode. Second, the optical flow estimation networks were trained on synthetic data, and may not generalize well enough to real-world videos. We address the twofold limitations by enhancing the optical flows in two stages: offline and online. In the offline stage, we fine-tune a trained optical flow estimation network with the motion information provided by a traditional (non-deep) video compression scheme, e.g. H.266/VVC, as we believe the motion information of H.266/VVC achieves a better rate-distortion trade-off. In the online stage, we further optimize the latent features of the optical flows with a gradient descent-based algorithm for the video to be compressed, so as to enhance the adaptivity of the optical flows. We conduct experiments on two state-of-the-art deep video compression schemes, DCVC and DCVC-DC. Experimental results demonstrate that the proposed offline and online enhancement together achieves on average 13.4% bitrate saving for DCVC and 4.1% bitrate saving for DCVC-DC on the tested videos, without increasing the model or computational complexity of the decoder side.
\ No newline at end of file
diff --git a/data/2024/aaai/Omega-Regular Decision Processes b/data/2024/aaai/Omega-Regular Decision Processes
new file mode 100644
index 0000000000..9b62b7e1ea
--- /dev/null
+++ b/data/2024/aaai/Omega-Regular Decision Processes	
@@ -0,0 +1 @@
+Regular decision processes (RDPs) are a subclass of non-Markovian decision processes where the transition and reward functions are guarded by some regular property of the past (a lookback). While RDPs enable intuitive and succinct representation of non-Markovian decision processes, their expressive power coincides with finite-state Markov decision processes (MDPs). We introduce omega-regular decision processes (ODPs) where the non-Markovian aspect of the transition and reward functions are extended to an omega-regular lookahead over the system evolution. Semantically, these lookaheads can be considered as promises made by the decision maker or the learning agent about her future behavior. In particular, we assume that, if the promised lookaheads are not met, then the payoff to the decision maker is falsum (least desirable payoff), overriding any rewards collected by the decision maker. We enable optimization and learning for ODPs under the discounted-reward objective by reducing them to lexicographic optimization and learning over finite MDPs. We present experimental results demonstrating the effectiveness of the proposed reduction.
\ No newline at end of file
diff --git a/data/2024/aaai/Omni-Kernel Network for Image Restoration b/data/2024/aaai/Omni-Kernel Network for Image Restoration
new file mode 100644
index 0000000000..2664f78612
--- /dev/null
+++ b/data/2024/aaai/Omni-Kernel Network for Image Restoration	
@@ -0,0 +1 @@
+Image restoration aims to reconstruct a high-quality image from a degraded low-quality observation. Recently, Transformer models have achieved promising performance on image restoration tasks due to their powerful ability to model long-range dependencies. However, the quadratically growing complexity with respect to the input size makes them inapplicable to practical applications. In this paper, we develop an efficient convolutional network for image restoration by enhancing multi-scale representation learning. To this end, we propose an omni-kernel module that consists of three branches, i.e., global, large, and local branches, to learn global-to-local feature representations efficiently. Specifically, the global branch achieves a global perceptive field via the dual-domain channel attention and frequency-gated mechanism. Furthermore, to provide multi-grained receptive fields, the large branch is formulated via different shapes of depth-wise convolutions with unusually large kernel sizes. Moreover, we complement local information using a point-wise depth-wise convolution. Finally, the proposed network, dubbed OKNet, is established by inserting the omni-kernel module into the bottleneck position for efficiency. Extensive experiments demonstrate that our network achieves state-of-the-art performance on 11 benchmark datasets for three representative image restoration tasks, including image dehazing, image desnowing, and image defocus deblurring. The code is available at https://github.com/c-yn/OKNet.
\ No newline at end of file
diff --git a/data/2024/aaai/Omnidirectional Image Super-resolution via Bi-projection Fusion b/data/2024/aaai/Omnidirectional Image Super-resolution via Bi-projection Fusion
new file mode 100644
index 0000000000..2fbd121719
--- /dev/null
+++ b/data/2024/aaai/Omnidirectional Image Super-resolution via Bi-projection Fusion	
@@ -0,0 +1 @@
+With the rapid development of virtual reality, omnidirectional images (ODIs) have attracted much attention from both the industrial community and academia. However, due to storage and transmission limitations, the resolution of current ODIs is often insufficient to provide an immersive virtual reality experience. Previous approaches address this issue using conventional 2D super-resolution techniques on equirectangular projection without exploiting the unique geometric properties of ODIs. In particular, the equirectangular projection (ERP) provides a complete field-of-view but introduces significant distortion, while the cubemap projection (CMP) can reduce distortion yet has a limited field-of-view. In this paper, we present a novel Bi-Projection Omnidirectional Image Super-Resolution (BPOSR) network to take advantage of the geometric properties of the above two projections. Then, we design two tailored attention methods for these projections: Horizontal Striped Transformer Block (HSTB) for ERP and Perspective Shift Transformer Block (PSTB) for CMP. Furthermore, we propose a fusion module to make these projections complement each other. Extensive experiments demonstrate that BPOSR achieves state-of-the-art performance on omnidirectional image super-resolution. The code is available at https://github.com/W-JG/BPOSR.
\ No newline at end of file
diff --git a/data/2024/aaai/Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency b/data/2024/aaai/Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency
new file mode 100644
index 0000000000..3df1284597
--- /dev/null
+++ b/data/2024/aaai/Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency	
@@ -0,0 +1 @@
+Natural language video localization plays a pivotal role in video understanding, and leveraging weakly-labeled data is considered a promising approach to circumvent the laborintensive process of manual annotations. However, this approach encounters two significant challenges: 1) limited input distribution, namely that the limited writing styles of the language query, annotated by human annotators, hinder the model’s generalization to real-world scenarios with diverse vocabularies and sentence structures; 2) the incomplete ground truth, whose supervision guidance is insufficient. To overcome these challenges, we propose an omnipotent distillation algorithm with large language models (LLM). The distribution of the input sample is enriched to obtain diverse multi-view versions while a consistency then comes to regularize the consistency of their results for distillation. Specifically, we first train our teacher model with the proposed intra-model agreement, where multiple sub-models are supervised by each other. Then, we leverage the LLM to paraphrase the language query and distill the teacher model to a lightweight student model by enforcing the consistency between the localization results of the paraphrased sentence and the original one. In addition, to assess the generalization of the model across different dimensions of language variation, we create extensive datasets by building upon existing datasets. Our experiments demonstrate substantial performance improvements adaptively to diverse kinds of language queries.
\ No newline at end of file
diff --git a/data/2024/aaai/On Alternating-Time Temporal Logic, Hyperproperties, and Strategy Sharing b/data/2024/aaai/On Alternating-Time Temporal Logic, Hyperproperties, and Strategy Sharing
new file mode 100644
index 0000000000..6418fdc4d2
--- /dev/null
+++ b/data/2024/aaai/On Alternating-Time Temporal Logic, Hyperproperties, and Strategy Sharing	
@@ -0,0 +1,7 @@
+Alternating-time temporal logic (ATL*) is a well-established framework for formal reasoning about multi-agent systems. 
+However, while ATL* can reason about the strategic ability of agents (e.g., some coalition A can ensure that a goal is reached eventually), we cannot compare multiple strategic interactions, nor can we require multiple agents to follow the same strategy. 
+For example, we cannot state that coalition A can reach a goal sooner (or more often) than some other coalition A'. 
+In this paper, we propose HyperATL*_S, an extension of ATL* in which we can (1) compare the outcome of multiple strategic interactions w.r.t. a hyperproperty, i.e., a property that refers to multiple paths at the same time, and (2) enforce that some agents share the same strategy.
+We show that HyperATL*_S is a rich specification language that captures important AI-related properties that were out of reach of existing logics. 
+We prove that model checking of HyperATL*_S on concurrent game structures is decidable.
+We implement our model-checking algorithm in a tool we call HyMASMC and evaluate it on a range of benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/On Computing Makespan-Optimal Solutions for Generalized Sliding-Tile Puzzles b/data/2024/aaai/On Computing Makespan-Optimal Solutions for Generalized Sliding-Tile Puzzles
new file mode 100644
index 0000000000..d9255a4c7f
--- /dev/null
+++ b/data/2024/aaai/On Computing Makespan-Optimal Solutions for Generalized Sliding-Tile Puzzles	
@@ -0,0 +1 @@
+In the 15-puzzle game, 15 labeled square tiles are reconfigured on a 4 × 4 board through an escort, wherein each (time) step, a single tile neighboring it may slide into it, leaving the space previously occupied by the tile as the new escort. We study a generalized sliding-tile puzzle (GSTP) in which (1) there are 1+ escorts and (2) multiple tiles can move synchronously in a single time step. Compared with popular discrete multi-agent/robot motion models, GSTP provides a more accurate model for a broad array of high-utility applications, including warehouse automation and autonomous garage parking, but is less studied due to the more involved tile interactions. In this work, we analyze optimal GSTP solution structures, establishing that computing makespan optimal solutions for GSTP is NP-complete and developing polynomial time algorithms yielding makespans approximating the minimum with expected/high probability constant factors, assuming randomized start and goal configurations.
\ No newline at end of file
diff --git a/data/2024/aaai/On Disentanglement of Asymmetrical Knowledge Transfer for Modality-Task Agnostic Federated Learning b/data/2024/aaai/On Disentanglement of Asymmetrical Knowledge Transfer for Modality-Task Agnostic Federated Learning
new file mode 100644
index 0000000000..296f24f2e6
--- /dev/null
+++ b/data/2024/aaai/On Disentanglement of Asymmetrical Knowledge Transfer for Modality-Task Agnostic Federated Learning	
@@ -0,0 +1 @@
+There has been growing concern regarding data privacy during the development and deployment of Multimodal Foundation Models for Artificial General Intelligence (AGI), while Federated Learning (FL) allows multiple clients to collaboratively train models in a privacy-preserving manner. This paper formulates and studies Modality-task Agnostic Federated Learning (AFL) to pave the way toward privacy-preserving AGI. A unique property of AFL is the asymmetrical knowledge relationships among clients due to modality gaps, task gaps, and domain shifts between clients. This raises a challenge in learning an optimal inter-client information-sharing scheme that maximizes positive transfer and minimizes negative transfer for AFL. However, prior FL methods, mostly focusing on symmetrical knowledge transfer, tend to exhibit insufficient positive transfer and fail to fully avoid negative transfer during inter-client collaboration. To address this issue, we propose DisentAFL, which leverages a two-stage Knowledge Disentanglement and Gating mechanism to explicitly decompose the original asymmetrical inter-client information-sharing scheme into several independent symmetrical inter-client information-sharing schemes, each of which corresponds to certain semantic knowledge type learned from the local tasks. Experimental results demonstrate the superiority of our method on AFL than baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/On Estimating the Gradient of the Expected Information Gain in Bayesian Experimental Design b/data/2024/aaai/On Estimating the Gradient of the Expected Information Gain in Bayesian Experimental Design
new file mode 100644
index 0000000000..d05bb6a744
--- /dev/null
+++ b/data/2024/aaai/On Estimating the Gradient of the Expected Information Gain in Bayesian Experimental Design	
@@ -0,0 +1 @@
+Bayesian Experimental Design (BED), which aims to find the optimal experimental conditions for Bayesian inference, is usually posed as to optimize the expected information gain (EIG). The gradient information is often needed for efficient EIG optimization, and as a result the ability to estimate the gradient of EIG is essential for BED problems. The primary goal of this work is to develop methods for estimating the gradient of EIG, which, combined with the stochastic gradient descent algorithms, result in efficient optimization of EIG. Specifically, we first introduce a posterior expected representation of the EIG gradient with respect to the design variables. Based on this, we propose two methods for estimating the EIG gradient, UEEG-MCMC that leverages posterior samples generated through Markov Chain Monte Carlo (MCMC) to estimate the EIG gradient, and BEEG-AP that focuses on achieving high simulation efficiency by repeatedly using parameter samples. Theoretical analysis and numerical studies illustrate that UEEG-MCMC is robust agains the actual EIG value, while BEEG-AP is more efficient when the EIG value to be optimized is small. Moreover, both methods show superior performance compared to several popular benchmarks in our numerical experiments.
\ No newline at end of file
diff --git a/data/2024/aaai/On Inference Stability for Diffusion Models b/data/2024/aaai/On Inference Stability for Diffusion Models
new file mode 100644
index 0000000000..d00fa9caa4
--- /dev/null
+++ b/data/2024/aaai/On Inference Stability for Diffusion Models	
@@ -0,0 +1 @@
+Denoising Probabilistic Models (DPMs) represent an emerging domain of generative models that excel in generating diverse and high-quality images. However, most current training methods for DPMs often neglect the correlation between timesteps, limiting the model's performance in generating images effectively. Notably, we theoretically point out that this issue can be caused by the cumulative estimation gap between the predicted and the actual trajectory. To minimize that gap, we propose a novel sequence-aware loss that aims to reduce the estimation gap to enhance the sampling quality. Furthermore, we theoretically show that our proposed loss function is a tighter upper bound of the estimation loss in comparison with the conventional loss in DPMs. Experimental results on several benchmark datasets including CIFAR10, CelebA, and CelebA-HQ consistently show a remarkable improvement of our proposed method regarding the image generalization quality measured by FID and Inception Score compared to several DPM baselines. Our code and pre-trained checkpoints are available at https://github.com/VinAIResearch/SA-DPM.
\ No newline at end of file
diff --git a/data/2024/aaai/On Optimal Tradeoffs between EFX and Nash Welfare b/data/2024/aaai/On Optimal Tradeoffs between EFX and Nash Welfare
new file mode 100644
index 0000000000..3670181cb0
--- /dev/null
+++ b/data/2024/aaai/On Optimal Tradeoffs between EFX and Nash Welfare	
@@ -0,0 +1 @@
+A major problem in fair division is how to allocate a set of indivisible resources among agents fairly and efficiently. The goal of this work is to characterize the tradeoffs between two well-studied measures of fairness and efficiency --- envy freeness up to any item (EFX) for fairness, and Nash welfare for efficiency --- by saying, for given constants α and β, whether there exists an α-EFX allocation that guarantees a β-fraction of the maximum Nash welfare (β-MNW). For additive valuations, we show that for any α ∈ [0,1], there exists a partial allocation that is α-EFX and 1/(α+1)-MNW. This tradeoff turns out to be tight (for every α) as demonstrated by an impossibility result that we give. We also show that for α ∈ [0, φ-1 ≃ 0.618] these partial allocations can be turned into complete allocations where all items are assigned. Furthermore, for any α ∈ [0, 1/2], we show that the tight tradeoff of α-EFX and 1/(α+1)-MNW with complete allocations holds for the more general setting of subadditive valuations. Our results improve upon the current state of the art, for both additive and subadditive valuations, and match the best-known approximations of EFX under complete allocations, regardless of Nash welfare guarantees. Notably, our constructions for additive valuations also provide EF1 and constant approximations for maximin share guarantees.
\ No newline at end of file
diff --git a/data/2024/aaai/On Partial Optimal Transport: Revising the Infeasibility of Sinkhorn and Efficient Gradient Methods b/data/2024/aaai/On Partial Optimal Transport: Revising the Infeasibility of Sinkhorn and Efficient Gradient Methods
new file mode 100644
index 0000000000..2f903ae368
--- /dev/null
+++ b/data/2024/aaai/On Partial Optimal Transport: Revising the Infeasibility of Sinkhorn and Efficient Gradient Methods	
@@ -0,0 +1 @@
+This paper studies the Partial Optimal Transport (POT) problem between two unbalanced measures with at most n supports and its applications in various AI tasks such as color transfer or domain adaptation. There is hence a need for fast approximations of POT with increasingly large problem sizes in arising applications. We first theoretically and experimentally investigate the infeasibility of the state-of-the-art Sinkhorn algorithm for POT, which consequently degrades its qualitative performance in real world applications like point-cloud registration. To this end, we propose a novel rounding algorithm for POT, and then provide a feasible Sinkhorn procedure with a revised computation complexity of O(n^2/epsilon^4). Our rounding algorithm also permits the development of two first-order methods to approximate the POT problem. The first algorithm, Adaptive Primal-Dual Accelerated Gradient Descent (APDAGD), finds an epsilon-approximate solution to the POT problem in O(n^2.5/epsilon). The second method, Dual Extrapolation, achieves the computation complexity of O(n^2/epsilon), thereby being the best in the literature. We further demonstrate the flexibility of POT compared to standard OT as well as the practicality of our algorithms on real applications where two marginal distributions are unbalanced.
\ No newline at end of file
diff --git a/data/2024/aaai/On Unsupervised Domain Adaptation: Pseudo Label Guided Mixup for Adversarial Prompt Tuning b/data/2024/aaai/On Unsupervised Domain Adaptation: Pseudo Label Guided Mixup for Adversarial Prompt Tuning
new file mode 100644
index 0000000000..65b793cd17
--- /dev/null
+++ b/data/2024/aaai/On Unsupervised Domain Adaptation: Pseudo Label Guided Mixup for Adversarial Prompt Tuning	
@@ -0,0 +1 @@
+To date, a backbone of methods for unsupervised domain adaptation (UDA) involves learning label-discriminative features via a label classifier and domain-invariant features through a domain discriminator in an adversarial scheme. However, these methods lack explicit control for aligning the source data and target data within the same label class, degrading the classifier's performance in the target domain. In this paper, we propose PL-Mix, a pseudo label guided Mixup method based on adversarial prompt tuning. Specifically, our PL-Mix facilitates class-dependent alignment and can alleviate the impact of noisy pseudo-labels. We then theoretically justify that PL-Mix can improve the generalization for UDA. Extensive experiments of the comparison with existing models also demonstrate the effectiveness of PL-Mix.
\ No newline at end of file
diff --git a/data/2024/aaai/On the Actionability of Outcome Prediction b/data/2024/aaai/On the Actionability of Outcome Prediction
new file mode 100644
index 0000000000..b0cffe3345
--- /dev/null
+++ b/data/2024/aaai/On the Actionability of Outcome Prediction	
@@ -0,0 +1,6 @@
+Predicting future outcomes is a prevalent application of machine learning in social impact domains. Examples range from predicting student success in education to predicting disease risk in healthcare. Practitioners recognize that the ultimate goal is not just to predict but to act effectively. Increasing evidence suggests that relying on outcome predictions for downstream interventions may not have desired results. 
+
+In most domains there exists a multitude of possible interventions for each individual, making the challenge of taking effective action more acute. Even when causal mechanisms connecting the individual's latent states to outcomes are well understood, in any given instance (a specific student or patient), practitioners still need to infer---from budgeted measurements of latent states---which of many possible interventions will be most effective for this individual. With this in mind, we ask: when are accurate predictors of outcomes helpful for identifying the most suitable intervention?
+
+Through a simple model encompassing actions, latent states, and measurements, we demonstrate that pure outcome prediction rarely results in the most effective policy for taking actions, even when combined with other measurements. 
+We find that except in cases where there is a single decisive action for improving the outcome, outcome prediction never maximizes "action value", the utility of taking actions. Making measurements of actionable latent states, where specific actions lead to desired outcomes, may considerably enhance the action value compared to outcome prediction, and the degree of improvement depends on action costs and the outcome model. This analysis emphasizes the need to go beyond generic outcome prediction in interventional settings by incorporating knowledge of plausible actions and latent states.
\ No newline at end of file
diff --git a/data/2024/aaai/On the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling b/data/2024/aaai/On the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling
new file mode 100644
index 0000000000..29cda18590
--- /dev/null
+++ b/data/2024/aaai/On the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling	
@@ -0,0 +1 @@
+Hierarchical topic modeling aims to discover latent topics from a corpus and organize them into a hierarchy to understand documents with desirable semantic granularity. However, existing work struggles with producing topic hierarchies of low affinity, rationality, and diversity, which hampers document understanding. To overcome these challenges, we in this paper propose Transport Plan and Context-aware Hierarchical Topic Model (TraCo). Instead of early simple topic dependencies, we propose a transport plan dependency method. It constrains dependencies to ensure their sparsity and balance, and also regularizes topic hierarchy building with them. This improves affinity and diversity of hierarchies. We further propose a context-aware disentangled decoder. Rather than previously entangled decoding, it distributes different semantic granularity to topics at different levels by disentangled decoding. This facilitates the rationality of hierarchies. Experiments on benchmark datasets demonstrate that our method surpasses state-of-the-art baselines, effectively improving the affinity, rationality, and diversity of hierarchical topic modeling with better performance on downstream tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/On the Computational Complexity of Plan Verification, (Bounded) Plan-Optimality Verification, and Bounded Plan Existence b/data/2024/aaai/On the Computational Complexity of Plan Verification, (Bounded) Plan-Optimality Verification, and Bounded Plan Existence
new file mode 100644
index 0000000000..2823360653
--- /dev/null
+++ b/data/2024/aaai/On the Computational Complexity of Plan Verification, (Bounded) Plan-Optimality Verification, and Bounded Plan Existence	
@@ -0,0 +1 @@
+In this paper we study the computational complexity of several reasoning tasks centered around the bounded plan existence problem. We do this for standard classical planning and hierarchical task network (HTN) planning and each for a grounded and a lifted representation. Whereas bounded plan existence complexity is known for classical planning, it has not yet been studied for HTN planning. For plan verification, results were available for both formalisms except for the lifted HTN planning. We will present lower and upper bounds of the complexity of plan verification in lifted HTN planning and provide novel insights into its grounded counterpart, in which we show that verification is not just NP-complete in the general case, but already for a severely restricted special case. Finally, we show the complexity concerning verifying the optimality of a given plan and discuss its connection to the bounded plan existence problem.
\ No newline at end of file
diff --git a/data/2024/aaai/On the Concept Trustworthiness in Concept Bottleneck Models b/data/2024/aaai/On the Concept Trustworthiness in Concept Bottleneck Models
new file mode 100644
index 0000000000..587a6bb498
--- /dev/null
+++ b/data/2024/aaai/On the Concept Trustworthiness in Concept Bottleneck Models	
@@ -0,0 +1 @@
+Concept Bottleneck Models (CBMs), which break down the reasoning process into the input-to-concept mapping and the concept-to-label prediction, have garnered significant attention due to their remarkable interpretability achieved by the interpretable concept bottleneck. However, despite the transparency of the concept-to-label prediction, the mapping from the input to the intermediate concept remains a black box, giving rise to concerns about the trustworthiness of the learned concepts (i.e., these concepts may be predicted based on spurious cues). The issue of concept untrustworthiness greatly hampers the interpretability of CBMs, thereby hindering their further advancement. To conduct a comprehensive analysis on this issue, in this study we establish a benchmark to assess the trustworthiness of concepts in CBMs. A pioneering metric, referred to as concept trustworthiness score, is proposed to gauge whether the concepts are derived from relevant regions. Additionally, an enhanced CBM is introduced, enabling concept predictions to be made specifically from distinct parts of the feature map, thereby facilitating the exploration of their related regions. Besides, we introduce three modules, namely the cross-layer alignment (CLA) module, the cross-image alignment (CIA) module, and the prediction alignment (PA) module, to further enhance the concept trustworthiness within the elaborated CBM. The experiments on five datasets across ten architectures demonstrate that without using any concept localization annotations during training, our model improves the concept trustworthiness by a large margin, meanwhile achieving superior accuracy to the state-of-the-arts. Our code is available at https://github.com/hqhQAQ/ProtoCBM.
\ No newline at end of file
diff --git a/data/2024/aaai/On the Convergence of an Adaptive Momentum Method for Adversarial Attacks b/data/2024/aaai/On the Convergence of an Adaptive Momentum Method for Adversarial Attacks
new file mode 100644
index 0000000000..980cdcabe8
--- /dev/null
+++ b/data/2024/aaai/On the Convergence of an Adaptive Momentum Method for Adversarial Attacks	
@@ -0,0 +1 @@
+Adversarial examples are commonly created by solving a constrained optimization problem, typically using sign-based methods like Fast Gradient Sign Method (FGSM). These attacks can benefit from momentum with a constant parameter, such as Momentum Iterative FGSM (MI-FGSM), to enhance black-box transferability. However, the monotonic time-varying momentum parameter is required to guarantee convergence in theory, creating a theory-practice gap. Additionally, recent work shows that sign-based methods fail to converge to the optimum in several convex settings, exacerbating the issue. To address these concerns, we propose a novel method which incorporates both an innovative adaptive momentum parameter without monotonicity assumptions and an adaptive step-size scheme that replaces the sign operation. Furthermore, we derive a regret upper bound for general convex functions. Experiments on multiple models demonstrate the efficacy of our method in generating adversarial examples with human-imperceptible noise while achieving high attack success rates, indicating its superiority over previous adversarial example generation methods.
\ No newline at end of file
diff --git a/data/2024/aaai/On the Expressivity of Recurrent Neural Cascades b/data/2024/aaai/On the Expressivity of Recurrent Neural Cascades
new file mode 100644
index 0000000000..06f4830a5d
--- /dev/null
+++ b/data/2024/aaai/On the Expressivity of Recurrent Neural Cascades	
@@ -0,0 +1,2 @@
+Recurrent Neural Cascades (RNCs) are the recurrent neural networks with no cyclic dependencies among recurrent neurons. This class of recurrent networks has received a lot of attention in practice. Besides training methods for a fixed architecture such as backpropagation, the cascade architecture naturally allows for constructive learning methods, where recurrent nodes are added incrementally one at a time, often yielding smaller networks. Furthermore, acyclicity amounts to a structural prior that even for the same number of neurons yields a more favourable sample complexity compared to a fully-connected architecture.
+A central question is whether the advantages of the cascade architecture come at the cost of a reduced expressivity. We provide new insights into this question. We show that the regular languages captured by RNCs with sign and tanh activation with positive recurrent weights are the star-free regular languages. In order to establish our results we developed a novel framework where capabilities of RNCs are assessed by analysing which semigroups and groups a single neuron is able to implement. A notable implication of our framework is that RNCs can achieve the expressivity of all regular languages by introducing neurons that can implement groups.
\ No newline at end of file
diff --git a/data/2024/aaai/On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods b/data/2024/aaai/On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods
new file mode 100644
index 0000000000..7f1a7a90ea
--- /dev/null
+++ b/data/2024/aaai/On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods	
@@ -0,0 +1 @@
+Most existing evaluations of explainable machine learning (ML) methods rely on simplifying assumptions or proxies that do not reflect real-world use cases; the handful of more robust evaluations on real-world settings have shortcomings in their design, generally leading to overestimation of methods' real-world utility. In this work, we seek to address this by conducting a study that evaluates post-hoc explainable ML methods in a setting consistent with the application context and provide a template for future evaluation studies. We modify and improve a prior study on e-commerce fraud detection by relaxing the original work's simplifying assumptions that departed from the deployment context. Our study finds no evidence for the utility of the tested explainable ML methods in the context, which is a drastically different conclusion from the earlier work. This highlights how seemingly trivial experimental design choices can yield misleading conclusions about method utility. In addition, our work carries lessons about the necessity of not only evaluating explainable ML methods using tasks, data, users, and metrics grounded in the intended application context but also developing methods tailored to specific applications, moving beyond general-purpose explainable ML methods.
\ No newline at end of file
diff --git a/data/2024/aaai/On the Robustness of Neural-Enhanced Video Streaming against Adversarial Attacks b/data/2024/aaai/On the Robustness of Neural-Enhanced Video Streaming against Adversarial Attacks
new file mode 100644
index 0000000000..70e92316aa
--- /dev/null
+++ b/data/2024/aaai/On the Robustness of Neural-Enhanced Video Streaming against Adversarial Attacks	
@@ -0,0 +1 @@
+The explosive growth of video traffic on today's Internet promotes the rise of Neural-enhanced Video Streaming (NeVS), which effectively improves the rate-distortion trade-off by employing a cheap neural super-resolution model for quality enhancement on the receiver side. Missing by existing work, we reveal that the NeVS pipeline may suffer from a practical threat, where the crucial codec component (i.e., encoder for compression and decoder for restoration) can trigger adversarial attacks in a man-in-the-middle manner to significantly destroy video recovery performance and finally incurs the malfunction of downstream video perception tasks. In this paper, we are the first attempt to inspect the vulnerability of NeVS and discover a novel adversarial attack, called codec hijacking, where the injected invisible perturbation conspires with the malicious encoding matrix by reorganizing the spatial-temporal bit allocation within the bitstream size budget. Such a zero-day vulnerability makes our attack hard to defend because there is no visual distortion on the recovered videos until the attack happens. More seriously, this attack can be extended to diverse enhancement models, thus exposing a wide range of video perception tasks under threat. Evaluation based on state-of-the-art video codec benchmark illustrates that our attack significantly degrades the recovery performance of NeVS over previous attack methods. The damaged video quality finally leads to obvious malfunction of downstream tasks with over 75% success rate. We hope to arouse public attention on codec hijacking and its defence.
\ No newline at end of file
diff --git a/data/2024/aaai/On the Role of Server Momentum in Federated Learning b/data/2024/aaai/On the Role of Server Momentum in Federated Learning
new file mode 100644
index 0000000000..370fb4cf0d
--- /dev/null
+++ b/data/2024/aaai/On the Role of Server Momentum in Federated Learning	
@@ -0,0 +1 @@
+Federated Averaging (FedAvg) is known to experience convergence issues when encountering significant clients system heterogeneity and data heterogeneity. Server momentum has been proposed as an effective mitigation. However, existing server momentum works are restrictive in the momentum formulation, do not properly schedule hyperparameters and focus only on system homogeneous settings, which leaves the role of server momentum still an under-explored problem. In this paper, we propose a general framework for server momentum, that (a) covers a large class of momentum schemes that are unexplored in federated learning (FL), (b) enables a popular stagewise hyperparameter scheduler, (c) allows heterogeneous and asynchronous local computing. We provide rigorous convergence analysis for the proposed framework. To our best knowledge, this is the first work that thoroughly analyzes the performances of server momentum with a hyperparameter scheduler and system heterogeneity. Extensive experiments validate the effectiveness of our proposed framework. Due to page limit, we leave all proofs to the full version https://arxiv.org/abs/2312.12670.
\ No newline at end of file
diff --git a/data/2024/aaai/On the Structural Hardness of Answer Set Programming: Can Structure Efficiently Confine the Power of Disjunctions? b/data/2024/aaai/On the Structural Hardness of Answer Set Programming: Can Structure Efficiently Confine the Power of Disjunctions?
new file mode 100644
index 0000000000..88d21c098b
--- /dev/null
+++ b/data/2024/aaai/On the Structural Hardness of Answer Set Programming: Can Structure Efficiently Confine the Power of Disjunctions?	
@@ -0,0 +1,2 @@
+Answer Set Programming (ASP) is a generic problem modeling and solving framework with a strong focus on knowledge representation and a rapid growth of industrial applications. So far, the study of complexity resulted in characterizing hardness and determining their sources, fine-grained insights in the form of dichotomy-style results, as well as detailed parameterized complexity landscapes. Unfortunately, for the well-known parameter treewidth disjunctive programs require double-exponential runtime under reasonable complexity assumptions. This quickly becomes out of reach. We deal with the classification of structural parameters for disjunctive ASP on the program's rule structure (incidence graph). 
+First, we provide a polynomial kernel to obtain single-exponential runtime in terms of vertex cover size, despite subset-minimization being not represented in the program’s structure. Then we turn our attention to strictly better structural parameters between vertex cover size and treewidth. Here, we provide double-exponential lower bounds for the most prominent parameters in that range: treedepth, feedback vertex size, and cliquewidth. Based on this, we argue that unfortunately our options beyond vertex cover size are limited. Our results provide an in-depth hardness study, relying on a novel reduction from normal to disjunctive programs, trading the increase of complexity for an exponential parameter compression.
\ No newline at end of file
diff --git a/data/2024/aaai/On the Unstable Convergence Regime of Gradient Descent b/data/2024/aaai/On the Unstable Convergence Regime of Gradient Descent
new file mode 100644
index 0000000000..40c7e8f99a
--- /dev/null
+++ b/data/2024/aaai/On the Unstable Convergence Regime of Gradient Descent	
@@ -0,0 +1 @@
+Traditional gradient descent (GD) has been fully investigated for convex or L-smoothness functions, and it is widely utilized in current neural network optimization. The classical descent lemma ensures that for a function with L-smoothness, the GD trajectory converges stably towards the minimum when the learning rate is below 2 / L. This convergence is marked by a consistent reduction in the loss function throughout the iterations. However, recent experimental studies have demonstrated that even when the L-smoothness condition is not met, or if the learning rate is increased leading to oscillations in the loss function during iterations, the GD trajectory still exhibits convergence over the long run. This phenomenon is referred to as the unstable convergence regime of GD. In this paper, we present a theoretical perspective to offer a qualitative analysis of this phenomenon. The unstable convergence is in fact an inherent property of GD for general twice differentiable functions. Specifically, the forwardinvariance of GD is established, i.e., it ensures that any point within a local region will always remain within this region under GD iteration. Then, based on the forward-invariance, for the initialization outside an open set containing the local minimum, the loss function will oscillate at the first several iterations and then become monotonely decreasing after the GD trajectory jumped into the open set. This work theoretically clarifies the unstable convergence phenomenon of GD discussed in previous experimental works. The unstable convergence of GD mainly depends on the selection of the initialization, and it is actually inevitable due to the complex nature of loss function.
\ No newline at end of file
diff --git a/data/2024/aaai/Once and for All: Universal Transferable Adversarial Perturbation against Deep Hashing-Based Facial Image Retrieval b/data/2024/aaai/Once and for All: Universal Transferable Adversarial Perturbation against Deep Hashing-Based Facial Image Retrieval
new file mode 100644
index 0000000000..b336d93b1e
--- /dev/null
+++ b/data/2024/aaai/Once and for All: Universal Transferable Adversarial Perturbation against Deep Hashing-Based Facial Image Retrieval	
@@ -0,0 +1 @@
+Deep Hashing (DH)-based image retrieval has been widely applied to face-matching systems due to its accuracy and efficiency. However, this convenience comes with an increased risk of privacy leakage. DH models inherit the vulnerability to adversarial attacks, which can be used to prevent the retrieval of private images. Existing adversarial attacks against DH typically target a single image or a specific class of images, lacking universal adversarial perturbation for the entire hash dataset. In this paper, we propose the first universal transferable adversarial perturbation against DH-based facial image retrieval, a single perturbation can protect all images. Specifically, we explore the relationship between clusters learned by different DH models and define the optimization objective of universal perturbation as leaving from the overall hash center. To mitigate the challenge of single-objective optimization, we randomly obtain sub-cluster centers and further propose sub-task-based meta-learning to aid in overall optimization. We test our method with popular facial datasets and DH models, indicating impressive cross-image, -identity, -model, and -scheme universal anti-retrieval performance. Compared to state-of-the-art methods, our performance is competitive in white-box settings and exhibits significant improvements of 10%-70% in transferability in all black-box settings.
\ No newline at end of file
diff --git a/data/2024/aaai/One Self-Configurable Model to Solve Many Abstract Visual Reasoning Problems b/data/2024/aaai/One Self-Configurable Model to Solve Many Abstract Visual Reasoning Problems
new file mode 100644
index 0000000000..c608041928
--- /dev/null
+++ b/data/2024/aaai/One Self-Configurable Model to Solve Many Abstract Visual Reasoning Problems	
@@ -0,0 +1 @@
+Abstract Visual Reasoning (AVR) comprises a wide selection of various problems similar to those used in human IQ tests. Recent years have brought dynamic progress in solving particular AVR tasks, however, in the contemporary literature AVR problems are largely dealt with in isolation, leading to highly specialized task-specific methods. With the aim of developing universal learning systems in the AVR domain, we propose the unified model for solving Single-Choice Abstract visual Reasoning tasks (SCAR), capable of solving various single-choice AVR tasks, without making any a priori assumptions about the task structure, in particular the number and location of panels. The proposed model relies on a novel Structure-Aware dynamic Layer (SAL), which adapts its weights to the structure of the considered AVR problem. Experiments conducted on Raven's Progressive Matrices, Visual Analogy Problems, and Odd One Out problems show that SCAR (SAL-based models, in general) effectively solves diverse AVR tasks, and its performance is on par with the state-of-the-art task-specific baselines. What is more, SCAR demonstrates effective knowledge reuse in multi-task and transfer learning settings. To our knowledge, this work is the first successful attempt to construct a general single-choice AVR solver relying on self-configurable architecture and unified solving method. With this work we aim to stimulate and foster progress on task-independent research paths in the AVR domain, with the long-term goal of development of a general AVR solver.
\ No newline at end of file
diff --git a/data/2024/aaai/One Step Closer to Unbiased Aleatoric Uncertainty Estimation b/data/2024/aaai/One Step Closer to Unbiased Aleatoric Uncertainty Estimation
new file mode 100644
index 0000000000..655911b393
--- /dev/null
+++ b/data/2024/aaai/One Step Closer to Unbiased Aleatoric Uncertainty Estimation	
@@ -0,0 +1 @@
+Neural networks are powerful tools in various applications, and quantifying their uncertainty is crucial for reliable decision-making. In the deep learning field, the uncertainties are usually categorized into aleatoric (data) and epistemic (model) uncertainty. In this paper, we point out that the existing popular variance attenuation method highly overestimates aleatoric uncertainty. To address this issue, we proposed a new estimation method by actively de-noising the observed data. By conducting a broad range of experiments, we demonstrate that our proposed approach provides a much closer approximation to the actual data uncertainty than the standard method.
\ No newline at end of file
diff --git a/data/2024/aaai/One at a Time: Progressive Multi-Step Volumetric Probability Learning for Reliable 3D Scene Perception b/data/2024/aaai/One at a Time: Progressive Multi-Step Volumetric Probability Learning for Reliable 3D Scene Perception
new file mode 100644
index 0000000000..eb69414cd9
--- /dev/null
+++ b/data/2024/aaai/One at a Time: Progressive Multi-Step Volumetric Probability Learning for Reliable 3D Scene Perception	
@@ -0,0 +1 @@
+Numerous studies have investigated the pivotal role of reliable 3D volume representation in scene perception tasks, such as multi-view stereo (MVS) and semantic scene completion (SSC). They typically construct 3D probability volumes directly with geometric correspondence, attempting to fully address the scene perception tasks in a single forward pass. However, such a single-step solution makes it hard to learn accurate and convincing volumetric probability, especially in challenging regions like unexpected occlusions and complicated light reflections. Therefore, this paper proposes to decompose the complicated 3D volume representation learning into a sequence of generative steps to facilitate fine and reliable scene perception. Considering the recent advances achieved by strong generative diffusion models, we introduce a multi-step learning framework, dubbed as VPD, dedicated to progressively refining the Volumetric Probability in a Diffusion process. Specifically, we first build a coarse probability volume from input images with the off-the-shelf scene perception baselines, which is then conditioned as the basic geometry prior before being fed into a 3D diffusion UNet, to progressively achieve accurate probability distribution modeling. To handle the corner cases in challenging areas, a Confidence-Aware Contextual Collaboration (CACC) module is developed to correct the uncertain regions for reliable volumetric learning based on multi-scale contextual contents. Moreover, an Online Filtering (OF) strategy is designed to maintain representation consistency for stable diffusion sampling. Extensive experiments are conducted on scene perception tasks including multi-view stereo (MVS) and semantic scene completion (SSC), to validate the efficacy of our method in learning reliable volumetric representations. Notably, for the SSC task, our work stands out as the first to surpass LiDAR-based methods on the SemanticKITTI dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/Online Boosting Adaptive Learning under Concept Drift for Multistream Classification b/data/2024/aaai/Online Boosting Adaptive Learning under Concept Drift for Multistream Classification
new file mode 100644
index 0000000000..29f5cd6586
--- /dev/null
+++ b/data/2024/aaai/Online Boosting Adaptive Learning under Concept Drift for Multistream Classification	
@@ -0,0 +1 @@
+Multistream classification poses significant challenges due to the necessity for rapid adaptation in dynamic streaming processes with concept drift. Despite the growing research outcomes in this area, there has been a notable oversight regarding the temporal dynamic relationships between these streams, leading to the issue of negative transfer arising from irrelevant data. In this paper, we propose a novel Online Boosting Adaptive Learning (OBAL) method that effectively addresses this limitation by adaptively learning the dynamic correlation among different streams. Specifically, OBAL operates in a dual-phase mechanism, in the first of which we design an Adaptive COvariate Shift Adaptation (AdaCOSA) algorithm to construct an initialized ensemble model using archived data from various source streams, thus mitigating the covariate shift while learning the dynamic correlations via an adaptive re-weighting strategy. During the online process, we employ a Gaussian Mixture Model-based weighting mechanism, which is seamlessly integrated with the acquired correlations via AdaCOSA to effectively handle asynchronous drift. This approach significantly improves the predictive performance and stability of the target stream. We conduct comprehensive experiments on several synthetic and real-world data streams, encompassing various drifting scenarios and types. The results clearly demonstrate that OBAL achieves remarkable advancements in addressing multistream classification problems by effectively leveraging positive knowledge derived from multiple sources.
\ No newline at end of file
diff --git a/data/2024/aaai/Online Conversion Rate Prediction via Multi-Interval Screening and Synthesizing under Delayed Feedback b/data/2024/aaai/Online Conversion Rate Prediction via Multi-Interval Screening and Synthesizing under Delayed Feedback
new file mode 100644
index 0000000000..90fa10de37
--- /dev/null
+++ b/data/2024/aaai/Online Conversion Rate Prediction via Multi-Interval Screening and Synthesizing under Delayed Feedback	
@@ -0,0 +1 @@
+Due to the widespread adoption of the cost-per-action(CPA) display strategy that demands a real-time conversion rate prediction(CVR), delayed feedback is becoming one of the major challenges in online advertising. As the true labels of a significant quantity of samples are only available after long delays, the observed training data are usually biased, harming the performance of models. Recent studies show integrating models with varying waiting windows to observe true labels is beneficial, but the aggregation framework remains far from reaching a consensus. In this work, we propose the Multi-Interval Screening and Synthesizing model (MISS for short) for online CVR prediction. We first design a multi-interval screening model with various output heads to produce accurate and distinctive estimates. Then a light-weight synthesizing model with an assembled training pipeline is applied to thoroughly exploit the knowledge and relationship among heads, obtaining reliable predictions. Extensive experiments on two real-world advertising datasets validate the effectiveness of our model.
\ No newline at end of file
diff --git a/data/2024/aaai/Online Markov Decision Processes Configuration with Continuous Decision Space b/data/2024/aaai/Online Markov Decision Processes Configuration with Continuous Decision Space
new file mode 100644
index 0000000000..6348c8b1cd
--- /dev/null
+++ b/data/2024/aaai/Online Markov Decision Processes Configuration with Continuous Decision Space	
@@ -0,0 +1 @@
+In this paper, we investigate the optimal online configuration of episodic Markov decision processes when the space of the possible configurations is continuous. Specifically, we study the interaction between a learner (referred to as the configurator) and an agent with a fixed, unknown policy, when the learner aims to minimize her losses by choosing transition functions in online fashion. The losses may be unrelated to the agent's rewards. This problem applies to many real-world scenarios where the learner seeks to manipulate the Markov decision process to her advantage. We study both deterministic and stochastic settings, where the losses are either fixed or sampled from an unknown probability distribution. We design two algorithms whose peculiarity is to rely on occupancy measures to explore with optimism the continuous space of transition functions, achieving constant regret in deterministic settings and sublinear regret in stochastic settings, respectively. Moreover, we prove that the regret bound is tight with respect to any constant factor in deterministic settings. Finally, we compare the empiric performance of our algorithms with a baseline in synthetic experiments.
\ No newline at end of file
diff --git a/data/2024/aaai/Online Reinforcement Learning-Based Pedagogical Planning for Narrative-Centered Learning Environments b/data/2024/aaai/Online Reinforcement Learning-Based Pedagogical Planning for Narrative-Centered Learning Environments
new file mode 100644
index 0000000000..c2f1f463e7
--- /dev/null
+++ b/data/2024/aaai/Online Reinforcement Learning-Based Pedagogical Planning for Narrative-Centered Learning Environments	
@@ -0,0 +1 @@
+Pedagogical planners can provide adaptive support to students in narrative-centered learning environments by dynamically scaffolding student learning and tailoring problem scenarios. Reinforcement learning (RL) is frequently used for pedagogical planning in narrative-centered learning environments. However, RL-based pedagogical planning raises significant challenges due to the scarcity of data for training RL policies. Most prior work has relied on limited-size datasets and offline RL techniques for policy learning. Unfortunately, offline RL techniques do not support on-demand exploration and evaluation, which can adversely impact the quality of induced policies. To address the limitation of data scarcity and offline RL, we propose INSIGHT, an online RL framework for training data-driven pedagogical policies that optimize student learning in narrative-centered learning environments. The INSIGHT framework consists of three components: a narrative-centered learning environment simulator, a simulated student agent, and an RL-based pedagogical planner agent, which uses a reward metric that is associated with effective student learning processes. The framework enables the generation of synthetic data for on-demand exploration and evaluation of RL-based pedagogical planning. We have implemented INSIGHT with OpenAI Gym for a narrative-centered learning environment testbed with rule-based simulated student agents and a deep Q-learning-based pedagogical planner. Our results show that online deep RL algorithms can induce near-optimal pedagogical policies in the INSIGHT framework, while offline deep RL algorithms only find suboptimal policies even with large amounts of data.
\ No newline at end of file
diff --git a/data/2024/aaai/Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints b/data/2024/aaai/Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints
new file mode 100644
index 0000000000..17e923ae07
--- /dev/null
+++ b/data/2024/aaai/Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints	
@@ -0,0 +1 @@
+Restless multi-armed bandits (RMAB) have been widely used to model sequential decision making problems with constraints. The decision maker (DM) aims to maximize the expected total reward over an infinite horizon under an “instantaneous activation constraint” that at most B arms can be activated at any decision epoch, where the state of each arm evolves stochastically according to a Markov decision process (MDP). However, this basic model fails to provide any fairness guarantee among arms. In this paper, we introduce RMAB-F, a new RMAB model with “long-term fairness constraints”, where the objective now is to maximize the longterm reward while a minimum long-term activation fraction for each arm must be satisfied. For the online RMAB-F setting (i.e., the underlying MDPs associated with each arm are unknown to the DM), we develop a novel reinforcement learning (RL) algorithm named Fair-UCRL. We prove that Fair-UCRL ensures probabilistic sublinear bounds on both the reward regret and the fairness violation regret. Compared with off-the-shelf RL methods, our Fair-UCRL is much more computationally efficient since it contains a novel exploitation that leverages a low-complexity index policy for making decisions. Experimental results further demonstrate the effectiveness of our Fair-UCRL.
\ No newline at end of file
diff --git a/data/2024/aaai/Online Sensitivity Optimization in Differentially Private Learning b/data/2024/aaai/Online Sensitivity Optimization in Differentially Private Learning
new file mode 100644
index 0000000000..888ce7ee9c
--- /dev/null
+++ b/data/2024/aaai/Online Sensitivity Optimization in Differentially Private Learning	
@@ -0,0 +1 @@
+Training differentially private machine learning models requires constraining an individual's contribution to the optimization process. This is achieved by clipping the 2-norm of their gradient at a predetermined threshold prior to averaging and batch sanitization. This selection adversely influences optimization in two opposing ways: it either exacerbates the bias due to excessive clipping at lower values, or augments sanitization noise at higher values. The choice significantly hinges on factors such as the dataset, model architecture, and even varies within the same optimization, demanding meticulous tuning usually accomplished through a grid search. In order to circumvent the privacy expenses incurred in hyperparameter tuning, we present a novel approach to dynamically optimize the clipping threshold. We treat this threshold as an additional learnable parameter, establishing a clean relationship between the threshold and the cost function. This allows us to optimize the former with gradient descent, with minimal repercussions on the overall privacy analysis. Our method is thoroughly assessed against alternative fixed and adaptive strategies across diverse datasets, tasks, model dimensions, and privacy levels. Our results indicate that it performs comparably or better in the evaluated scenarios, given the same privacy requirements.
\ No newline at end of file
diff --git a/data/2024/aaai/OntoFact: Unveiling Fantastic Fact-Skeleton of LLMs via Ontology-Driven Reinforcement Learning b/data/2024/aaai/OntoFact: Unveiling Fantastic Fact-Skeleton of LLMs via Ontology-Driven Reinforcement Learning
new file mode 100644
index 0000000000..65adab6ba5
--- /dev/null
+++ b/data/2024/aaai/OntoFact: Unveiling Fantastic Fact-Skeleton of LLMs via Ontology-Driven Reinforcement Learning	
@@ -0,0 +1 @@
+Large language models (LLMs) have demonstrated impressive proficiency in information retrieval, while they are prone to generating incorrect responses that conflict with reality, a phenomenon known as intrinsic hallucination. The critical challenge lies in the unclear and unreliable fact distribution within LLMs trained on vast amounts of data. The prevalent approach frames the factual detection task as a question-answering paradigm, where the LLMs are asked about factual knowledge and examined for correctness. However, existing studies primarily focused on deriving test cases only from several specific domains, such as movies and sports, limiting the comprehensive observation of missing knowledge and the analysis of unexpected hallucinations. To address this issue, we propose OntoFact, an adaptive framework for detecting unknown facts of LLMs, devoted to mining the ontology-level skeleton of the missing knowledge. Specifically, we argue that LLMs could expose the ontology-based similarity among missing facts and introduce five representative knowledge graphs (KGs) as benchmarks. We further devise a sophisticated ontology-driven reinforcement learning (ORL) mechanism to produce error-prone test cases with specific entities and relations automatically. The ORL mechanism rewards the KGs for navigating toward a feasible direction for unveiling factual errors. Moreover, empirical efforts demonstrate that dominant LLMs are biased towards answering Yes rather than No, regardless of whether this knowledge is included. To mitigate the overconfidence of LLMs, we leverage a hallucination-free detection (HFD) strategy to tackle unfair comparisons between baselines, thereby boosting the result robustness. Experimental results on 5 datasets, using 32 representative LLMs, reveal a general lack of fact in current LLMs. Notably, ChatGPT exhibits fact error rates of 51.6% on DBpedia and 64.7% on YAGO, respectively. Additionally, the ORL mechanism demonstrates promising error prediction scores, with F1 scores ranging from 70% to 90% across most LLMs. Compared to the exhaustive testing, ORL achieves an average recall of 80% while reducing evaluation time by 35.29% to 63.12%.
\ No newline at end of file
diff --git a/data/2024/aaai/Open-Set Facial Expression Recognition b/data/2024/aaai/Open-Set Facial Expression Recognition
new file mode 100644
index 0000000000..3cf4bfc785
--- /dev/null
+++ b/data/2024/aaai/Open-Set Facial Expression Recognition	
@@ -0,0 +1 @@
+Facial expression recognition (FER) models are typically trained on datasets with a fixed number of seven basic classes. However, recent research works (Cowen et al. 2021; Bryant et al. 2022; Kollias 2023) point out that there are far more expressions than the basic ones. Thus, when these models are deployed in the real world, they may encounter unknown classes, such as compound expressions that cannot be classified into existing basic classes. To address this issue, we propose the open-set FER task for the first time. Though there are many existing open-set recognition methods, we argue that they do not work well for open-set FER because FER data are all human faces with very small inter-class distances, which makes the open-set samples very similar to close-set samples. In this paper, we are the first to transform the disadvantage of small inter-class distance into an advantage by proposing a new way for open-set FER. Specifically, we find that small inter-class distance allows for sparsely distributed pseudo labels of open-set samples, which can be viewed as symmetric noisy labels. Based on this novel observation, we convert the open-set FER to a noisy label detection problem. We further propose a novel method that incorporates attention map consistency and cycle training to detect the open-set samples. Extensive experiments on various FER datasets demonstrate that our method clearly outperforms state-of-the-art open-set recognition methods by large margins. Code is available at https://github.com/zyh-uaiaaaa.
\ No newline at end of file
diff --git a/data/2024/aaai/Open-Set Graph Domain Adaptation via Separate Domain Alignment b/data/2024/aaai/Open-Set Graph Domain Adaptation via Separate Domain Alignment
new file mode 100644
index 0000000000..5dd785ea8d
--- /dev/null
+++ b/data/2024/aaai/Open-Set Graph Domain Adaptation via Separate Domain Alignment	
@@ -0,0 +1 @@
+Domain adaptation has become an attractive learning paradigm, as it can leverage source domains with rich labels to deal with classification tasks in an unlabeled target domain. A few recent studies develop domain adaptation approaches for graph-structured data. In the case of node classification task, current domain adaptation methods only focus on the closed-set setting, where source and target domains share the same label space. A more practical assumption is that the target domain may contain new classes that are not included in the source domain. Therefore, in this paper, we introduce a novel and challenging problem for graphs, i.e., open-set domain adaptive node classification, and propose a new approach to solve it. Specifically, we develop an algorithm for efficient knowledge transfer from a labeled source graph to an unlabeled target graph under a separate domain alignment (SDA) strategy, in order to learn discriminative feature representations for the target graph. Our goal is to not only correctly classify target nodes into the known classes, but also classify unseen types of nodes into an unknown class. Experimental results on real-world datasets show that our method outperforms existing methods on graph domain adaptation.
\ No newline at end of file
diff --git a/data/2024/aaai/Open-Vocabulary Video Relation Extraction b/data/2024/aaai/Open-Vocabulary Video Relation Extraction
new file mode 100644
index 0000000000..37264ef600
--- /dev/null
+++ b/data/2024/aaai/Open-Vocabulary Video Relation Extraction	
@@ -0,0 +1,2 @@
+A comprehensive understanding of videos is inseparable from describing the action with its contextual action-object interactions. However, many current video understanding tasks prioritize general action classification and overlook the actors and relationships that shape the nature of the action, resulting in a superficial understanding of the action. 
+Motivated by this, we introduce Open-vocabulary Video Relation Extraction (OVRE), a novel task that views action understanding through the lens of action-centric relation triplets. OVRE focuses on pairwise relations that take part in the action and describes these relation triplets with natural languages. Moreover, we curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset. With Moments-OVRE, we further propose a cross-modal mapping model to generate relation triplets as a sequence. Finally, we benchmark existing cross-modal generation models on the new task of OVRE. Our code and dataset are available at https://github.com/Iriya99/OVRE.
\ No newline at end of file
diff --git a/data/2024/aaai/Opening the Black Box: Unraveling the Classroom Dialogue Analysis (Student Abstract) b/data/2024/aaai/Opening the Black Box: Unraveling the Classroom Dialogue Analysis (Student Abstract)
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/aaai/Operationalizing Essential Characteristics of Creativity in a Computational System for Music Composition b/data/2024/aaai/Operationalizing Essential Characteristics of Creativity in a Computational System for Music Composition
new file mode 100644
index 0000000000..deb1cea6ac
--- /dev/null
+++ b/data/2024/aaai/Operationalizing Essential Characteristics of Creativity in a Computational System for Music Composition	
@@ -0,0 +1 @@
+We address the problem of building and evaluating a computational system whose primary objective is creativity. We illustrate seven characteristics for computational creativity in the context of a system that autonomously composes Western lyrical music. We conduct an external evaluation of the system in which respondents rated the system with regard to each characteristic as well as with regard to overall creativity. Average scores for overall creativity exceeded the ratings for any single characteristic, suggesting that creativity may be an emergent property and that unique research opportunities exist for building CC systems whose design attempts to comprehend all known characteristics of creativity.
\ No newline at end of file
diff --git a/data/2024/aaai/Operator-Learning-Inspired Modeling of Neural Ordinary Differential Equations b/data/2024/aaai/Operator-Learning-Inspired Modeling of Neural Ordinary Differential Equations
new file mode 100644
index 0000000000..0fdf825867
--- /dev/null
+++ b/data/2024/aaai/Operator-Learning-Inspired Modeling of Neural Ordinary Differential Equations	
@@ -0,0 +1 @@
+Neural ordinary differential equations (NODEs), one of the most influential works of the differential equation-based deep learning, are to continuously generalize residual networks and opened a new field. They are currently utilized for various downstream tasks, e.g., image classification, time series classification, image generation, etc. Its key part is how to model the time-derivative of the hidden state, denoted dh(t)/dt. People have habitually used conventional neural network architectures, e.g., fully-connected layers followed by non-linear activations. In this paper, however, we present a neural operator-based method to define the time-derivative term. Neural operators were initially proposed to model the differential operator of partial differential equations (PDEs). Since the time-derivative of NODEs can be understood as a special type of the differential operator, our proposed method, called branched Fourier neural operator (BFNO), makes sense. In our experiments with general downstream tasks, our method significantly outperforms existing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Opponent-Model Search in Games with Incomplete Information b/data/2024/aaai/Opponent-Model Search in Games with Incomplete Information
new file mode 100644
index 0000000000..a42604ef78
--- /dev/null
+++ b/data/2024/aaai/Opponent-Model Search in Games with Incomplete Information	
@@ -0,0 +1 @@
+Games with incomplete information are games that model situations where players do not have common knowledge about the game they play, e.g. card games such as poker or bridge. Opponent models can be of crucial importance for decision-making in such games. We propose algorithms for computing optimal and/or robust strategies in games with incomplete information, given various types of knowledge about opponent models. As an application, we describe a framework for reasoning about an opponent's reasoning in such games, where opponent models arise naturally.
\ No newline at end of file
diff --git a/data/2024/aaai/Optical Flow for Spike Camera with Hierarchical Spatial-Temporal Spike Fusion b/data/2024/aaai/Optical Flow for Spike Camera with Hierarchical Spatial-Temporal Spike Fusion
new file mode 100644
index 0000000000..13ddf30b41
--- /dev/null
+++ b/data/2024/aaai/Optical Flow for Spike Camera with Hierarchical Spatial-Temporal Spike Fusion	
@@ -0,0 +1 @@
+As an emerging neuromorphic camera with an asynchronous working mechanism, spike camera shows good potential for high-speed vision tasks. Each pixel in spike camera accumulates photons persistently and fires a spike whenever the accumulation exceeds a threshold. Such high-frequency fine-granularity photon recording facilitates the analysis and recovery of dynamic scenes with high-speed motion. This paper considers the optical flow estimation problem for spike cameras. Due to the Poisson nature of incoming photons, the occurrence of spikes is random and fluctuating, making conventional image matching inefficient. We propose a Hierarchical Spatial-Temporal (HiST) fusion module for spike representation to pursue reliable feature matching and develop a robust optical flow network, dubbed as HiST-SFlow. The HiST extracts features at multiple moments and hierarchically fuses the spatial-temporal information. We also propose an intra-moment filtering module to further extract the feature and suppress the influence of randomness in spikes. A scene loss is proposed to ensure that this hierarchical representation recovers the essential visual information in the scene. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared with the existing methods. The source codes are available at https://github.com/ruizhao26/HiST-SFlow.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimal Attack and Defense for Reinforcement Learning b/data/2024/aaai/Optimal Attack and Defense for Reinforcement Learning
new file mode 100644
index 0000000000..bbf2978764
--- /dev/null
+++ b/data/2024/aaai/Optimal Attack and Defense for Reinforcement Learning	
@@ -0,0 +1 @@
+To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimal Makespan in a Minute Timespan! A Scalable Multi-Robot Goal Assignment Algorithm for Minimizing Mission Time b/data/2024/aaai/Optimal Makespan in a Minute Timespan! A Scalable Multi-Robot Goal Assignment Algorithm for Minimizing Mission Time
new file mode 100644
index 0000000000..3189e58e76
--- /dev/null
+++ b/data/2024/aaai/Optimal Makespan in a Minute Timespan! A Scalable Multi-Robot Goal Assignment Algorithm for Minimizing Mission Time	
@@ -0,0 +1 @@
+We study a variant of the multi-robot goal assignment problem where a unique goal to each robot needs to be assigned while minimizing the largest cost of movement among the robots, called makespan. A significant step in solving this problem is to find the cost associated with the robot-goal pairs, which requires solving a complex path planning problem. We present OM, a scalable optimal algorithm that solves the multi-robot goal assignment problem by computing the paths for a significantly less number of robot-goal pairs compared to the state-of-the-art algorithms, leading to a computationally superior mechanism to solve the problem. We extensively evaluate our algorithm for hundreds of robots on randomly generated and standard workspaces. Our experimental results demonstrate that the proposed algorithm achieves a noticeable speedup over two state-of-the-art baseline algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimal Quasi-clique: Hardness, Equivalence with Densest-k-Subgraph, and Quasi-partitioned Community Mining b/data/2024/aaai/Optimal Quasi-clique: Hardness, Equivalence with Densest-k-Subgraph, and Quasi-partitioned Community Mining
new file mode 100644
index 0000000000..79aa3e8957
--- /dev/null
+++ b/data/2024/aaai/Optimal Quasi-clique: Hardness, Equivalence with Densest-k-Subgraph, and Quasi-partitioned Community Mining	
@@ -0,0 +1 @@
+Dense subgraph discovery (DSD) is a key primitive in graph mining that typically deals with extracting cliques and near-cliques. In this paper, we revisit the optimal quasi-clique (OQC) formulation for DSD and establish that it is NP--hard. In addition, we reveal the hitherto unknown property that OQC can be used to explore the entire spectrum of densest subgraphs of all distinct sizes by appropriately varying a single hyperparameter, thereby forging an intimate link with the classic densest-k-subgraph problem (DkS). We corroborate these findings on real-world graphs by applying the simple greedy algorithm for OQC with improved hyperparameter tuning, to quickly generate high-quality approximations of the size-density frontier. Our findings indicate that OQC not only extracts high quality (near)-cliques, but also large and loosely-connected subgraphs that exhibit well defined local community structure. The latter discovery is particularly intriguing, since OQC is not explicitly geared towards community detection.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimal Transport with Cyclic Symmetry b/data/2024/aaai/Optimal Transport with Cyclic Symmetry
new file mode 100644
index 0000000000..364164caaa
--- /dev/null
+++ b/data/2024/aaai/Optimal Transport with Cyclic Symmetry	
@@ -0,0 +1 @@
+We propose novel fast algorithms for optimal transport (OT) utilizing a cyclic symmetry structure of input data. Such OT with cyclic symmetry appears universally in various real-world examples: image processing, urban planning, and graph processing. Our main idea is to reduce OT to a small optimization problem that has significantly fewer variables by utilizing cyclic symmetry and various optimization techniques. On the basis of this reduction, our algorithms solve the small optimization problem instead of the original OT. As a result, our algorithms obtain the optimal solution and the objective function value of the original OT faster than solving the original OT directly. In this paper, our focus is on two crucial OT formulations: the linear programming OT (LOT) and the strongly convex-regularized OT, which includes the well-known entropy-regularized OT (EROT). Experiments show the effectiveness of our algorithms for LOT and EROT in synthetic/real-world data that has a strict/approximate cyclic symmetry structure. Through theoretical and experimental results, this paper successfully introduces the concept of symmetry into the OT research field for the first time.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimal Transport with Tempered Exponential Measures b/data/2024/aaai/Optimal Transport with Tempered Exponential Measures
new file mode 100644
index 0000000000..3bad061289
--- /dev/null
+++ b/data/2024/aaai/Optimal Transport with Tempered Exponential Measures	
@@ -0,0 +1 @@
+In the field of optimal transport, two prominent subfields face each other: (i) unregularized optimal transport, ``a-la-Kantorovich'', which leads to extremely sparse plans but with algorithms that scale poorly, and (ii) entropic-regularized optimal transport, ``a-la-Sinkhorn-Cuturi'', which gets near-linear approximation algorithms but leads to maximally un-sparse plans. In this paper, we show that an extension of the latter to tempered exponential measures, a generalization of exponential families with indirect measure normalization, gets to a very convenient middle ground, with both very fast approximation algorithms and sparsity, which is under control up to sparsity patterns. In addition, our formulation fits naturally in the unbalanced optimal transport problem setting.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimised Storage for Datalog Reasoning b/data/2024/aaai/Optimised Storage for Datalog Reasoning
new file mode 100644
index 0000000000..896932fead
--- /dev/null
+++ b/data/2024/aaai/Optimised Storage for Datalog Reasoning	
@@ -0,0 +1 @@
+Materialisation facilitates Datalog reasoning by precomputing all consequences of the facts and the rules so that queries can be directly answered over the materialised facts. However, storing all materialised facts may be infeasible in practice, especially when the rules are complex and the given set of facts is large. We observe that for certain combinations of rules, there exist data structures that compactly represent the reasoning result and can be efficiently queried when necessary. In this paper, we present a general framework that allows for the integration of such optimised storage schemes with standard materialisation algorithms. Moreover, we devise optimised storage schemes targeting at transitive rules and union rules, two types of (combination of) rules that commonly occur in practice. Our experimental evaluation shows that our approach significantly improves memory consumption, sometimes by orders of magnitude, while remaining competitive in terms of query answering time.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimistic Model Rollouts for Pessimistic Offline Policy Optimization b/data/2024/aaai/Optimistic Model Rollouts for Pessimistic Offline Policy Optimization
new file mode 100644
index 0000000000..649524cdec
--- /dev/null
+++ b/data/2024/aaai/Optimistic Model Rollouts for Pessimistic Offline Policy Optimization	
@@ -0,0 +1 @@
+Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards, and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimistic Policy Gradient in Multi-Player Markov Games with a Single Controller: Convergence beyond the Minty Property b/data/2024/aaai/Optimistic Policy Gradient in Multi-Player Markov Games with a Single Controller: Convergence beyond the Minty Property
new file mode 100644
index 0000000000..b4f095cbb7
--- /dev/null
+++ b/data/2024/aaai/Optimistic Policy Gradient in Multi-Player Markov Games with a Single Controller: Convergence beyond the Minty Property	
@@ -0,0 +1 @@
+Policy gradient methods enjoy strong practical performance in numerous tasks in reinforcement learning. Their theoretical understanding in multiagent settings, however, remains limited, especially beyond two-player competitive and potential Markov games. In this paper, we develop a new framework to characterize optimistic policy gradient methods in multi-player Markov games with a single controller. Specifically, under the further assumption that the game exhibits an equilibrium collapse, in that the marginals of coarse correlated equilibria (CCE) induce Nash equilibria (NE), we show convergence to stationary epsilon-NE in O(1/epsilon^2) iterations, where O suppresses polynomial factors in the natural parameters of the game. Such an equilibrium collapse is well-known to manifest itself in two-player zero-sum Markov games, but also occurs even in a class of multi-player Markov games with separable interactions, as established by recent work. As a result, we bypass known complexity barriers for computing stationary NE when either of our assumptions fails. Our approach relies on a natural generalization of the classical Minty property that we introduce, which we anticipate to have further applications beyond Markov games.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning b/data/2024/aaai/Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning
new file mode 100644
index 0000000000..2d65829e9f
--- /dev/null
+++ b/data/2024/aaai/Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning	
@@ -0,0 +1 @@
+In cooperative multi-agent reinforcement learning, decentralized agents hold the promise of overcoming the combinatorial explosion of joint action space and enabling greater scalability. However, they are susceptible to a game-theoretic pathology called relative overgeneralization that shadows the optimal joint action. Although recent value-decomposition algorithms guide decentralized agents by learning a factored global action value function, the representational limitation and the inaccurate sampling of optimal joint actions during the learning process make this problem still. To address this limitation, this paper proposes a novel algorithm called Optimistic Value Instructors (OVI). The main idea behind OVI is to introduce multiple optimistic instructors into the value-decomposition paradigm, which are capable of suggesting potentially optimal joint actions and rectifying the factored global action value function to recover these optimal actions. Specifically, the instructors maintain optimistic value estimations of per-agent local actions and thus eliminate the negative effects caused by other agents' exploratory or sub-optimal non-cooperation, enabling accurate identification and suggestion of optimal joint actions. Based on the instructors' suggestions, the paper further presents two instructive constraints to rectify the factored global action value function to recover these optimal joint actions, thus overcoming the RO problem. Experimental evaluation of OVI on various cooperative multi-agent tasks demonstrates its superior performance against multiple baselines, highlighting its effectiveness.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimize & Reduce: A Top-Down Approach for Image Vectorization b/data/2024/aaai/Optimize & Reduce: A Top-Down Approach for Image Vectorization
new file mode 100644
index 0000000000..62903ad599
--- /dev/null
+++ b/data/2024/aaai/Optimize & Reduce: A Top-Down Approach for Image Vectorization	
@@ -0,0 +1 @@
+Vector image representation is a popular choice when editability and flexibility in resolution are desired. However, most images are only available in raster form, making raster-to-vector image conversion (vectorization) an important task. Classical methods for vectorization are either domain-specific or yield an abundance of shapes which limits editability and interpretability. Learning-based methods, that use differentiable rendering, have revolutionized vectorization, at the cost of poor generalization to out-of-training distribution domains, and optimization-based counterparts are either slow or produce non-editable and redundant shapes. In this work, we propose Optimize & Reduce (O&R), a top-down approach to vectorization that is both fast and domain-agnostic. O&R aims to attain a compact representation of input images by iteratively optimizing Bezier curve parameters and significantly reducing the number of shapes, using a devised importance measure. We contribute a benchmark of five datasets comprising images from a broad spectrum of image complexities - from emojis to natural-like images. Through extensive experiments on hundreds of images, we demonstrate that our method is domain agnostic and outperforms existing works in both reconstruction and perceptual quality for a fixed number of shapes. Moreover, we show that our algorithm is x10 faster than the state-of-the-art optimization-based method. Our code is publicly available: https://github.com/ajevnisek/optimize-and-reduce
\ No newline at end of file
diff --git a/data/2024/aaai/Optimizing ADMM and Over-Relaxed ADMM Parameters for Linear Quadratic Problems b/data/2024/aaai/Optimizing ADMM and Over-Relaxed ADMM Parameters for Linear Quadratic Problems
new file mode 100644
index 0000000000..11afc3a125
--- /dev/null
+++ b/data/2024/aaai/Optimizing ADMM and Over-Relaxed ADMM Parameters for Linear Quadratic Problems	
@@ -0,0 +1 @@
+The Alternating Direction Method of Multipliers (ADMM) has gained significant attention across a broad spectrum of machine learning applications. Incorporating the over-relaxation technique shows potential for enhancing the convergence rate of ADMM. However, determining optimal algorithmic parameters, including both the associated penalty and relaxation parameters, often relies on empirical approaches tailored to specific problem domains and contextual scenarios. Incorrect parameter selection can significantly hinder ADMM's convergence rate. To address this challenge, in this paper we first propose a general approach to optimize the value of penalty parameter, followed by a novel closed-form formula to compute the optimal relaxation parameter in the context of linear quadratic problems (LQPs). We then experimentally validate our parameter selection methods through random instantiations and diverse imaging applications, encompassing diffeomorphic image registration, image deblurring, and MRI reconstruction.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimizing IT FinOps and Sustainability through Unsupervised Workload Characterization b/data/2024/aaai/Optimizing IT FinOps and Sustainability through Unsupervised Workload Characterization
new file mode 100644
index 0000000000..eac73aeb1b
--- /dev/null
+++ b/data/2024/aaai/Optimizing IT FinOps and Sustainability through Unsupervised Workload Characterization	
@@ -0,0 +1 @@
+The widespread adoption of public and hybrid clouds, along with elastic resources and various automation tools for dynamic deployment, has accelerated the rapid provisioning of compute resources as needed. Despite these advancements, numerous resources persist unnecessarily due to factors such as poor digital hygiene, risk aversion, or the absence of effective tools, resulting in substantial costs and energy consumption. Existing threshold-based techniques prove inadequate in effectively addressing this challenge. To address this issue, we propose an unsupervised machine learning framework to automatically identify resources that can be de-provisioned completely or summoned on a schedule. Application of this approach to enterprise data has yielded promising initial results, facilitating the segregation of productive workloads with recurring demands from non-productive ones.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimizing Local Satisfaction of Long-Run Average Objectives in Markov Decision Processes b/data/2024/aaai/Optimizing Local Satisfaction of Long-Run Average Objectives in Markov Decision Processes
new file mode 100644
index 0000000000..4f561f6c7f
--- /dev/null
+++ b/data/2024/aaai/Optimizing Local Satisfaction of Long-Run Average Objectives in Markov Decision Processes	
@@ -0,0 +1 @@
+Long-run average optimization problems for Markov decision processes (MDPs) require constructing policies with optimal steady-state behavior, i.e., optimal limit frequency of visits to the states. However, such policies may suffer from local instability in the sense that the frequency of states visited in a bounded time horizon along a run differs significantly from the limit frequency. In this work, we propose an efficient algorithmic solution to this problem.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimizing Recall in Deep Graph Hashing Framework for Item Retrieval (Student Abstract) b/data/2024/aaai/Optimizing Recall in Deep Graph Hashing Framework for Item Retrieval (Student Abstract)
new file mode 100644
index 0000000000..b73aae8f7f
--- /dev/null
+++ b/data/2024/aaai/Optimizing Recall in Deep Graph Hashing Framework for Item Retrieval (Student Abstract)	
@@ -0,0 +1 @@
+Hashing-based recommendation (HR) methods, whose core idea is mapping users and items into hamming space, are common practice to improve item retrieval efficiency. However, existing HR fails to align optimization objective (i.e., Bayesian Personalized Ranking) and evaluation metric (i.e., Recall), leading to suboptimal performance. In this paper, we propose a smooth recall loss (termed as SRLoss), which targets Recall as the optimization objective. Due to the existence of discrete constraints, the optimization problem is NP-hard. To this end, we propose an approximation-adjustable gradient estimator to solve our problem. Experimental Results demonstrate the effectiveness of our proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Optimizing the Optimization of Planning Domains by Automatic Action Schema Splitting b/data/2024/aaai/Optimizing the Optimization of Planning Domains by Automatic Action Schema Splitting
new file mode 100644
index 0000000000..10242e41d4
--- /dev/null
+++ b/data/2024/aaai/Optimizing the Optimization of Planning Domains by Automatic Action Schema Splitting	
@@ -0,0 +1,7 @@
+Most planners are based on grounding, that is, generating all instances of a parameterized action during a preprocessing phase.
+For some problems the number of ground actions is too high, causing a performance bottleneck.
+Building upon an existing approach, we present an enhanced method to split action schemas automatically during the grounding phase, to reduce the number of ground actions.
+First, we propose to exploit the structural knowledge of the problems to have a more informative dependency graph.
+Then, we suggest a better objective function to define and choose the best split.
+Finally, we present a more effective search to find it.
+We experimentally measure the impact of each of these improvements, and show that our approach significantly outperforms the state of the art.
\ No newline at end of file
diff --git a/data/2024/aaai/Orthogonal Dictionary Guided Shape Completion Network for Point Cloud b/data/2024/aaai/Orthogonal Dictionary Guided Shape Completion Network for Point Cloud
new file mode 100644
index 0000000000..f10a59e247
--- /dev/null
+++ b/data/2024/aaai/Orthogonal Dictionary Guided Shape Completion Network for Point Cloud	
@@ -0,0 +1 @@
+Point cloud shape completion, which aims to reconstruct the missing regions of the incomplete point clouds with plausible shapes, is an ill-posed and challenging task that benefits many downstream 3D applications. Prior approaches achieve this goal by employing a two-stage completion framework, generating a coarse yet complete seed point cloud through an encoder-decoder network, followed by refinement and upsampling. However, the encoded features suffer from information loss of the missing portion, leading to an inability of the decoder to reconstruct seed points with detailed geometric clues. To tackle this issue, we propose a novel Orthogonal Dictionary Guided Shape Completion Network (ODGNet). The proposed ODGNet consists of a Seed Generation U-Net, which leverages multi-level feature extraction and concatenation to significantly enhance the representation capability of seed points, and Orthogonal Dictionaries that can learn shape priors from training samples and thus compensate for the information loss of the missing portions during inference. Our design is simple but to the point, extensive experiment results indicate that the proposed method can reconstruct point clouds with more details and outperform previous state-of-the-art counterparts. The implementation code is available at https://github.com/corecai163/ODGNet.
\ No newline at end of file
diff --git a/data/2024/aaai/Out of Thin Air: Exploring Data-Free Adversarial Robustness Distillation b/data/2024/aaai/Out of Thin Air: Exploring Data-Free Adversarial Robustness Distillation
new file mode 100644
index 0000000000..462164d0c4
--- /dev/null
+++ b/data/2024/aaai/Out of Thin Air: Exploring Data-Free Adversarial Robustness Distillation	
@@ -0,0 +1 @@
+Adversarial Robustness Distillation (ARD) is a promising task to solve the issue of limited adversarial robustness of small capacity models while optimizing the expensive computational costs of Adversarial Training (AT). Despite the good robust performance, the existing ARD methods are still impractical to deploy in natural high-security scenes due to these methods rely entirely on original or publicly available data with a similar distribution. In fact, these data are almost always private, specific, and distinctive for scenes that require high robustness. To tackle these issues, we propose a challenging but significant task called Data-Free Adversarial Robustness Distillation (DFARD), which aims to train small, easily deployable, robust models without relying on data. We demonstrate that the challenge lies in the lower upper bound of knowledge transfer information, making it crucial to mining and transferring knowledge more efficiently. Inspired by human education, we design a plug-and-play Interactive Temperature Adjustment (ITA) strategy to improve the efficiency of knowledge transfer and propose an Adaptive Generator Balance (AGB) module to retain more data information. Our method uses adaptive hyperparameters to avoid a large number of parameter tuning, which significantly outperforms the combination of existing techniques. Meanwhile, our method achieves stable and reliable performance on multiple benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Out-of-Distribution Detection in Long-Tailed Recognition with Calibrated Outlier Class Learning b/data/2024/aaai/Out-of-Distribution Detection in Long-Tailed Recognition with Calibrated Outlier Class Learning
new file mode 100644
index 0000000000..fe37ceaaa2
--- /dev/null
+++ b/data/2024/aaai/Out-of-Distribution Detection in Long-Tailed Recognition with Calibrated Outlier Class Learning	
@@ -0,0 +1 @@
+Existing out-of-distribution (OOD) methods have shown great success on balanced datasets but become ineffective in long-tailed recognition (LTR) scenarios where 1) OOD samples are often wrongly classified into head classes and/or 2) tail-class samples are treated as OOD samples. To address these issues, current studies fit a prior distribution of auxiliary/pseudo OOD data to the long-tailed in-distribution (ID) data. However, it is difficult to obtain such an accurate prior distribution given the unknowingness of real OOD samples and heavy class imbalance in LTR. A straightforward solution to avoid the requirement of this prior is to learn an outlier class to encapsulate the OOD samples. The main challenge is then to tackle the aforementioned confusion between OOD samples and head/tail-class samples when learning the outlier class. To this end, we introduce a novel calibrated outlier class learning (COCL) approach, in which 1) a debiased large margin learning method is introduced in the outlier class learning to distinguish OOD samples from both head and tail classes in the representation space and 2) an outlier-class-aware logit calibration method is defined to enhance the long-tailed classification confidence. Extensive empirical results on three popular benchmarks CIFAR10-LT, CIFAR100-LT, and ImageNet-LT demonstrate that COCL substantially outperforms existing state-of-the-art OOD detection methods in LTR while being able to improve the classification accuracy on ID data. Code is available at https://github.com/mala-lab/COCL.
\ No newline at end of file
diff --git a/data/2024/aaai/Outlier Ranking for Large-Scale Public Health Data b/data/2024/aaai/Outlier Ranking for Large-Scale Public Health Data
new file mode 100644
index 0000000000..5eb01487b7
--- /dev/null
+++ b/data/2024/aaai/Outlier Ranking for Large-Scale Public Health Data	
@@ -0,0 +1 @@
+Disease control experts inspect public health data streams daily for outliers worth investigating, like those corresponding to data quality issues or disease outbreaks. However, they can only examine a few of the thousands of maximally-tied outliers returned by univariate outlier detection methods applied to large-scale public health data streams. To help experts distinguish the most important outliers from these thousands of tied outliers, we propose a new task for algorithms to rank the outputs of any univariate method applied to each of many streams. Our novel algorithm for this task, which leverages hierarchical networks and extreme value analysis, performed the best across traditional outlier detection metrics in a human-expert evaluation using public health data streams. Most importantly, experts have used our open-source Python implementation since April 2023 and report identifying outliers worth investigating 9.1x faster than their prior baseline. Other organizations can readily adapt this implementation to create rankings from the outputs of their tailored univariate methods across large-scale streams.
\ No newline at end of file
diff --git a/data/2024/aaai/P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL b/data/2024/aaai/P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL
new file mode 100644
index 0000000000..aaf932518a
--- /dev/null
+++ b/data/2024/aaai/P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL	
@@ -0,0 +1,3 @@
+Safe Reinforcement Learning (SRL) algorithms aim to learn a policy that maximizes the reward while satisfying the safety constraints. One of the challenges in SRL is that it is often difficult to balance the two objectives of reward maximization and safety constraint satisfaction. Existing algorithms utilize constraint optimization techniques like penalty-based, barrier penalty-based, and Lagrangian-based dual or primal policy optimizations methods. However, they suffer from training oscillations and approximation errors, which impact the overall learning objectives.
+
+This paper proposes the Permeable Penalty Barrier-based Policy Optimization (P2BPO) algorithm that addresses this issue by allowing a small fraction of penalty beyond the penalty barrier, and a parameter is used to control this permeability. In addition, an adaptive penalty parameter is used instead of a constant one, which is initialized with a low value and increased gradually as the agent violates the safety constraints. We have also provided a theoretical proof of the proposed method's performance guarantee bound, which ensures that P2BPO can learn a policy satisfying the safety constraints with high probability while achieving a higher expected reward. Furthermore, we compare P2BPO with other SRL algorithms on various SRL tasks and demonstrate that it achieves better rewards while adhering to the constraints.
\ No newline at end of file
diff --git a/data/2024/aaai/PA2D-MORL: Pareto Ascent Directional Decomposition Based Multi-Objective Reinforcement Learning b/data/2024/aaai/PA2D-MORL: Pareto Ascent Directional Decomposition Based Multi-Objective Reinforcement Learning
new file mode 100644
index 0000000000..cb144d249d
--- /dev/null
+++ b/data/2024/aaai/PA2D-MORL: Pareto Ascent Directional Decomposition Based Multi-Objective Reinforcement Learning	
@@ -0,0 +1 @@
+Multi-objective reinforcement learning (MORL) provides an effective solution for decision-making problems involving conflicting objectives. However, achieving high-quality approximations to the Pareto policy set remains challenging, especially in complex tasks with continuous or high-dimensional state-action space. In this paper, we propose the Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning (PA2D-MORL) method, which constructs an efficient scheme for multi-objective problem decomposition and policy improvement, leading to a superior approximation of Pareto policy set. The proposed method leverages Pareto ascent direction to select the scalarization weights and computes the multi-objective policy gradient, which determines the policy optimization direction and ensures joint improvement on all objectives. Meanwhile, multiple policies are selectively optimized under an evolutionary framework to approximate the Pareto frontier from different directions. Additionally, a Pareto adaptive fine-tuning approach is applied to enhance the density and spread of the Pareto frontier approximation. Experiments on various multi-objective robot control tasks show that the proposed method clearly outperforms the current state-of-the-art algorithm in terms of both quality and stability of the outcomes.
\ No newline at end of file
diff --git a/data/2024/aaai/PAC-Bayes Generalisation Bounds for Dynamical Systems including Stable RNNs b/data/2024/aaai/PAC-Bayes Generalisation Bounds for Dynamical Systems including Stable RNNs
new file mode 100644
index 0000000000..68c564057a
--- /dev/null
+++ b/data/2024/aaai/PAC-Bayes Generalisation Bounds for Dynamical Systems including Stable RNNs	
@@ -0,0 +1,4 @@
+In this paper, we derive a PAC-Bayes bound on the generalisation gap, in a supervised time-series setting for a special class of discrete-time non-linear dynamical systems. This class includes stable recurrent neural networks (RNN), and the motivation for this work was its application to RNNs. In order to achieve the results, we impose some stability constraints, on the allowed models. 
+Here, stability is understood in the sense of dynamical systems. For RNNs, these stability conditions can be expressed in terms of conditions on the weights. 
+We assume the processes involved are essentially bounded and the loss functions are Lipschitz. The proposed bound on the generalisation gap depends on the mixing coefficient of the data distribution, and the essential supremum of the data. Furthermore, the bound converges to zero as the dataset size increases.
+In this paper, we 1) formalize the learning problem, 2) derive a PAC-Bayesian error bound for such systems, 3) discuss various consequences of this error bound, and 4) show an illustrative example, with discussions on computing the proposed bound. Unlike other available bounds the derived bound holds for non i.i.d. data (time-series) and it does not grow with the number of steps of the RNN.
\ No newline at end of file
diff --git a/data/2024/aaai/PARSAC: Accelerating Robust Multi-Model Fitting with Parallel Sample Consensus b/data/2024/aaai/PARSAC: Accelerating Robust Multi-Model Fitting with Parallel Sample Consensus
new file mode 100644
index 0000000000..5ff08e4e88
--- /dev/null
+++ b/data/2024/aaai/PARSAC: Accelerating Robust Multi-Model Fitting with Parallel Sample Consensus	
@@ -0,0 +1,9 @@
+We present a real-time method for robust estimation of multiple instances of geometric models from noisy data.
+Geometric models such as vanishing points, planar homographies or fundamental matrices are essential for 3D scene analysis.
+Previous approaches discover distinct model instances in an iterative manner, thus limiting their potential for speedup via parallel computation.
+In contrast, our method detects all model instances independently and in parallel.
+A neural network segments the input data into clusters representing potential model instances by predicting multiple sets of sample and inlier weights.
+Using the predicted weights, we determine the model parameters for each potential instance separately in a RANSAC-like fashion.
+We train the neural network via task-specific loss functions, i.e. we do not require a ground-truth segmentation of the input data.
+As suitable training data for homography and fundamental matrix fitting is scarce, we additionally present two new synthetic datasets.
+We demonstrate state-of-the-art performance on these as well as multiple established datasets, with inference times as small as five milliseconds per image.
\ No newline at end of file
diff --git a/data/2024/aaai/PC-Conv: Unifying Homophily and Heterophily with Two-Fold Filtering b/data/2024/aaai/PC-Conv: Unifying Homophily and Heterophily with Two-Fold Filtering
new file mode 100644
index 0000000000..459bd8db2f
--- /dev/null
+++ b/data/2024/aaai/PC-Conv: Unifying Homophily and Heterophily with Two-Fold Filtering	
@@ -0,0 +1 @@
+Recently, many carefully designed graph representation learning methods have achieved impressive performance on either strong heterophilic or homophilic graphs, but not both. Therefore, they are incapable of generalizing well across real-world graphs with different levels of homophily. This is attributed to their neglect of homophily in heterophilic graphs, and vice versa. In this paper, we propose a two-fold filtering mechanism to mine homophily in heterophilic graphs, and vice versa. In particular, we extend the graph heat equation to perform heterophilic aggregation of global information from a long distance. The resultant filter can be exactly approximated by the Possion-Charlier (PC) polynomials. To further exploit information at multiple orders, we introduce a powerful graph convolution PC-Conv and its instantiation PCNet for the node classification task. Compared to the state-of-the-art GNNs, PCNet shows competitive performance on well-known homophilic and heterophilic graphs. Our implementation is available at https://github.com/uestclbh/PC-Conv.
\ No newline at end of file
diff --git a/data/2024/aaai/PCE-Palm: Palm Crease Energy Based Two-Stage Realistic Pseudo-Palmprint Generation b/data/2024/aaai/PCE-Palm: Palm Crease Energy Based Two-Stage Realistic Pseudo-Palmprint Generation
new file mode 100644
index 0000000000..4b924abddc
--- /dev/null
+++ b/data/2024/aaai/PCE-Palm: Palm Crease Energy Based Two-Stage Realistic Pseudo-Palmprint Generation	
@@ -0,0 +1 @@
+The lack of large-scale data seriously hinders the development of palmprint recognition. Recent approaches address this issue by generating large-scale realistic pseudo palmprints from Bézier curves. However, the significant difference between Bézier curves and real palmprints limits their effectiveness. In this paper, we divide the Bézier-Real difference into creases and texture differences, thus reducing the generation difficulty. We introduce a new palm crease energy (PCE) domain as a bridge from Bézier curves to real palmprints and propose a two-stage generation model. The first stage generates PCE images (realistic creases) from Bézier curves, and the second stage outputs realistic palmprints (realistic texture) with PCE images as input. In addition, we also design a lightweight plug-and-play line feature enhancement block to facilitate domain transfer and improve recognition performance. Extensive experimental results demonstrate that the proposed method surpasses state-of-the-art methods. Under extremely few data settings like 40 IDs (only 2.5% of the total training set), our model achieves a 29% improvement over RPG-Palm and outperforms ArcFace with 100% training set by more than 6% in terms of TAR@FAR=1e-6.
\ No newline at end of file
diff --git a/data/2024/aaai/PDE+: Enhancing Generalization via PDE with Adaptive Distributional Diffusion b/data/2024/aaai/PDE+: Enhancing Generalization via PDE with Adaptive Distributional Diffusion
new file mode 100644
index 0000000000..4d9f704d6c
--- /dev/null
+++ b/data/2024/aaai/PDE+: Enhancing Generalization via PDE with Adaptive Distributional Diffusion	
@@ -0,0 +1 @@
+The generalization of neural networks is a central challenge in machine learning, especially concerning the performance under distributions that differ from training ones. Current methods, mainly based on the data-driven paradigm such as data augmentation, adversarial training, and noise injection, may encounter limited generalization due to model non-smoothness. In this paper, we propose to investigate generalization from a Partial Differential Equation (PDE) perspective, aiming to enhance it directly through the underlying function of neural networks, rather than focusing on adjusting input data. Specifically, we first establish the connection between neural network generalization and the smoothness of the solution to a specific PDE, namely transport equation. Building upon this, we propose a general framework that introduces adaptive distributional diffusion into transport equation to enhance the smoothness of its solution, thereby improving generalization. In the context of neural networks, we put this theoretical framework into practice as PDE+ (PDE with Adaptive Distributional Diffusion) which diffuses each sample into a distribution covering semantically similar inputs. This enables better coverage of potentially unobserved distributions in training, thus improving generalization beyond merely data-driven methods. The effectiveness of PDE+ is validated through extensive experimental settings, demonstrating its superior performance compared to state-of-the-art methods. Our code is available at https://github.com/yuanyige/pde-add.
\ No newline at end of file
diff --git a/data/2024/aaai/PG-LBO: Enhancing High-Dimensional Bayesian Optimization with Pseudo-Label and Gaussian Process Guidance b/data/2024/aaai/PG-LBO: Enhancing High-Dimensional Bayesian Optimization with Pseudo-Label and Gaussian Process Guidance
new file mode 100644
index 0000000000..a84e299adc
--- /dev/null
+++ b/data/2024/aaai/PG-LBO: Enhancing High-Dimensional Bayesian Optimization with Pseudo-Label and Gaussian Process Guidance	
@@ -0,0 +1 @@
+Variational Autoencoder based Bayesian Optimization (VAE-BO) has demonstrated its excellent performance in addressing high-dimensional structured optimization problems. However, current mainstream methods overlook the potential of utilizing a pool of unlabeled data to construct the latent space, while only concentrating on designing sophisticated models to leverage the labeled data. Despite their effective usage of labeled data, these methods often require extra network structures, additional procedure, resulting in computational inefficiency. To address this issue, we propose a novel method to effectively utilize unlabeled data with the guidance of labeled data. Specifically, we tailor the pseudo-labeling technique from semi-supervised learning to explicitly reveal the relative magnitudes of optimization objective values hidden within the unlabeled data. Based on this technique, we assign appropriate training weights to unlabeled data to enhance the construction of a discriminative latent space. Furthermore, we treat the VAE encoder and the Gaussian Process (GP) in Bayesian optimization as a unified deep kernel learning process, allowing the direct utilization of labeled data, which we term as Gaussian Process guidance. This directly and effectively integrates the goal of improving GP accuracy into the VAE training, thereby guiding the construction of the latent space. The extensive experiments demonstrate that our proposed method outperforms existing VAE-BO algorithms in various optimization scenarios. Our code will be published at https://github.com/TaicaiChen/PG-LBO.
\ No newline at end of file
diff --git a/data/2024/aaai/PHFormer: Multi-Fragment Assembly Using Proxy-Level Hybrid Transformer b/data/2024/aaai/PHFormer: Multi-Fragment Assembly Using Proxy-Level Hybrid Transformer
new file mode 100644
index 0000000000..a9866af3b3
--- /dev/null
+++ b/data/2024/aaai/PHFormer: Multi-Fragment Assembly Using Proxy-Level Hybrid Transformer	
@@ -0,0 +1 @@
+Fragment assembly involves restoring broken objects to their original geometries, and has many applications, such as archaeological restoration. Existing learning based frameworks have shown potential for solving part assembly problems with semantic decomposition, but cannot handle such geometrical decomposition problems. In this work, we propose a novel assembly framework, proxy level hybrid Transformer, with the core idea of using a hybrid graph to model and reason complex structural relationships between patches of fragments, dubbed as proxies. To this end, we propose a hybrid attention module, composed of intra and inter attention layers, enabling capturing of crucial contextual information within fragments and relative structural knowledge across fragments. Furthermore, we propose an adjacency aware hierarchical pose estimator, exploiting a decompose and integrate strategy. It progressively predicts adjacent probability and relative poses between fragments, and then implicitly infers their absolute poses by dynamic information integration. Extensive experimental results demonstrate that our method effectively reduces assembly errors while maintaining fast inference speed. The code is available at https://github.com/521piglet/PHFormer.
\ No newline at end of file
diff --git a/data/2024/aaai/PICNN: A Pathway towards Interpretable Convolutional Neural Networks b/data/2024/aaai/PICNN: A Pathway towards Interpretable Convolutional Neural Networks
new file mode 100644
index 0000000000..274a5bdb49
--- /dev/null
+++ b/data/2024/aaai/PICNN: A Pathway towards Interpretable Convolutional Neural Networks	
@@ -0,0 +1 @@
+Convolutional Neural Networks (CNNs) have exhibited great performance in discriminative feature learning for complex visual tasks. Besides discrimination power, interpretability is another important yet under-explored property for CNNs. One difficulty in the CNN interpretability is that filters and image classes are entangled. In this paper, we introduce a novel pathway to alleviate the entanglement between filters and image classes. The proposed pathway groups the filters in a late conv-layer of CNN into class-specific clusters. Clusters and classes are in a one-to-one relationship. Specifically, we use the Bernoulli sampling to generate the filter-cluster assignment matrix from a learnable filter-class correspondence matrix. To enable end-to-end optimization, we develop a novel reparameterization trick for handling the non-differentiable Bernoulli sampling. We evaluate the effectiveness of our method on ten widely used network architectures (including nine CNNs and a ViT) and five benchmark datasets. Experimental results have demonstrated that our method PICNN (the combination of standard CNNs with our proposed pathway) exhibits greater interpretability than standard CNNs while achieving higher or comparable discrimination power.
\ No newline at end of file
diff --git a/data/2024/aaai/PICSR: Prototype-Informed Cross-Silo Router for Federated Learning (Student Abstract) b/data/2024/aaai/PICSR: Prototype-Informed Cross-Silo Router for Federated Learning (Student Abstract)
new file mode 100644
index 0000000000..0819e8fa44
--- /dev/null
+++ b/data/2024/aaai/PICSR: Prototype-Informed Cross-Silo Router for Federated Learning (Student Abstract)	
@@ -0,0 +1 @@
+Federated Learning is an effective approach for learning from data distributed across multiple institutions. While most existing studies are aimed at improving predictive accuracy of models, little work has been done to explain knowledge differences between institutions and the benefits of collaboration. Understanding these differences is critical in cross-silo federated learning domains, e.g., in healthcare or banking, where each institution or silo has a different underlying distribution and stakeholders want to understand how their institution compares to their partners. We introduce Prototype-Informed Cross-Silo Router (PICSR) which utilizes a mixture of experts approach to combine local models derived from multiple silos. Furthermore, by computing data similarity to prototypical samples from each silo, we are able to ground the router’s predictions in the underlying dataset distributions. Experiments on a real-world heart disease prediction dataset show that PICSR retains high performance while enabling further explanations on the differences among institutions compared to a single black-box model.
\ No newline at end of file
diff --git a/data/2024/aaai/PM-INR: Prior-Rich Multi-Modal Implicit Large-Scale Scene Neural Representation b/data/2024/aaai/PM-INR: Prior-Rich Multi-Modal Implicit Large-Scale Scene Neural Representation
new file mode 100644
index 0000000000..dbadbd1186
--- /dev/null
+++ b/data/2024/aaai/PM-INR: Prior-Rich Multi-Modal Implicit Large-Scale Scene Neural Representation	
@@ -0,0 +1,5 @@
+Recent advancements in implicit neural representations have contributed to high-fidelity surface reconstruction and photorealistic novel view synthesis. However, with the expansion of the scene scale, such as block or city level, existing methods
+will encounter challenges because traditional sampling cannot cope with the cubically growing sampling space. To alleviate the dependence on filling the sampling space, we explore using multi-modal priors to assist individual points to
+obtain more global semantic information and propose a priorrich multi-modal implicit neural representation network, Pm-INR, for the outdoor unbounded large-scale scene. The core of our method is multi-modal prior extraction and crossmodal prior fusion modules. The former encodes codebooks from different modality inputs and extracts valuable priors, while the latter fuses priors to maintain view consistency and preserve unique features among multi-modal priors. Finally, feature-rich cross-modal priors are injected into the sampling
+regions to allow each region to perceive global information without filling the sampling space. Extensive experiments have demonstrated the effectiveness and robustness of our method for outdoor unbounded large-scale scene novel
+view synthesis, which outperforms state-of-the-art methods in terms of PSNR, SSIM, and LPIPS.
\ No newline at end of file
diff --git a/data/2024/aaai/PMAC: Personalized Multi-Agent Communication b/data/2024/aaai/PMAC: Personalized Multi-Agent Communication
new file mode 100644
index 0000000000..c490824aaf
--- /dev/null
+++ b/data/2024/aaai/PMAC: Personalized Multi-Agent Communication	
@@ -0,0 +1 @@
+Communication plays a crucial role in information sharing within the field of multi-agent reinforcement learning (MARL). However, how to transmit information that meets individual needs remains a long-standing challenge. Some existing work focus on using a common channel for information transfer, which limits the capability for local communication. Meanwhile, other work attempt to establish peer-to-peer communication topologies but suffer from quadratic complexity. In this paper, we propose Personalized Multi-Agent Communication (PMAC), which enables the formation of peer-to-peer communication topologies, personalized message sending, and personalized message receiving. All these modules in PMAC are performed using only multilayer perceptrons (MLPs) with linear computational complexity. Empirically, we show the strength of personalized communication in a variety of cooperative scenarios. Our approach exhibits competitive performance compared to existing methods while maintaining notable computational efficiency.
\ No newline at end of file
diff --git a/data/2024/aaai/PMET: Precise Model Editing in a Transformer b/data/2024/aaai/PMET: Precise Model Editing in a Transformer
new file mode 100644
index 0000000000..dd27aff960
--- /dev/null
+++ b/data/2024/aaai/PMET: Precise Model Editing in a Transformer	
@@ -0,0 +1 @@
+Model editing techniques modify a minor proportion of knowledge in Large Language Models (LLMs) at a relatively low cost, which have demonstrated notable success. Existing methods assume Transformer Layer (TL) hidden states are values of key-value memories of the Feed-Forward Network (FFN). They usually optimize the TL hidden states to memorize target knowledge and use it to update the weights of the FFN in LLMs. However, the information flow of TL hidden states comes from three parts: Multi-Head Self-Attention (MHSA), FFN, and residual connections. Existing methods neglect the fact that the TL hidden states contains information not specifically required for FFN. Consequently, the performance of model editing decreases. To achieve more precise model editing, we analyze hidden states of MHSA and FFN, finding that MHSA encodes certain general knowledge extraction patterns. This implies that MHSA weights do not require updating when new knowledge is introduced. Based on above findings, we introduce PMET, which simultaneously optimizes Transformer Component (TC, namely MHSA and FFN) hidden states, while only using the optimized TC hidden states of FFN to precisely update FFN weights. Our experiments demonstrate that PMET exhibits state-of-the-art performance on both the \textsc{counterfact} and zsRE datasets. Our ablation experiments substantiate the effectiveness of our enhancements, further reinforcing the finding that the MHSA encodes certain general knowledge extraction patterns and indicating its storage of a small amount of factual knowledge. Our code is available at \url{https://github.com/xpq-tech/PMET}.
\ No newline at end of file
diff --git a/data/2024/aaai/PMRC: Prompt-Based Machine Reading Comprehension for Few-Shot Named Entity Recognition b/data/2024/aaai/PMRC: Prompt-Based Machine Reading Comprehension for Few-Shot Named Entity Recognition
new file mode 100644
index 0000000000..152e46693c
--- /dev/null
+++ b/data/2024/aaai/PMRC: Prompt-Based Machine Reading Comprehension for Few-Shot Named Entity Recognition	
@@ -0,0 +1 @@
+The prompt-based method has been proven effective in improving the performance of pre-trained language models (PLMs) on sentence-level few-shot tasks. However, when applying prompting to token-level tasks such as Named Entity Recognition (NER), specific templates need to be designed, and all possible segments of the input text need to be enumerated. These methods have high computational complexity in both training and inference processes, making them difficult to apply in real-world scenarios. To address these issues, we redefine the NER task as a Machine Reading Comprehension (MRC) task and incorporate prompting into the MRC framework. Specifically, we sequentially insert boundary markers for various entity types into the templates and use these markers as anchors during the inference process to differentiate entity types. In contrast to the traditional multi-turn question-answering extraction in the MRC framework, our method can extract all spans of entity types in one round. Furthermore, we propose word-based template and example-based template that enhance the MRC framework's perception of entity start and end positions while significantly reducing the manual effort required for template design. It is worth noting that in cross-domain scenarios, PMRC does not require redesigning the model architecture and can continue training by simply replacing the templates to recognize entity types in the target domain. Experimental results demonstrate that our approach outperforms state-of-the-art models in low-resource settings, achieving an average performance improvement of +5.2% in settings where access to source domain data is limited. Particularly, on the ATIS dataset with a large number of entity types and 10-shot setting, PMRC achieves a performance improvement of +15.7%. Moreover, our method achieves a decoding speed 40.56 times faster than the template-based cloze-style approach.
\ No newline at end of file
diff --git a/data/2024/aaai/PNeRFLoc: Visual Localization with Point-Based Neural Radiance Fields b/data/2024/aaai/PNeRFLoc: Visual Localization with Point-Based Neural Radiance Fields
new file mode 100644
index 0000000000..291ad463ff
--- /dev/null
+++ b/data/2024/aaai/PNeRFLoc: Visual Localization with Point-Based Neural Radiance Fields	
@@ -0,0 +1 @@
+Due to the ability to synthesize high-quality novel views, Neural Radiance Fields (NeRF) has been recently exploited to improve visual localization in a known environment. However, the existing methods mostly utilize NeRF for data augmentation to improve the regression model training, and their performances on novel viewpoints and appearances are still limited due to the lack of geometric constraints. In this paper, we propose a novel visual localization framework, i.e., PNeRFLoc, based on a unified point-based representation. On one hand, PNeRFLoc supports the initial pose estimation by matching 2D and 3D feature points as traditional structure-based methods; on the other hand, it also enables pose refinement with novel view synthesis using rendering-based optimization. Specifically, we propose a novel feature adaption module to close the gaps between the features for visual localization and neural rendering. To improve the efficacy and efficiency of neural rendering-based optimization, we also developed an efficient rendering-based framework with a warping loss function. Extensive experiments demonstrate that PNeRFLoc performs the best on the synthetic dataset when the 3D NeRF model can be well learned, and significantly outperforms all the NeRF-boosted localization methods with on-par SOTA performance on the real-world benchmark localization datasets. Project webpage: https://zju3dv.github.io/PNeRFLoc/.
\ No newline at end of file
diff --git a/data/2024/aaai/PNeSM: Arbitrary 3D Scene Stylization via Prompt-Based Neural Style Mapping b/data/2024/aaai/PNeSM: Arbitrary 3D Scene Stylization via Prompt-Based Neural Style Mapping
new file mode 100644
index 0000000000..0e93cfdf45
--- /dev/null
+++ b/data/2024/aaai/PNeSM: Arbitrary 3D Scene Stylization via Prompt-Based Neural Style Mapping	
@@ -0,0 +1 @@
+3D scene stylization refers to transform the appearance of a 3D scene to match a given style image, ensuring that images rendered from different viewpoints exhibit the same style as the given style image, while maintaining the 3D consistency of the stylized scene. Several existing methods have obtained impressive results in stylizing 3D scenes. However, the mod- els proposed by these methods need to be re-trained when applied to a new scene. In other words, their models are cou- pled with a specific scene and cannot adapt to arbitrary other scenes. To address this issue, we propose a novel 3D scene stylization framework to transfer an arbitrary style to an ar- bitrary scene, without any style-related or scene-related re- training. Concretely, we first map the appearance of the 3D scene into a 2D style pattern space, which realizes complete disentanglement of the geometry and appearance of the 3D scene and makes our model be generalized to arbitrary 3D scenes. Then we stylize the appearance of the 3D scene in the 2D style pattern space via a prompt-based 2D stylization al- gorithm. Experimental results demonstrate that our proposed framework is superior to SOTA methods in both visual qual- ity and generalization.
\ No newline at end of file
diff --git a/data/2024/aaai/PORTAL: Automatic Curricula Generation for Multiagent Reinforcement Learning b/data/2024/aaai/PORTAL: Automatic Curricula Generation for Multiagent Reinforcement Learning
new file mode 100644
index 0000000000..0fdb59162b
--- /dev/null
+++ b/data/2024/aaai/PORTAL: Automatic Curricula Generation for Multiagent Reinforcement Learning	
@@ -0,0 +1 @@
+Despite many breakthroughs in recent years, it is still hard for MultiAgent Reinforcement Learning (MARL) algorithms to directly solve complex tasks in MultiAgent Systems (MASs) from scratch. In this work, we study how to use Automatic Curriculum Learning (ACL) to reduce the number of environmental interactions required to learn a good policy. In order to solve a difficult task, ACL methods automatically select a sequence of tasks (i.e., curricula). The idea is to obtain maximum learning progress towards the final task by continuously learning on tasks that match the current capabilities of the learners. The key question is how to measure the learning progress of the learner for better curriculum selection. We propose a novel ACL framework, PrOgRessive mulTiagent Automatic curricuLum (PORTAL), for MASs. PORTAL selects curricula according to two critera: 1) How difficult is a task, relative to the learners’ current abilities? 2) How similar is a task, relative to the final task? By learning a shared feature space between tasks, PORTAL is able to characterize different tasks based on the distribution of features and select those that are similar to the final task. Also, the shared feature space can effectively facilitate the policy transfer between curricula. Experimental results show that PORTAL can train agents to master extremely hard cooperative tasks, which can not be achieved with previous state-of-the-art MARL algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/PPEA-Depth: Progressive Parameter-Efficient Adaptation for Self-Supervised Monocular Depth Estimation b/data/2024/aaai/PPEA-Depth: Progressive Parameter-Efficient Adaptation for Self-Supervised Monocular Depth Estimation
new file mode 100644
index 0000000000..4748942b41
--- /dev/null
+++ b/data/2024/aaai/PPEA-Depth: Progressive Parameter-Efficient Adaptation for Self-Supervised Monocular Depth Estimation	
@@ -0,0 +1,2 @@
+Self-supervised monocular depth estimation is of significant importance with applications spanning across autonomous driving and robotics. However, the reliance on self-supervision introduces a strong static-scene assumption, thereby posing challenges in achieving optimal performance in dynamic scenes, which are prevalent in most real-world situations.
+To address these issues, we propose PPEA-Depth, a Progressive Parameter-Efficient Adaptation approach to transfer a pre-trained image model for self-supervised depth estimation. The training comprises two sequential stages: an initial phase trained on a dataset primarily composed of static scenes, succeeded by an expansion to more intricate datasets involving dynamic scenes. To facilitate this process, we design compact encoder and decoder adapters to enable parameter-efficient tuning, allowing the network to adapt effectively. They not only uphold generalized patterns from pre-trained image models but also retain knowledge gained from the preceding phase into the subsequent one. Extensive experiments demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI, CityScapes and DDAD datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine b/data/2024/aaai/PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine
new file mode 100644
index 0000000000..c5e3d9b158
--- /dev/null
+++ b/data/2024/aaai/PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine	
@@ -0,0 +1 @@
+As an effective tool for eliciting the power of Large Language Models (LLMs), prompting has recently demonstrated unprecedented abilities across a variety of complex tasks. To further improve the performance, prompt ensemble has attracted substantial interest for tackling the hallucination and instability of LLMs. However, existing methods usually adopt a two-stage paradigm, which requires a pre-prepared set of prompts with substantial manual effort, and is unable to perform directed optimization for different weak learners. In this paper, we propose a simple, universal, and automatic method named PREFER (Prompt Ensemble learning via Feedback-Reflect-Refine) to address the stated limitations. Specifically, given the fact that weak learners are supposed to focus on hard examples during boosting, PREFER builds a feedback mechanism for reflecting on the inadequacies of existing weak learners. Based on this, the LLM is required to automatically synthesize new prompts for iterative refinement. Moreover, to enhance stability of the prompt effect evaluation, we propose a novel prompt bagging method involving forward and backward thinking, which is superior to majority voting and is beneficial for both feedback and weight calculation in boosting. Extensive experiments demonstrate that our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin. We have made our code publicly available.
\ No newline at end of file
diff --git a/data/2024/aaai/PRP Rebooted: Advancing the State of the Art in FOND Planning b/data/2024/aaai/PRP Rebooted: Advancing the State of the Art in FOND Planning
new file mode 100644
index 0000000000..5af0626f7d
--- /dev/null
+++ b/data/2024/aaai/PRP Rebooted: Advancing the State of the Art in FOND Planning	
@@ -0,0 +1 @@
+Fully Observable Non-Deterministic (FOND) planning is a variant of classical symbolic planning in which actions are nondeterministic, with an action's outcome known only upon execution. It is a popular planning paradigm with applications ranging from robot planning to dialogue-agent design and reactive synthesis. Over the last 20 years, a number of approaches to FOND planning have emerged. In this work, we establish a new state of the art, following in the footsteps of some of the most powerful FOND planners to date. Our planner, PR2, decisively outperforms the four leading FOND planners, at times by a large margin, in 17 of 18 domains that represent a comprehensive benchmark suite. Ablation studies demonstrate the impact of various techniques we introduce, with the largest improvement coming from our novel FOND-aware heuristic.
\ No newline at end of file
diff --git a/data/2024/aaai/PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for Efficient and Generalizable Compound-Protein Interaction Prediction b/data/2024/aaai/PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for Efficient and Generalizable Compound-Protein Interaction Prediction
new file mode 100644
index 0000000000..ffc9972df4
--- /dev/null
+++ b/data/2024/aaai/PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for Efficient and Generalizable Compound-Protein Interaction Prediction	
@@ -0,0 +1 @@
+Compound-Protein Interaction (CPI) prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery. Existing deep learning-based methods utilize only the single modality of protein sequences or structures and lack the co-modeling of the joint distribution of the two modalities, which may lead to significant performance drops in complex real-world scenarios due to various factors, e.g., modality missing and domain shifting. More importantly, these methods only model protein sequences and structures at a single fixed scale, neglecting more fine-grained multi-scale information, such as those embedded in key protein fragments. In this paper, we propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction (PSC-CPI), which captures the dependencies between protein sequences and structures through both intra-modality and cross-modality contrasting. We further apply length-variable protein augmentation to allow contrasting to be performed at different scales, from the amino acid level to the sequence level. Finally, in order to more fairly evaluate the model generalizability, we split the test data into four settings based on whether compounds and proteins have been observed during the training stage. Extensive experiments have shown that PSC-CPI generalizes well in all four settings, particularly in the more challenging ``Unseen-Both" setting, where neither compounds nor proteins have been observed during training. Furthermore, even when encountering a situation of modality missing, i.e., inference with only single-modality protein data, PSC-CPI still exhibits comparable or even better performance than previous approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/PTMQ: Post-training Multi-Bit Quantization of Neural Networks b/data/2024/aaai/PTMQ: Post-training Multi-Bit Quantization of Neural Networks
new file mode 100644
index 0000000000..edf383877b
--- /dev/null
+++ b/data/2024/aaai/PTMQ: Post-training Multi-Bit Quantization of Neural Networks	
@@ -0,0 +1 @@
+The ability of model quantization with arbitrary bit-width to dynamically meet diverse bit-width requirements during runtime has attracted significant attention. Recent research has focused on optimizing large-scale training methods to achieve robust bit-width adaptation, which is a time-consuming process requiring hundreds of GPU hours. Furthermore, converting bit-widths requires recalculating statistical parameters of the norm layers, thereby impeding real-time switching of the bit-width. To overcome these challenges, we propose an efficient Post-Training Multi-bit Quantization (PTMQ) scheme that requires only a small amount of calibration data to perform block-wise reconstruction of multi-bit quantization errors. It eliminates the influence of statistical parameters by fusing norm layers, and supports real-time switching bit-widths in uniform quantization and mixed-precision quantization. To improve quantization accuracy and robustness, we propose a Multi-bit Feature Mixer technique (MFM) for fusing features of different bit-widths to enhance robustness across varying bit-widths. Moreover, we introduced the Group-wise Distillation Loss (GD-Loss) to enhance the correlation between different bit-width groups and further improve the overall performance of PTMQ. Extensive experiments demonstrate that PTMQ achieves comparable performance to existing state-of-the-art post-training quantization methods, while optimizing it speeds up by 100$\times$ compared to recent multi-bit quantization works. Code can be available at https://github.com/xuke225/PTMQ.
\ No newline at end of file
diff --git a/data/2024/aaai/PTUS: Photo-Realistic Talking Upper-Body Synthesis via 3D-Aware Motion Decomposition Warping b/data/2024/aaai/PTUS: Photo-Realistic Talking Upper-Body Synthesis via 3D-Aware Motion Decomposition Warping
new file mode 100644
index 0000000000..02d4dd6e03
--- /dev/null
+++ b/data/2024/aaai/PTUS: Photo-Realistic Talking Upper-Body Synthesis via 3D-Aware Motion Decomposition Warping	
@@ -0,0 +1 @@
+Talking upper-body synthesis is a promising task due to its versatile potential for video creation and consists of animating the body and face from a source image with the motion from a given driving video. However, prior synthesis approaches fall short in addressing this task and have been either limited to animating heads of a target person only, or have animated the upper body but neglected the synthesis of precise facial details. To tackle this task, we propose a Photo-realistic Talking Upper-body Synthesis method via 3D-aware motion decomposition warping, named PTUS, to both precisely synthesize the upper body as well as recover the details of the face such as blinking and lip synchronization. In particular, the motion decomposition mechanism consists of a face-body motion decomposition, which decouples the 3D motion estimation of the face and body, and a local-global motion decomposition, which decomposes the 3D face motion into global and local motions resulting in the transfer of facial expression. The 3D-aware warping module transfers the large-scale and subtle 3D motions to the extracted 3D depth-aware features in a coarse-tofine manner. Moreover, we present a new dataset, Talking-UB, which includes upper-body images with high-resolution faces, addressing the limitations of prior datasets that either consist of only facial images or upper-body images with blurry faces. Experimental results demonstrate that our proposed method can synthesize high-quality videos that preserve facial details, and achieves superior results compared to state-of-the-art cross-person motion transfer approaches. Code and collected dataset are released in https://github.com/cooluoluo/PTUS.
\ No newline at end of file
diff --git a/data/2024/aaai/PVALane: Prior-Guided 3D Lane Detection with View-Agnostic Feature Alignment b/data/2024/aaai/PVALane: Prior-Guided 3D Lane Detection with View-Agnostic Feature Alignment
new file mode 100644
index 0000000000..99ab6f5838
--- /dev/null
+++ b/data/2024/aaai/PVALane: Prior-Guided 3D Lane Detection with View-Agnostic Feature Alignment	
@@ -0,0 +1 @@
+Monocular 3D lane detection is essential for a reliable autonomous driving system and has recently been rapidly developing. Existing popular methods mainly employ a predefined 3D anchor for lane detection based on front-viewed (FV) space, aiming to mitigate the effects of view transformations. However, the perspective geometric distortion between FV and 3D space in this FV-based approach introduces extremely dense anchor designs, which ultimately leads to confusing lane representations. In this paper, we introduce a novel prior-guided perspective on lane detection and propose an end-to-end framework named PVALane, which utilizes 2D prior knowledge to achieve precise and efficient 3D lane detection. Since 2D lane predictions can provide strong priors for lane existence, PVALane exploits FV features to generate sparse prior anchors with potential lanes in 2D space. These dynamic prior anchors help PVALane to achieve distinct lane representations and effectively improve the precision of PVALane due to the reduced lane search space. Additionally, by leveraging these prior anchors and representing lanes in both FV and bird-eye-viewed (BEV) spaces, we effectively align and merge semantic and geometric information from FV and BEV features. Extensive experiments conducted on the OpenLane and ONCE-3DLanes datasets demonstrate the superior performance of our method compared to existing state-of-the-art approaches and exhibit excellent robustness.
\ No newline at end of file
diff --git a/data/2024/aaai/PaintHuman: Towards High-Fidelity Text-to-3D Human Texturing via Denoised Score Distillation b/data/2024/aaai/PaintHuman: Towards High-Fidelity Text-to-3D Human Texturing via Denoised Score Distillation
new file mode 100644
index 0000000000..eaabdbe03b
--- /dev/null
+++ b/data/2024/aaai/PaintHuman: Towards High-Fidelity Text-to-3D Human Texturing via Denoised Score Distillation	
@@ -0,0 +1 @@
+Recent advances in zero-shot text-to-3D human generation, which employ the human model prior (e.g., SMPL) or Score Distillation Sampling (SDS) with pre-trained text-to-image diffusion models, have been groundbreaking. However, SDS may provide inaccurate gradient directions under the weak diffusion guidance, as it tends to produce over-smoothed results and generate body textures that are inconsistent with the detailed mesh geometry. Therefore, directly leveraging existing strategies for high-fidelity text-to-3D human texturing is challenging. In this work, we propose a model called PaintHuman to addresses the challenges from two perspectives. We first propose a novel score function, Denoised Score Distillation (DSD), which directly modifies the SDS by introducing negative gradient components to iteratively correct the gradient direction and generate high-quality textures. In addition, we use the depth map as a geometric guide to ensure that the texture is semantically aligned to human mesh surfaces. To guarantee the quality of rendered results, we employ geometry-aware networks to predict surface materials and render realistic human textures. Extensive experiments, benchmarked against state-of-the-art (SoTA) methods, validate the efficacy of our approach.Project page: https://painthuman.github.io/.
\ No newline at end of file
diff --git a/data/2024/aaai/Painterly Image Harmonization by Learning from Painterly Objects b/data/2024/aaai/Painterly Image Harmonization by Learning from Painterly Objects
new file mode 100644
index 0000000000..4c487abefd
--- /dev/null
+++ b/data/2024/aaai/Painterly Image Harmonization by Learning from Painterly Objects	
@@ -0,0 +1 @@
+Given a composite image with photographic object and painterly background, painterly image harmonization targets at stylizing the composite object to be compatible with the background. Despite the competitive performance of existing painterly harmonization works, they did not fully leverage the painterly objects in artistic paintings. In this work, we explore learning from painterly objects for painterly image harmonization. In particular, we learn a mapping from background style and object information to object style based on painterly objects in artistic paintings. With the learnt mapping, we can hallucinate the target style of composite object, which is used to harmonize encoder feature maps to produce the harmonized image. Extensive experiments on the benchmark dataset demonstrate the effectiveness of our proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Pairwise-Label-Based Deep Incremental Hashing with Simultaneous Code Expansion b/data/2024/aaai/Pairwise-Label-Based Deep Incremental Hashing with Simultaneous Code Expansion
new file mode 100644
index 0000000000..d6268d7cc3
--- /dev/null
+++ b/data/2024/aaai/Pairwise-Label-Based Deep Incremental Hashing with Simultaneous Code Expansion	
@@ -0,0 +1,4 @@
+Deep incremental hashing has become a subject of considerable interest due to its capability to learn hash codes in an incremental manner, eliminating the need to generate codes for classes that have already been learned. However, accommodating more classes requires longer hash codes, and regenerating database codes becomes inevitable when code expansion is required.
+In this paper, we present a unified deep hash framework that can simultaneously learn new classes and increase hash code capacity. Specifically, we design a triple-channel asymmetric framework to optimize a new CNN model with a target code length and a code projection matrix. This enables us to directly generate hash codes for new images, and efficiently generate expanded hash codes for original database images from the old ones with the learned projection matrix.
+Meanwhile, we propose a pairwise-label-based incremental similarity-preserving loss to optimize the new CNN model, which can incrementally preserve new similarities while maintaining the old ones. Additionally, we design a double-end quantization loss to reduce the quantization error from new and original query images. As a result, our method efficiently embeds both new and original similarities into the expanded hash codes, while keeping the original database codes unchanged.
+We conduct extensive experiments on three widely-used image retrieval benchmarks, demonstrating that our method can significantly reduce the time required to expand existing database codes, while maintaining state-of-the-art retrieval performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Pandora's Problem with Deadlines b/data/2024/aaai/Pandora's Problem with Deadlines
new file mode 100644
index 0000000000..e81324a7dd
--- /dev/null
+++ b/data/2024/aaai/Pandora's Problem with Deadlines	
@@ -0,0 +1,3 @@
+Pandora’s problem is a fundamental model that studies optimal search under costly inspection. In the classic version, there are n boxes, each associated with a known cost and a known distribution over values. A strategy inspects the boxes sequentially and obtains a utility that equals the difference between the maximum value of an inspected box and the total inspection cost. Weitzman (1979) presented a surprisingly simple strategy that obtains the optimal expected utility.
+
+In this work we introduce a new variant of Pandora’s problem in which every box is also associated with a publicly known deadline, indicating the final round by which its value may be chosen. This model captures many real-life scenarios where alternatives admit deadlines, such as candidate interviews and college admissions. Our main result is an efficient threshold-based strategy that achieves a constant approximation relative to the performance of the optimal strategy for the deadlines setting.
\ No newline at end of file
diff --git a/data/2024/aaai/Pano-NeRF: Synthesizing High Dynamic Range Novel Views with Geometry from Sparse Low Dynamic Range Panoramic Images b/data/2024/aaai/Pano-NeRF: Synthesizing High Dynamic Range Novel Views with Geometry from Sparse Low Dynamic Range Panoramic Images
new file mode 100644
index 0000000000..c5cc1da51a
--- /dev/null
+++ b/data/2024/aaai/Pano-NeRF: Synthesizing High Dynamic Range Novel Views with Geometry from Sparse Low Dynamic Range Panoramic Images	
@@ -0,0 +1 @@
+Panoramic imaging research on geometry recovery and High Dynamic Range (HDR) reconstruction becomes a trend with the development of Extended Reality (XR). Neural Radiance Fields (NeRF) provide a promising scene representation for both tasks without requiring extensive prior data. How- ever, in the case of inputting sparse Low Dynamic Range (LDR) panoramic images, NeRF often degrades with under-constrained geometry and is unable to reconstruct HDR radiance from LDR inputs. We observe that the radiance from each pixel in panoramic images can be modeled as both a signal to convey scene lighting information and a light source to illuminate other pixels. Hence, we propose the irradiance fields from sparse LDR panoramic images, which increases the observation counts for faithful geometry recovery and leverages the irradiance-radiance attenuation for HDR reconstruction. Extensive experiments demonstrate that the irradiance fields outperform state-of-the-art methods on both geometry recovery and HDR reconstruction and validate their effectiveness. Furthermore, we show a promising byproduct of spatially-varying lighting estimation. The code is available at https://github.com/Lu-Zhan/Pano-NeRF.
\ No newline at end of file
diff --git a/data/2024/aaai/Panoptic Scene Graph Generation with Semantics-Prototype Learning b/data/2024/aaai/Panoptic Scene Graph Generation with Semantics-Prototype Learning
new file mode 100644
index 0000000000..695fed9873
--- /dev/null
+++ b/data/2024/aaai/Panoptic Scene Graph Generation with Semantics-Prototype Learning	
@@ -0,0 +1,5 @@
+Panoptic Scene Graph Generation (PSG) parses objects and predicts their relationships (predicate) to connect human language and visual scenes.
+However, different language preferences of annotators and semantic overlaps between predicates lead to biased predicate annotations in the dataset, i.e. different predicates for the same object pairs.
+Biased predicate annotations make PSG models struggle in constructing a clear decision plane among predicates, which greatly hinders the real application of PSG models.
+To address the intrinsic bias above, we propose a novel framework named ADTrans to adaptively transfer biased predicate annotations to informative and unified ones. To promise consistency and accuracy during the transfer process, we propose to observe the invariance degree of representations in each predicate class, and learn unbiased prototypes of predicates with different intensities. Meanwhile, we continuously measure the distribution changes between each presentation and its prototype, and constantly screen potentially biased data. Finally, with the unbiased predicate-prototype representation embedding space, biased annotations are easily identified.
+Experiments show that ADTrans significantly improves the performance of benchmark models, achieving a new state-of-the-art performance, and shows great generalization and effectiveness on multiple datasets. Our code is released at https://github.com/lili0415/PSG-biased-annotation.
\ No newline at end of file
diff --git a/data/2024/aaai/Pantypes: Diverse Representatives for Self-Explainable Models b/data/2024/aaai/Pantypes: Diverse Representatives for Self-Explainable Models
new file mode 100644
index 0000000000..f0c8c6e745
--- /dev/null
+++ b/data/2024/aaai/Pantypes: Diverse Representatives for Self-Explainable Models	
@@ -0,0 +1,2 @@
+Prototypical self-explainable classifiers have emerged to meet the growing demand for interpretable AI systems. These classifiers are designed to incorporate high transparency in their decisions by basing inference on similarity with learned prototypical objects. While these models are designed with diversity in mind, the learned prototypes often do not sufficiently represent all aspects of the input distribution, particularly those in low density regions. 
+Such lack of sufficient data representation, known as representation bias, has been associated with various detrimental properties related to machine learning diversity and fairness. In light of this, we introduce pantypes, a new family of prototypical objects designed to capture the full diversity of the input distribution through a sparse set of objects. We show that pantypes can empower prototypical self-explainable models by occupying divergent regions of the latent space and thus fostering high diversity, interpretability and fairness.
\ No newline at end of file
diff --git a/data/2024/aaai/ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer b/data/2024/aaai/ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer
new file mode 100644
index 0000000000..5870c9d8bd
--- /dev/null
+++ b/data/2024/aaai/ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer	
@@ -0,0 +1 @@
+Textual style transfer is the task of transforming stylistic properties of text while preserving meaning. Target "styles" can be defined in numerous ways, ranging from single attributes (e.g. formality) to authorship (e.g. Shakespeare). Previous unsupervised style-transfer approaches generally rely on significant amounts of labeled data for only a fixed set of styles or require large language models. In contrast, we introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles at inference time. Our parameter-efficient approach, ParaGuide, leverages paraphrase-conditioned diffusion models alongside gradient-based guidance from both off-the-shelf classifiers and strong existing style embedders to transform the style of text while preserving semantic information. We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.
\ No newline at end of file
diff --git a/data/2024/aaai/Parallel Beam Search Algorithms for Domain-Independent Dynamic Programming b/data/2024/aaai/Parallel Beam Search Algorithms for Domain-Independent Dynamic Programming
new file mode 100644
index 0000000000..7df6fad65e
--- /dev/null
+++ b/data/2024/aaai/Parallel Beam Search Algorithms for Domain-Independent Dynamic Programming	
@@ -0,0 +1 @@
+Domain-independent dynamic programming (DIDP), a model-based paradigm based on dynamic programming, has shown promising performance on multiple combinatorial optimization problems compared with mixed integer programming (MIP) and constraint programming (CP). The current DIDP solvers are based on heuristic search, and the state-of-the-art solver, complete anytime beam search (CABS), uses beam search. However, the current DIDP solvers cannot utilize multiple threads, unlike state-of-the-art MIP and CP solvers. In this paper, we propose three parallel beam search algorithms and develop multi-thread implementations of CABS. With 32 threads, our multi-thread DIDP solvers achieve 9 to 39 times speedup on average and significant performance improvement over the sequential solver, finding the new best solutions for two instances of the traveling salesperson problem with time windows. In addition, our solvers outperform multi-thread MIP and CP solvers in four of the six combinatorial optimization problems evaluated.
\ No newline at end of file
diff --git a/data/2024/aaai/Parallel Empirical Evaluations: Resilience despite Concurrency b/data/2024/aaai/Parallel Empirical Evaluations: Resilience despite Concurrency
new file mode 100644
index 0000000000..cb7a6f9ff0
--- /dev/null
+++ b/data/2024/aaai/Parallel Empirical Evaluations: Resilience despite Concurrency	
@@ -0,0 +1,2 @@
+Computational evaluations are crucial in modern problem-solving when we surpass theoretical algorithms or bounds. These experiments frequently take much work, and the sheer amount of needed resources makes it impossible to execute them on a single personal computer or laptop. Cluster schedulers allow for automatizing these tasks and scale to many computers. But, when we evaluate implementations of combinatorial algorithms, we depend on stable runtime results. Common approaches either limit parallelism or suffer from unstable runtime measurements due to interference among jobs on modern hardware. The former is inefficient and not sustainable. The latter results in unreplicable experiments.
+In this work, we address this issue and offer an acceptable balance between efficiency, software, hardware complexity, reliability, and replicability. We investigate effects towards replicability stability and illustrate how to efficiently use widely employed cluster resources for parallel evaluations. Furthermore, we present solutions which mitigate issues that emerge from the concurrent execution of benchmark jobs. Our experimental evaluation shows that – despite parallel execution – our approach reduces the runtime instability on the majority of instances to one second.
\ No newline at end of file
diff --git a/data/2024/aaai/Parallel Ranking of Ads and Creatives in Real-Time Advertising Systems b/data/2024/aaai/Parallel Ranking of Ads and Creatives in Real-Time Advertising Systems
new file mode 100644
index 0000000000..fa9c0482d7
--- /dev/null
+++ b/data/2024/aaai/Parallel Ranking of Ads and Creatives in Real-Time Advertising Systems	
@@ -0,0 +1 @@
+Creativity is the heart and soul of advertising services. Effective creatives can create a win-win scenario: advertisers each target users and achieve marketing objectives more effectively, users more quickly find products of interest, and platforms generate more advertising revenue. With the advent of AI-Generated Content, advertisers now can produce vast amounts of creative content at a minimal cost. The current challenge lies in how advertising systems can select the most pertinent creative in real-time for each user personally. Existing methods typically perform serial ranking of ads or creatives, limiting the creative module in terms of both effectiveness and efficiency. In this paper, we propose for the first time a novel architecture for online parallel estimation of ads and creatives ranking, as well as the corresponding offline joint optimization model. The online architecture enables sophisticated personalized creative modeling while reducing overall latency. The offline joint model for CTR estimation allows mutual awareness and collaborative optimization between ads and creatives. Additionally, we optimize the offline evaluation metrics for the implicit feedback sorting task involved in ad creative ranking. We conduct extensive experiments to compare ours with two state-of-the-art approaches. The results demonstrate the effectiveness of our approach in both offline evaluations and real-world advertising platforms online in terms of response time, CTR, and CPM.
\ No newline at end of file
diff --git a/data/2024/aaai/Parallel Vertex Diffusion for Unified Visual Grounding b/data/2024/aaai/Parallel Vertex Diffusion for Unified Visual Grounding
new file mode 100644
index 0000000000..0a98e22ee8
--- /dev/null
+++ b/data/2024/aaai/Parallel Vertex Diffusion for Unified Visual Grounding	
@@ -0,0 +1 @@
+Unified visual grounding (UVG) capitalizes on a wealth of task-related knowledge across various grounding tasks via one-shot training, which curtails retraining costs and task-specific architecture design efforts. Vertex generation-based UVG methods achieve this versatility by unified modeling object box and contour prediction and provide a text-powered interface to vast related multi-modal tasks, e.g., visual question answering and captioning. However, these methods typically generate vertexes sequentially through autoregression, which is prone to be trapped in error accumulation and heavy computation, especially for high-dimension sequence generation in complex scenarios. In this paper, we develop Parallel Vertex Diffusion (PVD) based on the parallelizability of diffusion models to accurately and efficiently generate vertexes in a parallel and scalable manner. Since the coordinates fluctuate greatly, it typically encounters slow convergence when training diffusion models without geometry constraints. Therefore, we consummate our PVD by two critical components, i.e., center anchor mechanism and angle summation loss, which serve to normalize coordinates and adopt a differentiable geometry descriptor from the point-in-polygon problem of computational geometry to constrain the overall difference of prediction and label vertexes. These innovative designs empower our PVD to demonstrate its superiority with state-of-the-art performance across various grounding tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Parameterization of (Partial) Maximum Satisfiability above Matching in a Variable-Clause Graph b/data/2024/aaai/Parameterization of (Partial) Maximum Satisfiability above Matching in a Variable-Clause Graph
new file mode 100644
index 0000000000..4cc80ead10
--- /dev/null
+++ b/data/2024/aaai/Parameterization of (Partial) Maximum Satisfiability above Matching in a Variable-Clause Graph	
@@ -0,0 +1 @@
+In the paper, we study the Maximum Satisfiability and the Partial Maximum Satisfiability problems. Using Gallai–Edmonds decomposition, we significantly improve the upper bound for the Maximum Satisfiability problem parameterized above maximum matching in the variable-clause graph. Our algorithm operates with a runtime of O*(2.83^k'), a substantial improvement compared to the previous approach requiring O*(4^k' ), where k' denotes the relevant parameter. Moreover, this result immediately implies O*(1.14977^m) and O*(1.27895^m) time algorithms for the (n, 3)-MaxSAT and (n, 4)-MaxSAT where m is the overall number of clauses. These upper bounds improve prior-known upper bounds equal to O*(1.1554^m) and O*(1.2872^m). We also adapt the algorithm so that it can handle instances of Partial Maximum Satisfiability without losing performance in some cases. Note that this is somewhat surprising, as the existence of even one hard clause can significantly increase the hardness of a problem.
\ No newline at end of file
diff --git a/data/2024/aaai/Parameterized Approximation Algorithms for Sum of Radii Clustering and Variants b/data/2024/aaai/Parameterized Approximation Algorithms for Sum of Radii Clustering and Variants
new file mode 100644
index 0000000000..3d4bab0563
--- /dev/null
+++ b/data/2024/aaai/Parameterized Approximation Algorithms for Sum of Radii Clustering and Variants	
@@ -0,0 +1,3 @@
+Clustering is one of the most fundamental tools in artificial intelligence, machine learning, and data mining. In this paper, we follow one of the recent mainstream topics of clustering, Sum of Radii (SoR), which naturally arises as a balance between the folklore k-center and k-median. SoR aims to determine a set of k balls, each centered at a point in a given dataset, such that their union covers the entire dataset while minimizing the sum of radii of the k balls. 
+We propose a general technical framework to overcome the challenge posed by varying radii in SoR, which yields fixed-parameter tractable (fpt) algorithms with respect to k (i.e., whose running time is f(k) ploy(n) for some f). 
+Our framework is versatile and obtains fpt approximation algorithms with constant approximation ratios for SoR as well as its variants in general metrics, such as Fair SoR and Matroid SoR, which significantly improve the previous results.
\ No newline at end of file
diff --git a/data/2024/aaai/Parameterized Projected Bellman Operator b/data/2024/aaai/Parameterized Projected Bellman Operator
new file mode 100644
index 0000000000..5f9100cf0a
--- /dev/null
+++ b/data/2024/aaai/Parameterized Projected Bellman Operator	
@@ -0,0 +1 @@
+Approximate value iteration (AVI) is a family of algorithms for reinforcement learning (RL) that aims to obtain an approximation of the optimal value function. Generally, AVI algorithms implement an iterated procedure where each step consists of (i) an application of the Bellman operator and (ii) a projection step into a considered function space. Notoriously, the Bellman operator leverages transition samples, which strongly determine its behavior, as uninformative samples can result in negligible updates or long detours, whose detrimental effects are further exacerbated by the computationally intensive projection step. To address these issues, we propose a novel alternative approach based on learning an approximate version of the Bellman operator rather than estimating it through samples as in AVI approaches. This way, we are able to (i) generalize across transition samples and (ii) avoid the computationally intensive projection step. For this reason, we call our novel operator projected Bellman operator (PBO). We formulate an optimization problem to learn PBO for generic sequential decision-making problems, and we theoretically analyze its properties in two representative classes of RL problems. Furthermore, we theoretically study our approach under the lens of AVI and devise algorithmic implementations to learn PBO in offline and online settings by leveraging neural network parameterizations. Finally, we empirically showcase the benefits of PBO w.r.t. the regular Bellman operator on several RL problems.
\ No newline at end of file
diff --git a/data/2024/aaai/Pareto Front-Diverse Batch Multi-Objective Bayesian Optimization b/data/2024/aaai/Pareto Front-Diverse Batch Multi-Objective Bayesian Optimization
new file mode 100644
index 0000000000..4b42cff480
--- /dev/null
+++ b/data/2024/aaai/Pareto Front-Diverse Batch Multi-Objective Bayesian Optimization	
@@ -0,0 +1 @@
+We consider the problem of multi-objective optimization (MOO) of expensive black-box functions with the goal of discovering high-quality and diverse Pareto fronts where we are allowed to evaluate a batch of inputs. This problem arises in many real-world applications including penicillin production where diversity of solutions is critical. We solve this problem in the framework of Bayesian optimization (BO) and propose a novel approach referred to as Pareto front-Diverse Batch Multi-Objective BO (PDBO). PDBO tackles two important challenges: 1) How to automatically select the best acquisition function in each BO iteration, and 2) How to select a diverse batch of inputs by considering multiple objectives. We propose principled solutions to address these two challenges. First, PDBO employs a multi-armed bandit approach to select one acquisition function from a given library. We solve a cheap MOO problem by assigning the selected acquisition function for each expensive objective function to obtain a candidate set of inputs for evaluation. Second, it utilizes Determinantal Point Processes (DPPs) to choose a Pareto-front-diverse batch of inputs for evaluation from the candidate set obtained from the first step. The key parameters for the methods behind these two steps are updated after each round of function evaluations. Experiments on multiple MOO benchmarks demonstrate that PDBO outperforms prior methods in terms of both the quality and diversity of Pareto solutions.
\ No newline at end of file
diff --git a/data/2024/aaai/Parsing All Adverse Scenes: Severity-Aware Semantic Segmentation with Mask-Enhanced Cross-Domain Consistency b/data/2024/aaai/Parsing All Adverse Scenes: Severity-Aware Semantic Segmentation with Mask-Enhanced Cross-Domain Consistency
new file mode 100644
index 0000000000..4364750894
--- /dev/null
+++ b/data/2024/aaai/Parsing All Adverse Scenes: Severity-Aware Semantic Segmentation with Mask-Enhanced Cross-Domain Consistency	
@@ -0,0 +1,2 @@
+Although recent methods in Unsupervised Domain Adaptation (UDA) have achieved success in segmenting rainy or snowy scenes by improving consistency, they face limitations when dealing with more challenging scenarios like foggy and night scenes. We argue that these prior methods excessively focus on weather-specific features in adverse scenes, which exacerbates the existing domain gaps.
+To address this issue, we propose a new metric to evaluate the severity of all adverse scenes and offer a novel perspective that enables task unification across all adverse scenarios. Our method focuses on Severity, allowing our model to learn more consistent features and facilitate domain distribution alignment, thereby alleviating domain gaps. Unlike the vague descriptions of consistency in previous methods, we introduce Cross-domain Consistency, which is quantified using the Structure Similarity Index Measure (SSIM) to measure the distance between the source and target domains. Specifically, our unified model consists of two key modules: the Merging Style Augmentation Module (MSA) and the Severity Perception Mask Module (SPM). The MSA module transforms all adverse scenes into augmented scenes, effectively eliminating weather-specific features and enhancing Cross-domain Consistency. The SPM module incorporates a Severity Perception mechanism, guiding a Mask operation that enables our model to learn highly consistent features from the augmented scenes. Our unified framework, named PASS (Parsing All adverSe Scenes), achieves significant performance improvements over state-of-the-art methods on widely-used benchmarks for all adverse scenes. Notably, the performance of PASS is superior to Semi-Unified models and even surpasses weather-specific models.
\ No newline at end of file
diff --git a/data/2024/aaai/Partial Multi-View Clustering via Self-Supervised Network b/data/2024/aaai/Partial Multi-View Clustering via Self-Supervised Network
new file mode 100644
index 0000000000..ce63def494
--- /dev/null
+++ b/data/2024/aaai/Partial Multi-View Clustering via Self-Supervised Network	
@@ -0,0 +1,2 @@
+Partial multi-view clustering is a challenging and practical research problem for data analysis in real-world applications, due to the potential data missing issue in different views. However, most existing methods have not fully explored the correlation information among various incomplete views. In addition, these existing clustering methods always ignore discovering discriminative features inside the data itself in this unsupervised task. To tackle these challenges, we propose Partial Multi-View Clustering via Self-Supervised \textbf{N}etwork (PVC-SSN) in this paper. 
+Specifically, we employ contrastive learning to obtain a more discriminative and consistent subspace representation, which is guided by a self-supervised module. Self-supervised learning can exploit effective cluster information through the data itself to guide the learning process of clustering tasks. Thus, it can pull together embedding features from the same cluster and push apart these from different clusters. Extensive experiments on several benchmark datasets show that the proposed PVC-SCN method outperforms several state-of-the-art clustering methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Partially Observable Hierarchical Reinforcement Learning with AI Planning (Student Abstract) b/data/2024/aaai/Partially Observable Hierarchical Reinforcement Learning with AI Planning (Student Abstract)
new file mode 100644
index 0000000000..b4046fba00
--- /dev/null
+++ b/data/2024/aaai/Partially Observable Hierarchical Reinforcement Learning with AI Planning (Student Abstract)	
@@ -0,0 +1 @@
+Partially observable Markov decision processes (POMDPs) challenge reinforcement learning agents due to incomplete knowledge of the environment. Even assuming monotonicity in uncertainty, it is difficult for an agent to know how and when to stop exploring for a given task. In this abstract, we discuss how to use hierarchical reinforcement learning (HRL) and AI Planning (AIP) to improve exploration when the agent knows possible valuations of unknown predicates and how to discover them. By encoding the uncertainty in an abstract planning model, the agent can derive a high-level plan which is then used to decompose the overall POMDP into a tree of semi-POMDPs for training. We evaluate our agent's performance on the MiniGrid domain and show how guided exploration may improve agent performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Participation Incentives in Approval-Based Committee Elections b/data/2024/aaai/Participation Incentives in Approval-Based Committee Elections
new file mode 100644
index 0000000000..262fa552c2
--- /dev/null
+++ b/data/2024/aaai/Participation Incentives in Approval-Based Committee Elections	
@@ -0,0 +1,16 @@
+In approval-based committee (ABC) voting, the goal is to
+choose a subset of predefined size of the candidates based on
+the voters’ approval preferences over the candidates. While
+this problem has attracted significant attention in recent years,
+the incentives for voters to participate in an election for a
+given ABC voting rule have been neglected so far. This paper
+is thus the first to explicitly study this property, typically called
+participation, for ABC voting rules. In particular, we show
+that all ABC scoring rules even satisfy group participation,
+whereas most sequential rules severely fail participation. We
+furthermore explore several escape routes to the impossibility
+for sequential ABC voting rules: we prove for many sequential
+rules that (i) they satisfy participation on laminar profiles, (ii)
+voters who approve none of the elected candidates cannot
+benefit by abstaining, and (iii) it is NP-hard for a voter to
+decide whether she benefits from abstaining
\ No newline at end of file
diff --git a/data/2024/aaai/Pass-Efficient Algorithms for Graph Spectral Clustering (Student Abstract) b/data/2024/aaai/Pass-Efficient Algorithms for Graph Spectral Clustering (Student Abstract)
new file mode 100644
index 0000000000..d527bca59c
--- /dev/null
+++ b/data/2024/aaai/Pass-Efficient Algorithms for Graph Spectral Clustering (Student Abstract)	
@@ -0,0 +1,2 @@
+Graph spectral clustering is a fundamental technique in data analysis, which utilizes eigenpairs of the Laplacian matrix to partition graph vertices into clusters. However, classical spectral clustering algorithms require eigendecomposition of the Laplacian matrix, which has cubic time complexity. In
+this work, we describe pass-efficient spectral clustering algorithms that leverage recent advances in randomized eigendecomposition and the structure of the graph vertex-edge matrix. Furthermore, we derive formulas for their efficient implementation. The resulting algorithms have a linear time complexity with respect to the number of vertices and edges and pass over the graph constant times, making them suitable for processing large graphs stored on slow memory. Experiments validate the accuracy and efficiency of the algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Patch-Aware Sample Selection for Efficient Masked Image Modeling b/data/2024/aaai/Patch-Aware Sample Selection for Efficient Masked Image Modeling
new file mode 100644
index 0000000000..587ca20583
--- /dev/null
+++ b/data/2024/aaai/Patch-Aware Sample Selection for Efficient Masked Image Modeling	
@@ -0,0 +1 @@
+Nowadays sample selection is drawing increasing attention. By extracting and training only on the most informative subset, sample selection can effectively reduce the training cost. Although sample selection is effective in conventional supervised learning, applying it to Masked Image Modeling (MIM) still poses challenges due to the gap between sample-level selection and patch-level pre-training. In this paper, we inspect the sample selection in MIM pre-training and find the basic selection suffers from performance degradation. We attribute this degradation primarily to 2 factors: the random mask strategy and the simple averaging function. We then propose Patch-Aware Sample Selection (PASS), including a low-cost Dynamic Trained Mask Predictor (DTMP) and Weighted Selection Score (WSS). DTMP consistently masks the informative patches in samples, ensuring a relatively accurate representation of selection score. WSS enhances the selection score using patch-level disparity. Extensive experiments show the effectiveness of PASS in selecting the most informative subset and accelerating pretraining. PASS exhibits superior performance across various datasets, MIM methods, and downstream tasks. Particularly, PASS improves MAE by 0.7% on ImageNet-1K while utilizing only 37% data budget and achieves ~1.7x speedup.
\ No newline at end of file
diff --git a/data/2024/aaai/Patch-Wise Graph Contrastive Learning for Image Translation b/data/2024/aaai/Patch-Wise Graph Contrastive Learning for Image Translation
new file mode 100644
index 0000000000..66e0b987ac
--- /dev/null
+++ b/data/2024/aaai/Patch-Wise Graph Contrastive Learning for Image Translation	
@@ -0,0 +1 @@
+Recently, patch-wise contrastive learning is drawing attention for the image translation by exploring the semantic correspondence between the input image and the output image. To further explore the patch-wise topology for high-level semantic understanding, here we exploit the graph neural network to capture the topology-aware features. Specifically, we construct the graph based on the patch-wise similarity from a pretrained encoder, whose adjacency matrix is shared to enhance the consistency of patch-wise relation between the input and the output. Then, we obtain the node feature from the graph neural network, and enhance the correspondence between the nodes by increasing mutual information using the contrastive loss. In order to capture the hierarchical semantic structure, we further propose the graph pooling. Experimental results demonstrate the state-of-art results for the image translation thanks to the semantic encoding by the constructed graphs.
\ No newline at end of file
diff --git a/data/2024/aaai/PathAsst: A Generative Foundation AI Assistant towards Artificial General Intelligence of Pathology b/data/2024/aaai/PathAsst: A Generative Foundation AI Assistant towards Artificial General Intelligence of Pathology
new file mode 100644
index 0000000000..6c1a988880
--- /dev/null
+++ b/data/2024/aaai/PathAsst: A Generative Foundation AI Assistant towards Artificial General Intelligence of Pathology	
@@ -0,0 +1 @@
+As advances in large language models (LLMs) and multimodal techniques continue to mature, the development of general-purpose multimodal large language models (MLLMs) has surged, offering significant applications in interpreting natural images. However, the field of pathology has largely remained untapped, particularly in gathering high-quality data and designing comprehensive model frameworks. To bridge the gap in pathology MLLMs, we present PathAsst, a multimodal generative foundation AI assistant to revolutionize diagnostic and predictive analytics in pathology. The development of PathAsst involves three pivotal steps: data acquisition, CLIP model adaptation, and the training of PathAsst's multimodal generative capabilities. Firstly, we collect over 207K high-quality pathology image-text pairs from authoritative sources. Leveraging the advanced power of ChatGPT, we generate over 180K instruction-following samples. Furthermore, we devise additional instruction-following data specifically tailored for invoking eight pathology-specific sub-models we prepared, allowing the PathAsst to effectively collaborate with these models, enhancing its diagnostic ability. Secondly, by leveraging the collected data, we construct PathCLIP, a pathology-dedicated CLIP, to enhance PathAsst's capabilities in interpreting pathology images. Finally, we integrate PathCLIP with the Vicuna-13b and utilize pathology-specific instruction-tuning data to enhance the multimodal generation capacity of PathAsst and bolster its synergistic interactions with sub-models. The experimental results of PathAsst show the potential of harnessing AI-powered generative foundation model to improve pathology diagnosis and treatment processes. We open-source our dataset, as well as a comprehensive toolkit for extensive pathology data collection and preprocessing at https://github.com/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology.
\ No newline at end of file
diff --git a/data/2024/aaai/Paths, Proofs, and Perfection: Developing a Human-Interpretable Proof System for Constrained Shortest Paths b/data/2024/aaai/Paths, Proofs, and Perfection: Developing a Human-Interpretable Proof System for Constrained Shortest Paths
new file mode 100644
index 0000000000..86d81cb70f
--- /dev/null
+++ b/data/2024/aaai/Paths, Proofs, and Perfection: Developing a Human-Interpretable Proof System for Constrained Shortest Paths	
@@ -0,0 +1 @@
+People want to rely on optimization algorithms for complex decisions but verifying the optimality of the solutions can then become a valid concern, particularly for critical decisions taken by non-experts in optimization. One example is the shortest-path problem on a network, occurring in many contexts from transportation to logistics to telecommunications. While the standard shortest-path problem is both solvable in polynomial time and certifiable by duality, introducing side constraints makes solving and certifying the solutions much harder. We propose a proof system for constrained shortest-path problems, which gives a set of logical rules to derive new facts about feasible solutions. The key trait of the proposed proof system is that it specifically includes high-level graph concepts within its reasoning steps (such as connectivity or path structure), in contrast to, e.g., using linear combinations of model constraints. Thus, using our proof system, we can provide a step-by-step, human-auditable explanation showing that the path given by an external solver cannot be improved. Additionally, to maximize the advantages of this setup, we propose a proof search procedure that specifically aims to find small proofs of this form using a procedure similar to A* search. We evaluate our proof system on constrained shortest path instances generated from real-world road networks and experimentally show that we may indeed derive more interpretable proofs compared to an integer programming approach, in some cases leading to much smaller proofs.
\ No newline at end of file
diff --git a/data/2024/aaai/Pay Attention to Target: Relation-Aware Temporal Consistency for Domain Adaptive Video Semantic Segmentation b/data/2024/aaai/Pay Attention to Target: Relation-Aware Temporal Consistency for Domain Adaptive Video Semantic Segmentation
new file mode 100644
index 0000000000..0afbe772c9
--- /dev/null
+++ b/data/2024/aaai/Pay Attention to Target: Relation-Aware Temporal Consistency for Domain Adaptive Video Semantic Segmentation	
@@ -0,0 +1 @@
+Video semantic segmentation has achieved conspicuous achievements attributed to the development of deep learning, but suffers from labor-intensive annotated training data gathering. To alleviate the data-hunger issue, domain adaptation approaches are developed in the hope of adapting the model trained on the labeled synthetic videos to the real videos in the absence of annotations. By analyzing the dominant paradigm consistency regularization in the domain adaptation task, we find that the bottlenecks exist in previous methods from the perspective of pseudo-labels. To take full advantage of the information contained in the pseudo-labels and empower more effective supervision signals, we propose a coherent PAT network including a target domain focalizer and relation-aware temporal consistency. The proposed PAT network enjoys several merits. First, the target domain focalizer is responsible for paying attention to the target domain, and increasing the accessibility of pseudo-labels in consistency training. Second, the relation-aware temporal consistency aims at modeling the inter-class consistent relationship across frames to equip the model with effective supervision signals. Extensive experimental results on two challenging benchmarks demonstrate that our method performs favorably against state-of-the-art domain adaptive video semantic segmentation methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Peer Learning: Learning Complex Policies in Groups from Scratch via Action Recommendations b/data/2024/aaai/Peer Learning: Learning Complex Policies in Groups from Scratch via Action Recommendations
new file mode 100644
index 0000000000..56c88ef0bb
--- /dev/null
+++ b/data/2024/aaai/Peer Learning: Learning Complex Policies in Groups from Scratch via Action Recommendations	
@@ -0,0 +1,2 @@
+Peer learning is a novel high-level reinforcement learning framework for agents learning in groups. While standard reinforcement learning trains an individual agent in trial-and-error fashion, all on its own, peer learning addresses a related setting in which a group of agents, i.e., peers, learns to master a task simultaneously together from scratch. Peers are allowed to communicate only about their own states and actions recommended by others: "What would you do in my situation?". Our motivation is to study the learning behavior of these agents.
+We formalize the teacher selection process in the action advice setting as a multi-armed bandit problem and therefore highlight the need for exploration. Eventually, we analyze the learning behavior of the peers and observe their ability to rank the agents' performance within the study group and understand which agents give reliable advice. Further, we compare peer learning with single agent learning and a state-of-the-art action advice baseline. We show that peer learning is able to outperform single-agent learning and the baseline in several challenging discrete and continuous OpenAI Gym domains. Doing so, we also show that within such a framework complex policies from action recommendations beyond discrete action spaces can evolve.
\ No newline at end of file
diff --git a/data/2024/aaai/PerFedRLNAS: One-for-All Personalized Federated Neural Architecture Search b/data/2024/aaai/PerFedRLNAS: One-for-All Personalized Federated Neural Architecture Search
new file mode 100644
index 0000000000..8056337972
--- /dev/null
+++ b/data/2024/aaai/PerFedRLNAS: One-for-All Personalized Federated Neural Architecture Search	
@@ -0,0 +1 @@
+Personalized federated learning is a new paradigm to address heterogeneous problems (e.g. issues with non-i.i.d. data) in federated learning. However, existing personalized federated learning methods lack standards for how personalized and shared parts of the models are designed. Sometimes, manual design can even lead to worse performance than non-personalization. As a result, we propose a new algorithm for personalized federated neural architecture search, called PerFedRLNAS, to automatically personalize the architectures and weights of models on each client. With such an algorithm, we can solve the issues of low efficiency as well as failure to adapt to new search spaces in previous federated neural architecture search work. We further show that with automatically assigning different client architectures can solve heterogeneity of data distribution, efficiency and memory in federated learning. In our experiments, we empirically show that our framework shows much better performance with respect to personalized accuracy and overall time compared to state-of-the-art methods. Furthermore, PerFedRLNAS has a good generalization ability to new clients, and is easy to be deployed in practice.
\ No newline at end of file
diff --git a/data/2024/aaai/Percentile Risk-Constrained Budget Pacing for Guaranteed Display Advertising in Online Optimization b/data/2024/aaai/Percentile Risk-Constrained Budget Pacing for Guaranteed Display Advertising in Online Optimization
new file mode 100644
index 0000000000..7dde3b25be
--- /dev/null
+++ b/data/2024/aaai/Percentile Risk-Constrained Budget Pacing for Guaranteed Display Advertising in Online Optimization	
@@ -0,0 +1 @@
+Guaranteed display (GD) advertising is a critical component of advertising since it provides publishers with stable revenue and enables advertisers to target specific audiences with guaranteed impressions. However, smooth pacing control for online ad delivery presents a challenge due to significant budget disparities, user arrival distribution drift, and dynamic change between supply and demand. This paper presents robust risk-constrained pacing (RCPacing) that utilizes Lagrangian dual multipliers to fine-tune probabilistic throttling through monotonic mapping functions within the percentile space of impression performance distribution. RCPacing combines distribution drift resilience and compatibility with guaranteed allocation mechanism, enabling us to provide near-optimal online services. We also show that RCPacing achieves O(sqrt(T)) dynamic regret where T is the length of the horizon. RCPacing's effectiveness is validated through offline evaluations and online A/B testing conducted on Taobao brand advertising platform.
\ No newline at end of file
diff --git a/data/2024/aaai/Performative Federated Learning: A Solution to Model-Dependent and Heterogeneous Distribution Shifts b/data/2024/aaai/Performative Federated Learning: A Solution to Model-Dependent and Heterogeneous Distribution Shifts
new file mode 100644
index 0000000000..494946466d
--- /dev/null
+++ b/data/2024/aaai/Performative Federated Learning: A Solution to Model-Dependent and Heterogeneous Distribution Shifts	
@@ -0,0 +1,3 @@
+We consider a federated learning (FL) system consisting of multiple clients and a server, where the clients aim to collaboratively learn a common decision model from their distributed data. Unlike the conventional FL framework that assumes the client's data is static, we consider scenarios where the clients' data distributions may be reshaped by the deployed decision model. In this work, we leverage the idea of distribution shift mappings in performative prediction to formalize this model-dependent data distribution shift and propose a performative FL framework. 
+We first introduce necessary and sufficient conditions for the existence of a unique performative stable solution and characterize its distance to the performative optimal solution. Then we propose the performative FedAvg algorithm and show that it converges to the performative stable solution at a rate of O(1/T) under both full and partial participation schemes.
+In particular, we use novel proof techniques and show how the clients' heterogeneity influences the convergence. Numerical results validate our analysis and provide valuable insights into real-world applications.
\ No newline at end of file
diff --git a/data/2024/aaai/Permutation-Based Hypothesis Testing for Neural Networks b/data/2024/aaai/Permutation-Based Hypothesis Testing for Neural Networks
new file mode 100644
index 0000000000..413cda6fd2
--- /dev/null
+++ b/data/2024/aaai/Permutation-Based Hypothesis Testing for Neural Networks	
@@ -0,0 +1 @@
+Neural networks are powerful predictive models, but they provide little insight into the nature of relationships between predictors and outcomes. Although numerous methods have been proposed to quantify the relative contributions of input features, statistical inference and hypothesis testing of feature associations remain largely unexplored. We propose a permutation-based approach to testing that uses the partial derivatives of the network output with respect to specific inputs to assess both the significance of input features and whether significant features are linearly associated with the network output. These tests, which can be flexibly applied to a variety of network architectures, enhance the explanatory power of neural networks, and combined with powerful predictive capability, extend the applicability of these models.
\ No newline at end of file
diff --git a/data/2024/aaai/Personalization as a Shortcut for Few-Shot Backdoor Attack against Text-to-Image Diffusion Models b/data/2024/aaai/Personalization as a Shortcut for Few-Shot Backdoor Attack against Text-to-Image Diffusion Models
new file mode 100644
index 0000000000..356700252f
--- /dev/null
+++ b/data/2024/aaai/Personalization as a Shortcut for Few-Shot Backdoor Attack against Text-to-Image Diffusion Models	
@@ -0,0 +1 @@
+Although recent personalization methods have democratized high-resolution image synthesis by enabling swift concept acquisition with minimal examples and lightweight computation, they also present an exploitable avenue for highly accessible backdoor attacks. This paper investigates a critical and unexplored aspect of text-to-image (T2I) diffusion models - their potential vulnerability to backdoor attacks via personalization. By studying the prompt processing of popular personalization methods (epitomized by Textual Inversion and DreamBooth), we have devised dedicated personalization-based backdoor attacks according to the different ways of dealing with unseen tokens and divide them into two families: nouveau-token and legacy-token backdoor attacks. In comparison to conventional backdoor attacks involving the fine-tuning of the entire text-to-image diffusion model, our proposed personalization-based backdoor attack method can facilitate more tailored, efficient, and few-shot attacks. Through comprehensive empirical study, we endorse the utilization of the nouveau-token backdoor attack due to its impressive effectiveness, stealthiness, and integrity, markedly outperforming the legacy-token backdoor attack.
\ No newline at end of file
diff --git a/data/2024/aaai/Personalized LoRA for Human-Centered Text Understanding b/data/2024/aaai/Personalized LoRA for Human-Centered Text Understanding
new file mode 100644
index 0000000000..db1e8898c0
--- /dev/null
+++ b/data/2024/aaai/Personalized LoRA for Human-Centered Text Understanding	
@@ -0,0 +1 @@
+Effectively and efficiently adapting a pre-trained language model (PLM) for human-centered text understanding (HCTU) is challenging since user tokens are million-level in most personalized applications and do not have concrete explicit semantics. A standard and parameter-efficient approach (e.g., LoRA) necessitates memorizing numerous suits of adapters for each user. In this work, we introduce a personalized LoRA (PLoRA) with a plug-and-play (PnP) framework for the HCTU task. PLoRA is effective, parameter-efficient, and dynamically deploying in PLMs. Moreover, a personalized dropout and a mutual information maximizing strategies are adopted and hence the proposed PLoRA can be well adapted to few/zero-shot learning scenarios for the cold-start issue. Experiments conducted on four benchmark datasets show that the proposed method outperforms existing methods in full/few/zero-shot learning scenarios for the HCTU task, even though it has fewer trainable parameters. For reproducibility, the code for this paper is available at: https://github.com/yoyo-yun/PLoRA.
\ No newline at end of file
diff --git a/data/2024/aaai/Personalized Reinforcement Learning with a Budget of Policies b/data/2024/aaai/Personalized Reinforcement Learning with a Budget of Policies
new file mode 100644
index 0000000000..26012c9ed6
--- /dev/null
+++ b/data/2024/aaai/Personalized Reinforcement Learning with a Budget of Policies	
@@ -0,0 +1 @@
+Personalization in machine learning (ML) tailors models' decisions to the individual characteristics of users. While this approach has seen success in areas like recommender systems, its expansion into high-stakes fields such as healthcare and autonomous driving is hindered by the extensive regulatory approval processes involved. To address this challenge, we propose a novel framework termed represented Markov Decision Processes (r-MDPs) that is designed to balance the need for personalization with the regulatory constraints. In an r-MDP, we cater to a diverse user population, each with unique preferences, through interaction with a small set of representative policies. Our objective is twofold: efficiently match each user to an appropriate representative policy and simultaneously optimize these policies to maximize overall social welfare. We develop two deep reinforcement learning algorithms that efficiently solve r-MDPs. These algorithms draw inspiration from the principles of classic K-means clustering and are underpinned by robust theoretical foundations. Our empirical investigations, conducted across a variety of simulated environments, showcase the algorithms' ability to facilitate meaningful personalization even under constrained policy budgets. Furthermore, they demonstrate scalability, efficiently adapting to larger policy budgets.
\ No newline at end of file
diff --git a/data/2024/aaai/Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off b/data/2024/aaai/Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off
new file mode 100644
index 0000000000..47e3a5965b
--- /dev/null
+++ b/data/2024/aaai/Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off	
@@ -0,0 +1 @@
+Neural ranking models (NRMs) have shown great success in information retrieval (IR). But their predictions can easily be manipulated using adversarial examples, which are crafted by adding imperceptible perturbations to legitimate documents. This vulnerability raises significant concerns about their reliability and hinders the widespread deployment of NRMs. By incorporating adversarial examples into training data, adversarial training has become the de facto defense approach to adversarial attacks against NRMs. However, this defense mechanism is subject to a trade-off between effectiveness and adversarial robustness. In this study, we establish theoretical guarantees regarding the effectiveness-robustness trade-off in NRMs. We decompose the robust ranking error into two components, i.e., a natural ranking error for effectiveness evaluation and a boundary ranking error for assessing adversarial robustness. Then, we define the perturbation invariance of a ranking model and prove it to be a differentiable upper bound on the boundary ranking error for attainable computation. Informed by our theoretical analysis, we design a novel perturbation-invariant adversarial training (PIAT) method for ranking models to achieve a better effectiveness-robustness trade-off. We design a regularized surrogate loss, in which one term encourages the effectiveness to be maximized while the regularization term encourages the output to be smooth, so as to improve adversarial robustness. Experimental results on several ranking models demonstrate the superiority of PITA compared to existing adversarial defenses.
\ No newline at end of file
diff --git a/data/2024/aaai/Pharmacokinetics-Informed Neural Network for Predicting Opioid Administration Moments with Wearable Sensors b/data/2024/aaai/Pharmacokinetics-Informed Neural Network for Predicting Opioid Administration Moments with Wearable Sensors
new file mode 100644
index 0000000000..3d520959dd
--- /dev/null
+++ b/data/2024/aaai/Pharmacokinetics-Informed Neural Network for Predicting Opioid Administration Moments with Wearable Sensors	
@@ -0,0 +1 @@
+Long-term and high-dose prescription opioid use places individuals at risk for opioid misuse, opioid use disorder (OUD), and overdose. Existing methods for monitoring opioid use and detecting misuse rely on self-reports, which are prone to reporting bias, and toxicology testing, which may be infeasible in outpatient settings. Although wearable technologies for monitoring day-to-day health metrics have gained significant traction in recent years due to their ease of use, flexibility, and advancements in sensor technology, their application within the opioid use space remains underexplored. In the current work, we demonstrate that oral opioid administrations can be detected using physiological signals collected from a wrist sensor. More importantly, we show that models informed by opioid pharmacokinetics increase reliability in predicting the timing of opioid administrations. Forty-two individuals who were prescribed opioids as a part of their medical treatment in-hospital and after discharge were enrolled. Participants wore a wrist sensor throughout the study, while opioid administrations were tracked using electronic medical records and self-reports. We collected 1,983 hours of sensor data containing 187 opioid administrations from the inpatient setting and 927 hours of sensor data containing 40 opioid administrations from the outpatient setting. We demonstrate that a self-supervised pre-trained model, capable of learning the canonical time series of plasma concentration of the drug derived from opioid pharmacokinetics, can reliably detect opioid administration in both settings. Our work suggests the potential of pharmacokinetic-informed, data-driven models to objectively detect opioid use in daily life.
\ No newline at end of file
diff --git a/data/2024/aaai/Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion b/data/2024/aaai/Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion
new file mode 100644
index 0000000000..239bf541ff
--- /dev/null
+++ b/data/2024/aaai/Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion	
@@ -0,0 +1 @@
+Voice conversion (VC) aims at altering a person's voice to make it sound similar to the voice of another person while preserving linguistic content. Existing methods suffer from a dilemma between content intelligibility and speaker similarity; i.e., methods with higher intelligibility usually have a lower speaker similarity, while methods with higher speaker similarity usually require plenty of target speaker voice data to achieve high intelligibility. In this work, we propose a novel method Phoneme Hallucinator that achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model; it adopts a novel model to hallucinate diversified and high-fidelity target speaker phonemes based just on a short target speaker voice (e.g. 3 seconds). The hallucinated phonemes are then exploited to perform neighbor-based voice conversion. Our model is a text-free, any-to-any VC model that requires no text annotations and supports conversion to any unseen speaker. Quantitative and qualitative evaluations show that Phoneme Hallucinator outperforms existing VC methods for both intelligibility and speaker similarity.
\ No newline at end of file
diff --git a/data/2024/aaai/Physics-Informed Graph Neural Networks for Water Distribution Systems b/data/2024/aaai/Physics-Informed Graph Neural Networks for Water Distribution Systems
new file mode 100644
index 0000000000..1950e43fba
--- /dev/null
+++ b/data/2024/aaai/Physics-Informed Graph Neural Networks for Water Distribution Systems	
@@ -0,0 +1 @@
+Water distribution systems (WDS) are an integral part of critical infrastructure which is pivotal to urban development. As 70% of the world's population will likely live in urban environments in 2050, efficient simulation and planning tools for WDS play a crucial role in reaching UN's sustainable developmental goal (SDG) 6 - "Clean water and sanitation for all". In this realm, we propose a novel and efficient machine learning emulator, more precisely, a physics-informed deep learning (DL) model, for hydraulic state estimation in WDS. Using a recursive approach, our model only needs a few graph convolutional neural network (GCN) layers and employs an innovative algorithm based on message passing. Unlike conventional machine learning tasks, the model uses hydraulic principles to infer two additional hydraulic state features in the process of reconstructing the available ground truth feature in an unsupervised manner. To the best of our knowledge, this is the first DL approach to emulate the popular hydraulic simulator EPANET, utilizing no additional information. Like most DL models and unlike the hydraulic simulator, our model demonstrates vastly faster emulation times that do not increase drastically with the size of the WDS. Moreover, we achieve high accuracy on the ground truth and very similar results compared to the hydraulic simulator as demonstrated through experiments on five real-world WDS datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Physics-Informed Representation and Learning: Control and Risk Quantification b/data/2024/aaai/Physics-Informed Representation and Learning: Control and Risk Quantification
new file mode 100644
index 0000000000..6833bb943a
--- /dev/null
+++ b/data/2024/aaai/Physics-Informed Representation and Learning: Control and Risk Quantification	
@@ -0,0 +1,2 @@
+Optimal and safety-critical control are fundamental problems for stochastic systems, and are widely considered in real-world scenarios such as robotic manipulation and autonomous driving. In this paper, we consider the problem of efficiently finding optimal and safe control for high-dimensional systems. Specifically, we propose to use dimensionality reduction techniques from a comparison theorem for stochastic differential equations together with a generalizable physics-informed neural network to estimate the optimal value function and the safety probability of the system. The proposed framework results in substantial sample efficiency improvement compared to existing methods. We further develop an autoencoder-like neural network to automatically identify the low-dimensional features in the system to enhance the ease of design for system integration. We also provide experiments and quantitative analysis to validate the efficacy of the proposed method. 
+Source code is available at https://github.com/jacobwang925/path-integral-PINN.
\ No newline at end of file
diff --git a/data/2024/aaai/Piecewise Linear Transformation - Propagating Aleatoric Uncertainty in Neural Networks b/data/2024/aaai/Piecewise Linear Transformation - Propagating Aleatoric Uncertainty in Neural Networks
new file mode 100644
index 0000000000..32ff4861b1
--- /dev/null
+++ b/data/2024/aaai/Piecewise Linear Transformation - Propagating Aleatoric Uncertainty in Neural Networks	
@@ -0,0 +1 @@
+Real-world data typically exhibit aleatoric uncertainty which has to be considered during data-driven decision-making to assess the confidence of the decision provided by machine learning models. To propagate aleatoric uncertainty represented by probability distributions (PDs) through neural networks (NNs), both sampling-based and function approximation-based methods have been proposed. However, these methods suffer from significant approximation errors and are not able to accurately represent predictive uncertainty in the NN output. In this paper, we present a novel method, Piecewise Linear Transformation (PLT), for propagating PDs through NNs with piecewise linear activation functions (e.g., ReLU NNs). PLT does not require sampling or specific assumptions about the PDs. Instead, it harnesses the piecewise linear structure of such NNs to determine the propagated PD in the output space. In this way, PLT supports the accurate quantification of predictive uncertainty based on the criterion exactness of the propagated PD. We assess this exactness in theory by showing error bounds for our propagated PD. Further, our experimental evaluation validates that PLT outperforms competing methods on publicly available real-world classification and regression datasets regarding exactness. Thus, the PDs propagated by PLT allow to assess the uncertainty of the provided decisions, offering valuable support.
\ No newline at end of file
diff --git a/data/2024/aaai/Plug-In Diffusion Model for Sequential Recommendation b/data/2024/aaai/Plug-In Diffusion Model for Sequential Recommendation
new file mode 100644
index 0000000000..f314da9d18
--- /dev/null
+++ b/data/2024/aaai/Plug-In Diffusion Model for Sequential Recommendation	
@@ -0,0 +1 @@
+Pioneering efforts have verified the effectiveness of the diffusion models in exploring the informative uncertainty for recommendation. Considering the difference between recommendation and image synthesis tasks, existing methods have undertaken tailored refinements to the diffusion and reverse process. However, these approaches typically use the highest-score item in corpus for user interest prediction, leading to the ignorance of the user's generalized preference contained within other items, thereby remaining constrained by the data sparsity issue. To address this issue, this paper presents a novel Plug-in Diffusion Model for Recommendation (PDRec) framework, which employs the diffusion model as a flexible plugin to jointly take full advantage of the diffusion-generating user preferences on all items. Specifically, PDRec first infers the users' dynamic preferences on all items via a time-interval diffusion model and proposes a Historical Behavior Reweighting (HBR) mechanism to identify the high-quality behaviors and suppress noisy behaviors. In addition to the observed items, PDRec proposes a Diffusion-based Positive Augmentation (DPA) strategy to leverage the top-ranked unobserved items as the potential positive samples, bringing in informative and diverse soft signals to alleviate data sparsity. To alleviate the false negative sampling issue, PDRec employs Noise-free Negative Sampling (NNS) to select stable negative samples for ensuring effective model optimization. Extensive experiments and analyses on four datasets have verified the superiority of the proposed PDRec over the state-of-the-art baselines and showcased the universality of PDRec as a flexible plugin for commonly-used sequential encoders in different recommendation scenarios. The code is available in https://github.com/hulkima/PDRec.
\ No newline at end of file
diff --git a/data/2024/aaai/PoetryDiffusion: Towards Joint Semantic and Metrical Manipulation in Poetry Generation b/data/2024/aaai/PoetryDiffusion: Towards Joint Semantic and Metrical Manipulation in Poetry Generation
new file mode 100644
index 0000000000..7c88e61bea
--- /dev/null
+++ b/data/2024/aaai/PoetryDiffusion: Towards Joint Semantic and Metrical Manipulation in Poetry Generation	
@@ -0,0 +1,2 @@
+Controllable text generation is a challenging and meaningful field in natural language generation (NLG). Especially, poetry generation is a typical one with well-defined and strict conditions for text generation which is an ideal playground for the assessment of current methodologies. While prior works succeeded in controlling either semantic or metrical aspects of poetry generation, simultaneously addressing both remains a challenge. In this paper, we pioneer the use of the Diffusion model for generating sonnets and Chinese SongCi poetry to tackle such challenges. In terms of semantics, our PoetryDiffusion model, built upon the Diffusion model, generates entire sentences or poetry by comprehensively considering the entirety of sentence information. This approach enhances semantic expression, distinguishing it from autoregressive and large language models (LLMs). For metrical control, its constraint control module which can be trained individually enables us to flexibly incorporate a novel metrical controller to manipulate and evaluate metrics (format and rhythm).
+The denoising process in PoetryDiffusion allows for the gradual enhancement of semantics and flexible integration of the metrical controller which can calculate and impose penalties on states that stray significantly from the target control distribution. Experimental results on two datasets demonstrate that our model outperforms existing models in terms of automatic evaluation of semantic, metrical, and overall performance as well as human evaluation. Codes are released to https://github.com/ChorlingLau/PoetryDiffusion.
\ No newline at end of file
diff --git "a/data/2024/aaai/Poincar\303\251 Differential Privacy for Hierarchy-Aware Graph Embedding" "b/data/2024/aaai/Poincar\303\251 Differential Privacy for Hierarchy-Aware Graph Embedding"
new file mode 100644
index 0000000000..57cc5e4b34
--- /dev/null
+++ "b/data/2024/aaai/Poincar\303\251 Differential Privacy for Hierarchy-Aware Graph Embedding"	
@@ -0,0 +1 @@
+Hierarchy is an important and commonly observed topological property in real-world graphs that indicate the relationships between supervisors and subordinates or the organizational behavior of human groups. As hierarchy is introduced as a new inductive bias into the Graph Neural Networks (GNNs) in various tasks, it implies latent topological relations for attackers to improve their inference attack performance, leading to serious privacy leakage issues. In addition, existing privacy-preserving frameworks suffer from reduced protection ability in hierarchical propagation due to the deficiency of adaptive upper-bound estimation of the hierarchical perturbation boundary. It is of great urgency to effectively leverage the hierarchical property of data while satisfying privacy guarantees. To solve the problem, we propose the Poincar\'e Differential Privacy framework, named PoinDP, to protect the hierarchy-aware graph embedding based on hyperbolic geometry. Specifically, PoinDP first learns the hierarchy weights for each entity based on the Poincar\'e model in hyperbolic space. Then, the Personalized Hierarchy-aware Sensitivity is designed to measure the sensitivity of the hierarchical structure and adaptively allocate the privacy protection strength. Besides, Hyperbolic Gaussian Mechanism (HGM) is proposed to extend the Gaussian mechanism in Euclidean space to hyperbolic space to realize random perturbations that satisfy differential privacy under the hyperbolic space metric. Extensive experiment results on five real-world datasets demonstrate the proposed PoinDP’s advantages of effective privacy protection while maintaining good performance on the node classification task.
\ No newline at end of file
diff --git a/data/2024/aaai/Point Cloud Part Editing: Segmentation, Generation, Assembly, and Selection b/data/2024/aaai/Point Cloud Part Editing: Segmentation, Generation, Assembly, and Selection
new file mode 100644
index 0000000000..7887be3690
--- /dev/null
+++ b/data/2024/aaai/Point Cloud Part Editing: Segmentation, Generation, Assembly, and Selection	
@@ -0,0 +1 @@
+Ideal part editing should guarantee the diversity of edited parts, the fidelity to the remaining parts, and the quality of the results. However, previous methods do not disentangle each part completely, which means the edited parts will affect the others, resulting in poor diversity and fidelity. In addition, some methods lack constraints between parts, which need manual selections of edited results to ensure quality. Therefore, we propose a four-stage process for point cloud part editing: Segmentation, Generation, Assembly, and Selection. Based on this process, we introduce SGAS, a model for part editing that employs two strategies: feature disentanglement and constraint. By independently fitting part-level feature distributions, we realize the feature disentanglement. By explicitly modeling the transformation from object-level distribution to part-level distributions, we realize the feature constraint. Considerable experiments on different datasets demonstrate the efficiency and effectiveness of SGAS on point cloud part editing. In addition, SGAS can be pruned to realize unsupervised part-aware point cloud generation and achieves state-of-the-art results.
\ No newline at end of file
diff --git a/data/2024/aaai/Point Transformer with Federated Learning for Predicting Breast Cancer HER2 Status from Hematoxylin and Eosin-Stained Whole Slide Images b/data/2024/aaai/Point Transformer with Federated Learning for Predicting Breast Cancer HER2 Status from Hematoxylin and Eosin-Stained Whole Slide Images
new file mode 100644
index 0000000000..6378e668b6
--- /dev/null
+++ b/data/2024/aaai/Point Transformer with Federated Learning for Predicting Breast Cancer HER2 Status from Hematoxylin and Eosin-Stained Whole Slide Images	
@@ -0,0 +1 @@
+Directly predicting human epidermal growth factor receptor 2 (HER2) status from widely available hematoxylin and eosin (HE)-stained whole slide images (WSIs) can reduce technical costs and expedite treatment selection. Accurately predicting HER2 requires large collections of multi-site WSIs. Federated learning enables collaborative training of these WSIs without gigabyte-size WSIs transportation and data privacy concerns. However, federated learning encounters challenges in addressing label imbalance in multi-site WSIs from the real world. Moreover, existing WSI classification methods cannot simultaneously exploit local context information and long-range dependencies in the site-end feature representation of federated learning. To address these issues, we present a point transformer with federated learning for multi-site HER2 status prediction from HE-stained WSIs. Our approach incorporates two novel designs. We propose a dynamic label distribution strategy and an auxiliary classifier, which helps to establish a well-initialized model and mitigate label distribution variations across sites. Additionally, we propose a farthest cosine sampling based on cosine distance. It can sample the most distinctive features and capture the long-range dependencies. Extensive experiments and analysis show that our method achieves state-of-the-art performance at four sites with a total of 2687 WSIs. Furthermore, we demonstrate that our model can generalize to two unseen sites with 229 WSIs. Code is available at: https://github.com/boyden/PointTransformerFL
\ No newline at end of file
diff --git a/data/2024/aaai/Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models b/data/2024/aaai/Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models
new file mode 100644
index 0000000000..ceb9017c4a
--- /dev/null
+++ b/data/2024/aaai/Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models	
@@ -0,0 +1 @@
+The popularity of pre-trained large models has revolutionized downstream tasks across diverse fields, such as language, vision, and multi-modality. To minimize the adaption cost for downstream tasks, many Parameter-Efficient Fine-Tuning (PEFT) techniques are proposed for language and 2D image pre-trained models. However, the specialized PEFT method for 3D pre-trained models is still under-explored. To this end, we introduce Point-PEFT, a novel framework for adapting point cloud pre-trained models with minimal learnable parameters. Specifically, for a pre-trained 3D model, we freeze most of its parameters, and only tune the newly added PEFT modules on downstream tasks, which consist of a Point-prior Prompt and a Geometry-aware Adapter. The Point-prior Prompt adopts a set of learnable prompt tokens, for which we propose to construct a memory bank with domain-specific knowledge, and utilize a parameter-free attention to enhance the prompt tokens. The Geometry-aware Adapter aims to aggregate point cloud features within spatial neighborhoods to capture fine-grained geometric information through local interactions. Extensive experiments indicate that our Point-PEFT can achieve better performance than the full fine-tuning on various downstream tasks, while using only 5% of the trainable parameters, demonstrating the efficiency and effectiveness of our approach. Code is released at https://github.com/Ivan-Tang-3D/Point-PEFT.
\ No newline at end of file
diff --git a/data/2024/aaai/Point-to-Spike Residual Learning for Energy-Efficient 3D Point Cloud Classification b/data/2024/aaai/Point-to-Spike Residual Learning for Energy-Efficient 3D Point Cloud Classification
new file mode 100644
index 0000000000..c1acaf271e
--- /dev/null
+++ b/data/2024/aaai/Point-to-Spike Residual Learning for Energy-Efficient 3D Point Cloud Classification	
@@ -0,0 +1 @@
+Spiking neural networks (SNNs) have revolutionized neural learning and are making remarkable strides in image analysis and robot control tasks with ultra-low power consumption advantages. Inspired by this success, we investigate the application of spiking neural networks to 3D point cloud processing. We present a point-to-spike residual learning network for point cloud classification, which operates on points with binary spikes rather than floating-point numbers. Specifically, we first design a spatial-aware kernel point spiking neuron to relate spiking generation to point position in 3D space. On this basis, we then design a 3D spiking residual block for effective feature learning based on spike sequences. By stacking the 3D spiking residual blocks, we build the point-to-spike residual classification network, which achieves low computation cost and low accuracy loss on two benchmark datasets, ModelNet40 and ScanObjectNN. Moreover, the classifier strikes a good balance between classification accuracy and biological characteristics, allowing us to explore the deployment of 3D processing to neuromorphic chips for developing energy-efficient 3D robotic perception systems.
\ No newline at end of file
diff --git a/data/2024/aaai/Point2Real: Bridging the Gap between Point Cloud and Realistic Image for Open-World 3D Recognition b/data/2024/aaai/Point2Real: Bridging the Gap between Point Cloud and Realistic Image for Open-World 3D Recognition
new file mode 100644
index 0000000000..01cc1e2cf4
--- /dev/null
+++ b/data/2024/aaai/Point2Real: Bridging the Gap between Point Cloud and Realistic Image for Open-World 3D Recognition	
@@ -0,0 +1 @@
+Recognition in open-world scenarios is an important and challenging field, where Vision-Language Pre-training paradigms have greatly impacted the 2D domain. This inspires a growing interest in introducing 2D pre-trained models, such as CLIP, into the 3D domain to enhance the ability of point cloud understanding. Considering the difference between discrete 3D point clouds and real-world 2D images, reducing the domain gap is crucial. Some recent works project point clouds onto a 2D plane to enable 3D zero-shot capabilities without training. However, this simplistic approach leads to an unclear or even distorted geometric structure, limiting the potential of 2D pre-trained models in 3D. To address the domain gap, we propose Point2Real, a training-free framework based on the realistic rendering technique to automate the transformation of the 3D point cloud domain into the Vision-Language domain. Specifically, Point2Real leverages a shape recovery module that devises an iterative ball-pivoting algorithm to convert point clouds into meshes, narrowing the gap in shape at first. To simulate photo-realistic images, a set of refined textures as candidates is applied for rendering, where the CLIP confidence is utilized to select the suitable one. Moreover, to tackle the viewpoint challenge, a heuristic multi-view adapter is implemented for feature aggregation, which exploits the depth surface as an effective indicator of view-specific discriminability for recognition. We conduct experiments on ModelNet10, ModelNet40, and ScanObjectNN datasets, and the results demonstrate that Point2Real outperforms other approaches in zero-shot and few-shot tasks by a large margin.
\ No newline at end of file
diff --git a/data/2024/aaai/PointAttN: You Only Need Attention for Point Cloud Completion b/data/2024/aaai/PointAttN: You Only Need Attention for Point Cloud Completion
new file mode 100644
index 0000000000..30bac25adb
--- /dev/null
+++ b/data/2024/aaai/PointAttN: You Only Need Attention for Point Cloud Completion	
@@ -0,0 +1 @@
+Point cloud completion referring to completing 3D shapes from partial 3D point clouds is a fundamental problem for 3D point cloud analysis tasks. Benefiting from the development of deep neural networks, researches on point cloud completion have made great progress in recent years. However, the explicit local region partition like kNNs involved in existing methods makes them sensitive to the density distribution of point clouds. Moreover, it serves limited receptive fields that prevent capturing features from long-range context information. To solve the problems, we leverage the cross-attention and self-attention mechanisms to design novel neural network for point cloud completion with implicit local region partition. Two basic units Geometric Details Perception (GDP) and Self-Feature Augment (SFA) are proposed to establish the structural relationships directly among points in a simple yet effective way via attention mechanism. Then based on GDP and SFA, we construct a new framework with popular encoder-decoder architecture for point cloud completion. The proposed framework, namely PointAttN, is simple, neat and effective, which can precisely capture the structural information of 3D shapes and predict complete point clouds with detailed geometry. Experimental results demonstrate that our PointAttN outperforms state-of-the-art methods on multiple challenging benchmarks. Code is available at: https://github.com/ohhhyeahhh/PointAttN
\ No newline at end of file
diff --git a/data/2024/aaai/PointCVaR: Risk-Optimized Outlier Removal for Robust 3D Point Cloud Classification b/data/2024/aaai/PointCVaR: Risk-Optimized Outlier Removal for Robust 3D Point Cloud Classification
new file mode 100644
index 0000000000..55c8f825d5
--- /dev/null
+++ b/data/2024/aaai/PointCVaR: Risk-Optimized Outlier Removal for Robust 3D Point Cloud Classification	
@@ -0,0 +1 @@
+With the growth of 3D sensing technology, the deep learning system for 3D point clouds has become increasingly important, especially in applications such as autonomous vehicles where safety is a primary concern. However, there are growing concerns about the reliability of these systems when they encounter noisy point clouds, either occurring naturally or introduced with malicious intent. This paper highlights the challenges of point cloud classification posed by various forms of noise, from simple background noise to malicious adversarial/backdoor attacks that can intentionally skew model predictions. While there's an urgent need for optimized point cloud denoising, current point outlier removal approaches, an essential step for denoising, rely heavily on handcrafted strategies and are not adapted for higher-level tasks, such as classification. To address this issue, we introduce an innovative point outlier cleansing method that harnesses the power of downstream classification models. Using gradient-based attribution analysis, we define a novel concept: point risk. Drawing inspiration from tail risk minimization in finance, we recast the outlier removal process as an optimization problem, named PointCVaR. Extensive experiments show that our proposed technique not only robustly filters diverse point cloud outliers but also consistently and significantly enhances existing robust methods for point cloud classification. A notable feature of our approach is its effectiveness in defending against the latest threat of backdoor attacks in point clouds.
\ No newline at end of file
diff --git a/data/2024/aaai/PointPatchMix: Point Cloud Mixing with Patch Scoring b/data/2024/aaai/PointPatchMix: Point Cloud Mixing with Patch Scoring
new file mode 100644
index 0000000000..24c0065493
--- /dev/null
+++ b/data/2024/aaai/PointPatchMix: Point Cloud Mixing with Patch Scoring	
@@ -0,0 +1 @@
+Data augmentation is an effective regularization strategy for mitigating overfitting in deep neural networks, and it plays a crucial role in 3D vision tasks, where the point cloud data is relatively limited. While mixing-based augmentation has shown promise for point clouds, previous methods mix point clouds either on block level or point level, which has constrained their ability to strike a balance between generating diverse training samples and preserving the local characteristics of point clouds. The significance of each part component of the point clouds has not been fully considered, as not all parts contribute equally to the classification task, and some parts may contain unimportant or redundant information. To overcome these challenges, we propose PointPatchMix, a novel approach that mixes point clouds at the patch level and integrates a patch scoring module to generate content-based targets for mixed point clouds. Our approach preserves local features at the patch level, while the patch scoring module assigns targets based on the content-based significance score from a pre-trained teacher model. We evaluate PointPatchMix on two benchmark datasets including ModelNet40 and ScanObjectNN, and demonstrate significant improvements over various baselines in both synthetic and real-world datasets, as well as few-shot settings. With Point-MAE as our baseline, our model surpasses previous methods by a significant margin. Furthermore, our approach shows strong generalization across various point cloud methods and enhances the robustness of the baseline model. Code is available at https://jiazewang.com/projects/pointpatchmix.html.
\ No newline at end of file
diff --git a/data/2024/aaai/Polyper: Boundary Sensitive Polyp Segmentation b/data/2024/aaai/Polyper: Boundary Sensitive Polyp Segmentation
new file mode 100644
index 0000000000..4c7aeac869
--- /dev/null
+++ b/data/2024/aaai/Polyper: Boundary Sensitive Polyp Segmentation	
@@ -0,0 +1 @@
+We present a new boundary sensitive framework for polyp segmentation, termed Polyper.Our method is motivated by a clinical approach that seasoned medical practitioners often leverage the inherent features of interior polyp regions to tackle blurred boundaries.Inspired by this, we propose to explicitly leverages boundary regions to bolster the model's boundary discrimination capability while minimizing computational resource wastage. Our approach first extracts low-confidence boundary regions and high-confidence prediction regions from an initial segmentation map through differentiable morphological operators.Then, we design the boundary sensitive attention that concentrates on augmenting the features near the boundary regions using the high-confidence prediction region's characteristics to generate good segmentation results.Our proposed method can be seamlessly integrated with classical encoder networks, like ResNet-50, MiT-B1, and Swin Transformer.To evaludate the effectiveness of Polyper, we conduct experiments on five publicly available challenging datasets, and receive state-of-the-art performance on all of them. Code is available at https://github.com/haoshao-nku/medical_seg.git.
\ No newline at end of file
diff --git a/data/2024/aaai/PoseGen: Learning to Generate 3D Human Pose Dataset with NeRF b/data/2024/aaai/PoseGen: Learning to Generate 3D Human Pose Dataset with NeRF
new file mode 100644
index 0000000000..6964011f6b
--- /dev/null
+++ b/data/2024/aaai/PoseGen: Learning to Generate 3D Human Pose Dataset with NeRF	
@@ -0,0 +1 @@
+This paper proposes an end-to-end framework for generating 3D human pose datasets using Neural Radiance Fields (NeRF). Public datasets generally have limited diversity in terms of human poses and camera viewpoints, largely due to the resource-intensive nature of collecting 3D human pose data. As a result, pose estimators trained on public datasets significantly underperform when applied to unseen out-of-distribution samples. Previous works proposed augmenting public datasets by generating 2D-3D pose pairs or rendering a large amount of random data. Such approaches either overlook image rendering or result in suboptimal datasets for pre-trained models. Here we propose PoseGen, which learns to generate a dataset (human 3D poses and images) with a feedback loss from a given pre-trained pose estimator. In contrast to prior art, our generated data is optimized to improve the robustness of the pre-trained model. The objective of PoseGen is to learn a distribution of data that maximizes the prediction error of a given pre-trained model. As the learned data distribution contains OOD samples of the pre-trained model, sampling data from such a distribution for further fine-tuning a pre-trained model improves the generalizability of the model. This is the first work that proposes NeRFs for 3D human data generation. NeRFs are data-driven and do not require 3D scans of humans. Therefore, using NeRF for data generation is a new direction for convenient user-specific data generation. Our extensive experiments show that the proposed PoseGen improves two baseline models (SPIN and HybrIK) on four datasets with an average 6% relative improvement.
\ No newline at end of file
diff --git a/data/2024/aaai/Post-trained Convolution Networks for Single Image Super-resolution (Abstract Reprint) b/data/2024/aaai/Post-trained Convolution Networks for Single Image Super-resolution (Abstract Reprint)
new file mode 100644
index 0000000000..0e7b6861cf
--- /dev/null
+++ b/data/2024/aaai/Post-trained Convolution Networks for Single Image Super-resolution (Abstract Reprint)	
@@ -0,0 +1 @@
+A new method is proposed to increase the accuracy of the state-of-the-art single image super-resolution (SISR) using novel training procedure. The proposed method, named post-trained convolutional neural network (CNN), is carried out stochastic dual simplex algorithm (SDSA) in the last reconstruction layer. The method utilizes contextual information to update the last reconstruction layer of CNN. The extracted contextual information is projected to the last reconstructed layer by optimized weights and the bias is managed through SDSA. Post-trained CNN is applied to the very deep super-resolution (VDSR) method to show its performance. The quantitative and visual results demonstrate that the proposed post-trained VDSR (PTVDSR) exhibits excellent and competitive performance when compared with the VDSR and other super-resolution methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Potential-Based Reward Shaping for Intrinsic Motivation (Student Abstract) b/data/2024/aaai/Potential-Based Reward Shaping for Intrinsic Motivation (Student Abstract)
new file mode 100644
index 0000000000..12fcfa65e7
--- /dev/null
+++ b/data/2024/aaai/Potential-Based Reward Shaping for Intrinsic Motivation (Student Abstract)	
@@ -0,0 +1 @@
+Recently there has been a proliferation of intrinsic motivation (IM) reward shaping methods to learn in complex and sparse-reward environments. These methods can often inadvertently change the set of optimal policies in an environment, leading to suboptimal behavior. Previous work on mitigating the risks of reward shaping, particularly through potential-based reward shaping (PBRS), has not been applicable to many IM methods, as they are often complex, trainable functions themselves, and therefore dependent on a wider set of variables than the traditional reward functions that PBRS was developed for. We present an extension to PBRS that we show preserves the set of optimal policies under a more general set of functions than has been previously demonstrated. We also present Potential-Based Intrinsic Motivation (PBIM), a method for converting IM rewards into a potential-based form that are useable without altering the set of optimal policies. Testing in the MiniGrid DoorKey environment, we demonstrate that PBIM successfully prevents the agent from converging to a suboptimal policy and can speed up training.
\ No newline at end of file
diff --git a/data/2024/aaai/Power Grid Anomaly Detection via Hybrid LSTM-GIN Model (Student Abstract) b/data/2024/aaai/Power Grid Anomaly Detection via Hybrid LSTM-GIN Model (Student Abstract)
new file mode 100644
index 0000000000..abe3fe19b1
--- /dev/null
+++ b/data/2024/aaai/Power Grid Anomaly Detection via Hybrid LSTM-GIN Model (Student Abstract)	
@@ -0,0 +1 @@
+Cyberattacks on power grids pose significant risks to national security. Power grid attacks typically lead to abnormal readings in power output, frequency, current, and voltage. Due to the interconnected structure of power grids, abnormalities can spread throughout the system and cause widespread power outages if not detected and dealt with promptly. Our research proposes a novel anomaly detection system for power grids that prevents overfitting. We created a network graph to represent the structure of the power grid, where nodes represent power grid components like generators and edges represent connections between nodes such as overhead power lines. We combine the capabilities of Long Short-Term Memory (LSTM) models with a Graph Isomorphism Network (GIN) in a hybrid model to pinpoint anomalies in the grid. We train our model on each category of nodes that serves a similar structural purpose to prevent overfitting of the model. We then assign each node in the graph a unique signature using a GIN. Our model achieved a 99.92% accuracy rate, which is significantly higher than a version of our model without structural encoding, which had an accuracy level of 97.30%. Our model allows us to capture structural and temporal components of power grids and develop an attack detection system with high accuracy without overfitting.
\ No newline at end of file
diff --git a/data/2024/aaai/Power-Aware Inverse-Search Machine Learning for Low Resource Multi-Objective Unmanned Underwater Vehicle Control (Student Abstract) b/data/2024/aaai/Power-Aware Inverse-Search Machine Learning for Low Resource Multi-Objective Unmanned Underwater Vehicle Control (Student Abstract)
new file mode 100644
index 0000000000..acf956ad61
--- /dev/null
+++ b/data/2024/aaai/Power-Aware Inverse-Search Machine Learning for Low Resource Multi-Objective Unmanned Underwater Vehicle Control (Student Abstract)	
@@ -0,0 +1 @@
+Flapping-fin unmanned underwater vehicle (UUV) propulsion systems enable high maneuverability for tasks ranging from station-keeping to surveillance but are often constrained by their limited computational power and battery capacity. Previous research has demonstrated that time-series neural network models can accurately predict the thrust and power of certain fin kinematics based on the specified gait coupled with the fin configuration, but can not fit an inverse neural network that takes a thrust request and tunes the kinematics by weighting thrust generation, smooth movement transitions, and power attributes. We study various combinations of the three weights and fin materials to create different ‘modes’ of movement for a multi-objective UUV, based on controller intent using an inverse neural network. Finally, we implement and validate an enhanced power-aware inverse model by benchmarking on the Raspberry Pi Model 4B system and testing through generated simulated movements.
\ No newline at end of file
diff --git a/data/2024/aaai/Practical Privacy-Preserving MLaaS: When Compressive Sensing Meets Generative Networks b/data/2024/aaai/Practical Privacy-Preserving MLaaS: When Compressive Sensing Meets Generative Networks
new file mode 100644
index 0000000000..3635eec22a
--- /dev/null
+++ b/data/2024/aaai/Practical Privacy-Preserving MLaaS: When Compressive Sensing Meets Generative Networks	
@@ -0,0 +1 @@
+The Machine-Learning-as-a-Service (MLaaS) framework allows one to grab low-hanging fruit of machine learning techniques and data science, without either much expertise for this sophisticated sphere or provision of specific infrastructures. However, the requirement of revealing all training data to the service provider raises new concerns in terms of privacy leakage, storage consumption, efficiency, bandwidth, etc. In this paper, we propose a lightweight privacy-preserving MLaaS framework by combining Compressive Sensing (CS) and Generative Networks. It’s constructed on the favorable facts observed in recent works that general inference tasks could be fulfilled with generative networks and classifier trained on compressed measurements, since the generator could model the data distribution and capture discriminative information which are useful for classification. To improve the performance of the MLaaS framework, the supervised generative models of the server are trained and optimized with prior knowledge provided by the client. In order to prevent the service provider from recovering the original data as well as identifying the queried results, a noise-addition mechanism is designed and adopted into the compressed data domain. Empirical results confirmed its performance superiority in accuracy and resource consumption against the state-of-the-art privacy preserving MLaaS frameworks.
\ No newline at end of file
diff --git a/data/2024/aaai/Practical Sentiment Analysis for Education: The Power of Student Crowdsourcing b/data/2024/aaai/Practical Sentiment Analysis for Education: The Power of Student Crowdsourcing
new file mode 100644
index 0000000000..e2a281629a
--- /dev/null
+++ b/data/2024/aaai/Practical Sentiment Analysis for Education: The Power of Student Crowdsourcing	
@@ -0,0 +1 @@
+Sentiment analysis provides a promising tool to automatically assess the emotions voiced in written student feedback such as periodically collected unit-of-study reflections. The commonly used dictionary-based approaches are limited to major languages and fail to capture contextual differences. Pretrained large language models have been shown to be biased and online versions raise privacy concerns. Hence, we resort to traditional supervised machine learning (ML) approaches which are designed to overcome these issues by learning from domain-specific labeled data. However, these labels are hard to come by -- in our case manually annotating student feedback is prone to bias and time-consuming, especially in high-enrollment courses. In this work, we investigate the use of student crowdsourced labels for supervised sentiment analysis for education. Specifically, we compare crowdsourced and student self-reported labels with human expert annotations and use them in various ML approaches to evaluate the performance on predicting emotions of written student feedback collected from large computer science classes. We find that the random forest model trained with student-crowdsourced labels tremendously improves the identification of reflections with negative sentiment. In addition to our quantitative study, we describe our crowdsourcing experiment which was intentionally designed to be an educational activity in an introduction to data science course.
\ No newline at end of file
diff --git a/data/2024/aaai/Pre-trained Online Contrastive Learning for Insurance Fraud Detection b/data/2024/aaai/Pre-trained Online Contrastive Learning for Insurance Fraud Detection
new file mode 100644
index 0000000000..0530627052
--- /dev/null
+++ b/data/2024/aaai/Pre-trained Online Contrastive Learning for Insurance Fraud Detection	
@@ -0,0 +1 @@
+Medical insurance fraud has always been a crucial challenge in the field of healthcare industry. Existing fraud detection models mostly focus on offline learning scenes. However, fraud patterns are constantly evolving, making it difficult for models trained on past data to detect newly emerging fraud patterns, posing a severe challenge in medical fraud detection. Moreover, current incremental learning models are mostly designed to address catastrophic forgetting, but often exhibit suboptimal performance in fraud detection. To address this challenge, this paper proposes an innovative online learning method for medical insurance fraud detection, named POCL. This method combines contrastive learning pre-training with online updating strategies. In the pre-training stage, we leverage contrastive learning pre-training to learn on historical data, enabling deep feature learning and obtaining rich risk representations. In the online learning stage, we adopt a Temporal Memory Aware Synapses online updating strategy, allowing the model to perform incremental learning and optimization based on continuously emerging new data. This ensures timely adaptation to fraud patterns and reduces forgetting of past knowledge. Our model undergoes extensive experiments and evaluations on real-world insurance fraud datasets. The results demonstrate our model has significant advantages in accuracy compared to the state-of-the-art baseline methods, while also exhibiting lower running time and space consumption. Our sources are released at https://github.com/finint/POCL.
\ No newline at end of file
diff --git a/data/2024/aaai/PreRoutGNN for Timing Prediction with Order Preserving Partition: Global Circuit Pre-training, Local Delay Learning and Attentional Cell Modeling b/data/2024/aaai/PreRoutGNN for Timing Prediction with Order Preserving Partition: Global Circuit Pre-training, Local Delay Learning and Attentional Cell Modeling
new file mode 100644
index 0000000000..e7bb6a1ebb
--- /dev/null
+++ b/data/2024/aaai/PreRoutGNN for Timing Prediction with Order Preserving Partition: Global Circuit Pre-training, Local Delay Learning and Attentional Cell Modeling	
@@ -0,0 +1 @@
+Pre-routing timing prediction has been recently studied for evaluating the quality of a candidate cell placement in chip design. It involves directly estimating the timing metrics for both pin-level (slack, slew) and edge-level (net delay, cell delay), without time-consuming routing. However, it often suffers from signal decay and error accumulation due to the long timing paths in large-scale industrial circuits. To address these challenges, we propose a two-stage approach. First, we propose global circuit training to pre-train a graph auto-encoder that learns the global graph embedding from circuit netlist. Second, we use a novel node updating scheme for message passing on GCN, following the topological sorting sequence of the learned graph embedding and circuit graph. This scheme residually models the local time delay between two adjacent pins in the updating sequence, and extracts the lookup table information inside each cell via a new attention mechanism. To handle large-scale circuits efficiently, we introduce an order preserving partition scheme that reduces memory consumption while maintaining the topological dependencies. Experiments on 21 real world circuits achieve a new SOTA R2 of 0.93 for slack prediction, which is significantly surpasses 0.59 by previous SOTA method. Code will be available at: https://github.com/Thinklab-SJTU/EDA-AI.
\ No newline at end of file
diff --git a/data/2024/aaai/Predicting Real-World Penny Auction Durations by Integrating Game Theory and Machine Learning b/data/2024/aaai/Predicting Real-World Penny Auction Durations by Integrating Game Theory and Machine Learning
new file mode 100644
index 0000000000..605202a69c
--- /dev/null
+++ b/data/2024/aaai/Predicting Real-World Penny Auction Durations by Integrating Game Theory and Machine Learning	
@@ -0,0 +1 @@
+Game theory and machine learning are two widely used techniques for predicting the outcomes of strategic interactions among humans. However, the game theory-based approach often relies on strong rationality and informational assumptions, while the machine learning-based approach typically requires the testing data to come from the same distribution as the training data. Our work studies how to integrate the two techniques to address these weaknesses. We focus on the interactions among real bidders in penny auctions, and develop a three-stage framework to predict the distributions of auction durations, which indicate the numbers of bids and auctioneer revenues. Specifically, we first leverage a pre-trained neural network to encode the descriptions of products in auctions into embeddings. Second, we apply game theory models to make preliminary predictions of auction durations. In particular, we tackle the challenge of accurately inferring parameters in game theory models. Third, we develop a Multi-Branch Mixture Density Network to learn the mapping from product embeddings and game-theoretic predictions to the distributions of actual auction durations. Experiments on real-world penny auction data demonstrate that our framework outperforms both game theory-based and machine learning-based prediction approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/PrefAce: Face-Centric Pretraining with Self-Structure Aware Distillation b/data/2024/aaai/PrefAce: Face-Centric Pretraining with Self-Structure Aware Distillation
new file mode 100644
index 0000000000..78b49c01ec
--- /dev/null
+++ b/data/2024/aaai/PrefAce: Face-Centric Pretraining with Self-Structure Aware Distillation	
@@ -0,0 +1 @@
+Video-based facial analysis is important for autonomous agents to understand human expressions and sentiments. However, limited labeled data is available to learn effective facial representations. This paper proposes a novel self-supervised face-centric pretraining framework, called PrefAce, which learns transferable video facial representation without labels. The self-supervised learning is performed with an effective landmark-guided global-local tube distillation. Meanwhile, a novel instance-wise update FaceFeat Cache is built to enforce more discriminative and diverse representations for downstream tasks. Extensive experiments demonstrate that the proposed framework learns universal instance-aware facial representations with fine-grained landmark details from videos. The point is that it can transfer across various facial analysis tasks, e.g., Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our framework also outperforms the state-of-the-art on various downstream tasks, even in low data regimes. Code is available at https://github.com/siyuan-h/PrefAce.
\ No newline at end of file
diff --git a/data/2024/aaai/Preference Aware Dual Contrastive Learning for Item Cold-Start Recommendation b/data/2024/aaai/Preference Aware Dual Contrastive Learning for Item Cold-Start Recommendation
new file mode 100644
index 0000000000..400a540e1a
--- /dev/null
+++ b/data/2024/aaai/Preference Aware Dual Contrastive Learning for Item Cold-Start Recommendation	
@@ -0,0 +1 @@
+Existing cold-start recommendation methods often adopt item-level alignment strategies to align the content feature and the collaborative feature of warm items for model training, however, cold items in the test stage have no historical interactions with users to obtain the collaborative feature. These existing models ignore the aforementioned condition of cold items in the training stage, resulting in the performance limitation. In this paper, we propose a preference aware dual contrastive learning based recommendation model (PAD-CLRec), where the user preference is explored to take into account the condition of cold items for feature alignment. Here, the user preference is obtained by aggregating a group of collaborative feature of the warm items in the user's purchase records. Then, a group-level alignment between the user preference and the item's content feature can be realized via a proposed preference aware contrastive function for enhancing cold-item recommendation. In addition, a joint objective function is introduced to achieve a better trade-off between the recommendation performance of warm items and cold items from both item-level and group-level perspectives, yielding better overall recommendation performance. Extensive experiments are conducted to demonstrate the effectiveness of the proposed method, and the results show the superiority of our method, as compared with the state-of-the-arts.
\ No newline at end of file
diff --git a/data/2024/aaai/Preference-Aware Constrained Multi-Objective Bayesian Optimization (Student Abstract) b/data/2024/aaai/Preference-Aware Constrained Multi-Objective Bayesian Optimization (Student Abstract)
new file mode 100644
index 0000000000..ddd4a60b35
--- /dev/null
+++ b/data/2024/aaai/Preference-Aware Constrained Multi-Objective Bayesian Optimization (Student Abstract)	
@@ -0,0 +1 @@
+This paper addresses the problem of constrained multi-objective optimization over black-box objective functions with practitioner-specified preferences over the objectives when a large fraction of the input space is infeasible (i.e., violates constraints). This problem arises in many engineering design problems, including analog circuits and electric power system design. We aim to approximate the optimal Pareto set over the small fraction of feasible input designs. The key challenges include the massive size of the design space, multiple objectives, a large number of constraints, and the small fraction of feasible input designs, which can be identified only after performing expensive experiments/simulations. We propose a novel and efficient preference-aware constrained multi-objective Bayesian optimization approach referred to as PAC-MOO to address these challenges. The key idea is to learn surrogate models for both output objectives and constraints, and select the candidate input for evaluation in each iteration that maximizes the information gained about the optimal constrained Pareto front while factoring in the preferences over objectives. Our experiments on synthetic and challenging real-world analog circuit design optimization problems demonstrate the efficacy of PAC-MOO over baseline methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Preparing Lessons for Progressive Training on Language Models b/data/2024/aaai/Preparing Lessons for Progressive Training on Language Models
new file mode 100644
index 0000000000..fb4f68511a
--- /dev/null
+++ b/data/2024/aaai/Preparing Lessons for Progressive Training on Language Models	
@@ -0,0 +1 @@
+The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prepares lessons for expanding operations by learning high-layer functionality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model depth extension. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models, making it a universal and efficient solution for training deep models while reducing time, financial, and environmental costs.
\ No newline at end of file
diff --git a/data/2024/aaai/Preventing Eviction-Caused Homelessness through ML-Informed Distribution of Rental Assistance b/data/2024/aaai/Preventing Eviction-Caused Homelessness through ML-Informed Distribution of Rental Assistance
new file mode 100644
index 0000000000..0fbefc0fb4
--- /dev/null
+++ b/data/2024/aaai/Preventing Eviction-Caused Homelessness through ML-Informed Distribution of Rental Assistance	
@@ -0,0 +1 @@
+Rental assistance programs provide individuals with financial assistance to prevent housing instabilities caused by evictions and avert homelessness. Since these programs operate under resource constraints, they must decide who to prioritize. Typically, funding is distributed by a reactive allocation process that does not systematically consider risk of future homelessness. We partnered with Anonymous County (PA) to explore a proactive and preventative allocation approach that prioritizes individuals facing eviction based on their risk of future homelessness. Our ML models, trained on state and county administrative data accurately identify at-risk individuals, outperforming simpler prioritization approaches by at least 20% while meeting our equity and fairness goals across race and gender. Furthermore, our approach would reach 28% of individuals who are overlooked by the current process and end up homeless. Beyond improvements to the rental assistance program in Anonymous County, this study can inform the development of evidence-based decision support tools in similar contexts, including lessons about data needs, model design, evaluation, and field validation.
\ No newline at end of file
diff --git a/data/2024/aaai/Primitive-Based 3D Human-Object Interaction Modelling and Programming b/data/2024/aaai/Primitive-Based 3D Human-Object Interaction Modelling and Programming
new file mode 100644
index 0000000000..6c218bf2fb
--- /dev/null
+++ b/data/2024/aaai/Primitive-Based 3D Human-Object Interaction Modelling and Programming	
@@ -0,0 +1,3 @@
+Embedding Human and Articulated Object Interaction (HAOI) in 3D is an important direction for a deeper human activity understanding. Different from previous works that use parametric and CAD models to represent humans and objects, in this work, we propose a novel 3D geometric primitive-based language to encode both humans and objects. Given our new paradigm, humans and objects are all compositions of primitives instead of heterogeneous entities. Thus, mutual information learning may be achieved between the limited 3D data of humans and different object categories. Moreover, considering the simplicity of the expression and the richness of the information it contains, we choose the superquadric as the primitive representation.
+To explore an effective embedding of HAOI for the machine, we build a new benchmark on 3D HAOI consisting of primitives together with their images and propose a task requiring machines to recover 3D HAOI using primitives from images.
+Moreover, we propose a baseline of single-view 3D reconstruction on HAOI. We believe this primitive-based 3D HAOI representation would pave the way for 3D HAOI studies. Our code and data are available at https://mvig-rhos.com/p3haoi.
\ No newline at end of file
diff --git a/data/2024/aaai/Principal-Agent Reward Shaping in MDPs b/data/2024/aaai/Principal-Agent Reward Shaping in MDPs
new file mode 100644
index 0000000000..b3f2072b39
--- /dev/null
+++ b/data/2024/aaai/Principal-Agent Reward Shaping in MDPs	
@@ -0,0 +1 @@
+Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest. The economic literature has extensively studied principal-agent problems, and recent work has extended this to more complex scenarios such as Markov Decision Processes (MDPs). In this paper, we further explore this line of research by investigating how reward shaping under budget constraints can improve the principal's utility. We study a two-player Stackelberg game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players. The principal offers an additional reward to the agent, and the agent picks their policy selfishly to maximize their reward, which is the sum of the original and the offered reward. Our results establish the NP-hardness of the problem and offer polynomial approximation algorithms for two classes of instances: Stochastic trees and deterministic decision processes with a finite horizon.
\ No newline at end of file
diff --git a/data/2024/aaai/Principle Component Trees and Their Persistent Homology b/data/2024/aaai/Principle Component Trees and Their Persistent Homology
new file mode 100644
index 0000000000..a1ab512a49
--- /dev/null
+++ b/data/2024/aaai/Principle Component Trees and Their Persistent Homology	
@@ -0,0 +1 @@
+Low dimensional models like PCA are often used to simplify complex datasets by learning a single approximating subspace. This paradigm has expanded to union of subspaces models, like those learned by subspace clustering. In this paper, we present Principal Component Trees (PCTs), a graph structure that generalizes these ideas to identify mixtures of components that together describe the subspace structure of high-dimensional datasets. Each node in a PCT corresponds to a principal component of the data, and the edges between nodes indicate the components that must be mixed to produce a subspace that approximates a portion of the data. In order to construct PCTs, we propose two angle-distribution hypothesis tests to detect subspace clusters in the data. To analyze, compare, and select the best PCT model, we define two persistent homology measures that describe their shape. We show our construction yields two key properties of PCTs, namely ancestral orthogonality and non-decreasing singular values. Our main theoretical results show that learning PCTs reduces to PCA under multivariate normality, and that PCTs are efficient parameterizations of intersecting union of subspaces. Finally, we use PCTs to analyze neural network latent space, word embeddings, and reference image datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Prior and Prediction Inverse Kernel Transformer for Single Image Defocus Deblurring b/data/2024/aaai/Prior and Prediction Inverse Kernel Transformer for Single Image Defocus Deblurring
new file mode 100644
index 0000000000..0bd57f0bb2
--- /dev/null
+++ b/data/2024/aaai/Prior and Prediction Inverse Kernel Transformer for Single Image Defocus Deblurring	
@@ -0,0 +1,2 @@
+Defocus blur, due to spatially-varying sizes and shapes, is hard to remove. Existing methods either are unable to effectively handle irregular defocus blur or fail to generalize well on other datasets.
+In this work, we propose a divide-and-conquer approach to tackling this issue, which gives rise to a novel end-to-end deep learning method, called prior-and-prediction inverse kernel transformer (P2IKT), for single image defocus deblurring. Since most defocus blur can be approximated as Gaussian blur or its variants, we construct an inverse Gaussian kernel module in our method to enhance its generalization ability. At the same time, an inverse kernel prediction module is introduced in order to flexibly address the irregular blur that cannot be approximated by Gaussian blur. We further design a scale recurrent transformer, which estimates mixing coefficients for adaptively combining the results from the two modules and runs the scale recurrent ``coarse-to-fine" procedure for progressive defocus deblurring. Extensive experimental results demonstrate that our P2IKT outperforms previous methods in terms of PSNR on multiple defocus deblurring datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Privacy Amplification by Iteration for ADMM with (Strongly) Convex Objective Functions b/data/2024/aaai/Privacy Amplification by Iteration for ADMM with (Strongly) Convex Objective Functions
new file mode 100644
index 0000000000..2d8c686517
--- /dev/null
+++ b/data/2024/aaai/Privacy Amplification by Iteration for ADMM with (Strongly) Convex Objective Functions	
@@ -0,0 +1,7 @@
+We examine a private ADMM variant for (strongly) convex objectives which is a primal-dual iterative method. Each iteration has a user with a private function used to update the primal variable, masked by Gaussian noise for local privacy, without directly adding noise to the dual variable. Privacy amplification by iteration explores if noises from later iterations can enhance the privacy guarantee when releasing final variables after the last iteration.
+
+Cyffers et al. explored privacy amplification by iteration for the proximal ADMM variant, where a user's entire private function is accessed and noise is added to the primal variable. In contrast, we examine a private ADMM variant requiring just one gradient access to a user's function, but both primal and dual variables must be passed between successive iterations.
+
+To apply Balle et al.'s coupling framework to the gradient ADMM variant, we tackle technical challenges with novel ideas. First, we address the non-expansive mapping issue in ADMM iterations by using a customized norm. Second, because the dual variables are not masked with any noise directly, their privacy guarantees are achieved by treating two consecutive noisy ADMM iterations as a Markov operator.
+
+Our main result is that the privacy guarantee for the gradient ADMM variant can be amplified proportionally to the number of iterations. For strongly convex objective functions, this amplification exponentially increases with the number of iterations. These amplification results align with the previously studied special case of stochastic gradient descent.
\ No newline at end of file
diff --git a/data/2024/aaai/Privileged Prior Information Distillation for Image Matting b/data/2024/aaai/Privileged Prior Information Distillation for Image Matting
new file mode 100644
index 0000000000..b68e9a8fe2
--- /dev/null
+++ b/data/2024/aaai/Privileged Prior Information Distillation for Image Matting	
@@ -0,0 +1 @@
+Performance of trimap-free image matting methods is limited when trying to decouple the deterministic and undetermined regions, especially in the scenes where foregrounds are semantically ambiguous, chromaless, or high transmittance. In this paper, we propose a novel framework named Privileged Prior Information Distillation for Image Matting (PPID-IM) that can effectively transfer privileged prior environment-aware information to improve the performance of trimap-free students in solving hard foregrounds. The prior information of trimap regulates only the teacher model during the training stage, while not being fed into the student network during actual inference. To achieve effective privileged cross-modality (i.e. trimap and RGB) information distillation, we introduce a Cross-Level Semantic Distillation (CLSD) module that reinforces the students with more knowledgeable semantic representations and environment-aware information. We also propose an Attention-Guided Local Distillation module that efficiently transfers privileged local attributes from the trimap-based teacher to trimap-free students for the guidance of local-region optimization. Extensive experiments demonstrate the effectiveness and superiority of our PPID on image matting. The code will be released soon.
\ No newline at end of file
diff --git a/data/2024/aaai/ProAgent: Building Proactive Cooperative Agents with Large Language Models b/data/2024/aaai/ProAgent: Building Proactive Cooperative Agents with Large Language Models
new file mode 100644
index 0000000000..e4be643cf7
--- /dev/null
+++ b/data/2024/aaai/ProAgent: Building Proactive Cooperative Agents with Large Language Models	
@@ -0,0 +1 @@
+Building agents with adaptive behavior in cooperative tasks stands as a paramount goal in the realm of multi-agent systems. Current approaches to developing cooperative agents rely primarily on learning-based methods, whose policy generalization depends heavily on the diversity of teammates they interact with during the training phase. Such reliance, however, constrains the agents' capacity for strategic adaptation when cooperating with unfamiliar teammates, which becomes a significant challenge in zero-shot coordination scenarios. To address this challenge, we propose ProAgent, a novel framework that harnesses large language models (LLMs) to create proactive agents capable of dynamically adapting their behavior to enhance cooperation with teammates. ProAgent can analyze the present state, and infer the intentions of teammates from observations. It then updates its beliefs in alignment with the teammates' subsequent actual behaviors. Moreover, ProAgent exhibits a high degree of modularity and interpretability, making it easily integrated into various of coordination scenarios. Experimental evaluations conducted within the Overcooked-AI environment unveil the remarkable performance superiority of ProAgent, outperforming five methods based on self-play and population-based training when cooperating with AI agents. Furthermore, in partnered with human proxy models, its performance exhibits an average improvement exceeding 10% compared to the current state-of-the-art method. For more information about our project, please visit https://pku-proagent.github.io.
\ No newline at end of file
diff --git a/data/2024/aaai/ProCC: Progressive Cross-Primitive Compatibility for Open-World Compositional Zero-Shot Learning b/data/2024/aaai/ProCC: Progressive Cross-Primitive Compatibility for Open-World Compositional Zero-Shot Learning
new file mode 100644
index 0000000000..9f32b29e1c
--- /dev/null
+++ b/data/2024/aaai/ProCC: Progressive Cross-Primitive Compatibility for Open-World Compositional Zero-Shot Learning	
@@ -0,0 +1 @@
+Open-World Compositional Zero-shot Learning (OW-CZSL) aims to recognize novel compositions of state and object primitives in images with no priors on the compositional space, which induces a tremendously large output space containing all possible state-object compositions. Existing works either learn the joint compositional state-object embedding or predict simple primitives with separate classifiers. However, the former method heavily relies on external word embedding methods, and the latter ignores the interactions of interdependent primitives, respectively. In this paper, we revisit the primitive prediction approach and propose a novel method, termed Progressive Cross-primitive Compatibility (ProCC), to mimic the human learning process for OW-CZSL tasks. Specifically, the cross-primitive compatibility module explicitly learns to model the interactions of state and object features with the trainable memory units, which efficiently acquires cross-primitive visual attention to reason high-feasibility compositions, without the aid of external knowledge. Moreover, to alleviate the invalid cross-primitive interactions, especially for partial-supervision conditions (pCZSL), we design a progressive training paradigm to optimize the primitive classifiers conditioned on pre-trained features in an easy-to-hard manner. Extensive experiments on three widely used benchmark datasets demonstrate that our method outperforms other representative methods on both OW-CZSL and pCZSL settings by large margins.
\ No newline at end of file
diff --git a/data/2024/aaai/Probabilistic Neural Circuits b/data/2024/aaai/Probabilistic Neural Circuits
new file mode 100644
index 0000000000..7d7012c7ec
--- /dev/null
+++ b/data/2024/aaai/Probabilistic Neural Circuits	
@@ -0,0 +1 @@
+Probabilistic circuits (PCs) have gained prominence in recent years as a versatile framework for discussing probabilistic models that support tractable queries and are yet expressive enough to model complex probability distributions. Nevertheless, tractability comes at a cost: PCs are less expressive than neural networks. In this paper we introduce probabilistic neural circuits (PNCs), which strike a balance between PCs and neural nets in terms of tractability and expressive power. Theoretically, we show that PNCs can be interpreted as deep mixtures of Bayesian networks. Experimentally, we demonstrate that PNCs constitute powerful function approximators.
\ No newline at end of file
diff --git a/data/2024/aaai/Probabilistic Offline Policy Ranking with Approximate Bayesian Computation b/data/2024/aaai/Probabilistic Offline Policy Ranking with Approximate Bayesian Computation
new file mode 100644
index 0000000000..958a08f433
--- /dev/null
+++ b/data/2024/aaai/Probabilistic Offline Policy Ranking with Approximate Bayesian Computation	
@@ -0,0 +1 @@
+In practice, it is essential to compare and rank candidate policies offline before real-world deployment for safety and reliability. Prior work seeks to solve this offline policy ranking (OPR) problem through value-based methods, such as Off-policy evaluation (OPE). However, they fail to analyze special case performance (e.g., worst or best cases), due to the lack of holistic characterization of policies’ performance. It is even more difficult to estimate precise policy values when the reward is not fully accessible under sparse settings. In this paper, we present Probabilistic Offline Policy Ranking (POPR), a framework to address OPR problems by leveraging expert data to characterize the probability of a candidate policy behaving like experts, and approximating its entire performance posterior distribution to help with ranking. POPR does not rely on value estimation, and the derived performance posterior can be used to distinguish candidates in worst-, best-, and average-cases. To estimate the posterior, we propose POPR-EABC, an Energy-based Approximate Bayesian Computation (ABC) method conducting likelihood-free inference. POPR-EABC reduces the heuristic nature of ABC by a smooth energy function, and improves the sampling efficiency by a pseudo-likelihood. We empirically demonstrate that POPR-EABC is adequate for evaluating policies in both discrete and continuous action spaces across various experiment environments, and facilitates probabilistic comparisons of candidate policies before deployment.
\ No newline at end of file
diff --git a/data/2024/aaai/Probabilities of Causation with Nonbinary Treatment and Effect b/data/2024/aaai/Probabilities of Causation with Nonbinary Treatment and Effect
new file mode 100644
index 0000000000..8bf5cd4776
--- /dev/null
+++ b/data/2024/aaai/Probabilities of Causation with Nonbinary Treatment and Effect	
@@ -0,0 +1 @@
+Probabilities of causation are proven to be critical in modern decision-making. This paper deals with the problem of estimating the probabilities of causation when treatment and effect are not binary. Pearl defined the binary probabilities of causation, such as the probability of necessity and sufficiency (PNS), the probability of sufficiency (PS), and the probability of necessity (PN). Tian and Pearl then derived sharp bounds for these probabilities of causation using experimental and observational data. In this paper, we define and provide theoretical bounds for all types of probabilities of causation with multivalued treatments and effects. We further discuss examples where our bounds guide practical decisions and use simulation studies to evaluate how informative the bounds are for various data combinations.
\ No newline at end of file
diff --git a/data/2024/aaai/Probability-Polarized Optimal Transport for Unsupervised Domain Adaptation b/data/2024/aaai/Probability-Polarized Optimal Transport for Unsupervised Domain Adaptation
new file mode 100644
index 0000000000..4ab800aff6
--- /dev/null
+++ b/data/2024/aaai/Probability-Polarized Optimal Transport for Unsupervised Domain Adaptation	
@@ -0,0 +1 @@
+Optimal transport (OT) is an important methodology to measure distribution discrepancy, which has achieved promising performance in artificial intelligence applications, e.g., unsupervised domain adaptation. However, from the view of transportation, there are still limitations: 1) the local discriminative structures for downstream tasks, e.g., cluster structure for classification, cannot be explicitly admitted by the learned OT plan; 2) the entropy regularization induces a dense OT plan with increasing uncertainty. To tackle these issues, we propose a novel Probability-Polarized OT (PPOT) framework, which can characterize the structure of OT plan explicitly. Specifically, the probability polarization mechanism is proposed to guide the optimization direction of OT plan, which generates a clear margin between similar and dissimilar transport pairs and reduces the uncertainty. Further, a dynamic mechanism for margin is developed by incorporating task-related information into the polarization, which directly captures the intra/inter class correspondence for knowledge transportation. A mathematical understanding for PPOT is provided from the view of gradient, which ensures interpretability. Extensive experiments on several datasets validate the effectiveness and empirical efficiency of PPOT.
\ No newline at end of file
diff --git a/data/2024/aaai/Procedural Level Generation with Diffusion Models from a Single Example b/data/2024/aaai/Procedural Level Generation with Diffusion Models from a Single Example
new file mode 100644
index 0000000000..6cb843c5d6
--- /dev/null
+++ b/data/2024/aaai/Procedural Level Generation with Diffusion Models from a Single Example	
@@ -0,0 +1 @@
+Level generation is a central focus of Procedural Content Generation (PCG), yet deep learning-based approaches are limited by scarce training data, i.e., human-designed levels. Despite being a dominant framework, Generative Adversarial Networks (GANs) exhibit a substantial quality gap between generated and human-authored levels, alongside rising training costs, particularly with increasing token complexity. In this paper, we introduce a diffusion-based generative model that learns from just one example. Our approach involves two core components: 1) an efficient yet expressive level representation, and 2) a latent denoising network with constrained receptive fields. To start with, our method utilizes token semantic labels, similar to word embeddings, to provide dense representations. This strategy not only surpasses one-hot encoding in representing larger game levels but also improves stability and accelerates convergence in latent diffusion. In addition, we adapt the denoising network architecture to confine the receptive field to localized patches of the data, aiming to facilitate single-example learning. Extensive experiments demonstrate that our model is capable of generating stylistically congruent samples of arbitrary sizes compared to manually designed levels. It suits a wide range of level structures with fewer artifacts than GAN-based approaches. The source code is available at https://github.com/shiqi-dai/diffusioncraft.
\ No newline at end of file
diff --git a/data/2024/aaai/Program Synthesis with Best-First Bottom-Up Search (Abstract Reprint) b/data/2024/aaai/Program Synthesis with Best-First Bottom-Up Search (Abstract Reprint)
new file mode 100644
index 0000000000..e7b33ad68d
--- /dev/null
+++ b/data/2024/aaai/Program Synthesis with Best-First Bottom-Up Search (Abstract Reprint)	
@@ -0,0 +1 @@
+Cost-guided bottom-up search (BUS) algorithms use a cost function to guide the search to solve program synthesis tasks. In this paper, we show that current state-of-the-art cost-guided BUS algorithms suffer from a common problem: they can lose useful information given by the model and fail to perform the search in a best-first order according to a cost function. We introduce a novel best-first bottom-up search algorithm, which we call Bee Search, that does not suffer information loss and is able to perform cost-guided bottom-up synthesis in a best-first manner. Importantly, Bee Search performs best-first search with respect to the generation of programs, i.e., it does not even create in memory programs that are more expensive than the solution program. It attains best-first ordering with respect to generation by performing a search in an abstract space of program costs. We also introduce a new cost function that better uses the information provided by an existing cost model. Empirical results on string manipulation and bit-vector tasks show that Bee Search can outperform existing cost-guided BUS approaches when employing more complex domain-specific languages (DSLs); Bee Search and previous approaches perform equally well with simpler DSLs. Furthermore, our new cost function with Bee Search outperforms previous cost functions on string manipulation tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion b/data/2024/aaai/Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion
new file mode 100644
index 0000000000..e2bf5c17c9
--- /dev/null
+++ b/data/2024/aaai/Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion	
@@ -0,0 +1 @@
+In recent years, knowledge graph completion (KGC) models based on pre-trained language model (PLM) have shown promising results. However, the large number of parameters and high computational cost of PLM models pose challenges for their application in downstream tasks. This paper proposes a progressive distillation method based on masked generation features for KGC task, aiming to significantly reduce the complexity of pre-trained models. Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models. However, traditional feature distillation suffers from the limitation of having a single representation of information in teacher models. To solve this problem, we propose masked generation of teacher-student features, which contain richer representation information. Furthermore, there is a significant gap in representation ability between teacher and student. Therefore, we design a progressive distillation method to distill student models at each grade level, enabling efficient knowledge transfer from teachers to students. The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods. Furthermore, in the progressive distillation stage, the model significantly reduces the model parameters while maintaining a certain level of performance. Specifically, the model parameters of the lower-grade student model are reduced by 56.7\% compared to the baseline.
\ No newline at end of file
diff --git a/data/2024/aaai/Progressive Feature Self-Reinforcement for Weakly Supervised Semantic Segmentation b/data/2024/aaai/Progressive Feature Self-Reinforcement for Weakly Supervised Semantic Segmentation
new file mode 100644
index 0000000000..3e6f06446a
--- /dev/null
+++ b/data/2024/aaai/Progressive Feature Self-Reinforcement for Weakly Supervised Semantic Segmentation	
@@ -0,0 +1 @@
+Compared to conventional semantic segmentation with pixel-level supervision, weakly supervised semantic segmentation (WSSS) with image-level labels poses the challenge that it commonly focuses on the most discriminative regions, resulting in a disparity between weakly and fully supervision scenarios. A typical manifestation is the diminished precision on object boundaries, leading to deteriorated accuracy of WSSS. To alleviate this issue, we propose to adaptively partition the image content into certain regions (e.g., confident foreground and background) and uncertain regions (e.g., object boundaries and misclassified categories) for separate processing. For uncertain cues, we propose an adaptive masking strategy and seek to recover the local information with self-distilled knowledge. We further assume that confident regions should be robust enough to preserve the global semantics, and introduce a complementary self-distillation method that constrains semantic consistency between confident regions and an augmented view with the same class labels. Extensive experiments conducted on PASCAL VOC 2012 and MS COCO 2014 demonstrate that our proposed single-stage approach for WSSS not only outperforms state-of-the-art counterparts but also surpasses multi-stage methods that trade complexity for accuracy.
\ No newline at end of file
diff --git a/data/2024/aaai/Progressive High-Frequency Reconstruction for Pan-Sharpening with Implicit Neural Representation b/data/2024/aaai/Progressive High-Frequency Reconstruction for Pan-Sharpening with Implicit Neural Representation
new file mode 100644
index 0000000000..e9454369f1
--- /dev/null
+++ b/data/2024/aaai/Progressive High-Frequency Reconstruction for Pan-Sharpening with Implicit Neural Representation	
@@ -0,0 +1 @@
+Pan-sharpening aims to leverage the high-frequency signal of the panchromatic (PAN) image to enhance the resolution of its corresponding multi-spectral (MS) image. However, deep neural networks (DNNs) tend to prioritize learning the low-frequency components during the training process, which limits the restoration of high-frequency edge details in MS images. To overcome this limitation, we treat pan-sharpening as a coarse-to-fine high-frequency restoration problem and propose a novel method for achieving high-quality restoration of edge information in MS images. Specifically, to effectively obtain fine-grained multi-scale contextual features, we design a Band-limited Multi-scale High-frequency Generator (BMHG) that generates high-frequency signals from the PAN image within different bandwidths. During training, higher-frequency signals are progressively injected into the MS image, and corresponding residual blocks are introduced into the network simultaneously. This design enables gradients to flow from later to earlier blocks smoothly, encouraging intermediate blocks to concentrate on missing details. Furthermore, to address the issue of pixel position misalignment arising from multi-scale features fusion, we propose a Spatial-spectral Implicit Image Function (SIIF) that employs implicit neural representation to effectively represent and fuse spatial and spectral features in the continuous domain. Extensive experiments on different datasets demonstrate that our method outperforms existing approaches in terms of quantitative and visual measurements for high-frequency detail recovery.
\ No newline at end of file
diff --git a/data/2024/aaai/Progressive Painterly Image Harmonization from Low-Level Styles to High-Level Styles b/data/2024/aaai/Progressive Painterly Image Harmonization from Low-Level Styles to High-Level Styles
new file mode 100644
index 0000000000..704e337b6e
--- /dev/null
+++ b/data/2024/aaai/Progressive Painterly Image Harmonization from Low-Level Styles to High-Level Styles	
@@ -0,0 +1 @@
+Painterly image harmonization aims to harmonize a photographic foreground object on the painterly background. Different from previous auto-encoder based harmonization networks, we develop a progressive multi-stage harmonization network, which harmonizes the composite foreground from low-level styles (e.g., color, simple texture) to high-level styles (e.g., complex texture). Our network has better interpretability and harmonization performance. Moreover, we design an early-exit strategy to automatically decide the proper stage to exit, which can skip the unnecessary and even harmful late stages. Extensive experiments on the benchmark dataset demonstrate the effectiveness of our progressive harmonization network.
\ No newline at end of file
diff --git a/data/2024/aaai/Progressive Text-to-Image Diffusion with Soft Latent Direction b/data/2024/aaai/Progressive Text-to-Image Diffusion with Soft Latent Direction
new file mode 100644
index 0000000000..501afe3161
--- /dev/null
+++ b/data/2024/aaai/Progressive Text-to-Image Diffusion with Soft Latent Direction	
@@ -0,0 +1 @@
+In spite of the rapidly evolving landscape of text-to-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations—namely insertion, editing, and erasing—we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field's performance standards.
\ No newline at end of file
diff --git a/data/2024/aaai/Progressively Knowledge Distillation via Re-parameterizing Diffusion Reverse Process b/data/2024/aaai/Progressively Knowledge Distillation via Re-parameterizing Diffusion Reverse Process
new file mode 100644
index 0000000000..f78ffff1d9
--- /dev/null
+++ b/data/2024/aaai/Progressively Knowledge Distillation via Re-parameterizing Diffusion Reverse Process	
@@ -0,0 +1,9 @@
+Knowledge distillation aims at transferring knowledge from the teacher model to the student one by aligning their distributions. 
+Feature-level distillation often uses L2 distance or its variants as the loss function, based on the assumption that outputs follow normal distributions. 
+This poses a significant challenge when distribution gaps are substantial since this loss function ignores the variance term. 
+To address the problem, we propose to decompose the transfer objective into small parts and optimize it progressively. 
+This process is inspired by diffusion models from which the noise distribution is mapped to the target distribution step by step.
+However, directly employing diffusion models is impractical in the distillation scenario due to its heavy reverse process.
+To overcome this challenge, we adopt the structural re-parameterization technique to generate multiple student features to approximate the teacher features sequentially. 
+The multiple student features are combined linearly in inference time without extra cost.
+We present extensive experiments performed on various transfer scenarios, such as CNN-to-CNN and Transformer-to-CNN, that validate the effectiveness of our approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Project-Fair and Truthful Mechanisms for Budget Aggregation b/data/2024/aaai/Project-Fair and Truthful Mechanisms for Budget Aggregation
new file mode 100644
index 0000000000..6e55f7ef83
--- /dev/null
+++ b/data/2024/aaai/Project-Fair and Truthful Mechanisms for Budget Aggregation	
@@ -0,0 +1 @@
+We study the budget aggregation problem in which a set of strategic voters must split a finite divisible resource (such as money or time) among a set of competing projects. Our goal is twofold: We seek truthful mechanisms that provide fairness guarantees to the projects. For the first objective, we focus on the class of moving phantom mechanisms, which are -- to this day -- essentially the only known truthful mechanisms in this setting. For project fairness, we consider the mean division as a fair baseline, and bound the maximum difference between the funding received by any project and this baseline. We propose a novel and simple moving phantom mechanism that provides optimal project fairness guarantees. As a corollary of our results, we show that our new mechanism minimizes the L1 distance to the mean for three projects and gives the first non-trivial bounds on this quantity for more than three projects.
\ No newline at end of file
diff --git a/data/2024/aaai/Promoting Counterfactual Robustness through Diversity b/data/2024/aaai/Promoting Counterfactual Robustness through Diversity
new file mode 100644
index 0000000000..519fc8a93e
--- /dev/null
+++ b/data/2024/aaai/Promoting Counterfactual Robustness through Diversity	
@@ -0,0 +1,12 @@
+Counterfactual explanations shed light on the decisions of black-box models by explaining
+how an input can be altered to obtain a favourable decision from the model (e.g., when a loan application has been rejected).
+However, as noted recently, counterfactual explainers may lack robustness in the sense that a minor change
+in the input can cause a major change in the explanation. This can cause confusion on the user side and
+open the door for adversarial attacks. In this paper, we study some sources of non-robustness. 
+While there are fundamental reasons for why an explainer that returns a single counterfactual cannot be
+robust in all instances, we show that some interesting robustness guarantees can be given by reporting 
+multiple rather than a single counterfactual. Unfortunately, the number of counterfactuals that need to
+be reported for the theoretical guarantees to hold can be prohibitively large. We therefore propose an approximation
+algorithm that uses a diversity criterion to select a feasible number of most relevant explanations and study its robustness empirically. Our experiments indicate that our method improves the
+state-of-the-art in generating robust explanations, while maintaining other desirable properties
+and providing competitive computational performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Promoting Fair Vaccination Strategies through Influence Maximization: A Case Study on COVID-19 Spread b/data/2024/aaai/Promoting Fair Vaccination Strategies through Influence Maximization: A Case Study on COVID-19 Spread
new file mode 100644
index 0000000000..cc129fb57d
--- /dev/null
+++ b/data/2024/aaai/Promoting Fair Vaccination Strategies through Influence Maximization: A Case Study on COVID-19 Spread	
@@ -0,0 +1 @@
+The aftermath of the Covid-19 pandemic saw more severe outcomes for racial minority groups and economically-deprived communities. Such disparities can be explained by several factors, including unequal access to healthcare, as well as the inability of low income groups to reduce their mobility due to work or social obligations. Moreover, senior citizens were found to be more susceptible to severe symptoms, largely due to age-related health reasons. Adapting vaccine distribution strategies to consider a range of demographics is therefore essential to address these disparities. In this study, we propose a novel approach that utilizes influence maximization (IM) on mobility networks to develop vaccination strategies which incorporate demographic fairness. By considering factors such as race, social status, age, and associated risk factors, we aim to optimize vaccine distribution to achieve various fairness definitions for one or more protected attributes at a time. Through extensive experiments conducted on Covid-19 spread in three major metropolitan areas across the United States, we demonstrate the effectiveness of our proposed approach in reducing disease transmission and promoting fairness in vaccination distribution.
\ No newline at end of file
diff --git a/data/2024/aaai/Promoting Research Collaboration with Open Data Driven Team Recommendation in Response to Call for Proposals b/data/2024/aaai/Promoting Research Collaboration with Open Data Driven Team Recommendation in Response to Call for Proposals
new file mode 100644
index 0000000000..9a9e9b7010
--- /dev/null
+++ b/data/2024/aaai/Promoting Research Collaboration with Open Data Driven Team Recommendation in Response to Call for Proposals	
@@ -0,0 +1 @@
+Building teams and promoting collaboration are two very common business activities. An example of these are seen in the TeamingForFunding problem, where research institutions and researchers are interested to identify collaborative opportunities when applying to funding agencies in response to latter's calls for proposals. We describe a novel deployed system to recommend teams using a variety of AI methods, such that (1) each team achieves the highest possible skill coverage that is demanded by the opportunity, and (2) the workload of distributing the opportunities is balanced amongst the candidate members. We address these questions by extracting skills latent in open data of proposal calls (demand) and researcher profiles (supply), normalizing them using taxonomies, and creating efficient algorithms that match demand to supply. We create teams to maximize goodness along a novel metric balancing short- and long-term objectives. We validate the success of our algorithms (1) quantitatively, by evaluating the recommended teams using a goodness score and find that more informed methods lead to recommendations of smaller number of teams but higher goodness, and (2) qualitatively, by conducting a large-scale user study at a college-wide level, and demonstrate that users overall found the tool very useful and relevant. Lastly, we evaluate our system in two diverse settings in US and India (of researchers and proposal calls) to establish generality of our approach, and deploy it at a major US university for routine use.
\ No newline at end of file
diff --git a/data/2024/aaai/Prompt-Based Distribution Alignment for Unsupervised Domain Adaptation b/data/2024/aaai/Prompt-Based Distribution Alignment for Unsupervised Domain Adaptation
new file mode 100644
index 0000000000..df2ce8dc40
--- /dev/null
+++ b/data/2024/aaai/Prompt-Based Distribution Alignment for Unsupervised Domain Adaptation	
@@ -0,0 +1 @@
+Recently, despite the unprecedented success of large pre-trained visual-language models (VLMs) on a wide range of downstream tasks, the real-world unsupervised domain adaptation (UDA) problem is still not well explored. Therefore, in this paper, we first experimentally demonstrate that the unsupervised-trained VLMs can significantly reduce the distribution discrepancy between source and target domains, thereby improving the performance of UDA. However, a major challenge for directly deploying such models on downstream UDA tasks is prompt engineering, which requires aligning the domain knowledge of source and target domains, since the performance of UDA is severely influenced by a good domain-invariant representation. We further propose a Prompt-based Distribution Alignment (PDA) method to incorporate the domain knowledge into prompt learning. Specifically, PDA employs a two-branch prompt-tuning paradigm, namely base branch and alignment branch. The base branch focuses on integrating class-related representation into prompts, ensuring discrimination among different classes. To further minimize domain discrepancy, for the alignment branch, we construct feature banks for both the source and target domains and propose image-guided feature tuning (IFT) to make the input attend to feature banks, which effectively integrates self-enhanced and cross-domain features into the model. In this way, these two branches can be mutually promoted to enhance the adaptation of VLMs for UDA. We conduct extensive experiments on three benchmarks to demonstrate that our proposed PDA achieves state-of-the-art performance. The code is available at https://github.com/BaiShuanghao/Prompt-based-Distribution-Alignment.
\ No newline at end of file
diff --git a/data/2024/aaai/PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation b/data/2024/aaai/PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation
new file mode 100644
index 0000000000..91caaf5dd9
--- /dev/null
+++ b/data/2024/aaai/PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation	
@@ -0,0 +1 @@
+Automatic medical report generation (MRG) is of great research value as it has the potential to relieve radiologists from the heavy burden of report writing. Despite recent advancements, accurate MRG remains challenging due to the need for precise clinical understanding and disease identification. Moreover, the imbalanced distribution of diseases makes the challenge even more pronounced, as rare diseases are underrepresented in training data, making their diagnosis unreliable. To address these challenges, we propose diagnosis-driven prompts for medical report generation (PromptMRG), a novel framework that aims to improve the diagnostic accuracy of MRG with the guidance of diagnosis-aware prompts. Specifically, PromptMRG is based on encoder-decoder architecture with an extra disease classification branch. When generating reports, the diagnostic results from the classification branch are converted into token prompts to explicitly guide the generation process. To further improve the diagnostic accuracy, we design cross-modal feature enhancement, which retrieves similar reports from the database to assist the diagnosis of a query image by leveraging the knowledge from a pre-trained CLIP. Moreover, the disease imbalanced issue is addressed by applying an adaptive logit-adjusted loss to the classification branch based on the individual learning status of each disease, which overcomes the barrier of text decoder's inability to manipulate disease distributions. Experiments on two MRG benchmarks show the effectiveness of the proposed method, where it obtains state-of-the-art clinical efficacy performance on both datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Prompting Multi-Modal Image Segmentation with Semantic Grouping b/data/2024/aaai/Prompting Multi-Modal Image Segmentation with Semantic Grouping
new file mode 100644
index 0000000000..a03096d770
--- /dev/null
+++ b/data/2024/aaai/Prompting Multi-Modal Image Segmentation with Semantic Grouping	
@@ -0,0 +1 @@
+Multi-modal image segmentation is one of the core issues in computer vision. The main challenge lies in integrating common information between modalities while retaining specific patterns for each modality. Existing methods typically perform full fine-tuning on RGB-based pre-trained parameters to inherit the powerful representation of the foundation model. Although effective, such paradigm is not optimal due to weak transferability and scarce downstream data. Inspired by the recent success of prompt learning in language models, we propose the Grouping Prompt Tuning Framework (GoPT), which introduces explicit semantic grouping to learn modal-related prompts, adapting the frozen pre-trained foundation model to various downstream multi-modal segmentation tasks. Specifically, a class-aware uni-modal prompter is designed to balance intra- and inter-modal semantic propagation by grouping modality-specific class tokens, thereby improving the adaptability of spatial information. Furthermore, an alignment-induced cross-modal prompter is introduced to aggregate class-aware representations and share prompt parameters among different modalities to assist in modeling common statistics. Extensive experiments show the superiority of our GoPT, which achieves SOTA performance on various downstream multi-modal image segmentation tasks by training only < 1% model parameters.
\ No newline at end of file
diff --git a/data/2024/aaai/Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer b/data/2024/aaai/Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer
new file mode 100644
index 0000000000..0378af69f8
--- /dev/null
+++ b/data/2024/aaai/Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer	
@@ -0,0 +1 @@
+Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the Audio-Visual Localization and Segmentation tasks but under the demanding zero-shot and few-shot scenarios. To achieve this goal, different from existing approaches that mostly employ the encoder-fusion-decoder paradigm to decode localization information from the fused audio-visual feature, we introduce the encoder-prompt-decoder paradigm, aiming to better fit the data scarcity and varying data distribution dilemmas with the help of abundant knowledge from pre-trained models. Specifically, we first propose to construct a Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects, meanwhile, the semantic gap between the visual and audio modalities is also encouraged to shrink. Then, we develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model. By equipping with these means, extensive experiments demonstrate that this new paradigm outperforms other fusion-based methods in both the unseen class and cross-dataset settings. We hope that our work can further promote the generalization study of Audio-Visual Localization and Segmentation in practical application scenarios. Project page: https://github.com/GeWu-Lab/Generalizable-Audio-Visual-Segmentation
\ No newline at end of file
diff --git a/data/2024/aaai/Proportional Aggregation of Preferences for Sequential Decision Making b/data/2024/aaai/Proportional Aggregation of Preferences for Sequential Decision Making
new file mode 100644
index 0000000000..663b00ba1a
--- /dev/null
+++ b/data/2024/aaai/Proportional Aggregation of Preferences for Sequential Decision Making	
@@ -0,0 +1 @@
+We study the problem of fair sequential decision making given voter preferences. In each round, a decision rule must choose a decision from a set of alternatives where each voter reports which of these alternatives they approve. Instead of going with the most popular choice in each round, we aim for proportional representation, using axioms inspired by the multi-winner voting literature. The axioms require that every group of α% of the voters, if it agrees in every round (i.e., approves a common alternative), then those voters must approve at least α% of the decisions. A stronger version of the axioms requires that every group of α% of the voters that agrees in a β fraction of rounds must approve β⋅α% of the decisions. We show that three attractive voting rules satisfy axioms of this style. One of them (Sequential Phragmén) makes its decisions online, and the other two satisfy strengthened versions of the axioms but make decisions semi-online (Method of Equal Shares) or fully offline (Proportional Approval Voting). We present empirical results for these rules based on synthetic data and U.S. political elections. We also run experiments using the moral machine dataset about ethical dilemmas. We train preference models on user responses from different countries and let the models cast votes. We find that aggregating these votes using our rules leads to a more equal utility distribution across demographics than making decisions using a single global preference model.
\ No newline at end of file
diff --git a/data/2024/aaai/Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers b/data/2024/aaai/Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers
new file mode 100644
index 0000000000..e0cf246e89
--- /dev/null
+++ b/data/2024/aaai/Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers	
@@ -0,0 +1,7 @@
+In recent years, significant progress has been made in the field of protein function prediction with the development of various machine-learning approaches.
+However, most existing methods formulate the task as a multi-classification problem, i.e. assigning predefined labels to proteins.
+In this work, we propose a novel approach, Prot2Text, which predicts a protein's function in a free text style, moving beyond the conventional binary or categorical classifications.
+By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types including protein sequence, structure, and textual annotation and description.
+This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions.
+To evaluate our model, we extracted a multimodal protein dataset from SwissProt, and demonstrate empirically the effectiveness of Prot2Text.
+These results highlight the transformative impact of multimodal models, specifically the fusion of GNNs and LLMs, empowering researchers with powerful tools for more accurate function prediction of existing as well as first-to-see proteins.
\ No newline at end of file
diff --git a/data/2024/aaai/Protect Your Score: Contact-Tracing with Differential Privacy Guarantees b/data/2024/aaai/Protect Your Score: Contact-Tracing with Differential Privacy Guarantees
new file mode 100644
index 0000000000..fe43d5c761
--- /dev/null
+++ b/data/2024/aaai/Protect Your Score: Contact-Tracing with Differential Privacy Guarantees	
@@ -0,0 +1 @@
+The pandemic in 2020 and 2021 had enormous economic and societal consequences, and studies show that contact tracing algorithms can be key in the early containment of the virus. While large strides have been made towards more effective contact tracing algorithms, we argue that privacy concerns currently hold deployment back. The essence of a contact tracing algorithm constitutes the communication of a risk score. Yet, it is precisely the communication and release of this score to a user that an adversary can leverage to gauge the private health status of an individual. We pinpoint a realistic attack scenario and propose a contact tracing algorithm with differential privacy guarantees against this attack. The algorithm is tested on the two most widely used agent-based COVID19 simulators and demonstrates superior performance in a wide range of settings. Especially for realistic test scenarios and while releasing each risk score with epsilon=1 differential privacy, we achieve a two to ten-fold reduction in the infection rate of the virus. To the best of our knowledge, this presents the first contact tracing algorithm with differential privacy guarantees when revealing risk scores for COVID19.
\ No newline at end of file
diff --git a/data/2024/aaai/Provable Robustness against a Union of L_0 Adversarial Attacks b/data/2024/aaai/Provable Robustness against a Union of L_0 Adversarial Attacks
new file mode 100644
index 0000000000..3c20995f29
--- /dev/null
+++ b/data/2024/aaai/Provable Robustness against a Union of L_0 Adversarial Attacks	
@@ -0,0 +1 @@
+Sparse or L0 adversarial attacks arbitrarily perturb an unknown subset of the features. L0 robustness analysis is particularly well-suited for heterogeneous (tabular) data where features have different types or scales. State-of-the-art L0 certified defenses are based on randomized smoothing and apply to evasion attacks only. This paper proposes feature partition aggregation (FPA) -- a certified defense against the union of L0 evasion, backdoor, and poisoning attacks. FPA generates its stronger robustness guarantees via an ensemble whose submodels are trained on disjoint feature sets. Compared to state-of-the-art L0 defenses, FPA is up to 3,000x faster and provides larger median robustness guarantees (e.g., median certificates of 13 pixels over 10 for CIFAR10, 12 pixels over 10 for MNIST, 4 features over 1 for Weather, and 3 features over 1 for Ames), meaning FPA provides the additional dimensions of robustness essentially for free.
\ No newline at end of file
diff --git a/data/2024/aaai/Provably Convergent Federated Trilevel Learning b/data/2024/aaai/Provably Convergent Federated Trilevel Learning
new file mode 100644
index 0000000000..d62f871c23
--- /dev/null
+++ b/data/2024/aaai/Provably Convergent Federated Trilevel Learning	
@@ -0,0 +1 @@
+Trilevel learning, also called trilevel optimization (TLO), has been recognized as a powerful modelling tool for hierarchical decision process and widely applied in many machine learning applications, such as robust neural architecture search, hyperparameter optimization, and domain adaptation. Tackling TLO problems has presented a great challenge due to their nested decision-making structure. In addition, existing works on TLO face the following key challenges: 1) they all focus on the non-distributed setting, which may lead to privacy breach; 2) they do not offer any non-asymptotic convergence analysis which characterizes how fast an algorithm converges. To address the aforementioned challenges, this paper proposes an asynchronous federated trilevel optimization method to solve TLO problems. The proposed method utilizes u-cuts to construct a hyper-polyhedral approximation for the TLO problem and solve it in an asynchronous manner. We demonstrate that the proposed u-cuts are applicable to not only convex functions but also a wide range of non-convex functions that meet the u-weakly convex assumption. Furthermore, we theoretically analyze the non-asymptotic convergence rate for the proposed method by showing its iteration complexity to obtain ϵ-stationary point is upper bounded by O(1/ϵ²). Extensive experiments on real-world datasets have been conducted to elucidate the superiority of the proposed method, e.g., it has a faster convergence rate with a maximum acceleration of approximately 80%.
\ No newline at end of file
diff --git a/data/2024/aaai/Provably Powerful Graph Neural Networks for Directed Multigraphs b/data/2024/aaai/Provably Powerful Graph Neural Networks for Directed Multigraphs
new file mode 100644
index 0000000000..e6d6342d7a
--- /dev/null
+++ b/data/2024/aaai/Provably Powerful Graph Neural Networks for Directed Multigraphs	
@@ -0,0 +1,2 @@
+This paper analyses a set of simple adaptations that transform standard message-passing Graph Neural Networks (GNN) into provably powerful directed multigraph neural networks. The adaptations include multigraph port numbering, ego IDs, and reverse message passing. We prove that the combination of these theoretically enables the detection of any directed subgraph pattern. To validate the effectiveness of our proposed adaptations in practice, we conduct experiments on synthetic subgraph detection tasks, which demonstrate outstanding performance with almost perfect results. 
+Moreover, we apply our proposed adaptations to two financial crime analysis tasks. We observe dramatic improvements in detecting money laundering transactions, improving the minority-class F1 score of a standard message-passing GNN by up to 30%, and closely matching or outperforming tree-based and GNN baselines. Similarly impressive results are observed on a real-world phishing detection dataset, boosting three standard GNNs’ F1 scores by around 15% and outperforming all baselines. An extended version with appendices can be found on arXiv: https://arxiv.org/abs/2306.11586.
\ No newline at end of file
diff --git a/data/2024/aaai/Providing Fair Recourse over Plausible Groups b/data/2024/aaai/Providing Fair Recourse over Plausible Groups
new file mode 100644
index 0000000000..6efee70b36
--- /dev/null
+++ b/data/2024/aaai/Providing Fair Recourse over Plausible Groups	
@@ -0,0 +1 @@
+Machine learning models now automate decisions in applications where we may wish to provide recourse to adversely affected individuals. In practice, existing methods to provide recourse return actions that fail to account for latent characteristics that are not captured in the model (e.g., age, sex, marital status). In this paper, we study how the cost and feasibility of recourse can change across these latent groups. We introduce a notion of group-level plausibility to identify groups of individuals with a shared set of latent characteristics. We develop a general-purpose clustering procedure to identify groups from samples. Further, we propose a constrained optimization approach to learn models that equalize the cost of recourse over latent groups. We evaluate our approach through an empirical study on simulated and real-world datasets, showing that it can produce models that have better performance in terms of overall costs and feasibility at a group level.
\ No newline at end of file
diff --git a/data/2024/aaai/ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open-Vocabulary Object Detection b/data/2024/aaai/ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open-Vocabulary Object Detection
new file mode 100644
index 0000000000..3e71823393
--- /dev/null
+++ b/data/2024/aaai/ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open-Vocabulary Object Detection	
@@ -0,0 +1 @@
+Open-vocabulary object detection (OVOD) aims to recognize novel objects whose categories are not included in the training set. In order to classify these unseen classes during training, many OVOD frameworks leverage the zero-shot capability of largely pretrained vision and language models, such as CLIP. To further improve generalization on the unseen novel classes, several approaches proposed to additionally train with pseudo region labeling on the external data sources that contain a substantial number of novel category labels beyond the existing training data. Albeit its simplicity, these pseudo-labeling methods still exhibit limited improvement with regard to the truly unseen novel classes that were not pseudo-labeled. In this paper, we present a novel, yet simple technique that helps generalization on the overall distribution of novel classes. Inspired by our observation that numerous novel classes reside within the convex hull constructed by the base (seen) classes in the CLIP embedding space, we propose to synthesize proxy-novel classes approximating novel classes via linear mixup between a pair of base classes. By training our detector with these synthetic proxy-novel classes, we effectively explore the embedding space of novel classes. The experimental results on various OVOD benchmarks such as LVIS and COCO demonstrate superior performance on novel classes compared to the other state-of-the-art methods. Code is available at https://github.com/clovaai/ProxyDet.
\ No newline at end of file
diff --git "a/data/2024/aaai/Proxyformer: Nystr\303\266m-Based Linear Transformer with Trainable Proxy Tokens" "b/data/2024/aaai/Proxyformer: Nystr\303\266m-Based Linear Transformer with Trainable Proxy Tokens"
new file mode 100644
index 0000000000..6345eef54c
--- /dev/null
+++ "b/data/2024/aaai/Proxyformer: Nystr\303\266m-Based Linear Transformer with Trainable Proxy Tokens"	
@@ -0,0 +1 @@
+Transformer-based models have demonstrated remarkable performance in various domains, including natural language processing, image processing and generative modeling. The most significant contributor to the successful performance of Transformer models is the self-attention mechanism, which allows for a comprehensive understanding of the interactions between tokens in the input sequence. However, there is a well-known scalability issue, the quadratic dependency (i.e. O(n^2)) of self-attention operations on the input sequence length n, making the handling of lengthy sequences challenging. To address this limitation, there has been a surge of research on efficient transformers, aiming to alleviate the quadratic dependency on the input sequence length. Among these, the Nyströmformer, which utilizes the Nyström method to decompose the attention matrix, achieves superior performance in both accuracy and throughput. However, its landmark selection exhibits redundancy, and the model incurs computational overhead when calculating the pseudo-inverse matrix. We propose a novel Nyström method-based transformer, called Proxyformer. Unlike the traditional approach of selecting landmarks from input tokens, the Proxyformer utilizes trainable neural memory, called proxy tokens, for landmarks. By integrating contrastive learning, input injection, and a specialized dropout for the decomposed matrix, Proxyformer achieves top-tier performance for long sequence tasks in the Long Range Arena benchmark.
\ No newline at end of file
diff --git a/data/2024/aaai/Pseudo-Label Calibration Semi-supervised Multi-Modal Entity Alignment b/data/2024/aaai/Pseudo-Label Calibration Semi-supervised Multi-Modal Entity Alignment
new file mode 100644
index 0000000000..ab95463ea3
--- /dev/null
+++ b/data/2024/aaai/Pseudo-Label Calibration Semi-supervised Multi-Modal Entity Alignment	
@@ -0,0 +1 @@
+Multi-modal entity alignment (MMEA) aims to identify equivalent entities between two multi-modal knowledge graphs for integration. Unfortunately, prior arts have attempted to improve the interaction and fusion of multi-modal information, which have overlooked the influence of modal-specific noise and the usage of labeled and unlabeled data in semi-supervised settings. In this work, we introduce a Pseudo-label Calibration Multi-modal Entity Alignment (PCMEA) in a semi-supervised way. Specifically, in order to generate holistic entity representations, we first devise various embedding modules and attention mechanisms to extract visual, structural, relational, and attribute features. Different from the prior direct fusion methods, we next propose to exploit mutual information maximization to filter the modal-specific noise and to augment modal-invariant commonality. Then, we combine pseudo-label calibration with momentum-based contrastive learning to make full use of the labeled and unlabeled data, which improves the quality of pseudo-label and pulls aligned entities closer. Finally, extensive experiments on two MMEA datasets demonstrate the effectiveness of our PCMEA, which yields state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Pure-Past Action Masking b/data/2024/aaai/Pure-Past Action Masking
new file mode 100644
index 0000000000..df3f98d0b4
--- /dev/null
+++ b/data/2024/aaai/Pure-Past Action Masking	
@@ -0,0 +1 @@
+We present Pure-Past Action Masking (PPAM), a lightweight approach to action masking for safe reinforcement learning. In PPAM, actions are disallowed (“masked”) according to specifications expressed in Pure-Past Linear Temporal Logic (PPLTL). PPAM can enforce non-Markovian constraints, i.e., constraints based on the history of the system, rather than just the current state of the (possibly hidden) MDP. The features used in the safety constraint need not be the same as those used by the learning agent, allowing a clear separation of concerns between the safety constraints and reward specifications of the (learning) agent. We prove formally that an agent trained with PPAM can learn any optimal policy that satisfies the safety constraints, and that they are as expressive as shields, another approach to enforce non-Markovian constraints in RL. Finally, we provide empirical results showing how PPAM can guarantee constraint satisfaction in practice.
\ No newline at end of file
diff --git a/data/2024/aaai/Pushing the Limit of Fine-Tuning for Few-Shot Learning: Where Feature Reusing Meets Cross-Scale Attention b/data/2024/aaai/Pushing the Limit of Fine-Tuning for Few-Shot Learning: Where Feature Reusing Meets Cross-Scale Attention
new file mode 100644
index 0000000000..9708ee260e
--- /dev/null
+++ b/data/2024/aaai/Pushing the Limit of Fine-Tuning for Few-Shot Learning: Where Feature Reusing Meets Cross-Scale Attention	
@@ -0,0 +1 @@
+Due to the scarcity of training samples, Few-Shot Learning (FSL) poses a significant challenge to capture discriminative object features effectively. The combination of transfer learning and meta-learning has recently been explored by pre-training the backbone features using labeled base data and subsequently fine-tuning the model with target data. However, existing meta-learning methods, which use embedding networks, suffer from scaling limitations when dealing with a few labeled samples, resulting in suboptimal results. Inspired by the latest advances in FSL, we further advance the approach of fine-tuning a pre-trained architecture by a strengthened hierarchical feature representation. The technical contributions of this work include: 1) a hybrid design named Intra-Block Fusion (IBF) to strengthen the extracted features within each convolution block; and 2) a novel Cross-Scale Attention (CSA) module to mitigate the scaling inconsistencies arising from the limited training samples, especially for cross-domain tasks. We conducted comprehensive evaluations on standard benchmarks, including three in-domain tasks (miniImageNet, CIFAR-FS, and FC100), as well as two cross-domain tasks (CDFSL and Meta-Dataset). The results have improved significantly over existing state-of-the-art approaches on all benchmark datasets. In particular, the FSL performance on the in-domain FC100 dataset is more than three points better than the latest PMF (Hu et al. 2022).
\ No newline at end of file
diff --git a/data/2024/aaai/Q-SENN: Quantized Self-Explaining Neural Networks b/data/2024/aaai/Q-SENN: Quantized Self-Explaining Neural Networks
new file mode 100644
index 0000000000..3fd5e0b566
--- /dev/null
+++ b/data/2024/aaai/Q-SENN: Quantized Self-Explaining Neural Networks	
@@ -0,0 +1 @@
+Explanations in Computer Vision are often desired, but most Deep Neural Networks can only provide saliency maps with questionable faithfulness. Self-Explaining Neural Networks (SENN) extract interpretable concepts with fidelity, diversity, and grounding to combine them linearly for decision-making. While they can explain what was recognized, initial realizations lack accuracy and general applicability. We propose the Quantized-Self-Explaining Neural Network “Q-SENN”. Q-SENN satisfies or exceeds the desiderata of SENN while being applicable to more complex datasets and maintaining most or all of the accuracy of an uninterpretable baseline model, outperforming previous work in all considered metrics. Q-SENN describes the relationship between every class and feature as either positive, negative or neutral instead of an arbitrary number of possible relations, enforcing more binary human-friendly features. Since every class is assigned just 5 interpretable features on average, Q-SENN shows convincing local and global interpretability. Additionally, we propose a feature alignment method, capable of aligning learned features with human language-based concepts without additional supervision. Thus, what is learned can be more easily verbalized. The code is published: https://github.com/ThomasNorr/Q-SENN
\ No newline at end of file
diff --git a/data/2024/aaai/QCS-SGM+: Improved Quantized Compressed Sensing with Score-Based Generative Models b/data/2024/aaai/QCS-SGM+: Improved Quantized Compressed Sensing with Score-Based Generative Models
new file mode 100644
index 0000000000..95633d2ef7
--- /dev/null
+++ b/data/2024/aaai/QCS-SGM+: Improved Quantized Compressed Sensing with Score-Based Generative Models	
@@ -0,0 +1 @@
+In practical compressed sensing (CS), the obtained measurements typically necessitate quantization to a limited number of bits prior to transmission or storage. This nonlinear quantization process poses significant recovery challenges, particularly with extreme coarse quantization such as 1-bit. Recently, an efficient algorithm called QCS-SGM was proposed for quantized CS (QCS) which utilizes score-based generative models (SGM) as an implicit prior. Due to the adeptness of SGM in capturing the intricate structures of natural signals, QCS-SGM substantially outperforms previous QCS methods. However, QCS-SGM is constrained to (approximately) row-orthogonal sensing matrices as the computation of the likelihood score becomes intractable otherwise. To address this limitation, we introduce an advanced variant of QCS-SGM, termed QCS-SGM+, capable of handling general matrices effectively. The key idea is a Bayesian inference perspective on the likelihood score computation, wherein expectation propagation is employed for its approximate computation. Extensive experiments are conducted, demonstrating the substantial superiority of QCS-SGM+ over QCS-SGM for general sensing matrices beyond mere row-orthogonality.
\ No newline at end of file
diff --git a/data/2024/aaai/QDETRv: Query-Guided DETR for One-Shot Object Localization in Videos b/data/2024/aaai/QDETRv: Query-Guided DETR for One-Shot Object Localization in Videos
new file mode 100644
index 0000000000..5fb85b35db
--- /dev/null
+++ b/data/2024/aaai/QDETRv: Query-Guided DETR for One-Shot Object Localization in Videos	
@@ -0,0 +1 @@
+In this work, we study one-shot video object localization problem that aims to localize instances of unseen objects in the target video using a single query image of the object. Toward addressing this challenging problem, we extend a popular and successful object detection method, namely DETR (Detection Transformer), and introduce a novel approach –query-guided detection transformer for videos (QDETRv). A distinctive feature of QDETRv is its capacity to exploit information from the query image and spatio-temporal context of the target video, which significantly aids in precisely pinpointing the desired object in the video. We incorporate cross-attention mechanisms that capture temporal relationships across adjacent frames to handle the dynamic context in videos effectively. Further, to ensure strong initialization for QDETRv, we also introduce a novel unsupervised pretraining technique tailored to videos. This involves training our model on synthetic object trajectories with an analogous objective as the query-guided localization task. During this pretraining phase, we incorporate recurrent object queries and loss functions that encourage accurate patch feature reconstruction. These additions enable better temporal understanding and robust representation learning. Our experiments show that the proposed model significantly outperforms the competitive baselines on two public benchmarks, VidOR and ImageNet-VidVRD, extended for one-shot open-set localization tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/QI-IRA: Quantum-Inspired Interactive Ranking Aggregation for Person Re-identification b/data/2024/aaai/QI-IRA: Quantum-Inspired Interactive Ranking Aggregation for Person Re-identification
new file mode 100644
index 0000000000..a326f7ce26
--- /dev/null
+++ b/data/2024/aaai/QI-IRA: Quantum-Inspired Interactive Ranking Aggregation for Person Re-identification	
@@ -0,0 +1 @@
+Ranking aggregation (RA), the process of aggregating multiple rankings derived from multiple search strategies, has been proved effective in person re-identification (re-ID) because of a single re-ID method can not always achieve consistent superiority for different scenarios. Existing RA research mainly focus on unsupervised and fully-supervised methods. The former lack external supervision to optimize performance, while the latter are costly because of expensive labeling effort required for training. To address the above challenges, this paper proposes a quantum-inspired interactive ranking aggregation (QI-IRA) method, which (1) utilizes quantum theory to interpret and model the generation and aggregation of multiple basic rankings, (2) approximates or even exceeds the performance of fully-supervised RA methods with much less labeling cost, even as low as only two feedbacks per query on Market1501, MARS and DukeMTMC-VideoReID datasets. Comparative experiments conducted on six public re-ID datasets validate the superiority of the proposed QI-IRA method over existing unsupervised, interactive, and fully-supervised RA approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/QLABGrad: A Hyperparameter-Free and Convergence-Guaranteed Scheme for Deep Learning b/data/2024/aaai/QLABGrad: A Hyperparameter-Free and Convergence-Guaranteed Scheme for Deep Learning
new file mode 100644
index 0000000000..197e841dff
--- /dev/null
+++ b/data/2024/aaai/QLABGrad: A Hyperparameter-Free and Convergence-Guaranteed Scheme for Deep Learning	
@@ -0,0 +1,18 @@
+The learning rate is a critical hyperparameter for deep learning
+tasks since it determines the extent to which the model
+parameters are adjusted during the learning course. However,
+the choice of learning rates typically depends on empirical
+judgment, which may not result in satisfactory outcomes
+without intensive try-and-error experiments. In this
+study, we propose a novel learning rate adaptation scheme
+called QLABGrad. Without any user-specified hyperparameter,
+QLABGrad automatically determines the learning rate by
+optimizing the quadratic loss approximation-based (QLAB)
+function for a given gradient descent direction, where only
+one extra forward propagation is required. We theoretically
+prove the convergence of QLABGrad under the smooth Lipschitz
+condition on the loss function. Experiment results on
+multiple architectures, including MLP, CNN, and ResNet, on
+MNIST, CIFAR10, and ImageNet datasets, demonstrate that
+QLABGrad outperforms widely adopted schemes for deep
+learning.
\ No newline at end of file
diff --git a/data/2024/aaai/QPEN: Quantum Projection and Quantum Entanglement Enhanced Network for Cross-Lingual Aspect-Based Sentiment Analysis b/data/2024/aaai/QPEN: Quantum Projection and Quantum Entanglement Enhanced Network for Cross-Lingual Aspect-Based Sentiment Analysis
new file mode 100644
index 0000000000..016d8264eb
--- /dev/null
+++ b/data/2024/aaai/QPEN: Quantum Projection and Quantum Entanglement Enhanced Network for Cross-Lingual Aspect-Based Sentiment Analysis	
@@ -0,0 +1 @@
+Aspect-based sentiment analysis (ABSA) has attracted much attention due to its wide application scenarios. Most previous studies have focused solely on monolingual ABSA, posing a formidable challenge when extending ABSA applications to multilingual scenarios. In this paper, we study upgrading monolingual ABSA to cross-lingual ABSA. Existing methods usually exploit pre-trained cross-lingual language to model cross-lingual ABSA, and enhance the model with translation data. However, the low-resource languages might be under-represented during the pre-training phase, and the translation-enhanced methods heavily rely on the quality of the translation and label projection. Inspired by the observation that quantum entanglement can correlate multiple single systems, we map the monolingual expression to the quantum Hilbert space as a single quantum system, and then utilize quantum entanglement and quantum measurement to achieve cross-lingual ABSA. Specifically, we propose a novel quantum neural model named QPEN (short for quantum projection and quantum entanglement enhanced network). It is equipped with a proposed quantum projection module that projects aspects as quantum superposition on a complex-valued Hilbert space. Furthermore, a quantum entanglement module is proposed in QPEN to share language-specific features between different languages without transmission. We conducted simulation experiments on the classical computer, and experimental results on SemEval-2016 dataset demonstrate that our method achieves state-of-the-art performance in terms of F1-scores for five languages.
\ No newline at end of file
diff --git a/data/2024/aaai/Quad Bayer Joint Demosaicing and Denoising Based on Dual Encoder Network with Joint Residual Learning b/data/2024/aaai/Quad Bayer Joint Demosaicing and Denoising Based on Dual Encoder Network with Joint Residual Learning
new file mode 100644
index 0000000000..0488dcb972
--- /dev/null
+++ b/data/2024/aaai/Quad Bayer Joint Demosaicing and Denoising Based on Dual Encoder Network with Joint Residual Learning	
@@ -0,0 +1 @@
+The recent imaging technology Quad Bayer CFA brings better imaging PSNR and higher visual quality compared to traditional Bayer CFA, but also serious challenges for demosaicing and denoising during the ISP pipeline. In this paper, we propose a novel dual encoder network, namely DRNet, to achieve joint demosaicing and denoising for Quad Bayer CFA. The dual encoders are carefully designed in that one is mainly constructed by a joint residual block to jointly estimate the residuals for demosaicing and denoising separately. In contrast, the other one is started with a pixel modulation block which is specially designed to match the characteristics of Quad Bayer pattern for better feature extraction. We demonstrate the effectiveness of each proposed component through detailed ablation investigations. The comparison results on public benchmarks illustrate that our DRNet achieves an apparent performance gain~(0.38dB to the 2nd best) from the state-of-the-art method and balances performance and efficiency well. The experiments on real-world images show that the proposed method could enhance the reconstruction quality from the native ISP algorithm.
\ No newline at end of file
diff --git a/data/2024/aaai/Quality-Diversity Generative Sampling for Learning with Synthetic Data b/data/2024/aaai/Quality-Diversity Generative Sampling for Learning with Synthetic Data
new file mode 100644
index 0000000000..6a9ea87ae1
--- /dev/null
+++ b/data/2024/aaai/Quality-Diversity Generative Sampling for Learning with Synthetic Data	
@@ -0,0 +1 @@
+Generative models can serve as surrogates for some real data sources by creating synthetic training datasets, but in doing so they may transfer biases to downstream tasks. We focus on protecting quality and diversity when generating synthetic training datasets. We propose quality-diversity generative sampling (QDGS), a framework for sampling data uniformly across a user-defined measure space, despite the data coming from a biased generator. QDGS is a model-agnostic framework that uses prompt guidance to optimize a quality objective across measures of diversity for synthetically generated data, without fine-tuning the generative model. Using balanced synthetic datasets generated by QDGS, we first debias classifiers trained on color-biased shape datasets as a proof-of-concept. By applying QDGS to facial data synthesis, we prompt for desired semantic concepts, such as skin tone and age, to create an intersectional dataset with a combined blend of visual features. Leveraging this balanced data for training classifiers improves fairness while maintaining accuracy on facial recognition benchmarks. Code available at: https://github.com/Cylumn/qd-generative-sampling.
\ No newline at end of file
diff --git a/data/2024/aaai/Quantifying Political Polarization through the Lens of Machine Translation and Vicarious Offense b/data/2024/aaai/Quantifying Political Polarization through the Lens of Machine Translation and Vicarious Offense
new file mode 100644
index 0000000000..dbcb06eb88
--- /dev/null
+++ b/data/2024/aaai/Quantifying Political Polarization through the Lens of Machine Translation and Vicarious Offense	
@@ -0,0 +1,5 @@
+This talk surveys three related research contributions that shed light on the current US political divide: 
+
+1. a novel machine-translation-based framework to quantify political polarization; 
+2. an analysis of disparate media portrayal of US policing in major cable news outlets; and 
+3. a novel perspective of vicarious offense that examines a timely and important question -- how well do Democratic-leaning users perceive what content would be deemed as offensive by their Republican-leaning counterparts or vice-versa?
\ No newline at end of file
diff --git a/data/2024/aaai/Quantifying and Analyzing Entity-Level Memorization in Large Language Models b/data/2024/aaai/Quantifying and Analyzing Entity-Level Memorization in Large Language Models
new file mode 100644
index 0000000000..4adaa06230
--- /dev/null
+++ b/data/2024/aaai/Quantifying and Analyzing Entity-Level Memorization in Large Language Models	
@@ -0,0 +1 @@
+Large language models (LLMs) have been proven capable of memorizing their training data, which can be extracted through specifically designed prompts. As the scale of datasets continues to grow, privacy risks arising from memorization have attracted increasing attention. Quantifying language model memorization helps evaluate potential privacy risks. However, prior works on quantifying memorization require access to the precise original data or incur substantial computational overhead, making it difficult for applications in real-world language models. To this end, we propose a fine-grained, entity-level definition to quantify memorization with conditions and metrics closer to real-world scenarios. In addition, we also present an approach for efficiently extracting sensitive entities from autoregressive language models. We conduct extensive experiments based on the proposed, probing language models' ability to reconstruct sensitive entities under different settings. We find that language models have strong memorization at the entity level and are able to reproduce the training data even with partial leakages. The results demonstrate that LLMs not only memorize their training data but also understand associations between entities. These findings necessitate that trainers of LLMs exercise greater prudence regarding model memorization, adopting memorization mitigation techniques to preclude privacy violations.
\ No newline at end of file
diff --git a/data/2024/aaai/Quantile-Based Maximum Likelihood Training for Outlier Detection b/data/2024/aaai/Quantile-Based Maximum Likelihood Training for Outlier Detection
new file mode 100644
index 0000000000..c8f9994bf4
--- /dev/null
+++ b/data/2024/aaai/Quantile-Based Maximum Likelihood Training for Outlier Detection	
@@ -0,0 +1 @@
+Discriminative learning effectively predicts true object class for image classification. However, it often results in false positives for outliers, posing critical concerns in applications like autonomous driving and video surveillance systems. Previous attempts to address this challenge involved training image classifiers through contrastive learning using actual outlier data or synthesizing outliers for self-supervised learning. Furthermore, unsupervised generative modeling of inliers in pixel space has shown limited success for outlier detection. In this work, we introduce a quantile-based maximum likelihood objective for learning the inlier distribution to improve the outlier separation during inference. Our approach fits a normalizing flow to pre-trained discriminative features and detects the outliers according to the evaluated log-likelihood. The experimental evaluation demonstrates the effectiveness of our method as it surpasses the performance of the state-of-the-art unsupervised methods for outlier detection. The results are also competitive compared with a recent self-supervised approach for outlier detection. Our work allows to reduce dependency on well-sampled negative training data, which is especially important for domains like medical diagnostics or remote sensing.
\ No newline at end of file
diff --git a/data/2024/aaai/Quantile-Regression-Ensemble: A Deep Learning Algorithm for Downscaling Extreme Precipitation b/data/2024/aaai/Quantile-Regression-Ensemble: A Deep Learning Algorithm for Downscaling Extreme Precipitation
new file mode 100644
index 0000000000..43a17f3add
--- /dev/null
+++ b/data/2024/aaai/Quantile-Regression-Ensemble: A Deep Learning Algorithm for Downscaling Extreme Precipitation	
@@ -0,0 +1 @@
+Global Climate Models (GCMs) simulate low resolution climate projections on a global scale. The native resolution of GCMs is generally too low for societal-level decision-making. To enhance the spatial resolution, downscaling is often applied to GCM output. Statistical downscaling techniques, in particular, are well-established as a cost-effective approach. They require significantly less computational time than physics-based dynamical downscaling. In recent years, deep learning has gained prominence in statistical downscaling, demonstrating significantly lower error rates compared to traditional statistical methods. However, a drawback of regression-based deep learning techniques is their tendency to overfit to the mean sample intensity. Extreme values as a result are often underestimated. Problematically, extreme events have the largest societal impact. We propose Quantile-Regression-Ensemble (QRE), an innovative deep learning algorithm inspired by boosting methods. Its primary objective is to avoid trade-offs between fitting to sample means and extreme values by training independent models on a partitioned dataset. Our QRE is robust to redundant models and not susceptible to explosive ensemble weights, ensuring a reliable training process. QRE achieves lower Mean Squared Error (MSE) compared to various baseline models. In particular, our algorithm has a lower error for high-intensity precipitation events over New Zealand, highlighting the ability to represent extreme events accurately.
\ No newline at end of file
diff --git a/data/2024/aaai/Quantum Interference Model for Semantic Biases of Glosses in Word Sense Disambiguation b/data/2024/aaai/Quantum Interference Model for Semantic Biases of Glosses in Word Sense Disambiguation
new file mode 100644
index 0000000000..ccc47249c3
--- /dev/null
+++ b/data/2024/aaai/Quantum Interference Model for Semantic Biases of Glosses in Word Sense Disambiguation	
@@ -0,0 +1 @@
+Word Sense Disambiguation (WSD) aims to determine the meaning of the target word according to the given context. Currently, a single representation enhanced by glosses from different dictionaries or languages is used to characterize each word sense. By analyzing the similarity between glosses of the same word sense, we find semantic biases among them, revealing that the glosses have their own descriptive perspectives. Therefore, the traditional approach of integrating all glosses by a single representation results in failing to present the unique semantics revealed by the individual glosses. In this paper, a quantum superposition state is employed to formalize the representations of multiple glosses of the same word sense to reveal their distributions. Furthermore, the quantum interference model is leveraged to calculate the probability that the target word belongs to this superposition state. The advantage is that the interference term can be regarded as a confidence level to guide word sense recognition. Finally, experiments are performed under standard WSD evaluation framework and the latest cross-lingual datasets, and the results verify the effectiveness of our model.
\ No newline at end of file
diff --git a/data/2024/aaai/Quantum-Inspired Neural Network with Runge-Kutta Method b/data/2024/aaai/Quantum-Inspired Neural Network with Runge-Kutta Method
new file mode 100644
index 0000000000..146a5b943b
--- /dev/null
+++ b/data/2024/aaai/Quantum-Inspired Neural Network with Runge-Kutta Method	
@@ -0,0 +1 @@
+In recent years, researchers have developed novel Quantum-Inspired Neural Network (QINN) frameworks for the Natural Language Processing (NLP) tasks, inspired by the theoretical investigations of quantum cognition. However, we have found that the training efficiency of QINNs is significantly lower than that of classical networks. We analyze the unitary transformation modules of existing QINNs based on the time displacement symmetry of quantum mechanics and discover that they are resembling a mathematical form similar to the first-order Euler method. The high truncation error associated with Euler method affects the training efficiency of QINNs. In order to enhance the training efficiency of QINNs, we generalize QINNs' unitary transformation modules to the Quantum-like high-order Runge-Kutta methods (QRKs). Moreover, we present the results of experiments on conversation emotion recognition and text classification tasks to validate the effectiveness of the proposed approach.
\ No newline at end of file
diff --git a/data/2024/aaai/QuerySum: A Multi-Document Query-Focused Summarization Dataset Augmented with Similar Query Clusters b/data/2024/aaai/QuerySum: A Multi-Document Query-Focused Summarization Dataset Augmented with Similar Query Clusters
new file mode 100644
index 0000000000..f71436a148
--- /dev/null
+++ b/data/2024/aaai/QuerySum: A Multi-Document Query-Focused Summarization Dataset Augmented with Similar Query Clusters	
@@ -0,0 +1 @@
+Query-focused summarization (QFS) aims to summarize the source document(s) with regard to a specific aspect of information given in a query. It plays an important role in presenting users with a concise answer summary from a set of query-relevant documents retrieved by the information retrieval system. Nonetheless, the QFS research has long been hampered by the lack of adequate datasets in terms of both quality and quantity. In this paper, we introduce a large-scale multi-document query-focused summarization dataset, called QuerySum, which contains 27,041 data samples covering diverse topics and its quality is guaranteed through human verification. Unlike some previous QFS datasets constructed directly from the question answering datasets, 74% queries in our dataset are the challenging non-factoid What-, Why-, and How- questions. More importantly, we also provide a set of similar queries together with the corresponding summaries pairs for each query as the retrieved context, presenting a new feature of QuerySum. We aim to encourage research efforts in query intention understanding in the context of QFS. Leveraging QuerySum's depth, we propose a model for query-aware multi-document summarization and set a new QFS benchmark.
\ No newline at end of file
diff --git a/data/2024/aaai/Question Calibration and Multi-Hop Modeling for Temporal Question Answering b/data/2024/aaai/Question Calibration and Multi-Hop Modeling for Temporal Question Answering
new file mode 100644
index 0000000000..4f6da87f30
--- /dev/null
+++ b/data/2024/aaai/Question Calibration and Multi-Hop Modeling for Temporal Question Answering	
@@ -0,0 +1 @@
+Many models that leverage knowledge graphs (KGs) have recently demonstrated remarkable success in question answering (QA) tasks. In the real world, many facts contained in KGs are time-constrained thus temporal KGQA has received increasing attention. Despite the fruitful efforts of previous models in temporal KGQA, they still have several limitations. (I) They adopt pre-trained language models (PLMs) to obtain question representations, while PLMs tend to focus on entity information and ignore entity transfer caused by temporal constraints, and finally fail to learn specific temporal representations of entities. (II) They neither emphasize the graph structure between entities nor explicitly model the multi-hop relationship in the graph, which will make it difficult to solve complex multi-hop question answering. To alleviate this problem, we propose a novel Question Calibration and Multi-Hop Modeling (QC-MHM) network. Specifically, We first calibrate the question representation by fusing the question and the time-constrained concepts in KG. Then, we construct the GNN layer to complete multi-hop message passing. Finally, the question representation is combined with the embedding output by the GNN to generate the final prediction. Empirical results verify that the proposed model achieves better performance than the state-of-the-art models in the benchmark dataset. Notably, the Hits@1 and Hits@10 results of QC-MHM on the CronQuestions dataset's complex questions are absolutely improved by 5.1% and 1.2% compared to the best-performing baseline. Moreover, QC-MHM can generate interpretable and trustworthy predictions.
\ No newline at end of file
diff --git a/data/2024/aaai/QuickRender: A Photorealistic Procedurally Generated Dataset with Applications to Super Resolution (Student Abstract) b/data/2024/aaai/QuickRender: A Photorealistic Procedurally Generated Dataset with Applications to Super Resolution (Student Abstract)
new file mode 100644
index 0000000000..d422b5bc95
--- /dev/null
+++ b/data/2024/aaai/QuickRender: A Photorealistic Procedurally Generated Dataset with Applications to Super Resolution (Student Abstract)	
@@ -0,0 +1,5 @@
+Rendering of complex scenes from software such as Blender is time consuming, but corresponding auxiliary data such as depth or object segmentation maps are relatively fast to generate. The auxiliary data also provides a wealth of information for tasks such as optical flow prediction.
+
+In this paper we present the QuickRender dataset, a collection of procedurally generated scenes rendered into over 5,000 sequential image triplets along with accompanying auxiliary data. The goal of this dataset is to provide a diversity of scenes and motion while maintaining realistic behaviours. A sample application using this dataset to perform single image super resolution is also presented.
+
+The dataset and related source code can be found at https://github.com/MP-mtroyal/MetaSRGAN.
\ No newline at end of file
diff --git a/data/2024/aaai/Quilt: Robust Data Segment Selection against Concept Drifts b/data/2024/aaai/Quilt: Robust Data Segment Selection against Concept Drifts
new file mode 100644
index 0000000000..1ceaac5925
--- /dev/null
+++ b/data/2024/aaai/Quilt: Robust Data Segment Selection against Concept Drifts	
@@ -0,0 +1 @@
+Continuous machine learning pipelines are common in industrial settings where models are periodically trained on data streams. Unfortunately, concept drifts may occur in data streams where the joint distribution of the data X and label y, P(X, y), changes over time and possibly degrade model accuracy. Existing concept drift adaptation approaches mostly focus on updating the model to the new data possibly using ensemble techniques of previous models and tend to discard the drifted historical data. However, we contend that explicitly utilizing the drifted data together leads to much better model accuracy and propose Quilt, a data-centric framework for identifying and selecting data segments that maximize model accuracy. To address the potential downside of efficiency, Quilt extends existing data subset selection techniques, which can be used to reduce the training data without compromising model accuracy. These techniques cannot be used as is because they only assume virtual drifts where the posterior probabilities P(y|X) are assumed not to change. In contrast, a key challenge in our setup is to also discard undesirable data segments with concept drifts. Quilt thus discards drifted data segments and selects data segment subsets holistically for accurate and efficient model training. The two operations use gradient-based scores, which have little computation overhead. In our experiments, we show that Quilt outperforms state-of-the-art drift adaptation and data selection baselines on synthetic and real datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion b/data/2024/aaai/R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion
new file mode 100644
index 0000000000..6b5376477c
--- /dev/null
+++ b/data/2024/aaai/R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion	
@@ -0,0 +1 @@
+Image generation tasks have achieved remarkable performance using large-scale diffusion models. However, these models are limited to capturing the abstract relations (viz., interactions excluding positional relations) among multiple entities of complex scene graphs. Two main problems exist: 1) fail to depict more concise and accurate interactions via abstract relations; 2) fail to generate complete entities. To address that, we propose a novel Relation-aware Compositional Contrastive Control Diffusion method, dubbed as R3CD, that leverages large-scale diffusion models to learn abstract interactions from scene graphs. Herein, a scene graph transformer based on node and edge encoding is first designed to perceive both local and global information from input scene graphs, whose embeddings are initialized by a T5 model. Then a joint contrastive loss based on attention maps and denoising steps is developed to control the diffusion model to understand and further generate images, whose spatial structures and interaction features are consistent with a priori relation. Extensive experiments are conducted on two datasets: Visual Genome and COCO-Stuff, and demonstrate that the proposal outperforms existing models both in quantitative and qualitative metrics to generate more realistic and diverse images according to different scene graph specifications.
\ No newline at end of file
diff --git a/data/2024/aaai/READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling b/data/2024/aaai/READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling
new file mode 100644
index 0000000000..384a11b9f1
--- /dev/null
+++ b/data/2024/aaai/READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling	
@@ -0,0 +1 @@
+Fully fine-tuning pretrained large-scale transformer models has become a popular paradigm for video-language modeling tasks, such as temporal language grounding and video-language summarization. With a growing number of tasks and limited training data, such full fine-tuning approach leads to costly model storage and unstable training. To overcome these shortcomings, we introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time. However, existing adapters fail to capture intrinsic temporal relations among video frames or textual words. Moreover, they neglect the preservation of critical task-related information that flows from the raw video-language input into the adapter’s low-dimensional space. To address these issues, we first propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability. Second, we propose Partial Video-Language Alignment (PVLA) objective via the use of partial optimal transport to maintain task-related information flowing into our READ modules. We validate our READ-PVLA framework through extensive experiments where READ-PVLA significantly outperforms all existing fine-tuning strategies on multiple low-resource temporal language grounding and video-language summarization benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/REGLO: Provable Neural Network Repair for Global Robustness Properties b/data/2024/aaai/REGLO: Provable Neural Network Repair for Global Robustness Properties
new file mode 100644
index 0000000000..96e390a82b
--- /dev/null
+++ b/data/2024/aaai/REGLO: Provable Neural Network Repair for Global Robustness Properties	
@@ -0,0 +1 @@
+We present REGLO, a novel methodology for repairing pretrained neural networks to satisfy global robustness and individual fairness properties. A neural network is said to be globally robust with respect to a given input region if and only if all the input points in the region are locally robust. This notion of global robustness also captures the notion of individual fairness as a special case. We prove that any counterexample to a global robustness property must exhibit a corresponding large gradient. For ReLU networks, this result allows us to efficiently identify the linear regions that violate a given global robustness property. By formulating and solving a suitable robust convex optimization problem, REGLO then computes a minimal weight change that will provably repair these violating linear regions.
\ No newline at end of file
diff --git a/data/2024/aaai/REPrune: Channel Pruning via Kernel Representative Selection b/data/2024/aaai/REPrune: Channel Pruning via Kernel Representative Selection
new file mode 100644
index 0000000000..16826bb9fd
--- /dev/null
+++ b/data/2024/aaai/REPrune: Channel Pruning via Kernel Representative Selection	
@@ -0,0 +1 @@
+Channel pruning is widely accepted to accelerate modern convolutional neural networks (CNNs). The resulting pruned model benefits from its immediate deployment on general-purpose software and hardware resources. However, its large pruning granularity, specifically at the unit of a convolution filter, often leads to undesirable accuracy drops due to the inflexibility of deciding how and where to introduce sparsity to the CNNs. In this paper, we propose REPrune, a novel channel pruning technique that emulates kernel pruning, fully exploiting the finer but structured granularity. REPrune identifies similar kernels within each channel using agglomerative clustering. Then, it selects filters that maximize the incorporation of kernel representatives while optimizing the maximum cluster coverage problem. By integrating with a simultaneous training-pruning paradigm, REPrune promotes efficient, progressive pruning throughout training CNNs, avoiding the conventional train-prune-finetune sequence. Experimental results highlight that REPrune performs better in computer vision tasks than existing methods, effectively achieving a balance between acceleration ratio and performance retention.
\ No newline at end of file
diff --git a/data/2024/aaai/RG-GAN: Dynamic Regenerative Pruning for Data-Efficient Generative Adversarial Networks b/data/2024/aaai/RG-GAN: Dynamic Regenerative Pruning for Data-Efficient Generative Adversarial Networks
new file mode 100644
index 0000000000..444dcbdb78
--- /dev/null
+++ b/data/2024/aaai/RG-GAN: Dynamic Regenerative Pruning for Data-Efficient Generative Adversarial Networks	
@@ -0,0 +1 @@
+Training Generative Adversarial Networks (GAN) to generate high-quality images typically requires large datasets. Network pruning during training has recently emerged as a significant advancement for data-efficient GAN. However, simple and straightforward pruning can lead to the risk of losing key information, resulting in suboptimal results due to GAN’s competitive dynamics between generator (G) and discriminator (D). Addressing this, we present RG-GAN, a novel approach that marks the first incorporation of dynamic weight regeneration and pruning in GAN training to improve the quality of the generated samples, even with limited data. Specifically, RG-GAN initiates layer-wise dynamic pruning by removing less important weights to the quality of the generated images. While pruning enhances efficiency, excessive sparsity within layers can pose a risk of model collapse. To mitigate this issue, RG-GAN applies a dynamic regeneration method to reintroduce specific weights when they become important, ensuring a balance between sparsity and image quality. Though effective, the sparse network achieved through this process might eliminate some weights important to the combined G and D performance, a crucial aspect for achieving stable and effective GAN training. RG-GAN addresses this loss of weights by integrating learned sparse network weights back into the dense network at the previous stage during a follow-up regeneration step. Our results consistently demonstrate RG-GAN’s robust performance across a variety of scenarios, including different GAN architectures, datasets, and degrees of data scarcity, reinforcing its value as a generic training methodology. Results also show that data augmentation exhibits improved performance in conjunction with RG-GAN. Furthermore, RG-GAN can achieve fewer parameters without compromising, and even enhancing, the quality of the generated samples. Code can be found at this link: https://github.com/IntellicentAI-Lab/RG-GAN
\ No newline at end of file
diff --git a/data/2024/aaai/RGMComm: Return Gap Minimization via Discrete Communications in Multi-Agent Reinforcement Learning b/data/2024/aaai/RGMComm: Return Gap Minimization via Discrete Communications in Multi-Agent Reinforcement Learning
new file mode 100644
index 0000000000..c84f89d60b
--- /dev/null
+++ b/data/2024/aaai/RGMComm: Return Gap Minimization via Discrete Communications in Multi-Agent Reinforcement Learning	
@@ -0,0 +1,2 @@
+Communication is crucial for solving cooperative Multi-Agent Reinforcement Learning tasks in partially observable Markov Decision Processes. Existing works often rely on black-box methods to encode local information/features into messages shared with other agents, leading to the generation of continuous messages with high communication overhead and poor interpretability. Prior attempts at discrete communication methods generate one-hot vectors trained as part of agents' actions and use the Gumbel softmax operation for calculating message gradients, which are all heuristic designs that do not provide any quantitative guarantees on the expected return. 
+This paper establishes an upper bound on the return gap between an ideal policy with full observability and an optimal partially observable policy with discrete communication. This result enables us to recast multi-agent communication into a novel online clustering problem over the local observations at each agent, with messages as cluster labels and the upper bound on the return gap as clustering loss. To minimize the return gap, we propose the Return-Gap-Minimization Communication (RGMComm) algorithm, which is a surprisingly simple design of discrete message generation functions and is integrated with reinforcement learning through the utilization of a novel Regularized Information Maximization loss function, which incorporates cosine-distance as the clustering metric. Evaluations show that RGMComm significantly outperforms state-of-the-art multi-agent communication baselines and can achieve nearly optimal returns with few-bit messages that are naturally interpretable.
\ No newline at end of file
diff --git a/data/2024/aaai/RL-SeqISP: Reinforcement Learning-Based Sequential Optimization for Image Signal Processing b/data/2024/aaai/RL-SeqISP: Reinforcement Learning-Based Sequential Optimization for Image Signal Processing
new file mode 100644
index 0000000000..022b79a027
--- /dev/null
+++ b/data/2024/aaai/RL-SeqISP: Reinforcement Learning-Based Sequential Optimization for Image Signal Processing	
@@ -0,0 +1 @@
+Hardware image signal processing (ISP), aiming at converting RAW inputs to RGB images, consists of a series of processing blocks, each with multiple parameters. Traditionally, ISP parameters are manually tuned in isolation by imaging experts according to application-specific quality and performance metrics, which is time-consuming and biased towards human perception due to complex interaction with the output image. Since the relationship between any single parameter’s variation and the output performance metric is a complex, non-linear function, optimizing such a large number of ISP parameters is challenging. To address this challenge, we propose a novel Sequential ISP parameter optimization model, called the RL-SeqISP model, which utilizes deep reinforcement learning to jointly optimize all ISP parameters for a variety of imaging applications. Concretely, inspired by the sequential tuning process of human experts, the proposed model can progressively enhance image quality by seamlessly integrating information from both the image feature space and the parameter space. Furthermore, a dynamic parameter optimization module is introduced to avoid ISP parameters getting stuck into local optima, which is able to more effectively guarantee the optimal parameters resulting from the sequential learning strategy. These merits of the RL-SeqISP model as well as its high efficiency are substantiated by comprehensive experiments on a wide range of downstream tasks, including two visual analysis tasks (instance segmentation and object detection), and image quality assessment (IQA), as compared with representative methods both quantitatively and qualitatively. In particular, even using only 10% of the training data, our model outperforms other SOTA methods by an average of 7% mAP on two visual analysis tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/RLPeri: Accelerating Visual Perimetry Test with Reinforcement Learning and Convolutional Feature Extraction b/data/2024/aaai/RLPeri: Accelerating Visual Perimetry Test with Reinforcement Learning and Convolutional Feature Extraction
new file mode 100644
index 0000000000..58a9073be0
--- /dev/null
+++ b/data/2024/aaai/RLPeri: Accelerating Visual Perimetry Test with Reinforcement Learning and Convolutional Feature Extraction	
@@ -0,0 +1,3 @@
+Visual perimetry is an important eye examination that helps detect vision problems caused by ocular or neurological conditions. During the test, a patient's gaze is fixed at a specific location while light stimuli of varying intensities are presented in central and peripheral vision. Based on the patient's responses to the stimuli, the visual field mapping and sensitivity are determined. However, maintaining high levels of concentration throughout the test can be challenging for patients, leading to increased examination times and decreased accuracy.
+
+In this work, we present RLPeri, a reinforcement learning-based approach to optimize visual perimetry testing. By determining the optimal sequence of locations and initial stimulus values, we aim to reduce the examination time without compromising accuracy. Additionally, we incorporate reward shaping techniques to further improve the testing performance. To monitor the patient's responses over time during testing, we represent the test's state as a pair of 3D matrices. We apply two different convolutional kernels to extract spatial features across locations as well as features across different stimulus values for each location. Through experiments, we demonstrate that our approach results in a 10-20% reduction in examination time while maintaining the accuracy as compared to state-of-the-art methods. With the presented approach, we aim to make visual perimetry testing more efficient and patient-friendly, while still providing accurate results.
\ No newline at end of file
diff --git a/data/2024/aaai/RLfOLD: Reinforcement Learning from Online Demonstrations in Urban Autonomous Driving b/data/2024/aaai/RLfOLD: Reinforcement Learning from Online Demonstrations in Urban Autonomous Driving
new file mode 100644
index 0000000000..f543f96420
--- /dev/null
+++ b/data/2024/aaai/RLfOLD: Reinforcement Learning from Online Demonstrations in Urban Autonomous Driving	
@@ -0,0 +1 @@
+Reinforcement Learning from Demonstrations (RLfD) has emerged as an effective method by fusing expert demonstrations into Reinforcement Learning (RL) training, harnessing the strengths of both Imitation Learning (IL) and RL. However, existing algorithms rely on offline demonstrations, which can introduce a distribution gap between the demonstrations and the actual training environment, limiting their performance. In this paper, we propose a novel approach, Reinforcement Learning from Online Demonstrations (RLfOLD), that leverages online demonstrations to address this limitation, ensuring the agent learns from relevant and up-to-date scenarios, thus effectively bridging the distribution gap. Unlike conventional policy networks used in typical actor-critic algorithms, RLfOLD introduces a policy network that outputs two standard deviations: one for exploration and the other for IL training. This novel design allows the agent to adapt to varying levels of uncertainty inherent in both RL and IL. Furthermore, we introduce an exploration process guided by an online expert, incorporating an uncertainty-based technique. Our experiments on the CARLA NoCrash benchmark demonstrate the effectiveness and efficiency of RLfOLD. Notably, even with a significantly smaller encoder and a single camera setup, RLfOLD surpasses state-of-the-art methods in this evaluation. These results, achieved with limited resources, highlight RLfOLD as a highly promising solution for real-world applications.
\ No newline at end of file
diff --git a/data/2024/aaai/ROG_PL: Robust Open-Set Graph Learning via Region-Based Prototype Learning b/data/2024/aaai/ROG_PL: Robust Open-Set Graph Learning via Region-Based Prototype Learning
new file mode 100644
index 0000000000..88a6390e14
--- /dev/null
+++ b/data/2024/aaai/ROG_PL: Robust Open-Set Graph Learning via Region-Based Prototype Learning	
@@ -0,0 +1 @@
+Open-set graph learning is a practical task that aims to classify the known class nodes and to identify unknown class samples as unknowns. Conventional node classification methods usually perform unsatisfactorily in open-set scenarios due to the complex data they encounter, such as out-of-distribution (OOD) data and in-distribution (IND) noise. OOD data are samples that do not belong to any known classes. They are outliers if they occur in training (OOD noise), and open-set samples if they occur in testing. IND noise are training samples which are assigned incorrect labels. The existence of IND noise and OOD noise is prevalent, which usually cause the ambiguity problem, including the intra-class variety problem and the inter-class confusion problem. Thus, to explore robust open-set learning methods is necessary and difficult, and it becomes even more difficult for non-IID graph data. To this end, we propose a unified framework named ROG_PL to achieve robust open-set learning on complex noisy graph data, by introducing prototype learning. In specific, ROG_PL consists of two modules, i.e., denoising via label propagation and open-set prototype learning via regions. The first module corrects noisy labels through similarity-based label propagation and removes low-confidence samples, to solve the intra-class variety problem caused by noise. The second module learns open-set prototypes for each known class via non-overlapped regions and remains both interior and border prototypes to remedy the inter-class confusion problem. The two modules are iteratively updated under the constraints of classification loss and prototype diversity loss. To the best of our knowledge, the proposed ROG_PL is the first robust open-set node classification method for graph data with complex noise. Experimental evaluations of ROG_PL on several benchmark graph datasets demonstrate that it has good performance.
\ No newline at end of file
diff --git a/data/2024/aaai/RPSC: Robust Pseudo-Labeling for Semantic Clustering b/data/2024/aaai/RPSC: Robust Pseudo-Labeling for Semantic Clustering
new file mode 100644
index 0000000000..7307d47456
--- /dev/null
+++ b/data/2024/aaai/RPSC: Robust Pseudo-Labeling for Semantic Clustering	
@@ -0,0 +1 @@
+Clustering methods achieve performance improvement by jointly learning representation and cluster assignment. However, they do not consider the confidence of pseudo-labels which are not optimal as supervised information, resulting into error accumulation. To address this issue, we propose a Robust Pseudo-labeling for Semantic Clustering (RPSC) approach, which includes two stages. In the first stage (RPSC-Self), we design a semantic pseudo-labeling scheme by using the consistency of samples, i.e., samples with same semantics should be close to each other in the embedding space. To exploit robust semantic pseudo-labels for self-supervised learning, we propose a soft contrastive loss (SCL) which encourage the model to believe high-confidence sematic pseudo-labels and be less driven by low-confidence pseudo-labels. In the second stage (RPSC-Semi), we first determine the semantic pseudo-label of a sample based on the distance between itself and cluster centers, followed by screening out reliable semantic pseudo-label by exploiting the consistency. These reliable pseudo-labels are used as supervised information in the pseudo-semi-supervised learning algorithm to further improve the performance. Experimental results show that RPSC outperforms 18 competitive clustering algorithms significantly on six challenging image benchmarks. In particular, RPSC achieves an accuracy of 0.688 on ImageNet-Dogs, which is an up to 24% improvement, compared with the second-best method. Meanwhile, we conduct ablation studies to investigate effects of different augmented strategies on RPSC as well as contributions of terms in SCL to clustering performance. Besides, experimental results indicate that SCL can be easily integrated into existing clustering methods and bring performance improvement.
\ No newline at end of file
diff --git a/data/2024/aaai/RR-PU: A Synergistic Two-Stage Positive and Unlabeled Learning Framework for Robust Tax Evasion Detection b/data/2024/aaai/RR-PU: A Synergistic Two-Stage Positive and Unlabeled Learning Framework for Robust Tax Evasion Detection
new file mode 100644
index 0000000000..c7d5b4dff3
--- /dev/null
+++ b/data/2024/aaai/RR-PU: A Synergistic Two-Stage Positive and Unlabeled Learning Framework for Robust Tax Evasion Detection	
@@ -0,0 +1 @@
+Tax evasion, an unlawful practice in which taxpayers deliberately conceal information to avoid paying tax liabilities, poses significant challenges for tax authorities. Effective tax evasion detection is critical for assisting tax authorities in mitigating tax revenue loss. Recently, machine-learning-based methods, particularly those employing positive and unlabeled (PU) learning, have been adopted for tax evasion detection, achieving notable success. However, these methods exhibit two major practical limitations. First, their success heavily relies on the strong assumption that the label frequency (the fraction of identified taxpayers among tax evaders) is known in advance. Second, although some methods attempt to estimate label frequency using approaches like Mixture Proportion Estimation (MPE) without making any assumptions, they subsequently construct a classifier based on the error-prone label frequency obtained from the previous estimation. This two-stage approach may not be optimal, as it neglects error accumulation in classifier training resulting from the estimation bias in the first stage. To address these limitations, we propose a novel PU learning-based tax evasion detection framework called RR-PU, which can revise the bias in a two-stage synergistic manner. Specifically, RR-PU refines the label frequency initialization by leveraging a regrouping technique to fortify the MPE perspective. Subsequently, we integrate a trainable slack variable to fine-tune the initial label frequency, concurrently optimizing this variable and the classifier to eliminate latent bias in the initial stage. Experimental results on three real-world tax datasets demonstrate that RR-PU outperforms state-of-the-art methods in tax evasion detection tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/RRL: Recommendation Reverse Learning b/data/2024/aaai/RRL: Recommendation Reverse Learning
new file mode 100644
index 0000000000..3e24879c39
--- /dev/null
+++ b/data/2024/aaai/RRL: Recommendation Reverse Learning	
@@ -0,0 +1 @@
+As societies become increasingly aware of data privacy, regulations require that private information about users must be removed from both database and ML models, which is more colloquially called `the right to be forgotten`. Such privacy problems of recommendation systems, which hold large amounts of private data, are drawing increasing attention. Recent research suggests dividing the preference data into multiple shards and training submodels with these shards and forgetting users' personal preference data by retraining the submodels of marked shards. Despite the computational efficiency development compared with retraining from scratch, the overall recommendation performance deteriorates after dividing the shards because the collaborative information contained in the training data is broken. In this paper, we aim to propose a forgetting framework for recommendation models that neither separate the training data nor jeopardizes the recommendation performance, named Recommendation Reverse Learning (RRL). Given the trained recommendation model and marked preference data, we devise Reverse BPR Objective (RBPR Objective) to fine-tune the recommendation model to force it to forget the marked data. Nevertheless, as the recommendation model encode the complex collaborative information among users, we propose to utilize Fisher Information Matrix (FIM) to estimate the influence of reverse learning on other users' collaborative information and guide the updates of representations. We conduct experiments on two representative recommendation models and three public benchmark datasets to verify the efficiency of RRL. To verify the forgetting completeness, we use RRL to make the recommendation model poisoned by shilling attacks forget malicious users.
\ No newline at end of file
diff --git a/data/2024/aaai/RWMS: Reliable Weighted Multi-Phase for Semi-supervised Segmentation b/data/2024/aaai/RWMS: Reliable Weighted Multi-Phase for Semi-supervised Segmentation
new file mode 100644
index 0000000000..40b1fbab6a
--- /dev/null
+++ b/data/2024/aaai/RWMS: Reliable Weighted Multi-Phase for Semi-supervised Segmentation	
@@ -0,0 +1 @@
+Semantic segmentation is one of the tasks concerned in the field of computer vision. However, the cost of capturing large numbers of pixel-level annotations is expensive. Semi-supervised learning can utilize labeled and unlabeled data, providing new ideas for solving the problem of insufficient labeled data. In this work, we propose a data-reliability weighted multi-phase learning method for semi-supervised segmentation (RWMS). Under the framework of self-training, we train two different teacher models to evaluate the reliability of pseudo labels. By selecting reliable data at the image level and reweighting pseudo labels at the pixel level, multi-phase training is guided to focus on more reliable knowledge. Besides, we also inject strong data augmentations on unlabeled images while training. Through extensive experiments, we demonstrate that our method performs remarkably well compared to baseline methods and substantially outperforms them, more than 3% on VOC and Cityscapes.
\ No newline at end of file
diff --git a/data/2024/aaai/Racing Control Variable Genetic Programming for Symbolic Regression b/data/2024/aaai/Racing Control Variable Genetic Programming for Symbolic Regression
new file mode 100644
index 0000000000..0d8902c5d0
--- /dev/null
+++ b/data/2024/aaai/Racing Control Variable Genetic Programming for Symbolic Regression	
@@ -0,0 +1 @@
+Symbolic regression, as one of the most crucial tasks in AI for science, discovers governing equations from experimental data. Popular approaches based on genetic programming, Monte Carlo tree search, or deep reinforcement learning learn symbolic regression from a fixed dataset. These methods require massive datasets and long training time especially when learning complex equations involving many variables. Recently, Control Variable Genetic Programming (CVGP) has been introduced which accelerates the regression process by discovering equations from designed control variable experiments. However, the set of experiments is fixed a-priori in CVGP and we observe that sub-optimal selection of experiment schedules delay the discovery process significantly. To overcome this limitation, we propose Racing Control Variable Genetic Programming (Racing-CVGP), which carries out multiple experiment schedules simultaneously. A selection scheme similar to that used in selecting good symbolic equations in the genetic programming process is implemented to ensure that promising experiment schedules eventually win over the average ones. The unfavorable schedules are terminated early to save time for the promising ones. We evaluate Racing-CVGP on several synthetic and real-world datasets corresponding to true physics laws. We demonstrate that Racing-CVGP outperforms CVGP and a series of symbolic regressors which discover equations from fixed datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation b/data/2024/aaai/RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation
new file mode 100644
index 0000000000..28e8a7a3ab
--- /dev/null
+++ b/data/2024/aaai/RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation	
@@ -0,0 +1 @@
+3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images. However, image-based scene perception encounters significant challenges in achieving accurate prediction due to the absence of geometric priors. In this paper, we address this issue by exploring cross-modal knowledge distillation in this task, i.e., we leverage a stronger multi-modal model to guide the visual model during training. In practice, we observe that directly applying features or logits alignment, proposed and widely used in bird's-eye-view (BEV) perception, does not yield satisfactory results. To overcome this problem, we introduce RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction. By employing differentiable volume rendering, we generate depth and semantic maps in perspective views and propose two novel consistency criteria between the rendered outputs of teacher and student models. Specifically, the depth consistency loss aligns the termination distributions of the rendered rays, while the semantic consistency loss mimics the intra-segment similarity guided by vision foundation models (VLMs). Experimental results on the nuScenes dataset demonstrate the effectiveness of our proposed method in improving various 3D occupancy prediction approaches, e.g., our proposed methodology enhances our baseline by 2.2% in the metric of mIoU and achieves 50% in Occ3D benchmark.
\ No newline at end of file
diff --git a/data/2024/aaai/RadarMOSEVE: A Spatial-Temporal Transformer Network for Radar-Only Moving Object Segmentation and Ego-Velocity Estimation b/data/2024/aaai/RadarMOSEVE: A Spatial-Temporal Transformer Network for Radar-Only Moving Object Segmentation and Ego-Velocity Estimation
new file mode 100644
index 0000000000..62d3164992
--- /dev/null
+++ b/data/2024/aaai/RadarMOSEVE: A Spatial-Temporal Transformer Network for Radar-Only Moving Object Segmentation and Ego-Velocity Estimation	
@@ -0,0 +1 @@
+Moving object segmentation (MOS) and Ego velocity estimation (EVE) are vital capabilities for mobile systems to achieve full autonomy. Several approaches have attempted to achieve MOSEVE using a LiDAR sensor. However, LiDAR sensors are typically expensive and susceptible to adverse weather conditions. Instead, millimeter-wave radar (MWR) has gained popularity in robotics and autonomous driving for real applications due to its cost-effectiveness and resilience to bad weather. Nonetheless, publicly available MOSEVE datasets and approaches using radar data are limited. Some existing methods adopt point convolutional networks from LiDAR-based approaches, ignoring the specific artifacts and the valuable radial velocity information of radar measurements, leading to suboptimal performance. In this paper, we propose a novel transformer network that effectively addresses the sparsity and noise issues and leverages the radial velocity measurements of radar points using our devised radar self- and cross-attention mechanisms. Based on that, our method achieves accurate EVE of the robot and performs MOS using only radar data simultaneously. To thoroughly evaluate the MOSEVE performance of our method, we annotated the radar points in the public View-of-Delft (VoD) dataset and additionally constructed a new radar dataset in various environments. The experimental results demonstrate the superiority of our approach over existing state-of-the-art methods. The code is available at https://github.com/ORCAUboat/RadarMOSEVE.
\ No newline at end of file
diff --git a/data/2024/aaai/ReGCL: Rethinking Message Passing in Graph Contrastive Learning b/data/2024/aaai/ReGCL: Rethinking Message Passing in Graph Contrastive Learning
new file mode 100644
index 0000000000..a3fd87e2e1
--- /dev/null
+++ b/data/2024/aaai/ReGCL: Rethinking Message Passing in Graph Contrastive Learning	
@@ -0,0 +1 @@
+Graph contrastive learning (GCL) has demonstrated remarkable efficacy in graph representation learning. However, previous studies have overlooked the inherent conflict that arises when employing graph neural networks (GNNs) as encoders for node-level contrastive learning. This conflict pertains to the partial incongruity between the feature aggregation mechanism of graph neural networks and the embedding distinction characteristic of contrastive learning. Theoretically, to investigate the location and extent of the conflict, we analyze the participation of message-passing from the gradient perspective of InfoNCE loss. Different from contrastive learning in other domains, the conflict in GCL arises due to the presence of certain samples that contribute to both the gradients of positive and negative simultaneously under the manner of message passing, which are opposite optimization directions. To further address the conflict issue, we propose a practical framework called ReGCL, which utilizes theoretical findings of GCL gradients to effectively improve graph contrastive learning. Specifically, two gradient-based strategies are devised in terms of both message passing and loss function to mitigate the conflict. Firstly, a gradient-guided structure learning method is proposed in order to acquire a structure that is adapted to contrastive learning principles. Secondly, a gradient-weighted InfoNCE loss function is designed to reduce the impact of false negative samples with high probabilities, specifically from the standpoint of the graph encoder. Extensive experiments demonstrate the superiority of the proposed method in comparison to state-of-the-art baselines across various node classification benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Reachability of Fair Allocations via Sequential Exchanges b/data/2024/aaai/Reachability of Fair Allocations via Sequential Exchanges
new file mode 100644
index 0000000000..6ceeb0f411
--- /dev/null
+++ b/data/2024/aaai/Reachability of Fair Allocations via Sequential Exchanges	
@@ -0,0 +1 @@
+In the allocation of indivisible goods, a prominent fairness notion is envy-freeness up to one good (EF1). We initiate the study of reachability problems in fair division by investigating the problem of whether one EF1 allocation can be reached from another EF1 allocation via a sequence of exchanges such that every intermediate allocation is also EF1. We show that two EF1 allocations may not be reachable from each other even in the case of two agents, and deciding their reachability is PSPACE-complete in general. On the other hand, we prove that reachability is guaranteed for two agents with identical or binary utilities as well as for any number of agents with identical binary utilities. We also examine the complexity of deciding whether there is an EF1 exchange sequence that is optimal in the number of exchanges required.
\ No newline at end of file
diff --git a/data/2024/aaai/Reading between the Lines: Image-Based Order Detection in OCR for Chinese Historical Documents b/data/2024/aaai/Reading between the Lines: Image-Based Order Detection in OCR for Chinese Historical Documents
new file mode 100644
index 0000000000..68f3dcfa04
--- /dev/null
+++ b/data/2024/aaai/Reading between the Lines: Image-Based Order Detection in OCR for Chinese Historical Documents	
@@ -0,0 +1 @@
+The cursive written text is still representing a challenge for researchers. Latin and Chinese Optical Character Recognition systems (OCR) have been studied extensively in the literature. Yet little work was performed for Arabic character recognition. Powerful and stable text segmentation is still needed. In this paper, a segmentation technique which is capable of processing vowelized Arabic text is introduced. Such a technique is size and font independent. Moreover, it does not require the detection of the centerline. It can also process typeset text. This technique can segment the cursive written text line even vthe line suffers from skewness. HNTRODUCTION One of the most important characteristics of the Arabic text, written on a horizontal line, is that characters are connected together by a connection line called "centerline" as shown in fig. 1. Most of the previous and existing techniques [l-61 depend heavily on the detection of this line. As by deleting this centerline, the primitives forming the cursive text will be separated. Existing segmentation techniques for cursive written text are font and size dependent. The segmentation techniques differ according to whether the font is typewritten, typesetting or handwritten. However, a new technique is presented in this paper that does not require the detection of the centerline. Evermore, it is size and font independent. This technique can segment the cursive written text line even if the line suffers from skewness. ILARABIC TEXT SEGMENTATION Due to its cursive nature, the Arabic text needs to be segmented before the recognition , phase in most Arabic OCR. This segmentation needs to be accurate and stable for different sizes and fonts. Any error in segmentation will propagate in the recognition phase. Some research in the Arabic text segmentation used thresholding of the word histogram to detect and eliminate the connection part between two consecutive characters [7, 81. Other [9, 101 used thresholding on the word outer contour rather than the histogram. However the use of thresholding in text segmentation needs a-priori information about the average size of the character in the page in order to determine a threshold value. This will not work for Omni size character recognition where different character sizes may exist on the same text line. Also it is found [4] that at the connection part between the characters have different widths for different Arabic fonts. In the following subsections some details about such techniques will be given. A. Second Moment Segmentation Technique The first step in this technique is to detect and isolate the different lines in a given text. Once the line is isolated, a moment histogram is generated. By choosing the right threshold, the histogram is partitioned into segments. In the case of Arabic written text, the resulting segments correspond to the primitives in the Arabic word. Calculation of the Centerline Consider a region that includes text only, the centerline is normally the row that contains the maximum number of black pixels as shown in fig.2. However, the row containing the maximum number of black pixels does not always correspond to the centerline especially if the line is skewed. The second moment histogram Generally, the centerline divides the Arabic word into two parts. The second moment of each pixel above or below the centerline is computed as shown in fig.3. The Segmentation Threshold The segmentation of the Arabic word using the second moment histogram depends on choosing the right threshold. Generally, it is not possible to choose a fixed threshold to segment all Arabic words because it is font and size dependent. Figure 3 shows an example of a word with different sizes and the corresponding second moment histogram. Figure 4 shows another example of three words of different fonts and again the difference in the threshold is clear. B. Contour Segmentation Technique The Contour segmentation algorithm [4] starts with word contour tracing:Figure 5 shows an Arabic word after the elimination of all internal black pixels. It also shows that the connection lines are formed of only two lines (i.e. the columns of the connection line will be formed of only two pixels). However, other columns inside the primitives and 0-7803-8575-6/04/$20.00 02004 IEEE 412 not part of the connection line may contain only two pixels after the elimination of internal pixels. This problem of detecting only columns having two pixels and belonging to the connection line is considered in the sequel. The contour tracing operation leaves the connection line columns with only two pixels. Then it is required to remove this part. To achieve that, the thickness of the text line is divided into three regions as shown in fig. 6. The height of each region depends on the size of the font. The connection parts in the text should lye in the middle region, and hence any column containing two pixels outside this region is not removed. The second moment technique suffers from the fact that it needs pre-information about the font size and type. The threshold value should be adjusted to cope with the size and type of the text to be segmented. Otherwise, it requires a trial-and-error procedure to find the suitable threshold. This technique will also fail when considering different font types and sizes on the same line and also if the line is skewed. Furthermore, the contour segmentation technique suffers from dependence on the detection of the centerline. Moreover, it needs special enhancement to handle vowelized text. These factors justifi the need for another segmentation technique that is type and size independent (i.e. Omni) and is independent of the centerline. Such a technique is described in the next section. 1II.The Centerline Independent Technique The new technique is based on the detection of upwards spikes present in the written text. It is composed of scanning each line from right to left and segmenting each line into isolated regions while giving each region an index as shown in fig 7. In the Arabic cursive text, the developed regions can be clustered into four types: *Isolated Characters, *Isolated Diacritics and Hamza, *Isolated vowelization marks, *Isolated Sub words (a whole word o a part of it). A specific filter is applied during the scanning to detect the upwards spikes in the sub words results into further dividing these sub words into primitives. Each primitive will be given a different index as shown in fig. 7 and fig. 8. IV.CONCLUSIONS In this work, a new segmentation technique was presented. It does not depend on the centerline detection. Preliminary experiments showed promising results when processing either vowelized Arabic or typeset text (as shown in fig Band 10). This technique is able to segment text lines' into the corresponding primitives independently from the text font and size. It works even if more that one character set is present in the text line and can also tolerate line skewness. Figure 11 shows promising results when applied to handwritten Arabic text. The work incorporated in this paper, can be integrated within different recognition techniques either using group classification [ 1 11, neutral network [ 121, HMM [13] or any other recognition system to build a complete Arabic OCR.' V . REFERENCE S [ I ] S. Mori, C. Y. Suen, and K. Yamamoto, "Historical review of ocr research and development," Proceedings of the IEEE 80(7), pp. 1029-1058, 1992. [2] A. Amin, "Off-line arabic character recognition: The state of the art," Pattern Recognition 31(5), pp. 517530, 1998. [3] S. M. Yamany, "A complete analysis of an arabic text reader," Master's thesis, Systems and Biomedical Dept., Faculty of Eng., Cairo Univ., Egypt, Mar 1995. [4] M. A. Hashsish, A. R. El-Bialy, A. H. Kandil, and S. M. Yamany, "A novel segmentation technique for cursive written text," Proc. AI-Azhar Eng. 4th Int. Conf 1995. [5] M. A. Hashsish, A. R. El-Bialy, A. H. Kandil, and S. M. Yamany, "Topological features: Towards an omni arabic text reader," Proc. AI-Azhar Eng. 5th Int. Conf 1997. [6] H. Abdelazim and M. A. Hashish, "Arabic reading machine," Proc. 10th National Computer Conf. , pp. 733740, 1988. Riyad, Saudi Arabia. [7] H. Abdelazim and M. A. Hashish, "Automatic recognition of arabic text," 10th Image/ITL conf. in IBM Toronto Lab. , 1987. Canada. [8] A. Amin and S. AI-Fedaghi, "Machine recognition of printed arabic text utilizing a natural language morphology," Int. J.-ManMachine Stud 35, pp. 769-788, 1991. [9] T. El-Sheikh and R. Guindi, "Computer recognition of arabic cursive script," Pattern Recognition 21, pp. 293-302, 1988. [IO] V. Margner, "Sarat-a system for the recognition of Arabic printed text," Proc. 11th Int. Conf on Pattern Recognition , pp. 561-564, 1992. [ I 11 A. El-Bialy, A. H. Kandil, M. Hashish and S. Yamany "Arabic OCR: Twoard A Complete System," Document Recognition and Retrieval W, SPIE Vol. 3967, pp 42-51, Jan. 2000 [12] A. Amin and H. AI-Sadoun, "Handprinted arabic character recognition system using an artificial neural network," Pattern Recognition 29, pp. 663-675, 1996. [I31 L. Zhidong, I. Bazzi, A. Kornai, J. Makhoul, P. Natarajan and R. Schwartz "A Robust, Language Independent OCR," AIPR 98, SPIE Vol. 33584, pp 96-104, Oct. 1998.
\ No newline at end of file
diff --git a/data/2024/aaai/Real3D: The Curious Case of Neural Scene Degeneration b/data/2024/aaai/Real3D: The Curious Case of Neural Scene Degeneration
new file mode 100644
index 0000000000..a8807518f2
--- /dev/null
+++ b/data/2024/aaai/Real3D: The Curious Case of Neural Scene Degeneration	
@@ -0,0 +1,5 @@
+Despite significant progress in utilizing pre-trained text-to-image diffusion models to guide the creation of 3D scenes, these methods often struggle to generate scenes that are sufficiently realistic, leading to "neural scene degeneration". 
+In this work, we propose a new 3D scene generation model called Real3D. 
+Specifically, Real3D designs a pipeline from a NeRF-like implicit renderer to a tetrahedrons-based explicit renderer, greatly improving the neural network's ability to generate various neural scenes. 
+Moreover, Real3D introduces an additional discriminator to prevent neural scenes from falling into undesirable local optima, thus avoiding the degeneration phenomenon.
+Our experimental results demonstrate that Real3D outperforms all existing state-of-the-art text-to-3D generation methods, providing valuable insights to facilitate the development of learning-based 3D scene generation approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Reasoning about Causality in Games (Abstract Reprint) b/data/2024/aaai/Reasoning about Causality in Games (Abstract Reprint)
new file mode 100644
index 0000000000..69e60ab3a3
--- /dev/null
+++ b/data/2024/aaai/Reasoning about Causality in Games (Abstract Reprint)	
@@ -0,0 +1,11 @@
+Causal reasoning and game-theoretic reasoning are fundamental topics in artificial intelligence, among many other disciplines: this paper is concerned with their intersection. Despite their importance, a formal framework that supports both these forms of reasoning has, until now, been lacking. We offer a solution in the form of (structural) causal games, which can be seen as extending Pearl's causal hierarchy to the game-theoretic domain, or as extending Koller and Milch's multi-agent influence diagrams to the causal domain. We then consider three key questions:
+i)
+How can the (causal) dependencies in games – either between variables, or between strategies – be modelled in a uniform, principled manner?
+
+ii)
+How may causal queries be computed in causal games, and what assumptions does this require?
+
+iii)
+How do causal games compare to existing formalisms?
+
+To address question i), we introduce mechanised games, which encode dependencies between agents' decision rules and the distributions governing the game. In response to question ii), we present definitions of predictions, interventions, and counterfactuals, and discuss the assumptions required for each. Regarding question iii), we describe correspondences between causal games and other formalisms, and explain how causal games can be used to answer queries that other causal or game-theoretic models do not support. Finally, we highlight possible applications of causal games, aided by an extensive open-source Python library.
\ No newline at end of file
diff --git a/data/2024/aaai/RecWizard: A Toolkit for Conversational Recommendation with Modular, Portable Models and Interactive User Interface b/data/2024/aaai/RecWizard: A Toolkit for Conversational Recommendation with Modular, Portable Models and Interactive User Interface
new file mode 100644
index 0000000000..544f78602c
--- /dev/null
+++ b/data/2024/aaai/RecWizard: A Toolkit for Conversational Recommendation with Modular, Portable Models and Interactive User Interface	
@@ -0,0 +1 @@
+We present a new Python toolkit called RecWizard for Conversational Recommender Systems (CRS). RecWizard offers support for development of models and interactive user interface, drawing from the best practices of the Huggingface ecosystems. CRS with RecWizard are modular, portable, interactive and Large Language Models (LLMs)-friendly, to streamline the learning process and reduce the additional effort for CRS research. For more comprehensive information about RecWizard, please check our GitHub https://github.com/McAuley-Lab/RecWizard.
\ No newline at end of file
diff --git a/data/2024/aaai/Recall-Oriented Continual Learning with Generative Adversarial Meta-Model b/data/2024/aaai/Recall-Oriented Continual Learning with Generative Adversarial Meta-Model
new file mode 100644
index 0000000000..1f45caeb6d
--- /dev/null
+++ b/data/2024/aaai/Recall-Oriented Continual Learning with Generative Adversarial Meta-Model	
@@ -0,0 +1 @@
+The stability-plasticity dilemma is a major challenge in continual learning, as it involves balancing the conflicting objectives of maintaining performance on previous tasks while learning new tasks. In this paper, we propose the recalloriented continual learning framework to address this challenge. Inspired by the human brain’s ability to separate the mechanisms responsible for stability and plasticity, our framework consists of a two-level architecture where an inference network effectively acquires new knowledge and a generative network recalls past knowledge when necessary. In particular, to maximize the stability of past knowledge, we investigate the complexity of knowledge depending on different representations, and thereby introducing generative adversarial meta-model (GAMM) that incrementally learns task-specific parameters instead of input data samples of the task. Through our experiments, we show that our framework not only effectively learns new knowledge without any disruption but also achieves high stability of previous knowledge in both task-aware and task-agnostic learning scenarios. Our code is available at: https://github.com/bigdata-inha/recall-orientedcl-framework.
\ No newline at end of file
diff --git a/data/2024/aaai/Recasting Regional Lighting for Shadow Removal b/data/2024/aaai/Recasting Regional Lighting for Shadow Removal
new file mode 100644
index 0000000000..ec7d5f786d
--- /dev/null
+++ b/data/2024/aaai/Recasting Regional Lighting for Shadow Removal	
@@ -0,0 +1,2 @@
+Removing shadows requires an understanding of both lighting conditions and object textures in a scene. Existing methods typically learn pixel-level color mappings between
+shadow and non-shadow images, in which the joint modeling of lighting and object textures is implicit and inadequate. We observe that in a shadow region, the degradation degree of object textures depends on the local illumination, while simply enhancing the local illumination cannot fully recover the attenuated textures. Based on this observation, we propose to condition the restoration of attenuated textures on the corrected local lighting in the shadow region. Specifically, We first design a shadow-aware decomposition network to estimate the illumination and reflectance layers of shadow regions explicitly. We then propose a novel bilateral correction network to recast the lighting of shadow regions in the illumination layer via a novel local lighting correction module, and to restore the textures conditioned on the corrected illumination layer via a novel illumination-guided texture restoration module. We further annotate pixel-wise shadow masks for the public SRD dataset, which originally contains only image pairs. Experiments on three benchmarks show that our method outperforms existing state-of-the-art shadow removal methods. Project page in: yuhaoliu7456.github.io/RRL-Net.
\ No newline at end of file
diff --git a/data/2024/aaai/Recent Advancements in Inverse Reinforcement Learning b/data/2024/aaai/Recent Advancements in Inverse Reinforcement Learning
new file mode 100644
index 0000000000..5ce7fabd49
--- /dev/null
+++ b/data/2024/aaai/Recent Advancements in Inverse Reinforcement Learning	
@@ -0,0 +1,9 @@
+Inverse reinforcement learning (IRL) has seen significant advancements in recent years. This class of approaches aims to efficiently learn the underlying reward function that rationalizes the behavior exhibited by expert agents, often represented by humans. In contrast to mere behavioral cloning, the reconstruction of a reward function yields appealing implications, as it allows for more effective interpretability of the expert’s decisions and provides a transferable specification of the expert’s objectives for application in even different environments. Unlike the well-understood field of reinforcement learning (RL) from a theoretical perspective, IRL still grapples with limited understanding, significantly constraining its applicability. A fundamental challenge in IRL is the inherent ambiguity in selecting a reward function, given the existence of multiple candidate functions, all explaining the expert’s behavior.
+
+In this talk, I will survey three of my papers that have made notable contributions to the IRL field: “Provably Efficient Learning of Transferable Rewards”, “Towards Theoretical Understanding of Inverse Reinforcement Learning”, and “Inverse Reinforcement Learning with Sub-optimal Experts".
+
+The central innovation introduced by the first paper is a novel formulation of the IRL problem that overcomes the issue of ambiguity. IRL is reframed as the problem of learning the feasible reward set, which is the set of all rewards that can explain the expert’s behavior. This approach postpones the selection of the reward function, thereby circumventing the ambiguity issues. Furthermore, the feasible reward set exhibits convenient geometric properties that enable the development of efficient algorithms for its computation. 
+
+Building on this novel formulation of IRL, the second paper addresses the problem of efficiently learning the feasible reward set when the environment and the expert’s policy are not known in advance. It introduces a novel way to assess the dissimilarity between feasible reward sets based on the Hausdorff distance and presents a new PAC (probabilistic approximately correct) framework. The most significant contribution of this paper is the introduction of the first sample complexity lower bound, which highlights the challenges inherent in the IRL problem. Deriving this lower bound necessitated the development of novel technical tools. The paper also demonstrates that when a generative model of the environment is available, a uniform sampling strategy achieves a sample complexity that matches the lower bound, up to logarithmic factors.
+
+Finally, in the third paper, the IRL problem in the presence of sub-optimal experts is investigated. Specifically, the paper assumes the availability of multiple sub-optimal experts, in addition to the expert agent, which provides additional demonstrations, associated with a known quantification of the maximum amount of sub-optimality. The paper shows that this richer information mitigates the ambiguity problem, significantly reducing the size of the feasible reward set while retaining its favorable geometric properties. Furthermore, the paper explores the associated statistical problem and derives novel lower bounds for sample complexity, along with almost matching algorithms. These selected papers represent notable advancements in IRL, contributing to the establishment of a solid theoretical foundation for IRL and extending the framework to accommodate scenarios with sub-optimal experts.
\ No newline at end of file
diff --git a/data/2024/aaai/Recognizing Ultra-High-Speed Moving Objects with Bio-Inspired Spike Camera b/data/2024/aaai/Recognizing Ultra-High-Speed Moving Objects with Bio-Inspired Spike Camera
new file mode 100644
index 0000000000..3e2f419716
--- /dev/null
+++ b/data/2024/aaai/Recognizing Ultra-High-Speed Moving Objects with Bio-Inspired Spike Camera	
@@ -0,0 +1 @@
+Bio-inspired spike camera mimics the sampling principle of primate fovea. It presents high temporal resolution and dynamic range, showing great promise in fast-moving object recognition. However, the physical limit of CMOS technology in spike cameras still hinders their capability of recognizing ultra-high-speed moving objects, e.g., extremely fast motions cause blur during the imaging process of spike cameras. This paper presents the first theoretical analysis for the causes of spiking motion blur and proposes a robust representation that addresses this issue through temporal-spatial context learning. The proposed method leverages multi-span feature aggregation to capture temporal cues and employs residual deformable convolution to model spatial correlation among neighbouring pixels. Additionally, this paper contributes an original real-captured spiking recognition dataset consisting of 12,000 ultra-high-speed (equivalent speed > 500 km/h) moving objects. Experimental results show that the proposed method achieves 73.2% accuracy in recognizing 10 classes of ultra-high-speed moving objects, outperforming all existing spike-based recognition methods. Resources will be available at https://github.com/Evin-X/UHSR.
\ No newline at end of file
diff --git a/data/2024/aaai/Recommender Ecosystems: A Mechanism Design Perspective on Holistic Modeling and Optimization b/data/2024/aaai/Recommender Ecosystems: A Mechanism Design Perspective on Holistic Modeling and Optimization
new file mode 100644
index 0000000000..eea62a1fad
--- /dev/null
+++ b/data/2024/aaai/Recommender Ecosystems: A Mechanism Design Perspective on Holistic Modeling and Optimization	
@@ -0,0 +1 @@
+Modern recommender systems lie at the heart of complex recommender ecosystems that couple the behavior of users, content providers, vendors, advertisers, and other actors. Despite this, the focus of much recommender systems research and deployment is on the local, myopic optimization of the recommendations made to individual users. This comes at a significant cost to the long-term utility that recommender systems generate for their users. We argue that modeling the incentives and behaviors of these actors, and the interactions among them induced by the recommender systems, is needed to maximize value and improve overall ecosystem health. Moreover, we propose the use of economic mechanism design, an area largely overlooked in recommender systems research, as a framework for developing such models. That said, one cannot apply “vanilla” mechanism design to recommender ecosystem modeling optimization out of the box—the use of mechanism design raises a number of subtle and interesting research challenges. We outline a number of these in this talk (and paper), emphasizing the need to develop nonstandard approaches to mechanism design that intersect with numerous areas of research, including preference modeling, reinforcement learning and exploration, behavioral economics, and generative AI, among others.
\ No newline at end of file
diff --git a/data/2024/aaai/Reconciling Predictive and Statistical Parity: A Causal Approach b/data/2024/aaai/Reconciling Predictive and Statistical Parity: A Causal Approach
new file mode 100644
index 0000000000..4b3b23b0f6
--- /dev/null
+++ b/data/2024/aaai/Reconciling Predictive and Statistical Parity: A Causal Approach	
@@ -0,0 +1 @@
+Since the rise of fair machine learning as a critical field of inquiry, many different notions on how to quantify and measure discrimination have been proposed in the literature. Some of these notions, however, were shown to be mutually incompatible. Such findings make it appear that numerous different kinds of fairness exist, thereby making a consensus on the appropriate measure of fairness harder to reach, hindering the applications of these tools in practice. In this paper, we investigate one of these key impossibility results that relates the notions of statistical and predictive parity. Specifically, we derive a new causal decomposition formula for the fairness measures associated with predictive parity, and obtain a novel insight into how this criterion is related to statistical parity through the legal doctrines of disparate treatment, disparate impact, and the notion of business necessity. Our results show that through a more careful causal analysis, the notions of statistical and predictive parity are not really mutually exclusive, but complementary and spanning a spectrum of fairness notions through the concept of business necessity. Finally, we demonstrate the importance of our findings on a real-world example.
\ No newline at end of file
diff --git a/data/2024/aaai/Rectangle Search: An Anytime Beam Search b/data/2024/aaai/Rectangle Search: An Anytime Beam Search
new file mode 100644
index 0000000000..6e1b7c21e1
--- /dev/null
+++ b/data/2024/aaai/Rectangle Search: An Anytime Beam Search	
@@ -0,0 +1 @@
+Anytime heuristic search algorithms try to find a (potentially suboptimal) solution as quickly as possible and then work to find better and better solutions until an optimal solution is obtained or time is exhausted. The most widely-known anytime search algorithms are based on best-first search. In this paper, we propose a new algorithm, rectangle search, that is instead based on beam search, a variant of breadth-first search. It repeatedly explores alternatives at all depth levels and is thus best-suited to problems featuring deep local minima. Experiments using a variety of popular search benchmarks suggest that rectangle search is competitive with fixed-width beam search and often performs better than the previous best anytime search algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Recurrent Graph Neural Networks and Their Connections to Bisimulation and Logic b/data/2024/aaai/Recurrent Graph Neural Networks and Their Connections to Bisimulation and Logic
new file mode 100644
index 0000000000..6326a622e0
--- /dev/null
+++ b/data/2024/aaai/Recurrent Graph Neural Networks and Their Connections to Bisimulation and Logic	
@@ -0,0 +1 @@
+The success of Graph Neural Networks (GNNs) in practice has motivated extensive research on their theoretical properties. This includes recent results that characterise node classifiers expressible by GNNs in terms of first order logic. Most of the analysis, however, has been focused on GNNs with fixed number of message-passing iterations (i.e., layers), which cannot realise many simple classifiers such as reachability of a node with a given label. In this paper, we start to fill this gap and study the foundations of GNNs that can perform more than a fixed number of message-passing iterations. We first formalise two generalisations of the basic GNNs: recurrent GNNs (RecGNNs), which repeatedly apply message-passing iterations until the node classifications become stable, and graph-size GNNs (GSGNNs), which exploit a built-in function of the input graph size to decide the number of message-passings. We then formally prove that GNN classifiers are strictly less expressive than RecGNN ones, and RecGNN classifiers are strictly less expressive than GSGNN ones. To get this result, we identify novel semantic characterisations of the three formalisms in terms of suitable variants of bisimulation, which we believe have their own value for our understanding of GNNs. Finally, we prove syntactic logical characterisations of RecGNNs and GSGNNs analogous to the logical characterisation of plain GNNs, where we connect the two formalisms to monadic monotone fixpoint logic---a generalisation of first-order logic that supports recursion.
\ No newline at end of file
diff --git a/data/2024/aaai/Recurrent Partial Kernel Network for Efficient Optical Flow Estimation b/data/2024/aaai/Recurrent Partial Kernel Network for Efficient Optical Flow Estimation
new file mode 100644
index 0000000000..dfced8c69c
--- /dev/null
+++ b/data/2024/aaai/Recurrent Partial Kernel Network for Efficient Optical Flow Estimation	
@@ -0,0 +1 @@
+Optical flow estimation is a challenging task consisting of predicting per-pixel motion vectors between images. Recent methods have employed larger and more complex models to improve the estimation accuracy. However, this impacts the widespread adoption of optical flow methods and makes it harder to train more general models since the optical flow data is hard to obtain. This paper proposes a small and efficient model for optical flow estimation. We design a new spatial recurrent encoder that extracts discriminative features at a significantly reduced size. Unlike standard recurrent units, we utilize Partial Kernel Convolution (PKConv) layers to produce variable multi-scale features with a single shared block. We also design efficient Separable Large Kernels (SLK) to capture large context information with low computational cost. Experiments on public benchmarks show that we achieve state-of-the-art generalization performance while requiring significantly fewer parameters and memory than competing methods. Our model ranks first in the Spring benchmark without finetuning, improving the results by over 10% while requiring an order of magnitude fewer FLOPs and over four times less memory than the following published method without finetuning. The code is available at github.com/hmorimitsu/ptlflow/tree/main/ptlflow/models/rpknet.
\ No newline at end of file
diff --git a/data/2024/aaai/RedCore: Relative Advantage Aware Cross-Modal Representation Learning for Missing Modalities with Imbalanced Missing Rates b/data/2024/aaai/RedCore: Relative Advantage Aware Cross-Modal Representation Learning for Missing Modalities with Imbalanced Missing Rates
new file mode 100644
index 0000000000..805eb8fd49
--- /dev/null
+++ b/data/2024/aaai/RedCore: Relative Advantage Aware Cross-Modal Representation Learning for Missing Modalities with Imbalanced Missing Rates	
@@ -0,0 +1 @@
+Multimodal learning is susceptible to modality missing, which poses a major obstacle for its practical applications and, thus, invigorates increasing research interest. In this paper, we investigate two challenging problems: 1) when modality missing exists in the training data, how to exploit the incomplete samples while guaranteeing that they are properly supervised? 2) when the missing rates of different modalities vary, causing or exacerbating the imbalance among modalities, how to address the imbalance and ensure all modalities are well-trained. To tackle these two challenges, we first introduce the variational information bottleneck (VIB) method for the cross-modal representation learning of missing modalities, which capitalizes on the available modalities and the labels as supervision. Then, accounting for the imbalanced missing rates, we define relative advantage to quantify the advantage of each modality over others. Accordingly, a bi-level optimization problem is formulated to adaptively regulate the supervision of all modalities during training. As a whole, the proposed approach features Relative advantage aware Cross-modal representation learning (abbreviated as RedCore) for missing modalities with imbalanced missing rates. Extensive empirical results demonstrate that RedCore outperforms competing models in that it exhibits superior robustness against either large or imbalanced missing rates. The code is available at: https://github.com/sunjunaimer/RedCore.
\ No newline at end of file
diff --git a/data/2024/aaai/Redefining ABA+ Semantics via Abstract Set-to-Set Attacks b/data/2024/aaai/Redefining ABA+ Semantics via Abstract Set-to-Set Attacks
new file mode 100644
index 0000000000..f33466b81b
--- /dev/null
+++ b/data/2024/aaai/Redefining ABA+ Semantics via Abstract Set-to-Set Attacks	
@@ -0,0 +1 @@
+Assumption-based argumentation (ABA) is a powerful defeasible reasoning formalism which is based on the interplay of assumptions, their contraries, and inference rules. ABA with preferences (ABA+) generalizes the basic model by allowing qualitative comparison between assumptions. The integration of preferences however comes with a cost. In ABA+, the evaluation under two central and well-established semantics---grounded and complete semantics---is not guaranteed to yield an outcome. Moreover, while ABA frameworks without preferences allow for a graph-based representation in Dung-style frameworks, an according instantiation for general ABA+ frameworks has not been established so far. In this work, we tackle both issues: First, we develop a novel abstract argumentation formalism based on set-to-set attacks. We show that our so-called Hyper Argumentation Frameworks (HYPAFs) capture ABA+. Second, we propose relaxed variants of complete and grounded semantics for HYPAFs that yield an extension for all frameworks by design, while still faithfully generalizing the established semantics of Dung-style Argumentation Frameworks. We exploit the newly established correspondence between ABA+ and HYPAFs to obtain variants for grounded and complete ABA+ semantics that are guaranteed to yield an outcome. Finally, we discuss basic properties and provide a complexity analysis. Along the way, we settle the computational complexity of several ABA+ semantics.
\ No newline at end of file
diff --git a/data/2024/aaai/Redefining the Laparoscopic Spatial Sense: AI-Based Intra- and Postoperative Measurement from Stereoimages b/data/2024/aaai/Redefining the Laparoscopic Spatial Sense: AI-Based Intra- and Postoperative Measurement from Stereoimages
new file mode 100644
index 0000000000..ddc2ccc112
--- /dev/null
+++ b/data/2024/aaai/Redefining the Laparoscopic Spatial Sense: AI-Based Intra- and Postoperative Measurement from Stereoimages	
@@ -0,0 +1 @@
+A significant challenge in image-guided surgery is the accurate measurement task of relevant structures such as vessel segments, resection margins, or bowel lengths. While this task is an essential component of many surgeries, it involves substantial human effort and is prone to inaccuracies. In this paper, we develop a novel human-AI-based method for laparoscopic measurements utilizing stereo vision that has been guided by practicing surgeons. Based on a holistic qualitative requirements analysis, this work proposes a comprehensive measurement method, which comprises state-of-the-art machine learning architectures, such as RAFT-Stereo and YOLOv8. The developed method is assessed in various realistic experimental evaluation environments. Our results outline the potential of our method achieving high accuracies in distance measurements with errors below 1 mm. Furthermore, on-surface measurements demonstrate robustness when applied in challenging environments with textureless regions. Overall, by addressing the inherent challenges of image-guided surgery, we lay the foundation for a more robust and accurate solution for intra- and postoperative measurements, enabling more precise, safe, and efficient surgical procedures.
\ No newline at end of file
diff --git a/data/2024/aaai/Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models b/data/2024/aaai/Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models
new file mode 100644
index 0000000000..1999851dfa
--- /dev/null
+++ b/data/2024/aaai/Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models	
@@ -0,0 +1 @@
+Denoising Diffusion models have exhibited remarkable capabilities in image generation. However, generating high-quality samples requires a large number of iterations. Knowledge distillation for diffusion models is an effective method to address this limitation with a shortened sampling process but causes degraded generative quality. Based on our analysis with bias-variance decomposition and experimental observations, we attribute the degradation to the spatial fitting error occurring in the training of both the teacher and student model in the distillation. Accordingly, we propose Spatial Fitting-Error Reduction Distillation model (SFERD). SFERD utilizes attention guidance from the teacher model and a designed semantic gradient predictor to reduce the student's fitting error. Empirically, our proposed model facilitates high-quality sample generation in a few function evaluations. We achieve an FID of 5.31 on CIFAR-10 and 9.39 on ImageNet 64x64 with only one step, outperforming existing diffusion methods. Our study provides a new perspective on diffusion distillation by highlighting the intrinsic denoising ability of models.
\ No newline at end of file
diff --git a/data/2024/aaai/Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation b/data/2024/aaai/Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
new file mode 100644
index 0000000000..d711c9700c
--- /dev/null
+++ b/data/2024/aaai/Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation	
@@ -0,0 +1,5 @@
+Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames.
+However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. 
+Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities.
+Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
+On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +8.7% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at https://github.com/OpenGVLab/MUTR.
\ No newline at end of file
diff --git a/data/2024/aaai/Refined Characterizations of Approval-Based Committee Scoring Rules b/data/2024/aaai/Refined Characterizations of Approval-Based Committee Scoring Rules
new file mode 100644
index 0000000000..7730a253f3
--- /dev/null
+++ b/data/2024/aaai/Refined Characterizations of Approval-Based Committee Scoring Rules	
@@ -0,0 +1 @@
+In approval-based committee (ABC) elections, the goal is to select a fixed-size subset of the candidates, a so-called committee, based on the voters' approval ballots over the candidates. One of the most popular classes of ABC voting rules are ABC scoring rules, for which voters give points to each committee and the committees with maximal total points are chosen. While the set of ABC scoring rules has recently been characterized in a model where the output is a ranking of all committees, no full characterization of these rules exists in the standard model where a set of winning committees is returned. We address this issue by characterizing two important subclasses of ABC scoring rules in the standard ABC election model, thereby both extending the result for ABC ranking rules to the standard setting and refining it to subclasses. In more detail, by relying on a consistency axiom for variable electorates, we characterize (i) the prominent class of Thiele rules and (ii) a new class of ABC voting rules called ballot size weighted approval voting. Based on these theorems, we also infer characterizations of three well-known ABC voting rules, namely multi-winner approval voting, proportional approval voting, and satisfaction approval voting.
\ No newline at end of file
diff --git a/data/2024/aaai/Refining Latent Homophilic Structures over Heterophilic Graphs for Robust Graph Convolution Networks b/data/2024/aaai/Refining Latent Homophilic Structures over Heterophilic Graphs for Robust Graph Convolution Networks
new file mode 100644
index 0000000000..8d9e86d84f
--- /dev/null
+++ b/data/2024/aaai/Refining Latent Homophilic Structures over Heterophilic Graphs for Robust Graph Convolution Networks	
@@ -0,0 +1 @@
+Graph convolution networks (GCNs) are extensively utilized in various graph tasks to mine knowledge from spatial data. Our study marks the pioneering attempt to quantitatively investigate the GCN robustness over omnipresent heterophilic graphs for node classification. We uncover that the predominant vulnerability is caused by the structural out-of-distribution (OOD) issue. This finding motivates us to present a novel method that aims to harden GCNs by automatically learning Latent Homophilic Structures over heterophilic graphs. We term such a methodology as LHS. To elaborate, our initial step involves learning a latent structure by employing a novel self-expressive technique based on multi-node interactions. Subsequently, the structure is refined using a pairwisely constrained dual-view contrastive learning approach. We iteratively perform the above procedure, enabling a GCN model to aggregate information in a homophilic way on heterophilic graphs. Armed with such an adaptable structure, we can properly mitigate the structural OOD threats over heterophilic graphs. Experiments on various benchmarks show the effectiveness of the proposed LHS approach for robust GCNs.
\ No newline at end of file
diff --git a/data/2024/aaai/Region-Aware Exposure Consistency Network for Mixed Exposure Correction b/data/2024/aaai/Region-Aware Exposure Consistency Network for Mixed Exposure Correction
new file mode 100644
index 0000000000..de7158473a
--- /dev/null
+++ b/data/2024/aaai/Region-Aware Exposure Consistency Network for Mixed Exposure Correction	
@@ -0,0 +1 @@
+Exposure correction aims to enhance images suffering from improper exposure to achieve satisfactory visual effects. Despite recent progress, existing methods generally mitigate either overexposure or underexposure in input images, and they still struggle to handle images with mixed exposure, i.e., one image incorporates both overexposed and underexposed regions. The mixed exposure distribution is non-uniform and leads to varying representation, which makes it challenging to address in a unified process. In this paper, we introduce an effective Region-aware Exposure Correction Network (RECNet) that can handle mixed exposure by adaptively learning and bridging different regional exposure representations. Specifically, to address the challenge posed by mixed exposure disparities, we develop a region-aware de-exposure module that effectively translates regional features of mixed exposure scenarios into an exposure-invariant feature space. Simultaneously, as de-exposure operation inevitably reduces discriminative information, we introduce a mixed-scale restoration unit that integrates exposure-invariant features and unprocessed features to recover local information. To further achieve a uniform exposure distribution in the global image, we propose an exposure contrastive regularization strategy under the constraints of intra-regional exposure consistency and inter-regional exposure continuity. Extensive experiments are conducted on various datasets, and the experimental results demonstrate the superiority and generalization of our proposed method. The code is released at: https://github.com/kravrolens/RECNet.
\ No newline at end of file
diff --git a/data/2024/aaai/Region-Disentangled Diffusion Model for High-Fidelity PPG-to-ECG Translation b/data/2024/aaai/Region-Disentangled Diffusion Model for High-Fidelity PPG-to-ECG Translation
new file mode 100644
index 0000000000..5c66eedb10
--- /dev/null
+++ b/data/2024/aaai/Region-Disentangled Diffusion Model for High-Fidelity PPG-to-ECG Translation	
@@ -0,0 +1 @@
+The high prevalence of cardiovascular diseases (CVDs) calls for accessible and cost-effective continuous cardiac monitoring tools. Despite Electrocardiography (ECG) being the gold standard, continuous monitoring remains a challenge, leading to the exploration of Photoplethysmography (PPG), a promising but more basic alternative available in consumer wearables. This notion has recently spurred interest in translating PPG to ECG signals. In this work, we introduce Region-Disentangled Diffusion Model (RDDM), a novel diffusion model designed to capture the complex temporal dynamics of ECG. Traditional Diffusion models like Denoising Diffusion Probabilistic Models (DDPM) face challenges in capturing such nuances due to the indiscriminate noise addition process across the entire signal. Our proposed RDDM overcomes such limitations by incorporating a novel forward process that selectively adds noise to specific regions of interest (ROI) such as QRS complex in ECG signals, and a reverse process that disentangles the denoising of ROI and non-ROI regions. Quantitative experiments demonstrate that RDDM can generate high-fidelity ECG from PPG in as few as 10 diffusion steps, making it highly effective and computationally efficient. Additionally, to rigorously validate the usefulness of the generated ECG signals, we introduce CardioBench, a comprehensive evaluation benchmark for a variety of cardiac-related tasks including heart rate and blood pressure estimation, stress classification, and the detection of atrial fibrillation and diabetes. Our thorough experiments show that RDDM achieves state-of-the-art performance on CardioBench. To the best of our knowledge, RDDM is the first diffusion model for cross-modal signal-to-signal translation in the bio-signal domain.
\ No newline at end of file
diff --git a/data/2024/aaai/Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes b/data/2024/aaai/Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes
new file mode 100644
index 0000000000..8f6b4a2b83
--- /dev/null
+++ b/data/2024/aaai/Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes	
@@ -0,0 +1 @@
+In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a vanilla policy gradient-based algorithm and show its global convergence property. We then prove that the proposed algorithm has O(T^3/4) regret. Remarkably, this paper marks a pioneering effort by presenting the first exploration into regret bound computation for the general parameterized policy gradient algorithm in the context of average reward scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Regret Analysis of Repeated Delegated Choice b/data/2024/aaai/Regret Analysis of Repeated Delegated Choice
new file mode 100644
index 0000000000..f178449ee1
--- /dev/null
+++ b/data/2024/aaai/Regret Analysis of Repeated Delegated Choice	
@@ -0,0 +1 @@
+We present a study on a repeated delegated choice problem, which is the first to consider an online learning variant of Kleinberg and Kleinberg, EC'18. In this model, a principal interacts repeatedly with an agent who possesses an exogenous set of solutions to search for efficient ones. Each solution can yield varying utility for both the principal and the agent, and the agent may propose a solution to maximize its own utility in a selfish manner. To mitigate this behavior, the principal announces an eligible set which screens out a certain set of solutions. The principal, however, does not have any information on the distribution of solutions nor the number of solutions in advance. Therefore, the principal dynamically announces various eligible sets to efficiently learn the distribution. The principal's objective is to minimize cumulative regret compared to the optimal eligible set in hindsight. We explore two dimensions of the problem setup, whether the agent behaves myopically or strategizes across the rounds, and whether the solutions yield deterministic or stochastic utility. We obtain sublinear regret upper bounds in various regimes, and derive corresponding lower bounds which implies the tightness of the results. Overall, we bridge a well-known problem in economics to the evolving area of online learning, and present a comprehensive study in this problem.
\ No newline at end of file
diff --git a/data/2024/aaai/Regroup Median Loss for Combating Label Noise b/data/2024/aaai/Regroup Median Loss for Combating Label Noise
new file mode 100644
index 0000000000..f2869a1d58
--- /dev/null
+++ b/data/2024/aaai/Regroup Median Loss for Combating Label Noise	
@@ -0,0 +1 @@
+The deep model training procedure requires large-scale datasets of annotated data. Due to the difficulty of annotating a large number of samples, label noise caused by incorrect annotations is inevitable, resulting in low model performance and poor model generalization. To combat label noise, current methods usually select clean samples based on the small-loss criterion and use these samples for training. Due to some noisy samples similar to clean ones, these small-loss criterion-based methods are still affected by label noise. To address this issue, in this work, we propose Regroup Median Loss (RML) to reduce the probability of selecting noisy samples and correct losses of noisy samples. RML randomly selects samples with the same label as the training samples based on a new loss processing method. Then, we combine the stable mean loss and the robust median loss through a proposed regrouping strategy to obtain robust loss estimation for noisy samples. To further improve the model performance against label noise, we propose a new sample selection strategy and build a semi-supervised method based on RML. Compared to state-of-the-art methods, for both the traditionally trained and semi-supervised models, RML achieves a significant improvement on synthetic and complex real-world datasets. The source is at https://github.com/Feng-peng-Li/Regroup-Loss-Median-to-Combat-Label-Noise.
\ No newline at end of file
diff --git a/data/2024/aaai/Regulating AI: Applying Insights from Behavioural Economics and Psychology to the Application of Article 5 of the EU AI Act b/data/2024/aaai/Regulating AI: Applying Insights from Behavioural Economics and Psychology to the Application of Article 5 of the EU AI Act
new file mode 100644
index 0000000000..568c47c21b
--- /dev/null
+++ b/data/2024/aaai/Regulating AI: Applying Insights from Behavioural Economics and Psychology to the Application of Article 5 of the EU AI Act	
@@ -0,0 +1 @@
+Article 5 of the European Union’s Artificial Intelligence Act is intended to regulate AI use to prevent potentially harmful consequences. Nevertheless, applying this legislation practically is likely to be challenging because of ambiguously used terminologies and because it fails to specify which manipulation techniques may be invoked by AI, potentially leading to significant harm. This paper aims to bridge this gap by defining key terms and demonstrating how AI may invoke these techniques, drawing from insights in psychology and behavioural economics. First, this paper provides definitions of the terms “subliminal techniques”, “manipulative techniques” and “deceptive techniques”. Secondly, we identified from the literature in cognitive psychology and behavioural economics three subliminal and five manipulative techniques and exemplify how AI might implement these techniques to manipulate users in real-world case scenarios. These illustrations may serve as a practical guide for stakeholders to detect cases of AI manipulation and consequently devise preventive measures. Article 5 has also been criticised for offering inadequate protection. We critically assess the protection offered by Article 5, proposing specific revisions to paragraph 1, points (a) and (b) of Article 5 to increase its protective effectiveness.
\ No newline at end of file
diff --git a/data/2024/aaai/Regulating Intermediate 3D Features for Vision-Centric Autonomous Driving b/data/2024/aaai/Regulating Intermediate 3D Features for Vision-Centric Autonomous Driving
new file mode 100644
index 0000000000..306ee98472
--- /dev/null
+++ b/data/2024/aaai/Regulating Intermediate 3D Features for Vision-Centric Autonomous Driving	
@@ -0,0 +1 @@
+Multi-camera perception tasks have gained significant attention in the field of autonomous driving. However, existing frameworks based on Lift-Splat-Shoot (LSS) in the multi-camera setting cannot produce suitable dense 3D features due to the projection nature and uncontrollable densification process. To resolve this problem, we propose to regulate intermediate dense 3D features with the help of volume rendering. Specifically, we employ volume rendering to process the dense 3D features to obtain corresponding 2D features (e.g., depth maps, semantic maps), which are supervised by associated labels in the training. This manner regulates the generation of dense 3D features on the feature level, providing appropriate dense and unified features for multiple perception tasks. Therefore, our approach is termed Vampire, stands for ``Volume rendering As Multi-camera Perception Intermediate feature REgulator''. Experimental results on the Occ3D and nuScenes datasets demonstrate that Vampire facilitates fine-grained and appropriate extraction of dense 3D features, and is competitive with existing SOTA methods across diverse downstream perception tasks like 3D occupancy prediction, LiDAR segmentation and 3D objection detection, while utilizing moderate GPU resources. We provide a video demonstration in the supplementary materials and Codes are available at github.com/cskkxjk/Vampire.
\ No newline at end of file
diff --git a/data/2024/aaai/Reinforced Adaptive Knowledge Learning for Multimodal Fake News Detection b/data/2024/aaai/Reinforced Adaptive Knowledge Learning for Multimodal Fake News Detection
new file mode 100644
index 0000000000..ece771fd6c
--- /dev/null
+++ b/data/2024/aaai/Reinforced Adaptive Knowledge Learning for Multimodal Fake News Detection	
@@ -0,0 +1 @@
+Nowadays, detecting multimodal fake news has emerged as a foremost concern since the widespread dissemination of fake news may incur adverse societal impact. Conventional methods generally focus on capturing the linguistic and visual semantics within the multimodal content, which fall short in effectively distinguishing the heightened level of meticulous fabrications. Recently, external knowledge is introduced to provide valuable background facts as complementary to facilitate news detection. Nevertheless, existing knowledge-enhanced endeavors directly incorporate all knowledge contexts through static entity embeddings, resulting in the potential noisy and content-irrelevant knowledge. Moreover, the integration of knowledge entities makes it intractable to model the sophisticated correlations between multimodal semantics and knowledge entities. In light of these limitations, we propose a novel Adaptive Knowledge-Aware Fake News Detection model, dubbed AKA-Fake. For each news, AKA-Fake learns a compact knowledge subgraph under a reinforcement learning paradigm, which consists of a subset of entities and contextual neighbors in the knowledge graph, restoring the most informative knowledge facts. A novel heterogeneous graph learning module is further proposed to capture the reliable cross-modality correlations via topology refinement and modality-attentive pooling. Our proposal is extensively evaluated over three popular datasets, and experimental results demonstrate the superiority of AKA-Fake.
\ No newline at end of file
diff --git a/data/2024/aaai/Reinforcement Learning and Data-Generation for Syntax-Guided Synthesis b/data/2024/aaai/Reinforcement Learning and Data-Generation for Syntax-Guided Synthesis
new file mode 100644
index 0000000000..91661f6c6b
--- /dev/null
+++ b/data/2024/aaai/Reinforcement Learning and Data-Generation for Syntax-Guided Synthesis	
@@ -0,0 +1 @@
+Program synthesis is the task of automatically generating code based on a specification. In Syntax-Guided Synthesis (SyGuS) this specification is a combination of a syntactic template and a logical formula, and the result is guaranteed to satisfy both. We present a reinforcement-learning guided algorithm for SyGuS which uses Monte-Carlo Tree Search (MCTS) to search the space of candidate solutions. Our algorithm learns policy and value functions which, combined with the upper confidence bound for trees, allow it to balance exploration and exploitation. A common challenge in applying machine learning approaches to syntax-guided synthesis is the scarcity of training data. To address this, we present a method for automatically generating training data for SyGuS based on anti-unification of existing first-order satisfiability problems, which we use to train our MCTS policy. We implement and evaluate this setup and demonstrate that learned policy and value improve the synthesis performance over a baseline by over 26 percentage points in the training and testing sets. Our tool outperforms state-of-the-art tool cvc5 on the training set and performs comparably in terms of the total number of problems solved on the testing set (solving 23% of the benchmarks on which cvc5 fails). We make our data set publicly available, to enable further application of machine learning methods to the SyGuS problem.
\ No newline at end of file
diff --git a/data/2024/aaai/Reinforcement Learning as a Parsimonious Alternative to Prediction Cascades: A Case Study on Image Segmentation b/data/2024/aaai/Reinforcement Learning as a Parsimonious Alternative to Prediction Cascades: A Case Study on Image Segmentation
new file mode 100644
index 0000000000..cb91f76d6c
--- /dev/null
+++ b/data/2024/aaai/Reinforcement Learning as a Parsimonious Alternative to Prediction Cascades: A Case Study on Image Segmentation	
@@ -0,0 +1 @@
+Deep learning architectures have achieved state-of-the-art (SOTA) performance on computer vision tasks such as object detection and image segmentation. This may be attributed to the use of over-parameterized, monolithic deep learning architectures executed on large datasets. Although such large architectures lead to increased accuracy, this is usually accompanied by a larger increase in computation and memory requirements during inference. While this is a non-issue in traditional machine learning (ML) pipelines, the recent confluence of machine learning and fields like the Internet of Things (IoT) has rendered such large architectures infeasible for execution in low-resource settings. For some datasets, large monolithic pipelines may be overkill for simpler inputs. To address this problem, previous efforts have proposed decision cascades where inputs are passed through models of increasing complexity until desired performance is achieved. However, we argue that cascaded prediction leads to sub-optimal throughput and increased computational cost due to wasteful intermediate computations. To address this, we propose PaSeR (Parsimonious Segmentation with Reinforcement Learning) a non-cascading, cost-aware learning pipeline as an efficient alternative to cascaded decision architectures. Through experimental evaluation on both real-world and standard datasets, we demonstrate that PaSeR achieves better accuracy while minimizing computational cost relative to cascaded models. Further, we introduce a new metric IoU/GigaFlop to evaluate the balance between cost and performance. On the real-world task of battery material phase segmentation, PaSeR yields a minimum performance improvement of 174% on the IoU/GigaFlop metric with respect to baselines. We also demonstrate PaSeR's adaptability to complementary models trained on a noisy MNIST dataset, where it achieved a minimum performance improvement on IoU/GigaFlop of 13.4% over SOTA models. Code and data are available at github.com/scailab/paser.
\ No newline at end of file
diff --git a/data/2024/aaai/Relational Distant Supervision for Image Captioning without Image-Text Pairs b/data/2024/aaai/Relational Distant Supervision for Image Captioning without Image-Text Pairs
new file mode 100644
index 0000000000..d8ec28e20e
--- /dev/null
+++ b/data/2024/aaai/Relational Distant Supervision for Image Captioning without Image-Text Pairs	
@@ -0,0 +1 @@
+Unsupervised image captioning aims to generate descriptions of images without relying on any image-sentence pairs for training. Most existing works use detected visual objects or concepts as bridge to connect images and texts. Considering that the relationship between objects carries more information, we use the object relationship as a more accurate connection between images and texts. In this paper, we adapt the idea of distant supervision that extracts the knowledge about object relationships from an external corpus and imparts them to images to facilitate inferring visual object relationships, without introducing any extra pre-trained relationship detectors. Based on these learned informative relationships, we construct pseudo image-sentence pairs for captioning model training. Specifically, our method consists of three modules: (1) a relationship learning module that learns to infer relationships from images under the distant supervision; (2) a relationship-to-sentence module that transforms the inferred relationships into sentences to generate pseudo image-sentence pairs; (3) an image captioning module that is trained by using the generated image-sentence pairs. Promising results on three datasets show that our method outperforms the state-of-the-art methods of unsupervised image captioning.
\ No newline at end of file
diff --git a/data/2024/aaai/Relational Programming with Foundational Models b/data/2024/aaai/Relational Programming with Foundational Models
new file mode 100644
index 0000000000..ba6b3e72a6
--- /dev/null
+++ b/data/2024/aaai/Relational Programming with Foundational Models	
@@ -0,0 +1 @@
+Foundation models have vast potential to enable diverse AI applications. The powerful yet incomplete nature of these models has spurred a wide range of mechanisms to augment them with capabilities such as in-context learning, information retrieval, and code interpreting. We propose Vieira, a declarative framework that unifies these mechanisms in a general solution for programming with foundation models. Vieira follows a probabilistic relational paradigm and treats foundation models as stateless functions with relational inputs and outputs. It supports neuro-symbolic applications by enabling the seamless combination of such models with logic programs, as well as complex, multi-modal applications by streamlining the composition of diverse sub-models. We implement Vieira by extending the Scallop compiler with a foreign interface that supports foundation models as plugins. We implement plugins for 12 foundation models including GPT, CLIP, and SAM. We evaluate Vieira on 9 challenging tasks that span language, vision, and structured and vector databases. Our evaluation shows that programs in Vieira are concise, can incorporate modern foundation models, and have comparable or better accuracy than competitive baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Relative Policy-Transition Optimization for Fast Policy Transfer b/data/2024/aaai/Relative Policy-Transition Optimization for Fast Policy Transfer
new file mode 100644
index 0000000000..b1678da602
--- /dev/null
+++ b/data/2024/aaai/Relative Policy-Transition Optimization for Fast Policy Transfer	
@@ -0,0 +1 @@
+We consider the problem of policy transfer between two Markov Decision Processes (MDPs). We introduce a lemma based on existing theoretical results in reinforcement learning to measure the relativity gap between two arbitrary MDPs, that is the difference between any two cumulative expected returns defined on different policies and environment dynamics. Based on this lemma, we propose two new algorithms referred to as Relative Policy Optimization (RPO) and Relative Transition Optimization (RTO), which offer fast policy transfer and dynamics modelling, respectively. RPO transfers the policy evaluated in one environment to maximize the return in another, while RTO updates the parameterized dynamics model to reduce the gap between the dynamics of the two environments. Integrating the two algorithms results in the complete Relative Policy-Transition Optimization (RPTO) algorithm, in which the policy interacts with the two environments simultaneously, such that data collections from two environments, policy and transition updates are completed in one closed loop to form a principled learning framework for policy transfer. We demonstrate the effectiveness of RPTO on a set of MuJoCo continuous control tasks by creating policy transfer problems via variant dynamics.
\ No newline at end of file
diff --git a/data/2024/aaai/Relaxed Stationary Distribution Correction Estimation for Improved Offline Policy Optimization b/data/2024/aaai/Relaxed Stationary Distribution Correction Estimation for Improved Offline Policy Optimization
new file mode 100644
index 0000000000..34a8cc3acf
--- /dev/null
+++ b/data/2024/aaai/Relaxed Stationary Distribution Correction Estimation for Improved Offline Policy Optimization	
@@ -0,0 +1 @@
+One of the major challenges of offline reinforcement learning (RL) is dealing with distribution shifts that stem from the mismatch between the trained policy and the data collection policy. Stationary distribution correction estimation algorithms (DICE) have addressed this issue by regularizing the policy optimization with f-divergence between the state-action visitation distributions of the data collection policy and the optimized policy. While such regularization naturally integrates to derive an objective to get optimal state-action visitation, such an implicit policy optimization framework has shown limited performance in practice. We observe that the reduced performance is attributed to the biased estimate and the properties of conjugate functions of f-divergence regularization. In this paper, we improve the regularized implicit policy optimization framework by relieving the bias and reshaping the conjugate function by relaxing the constraints. We show that the relaxation adjusts the degree of involvement of the sub-optimal samples in optimization, and we derive a new offline RL algorithm that benefits from the relaxed framework, improving from a previous implicit policy optimization algorithm by a large margin.
\ No newline at end of file
diff --git a/data/2024/aaai/Relevant Intrinsic Feature Enhancement Network for Few-Shot Semantic Segmentation b/data/2024/aaai/Relevant Intrinsic Feature Enhancement Network for Few-Shot Semantic Segmentation
new file mode 100644
index 0000000000..38d0bc3f1f
--- /dev/null
+++ b/data/2024/aaai/Relevant Intrinsic Feature Enhancement Network for Few-Shot Semantic Segmentation	
@@ -0,0 +1 @@
+For few-shot semantic segmentation, the primary task is to extract class-specific intrinsic information from limited labeled data. However, the semantic ambiguity and inter-class similarity of previous methods limit the accuracy of pixel-level foreground-background classification. To alleviate these issues, we propose the Relevant Intrinsic Feature Enhancement Network (RiFeNet). To improve the semantic consistency of foreground instances, we propose an unlabeled branch as an efficient data utilization method, which teaches the model how to extract intrinsic features robust to intra-class differences. Notably, during testing, the proposed unlabeled branch is excluded without extra unlabeled data and computation. Furthermore, we extend the inter-class variability between foreground and background by proposing a novel multi-level prototype generation and interaction module. The different-grained complementarity between global and local prototypes allows for better distinction between similar categories. The qualitative and quantitative performance of RiFeNet surpasses the state-of-the-art methods on PASCAL-5i and COCO benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Reliable Conflictive Multi-View Learning b/data/2024/aaai/Reliable Conflictive Multi-View Learning
new file mode 100644
index 0000000000..b0c9235c6d
--- /dev/null
+++ b/data/2024/aaai/Reliable Conflictive Multi-View Learning	
@@ -0,0 +1 @@
+Multi-view learning aims to combine multiple features to achieve more comprehensive descriptions of data. Most previous works assume that multiple views are strictly aligned. However, real-world multi-view data may contain low-quality conflictive instances, which show conflictive information in different views. Previous methods for this problem mainly focus on eliminating the conflictive data instances by removing them or replacing conflictive views. Nevertheless, real-world applications usually require making decisions for conflictive instances rather than only eliminating them. To solve this, we point out a new Reliable Conflictive Multi-view Learning (RCML) problem, which requires the model to provide decision results and attached reliabilities for conflictive multi-view data. We develop an Evidential Conflictive Multi-view Learning (ECML) method for this problem. ECML first learns view-specific evidence, which could be termed as the amount of support to each category collected from data. Then, we can construct view-specific opinions consisting of decision results and reliability. In the multi-view fusion stage, we propose a conflictive opinion aggregation strategy and theoretically prove this strategy can exactly model the relation of multi-view common and view-specific reliabilities. Experiments performed on 6 datasets verify the effectiveness of ECML. The code is released at https://github.com/jiajunsi/RCML.
\ No newline at end of file
diff --git a/data/2024/aaai/Reliable Data Generation and Selection for Low-Resource Relation Extraction b/data/2024/aaai/Reliable Data Generation and Selection for Low-Resource Relation Extraction
new file mode 100644
index 0000000000..10b83ada60
--- /dev/null
+++ b/data/2024/aaai/Reliable Data Generation and Selection for Low-Resource Relation Extraction	
@@ -0,0 +1 @@
+Automated construction of annotated data holds significant importance in Relation Extraction (RE) tasks due to the hardness and cost of human annotation. In this work, we propose Self-RDGS, a method for Self-supervised Reliable Data Generation and Selection in low-resource RE tasks. At first, we fully utilize the knowledge of triplets as prompts to generate sentences by employing the Large Language Models (LLMs). Since the auto-generated data contains noise, we then propose a ranking-based data selection method to select reliable sentences. Finally, we integrate the data selection and RE model training within a self-supervised iterative framework. Through experimentation on three datasets with low-resource settings, we demonstrate the effectiveness of our proposed approach in constructing annotated data and achieving noteworthy improvements in comparison to multiple baselines. Code, data and models are available at https://github.com/jjyunlp/GenerationRE.
\ No newline at end of file
diff --git a/data/2024/aaai/Relightable and Animatable Neural Avatars from Videos b/data/2024/aaai/Relightable and Animatable Neural Avatars from Videos
new file mode 100644
index 0000000000..b7686e6c11
--- /dev/null
+++ b/data/2024/aaai/Relightable and Animatable Neural Avatars from Videos	
@@ -0,0 +1 @@
+Lightweight creation of 3D digital avatars is a highly desirable but challenging task. With only sparse videos of a person under unknown illumination, we propose a method to create relightable and animatable neural avatars, which can be used to synthesize photorealistic images of humans under novel viewpoints, body poses, and lighting. The key challenge here is to disentangle the geometry, material of the clothed body, and lighting, which becomes more difficult due to the complex geometry and shadow changes caused by body motions. To solve this ill-posed problem, we propose novel techniques to better model the geometry and shadow changes. For geometry change modeling, we propose an invertible deformation field, which helps to solve the inverse skinning problem and leads to better geometry quality. To model the spatial and temporal varying shading cues, we propose a pose-aware part-wise light visibility network to estimate light occlusion. Extensive experiments on synthetic and real datasets show that our approach reconstructs high-quality geometry and generates realistic shadows under different body poses. Code and data are available at https://wenbin-lin.github.io/RelightableAvatar-page.
\ No newline at end of file
diff --git a/data/2024/aaai/Removing Interference and Recovering Content Imaginatively for Visible Watermark Removal b/data/2024/aaai/Removing Interference and Recovering Content Imaginatively for Visible Watermark Removal
new file mode 100644
index 0000000000..f8384cd4fe
--- /dev/null
+++ b/data/2024/aaai/Removing Interference and Recovering Content Imaginatively for Visible Watermark Removal	
@@ -0,0 +1 @@
+Visible watermarks, while instrumental in protecting image copyrights, frequently distort the underlying content, complicating tasks like scene interpretation and image editing. Visible watermark removal aims to eliminate the interference of watermarks and restore the background content. However, existing methods often implement watermark component removal and background restoration tasks within a singular branch, leading to residual watermarks in the predictions and ignoring cases where watermarks heavily obscure the background. To address these limitations, this study introduces the Removing Interference and Recovering Content Imaginatively (RIRCI) framework. RIRCI embodies a two-stage approach: the initial phase centers on discerning and segregating the watermark component, while the subsequent phase focuses on background content restoration. To achieve meticulous background restoration, our proposed model employs a dual-path network capable of fully exploring the intrinsic background information beneath semi-transparent watermarks and peripheral contextual information from unaffected regions. Moreover, a Global and Local Context Interaction module is built upon multi-layer perceptrons and bidirectional feature transformation for comprehensive representation modeling in the background restoration phase. The efficacy of our approach is empirically validated across two large-scale datasets, and our findings reveal a marked enhancement over existing watermark removal techniques.
\ No newline at end of file
diff --git a/data/2024/aaai/Representation-Based Robustness in Goal-Conditioned Reinforcement Learning b/data/2024/aaai/Representation-Based Robustness in Goal-Conditioned Reinforcement Learning
new file mode 100644
index 0000000000..398bc0b238
--- /dev/null
+++ b/data/2024/aaai/Representation-Based Robustness in Goal-Conditioned Reinforcement Learning	
@@ -0,0 +1 @@
+While Goal-Conditioned Reinforcement Learning (GCRL) has gained attention, its algorithmic robustness against adversarial perturbations remains unexplored. The attacks and robust representation training methods that are designed for traditional RL become less effective when applied to GCRL. To address this challenge, we first propose the Semi-Contrastive Representation attack, a novel approach inspired by the adversarial contrastive attack. Unlike existing attacks in RL, it only necessitates information from the policy function and can be seamlessly implemented during deployment. Then, to mitigate the vulnerability of existing GCRL algorithms, we introduce Adversarial Representation Tactics, which combines Semi-Contrastive Adversarial Augmentation with Sensitivity-Aware Regularizer to improve the adversarial robustness of the underlying RL agent against various types of perturbations. Extensive experiments validate the superior performance of our attack and defence methods across multiple state-of-the-art GCRL algorithms. Our code is available at https://github.com/TrustAI/ReRoGCRL.
\ No newline at end of file
diff --git a/data/2024/aaai/Reproduce, Replicate, Reevaluate. The Long but Safe Way to Extend Machine Learning Methods b/data/2024/aaai/Reproduce, Replicate, Reevaluate. The Long but Safe Way to Extend Machine Learning Methods
new file mode 100644
index 0000000000..3e0a56c7d2
--- /dev/null
+++ b/data/2024/aaai/Reproduce, Replicate, Reevaluate. The Long but Safe Way to Extend Machine Learning Methods	
@@ -0,0 +1 @@
+Reproducibility is a desirable property of scientific research. On the one hand, it increases confidence in results. On the other hand, reproducible results can be extended on a solid basis. In rapidly developing fields such as machine learning, the latter is particularly important to ensure the reliability of research. In this paper, we present a systematic approach to reproducing (using the available implementation), replicating (using an alternative implementation) and reevaluating (using different datasets) state-of-the-art experiments. This approach enables the early detection and correction of deficiencies and thus the development of more robust and transparent machine learning methods. We detail the independent reproduction, replication, and reevaluation of the initially published experiments with a method that we want to extend. For each step, we identify issues and draw lessons learned. We further discuss solutions that have proven effective in overcoming the encountered problems. This work can serve as a guide for further reproducibility studies and generally improve reproducibility in machine learning.
\ No newline at end of file
diff --git a/data/2024/aaai/ResDiff: Combining CNN and Diffusion Model for Image Super-resolution b/data/2024/aaai/ResDiff: Combining CNN and Diffusion Model for Image Super-resolution
new file mode 100644
index 0000000000..4084f6a8f4
--- /dev/null
+++ b/data/2024/aaai/ResDiff: Combining CNN and Diffusion Model for Image Super-resolution	
@@ -0,0 +1 @@
+Adapting the Diffusion Probabilistic Model (DPM) for direct image super-resolution is wasteful, given that a simple Convolutional Neural Network (CNN) can recover the main low-frequency content. Therefore, we present ResDiff, a novel Diffusion Probabilistic Model based on Residual structure for Single Image Super-Resolution (SISR). ResDiff utilizes a combination of a CNN, which restores primary low-frequency components, and a DPM, which predicts the residual between the ground-truth image and the CNN predicted image. In contrast to the common diffusion-based methods that directly use LR space to guide the noise towards HR space, ResDiff utilizes the CNN’s initial prediction to direct the noise towards the residual space between HR space and CNN-predicted space, which not only accelerates the generation process but also acquires superior sample quality. Additionally, a frequency-domain-based loss function for CNN is introduced to facilitate its restoration, and a frequency-domain guided diffusion is designed for DPM on behalf of predicting high-frequency details. The extensive experiments on multiple benchmark datasets demonstrate that ResDiff outperforms previous diffusion based methods in terms of shorter model convergence time, superior generation quality, and more diverse samples.
\ No newline at end of file
diff --git a/data/2024/aaai/ResMatch: Residual Attention Learning for Feature Matching b/data/2024/aaai/ResMatch: Residual Attention Learning for Feature Matching
new file mode 100644
index 0000000000..3935f6219e
--- /dev/null
+++ b/data/2024/aaai/ResMatch: Residual Attention Learning for Feature Matching	
@@ -0,0 +1 @@
+Attention-based graph neural networks have made great progress in feature matching. However, the literature lacks a comprehensive understanding of how the attention mechanism operates for feature matching. In this paper, we rethink cross- and self-attention from the viewpoint of traditional feature matching and filtering. To facilitate the learning of matching and filtering, we incorporate the similarity of descriptors into cross-attention and relative positions into self-attention. In this way, the attention can concentrate on learning residual matching and filtering functions with reference to the basic functions of measuring visual and spatial correlation. Moreover, we leverage descriptor similarity and relative positions to extract inter- and intra-neighbors. Then sparse attention for each point can be performed only within its neighborhoods to acquire higher computation efficiency. Extensive experiments, including feature matching, pose estimation and visual localization, confirm the superiority of the proposed method. Our codes are available at https://github.com/ACuOoOoO/ResMatch.
\ No newline at end of file
diff --git a/data/2024/aaai/Research of Event Reconstruct Based on Multi-View Contrastive Learning (Student Abstract) b/data/2024/aaai/Research of Event Reconstruct Based on Multi-View Contrastive Learning (Student Abstract)
new file mode 100644
index 0000000000..05927e81aa
--- /dev/null
+++ b/data/2024/aaai/Research of Event Reconstruct Based on Multi-View Contrastive Learning (Student Abstract)	
@@ -0,0 +1 @@
+The proliferation of social media exacerbates information fragmentation, posing challenges to understanding public events. We address the problem of event reconstruction with a novel Multi-view Contrast Event Reconstruction (MCER) model. MCER maximizes feature dissimilarity between different views of the same event using contrastive learning, while minimizing mutual information between distinct events. This aggregates fragmented views to reconstruct comprehensive event representations. MCER employs momentum and weight-sharing encoders in a three-tower architecture with supervised contrastive loss for multi-view representation learning. Due to the scarcity of multi-view public datasets, we construct a new Mul-view-data benchmark.Experiments demonstrate MCER’s superior performance on public data and our Mul-view-data, significantly outperforming selfsupervised methods by incorporating supervised contrastive techniques. MCER advances multi-view representation learning to counter information fragmentation and enable robust event understanding.
\ No newline at end of file
diff --git a/data/2024/aaai/Residual Hyperbolic Graph Convolution Networks b/data/2024/aaai/Residual Hyperbolic Graph Convolution Networks
new file mode 100644
index 0000000000..e6b6c0b4f7
--- /dev/null
+++ b/data/2024/aaai/Residual Hyperbolic Graph Convolution Networks	
@@ -0,0 +1,2 @@
+Hyperbolic graph convolutional networks (HGCNs) have demonstrated representational capabilities of modeling hierarchical-structured graphs. However, as in general GCNs, over-smoothing may occur as the number of model layers increases, limiting the representation capabilities of most current HGCN models. In this paper, we propose residual hyperbolic graph convolutional networks (R-HGCNs) to address the over-smoothing problem. We introduce a hyperbolic residual connection function to overcome the over-smoothing problem, and also theoretically prove the effectiveness of the hyperbolic residual function. Moreover, we use product manifolds and HyperDrop to facilitate the R-HGCNs. The distinctive features of the R-HGCNs are as follows: (1) The hyperbolic residual connection preserves the initial node information in each layer and adds a hyperbolic identity mapping to prevent node features from being indistinguishable. (2) Product manifolds in R-HGCNs have been set up with different origin points in different components to facilitate the extraction of feature information from a wider range of perspectives, which enhances the representing capability of R-HGCNs. (3) HyperDrop adds multiplicative Gaussian noise into hyperbolic representations, such that perturbations can be added to alleviate the over-fitting problem without deconstructing the hyperbolic geometry.
+Experiment results demonstrate the effectiveness of R-HGCNs under various graph convolution layers and different structures of product manifolds.
\ No newline at end of file
diff --git a/data/2024/aaai/Resisting Backdoor Attacks in Federated Learning via Bidirectional Elections and Individual Perspective b/data/2024/aaai/Resisting Backdoor Attacks in Federated Learning via Bidirectional Elections and Individual Perspective
new file mode 100644
index 0000000000..b8dad70230
--- /dev/null
+++ b/data/2024/aaai/Resisting Backdoor Attacks in Federated Learning via Bidirectional Elections and Individual Perspective	
@@ -0,0 +1 @@
+Existing approaches defend against backdoor attacks in federated learning (FL) mainly through a) mitigating the impact of infected models, or b) excluding infected models. The former negatively impacts model accuracy, while the latter usually relies on globally clear boundaries between benign and infected model updates. However, in reality, model updates can easily become mixed and scattered throughout due to the diverse distributions of local data. This work focuses on excluding infected models in FL. Unlike previous perspectives from a global view, we propose Snowball, a novel anti-backdoor FL framework through bidirectional elections from an individual perspective inspired by one principle deduced by us and two principles in FL and deep learning. It is characterized by a) bottom-up election, where each candidate model update votes to several peer ones such that a few model updates are elected as selectees for aggregation; and b) top-down election, where selectees progressively enlarge themselves through picking up from the candidates. We compare Snowball with state-of-the-art defenses to backdoor attacks in FL on five real-world datasets, demonstrating its superior resistance to backdoor attacks and slight impact on the accuracy of the global model.
\ No newline at end of file
diff --git a/data/2024/aaai/Resource Democratization: Is Compute the Binding Constraint on AI Research? b/data/2024/aaai/Resource Democratization: Is Compute the Binding Constraint on AI Research?
new file mode 100644
index 0000000000..178de21fa6
--- /dev/null
+++ b/data/2024/aaai/Resource Democratization: Is Compute the Binding Constraint on AI Research?	
@@ -0,0 +1 @@
+Access to compute is widely viewed as a primary barrier to AI research progress. Compute resource stratification between academic and industry researchers is therefore a source of concern. Yet the experiences of researchers who might encounter resource constraints in their work have received no direct study. We addressed this gap by conducting a large survey of AI researchers that posed questions about project inputs, outcomes, and challenges. Contrary to popular narratives, responses from more than 500 participants revealed more concern about talent and data limitations than compute access. There were few differences between academic and industry researchers in this regard. The exception were researchers who already use large amounts of compute, and expressed a need for more. These findings suggest that interventions to subsidize compute without addressing the limitations on talent and data availability reported by our respondents might cause or exacerbate commonly cited resource inequalities, with unknown impact on the future of equitable research.
\ No newline at end of file
diff --git a/data/2024/aaai/Resource Efficient Deep Learning Hardware Watermarks with Signature Alignment b/data/2024/aaai/Resource Efficient Deep Learning Hardware Watermarks with Signature Alignment
new file mode 100644
index 0000000000..f15294aaa5
--- /dev/null
+++ b/data/2024/aaai/Resource Efficient Deep Learning Hardware Watermarks with Signature Alignment	
@@ -0,0 +1 @@
+Deep learning intellectual properties (IPs) are high-value assets that are frequently susceptible to theft. This vulnerability has led to significant interest in defending the field's intellectual properties from theft. Recently, watermarking techniques have been extended to protect deep learning hardware from privacy. These technique embed modifications that change the hardware's behavior when activated. In this work, we propose the first method for embedding watermarks in deep learning hardware that incorporates the owner's key samples into the embedding methodology. This improves our watermarks' reliability and efficiency in identifying the hardware over those generated using randomly selected key samples. Our experimental results demonstrate that by considering the target key samples when generating the hardware modifications, we can significantly increase the embedding success rate while targeting fewer functional blocks, decreasing the required hardware overhead needed to defend it.
\ No newline at end of file
diff --git a/data/2024/aaai/Responding to the Call: Exploring Automatic Music Composition Using a Knowledge-Enhanced Model b/data/2024/aaai/Responding to the Call: Exploring Automatic Music Composition Using a Knowledge-Enhanced Model
new file mode 100644
index 0000000000..d244eb3db5
--- /dev/null
+++ b/data/2024/aaai/Responding to the Call: Exploring Automatic Music Composition Using a Knowledge-Enhanced Model	
@@ -0,0 +1 @@
+Call-and-response is a musical technique that enriches the creativity of music, crafting coherent musical ideas that mirror the back-and-forth nature of human dialogue with distinct musical characteristics. Although this technique is integral to numerous musical compositions, it remains largely uncharted in automatic music composition. To enhance the creativity of machine-composed music, we first introduce the Call-Response Dataset (CRD) containing 19,155 annotated musical pairs and crafted comprehensive objective evaluation metrics for musical assessment. Then, we design a knowledge-enhanced learning-based method to bridge the gap between human and machine creativity. Specifically, we train the composition module using the call-response pairs, supplementing it with musical knowledge in terms of rhythm, melody, and harmony. Our experimental results underscore that our proposed model adeptly produces a wide variety of creative responses for various musical calls.
\ No newline at end of file
diff --git a/data/2024/aaai/Response Enhanced Semi-supervised Dialogue Query Generation b/data/2024/aaai/Response Enhanced Semi-supervised Dialogue Query Generation
new file mode 100644
index 0000000000..74eb4a7517
--- /dev/null
+++ b/data/2024/aaai/Response Enhanced Semi-supervised Dialogue Query Generation	
@@ -0,0 +1,4 @@
+Leveraging vast and continually updated knowledge from the Internet has been considered an important ability for a dialogue system. Therefore, the dialogue query generation task is proposed for generating search queries from dialogue histories, which will be submitted to a search engine for retrieving relevant websites on the Internet. In this regard, previous efforts were devoted to collecting conversations with annotated queries and training a query producer (QP) via standard supervised learning. However, these studies still face the challenges of data scarcity and domain adaptation.
+To address these issues, in this paper, we propose a semi-supervised learning framework -- SemiDQG, to improve model performance with unlabeled conversations. Based on the observation that the search query is typically related to the topic of dialogue response, we train a response-augmented query producer (RA) to provide rich and effective training signals for QP.
+We first apply a similarity-based query selection strategy to select high-quality RA-generated pseudo queries, which are used to construct pseudo instances for training QP and RA.
+Then, we adopt the REINFORCE algorithm to further enhance QP, with RA-provided rewards as fine-grained training signals. Experimental results and in-depth analysis of three benchmarks show the effectiveness of our framework in cross-domain and low-resource scenarios. Particularly, SemiDQG significantly surpasses ChatGPT and competitive baselines. Our code is available at \url{https://github.com/DeepLearnXMU/SemiDQG}.
\ No newline at end of file
diff --git a/data/2024/aaai/Responsibility in Extensive Form Games b/data/2024/aaai/Responsibility in Extensive Form Games
new file mode 100644
index 0000000000..84f42287db
--- /dev/null
+++ b/data/2024/aaai/Responsibility in Extensive Form Games	
@@ -0,0 +1,3 @@
+Two different forms of responsibility, counterfactual and seeing-to-it, have been extensively discussed in philosophy and AI in the context of a single agent or multiple agents acting simultaneously. Although the generalisation of counterfactual responsibility to a setting where multiple agents act in some order is relatively straightforward, the same cannot be said about seeing-to-it responsibility. Two versions of seeing-to-it modality applicable to such settings have been proposed in the literature. Neither of them perfectly captures the intuition of responsibility. The paper proposes a definition of seeing-to-it responsibility for such settings that amalgamate the two modalities.
+
+The paper shows that the newly proposed notion of responsibility and counterfactual responsibility are not definable through each other and studies the responsibility gap for these two forms of responsibility. It shows that although these two forms of responsibility are not enough to ascribe responsibility in each possible situation, this gap does not exist if higher-order responsibility is taken into account.
\ No newline at end of file
diff --git a/data/2024/aaai/Responsible Bandit Learning via Privacy-Protected Mean-Volatility Utility b/data/2024/aaai/Responsible Bandit Learning via Privacy-Protected Mean-Volatility Utility
new file mode 100644
index 0000000000..9bbf3b847a
--- /dev/null
+++ b/data/2024/aaai/Responsible Bandit Learning via Privacy-Protected Mean-Volatility Utility	
@@ -0,0 +1 @@
+For ensuring the safety of users by protecting the privacy, the traditional privacy-preserving bandit algorithm aiming to maximize the mean reward has been widely studied in scenarios such as online ride-hailing, advertising recommendations, and personalized healthcare. However, classical bandit learning is irresponsible in such practical applications as they fail to account for risks in online decision-making and ignore external system information. This paper firstly proposes privacy protected mean-volatility utility as the objective of bandit learning and proves its responsibility, because it aims at achieving the maximum probability of utility by considering the risk. Theoretically, our proposed responsible bandit learning is expected to achieve the fastest convergence rate among current bandit algorithms and generates more statistical power than classical normality-based test. Finally, simulation studies provide supporting evidence for the theoretical results and demonstrate stronger performance when using stricter privacy budgets.
\ No newline at end of file
diff --git a/data/2024/aaai/Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition b/data/2024/aaai/Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition
new file mode 100644
index 0000000000..974b0111c5
--- /dev/null
+++ b/data/2024/aaai/Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition	
@@ -0,0 +1 @@
+Prior studies on audio-visual speech recognition typically assume the visibility of speaking lips, ignoring the fact that visual occlusion occurs in real-world videos, thus adversely affecting recognition performance. To address this issue, we propose a framework that restores occluded lips in a video by utilizing both the video itself and the corresponding noisy audio. Specifically, the framework aims to achieve these three tasks: detecting occluded frames, masking occluded areas, and reconstruction of masked regions. We tackle the first two issues by utilizing the Class Activation Map (CAM) obtained from occluded frame detection to facilitate the masking of occluded areas. Additionally, we introduce a novel synthesis-matching strategy for the reconstruction to ensure the compatibility of audio features with different levels of occlusion. Our framework is evaluated in terms of Word Error Rate (WER) on the original videos, the videos corrupted by concealed lips, and the videos restored using the framework with several existing state-of-the-art audio-visual speech recognition methods. Experimental results substantiate that our framework significantly mitigates performance degradation resulting from lip occlusion. Under -5dB noise conditions, AV-Hubert's WER increases from 10.62% to 13.87% due to lip occlusion, but rebounds to 11.87% in conjunction with the proposed framework. Furthermore, the framework also demonstrates its capacity to produce natural synthesized images in qualitative assessments.
\ No newline at end of file
diff --git a/data/2024/aaai/RetLLM-E: Retrieval-Prompt Strategy for Question-Answering on Student Discussion Forums b/data/2024/aaai/RetLLM-E: Retrieval-Prompt Strategy for Question-Answering on Student Discussion Forums
new file mode 100644
index 0000000000..659b8a9c06
--- /dev/null
+++ b/data/2024/aaai/RetLLM-E: Retrieval-Prompt Strategy for Question-Answering on Student Discussion Forums	
@@ -0,0 +1,5 @@
+This paper focuses on using Large Language Models to support teaching assistants in answering questions on large student forums such as Piazza and EdSTEM. Since student questions on these forums are often closely tied to specific aspects of the institution, instructor, and course delivery, general-purpose LLMs do not directly do well on this task.
+We introduce RetLLM-E, a method that combines text-retrieval and prompting approaches to enable LLMs to provide precise and high-quality answers to student questions. When presented with a student question, our system initiates a two-step process. First, it retrieves relevant context from (i) a dataset of student questions addressed by course instructors
+(Q&A Retrieval) and (ii) relevant segments of course materials (Document Retrieval). RetLLM-E then prompts LLM using the retrieved text and an engineered prompt structure to
+yield an answer optimized for the student question.
+We present a set of quantitative and human evaluation experiments, comparing our method to ground truth answers to questions in a test set of actual student questions. Our results demonstrate that our approach provides higher-quality responses to course-related questions than an LLM operating without context or relying solely on retrieval-based context. RetLLM-E can easily be adopted in different courses, providing instructors and students with context-aware automatic responses.
\ No newline at end of file
diff --git a/data/2024/aaai/Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers (Student Abstract) b/data/2024/aaai/Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers (Student Abstract)
new file mode 100644
index 0000000000..de8d7431a9
--- /dev/null
+++ b/data/2024/aaai/Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers (Student Abstract)	
@@ -0,0 +1,2 @@
+This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these ”attentionless Transformers” to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward
+networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Rethinking Causal Relationships Learning in Graph Neural Networks b/data/2024/aaai/Rethinking Causal Relationships Learning in Graph Neural Networks
new file mode 100644
index 0000000000..2886d51b01
--- /dev/null
+++ b/data/2024/aaai/Rethinking Causal Relationships Learning in Graph Neural Networks	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) demonstrate their significance by effectively modeling complex interrelationships within graph-structured data. To enhance the credibility and robustness of GNNs, it becomes exceptionally crucial to bolster their ability to capture causal relationships. However, despite recent advancements that have indeed strengthened GNNs from a causal learning perspective, conducting an in-depth analysis specifically targeting the causal modeling prowess of GNNs remains an unresolved issue. In order to comprehensively analyze various GNN models from a causal learning perspective, we constructed an artificially synthesized dataset with known and controllable causal relationships between data and labels. The rationality of the generated data is further ensured through theoretical foundations. Drawing insights from analyses conducted using our dataset, we introduce a lightweight and highly adaptable GNN module designed to strengthen GNNs' causal learning capabilities across a diverse range of tasks. Through a series of experiments conducted on both synthetic datasets and other real-world datasets, we empirically validate the effectiveness of the proposed module. The codes are available at https://github.com/yaoyao-yaoyao-cell/CRCG.
\ No newline at end of file
diff --git a/data/2024/aaai/Rethinking Dimensional Rationale in Graph Contrastive Learning from Causal Perspective b/data/2024/aaai/Rethinking Dimensional Rationale in Graph Contrastive Learning from Causal Perspective
new file mode 100644
index 0000000000..5ea191f5fb
--- /dev/null
+++ b/data/2024/aaai/Rethinking Dimensional Rationale in Graph Contrastive Learning from Causal Perspective	
@@ -0,0 +1 @@
+Graph contrastive learning is a general learning paradigm excelling at capturing invariant information from diverse perturbations in graphs. Recent works focus on exploring the structural rationale from graphs, thereby increasing the discriminability of the invariant information. However, such methods may incur in the mis-learning of graph models towards the interpretability of graphs, and thus the learned noisy and task-agnostic information interferes with the prediction of graphs. To this end, with the purpose of exploring the intrinsic rationale of graphs, we accordingly propose to capture the dimensional rationale from graphs, which has not received sufficient attention in the literature. The conducted exploratory experiments attest to the feasibility of the aforementioned roadmap. To elucidate the innate mechanism behind the performance improvement arising from the dimensional rationale, we rethink the dimensional rationale in graph contrastive learning from a causal perspective and further formalize the causality among the variables in the pre-training stage to build the corresponding structural causal model. On the basis of the understanding of the structural causal model, we propose the dimensional rationale-aware graph contrastive learning approach, which introduces a learnable dimensional rationale acquiring network and a redundancy reduction constraint. The learnable dimensional rationale acquiring network is updated by leveraging a bi-level meta-learning technique, and the redundancy reduction constraint disentangles the redundant features through a decorrelation process during learning. Empirically, compared with state-of-the-art methods, our method can yield significant performance boosts on various benchmarks with respect to discriminability and transferability. The code implementation of our method is available at https://github.com/ByronJi/DRGCL.
\ No newline at end of file
diff --git a/data/2024/aaai/Rethinking Graph Masked Autoencoders through Alignment and Uniformity b/data/2024/aaai/Rethinking Graph Masked Autoencoders through Alignment and Uniformity
new file mode 100644
index 0000000000..36b9e85163
--- /dev/null
+++ b/data/2024/aaai/Rethinking Graph Masked Autoencoders through Alignment and Uniformity	
@@ -0,0 +1 @@
+Self-supervised learning on graphs can be bifurcated into contrastive and generative methods. Contrastive methods, also known as graph contrastive learning (GCL), have dominated graph self-supervised learning in the past few years, but the recent advent of graph masked autoencoder (GraphMAE) rekindles the momentum behind generative methods. Despite the empirical success of GraphMAE, there is still a dearth of theoretical understanding regarding its efficacy. Moreover, while both generative and contrastive methods have been shown to be effective, their connections and differences have yet to be thoroughly investigated. Therefore, we theoretically build a bridge between GraphMAE and GCL, and prove that the node-level reconstruction objective in GraphMAE implicitly performs context-level GCL. Based on our theoretical analysis, we further identify the limitations of the GraphMAE from the perspectives of alignment and uniformity, which have been considered as two key properties of high-quality representations in GCL. We point out that GraphMAE's alignment performance is restricted by the masking strategy, and the uniformity is not strictly guaranteed. To remedy the aforementioned limitations, we propose an Alignment-Uniformity enhanced Graph Masked AutoEncoder, named AUG-MAE. Specifically, we propose an easy-to-hard adversarial masking strategy to provide hard-to-align samples, which improves the alignment performance. Meanwhile, we introduce an explicit uniformity regularizer to ensure the uniformity of the learned representations. Experimental results on benchmark datasets demonstrate the superiority of our model over existing state-of-the-art methods. The code is available at: https://github.com/AzureLeon1/AUG-MAE.
\ No newline at end of file
diff --git a/data/2024/aaai/Rethinking Mesh Watermark: Towards Highly Robust and Adaptable Deep 3D Mesh Watermarking b/data/2024/aaai/Rethinking Mesh Watermark: Towards Highly Robust and Adaptable Deep 3D Mesh Watermarking
new file mode 100644
index 0000000000..bb7dc08a74
--- /dev/null
+++ b/data/2024/aaai/Rethinking Mesh Watermark: Towards Highly Robust and Adaptable Deep 3D Mesh Watermarking	
@@ -0,0 +1 @@
+The goal of 3D mesh watermarking is to embed the message in 3D meshes that can withstand various attacks imperceptibly and reconstruct the message accurately from watermarked meshes. The watermarking algorithm is supposed to withstand multiple attacks, and the complexity should not grow significantly with the mesh size. Unfortunately, previous methods are less robust against attacks and lack of adaptability. In this paper, we propose a robust and adaptable deep 3D mesh watermarking Deep3DMark that leverages attention-based convolutions in watermarking tasks to embed binary messages in vertex distributions without texture assistance. Furthermore, our Deep3DMark exploits the property that simplified meshes inherit similar relations from the original ones, where the relation is the offset vector directed from one vertex to its neighbor. By doing so, our method can be trained on simplified meshes but remains effective on large size meshes (size adaptable) and unseen categories of meshes (geometry adaptable). Extensive experiments demonstrate our method remains efficient and effective even if the mesh size is 190× increased. Under mesh attacks, Deep3DMark achieves 10%∼50% higher accuracy than traditional methods, and 2× higher SNR and 8% higher accuracy than previous DNN-based methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Rethinking Multi-Scale Representations in Deep Deraining Transformer b/data/2024/aaai/Rethinking Multi-Scale Representations in Deep Deraining Transformer
new file mode 100644
index 0000000000..8742e1fd81
--- /dev/null
+++ b/data/2024/aaai/Rethinking Multi-Scale Representations in Deep Deraining Transformer	
@@ -0,0 +1 @@
+Existing Transformer-based image deraining methods depend mostly on fixed single-input single-output U-Net architecture. In fact, this not only neglects the potentially explicit information from multiple image scales, but also lacks the capability of exploring the complementary implicit information across different scales. In this work, we rethink the multi-scale representations and design an effective multi-input multi-output framework that constructs intra- and inter-scale hierarchical modulation to better facilitate rain removal and help image restoration. We observe that rain levels reduce dramatically in coarser image scales, thus proposing to restore rain-free results from the coarsest scale to the finest scale in image pyramid inputs, which also alleviates the difficulty of model learning. Specifically, we integrate a sparsity-compensated Transformer block and a frequency-enhanced convolutional block into a coupled representation module, in order to jointly learn the intra-scale content-aware features. To facilitate representations learned at different scales to communicate with each other, we leverage a gated fusion module to adaptively aggregate the inter-scale spatial-aware features, which are rich in correlated information of rain appearances, leading to high-quality results. Extensive experiments demonstrate that our model achieves consistent gains on five benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Rethinking Peculiar Images by Diffusion Models: Revealing Local Minima's Role b/data/2024/aaai/Rethinking Peculiar Images by Diffusion Models: Revealing Local Minima's Role
new file mode 100644
index 0000000000..bd3f3cb2c6
--- /dev/null
+++ b/data/2024/aaai/Rethinking Peculiar Images by Diffusion Models: Revealing Local Minima's Role	
@@ -0,0 +1 @@
+Recent significant advancements in diffusion models have revolutionized image generation, enabling the synthesis of highly realistic images with text-based guidance. These breakthroughs have paved the way for constructing datasets via generative artificial intelligence (AI), offering immense potential for various applications. However, two critical challenges hinder the widespread adoption of synthesized data: computational cost and the generation of peculiar images. While computational costs have improved through various approaches, the issue of peculiar image generation remains relatively unexplored. Existing solutions rely on heuristics, extra training, or AI-based post-processing to mitigate this problem. In this paper, we present a novel approach to address both issues simultaneously. We establish that both gradient descent and diffusion sampling are specific cases of the generalized expectation maximization algorithm. We hypothesize and empirically demonstrate that peculiar image generation is akin to the local minima problem in optimization. Inspired by optimization techniques, we apply naive momentum and positive-negative momentum to diffusion sampling. Last, we propose new metrics to evaluate the peculiarity. Experimental results show momentum effectively prevents peculiar image generation without extra computation.
\ No newline at end of file
diff --git a/data/2024/aaai/Rethinking Propagation for Unsupervised Graph Domain Adaptation b/data/2024/aaai/Rethinking Propagation for Unsupervised Graph Domain Adaptation
new file mode 100644
index 0000000000..61819a924e
--- /dev/null
+++ b/data/2024/aaai/Rethinking Propagation for Unsupervised Graph Domain Adaptation	
@@ -0,0 +1 @@
+Unsupervised Graph Domain Adaptation (UGDA) aims to transfer knowledge from a labelled source graph to an unlabelled target graph in order to address the distribution shifts between graph domains. Previous works have primarily focused on aligning data from the source and target graph in the representation space learned by graph neural networks (GNNs). However, the inherent generalization capability of GNNs has been largely overlooked. Motivated by our empirical analysis, we reevaluate the role of GNNs in graph domain adaptation and uncover the pivotal role of the propagation process in GNNs for adapting to different graph domains. We provide a comprehensive theoretical analysis of UGDA and derive a generalization bound for multi-layer GNNs. By formulating GNN Lipschitz for k-layer GNNs, we show that the target risk bound can be tighter by removing propagation layers in source graph and stacking multiple propagation layers in target graph. Based on the empirical and theoretical analysis mentioned above, we propose a simple yet effective approach called A2GNN for graph domain adaptation. Through extensive experiments on real-world datasets, we demonstrate the effectiveness of our proposed A2GNN framework.
\ No newline at end of file
diff --git a/data/2024/aaai/Rethinking Reverse Distillation for Multi-Modal Anomaly Detection b/data/2024/aaai/Rethinking Reverse Distillation for Multi-Modal Anomaly Detection
new file mode 100644
index 0000000000..321800f2a2
--- /dev/null
+++ b/data/2024/aaai/Rethinking Reverse Distillation for Multi-Modal Anomaly Detection	
@@ -0,0 +1 @@
+In recent years, there has been significant progress in employing color images for anomaly detection in industrial scenarios, but it is insufficient for identifying anomalies that are invisible in RGB images alone. As a supplement, introducing extra modalities such as depth and surface normal maps can be helpful to detect these anomalies. To this end, we present a novel Multi-Modal Reverse Distillation (MMRD) paradigm that consists of a frozen multi-modal teacher encoder to generate distillation targets and a learnable student decoder targeting to restore multi-modal representations from the teacher. Specifically, the teacher extracts complementary visual features from different modalities via a siamese architecture and then parameter-freely fuses these information from multiple levels as the targets of distillation. For the student, it learns modality-related priors from the teacher representations of normal training data and performs interaction between them to form multi-modal representations for target reconstruction. Extensive experiments show that our MMRD outperforms recent state-of-the-art methods on both anomaly detection and localization on MVTec-3D AD and Eyecandies benchmarks. Codes will be available upon acceptance.
\ No newline at end of file
diff --git a/data/2024/aaai/Rethinking Robustness of Model Attributions b/data/2024/aaai/Rethinking Robustness of Model Attributions
new file mode 100644
index 0000000000..0427478321
--- /dev/null
+++ b/data/2024/aaai/Rethinking Robustness of Model Attributions	
@@ -0,0 +1 @@
+For machine learning models to be reliable and trustworthy, their decisions must be interpretable. As these models find increasing use in safety-critical applications, it is important that not just the model predictions but also their explanations (as feature attributions) be robust to small human-imperceptible input perturbations. Recent works have shown that many attribution methods are fragile and have proposed improvements in either these methods or the model training. We observe two main causes for fragile attributions: first, the existing metrics of robustness (e.g., top-k intersection) overpenalize even reasonable local shifts in attribution, thereby making random perturbations to appear as a strong attack, and second, the attribution can be concentrated in a small region even when there are multiple important parts in an image. To rectify this, we propose simple ways to strengthen existing metrics and attribution methods that incorporate locality of pixels in robustness metrics and diversity of pixel locations in attributions. Towards the role of model training in attributional robustness, we empirically observe that adversarially trained models have more robust attributions on smaller datasets, however, this advantage disappears in larger datasets. Code is made available at https://github.com/ksandeshk/LENS.
\ No newline at end of file
diff --git a/data/2024/aaai/Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point b/data/2024/aaai/Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point
new file mode 100644
index 0000000000..7c91fbc004
--- /dev/null
+++ b/data/2024/aaai/Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point	
@@ -0,0 +1 @@
+As a fundamental and challenging task in the vision and language domain, Referring Expression Comprehension (REC) has shown impressive improvements recently. However, for a complex task that couples the comprehension of abstract concepts and the localization of concrete instances, one-stage approaches are bottlenecked by computing and data resources. To obtain a low-cost solution, the prevailing two-stage approaches decouple REC into localization (region proposal) and comprehension (region-expression matching) at region-level, but the solution based on isolated regions cannot sufficiently utilize the context and is usually limited by the quality of proposals. Therefore, it is necessary to rebuild an efficient two-stage solution system. In this paper, we propose a point-based two-stage framework for REC, in which the two stages are redefined as point-based cross-modal comprehension and point-based instance localization. Specifically, we reconstruct the raw bounding box and segmentation mask into center and mass scores as soft ground-truth for measuring point-level cross-modal correlations. With the soft ground-truth, REC can be approximated as a binary classification problem, which fundamentally avoids the impact of isolated regions on the optimization process. Remarkably, the consistent metrics between center and mass scores allow our system to directly optimize grounding and segmentation by utilizing the same architecture. Experiments on multiple benchmarks show the feasibility and potential of our point-based paradigm. Our code available at https://github.com/VILAN-Lab/PBREC-MT.
\ No newline at end of file
diff --git a/data/2024/aaai/Rethinking the Development of Large Language Models from the Causal Perspective: A Legal Text Prediction Case Study b/data/2024/aaai/Rethinking the Development of Large Language Models from the Causal Perspective: A Legal Text Prediction Case Study
new file mode 100644
index 0000000000..d46228068d
--- /dev/null
+++ b/data/2024/aaai/Rethinking the Development of Large Language Models from the Causal Perspective: A Legal Text Prediction Case Study	
@@ -0,0 +1 @@
+While large language models (LLMs) exhibit impressive performance on a wide range of NLP tasks, most of them fail to learn the causality from correlation, which disables them from learning rationales for predicting. Rethinking the whole developing process of LLMs is of great urgency as they are adopted in various critical tasks that need rationales, including legal text prediction (e.g., legal judgment prediction). In this paper, we first explain the underlying theoretical mechanism of their failure and argue that both the data imbalance and the omission of causality in model design and selection render the current training-testing paradigm failed to select the unique causality-based model from correlation-based models. Second, we take the legal text prediction task as the testbed and reconstruct the developing process of LLMs by simultaneously infusing causality into model architectures and organizing causality-based adversarial attacks for evaluation. Specifically, we base our reconstruction on our theoretical analysis and propose a causality-aware self-attention mechanism (CASAM), which prevents LLMs from entangling causal and non-causal information by restricting the interaction between causal and non-causal words. Meanwhile, we propose eight kinds of legal-specific attacks to form causality-based model selection. Our extensive experimental results demonstrate that our proposed CASAM achieves state-of-the-art (SOTA) performances and the strongest robustness on three commonly used legal text prediction benchmarks. We make our code publicly available at https://github.com/Carrot-Red/Rethink-LLM-development.
\ No newline at end of file
diff --git a/data/2024/aaai/Rethinking the Paradigm of Content Constraints in Unpaired Image-to-Image Translation b/data/2024/aaai/Rethinking the Paradigm of Content Constraints in Unpaired Image-to-Image Translation
new file mode 100644
index 0000000000..6effe05fc9
--- /dev/null
+++ b/data/2024/aaai/Rethinking the Paradigm of Content Constraints in Unpaired Image-to-Image Translation	
@@ -0,0 +1 @@
+In an unpaired setting, lacking sufficient content constraints for image-to-image translation (I2I) tasks, GAN-based approaches are usually prone to model collapse. Current solutions can be divided into two categories, reconstruction-based and Siamese network-based. The former requires that the transformed or transforming image can be perfectly converted back to the original image, which is sometimes too strict and limits the generative performance. The latter involves feeding the original and generated images into a feature extractor and then matching their outputs. This is not efficient enough, and a universal feature extractor is not easily available. In this paper, we propose EnCo, a simple but efficient way to maintain the content by constraining the representational similarity in the latent space of patch-level features from the same stage of the encoder and decoder of the generator. For the similarity function, we use a simple MSE loss instead of contrastive loss, which is currently widely used in I2I tasks. Benefits from the design, EnCo training is extremely efficient, while the features from the encoder produce a more positive effect on the decoding, leading to more satisfying generations. In addition, we rethink the role played by discriminators in sampling patches and propose a discriminative attention-guided (DAG) patch sampling strategy to replace random sampling. DAG is parameter-free and only requires negligible computational overhead, while significantly improving the performance of the model. Extensive experiments on multiple datasets demonstrate the effectiveness and advantages of EnCo, and we achieve multiple state-of-the-art compared to previous methods.
\ No newline at end of file
diff --git a/data/2024/aaai/RetouchFormer: Semi-supervised High-Quality Face Retouching Transformer with Prior-Based Selective Self-Attention b/data/2024/aaai/RetouchFormer: Semi-supervised High-Quality Face Retouching Transformer with Prior-Based Selective Self-Attention
new file mode 100644
index 0000000000..5bc5a9c4e8
--- /dev/null
+++ b/data/2024/aaai/RetouchFormer: Semi-supervised High-Quality Face Retouching Transformer with Prior-Based Selective Self-Attention	
@@ -0,0 +1 @@
+Face retouching is to beautify a face image, while preserving the image content as much as possible. It is a promising yet challenging task to remove face imperfections and fill with normal skin. Generic image enhancement methods are hampered by the lack of imperfection localization, which often results in incomplete removal of blemishes at large scales. To address this issue, we propose a transformer-based approach, RetouchFormer, which simultaneously identify imperfections and synthesize realistic content in the corresponding regions. Specifically, we learn a latent dictionary to capture the clean face priors, and predict the imperfection regions via a reconstruction-oriented localization module. Also based on this, we can realize face retouching by explicitly suppressing imperfections in our selective self-attention computation, such that local content will be synthesized from normal skin. On the other hand, multi-scale feature tokens lead to increased flexibility in dealing with the imperfections at various scales. The design elements bring greater effectiveness and efficiency. RetouchFormer outperforms the advanced face retouching methods and synthesizes clean face images with high fidelity in our list of extensive experiments performed.
\ No newline at end of file
diff --git a/data/2024/aaai/Retrieval-Augmented Primitive Representations for Compositional Zero-Shot Learning b/data/2024/aaai/Retrieval-Augmented Primitive Representations for Compositional Zero-Shot Learning
new file mode 100644
index 0000000000..954a0af034
--- /dev/null
+++ b/data/2024/aaai/Retrieval-Augmented Primitive Representations for Compositional Zero-Shot Learning	
@@ -0,0 +1 @@
+Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by learning from seen compositions. Composing the learned knowledge of seen primitives, i.e., attributes or objects, into novel compositions is critical for CZSL. In this work, we propose to explicitly retrieve knowledge of seen primitives for compositional zero-shot learning. We present a retrieval-augmented method, which augments standard multi-path classification methods with two retrieval modules. Specifically, we construct two databases storing the attribute and object representations of training images, respectively. For an input training/testing image, we use two retrieval modules to retrieve representations of training images with the same attribute and object, respectively. The primitive representations of the input image are augmented by using the retrieved representations, for composition recognition. By referencing semantically similar images, the proposed method is capable of recalling knowledge of seen primitives for compositional generalization. Experiments on three widely-used datasets show the effectiveness of the proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/RetroOOD: Understanding Out-of-Distribution Generalization in Retrosynthesis Prediction b/data/2024/aaai/RetroOOD: Understanding Out-of-Distribution Generalization in Retrosynthesis Prediction
new file mode 100644
index 0000000000..3cc7d6a0ff
--- /dev/null
+++ b/data/2024/aaai/RetroOOD: Understanding Out-of-Distribution Generalization in Retrosynthesis Prediction	
@@ -0,0 +1 @@
+Machine learning-assisted retrosynthesis prediction models have been gaining widespread adoption, though their performances oftentimes degrade significantly when deployed in real-world applications embracing out-of-distribution (OOD) molecules or reactions. Despite steady progress on standard benchmarks, our understanding of existing retrosynthesis prediction models under the premise of distribution shifts remains stagnant. To this end, we first formally sort out two types of distribution shifts in retrosynthesis prediction and construct two groups of benchmark datasets. Next, through comprehensive experiments, we systematically compare state-of-the-art retrosynthesis prediction models on the two groups of benchmarks, revealing the limitations of previous in-distribution evaluation and re-examining the advantages of each model. More remarkably, we are motivated by the above empirical insights to propose two model-agnostic techniques that can improve the OOD generalization of arbitrary off-the-shelf retrosynthesis prediction algorithms. Our preliminary experiments show their high potential with an average performance improvement of 4.6%, and the established benchmarks serve as a foothold for further retrosynthesis prediction research towards OOD generalization.
\ No newline at end of file
diff --git a/data/2024/aaai/Revealing the Proximate Long-Tail Distribution in Compositional Zero-Shot Learning b/data/2024/aaai/Revealing the Proximate Long-Tail Distribution in Compositional Zero-Shot Learning
new file mode 100644
index 0000000000..3af71fba7e
--- /dev/null
+++ b/data/2024/aaai/Revealing the Proximate Long-Tail Distribution in Compositional Zero-Shot Learning	
@@ -0,0 +1 @@
+Compositional Zero-Shot Learning (CZSL) aims to transfer knowledge from seen state-object pairs to novel unseen pairs. In this process, visual bias caused by the diverse interrelationship of state-object combinations blurs their visual features, hindering the learning of distinguishable class prototypes. Prevailing methods concentrate on disentangling states and objects directly from visual features, disregarding potential enhancements that could arise from a data viewpoint. Experimentally, we unveil the results caused by the above problem closely approximate the long-tailed distribution. As a solution, we transform CZSL into a proximate class imbalance problem. We mathematically deduce the role of class prior within the long-tailed distribution in CZSL. Building upon this insight, we incorporate visual bias caused by compositions into the classifier's training and inference by estimating it as a proximate class prior. This enhancement encourages the classifier to acquire more discernible class prototypes for each composition, thereby achieving more balanced predictions. Experimental results demonstrate that our approach elevates the model's performance to the state-of-the-art level, without introducing additional parameters.
\ No newline at end of file
diff --git a/data/2024/aaai/Reverse Multi-Choice Dialogue Commonsense Inference with Graph-of-Thought b/data/2024/aaai/Reverse Multi-Choice Dialogue Commonsense Inference with Graph-of-Thought
new file mode 100644
index 0000000000..f377027290
--- /dev/null
+++ b/data/2024/aaai/Reverse Multi-Choice Dialogue Commonsense Inference with Graph-of-Thought	
@@ -0,0 +1,8 @@
+With the proliferation of dialogic data across the Internet, the Dialogue Commonsense Multi-choice Question Answering (DC-MCQ) task has emerged as a response to the challenge of comprehending user queries and intentions.
+Although prevailing methodologies exhibit effectiveness in addressing single-choice questions, they encounter difficulties in handling multi-choice queries due to the heightened intricacy and informational density. 
+In this paper, inspired by the human cognitive process of progressively excluding options, we propose a three-step Reverse Exclusion Graph-of-Thought (ReX-GoT) framework, including Option Exclusion, Error Analysis, and Combine Information.
+Specifically, our ReX-GoT mimics human reasoning by gradually excluding irrelevant options and learning the reasons for option errors to choose the optimal path of the GoT and ultimately infer the correct answer.
+By progressively integrating intricate clues, our method effectively reduces the difficulty of multi-choice reasoning and provides a novel solution for DC-MCQ.
+Extensive experiments on the CICERO and CICERO_v2 datasets validate the significant improvement of our approach on DC-MCQ task.
+On zero-shot setting, our model outperform the best baseline by 17.67% in terms of F1 score for the multi-choice task.
+Most strikingly, our GPT3.5-based ReX-GoT framework achieves a remarkable 39.44% increase in F1 score.
\ No newline at end of file
diff --git a/data/2024/aaai/Review-Enhanced Hierarchical Contrastive Learning for Recommendation b/data/2024/aaai/Review-Enhanced Hierarchical Contrastive Learning for Recommendation
new file mode 100644
index 0000000000..a1e4fcf9fd
--- /dev/null
+++ b/data/2024/aaai/Review-Enhanced Hierarchical Contrastive Learning for Recommendation	
@@ -0,0 +1 @@
+Designed to establish potential relations and distill high-order representations, graph-based recommendation systems continue to reveal promising results by jointly modeling ratings and reviews. However, existing studies capture simple review relations, failing to (1) completely explore hidden connections between users (or items), (2) filter out redundant information derived from reviews, and (3) model the behavioral association between rating and review interactions. To address these challenges, we propose a review-enhanced hierarchical contrastive learning, namely ReHCL. First, ReHCL constructs topic and semantic graphs to fully mine review relations from different views. Moreover, a cross-view graph contrastive learning is used to achieve enhancement of node representations and extract useful review knowledge. Meanwhile, we design a neighbor-based positive sampling to capture the graph-structured similarity between topic and semantic views, further performing efficient contrast and reducing redundant noise. Next, we propose a cross-modal contrastive learning to match the rating and review representations, by exploring the association between ratings and reviews. Lastly, these two contrastive learning modes form a hierarchical contrastive learning task, which is applied to enhance the final recommendation task. Extensive experiments verify the superiority of ReHCL compared with state-of-the-arts.
\ No newline at end of file
diff --git a/data/2024/aaai/Reviewing the Forgotten Classes for Domain Adaptation of Black-Box Predictors b/data/2024/aaai/Reviewing the Forgotten Classes for Domain Adaptation of Black-Box Predictors
new file mode 100644
index 0000000000..acfaf708c6
--- /dev/null
+++ b/data/2024/aaai/Reviewing the Forgotten Classes for Domain Adaptation of Black-Box Predictors	
@@ -0,0 +1 @@
+For addressing the data privacy and portability issues of domain adaptation, Domain Adaptation of Black-box Predictors (DABP) aims to adapt a black-box source model to an unlabeled target domain without accessing both the source-domain data and details of the source model. Although existing DABP approaches based on knowledge distillation (KD) have achieved promising results, we experimentally find that these methods all have the minority class forgetting issue, which refers that the trained model completely forgets some minority classes. To address this issue, we propose a method called Reviewing the Forgotten Classes (RFC), which including two main modules. Firstly, we propose a simple but effective component called selection training (ST). ST selects classes that the model tends to forget according to the learning status of the model and obtains clean samples of the selected classes with the small-loss criterion for enhanced training. ST is orthogonal to previous methods and can effectively alleviate their minority class forgetting issue. Secondly, we find that neighborhood clustering (NC) can help the model learn more balanced than KD so that further alleviate the minority class forgetting issue. However, NC is based on the fact that target features from the source model already form some semantic structure, while DABP is unable to obtain the source model. Thus, we use KD and ST to warm up the target model to form a certain semantic structure. Overall, our method inherits the merits of both ST and NC, and achieves state of the art on three DABP benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning b/data/2024/aaai/Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning
new file mode 100644
index 0000000000..ff00923b3e
--- /dev/null
+++ b/data/2024/aaai/Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning	
@@ -0,0 +1 @@
+In representation learning, a disentangled representation is highly desirable as it encodes generative factors of data in a separable and compact pattern. Researchers have advocated leveraging disentangled representations to complete downstream tasks with encouraging empirical evidence. This paper further investigates the necessity of disentangled representation in downstream applications. Specifically, we show that dimension-wise disentangled representations are unnecessary on a fundamental downstream task, abstract visual reasoning. We provide extensive empirical evidence against the necessity of disentanglement, covering multiple datasets, representation learning methods, and downstream network architectures. Furthermore, our findings suggest that the informativeness of representations is a better indicator of downstream performance than disentanglement. Finally, the positive correlation between informativeness and disentanglement explains the claimed usefulness of disentangled representations in previous works. The source code is available at https://github.com/Richard-coder-Nai/disentanglement-lib-necessity.git
\ No newline at end of file
diff --git a/data/2024/aaai/Revisiting Document-Level Relation Extraction with Context-Guided Link Prediction b/data/2024/aaai/Revisiting Document-Level Relation Extraction with Context-Guided Link Prediction
new file mode 100644
index 0000000000..adaa628171
--- /dev/null
+++ b/data/2024/aaai/Revisiting Document-Level Relation Extraction with Context-Guided Link Prediction	
@@ -0,0 +1 @@
+Document-level relation extraction (DocRE) poses the challenge of identifying relationships between entities within a document. Existing approaches rely on logical reasoning or contextual cues from entities. This paper reframes document-level RE as link prediction over a Knowledge Graph (KG) with distinct benefits: 1) Our approach amalgamates entity context and document-derived logical reasoning, enhancing link prediction quality. 2) Predicted links between entities offer interpretability, elucidating employed reasoning. We evaluate our approach on benchmark datasets - DocRED, ReDocRED, and DWIE. The results indicate that our proposed method outperforms the state-of-the-art models and suggests that incorporating context-based Knowledge Graph link prediction techniques can enhance the performance of document-level relation extraction models.
\ No newline at end of file
diff --git a/data/2024/aaai/Revisiting Gradient Pruning: A Dual Realization for Defending against Gradient Attacks b/data/2024/aaai/Revisiting Gradient Pruning: A Dual Realization for Defending against Gradient Attacks
new file mode 100644
index 0000000000..0f87d78529
--- /dev/null
+++ b/data/2024/aaai/Revisiting Gradient Pruning: A Dual Realization for Defending against Gradient Attacks	
@@ -0,0 +1 @@
+Collaborative learning (CL) is a distributed learning framework that aims to protect user privacy by allowing users to jointly train a model by sharing their gradient updates only. However, gradient inversion attacks (GIAs), which recover users' training data from shared gradients, impose severe privacy threats to CL. Existing defense methods adopt different techniques, e.g., differential privacy, cryptography, and perturbation defenses, to defend against the GIAs. Nevertheless, all current defense methods suffer from a poor trade-off between privacy, utility, and efficiency. To mitigate the weaknesses of existing solutions, we propose a novel defense method, Dual Gradient Pruning (DGP), based on gradient pruning, which can improve communication efficiency while preserving the utility and privacy of CL. Specifically, DGP slightly changes gradient pruning with a stronger privacy guarantee. And DGP can also significantly improve communication efficiency with a theoretical analysis of its convergence and generalization. Our extensive experiments show that DGP can effectively defend against the most powerful GIAs and reduce the communication cost without sacrificing the model's utility.
\ No newline at end of file
diff --git a/data/2024/aaai/Revisiting Graph-Based Fraud Detection in Sight of Heterophily and Spectrum b/data/2024/aaai/Revisiting Graph-Based Fraud Detection in Sight of Heterophily and Spectrum
new file mode 100644
index 0000000000..2bdac404fc
--- /dev/null
+++ b/data/2024/aaai/Revisiting Graph-Based Fraud Detection in Sight of Heterophily and Spectrum	
@@ -0,0 +1 @@
+Graph-based fraud detection (GFD) can be regarded as a challenging semi-supervised node binary classification task. In recent years, Graph Neural Networks (GNN) have been widely applied to GFD, characterizing the anomalous possibility of a node by aggregating neighbor information. However, fraud graphs are inherently heterophilic, thus most of GNNs perform poorly due to their assumption of homophily. In addition, due to the existence of heterophily and class imbalance problem, the existing models do not fully utilize the precious node label information. To address the above issues, this paper proposes a semi-supervised GNN-based fraud detector SEC-GFD. This detector includes a hybrid filtering module and a local environmental constraint module, the two modules are utilized to solve heterophily and label utilization problem respectively. The first module starts from the perspective of the spectral domain, and solves the heterophily problem to a certain extent. Specifically, it divides the spectrum into various mixed-frequency bands based on the correlation between spectrum energy distribution and heterophily. Then in order to make full use of the node label information, a local environmental constraint module is adaptively designed. The comprehensive experimental results on four real-world fraud detection datasets denote that SEC-GFD outperforms other competitive graph-based fraud detectors. We release our code at https://github.com/Sunxkissed/SEC-GFD.
\ No newline at end of file
diff --git a/data/2024/aaai/Revisiting Open-Set Panoptic Segmentation b/data/2024/aaai/Revisiting Open-Set Panoptic Segmentation
new file mode 100644
index 0000000000..b5344fd889
--- /dev/null
+++ b/data/2024/aaai/Revisiting Open-Set Panoptic Segmentation	
@@ -0,0 +1 @@
+In this paper, we focus on the open-set panoptic segmentation (OPS) task to circumvent the data explosion problem. Different from the close-set setting, OPS targets to detect both known and unknown categories, where the latter is not annotated during training. Different from existing work that only selects a few common categories as unknown ones, we move forward to the real-world scenario by considering the various tail categories (~1k). To this end, we first build a new dataset with long-tail distribution for the OPS task. Based on this dataset, we additionally add a new class type for unknown classes and re-define the training annotations to make the OPS definition more complete and reasonable. Moreover, we analyze the influence of several significant factors in the OPS task and explore the upper bound of performance on unknown classes with different settings. Furthermore, based on the analyses, we design an effective two-phase framework for the OPS task, including thing-agnostic map generation and unknown segment mining. We further adopt semi-supervised learning to improve the OPS performance. Experimental results on different datasets validate the effectiveness of our method.
\ No newline at end of file
diff --git a/data/2024/aaai/Revisiting the Information Capacity of Neural Network Watermarks: Upper Bound Estimation and Beyond b/data/2024/aaai/Revisiting the Information Capacity of Neural Network Watermarks: Upper Bound Estimation and Beyond
new file mode 100644
index 0000000000..977d7c00f1
--- /dev/null
+++ b/data/2024/aaai/Revisiting the Information Capacity of Neural Network Watermarks: Upper Bound Estimation and Beyond	
@@ -0,0 +1,7 @@
+To trace the copyright of deep neural networks, an owner can embed its identity information into its model as a watermark.
+The capacity of the watermark quantify the maximal volume of information that can be verified from the watermarked model.
+Current studies on capacity focus on the ownership verification accuracy under ordinary removal attacks and fail to capture the relationship between robustness and fidelity.
+This paper studies the capacity of deep neural network watermarks from an information theoretical perspective.
+We propose a new definition of deep neural network watermark capacity analogous to channel capacity, analyze its properties, and design an algorithm that yields a tight estimation of its upper bound under adversarial overwriting.
+We also propose a universal non-invasive method to secure the transmission of the identity message beyond capacity by multiple rounds of ownership verification. 
+Our observations provide evidence for neural network owners and defenders that are curious about the tradeoff between the integrity of their ownership and the performance degradation of their products.
\ No newline at end of file
diff --git a/data/2024/aaai/Revitalizing Bahnaric Language through Neural Machine Translation: Challenges, Strategies, and Promising Outcomes b/data/2024/aaai/Revitalizing Bahnaric Language through Neural Machine Translation: Challenges, Strategies, and Promising Outcomes
new file mode 100644
index 0000000000..c2708656bf
--- /dev/null
+++ b/data/2024/aaai/Revitalizing Bahnaric Language through Neural Machine Translation: Challenges, Strategies, and Promising Outcomes	
@@ -0,0 +1,3 @@
+The Bahnar, a minority ethnic group in Vietnam with ancient roots, hold a language of deep cultural and historical significance. The government is prioritizing the preservation and dissemination of Bahnar language through online availability and cross-generational communication. Recent AI advances, including Neural Machine Translation (NMT), have transformed translation with improved accuracy and fluency, fostering language revitalization through learning, communication, and documentation. In particular, NMT enhances accessibility for Bahnar language speakers, making information and content more available.
+
+However, translating Vietnamese to Bahnar language faces practical hurdles due to resource limitations, particularly in the case of Bahnar language as an extremely low-resource language. These challenges encompass data scarcity, vocabulary constraints, and a lack of fine-tuning data. To address these, we propose transfer learning from selected pre-trained models to optimize translation quality and computational efficiency, capitalizing on linguistic similarities between Vietnamese and Bahnar language. Concurrently, we apply tailored augmentation strategies to adapt machine translation for the Vietnamese-Bahnar language context. Our approach is validated through superior results on bilingual Vietnamese-Bahnar language datasets when compared to baseline models. By tackling translation challenges, we help revitalize Bahnar language, ensuring information flows freely and the language thrives.
\ No newline at end of file
diff --git a/data/2024/aaai/Revolutionizing Education through AI-Powered Inclusive Learning Systems b/data/2024/aaai/Revolutionizing Education through AI-Powered Inclusive Learning Systems
new file mode 100644
index 0000000000..3691006798
--- /dev/null
+++ b/data/2024/aaai/Revolutionizing Education through AI-Powered Inclusive Learning Systems	
@@ -0,0 +1,5 @@
+This proposal introduces an innovative AI-powered learning system designed to address educational disparities worldwide. Focused on developing countries, the system seamlessly translates educational content between English and native languages, breaking down language barriers. Leveraging advanced natural language processing and machine learning techniques, including transformer models like BERT and GPT-3, the system ensures inclusivity, effectiveness, and engagement.
+
+Built on prior research demonstrating AI's efficacy in language translation and personalized learning, the proposed system draws inspiration from successful projects like Duolingo Language Incubator. By providing inclusive and accessible learning experiences, it empowers individuals to overcome language barriers, fostering global participation.
+
+The potential impact is significant, with the system poised to accelerate learning, enhance literacy rates, and create a more skilled workforce in developing countries. This research reflects a commitment to revolutionize education through technology, aiming for lasting and transformative contributions to global society. Through AI-driven education, a brighter, more inclusive future is envisioned.
\ No newline at end of file
diff --git a/data/2024/aaai/Reward (Mis)design for Autonomous Driving (Abstract Reprint) b/data/2024/aaai/Reward (Mis)design for Autonomous Driving (Abstract Reprint)
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/aaai/Reward Certification for Policy Smoothed Reinforcement Learning b/data/2024/aaai/Reward Certification for Policy Smoothed Reinforcement Learning
new file mode 100644
index 0000000000..b05b05f05b
--- /dev/null
+++ b/data/2024/aaai/Reward Certification for Policy Smoothed Reinforcement Learning	
@@ -0,0 +1 @@
+Reinforcement Learning (RL) has achieved remarkable success in safety-critical areas, but it can be weakened by adversarial attacks. Recent studies have introduced ``smoothed policies" to enhance its robustness. Yet, it is still challenging to establish a provable guarantee to certify the bound of its total reward. Prior methods relied primarily on computing bounds using Lipschitz continuity or calculating the probability of cumulative reward being above specific thresholds. However, these techniques are only suited for continuous perturbations on the RL agent's observations and are restricted to perturbations bounded by the l2-norm. To address these limitations, this paper proposes a general black-box certification method, called ReCePS, which is capable of directly certifying the cumulative reward of the smoothed policy under various lp-norm bounded perturbations. Furthermore, we extend our methodology to certify perturbations on action spaces. Our approach leverages f-divergence to measure the distinction between the original distribution and the perturbed distribution, subsequently determining the certification bound by solving a convex optimisation problem. We provide a comprehensive theoretical analysis and run experiments in multiple environments. Our results show that our method not only improves the tightness of certified lower bound of the mean cumulative reward but also demonstrates better efficiency than state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Reward Penalties on Augmented States for Solving Richly Constrained RL Effectively b/data/2024/aaai/Reward Penalties on Augmented States for Solving Richly Constrained RL Effectively
new file mode 100644
index 0000000000..700274c431
--- /dev/null
+++ b/data/2024/aaai/Reward Penalties on Augmented States for Solving Richly Constrained RL Effectively	
@@ -0,0 +1 @@
+Constrained Reinforcement Learning employs trajectory-based cost constraints (such as expected cost, Value at Risk, or Conditional VaR cost) to compute safe policies. The challenge lies in handling these constraints effectively while optimizing expected reward. Existing methods convert such trajectory-based constraints into local cost constraints, but they rely on cost estimates, leading to either aggressive or conservative solutions with regards to cost. We propose an unconstrained formulation that employs reward penalties over states augmented with costs to compute safe policies. Unlike standard primal-dual methods, our approach penalizes only infeasible trajectories through state augmentation. This ensures that increasing the penalty parameter always guarantees a feasible policy, a feature lacking in primal-dual methods. Our approach exhibits strong empirical performance and theoretical properties, offering a fresh paradigm for solving complex Constrained RL problems, including rich constraints like expected cost, Value at Risk, and Conditional Value at Risk. Our experimental results demonstrate superior performance compared to leading approaches across various constraint types on multiple benchmark problems.
\ No newline at end of file
diff --git a/data/2024/aaai/Reward-Respecting Subtasks for Model-Based Reinforcement Learning (Abstract Reprint) b/data/2024/aaai/Reward-Respecting Subtasks for Model-Based Reinforcement Learning (Abstract Reprint)
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/aaai/Rider Posture-Based Continuous Authentication with Few-Shot Learning for Mobility Scooters (Student Abstract) b/data/2024/aaai/Rider Posture-Based Continuous Authentication with Few-Shot Learning for Mobility Scooters (Student Abstract)
new file mode 100644
index 0000000000..98b36a6269
--- /dev/null
+++ b/data/2024/aaai/Rider Posture-Based Continuous Authentication with Few-Shot Learning for Mobility Scooters (Student Abstract)	
@@ -0,0 +1 @@
+Current practice of mobility scooter user authentication using physical keys and traditional password-based one-time security mechanisms cannot meet the needs of many mobility scooter riders, especially senior citizens having issues in recalling memory. Now seamless authentication approaches are needed to provide ongoing protection for mobility scooters against takeovers and unauthorized access. Existing continuous authentication techniques do not work well in a mobility scooter setting due to issues such as user comfort, deployment cost and enrollment time, among others. In that direction, our contributions in this research effort are two-fold: (i) we propose a novel system that incorporates advances in few-shot learning, hierarchical processing, and contextual embedding to establish continuous authentication for mobility scooter riders using only posture data. This security system, trained on data collected from real mobility scooter riders, demonstrates quick enrollment and easy deployability, while successfully serving as an unobtrusive first layer of security. (ii) we provide to the research community the largest publicly available repository of mobility scooter riders' body key-points data to enable further research in this direction.
\ No newline at end of file
diff --git a/data/2024/aaai/Risk Management in Image Generative Models through Model Fingerprinting b/data/2024/aaai/Risk Management in Image Generative Models through Model Fingerprinting
new file mode 100644
index 0000000000..207e63a158
--- /dev/null
+++ b/data/2024/aaai/Risk Management in Image Generative Models through Model Fingerprinting	
@@ -0,0 +1 @@
+My doctoral research delves into the realm of generative model fingerprinting, aiming to assign responsibility for the generated images. I introduce frameworks that modify generative models to incorporate each user's distinct digital fingerprint. This ensures that every piece of generated content carries a traceable identifier linked to its originator. The primary objective of my research is to achieve optimal attribution accuracy while ensuring minimal compromise on the model's performance. Additionally, I present strategies designed to enhance robustness against common adversarial manipulations, which malicious users might employ to obscure or remove these fingerprints.
\ No newline at end of file
diff --git a/data/2024/aaai/Risk-Aware Continuous Control with Neural Contextual Bandits b/data/2024/aaai/Risk-Aware Continuous Control with Neural Contextual Bandits
new file mode 100644
index 0000000000..955119d4ee
--- /dev/null
+++ b/data/2024/aaai/Risk-Aware Continuous Control with Neural Contextual Bandits	
@@ -0,0 +1 @@
+Recent advances in learning techniques have garnered attention for their applicability to a diverse range of real-world sequential decision-making problems. Yet, many practical applications have critical constraints for operation in real environments. Most learning solutions often neglect the risk of failing to meet these constraints, hindering their implementation in real-world contexts. In this paper, we propose a risk-aware decision-making framework for contextual bandit problems, accommodating constraints and continuous action spaces. Our approach employs an actor multi-critic architecture, with each critic characterizing the distribution of performance and constraint metrics. Our framework is designed to cater to various risk levels, effectively balancing constraint satisfaction against performance. To demonstrate the effectiveness of our approach, we first compare it against state-of-the-art baseline methods in a synthetic environment, highlighting the impact of intrinsic environmental noise across different risk configurations. Finally, we evaluate our framework in a real-world use case involving a 5G mobile network where only our approach satisfies consistently the system constraint (a signal processing reliability target) with a small performance toll (8.5% increase in power consumption).
\ No newline at end of file
diff --git a/data/2024/aaai/Risk-Conditioned Reinforcement Learning: A Generalized Approach for Adapting to Varying Risk Measures b/data/2024/aaai/Risk-Conditioned Reinforcement Learning: A Generalized Approach for Adapting to Varying Risk Measures
new file mode 100644
index 0000000000..b32785952f
--- /dev/null
+++ b/data/2024/aaai/Risk-Conditioned Reinforcement Learning: A Generalized Approach for Adapting to Varying Risk Measures	
@@ -0,0 +1 @@
+In application domains requiring mission-critical decision making, such as finance and robotics, the optimal policy derived by reinforcement learning (RL) often hinges on a preference for risk management. Yet, the dynamic nature of risk measures poses considerable challenges to achieving generalization and adaptation of risk-sensitive policies in the context of RL. In this paper, we propose a risk-conditioned RL model that enables rapid policy adaptation to varying risk measures via a unified risk representation, the Weighted Value-at-Risk (WV@R). To sample risk measures that avoid undue optimism, we construct a risk proposal network employing a conditional adversarial auto-encoder and a normalizing flow. This network establishes coherent representations for risk measures, preserving the continuity in terms of the Wasserstein distance on the risk measures. The normalizing flow is used to support non-crossing quantile regression that obtains valid samples for risk measures, and it is also applied to the agent’s critic to ascertain the preservation of monotonicity in quantile estimations. Through experiments with locomotion, finance, and self-driving scenarios, we show that our model is capable of adapting to a range of risk measures, achieving comparable performance to the baseline models individually trained for each measure. Our model often outperforms the baselines, especially in the cases when exploration is required during training but risk-aversion is favored during evaluation.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust 3D Tracking with Quality-Aware Shape Completion b/data/2024/aaai/Robust 3D Tracking with Quality-Aware Shape Completion
new file mode 100644
index 0000000000..860841800e
--- /dev/null
+++ b/data/2024/aaai/Robust 3D Tracking with Quality-Aware Shape Completion	
@@ -0,0 +1 @@
+3D single object tracking remains a challenging problem due to the sparsity and incompleteness of the point clouds. Existing algorithms attempt to address the challenges in two strategies. The first strategy is to learn dense geometric features based on the captured sparse point cloud. Nevertheless, it is quite a formidable task since the learned dense geometric features are with high uncertainty for depicting the shape of the target object. The other strategy is to aggregate the sparse geometric features of multiple templates to enrich the shape information, which is a routine solution in 2D tracking. However, aggregating the coarse shape representations can hardly yield a precise shape representation. Different from 2D pixels, 3D points of different frames can be directly fused by coordinate transform, i.e., shape completion. Considering that, we propose to construct a synthetic target representation composed of dense and complete point clouds depicting the target shape precisely by shape completion for robust 3D tracking. Specifically, we design a voxelized 3D tracking framework with shape completion, in which we propose a quality-aware shape completion mechanism to alleviate the adverse effect of noisy historical predictions. It enables us to effectively construct and leverage the synthetic target representation. Besides, we also develop a voxelized relation modeling module and box refinement module to improve tracking performance. Favorable performance against state-of-the-art algorithms on three benchmarks demonstrates the effectiveness and generalization ability of our method.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Active Measuring under Model Uncertainty b/data/2024/aaai/Robust Active Measuring under Model Uncertainty
new file mode 100644
index 0000000000..c6666bfd02
--- /dev/null
+++ b/data/2024/aaai/Robust Active Measuring under Model Uncertainty	
@@ -0,0 +1 @@
+Partial observability and uncertainty are common problems in sequential decision-making that particularly impede the use of formal models such as Markov decision processes (MDPs). However, in practice, agents may be able to employ costly sensors to measure their environment and resolve partial observability by gathering information. Moreover, imprecise transition functions can capture model uncertainty. We combine these concepts and extend MDPs to robust active-measuring MDPs (RAM-MDPs). We present an active-measure heuristic to solve RAM-MDPs efficiently and show that model uncertainty can, counterintuitively, let agents take fewer measurements. We propose a method to counteract this behavior while only incurring a bounded additional cost. We empirically compare our methods to several baselines and show their superior scalability and performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Beamforming for Downlink Multi-Cell Systems: A Bilevel Optimization Perspective b/data/2024/aaai/Robust Beamforming for Downlink Multi-Cell Systems: A Bilevel Optimization Perspective
new file mode 100644
index 0000000000..892af454a9
--- /dev/null
+++ b/data/2024/aaai/Robust Beamforming for Downlink Multi-Cell Systems: A Bilevel Optimization Perspective	
@@ -0,0 +1 @@
+Utilization of inter-base station cooperation for information processing has shown great potential in enhancing the overall quality of communication services (QoS) in wireless communication networks. Nevertheless, such cooperations require the knowledge of channel state information (CSI) at base stations (BSs), which is assumed to be perfectly known. However, CSI errors are inevitable in practice which necessitates beamforming technique that can achieve robust performance in the presence of channel estimation errors. Existing approaches relax the robust beamforming design problems into semidefinite programming (SDP), which can only achieve a solution that is far from being optimal. To this end, this paper views robust beamforming design problems from a bilevel optimization perspective. In particular, we focus on maximizing the worst-case weighted sum-rate (WSR) in the downlink multi-cell multi-user multiple-input single-output (MISO) system considering bounded CSI errors. We first reformulate this problem into a bilevel optimization problem and then develop an efficient algorithm based on the cutting plane method. A distributed optimization algorithm has also been developed to facilitate the parallel processing in practical settings. Numerical results are provided to confirm the effectiveness of the proposed algorithm in terms of performance and complexity, particularly in the presence of CSI uncertainties.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Blind Text Image Deblurring via Maximum Consensus Framework b/data/2024/aaai/Robust Blind Text Image Deblurring via Maximum Consensus Framework
new file mode 100644
index 0000000000..056770a729
--- /dev/null
+++ b/data/2024/aaai/Robust Blind Text Image Deblurring via Maximum Consensus Framework	
@@ -0,0 +1 @@
+The blind text image deblurring problem presents a formidable challenge, requiring the recovery of a clean and sharp text image from a blurry version with an unknown blur kernel. Sparsity-based strategies have demonstrated their efficacy by emphasizing the sparse priors of the latent image and kernel. However, these existing strategies have largely neglected the influence of additional noise, imposing limitations on their performance. To overcome this limitation, we propose a novel framework designed to effectively mitigate the impact of extensive noise prevalent in blurred images. Our approach centers around a robust Maximum Consensus Framework, wherein we optimize the quantity of interest from the noisy blurry image based on the maximum consensus criterion. Furthermore, we propose the integration of the Alternating Direction Method of Multipliers (ADMM) and the Half-Quadratic Splitting (HQS) method to address the computationally intractable L0 norm problem. This innovative strategy enables improvements in the deblurring performance of blurry text images with the additional synthetic noise. Experimental evaluations conducted on various noisy blurry text images demonstrate the superiority of the proposed approach over existing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Communicative Multi-Agent Reinforcement Learning with Active Defense b/data/2024/aaai/Robust Communicative Multi-Agent Reinforcement Learning with Active Defense
new file mode 100644
index 0000000000..ce07c02136
--- /dev/null
+++ b/data/2024/aaai/Robust Communicative Multi-Agent Reinforcement Learning with Active Defense	
@@ -0,0 +1 @@
+Communication in multi-agent reinforcement learning (MARL) has been proven to effectively promote cooperation among agents recently. Since communication in real-world scenarios is vulnerable to noises and adversarial attacks, it is crucial to develop robust communicative MARL technique. However, existing research in this domain has predominantly focused on passive defense strategies, where agents receive all messages equally, making it hard to balance performance and robustness. We propose an active defense strategy, where agents automatically reduce the impact of potentially harmful messages on the final decision. There are two challenges to implement this strategy, that are defining unreliable messages and adjusting the unreliable messages' impact on the final decision properly. To address them, we design an Active Defense Multi-Agent Communication framework (ADMAC), which estimates the reliability of received messages and adjusts their impact on the final decision accordingly with the help of a decomposable decision structure. The superiority of ADMAC over existing methods is validated by experiments in three communication-critical tasks under four types of attacks.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Distributed Gradient Aggregation Using Projections onto Gradient Manifolds b/data/2024/aaai/Robust Distributed Gradient Aggregation Using Projections onto Gradient Manifolds
new file mode 100644
index 0000000000..85c69d821a
--- /dev/null
+++ b/data/2024/aaai/Robust Distributed Gradient Aggregation Using Projections onto Gradient Manifolds	
@@ -0,0 +1 @@
+We study the distributed gradient aggregation problem where individual clients contribute to learning a central model by sharing parameter gradients constructed from local losses. However, errors in some gradients, caused by low-quality data or adversaries, can degrade the learning process when naively combined. Existing robust gradient aggregation approaches assume that local data represent the global data-generating distribution, which may not always apply to heterogeneous (non-i.i.d.) client data. We propose a new algorithm that can robustly aggregate gradients from potentially heterogeneous clients. Our approach leverages the manifold structure inherent in heterogeneous client gradients and evaluates gradient anomaly degrees by projecting them onto this manifold. This algorithm is implemented as a simple and efficient method that accumulates random projections within the subspace defined by the nearest neighbors within a gradient cloud. Our experiments demonstrate consistent performance improvements over state-of-the-art robust aggregation algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models b/data/2024/aaai/Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models
new file mode 100644
index 0000000000..5e152096f4
--- /dev/null
+++ b/data/2024/aaai/Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models	
@@ -0,0 +1 @@
+Many evaluation measures are used to evaluate social biases in masked language models (MLMs). However, we find that these previously proposed evaluation measures are lacking robustness in scenarios with limited datasets. This is because these measures are obtained by comparing the pseudo-log-likelihood (PLL) scores of the stereotypical and anti-stereotypical samples using an indicator function. The disadvantage is the limited mining of the PLL score sets without capturing its distributional information. In this paper, we represent a PLL score set as a Gaussian distribution and use Kullback-Leibler (KL) divergence and Jensen–Shannon (JS) divergence to construct evaluation measures for the distributions of stereotypical and anti-stereotypical PLL scores. Experimental results on the publicly available datasets StereoSet (SS) and CrowS-Pairs (CP) show that our proposed measures are significantly more robust and interpretable than those proposed previously.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Few-Shot Named Entity Recognition with Boundary Discrimination and Correlation Purification b/data/2024/aaai/Robust Few-Shot Named Entity Recognition with Boundary Discrimination and Correlation Purification
new file mode 100644
index 0000000000..2ba6b87dc1
--- /dev/null
+++ b/data/2024/aaai/Robust Few-Shot Named Entity Recognition with Boundary Discrimination and Correlation Purification	
@@ -0,0 +1 @@
+Few-shot named entity recognition (NER) aims to recognize novel named entities in low-resource domains utilizing existing knowledge. However, the present few-shot NER models assume that the labeled data are all clean without noise or outliers, and there are few works focusing on the robustness of the cross-domain transfer learning ability to textual adversarial attacks in Few-shot NER. In this work, we comprehensively explore and assess the robustness of few-shot NER models under textual adversarial attack scenario, and found the vulnerability of existing few-shot NER models. Furthermore, we propose a robust two-stage few-shot NER method with Boundary Discrimination and Correlation Purification (BDCP). Specifically, in the span detection stage, the entity boundary discriminative module is introduced to provide a highly distinguishing boundary representation space to detect entity spans. In the entity typing stage, the correlations between entities and contexts are purified by minimizing the interference information and facilitating correlation generalization to alleviate the perturbations caused by textual adversarial attacks. In addition, we construct adversarial examples for few-shot NER based on public datasets Few-NERD and Cross-Dataset. Comprehensive evaluations on those two groups of few-shot NER datasets containing adversarial examples demonstrate the robustness and superiority of the proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Loss Functions for Training Decision Trees with Noisy Labels b/data/2024/aaai/Robust Loss Functions for Training Decision Trees with Noisy Labels
new file mode 100644
index 0000000000..b3a50bdfe0
--- /dev/null
+++ b/data/2024/aaai/Robust Loss Functions for Training Decision Trees with Noisy Labels	
@@ -0,0 +1 @@
+We consider training decision trees using noisily labeled data, focusing on loss functions that can lead to robust learning algorithms. Our contributions are threefold. First, we offer novel theoretical insights on the robustness of many existing loss functions in the context of decision tree learning. We show that some of the losses belong to a class of what we call conservative losses, and the conservative losses lead to an early stopping behavior during training and noise-tolerant predictions during testing. Second, we introduce a framework for constructing robust loss functions, called distribution losses. These losses apply percentile-based penalties based on an assumed margin distribution, and they naturally allow adapting to different noise rates via a robustness parameter. In particular, we introduce a new loss called the negative exponential loss, which leads to an efficient greedy impurity-reduction learning algorithm. Lastly, our experiments on multiple datasets and noise settings validate our theoretical insight and the effectiveness of our adaptive negative exponential loss.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Node Classification on Graph Data with Graph and Label Noise b/data/2024/aaai/Robust Node Classification on Graph Data with Graph and Label Noise
new file mode 100644
index 0000000000..7f58c3e87a
--- /dev/null
+++ b/data/2024/aaai/Robust Node Classification on Graph Data with Graph and Label Noise	
@@ -0,0 +1 @@
+Current research for node classification focuses on dealing with either graph noise or label noise, but few studies consider both of them. In this paper, we propose a new robust node classification method to simultaneously deal with graph noise and label noise. To do this, we design a graph contrastive loss to conduct local graph learning and employ self-attention to conduct global graph learning. They enable us to improve the expressiveness of node representation by using comprehensive information among nodes. We also utilize pseudo graphs and pseudo labels to deal with graph noise and label noise, respectively. Furthermore, We numerically validate the superiority of our method in terms of robust node classification compared with all comparison methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Nonparametric Regression under Poisoning Attack b/data/2024/aaai/Robust Nonparametric Regression under Poisoning Attack
new file mode 100644
index 0000000000..7f52f048ae
--- /dev/null
+++ b/data/2024/aaai/Robust Nonparametric Regression under Poisoning Attack	
@@ -0,0 +1 @@
+This paper studies robust nonparametric regression, in which an adversarial attacker can modify the values of up to q samples from a training dataset of size N. Our initial solution is an M-estimator based on Huber loss minimization. Compared with simple kernel regression, i.e. the Nadaraya-Watson estimator, this method can significantly weaken the impact of malicious samples on the regression performance. We provide the convergence rate as well as the corresponding minimax lower bound. The result shows that, with proper bandwidth selection, supremum error is minimax optimal. The L2 error is optimal with relatively small q, but is suboptimal with larger q. The reason is that this estimator is vulnerable if there are many attacked samples concentrating in a small region. To address this issue, we propose a correction method by projecting the initial estimate to the space of Lipschitz functions. The final estimate is nearly minimax optimal for arbitrary q, up to a logarithmic factor.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Policy Learning via Offline Skill Diffusion b/data/2024/aaai/Robust Policy Learning via Offline Skill Diffusion
new file mode 100644
index 0000000000..e224763d2d
--- /dev/null
+++ b/data/2024/aaai/Robust Policy Learning via Offline Skill Diffusion	
@@ -0,0 +1,2 @@
+Skill-based reinforcement learning (RL) approaches have shown considerable promise, especially in solving long-horizon tasks via hierarchical structures. These skills, learned task-agnostically from offline datasets, can accelerate the policy learning process for new tasks. Yet, the application of these skills in different domains remains restricted due to their inherent dependency on the datasets, which poses a challenge when attempting to learn a skill-based policy via RL for a target domain different from the datasets' domains. In this paper, we present a novel offline skill learning framework DuSkill which employs a guided Diffusion model to generate versatile skills extended from the limited skills in datasets, thereby enhancing the robustness of policy learning for tasks in different domains. Specifically, we devise a guided diffusion-based skill decoder in conjunction with the hierarchical encoding to disentangle the skill embedding space into two distinct representations, one for encapsulating domain-invariant behaviors and the other for delineating the factors that induce domain variations in the behaviors. Our DuSkill framework enhances the diversity of skills learned offline, thus enabling to accelerate the learning procedure of high-level policies for different domains.
+Through experiments, we show that DuSkill outperforms other skill-based imitation learning and RL algorithms for several long-horizon tasks, demonstrating its benefits in few-shot imitation and online RL.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Stochastic Graph Generator for Counterfactual Explanations b/data/2024/aaai/Robust Stochastic Graph Generator for Counterfactual Explanations
new file mode 100644
index 0000000000..75ae62ced3
--- /dev/null
+++ b/data/2024/aaai/Robust Stochastic Graph Generator for Counterfactual Explanations	
@@ -0,0 +1 @@
+Counterfactual Explanation (CE) techniques have garnered attention as a means to provide insights to the users engaging with AI systems. While extensively researched in domains such as medical imaging and autonomous vehicles, Graph Counterfactual Explanation (GCE) methods have been comparatively under-explored. GCEs generate a new graph similar to the original one, with a different outcome grounded on the underlying predictive model. Among these GCE techniques, those rooted in generative mechanisms have received relatively limited investigation despite demonstrating impressive accomplishments in other domains, such as artistic styles and natural language modelling. The preference for generative explainers stems from their capacity to generate counterfactual instances during inference, leveraging autonomously acquired perturbations of the input graph. Motivated by the rationales above, our study introduces RSGG-CE, a novel Robust Stochastic Graph Generator for Counterfactual Explanations able to produce counterfactual examples from the learned latent space considering a partially ordered generation sequence. Furthermore, we undertake quantitative and qualitative analyses to compare RSGG-CE's performance against SoA generative explainers, highlighting its increased ability to engendering plausible counterfactual candidates.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Test-Time Adaptation for Zero-Shot Prompt Tuning b/data/2024/aaai/Robust Test-Time Adaptation for Zero-Shot Prompt Tuning
new file mode 100644
index 0000000000..78f1d76be0
--- /dev/null
+++ b/data/2024/aaai/Robust Test-Time Adaptation for Zero-Shot Prompt Tuning	
@@ -0,0 +1 @@
+CLIP has demonstrated remarkable generalization across diverse downstream tasks. By aligning images and texts in a shared feature space, they enable zero-shot classification via hand-crafted prompts. However, recent studies have shown that hand-crafted prompts may be unsuitable in practical applications. Specifically, choosing an appropriate prompt for a given task requires accurate data and knowledge, which may not be obtainable in practical situations. An inappropriate prompt can result in poor performance. Moreover, if there is no training data, tuning prompts arbitrarily through unlabeled test data may lead to serious performance degradation when giving hand-crafted prompts. Our study reveals that the aforementioned problems are mainly due to the biases in testing data (Data Bias) and pre-trained CLIP model (Model Bias). The Data Bias makes it challenging to choose an appropriate prompt, while Model Bias renders some predictions inaccurate and biased, which leads to error accumulation. To address these biases, we propose robust test-time Adaptation for zeroshot Prompt tuning (ADAPROMPT). Specifically, we ensemble multiple prompts to avoid the worst-case results and dynamically tune prompts to adapt to Data Bias during testing. Furthermore, we adopt a confidence-aware buffer to store balanced and confident unlabeled test data to tune prompts in order to overcome Model Bias. Our extensive experiments on several benchmarks demonstrate that ADAPROMPT alleviates model bias, adapts to data bias and mostly outperforms the state-of-the-art methods at a small time cost. Moreover, our experimental results reveal that ADAPROMPT hardly encounters any performance degradation on these datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Uncertainty Quantification Using Conformalised Monte Carlo Prediction b/data/2024/aaai/Robust Uncertainty Quantification Using Conformalised Monte Carlo Prediction
new file mode 100644
index 0000000000..da1358c9e3
--- /dev/null
+++ b/data/2024/aaai/Robust Uncertainty Quantification Using Conformalised Monte Carlo Prediction	
@@ -0,0 +1 @@
+Deploying deep learning models in safety-critical applications remains a very challenging task, mandating the provision of assurances for the dependable operation of these models. Uncertainty quantification (UQ) methods estimate the model’s confidence per prediction, informing decision-making by considering the effect of randomness and model misspecification. Despite the advances of state-of-the-art UQ methods, they are computationally expensive or produce conservative prediction sets/intervals. We introduce MC-CP, a novel hybrid UQ method that combines a new adaptive Monte Carlo (MC) dropout method with conformal prediction (CP). MC-CP adaptively modulates the traditional MC dropout at runtime to save memory and computation resources, enabling predictions to be consumed by CP, yielding robust prediction sets/intervals. Throughout comprehensive experiments, we show that MC-CP delivers significant improvements over comparable UQ methods, like MC dropout, RAPS and CQR, both in classification and regression benchmarks. MC-CP can be easily added to existing models, making its deployment simple. The MC-CP code and replication package is available at https://github.com/team-daniel/MC-CP.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Visual Imitation Learning with Inverse Dynamics Representations b/data/2024/aaai/Robust Visual Imitation Learning with Inverse Dynamics Representations
new file mode 100644
index 0000000000..59a4fff7c8
--- /dev/null
+++ b/data/2024/aaai/Robust Visual Imitation Learning with Inverse Dynamics Representations	
@@ -0,0 +1 @@
+Imitation learning (IL) has achieved considerable success in solving complex sequential decision-making problems. However, current IL methods mainly assume that the environment for learning policies is the same as the environment for collecting expert datasets. Therefore, these methods may fail to work when there are slight differences between the learning and expert environments, especially for challenging problems with high-dimensional image observations. However, in real-world scenarios, it is rare to have the chance to collect expert trajectories precisely in the target learning environment. To address this challenge, we propose a novel robust imitation learning approach, where we develop an inverse dynamics state representation learning objective to align the expert environment and the learning environment. With the abstract state representation, we design an effective reward function, which thoroughly measures the similarity between behavior data and expert data not only element-wise, but also from the trajectory level. We conduct extensive experiments to evaluate the proposed approach under various visual perturbations and in diverse visual control tasks. Our approach can achieve a near-expert performance in most environments, and significantly outperforms the state-of-the-art visual IL methods and robust IL methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Robust Visual Recognition with Class-Imbalanced Open-World Noisy Data b/data/2024/aaai/Robust Visual Recognition with Class-Imbalanced Open-World Noisy Data
new file mode 100644
index 0000000000..613279df81
--- /dev/null
+++ b/data/2024/aaai/Robust Visual Recognition with Class-Imbalanced Open-World Noisy Data	
@@ -0,0 +1 @@
+Learning from open-world noisy data, where both closed-set and open-set noise co-exist in the dataset, is a realistic but underexplored setting. Only recently, several efforts have been initialized to tackle this problem. However, these works assume the classes are balanced when dealing with open-world noisy data. This assumption often violates the nature of real-world large-scale datasets, where the label distributions are generally long-tailed, i.e. class-imbalanced. In this paper, we study the problem of robust visual recognition with class-imbalanced open-world noisy data. We propose a probabilistic graphical model-based approach: iMRF to achieve label noise correction that is robust to class imbalance via an efficient iterative inference of a Markov Random Field (MRF) in each training mini-batch. Furthermore, we design an agreement-based thresholding strategy to adaptively collect clean samples from all classes that includes corrected closed-set noisy samples while rejecting open-set noisy samples. We also introduce a noise-aware balanced cross-entropy loss to explicitly eliminate the bias caused by class-imbalanced data. Extensive experiments on several benchmark datasets including synthetic and real-world noisy datasets demonstrate the superior performance robustness of our method over existing methods. Our code is available at https://github.com/Na-Z/LIOND.
\ No newline at end of file
diff --git a/data/2024/aaai/Robustly Improving Bandit Algorithms with Confounded and Selection Biased Offline Data: A Causal Approach b/data/2024/aaai/Robustly Improving Bandit Algorithms with Confounded and Selection Biased Offline Data: A Causal Approach
new file mode 100644
index 0000000000..2750135c27
--- /dev/null
+++ b/data/2024/aaai/Robustly Improving Bandit Algorithms with Confounded and Selection Biased Offline Data: A Causal Approach	
@@ -0,0 +1,3 @@
+This paper studies bandit problems where an agent has access to offline data that might be utilized to potentially improve the estimation of each arm’s reward distribution. A major obstacle in this setting is the existence of compound biases from the observational data. Ignoring these biases and blindly fitting a model with the biased data could even negatively affect the online learning phase. In this work, we formulate this problem from a causal perspective. First, we categorize the biases into confounding bias and selection bias based on the causal structure they imply. Next, we extract the causal bound for each arm that is robust towards compound biases from biased observational data. The derived bounds contain the
+ground truth mean reward and can effectively guide the bandit agent to learn a nearly-optimal decision policy. We also conduct regret analysis in both contextual and non-contextual bandit settings and show that prior causal bounds could help
+consistently reduce the asymptotic regret.
\ No newline at end of file
diff --git a/data/2024/aaai/Robustly Train Normalizing Flows via KL Divergence Regularization b/data/2024/aaai/Robustly Train Normalizing Flows via KL Divergence Regularization
new file mode 100644
index 0000000000..3034ed69ae
--- /dev/null
+++ b/data/2024/aaai/Robustly Train Normalizing Flows via KL Divergence Regularization	
@@ -0,0 +1 @@
+In this paper, we find that the training of Normalizing Flows (NFs) are easily affected by the outliers and a small number (or high dimensionality) of training samples. To solve this problem, we propose a Kullback–Leibler (KL) divergence regularization on the Jacobian matrix of NFs. We prove that such regularization is equivalent to adding a set of samples whose covariance matrix is the identity matrix to the training set. Thus, it reduces the negative influence of the outliers and the small sample number on the estimation of the covariance matrix, simultaneously. Therefore, our regularization makes the training of NFs robust. Ultimately, we evaluate the performance of NFs on out-of-distribution (OoD) detection tasks. The excellent results obtained demonstrate the effectiveness of the proposed regularization term. For example, with the help of the proposed regularization, the OoD detection score increases at most 30% compared with the one without the regularization.
\ No newline at end of file
diff --git a/data/2024/aaai/Robustness Verification of Multi-Class Tree Ensembles b/data/2024/aaai/Robustness Verification of Multi-Class Tree Ensembles
new file mode 100644
index 0000000000..055c177f78
--- /dev/null
+++ b/data/2024/aaai/Robustness Verification of Multi-Class Tree Ensembles	
@@ -0,0 +1,3 @@
+Tree ensembles are one of the most widely used model classes. 
+However, these models are susceptible to adversarial examples, which are slightly perturbed examples that elicit a misprediction.
+There has been significant research on designing approaches to verify the robustness of tree ensembles to such attacks. However, existing verification algorithms for tree ensembles are only able to analyze binary classifiers and hence address multiclass problems by reducing them to binary ones using a one-versus-other strategy. In this paper, we show that naively applying this strategy can yield incorrect results in certain situations. We address this shortcoming by proposing a novel approximate heuristic approach to verification for multiclass tree ensembles. Our approach is based on a novel generalization of the verification task, which we show emits other relevant verification queries.
\ No newline at end of file
diff --git a/data/2024/aaai/Robustness and Visual Explanation for Black Box Image, Video, and ECG Signal Classification with Reinforcement Learning b/data/2024/aaai/Robustness and Visual Explanation for Black Box Image, Video, and ECG Signal Classification with Reinforcement Learning
new file mode 100644
index 0000000000..70cb3ccbc4
--- /dev/null
+++ b/data/2024/aaai/Robustness and Visual Explanation for Black Box Image, Video, and ECG Signal Classification with Reinforcement Learning	
@@ -0,0 +1 @@
+We present a generic Reinforcement Learning (RL) framework optimized for crafting adversarial attacks on different model types spanning from ECG signal analysis (1D), image classification (2D), and video classification (3D). The framework focuses on identifying sensitive regions and inducing misclassifications with minimal distortions and various distortion types. The novel RL method outperforms state-of-the-art methods for all three applications, proving its efficiency. Our RL approach produces superior localization masks, enhancing interpretability for image classification and ECG analysis models. For applications such as ECG analysis, our platform highlights critical ECG segments for clinicians while ensuring resilience against prevalent distortions. This comprehensive tool aims to bolster both resilience with adversarial training and transparency across varied applications and data types.
\ No newline at end of file
diff --git a/data/2024/aaai/Robustness-Guided Image Synthesis for Data-Free Quantization b/data/2024/aaai/Robustness-Guided Image Synthesis for Data-Free Quantization
new file mode 100644
index 0000000000..b24b87ea05
--- /dev/null
+++ b/data/2024/aaai/Robustness-Guided Image Synthesis for Data-Free Quantization	
@@ -0,0 +1 @@
+Quantization has emerged as a promising direction for model compression. Recently, data-free quantization has been widely studied as a promising method to avoid privacy concerns, which synthesizes images as an alternative to real training data. Existing methods use classification loss to ensure the reliability of the synthesized images. Unfortunately, even if these images are well-classified by the pre-trained model, they still suffer from low semantics and homogenization issues. Intuitively, these low-semantic images are sensitive to perturbations, and the pre-trained model tends to have inconsistent output when the generator synthesizes an image with low semantics. To this end, we propose Robustness-Guided Image Synthesis (RIS), a simple but effective method to enrich the semantics of synthetic images and improve image diversity, further boosting the performance of data-free compression tasks. Concretely, we first introduce perturbations on input and model weight, then define the inconsistency metrics at feature and prediction levels before and after perturbations. On the basis of inconsistency on two levels, we design a robustness optimization objective to eliminate low-semantic images. Moreover, we also make our approach diversity-aware by forcing the generator to synthesize images with small correlations. With RIS, we achieve state-of-the-art performance for various settings on data-free quantization and can be extended to other data-free compression tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Roll with the Punches: Expansion and Shrinkage of Soft Label Selection for Semi-supervised Fine-Grained Learning b/data/2024/aaai/Roll with the Punches: Expansion and Shrinkage of Soft Label Selection for Semi-supervised Fine-Grained Learning
new file mode 100644
index 0000000000..5238a7ce2f
--- /dev/null
+++ b/data/2024/aaai/Roll with the Punches: Expansion and Shrinkage of Soft Label Selection for Semi-supervised Fine-Grained Learning	
@@ -0,0 +1 @@
+While semi-supervised learning (SSL) has yielded promising results, the more realistic SSL scenario remains to be explored, in which the unlabeled data exhibits extremely high recognition difficulty, e.g., fine-grained visual classification in the context of SSL (SS-FGVC). The increased recognition difficulty on fine-grained unlabeled data spells disaster for pseudo-labeling accuracy, resulting in poor performance of the SSL model. To tackle this challenge, we propose Soft Label Selection with Confidence-Aware Clustering based on Class Transition Tracking (SoC) by reconstructing the pseudo-label selection process by jointly optimizing Expansion Objective and Shrinkage Objective, which is based on a soft label manner. Respectively, the former objective encourages soft labels to absorb more candidate classes to ensure the attendance of ground-truth class, while the latter encourages soft labels to reject more noisy classes, which is theoretically proved to be equivalent to entropy minimization. In comparisons with various state-of-the-art methods, our approach demonstrates its superior performance in SS-FGVC. Checkpoints and source code are available at https://github.com/NJUyued/SoC4SS-FGVC.
\ No newline at end of file
diff --git a/data/2024/aaai/Rolling-Unet: Revitalizing MLP's Ability to Efficiently Extract Long-Distance Dependencies for Medical Image Segmentation b/data/2024/aaai/Rolling-Unet: Revitalizing MLP's Ability to Efficiently Extract Long-Distance Dependencies for Medical Image Segmentation
new file mode 100644
index 0000000000..f6850e7a81
--- /dev/null
+++ b/data/2024/aaai/Rolling-Unet: Revitalizing MLP's Ability to Efficiently Extract Long-Distance Dependencies for Medical Image Segmentation	
@@ -0,0 +1 @@
+Medical image segmentation methods based on deep learning network are mainly divided into CNN and Transformer. However, CNN struggles to capture long-distance dependencies, while Transformer suffers from high computational complexity and poor local feature learning. To efficiently extract and fuse local features and long-range dependencies, this paper proposes Rolling-Unet, which is a CNN model combined with MLP. Specifically, we propose the core R-MLP module, which is responsible for learning the long-distance dependency in a single direction of the whole image. By controlling and combining R-MLP modules in different directions, OR-MLP and DOR-MLP modules are formed to capture long-distance dependencies in multiple directions. Further, Lo2 block is proposed to encode both local context information and long-distance dependencies without excessive computational burden. Lo2 block has the same parameter size and computational complexity as a 3×3 convolution. The experimental results on four public datasets show that Rolling-Unet achieves superior performance compared to the state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Root Cause Explanation of Outliers under Noisy Mechanisms b/data/2024/aaai/Root Cause Explanation of Outliers under Noisy Mechanisms
new file mode 100644
index 0000000000..40b67fd768
--- /dev/null
+++ b/data/2024/aaai/Root Cause Explanation of Outliers under Noisy Mechanisms	
@@ -0,0 +1 @@
+Identifying root causes of anomalies in causal processes is vital across disciplines. Once identified, one can isolate the root causes and implement necessary measures to restore the normal operation. Causal processes are often modelled as graphs with entities being nodes and their paths/interconnections as edge. Existing work only consider the contribution of nodes in the generative process, thus can not attribute the outlier score to the edges of the mechanism if the anomaly occurs in the connections. In this paper, we consider both individual edge and node of each mechanism when identifying the root causes. We introduce a noisy functional causal model to account for this purpose. Then, we employ Bayesian learning and inference methods to infer the noises of the nodes and edges. We then represent the functional form of a target outlier leaf as a function of the node and edge noises. Finally, we propose an efficient gradient-based attribution method to compute the anomaly attribution scores which scales linearly with the number of nodes and edges. Experiments on simulated datasets and two real-world scenario datasets show better anomaly attribution performance of the proposed method compared to the baselines. Our method scales to larger graphs with more nodes and edges.
\ No newline at end of file
diff --git "a/data/2024/aaai/Runtime Analysis of the (\316\274 + 1) GA: Provable Speed-Ups from Strong Drift towards Diverse Populations" "b/data/2024/aaai/Runtime Analysis of the (\316\274 + 1) GA: Provable Speed-Ups from Strong Drift towards Diverse Populations"
new file mode 100644
index 0000000000..bbba4cb3db
--- /dev/null
+++ "b/data/2024/aaai/Runtime Analysis of the (\316\274 + 1) GA: Provable Speed-Ups from Strong Drift towards Diverse Populations"	
@@ -0,0 +1,3 @@
+Most evolutionary algorithms used in practice heavily employ crossover. In contrast, the rigorous understanding of how crossover is beneficial is largely lagging behind. In this work, we make a considerable step forward by analyzing the population dynamics of the (µ+1) genetic algorithm when optimizing the Jump benchmark. We observe (and prove via mathematical means) that once the population contains two different individuals on the local optimum, the diversity in the population increases in expectation. From this drift towards more diverse states, we show that a diversity suitable for crossover to be effective is reached quickly and, more importantly, then persists for a time that is at least exponential in the population size µ. This drastically improves over the previously best known guarantee, which is only quadratic in µ.
+
+Our new understanding of the population dynamics easily gives stronger performance guarantees. In particular, we derive that population sizes logarithmic in the problem size n suffice to gain an Ω(n)-factor runtime improvement from crossover (previous works achieved comparable bounds only with µ = Θ(n) or a non-standard mutation rate).
\ No newline at end of file
diff --git a/data/2024/aaai/Runtime Analysis of the SMS-EMOA for Many-Objective Optimization b/data/2024/aaai/Runtime Analysis of the SMS-EMOA for Many-Objective Optimization
new file mode 100644
index 0000000000..11543b7e3b
--- /dev/null
+++ b/data/2024/aaai/Runtime Analysis of the SMS-EMOA for Many-Objective Optimization	
@@ -0,0 +1,5 @@
+The widely used multiobjective optimizer NSGA-II was recently proven to have considerable difficulties in many-objective optimization. In contrast, experimental results in the literature show a good performance of the SMS-EMOA, which can be seen as a steady-state NSGA-II that uses the hypervolume contribution instead of the crowding distance as the second selection criterion. 
+
+This paper conducts the first rigorous runtime analysis of the SMS-EMOA for many-objective optimization. To this aim, we first propose a many-objective counterpart, the m-objective mOJZJ problem, of the bi-objective OJZJ benchmark, which is the first many-objective multimodal benchmark used in a mathematical runtime analysis. We prove that SMS-EMOA computes the full Pareto front of this benchmark in an expected number of O(M^2 n^k) iterations, where n denotes the problem size (length of the bit-string representation), k the gap size (a difficulty parameter of the problem), and M=(2n/m-2k+3)^(m/2) the size of the Pareto front. This result together with the existing negative result on the original NSGA-II shows that in principle, the general approach of the NSGA-II is suitable for many-objective optimization, but the crowding distance as tie-breaker has deficiencies.
+
+We obtain three additional insights on the SMS-EMOA. Different from a recent result for the bi-objective OJZJ benchmark, the stochastic population update often does not help for mOJZJ. It results in a 1/Θ(min(Mk^(1/2)/2^(k/2),1)) speed-up, which is Θ(1) for large m such as m>k. On the positive side, we prove that heavy-tailed mutation still results in a speed-up of order k^(0.5+k-β). Finally, we conduct the first runtime analyses of the SMS-EMOA on the bi-objective OneMinMax and LOTZ benchmarks and show that it has a performance comparable to the GSEMO and the NSGA-II.
\ No newline at end of file
diff --git a/data/2024/aaai/Runtime vs. Extracted Proof Size: An Exponential Gap for CDCL on QBFs b/data/2024/aaai/Runtime vs. Extracted Proof Size: An Exponential Gap for CDCL on QBFs
new file mode 100644
index 0000000000..c772f72128
--- /dev/null
+++ b/data/2024/aaai/Runtime vs. Extracted Proof Size: An Exponential Gap for CDCL on QBFs	
@@ -0,0 +1,3 @@
+Conflict-driven clause learning (CDCL) is the dominating algorithmic paradigm for SAT solving and hugely successful in practice. In its lifted version QCDCL, it is one of the main approaches for solving quantified Boolean formulas (QBF).
+ 
+In both SAT and QBF, proofs can be efficiently extracted from runs of (Q)CDCL solvers. While for CDCL, it is known that the proof size in the underlying proof system propositional resolution matches the CDCL runtime up to a polynomial factor, we show that in QBF there is an exponential gap between QCDCL runtime and the size of the extracted proofs in QBF resolution systems. We demonstrate that this is not just a gap between QCDCL runtime and the size of any QBF resolution proof, but even the extracted proofs are exponentially smaller for some instances. Hence searching for a small proof via QCDCL (even with non-deterministic decision policies) will provably incur an exponential overhead for some instances.
\ No newline at end of file
diff --git a/data/2024/aaai/S2CycleDiff: Spatial-Spectral-Bilateral Cycle-Diffusion Framework for Hyperspectral Image Super-resolution b/data/2024/aaai/S2CycleDiff: Spatial-Spectral-Bilateral Cycle-Diffusion Framework for Hyperspectral Image Super-resolution
new file mode 100644
index 0000000000..8e44522d09
--- /dev/null
+++ b/data/2024/aaai/S2CycleDiff: Spatial-Spectral-Bilateral Cycle-Diffusion Framework for Hyperspectral Image Super-resolution	
@@ -0,0 +1 @@
+Hyperspectral image super-resolution (HISR) is a technique that can break through the limitation of imaging mechanism to obtain the hyperspectral image (HSI) with high spatial resolution. Although some progress has been achieved by existing methods, most of them directly learn the spatial-spectral joint mapping between the observed images and the target high-resolution HSI (HrHSI), failing to fully reserve the spectral distribution of low-resolution HSI (LrHSI) and the spatial distribution of high-resolution multispectral imagery (HrMSI). To this end, we propose a spatial-spectral-bilateral cycle-diffusion framework (S2CycleDiff) for HISR, which can step-wise generate the HrHSI with high spatial-spectral fidelity by learning the conditional distribution of spatial and spectral super-resolution processes bilaterally. Specifically, a customized conditional cycle-diffusion framework is designed as the backbone to achieve the spatial-spectral-bilateral super-resolution by repeated refinement, wherein the spatial/spectral guided pyramid denoising (SGPD) module seperately takes HrMSI and LrHSI as the guiding factors to achieve the spatial details injection and spectral correction. The outputs of the conditional cycle-diffusion framework are fed into a complementary fusion block to integrate the spatial and spectral details to generate the desired HrHSI. Experiments have been conducted on three widely used datasets to demonstrate the superiority of the proposed method over state-of-the-art HISR methods. The code is available at https://github.com/Jiahuiqu/S2CycleDiff.
\ No newline at end of file
diff --git a/data/2024/aaai/S2WAT: Image Style Transfer via Hierarchical Vision Transformer Using Strips Window Attention b/data/2024/aaai/S2WAT: Image Style Transfer via Hierarchical Vision Transformer Using Strips Window Attention
new file mode 100644
index 0000000000..6069dc624c
--- /dev/null
+++ b/data/2024/aaai/S2WAT: Image Style Transfer via Hierarchical Vision Transformer Using Strips Window Attention	
@@ -0,0 +1 @@
+Transformer's recent integration into style transfer leverages its proficiency in establishing long-range dependencies, albeit at the expense of attenuated local modeling. This paper introduces Strips Window Attention Transformer (S2WAT), a novel hierarchical vision transformer designed for style transfer. S2WAT employs attention computation in diverse window shapes to capture both short- and long-range dependencies. The merged dependencies utilize the "Attn Merge" strategy, which adaptively determines spatial weights based on their relevance to the target. Extensive experiments on representative datasets show the proposed method's effectiveness compared to state-of-the-art (SOTA) transformer-based and other approaches. The code and pre-trained models are available at https://github.com/AlienZhang1996/S2WAT.
\ No newline at end of file
diff --git a/data/2024/aaai/S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment b/data/2024/aaai/S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment
new file mode 100644
index 0000000000..b496b31d22
--- /dev/null
+++ b/data/2024/aaai/S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment	
@@ -0,0 +1 @@
+Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. Despite the success, most traditional VLMs-based methods are restricted by the assumption of partial source supervision or ideal target vocabularies, which rarely satisfy the open-world scenario. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. To address the new problem, we propose the Self Structural Semantic Alignment (S3A) framework, which extracts the structural semantic information from unlabeled data while simultaneously self-learning. Our S3A framework adopts a unique Cluster-Vote-Prompt-Realign (CVPR) algorithm, which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR algorithm includes iterative clustering on images, voting within each cluster to identify initial class candidates from the vocabulary, generating discriminative prompts with large language models to discern confusing candidates, and realigning images and the vocabulary as structural semantic alignment. Finally, we propose to self-train the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S3A method substantially improves over existing VLMs-based approaches, achieving a more than 15% accuracy improvement over CLIP on average. Our codes, models, and prompts are publicly released at https://github.com/sheng-eatamath/S3A.
\ No newline at end of file
diff --git a/data/2024/aaai/SALSA: Semantically-Aware Latent Space Autoencoder b/data/2024/aaai/SALSA: Semantically-Aware Latent Space Autoencoder
new file mode 100644
index 0000000000..18f9dc75b6
--- /dev/null
+++ b/data/2024/aaai/SALSA: Semantically-Aware Latent Space Autoencoder	
@@ -0,0 +1 @@
+In deep learning for drug discovery, molecular representations are often based on sequences, known as SMILES, which allow for straightforward implementation of natural language processing methodologies, one being the sequence-to-sequence autoencoder. However, we observe that training an autoencoder solely on SMILES is insufficient to learn molecular representations that are semantically meaningful, where semantics are specified by the structural (graph-to-graph) similarities between molecules. We demonstrate by example that SMILES-based autoencoders may map structurally similar molecules to distant codes, resulting in an incoherent latent space that does not necessarily respect the semantic similarities between molecules. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA) for molecular representations: a SMILES-based transformer autoencoder modified with a contrastive task aimed at learning graph-to-graph similarities between molecules. To accomplish this, we develop a novel dataset comprised of sets of structurally similar molecules and opt for a supervised contrastive loss that is able to incorporate full sets of positive samples. We evaluate semantic awareness of SALSA representations by comparing to its ablated counterparts, and show empirically that SALSA learns representations that maintain 1) structural awareness, 2) physicochemical awareness, 3) biological awareness, and 4) semantic continuity.
\ No newline at end of file
diff --git a/data/2024/aaai/SAM-PARSER: Fine-Tuning SAM Efficiently by Parameter Space Reconstruction b/data/2024/aaai/SAM-PARSER: Fine-Tuning SAM Efficiently by Parameter Space Reconstruction
new file mode 100644
index 0000000000..cadfb8d365
--- /dev/null
+++ b/data/2024/aaai/SAM-PARSER: Fine-Tuning SAM Efficiently by Parameter Space Reconstruction	
@@ -0,0 +1 @@
+Segment Anything Model (SAM) has received remarkable attention as it offers a powerful and versatile solution for object segmentation in images. However, fine-tuning SAM for downstream segmentation tasks under different scenarios remains a challenge, as the varied characteristics of different scenarios naturally requires diverse model parameter spaces. Most existing fine-tuning methods attempt to bridge the gaps among different scenarios by introducing a set of new parameters to modify SAM's original parameter space. Unlike these works, in this paper, we propose fine-tuning SAM efficiently by parameter space reconstruction (SAM-PARSER), which introduce nearly zero trainable parameters during fine-tuning. In SAM-PARSER, we assume that SAM's original parameter space is relatively complete, so that its bases are able to reconstruct the parameter space of a new scenario. We obtain the bases by matrix decomposition, and fine-tuning the coefficients to reconstruct the parameter space tailored to the new scenario by an optimal linear combination of the bases. Experimental results show that SAM-PARSER exhibits superior segmentation performance across various scenarios, while reducing the number of trainable parameters by approximately 290 times compared with current parameter-efficient fine-tuning methods.
\ No newline at end of file
diff --git a/data/2024/aaai/SAME: Sample Reconstruction against Model Extraction Attacks b/data/2024/aaai/SAME: Sample Reconstruction against Model Extraction Attacks
new file mode 100644
index 0000000000..c09b64bcab
--- /dev/null
+++ b/data/2024/aaai/SAME: Sample Reconstruction against Model Extraction Attacks	
@@ -0,0 +1 @@
+While deep learning models have shown significant performance across various domains, their deployment needs extensive resources and advanced computing infrastructure. As a solution, Machine Learning as a Service (MLaaS) has emerged, lowering the barriers for users to release or productize their deep learning models. However, previous studies have highlighted potential privacy and security concerns associated with MLaaS, and one primary threat is model extraction attacks. To address this, there are many defense solutions but they suffer from unrealistic assumptions and generalization issues, making them less practical for reliable protection. Driven by these limitations, we introduce a novel defense mechanism, SAME, based on the concept of sample reconstruction. This strategy imposes minimal prerequisites on the defender's capabilities, eliminating the need for auxiliary Out-of-Distribution (OOD) datasets, user query history, white-box model access, and additional intervention during model training. It is compatible with existing active defense methods. Our extensive experiments corroborate the superior efficacy of SAME over state-of-the-art solutions. Our code is available at https://github.com/xythink/SAME.
\ No newline at end of file
diff --git a/data/2024/aaai/SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything Model b/data/2024/aaai/SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything Model
new file mode 100644
index 0000000000..52699cb8ce
--- /dev/null
+++ b/data/2024/aaai/SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything Model	
@@ -0,0 +1 @@
+Optical Flow Estimation aims to find the 2D dense motion field between two frames. Due to the limitation of model structures and training datasets, existing methods often rely too much on local clues and ignore the integrity of objects, resulting in fragmented motion estimation. Through theoretical analysis, we find the pre-trained large vision models are helpful in optical flow estimation, and we notice that the recently famous Segment Anything Model (SAM) demonstrates a strong ability to segment complete objects, which is suitable for solving the fragmentation problem. We thus propose a solution to embed the frozen SAM image encoder into FlowFormer to enhance object perception. To address the challenge of in-depth utilizing SAM in non-segmentation tasks like optical flow estimation, we propose an Optical Flow Task-Specific Adaption scheme, including a Context Fusion Module to fuse the SAM encoder with the optical flow context encoder, and a Context Adaption Module to adapt the SAM features for optical flow task with Learned Task-Specific Embedding. Our proposed SAMFlow model reaches 0.86/2.10 clean/final EPE and 3.55/12.32 EPE/F1-all on Sintel and KITTI-15 training set, surpassing Flowformer by 8.5%/9.9% and 13.2%/16.3%. Furthermore, our model achieves state-of-the-art performance on the Sintel and KITTI-15 benchmarks, ranking #1 among all two-frame methods on Sintel clean pass.
\ No newline at end of file
diff --git a/data/2024/aaai/SAT-Based Algorithms for Regular Graph Pattern Matching b/data/2024/aaai/SAT-Based Algorithms for Regular Graph Pattern Matching
new file mode 100644
index 0000000000..432761a817
--- /dev/null
+++ b/data/2024/aaai/SAT-Based Algorithms for Regular Graph Pattern Matching	
@@ -0,0 +1,3 @@
+Graph matching is a fundamental problem in pattern recognition, with many applications such as software analysis and computational biology. One well-known type of graph matching problem is graph isomorphism, which consists of deciding if two graphs are identical. Despite its usefulness, the properties that one may check using graph isomorphism are rather limited, since it only allows strict equality checks between two graphs. For example, it does not allow one to check complex structural properties such as if the target graph is an arbitrary length sequence followed by an arbitrary size loop.
+
+We propose a generalization of graph isomorphism that allows one to check such properties through a declarative specification. This specification is given in the form of a Regular Graph Pattern (ReGaP), a special type of graph, inspired by regular expressions, that may contain wildcard nodes that represent arbitrary structures such as variable-sized sequences or subgraphs. We propose a SAT-based algorithm for checking if a target graph matches a given ReGaP. We also propose a preprocessing technique for improving the performance of the algorithm and evaluate it through an extensive experimental evaluation on benchmarks from the CodeSearchNet dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/SAT-Based Techniques for Lexicographically Smallest Finite Models b/data/2024/aaai/SAT-Based Techniques for Lexicographically Smallest Finite Models
new file mode 100644
index 0000000000..f3d1712738
--- /dev/null
+++ b/data/2024/aaai/SAT-Based Techniques for Lexicographically Smallest Finite Models	
@@ -0,0 +1,3 @@
+This paper proposes SAT-based techniques to calculate a specific normal form of a given finite mathematical structure (model). The normal form is obtained by permuting the domain elements so that the representation of the structure is lexicographically smallest possible. Such a normal form is of interest to mathematicians as it enables easy cataloging of algebraic structures. In particular, two structures are isomorphic precisely when their normal forms are the same. This form is also natural to inspect as mathematicians have been using it routinely for many decades.
+
+We develop a novel approach where a SAT solver is used in a black-box fashion to compute the smallest representative. The approach constructs the representative gradually and searches the space of possible isomorphisms, requiring a small number of variables. However, the approach may lead to a large number of SAT calls and therefore we devise propagation techniques to reduce this number. The paper focuses on finite structures with a single binary operation (encompassing groups, semigroups, etc.). However, the approach is generalizable to arbitrary finite structures. We provide an implementation of the proposed algorithm and evaluate it on a variety of algebraic structures.
\ No newline at end of file
diff --git a/data/2024/aaai/SAT-Based Tree Decomposition with Iterative Cascading Policy Selection b/data/2024/aaai/SAT-Based Tree Decomposition with Iterative Cascading Policy Selection
new file mode 100644
index 0000000000..b8363f5839
--- /dev/null
+++ b/data/2024/aaai/SAT-Based Tree Decomposition with Iterative Cascading Policy Selection	
@@ -0,0 +1,5 @@
+Solvers for propositional satisfiability (SAT) effectively tackle hard optimization problems. However, translating to SAT can cause a significant size increase, restricting its use to smaller instances. To mitigate this, frameworks using multiple local SAT calls for gradually improving a heuristic solution have been proposed. The performance of such algorithmic frameworks heavily relies on critical parameters, including the size of selected local instances and the time allocated per SAT call.
+
+This paper examines the automated configuration of the treewidth SAT-based local improvement method (TW-SLIM) framework, which uses multiple SAT calls for computing tree decompositions of small width, a fundamental problem in combinatorial optimization. We explore various TW-SLIM configuration methods, including offline learning and real-time adjustments, significantly outperforming default settings in multi-SAT scenarios with changing problems.
+
+Building upon insights gained from offline training and real-time configurations for TW-SLIM, we propose the iterative cascading policy—a novel hybrid technique that uniquely combines both. The iterative cascading policy employs a pool of 30 configurations obtained through clustering-based offline methods, deploying them in dynamic cascades across multiple rounds. In each round, the 30 configurations are tested according to the cascading ordering, and the best tree decomposition is retained for further improvement, with the option to adjust the following ordering of cascades. This iterative approach significantly enhances the performance of TW-SLIM beyond baseline results, even within varying global timeouts. This highlights the effectiveness of the proposed iterative cascading policy in enhancing the efficiency and efficacy of complex algorithmic frameworks like TW-SLIM.
\ No newline at end of file
diff --git a/data/2024/aaai/SAUI: Scale-Aware Unseen Imagineer for Zero-Shot Object Detection b/data/2024/aaai/SAUI: Scale-Aware Unseen Imagineer for Zero-Shot Object Detection
new file mode 100644
index 0000000000..6ee8211579
--- /dev/null
+++ b/data/2024/aaai/SAUI: Scale-Aware Unseen Imagineer for Zero-Shot Object Detection	
@@ -0,0 +1 @@
+Zero-shot object detection (ZSD) aims to localize and classify unseen objects without access to their training annotations. As a prevailing solution to ZSD, generation-based methods synthesize unseen visual features by taking seen features as reference and class semantic embeddings as guideline. Although previous works continuously improve the synthesis quality, they fail to consider the scale-varying nature of unseen objects. The generation process is preformed over a single scale of object features and thus lacks scale-diversity among synthesized features. In this paper, we reveal the scale-varying challenge in ZSD and propose a Scale-Aware Unseen Imagineer (SAUI) to lead the way of a novel scale-aware ZSD paradigm. To obtain multi-scale features of seen-class objects, we design a specialized coarse-to-fine extractor to capture features through multiple scale-views. To generate unseen features scale by scale, we innovate a Series-GAN synthesizer along with three scale-aware contrastive components to imagine separable, diverse and robust scale-wise unseen features. Extensive experiments on PASCAL VOC, COCO and DIOR datasets demonstrate SAUI's better performance in different scenarios, especially for scale-varying and small objects. Notably, SAUI achieves the new state-of-the art performance on COCO and DIOR.
\ No newline at end of file
diff --git a/data/2024/aaai/SAVSR: Arbitrary-Scale Video Super-Resolution via a Learned Scale-Adaptive Network b/data/2024/aaai/SAVSR: Arbitrary-Scale Video Super-Resolution via a Learned Scale-Adaptive Network
new file mode 100644
index 0000000000..0202be7743
--- /dev/null
+++ b/data/2024/aaai/SAVSR: Arbitrary-Scale Video Super-Resolution via a Learned Scale-Adaptive Network	
@@ -0,0 +1 @@
+Deep learning-based video super-resolution (VSR) networks have gained significant performance improvements in recent years. However, existing VSR networks can only support a fixed integer scale super-resolution task, and when we want to perform VSR at multiple scales, we need to train several models. This implementation certainly increases the consumption of computational and storage resources, which limits the application scenarios of VSR techniques. In this paper, we propose a novel Scale-adaptive Arbitrary-scale Video Super-Resolution network (SAVSR), which is the first work focusing on spatial VSR at arbitrary scales including both non-integer and asymmetric scales. We also present an omni-dimensional scale-attention convolution, which dynamically adapts according to the scale of the input to extract inter-frame features with stronger representational power. Moreover, the proposed spatio-temporal adaptive arbitrary-scale upsampling performs VSR tasks using both temporal features and scale information. And we design an iterative bi-directional architecture for implicit feature alignment. Experiments at various scales on the benchmark datasets show that the proposed SAVSR outperforms state-of-the-art (SOTA) methods at non-integer and asymmetric scales. The source code is available at https://github.com/Weepingchestnut/SAVSR.
\ No newline at end of file
diff --git "a/data/2024/aaai/SA\302\262VP: Spatially Aligned-and-Adapted Visual Prompt" "b/data/2024/aaai/SA\302\262VP: Spatially Aligned-and-Adapted Visual Prompt"
new file mode 100644
index 0000000000..e9e52a659f
--- /dev/null
+++ "b/data/2024/aaai/SA\302\262VP: Spatially Aligned-and-Adapted Visual Prompt"	
@@ -0,0 +1 @@
+As a prominent parameter-efficient fine-tuning technique in NLP, prompt tuning is being explored its potential in computer vision. Typical methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP, which represents an input image as a flattened sequence of token embeddings and then learns a set of unordered parameterized tokens prefixed to the sequence representation as the visual prompts for task adaptation of large vision models. While such sequential modeling paradigm of visual prompt has shown great promise, there are two potential limitations. First, the learned visual prompts cannot model the underlying spatial relations in the input image, which is crucial for image encoding. Second, since all prompt tokens play the same role of prompting for all image tokens without distinction, it lacks the fine-grained prompting capability, i.e., individual prompting for different image tokens. In this work, we propose the Spatially Aligned-and-Adapted Visual Prompt model (SA^2VP), which learns a two-dimensional prompt token map with equal (or scaled) size to the image token map, thereby being able to spatially align with the image map. Each prompt token is designated to prompt knowledge only for the spatially corresponding image tokens. As a result, our model can conduct individual prompting for different image tokens in a fine-grained manner. Moreover, benefiting from the capability of preserving the spatial structure by the learned prompt token map, our SA^2VP is able to model the spatial relations in the input image, leading to more effective prompting. Extensive experiments on three challenging benchmarks for image classification demonstrate the superiority of our model over other state-of-the-art methods for visual prompt tuning. Code is available at https://github.com/tommy-xq/SA2VP.
\ No newline at end of file
diff --git a/data/2024/aaai/SC-NeuS: Consistent Neural Surface Reconstruction from Sparse and Noisy Views b/data/2024/aaai/SC-NeuS: Consistent Neural Surface Reconstruction from Sparse and Noisy Views
new file mode 100644
index 0000000000..41389a34d1
--- /dev/null
+++ b/data/2024/aaai/SC-NeuS: Consistent Neural Surface Reconstruction from Sparse and Noisy Views	
@@ -0,0 +1 @@
+The recent neural surface reconstruction approaches using volume rendering have made much progress by achieving impressive surface reconstruction quality, but are still limited to dense and highly accurate posed views. To overcome such drawbacks, this paper pays special attention on the consistent surface reconstruction from sparse views with noisy camera poses. Unlike previous approaches, the key difference of this paper is to exploit the multi-view constraints directly from the explicit geometry of the neural surface, which can be used as effective regularization to jointly learn the neural surface and refine the camera poses. To build effective multi-view constraints, we introduce a fast differentiable on-surface intersection to generate on-surface points, and propose view-consistent losses on such differentiable points to regularize the neural surface learning. Based on this point, we propose a joint learning strategy, named SC-NeuS, to perform geometry-consistent surface reconstruction in an end-to-end manner. With extensive evaluation on public datasets, our SC-NeuS can achieve consistently better surface reconstruction results with fine-grained details than previous approaches, especially from sparse and noisy camera views. The source code is available at https://github.com/zouzx/sc-neus.git.
\ No newline at end of file
diff --git a/data/2024/aaai/SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-Supervised Skeleton-Based Action Recognition b/data/2024/aaai/SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-Supervised Skeleton-Based Action Recognition
new file mode 100644
index 0000000000..22053342a7
--- /dev/null
+++ b/data/2024/aaai/SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-Supervised Skeleton-Based Action Recognition	
@@ -0,0 +1 @@
+Contrastive learning has achieved great success in skeleton-based action recognition. However, most existing approaches encode the skeleton sequences as entangled spatiotemporal representations and confine the contrasts to the same level of representation. Instead, this paper introduces a novel contrastive learning framework, namely Spatiotemporal Clues Disentanglement Network (SCD-Net). Specifically, we integrate the decoupling module with a feature extractor to derive explicit clues from spatial and temporal domains respectively. As for the training of SCD-Net, with a constructed global anchor, we encourage the interaction between the anchor and extracted clues. Further, we propose a new masking strategy with structural constraints to strengthen the contextual associations, leveraging the latest development from masked image modelling into the proposed SCD-Net. We conduct extensive evaluations on the NTU-RGB+D (60&120) and PKU-MMD (I&II) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning, and semi-supervised learning. The experimental results demonstrate the effectiveness of our method, which outperforms the existing state-of-the-art (SOTA) approaches significantly. Our code and supplementary material can be found at https://github.com/cong-wu/SCD-Net.
\ No newline at end of file
diff --git a/data/2024/aaai/SCP: Spherical-Coordinate-Based Learned Point Cloud Compression b/data/2024/aaai/SCP: Spherical-Coordinate-Based Learned Point Cloud Compression
new file mode 100644
index 0000000000..2e085ed20f
--- /dev/null
+++ b/data/2024/aaai/SCP: Spherical-Coordinate-Based Learned Point Cloud Compression	
@@ -0,0 +1 @@
+In recent years, the task of learned point cloud compression has gained prominence. An important type of point cloud, LiDAR point cloud, is generated by spinning LiDAR on vehicles. This process results in numerous circular shapes and azimuthal angle invariance features within the point clouds. However, these two features have been largely overlooked by previous methodologies. In this paper, we introduce a model-agnostic method called Spherical-Coordinate-based learned Point cloud compression (SCP), designed to fully leverage the features of circular shapes and azimuthal angle invariance. Additionally, we propose a multi-level Octree for SCP to mitigate the reconstruction error for distant areas within the Spherical-coordinate-based Octree. SCP exhibits excellent universality, making it applicable to various learned point cloud compression techniques. Experimental results demonstrate that SCP surpasses previous state-of-the-art methods by up to 29.14% in point-to-point PSNR BD-Rate.
\ No newline at end of file
diff --git a/data/2024/aaai/SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation b/data/2024/aaai/SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation
new file mode 100644
index 0000000000..48b0fb7363
--- /dev/null
+++ b/data/2024/aaai/SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation	
@@ -0,0 +1 @@
+Recent real-time semantic segmentation methods usually adopt an additional semantic branch to pursue rich long-range context. However, the additional branch incurs undesirable computational overhead and slows inference speed. To eliminate this dilemma, we propose SCTNet, a single branch CNN with transformer semantic information for real-time segmentation. SCTNet enjoys the rich semantic representations of an inference-free semantic branch while retaining the high efficiency of lightweight single branch CNN. SCTNet utilizes a transformer as the training-only semantic branch considering its superb ability to extract long-range context. With the help of the proposed transformer-like CNN block CFBlock and the semantic information alignment module, SCTNet could capture the rich semantic information from the transformer branch in training. During the inference, only the single branch CNN needs to be deployed. We conduct extensive experiments on Cityscapes, ADE20K, and COCO-Stuff-10K, and the results show that our method achieves the new state-of-the-art performance. The code and model is available at https://github.com/xzz777/SCTNet.
\ No newline at end of file
diff --git a/data/2024/aaai/SD-MVS: Segmentation-Driven Deformation Multi-View Stereo with Spherical Refinement and EM Optimization b/data/2024/aaai/SD-MVS: Segmentation-Driven Deformation Multi-View Stereo with Spherical Refinement and EM Optimization
new file mode 100644
index 0000000000..2a53ba60ca
--- /dev/null
+++ b/data/2024/aaai/SD-MVS: Segmentation-Driven Deformation Multi-View Stereo with Spherical Refinement and EM Optimization	
@@ -0,0 +1 @@
+In this paper, we introduce Segmentation-Driven Deformation Multi-View Stereo (SD-MVS), a method that can effectively tackle challenges in 3D reconstruction of textureless areas. We are the first to adopt the Segment Anything Model (SAM) to distinguish semantic instances in scenes and further leverage these constraints for pixelwise patch deformation on both matching cost and propagation. Concurrently, we propose a unique refinement strategy that combines spherical coordinates and gradient descent on normals and pixelwise search interval on depths, significantly improving the completeness of reconstructed 3D model. Furthermore, we adopt the Expectation-Maximization (EM) algorithm to alternately optimize the aggregate matching cost and hyperparameters, effectively mitigating the problem of parameters being excessively dependent on empirical tuning. Evaluations on the ETH3D high-resolution multi-view stereo benchmark and the Tanks and Temples dataset demonstrate that our method can achieve state-of-the-art results with less time consumption.
\ No newline at end of file
diff --git a/data/2024/aaai/SDAC: A Multimodal Synthetic Dataset for Anomaly and Corner Case Detection in Autonomous Driving b/data/2024/aaai/SDAC: A Multimodal Synthetic Dataset for Anomaly and Corner Case Detection in Autonomous Driving
new file mode 100644
index 0000000000..01d7246ccb
--- /dev/null
+++ b/data/2024/aaai/SDAC: A Multimodal Synthetic Dataset for Anomaly and Corner Case Detection in Autonomous Driving	
@@ -0,0 +1 @@
+Nowadays, closed-set perception methods for autonomous driving perform well on datasets containing normal scenes. However, they still struggle to handle anomalies in the real world, such as unknown objects that have never been seen while training. The lack of public datasets to evaluate the model performance on anomaly and corner cases has hindered the development of reliable autonomous driving systems. Therefore, we propose a multimodal Synthetic Dataset for Anomaly and Corner case detection, called SDAC, which encompasses anomalies captured from multi-view cameras and the LiDAR sensor, providing a rich set of annotations for multiple mainstream perception tasks. SDAC is the first public dataset for autonomous driving that categorizes anomalies into object, scene, and scenario levels, allowing the evaluation under different anomalous conditions. Experiments show that closed-set models suffer significant performance drops on anomaly subsets in SDAC. Existing anomaly detection methods fail to achieve satisfactory performance, suggesting that anomaly detection remains a challenging problem. We anticipate that our SDAC dataset could foster the development of safe and reliable systems for autonomous driving.
\ No newline at end of file
diff --git a/data/2024/aaai/SDGAN: Disentangling Semantic Manipulation for Facial Attribute Editing b/data/2024/aaai/SDGAN: Disentangling Semantic Manipulation for Facial Attribute Editing
new file mode 100644
index 0000000000..cd24a9e490
--- /dev/null
+++ b/data/2024/aaai/SDGAN: Disentangling Semantic Manipulation for Facial Attribute Editing	
@@ -0,0 +1 @@
+Facial attribute editing has garnered significant attention, yet prevailing methods struggle with achieving precise attribute manipulation while preserving irrelevant details and controlling attribute styles. This challenge primarily arises from the strong correlations between different attributes and the interplay between attributes and identity. In this paper, we propose Semantic Disentangled GAN (SDGAN), a novel method addressing this challenge. SDGAN introduces two key concepts: a semantic disentanglement generator that assigns facial representations to distinct attribute-specific editing modules, enabling the decoupling of the facial attribute editing process, and a semantic mask alignment strategy that confines attribute editing to appropriate regions, thereby avoiding undesired modifications. Leveraging these concepts, SDGAN demonstrates accurate attribute editing and achieves high-quality attribute style manipulation through both latent-guided and reference-guided manners. We extensively evaluate our method on the CelebA-HQ database, providing both qualitative and quantitative analyses. Our results establish that SDGAN significantly outperforms state-of-the-art techniques, showcasing the effectiveness of our approach. To foster reproducibility and further research, we will provide the code for our method.
\ No newline at end of file
diff --git a/data/2024/aaai/SDGMNet: Statistic-Based Dynamic Gradient Modulation for Local Descriptor Learning b/data/2024/aaai/SDGMNet: Statistic-Based Dynamic Gradient Modulation for Local Descriptor Learning
new file mode 100644
index 0000000000..b2e212ab0d
--- /dev/null
+++ b/data/2024/aaai/SDGMNet: Statistic-Based Dynamic Gradient Modulation for Local Descriptor Learning	
@@ -0,0 +1 @@
+Rescaling the backpropagated gradient of contrastive loss has made significant progress in descriptor learning. However, current gradient modulation strategies have no regard for the varying distribution of global gradients, so they would suffer from changes in training phases or datasets. In this paper, we propose a dynamic gradient modulation, named SDGMNet, for contrastive local descriptor learning. The core of our method is formulating modulation functions with dynamically estimated statistical characteristics. Firstly, we introduce angle for distance measure after deep analysis on backpropagation of pair-wise loss. On this basis, auto-focus modulation is employed to moderate the impact of statistically uncommon individual pairs in stochastic gradient descent optimization; probabilistic margin cuts off the gradients of proportional triplets that have achieved enough optimization; power adjustment balances the total weights of negative pairs and positive pairs. Extensive experiments demonstrate that our novel descriptor surpasses previous state-of-the-art methods in several tasks including patch verification, retrieval, pose estimation, and 3D reconstruction.
\ No newline at end of file
diff --git a/data/2024/aaai/SEA-GWNN: Simple and Effective Adaptive Graph Wavelet Neural Network b/data/2024/aaai/SEA-GWNN: Simple and Effective Adaptive Graph Wavelet Neural Network
new file mode 100644
index 0000000000..6fd2452cc6
--- /dev/null
+++ b/data/2024/aaai/SEA-GWNN: Simple and Effective Adaptive Graph Wavelet Neural Network	
@@ -0,0 +1 @@
+The utilization of wavelet-based techniques in graph neural networks (GNNs) has gained considerable attention, particularly in the context of node classification. Although existing wavelet-based approaches have shown promise, they are constrained by their reliance on pre-defined wavelet filters, rendering them incapable of effectively adapting to signals that reside on graphs based on tasks at hand. Recent research endeavors address this issue through the introduction of a wavelet lifting transform. However, this technique necessitates the use of bipartite graphs, causing a transformation of the original graph structure into a bipartite configuration. This alteration of graph topology results in the generation of undesirable wavelet filters, thereby undermining the effectiveness of the method. In response to these challenges, we propose a novel simple and effective adaptive graph wavelet neural network (SEA-GWNN) class that employs the lifting scheme on arbitrary graph structures while upholding the original graph topology by leveraging multi-hop computation trees. A noteworthy aspect of the approach is the focus on local substructures represented as acyclic trees, wherein the lifting strategy is applied in a localized manner. This locally defined lifting scheme effectively combines high-pass and low-pass frequency information to enhance node representations. Furthermore, to reduce computing costs, we propose to decouple the higher- order lifting operators and induce them from the lower-order structures. Finally, we benchmark our model on several real- world datasets spanning four distinct categories, including citation networks, webpages, the film industry, and large-scale graphs and the experimental results showcase the efficacy of the proposed SEA-GWNN.
\ No newline at end of file
diff --git a/data/2024/aaai/SEC: More Accurate Clustering Algorithm via Structural Entropy b/data/2024/aaai/SEC: More Accurate Clustering Algorithm via Structural Entropy
new file mode 100644
index 0000000000..4fa9a31149
--- /dev/null
+++ b/data/2024/aaai/SEC: More Accurate Clustering Algorithm via Structural Entropy	
@@ -0,0 +1 @@
+As one of the most popular machine learning tools in the field of unsupervised learning, clustering has been widely used in various practical applications. While numerous methods have been proposed for clustering, a commonly encountered issue is that the existing clustering methods rely heavily on local neighborhood information during the optimization process, which leads to suboptimal performance on real-world datasets. Besides, most existing clustering methods use Euclidean distances or densities to measure the similarity between data points. This could constrain the effectiveness of the algorithms for handling datasets with irregular patterns. Thus, a key challenge is how to effectively capture the global structural information in clustering instances to improve the clustering quality. In this paper, we propose a new clustering algorithm, called SEC. This algorithm uses the global structural information extracted from an encoding tree to guide the clustering optimization process. Based on the relation between data points in the instance, a sparse graph of the clustering instance can be constructed. By leveraging the sparse graph constructed, we propose an iterative encoding tree method, where hierarchical abstractions of the encoding tree are iteratively extracted as new clustering features to obtain better clustering results. To avoid the influence of easily misclustered data points located on the boundaries of the clustering partitions, which we call "fringe points", we propose an iterative pre-deletion and reassignment technique such that the algorithm can delete and reassign the "fringe points" to obtain more resilient and precise clustering results. Empirical experiments on both synthetic and real-world datasets demonstrate that our proposed algorithm outperforms state-of-the-art clustering methods and achieves better clustering performances. On average, the clustering accuracy (ACC) is increased by 1.7% and the normalized mutual information (NMI) by 7.9% compared with the current state-of-the-art (SOTA) algorithm on synthetic datasets. On real-world datasets, our method outperforms other clustering methods with an average increase of 12.3% in ACC and 5.2% in NMI, respectively.
\ No newline at end of file
diff --git a/data/2024/aaai/SECap: Speech Emotion Captioning with Large Language Model b/data/2024/aaai/SECap: Speech Emotion Captioning with Large Language Model
new file mode 100644
index 0000000000..aa1239724c
--- /dev/null
+++ b/data/2024/aaai/SECap: Speech Emotion Captioning with Large Language Model	
@@ -0,0 +1 @@
+Speech emotions are crucial in human communication and are extensively used in fields like speech synthesis and natural language understanding. Most prior studies, such as speech emotion recognition, have categorized speech emotions into a fixed set of classes. Yet, emotions expressed in human speech are often complex, and categorizing them into predefined groups can be insufficient to adequately represent speech emotions. On the contrary, describing speech emotions directly by means of natural language may be a more effective approach. Regrettably, there are not many studies available that have focused on this direction. Therefore, this paper proposes a speech emotion captioning framework named SECap, aiming at effectively describing speech emotions using natural language. Owing to the impressive capabilities of large language models in language comprehension and text generation, SECap employs LLaMA as the text decoder to allow the production of coherent speech emotion captions. In addition, SECap leverages HuBERT as the audio encoder to extract general speech features and Q-Former as the Bridge-Net to provide LLaMA with emotion-related speech features. To accomplish this, Q-Former utilizes mutual information learning to disentangle emotion-related speech features and speech contents, while implementing contrastive learning to extract more emotion-related speech features. The results of objective and subjective evaluations demonstrate that: 1) the SECap framework outperforms the HTSAT-BART baseline in all objective evaluations; 2) SECap can generate high-quality speech emotion captions that attain performance on par with human annotators in subjective mean opinion score tests.
\ No newline at end of file
diff --git a/data/2024/aaai/SEER: Backdoor Detection for Vision-Language Models through Searching Target Text and Image Trigger Jointly b/data/2024/aaai/SEER: Backdoor Detection for Vision-Language Models through Searching Target Text and Image Trigger Jointly
new file mode 100644
index 0000000000..48ad6139b7
--- /dev/null
+++ b/data/2024/aaai/SEER: Backdoor Detection for Vision-Language Models through Searching Target Text and Image Trigger Jointly	
@@ -0,0 +1 @@
+This paper proposes SEER, a novel backdoor detection algorithm for vision-language models, addressing the gap in the literature on multi-modal backdoor detection. While backdoor detection in single-modal models has been well studied, the investigation of such defenses in multi-modal models remains limited. Existing backdoor defense mechanisms cannot be directly applied to multi-modal settings due to their increased complexity and search space explosion. In this paper, we propose to detect backdoors in vision-language models by jointly searching image triggers and malicious target texts in feature space shared by vision and language modalities. Our extensive experiments demonstrate that SEER can achieve over 92% detection rate on backdoor detection in vision-language models in various settings without accessing training data or knowledge of downstream tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/SEIT: Structural Enhancement for Unsupervised Image Translation in Frequency Domain b/data/2024/aaai/SEIT: Structural Enhancement for Unsupervised Image Translation in Frequency Domain
new file mode 100644
index 0000000000..0a281c9fed
--- /dev/null
+++ b/data/2024/aaai/SEIT: Structural Enhancement for Unsupervised Image Translation in Frequency Domain	
@@ -0,0 +1 @@
+For the task of unsupervised image translation, transforming the image style while preserving its original structure remains challenging. In this paper, we propose an unsupervised image translation method with structural enhancement in frequency domain named SEIT. Specifically, a frequency dynamic adaptive (FDA) module is designed for image style transformation that can well transfer the image style while maintaining its overall structure by decoupling the image content and style in frequency domain. Moreover, a wavelet-based structure enhancement (WSE) module is proposed to improve the intermediate translation results by matching the high-frequency information, thus enriching the structural details. Furthermore, a multi-scale network architecture is designed to extract the domain-specific information using image-independent encoders for both the source and target domains. The extensive experimental results well demonstrate the effectiveness of the proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/SENCR: A Span Enhanced Two-Stage Network with Counterfactual Rethinking for Chinese NER b/data/2024/aaai/SENCR: A Span Enhanced Two-Stage Network with Counterfactual Rethinking for Chinese NER
new file mode 100644
index 0000000000..508a9f163d
--- /dev/null
+++ b/data/2024/aaai/SENCR: A Span Enhanced Two-Stage Network with Counterfactual Rethinking for Chinese NER	
@@ -0,0 +1 @@
+Recently, lots of works that incorporate external lexicon information into character-level Chinese named entity recognition(NER) to overcome the lackness of natural delimiters of words, have achieved many advanced performance. However, obtaining and maintaining high-quality lexicons is costly, especially in special domains. In addition, the entity boundary bias caused by high mention coverage in some boundary characters poses a significant challenge to the generalization of NER models but receives little attention in the existing literature. To address these issues, we propose SENCR, a Span Enhanced Two-stage Network with Counterfactual Rethinking for Chinese NER, that contains a boundary detector for boundary supervision, a convolution-based type classifier for better span representation and a counterfactual rethinking(CR) strategy for debiased boundary detection in inference. The proposed boundary detector and type classifier are jointly trained with the same contextual encoder and then the trained boundary detector is debiased by our proposed CR strategy without modifying any model parameters in the inference stage. Extensive experiments on four Chinese NER datasets show the effectiveness of our proposed approach.
\ No newline at end of file
diff --git a/data/2024/aaai/SFC: Shared Feature Calibration in Weakly Supervised Semantic Segmentation b/data/2024/aaai/SFC: Shared Feature Calibration in Weakly Supervised Semantic Segmentation
new file mode 100644
index 0000000000..9ee7d43df9
--- /dev/null
+++ b/data/2024/aaai/SFC: Shared Feature Calibration in Weakly Supervised Semantic Segmentation	
@@ -0,0 +1 @@
+Image-level weakly supervised semantic segmentation has received increasing attention due to its low annotation cost. Existing methods mainly rely on Class Activation Mapping (CAM) to obtain pseudo-labels for training semantic segmentation models. In this work, we are the first to demonstrate that long-tailed distribution in training data can cause the CAM calculated through classifier weights over-activated for head classes and under-activated for tail classes due to the shared features among head- and tail- classes. This degrades pseudo-label quality and further influences final semantic segmentation performance. To address this issue, we propose a Shared Feature Calibration (SFC) method for CAM generation. Specifically, we leverage the class prototypes which carry positive shared features and propose a Multi-Scaled Distribution-Weighted (MSDW) consistency loss for narrowing the gap between the CAMs generated through classifier weights and class prototypes during training. The MSDW loss counterbalances over-activation and under-activation by calibrating the shared features in head-/tail-class classifier weights. Experimental results show that our SFC significantly improves CAM boundaries and achieves new state-of-the-art performances. The project is available at https://github.com/Barrett-python/SFC.
\ No newline at end of file
diff --git a/data/2024/aaai/SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation b/data/2024/aaai/SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation
new file mode 100644
index 0000000000..33988e88b1
--- /dev/null
+++ b/data/2024/aaai/SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation	
@@ -0,0 +1 @@
+In this paper, we propose a novel model called SGFormer, Semantic Graph TransFormer for point cloud-based 3D scene graph generation. The task aims to parse a point cloud-based scene into a semantic structural graph, with the core challenge of modeling the complex global structure. Existing methods based on graph convolutional networks (GCNs) suffer from the over-smoothing dilemma and can only propagate information from limited neighboring nodes. In contrast, SGFormer uses Transformer layers as the base building block to allow global information passing, with two types of newly-designed layers tailored for the 3D scene graph generation task. Specifically, we introduce the graph embedding layer to best utilize the global information in graph edges while maintaining comparable computation costs. Furthermore, we propose the semantic injection layer to leverage linguistic knowledge from large-scale language model (i.e., ChatGPT), to enhance objects' visual features. We benchmark our SGFormer on the established 3DSSG dataset and achieve a 40.94% absolute improvement in relationship prediction's R@50 and an 88.36% boost on the subset with complex scenes over the state-of-the-art. Our analyses further show SGFormer's superiority in the long-tail and zero-shot scenarios. Our source code is available at https://github.com/Andy20178/SGFormer.
\ No newline at end of file
diff --git a/data/2024/aaai/SGNet: Structure Guided Network via Gradient-Frequency Awareness for Depth Map Super-resolution b/data/2024/aaai/SGNet: Structure Guided Network via Gradient-Frequency Awareness for Depth Map Super-resolution
new file mode 100644
index 0000000000..f4726e3301
--- /dev/null
+++ b/data/2024/aaai/SGNet: Structure Guided Network via Gradient-Frequency Awareness for Depth Map Super-resolution	
@@ -0,0 +1 @@
+Depth super-resolution (DSR) aims to restore high-resolution (HR) depth from low-resolution (LR) one, where RGB image is often used to promote this task. Recent image guided DSR approaches mainly focus on spatial domain to rebuild depth structure. However, since the structure of LR depth is usually blurry, only considering spatial domain is not very sufficient to acquire satisfactory results. In this paper, we propose structure guided network (SGNet), a method that pays more attention to gradient and frequency domains, both of which have the inherent ability to capture high-frequency structure. Specifically, we first introduce the gradient calibration module (GCM), which employs the accurate gradient prior of RGB to sharpen the LR depth structure. Then we present the Frequency Awareness Module (FAM) that recursively conducts multiple spectrum differencing blocks (SDB), each of which propagates the precise high-frequency components of RGB into the LR depth. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of our SGNet, reaching the state-of-the-art (see Fig. 1). Codes and pre-trained models are available at https://github.com/yanzq95/SGNet.
\ No newline at end of file
diff --git a/data/2024/aaai/SHAP@k: Efficient and Probably Approximately Correct (PAC) Identification of Top-K Features b/data/2024/aaai/SHAP@k: Efficient and Probably Approximately Correct (PAC) Identification of Top-K Features
new file mode 100644
index 0000000000..449021cae3
--- /dev/null
+++ b/data/2024/aaai/SHAP@k: Efficient and Probably Approximately Correct (PAC) Identification of Top-K Features	
@@ -0,0 +1 @@
+The SHAP framework provides a principled method to explain the predictions of a model by computing feature importance. Motivated by applications in finance, we introduce the Top-k Identification Problem (TkIP) (and its ordered variant TkIP- O), where the objective is to identify the subset (or ordered subset for TkIP-O) of k features corresponding to the highest SHAP values with PAC guarantees. While any sampling-based method that estimates SHAP values (such as KernelSHAP and SamplingSHAP) can be trivially adapted to solve TkIP, doing so is highly sample inefficient. Instead, we leverage the connection between SHAP values and multi-armed bandits (MAB) to show that both TkIP and TkIP-O can be reduced to variants of problems in MAB literature. This reduction allows us to use insights from the MAB literature to develop sample-efficient variants of KernelSHAP and SamplingSHAP. We propose KernelSHAP@k and SamplingSHAP@k for solving TkIP; along with KernelSHAP-O and SamplingSHAP-O to solve the ordering problem in TkIP-O. We perform extensive experiments using several credit-related datasets to show that our methods offer significant improvements of up to 40× in sample efficiency and 39× in runtime.
\ No newline at end of file
diff --git a/data/2024/aaai/SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation b/data/2024/aaai/SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation
new file mode 100644
index 0000000000..b5e26dfaee
--- /dev/null
+++ b/data/2024/aaai/SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation	
@@ -0,0 +1 @@
+High-resolution representation is essential for achieving good performance in human pose estimation models. To obtain such features, existing works utilize high-resolution input images or fine-grained image tokens. However, this dense high-resolution representation brings a significant computational burden. In this paper, we address the following question: "Only sparse human keypoint locations are detected for human pose estimation, is it really necessary to describe the whole image in a dense, high-resolution manner?" Based on dynamic transformer models, we propose a framework that only uses Sparse High-resolution Representations for human Pose estimation (SHaRPose). In detail, SHaRPose consists of two stages. At the coarse stage, the relations between image regions and keypoints are dynamically mined while a coarse estimation is generated. Then, a quality predictor is applied to decide whether the coarse estimation results should be refined. At the fine stage, SHaRPose builds sparse high-resolution representations only on the regions related to the keypoints and provides refined high-precision human pose estimations. Extensive experiments demonstrate the outstanding performance of the proposed method. Specifically, compared to the state-of-the-art method ViTPose, our model SHaRPose-Base achieves 77.4 AP (+0.5 AP) on the COCO validation set and 76.7 AP (+0.5 AP) on the COCO test-dev set, and infers at a speed of 1.4x faster than ViTPose-Base. Code is available at https://github.com/AnxQ/sharpose.
\ No newline at end of file
diff --git a/data/2024/aaai/SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object Tracking b/data/2024/aaai/SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object Tracking
new file mode 100644
index 0000000000..4b7c23045d
--- /dev/null
+++ b/data/2024/aaai/SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object Tracking	
@@ -0,0 +1 @@
+Despite recent progress in Multiple Object Tracking (MOT), several obstacles such as occlusions, similar objects, and complex scenes remain an open challenge. Meanwhile, a systematic study of the cost-performance tradeoff for the popular tracking-by-detection paradigm is still lacking. This paper introduces SMILEtrack, an innovative object tracker that effectively addresses these challenges by integrating an efficient object detector with a Siamese network-based Similarity Learning Module (SLM). The technical contributions of SMILETrack are twofold. First, we propose an SLM that calculates the appearance similarity between two objects, overcoming the limitations of feature descriptors in Separate Detection and Embedding (SDE) models. The SLM incorporates a Patch Self-Attention (PSA) block inspired by the vision Transformer, which generates reliable features for accurate similarity matching. Second, we develop a Similarity Matching Cascade (SMC) module with a novel GATE function for robust object matching across consecutive video frames, further enhancing MOT performance. Together, these innovations help SMILETrack achieve an improved trade-off between the cost (e.g., running speed) and performance (e.g., tracking accuracy) over several existing state-of-the-art benchmarks, including the popular BYTETrack method. SMILETrack outperforms BYTETrack by 0.4-0.8 MOTA and 2.1-2.2 HOTA points on MOT17 and MOT20 datasets. Code is available at http://github.com/pingyang1117/SMILEtrack_official.
\ No newline at end of file
diff --git a/data/2024/aaai/SNN-PDE: Learning Dynamic PDEs from Data with Simplicial Neural Networks b/data/2024/aaai/SNN-PDE: Learning Dynamic PDEs from Data with Simplicial Neural Networks
new file mode 100644
index 0000000000..85650b07bd
--- /dev/null
+++ b/data/2024/aaai/SNN-PDE: Learning Dynamic PDEs from Data with Simplicial Neural Networks	
@@ -0,0 +1 @@
+Dynamics of many complex systems, from weather and climate to spread of infectious diseases, can be described by partial differential equations (PDEs). Such PDEs involve unknown function(s), partial derivatives, and typically multiple independent variables. The traditional numerical methods for solving PDEs assume that the data are observed on a regular grid. However, in many applications, for example, weather and air pollution monitoring delivered by the arbitrary located weather stations of the National Weather Services, data records are irregularly spaced. Furthermore, in problems involving prediction analytics such as forecasting wildfire smoke plumes, the primary focus may be on a set of irregular locations associated with urban development. In recent years, deep learning (DL) methods and, in particular, graph neural networks (GNNs) have emerged as a new promising tool that can complement traditional PDE solvers in scenarios of the irregular spaced data, contributing to the newest research trend of physics informed machine learning (PIML). However, most existing PIML methods tend to be limited in their ability to describe higher dimensional structural properties exhibited by real world phenomena, especially, ones that live on manifolds. To address this fundamental challenge, we bring the elements of the Hodge theory and, in particular, simplicial convolution defined on the Hodge Laplacian to the emerging nexus of DL and PDEs. In contrast to conventional Laplacian and the associated convolution operation, the simplicial convolution allows us to rigorously describe diffusion across higher order structures and to better approximate the complex underlying topology and geometry of the data. The new approach, Simplicial Neural Networks for Partial Differential Equations (SNN PDE) offers a computationally efficient yet effective solution for time dependent PDEs. Our studies of a broad range of synthetic data and wildfire processes demonstrate that SNN PDE improves upon state of the art baselines in handling unstructured grids and irregular time intervals of complex physical systems and offers competitive forecasting capabilities for weather and air quality forecasting.
\ No newline at end of file
diff --git a/data/2024/aaai/SOCIALGYM 2.0: Simulator for Multi-Robot Learning and Navigation in Shared Human Spaces b/data/2024/aaai/SOCIALGYM 2.0: Simulator for Multi-Robot Learning and Navigation in Shared Human Spaces
new file mode 100644
index 0000000000..1a896f4419
--- /dev/null
+++ b/data/2024/aaai/SOCIALGYM 2.0: Simulator for Multi-Robot Learning and Navigation in Shared Human Spaces	
@@ -0,0 +1 @@
+We present Social Gym 2.0, a simulator for multi-agent navigation research. Our simulator enables navigation for multiple autonomous agents, replicating real-world dynamics in complex indoor environments, including doorways, hallways, intersections, and roundabouts. Unlike current simulators that concentrate on single robots in open spaces, Social Gym 2.0 employs multi-agent reinforcement learning (MARL) to develop optimal navigation policies for multiple robots with diverse, dynamic constraints in complex environments. Social Gym 2.0 also departs from the accepted software design standards by employing a configuration-over-convention paradigm providing the capability to benchmark different MARL algorithms, as well as customize observation and reward functions. Users can additionally create their own environments and evaluate various algorithms, based on both deep reinforcement learning as well as classical navigation, using a broad range of social navigation metrics.
\ No newline at end of file
diff --git a/data/2024/aaai/SOGDet: Semantic-Occupancy Guided Multi-View 3D Object Detection b/data/2024/aaai/SOGDet: Semantic-Occupancy Guided Multi-View 3D Object Detection
new file mode 100644
index 0000000000..abf0504f5c
--- /dev/null
+++ b/data/2024/aaai/SOGDet: Semantic-Occupancy Guided Multi-View 3D Object Detection	
@@ -0,0 +1,10 @@
+In the field of autonomous driving, accurate and comprehensive perception of the 3D environment is crucial.
+Bird's Eye View (BEV) based methods have emerged as a promising solution for 3D object detection using multi-view images as input.
+However, existing 3D object detection methods often ignore the physical context in the environment, such as sidewalk and vegetation, resulting in sub-optimal performance. 
+In this paper, we propose a novel approach called SOGDet (Semantic-Occupancy Guided Multi-view 3D Object Detection), that leverages a 3D semantic-occupancy branch to improve the accuracy of 3D object detection. 
+In particular, the physical context modeled by semantic occupancy helps the detector to perceive the scenes in a more holistic view.
+Our SOGDet is flexible to use and can be seamlessly integrated with most existing BEV-based methods.
+To evaluate its effectiveness, we apply this approach to several state-of-the-art baselines and conduct extensive experiments on the exclusive nuScenes dataset.
+Our results show that SOGDet consistently enhance the performance of three baseline methods in terms of nuScenes Detection Score (NDS) and mean Average Precision (mAP). 
+This indicates that the combination of 3D object detection and 3D semantic occupancy leads to a more comprehensive perception of the 3D environment, thereby aiding build more robust autonomous driving systems.
+The codes are available at: https://github.com/zhouqiu/SOGDet.
\ No newline at end of file
diff --git a/data/2024/aaai/SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space b/data/2024/aaai/SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space
new file mode 100644
index 0000000000..d48d10cfa2
--- /dev/null
+++ b/data/2024/aaai/SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space	
@@ -0,0 +1 @@
+Symmetric positive definite(SPD) matrices have shown important value and applications in statistics and machine learning, such as FMRI analysis and traffic prediction. Previous works on SPD matrices mostly focus on discriminative models, where predictions are made directly on E(X|y), where y is a vector and X is an SPD matrix. However, these methods are challenging to handle for large-scale data. In this paper, inspired by denoising diffusion probabilistic model(DDPM), we propose a novel generative model, termed SPD-DDPM, by introducing Gaussian distribution in the SPD space to estimate E(X|y). Moreover, our model can estimate p(X) unconditionally and flexibly without giving y. On the one hand, the model conditionally learns p(X|y) and utilizes the mean of samples to obtain E(X|y) as a prediction. On the other hand, the model unconditionally learns the probability distribution of the data p(X) and generates samples that conform to this distribution. Furthermore, we propose a new SPD net which is much deeper than the previous networks and allows for the inclusion of conditional factors. Experiment results on toy data and real taxi data demonstrate that our models effectively fit the data distribution both unconditionally and conditionally.
\ No newline at end of file
diff --git a/data/2024/aaai/SPGroup3D: Superpoint Grouping Network for Indoor 3D Object Detection b/data/2024/aaai/SPGroup3D: Superpoint Grouping Network for Indoor 3D Object Detection
new file mode 100644
index 0000000000..7987e39f13
--- /dev/null
+++ b/data/2024/aaai/SPGroup3D: Superpoint Grouping Network for Indoor 3D Object Detection	
@@ -0,0 +1 @@
+Current 3D object detection methods for indoor scenes mainly follow the voting-and-grouping strategy to generate proposals. However, most methods utilize instance-agnostic groupings, such as ball query, leading to inconsistent semantic information and inaccurate regression of the proposals. To this end, we propose a novel superpoint grouping network for indoor anchor-free one-stage 3D object detection. Specifically, we first adopt an unsupervised manner to partition raw point clouds into superpoints, areas with semantic consistency and spatial similarity. Then, we design a geometry-aware voting module that adapts to the centerness in anchor-free detection by constraining the spatial relationship between superpoints and object centers. Next, we present a superpoint-based grouping module to explore the consistent representation within proposals. This module includes a superpoint attention layer to learn feature interaction between neighboring superpoints, and a superpoint-voxel fusion layer to propagate the superpoint-level information to the voxel level. Finally, we employ effective multiple matching to capitalize on the dynamic receptive fields of proposals based on superpoints during the training. Experimental results demonstrate our method achieves state-of-the-art performance on ScanNet V2, SUN RGB-D, and S3DIS datasets in the indoor one-stage 3D object detection. Source code is available at https://github.com/zyrant/SPGroup3D.
\ No newline at end of file
diff --git a/data/2024/aaai/SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation b/data/2024/aaai/SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation
new file mode 100644
index 0000000000..46e9578b07
--- /dev/null
+++ b/data/2024/aaai/SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation	
@@ -0,0 +1 @@
+Recently, self-supervised monocular depth estimation has gained popularity with numerous applications in autonomous driving and robotics. However, existing solutions primarily seek to estimate depth from immediate visual features, and struggle to recover fine-grained scene details. In this paper, we introduce SQLdepth, a novel approach that can effectively learn fine-grained scene structure priors from ego-motion. In SQLdepth, we propose a novel Self Query Layer (SQL) to build a self-cost volume and infer depth from it, rather than inferring depth from feature maps. We show that, the self-cost volume is an effective inductive bias for geometry learning, which implicitly models the single-frame scene geometry, with each slice of it indicating a relative distance map between points and objects in a latent space. Experimental results on KITTI and Cityscapes show that our method attains remarkable state-of-the-art performance, and showcases computational efficiency, reduced training complexity, and the ability to recover fine-grained scene details. Moreover, the self-matching-oriented relative distance querying in SQL improves the robustness and zero-shot generalization capability of SQLdepth. Code is available at https://github.com/hisfog/SfMNeXt-Impl.
\ No newline at end of file
diff --git a/data/2024/aaai/SRFormer: Text Detection Transformer with Incorporated Segmentation and Regression b/data/2024/aaai/SRFormer: Text Detection Transformer with Incorporated Segmentation and Regression
new file mode 100644
index 0000000000..379b36d9b2
--- /dev/null
+++ b/data/2024/aaai/SRFormer: Text Detection Transformer with Incorporated Segmentation and Regression	
@@ -0,0 +1 @@
+Existing techniques for text detection can be broadly classified into two primary groups: segmentation-based and regression-based methods. Segmentation models offer enhanced robustness to font variations but require intricate post-processing, leading to high computational overhead. Regression-based methods undertake instance-aware prediction but face limitations in robustness and data efficiency due to their reliance on high-level representations. In our academic pursuit, we propose SRFormer, a unified DETR-based model with amalgamated Segmentation and Regression, aiming at the synergistic harnessing of the inherent robustness in segmentation representations, along with the straightforward post-processing of instance-level regression. Our empirical analysis indicates that favorable segmentation predictions can be obtained at the initial decoder layers. In light of this, we constrain the incorporation of segmentation branches to the first few decoder layers and employ progressive regression refinement in subsequent layers, achieving performance gains while minimizing computational load from the mask. Furthermore, we propose a Mask-informed Query Enhancement module. We take the segmentation result as a natural soft-ROI to pool and extract robust pixel representations, which are then employed to enhance and diversify instance queries. Extensive experimentation across multiple benchmarks has yielded compelling findings, highlighting our method's exceptional robustness, superior training and data efficiency, as well as its state-of-the-art performance. Our code is available at https://github.com/retsuh-bqw/SRFormer-Text-Det.
\ No newline at end of file
diff --git a/data/2024/aaai/SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-Form Layout-to-Image Generation b/data/2024/aaai/SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-Form Layout-to-Image Generation
new file mode 100644
index 0000000000..9b7e51c2d9
--- /dev/null
+++ b/data/2024/aaai/SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-Form Layout-to-Image Generation	
@@ -0,0 +1 @@
+Despite significant progress in Text-to-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls. In contrast, Layout-to-Image (L2I) generation, aiming to generate realistic and complex scene images from user-specified layouts, has risen to prominence. However, existing methods transform layout information into tokens or RGB images for conditional control in the generative process, leading to insufficient spatial and semantic controllability of individual instances. To address these limitations, we propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. Owing to rich spatial and semantic information encapsulated in well-designed feature maps, SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. Additionally, we propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms. The former aims to model the relationships among multiple objects within scenes while the latter is designed to heighten the model's sensitivity to the spatial information embedded in the guidance. Extensive experiments demonstrate that SSMG achieves highly promising results, setting a new state-of-the-art across a range of metrics encompassing fidelity, diversity, and controllability.
\ No newline at end of file
diff --git a/data/2024/aaai/STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering b/data/2024/aaai/STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering
new file mode 100644
index 0000000000..39de3cd990
--- /dev/null
+++ b/data/2024/aaai/STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering	
@@ -0,0 +1,4 @@
+Recently we have witnessed the rapid development of video question answering models. However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos. 
+To tackle this problem we propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering. STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks, and a set of lightweight neural modules to complete each of these sub-tasks.
+Though neural module networks are already widely studied on image-text tasks, applying them to videos is a non-trivial task, as reasoning on videos requires different abilities. In this paper, we define a set of basic video-text sub-tasks for video question answering and design a set of lightweight modules to complete them.
+Different from most prior works, modules of STAIR return intermediate outputs specific to their intentions instead of always returning attention maps, which makes it easier to interpret and collaborate with pre-trained models. We also introduce intermediate supervision to make these intermediate outputs more accurate. We conduct extensive experiments on several video question answering datasets under various settings to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available. Code: https://github.com/yellow-binary-tree/STAIR
\ No newline at end of file
diff --git a/data/2024/aaai/STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models b/data/2024/aaai/STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models
new file mode 100644
index 0000000000..8b72933ed7
--- /dev/null
+++ b/data/2024/aaai/STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models	
@@ -0,0 +1 @@
+Information extraction tasks such as event extraction require an in-depth understanding of the output structure and sub-task dependencies. They heavily rely on task-specific training data in the form of (passage, target structure) pairs to obtain reasonable performance. However, obtaining such data through human annotation is costly, leading to a pressing need for low-resource information extraction approaches that require minimal human labeling for real-world applications. Fine-tuning supervised models with synthesized training data would be a generalizable method, but the existing data generation methods either still rely on large-scale ground-truth data or cannot be applied to complicated IE tasks due to their poor performance. To address these challenges, we propose STAR, a data generation method that leverages Large Language Models (LLMs) to synthesize data instances given limited seed demonstrations, thereby boosting low-resource information extraction performance. Our approach involves generating target structures (Y) followed by generating passages (X), all accomplished with the aid of LLMs. We design fine-grained step-by-step instructions to obtain the initial data instances. We further reduce errors and improve data quality through self-reflection error identification and self-refinement with iterative revision. Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks, even surpassing the effectiveness of human-curated data. Human assessment of the data quality shows STAR-generated data exhibit higher passage quality and better align with the task definitions compared with the human-curated data.
\ No newline at end of file
diff --git a/data/2024/aaai/STAS: Spatial-Temporal Return Decomposition for Solving Sparse Rewards Problems in Multi-agent Reinforcement Learning b/data/2024/aaai/STAS: Spatial-Temporal Return Decomposition for Solving Sparse Rewards Problems in Multi-agent Reinforcement Learning
new file mode 100644
index 0000000000..cd42bbf85e
--- /dev/null
+++ b/data/2024/aaai/STAS: Spatial-Temporal Return Decomposition for Solving Sparse Rewards Problems in Multi-agent Reinforcement Learning	
@@ -0,0 +1 @@
+Centralized Training with Decentralized Execution (CTDE) has been proven to be an effective paradigm in cooperative multi-agent reinforcement learning (MARL). One of the major challenges is credit assignment, which aims to credit agents by their contributions. They lack the functionality to model complicated relations of the delayed global reward in the temporal dimension and suffer from inefficiencies. To tackle this, we introduce Spatial-Temporal Attention with Shapley (STAS), a novel method that learns credit assignment in both temporal and spatial dimensions. It first decomposes the global return back to each time step, then utilizes the Shapley Value to redistribute the individual payoff from the decomposed global reward. To mitigate the computational complexity of the Shapley Value, we introduce an approximation of marginal contribution and utilize Monte Carlo sampling to estimate it. We evaluate our method on an Alice & Bob example and MPE environments across different scenarios. Our results demonstrate that our method effectively assigns spatial-temporal credit, outperforming all state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/STDiff: Spatio-Temporal Diffusion for Continuous Stochastic Video Prediction b/data/2024/aaai/STDiff: Spatio-Temporal Diffusion for Continuous Stochastic Video Prediction
new file mode 100644
index 0000000000..35d4171c7b
--- /dev/null
+++ b/data/2024/aaai/STDiff: Spatio-Temporal Diffusion for Continuous Stochastic Video Prediction	
@@ -0,0 +1 @@
+Predicting future frames of a video is challenging because it is difficult to learn the uncertainty of the underlying factors influencing their contents. In this paper, we propose a novel video prediction model, which has infinite-dimensional latent variables over the spatio-temporal domain. Specifically, we first decompose the video motion and content information, then take a neural stochastic differential equation to predict the temporal motion information, and finally, an image diffusion model autoregressively generates the video frame by conditioning on the predicted motion feature and the previous frame. The better expressiveness and stronger stochasticity learning capability of our model lead to state-of-the-art video prediction performances. As well, our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way the future video frames with an arbitrarily high frame rate. Our code is available at https://github.com/XiYe20/STDiffProject.
\ No newline at end of file
diff --git a/data/2024/aaai/STEM: Unleashing the Power of Embeddings for Multi-Task Recommendation b/data/2024/aaai/STEM: Unleashing the Power of Embeddings for Multi-Task Recommendation
new file mode 100644
index 0000000000..1cf887d1aa
--- /dev/null
+++ b/data/2024/aaai/STEM: Unleashing the Power of Embeddings for Multi-Task Recommendation	
@@ -0,0 +1 @@
+Multi-task learning (MTL) has gained significant popularity in recommender systems as it enables simultaneous optimization of multiple objectives. A key challenge in MTL is negative transfer, but existing studies explored negative transfer on all samples, overlooking the inherent complexities within them. We split the samples according to the relative amount of positive feedback among tasks. Surprisingly, negative transfer still occurs in existing MTL methods on samples that receive comparable feedback across tasks. Existing work commonly employs a shared-embedding paradigm, limiting the ability of modeling diverse user preferences on different tasks. In this paper, we introduce a novel Shared and Task-specific EMbeddings (STEM) paradigm that aims to incorporate both shared and task-specific embeddings to effectively capture task-specific user preferences. Under this paradigm, we propose a simple model STEM-Net, which is equipped with an All Forward Task-specific Backward gating network to facilitate the learning of task-specific embeddings and direct knowledge transfer across tasks. Remarkably, STEM-Net demonstrates exceptional performance on comparable samples, achieving positive transfer. Comprehensive evaluation on three public MTL recommendation datasets demonstrates that STEM-Net outperforms state-of-the-art models by a substantial margin. Our code is released at https://github.com/LiangcaiSu/STEM.
\ No newline at end of file
diff --git a/data/2024/aaai/STViT: Improving Self-Supervised Multi-Camera Depth Estimation with Spatial-Temporal Context and Adversarial Geometry Regularization (Student Abstract) b/data/2024/aaai/STViT: Improving Self-Supervised Multi-Camera Depth Estimation with Spatial-Temporal Context and Adversarial Geometry Regularization (Student Abstract)
new file mode 100644
index 0000000000..78707bc590
--- /dev/null
+++ b/data/2024/aaai/STViT: Improving Self-Supervised Multi-Camera Depth Estimation with Spatial-Temporal Context and Adversarial Geometry Regularization (Student Abstract)	
@@ -0,0 +1 @@
+Multi-camera depth estimation has recently garnered significant attention due to its substantial practical implications in the realm of autonomous driving. In this paper, we delve into the task of self-supervised multi-camera depth estimation and propose an innovative framework, STViT, featuring several noteworthy enhancements: 1) we propose a Spatial-Temporal Transformer to comprehensively exploit both local connectivity and the global context of image features, meanwhile learning enriched spatial-temporal cross-view correlations to recover 3D geometry. 2) to alleviate the severe effect of adverse conditions, e.g., rainy weather and nighttime driving, we introduce a GAN-based Adversarial Geometry Regularization Module (AGR) to further constrain the depth estimation with unpaired normal-condition depth maps and prevent the model from being incorrectly trained. Experiments on challenging autonomous driving datasets Nuscenes and DDAD show that our method achieves state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2024/aaai/SUF: Stabilized Unconstrained Fine-Tuning for Offline-to-Online Reinforcement Learning b/data/2024/aaai/SUF: Stabilized Unconstrained Fine-Tuning for Offline-to-Online Reinforcement Learning
new file mode 100644
index 0000000000..d3fd85305f
--- /dev/null
+++ b/data/2024/aaai/SUF: Stabilized Unconstrained Fine-Tuning for Offline-to-Online Reinforcement Learning	
@@ -0,0 +1 @@
+Offline-to-online reinforcement learning (RL) provides a promising solution to improving suboptimal offline pre-trained policies through online fine-tuning. However, one efficient method, unconstrained fine-tuning, often suffers from severe policy collapse due to excessive distribution shift. To ensure stability, existing methods retain offline constraints and employ additional techniques during fine-tuning, which hurts efficiency. In this work, we introduce a novel perspective: eliminating the policy collapse without imposing constraints. We observe that such policy collapse arises from the mismatch between unconstrained fine-tuning and the conventional RL training framework. To this end, we propose Stabilized Unconstrained Fine-tuning (SUF), a streamlined framework that benefits from the efficiency of unconstrained fine-tuning while ensuring stability by modifying the Update-To-Data ratio. With just a few lines of code adjustments, SUF demonstrates remarkable adaptability to diverse backbones and superior performance over state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/SURER: Structure-Adaptive Unified Graph Neural Network for Multi-View Clustering b/data/2024/aaai/SURER: Structure-Adaptive Unified Graph Neural Network for Multi-View Clustering
new file mode 100644
index 0000000000..c55feb4435
--- /dev/null
+++ b/data/2024/aaai/SURER: Structure-Adaptive Unified Graph Neural Network for Multi-View Clustering	
@@ -0,0 +1 @@
+Deep Multi-view Graph Clustering (DMGC) aims to partition instances into different groups using the graph information extracted from multi-view data. The mainstream framework of DMGC methods applies graph neural networks to embed structure information into the view-specific representations and fuse them for the consensus representation. However, on one hand, we find that the graph learned in advance is not ideal for clustering as it is constructed by original multi-view data and localized connecting. On the other hand, most existing methods learn the consensus representation in a late fusion manner, which fails to propagate the structure relations across multiple views. Inspired by the observations, we propose a Structure-adaptive Unified gRaph nEural network for multi-view clusteRing (SURER), which can jointly learn a heterogeneous multi-view unified graph and robust graph neural networks for multi-view clustering. Specifically, we first design a graph structure learning module to refine the original view-specific attribute graphs, which removes false edges and discovers the potential connection. According to the view-specific refined attribute graphs, we integrate them into a unified heterogeneous graph by linking the representations of the same sample from different views. Furthermore, we use the unified heterogeneous graph as the input of the graph neural network to learn the consensus representation for each instance, effectively integrating complementary information from various views. Extensive experiments on diverse datasets demonstrate the superior effectiveness of our method compared to other state-of-the-art approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Safe Abductive Learning in the Presence of Inaccurate Rules b/data/2024/aaai/Safe Abductive Learning in the Presence of Inaccurate Rules
new file mode 100644
index 0000000000..e3de133930
--- /dev/null
+++ b/data/2024/aaai/Safe Abductive Learning in the Presence of Inaccurate Rules	
@@ -0,0 +1,2 @@
+Integrating complementary strengths of raw data and logical rules to improve the learning generalization has been recently shown promising and effective, e.g., abductive learning is one generic framework that can learn the perception model from data and reason between rules simultaneously. However, the performance would be seriously decreased when inaccurate logical rules appear, which may be even worse than baselines using only raw data.
+Efforts on this issue are highly desired while remain to be limited. This paper proposes a simple and effective safe abductive learning method to alleviate the harm caused by inaccurate rules. Unlike the existing methods which directly use all rules without correctness checks, it utilizes them selectively by constructing a graphical model with an adaptive reasoning process to prevent performance hazards. Theoretically, we show that induction and abduction are mutually beneficial, and can be rigorously justified from a classical maximum likelihood estimation perspective. Experiments on diverse tasks show that our method can tolerate at least twice as many inaccurate rules as accurate ones and achieve highly competitive performance while other methods can't. Moreover, the proposal can refine inaccurate rules and works well in extended weakly supervised scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration b/data/2024/aaai/Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration
new file mode 100644
index 0000000000..60c12f6288
--- /dev/null
+++ b/data/2024/aaai/Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration	
@@ -0,0 +1 @@
+This paper studies safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints where unsafe actions must be avoided at each step. Existing studies have considered safe RL with hard instantaneous constraints, but their approaches rely on several key assumptions: (i) the RL agent knows a safe action set for every state or knows a safe graph in which all the state-action-state triples are safe, and (ii) the constraint/cost functions are linear. In this paper, we consider safe RL with instantaneous hard constraints without assumption (i) and generalize (ii) to Reproducing Kernel Hilbert Space (RKHS). Our proposed algorithm, LSVI-AE, achieves O(√{d³H⁴K}) regret and O(H √{dK}) hard constraint violation when the cost function is linear and O(H?ₖ √{K}) hard constraint violation when the cost function belongs to RKHS. Here K is the learning horizon, H is the length of each episode, and ?ₖ is the information gain w.r.t the kernel used to approximate cost functions. Our results achieve the optimal dependency on the learning horizon K, matching the lower bound we provide in this paper and demonstrating the efficiency of LSVI-AE. Notably, the design of our approach encourages aggressive policy exploration, providing a unique perspective on safe RL with general cost functions and no prior knowledge of safe actions, which may be of independent interest.
\ No newline at end of file
diff --git a/data/2024/aaai/SafeAR: Safe Algorithmic Recourse by Risk-Aware Policies b/data/2024/aaai/SafeAR: Safe Algorithmic Recourse by Risk-Aware Policies
new file mode 100644
index 0000000000..234f719292
--- /dev/null
+++ b/data/2024/aaai/SafeAR: Safe Algorithmic Recourse by Risk-Aware Policies	
@@ -0,0 +1 @@
+With the growing use of machine learning (ML) models in critical domains such as finance and healthcare, the need to offer recourse for those adversely affected by the decisions of ML models has become more important; individuals ought to be provided with recommendations on actions to take for improving their situation and thus receiving a favorable decision. Prior work on sequential algorithmic recourse---which recommends a series of changes---focuses on action feasibility and uses the proximity of feature changes to determine action costs. However, the uncertainties of feature changes and the risk of higher than average costs in recourse have not been considered. It is undesirable if a recourse could (with some probability) result in a worse situation from which recovery requires an extremely high cost. It is essential to incorporate risks when computing and evaluating recourse. We call the recourse computed with such risk considerations as Safe Algorithmic Recourse (SafeAR). The objective is to empower people to choose a recourse based on their risk tolerance. In this work, we discuss and show how existing recourse desiderata can fail to capture the risk of higher costs. We present a method to compute recourse policies that consider variability in cost and connect algorithmic recourse literature with risk-sensitive reinforcement learning. We also adopt measures "Value at Risk" and "Conditional Value at Risk" from the financial literature to summarize risk concisely. We apply our method to two real-world datasets and compare policies with different risk-aversion levels using risk measures and recourse desiderata (sparsity and proximity).
\ No newline at end of file
diff --git a/data/2024/aaai/Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis b/data/2024/aaai/Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis
new file mode 100644
index 0000000000..62f022109c
--- /dev/null
+++ b/data/2024/aaai/Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis	
@@ -0,0 +1 @@
+This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL), such that the safety constraint violations are bounded at any point during learning. As enforcing safety during training might severely limit the agent’s exploration, we propose here a new architecture that handles the trade-off between efficient progress and safety during exploration. As the exploration progresses, we update via Bayesian inference Dirichlet-Categorical models of the transition probabilities of the Markov decision process that describes the environment dynamics. We then propose a way to approximate moments of belief about the risk associated to the action selection policy. We demonstrate that this approach can be easily interleaved with RL and we present experimental results to showcase the performance of the overall architecture.
\ No newline at end of file
diff --git a/data/2024/aaai/Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge b/data/2024/aaai/Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge
new file mode 100644
index 0000000000..1190feefd5
--- /dev/null
+++ b/data/2024/aaai/Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge	
@@ -0,0 +1 @@
+The problem of sample complexity of online reinforcement learning is often studied in the literature without taking into account any partial knowledge about the system dynamics that could potentially accelerate the learning process. In this paper, we study the sample complexity of online Q-learning methods when some prior knowledge about the dynamics is available or can be learned efficiently. We focus on systems that evolve according to an additive disturbance model of the form S_{h+1} = ƒ(S_h, A_h) + W_h, where ƒ represents the underlying system dynamics, and W_h are unknown disturbances independent of states and actions. In the setting of finite episodic Markov decision processes with S states, A actions, and episode length H, we present an optimistic Q-learning algorithm that achieves Õ(Poly(H)√T) regret under perfect knowledge of ƒ, where T is the total number of interactions with the system. This is in contrast to the typical Õ(Poly(H)√SAT) regret for existing Q-learning methods. Further, if only a noisy estimate ƒ_hat of ƒ is available, our method can learn an approximately optimal policy in a number of samples that is independent of the cardinalities of state and action spaces. The sub-optimality gap depends on the approximation error ƒ_hat − ƒ, as well as the Lipschitz constant of the corresponding optimal value function. Our approach does not require modeling of the transition probabilities and enjoys the same memory complexity as model-free methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Sample-Constrained Black Box Optimization for Audio Personalization b/data/2024/aaai/Sample-Constrained Black Box Optimization for Audio Personalization
new file mode 100644
index 0000000000..433fa15085
--- /dev/null
+++ b/data/2024/aaai/Sample-Constrained Black Box Optimization for Audio Personalization	
@@ -0,0 +1,5 @@
+We consider the problem of personalizing audio to maximize user experience. Briefly, we aim to find a filter h*, which applied to any music or speech, will maximize the user’s satisfaction. This is a black-box optimization problem since the user’s satisfaction function is unknown. Substantive work has been done on this topic where the key idea is to play audio samples to the user, each shaped by a different filter hi, and query the user for their satisfaction scores f(hi). A family of “surrogate” functions is then designed to fit these scores and the optimization method gradually refines these functions to arrive at the filter ˆh* that maximizes satisfaction. 
+
+In certain applications, we observe that a second type of querying is possible where users can tell us the individual elements h*[j] of the optimal filter h*. Consider an analogy from cooking where the goal is to cook a recipe that maximizes user satisfaction. A user can be asked to score various cooked recipes (e.g., tofu fried rice) or to score individual ingredients (say, salt, sugar, rice, chicken, etc.). Given a budget of B queries, where a query can be of either type, our goal is to find the recipe that will maximize this user’s satisfaction. 
+
+Our proposal builds on Sparse Gaussian Process Regression (GPR) and shows how a hybrid approach can outperform any one type of querying. Our results are validated through simulations and real world experiments, where volunteers gave feedback on music/speech audio and were able to achieve high satisfaction levels. We believe this idea of hybrid querying opens new problems in black-box optimization and solutions can benefit other applications beyond audio personalization.
\ No newline at end of file
diff --git a/data/2024/aaai/Sample-Level Cross-View Similarity Learning for Incomplete Multi-View Clustering b/data/2024/aaai/Sample-Level Cross-View Similarity Learning for Incomplete Multi-View Clustering
new file mode 100644
index 0000000000..8d15e1769d
--- /dev/null
+++ b/data/2024/aaai/Sample-Level Cross-View Similarity Learning for Incomplete Multi-View Clustering	
@@ -0,0 +1 @@
+Incomplete multi-view clustering has attracted much attention due to its ability to handle partial multi-view data. Recently, similarity-based methods have been developed to explore the complete relationship among incomplete multi-view data. Although widely applied to partial scenarios, most of the existing approaches are still faced with two limitations. Firstly, fusing similarities constructed individually on each view fails to yield a complete unified similarity. Moreover, incomplete similarity generation may lead to anomalous similarity values with column sum constraints, affecting the final clustering results. To solve the above challenging issues, we propose a Sample-level Cross-view Similarity Learning (SCSL) method for Incomplete Multi-view Clustering. Specifically, we project all samples to the same dimension and simultaneously construct a complete similarity matrix across views based on the inter-view sample relationship and the intra-view sample relationship. In addition, a simultaneously learning consensus representation ensures the validity of the projection, which further enhances the quality of the similarity matrix through the graph Laplacian regularization. Experimental results on six benchmark datasets demonstrate the ability of SCSL in processing incomplete multi-view clustering tasks. Our code is publicly available at https://github.com/Tracesource/SCSL.
\ No newline at end of file
diff --git a/data/2024/aaai/Sample-and-Bound for Non-convex Optimization b/data/2024/aaai/Sample-and-Bound for Non-convex Optimization
new file mode 100644
index 0000000000..7d2eaa3e19
--- /dev/null
+++ b/data/2024/aaai/Sample-and-Bound for Non-convex Optimization	
@@ -0,0 +1 @@
+Standard approaches for global optimization of non-convex functions, such as branch-and-bound, maintain partition trees to systematically prune the domain. The tree size grows exponentially in the number of dimensions. We propose new sampling-based methods for non-convex optimization that adapts Monte Carlo Tree Search (MCTS) to improve efficiency. Instead of the standard use of visitation count in Upper Confidence Bounds, we utilize numerical overapproximations of the objective as an uncertainty metric, and also take into account of sampled estimates of first-order and second-order information. The Monte Carlo tree in our approach avoids the usual fixed combinatorial patterns in growing the tree, and aggressively zooms into the promising regions, while still balancing exploration and exploitation. We evaluate the proposed algorithms on high-dimensional non-convex optimization benchmarks against competitive baselines and analyze the effects of the hyper parameters.
\ No newline at end of file
diff --git a/data/2024/aaai/Sampling for Beyond-Worst-Case Online Ranking b/data/2024/aaai/Sampling for Beyond-Worst-Case Online Ranking
new file mode 100644
index 0000000000..521fc06ac1
--- /dev/null
+++ b/data/2024/aaai/Sampling for Beyond-Worst-Case Online Ranking	
@@ -0,0 +1,3 @@
+The feedback arc set problem is one of the most fundamental and well-studied ranking problems where n objects are to be ordered based on their pairwise comparison. The problem enjoys several efficient approximation algorithms in the offline setting. Unfortunately, online there are strong lower bounds on the competitive ratio establishing that no algorithm can perform well in the worst case.
+This paper introduces a new beyond-worst-case model for online feedback arc set. In the model, a sample of the input is given to the algorithm offline before the remaining instance is revealed online. This models the case in practice where yesterday's data is available and is similar to today's online instance. This sample is drawn from a known distribution which may not be uniform. We design an online algorithm with strong theoretical guarantees. The algorithm has a small constant competitive ratio when the sample is uniform---if not, we show we can recover the same result by adding a provably minimal sample. 
+Empirical results validate the theory and show that such algorithms can be used on temporal data to obtain strong results.
\ No newline at end of file
diff --git a/data/2024/aaai/Sampling-Resilient Multi-Object Tracking b/data/2024/aaai/Sampling-Resilient Multi-Object Tracking
new file mode 100644
index 0000000000..a65acc349c
--- /dev/null
+++ b/data/2024/aaai/Sampling-Resilient Multi-Object Tracking	
@@ -0,0 +1 @@
+Multi-Object Tracking (MOT) is a cornerstone operator for video surveillance applications. To enable real-time processing of large-scale live video streams, we study an interesting scenario called down-sampled MOT, which performs object tracking only on a small subset of video frames. The problem is challenging for state-of-the-art MOT methods, which exhibit significant performance degradation under high frame reduction ratios. In this paper, we devise a sampling-resilient tracker with a novel sparse-observation Kalman filter (SOKF). It integrates an LSTM network to capture non-linear and dynamic motion patterns caused by sparse observations. Since the LSTM-based state transition is not compatible with the original noise estimation mechanism, we propose new estimation strategies based on Bayesian neural networks and derive the optimal Kalman gain for SOKF. To associate the detected bounding boxes robustly, we also propose a comprehensive similarity metric that systematically integrates multiple spatial matching signals. Experiments on three benchmark datasets show that our proposed tracker achieves the best trade-off between efficiency and accuracy. With the same tracking accuracy, we reduce the total processing time of ByteTrack by 2× in MOT17 and 3× in DanceTrack.
\ No newline at end of file
diff --git a/data/2024/aaai/SasWOT: Real-Time Semantic Segmentation Architecture Search WithOut Training b/data/2024/aaai/SasWOT: Real-Time Semantic Segmentation Architecture Search WithOut Training
new file mode 100644
index 0000000000..72329977d2
--- /dev/null
+++ b/data/2024/aaai/SasWOT: Real-Time Semantic Segmentation Architecture Search WithOut Training	
@@ -0,0 +1 @@
+In this paper, we present SasWOT, the first training-free Semantic segmentation Architecture Search (SAS) framework via an auto-discovery proxy. Semantic segmentation is widely used in many real-time applications. For fast inference and memory efficiency, Previous SAS seeks the optimal segmenter by differentiable or RL Search. However, the significant computational costs of these training-based SAS limit their practical usage. To improve the search efficiency, we explore the training-free route but empirically observe that the existing zero-cost proxies designed on the classification task are sub-optimal on the segmentation benchmark. To address this challenge, we develop a customized proxy search framework for SAS tasks to augment its predictive capabilities. Specifically, we design the proxy search space based on the some observations: (1) different inputs of segmenter statistics can be well combined; (2) some basic operators can effectively improve the correlation. Thus, we build computational graphs with multiple statistics as inputs and different advanced basis arithmetic as the primary operations to represent candidate proxies. Then, we employ an evolutionary algorithm to crossover and mutate the superior candidates in the population based on correlation evaluation. Finally, based on the searched proxy, we perform the segmenter search without candidate training. In this way, SasWOT not only enables automated proxy optimization for SAS tasks but also achieves significant search acceleration before the retrain stage. Extensive experiments on Cityscapes and CamVid datasets demonstrate that SasWOT achieves superior trade-off between accuracy and speed over several state-of-the-art techniques. More remarkably, on Cityscapes dataset, SasWOT achieves the performance of 71.3% mIoU with the speed of 162 FPS.
\ No newline at end of file
diff --git a/data/2024/aaai/Say Anything with Any Style b/data/2024/aaai/Say Anything with Any Style
new file mode 100644
index 0000000000..a9adc5cfad
--- /dev/null
+++ b/data/2024/aaai/Say Anything with Any Style	
@@ -0,0 +1 @@
+Generating stylized talking head with diverse head motions is crucial for achieving natural-looking videos but still remains challenging. Previous works either adopt a regressive method to capture the speaking style, resulting in a coarse style that is averaged across all training data, or employ a universal network to synthesize videos with different styles which causes suboptimal performance. To address these, we propose a novel dynamic-weight method, namely Say Anything with Any Style (SAAS), which queries the discrete style representation via a generative model with a learned style codebook. Specifically, we develop a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction. This discrete prior, along with the generative model, enhances the precision and robustness when extracting the speaking styles of the given style clips. By utilizing the extracted style, a residual architecture comprising a canonical branch and style-specific branch is employed to predict the mouth shapes conditioned on any driving audio while transferring the speaking style from the source to any desired one. To adapt to different speaking styles, we steer clear of employing a universal network by exploring an elaborate HyperStyle to produce the style-specific weights offset for the style branch. Furthermore, we construct a pose generator and a pose codebook to store the quantized pose representation, allowing us to sample diverse head motions aligned with the audio and the extracted style. Experiments demonstrate that our approach surpasses state-of-the-art methods in terms of both lip-synchronization and stylized expression. Besides, we extend our SAAS to video-driven style editing field and achieve satisfactory performance as well.
\ No newline at end of file
diff --git a/data/2024/aaai/SayCanPay: Heuristic Planning with Large Language Models Using Learnable Domain Knowledge b/data/2024/aaai/SayCanPay: Heuristic Planning with Large Language Models Using Learnable Domain Knowledge
new file mode 100644
index 0000000000..437e613e56
--- /dev/null
+++ b/data/2024/aaai/SayCanPay: Heuristic Planning with Large Language Models Using Learnable Domain Knowledge	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have demonstrated impressive planning abilities due to their vast "world knowledge". Yet, obtaining plans that are both feasible (grounded in affordances) and cost-effective (in plan length), remains a challenge, despite recent progress. This contrasts with heuristic planning methods that employ domain knowledge (formalized in action models such as PDDL) and heuristic search to generate feasible, optimal plans. Inspired by this, we propose to combine the power of LLMs and heuristic planning by leveraging the world knowledge of LLMs and the principles of heuristic search. Our approach, SayCanPay, employs LLMs to generate actions (Say) guided by learnable domain knowledge, that evaluates actions' feasibility (Can) and long-term reward/payoff (Pay), and heuristic search to select the best sequence of actions. Our contributions are (1) a novel framing of the LLM planning problem in the context of heuristic planning, (2) integrating grounding and cost-effective elements into the generated plans, and (3) using heuristic search over actions. Our extensive evaluations show that our model surpasses other LLM planning approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Scalable Enumeration of Trap Spaces in Boolean Networks via Answer Set Programming b/data/2024/aaai/Scalable Enumeration of Trap Spaces in Boolean Networks via Answer Set Programming
new file mode 100644
index 0000000000..c4b470618d
--- /dev/null
+++ b/data/2024/aaai/Scalable Enumeration of Trap Spaces in Boolean Networks via Answer Set Programming	
@@ -0,0 +1 @@
+Boolean Networks (BNs) are widely used as a modeling formalism in several domains, notably systems biology and computer science. A fundamental problem in BN analysis is the enumeration of trap spaces, which are hypercubes in the state space that cannot be escaped once entered. Several methods have been proposed for enumerating trap spaces, however they often suffer from scalability and efficiency issues, particularly for large and complex models. To our knowledge, the most efficient and recent methods for the trap space enumeration all rely on Answer Set Programming (ASP), which has been widely applied to the analysis of BNs. Motivated by these considerations, our work proposes a new method for enumerating trap spaces in BNs using ASP. We evaluate the method on a mix of 250+ real-world and 400+ randomly generated BNs, showing that it enables analysis of models beyond the capabilities of existing tools (namely pyboolnet, mpbn, trappist, and trapmvn).
\ No newline at end of file
diff --git a/data/2024/aaai/Scalable Geometric Fracture Assembly via Co-creation Space among Assemblers b/data/2024/aaai/Scalable Geometric Fracture Assembly via Co-creation Space among Assemblers
new file mode 100644
index 0000000000..d2e6c22a0d
--- /dev/null
+++ b/data/2024/aaai/Scalable Geometric Fracture Assembly via Co-creation Space among Assemblers	
@@ -0,0 +1 @@
+Geometric fracture assembly presents a challenging practical task in archaeology and 3D computer vision. Previous methods have focused solely on assembling fragments based on semantic information, which has limited the quantity of objects that can be effectively assembled. Therefore, there is a need to develop a scalable framework for geometric fracture assembly without relying on semantic information. To improve the effectiveness of assembling geometric fractures without semantic information, we propose a co-creation space comprising several assemblers capable of gradually and unambiguously assembling fractures. Additionally, we introduce a novel loss function, i.e., the geometric-based collision loss, to address collision issues during the fracture assembly process and enhance the results. Our framework exhibits better performance on both PartNet and Breaking Bad datasets compared to existing state-of-the-art frameworks. Extensive experiments and quantitative comparisons demonstrate the effectiveness of our proposed framework, which features linear computational complexity, enhanced abstraction, and improved generalization. Our code is publicly available at https://github.com/Ruiyuan-Zhang/CCS.
\ No newline at end of file
diff --git a/data/2024/aaai/Scalable Motion Style Transfer with Constrained Diffusion Generation b/data/2024/aaai/Scalable Motion Style Transfer with Constrained Diffusion Generation
new file mode 100644
index 0000000000..8b0cd7b633
--- /dev/null
+++ b/data/2024/aaai/Scalable Motion Style Transfer with Constrained Diffusion Generation	
@@ -0,0 +1 @@
+Current training of motion style transfer systems relies on consistency losses across style domains to preserve contents, hindering its scalable application to a large number of domains and private data. Recent image transfer works show the potential of independent training on each domain by leveraging implicit bridging between diffusion models, with the content preservation, however, limited to simple data patterns. We address this by imposing biased sampling in backward diffusion while maintaining the domain independence in the training stage. We construct the bias from the source domain keyframes and apply them as the gradient of content constraints, yielding a framework with keyframe manifold constraint gradients (KMCGs). Our validation demonstrates the success of training separate models to transfer between as many as ten dance motion styles. Comprehensive experiments find a significant improvement in preserving motion contents in comparison to baseline and ablative diffusion-based style transfer models. In addition, we perform a human study for a subjective assessment of the quality of generated dance motions. The results validate the competitiveness of KMCGs.
\ No newline at end of file
diff --git a/data/2024/aaai/Scale Optimization Using Evolutionary Reinforcement Learning for Object Detection on Drone Imagery b/data/2024/aaai/Scale Optimization Using Evolutionary Reinforcement Learning for Object Detection on Drone Imagery
new file mode 100644
index 0000000000..25847afbd3
--- /dev/null
+++ b/data/2024/aaai/Scale Optimization Using Evolutionary Reinforcement Learning for Object Detection on Drone Imagery	
@@ -0,0 +1 @@
+Object detection in aerial imagery presents a significant challenge due to large scale variations among objects. This paper proposes an evolutionary reinforcement learning agent, integrated within a coarse-to-fine object detection framework, to optimize the scale for more effective detection of objects in such images. Specifically, a set of patches potentially containing objects are first generated. A set of rewards measuring the localization accuracy, the accuracy of predicted labels, and the scale consistency among nearby patches are designed in the agent to guide the scale optimization. The proposed scale-consistency reward ensures similar scales for neighboring objects of the same category. Furthermore, a spatial-semantic attention mechanism is designed to exploit the spatial semantic relations between patches. The agent employs the proximal policy optimization strategy in conjunction with the evolutionary strategy, effectively utilizing both the current patch status and historical experience embedded in the agent. The proposed model is compared with state-of-the-art methods on two benchmark datasets for object detection on drone imagery. It significantly outperforms all the compared methods. Code is available at https://github.com/UNNC-CV/EvOD/.
\ No newline at end of file
diff --git a/data/2024/aaai/Scaling Few-Shot Learning for the Open World b/data/2024/aaai/Scaling Few-Shot Learning for the Open World
new file mode 100644
index 0000000000..2f83ecf9cc
--- /dev/null
+++ b/data/2024/aaai/Scaling Few-Shot Learning for the Open World	
@@ -0,0 +1 @@
+Few-shot learning (FSL) aims to enable learning models with the ability to automatically adapt to novel (unseen) domains in open-world scenarios. Nonetheless, there exists a significant disparity between the vast number of new concepts encountered in the open world and the restricted available scale of existing FSL works, which primarily focus on a limited number of novel classes. Such a gap hinders the practical applicability of FSL in realistic scenarios. To bridge this gap, we propose a new problem named Few-Shot Learning with Many Novel Classes (FSL-MNC) by substantially enlarging the number of novel classes, exceeding the count in the traditional FSL setup by over 500-fold. This new problem exhibits two major challenges, including the increased computation overhead during meta-training and the degraded classification performance by the large number of classes during meta-testing. To overcome these challenges, we propose a Simple Hierarchy Pipeline (SHA-Pipeline). Due to the inefficiency of traditional protocols of EML, we re-design a lightweight training strategy to reduce the overhead brought by much more novel classes. To capture discriminative semantics across numerous novel classes, we effectively reconstruct and leverage the class hierarchy information during meta-testing. Experiments show that the proposed SHA-Pipeline significantly outperforms not only the ProtoNet baseline but also the state-of-the-art alternatives across different numbers of novel classes.
\ No newline at end of file
diff --git a/data/2024/aaai/Scaling Offline Evaluation of Reinforcement Learning Agents through Abstraction b/data/2024/aaai/Scaling Offline Evaluation of Reinforcement Learning Agents through Abstraction
new file mode 100644
index 0000000000..1d95989d7d
--- /dev/null
+++ b/data/2024/aaai/Scaling Offline Evaluation of Reinforcement Learning Agents through Abstraction	
@@ -0,0 +1 @@
+A critical challenge for the widescale adoption of reinforcement learning (RL) is the need to give domain experts assurance that learned policies will improve decision-making -- and not lead to unacceptable behavior. To meet this challenge, my work aims to develop new methods for offline policy evaluation in real world RL domains. There has been much recent interest in offline evaluation and many advances. However, recent benchmarking efforts have also shown that there remains a substantial gap between current state-of-the-art methods and real world domains such as robotics. Towards scalable offline evaluation, my group is investigating the use of methods for abstraction and representation learning. In this new faculty highlight, I will present our recent results that show the promise of this direction for scaling offline evaluation in RL domains. I will then describe future directions in this line of that work which will further realize the promise of offline policy evaluation for increasing confidence in deployed RL.
\ No newline at end of file
diff --git a/data/2024/aaai/Scaling Up Pareto Optimization for Tree Structures with Affine Transformations: Evaluating Hybrid Floating Solar-Hydropower Systems in the Amazon b/data/2024/aaai/Scaling Up Pareto Optimization for Tree Structures with Affine Transformations: Evaluating Hybrid Floating Solar-Hydropower Systems in the Amazon
new file mode 100644
index 0000000000..c3fed3516c
--- /dev/null
+++ b/data/2024/aaai/Scaling Up Pareto Optimization for Tree Structures with Affine Transformations: Evaluating Hybrid Floating Solar-Hydropower Systems in the Amazon	
@@ -0,0 +1 @@
+Sustainability challenges inherently involve the consideration of multiple competing objectives. The Pareto frontier – the set of all optimal solutions that cannot be improved with respect to one objective without negatively affecting another – is a crucial decision-making tool for navigating sustainability challenges as it highlights the inherent trade-offs among conflicting objectives. Our research is motivated by the strategic planning of hydropower in the Amazon basin, one of the earth’s largest and most biodiverse river systems, where the need to increase energy production coincides with the pressing requirement of minimizing detrimental environmental impacts. We investigate an innovative strategy that pairs hydropower with Floating Photovoltaic Solar Panels (FPV). We provide a new extended multi-tree network formulation, which enables the consideration of multiple dam configurations. To address the computational challenge of scaling up the Pareto optimization framework to tackle multiple objectives across the entire Amazon basin, we further enhance the state-of-the-art algorithm for Pareto frontiers in tree-structured networks with two improvements. We introduce affine transformations induced by the sub-frontiers to compute Pareto dominance and provide strategies for merging sub-trees, significantly increasing the pruning of dominated solutions. Our experiments demonstrate considerable speedups, in some cases by more than an order of magnitude, while maintaining optimality guarantees, thus allowing us to more effectively approximate the Pareto frontiers. Moreover, our findings suggest significant shifts towards higher energy values in the Pareto frontier when pairing hybrid hydropower with FPV solutions, potentially amplifying energy production while mitigating adverse impacts.
\ No newline at end of file
diff --git a/data/2024/aaai/Scaling Up Semi-supervised Learning with Unconstrained Unlabelled Data b/data/2024/aaai/Scaling Up Semi-supervised Learning with Unconstrained Unlabelled Data
new file mode 100644
index 0000000000..06327204df
--- /dev/null
+++ b/data/2024/aaai/Scaling Up Semi-supervised Learning with Unconstrained Unlabelled Data	
@@ -0,0 +1 @@
+We propose UnMixMatch, a semi-supervised learning framework which can learn effective representations from unconstrained unlabelled data in order to scale up performance. Most existing semi-supervised methods rely on the assumption that labelled and unlabelled samples are drawn from the same distribution, which limits the potential for improvement through the use of free-living unlabeled data. Consequently, the generalizability and scalability of semi-supervised learning are often hindered by this assumption. Our method aims to overcome these constraints and effectively utilize unconstrained unlabelled data in semi-supervised learning. UnMixMatch consists of three main components: a supervised learner with hard augmentations that provides strong regularization, a contrastive consistency regularizer to learn underlying representations from the unlabelled data, and a self-supervised loss to enhance the representations that are learnt from the unlabelled data. We perform extensive experiments on 4 commonly used datasets and demonstrate superior performance over existing semi-supervised methods with a performance boost of 4.79%. Extensive ablation and sensitivity studies show the effectiveness and impact of each of the proposed components of our method. The code for our work is publicly available.
\ No newline at end of file
diff --git a/data/2024/aaai/Scaling and Masking: A New Paradigm of Data Sampling for Image and Video Quality Assessment b/data/2024/aaai/Scaling and Masking: A New Paradigm of Data Sampling for Image and Video Quality Assessment
new file mode 100644
index 0000000000..4656a03ecf
--- /dev/null
+++ b/data/2024/aaai/Scaling and Masking: A New Paradigm of Data Sampling for Image and Video Quality Assessment	
@@ -0,0 +1 @@
+Quality assessment of images and videos emphasizes both local details and global semantics, whereas general data sampling methods (e.g., resizing, cropping or grid-based fragment) fail to catch them simultaneously. To address the deficiency, current approaches have to adopt multi-branch models and take as input the multi-resolution data, which burdens the model complexity. In this work, instead of stacking up models, a more elegant data sampling method (named as SAMA, scaling and masking) is explored, which compacts both the local and global content in a regular input size. The basic idea is to scale the data into a pyramid first, and reduce the pyramid into a regular data dimension with a masking strategy. Benefiting from the spatial and temporal redundancy in images and videos, the processed data maintains the multi-scale characteristics with a regular input size, thus can be processed by a single-branch model. We verify the sampling method in image and video quality assessment. Experiments show that our sampling method can improve the performance of current single-branch models significantly, and achieves competitive performance to the multi-branch models without extra model complexity. The source code will be available at https://github.com/Sissuire/SAMA.
\ No newline at end of file
diff --git a/data/2024/aaai/ScanERU: Interactive 3D Visual Grounding Based on Embodied Reference Understanding b/data/2024/aaai/ScanERU: Interactive 3D Visual Grounding Based on Embodied Reference Understanding
new file mode 100644
index 0000000000..df8225c185
--- /dev/null
+++ b/data/2024/aaai/ScanERU: Interactive 3D Visual Grounding Based on Embodied Reference Understanding	
@@ -0,0 +1 @@
+Aiming to link natural language descriptions to specific regions in a 3D scene represented as 3D point clouds, 3D visual grounding is a very fundamental task for human-robot interaction. The recognition errors can significantly impact the overall accuracy and then degrade the operation of AI systems. Despite their effectiveness, existing methods suffer from the difficulty of low recognition accuracy in cases of multiple adjacent objects with similar appearance. To address this issue, this work intuitively introduces the human-robot interaction as a cue to facilitate the development of 3D visual grounding. Specifically, a new task termed Embodied Reference Understanding (ERU) is first designed for this concern. Then a new dataset called ScanERU is constructed to evaluate the effectiveness of this idea. Different from existing datasets, our ScanERU dataset is the first to cover semi-synthetic scene integration with textual, real-world visual, and synthetic gestural information. Additionally, this paper formulates a heuristic framework based on attention mechanisms and human body movements to enlighten the research of ERU. Experimental results demonstrate the superiority of the proposed method, especially in the recognition of multiple identical objects. Our codes and dataset are available in the ScanERU repository.
\ No newline at end of file
diff --git a/data/2024/aaai/Scene Flow Prior Based Point Cloud Completion with Masked Transformer (Student Abstract) b/data/2024/aaai/Scene Flow Prior Based Point Cloud Completion with Masked Transformer (Student Abstract)
new file mode 100644
index 0000000000..2aef9d6d65
--- /dev/null
+++ b/data/2024/aaai/Scene Flow Prior Based Point Cloud Completion with Masked Transformer (Student Abstract)	
@@ -0,0 +1 @@
+It is necessary to explore an effective point cloud completion mechanism that is of great significance for real-world tasks such as autonomous driving, robotics applications, and multi-target tracking. In this paper, we propose a point cloud completion method using a self-supervised transformer model based on the contextual constraints of scene flow. Our method uses the multi-frame point cloud context relationship as a guide to generate a series of token proposals, this priori condition ensures the stability of the point cloud completion. The experimental results show that the method proposed in this paper achieves high accuracy and good stability.
\ No newline at end of file
diff --git a/data/2024/aaai/SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research b/data/2024/aaai/SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research
new file mode 100644
index 0000000000..14fd09d224
--- /dev/null
+++ b/data/2024/aaai/SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research	
@@ -0,0 +1 @@
+Recently, there has been growing interest in using Large Language Models (LLMs) for scientific research. Numerous benchmarks have been proposed to evaluate the ability of LLMs for scientific research. However, current benchmarks are mostly based on pre-collected objective questions. This design suffers from data leakage problem and lacks the evaluation of subjective Q/A ability. In this paper, we propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues. Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability. In particular, we design a "dynamic" subset based on scientific principles to prevent evaluation from potential data leakage. Both objective and subjective questions are included in SciEval. These characteristics make SciEval a more effective benchmark for scientific research ability evaluation of LLMs. Comprehensive experiments on most advanced LLMs show that, although GPT-4 achieves SOTA performance compared to other LLMs, there is still substantial room for improvement, especially for dynamic questions. The codes and data are publicly available on https://github.com/OpenDFM/SciEval.
\ No newline at end of file
diff --git a/data/2024/aaai/SciSpace Copilot: Empowering Researchers through Intelligent Reading Assistance b/data/2024/aaai/SciSpace Copilot: Empowering Researchers through Intelligent Reading Assistance
new file mode 100644
index 0000000000..db56772119
--- /dev/null
+++ b/data/2024/aaai/SciSpace Copilot: Empowering Researchers through Intelligent Reading Assistance	
@@ -0,0 +1 @@
+We introduce SciSpace Copilot, an AI research assistant that helps in understanding and reading research papers faster by providing a plethora of features. Answering questions from a document has recently become popular using the Retrieval Augmented Generation (RAG) approach. Our tool uses an advanced question-answering pipeline to get accurate answers and also provide exact citations for the same. We provide many more valuable features on scientific text, including generating explanations, generating summaries, adding notes and highlights, and finding related papers from our 200 million corpus. Our tool supports 100+ languages, making research more accessible across language barriers. Thousands of users use SciSpace Copilot on a daily basis by uploading their articles to understand research faster and better. Our tool can be accessed at this link: https://typeset.io.
\ No newline at end of file
diff --git a/data/2024/aaai/Scores for Learning Discrete Causal Graphs with Unobserved Confounders b/data/2024/aaai/Scores for Learning Discrete Causal Graphs with Unobserved Confounders
new file mode 100644
index 0000000000..ce055d6435
--- /dev/null
+++ b/data/2024/aaai/Scores for Learning Discrete Causal Graphs with Unobserved Confounders	
@@ -0,0 +1 @@
+Structural learning is arguably one of the most challenging and pervasive tasks found throughout the data sciences. There exists a growing literature that studies structural learning in non-parametric settings where conditional independence constraints are taken to define the equivalence class. In the presence of unobserved confounders, it is understood that non-conditional independence constraints are imposed over the observational distribution, including certain equalities and inequalities between functionals of the joint distribution. In this paper, we develop structural learning methods that leverage additional constraints beyond conditional independences. Specifically, we first introduce a score for arbitrary graphs combining Watanabe's asymptotic expansion of the marginal likelihood and new bounds over the cardinality of the exogenous variables. Second, we show that the new score has desirable properties in terms of expressiveness and computability. In terms of expressiveness, we prove that the score captures distinct constraints imprinted in the data, including Verma's and inequalities'. In terms of computability, we show properties of score equivalence and decomposability, which allows, in principle, to break the problem of structural learning in smaller and more manageable pieces. Third, we implement this score using an MCMC sampling algorithm and test its properties in several simulation scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Scribble Hides Class: Promoting Scribble-Based Weakly-Supervised Semantic Segmentation with Its Class Label b/data/2024/aaai/Scribble Hides Class: Promoting Scribble-Based Weakly-Supervised Semantic Segmentation with Its Class Label
new file mode 100644
index 0000000000..8040fa563e
--- /dev/null
+++ b/data/2024/aaai/Scribble Hides Class: Promoting Scribble-Based Weakly-Supervised Semantic Segmentation with Its Class Label	
@@ -0,0 +1 @@
+Scribble-based weakly-supervised semantic segmentation using sparse scribble supervision is gaining traction as it reduces annotation costs when compared to fully annotated alternatives. Existing methods primarily generate pseudo-labels by diffusing labeled pixels to unlabeled ones with local cues for supervision. However, this diffusion process fails to exploit global semantics and class-specific cues, which are important for semantic segmentation. In this study, we propose a class-driven scribble promotion network, which utilizes both scribble annotations and pseudo-labels informed by image-level classes and global semantics for supervision. Directly adopting pseudo-labels might misguide the segmentation model, thus we design a localization rectification module to correct foreground representations in the feature space. To further combine the advantages of both supervisions, we also introduce a distance entropy loss for uncertainty reduction, which adapts per-pixel confidence weights according to the reliable region determined by the scribble and pseudo-label's boundary. Experiments on the ScribbleSup dataset with different qualities of scribble annotations outperform all the previous methods, demonstrating the superiority and robustness of our method. The code is available at https://github.com/Zxl19990529/Class-driven-Scribble-Promotion-Network.
\ No newline at end of file
diff --git a/data/2024/aaai/SeTformer Is What You Need for Vision and Language b/data/2024/aaai/SeTformer Is What You Need for Vision and Language
new file mode 100644
index 0000000000..8ef845bf3c
--- /dev/null
+++ b/data/2024/aaai/SeTformer Is What You Need for Vision and Language	
@@ -0,0 +1 @@
+The dot product self-attention (DPSA) is a fundamental component of transformers. However, scaling them to long sequences, like documents or high-resolution images, becomes prohibitively expensive due to the quadratic time and memory complexities arising from the softmax operation. Kernel methods are employed to simplify computations by approximating softmax but often lead to performance drops compared to softmax attention. We propose SeTformer, a novel transformer where DPSA is purely replaced by Self-optimal Transport (SeT) for achieving better performance and computational efficiency. SeT is based on two essential softmax properties: maintaining a non-negative attention matrix and using a nonlinear reweighting mechanism to emphasize important tokens in input sequences. By introducing a kernel cost function for optimal transport, SeTformer effectively satisfies these properties. In particular, with small and base-sized models, SeTformer achieves impressive top-1 accuracies of 84.7% and 86.2% on ImageNet-1K. In object detection, SeTformer-base outperforms the FocalNet counterpart by +2.2 mAP, using 38% fewer parameters and 29% fewer FLOPs. In semantic segmentation, our base-size model surpasses NAT by +3.5 mIoU with 33% fewer parameters. SeTformer also achieves state-of-the-art results in language modeling on the GLUE benchmark. These findings highlight SeTformer applicability for vision and language tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Secure Distributed Sparse Gaussian Process Models Using Multi-Key Homomorphic Encryption b/data/2024/aaai/Secure Distributed Sparse Gaussian Process Models Using Multi-Key Homomorphic Encryption
new file mode 100644
index 0000000000..e015d4242e
--- /dev/null
+++ b/data/2024/aaai/Secure Distributed Sparse Gaussian Process Models Using Multi-Key Homomorphic Encryption	
@@ -0,0 +1 @@
+Distributed sparse Gaussian process (dGP) models provide an ability to achieve accurate predictive performance using data from multiple devices in a time efficient and scalable manner. The distributed computation of model, however, risks exposure of privately owned data to public manipulation. In this paper we propose a secure solution for dGP regression models using multi-key homomorphic encryption. Experimental results show that with a little sacrifice in terms of time complexity, we achieve a secure dGP model without deteriorating the predictive performance compared to traditional non-secure dGP models. We also present a practical implementation of the proposed model using several Nvidia Jetson Nano Developer Kit modules to simulate a real-world scenario. Thus, secure dGP model plugs the data security issues of dGP and provide a secure and trustworthy solution for multiple devices to use privately owned data for model computation in a distributed environment availing speed, scalability and robustness of dGP.
\ No newline at end of file
diff --git a/data/2024/aaai/Securing Billion Bluetooth Devices Leveraging Learning-Based Techniques b/data/2024/aaai/Securing Billion Bluetooth Devices Leveraging Learning-Based Techniques
new file mode 100644
index 0000000000..98f0a79d5c
--- /dev/null
+++ b/data/2024/aaai/Securing Billion Bluetooth Devices Leveraging Learning-Based Techniques	
@@ -0,0 +1 @@
+As the most popular low-power communication protocol, cybersecurity research on Bluetooth Low Energy (BLE) has garnered significant attention. Due to BLE’s inherent security limitations and firmware vulnerabilities, spoofing attacks can easily compromise BLE devices and tamper with privacy data. In this paper, we proposed BLEGuard, a hybrid detection mechanism combined cyber-physical features with learning-based techniques. We established a physical network testbed to conduct attack simulations and capture advertising packets. Four different network features were utilized to implement detection and classification algorithms. Preliminary results have verified the feasibility of our proposed methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains b/data/2024/aaai/Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains
new file mode 100644
index 0000000000..e19c93dc1a
--- /dev/null
+++ b/data/2024/aaai/Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains	
@@ -0,0 +1 @@
+Accurately typing entity mentions from text segments is a fundamental task for various natural language processing applications. Many previous approaches rely on massive human-annotated data to perform entity typing. Nevertheless, collecting such data in highly specialized science and engineering domains (e.g., software engineering and security) can be time-consuming and costly, without mentioning the domain gaps between training and inference data if the model needs to be applied to confidential datasets. In this paper, we study the task of seed-guided fine-grained entity typing in science and engineering domains, which takes the name and a few seed entities for each entity type as the only supervision and aims to classify new entity mentions into both seen and unseen types (i.e., those without seed entities). To solve this problem, we propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus using the contextualized representations of pre-trained language models. It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types. Extensive experiments on two datasets covering four domains demonstrate the effectiveness of SEType in comparison with various baselines. Code and data are available at: https://github.com/yuzhimanhua/SEType.
\ No newline at end of file
diff --git a/data/2024/aaai/Seeing Dark Videos via Self-Learned Bottleneck Neural Representation b/data/2024/aaai/Seeing Dark Videos via Self-Learned Bottleneck Neural Representation
new file mode 100644
index 0000000000..167cd1bd65
--- /dev/null
+++ b/data/2024/aaai/Seeing Dark Videos via Self-Learned Bottleneck Neural Representation	
@@ -0,0 +1 @@
+Enhancing low-light videos in a supervised style presents a set of challenges, including limited data diversity, misalignment, and the domain gap introduced through the dataset construction pipeline. Our paper tackles these challenges by constructing a self-learned enhancement approach that gets rid of the reliance on any external training data. The challenge of self-supervised learning lies in fitting high-quality signal representations solely from input signals. Our work designs a bottleneck neural representation mechanism that extracts those signals. More in detail, we encode the frame-wise representation with a compact deep embedding and utilize a neural network to parameterize the video-level manifold consistently. Then, an entropy constraint is applied to the enhanced results based on the adjacent spatial-temporal context to filter out the degraded visual signals, e.g. noise and frame inconsistency. Last, a novel Chromatic Retinex decomposition is proposed to effectively align the reflectance distribution temporally. It benefits the entropy control on different components of each frame and facilitates noise-to-noise training, successfully suppressing the temporal flicker. Extensive experiments demonstrate the robustness and superior effectiveness of our proposed method. Our project is publicly available at: https://huangerbai.github.io/SLBNR/.
\ No newline at end of file
diff --git a/data/2024/aaai/Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation b/data/2024/aaai/Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation
new file mode 100644
index 0000000000..1992eae3b4
--- /dev/null
+++ b/data/2024/aaai/Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation	
@@ -0,0 +1 @@
+Augmented Reality (AR) devices, emerging as prominent mobile interaction platforms, face challenges in user safety, particularly concerning oncoming vehicles. While some solutions leverage onboard camera arrays, these cameras often have limited field-of-view (FoV) with front or downward perspectives. Addressing this, we propose a new out-of-view semantic segmentation task and Segment Beyond View (SBV), a novel audio-visual semantic segmentation method. SBV supplements the visual modality, which miss the information beyond FoV, with the auditory information using a teacher-student distillation model (Omni2Ego). The model consists of a vision teacher utilising panoramic information, an auditory teacher with 8-channel audio, and an audio-visual student that takes views with limited FoV and binaural audio as input and produce semantic segmentation for objects outside FoV. SBV outperforms existing models in comparative evaluations and shows a consistent performance across varying FoV ranges and in monaural audio settings.
\ No newline at end of file
diff --git a/data/2024/aaai/Select and Augment: Enhanced Dense Retrieval Knowledge Graph Augmentation (Abstract Reprint) b/data/2024/aaai/Select and Augment: Enhanced Dense Retrieval Knowledge Graph Augmentation (Abstract Reprint)
new file mode 100644
index 0000000000..f5fb440b5b
--- /dev/null
+++ b/data/2024/aaai/Select and Augment: Enhanced Dense Retrieval Knowledge Graph Augmentation (Abstract Reprint)	
@@ -0,0 +1 @@
+Injecting textual information into knowledge graph (KG) entity representations has been a worthwhile expedition in terms of improving performance in KG oriented tasks within the NLP community. External knowledge often adopted to enhance KG embeddings ranges from semantically rich lexical dependency parsed features to a set of relevant key words to entire text descriptions supplied from an external corpus such as wikipedia and many more. Despite the gains this innovation (Text-enhanced KG embeddings) has made, the proposal in this work suggests that it can be improved even further. Instead of using a single text description (which would not sufficiently represent an entity because of the inherent lexical ambiguity of text), we propose a multi-task framework that jointly selects a set of text descriptions relevant to KG entities as well as align or augment KG embeddings with text descriptions. Different from prior work that plugs formal entity descriptions declared in knowledge bases, this framework leverages a retriever model to selectively identify richer or highly relevant text descriptions to use in augmenting entities. Furthermore, the framework treats the number of descriptions to use in augmentation process as a parameter, which allows the flexibility of enumerating across several numbers before identifying an appropriate number. Experiment results for Link Prediction demonstrate a 5.5% and 3.5% percentage increase in the Mean Reciprocal Rank (MRR) and Hits@10 scores respectively, in comparison to text-enhanced knowledge graph augmentation methods using traditional CNNs.
\ No newline at end of file
diff --git a/data/2024/aaai/Selective Deep Autoencoder for Unsupervised Feature Selection b/data/2024/aaai/Selective Deep Autoencoder for Unsupervised Feature Selection
new file mode 100644
index 0000000000..483895853f
--- /dev/null
+++ b/data/2024/aaai/Selective Deep Autoencoder for Unsupervised Feature Selection	
@@ -0,0 +1 @@
+In light of the advances in big data, high-dimensional datasets are often encountered. Incorporating them into data-driven models can enhance performance; however, this comes at the cost of high computation and the risk of overfitting, particularly due to abundant redundant features. Identifying an informative subset of the features helps in reducing the dimensionality and enhancing model interpretability. In this paper, we propose a novel framework for unsupervised feature selection, called Selective Deep Auto-Encoder (SDAE). It aims to reduce the number of features used in unlabeled datasets without compromising the quality of information obtained. It achieves this by selecting sufficient features - from the original feature set - capable of representing the entire feature space and reconstructing them. Architecturally, it leverages the use of highly nonlinear latent representations in deep Autoencoders and intrinsically learns, in an unsupervised fashion, the relevant and globally representative subset of features through a customized Selective Layer. Extensive experimental results on three high-dimensional public datasets have shown promising feature selection performance by SDAE in comparison to other existing state-of-the-art unsupervised feature selection methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Selective Focus: Investigating Semantics Sensitivity in Post-training Quantization for Lane Detection b/data/2024/aaai/Selective Focus: Investigating Semantics Sensitivity in Post-training Quantization for Lane Detection
new file mode 100644
index 0000000000..4181ad4df9
--- /dev/null
+++ b/data/2024/aaai/Selective Focus: Investigating Semantics Sensitivity in Post-training Quantization for Lane Detection	
@@ -0,0 +1 @@
+Lane detection (LD) plays a crucial role in enhancing the L2+ capabilities of autonomous driving, capturing widespread attention. The Post-Processing Quantization (PTQ) could facilitate the practical application of LD models, enabling fast speeds and limited memories without labeled data. However, prior PTQ methods do not consider the complex LD outputs that contain physical semantics, such as offsets, locations, etc., and thus cannot be directly applied to LD models. In this paper, we pioneeringly investigate semantic sensitivity to post-processing for lane detection with a novel Lane Distortion Score. Moreover, we identify two main factors impacting the LD performance after quantization, namely intra-head sensitivity and inter-head sensitivity, where a small quantization error in specific semantics can cause significant lane distortion. Thus, we propose a Selective Focus framework deployed with Semantic Guided Focus and Sensitivity Aware Selection modules, to incorporate post-processing information into PTQ reconstruction. Based on the observed intra-head sensitivity, Semantic Guided Focus is introduced to prioritize foreground-related semantics using a practical proxy. For inter-head sensitivity, we present Sensitivity Aware Selection, efficiently recognizing influential prediction heads and refining the optimization objectives at runtime. Extensive experiments have been done on a wide variety of models including keypoint-, anchor-, curve-, and segmentation-based ones. Our method produces quantized models in minutes on a single GPU and can achieve 6.4\% F1 Score improvement on the CULane dataset. Code and supplementary statement can be found at https://github.com/PannenetsF/SelectiveFocus.
\ No newline at end of file
diff --git a/data/2024/aaai/Selective and Orthogonal Feature Activation for Pedestrian Attribute Recognition b/data/2024/aaai/Selective and Orthogonal Feature Activation for Pedestrian Attribute Recognition
new file mode 100644
index 0000000000..7b6cfdfb88
--- /dev/null
+++ b/data/2024/aaai/Selective and Orthogonal Feature Activation for Pedestrian Attribute Recognition	
@@ -0,0 +1 @@
+Pedestrian Attribute Recognition (PAR) involves identifying the attributes of individuals in person images. Existing PAR methods typically rely on CNNs as the backbone network to extract pedestrian features. However, CNNs process only one adjacent region at a time, leading to the loss of long-range inter-relations between different attribute-specific regions. To address this limitation, we leverage the Vision Transformer (ViT) instead of CNNs as the backbone for PAR, aiming to model long-range relations and extract more robust features. However, PAR suffers from an inherent attribute imbalance issue, causing ViT to naturally focus more on attributes that appear frequently in the training set and ignore some pedestrian attributes that appear less. The native features extracted by ViT are not able to tolerate the imbalance attribute distribution issue. To tackle this issue, we propose two novel components: the Selective Feature Activation Method (SFAM) and the Orthogonal Feature Activation Loss. SFAM smartly suppresses the more informative attribute-specific features, compelling the PAR model to capture discriminative features from regions that are easily overlooked. The proposed loss enforces an orthogonal constraint on the original feature extracted by ViT and the suppressed features from SFAM, promoting the complementarity of features in space. We conduct experiments on several benchmark PAR datasets, including PETA, PA100K, RAPv1, and RAPv2, demonstrating the effectiveness of our method. Specifically, our method outperforms existing state-of-the-art approaches by GRL, IAA-Caps, ALM, and SSC in terms of mA on the four datasets, respectively.
\ No newline at end of file
diff --git a/data/2024/aaai/Self-Distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach b/data/2024/aaai/Self-Distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach
new file mode 100644
index 0000000000..6b6a7e4d5e
--- /dev/null
+++ b/data/2024/aaai/Self-Distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach	
@@ -0,0 +1 @@
+Text recognition methods are gaining rapid development. Some advanced techniques, e.g., powerful modules, language models, and un- and semi-supervised learning schemes, consecutively push the performance on public benchmarks forward. However, the problem of how to better optimize a text recognition model from the perspective of loss functions is largely overlooked. CTC-based methods, widely used in practice due to their good balance between performance and inference speed, still grapple with accuracy degradation. This is because CTC loss emphasizes the optimization of the entire sequence target while neglecting to learn individual characters. We propose a self-distillation scheme for CTC-based model to address this issue. It incorporates a framewise regularization term in CTC loss to emphasize individual supervision, and leverages the maximizing-a-posteriori of latent alignment to solve the inconsistency problem that arises in distillation between CTC-based models. We refer to the regularized CTC loss as Distillation Connectionist Temporal Classification (DCTC) loss. DCTC loss is module-free, requiring no extra parameters, longer inference lag, or additional training data or phases. Extensive experiments on public benchmarks demonstrate that DCTC can boost text recognition model accuracy by up to 2.6%, without any of these drawbacks.
\ No newline at end of file
diff --git a/data/2024/aaai/Self-Interpretable Graph Learning with Sufficient and Necessary Explanations b/data/2024/aaai/Self-Interpretable Graph Learning with Sufficient and Necessary Explanations
new file mode 100644
index 0000000000..a50a6d038c
--- /dev/null
+++ b/data/2024/aaai/Self-Interpretable Graph Learning with Sufficient and Necessary Explanations	
@@ -0,0 +1 @@
+Self-interpretable graph learning methods provide insights to unveil the black-box nature of GNNs by providing predictions with built-in explanations. However, current works suffer from performance degradation compared to GNNs trained without built-in explanations. We argue the main reason is that they fail to generate explanations satisfying both sufficiency and necessity, and the biased explanations further hurt GNNs' performance. In this work, we propose a novel framework for generating SUfficient aNd NecessarY explanations (SUNNY-GNN for short) that benefit GNNs' predictions. The key idea is to conduct augmentations by structurally perturbing given explanations and employ a contrastive loss to guide the learning of explanations toward sufficiency and necessity directions. SUNNY-GNN introduces two coefficients to generate hard and reliable contrastive samples. We further extend SUNNY-GNN to heterogeneous graphs. Empirical results on various GNNs and real-world graphs show that SUNNY-GNN yields accurate predictions and faithful explanations, outperforming the state-of-the-art methods by improving 3.5% prediction accuracy and 13.1% explainability fidelity on average. Our code and data are available at https://github.com/SJTU-Quant/SUNNY-GNN.
\ No newline at end of file
diff --git a/data/2024/aaai/Self-Paced Unified Representation Learning for Hierarchical Multi-Label Classification b/data/2024/aaai/Self-Paced Unified Representation Learning for Hierarchical Multi-Label Classification
new file mode 100644
index 0000000000..4fbcbe6fb7
--- /dev/null
+++ b/data/2024/aaai/Self-Paced Unified Representation Learning for Hierarchical Multi-Label Classification	
@@ -0,0 +1 @@
+Hierarchical Multi-Label Classification (HMLC) is a well-established problem that aims at assigning data instances to multiple classes stored in a hierarchical structure. Despite its importance, existing approaches often face two key limitations: (i) They employ dense networks to solely explore the class hierarchy as hard criterion for maintaining taxonomic consistency among predicted classes, yet without leveraging rich semantic relationships between instances and classes; (ii) They struggle to generalize in settings with deep class levels, since the mini-batches uniformly sampled from different levels ignore the varying complexities of data and result in a non-smooth model adaptation to sparse data. To mitigate these issues, we present a Self-Paced Unified Representation (SPUR) learning framework, which focuses on the interplay between instance and classes to flexibly organize the training process of HMLC algorithms. Our framework consists of two lightweight encoders designed to capture the semantics of input features and the topological information of the class hierarchy. These encoders generate unified embeddings of instances and class hierarchy, which enable SPUR to exploit semantic dependencies between them and produce predictions in line with taxonomic constraints. Furthermore, we introduce a dynamic hardness measurement strategy that considers both class hierarchy and instance features to estimate the learning difficulty of each instance. This strategy is achieved by incorporating the propagation loss obtained at each hierarchical level, allowing for a more comprehensive assessment of learning complexity. Extensive experiments on several empirical benchmarks demonstrate the effectiveness and efficiency of SPUR compared to state-of-the-art methods, especially in scenarios with missing features.
\ No newline at end of file
diff --git a/data/2024/aaai/Self-Prompt Mechanism for Few-Shot Image Recognition b/data/2024/aaai/Self-Prompt Mechanism for Few-Shot Image Recognition
new file mode 100644
index 0000000000..d60f645fe4
--- /dev/null
+++ b/data/2024/aaai/Self-Prompt Mechanism for Few-Shot Image Recognition	
@@ -0,0 +1 @@
+Few-shot learning poses a formidable challenge as it necessitates effective recognition of novel classes based on a limited set of examples. Recent studies have sought to address the challenge of rare samples by tuning visual features through the utilization of external text prompts. However, the performance of these methods is constrained due to the inherent modality gap between the prompt text and image features. Instead of naively utilizing the external semantic information generated from text to guide the training of the image encoder, we propose a novel self-prompt mechanism (SPM) to adaptively adjust the neural network according to unseen data. Specifically, SPM involves a systematic selection of intrinsic semantic features generated by the image encoder across spatial and channel dimensions, thereby engendering self-prompt information. Subsequently, upon backpropagation of this self-prompt information to the deeper layers of the neural network, it effectively steers the network toward the learning and adaptation of new samples. Meanwhile, we propose a novel parameter-efficient tuning method that exclusively fine-tunes the parameters relevant to self-prompt (prompts are no more than 2% of the total parameters), and the incorporation of additional learnable parameters as self-prompt ensures the retention of prior knowledge through frozen encoder weights. Therefore, our method is highly suited for few-shot recognition tasks that require both information retention and adaptive adjustment of network parameters with limited labeling data constraints. Extensive experiments demonstrate the effectiveness of the proposed SPM in both 5-way 1-shot and 5-way 5-shot settings for standard single-domain and cross-domain few-shot recognition datasets, respectively. Our code is available at https://github.com/codeshop715/SPM.
\ No newline at end of file
diff --git a/data/2024/aaai/Self-Supervised 3D Human Mesh Recovery from a Single Image with Uncertainty-Aware Learning b/data/2024/aaai/Self-Supervised 3D Human Mesh Recovery from a Single Image with Uncertainty-Aware Learning
new file mode 100644
index 0000000000..90d0eaaee9
--- /dev/null
+++ b/data/2024/aaai/Self-Supervised 3D Human Mesh Recovery from a Single Image with Uncertainty-Aware Learning	
@@ -0,0 +1 @@
+Despite achieving impressive improvement in accuracy, most existing monocular 3D human mesh reconstruction methods require large-scale 2D/3D ground-truths for supervision, which limits their applications on unlabeled in-the-wild data that is ubiquitous. To alleviate the reliance on 2D/3D ground-truths, we present a self-supervised 3D human pose and shape reconstruction framework that relies only on self-consistency between intermediate representations of images and projected 2D predictions. Specifically, we extract 2D joints and depth maps from monocular images as proxy inputs, which provides complementary clues to infer accurate 3D human meshes. Furthermore, to reduce the impacts from noisy and ambiguous inputs while better concentrate on the high-quality information, we design an uncertainty-aware module to automatically learn the reliability of the inputs at body-joint level based on the consistency between 2D joints and depth map. Experiments on benchmark datasets show that our approach outperforms other state-of-the-art methods at similar supervision levels.
\ No newline at end of file
diff --git a/data/2024/aaai/Self-Supervised Bird's Eye View Motion Prediction with Cross-Modality Signals b/data/2024/aaai/Self-Supervised Bird's Eye View Motion Prediction with Cross-Modality Signals
new file mode 100644
index 0000000000..f57423f46b
--- /dev/null
+++ b/data/2024/aaai/Self-Supervised Bird's Eye View Motion Prediction with Cross-Modality Signals	
@@ -0,0 +1 @@
+Learning the dense bird's eye view (BEV) motion flow in a self-supervised manner is an emerging research for robotics and autonomous driving. Current self-supervised methods mainly rely on point correspondences between point clouds, which may introduce the problems of fake flow and inconsistency, hindering the model’s ability to learn accurate and realistic motion. In this paper, we introduce a novel cross-modality self-supervised training framework that effectively addresses these issues by leveraging multi-modality data to obtain supervision signals. We design three innovative supervision signals to preserve the inherent properties of scene motion, including the masked Chamfer distance loss, the piecewise rigidity loss, and the temporal consistency loss. Through extensive experiments, we demonstrate that our proposed self-supervised framework outperforms all previous self-supervision methods for the motion prediction task.
\ No newline at end of file
diff --git a/data/2024/aaai/Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction b/data/2024/aaai/Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction
new file mode 100644
index 0000000000..40679065be
--- /dev/null
+++ b/data/2024/aaai/Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction	
@@ -0,0 +1 @@
+Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information. However, in the task of target speech extraction, certain elements of global and local semantic information in the reference speech, which are irrelevant to speaker identity, can lead to speaker confusion within the speech extraction network. To overcome this challenge, we propose a self-supervised disentangled representation learning method. Our approach tackles this issue through a two-phase process, utilizing a reference speech encoding network and a global information disentanglement network to gradually disentangle the speaker identity information from other irrelevant factors. We exclusively employ the disentangled speaker identity information to guide the speech extraction network. Moreover, we introduce the adaptive modulation Transformer to ensure that the acoustic representation of the mixed signal remains undisturbed by the speaker embeddings. This component incorporates speaker embeddings as conditional information, facilitating natural and efficient guidance for the speech extraction network. Experimental results substantiate the effectiveness of our meticulously crafted approach, showcasing a substantial reduction in the likelihood of speaker confusion.
\ No newline at end of file
diff --git a/data/2024/aaai/Self-Supervised Framework Based on Subject-Wise Clustering for Human Subject Time Series Data b/data/2024/aaai/Self-Supervised Framework Based on Subject-Wise Clustering for Human Subject Time Series Data
new file mode 100644
index 0000000000..22354a8b01
--- /dev/null
+++ b/data/2024/aaai/Self-Supervised Framework Based on Subject-Wise Clustering for Human Subject Time Series Data	
@@ -0,0 +1 @@
+With the widespread adoption of IoT, wearable devices, and sensors, time series data from human subjects are significantly increasing in the healthcare domain. Due to the laborious nature of manual annotation in time series data and the requirement for human experts, self-supervised learning methods are attempted to alleviate the limited label situations. While existing self-supervised methods have been successful to achieve comparable performance to the fully supervised methods, there are still some limitations that need to be addressed, considering the nature of time series data from human subjects: In real-world clinical settings, data labels (e.g., sleep stages) are usually annotated by subject-level, and there is a substantial variation in patterns between subjects. Thus, a model should be designed to deal with not only the label scarcity but also subject-wise nature of data to ensure high performance in real-world scenarios. To mitigate these issues, we propose a novel self-supervised learning framework for human subject time series data: Subject-Aware Time Series Clustering (SA-TSC). In the unsupervised representation learning phase, SA-TSC adopts a subject-wise learning strategy rather than instance-wise learning which randomly samples data instances from different subjects within the batch during training. Specifically, we generate subject-graphs with our graph construction method based on Gumbel-Softmax and perform graph spectral clustering on each subject-graph. In addition, we utilize graph neural networks to capture dependencies between channels and design our own graph learning module motivated from self-supervised loss. Experimental results show the outstanding performance of our SA-TSC with the limited & subject-wise label setting, leading to its high applicability to the healthcare industry. The code is available at: https://github.com/DILAB-HYU/SA-TSC
\ No newline at end of file
diff --git a/data/2024/aaai/Self-Supervised Likelihood Estimation with Energy Guidance for Anomaly Segmentation in Urban Scenes b/data/2024/aaai/Self-Supervised Likelihood Estimation with Energy Guidance for Anomaly Segmentation in Urban Scenes
new file mode 100644
index 0000000000..bf1adbbed6
--- /dev/null
+++ b/data/2024/aaai/Self-Supervised Likelihood Estimation with Energy Guidance for Anomaly Segmentation in Urban Scenes	
@@ -0,0 +1 @@
+Robust autonomous driving requires agents to accurately identify unexpected areas (anomalies) in urban scenes. To this end, some critical issues remain open: how to design advisable metric to measure anomalies, and how to properly generate training samples of anomaly data? Classical effort in anomaly detection usually resorts to pixel-wise uncertainty or sample synthesis, which ignores the contextual information and sometimes requires auxiliary data with fine-grained annotations. On the contrary, in this paper, we exploit the strong context-dependent nature of segmentation task and design an energy-guided self-supervised frameworks for anomaly segmentation, which optimizes an anomaly head by maximizing likelihood of self-generated anomaly pixels. For this purpose, we design two estimators to model anomaly likelihood, one is a task-agnostic binary estimator and the other depicts the likelihood as residual of task-oriented joint energy. Based on proposed estimators, we devise an adaptive self-supervised training framework, which exploits the contextual reliance and estimated likelihood to refine mask annotations in anomaly areas. We conduct extensive experiments on challenging Fishyscapes and Road Anomaly benchmarks, demonstrating that without any auxiliary data or synthetic models, our method can still achieves comparable performance to supervised competitors. Code is available at https://github.com/yuanpengtu/SLEEG.
\ No newline at end of file
diff --git a/data/2024/aaai/Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search b/data/2024/aaai/Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search
new file mode 100644
index 0000000000..90f97229f1
--- /dev/null
+++ b/data/2024/aaai/Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search	
@@ -0,0 +1 @@
+Deep cross-modal hashing technology provides an effective and efficient cross-modal unified representation learning solution for cross-modal search. However, the existing methods neglect the implicit fine-grained multimodal knowledge relations between these modalities such as when the image contains information that is not directly described in the text. To tackle this problem, we propose a novel self-supervised multi-grained multi-modal knowledge graph contrastive hashing method for cross-modal search (CMGCH). Firstly, in order to capture implicit fine-grained cross-modal semantic associations, a multi-modal knowledge graph is constructed, which represents the implicit multimodal knowledge relations between the image and text as inter-modal and intra-modal semantic associations. Secondly, a cross-modal graph contrastive attention network is proposed to reason on the multi-modal knowledge graph to sufficiently learn the implicit fine-grained inter-modal and intra-modal knowledge relations. Thirdly, a cross-modal multi-granularity contrastive embedding learning mechanism is proposed, which fuses the global coarse-grained and local fine-grained embeddings by multihead attention mechanism for inter-modal and intra-modal contrastive learning, so as to enhance the cross-modal unified representations with stronger discriminativeness and semantic consistency preserving power. With the joint training of intra-modal and inter-modal contrast, the invariant and modal-specific information of different modalities can be maintained in the final unified cross-modal unified hash space. Extensive experiments on several cross-modal benchmark datasets demonstrate that the proposed CMGCH outperforms the state-of the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Self-Supervised Representation Learning with Meta Comprehensive Regularization b/data/2024/aaai/Self-Supervised Representation Learning with Meta Comprehensive Regularization
new file mode 100644
index 0000000000..9db4d78b70
--- /dev/null
+++ b/data/2024/aaai/Self-Supervised Representation Learning with Meta Comprehensive Regularization	
@@ -0,0 +1 @@
+Self-Supervised Learning (SSL) methods harness the concept of semantic invariance by utilizing data augmentation strategies to produce similar representations for different deformations of the same input. Essentially, the model captures the shared information among multiple augmented views of samples, while disregarding the non-shared information that may be beneficial for downstream tasks. To address this issue, we introduce a module called CompMod with Meta Comprehensive Regularization (MCR), embedded into existing self-supervised frameworks, to make the learned representations more comprehensive. Specifically, we update our proposed model through a bi-level optimization mechanism, enabling it to capture comprehensive features. Additionally, guided by the constrained extraction of features using maximum entropy coding, the self-supervised learning model learns more comprehensive features on top of learning consistent features. In addition, we provide theoretical support for our proposed method from information theory and causal counterfactual perspective. Experimental results show that our method achieves significant improvement in classification, object detection and semantic segmentation tasks on multiple benchmark datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Self-Training Based Few-Shot Node Classification by Knowledge Distillation b/data/2024/aaai/Self-Training Based Few-Shot Node Classification by Knowledge Distillation
new file mode 100644
index 0000000000..9305818ced
--- /dev/null
+++ b/data/2024/aaai/Self-Training Based Few-Shot Node Classification by Knowledge Distillation	
@@ -0,0 +1,2 @@
+Self-training based few-shot node classification (FSNC) methods have shown excellent performance in real applications, but they cannot make the full use of the information in the base set and are easily affected by the quality of pseudo-labels. To address these issues, this paper proposes a new self-training FSNC method by involving the representation distillation and the pseudo-label distillation. Specifically, the representation distillation includes two knowledge distillation methods (i.e., the local representation distillation and the global representation distillation) to transfer the information in the base set to the novel set. The pseudo-label distillation is designed to conduct knowledge distillation on the pseudo-labels to improve their quality. 
+Experimental results showed that our method achieves supreme performance, compared with state-of-the-art methods. Our code and a comprehensive theoretical version are available at https://github.com/zongqianwu/KD-FSNC.
\ No newline at end of file
diff --git a/data/2024/aaai/SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency b/data/2024/aaai/SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency
new file mode 100644
index 0000000000..4d72bb2849
--- /dev/null
+++ b/data/2024/aaai/SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency	
@@ -0,0 +1 @@
+This work presents an effective depth-consistency Self-Prompt Transformer, terms as SelfPromer, for image dehazing. It is motivated by an observation that the estimated depths of an image with haze residuals and its clear counterpart vary. Enforcing the depth consistency of dehazed images with clear ones, therefore, is essential for dehazing. For this purpose, we develop a prompt based on the features of depth differences between the hazy input images and corresponding clear counterparts that can guide dehazing models for better restoration. Specifically, we first apply deep features extracted from the input images to the depth difference features for generating the prompt that contains the haze residual information in the input. Then we propose a prompt embedding module that is designed to perceive the haze residuals, by linearly adding the prompt to the deep features. Further, we develop an effective prompt attention module to pay more attention to haze residuals for better removal. By incorporating the prompt, prompt embedding, and prompt attention into an encoder-decoder network based on VQGAN, we can achieve better perception quality. As the depths of clear images are not available at inference, and the dehazed images with one-time feed-forward execution may still contain a portion of haze residuals, we propose a new continuous self-prompt inference that can iteratively correct the dehazing model towards better haze-free image generation. Extensive experiments show that our SelfPromer performs favorably against the state-of-the-art approaches on both synthetic and real-world datasets in terms of perception metrics including NIQE, PI, and PIQE. The source codes will be made available at https://github.com/supersupercong/SelfPromer.
\ No newline at end of file
diff --git a/data/2024/aaai/SemLa: A Visual Analysis System for Fine-Grained Text Classification b/data/2024/aaai/SemLa: A Visual Analysis System for Fine-Grained Text Classification
new file mode 100644
index 0000000000..77957e0671
--- /dev/null
+++ b/data/2024/aaai/SemLa: A Visual Analysis System for Fine-Grained Text Classification	
@@ -0,0 +1 @@
+Fine-grained text classification requires models to distinguish between many fine-grained classes that are hard to tell apart. However, despite the increased risk of models relying on confounding features and predictions being especially difficult to interpret in this context, existing work on the interpretability of fine-grained text classification is severely limited. Therefore, we introduce our visual analysis system, SemLa, which incorporates novel visualization techniques that are tailored to this challenge. Our evaluation based on case studies and expert feedback shows that SemLa can be a powerful tool for identifying model weaknesses, making decisions about data annotation, and understanding the root cause of errors.
\ No newline at end of file
diff --git a/data/2024/aaai/SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation b/data/2024/aaai/SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation
new file mode 100644
index 0000000000..a841540cd7
--- /dev/null
+++ b/data/2024/aaai/SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation	
@@ -0,0 +1 @@
+This work explores the zero-shot adaptation capability of semantic skills, semantically interpretable experts' behavior patterns, in cross-domain settings, where a user input in interleaved multi-modal snippets can prompt a new long-horizon task for different domains. In these cross-domain settings, we present a semantic skill translator framework SemTra which utilizes a set of multi-modal models to extract skills from the snippets, and leverages the reasoning capabilities of a pretrained language model to adapt these extracted skills to the target domain. The framework employs a two-level hierarchy for adaptation: task adaptation and skill adaptation. During task adaptation, seq-to-seq translation by the language model transforms the extracted skills into a semantic skill sequence, which is tailored to fit the cross-domain contexts. Skill adaptation focuses on optimizing each semantic skill for the target domain context, through parametric instantiations that are facilitated by language prompting and contrastive learning-based context inferences. This hierarchical adaptation empowers the framework to not only infer a complex task specification in one-shot from the interleaved multi-modal snippets, but also adapt it to new domains with zero-shot learning abilities. We evaluate our framework with Meta-World, Franka Kitchen, RLBench, and CARLA environments. The results clarify the framework's superiority in performing long-horizon tasks and adapting to different domains, showing its broad applicability in practical use cases, such as cognitive robots interpreting abstract instructions and autonomous vehicles operating under varied configurations.
\ No newline at end of file
diff --git a/data/2024/aaai/Semantic Complete Scene Forecasting from a 4D Dynamic Point Cloud Sequence b/data/2024/aaai/Semantic Complete Scene Forecasting from a 4D Dynamic Point Cloud Sequence
new file mode 100644
index 0000000000..e94e7a3d8b
--- /dev/null
+++ b/data/2024/aaai/Semantic Complete Scene Forecasting from a 4D Dynamic Point Cloud Sequence	
@@ -0,0 +1 @@
+We study a new problem of semantic complete scene forecasting (SCSF) in this work. Given a 4D dynamic point cloud sequence, our goal is to forecast the complete scene corresponding to the future next frame along with its semantic labels. To tackle this challenging problem, we properly model the synergetic relationship between future forecasting and semantic scene completion through a novel network named SCSFNet. SCSFNet leverages a hybrid geometric representation for high-resolution complete scene forecasting. To leverage multi-frame observation as well as the understanding of scene dynamics to ease the completion task, SCSFNet introduces an attention-based skip connection scheme. To ease the need to model occlusion variations and to better focus on the occluded part, SCSFNet utilizes auxiliary visibility grids to guide the forecasting task. To evaluate the effectiveness of SCSFNet, we conduct experiments on various benchmarks including two large-scale indoor benchmarks we contributed and the outdoor SemanticKITTI benchmark. Extensive experiments show SCSFNet outperforms baseline methods on multiple metrics by a large margin, and also prove the synergy between future forecasting and semantic scene completion.The project page with code is available at scsfnet.github.io.
\ No newline at end of file
diff --git a/data/2024/aaai/Semantic Lens: Instance-Centric Semantic Alignment for Video Super-resolution b/data/2024/aaai/Semantic Lens: Instance-Centric Semantic Alignment for Video Super-resolution
new file mode 100644
index 0000000000..4024ec5325
--- /dev/null
+++ b/data/2024/aaai/Semantic Lens: Instance-Centric Semantic Alignment for Video Super-resolution	
@@ -0,0 +1 @@
+As a critical clue of video super-resolution (VSR), inter-frame alignment significantly impacts overall performance. However, accurate pixel-level alignment is a challenging task due to the intricate motion interweaving in the video. In response to this issue, we introduce a novel paradigm for VSR named Semantic Lens, predicated on semantic priors drawn from degraded videos. Specifically, video is modeled as instances, events, and scenes via a Semantic Extractor. Those semantics assist the Pixel Enhancer in understanding the recovered contents and generating more realistic visual results. The distilled global semantics embody the scene information of each frame, while the instance-specific semantics assemble the spatial-temporal contexts related to each instance. Furthermore, we devise a Semantics-Powered Attention Cross-Embedding (SPACE) block to bridge the pixel-level features with semantic knowledge, composed of a Global Perspective Shifter (GPS) and an Instance-Specific Semantic Embedding Encoder (ISEE). Concretely, the GPS module generates pairs of affine transformation parameters for pixel-level feature modulation conditioned on global semantics. After that the ISEE module harnesses the attention mechanism to align the adjacent frames in the instance-centric semantic space. In addition, we incorporate a simple yet effective pre-alignment module to alleviate the difficulty of model training. Extensive experiments demonstrate the superiority of our model over existing state-of-the-art VSR methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Semantic Segmentation in Multiple Adverse Weather Conditions with Domain Knowledge Retention b/data/2024/aaai/Semantic Segmentation in Multiple Adverse Weather Conditions with Domain Knowledge Retention
new file mode 100644
index 0000000000..6d37ad4149
--- /dev/null
+++ b/data/2024/aaai/Semantic Segmentation in Multiple Adverse Weather Conditions with Domain Knowledge Retention	
@@ -0,0 +1 @@
+Semantic segmentation's performance is often compromised when applied to unlabeled adverse weather conditions. Unsupervised domain adaptation is a potential approach to enhancing the model's adaptability and robustness to adverse weather. However, existing methods encounter difficulties when sequentially adapting the model to multiple unlabeled adverse weather conditions. They struggle to acquire new knowledge while also retaining previously learned knowledge. To address these problems, we propose a semantic segmentation method for multiple adverse weather conditions that incorporates adaptive knowledge acquisition, pseudo-label blending, and weather composition replay. Our adaptive knowledge acquisition enables the model to avoid learning from extreme images that could potentially cause the model to forget. In our approach of blending pseudo-labels, we not only utilize the current model but also integrate the previously learned model into the ongoing learning process. This collaboration between the current teacher and the previous model enhances the robustness of the pseudo-labels for the current target. Our weather composition replay mechanism allows the model to continuously refine its previously learned weather information while simultaneously learning from the new target domain. Our method consistently outperforms the state-of-the-art methods, and obtains the best performance with averaged mIoU (%) of 65.7 and the lowest forgetting (%) of 3.6 against 60.1 and 11.3, on the ACDC datsets for a four-target continual multi-target domain adaptation.
\ No newline at end of file
diff --git a/data/2024/aaai/Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning b/data/2024/aaai/Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning
new file mode 100644
index 0000000000..ef699010cb
--- /dev/null
+++ b/data/2024/aaai/Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning	
@@ -0,0 +1 @@
+The development of autoregressive modeling (AM) in computer vision lags behind natural language processing (NLP) in self-supervised pre-training. This is mainly caused by the challenge that images are not sequential signals and lack a natural order when applying autoregressive modeling. In this study, inspired by human beings’ way of grasping an image, i.e., focusing on the main object first, we present a semantic-aware autoregressive image modeling (SemAIM) method to tackle this challenge. The key insight of SemAIM is to autoregressively model images from the semantic patches to the less semantic patches. To this end, we first calculate a semantic-aware permutation of patches according to their feature similarities and then perform the autoregression procedure based on the permutation. In addition, considering that the raw pixels of patches are low-level signals and are not ideal prediction targets for learning high-level semantic representation, we also explore utilizing the patch features as the prediction targets. Extensive experiments are conducted on a broad range of downstream tasks, including image classification, object detection, and instance/semantic segmentation, to evaluate the performance of SemAIM. The results demonstrate SemAIM achieves state-of-the-art performance compared with other self-supervised methods. Specifically, with ViT-B, SemAIM achieves 84.1% top-1 accuracy for fine-tuning on ImageNet, 51.3% AP and 45.4% AP for object detection and instance segmentation on COCO, which outperforms the vanilla MAE by 0.5%, 1.0%, and 0.5%, respectively. Code is available at https://github.com/skyoux/SemAIM.
\ No newline at end of file
diff --git a/data/2024/aaai/Semantic-Aware Data Augmentation for Text-to-Image Synthesis b/data/2024/aaai/Semantic-Aware Data Augmentation for Text-to-Image Synthesis
new file mode 100644
index 0000000000..851c1a9da6
--- /dev/null
+++ b/data/2024/aaai/Semantic-Aware Data Augmentation for Text-to-Image Synthesis	
@@ -0,0 +1 @@
+Data augmentation has been recently leveraged as an effective regularizer in various vision-language deep neural networks. However, in text-to-image synthesis (T2Isyn), current augmentation wisdom still suffers from the semantic mismatch between augmented paired data. Even worse, semantic collapse may occur when generated images are less semantically constrained. In this paper, we develop a novel Semantic-aware Data Augmentation (SADA) framework dedicated to T2Isyn. In particular, we propose to augment texts in the semantic space via an Implicit Textual Semantic Preserving Augmentation, in conjunction with a specifically designed Image Semantic Regularization Loss as Generated Image Semantic Conservation, to cope well with semantic mismatch and collapse. As one major contribution, we theoretically show that Implicit Textual Semantic Preserving Augmentation can certify better text-image consistency while Image Semantic Regularization Loss regularizing the semantics of generated images would avoid semantic collapse and enhance image quality. Extensive experiments validate that SADA enhances text-image consistency and improves image quality significantly in T2Isyn models across various backbones. Especially, incorporating SADA during the tuning process of Stable Diffusion models also yields performance improvements.
\ No newline at end of file
diff --git a/data/2024/aaai/Semantic-Aware Transformation-Invariant RoI Align b/data/2024/aaai/Semantic-Aware Transformation-Invariant RoI Align
new file mode 100644
index 0000000000..7789f29ceb
--- /dev/null
+++ b/data/2024/aaai/Semantic-Aware Transformation-Invariant RoI Align	
@@ -0,0 +1 @@
+Great progress has been made in learning-based object detection methods in the last decade. Two-stage detectors often have higher detection accuracy than one-stage detectors, due to the use of region of interest (RoI) feature extractors which extract transformation-invariant RoI features for different RoI proposals, making refinement of bounding boxes and prediction of object categories more robust and accurate. However, previous RoI feature extractors can only extract invariant features under limited transformations. In this paper, we propose a novel RoI feature extractor, termed Semantic RoI Align (SRA), which is capable of extracting invariant RoI features under a variety of transformations for two-stage detectors. Specifically, we propose a semantic attention module to adaptively determine different sampling areas by leveraging the global and local semantic relationship within the RoI. We also propose a Dynamic Feature Sampler which dynamically samples features based on the RoI aspect ratio to enhance the efficiency of SRA, and a new position embedding, i.e., Area Embedding, to provide more accurate position information for SRA through an improved sampling area representation. Experiments show that our model significantly outperforms baseline models with slight computational overhead. In addition, it shows excellent generalization ability and can be used to improve performance with various state-of-the-art backbones and detection methods. The code is available at https://github.com/cxjyxxme/SemanticRoIAlign.
\ No newline at end of file
diff --git a/data/2024/aaai/Semantic-Guided Generative Image Augmentation Method with Diffusion Models for Image Classification b/data/2024/aaai/Semantic-Guided Generative Image Augmentation Method with Diffusion Models for Image Classification
new file mode 100644
index 0000000000..6c14163179
--- /dev/null
+++ b/data/2024/aaai/Semantic-Guided Generative Image Augmentation Method with Diffusion Models for Image Classification	
@@ -0,0 +1 @@
+Existing image augmentation methods consist of two categories: perturbation-based methods and generative methods. Perturbation-based methods apply pre-defined perturbations to augment an original image, but only locally vary the image, thus lacking image diversity. In contrast, generative methods bring more image diversity in the augmented images but may not preserve semantic consistency, thus may incorrectly change the essential semantics of the original image. To balance image diversity and semantic consistency in augmented images, we propose SGID, a Semantic-guided Generative Image augmentation method with Diffusion models for image classification. Specifically, SGID employs diffusion models to generate augmented images with good image diversity. More importantly, SGID takes image labels and captions as guidance to maintain semantic consistency between the augmented and original images. Experimental results show that SGID outperforms the best augmentation baseline by 1.72% on ResNet-50 (from scratch), 0.33% on ViT (ImageNet-21k), and 0.14% on CLIP-ViT (LAION-2B). Moreover, SGID can be combined with other image augmentation baselines and further improves the overall performance. We demonstrate the semantic consistency and image diversity of SGID through quantitative human and automated evaluations, as well as qualitative case studies.
\ No newline at end of file
diff --git a/data/2024/aaai/Semantic-Guided Novel Category Discovery b/data/2024/aaai/Semantic-Guided Novel Category Discovery
new file mode 100644
index 0000000000..68c5f64dac
--- /dev/null
+++ b/data/2024/aaai/Semantic-Guided Novel Category Discovery	
@@ -0,0 +1 @@
+The Novel Category Discovery problem aims to cluster an unlabeled set with the help of a labeled set consisting of disjoint but related classes. However, existing models treat class names as discrete one-hot labels and ignore the semantic understanding of these classes. In this paper, we propose a new setting named Semantic-guided Novel Category Discovery (SNCD), which requires the model to not only cluster the unlabeled images but also semantically recognize these images based on a set of their class names. The first challenge we confront pertains to effectively leveraging the class names of unlabeled images, given the inherent gap between the visual and linguistic domains. To address this issue, we incorporate a semantic-aware recognition mechanism. This is achieved by constructing dynamic class-wise visual prototypes as well as a semantic similarity matrix that enables the projection of visual features into the semantic space. The second challenge originates from the granularity disparity between the classification and clustering tasks. To deal with this, we develop a semantic-aware clustering process to facilitate the exchange of knowledge between the two tasks. Through extensive experiments, we demonstrate the mutual benefits of the recognition and clustering tasks, which can be jointly optimized. Experimental results on multiple datasets confirm the effectiveness of our proposed method. Our code is available at https://github.com/wang-weishuai/Semantic-guided-NCD.
\ No newline at end of file
diff --git a/data/2024/aaai/Semi-Supervised Blind Image Quality Assessment through Knowledge Distillation and Incremental Learning b/data/2024/aaai/Semi-Supervised Blind Image Quality Assessment through Knowledge Distillation and Incremental Learning
new file mode 100644
index 0000000000..76276bdd14
--- /dev/null
+++ b/data/2024/aaai/Semi-Supervised Blind Image Quality Assessment through Knowledge Distillation and Incremental Learning	
@@ -0,0 +1 @@
+Blind Image Quality Assessment (BIQA) aims to simulate human assessment of image quality. It has a great demand for labeled data, which is often insufficient in practice. Some researchers employ unsupervised methods to address this issue, which is challenging to emulate the human subjective system. To this end, we introduce a unified framework that combines semi-supervised and incremental learning to address the mentioned issue. Specifically, when training data is limited, semi-supervised learning is necessary to infer extensive unlabeled data. To facilitate semi-supervised learning, we use knowledge distillation to assign pseudo-labels to unlabeled data, preserving analytical capability. To gradually improve the quality of pseudo labels, we introduce incremental learning. However, incremental learning can lead to catastrophic forgetting. We employ Experience Replay by selecting representative samples during multiple rounds of semi-supervised learning, to alleviate forgetting and ensure model stability. Experimental results show that the proposed approach achieves state-of-the-art performance across various benchmark datasets. After being trained on the LIVE dataset, our method can be directly transferred to the CSIQ dataset. Compared with other methods, it significantly outperforms unsupervised methods on the CSIQ dataset with a marginal performance drop (-0.002) on the LIVE dataset. In conclusion, our proposed method demonstrates its potential to tackle the challenges in real-world production processes.
\ No newline at end of file
diff --git a/data/2024/aaai/Semi-factual Explanations in AI b/data/2024/aaai/Semi-factual Explanations in AI
new file mode 100644
index 0000000000..ca084be114
--- /dev/null
+++ b/data/2024/aaai/Semi-factual Explanations in AI	
@@ -0,0 +1 @@
+Most of the recent works on post-hoc example-based eXplainable AI (XAI) methods revolves around employing counterfactual explanations to provide justification of the predictions made by AI systems. Counterfactuals show what changes to the input-features change the output decision. However, a lesser-known, special-case of the counterfacual is the semi-factual, which provide explanations about what changes to the input-features do not change the output decision. Semi-factuals are potentially as useful as counterfactuals but have received little attention in the XAI literature. My doctoral research aims to establish a comprehensive framework for the use of semi-factuals in XAI by developing novel methods for their computation, supported by user tests.
\ No newline at end of file
diff --git a/data/2024/aaai/Semi-supervised 3D Object Detection with PatchTeacher and PillarMix b/data/2024/aaai/Semi-supervised 3D Object Detection with PatchTeacher and PillarMix
new file mode 100644
index 0000000000..b31d7e6329
--- /dev/null
+++ b/data/2024/aaai/Semi-supervised 3D Object Detection with PatchTeacher and PillarMix	
@@ -0,0 +1 @@
+Semi-supervised learning aims to leverage numerous unlabeled data to improve the model performance. Current semi-supervised 3D object detection methods typically use a teacher to generate pseudo labels for a student, and the quality of the pseudo labels is essential for the final performance. In this paper, we propose PatchTeacher, which focuses on partial scene 3D object detection to provide high-quality pseudo labels for the student. Specifically, we divide a complete scene into a series of patches and feed them to our PatchTeacher sequentially. PatchTeacher leverages the low memory consumption advantage of partial scene detection to process point clouds with a high-resolution voxelization, which can minimize the information loss of quantization and extract more fine-grained features. However, it is non-trivial to train a detector on fractions of the scene. Therefore, we introduce three key techniques, i.e., Patch Normalizer, Quadrant Align, and Fovea Selection, to improve the performance of PatchTeacher. Moreover, we devise PillarMix, a strong data augmentation strategy that mixes truncated pillars from different LiDAR scans to generate diverse training samples and thus help the model learn more general representation. Extensive experiments conducted on Waymo and ONCE datasets verify the effectiveness and superiority of our method and we achieve new state-of-the-art results, surpassing existing methods by a large margin. Codes are available at https://github.com/LittlePey/PTPM.
\ No newline at end of file
diff --git a/data/2024/aaai/Semi-supervised Active Learning for Video Action Detection b/data/2024/aaai/Semi-supervised Active Learning for Video Action Detection
new file mode 100644
index 0000000000..d49c9d626b
--- /dev/null
+++ b/data/2024/aaai/Semi-supervised Active Learning for Video Action Detection	
@@ -0,0 +1,23 @@
+In this work, we focus on label efficient learning for video
+action detection. We develop a novel semi-supervised active
+learning approach which utilizes both labeled as well as un-
+labeled data along with informative sample selection for ac-
+tion detection. Video action detection requires spatio-temporal
+localization along with classification, which poses several
+challenges for both active learning (informative sample se-
+lection) as well as semi-supervised learning (pseudo label
+generation). First, we propose NoiseAug, a simple augmenta-
+tion strategy which effectively selects informative samples for
+video action detection. Next, we propose fft-attention, a novel
+technique based on high-pass filtering which enables effective
+utilization of pseudo label for SSL in video action detection
+by emphasizing on relevant activity region within a video.
+We evaluate the proposed approach on three different bench-
+mark datasets, UCF-101-24, JHMDB-21, and Youtube-VOS.
+First, we demonstrate its effectiveness on video action detec-
+tion where the proposed approach outperforms prior works in
+semi-supervised and weakly-supervised learning along with
+several baseline approaches in both UCF101-24 and JHMDB-
+21. Next, we also show its effectiveness on Youtube-VOS for
+video object segmentation demonstrating its generalization
+capability for other dense prediction tasks in videos.
\ No newline at end of file
diff --git a/data/2024/aaai/Semi-supervised Class-Agnostic Motion Prediction with Pseudo Label Regeneration and BEVMix b/data/2024/aaai/Semi-supervised Class-Agnostic Motion Prediction with Pseudo Label Regeneration and BEVMix
new file mode 100644
index 0000000000..85b8d02daa
--- /dev/null
+++ b/data/2024/aaai/Semi-supervised Class-Agnostic Motion Prediction with Pseudo Label Regeneration and BEVMix	
@@ -0,0 +1 @@
+Class-agnostic motion prediction methods aim to comprehend motion within open-world scenarios, holding significance for autonomous driving systems. However, training a high-performance model in a fully-supervised manner always requires substantial amounts of manually annotated data, which can be both expensive and time-consuming to obtain. To address this challenge, our study explores the potential of semi-supervised learning (SSL) for class-agnostic motion prediction. Our SSL framework adopts a consistency-based self-training paradigm, enabling the model to learn from unlabeled data by generating pseudo labels through test-time inference. To improve the quality of pseudo labels, we propose a novel motion selection and re-generation module. This module effectively selects reliable pseudo labels and re-generates unreliable ones. Furthermore, we propose two data augmentation strategies: temporal sampling and BEVMix. These strategies facilitate consistency regularization in SSL. Experiments conducted on nuScenes demonstrate that our SSL method can surpass the self-supervised approach by a large margin by utilizing only a tiny fraction of labeled data. Furthermore, our method exhibits comparable performance to weakly and some fully supervised methods. These results highlight the ability of our method to strike a favorable balance between annotation costs and performance. Code will be available at https://github.com/kwwcv/SSMP.
\ No newline at end of file
diff --git a/data/2024/aaai/Semi-supervised Learning of Dynamical Systems with Neural Ordinary Differential Equations: A Teacher-Student Model Approach b/data/2024/aaai/Semi-supervised Learning of Dynamical Systems with Neural Ordinary Differential Equations: A Teacher-Student Model Approach
new file mode 100644
index 0000000000..b87bb24050
--- /dev/null
+++ b/data/2024/aaai/Semi-supervised Learning of Dynamical Systems with Neural Ordinary Differential Equations: A Teacher-Student Model Approach	
@@ -0,0 +1,3 @@
+Modeling dynamical systems is crucial for a wide range of tasks, but it remains challenging due to complex nonlinear dynamics, limited observations, or lack of prior knowledge. Recently, data-driven approaches such as Neural Ordinary Differential Equations (NODE) have shown promising results by leveraging the expressive power of neural networks to model unknown dynamics. However, these approaches often suffer from limited labeled training data, leading to poor generalization and suboptimal predictions. On the other hand, semi-supervised algorithms can utilize abundant unlabeled data and have demonstrated good performance in classification and regression tasks.
+We propose TS-NODE, the first semi-supervised approach to modeling dynamical systems with NODE. TS-NODE explores cheaply generated synthetic pseudo rollouts to broaden exploration in the state space and to tackle the challenges brought by lack of ground-truth system data under a teacher-student model. TS-NODE employs an unified optimization framework that corrects the teacher model based on the student's feedback while mitigating the potential false system dynamics present in pseudo rollouts.
+TS-NODE demonstrates significant performance improvements over a baseline Neural ODE model on multiple dynamical system modeling tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Semi-supervised Open-World Object Detection b/data/2024/aaai/Semi-supervised Open-World Object Detection
new file mode 100644
index 0000000000..88a66405a7
--- /dev/null
+++ b/data/2024/aaai/Semi-supervised Open-World Object Detection	
@@ -0,0 +1 @@
+Conventional open-world object detection (OWOD) problem setting first distinguishes known and unknown classes and then later incrementally learns the unknown objects when introduced with labels in the subsequent tasks. However, the current OWOD formulation heavily relies on the external human oracle for knowledge input during the incremental learning stages. Such reliance on run-time makes this formulation less realistic in a real-world deployment. To address this, we introduce a more realistic formulation, named semi-supervised open-world detection (SS-OWOD), that reduces the annotation cost by casting the incremental learning stages of OWOD in a semi-supervised manner. We demonstrate that the performance of the state-of-the-art OWOD detector dramatically deteriorates in the proposed SS-OWOD setting. Therefore, we introduce a novel SS-OWOD detector, named SS-OWFormer, that utilizes a feature-alignment scheme to better align the object query representations between the original and augmented images to leverage the large unlabeled and few labeled data. We further introduce a pseudo-labeling scheme for unknown detection that exploits the inherent capability of decoder object queries to capture object-specific information. On the COCO dataset, our SS-OWFormer using only 50% of the labeled data achieves detection performance that is on par with the state-of-the-art (SOTA) OWOD detector using all the 100% of labeled data. Further, our SS-OWFormer achieves an absolute gain of 4.8% in unknown recall over the SOTA OWOD detector. Lastly, we demonstrate the effectiveness of our SS-OWOD problem setting and approach for remote sensing object detection, proposing carefully curated splits and baseline performance evaluations. Our experiments on 4 datasets including MS COCO, PASCAL, Objects365 and DOTA demonstrate the effectiveness of our approach. Our source code, models and splits are available here https://github.com/sahalshajim/SS-OWFormer
\ No newline at end of file
diff --git a/data/2024/aaai/Semi-supervised TEE Segmentation via Interacting with SAM Equipped with Noise-Resilient Prompting b/data/2024/aaai/Semi-supervised TEE Segmentation via Interacting with SAM Equipped with Noise-Resilient Prompting
new file mode 100644
index 0000000000..f941426efc
--- /dev/null
+++ b/data/2024/aaai/Semi-supervised TEE Segmentation via Interacting with SAM Equipped with Noise-Resilient Prompting	
@@ -0,0 +1 @@
+Semi-supervised learning (SSL) is a powerful tool to address the challenge of insufficient annotated data in medical segmentation problems. However, existing semi-supervised methods mainly rely on internal knowledge for pseudo labeling, which is biased due to the distribution mismatch between the highly imbalanced labeled and unlabeled data. Segmenting left atrial appendage (LAA) from transesophageal echocardiogram (TEE) images is a typical medical image segmentation task featured by scarcity of professional annotations and diverse data distributions, for which existing SSL models cannot achieve satisfactory performance. In this paper, we propose a novel strategy to mitigate the inherent challenge of distribution mismatch in SSL by, for the first time, incorporating a large foundation model (i.e. SAM in our implementation) into an SSL model to improve the quality of pseudo labels. We further propose a new self-reconstruction mechanism to generate both noise-resilient prompts to demonically improve SAM’s generalization capability over TEE images and self-perturbations to stabilize the training process and reduce the impact of noisy labels. We conduct extensive experiments on an in-house TEE dataset; experimental results demonstrate that our method achieves better performance than state-of-the-art SSL models.
\ No newline at end of file
diff --git a/data/2024/aaai/SentinelLMs: Encrypted Input Adaptation and Fine-Tuning of Language Models for Private and Secure Inference b/data/2024/aaai/SentinelLMs: Encrypted Input Adaptation and Fine-Tuning of Language Models for Private and Secure Inference
new file mode 100644
index 0000000000..c78b6efd24
--- /dev/null
+++ b/data/2024/aaai/SentinelLMs: Encrypted Input Adaptation and Fine-Tuning of Language Models for Private and Secure Inference	
@@ -0,0 +1 @@
+This paper addresses the privacy and security concerns associated with deep neural language models, which serve as crucial components in various modern AI-based applications. These models are often used after being pre-trained and fine-tuned for specific tasks, with deployment on servers accessed through the internet. However, this introduces two fundamental risks: (a) the transmission of user inputs to the server via the network gives rise to interception vulnerabilities, and (b) privacy concerns emerge as organizations that deploy such models store user data with restricted context. To address this, we propose a novel method to adapt and fine-tune transformer-based language models on passkey-encrypted user-specific text. The original pre-trained language model first undergoes a quick adaptation (without any further pre-training) with a series of irreversible transformations applied to the tokenizer and token embeddings. This enables the model to perform inference on encrypted inputs while preventing reverse engineering of text from model parameters and intermediate outputs. After adaptation, models are fine-tuned on encrypted versions of existing training datasets. Experimental evaluation employing adapted versions of renowned models (e.g., BERT, RoBERTa) across established benchmark English and multilingual datasets for text classification and sequence labeling shows that encrypted models achieve performance parity with their original counterparts. This serves to safeguard performance, privacy, and security cohesively.
\ No newline at end of file
diff --git a/data/2024/aaai/Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module Operation b/data/2024/aaai/Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module Operation
new file mode 100644
index 0000000000..1231ec9337
--- /dev/null
+++ b/data/2024/aaai/Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module Operation	
@@ -0,0 +1 @@
+Large language models (LLMs) have been widely used in various applications but are known to suffer from issues related to untruthfulness and toxicity. While parameter-efficient modules (PEMs) have demonstrated their effectiveness in equipping models with new skills, leveraging PEMs for deficiency unlearning remains underexplored. In this work, we propose a PEMs operation approach, namely Extraction-before-Subtraction (Ext-Sub), to enhance the truthfulness and detoxification of LLMs through the integration of ``expert'' PEM and ``anti-expert'' PEM. Remarkably, even anti-expert PEM possess valuable capabilities due to their proficiency in generating fabricated content, which necessitates language modeling and logical narrative competence. Rather than merely negating the parameters, our approach involves extracting and eliminating solely the deficiency capability within anti-expert PEM while preserving the general capabilities. To evaluate the effectiveness of our approach in terms of truthfulness and detoxification, we conduct extensive experiments on LLMs, encompassing additional abilities such as language modelling and mathematical reasoning. Our empirical results demonstrate that our approach effectively improves truthfulness and detoxification, while largely preserving the fundamental abilities of LLMs.
\ No newline at end of file
diff --git a/data/2024/aaai/SeqGPT: An Out-of-the-Box Large Language Model for Open Domain Sequence Understanding b/data/2024/aaai/SeqGPT: An Out-of-the-Box Large Language Model for Open Domain Sequence Understanding
new file mode 100644
index 0000000000..a87a340a84
--- /dev/null
+++ b/data/2024/aaai/SeqGPT: An Out-of-the-Box Large Language Model for Open Domain Sequence Understanding	
@@ -0,0 +1 @@
+Large language models (LLMs) have shown impressive abilities for open-domain NLP tasks. However, LLMs are sometimes too footloose for natural language understanding (NLU) tasks which always have restricted output and input format. Their performances on NLU tasks are highly related to prompts or demonstrations and are shown to be poor at performing several representative NLU tasks, such as event extraction and entity typing. To this end, we present SeqGPT, a bilingual (i.e., English and Chinese) open-source autoregressive model specially enhanced for open-domain natural language understanding. We express all NLU tasks with two atomic tasks, which define fixed instructions to restrict the input and output format but still ``open'' for arbitrarily varied label sets. The model is first instruction-tuned with extremely fine-grained labeled data synthesized by ChatGPT and then further fine-tuned by 233 different atomic tasks from 152 datasets across various domains. The experimental results show that SeqGPT has decent classification and extraction ability, and is capable of performing language understanding tasks on unseen domains. We also conduct empirical studies on the scaling of data and model size as well as on the transfer across tasks. Our models are accessible at https://github.com/Alibaba-NLP/SeqGPT.
\ No newline at end of file
diff --git a/data/2024/aaai/SeqRank: Sequential Ranking of Salient Objects b/data/2024/aaai/SeqRank: Sequential Ranking of Salient Objects
new file mode 100644
index 0000000000..99cd736e08
--- /dev/null
+++ b/data/2024/aaai/SeqRank: Sequential Ranking of Salient Objects	
@@ -0,0 +1 @@
+Salient Object Ranking (SOR) is the process of predicting the order of an observer's attention to objects when viewing a complex scene. Existing SOR methods primarily focus on ranking various scene objects simultaneously by exploring their spatial and semantic properties. However, their solutions of simultaneously ranking all salient objects do not align with human viewing behavior, and may result in incorrect attention shift predictions. We observe that humans view a scene through a sequential and continuous process involving a cycle of foveating to objects of interest with our foveal vision while using peripheral vision to prepare for the next fixation location. For instance, when we see a flying kite, our foveal vision captures the kite itself, while our peripheral vision can help us locate the person controlling it such that we can smoothly divert our attention to it next. By repeatedly carrying out this cycle, we can gain a thorough understanding of the entire scene. Based on this observation, we propose to model the dynamic interplay between foveal and peripheral vision to predict human attention shifts sequentially. To this end, we propose a novel SOR model, SeqRank, which reproduces foveal vision to extract high-acuity visual features for accurate salient instance segmentation while also modeling peripheral vision to select the object that is likely to grab the viewer’s attention next. By incorporating both types of vision, our model can mimic human viewing behavior better and provide a more faithful ranking among various scene objects. Most notably, our model improves the SA-SOR/MAE scores by +6.1%/-13.0% on IRSR, compared with the state-of-the-art. Extensive experiments show the superior performance of our model on the SOR benchmarks. Code is available at https://github.com/guanhuankang/SeqRank.
\ No newline at end of file
diff --git a/data/2024/aaai/Sequential Fusion Based Multi-Granularity Consistency for Space-Time Transformer Tracking b/data/2024/aaai/Sequential Fusion Based Multi-Granularity Consistency for Space-Time Transformer Tracking
new file mode 100644
index 0000000000..7961f0eb30
--- /dev/null
+++ b/data/2024/aaai/Sequential Fusion Based Multi-Granularity Consistency for Space-Time Transformer Tracking	
@@ -0,0 +1 @@
+Regarded as a template-matching task for a long time, visual object tracking has witnessed significant progress in space-wise exploration. However, since tracking is performed on videos with substantial time-wise information, it is important to simultaneously mine the temporal contexts which have not yet been deeply explored. Previous supervised works mostly consider template reform as the breakthrough point, but they are often limited by additional computational burdens or the quality of chosen templates. To address this issue, we propose a Space-Time Consistent Transformer Tracker (STCFormer), which uses a sequential fusion framework with multi-granularity consistency constraints to learn spatiotemporal context information. We design a sequential fusion framework that recombines template and search images based on tracking results from chronological frames, fusing updated tracking states in training. To further overcome the over-reliance on the fixed template without increasing computational complexity, we design three space-time consistent constraints: Label Consistency Loss (LCL) for label-level consistency, Attention Consistency Loss (ACL) for patch-level ROI consistency, and Semantic Consistency Loss (SCL) for feature-level semantic consistency. Specifically, in ACL and SCL, the label information is used to constrain the attention and feature consistency of the target and the background, respectively, to avoid mutual interference. Extensive experiments have shown that our STCFormer outperforms many of the best-performing trackers on several popular benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Sequential Model-Based Diagnosis by Systematic Search (Abstract Reprint) b/data/2024/aaai/Sequential Model-Based Diagnosis by Systematic Search (Abstract Reprint)
new file mode 100644
index 0000000000..590c3c9b50
--- /dev/null
+++ b/data/2024/aaai/Sequential Model-Based Diagnosis by Systematic Search (Abstract Reprint)	
@@ -0,0 +1,9 @@
+Model-based diagnosis aims at identifying the real cause of a system's malfunction based on a formal system model and observations of the system behavior. To discriminate between multiple fault hypotheses (diagnoses), sequential diagnosis approaches iteratively pose queries to an oracle to acquire additional knowledge about the diagnosed system. Depending on the system type, queries can capture, e.g., system tests, probes, measurements, or expert questions.
+
+As the determination of optimal queries is NP-hard, state-of-the-art sequential diagnosis methods rely on a myopic one-step-lookahead analysis which has proven to constitute a particularly favorable trade-off between computational efficiency and diagnostic effectivity. Yet, this solves only a part of the problem, as various sources of complexity, such as the reliance on costly reasoning services and large numbers of or not explicitly given query candidates, remain.
+
+To deal with such issues, existing approaches often make assumptions about the (i) type of diagnosed system, (ii) formalism to describe the system, (iii) inference engine, (iv) type of query to be of interest, (v) query quality criterion to be adopted, or (vi) diagnosis computation algorithm to be employed. Moreover, they (vii) often cannot deal with large or implicit query spaces or with expressive logics, or (viii) require inputs that cannot always be provided.
+
+As a remedy, we propose a novel one-step lookahead query computation technique for sequential diagnosis that overcomes the said issues of existing methods. Our approach (1) is based on a solid theory, (2) involves a systematic search for optimal queries, (3) can operate on implicit and huge query spaces, (4) allows for a two-stage optimization of queries (wrt. their number and cost), (5) is designed to reduce expensive logical inferences to a minimum, and (6) is generally applicable. The latter means that it can deal with any type of diagnosis problem as per Reiter's theory, is applicable with any monotonic knowledge representation language, can interact with a multitude of diagnosis engines and logical reasoners, and allows for a quality optimization of queries based on any of the common criteria in the literature.
+
+We extensively study the performance of the novel technique using a benchmark of real-world diagnosis problems. Our findings are that our approach enables the computation of optimal queries with hardly any delay, independently of the size and complexity of the considered benchmark problem. Moreover, it proves to be highly scalable, and it outperforms the state-of-the-art method in the domain of our benchmarks by orders of magnitude in terms of computation time while always returning a qualitatively as good or better query.
\ No newline at end of file
diff --git a/data/2024/aaai/Sequential Modeling of Complex Marine Navigation: Case Study on a Passenger Vessel (Student Abstract) b/data/2024/aaai/Sequential Modeling of Complex Marine Navigation: Case Study on a Passenger Vessel (Student Abstract)
new file mode 100644
index 0000000000..c69debbf76
--- /dev/null
+++ b/data/2024/aaai/Sequential Modeling of Complex Marine Navigation: Case Study on a Passenger Vessel (Student Abstract)	
@@ -0,0 +1 @@
+The maritime industry's continuous commitment to sustainability has led to a dedicated exploration of methods to reduce vessel fuel consumption. This paper undertakes this challenge through a machine learning approach, leveraging a real-world dataset spanning two years of a passenger vessel in west coast Canada. Our focus centers on the creation of a time series forecasting model given the dynamic and static states, actions, and disturbances. This model is designed to predict dynamic states based on the actions provided, subsequently serving as an evaluative tool to assess the proficiency of the vessel's operation under the captain's guidance. Additionally, it lays the foundation for future optimization algorithms, providing valuable feedback on decision-making processes. To facilitate future studies, our code is available at https://github.com/pagand/model_optimze_vessel/tree/AAAI.
\ No newline at end of file
diff --git a/data/2024/aaai/Set Prediction Guided by Semantic Concepts for Diverse Video Captioning b/data/2024/aaai/Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
new file mode 100644
index 0000000000..787f429097
--- /dev/null
+++ b/data/2024/aaai/Set Prediction Guided by Semantic Concepts for Diverse Video Captioning	
@@ -0,0 +1 @@
+Diverse video captioning aims to generate a set of sentences to describe the given video in various aspects. Mainstream methods are trained with independent pairs of a video and a caption from its ground-truth set without exploiting the intra-set relationship, resulting in low diversity of generated captions. Different from them, we formulate diverse captioning into a semantic-concept-guided set prediction (SCG-SP) problem by fitting the predicted caption set to the ground-truth set, where the set-level relationship is fully captured. Specifically, our set prediction consists of two synergistic tasks, i.e., caption generation and an auxiliary task of concept combination prediction providing extra semantic supervision. Each caption in the set is attached to a concept combination indicating the primary semantic content of the caption and facilitating element alignment in set prediction. Furthermore, we apply a diversity regularization term on concepts to encourage the model to generate semantically diverse captions with various concept combinations. These two tasks share multiple semantics-specific encodings as input, which are obtained by iterative interaction between visual features and conceptual queries. The correspondence between the generated captions and specific concept combinations further guarantees the interpretability of our model. Extensive experiments on benchmark datasets show that the proposed SCG-SP achieves state-of-the-art (SOTA) performance under both relevance and diversity metrics.
\ No newline at end of file
diff --git a/data/2024/aaai/Settling Decentralized Multi-Agent Coordinated Exploration by Novelty Sharing b/data/2024/aaai/Settling Decentralized Multi-Agent Coordinated Exploration by Novelty Sharing
new file mode 100644
index 0000000000..1f2f8cae6e
--- /dev/null
+++ b/data/2024/aaai/Settling Decentralized Multi-Agent Coordinated Exploration by Novelty Sharing	
@@ -0,0 +1 @@
+Exploration in decentralized cooperative multi-agent reinforcement learning faces two challenges. One is that the novelty of global states is unavailable, while the novelty of local observations is biased. The other is how agents can explore in a coordinated way. To address these challenges, we propose MACE, a simple yet effective multi-agent coordinated exploration method. By communicating only local novelty, agents can take into account other agents' local novelty to approximate the global novelty. Further, we newly introduce weighted mutual information to measure the influence of one agent's action on other agents' accumulated novelty. We convert it as an intrinsic reward in hindsight to encourage agents to exert more influence on other agents' exploration and boost coordinated exploration. Empirically, we show that MACE achieves superior performance in three multi-agent environments with sparse rewards.
\ No newline at end of file
diff --git a/data/2024/aaai/Several Stories about High-Multiplicity EFx Allocation (Student Abstract) b/data/2024/aaai/Several Stories about High-Multiplicity EFx Allocation (Student Abstract)
new file mode 100644
index 0000000000..8c8d6ca534
--- /dev/null
+++ b/data/2024/aaai/Several Stories about High-Multiplicity EFx Allocation (Student Abstract)	
@@ -0,0 +1 @@
+Fair division is a topic that has significant social and industrial value. In this work, we study allocations that simultaneously satisfy definitions of fairness and efficiency: EFx and PO. First, we prove that the problem of finding such allocations is NP-hard for two agents. Then, we propose a concept for an ILP-based solving algorithm, the running time of which depends on the number of EFx allocations. We generate input data and analyze algorithm's running time based on the results obtained.
\ No newline at end of file
diff --git a/data/2024/aaai/Shadow Generation with Decomposed Mask Prediction and Attentive Shadow Filling b/data/2024/aaai/Shadow Generation with Decomposed Mask Prediction and Attentive Shadow Filling
new file mode 100644
index 0000000000..98322a6fed
--- /dev/null
+++ b/data/2024/aaai/Shadow Generation with Decomposed Mask Prediction and Attentive Shadow Filling	
@@ -0,0 +1 @@
+Image composition refers to inserting a foreground object into a background image to obtain a composite image. In this work, we focus on generating plausible shadows for the inserted foreground object to make the composite image more realistic. To supplement the existing small-scale dataset, we create a large-scale dataset called RdSOBA with rendering techniques. Moreover, we design a two-stage network named DMASNet with decomposed mask prediction and attentive shadow filling. Specifically, in the first stage, we decompose shadow mask prediction into box prediction and shape prediction. In the second stage, we attend to reference background shadow pixels to fill the foreground shadow. Abundant experiments prove that our DMASNet achieves better visual effects and generalizes well to real composite images.
\ No newline at end of file
diff --git a/data/2024/aaai/Shallow Diffusion for Fast Speech Enhancement (Student Abstract) b/data/2024/aaai/Shallow Diffusion for Fast Speech Enhancement (Student Abstract)
new file mode 100644
index 0000000000..61cd9d4737
--- /dev/null
+++ b/data/2024/aaai/Shallow Diffusion for Fast Speech Enhancement (Student Abstract)	
@@ -0,0 +1 @@
+Recently, the field of Speech Enhancement has witnessed the success of diffusion-based generative models. However, these diffusion-based methods used to take multiple iterations to generate high-quality samples, leading to high computational costs and inefficiency. In this paper, we propose SDFEN (Shallow Diffusion for Fast spEech eNhancement), a novel approach for addressing the inefficiency problem while enhancing the quality of generated samples by reducing the iterative steps in the reverse process of diffusion method. Specifically, we introduce the shallow diffusion strategy initiating the reverse process with an adaptive time step to accelerate inference. In addition, a dedicated noisy predictor is further proposed to guide the adaptive selection of time step. Experiment results demonstrate the superiority of the proposed SDFEN in effectiveness and efficiency.
\ No newline at end of file
diff --git a/data/2024/aaai/ShapeBoost: Boosting Human Shape Estimation with Part-Based Parameterization and Clothing-Preserving Augmentation b/data/2024/aaai/ShapeBoost: Boosting Human Shape Estimation with Part-Based Parameterization and Clothing-Preserving Augmentation
new file mode 100644
index 0000000000..e2529afc64
--- /dev/null
+++ b/data/2024/aaai/ShapeBoost: Boosting Human Shape Estimation with Part-Based Parameterization and Clothing-Preserving Augmentation	
@@ -0,0 +1 @@
+Accurate human shape recovery from a monocular RGB image is a challenging task because humans come in different shapes and sizes and wear different clothes. In this paper, we propose ShapeBoost, a new human shape recovery framework that achieves pixel-level alignment even for rare body shapes and high accuracy for people wearing different types of clothes. Unlike previous approaches that rely on the use of PCA-based shape coefficients, we adopt a new human shape parameterization that decomposes the human shape into bone lengths and the mean width of each part slice. This part-based parameterization technique achieves a balance between flexibility and validity using a semi-analytical shape reconstruction algorithm. Based on this new parameterization, a clothing-preserving data augmentation module is proposed to generate realistic images with diverse body shapes and accurate annotations. Experimental results show that our method outperforms other state-of-the-art methods in diverse body shape situations as well as in varied clothing situations.
\ No newline at end of file
diff --git a/data/2024/aaai/Shaping Up SHAP: Enhancing Stability through Layer-Wise Neighbor Selection b/data/2024/aaai/Shaping Up SHAP: Enhancing Stability through Layer-Wise Neighbor Selection
new file mode 100644
index 0000000000..f73c532b05
--- /dev/null
+++ b/data/2024/aaai/Shaping Up SHAP: Enhancing Stability through Layer-Wise Neighbor Selection	
@@ -0,0 +1 @@
+Machine learning techniques, such as deep learning and ensemble methods, are widely used in various domains due to their ability to handle complex real-world tasks. However, their black-box nature has raised multiple concerns about the fairness, trustworthiness, and transparency of computer-assisted decision-making. This has led to the emergence of local post-hoc explainability methods, which offer explanations for individual decisions made by black-box algorithms. Among these methods, Kernel SHAP is widely used due to its model-agnostic nature and its well-founded theoretical framework. Despite these strengths, Kernel SHAP suffers from high instability: different executions of the method with the same inputs can lead to significantly different explanations, which diminishes the relevance of the explanations. The contribution of this paper is two-fold. On the one hand, we show that Kernel SHAP's instability is caused by its stochastic neighbor selection procedure, which we adapt to achieve full stability without compromising explanation fidelity. On the other hand, we show that by restricting the neighbors generation to perturbations of size 1 -- which we call the coalitions of Layer 1 -- we obtain a novel feature-attribution method that is fully stable, computationally efficient, and still meaningful.
\ No newline at end of file
diff --git a/data/2024/aaai/ShareBERT: Embeddings Are Capable of Learning Hidden Layers b/data/2024/aaai/ShareBERT: Embeddings Are Capable of Learning Hidden Layers
new file mode 100644
index 0000000000..cc1a7fbed1
--- /dev/null
+++ b/data/2024/aaai/ShareBERT: Embeddings Are Capable of Learning Hidden Layers	
@@ -0,0 +1,4 @@
+The deployment of Pre-trained Language Models in memory-limited devices is hindered by their massive number of parameters, which motivated the interest in developing smaller architectures.
+Established works in the model compression literature showcased that small models often present a noticeable performance degradation and need to be paired with transfer learning methods, such as Knowledge Distillation. 
+In this work, we propose a parameter-sharing method that consists of sharing parameters between embeddings and the hidden layers, enabling the design of near-zero parameter encoders. To demonstrate its effectiveness, we present an architecture design called ShareBERT, which can preserve up to 95.5%
+of BERT Base performances, using only 5M parameters (21.9× fewer parameters) without the help of Knowledge Distillation. We demonstrate empirically that our proposal does not negatively affect the model learning capabilities and that it is even beneficial for representation learning. Code will be available at https://github.com/jchenghu/sharebert.
\ No newline at end of file
diff --git a/data/2024/aaai/Sharpness-Aware Model-Agnostic Long-Tailed Domain Generalization b/data/2024/aaai/Sharpness-Aware Model-Agnostic Long-Tailed Domain Generalization
new file mode 100644
index 0000000000..54512c4b57
--- /dev/null
+++ b/data/2024/aaai/Sharpness-Aware Model-Agnostic Long-Tailed Domain Generalization	
@@ -0,0 +1 @@
+Domain Generalization (DG) aims to improve the generalization ability of models trained on a specific group of source domains, enabling them to perform well on new, unseen target domains. Recent studies have shown that methods that converge to smooth optima can enhance the generalization performance of supervised learning tasks such as classification. In this study, we examine the impact of smoothness-enhancing formulations on domain adversarial training, which combines task loss and adversarial loss objectives. Our approach leverages the fact that converging to a smooth minimum with respect to task loss can stabilize the task loss and lead to better performance on unseen domains. Furthermore, we recognize that the distribution of objects in the real world often follows a long-tailed class distribution, resulting in a mismatch between machine learning models and our expectations of their performance on all classes of datasets with long-tailed class distributions. To address this issue, we consider the domain generalization problem from the perspective of the long-tail distribution and propose using the maximum square loss to balance different classes which can improve model generalizability. Our method's effectiveness is demonstrated through comparisons with state-of-the-art methods on various domain generalization datasets. Code: https://github.com/bamboosir920/SAMALTDG.
\ No newline at end of file
diff --git a/data/2024/aaai/Shrinking Your TimeStep: Towards Low-Latency Neuromorphic Object Recognition with Spiking Neural Networks b/data/2024/aaai/Shrinking Your TimeStep: Towards Low-Latency Neuromorphic Object Recognition with Spiking Neural Networks
new file mode 100644
index 0000000000..268f5e0e3a
--- /dev/null
+++ b/data/2024/aaai/Shrinking Your TimeStep: Towards Low-Latency Neuromorphic Object Recognition with Spiking Neural Networks	
@@ -0,0 +1 @@
+Neuromorphic object recognition with spiking neural networks (SNNs) is the cornerstone of low-power neuromorphic computing. However, existing SNNs suffer from significant latency, utilizing 10 to 40 timesteps or more, to recognize neuromorphic objects. At low latencies, the performance of existing SNNs is drastically degraded. In this work, we propose the Shrinking SNN (SSNN) to achieve low-latency neuromorphic object recognition without reducing performance. Concretely, we alleviate the temporal redundancy in SNNs by dividing SNNs into multiple stages with progressively shrinking timesteps, which significantly reduces the inference latency. During timestep shrinkage, the temporal transformer smoothly transforms the temporal scale and preserves the information maximally. Moreover, we add multiple early classifiers to the SNN during training to mitigate the mismatch between the surrogate gradient and the true gradient, as well as the gradient vanishing/exploding, thus eliminating the performance degradation at low latency. Extensive experiments on neuromorphic datasets, CIFAR10-DVS, N-Caltech101, and DVS-Gesture have revealed that SSNN is able to improve the baseline accuracy by 6.55% ~ 21.41%. With only 5 average timesteps and without any data augmentation, SSNN is able to achieve an accuracy of 73.63% on CIFAR10-DVS. This work presents a heterogeneous temporal scale SNN and provides valuable insights into the development of high-performance, low-latency SNNs.
\ No newline at end of file
diff --git a/data/2024/aaai/Shuffled Deep Regression b/data/2024/aaai/Shuffled Deep Regression
new file mode 100644
index 0000000000..a54f7aa985
--- /dev/null
+++ b/data/2024/aaai/Shuffled Deep Regression	
@@ -0,0 +1 @@
+Shuffled regression is the problem of learning regression models from shuffled data that consists of a set of input features and a set of target outputs where the correspondence between the input and output is unknown. This study proposes a new deep learning method for shuffled regression called Shuffled Deep Regression (SDR). We derive the sparse and stochastic variant of the Expectation-Maximization algorithm for SDR that iteratively updates discrete latent variables and the parameters of neural networks. The effectiveness of the proposal is confirmed by benchmark data experiments.
\ No newline at end of file
diff --git a/data/2024/aaai/Signed Graph Neural Ordinary Differential Equation for Modeling Continuous-Time Dynamics b/data/2024/aaai/Signed Graph Neural Ordinary Differential Equation for Modeling Continuous-Time Dynamics
new file mode 100644
index 0000000000..68c72ef504
--- /dev/null
+++ b/data/2024/aaai/Signed Graph Neural Ordinary Differential Equation for Modeling Continuous-Time Dynamics	
@@ -0,0 +1 @@
+Modeling continuous-time dynamics constitutes a foundational challenge, and uncovering inter-component correlations within complex systems holds promise for enhancing the efficacy of dynamic modeling. The prevailing approach of integrating graph neural networks with ordinary differential equations has demonstrated promising performance. However, they disregard the crucial signed information potential on graphs, impeding their capacity to accurately capture real-world phenomena and leading to subpar outcomes. In response, we introduce a novel approach: a signed graph neural ordinary differential equation, adeptly addressing the limitations of miscapturing signed information. Our proposed solution boasts both flexibility and efficiency. To substantiate its effectiveness, we seamlessly integrate our devised strategies into three preeminent graph-based dynamic modeling frameworks: graph neural ordinary differential equations, graph neural controlled differential equations, and graph recurrent neural networks. Rigorous assessments encompass three intricate dynamic scenarios from physics and biology, as well as scrutiny across four authentic real-world traffic datasets. Remarkably outperforming the trio of baselines, empirical results underscore the substantial performance enhancements facilitated by our proposed approach. Our code can be found at https://github.com/beautyonce/SGODE.
\ No newline at end of file
diff --git a/data/2024/aaai/Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees (Abstract Reprint) b/data/2024/aaai/Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees (Abstract Reprint)
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/aaai/SimCS: Simulation for Domain Incremental Online Continual Segmentation b/data/2024/aaai/SimCS: Simulation for Domain Incremental Online Continual Segmentation
new file mode 100644
index 0000000000..a45b832a0a
--- /dev/null
+++ b/data/2024/aaai/SimCS: Simulation for Domain Incremental Online Continual Segmentation	
@@ -0,0 +1 @@
+Continual Learning is a step towards lifelong intelligence where models continuously learn from recently collected data without forgetting previous knowledge. Existing continual learning approaches mostly focus on image classification in the class-incremental setup with clear task boundaries and unlimited computational budget. This work explores the problem of Online Domain-Incremental Continual Segmentation (ODICS), where the model is continually trained over batches of densely labeled images from different domains, with limited computation and no information about the task boundaries. ODICS arises in many practical applications. In autonomous driving, this may correspond to the realistic scenario of training a segmentation model over time on a sequence of cities. We analyze several existing continual learning methods and show that they perform poorly in this setting despite working well in class-incremental segmentation. We propose SimCS, a parameter-free method complementary to existing ones that uses simulated data to regularize continual learning. Experiments show that SimCS provides consistent improvements when combined with different CL methods.
\ No newline at end of file
diff --git a/data/2024/aaai/SimCalib: Graph Neural Network Calibration Based on Similarity between Nodes b/data/2024/aaai/SimCalib: Graph Neural Network Calibration Based on Similarity between Nodes
new file mode 100644
index 0000000000..d38b396d1d
--- /dev/null
+++ b/data/2024/aaai/SimCalib: Graph Neural Network Calibration Based on Similarity between Nodes	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) have exhibited impressive performance in modeling graph data as exemplified in various applications. Recently, the GNN calibration problem has attracted increasing attention, especially in cost-sensitive scenarios. Previous work has gained empirical insights on the issue, and devised effective approaches for it, but theoretical supports still fall short. In this work, we shed light on the relationship between GNN calibration and nodewise similarity via theoretical analysis. A novel calibration framework, named SimCalib, is accordingly proposed to consider similarity between nodes at global and local levels. At the global level, the Mahalanobis distance between the current node and class prototypes is integrated to implicitly consider similarity between the current node and all nodes in the same class. At the local level, the similarity of node representation movement dynamics, quantified by nodewise homophily and relative degree, is considered. Informed about the application of nodewise movement patterns in analyzing nodewise behavior on the over-smoothing problem, we empirically present a possible relationship between over-smoothing and GNN calibration problem. Experimentally, we discover a correlation between nodewise similarity and model calibration improvement, in alignment with our theoretical results. Additionally, we conduct extensive experiments investigating different design factors and demonstrate the effectiveness of our proposed SimCalib framework for GNN calibration by achieving state-of-the-art performance on 14 out of 16 benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/SimDistill: Simulated Multi-Modal Distillation for BEV 3D Object Detection b/data/2024/aaai/SimDistill: Simulated Multi-Modal Distillation for BEV 3D Object Detection
new file mode 100644
index 0000000000..1008c912ee
--- /dev/null
+++ b/data/2024/aaai/SimDistill: Simulated Multi-Modal Distillation for BEV 3D Object Detection	
@@ -0,0 +1 @@
+Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging and may lead to inferior performance. Although distilling precise 3D geometry knowledge from LiDAR data could help tackle this challenge, the benefits of LiDAR information could be greatly hindered by the significant modality gap between different sensory modalities. To address this issue, we propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy. Specifically, we devise multi-modal architectures for both teacher and student models, including a LiDAR-camera fusion-based teacher and a simulated fusion-based student. Owing to the ``identical'' architecture design, the student can mimic the teacher to generate multi-modal features with merely multi-view images as input, where a geometry compensation module is introduced to bridge the modality gap. Furthermore, we propose a comprehensive multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal fusion distillation simultaneously in the Bird's-eye-view space. Incorporating them together, our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment. Extensive experiments validate the effectiveness and superiority of SimDistill over state-of-the-art methods, achieving an improvement of 4.8% mAP and 4.1% NDS over the baseline detector. The source code will be released at https://github.com/ViTAE-Transformer/SimDistill.
\ No newline at end of file
diff --git a/data/2024/aaai/SimFair: Physics-Guided Fairness-Aware Learning with Simulation Models b/data/2024/aaai/SimFair: Physics-Guided Fairness-Aware Learning with Simulation Models
new file mode 100644
index 0000000000..c5f950eb2c
--- /dev/null
+++ b/data/2024/aaai/SimFair: Physics-Guided Fairness-Aware Learning with Simulation Models	
@@ -0,0 +1 @@
+Fairness-awareness has emerged as an essential building block for the responsible use of artificial intelligence in real applications. In many cases, inequity in performance is due to the change in distribution over different regions. While techniques have been developed to improve the transferability of fairness, a solution to the problem is not always feasible with no samples from the new regions, which is a bottleneck for pure data-driven attempts. Fortunately, physics-based mechanistic models have been studied for many problems with major social impacts. We propose SimFair, a physics-guided fairness-aware learning framework, which bridges the data limitation by integrating physical-rule-based simulation and inverse modeling into the training design. Using temperature prediction as an example, we demonstrate the effectiveness of the proposed SimFair in fairness preservation.
\ No newline at end of file
diff --git a/data/2024/aaai/SimPSI: A Simple Strategy to Preserve Spectral Information in Time Series Data Augmentation b/data/2024/aaai/SimPSI: A Simple Strategy to Preserve Spectral Information in Time Series Data Augmentation
new file mode 100644
index 0000000000..6a76254f91
--- /dev/null
+++ b/data/2024/aaai/SimPSI: A Simple Strategy to Preserve Spectral Information in Time Series Data Augmentation	
@@ -0,0 +1 @@
+Data augmentation is a crucial component in training neural networks to overcome the limitation imposed by data size, and several techniques have been studied for time series. Although these techniques are effective in certain tasks, they have yet to be generalized to time series benchmarks. We find that current data augmentation techniques ruin the core information contained within the frequency domain. To address this issue, we propose a simple strategy to preserve spectral information (SimPSI) in time series data augmentation. SimPSI preserves the spectral information by mixing the original and augmented input spectrum weighted by a preservation map, which indicates the importance score of each frequency. Specifically, our experimental contributions are to build three distinct preservation maps: magnitude spectrum, saliency map, and spectrum-preservative map. We apply SimPSI to various time series data augmentations and evaluate its effectiveness across a wide range of time series benchmarks. Our experimental results support that SimPSI considerably enhances the performance of time series data augmentations by preserving core spectral information. The source code used in the paper is available at https://github.com/Hyun-Ryu/simpsi.
\ No newline at end of file
diff --git a/data/2024/aaai/Simple Image-Level Classification Improves Open-Vocabulary Object Detection b/data/2024/aaai/Simple Image-Level Classification Improves Open-Vocabulary Object Detection
new file mode 100644
index 0000000000..2583cbf39a
--- /dev/null
+++ b/data/2024/aaai/Simple Image-Level Classification Improves Open-Vocabulary Object Detection	
@@ -0,0 +1 @@
+Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set of base categories on which the detection model is trained. Recent OVOD methods focus on adapting the image-level pre-trained vision-language models (VLMs), such as CLIP, to a region-level object detection task via, eg., region-level knowledge distillation, regional prompt learning, or region-text pre-training, to expand the detection vocabulary. These methods have demonstrated remarkable performance in recognizing regional visual concepts, but they are weak in exploiting the VLMs' powerful global scene understanding ability learned from the billion-scale image-level text descriptions. This limits their capability in detecting hard objects of small, blurred, or occluded appearance from novel/base categories, whose detection heavily relies on contextual information. To address this, we propose a novel approach, namely Simple Image-level Classification for Context-Aware Detection Scoring (SIC-CADS), to leverage the superior global knowledge yielded from CLIP for complementing the current OVOD models from a global perspective. The core of SIC-CADS is a multi-modal multi-label recognition (MLR) module that learns the object co-occurrence-based contextual information from CLIP to recognize all possible object categories in the scene. These image-level MLR scores can then be utilized to refine the instance-level detection scores of the current OVOD models in detecting those hard objects. This is verified by extensive empirical results on two popular benchmarks, OV-LVIS and OV-COCO, which show that SIC-CADS achieves significant and consistent improvement when combined with different types of OVOD models. Further, SIC-CADS also improves the cross-dataset generalization ability on Objects365 and OpenImages. Code is available at https://github.com/mala-lab/SIC-CADS.
\ No newline at end of file
diff --git a/data/2024/aaai/Simple Orthogonal Graph Representation Learning (Student Abstract) b/data/2024/aaai/Simple Orthogonal Graph Representation Learning (Student Abstract)
new file mode 100644
index 0000000000..9c5e0f4cc3
--- /dev/null
+++ b/data/2024/aaai/Simple Orthogonal Graph Representation Learning (Student Abstract)	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) have attracted significant interest recently since they can effectively process and analyze graph-structured data commonly found in real-world applications. However, the predicament that GNNs are difficult to train becomes worse as the layers increase. The essence of this problem is that stacking layers will reduce the stability of forward propagation and gradient back-propagation. And as the increasing scale of models (measured by the number of parameters), how to efficiently and effectively adapt it to particular downstream tasks becomes an intriguing research issue. In this work, motivated by the effect of orthogonality constraints, we propose a simple orthogonal training framework to impose the orthogonality constraints on GNNs, which can help models find a solution vector in a specific low dimensional subspace and stabilize the signaling processes at both the forward and backward directions. Specifically, we propose a novel polar decomposition-based orthogonal initialization (PDOI-R) algorithm, which can identify the low intrinsic dimension within the Stiefel Manifold and stabilize the training process. Extensive experiments demonstrate the effectiveness of the proposed method in multiple downstream tasks, showcasing its generality. The simple method can help existing state-of-the-art models achieve better performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Simple Weak Coresets for Non-decomposable Classification Measures b/data/2024/aaai/Simple Weak Coresets for Non-decomposable Classification Measures
new file mode 100644
index 0000000000..9e11ba87c2
--- /dev/null
+++ b/data/2024/aaai/Simple Weak Coresets for Non-decomposable Classification Measures	
@@ -0,0 +1 @@
+While coresets have been growing in terms of their application, barring few exceptions, they have mostly been limited to unsupervised settings. We consider supervised classification problems, and non-decomposable evaluation measures in such settings. We show that stratified uniform sampling based coresets have excellent empirical performance that are backed by theoretical guarantees too. We focus on the F1 score and Matthews Correlation Coefficient, two widely used non-decomposable objective functions that are nontrivial to optimize for and show that uniform coresets attain a lower bound for coreset size, and have good empirical performance, comparable with ``smarter'' coreset construction strategies.
\ No newline at end of file
diff --git a/data/2024/aaai/Simplicity Bias in Overparameterized Machine Learning b/data/2024/aaai/Simplicity Bias in Overparameterized Machine Learning
new file mode 100644
index 0000000000..1f4f73557c
--- /dev/null
+++ b/data/2024/aaai/Simplicity Bias in Overparameterized Machine Learning	
@@ -0,0 +1 @@
+A thorough theoretical understanding of the surprising generalization ability of deep networks (and other overparameterized models) is still lacking. Here we demonstrate that simplicity bias is a major phenomenon to be reckoned with in overparameterized machine learning. In addition to explaining the outcome of simplicity bias, we also study its source: following concrete rigorous examples, we argue that (i) simplicity bias can explain generalization in overparameterized learning models such as neural networks; (ii) simplicity bias and excellent generalization are optimizer-independent, as our example shows, and although the optimizer affects training, it is not the driving force behind simplicity bias; (iii) simplicity bias in pre-training models, and subsequent posteriors, is universal and stems from the subtle fact that uniformly-at-random constructed priors are not uniformly-at-random sampled ; and (iv) in neural network models, the biasing mechanism in wide (and shallow) networks is different from the biasing mechanism in deep (and narrow) networks.
\ No newline at end of file
diff --git a/data/2024/aaai/Simplifying Complex Observation Models in Continuous POMDP Planning with Probabilistic Guarantees and Practice b/data/2024/aaai/Simplifying Complex Observation Models in Continuous POMDP Planning with Probabilistic Guarantees and Practice
new file mode 100644
index 0000000000..f3549b03ce
--- /dev/null
+++ b/data/2024/aaai/Simplifying Complex Observation Models in Continuous POMDP Planning with Probabilistic Guarantees and Practice	
@@ -0,0 +1 @@
+Solving partially observable Markov decision processes (POMDPs) with high dimensional and continuous observations, such as camera images, is required for many real life robotics and planning problems. Recent researches suggested machine learned probabilistic models as observation models, but their use is currently too computationally expensive for online deployment. We deal with the question of what would be the implication of using simplified observation models for planning, while retaining formal guarantees on the quality of the solution. Our main contribution is a novel probabilistic bound based on a statistical total variation distance of the simplified model. We show that it bounds the theoretical POMDP value w.r.t. original model, from the empirical planned value with the simplified model, by generalizing recent results of particle-belief MDP concentration bounds. Our calculations can be separated into offline and online parts, and we arrive at formal guarantees without having to access the costly model at all during planning, which is also a novel result. Finally, we demonstrate in simulation how to integrate the bound into the routine of an existing continuous online POMDP solver.
\ No newline at end of file
diff --git a/data/2024/aaai/Simultaneous Optimization of Bid Shading and Internal Auction for Demand-Side Platforms b/data/2024/aaai/Simultaneous Optimization of Bid Shading and Internal Auction for Demand-Side Platforms
new file mode 100644
index 0000000000..956b7eb0c9
--- /dev/null
+++ b/data/2024/aaai/Simultaneous Optimization of Bid Shading and Internal Auction for Demand-Side Platforms	
@@ -0,0 +1 @@
+Online advertising has been one of the most important sources for industry's growth, where the demand-side platforms (DSP) play an important role via bidding to the ad exchanges on behalf of their advertiser clients. Since more and more ad exchanges have shifted from second to first price auctions, it is challenging for DSPs to adjust bidding strategy in the volatile environment. Recent studies on bid shading in first-price auctions may have limited performance due to relatively strong hypotheses about winning probability distribution. Moreover, these studies do not consider the incentive of advertiser clients, which can be crucial for a reliable advertising platform. In this work, we consider both the optimization of bid shading technique and the design of internal auction which is ex-post incentive compatible (IC) for the management of a DSP. Firstly, we prove that the joint design of bid shading and ex-post IC auction can be reduced to choosing one monotone bid function for each advertiser without loss of optimality. Then we propose a parameterized neural network to implement the monotone bid functions. With well-designed surrogate loss, the objective can be optimized in an end-to-end manner. Finally, our experimental results demonstrate the effectiveness and superiority of our algorithm.
\ No newline at end of file
diff --git a/data/2024/aaai/Situation-Dependent Causal Influence-Based Cooperative Multi-Agent Reinforcement Learning b/data/2024/aaai/Situation-Dependent Causal Influence-Based Cooperative Multi-Agent Reinforcement Learning
new file mode 100644
index 0000000000..55b122eb70
--- /dev/null
+++ b/data/2024/aaai/Situation-Dependent Causal Influence-Based Cooperative Multi-Agent Reinforcement Learning	
@@ -0,0 +1 @@
+Learning to collaborate has witnessed significant progress in multi-agent reinforcement learning (MARL). However, promoting coordination among agents and enhancing exploration capabilities remain challenges. In multi-agent environments, interactions between agents are limited in specific situations. Effective collaboration between agents thus requires a nuanced understanding of when and how agents' actions influence others.To this end, in this paper, we propose a novel MARL algorithm named Situation-Dependent Causal Influence-Based Cooperative Multi-agent Reinforcement Learning (SCIC), which incorporates a novel Intrinsic reward mechanism based on a new cooperation criterion measured by situation-dependent causal influence among agents.Our approach aims to detect inter-agent causal influences in specific situations based on the criterion using causal intervention and conditional mutual information. This effectively assists agents in exploring states that can positively impact other agents, thus promoting cooperation between agents.The resulting update links coordinated exploration and intrinsic reward distribution, which enhance overall collaboration and performance.Experimental results on various MARL benchmarks demonstrate the superiority of our method compared to state-of-the-art approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/SkeletonGait: Gait Recognition Using Skeleton Maps b/data/2024/aaai/SkeletonGait: Gait Recognition Using Skeleton Maps
new file mode 100644
index 0000000000..798571184e
--- /dev/null
+++ b/data/2024/aaai/SkeletonGait: Gait Recognition Using Skeleton Maps	
@@ -0,0 +1 @@
+The choice of the representations is essential for deep gait recognition methods. The binary silhouettes and skeletal coordinates are two dominant representations in recent literature, achieving remarkable advances in many scenarios. However, inherent challenges remain, in which silhouettes are not always guaranteed in unconstrained scenes, and structural cues have not been fully utilized from skeletons. In this paper, we introduce a novel skeletal gait representation named skeleton map, together with SkeletonGait, a skeleton-based method to exploit structural information from human skeleton maps. Specifically, the skeleton map represents the coordinates of human joints as a heatmap with Gaussian approximation, exhibiting a silhouette-like image devoid of exact body structure. Beyond achieving state-of-the-art performances over five popular gait datasets, more importantly, SkeletonGait uncovers novel insights about how important structural features are in describing gait and when they play a role. Furthermore, we propose a multi-branch architecture, named SkeletonGait++, to make use of complementary features from both skeletons and silhouettes. Experiments indicate that SkeletonGait++ outperforms existing state-of-the-art methods by a significant margin in various scenarios. For instance, it achieves an impressive rank-1 accuracy of over 85% on the challenging GREW dataset. The source code is available at https://github.com/ShiqiYu/OpenGait.
\ No newline at end of file
diff --git a/data/2024/aaai/Sketched Newton Value Iteration for Large-Scale Markov Decision Processes b/data/2024/aaai/Sketched Newton Value Iteration for Large-Scale Markov Decision Processes
new file mode 100644
index 0000000000..30b6e43e48
--- /dev/null
+++ b/data/2024/aaai/Sketched Newton Value Iteration for Large-Scale Markov Decision Processes	
@@ -0,0 +1 @@
+Value Iteration (VI) is one of the most classic algorithms for solving Markov Decision Processes (MDPs), which lays the foundations for various more advanced reinforcement learning algorithms, such as Q-learning. VI may take a large number of iterations to converge as it is a first-order method. In this paper, we introduce the Newton Value Iteration (NVI) algorithm, which eliminates the impact of action space dimension compared to some previous second-order methods. Consequently, NVI can efficiently handle MDPs with large action spaces. Building upon NVI, we propose a novel approach called Sketched Newton Value Iteration (SNVI) to tackle MDPs with both large state and action spaces. SNVI not only inherits the stability and fast convergence advantages of second-order algorithms, but also significantly reduces computational complexity, making it highly scalable. Extensive experiments demonstrate the superiority of our algorithms over traditional VI and previously proposed second-order VI algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract) b/data/2024/aaai/SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)
new file mode 100644
index 0000000000..5ed906abc5
--- /dev/null
+++ b/data/2024/aaai/SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)	
@@ -0,0 +1 @@
+When humans are posed with a difficult problem, they often approach it by identifying key skills, honing them, and finally effectively combining them. We propose a novel method and apply it for the VizWiz VQA task to predict the visual skills needed to answer a question, and leverage expert modules to produce intermediary outputs and fuse them in a skill-aware manner. Unlike prior works in visual question-answering (VQA) that use intermediate outputs such as detected objects and Optical Character Recognition (OCR), our approach explicitly guides the model with a skill embedding on what to focus on. While our results show that using skill-aware fusion outperforms skill-unaware models for only a subset of questions, we believe our results provide interesting directions for future work. We also release our code, model, and illustrative demonstrations for future research purposes.
\ No newline at end of file
diff --git a/data/2024/aaai/Skip-GANomaly++: Skip Connections and Residual Blocks for Anomaly Detection (Student Abstract) b/data/2024/aaai/Skip-GANomaly++: Skip Connections and Residual Blocks for Anomaly Detection (Student Abstract)
new file mode 100644
index 0000000000..069fca2648
--- /dev/null
+++ b/data/2024/aaai/Skip-GANomaly++: Skip Connections and Residual Blocks for Anomaly Detection (Student Abstract)	
@@ -0,0 +1 @@
+Anomaly detection is a critical task across various domains. Fundamentally, anomaly detection models offer methods to identify unusual patterns that do not align with expected behaviors. Notably, in the medical field, detecting anomalies in medical imagery or biometrics can facilitate early diagnosis of diseases. Consequently, we propose the Skip-GANomaly++ model, an enhanced and more efficient version of the conventional anomaly detection models. The proposed model's performance was evaluated through comparative experiments. Experimental results demonstrated superior performance across most classes compared to the previous models.
\ No newline at end of file
diff --git a/data/2024/aaai/SkipDiff: Adaptive Skip Diffusion Model for High-Fidelity Perceptual Image Super-resolution b/data/2024/aaai/SkipDiff: Adaptive Skip Diffusion Model for High-Fidelity Perceptual Image Super-resolution
new file mode 100644
index 0000000000..5d6d3798e3
--- /dev/null
+++ b/data/2024/aaai/SkipDiff: Adaptive Skip Diffusion Model for High-Fidelity Perceptual Image Super-resolution	
@@ -0,0 +1,2 @@
+It is well-known that image quality assessment usually meets with the problem of perception-distortion (p-d) tradeoff. The existing deep image super-resolution (SR) methods either focus on high fidelity with pixel-level objectives or high perception with generative models. The emergence of diffusion model paves a fresh way for image restoration, which has the potential to offer a brand-new solution for p-d trade-off. We experimentally observed that the perceptual quality and distortion change in an opposite direction with the increase of sampling steps. In light of this property, we propose an adaptive skip diffusion model (SkipDiff), which aims to achieve
+high-fidelity perceptual image SR with fewer sampling steps. Specifically, it decouples the sampling procedure into coarse skip approximation and fine skip refinement stages. A coarse-grained skip diffusion is first performed as a high-fidelity prior to obtaining a latent approximation of the full diffusion. Then, a fine-grained skip diffusion is followed to further refine the latent sample for promoting perception, where the fine time steps are adaptively learned by deep reinforcement learning. Meanwhile, this approach also enables faster sampling of diffusion model through skipping the intermediate denoising process to shorten the effective steps of the computation. Extensive experimental results show that our SkipDiff achieves superior perceptual quality with plausible reconstruction accuracy and a faster sampling speed.
\ No newline at end of file
diff --git a/data/2024/aaai/SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing b/data/2024/aaai/SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing
new file mode 100644
index 0000000000..30f0b597ee
--- /dev/null
+++ b/data/2024/aaai/SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing	
@@ -0,0 +1,2 @@
+Remote sensing imagery, despite its broad applications in helping achieve Sustainable Development Goals and tackle climate change, has not yet benefited from the recent advancements of versatile, task-agnostic vision language models (VLMs). A key reason is that the large-scale, semantically diverse image-text dataset required for developing VLMs is still absent for remote sensing images. Unlike natural images, remote sensing images and their associated text descriptions cannot be efficiently collected from the public Internet at scale. In this work, we bridge this gap by using geo-coordinates to automatically connect open, unlabeled remote sensing images with rich semantics covered in OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags. 
+With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification across seven benchmark datasets. It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval. We hope this dataset can support the advancement of VLMs for various multi-modal tasks in remote sensing, such as open-vocabulary classification, retrieval, captioning, and text-to-image synthesis.
\ No newline at end of file
diff --git a/data/2024/aaai/Sleep-Like Unsupervised Replay Improves Performance When Data Are Limited or Unbalanced (Student Abstract) b/data/2024/aaai/Sleep-Like Unsupervised Replay Improves Performance When Data Are Limited or Unbalanced (Student Abstract)
new file mode 100644
index 0000000000..8fafe053a8
--- /dev/null
+++ b/data/2024/aaai/Sleep-Like Unsupervised Replay Improves Performance When Data Are Limited or Unbalanced (Student Abstract)	
@@ -0,0 +1 @@
+The performance of artificial neural networks (ANNs) degrades when training data are limited or imbalanced. In contrast, the human brain can learn quickly from just a few examples. Here, we investigated the role of sleep in improving the performance of ANNs trained with limited data on the MNIST and Fashion MNIST datasets. Sleep was implemented as an unsupervised phase with local Hebbian type learning rules. We found a significant boost in accuracy after the sleep phase for models trained with limited data in the range of 0.5-10% of total MNIST or Fashion MNIST datasets. When more than 10% of the total data was used, sleep alone had a slight negative impact on performance, but this was remedied by fine-tuning on the original data. This study sheds light on a potential synaptic weight dynamics strategy employed by the brain during sleep to enhance memory performance when training data are limited or imbalanced.
\ No newline at end of file
diff --git a/data/2024/aaai/SlowTrack: Increasing the Latency of Camera-Based Perception in Autonomous Driving Using Adversarial Examples b/data/2024/aaai/SlowTrack: Increasing the Latency of Camera-Based Perception in Autonomous Driving Using Adversarial Examples
new file mode 100644
index 0000000000..6fdcb4d6aa
--- /dev/null
+++ b/data/2024/aaai/SlowTrack: Increasing the Latency of Camera-Based Perception in Autonomous Driving Using Adversarial Examples	
@@ -0,0 +1 @@
+In Autonomous Driving (AD), real-time perception is a critical component responsible for detecting surrounding objects to ensure safe driving. While researchers have extensively explored the integrity of AD perception due to its safety and security implications, the aspect of availability (real-time performance) or latency has received limited attention. Existing works on latency-based attack have focused mainly on object detection, i.e., a component in camera-based AD perception, overlooking the entire camera-based AD perception, which hinders them to achieve effective system-level effects, such as vehicle crashes. In this paper, we propose SlowTrack, a novel framework for generating adversarial attacks to increase the execution time of camera-based AD perception. We propose a novel two-stage attack strategy along with the three new loss function designs. Our evaluation is conducted on four popular camera-based AD perception pipelines, and the results demonstrate that SlowTrack significantly outperforms existing latency-based attacks while maintaining comparable imperceptibility levels. Furthermore, we perform the evaluation on Baidu Apollo, an industry-grade full-stack AD system, and LGSVL, a production-grade AD simulator, with two scenarios to compare the system-level effects of SlowTrack and existing attacks. Our evaluation results show that the system-level effects can be significantly improved, i.e., the vehicle crash rate of SlowTrack is around 95% on average while existing works only have around 30%.
\ No newline at end of file
diff --git a/data/2024/aaai/Small Language Model Can Self-Correct b/data/2024/aaai/Small Language Model Can Self-Correct
new file mode 100644
index 0000000000..b5f9af1c6c
--- /dev/null
+++ b/data/2024/aaai/Small Language Model Can Self-Correct	
@@ -0,0 +1 @@
+Generative Language Models (LMs) such as ChatGPT have exhibited remarkable performance across various downstream tasks. Nevertheless, one of their most prominent drawbacks is generating inaccurate or false information with a confident tone. Previous studies have devised sophisticated pipelines and prompts to induce large LMs to exhibit the capability for self-correction. However, large LMs are explicitly prompted to verify and modify their answers separately rather than completing all steps spontaneously like humans. Moreover, these complex prompts are extremely challenging for small LMs to follow. In this paper, we introduce the Intrinsic Self-Correction (ISC) in generative language models, aiming to correct the initial output of LMs in a self-triggered manner, even for those small LMs with 6 billion parameters. Specifically, we devise a pipeline for constructing self-correction data and propose Partial Answer Masking (PAM), aiming to endow the model with the capability for intrinsic self-correction through fine-tuning. We conduct experiments using LMs with parameters sizes ranging from 6 billion to 13 billion in two tasks, including commonsense reasoning and factual knowledge reasoning. Our experiments demonstrate that the outputs generated using ISC outperform those generated without self-correction. We believe that the output quality of even small LMs can be further improved by empowering them with the ability to intrinsic self-correct.
\ No newline at end of file
diff --git a/data/2024/aaai/Social Physics Informed Diffusion Model for Crowd Simulation b/data/2024/aaai/Social Physics Informed Diffusion Model for Crowd Simulation
new file mode 100644
index 0000000000..3b169951c5
--- /dev/null
+++ b/data/2024/aaai/Social Physics Informed Diffusion Model for Crowd Simulation	
@@ -0,0 +1 @@
+Crowd simulation holds crucial applications in various domains, such as urban planning, architectural design, and traffic arrangement. In recent years, physics-informed machine learning methods have achieved state-of-the-art performance in crowd simulation but fail to model the heterogeneity and multi-modality of human movement comprehensively. In this paper, we propose a social physics-informed diffusion model named SPDiff to mitigate the above gap. SPDiff takes both the interactive and historical information of crowds in the current timeframe to reverse the diffusion process, thereby generating the distribution of pedestrian movement in the subsequent timeframe. Inspired by the well-known social physics model, i.e., Social Force, regarding crowd dynamics, we design a crowd interaction encoder to guide the denoising process and further enhance this module with the equivariant properties of crowd interactions. To mitigate error accumulation in long-term simulations, we propose a multi-frame rollout training algorithm for diffusion modeling. Experiments conducted on two real-world datasets demonstrate the superior performance of SPDiff in terms of both macroscopic and microscopic evaluation metrics. Code and appendix are available at https://github.com/tsinghua-fib-lab/SPDiff.
\ No newline at end of file
diff --git a/data/2024/aaai/Social-Aware Group Display Configuration in VR Conference b/data/2024/aaai/Social-Aware Group Display Configuration in VR Conference
new file mode 100644
index 0000000000..d817c1d837
--- /dev/null
+++ b/data/2024/aaai/Social-Aware Group Display Configuration in VR Conference	
@@ -0,0 +1 @@
+Virtual Reality (VR) has emerged due to advancements in hardware and computer graphics. During the pandemic, conferences and exhibitions leveraging VR have gained attention. However, large-scale VR conferences, face a significant problem not yet studied in the literature -- displaying too many irrelevant users on the screen which may negatively impact the user experience. To address this issue, we formulate a new research problem, Social-Aware VR Conference Group Display Configuration (SVGD). Accordingly, we design the Social Utility-Aware VR Conference Group Formation (SVC) algorithm, which is a 2-approximation algorithm to SVGD. SVC iteratively selects either the P-Configuration or S-Configuration based on their effective ratios. This ensures that in each iteration, SVC identifies and chooses the solution with the highest current effectiveness. Experiments on real metaverse datasets show that the proposed SVC outperforms 11 baselines by 75% in terms of solution quality.
\ No newline at end of file
diff --git a/data/2024/aaai/SocialCVAE: Predicting Pedestrian Trajectory via Interaction Conditioned Latents b/data/2024/aaai/SocialCVAE: Predicting Pedestrian Trajectory via Interaction Conditioned Latents
new file mode 100644
index 0000000000..d537002f1e
--- /dev/null
+++ b/data/2024/aaai/SocialCVAE: Predicting Pedestrian Trajectory via Interaction Conditioned Latents	
@@ -0,0 +1 @@
+Pedestrian trajectory prediction is the key technology in many applications for providing insights into human behavior and anticipating human future motions. Most existing empirical models are explicitly formulated by observed human behaviors using explicable mathematical terms with deterministic nature, while recent work has focused on developing hybrid models combined with learning-based techniques for powerful expressiveness while maintaining explainability. However, the deterministic nature of the learned steering behaviors from the empirical models limits the models' practical performance. To address this issue, this work proposes the social conditional variational autoencoder (SocialCVAE) for predicting pedestrian trajectories, which employs a CVAE to explore behavioral uncertainty in human motion decisions. SocialCVAE learns socially reasonable motion randomness by utilizing a socially explainable interaction energy map as the CVAE's condition, which illustrates the future occupancy of each pedestrian's local neighborhood area. The energy map is generated using an energy-based interaction model, which anticipates the energy cost (i.e., repulsion intensity) of pedestrians' interactions with neighbors. Experimental results on two public benchmarks including 25 scenes demonstrate that SocialCVAE significantly improves prediction accuracy compared with the state-of-the-art methods, with up to 16.85% improvement in Average Displacement Error (ADE) and 69.18% improvement in Final Displacement Error (FDE). Code is available at: https://github.com/ViviXiang/SocialCVAE.
\ No newline at end of file
diff --git a/data/2024/aaai/SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models b/data/2024/aaai/SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models
new file mode 100644
index 0000000000..6538de64ad
--- /dev/null
+++ b/data/2024/aaai/SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models	
@@ -0,0 +1,3 @@
+Current datasets for unwanted social bias auditing are limited to studying protected demographic features such as race and gender. In this work, we introduce a comprehensive benchmark that is meant to capture the amplification of social bias, via stigmas, in generative language models. Taking inspiration from social science research, we start with a documented list of 93 US-centric stigmas and curate a question-answering (QA) dataset which involves simple social situations. Our benchmark, SocialStigmaQA, contains roughly 10K prompts, with a variety of prompt styles, carefully constructed to systematically test for both social bias and model robustness. We present results for SocialStigmaQA with two open source generative language models and we find that the proportion of socially biased output ranges from 45% to 59% across a variety of decoding strategies and prompting styles. We demonstrate that the deliberate design of the templates in our benchmark (e.g., adding biasing text to the prompt or using different verbs that change the answer that indicates bias) impacts the model tendencies to generate socially biased output. Additionally, through manual evaluation, we discover problematic patterns in the generated chain-of-thought output that range from subtle bias to lack of reasoning.
+
+Warning: This paper contains examples of text which are toxic, biased, and potentially harmful.
\ No newline at end of file
diff --git a/data/2024/aaai/SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger b/data/2024/aaai/SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger
new file mode 100644
index 0000000000..21aeebe63b
--- /dev/null
+++ b/data/2024/aaai/SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger	
@@ -0,0 +1 @@
+During the preceding biennium, vision-language pre-training has achieved noteworthy success on several downstream tasks. Nevertheless, acquiring high-quality image-text pairs, where the pairs are entirely exclusive of each other, remains a challenging task, and noise exists in the commonly used datasets. To address this issue, we propose SoftCLIP, a novel approach that relaxes the strict one-to-one constraint and achieves a soft cross-modal alignment by introducing a softened target, which is generated from the fine-grained intra-modal self-similarity. The intra-modal guidance is indicative to enable two pairs have some local similarities and model many-to-many relationships between the two modalities. Besides, since the positive still dominates in the softened target distribution, we disentangle the negatives in the distribution to further boost the relation alignment with the negatives in the cross-modal learning. Extensive experiments demonstrate the effectiveness of SoftCLIP. In particular, on ImageNet zero-shot classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings a top-1 accuracy improvement of 6.8%/7.2% over the CLIP baseline.
\ No newline at end of file
diff --git a/data/2024/aaai/Solar Power Generation Forecasting via Multimodal Feature Fusion (Student Abstract) b/data/2024/aaai/Solar Power Generation Forecasting via Multimodal Feature Fusion (Student Abstract)
new file mode 100644
index 0000000000..d7196e807c
--- /dev/null
+++ b/data/2024/aaai/Solar Power Generation Forecasting via Multimodal Feature Fusion (Student Abstract)	
@@ -0,0 +1,2 @@
+Solar power generation has recently been in the spotlight as global warming continues to worsen. However, two significant problems may hinder solar power generation, considering that solar panels are installed outside. The first is soiling, which accumulates on solar panels, and the second is a decrease in sunlight owing to bad weather.
+In this paper, we will demonstrate that the solar power generation forecasting can increase when considering soiling and sunlight information. We first introduce a dataset containing images of clean and soiled solar panels, sky images, and weather information. For accurate solar power generation forecasting, we propose a new multimodal model that aggregates various features related to weather, soiling, and sunlight. The experimental results demonstrated the high accuracy of our proposed multimodal model.
\ No newline at end of file
diff --git a/data/2024/aaai/Solving Non-rectangular Reward-Robust MDPs via Frequency Regularization b/data/2024/aaai/Solving Non-rectangular Reward-Robust MDPs via Frequency Regularization
new file mode 100644
index 0000000000..fc0a6fd1be
--- /dev/null
+++ b/data/2024/aaai/Solving Non-rectangular Reward-Robust MDPs via Frequency Regularization	
@@ -0,0 +1,2 @@
+In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This so-called rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior.
+In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an alpha-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method, and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.
\ No newline at end of file
diff --git a/data/2024/aaai/Solving Satisfiability Modulo Counting for Symbolic and Statistical AI Integration with Provable Guarantees b/data/2024/aaai/Solving Satisfiability Modulo Counting for Symbolic and Statistical AI Integration with Provable Guarantees
new file mode 100644
index 0000000000..87ee00774b
--- /dev/null
+++ b/data/2024/aaai/Solving Satisfiability Modulo Counting for Symbolic and Statistical AI Integration with Provable Guarantees	
@@ -0,0 +1 @@
+Satisfiability Modulo Counting (SMC) encompasses problems that require both symbolic decision-making and statistical reasoning. Its general formulation captures many real-world problems at the intersection of symbolic and statistical AI. SMC searches for policy interventions to control probabilistic outcomes. Solving SMC is challenging because of its highly intractable nature (NP^PP-complete), incorporating statistical inference and symbolic reasoning. Previous research on SMC solving lacks provable guarantees and/or suffers from suboptimal empirical performance, especially when combinatorial constraints are present. We propose XOR-SMC, a polynomial algorithm with access to NP-oracles, to solve highly intractable SMC problems with constant approximation guarantees. XOR-SMC transforms the highly intractable SMC into satisfiability problems by replacing the model counting in SMC with SAT formulae subject to randomized XOR constraints. Experiments on solving important SMC problems in AI for social good demonstrate that XOR-SMC outperforms several baselines both in solution quality and running time.
\ No newline at end of file
diff --git a/data/2024/aaai/Solving Spectrum Unmixing as a Multi-Task Bayesian Inverse Problem with Latent Factors for Endmember Variability b/data/2024/aaai/Solving Spectrum Unmixing as a Multi-Task Bayesian Inverse Problem with Latent Factors for Endmember Variability
new file mode 100644
index 0000000000..944111724b
--- /dev/null
+++ b/data/2024/aaai/Solving Spectrum Unmixing as a Multi-Task Bayesian Inverse Problem with Latent Factors for Endmember Variability	
@@ -0,0 +1,5 @@
+With the increasing customization of spectrometers, spectral unmixing has become a widely used technique in fields such as remote sensing, textiles, and environmental protection.
+However, endmember variability is a common issue for unmixing, where changes in lighting, atmospheric, temporal conditions, or the intrinsic spectral characteristics of materials, can all result in variations in the measured spectrum.
+Recent studies have employed deep neural networks to tackle endmember variability. However, these approaches rely on generic networks to implicitly resolve the issue, which struggles with the ill-posed nature and lack of effective convergence constraints for endmember variability. This paper proposes a streamlined multi-task learning model to rectify this problem, incorporating abundance regression and multi-label classification with Unmixing as a Bayesian Inverse Problem, denoted as BIPU. 
+To address the issue of the ill-posed nature, the uncertainty of unmixing is quantified and minimized through the Laplace approximation in a Bayesian inverse solver. In addition, to improve convergence under the influence of endmember variability, the paper introduces two types of constraints. The first separates background factors of variants from the initial factors for each endmember, while the second identifies and eliminates the influence of non-existent endmembers via multi-label classification during convergence.
+The effectiveness of this model is demonstrated not only on a self-collected near-infrared spectral textile dataset (FENIR), but also on three commonly used remote sensing hyperspectral image datasets, where it achieves state-of-the-art unmixing performance and exhibits strong generalization capabilities.
\ No newline at end of file
diff --git a/data/2024/aaai/Some Like It Small: Czech Semantic Embedding Models for Industry Applications b/data/2024/aaai/Some Like It Small: Czech Semantic Embedding Models for Industry Applications
new file mode 100644
index 0000000000..61d0978cbf
--- /dev/null
+++ b/data/2024/aaai/Some Like It Small: Czech Semantic Embedding Models for Industry Applications	
@@ -0,0 +1 @@
+This article focuses on the development and evaluation of Small-sized Czech sentence embedding models. Small models are important components for real-time industry applications in resource-constrained environments. Given the limited availability of labeled Czech data, alternative approaches, including pre-training, knowledge distillation, and unsupervised contrastive fine-tuning, are investigated. Comprehensive intrinsic and extrinsic analyses are conducted, showcasing the competitive performance of our models compared to significantly larger counterparts, with approximately 8 times smaller size and 5 times faster speed than conventional Base-sized models. To promote cooperation and reproducibility, both the models and the evaluation pipeline are made publicly accessible. Ultimately, this article presents practical applications of the developed sentence embedding models in Seznam.cz, the Czech search engine. These models have effectively replaced previous counterparts, enhancing the overall search experience for instance, in organic search, featured snippets, and image search. This transition has yielded improved performance.
\ No newline at end of file
diff --git a/data/2024/aaai/SpFormer: Spatio-Temporal Modeling for Scanpaths with Transformer b/data/2024/aaai/SpFormer: Spatio-Temporal Modeling for Scanpaths with Transformer
new file mode 100644
index 0000000000..7acadf84a0
--- /dev/null
+++ b/data/2024/aaai/SpFormer: Spatio-Temporal Modeling for Scanpaths with Transformer	
@@ -0,0 +1 @@
+Saccadic scanpath, a data representation of human visual behavior, has received broad interest in multiple domains. Scanpath is a complex eye-tracking data modality that includes the sequences of fixation positions and fixation duration, coupled with image information. However, previous methods usually face the spatial misalignment problem of fixation features and loss of critical temporal data (including temporal correlation and fixation duration). In this study, we propose a Transformer-based scanpath model, SpFormer, to alleviate these problems. First, we propose a fixation-centric paradigm to extract the aligned spatial fixation features and tokenize the scanpaths. Then, according to the visual working memory mechanism, we design a local meta attention to reduce the semantic redundancy of fixations and guide the model to focus on the meta scanpath. Finally, we progressively integrate the duration information and fuse it with the fixation features to solve the problem of ambiguous location with the Transformer block increasing. We conduct extensive experiments on four databases under three tasks. The SpFormer establishes new state-of-the-art results in distinct settings, verifying its flexibility and versatility in practical applications. The code can be obtained from https://github.com/wenqizhong/SpFormer.
\ No newline at end of file
diff --git a/data/2024/aaai/SpaceGTN: A Time-Agnostic Graph Transformer Network for Handwritten Diagram Recognition and Segmentation b/data/2024/aaai/SpaceGTN: A Time-Agnostic Graph Transformer Network for Handwritten Diagram Recognition and Segmentation
new file mode 100644
index 0000000000..334d7818fa
--- /dev/null
+++ b/data/2024/aaai/SpaceGTN: A Time-Agnostic Graph Transformer Network for Handwritten Diagram Recognition and Segmentation	
@@ -0,0 +1 @@
+Online handwriting recognition is pivotal in domains like note-taking, education, healthcare, and office tasks. Existing diagram recognition algorithms mainly rely on the temporal information of strokes, resulting in a decline in recognition performance when dealing with notes that have been modified or have no temporal information. The current datasets are drawn based on templates and cannot reflect the real free-drawing situation. To address these challenges, we present SpaceGTN, a time-agnostic Graph Transformer Network, leveraging spatial integration and removing the need for temporal data. Extensive experiments on multiple datasets have demonstrated that our method consistently outperforms existing methods and achieves state-of-the-art performance. We also propose a pipeline that seamlessly connects offline and online handwritten diagrams. By integrating a stroke restoration technique with SpaceGTN, it enables intelligent editing of previously uneditable offline diagrams at the stroke level. In addition, we have also launched the first online handwritten diagram dataset, OHSD, which is collected using a free-drawing method and comes with modification annotations.
\ No newline at end of file
diff --git a/data/2024/aaai/Span Graph Transformer for Document-Level Named Entity Recognition b/data/2024/aaai/Span Graph Transformer for Document-Level Named Entity Recognition
new file mode 100644
index 0000000000..3f6cbd67be
--- /dev/null
+++ b/data/2024/aaai/Span Graph Transformer for Document-Level Named Entity Recognition	
@@ -0,0 +1 @@
+Named Entity Recognition (NER), which aims to identify the span and category of entities within text, is a fundamental task in natural language processing. Recent NER approaches have featured pre-trained transformer-based models (e.g., BERT) as a crucial encoding component to achieve state-of-the-art performance. However, due to the length limit for input text, these models typically consider text at the sentence-level and cannot capture the long-range contextual dependency within a document. To address this issue, we propose a novel Span Graph Transformer (SGT) method for document-level NER, which constructs long-range contextual dependencies at both the token and span levels. Specifically, we first retrieve relevant contextual sentences in the document for each target sentence, and jointly encode them by BERT to capture token-level dependencies. Then, our proposed model extracts candidate spans from each sentence and integrates these spans into a document-level span graph, where nested spans within sentences and identical spans across sentences are connected. By leveraging the power of Graph Transformer and well-designed position encoding, our span graph can fully exploit span-level dependencies within the document. Extensive experiments on both resource-rich nested and flat NER datasets, as well as low-resource distantly supervised NER datasets, demonstrate that proposed SGT model achieves better performance than previous state-of-the-art models.
\ No newline at end of file
diff --git a/data/2024/aaai/Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales b/data/2024/aaai/Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales
new file mode 100644
index 0000000000..c28e5a1b4e
--- /dev/null
+++ b/data/2024/aaai/Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales	
@@ -0,0 +1 @@
+With the alarming rise of hate speech in online communities, the demand for effective NLP models to identify instances of offensive language has reached a critical point. However, the development of such models heavily relies on the availability of annotated datasets, which are scarce, particularly for less-studied languages. To bridge this gap for the Persian language, we present a novel dataset specifically tailored to multi-label hate speech detection. Our dataset, called Phate, consists of an extensive collection of over seven thousand manually-annotated Persian tweets, offering a rich resource for training and evaluating hate speech detection models on this language. Notably, each annotation in our dataset specifies the targeted group of hate speech and includes a span of the tweet which elucidates the rationale behind the assigned label. The incorporation of these information expands the potential applications of our dataset, facilitating the detection of targeted online harm or allowing the benchmark to serve research on interpretability of hate speech detection models. The dataset, annotation guideline, and all associated codes are accessible at https://github.com/Zahra-D/Phate.
\ No newline at end of file
diff --git a/data/2024/aaai/Sparse Bayesian Deep Learning for Cross Domain Medical Image Reconstruction b/data/2024/aaai/Sparse Bayesian Deep Learning for Cross Domain Medical Image Reconstruction
new file mode 100644
index 0000000000..de7380fbb1
--- /dev/null
+++ b/data/2024/aaai/Sparse Bayesian Deep Learning for Cross Domain Medical Image Reconstruction	
@@ -0,0 +1 @@
+Cross domain medical image reconstruction aims to address the issue that deep learning models trained solely on one source dataset might not generalize effectively to unseen target datasets from different hospitals. Some recent methods achieve satisfactory reconstruction performance, but often at the expense of extensive parameters and time consumption. To strike a balance between cross domain image reconstruction quality and model computational efficiency, we propose a lightweight sparse Bayesian deep learning method. Notably, we apply a fixed-form variational Bayes (FFVB) approach to quantify pixel-wise uncertainty priors derived from degradation distribution of the source domain. Furthermore, by integrating the uncertainty prior into the posterior sampled through stochastic gradient Langevin dynamics (SGLD), we develop a training strategy that dynamically generates and optimizes the prior distribution on the network weights for each unseen domain. This strategy enhances generalizability and ensures robust reconstruction performance. When evaluated on medical image reconstruction tasks, our proposed approach demonstrates impressive performance across various previously unseen domains.
\ No newline at end of file
diff --git a/data/2024/aaai/Sparse Enhanced Network: An Adversarial Generation Method for Robust Augmentation in Sequential Recommendation b/data/2024/aaai/Sparse Enhanced Network: An Adversarial Generation Method for Robust Augmentation in Sequential Recommendation
new file mode 100644
index 0000000000..43bd23f462
--- /dev/null
+++ b/data/2024/aaai/Sparse Enhanced Network: An Adversarial Generation Method for Robust Augmentation in Sequential Recommendation	
@@ -0,0 +1,3 @@
+Sequential Recommendation plays a significant role in daily recommendation systems, such as e-commerce platforms like Amazon and Taobao. However, even with the advent of large models, these platforms often face sparse issues in the historical browsing records of individual users due to new users joining or the introduction of new products. As a result, existing sequence recommendation algorithms may not perform well. To address this, sequence-based data augmentation methods have garnered attention.
+
+Existing sequence enhancement methods typically rely on augmenting existing data, employing techniques like cropping, masking prediction, random reordering, and random replacement of the original sequence. While these methods have shown improvements, they often overlook the exploration of the deep embedding space of the sequence. To tackle these challenges, we propose a Sparse Enhanced Network (SparseEnNet), which is a robust adversarial generation method. SparseEnNet aims to fully explore the hidden space in sequence recommendation, generating more robust enhanced items. Additionally, we adopt an adversarial generation method, allowing the model to differentiate between data augmentation categories and achieve better prediction performance for the next item in the sequence. Experiments have demonstrated that our method achieves a remarkable 4-14% improvement over existing methods when evaluated on the real-world datasets. (https://github.com/junyachen/SparseEnNet)
\ No newline at end of file
diff --git a/data/2024/aaai/Sparse Variational Student-t Processes b/data/2024/aaai/Sparse Variational Student-t Processes
new file mode 100644
index 0000000000..ff36a49975
--- /dev/null
+++ b/data/2024/aaai/Sparse Variational Student-t Processes	
@@ -0,0 +1 @@
+The theory of Bayesian learning incorporates the use of Student-t Processes to model heavy-tailed distributions and datasets with outliers. However, despite Student-t Processes having a similar computational complexity as Gaussian Processes, there has been limited emphasis on the sparse representation of this model. This is mainly due to the increased difficulty in modeling and computation compared to previous sparse Gaussian Processes. Our motivation is to address the need for a sparse representation framework that reduces computational complexity, allowing Student-t Processes to be more flexible for real-world datasets. To achieve this, we leverage the conditional distribution of Student-t Processes to introduce sparse inducing points. Bayesian methods and variational inference are then utilized to derive a well-defined lower bound, facilitating more efficient optimization of our model through stochastic gradient descent. We propose two methods for computing the variational lower bound, one utilizing Monte Carlo sampling and the other employing Jensen's inequality to compute the KL regularization term in the loss function. We propose adopting these approaches as viable alternatives to Gaussian processes when the data might contain outliers or exhibit heavy-tailed behavior, and we provide specific recommendations for their applicability. We evaluate the two proposed approaches on various synthetic and real-world datasets from UCI and Kaggle, demonstrating their effectiveness compared to baseline methods in terms of computational complexity and accuracy, as well as their robustness to outliers.
\ No newline at end of file
diff --git a/data/2024/aaai/Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views b/data/2024/aaai/Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views
new file mode 100644
index 0000000000..2280b189b1
--- /dev/null
+++ b/data/2024/aaai/Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views	
@@ -0,0 +1 @@
+Reconstructing 3D objects from extremely sparse views is a long-standing and challenging problem. While recent techniques employ image diffusion models for generating plausible images at novel viewpoints or for distilling pre-trained diffusion priors into 3D representations using score distillation sampling (SDS), these methods often struggle to simultaneously achieve high-quality, consistent, and detailed results for both novel-view synthesis (NVS) and geometry. In this work, we present Sparse3D, a novel 3D reconstruction method tailored for sparse view inputs. Our approach distills robust priors from a multiview-consistent diffusion model to refine a neural radiance field. Specifically, we employ a controller that harnesses epipolar features from input views, guiding a pre-trained diffusion model, such as Stable Diffusion, to produce novel-view images that maintain 3D consistency with the input. By tapping into 2D priors from powerful image diffusion models, our integrated model consistently delivers high-quality results, even when faced with open-world objects. To address the blurriness introduced by conventional SDS, we introduce the category-score distillation sampling (C-SDS) to enhance detail. We conduct experiments on CO3DV2 which is a multi-view dataset of real-world objects. Both quantitative and qualitative evaluations demonstrate that our approach outperforms previous state-of-the-art works on the metrics regarding NVS and geometry reconstruction.
\ No newline at end of file
diff --git a/data/2024/aaai/SparseGNV: Generating Novel Views of Indoor Scenes with Sparse RGB-D Images b/data/2024/aaai/SparseGNV: Generating Novel Views of Indoor Scenes with Sparse RGB-D Images
new file mode 100644
index 0000000000..7b2beac5e8
--- /dev/null
+++ b/data/2024/aaai/SparseGNV: Generating Novel Views of Indoor Scenes with Sparse RGB-D Images	
@@ -0,0 +1 @@
+We study to generate novel views of indoor scenes given sparse input views. The challenge is to achieve both photorealism and view consistency. We present SparseGNV: a learning framework that incorporates 3D structures and image generative models to generate novel views with three modules. The first module builds a neural point cloud as underlying geometry, providing scene context and guidance for the target novel view. The second module utilizes a transformer-based network to map the scene context and the guidance into a shared latent space and autoregressively decodes the target view in the form of discrete image tokens. The third module reconstructs the tokens back to the image of the target view. SparseGNV is trained across a large-scale indoor scene dataset to learn generalizable priors. Once trained, it can efficiently generate novel views of an unseen indoor scene in a feed-forward manner. We evaluate SparseGNV on real-world indoor scenes and demonstrate that it outperforms state-of-the-art methods based on either neural radiance fields or conditional image generation.
\ No newline at end of file
diff --git a/data/2024/aaai/Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention b/data/2024/aaai/Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention
new file mode 100644
index 0000000000..335e22f17f
--- /dev/null
+++ b/data/2024/aaai/Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. However, the enigmatic ``black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. While past approaches, such as attention visualization, pivotal subnetwork extraction, and concept-based analyses, offer some insight, they often focus on either local or global explanations within a single dimension, occasionally falling short in providing comprehensive clarity. In response, we propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs. Our framework, termed SparseCBM, innovatively integrates sparsity to elucidate three intertwined layers of interpretation: input, subnetwork, and concept levels. In addition, the newly introduced dimension of interpretable inference-time intervention facilitates dynamic adjustments to the model during deployment. Through rigorous empirical evaluations on real-world datasets, we demonstrate that SparseCBM delivers a profound understanding of LLM behaviors, setting it apart in both interpreting and ameliorating model inaccuracies. Codes are provided in supplements.
\ No newline at end of file
diff --git a/data/2024/aaai/Spatial Transform Decoupling for Oriented Object Detection b/data/2024/aaai/Spatial Transform Decoupling for Oriented Object Detection
new file mode 100644
index 0000000000..cf13181da2
--- /dev/null
+++ b/data/2024/aaai/Spatial Transform Decoupling for Oriented Object Detection	
@@ -0,0 +1 @@
+Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. However, their potential in rotation-sensitive scenarios has not been fully explored, and this limitation may be inherently attributed to the lack of spatial invariance in the data-forwarding process. In this study, we present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs. Built upon stacked ViT blocks, STD utilizes separate network branches to predict the position, size, and angle of bounding boxes, effectively harnessing the spatial transform potential of ViTs in a divide-and-conquer fashion. Moreover, by aggregating cascaded activation masks (CAMs) computed upon the regressed parameters, STD gradually enhances features within regions of interest (RoIs), which complements the self-attention mechanism. Without bells and whistles, STD achieves state-of-the-art performance on the benchmark datasets including DOTA-v1.0 (82.24% mAP) and HRSC2016 (98.55% mAP), which demonstrates the effectiveness of the proposed method. Source code is available at https://github.com/yuhongtian17/Spatial-Transform-Decoupling.
\ No newline at end of file
diff --git a/data/2024/aaai/Spatial Voting with Incomplete Voter Information b/data/2024/aaai/Spatial Voting with Incomplete Voter Information
new file mode 100644
index 0000000000..ac2b56fc96
--- /dev/null
+++ b/data/2024/aaai/Spatial Voting with Incomplete Voter Information	
@@ -0,0 +1,2 @@
+We consider spatial voting where candidates are located in the Euclidean d-dimensional space, and each voter ranks candidates based on their distance from the voter's ideal point. We explore the case where information about the location of voters' ideal points is incomplete: for each dimension, we are given an interval of possible values. We study the computational complexity of finding the possible and necessary winners for positional scoring rules. Our results show that we retain tractable cases of the classic model where voters have partial-order preferences. Moreover, we show that there are positional scoring rules under which the possible-winner problem is intractable for partial orders, but tractable in the one-dimensional spatial setting. 
+We also consider approval voting in this setting. We show that for up to two dimensions, the necessary-winner problem is tractable, while the possible-winner problem is hard for any number of dimensions.
\ No newline at end of file
diff --git a/data/2024/aaai/Spatial-Contextual Discrepancy Information Compensation for GAN Inversion b/data/2024/aaai/Spatial-Contextual Discrepancy Information Compensation for GAN Inversion
new file mode 100644
index 0000000000..117dd7b04d
--- /dev/null
+++ b/data/2024/aaai/Spatial-Contextual Discrepancy Information Compensation for GAN Inversion	
@@ -0,0 +1 @@
+Most existing GAN inversion methods either achieve accurate reconstruction but lack editability or offer strong editability at the cost of fidelity. Hence, how to balance the distortion-editability trade-off is a significant challenge for GAN inversion. To address this challenge, we introduce a novel spatial-contextual discrepancy information compensation-based GAN-inversion method (SDIC), which consists of a discrepancy information prediction network (DIPN) and a discrepancy information compensation network (DICN). SDIC follows a ``compensate-and-edit'' paradigm and successfully bridges the gap in image details between the original image and the reconstructed/edited image. On the one hand, DIPN encodes the multi-level spatial-contextual information of the original and initial reconstructed images and then predicts a spatial-contextual guided discrepancy map with two hourglass modules. In this way, a reliable discrepancy map that models the contextual relationship and captures fine-grained image details is learned. On the other hand, DICN incorporates the predicted discrepancy information into both the latent code and the GAN generator with different transformations, generating high-quality reconstructed/edited images. This effectively compensates for the loss of image details during GAN inversion. Both quantitative and qualitative experiments demonstrate that our proposed method achieves the excellent distortion-editability trade-off at a fast inference speed for both image inversion and editing tasks. Our code is available at https://github.com/ZzqLKED/SDIC.
\ No newline at end of file
diff --git a/data/2024/aaai/Spatial-Logic-Aware Weakly Supervised Learning for Flood Mapping on Earth Imagery b/data/2024/aaai/Spatial-Logic-Aware Weakly Supervised Learning for Flood Mapping on Earth Imagery
new file mode 100644
index 0000000000..fbf8e2f0a7
--- /dev/null
+++ b/data/2024/aaai/Spatial-Logic-Aware Weakly Supervised Learning for Flood Mapping on Earth Imagery	
@@ -0,0 +1 @@
+Flood mapping on Earth imagery is crucial for disaster management, but its efficacy is hampered by the lack of high-quality training labels. Given high-resolution Earth imagery with coarse and noisy training labels, a base deep neural network model, and a spatial knowledge base with label constraints, our problem is to infer the true high-resolution labels while training neural network parameters. Traditional methods are largely based on specific physical properties and thus fall short of capturing the rich domain constraints expressed by symbolic logic. Neural-symbolic models can capture rich domain knowledge, but existing methods do not address the unique spatial challenges inherent in flood mapping on high-resolution imagery. To fill this gap, we propose a spatial-logic-aware weakly supervised learning framework. Our framework integrates symbolic spatial logic inference into probabilistic learning in a weakly supervised setting. To reduce the time costs of logic inference on vast high-resolution pixels, we propose a multi-resolution spatial reasoning algorithm to infer true labels while training neural network parameters. Evaluations of real-world flood datasets show that our model outperforms several baselines in prediction accuracy. The code is available at https://github.com/spatialdatasciencegroup/SLWSL.
\ No newline at end of file
diff --git a/data/2024/aaai/Spatial-Temporal Augmentation for Crime Prediction (Student Abstract) b/data/2024/aaai/Spatial-Temporal Augmentation for Crime Prediction (Student Abstract)
new file mode 100644
index 0000000000..9c27c6dd55
--- /dev/null
+++ b/data/2024/aaai/Spatial-Temporal Augmentation for Crime Prediction (Student Abstract)	
@@ -0,0 +1 @@
+Crime prediction stands as a pivotal concern within the realm of urban management due to its potential threats to public safety. While prior research has predominantly focused on unraveling the intricate dependencies among urban regions and temporal dynamics, the challenges posed by the scarcity and uncertainty of historical crime data have not been thoroughly investigated. This study introduces an innovative spatial-temporal augmented learning framework for crime prediction, namely STAug. In STAug, we devise a CrimeMix to improve the ability of generalization. Furthermore, we harness a spatial-temporal aggregation to capture and incorporate multiple correlations covering the temporal, spatial, and crime-type aspects. Experiments on two real-world datasets underscore the superiority of STAug over several baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Spatial-Temporal Interplay in Human Mobility: A Hierarchical Reinforcement Learning Approach with Hypergraph Representation b/data/2024/aaai/Spatial-Temporal Interplay in Human Mobility: A Hierarchical Reinforcement Learning Approach with Hypergraph Representation
new file mode 100644
index 0000000000..a3e0373985
--- /dev/null
+++ b/data/2024/aaai/Spatial-Temporal Interplay in Human Mobility: A Hierarchical Reinforcement Learning Approach with Hypergraph Representation	
@@ -0,0 +1 @@
+In the realm of human mobility, the decision-making process for selecting the next-visit location is intricately influenced by a trade-off between spatial and temporal constraints, which are reflective of individual needs and preferences. This trade-off, however, varies across individuals, making the modeling of these spatial-temporal dynamics a formidable challenge. To address the problem, in this work, we introduce the "Spatial-temporal Induced Hierarchical Reinforcement Learning" (STI-HRL) framework, for capturing the interplay between spatial and temporal factors in human mobility decision-making. Specifically, STI-HRL employs a two-tiered decision-making process: the low-level focuses on disentangling spatial and temporal preferences using dedicated agents, while the high-level integrates these considerations to finalize the decision. To complement the hierarchical decision setting, we construct a hypergraph to organize historical data, encapsulating the multi-aspect semantics of human mobility. We propose a cross-channel hypergraph embedding module to learn the representations as the states to facilitate the decision-making cycle. Our extensive experiments on two real-world datasets validate the superiority of STI-HRL over state-of-the-art methods in predicting users' next visits across various performance metrics.
\ No newline at end of file
diff --git a/data/2024/aaai/Spatio-Temporal Fusion for Human Action Recognition via Joint Trajectory Graph b/data/2024/aaai/Spatio-Temporal Fusion for Human Action Recognition via Joint Trajectory Graph
new file mode 100644
index 0000000000..d9cba4f9c9
--- /dev/null
+++ b/data/2024/aaai/Spatio-Temporal Fusion for Human Action Recognition via Joint Trajectory Graph	
@@ -0,0 +1 @@
+Graph Convolutional Networks (GCNs) and Transformers have been widely applied to skeleton-based human action recognition, with each offering unique advantages in capturing spatial relationships and long-range dependencies. However, for most GCN methods, the construction of topological structures relies solely on the spatial information of human joints, limiting their ability to directly capture richer spatio-temporal dependencies. Additionally, the self-attention modules of many Transformer methods lack topological structure information, restricting the robustness and generalization of the models. To address these issues, we propose a Joint Trajectory Graph (JTG) that integrates spatio-temporal information into a uniform graph structure. We also present a Joint Trajectory GraphFormer (JT-GraphFormer), which directly captures the spatio-temporal relationships among all joint trajectories for human action recognition. To better integrate topological information into spatio-temporal relationships, we introduce a Spatio-Temporal Dijkstra Attention (STDA) mechanism to calculate relationship scores for all the joints in JTG. Furthermore, we incorporate the Koopman operator into the classification stage to enhance the model's representation ability and classification performance. Experiments demonstrate that JT-GraphFormer achieves outstanding performance in human action recognition tasks, outperforming state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120, and N-UCLA datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Spatio-Temporal Pivotal Graph Neural Networks for Traffic Flow Forecasting b/data/2024/aaai/Spatio-Temporal Pivotal Graph Neural Networks for Traffic Flow Forecasting
new file mode 100644
index 0000000000..f7f86a0468
--- /dev/null
+++ b/data/2024/aaai/Spatio-Temporal Pivotal Graph Neural Networks for Traffic Flow Forecasting	
@@ -0,0 +1 @@
+Traffic flow forecasting is a classical spatio-temporal data mining problem with many real-world applications. Recently, various methods based on Graph Neural Networks (GNN) have been proposed for the problem and achieved impressive prediction performance. However, we argue that the majority of existing methods disregarding the importance of certain nodes (referred to as pivotal nodes) that naturally exhibit extensive connections with multiple other nodes. Predicting on pivotal nodes poses a challenge due to their complex spatio-temporal dependencies compared to other nodes. In this paper, we propose a novel GNN-based method called Spatio-Temporal Pivotal Graph Neural Networks (STPGNN) to address the above limitation. We introduce a pivotal node identification module for identifying pivotal nodes. We propose a novel pivotal graph convolution module, enabling precise capture of spatio-temporal dependencies centered around pivotal nodes. Moreover, we propose a parallel framework capable of extracting spatio-temporal traffic features on both pivotal and non-pivotal nodes. Experiments on seven real-world traffic datasets verify our proposed method's effectiveness and efficiency compared to state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Spear and Shield: Adversarial Attacks and Defense Methods for Model-Based Link Prediction on Continuous-Time Dynamic Graphs b/data/2024/aaai/Spear and Shield: Adversarial Attacks and Defense Methods for Model-Based Link Prediction on Continuous-Time Dynamic Graphs
new file mode 100644
index 0000000000..869ef89129
--- /dev/null
+++ b/data/2024/aaai/Spear and Shield: Adversarial Attacks and Defense Methods for Model-Based Link Prediction on Continuous-Time Dynamic Graphs	
@@ -0,0 +1,10 @@
+Real-world graphs are dynamic, constantly evolving with new interactions, such as financial transactions in financial networks. 
+Temporal Graph Neural Networks (TGNNs) have been developed to effectively capture the evolving patterns in dynamic graphs.
+While these models have demonstrated their superiority, being widely adopted in various important fields, their vulnerabilities against adversarial attacks remain largely unexplored.
+In this paper, we propose T-SPEAR, a simple and effective adversarial attack method for link prediction on continuous-time dynamic graphs, focusing on investigating the vulnerabilities of TGNNs.
+Specifically, before the training procedure of a victim model, which is a TGNN for link prediction, we inject edge perturbations to the data that are unnoticeable in terms of the four constraints we propose, and yet effective enough to cause malfunction of the victim model. 
+Moreover, we propose a robust training approach T-SHIELD to mitigate the impact of adversarial attacks.
+By using edge filtering and enforcing temporal smoothness to node embeddings, we enhance the robustness of the victim model.
+Our experimental study shows that T-SPEAR significantly degrades the victim model's performance on link prediction tasks, and even more, our attacks are transferable to other TGNNs, which differ from the victim model assumed by the attacker.
+Moreover, we demonstrate that T-SHIELD effectively filters out adversarial edges and exhibits robustness against adversarial attacks, surpassing the link prediction performance of the naive TGNN by up to 11.2% under T-SPEAR.
+The code and datasets are available at https://github.com/wooner49/T-spear-shield
\ No newline at end of file
diff --git a/data/2024/aaai/Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation b/data/2024/aaai/Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation
new file mode 100644
index 0000000000..db828eb273
--- /dev/null
+++ b/data/2024/aaai/Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation	
@@ -0,0 +1,6 @@
+Recently, CLIP has found practical utility in the domain of pixel-level zero-shot segmentation tasks. 
+The present landscape features two-stage methodologies beset by issues such as intricate pipelines and elevated computational costs. While current one-stage approaches alleviate these concerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's generalization capacity, they still fall short in fully harnessing CLIP's potential for pixel-level unseen class demarcation and precise pixel predictions.
+To further stimulate CLIP's zero-shot dense prediction capability, we propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from image to pixel.
+Specifically, we initially introduce Spectral Prompt Tuning (SPT), incorporating spectral prompts into the CLIP visual encoder's shallow layers to capture structural intricacies of images, thereby enhancing comprehension of unseen classes.
+Subsequently, we introduce the Spectral Guided Decoder (SGD), utilizing both high and low-frequency information to steer the network's spatial focus towards more prominent classification features, enabling precise pixel-level prediction outcomes.
+Through extensive experiments on two public datasets, we demonstrate the superiority of our method over state-of-the-art approaches, performing well across all classes and particularly excelling in handling unseen classes.
\ No newline at end of file
diff --git a/data/2024/aaai/Spectral-Based Graph Neural Networks for Complementary Item Recommendation b/data/2024/aaai/Spectral-Based Graph Neural Networks for Complementary Item Recommendation
new file mode 100644
index 0000000000..e9b27aa353
--- /dev/null
+++ b/data/2024/aaai/Spectral-Based Graph Neural Networks for Complementary Item Recommendation	
@@ -0,0 +1,2 @@
+Modeling complementary relationships greatly helps recommender systems to accurately and promptly recommend the subsequent items when one item is purchased. Unlike traditional similar relationships, items with complementary relationships may be purchased successively (such as iPhone and Airpods Pro), and they not only share relevance but also exhibit dissimilarity. Since the two attributes are opposites, modeling complementary relationships is challenging. Previous attempts to exploit these relationships have either ignored or oversimplified the dissimilarity attribute, resulting in ineffective modeling and an inability to balance the two attributes. Since Graph Neural Networks (GNNs) can capture the relevance and dissimilarity between nodes in the spectral domain, we can leverage spectral-based GNNs to effectively understand and model complementary relationships. 
+In this study, we present a novel approach called Spectral-based Complementary Graph Neural Networks (SComGNN) that utilizes the spectral properties of complementary item graphs. We make the first observation that complementary relationships consist of low-frequency and mid-frequency components, corresponding to the relevance and dissimilarity attributes, respectively. Based on this spectral observation, we design spectral graph convolutional networks with low-pass and mid-pass filters to capture the low-frequency and mid-frequency components. Additionally, we propose a two-stage attention mechanism to adaptively integrate and balance the two attributes. Experimental results on four e-commerce datasets demonstrate the effectiveness of our model, with SComGNN significantly outperforming existing baseline models.
\ No newline at end of file
diff --git a/data/2024/aaai/SpectralNeRF: Physically Based Spectral Rendering with Neural Radiance Field b/data/2024/aaai/SpectralNeRF: Physically Based Spectral Rendering with Neural Radiance Field
new file mode 100644
index 0000000000..9312a7aa89
--- /dev/null
+++ b/data/2024/aaai/SpectralNeRF: Physically Based Spectral Rendering with Neural Radiance Field	
@@ -0,0 +1 @@
+In this paper, we propose SpectralNeRF, an end-to-end Neural Radiance Field (NeRF)-based architecture for high-quality physically based rendering from a novel spectral perspective. We modify the classical spectral rendering into two main steps, 1) the generation of a series of spectrum maps spanning different wavelengths, 2) the combination of these spectrum maps for the RGB output. Our SpectralNeRF follows these two steps through the proposed multi-layer perceptron (MLP)-based architecture (SpectralMLP) and Spectrum Attention UNet (SAUNet). Given the ray origin and the ray direction, the SpectralMLP constructs the spectral radiance field to obtain spectrum maps of novel views, which are then sent to the SAUNet to produce RGB images of white-light illumination. Applying NeRF to build up the spectral rendering is a more physically-based way from the perspective of ray-tracing. Further, the spectral radiance fields decompose difficult scenes and improve the performance of NeRF-based methods. Comprehensive experimental results demonstrate the proposed SpectralNeRF is superior to recent NeRF-based methods when synthesizing new views on synthetic and real datasets. The codes and datasets are available at https://github.com/liru0126/SpectralNeRF.
\ No newline at end of file
diff --git a/data/2024/aaai/Spectrum Translation for Refinement of Image Generation (STIG) Based on Contrastive Learning and Spectral Filter Profile b/data/2024/aaai/Spectrum Translation for Refinement of Image Generation (STIG) Based on Contrastive Learning and Spectral Filter Profile
new file mode 100644
index 0000000000..54de31445c
--- /dev/null
+++ b/data/2024/aaai/Spectrum Translation for Refinement of Image Generation (STIG) Based on Contrastive Learning and Spectral Filter Profile	
@@ -0,0 +1 @@
+Currently, image generation and synthesis have remarkably progressed with generative models. Despite photo-realistic results, intrinsic discrepancies are still observed in the frequency domain. The spectral discrepancy appeared not only in generative adversarial networks but in diffusion models. In this study, we propose a framework to effectively mitigate the disparity in frequency domain of the generated images to improve generative performance of both GAN and diffusion models. This is realized by spectrum translation for the refinement of image generation (STIG) based on contrastive learning. We adopt theoretical logic of frequency components in various generative networks. The key idea, here, is to refine the spectrum of the generated image via the concept of image-to-image translation and contrastive learning in terms of digital signal processing. We evaluate our framework across eight fake image datasets and various cutting-edge models to demonstrate the effectiveness of STIG. Our framework outperforms other cutting-edges showing significant decreases in FID and log frequency distance of spectrum. We further emphasize that STIG improves image quality by decreasing the spectral anomaly. Additionally, validation results present that the frequency-based deepfake detector confuses more in the case where fake spectrums are manipulated by STIG.
\ No newline at end of file
diff --git a/data/2024/aaai/SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model b/data/2024/aaai/SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model
new file mode 100644
index 0000000000..d74ea8d9d9
--- /dev/null
+++ b/data/2024/aaai/SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model	
@@ -0,0 +1 @@
+Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains. However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation. In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images. For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images. Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion. For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic. Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images. With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.
\ No newline at end of file
diff --git a/data/2024/aaai/Spherical Pseudo-Cylindrical Representation for Omnidirectional Image Super-resolution b/data/2024/aaai/Spherical Pseudo-Cylindrical Representation for Omnidirectional Image Super-resolution
new file mode 100644
index 0000000000..2a5a8e4719
--- /dev/null
+++ b/data/2024/aaai/Spherical Pseudo-Cylindrical Representation for Omnidirectional Image Super-resolution	
@@ -0,0 +1 @@
+Omnidirectional images have attracted significant attention in recent years due to the rapid development of virtual reality technologies. Equirectangular projection (ERP), a naive form to store and transfer omnidirectional images, however, is challenging for existing two-dimensional (2D) image super-resolution (SR) methods due to its inhomogeneous distributed sampling density and distortion across latitude. In this paper, we make one of the first attempts to design a spherical pseudo-cylindrical representation, which not only allows pixels at different latitudes to adaptively adopt the best distinct sampling density but also is model-agnostic to most off-the-shelf SR methods, enhancing their performances. Specifically, we start by upsampling each latitude of the input ERP image and design a computationally tractable optimization algorithm to adaptively obtain a (sub)-optimal sampling density for each latitude of the ERP image. Addressing the distortion of ERP, we introduce a new viewport-based training loss based on the original 3D sphere format of the omnidirectional image, which inherently lacks distortion. Finally, we present a simple yet effective recursive progressive omnidirectional SR network to showcase the feasibility of our idea. The experimental results on public datasets demonstrate the effectiveness of the proposed method as well as the consistently superior performance of our method over most state-of-the-art methods both quantitatively and qualitatively.
\ No newline at end of file
diff --git a/data/2024/aaai/Spiking NeRF: Representing the Real-World Geometry by a Discontinuous Representation b/data/2024/aaai/Spiking NeRF: Representing the Real-World Geometry by a Discontinuous Representation
new file mode 100644
index 0000000000..9c26278142
--- /dev/null
+++ b/data/2024/aaai/Spiking NeRF: Representing the Real-World Geometry by a Discontinuous Representation	
@@ -0,0 +1,2 @@
+A crucial reason for the success of existing NeRF-based methods is to build a neural density field for the geometry representation via multiple perceptron layers (MLPs).
+MLPs are continuous functions, however, real geometry or density field is frequently discontinuous at the interface between the air and the surface. Such a contrary brings the problem of unfaithful geometry representation. To this end, this paper proposes spiking NeRF, which leverages spiking neurons and a hybrid Artificial Neural Network (ANN)-Spiking Neural Network (SNN) framework to build a discontinuous density field for faithful geometry representation. Specifically, we first demonstrate the reason why continuous density fields will bring inaccuracy. Then, we propose to use the spiking neurons to build a discontinuous density field. We conduct a comprehensive analysis for the problem of existing spiking neuron models and then provide the numerical relationship between the parameter of the spiking neuron and the theoretical accuracy of geometry. Based on this, we propose a bounded spiking neuron to build the discontinuous density field. Our method achieves SOTA performance. The source code and the supplementary material are available at https://github.com/liaozhanfeng/Spiking-NeRF.
\ No newline at end of file
diff --git a/data/2024/aaai/SpikingBERT: Distilling BERT to Train Spiking Language Models Using Implicit Differentiation b/data/2024/aaai/SpikingBERT: Distilling BERT to Train Spiking Language Models Using Implicit Differentiation
new file mode 100644
index 0000000000..858a3b24fe
--- /dev/null
+++ b/data/2024/aaai/SpikingBERT: Distilling BERT to Train Spiking Language Models Using Implicit Differentiation	
@@ -0,0 +1 @@
+Large language Models (LLMs), though growing exceedingly powerful, comprises of orders of magnitude less neurons and synapses than the human brain. However, it requires significantly more power/energy to operate. In this work, we propose a novel bio-inspired spiking language model (LM) which aims to reduce the computational cost of conventional LMs by drawing motivation from the synaptic information flow in the brain. In this paper, we demonstrate a framework that leverages the average spiking rate of neurons at equilibrium to train a neuromorphic spiking LM using implicit differentiation technique, thereby overcoming the non-differentiability problem of spiking neural network (SNN) based algorithms without using any type of surrogate gradient. The steady-state convergence of the spiking neurons also allows us to design a spiking attention mechanism, which is critical in developing a scalable spiking LM. Moreover, the convergence of average spiking rate of neurons at equilibrium is utilized to develop a novel ANN-SNN knowledge distillation based technique wherein we use a pre-trained BERT model as “teacher” to train our “student” spiking architecture. While the primary architecture proposed in this paper is motivated by BERT, the technique can be potentially extended to different kinds of LLMs. Our work is the first one to demonstrate the performance of an operational spiking LM architecture on multiple different tasks in the GLUE benchmark. Our implementation source code is available at https://github.com/NeuroCompLab-psu/SpikingBERT.
\ No newline at end of file
diff --git a/data/2024/aaai/Spot the Error: Non-autoregressive Graphic Layout Generation with Wireframe Locator b/data/2024/aaai/Spot the Error: Non-autoregressive Graphic Layout Generation with Wireframe Locator
new file mode 100644
index 0000000000..3ad0676778
--- /dev/null
+++ b/data/2024/aaai/Spot the Error: Non-autoregressive Graphic Layout Generation with Wireframe Locator	
@@ -0,0 +1 @@
+Layout generation is a critical step in graphic design to achieve meaningful compositions of elements. Most previous works view it as a sequence generation problem by concatenating element attribute tokens (i.e., category, size, position). So far the autoregressive approach (AR) has achieved promising results, but is still limited in global context modeling and suffers from error propagation since it can only attend to the previously generated tokens. Recent non-autoregressive attempts (NAR) have shown competitive results, which provides a wider context range and the flexibility to refine with iterative decoding. However, current works only use simple heuristics to recognize erroneous tokens for refinement which is inaccurate. This paper first conducts an in-depth analysis to better understand the difference between the AR and NAR framework. Furthermore, based on our observation that pixel space is more sensitive in capturing spatial patterns of graphic layouts (e.g., overlap, alignment), we propose a learning-based locator to detect erroneous tokens which takes the wireframe image rendered from the generated layout sequence as input. We show that it serves as a complementary modality to the element sequence in object space and contributes greatly to the overall performance. Experiments on two public datasets show that our approach outperforms both AR and NAR baselines. Extensive studies further prove the effectiveness of different modules with interesting findings. Our code will be available at https://github.com/ffffatgoose/SpotError.
\ No newline at end of file
diff --git a/data/2024/aaai/Spotting the Unseen: Reciprocal Consensus Network Guided by Visual Archetypes b/data/2024/aaai/Spotting the Unseen: Reciprocal Consensus Network Guided by Visual Archetypes
new file mode 100644
index 0000000000..1d148a54cc
--- /dev/null
+++ b/data/2024/aaai/Spotting the Unseen: Reciprocal Consensus Network Guided by Visual Archetypes	
@@ -0,0 +1 @@
+Humans often require only a few visual archetypes to spot novel objects. Based on this observation, we present a strategy rooted in ``spotting the unseen" by establishing dense correspondences between potential query image regions and a visual archetype, and we propose the Consensus Network (CoNet). Our method leverages relational patterns intra and inter images via Auto-Correlation Representation (ACR) and Mutual-Correlation Representation (MCR). Within each image, the ACR module is capable of encoding both local self-similarity and global context simultaneously. Between the query and support images, the MCR module computes the cross-correlation across two image representations and introduces a reciprocal consistency constraint, which can incorporate to exclude outliers and enhance model robustness. To overcome the challenges of low-resource training data, particularly in one-shot learning scenarios, we incorporate an adaptive margin strategy to better handle diverse instances. The experimental results indicate the effectiveness of the proposed method across diverse domains such as object detection in natural scenes, and text spotting in both historical manuscripts and natural scenes, which demonstrates its sparkling generalization ability. Our code is available at: https://github.com/infinite-hwb/conet.
\ No newline at end of file
diff --git a/data/2024/aaai/Stability Analysis of Switched Linear Systems with Neural Lyapunov Functions b/data/2024/aaai/Stability Analysis of Switched Linear Systems with Neural Lyapunov Functions
new file mode 100644
index 0000000000..946bfeb68f
--- /dev/null
+++ b/data/2024/aaai/Stability Analysis of Switched Linear Systems with Neural Lyapunov Functions	
@@ -0,0 +1,2 @@
+Neural-based, data-driven analysis and control of dynamical systems have been recently investigated and have shown great promise, e.g. for safety verification or stability analysis. Indeed, not only do neural networks allow for an entirely model-free, data-driven approach, but also for handling arbitrary complex functions via their power of representation (as opposed to, e.g. algebraic optimization techniques that are restricted to polynomial functions). Whilst classical Lyapunov techniques allow to provide a formal and robust guarantee of stability of a switched dynamical system, very little is yet known about correctness guarantees for Neural Lyapunov functions, nor about their performance (amount of data needed for a certain accuracy). 
+We formally introduce Neural Lyapunov functions for the stability analysis of switched linear systems: we benchmark them on this paradigmatic problem, which is notoriously difficult (and in general Turing-undecidable), but which admits existing recently-developed technologies and theoretical results. Inspired by switched systems theory, we provide theoretical guarantees on the representative power of neural networks, leveraging recent results from the ML community. We additionally experimentally display how Neural Lyapunov functions compete with state-of-the-art results and techniques, while admitting a wide range of improvement, both in theory and in practice. This study intends to improve our understanding of the opportunities and current limitations of neural-based data-driven analysis and control of complex dynamical systems.
\ No newline at end of file
diff --git a/data/2024/aaai/Stability in Online Coalition Formation b/data/2024/aaai/Stability in Online Coalition Formation
new file mode 100644
index 0000000000..d479fa03ab
--- /dev/null
+++ b/data/2024/aaai/Stability in Online Coalition Formation	
@@ -0,0 +1 @@
+Coalition formation is concerned with the question of how to partition a set of agents into disjoint coalitions according to their preferences. Deviating from most of the previous work, we consider an online variant of the problem, where agents arrive in sequence and whenever an agent arrives, they have to be assigned to a coalition immediately and irrevocably. The scarce existing literature on online coalition formation has focused on the objective of maximizing social welfare, a demanding requirement, even in the offline setting. Instead, we seek to achieve stable coalition structures in an online setting, and focus on stability concepts based on deviations by single agents. We present a comprehensive picture in additively separable hedonic games, leading to dichotomies, where positive results are obtained by deterministic algorithms and negative results even hold for randomized algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Stability of Multi-Agent Learning in Competitive Networks: Delaying the Onset of Chaos b/data/2024/aaai/Stability of Multi-Agent Learning in Competitive Networks: Delaying the Onset of Chaos
new file mode 100644
index 0000000000..2008fc832c
--- /dev/null
+++ b/data/2024/aaai/Stability of Multi-Agent Learning in Competitive Networks: Delaying the Onset of Chaos	
@@ -0,0 +1,2 @@
+The behaviour of multi agent learning in competitive network games is often studied within the context of zero sum games, in which convergence guarantees may be obtained. However, outside of this class the behaviour of learning is known to display complex behaviours and convergence cannot be always guaranteed. Nonetheless, in order to develop a complete picture of the behaviour of multi agent learning in competitive settings, the zero sum assumption must be lifted.
+Motivated by this we study the Q Learning dynamics, a popular model of exploration and exploitation in multi agent learning, in competitive network games. We determine how the degree of competition, exploration rate and network connectivity impact the convergence of Q Learning. To study generic competitive games, we parameterise network games in terms of correlations between agent payoffs and study the average behaviour of the Q Learning dynamics across all games drawn from a choice of this parameter. This statistical approach establishes choices of parameters for which Q Learning dynamics converge to a stable fixed point. Differently to previous works, we find that the stability of Q Learning is explicitly dependent only on the network connectivity rather than the total number of agents. Our experiments validate these findings and show that, under certain network structures, the total number of agents can be increased without increasing the likelihood of unstable or chaotic behaviours.
\ No newline at end of file
diff --git a/data/2024/aaai/Stable Model Semantics for Description Logic Terminologies b/data/2024/aaai/Stable Model Semantics for Description Logic Terminologies
new file mode 100644
index 0000000000..45a0ac4ba5
--- /dev/null
+++ b/data/2024/aaai/Stable Model Semantics for Description Logic Terminologies	
@@ -0,0 +1 @@
+This paper studies a stable model semantics for Description Logic (DL) knowledge bases (KBs) and for (possibly cyclic) terminologies, ultimately showing that terminologies under the proposed semantics can be equipped with effective reasoning algorithms. The semantics is derived using Quantified Equilibrium Logic, and---in contrast to the usual semantics of DLs based on classical logic---supports default negation and allows to combine the open-world and the closed-world assumptions in a natural way. Towards understanding the computational properties of this and related formalisms, we show a strong undecidability result that applies not only to KBs under the stable model semantics, but also to the more basic setting of minimal model reasoning. Specifically, we show that concept satisfiability in minimal models of an ALCIO KB is undecidable. We then turn our attention to (possibly cyclic) DL terminologies, where ontological axioms are limited to definitions of concept names in terms of complex concepts. This restriction still yields a very rich setting. We show that standard reasoning problems, like concept satisfiability and subsumption, are ExpTime-complete for terminologies expressed in ALCI under the stable model semantics.
\ No newline at end of file
diff --git a/data/2024/aaai/Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise b/data/2024/aaai/Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise
new file mode 100644
index 0000000000..5aebed87cd
--- /dev/null
+++ b/data/2024/aaai/Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise	
@@ -0,0 +1 @@
+The open sourcing of large amounts of image data promotes the development of deep learning techniques. Along with this comes the privacy risk of these image datasets being exploited by unauthorized third parties to train deep learning models for commercial or illegal purposes. To avoid the abuse of data, a poisoning-based technique, "unlearnable example", has been proposed to significantly degrade the generalization performance of models by adding imperceptible noise to the data. To further enhance its robustness against adversarial training, existing works leverage iterative adversarial training on both the defensive noise and the surrogate model. However, it still remains unknown whether the robustness of unlearnable examples primarily comes from the effect of enhancement in the surrogate model or the defensive noise. Observing that simply removing the adversarial perturbation on the training process of the defensive noise can improve the performance of robust unlearnable examples, we identify that solely the surrogate model's robustness contributes to the performance. Furthermore, we found a negative correlation exists between the robustness of defensive noise and the protection performance, indicating defensive noise's instability issue. Motivated by this, to further boost the robust unlearnable example, we introduce Stable Error-Minimizing noise (SEM), which trains the defensive noise against random perturbation instead of the time-consuming adversarial perturbation to improve the stability of defensive noise. Through comprehensive experiments, we demonstrate that SEM achieves a new state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet Subset regarding both effectiveness and efficiency.
\ No newline at end of file
diff --git a/data/2024/aaai/Statistical Spatially Inhomogeneous Diffusion Inference b/data/2024/aaai/Statistical Spatially Inhomogeneous Diffusion Inference
new file mode 100644
index 0000000000..4d4b51270e
--- /dev/null
+++ b/data/2024/aaai/Statistical Spatially Inhomogeneous Diffusion Inference	
@@ -0,0 +1,4 @@
+Inferring a diffusion equation from discretely observed measurements is a statistical challenge of significant importance in a variety of fields, from single-molecule tracking in biophysical systems to modeling financial instruments.
+Assuming that the underlying dynamical process obeys a d-dimensional stochastic differential equation of the form dx_t = b(x_t)dt + \Sigma(x_t)dw_t, we propose neural network-based estimators of both the drift b and the spatially-inhomogeneous diffusion tensor D = \Sigma\Sigma^T/2 and provide statistical convergence guarantees when b and D are s-Hölder continuous. 
+Notably, our bound aligns with the minimax optimal rate N^{-\frac{2s}{2s+d}} for nonparametric function estimation even in the presence of correlation within observational data, which necessitates careful handling when establishing fast-rate generalization bounds.
+Our theoretical results are bolstered by numerical experiments demonstrating accurate inference of spatially-inhomogeneous diffusion tensors.
\ No newline at end of file
diff --git a/data/2024/aaai/Statistically Principled Deep Learning for SAR Image Segmentation b/data/2024/aaai/Statistically Principled Deep Learning for SAR Image Segmentation
new file mode 100644
index 0000000000..c4d4438f1b
--- /dev/null
+++ b/data/2024/aaai/Statistically Principled Deep Learning for SAR Image Segmentation	
@@ -0,0 +1 @@
+This paper proposes a novel approach for Synthetic Aperture Radar (SAR) image segmentation by incorporating known statistical properties of SAR into deep learning models. We generate synthetic data using the Generalized Gamma distribution, modify the U-Net architecture to encompass statistical moments, and employ stochastic distance losses for improved segmentation performance. Evaluation against traditional methods will reveal the potential of this approach to advance SAR image analysis, with broader applications in environmental monitoring and general image segmentation tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Stealthy Adversarial Attacks on Stochastic Multi-Armed Bandits b/data/2024/aaai/Stealthy Adversarial Attacks on Stochastic Multi-Armed Bandits
new file mode 100644
index 0000000000..7e18a1add4
--- /dev/null
+++ b/data/2024/aaai/Stealthy Adversarial Attacks on Stochastic Multi-Armed Bandits	
@@ -0,0 +1 @@
+Adversarial attacks against stochastic multi-armed bandit (MAB) algorithms have been extensively studied in the literature. In this work, we focus on reward poisoning attacks and find most existing attacks can be easily detected by our proposed detection method based on the test of homogeneity, due to their aggressive nature in reward manipulations. This motivates us to study the notion of stealthy attack against stochastic MABs and investigate the resulting attackability. Our analysis shows that against two popularly employed MAB algorithms, UCB1 and $\epsilon$-greedy, the success of a stealthy attack depends on the environmental conditions and the realized reward of the arm pulled in the first round. We also analyze the situation for general MAB algorithms equipped with our attack detection method and find that it is possible to have a stealthy attack that almost always succeeds. This brings new insights into the security risks of MAB algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/StegFormer: Rebuilding the Glory of Autoencoder-Based Steganography b/data/2024/aaai/StegFormer: Rebuilding the Glory of Autoencoder-Based Steganography
new file mode 100644
index 0000000000..1a33c2a0b5
--- /dev/null
+++ b/data/2024/aaai/StegFormer: Rebuilding the Glory of Autoencoder-Based Steganography	
@@ -0,0 +1 @@
+Image hiding aims to conceal one or more secret images within a cover image of the same resolution. Due to strict capacity requirements, image hiding is commonly called large-capacity steganography. In this paper, we propose StegFormer, a novel autoencoder-based image-hiding model. StegFormer can conceal one or multiple secret images within a cover image of the same resolution while preserving the high visual quality of the stego image. In addition, to mitigate the limitations of current steganographic models in real-world scenarios, we propose a normalizing training strategy and a restrict loss to improve the reliability of the steganographic models under realistic conditions. Furthermore, we propose an efficient steganographic capacity expansion method to increase the capacity of steganography and enhance the efficiency of secret communication. Through this approach, we can increase the relative payload of StegFormer to 96 bits per pixel without any training strategy modifications. Experiments demonstrate that our StegFormer outperforms existing state-of-the-art (SOTA) models. In the case of single-image steganography, there is an improvement of more than 3 dB and 5 dB in PSNR for secret/recovery image pairs and cover/stego image pairs.
\ No newline at end of file
diff --git a/data/2024/aaai/Step Vulnerability Guided Mean Fluctuation Adversarial Attack against Conditional Diffusion Models b/data/2024/aaai/Step Vulnerability Guided Mean Fluctuation Adversarial Attack against Conditional Diffusion Models
new file mode 100644
index 0000000000..f9a6c5434a
--- /dev/null
+++ b/data/2024/aaai/Step Vulnerability Guided Mean Fluctuation Adversarial Attack against Conditional Diffusion Models	
@@ -0,0 +1 @@
+The high-quality generation results of conditional diffusion models have brought about concerns regarding privacy and copyright issues. As a possible technique for preventing the abuse of diffusion models, the adversarial attack against diffusion models has attracted academic attention recently. In this work, utilizing the phenomenon that diffusion models are highly sensitive to the mean value of the input noise, we propose the Mean Fluctuation Attack (MFA) to introduce mean fluctuations by shifting the mean values of the estimated noises during the reverse process. In addition, we reveal that the vulnerability of different reverse steps against adversarial attacks actually varies significantly. By modeling the step vulnerability and using it as guidance to sample the target steps for generating adversarial examples, the effectiveness of adversarial attacks can be substantially enhanced. Extensive experiments show that our algorithm can steadily cause the mean shift of the predicted noises so as to disrupt the entire reverse generation process and degrade the generation results significantly. We also demonstrate that the step vulnerability is intrinsic to the reverse process by verifying its effectiveness in an attack method other than MFA. Code and Supplementary is available at https://github.com/yuhongwei22/MFA
\ No newline at end of file
diff --git a/data/2024/aaai/Stereo Vision Conversion from Planar Videos Based on Temporal Multiplane Images b/data/2024/aaai/Stereo Vision Conversion from Planar Videos Based on Temporal Multiplane Images
new file mode 100644
index 0000000000..3498749f4c
--- /dev/null
+++ b/data/2024/aaai/Stereo Vision Conversion from Planar Videos Based on Temporal Multiplane Images	
@@ -0,0 +1 @@
+With the rapid development of 3D movie and light-field displays, there is a growing demand for stereo videos. However, generating high-quality stereo videos from planar videos remains a challenging task. Traditional depth-image-based rendering techniques struggle to effectively handle the problem of occlusion exposure, which occurs when the occluded contents become visible in other views. Recently, the single-view multiplane images (MPI) representation has shown promising performance for planar video stereoscopy. However, the MPI still lacks real details that are occluded in the current frame, resulting in blurry artifacts in occlusion exposure regions. In fact, planar videos can leverage complementary information from adjacent frames to predict a more complete scene representation for the current frame. Therefore, this paper extends the MPI from still frames to the temporal domain, introducing the temporal MPI (TMPI). By extracting complementary information from adjacent frames based on optical flow guidance, obscured regions in the current frame can be effectively repaired. Additionally, a new module called masked optical flow warping (MOFW) is introduced to improve the propagation of pixels along optical flow trajectories. Experimental results demonstrate that the proposed method can generate high-quality stereoscopic or light-field videos from a single view and reproduce better occluded details than other state-of-the-art (SOTA) methods. https://github.com/Dio3ding/TMPI
\ No newline at end of file
diff --git a/data/2024/aaai/Sterling: Synergistic Representation Learning on Bipartite Graphs b/data/2024/aaai/Sterling: Synergistic Representation Learning on Bipartite Graphs
new file mode 100644
index 0000000000..bc6b04ead9
--- /dev/null
+++ b/data/2024/aaai/Sterling: Synergistic Representation Learning on Bipartite Graphs	
@@ -0,0 +1 @@
+A fundamental challenge of bipartite graph representation learning is how to extract informative node embeddings. Self-Supervised Learning (SSL) is a promising paradigm to address this challenge. Most recent bipartite graph SSL methods are based on contrastive learning which learns embeddings by discriminating positive and negative node pairs. Contrastive learning usually requires a large number of negative node pairs, which could lead to computational burden and semantic errors. In this paper, we introduce a novel synergistic representation learning model (STERLING) to learn node embeddings without negative node pairs. STERLING preserves the unique local and global synergies in bipartite graphs. The local synergies are captured by maximizing the similarity of the inter-type and intra-type positive node pairs, and the global synergies are captured by maximizing the mutual information of co-clusters. Theoretical analysis demonstrates that STERLING could improve the connectivity between different node types in the embedding space. Extensive empirical evaluation on various benchmark datasets and tasks demonstrates the effectiveness of STERLING for extracting node embeddings.
\ No newline at end of file
diff --git a/data/2024/aaai/Stitching Segments and Sentences towards Generalization in Video-Text Pre-training b/data/2024/aaai/Stitching Segments and Sentences towards Generalization in Video-Text Pre-training
new file mode 100644
index 0000000000..cd18e1640b
--- /dev/null
+++ b/data/2024/aaai/Stitching Segments and Sentences towards Generalization in Video-Text Pre-training	
@@ -0,0 +1 @@
+Video-language pre-training models have recently achieved remarkable results on various multi-modal downstream tasks. However, most of these models rely on contrastive learning or masking modeling to align global features across modalities, neglecting the local associations between video frames and text tokens. This limits the model’s ability to perform fine-grained matching and generalization, especially for tasks that selecting segments in long videos based on query texts. To address this issue, we propose a novel stitching and matching pre-text task for video-language pre-training that encourages fine-grained interactions between modalities. Our task involves stitching video frames or sentences into longer sequences and predicting the positions of cross-model queries in the stitched sequences. The individual frame and sentence representations are thus aligned via the stitching and matching strategy, encouraging the fine-grained interactions between videos and texts. in the stitched sequences for the cross-modal query. We conduct extensive experiments on various benchmarks covering text-to-video retrieval, video question answering, video captioning, and moment retrieval. Our results demonstrate that the proposed method significantly improves the generalization capacity of the video-text pre-training models.
\ No newline at end of file
diff --git a/data/2024/aaai/StockMixer: A Simple Yet Strong MLP-Based Architecture for Stock Price Forecasting b/data/2024/aaai/StockMixer: A Simple Yet Strong MLP-Based Architecture for Stock Price Forecasting
new file mode 100644
index 0000000000..15e7933824
--- /dev/null
+++ b/data/2024/aaai/StockMixer: A Simple Yet Strong MLP-Based Architecture for Stock Price Forecasting	
@@ -0,0 +1 @@
+Stock price forecasting is a fundamental yet challenging task in quantitative investment. Various researchers have developed a combination of neural network models (e.g., RNNs, GNNs, Transformers) for capturing complex indicator, temporal and stock correlations of the stock data.While complex architectures are highly expressive, they are often difficult to optimize and the performances are often compromised by the limited stock data. In this paper, we propose a simple MLP-based architecture named StockMixer which is easy to optimize and enjoys strong predictive performance. StockMixer performs indicator mixing, followed by time mixing, and finally stock mixing. Unlike the standard MLP-based mixing, we devise the time mixing to exchange multi-scale time patch information and realize the stock mixing by exploiting stock-to-market and market-to-stock influences explicitly. Extensive experiments on real stock benchmarks demonstrate our proposed StockMixer outperforms various state-of-the-art forecasting methods with a notable margin while reducing memory usage and runtime cost.Code is available at https://github.com/SJTU-Quant/StockMixer.
\ No newline at end of file
diff --git a/data/2024/aaai/Stop! Planner Time: Metareasoning for Probabilistic Planning Using Learned Performance Profiles b/data/2024/aaai/Stop! Planner Time: Metareasoning for Probabilistic Planning Using Learned Performance Profiles
new file mode 100644
index 0000000000..4973b1aae2
--- /dev/null
+++ b/data/2024/aaai/Stop! Planner Time: Metareasoning for Probabilistic Planning Using Learned Performance Profiles	
@@ -0,0 +1 @@
+The metareasoning framework aims to enable autonomous agents to factor in planning costs when making decisions. In this work, we develop the first non-myopic metareasoning algorithm for planning with Markov decision processes. Our method learns the behaviour of anytime probabilistic planning algorithms from performance data. Specifically, we propose a novel model for metareasoning, based on contextual performance profiles that predict the value of the planner's current solution given the time spent planning, the state of the planning algorithm's internal parameters, and the difficulty of the planning problem being solved. This model removes the need to assume that the current solution quality is always known, broadening the class of metareasoning problems that can be addressed. We then employ deep reinforcement learning to learn a policy that decides, at each timestep, whether to continue planning or start executing the current plan, and how to set hyperparameters of the planner to enhance its performance. We demonstrate our algorithm's ability to perform effective metareasoning in two domains.
\ No newline at end of file
diff --git a/data/2024/aaai/Strategic Recommendation: Revenue Optimal Matching for Online Platforms (Student Abstract) b/data/2024/aaai/Strategic Recommendation: Revenue Optimal Matching for Online Platforms (Student Abstract)
new file mode 100644
index 0000000000..0d254def80
--- /dev/null
+++ b/data/2024/aaai/Strategic Recommendation: Revenue Optimal Matching for Online Platforms (Student Abstract)	
@@ -0,0 +1,3 @@
+We consider a platform in a two-sided market with unit-supply sellers and unit-demand buyers. Each buyer can transact with a subset of sellers it knows off platform and another seller that the platform recommends. Given the choice of sellers, transactions and prices form a competitive equilibrium. The platform selects one seller for each buyer, and charges a fixed percentage of prices to all transactions that it recommends. The platform seeks to maximize total revenue.
+
+We show that the platform's problem is NP-hard, even when each buyer knows at most two buyers off platform. Finally, when each buyer values all sellers equally and knows only one buyer off platform, we provide a polynomial time algorithm that optimally solves the problem.
\ No newline at end of file
diff --git a/data/2024/aaai/Strategyproof Mechanisms for Group-Fair Obnoxious Facility Location Problems b/data/2024/aaai/Strategyproof Mechanisms for Group-Fair Obnoxious Facility Location Problems
new file mode 100644
index 0000000000..dbc65115c5
--- /dev/null
+++ b/data/2024/aaai/Strategyproof Mechanisms for Group-Fair Obnoxious Facility Location Problems	
@@ -0,0 +1 @@
+We study the group-fair obnoxious facility location problems from the mechanism design perspective where agents belong to different groups and have private location preferences on the undesirable locations of the facility. Our main goal is to design strategyproof mechanisms that elicit the true location preferences from the agents and determine a facility location that approximately optimizes several group-fair objectives. We first consider the maximum total and average group cost (group-fair) objectives. For these objectives, we propose deterministic mechanisms that achieve 3-approximation ratios and provide matching lower bounds. We then provide the characterization of 2-candidate strategyproof randomized mechanisms. Leveraging the characterization, we design randomized mechanisms with improved approximation ratios of 2 for both objectives. We also provide randomized lower bounds of 5/4 for both objectives. Moreover, we investigate intergroup and intragroup fairness (IIF) objectives, addressing fairness between groups and within each group. We present a mechanism that achieves a 4-approximation for the IIF objectives and provide tight lower bounds.
\ No newline at end of file
diff --git a/data/2024/aaai/Stratified GNN Explanations through Sufficient Expansion b/data/2024/aaai/Stratified GNN Explanations through Sufficient Expansion
new file mode 100644
index 0000000000..47d0b65437
--- /dev/null
+++ b/data/2024/aaai/Stratified GNN Explanations through Sufficient Expansion	
@@ -0,0 +1 @@
+Explaining the decisions made by Graph Neural Networks (GNNs) is vital for establishing trust and ensuring fairness in critical applications such as medicine and science. The prevalence of hierarchical structure in real-world graphs/networks raises an important question on GNN interpretability: "On each level of the graph structure, which specific fraction imposes the highest influence over the prediction?" Currently, the prevailing two categories of methods are incapable of achieving multi-level GNN explanation due to their flat or motif-centric nature. In this work, we formulate the problem of learning multi-level explanations out of GNN models and introduce a stratified explainer module, namely STFExplainer, that utilizes the concept of sufficient expansion to generate explanations on each stratum. Specifically, we learn a higher-level subgraph generator by leveraging both hierarchical structure and GNN-encoded input features. Experiment results on both synthetic and real-world datasets demonstrate the superiority of our stratified explainer on standard interpretability tasks and metrics such as fidelity and explanation recall, with an average improvement of 11% and 8% over the best alternative on each data type. The case study on material domains also confirms the value of our approach through detected multi-level graph patterns accurately reconstructing the knowledge-based ground truth.
\ No newline at end of file
diff --git a/data/2024/aaai/Strong Baselines for Parameter-Efficient Few-Shot Fine-Tuning b/data/2024/aaai/Strong Baselines for Parameter-Efficient Few-Shot Fine-Tuning
new file mode 100644
index 0000000000..ce6bc319a5
--- /dev/null
+++ b/data/2024/aaai/Strong Baselines for Parameter-Efficient Few-Shot Fine-Tuning	
@@ -0,0 +1 @@
+Few-shot classification (FSC) entails learning novel classes given only a few examples per class after a pre-training (or meta-training) phase on a set of base classes. Recent works have shown that simply fine-tuning a pre-trained Vision Transformer (ViT) on new test classes is a strong approach for FSC. Fine-tuning ViTs, however, is expensive in time, compute and storage. This has motivated the design of parameter efficient fine-tuning (PEFT) methods which fine-tune only a fraction of the Transformer's parameters. While these methods have shown promise, inconsistencies in experimental conditions make it difficult to disentangle their advantage from other experimental factors including the feature extractor architecture, pre-trained initialization and fine-tuning algorithm, amongst others. In our paper, we conduct a large-scale, experimentally consistent, empirical analysis to study PEFTs for few-shot image classification. Through a battery of over 1.8k controlled experiments on large-scale few-shot benchmarks including Meta-Dataset and ORBIT, we uncover novel insights on PEFTs that cast light on their efficacy in fine-tuning ViTs for few-shot classification. Through our controlled empirical study, we have two main findings: (i) Fine-tuning just the LayerNorm parameters (which we call LN-Tune) during few-shot adaptation is an extremely strong baseline across ViTs pre-trained with both self-supervised and supervised objectives, (ii) For self-supervised ViTs, we find that simply learning a set of scaling parameters for each attention matrix (which we call Attn-Scale) along with a domain-residual adapter (DRA) module leads to state-of-the-art performance (while being ~9x more parameter-efficient) on Meta-Dataset. Our empirical findings set strong baselines and call for rethinking the current design of PEFT methods for FSC.
\ No newline at end of file
diff --git a/data/2024/aaai/Stronger and Transferable Node Injection Attacks b/data/2024/aaai/Stronger and Transferable Node Injection Attacks
new file mode 100644
index 0000000000..6371a47133
--- /dev/null
+++ b/data/2024/aaai/Stronger and Transferable Node Injection Attacks	
@@ -0,0 +1 @@
+Despite the increasing popularity of graph neural networks (GNNs), the security risks associated with their deployment have not been well explored. Existing works follow the standard adversarial attacks to maximize cross-entropy loss within an L-infinity norm bound. We analyze the robustness of GNNs against node injection attacks (NIAs) in black-box settings by allowing new nodes to be injected and attacked. In this work, we propose to design stronger and transferable NIAs. First, we propose margin aware attack (MAA) that uses a maximum margin loss to generate NIAs. We then propose a novel margin and direction aware attack (MDA) that diversifies the initial directions of MAA attack by minimizing the cosine similarity of the injected nodes with respect to their respective random initialization in addition to the maximization of max-margin loss. This makes the NIAs stronger. We further observe that using L2 norm of gradients in the attack step leads to an enhanced diversity amongst the node features, thereby further enhancing the strength of the attack. We incorporate transferability in NIAs by perturbing the surrogate model before generating the attack. An analysis of eigen spectrum density of the hessian of the loss emphasizes that perturbing the weights of the surrogate model improves the transferability. Our experimental results demonstrate that the proposed resilient node injection attack (R-NIA) consistently outperform PGD by margins about 7-15% on both large and small graph datasets. R-NIA is significantly stronger and transferable than existing NIAs on graph robustness benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Structural Entropy Based Graph Structure Learning for Node Classification b/data/2024/aaai/Structural Entropy Based Graph Structure Learning for Node Classification
new file mode 100644
index 0000000000..a9f564d1c8
--- /dev/null
+++ b/data/2024/aaai/Structural Entropy Based Graph Structure Learning for Node Classification	
@@ -0,0 +1 @@
+As one of the most common tasks in graph data analysis, node classification is frequently solved by using graph structure learning (GSL) techniques to optimize graph structures and learn suitable graph neural networks. Most of the existing GSL methods focus on fusing different structural features (basic views) extracted from the graph, but very little graph semantics, like hierarchical communities, has been incorporated. Thus, they might be insufficient when dealing with the graphs containing noises from real-world complex systems. To address this issue, we propose a novel and effective GSL framework for node classification based on the structural information theory. Specifically, we first prove that an encoding tree with the minimal structural entropy could contain sufficient information for node classification and eliminate redundant noise via the graph's hierarchical abstraction. Then, we provide an efficient algorithm for constructing the encoding tree to enhance the basic views. Combining the community influence deduced from the encoding tree and the prediction confidence of each view, we further fuse the enhanced views to generate the optimal structure. Finally, we conduct extensive experiments on a variety of datasets. The results demonstrate that our method outperforms the state-of-the-art competitors on effectiveness and robustness.
\ No newline at end of file
diff --git a/data/2024/aaai/Structural Information Enhanced Graph Representation for Link Prediction b/data/2024/aaai/Structural Information Enhanced Graph Representation for Link Prediction
new file mode 100644
index 0000000000..f0c9721e7f
--- /dev/null
+++ b/data/2024/aaai/Structural Information Enhanced Graph Representation for Link Prediction	
@@ -0,0 +1 @@
+Link prediction is a fundamental task of graph machine learning, and Graph Neural Network (GNN) based methods have become the mainstream approach due to their good performance. However, the typical practice learns node representations through neighborhood aggregation, lacking awareness of the structural relationships between target nodes. Recently, some methods have attempted to address this issue by node labeling tricks. However, they still rely on the node-centric neighborhood message passing of GNNs, which we believe involves two limitations in terms of information perception and transmission for link prediction. First, it cannot perceive long-range structural information due to the restricted receptive fields. Second, there may be information loss of node-centric model on link-centric task. In addition, we empirically find that the neighbor node features could introduce noise for link prediction. To address these issues, we propose a structural information enhanced link prediction framework, which involves removing the neighbor node features while fitting neighborhood graph structures more focused through GNN. Furthermore, we introduce Binary Structural Transformer (BST) to encode the structural relationships between target nodes, complementing the deficiency of GNN. Our approach achieves remarkable results on multiple popular benchmarks, including ranking first on ogbl-ppa, ogbl-citation2 and Pubmed.
\ No newline at end of file
diff --git a/data/2024/aaai/Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception b/data/2024/aaai/Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception
new file mode 100644
index 0000000000..757efc3ce9
--- /dev/null
+++ b/data/2024/aaai/Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception	
@@ -0,0 +1 @@
+Understanding vehicles in images is important for various applications such as intelligent transportation and self-driving system. Existing vehicle-centric works typically pre-train models on large-scale classification datasets and then fine-tune them for specific downstream tasks. However, they neglect the specific characteristics of vehicle perception in different tasks and might thus lead to sub-optimal performance. To address this issue, we propose a novel vehicle-centric pre-training framework called VehicleMAE, which incorporates the structural information including the spatial structure from vehicle profile information and the semantic structure from informative high-level natural language descriptions for effective masked vehicle appearance reconstruction. To be specific, we explicitly extract the sketch lines of vehicles as a form of the spatial structure to guide vehicle reconstruction. The more comprehensive knowledge distilled from the CLIP big model based on the similarity between the paired/unpaired vehicle image-text sample is further taken into consideration to help achieve a better understanding of vehicles. A large-scale dataset is built to pre-train our model, termed Autobot1M, which contains about 1M vehicle images and 12693 text information. Extensive experiments on four vehicle-based downstream tasks fully validated the effectiveness of our VehicleMAE. The source code and pre-trained models will be released at https://github.com/Event-AHU/VehicleMAE.
\ No newline at end of file
diff --git a/data/2024/aaai/Structurally Guided Task Decomposition in Spatial Navigation Tasks (Student Abstract) b/data/2024/aaai/Structurally Guided Task Decomposition in Spatial Navigation Tasks (Student Abstract)
new file mode 100644
index 0000000000..5ef01ef3f5
--- /dev/null
+++ b/data/2024/aaai/Structurally Guided Task Decomposition in Spatial Navigation Tasks (Student Abstract)	
@@ -0,0 +1 @@
+How are people able to plan so efficiently despite limited cognitive resources? We aimed to answer this question by extending an existing model of human task decomposition that can explain a wide range of simple planning problems by adding structure information to the task to facilitate planning in more complex tasks. The extended model was then applied to a more complex planning domain of spatial navigation. Our results suggest that our framework can correctly predict the navigation strategies of the majority of the participants in an online experiment.
\ No newline at end of file
diff --git a/data/2024/aaai/Structure-Aware Multimodal Sequential Learning for Visual Dialog b/data/2024/aaai/Structure-Aware Multimodal Sequential Learning for Visual Dialog
new file mode 100644
index 0000000000..7057baf7b9
--- /dev/null
+++ b/data/2024/aaai/Structure-Aware Multimodal Sequential Learning for Visual Dialog	
@@ -0,0 +1 @@
+With the ability to collect vast amounts of image and natural language data from the web, there has been a remarkable advancement in Large-scale Language Models (LLMs). This progress has led to the emergence of chatbots and dialogue systems capable of fluent conversations with humans. As the variety of devices enabling interactions between humans and agents expands, and the performance of text-based dialogue systems improves, there has been recently proposed research on visual dialog. However, visual dialog requires understanding sequences of pairs consisting of images and sentences, making it challenging to gather sufficient data for training large-scale models from the web. In this paper, we propose a new multimodal learning method leveraging existing large-scale models designed for each modality, to enable model training for visual dialog with small visual dialog datasets. The key ideas of our approach are: 1) storing the history or context during the progression of visual dialog in the form of spatiotemporal graphs, and 2) introducing small modulation blocks between modality-specific models and the graphs to align the semantic spaces. For implementation, we introduce a novel structure-aware cross-attention method, which retrieves relevant image and text knowledge for utterance generation from the pretrained models. For experiments, we achieved a new state-of-the-art performance on three visual dialog datasets, including the most challenging one COMET.
\ No newline at end of file
diff --git a/data/2024/aaai/Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations b/data/2024/aaai/Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations
new file mode 100644
index 0000000000..c80dc1b888
--- /dev/null
+++ b/data/2024/aaai/Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations	
@@ -0,0 +1 @@
+Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. The models cannot make a distinction between "An astronaut rides a horse" and "A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning multi-modal representations. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations. To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. Meanwhile, the results on MSCOCO indicate that Structure-CLIP significantly enhances the structured representations while maintaining the ability of general representations. Our code is available at https://github.com/zjukg/Structure-CLIP.
\ No newline at end of file
diff --git a/data/2024/aaai/Students' Perceptions and Preferences of Generative Artificial Intelligence Feedback for Programming b/data/2024/aaai/Students' Perceptions and Preferences of Generative Artificial Intelligence Feedback for Programming
new file mode 100644
index 0000000000..3fd23abbe7
--- /dev/null
+++ b/data/2024/aaai/Students' Perceptions and Preferences of Generative Artificial Intelligence Feedback for Programming	
@@ -0,0 +1 @@
+The rapid evolution of artificial intelligence (AI), specifically large language models (LLMs), has opened opportunities for various educational applications. This paper explored the feasibility of utilizing ChatGPT, one of the most popular LLMs, for automating feedback for Java programming assignments in an introductory computer science (CS1) class. Specifically, this study focused on three questions: 1) To what extent do students view LLM-generated feedback as formative? 2) How do students see the comparative affordances of feedback prompts that include their code, vs. those that exclude it? 3) What enhancements do students suggest for improving LLM-generated feedback? To address these questions, we generated automated feedback using the ChatGPT API for four lab assignments in a CS1 class. The survey results revealed that students perceived the feedback as aligning well with formative feedback guidelines established by Shute. Additionally, students showed a clear preference for feedback generated by including the students' code as part of the LLM prompt, and our thematic study indicated that the preference was mainly attributed to the specificity, clarity, and corrective nature of the feedback. Moreover, this study found that students generally expected specific and corrective feedback with sufficient code examples, but had diverged opinions on the tone of the feedback. This study demonstrated that ChatGPT could generate Java programming assignment feedback that students perceived as formative. It also offered insights into the specific improvements that would make the ChatGPT-generated feedback useful for students.
\ No newline at end of file
diff --git a/data/2024/aaai/Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style b/data/2024/aaai/Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style
new file mode 100644
index 0000000000..a0047f4a8f
--- /dev/null
+++ b/data/2024/aaai/Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style	
@@ -0,0 +1 @@
+Although automatically animating audio-driven talking heads has recently received growing interest, previous efforts have mainly concentrated on achieving lip synchronization with the audio, neglecting two crucial elements for generating expressive videos: emotion style and art style. In this paper, we present an innovative audio-driven talking face generation method called Style2Talker. It involves two stylized stages, namely Style-E and Style-A, which integrate text-controlled emotion style and picture-controlled art style into the final output. In order to prepare the scarce emotional text descriptions corresponding to the videos, we propose a labor-free paradigm that employs large-scale pretrained models to automatically annotate emotional text labels for existing audio-visual datasets. Incorporating the synthetic emotion texts, the Style-E stage utilizes a large-scale CLIP model to extract emotion representations, which are combined with the audio, serving as the condition for an efficient latent diffusion model designed to produce emotional motion coefficients of a 3DMM model. Moving on to the Style-A stage, we develop a coefficient-driven motion generator and an art-specific style path embedded in the well-known StyleGAN. This allows us to synthesize high-resolution artistically stylized talking head videos using the generated emotional motion coefficients and an art style source picture. Moreover, to better preserve image details and avoid artifacts, we provide StyleGAN with the multi-scale content features extracted from the identity image and refine its intermediate feature maps by the designed content encoder and refinement network, respectively. Extensive experimental results demonstrate our method outperforms existing state-of-the-art methods in terms of audio-lip synchronization and performance of both emotion style and art style.
\ No newline at end of file
diff --git a/data/2024/aaai/StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis b/data/2024/aaai/StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis
new file mode 100644
index 0000000000..7edaf1c6ba
--- /dev/null
+++ b/data/2024/aaai/StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis	
@@ -0,0 +1 @@
+Style transfer for out-of-domain (OOD) singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles (such as timbre, emotion, pronunciation, and articulation skills) derived from reference singing voice samples. However, the endeavor to model the intricate nuances of singing voice styles is an arduous task, as singing voices possess a remarkable degree of expressiveness. Moreover, existing SVS methods encounter a decline in the quality of synthesized singing voices in OOD scenarios, as they rest upon the assumption that the target vocal attributes are discernible during the training phase. To overcome these challenges, we propose StyleSinger, the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples. StyleSinger incorporates two critical approaches for enhanced effectiveness: 1) the Residual Style Adaptor (RSA) which employs a residual quantization module to capture diverse style characteristics in singing voices, and 2) the Uncertainty Modeling Layer Normalization (UMLN) to perturb the style attributes within the content representation during the training phase and thus improve the model generalization. Our extensive evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples. Access to singing voice samples can be found at https://stylesinger.github.io/.
\ No newline at end of file
diff --git a/data/2024/aaai/Submodel Enumeration for CTL Is Hard b/data/2024/aaai/Submodel Enumeration for CTL Is Hard
new file mode 100644
index 0000000000..0607d0ff65
--- /dev/null
+++ b/data/2024/aaai/Submodel Enumeration for CTL Is Hard	
@@ -0,0 +1 @@
+Expressing system specifications using Computation Tree Logic (CTL) formulas, formalising programs using Kripke structures, and then model checking the system is an established workflow in program verification and has wide applications in AI. In this paper, we consider the task of model enumeration, which asks for a uniform stream of output systems that satisfy the given specification. We show that, given a CTL formula and a system (potentially falsified by the formula), enumerating satisfying submodels is always hard for CTL--regardless of which subset of CTL-operators is considered. As a silver lining on the horizon, we present fragments via restrictions on the allowed Boolean functions that still allow for fast enumeration.
\ No newline at end of file
diff --git a/data/2024/aaai/Successive POI Recommendation via Brain-Inspired Spatiotemporal Aware Representation b/data/2024/aaai/Successive POI Recommendation via Brain-Inspired Spatiotemporal Aware Representation
new file mode 100644
index 0000000000..147672c5a6
--- /dev/null
+++ b/data/2024/aaai/Successive POI Recommendation via Brain-Inspired Spatiotemporal Aware Representation	
@@ -0,0 +1 @@
+Existing approaches usually perform spatiotemporal representation in the spatial and temporal dimensions, respectively, which isolates the spatial and temporal natures of the target and leads to sub-optimal embeddings. Neuroscience research has shown that the mammalian brain entorhinal-hippocampal system provides efficient graph representations for general knowledge. Moreover, entorhinal grid cells present concise spatial representations, while hippocampal place cells represent perception conjunctions effectively. Thus, the entorhinal-hippocampal system provides a novel angle for spatiotemporal representation, which inspires us to propose the SpatioTemporal aware Embedding framework (STE) and apply it to POIs (STEP). STEP considers two types of POI-specific representations: sequential representation and spatiotemporal conjunctive representation, learned using sparse unlabeled data based on the proposed graph-building policies. Notably, STEP jointly represents the spatiotemporal natures of POIs using both observations and contextual information from integrated spatiotemporal dimensions by constructing a spatiotemporal context graph. Furthermore, we introduce a successive POI recommendation method using STEP, which achieves state-of-the-art performance on two benchmarks. In addition, we demonstrate the excellent performance of the STE representation approach in other spatiotemporal representation-centered tasks through a case study of the traffic flow prediction problem. Therefore, this work provides a novel solution to spatiotemporal representation and paves a new way for spatiotemporal modeling-related tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Summarizing Stream Data for Memory-Constrained Online Continual Learning b/data/2024/aaai/Summarizing Stream Data for Memory-Constrained Online Continual Learning
new file mode 100644
index 0000000000..a093fd6109
--- /dev/null
+++ b/data/2024/aaai/Summarizing Stream Data for Memory-Constrained Online Continual Learning	
@@ -0,0 +1 @@
+Replay-based methods have proved their effectiveness on online continual learning by rehearsing past samples from an auxiliary memory. With many efforts made on improving training schemes based on the memory, however, the information carried by each sample in the memory remains under-investigated. Under circumstances with restricted storage space, the informativeness of the memory becomes critical for effective replay. Although some works design specific strategies to select representative samples, by only employing a small number of original images, the storage space is still not well utilized. To this end, we propose to Summarize the knowledge from the Stream Data (SSD) into more informative samples by distilling the training characteristics of real images. Through maintaining the consistency of training gradients and relationship to the past tasks, the summarized samples are more representative for the stream data compared to the original images. Extensive experiments are conducted on multiple online continual learning benchmarks to support that the proposed SSD method significantly enhances the replay effects. We demonstrate that with limited extra computational overhead, SSD provides more than 3% accuracy boost for sequential CIFAR-100 under extremely restricted memory buffer. Code in https://github.com/vimar-gu/SSD.
\ No newline at end of file
diff --git a/data/2024/aaai/Sunshine to Rainstorm: Cross-Weather Knowledge Distillation for Robust 3D Object Detection b/data/2024/aaai/Sunshine to Rainstorm: Cross-Weather Knowledge Distillation for Robust 3D Object Detection
new file mode 100644
index 0000000000..de898423c4
--- /dev/null
+++ b/data/2024/aaai/Sunshine to Rainstorm: Cross-Weather Knowledge Distillation for Robust 3D Object Detection	
@@ -0,0 +1 @@
+LiDAR-based 3D object detection models inevitably struggle under rainy conditions due to the degraded and noisy scanning signals. Previous research has attempted to address this by simulating the noise from rain to improve the robustness of detection models. However, significant disparities exist between simulated and actual rain-impacted data points. In this work, we propose a novel rain simulation method, termed DRET, that unifies Dynamics and Rainy Environment Theory to provide a cost-effective means of expanding the available realistic rain data for 3D detection training. Furthermore, we present a Sunny-to-Rainy Knowledge Distillation (SRKD) approach to enhance 3D detection under rainy conditions. Extensive experiments on the Waymo-Open-Dataset show that, when combined with the state-of-the-art DSVT model and other classical 3D detectors, our proposed framework demonstrates significant detection accuracy improvements, without losing efficiency. Remarkably, our framework also improves detection capabilities under sunny conditions, therefore offering a robust solution for 3D detection regardless of whether the weather is rainy or sunny.
\ No newline at end of file
diff --git a/data/2024/aaai/SuperJunction: Learning-Based Junction Detection for Retinal Image Registration b/data/2024/aaai/SuperJunction: Learning-Based Junction Detection for Retinal Image Registration
new file mode 100644
index 0000000000..d3671c5e39
--- /dev/null
+++ b/data/2024/aaai/SuperJunction: Learning-Based Junction Detection for Retinal Image Registration	
@@ -0,0 +1 @@
+Keypoints-based approaches have shown to be promising for retinal image registration, which superimpose two or more images from different views based on keypoint detection and description. However, existing approaches suffer from ineffective keypoint detector and descriptor training. Meanwhile, the non-linear mapping from 3D retinal structure to 2D images is often neglected. In this paper, we propose a novel learning-based junction detection approach for retinal image registration, which enhances both the keypoint detector and descriptor training. To improve the keypoint detection, it uses a multi-task vessel detection to regularize the model training, which helps to learn more representative features and reduce the risk of over-fitting. To achieve effective training for keypoints description, a new constrained negative sampling approach is proposed to compute the descriptor loss. Moreover, we also consider the non-linearity between retinal images from different views during matching. Experimental results on FIRE dataset show that our method achieves mean area under curve of 0.850, which is 12.6% higher than 0.755 by the state-of-the-art method. All the codes are available at https://github.com/samjcheng/SuperJunction.
\ No newline at end of file
diff --git a/data/2024/aaai/Superposed Atomic Representation for Robust High-Dimensional Data Recovery of Multiple Low-Dimensional Structures b/data/2024/aaai/Superposed Atomic Representation for Robust High-Dimensional Data Recovery of Multiple Low-Dimensional Structures
new file mode 100644
index 0000000000..9f8d4fafdb
--- /dev/null
+++ b/data/2024/aaai/Superposed Atomic Representation for Robust High-Dimensional Data Recovery of Multiple Low-Dimensional Structures	
@@ -0,0 +1 @@
+This paper proposes a unified Superposed Atomic Representation (SAR) framework for high-dimensional data recovery with multiple low-dimensional structures. The data can be in various forms ranging from vectors to tensors. The goal of SAR is to recover different components from their sum, where each component has a low-dimensional structure, such as sparsity, low-rankness or be lying a low-dimensional subspace. Examples of SAR include, but not limited to, Robust Sparse Representation (RSR), Robust Principal Component Analysis (RPCA), Tensor RPCA (TRPCA), and Outlier Pursuit (OP). We establish the theoretical guarantee for SAR. To further improve SAR, we also develop a Weighted SAR (WSAR) framework by paying more attention and penalizing less on significant atoms of each component. An effective optimization algorithm is devised for WSAR and the convergence of the algorithm is rigorously proved. By leveraging WSAR as a general platform, several new methods are proposed for high-dimensional data recovery. The experiments on real data demonstrate the superiority of WSAR for various data recovery problems.
\ No newline at end of file
diff --git a/data/2024/aaai/Supervision Interpolation via LossMix: Generalizing Mixup for Object Detection and Beyond b/data/2024/aaai/Supervision Interpolation via LossMix: Generalizing Mixup for Object Detection and Beyond
new file mode 100644
index 0000000000..f1ed16fedf
--- /dev/null
+++ b/data/2024/aaai/Supervision Interpolation via LossMix: Generalizing Mixup for Object Detection and Beyond	
@@ -0,0 +1 @@
+The success of data mixing augmentations in image classification tasks has been well-received. However, these techniques cannot be readily applied to object detection due to challenges such as spatial misalignment, foreground/background distinction, and plurality of instances. To tackle these issues, we first introduce a novel conceptual framework called Supervision Interpolation (SI), which offers a fresh perspective on interpolation-based augmentations by relaxing and generalizing Mixup. Based on SI, we propose LossMix, a simple yet versatile and effective regularization that enhances the performance and robustness of object detectors and more. Our key insight is that we can effectively regularize the training on mixed data by interpolating their loss errors instead of ground truth labels. Empirical results on the PASCAL VOC and MS COCO datasets demonstrate that LossMix can consistently outperform state-of-the-art methods widely adopted for detection. Furthermore, by jointly leveraging LossMix with unsupervised domain adaptation, we successfully improve existing approaches and set a new state of the art for cross-domain object detection.
\ No newline at end of file
diff --git a/data/2024/aaai/Supporting Upper Elementary Students in Learning AI Concepts with Story-Driven Game-Based Learning b/data/2024/aaai/Supporting Upper Elementary Students in Learning AI Concepts with Story-Driven Game-Based Learning
new file mode 100644
index 0000000000..309ceff8f1
--- /dev/null
+++ b/data/2024/aaai/Supporting Upper Elementary Students in Learning AI Concepts with Story-Driven Game-Based Learning	
@@ -0,0 +1 @@
+Artificial intelligence (AI) is quickly finding broad application in every sector of society. This rapid expansion of AI has increased the need to cultivate an AI-literate workforce, and it calls for introducing AI education into K-12 classrooms to foster students’ awareness and interest in AI. With rich narratives and opportunities for situated problem solving, story-driven game-based learning offers a promising approach for creating engaging and effective K-12 AI learning experiences. In this paper, we present our ongoing work to iteratively design, develop, and evaluate a story-driven game-based learning environment focused on AI education for upper elementary students (ages 8 to 11). The game features a science inquiry problem centering on an endangered species and incorporates a Use-Modify-Create scaffolding framework to promote student learning. We present findings from an analysis of data collected from 16 students playing the game's quest focused on AI planning. Results suggest that the scaffolding framework provided students with the knowledge they needed to advance through the quest and that overall, students experienced positive learning outcomes.
\ No newline at end of file
diff --git a/data/2024/aaai/Suppressing Uncertainty in Gaze Estimation b/data/2024/aaai/Suppressing Uncertainty in Gaze Estimation
new file mode 100644
index 0000000000..c5587a175a
--- /dev/null
+++ b/data/2024/aaai/Suppressing Uncertainty in Gaze Estimation	
@@ -0,0 +1 @@
+Uncertainty in gaze estimation manifests in two aspects: 1) low-quality images caused by occlusion, blurriness, inconsistent eye movements, or even non-face images; 2) uncorrected labels resulting from the misalignment between the labeled and actual gaze points during the annotation process. Allowing these uncertainties to participate in training hinders the improvement of gaze estimation. To tackle these challenges, in this paper, we propose an effective solution, named Suppressing Uncertainty in Gaze Estimation (SUGE), which introduces a novel triplet-label consistency measurement to estimate and reduce the uncertainties. Specifically, for each training sample, we propose to estimate a novel ``neighboring label'' calculated by a linearly weighted projection from the neighbors to capture the similarity relationship between image features and their corresponding labels, which can be incorporated with the predicted pseudo label and ground-truth label for uncertainty estimation. By modeling such triplet-label consistency, we can largely reduce the negative effects of unqualified images and wrong labels through our designed sample weighting and label correction strategies. Experimental results on the gaze estimation benchmarks indicate that our proposed SUGE achieves state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2024/aaai/SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation b/data/2024/aaai/SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation
new file mode 100644
index 0000000000..f375a67ffc
--- /dev/null
+++ b/data/2024/aaai/SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation	
@@ -0,0 +1 @@
+The Segment Anything Model (SAM) is a powerful foundation model that has revolutionised image segmentation. To apply SAM to surgical instrument segmentation, a common approach is to locate precise points or boxes of instruments and then use them as prompts for SAM in a zero-shot manner. However, we observe two problems with this naive pipeline: (1) the domain gap between natural objects and surgical instruments leads to inferior generalisation of SAM; and (2) SAM relies on precise point or box locations for accurate segmentation, requiring either extensive manual guidance or a well-performing specialist detector for prompt preparation, which leads to a complex multi-stage pipeline. To address these problems, we introduce SurgicalSAM, a novel end-to-end efficient-tuning approach for SAM to effectively integrate surgical-specific information with SAM’s pre-trained knowledge for improved generalisation. Specifically, we propose a lightweight prototype-based class prompt encoder for tuning, which directly generates prompt embeddings from class prototypes and eliminates the use of explicit prompts for improved robustness and a simpler pipeline. In addition, to address the low inter-class variance among surgical instrument categories, we propose contrastive prototype learning, further enhancing the discrimination of the class prototypes for more accurate class prompting. The results of extensive experiments on both EndoVis2018 and EndoVis2017 datasets demonstrate that SurgicalSAM achieves state-of-the-art performance while only requiring a small number of tunable parameters. The source code is available at https://github.com/wenxi-yue/SurgicalSAM.
\ No newline at end of file
diff --git a/data/2024/aaai/Sustainability of Data Center Digital Twins with Reinforcement Learning b/data/2024/aaai/Sustainability of Data Center Digital Twins with Reinforcement Learning
new file mode 100644
index 0000000000..829cd27b54
--- /dev/null
+++ b/data/2024/aaai/Sustainability of Data Center Digital Twins with Reinforcement Learning	
@@ -0,0 +1 @@
+In recent years, the increasing emphasis on sustainability and carbon footprint reduction has required the exploration of innovative optimization techniques for data center operators. In this paper, we introduce a Concurrent Carbon Footprint Reduction (C2FR) Reinforcement Learning framework, designed to optimize data center energy consumption, load shifting, and battery operation decisions in real time. The C2FR framework utilizes short-term forecasts and incorporates Reinforcement Learning Energy ($A_{E}$), Battery ($A_{BAT}$) and Load-Shifting ($A_{LS}$) agents to optimize and effectively manage the intricate dependencies and information exchange between these individual optimization strategies, thus overcoming the limitations of existing isolated approaches. When compared to state-of-the-art algorithms, the C2FR framework demonstrates its effectiveness across various data center scenarios. The AE agent achieves a 7.9% reduction in pollutant emissions and a 7.8% reduction in energy cost on average. Moreover, the C2FR framework enables further emission reductions through the application of the battery and load-shifting optimization, leading to a total reduction of 10.17% in pollutant emissions on average over different data center configurations. This highlights the potential of the C2FR framework in addressing data center sustainability challenges and improving real-time carbon footprint optimization.
\ No newline at end of file
diff --git a/data/2024/aaai/Swift-Mapping: Online Neural Implicit Dense Mapping in Urban Scenes b/data/2024/aaai/Swift-Mapping: Online Neural Implicit Dense Mapping in Urban Scenes
new file mode 100644
index 0000000000..a58d7cb030
--- /dev/null
+++ b/data/2024/aaai/Swift-Mapping: Online Neural Implicit Dense Mapping in Urban Scenes	
@@ -0,0 +1 @@
+Online dense mapping of urban scenes is of paramount importance for scene understanding of autonomous navigation. Traditional online dense mapping methods fuse sensor measurements (vision, lidar, etc.) across time and space via explicit geometric correspondence. Recently, NeRF-based methods have proved the superiority of neural implicit representations by high-fidelity reconstruction of large-scale city scenes. However, it remains an open problem how to integrate powerful neural implicit representations into online dense mapping. Existing methods are restricted to constrained indoor environments and are too computationally expensive to meet online requirements. To this end, we propose Swift-Mapping, an online neural implicit dense mapping framework in urban scenes. We introduce a novel neural implicit octomap (NIO) structure that provides efficient neural representation for large and dynamic urban scenes while retaining online update capability. Based on that, we propose an online neural dense mapping framework that effectively manages and updates neural octree voxel features. Our approach achieves SOTA reconstruction accuracy while being more than 10x faster in reconstruction speed, demonstrating the superior performance of our method in both accuracy and efficiency.
\ No newline at end of file
diff --git a/data/2024/aaai/SwiftPillars: High-Efficiency Pillar Encoder for Lidar-Based 3D Detection b/data/2024/aaai/SwiftPillars: High-Efficiency Pillar Encoder for Lidar-Based 3D Detection
new file mode 100644
index 0000000000..0a7c5e7322
--- /dev/null
+++ b/data/2024/aaai/SwiftPillars: High-Efficiency Pillar Encoder for Lidar-Based 3D Detection	
@@ -0,0 +1 @@
+Lidar-based 3D Detection is one of the significant components of Autonomous Driving. However, current methods over-focus on improving the performance of 3D Lidar perception, which causes the architecture of networks becoming complicated and hard to deploy. Thus, the methods are difficult to apply in Autonomous Driving for real-time processing. In this paper, we propose a high-efficiency network, SwiftPillars, which includes Swift Pillar Encoder (SPE) and Multi-scale Aggregation Decoder (MAD). The SPE is constructed by a concise Dual-attention Module with lightweight operators. The Dual-attention Module utilizes feature pooling, matrix multiplication, etc. to speed up point-wise and channel-wise attention extraction and fusion. The MAD interconnects multiple scale features extracted by SPE with minimal computational cost to leverage performance. In our experiments, our proposal accomplishes 61.3% NDS and 53.2% mAP in nuScenes dataset. In addition, we evaluate inference time on several platforms (P4, T4, A2, MLU370, RTX3080), where SwiftPillars achieves up to 13.3ms (75FPS) on NVIDIA Tesla T4. Compared with PointPillars, SwiftPillars is on average 26.58% faster in inference speed with equivalent GPUs and a higher mAP of approximately 3.2% in the nuScenes dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/SwitchTab: Switched Autoencoders Are Effective Tabular Learners b/data/2024/aaai/SwitchTab: Switched Autoencoders Are Effective Tabular Learners
new file mode 100644
index 0000000000..984388597d
--- /dev/null
+++ b/data/2024/aaai/SwitchTab: Switched Autoencoders Are Effective Tabular Learners	
@@ -0,0 +1 @@
+Self-supervised representation learning methods have achieved significant success in computer vision and natural language processing (NLP), where data samples exhibit explicit spatial or semantic dependencies. However, applying these methods to tabular data is challenging due to the less pronounced dependencies among data samples. In this paper, we address this limitation by introducing SwitchTab, a novel self-supervised method specifically designed to capture latent dependencies in tabular data. SwitchTab leverages an asymmetric encoder-decoder framework to decouple mutual and salient features among data pairs, resulting in more representative embeddings. These embeddings, in turn, contribute to better decision boundaries and lead to improved results in downstream tasks. To validate the effectiveness of SwitchTab, we conduct extensive experiments across various domains involving tabular data. The results showcase superior performance in end-to-end prediction tasks with fine-tuning. Moreover, we demonstrate that pre-trained salient embeddings can be utilized as plug-and-play features to enhance the performance of various traditional classification methods (e.g., Logistic Regression, XGBoost, etc.). Lastly, we highlight the capability of SwitchTab to create explainable representations through visualization of decoupled mutual and salient features in the latent space.
\ No newline at end of file
diff --git a/data/2024/aaai/SyFormer: Structure-Guided Synergism Transformer for Large-Portion Image Inpainting b/data/2024/aaai/SyFormer: Structure-Guided Synergism Transformer for Large-Portion Image Inpainting
new file mode 100644
index 0000000000..e50b2197b0
--- /dev/null
+++ b/data/2024/aaai/SyFormer: Structure-Guided Synergism Transformer for Large-Portion Image Inpainting	
@@ -0,0 +1 @@
+Image inpainting is in full bloom accompanied by the progress of convolutional neural networks (CNNs) and transformers, revolutionizing the practical management of abnormity disposal, image editing, etc. However, due to the ever-mounting image resolutions and missing areas, the challenges of distorted long-range dependencies from cluttered background distributions and reduced reference information in image domain inevitably rise, which further cause severe performance degradation. To address the challenges, we propose a novel large-portion image inpainting approach, namely the Structure-Guided Synergism Transformer (SyFormer), to rectify the discrepancies in feature representation and enrich the structural cues from limited reference. Specifically, we devise a dual-routing filtering module that employs a progressive filtering strategy to eliminate invalid noise interference and establish global-level texture correlations. Simultaneously, the structurally compact perception module maps an affinity matrix within the introduced structural priors from a structure-aware generator, assisting in matching and filling the corresponding patches of large-proportionally damaged images. Moreover, we carefully assemble the aforementioned modules to achieve feature complementarity. Finally, a feature decoding alignment scheme is introduced in the decoding process, which meticulously achieves texture amalgamation across hierarchical features. Extensive experiments are conducted on two publicly available datasets, i.e., CelebA-HQ and Places2, to qualitatively and quantitatively demonstrate the superiority of our model over state-of-the-arts.
\ No newline at end of file
diff --git a/data/2024/aaai/Symbol Description Reading b/data/2024/aaai/Symbol Description Reading
new file mode 100644
index 0000000000..40738d4ffc
--- /dev/null
+++ b/data/2024/aaai/Symbol Description Reading	
@@ -0,0 +1 @@
+Mathematical formulas give concise representations of a document's key ideas in many natural sciences and engineering domains. The symbols that make up formulas carry semantic meaning that may differ by document or equation. What does ? mean in a given paper? Interpreting the symbols that comprise formulas requires identifying descriptions from the surrounding text. We approach this task of symbol description reading as an application of current AI technologies targeting the tuning of large language models for particular domains and automation of machine learning. Our pipeline integrates AI question answering and natural language processing to read symbol descriptions. We consider extractive and generative AI model variations and apply our pipeline on two example tasks of symbol description reading. Promising results provide motivation for wider deployment for which we describe a microservice architecture and related challenges.
\ No newline at end of file
diff --git a/data/2024/aaai/Symbolic Cognitive Diagnosis via Hybrid Optimization for Intelligent Education Systems b/data/2024/aaai/Symbolic Cognitive Diagnosis via Hybrid Optimization for Intelligent Education Systems
new file mode 100644
index 0000000000..61d535c5d9
--- /dev/null
+++ b/data/2024/aaai/Symbolic Cognitive Diagnosis via Hybrid Optimization for Intelligent Education Systems	
@@ -0,0 +1 @@
+Cognitive diagnosis assessment is a fundamental and crucial task for student learning. It models the student-exercise interaction, and discovers the students' proficiency levels on each knowledge attribute. In real-world intelligent education systems, generalization and interpretability of cognitive diagnosis methods are of equal importance. However, most existing methods can hardly make the best of both worlds due to the complicated student-exercise interaction. To this end, this paper proposes a symbolic cognitive diagnosis~(SCD) framework to simultaneously enhance generalization and interpretability. The SCD framework incorporates the symbolic tree to explicably represent the complicated student-exercise interaction function, and utilizes gradient-based optimization methods to effectively learn the student and exercise parameters. Meanwhile, the accompanying challenge is that we need to tunnel the discrete symbolic representation and continuous parameter optimization. To address this challenge, we propose to hybridly optimize the representation and parameters in an alternating manner. To fulfill SCD, it alternately learns the symbolic tree by derivative-free genetic programming and learns the student and exercise parameters via gradient-based Adam. The extensive experimental results on various real-world datasets show the superiority of SCD on both generalization and interpretability. The ablation study verifies the efficacy of each ingredient in SCD, and the case study explicitly showcases how the interpretable ability of SCD works.
\ No newline at end of file
diff --git a/data/2024/aaai/Symbolic Numeric Planning with Patterns b/data/2024/aaai/Symbolic Numeric Planning with Patterns
new file mode 100644
index 0000000000..02497dbdff
--- /dev/null
+++ b/data/2024/aaai/Symbolic Numeric Planning with Patterns	
@@ -0,0 +1 @@
+In this paper, we propose a novel approach for solving linear numeric planning problems, called Symbolic Pattern Planning. Given a planning problem Pi, a bound n and a pattern --defined as an arbitrary sequence of actions-- we encode the problem of finding a plan for Pi with bound n as a formula with fewer variables and/or clauses than the state-of-the-art rolled-up and relaxed-relaxed-exists encodings. More importantly, we prove that for any given bound, it is never the case that the latter two encodings allow finding a valid plan while ours does not. On the experimental side, we consider 6 other planning systems --including the ones which participated in this year's International Planning Competition (IPC)-- and we show that our planner Patty has remarkably good comparative performances on this year's IPC problems.
\ No newline at end of file
diff --git a/data/2024/aaai/Symbolic Reasoning Methods for AI Planning b/data/2024/aaai/Symbolic Reasoning Methods for AI Planning
new file mode 100644
index 0000000000..4c674d47d9
--- /dev/null
+++ b/data/2024/aaai/Symbolic Reasoning Methods for AI Planning	
@@ -0,0 +1,29 @@
+Planning is the act of deliberative thinking before acting.
+It is based on a symbolic model of the world and the options to act in it, usually defined in function-free first-order logic.
+The task is to find a sequence of actions (a plan) that leads from a given current state to a desired goal state.
+The basic, purely physical description may be augmented with a partially ordered grammar-like structure (a Hierarchical Task Network or HTN), which can describe expert knowledge, or practical, legal, or operational requirements.
+
+
+In this talk, I will survey a variety of methods for automatically deriving plans using symbolic methods for planning -- from both my past and future research.
+These symbolic methods -- in some sense -- translate planning problems into other, simpler symbolic representations and reason over them to find plans.
+
+
+As a basis for these methods, I will firstly introduce relevant theoretical results on planning.
+First, I will discuss the expressive power of planning formalisms (ECAI'14, ICAPS'16) and second, the computational complexity of HTN planning and related tasks such as HTN plan verification, plan modification, and plan recognition (ICAPS'15, ICAPS'16).
+
+
+Based on these theoretical results, I will develop why SAT-based HTN planning is possible and how it can be implemented.
+To this end, I will survey several of my publications at top-tier conferences, including papers at ICAPS'17, AAAI'18, AAAI'19, IJCAI'19, AAAI'20, and ICAPS'21 -- in which I developed an highly SAT-based planner for HTN problems including the ability to find optimal plans as well as the grounding as a preprocessing step.
+Here I will also give an outlook on future developments and new ideas that I propose for SAT-based planning -- including the exploitation of structures in plan (e.g.\ landmarks or operator-counting constraints).
+
+Next, I will present the idea of expressing lifted classical planning as SAT (ICAPS'22).
+The resulting planner LiSAT was the first lifted SAT-based planner -- and proved highly efficient and outperformed all other lifted planners at the time of publication.
+Notably, LiSAT was the first planner (lifted or grounded) and still is the only one to solve the challenging OrganicSynthesis benchmark -- and could even prove optimality for all plans.
+I will also outline future ideas to further improve the efficiency of LiSAT.
+
+
+Lastly, I introduce the notion of planning with symbolic symbolic representations (AAAI'21 and ICAPS'23).
+Here one uses Binary Decision Diagrams to encode large sets of states efficiently.
+For expressing the additional structure encoded by HTNs, I show how BDDs can be suitably integrated into finite automata.
+Based on this representation, an efficient and optimal planning algorithm can be derived.
+Additionally, I show how this algorithm can be extended to also cover oversubscription planning.
\ No newline at end of file
diff --git a/data/2024/aaai/Symbolic Regression Enhanced Decision Trees for Classification Tasks b/data/2024/aaai/Symbolic Regression Enhanced Decision Trees for Classification Tasks
new file mode 100644
index 0000000000..d66fd5ce27
--- /dev/null
+++ b/data/2024/aaai/Symbolic Regression Enhanced Decision Trees for Classification Tasks	
@@ -0,0 +1 @@
+We introduce a conceptually simple yet effective method to create small, compact decision trees - by using splits found via Symbolic Regression (SR). Traditional decision tree (DT) algorithms partition a dataset on axis-parallel splits. When the true boundaries are not along the feature axes, DT is likely to have a complicated structure and a dense decision boundary. In this paper, we introduce SR-Enhanced DT (SREDT) - a method which utilizes SR to increase the richness of the class of possible DT splits. We evaluate SREDT on both synthetic and real-world datasets. Despite its simplicity, our method produces surprisingly small trees that outperform both DT and oblique DT (ODT) on supervised classification tasks in terms of accuracy and F-score. We show empirically that SREDTs decrease inference time (compared to DT and ODT) and argue that they allow us to obtain more explainable descriptions of the decision process. SREDT also performs competitively against state-of-the-art tabular classification methods, including tree ensembles and deep models. Finally, we introduce a local search mechanism to improve SREDT and evaluate it on 56 PMLB datasets. This mechanism shows improved performance on 77.2% of the datasets, outperforming DT and ODT. In terms of F-Score, local SREDT outperforms DT and ODT in 82.5% and 73.7% of the datasets respectively and in terms of inference time, local SREDT requires 25.8% and 26.6% less inference time than DT and ODT respectively.
\ No newline at end of file
diff --git a/data/2024/aaai/Symmetric Q-learning: Reducing Skewness of Bellman Error in Online Reinforcement Learning b/data/2024/aaai/Symmetric Q-learning: Reducing Skewness of Bellman Error in Online Reinforcement Learning
new file mode 100644
index 0000000000..ff573d737c
--- /dev/null
+++ b/data/2024/aaai/Symmetric Q-learning: Reducing Skewness of Bellman Error in Online Reinforcement Learning	
@@ -0,0 +1 @@
+In deep reinforcement learning, estimating the value function to evaluate the quality of states and actions is essential. The value function is often trained using the least squares method, which implicitly assumes a Gaussian error distribution. However, a recent study suggested that the error distribution for training the value function is often skewed because of the properties of the Bellman operator, and violates the implicit assumption of normal error distribution in the least squares method. To address this, we proposed a method called Symmetric Q-learning, in which the synthetic noise generated from a zero-mean distribution is added to the target values to generate a Gaussian error distribution. We evaluated the proposed method on continuous control benchmark tasks in MuJoCo. It improved the sample efficiency of a state-of-the-art reinforcement learning method by reducing the skewness of the error distribution.
\ No newline at end of file
diff --git a/data/2024/aaai/Symmetric Self-Paced Learning for Domain Generalization b/data/2024/aaai/Symmetric Self-Paced Learning for Domain Generalization
new file mode 100644
index 0000000000..7f00135df4
--- /dev/null
+++ b/data/2024/aaai/Symmetric Self-Paced Learning for Domain Generalization	
@@ -0,0 +1,8 @@
+Deep learning methods often suffer performance degradation due to domain shift, where discrepancies exist between training and testing data distributions.
+Domain generalization mitigates this problem by leveraging information from multiple source domains to enhance model generalization capabilities for unseen domains.
+However, existing domain generalization methods typically present examples to the model in a random manner, overlooking the potential benefits of structured data presentation.
+To bridge this gap, we propose a novel learning strategy, Symmetric Self-Paced Learning (SSPL), for domain generalization.
+SSPL consists of a Symmetric Self-Paced training scheduler and a Gradient-based Difficulty Measure (GDM).
+Specifically, the proposed training scheduler initially focuses on easy examples, gradually shifting emphasis to harder examples as training progresses.
+GDM dynamically evaluates example difficulty through the gradient magnitude with respect to the example itself.
+Experiments across five popular benchmark datasets demonstrate the effectiveness of the proposed learning strategy.
\ No newline at end of file
diff --git a/data/2024/aaai/Sync-NeRF: Generalizing Dynamic NeRFs to Unsynchronized Videos b/data/2024/aaai/Sync-NeRF: Generalizing Dynamic NeRFs to Unsynchronized Videos
new file mode 100644
index 0000000000..21253f7ec1
--- /dev/null
+++ b/data/2024/aaai/Sync-NeRF: Generalizing Dynamic NeRFs to Unsynchronized Videos	
@@ -0,0 +1 @@
+Recent advancements in 4D scene reconstruction using neural radiance fields (NeRF) have demonstrated the ability to represent dynamic scenes from multi-view videos. However, they fail to reconstruct the dynamic scenes and struggle to fit even the training views in unsynchronized settings. It happens because they employ a single latent embedding for a frame while the multi-view images at the same frame were actually captured at different moments. To address this limitation, we introduce time offsets for individual unsynchronized videos and jointly optimize the offsets with NeRF. By design, our method is applicable for various baselines and improves them with large margins. Furthermore, finding the offsets always works as synchronizing the videos without manual effort. Experiments are conducted on the common Plenoptic Video Dataset and a newly built Unsynchronized Dynamic Blender Dataset to verify the performance of our method. Project page: https://seoha-kim.github.io/sync-nerf
\ No newline at end of file
diff --git a/data/2024/aaai/Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction b/data/2024/aaai/Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction
new file mode 100644
index 0000000000..79565edc56
--- /dev/null
+++ b/data/2024/aaai/Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction	
@@ -0,0 +1 @@
+Few-shot Relation Extraction (FSRE) aims to extract relational facts from a sparse set of labeled corpora. Recent studies have shown promising results in FSRE by employing Pre-trained Language Models (PLMs) within the framework of supervised contrastive learning, which considers both instances and label facts. However, how to effectively harness massive instance-label pairs to encompass the learned representation with semantic richness in this learning paradigm is not fully explored. To address this gap, we introduce a novel synergistic anchored contrastive pre-training framework. This framework is motivated by the insight that the diverse viewpoints conveyed through instance-label pairs capture incomplete yet complementary intrinsic textual semantics. Specifically, our framework involves a symmetrical contrastive objective that encompasses both sentence-anchored and label-anchored contrastive losses. By combining these two losses, the model establishes a robust and uniform representation space. This space effectively captures the reciprocal alignment of feature distributions among instances and relational facts, simultaneously enhancing the maximization of mutual information across diverse perspectives within the same relation. Experimental results demonstrate that our framework achieves significant performance enhancements compared to baseline models in downstream FSRE tasks. Furthermore, our approach exhibits superior adaptability to handle the challenges of domain shift and zero-shot relation extraction. Our code is available online at https://github.com/AONE-NLP/FSRE-SaCon.
\ No newline at end of file
diff --git a/data/2024/aaai/Synergistic Multiscale Detail Refinement via Intrinsic Supervision for Underwater Image Enhancement b/data/2024/aaai/Synergistic Multiscale Detail Refinement via Intrinsic Supervision for Underwater Image Enhancement
new file mode 100644
index 0000000000..f32b7781aa
--- /dev/null
+++ b/data/2024/aaai/Synergistic Multiscale Detail Refinement via Intrinsic Supervision for Underwater Image Enhancement	
@@ -0,0 +1 @@
+Visually restoring underwater scenes primarily involves mitigating interference from underwater media. Existing methods ignore the inherent scale-related characteristics in underwater scenes. Therefore, we present the synergistic multi-scale detail refinement via intrinsic supervision (SMDR-IS) for enhancing underwater scene details, which contain multi-stages. The low-degradation stage from the original images furnishes the original stage with multi-scale details, achieved through feature propagation using the Adaptive Selective Intrinsic Supervised Feature (ASISF) module. By using intrinsic supervision, the ASISF module can precisely control and guide feature transmission across multi-degradation stages, enhancing multi-scale detail refinement and minimizing the interference from irrelevant information in the low-degradation stage. In multi-degradation encoder-decoder framework of SMDR-IS, we introduce the Bifocal Intrinsic-Context Attention Module (BICA). Based on the intrinsic supervision principles, BICA efficiently exploits multi-scale scene information in images. BICA directs higher-resolution spaces by tapping into the insights of lower-resolution ones, underscoring the pivotal role of spatial contextual relationships in underwater image restoration. Throughout training, the inclusion of a multi-degradation loss function can enhance the network, allowing it to adeptly extract information across diverse scales. When benchmarked against state-of-the-art methods, SMDR-IS consistently showcases superior performance. Our code is available at https://github.com/zhoujingchun03/SMDR-IS
\ No newline at end of file
diff --git a/data/2024/aaai/T-NET: Weakly Supervised Graph Learning for Combatting Human Trafficking b/data/2024/aaai/T-NET: Weakly Supervised Graph Learning for Combatting Human Trafficking
new file mode 100644
index 0000000000..0323f4a4c1
--- /dev/null
+++ b/data/2024/aaai/T-NET: Weakly Supervised Graph Learning for Combatting Human Trafficking	
@@ -0,0 +1,3 @@
+Human trafficking (HT) for forced sexual exploitation, often described as modern-day slavery, is a pervasive problem that affects millions of people worldwide. Perpetrators of this crime post advertisements (ads) on behalf of their victims on adult service websites (ASW). These websites typically contain hundreds of thousands of ads including those posted by independent escorts, massage parlor agencies and spammers (fake ads). Detecting suspicious activity in these ads is difficult and developing data-driven methods is challenging due to the hard-to-label, complex and sensitive nature of the data. 
+
+In this paper, we propose T-Net, which unlike previous solutions, formulates this problem as weakly supervised classification. Since it takes several months to years to investigate a case and obtain a single definitive label, we design domain-specific signals or indicators that provide weak labels. T-Net also looks into connections between ads and models the problem as a graph learning task instead of classifying ads independently. We show that T-Net outperforms all baselines on a real-world dataset of ads by 7% average weighted F1 score. Given that this data contains personally identifiable information, we also present a realistic data generator and provide the first publicly available dataset in this domain which may be leveraged by the wider research community.
\ No newline at end of file
diff --git a/data/2024/aaai/T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models b/data/2024/aaai/T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models
new file mode 100644
index 0000000000..fe111098ba
--- /dev/null
+++ b/data/2024/aaai/T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models	
@@ -0,0 +1 @@
+The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., structure and color) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn low-cost T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications. Our code is available at https://github.com/TencentARC/T2I-Adapter.
\ No newline at end of file
diff --git a/data/2024/aaai/T2MAC: Targeted and Trusted Multi-Agent Communication through Selective Engagement and Evidence-Driven Integration b/data/2024/aaai/T2MAC: Targeted and Trusted Multi-Agent Communication through Selective Engagement and Evidence-Driven Integration
new file mode 100644
index 0000000000..e329f53cec
--- /dev/null
+++ b/data/2024/aaai/T2MAC: Targeted and Trusted Multi-Agent Communication through Selective Engagement and Evidence-Driven Integration	
@@ -0,0 +1 @@
+Communication stands as a potent mechanism to harmonize the behaviors of multiple agents. However, existing work primarily concentrates on broadcast communication, which not only lacks practicality, but also leads to information redundancy. This surplus, one-fits-all information could adversely impact the communication efficiency. Furthermore, existing works often resort to basic mechanisms to integrate observed and received information, impairing the learning process. To tackle these difficulties, we propose Targeted and Trusted Multi-Agent Communication (T2MAC), a straightforward yet effective method that enables agents to learn selective engagement and evidence-driven integration. With T2MAC, agents have the capability to craft individualized messages, pinpoint ideal communication windows, and engage with reliable partners, thereby refining communication efficiency. Following the reception of messages, the agents integrate information observed and received from different sources at an evidence level. This process enables agents to collectively use evidence garnered from multiple perspectives, fostering trusted and cooperative behaviors. We evaluate our method on a diverse set of cooperative multi-agent tasks, with varying difficulties, involving different scales and ranging from Hallway, MPE to SMAC. The experiments indicate that the proposed model not only surpasses the state-of-the-art methods in terms of cooperative performance and communication efficiency, but also exhibits impressive generalization.
\ No newline at end of file
diff --git a/data/2024/aaai/TA&AT: Enhancing Task-Oriented Dialog with Turn-Level Auxiliary Tasks and Action-Tree Based Scheduled Sampling b/data/2024/aaai/TA&AT: Enhancing Task-Oriented Dialog with Turn-Level Auxiliary Tasks and Action-Tree Based Scheduled Sampling
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/aaai/TACIT: A Target-Agnostic Feature Disentanglement Framework for Cross-Domain Text Classification b/data/2024/aaai/TACIT: A Target-Agnostic Feature Disentanglement Framework for Cross-Domain Text Classification
new file mode 100644
index 0000000000..7c2da786c0
--- /dev/null
+++ b/data/2024/aaai/TACIT: A Target-Agnostic Feature Disentanglement Framework for Cross-Domain Text Classification	
@@ -0,0 +1 @@
+Cross-domain text classification aims to transfer models from label-rich source domains to label-poor target domains, giving it a wide range of practical applications. Many approaches promote cross-domain generalization by capturing domaininvariant features. However, these methods rely on unlabeled samples provided by the target domains, which renders the model ineffective when the target domain is agnostic. Furthermore, the models are easily disturbed by shortcut learning in the source domain, which also hinders the improvement of domain generalization ability. To solve the aforementioned issues, this paper proposes TACIT, a target domain agnostic feature disentanglement framework which adaptively decouples robust and unrobust features by Variational Auto-Encoders. Additionally, to encourage the separation of unrobust features from robust features, we design a feature distillation task that compels unrobust features to approximate the output of the teacher. The teacher model is trained with a few easy samples that are easy to carry potential unknown shortcuts. Experimental results verify that our framework achieves comparable results to state-of-the-art baselines while utilizing only source domain data.
\ No newline at end of file
diff --git a/data/2024/aaai/TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient b/data/2024/aaai/TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient
new file mode 100644
index 0000000000..5d9dd76046
--- /dev/null
+++ b/data/2024/aaai/TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient	
@@ -0,0 +1 @@
+Multi-Agent Policy Gradient (MAPG) has made significant progress in recent years. However, centralized critics in state-of-the-art MAPG methods still face the centralized-decentralized mismatch (CDM) issue, which means sub-optimal actions by some agents will affect other agent's policy learning. While using individual critics for policy updates can avoid this issue, they severely limit cooperation among agents. To address this issue, we propose an agent topology framework, which decides whether other agents should be considered in policy gradient and achieves compromise between facilitating cooperation and alleviating the CDM issue. The agent topology allows agents to use coalition utility as learning objective instead of global utility by centralized critics or local utility by individual critics. To constitute the agent topology, various models are studied. We propose Topology-based multi-Agent Policy gradiEnt (TAPE) for both stochastic and deterministic MAPG methods. We prove the policy improvement theorem for stochastic TAPE and give a theoretical explanation for the improved cooperation among agents. Experiment results on several benchmarks show the agent topology is able to facilitate agent cooperation and alleviate CDM issue respectively to improve performance of TAPE. Finally, multiple ablation studies and a heuristic graph search algorithm are devised to show the efficacy of the agent topology.
\ No newline at end of file
diff --git a/data/2024/aaai/TAU: Trajectory Data Augmentation with Uncertainty for Next POI Recommendation b/data/2024/aaai/TAU: Trajectory Data Augmentation with Uncertainty for Next POI Recommendation
new file mode 100644
index 0000000000..daff4f13cc
--- /dev/null
+++ b/data/2024/aaai/TAU: Trajectory Data Augmentation with Uncertainty for Next POI Recommendation	
@@ -0,0 +1,2 @@
+Next Point-of-Interest (POI) recommendation has been proven effective at utilizing sparse, intricate spatial-temporal trajectory data to recommend subsequent POIs to users. While existing methods commonly alleviate the problem of data sparsity by integrating spatial-temporal context information, POI category features, and social relationships, they largely overlook the fact that the trajectory sequences collected in the datasets are often incomplete. This oversight limits the model’s potential to fully leverage historical context. In light of this background, we propose Trajectory Data Augmentation with Uncertainty (TAU) for Next POI Recommendation. TAU is a general graph-based trajectory data augmentation method designed to complete user mobility patterns by marrying uncertainty estimation into the next POI recommendation task. More precisely, TAU taps into the global transition pattern graph to identify sets of intermediate nodes located between every pair of locations, effectively
+leveraging edge weights as transition probabilities. During trajectory sequence construction, TAU selectively prompts intermediate nodes, chosen based on their likelihood of occurrence as pseudo-labels, to establish comprehensive trajectory sequences. Furthermore, to gauge the certainty and impact of pseudo-labels on the target location, we introduce a novel confidence-aware calibration strategy using evidence deep learning (EDL) for improved performance and reliability. The experimental results clearly indicate that our TAU method achieves consistent performance improvements over existing techniques across two real-world datasets, verifying its effectiveness as the state-of-the-art approach to the task.
\ No newline at end of file
diff --git a/data/2024/aaai/TC-LIF: A Two-Compartment Spiking Neuron Model for Long-Term Sequential Modelling b/data/2024/aaai/TC-LIF: A Two-Compartment Spiking Neuron Model for Long-Term Sequential Modelling
new file mode 100644
index 0000000000..5b5c37ac79
--- /dev/null
+++ b/data/2024/aaai/TC-LIF: A Two-Compartment Spiking Neuron Model for Long-Term Sequential Modelling	
@@ -0,0 +1 @@
+The identification of sensory cues associated with potential opportunities and dangers is frequently complicated by unrelated events that separate useful cues by long delays. As a result, it remains a challenging task for state-of-the-art spiking neural networks (SNNs) to establish long-term temporal dependency between distant cues. To address this challenge, we propose a novel biologically inspired Two-Compartment Leaky Integrate-and-Fire spiking neuron model, dubbed TC-LIF. The proposed model incorporates carefully designed somatic and dendritic compartments that are tailored to facilitate learning long-term temporal dependencies. Furthermore, the theoretical analysis is provided to validate the effectiveness of TC-LIF in propagating error gradients over an extended temporal duration. Our experimental results, on a diverse range of temporal classification tasks, demonstrate superior temporal classification capability, rapid training convergence, and high energy efficiency of the proposed TC-LIF model. Therefore, this work opens up a myriad of opportunities for solving challenging temporal processing tasks on emerging neuromorphic computing systems. Our code is publicly available at https://github.com/ZhangShimin1/TC-LIF.
\ No newline at end of file
diff --git a/data/2024/aaai/TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions b/data/2024/aaai/TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions
new file mode 100644
index 0000000000..6de077258a
--- /dev/null
+++ b/data/2024/aaai/TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions	
@@ -0,0 +1 @@
+A key challenge in continuous sign language recognition (CSLR) is to efficiently capture long-range spatial interactions over time from the video input. To address this challenge, we propose TCNet, a hybrid network that effectively models spatio-temporal information from Trajectories and Correlated regions. TCNet's trajectory module transforms frames into aligned trajectories composed of continuous visual tokens. This facilitates extracting region trajectory patterns. In addition, for a query token, self-attention is learned along the trajectory. As such, our network can also focus on fine-grained spatio-temporal patterns, such as finger movement, of a region in motion. TCNet's correlation module utilizes a novel dynamic attention mechanism that filters out irrelevant frame regions. Additionally, it assigns dynamic key-value tokens from correlated regions to each query. Both innovations significantly reduce the computation cost and memory. We perform experiments on four large-scale datasets: PHOENIX14, PHOENIX14-T, CSL, and CSL-Daily. Our results demonstrate that TCNet consistently achieves state-of-the-art performance. For example, we improve over the previous state-of-the-art by 1.5\% and 1.0\% word error rate on PHOENIX14 and PHOENIX14-T, respectively. Code is available at https://github.com/hotfinda/TCNet
\ No newline at end of file
diff --git a/data/2024/aaai/TDeLTA: A Light-Weight and Robust Table Detection Method Based on Learning Text Arrangement b/data/2024/aaai/TDeLTA: A Light-Weight and Robust Table Detection Method Based on Learning Text Arrangement
new file mode 100644
index 0000000000..fd720fc4fa
--- /dev/null
+++ b/data/2024/aaai/TDeLTA: A Light-Weight and Robust Table Detection Method Based on Learning Text Arrangement	
@@ -0,0 +1 @@
+The diversity of tables makes table detection a great challenge, leading to existing models becoming more tedious and complex. Despite achieving high performance, they often overfit to the table style in training set, and suffer from significant performance degradation when encountering out-of-distribution tables in other domains. To tackle this problem, we start from the essence of the table, which is a set of text arranged in rows and columns. Based on this, we propose a novel, light-weighted and robust Table Detection method based on Learning Text Arrangement, namely TDeLTA. TDeLTA takes the text blocks as input, and then models the arrangement of them with a sequential encoder and an attention module. To locate the tables precisely, we design a text-classification task, classifying the text blocks into 4 categories according to their semantic roles in the tables. Experiments are conducted on both the text blocks parsed from PDF and extracted by open-source OCR tools, respectively. Compared to several state-of-the-art methods, TDeLTA achieves competitive results with only 3.1M model parameters on the large-scale public datasets. Moreover, when faced with the cross-domain data under the 0-shot setting, TDeLTA outperforms baselines by a large margin of nearly 7%, which shows the strong robustness and transferability of the proposed model.
\ No newline at end of file
diff --git "a/data/2024/aaai/TD\302\262-Net: Toward Denoising and Debiasing for Video Scene Graph Generation" "b/data/2024/aaai/TD\302\262-Net: Toward Denoising and Debiasing for Video Scene Graph Generation"
new file mode 100644
index 0000000000..01e3bab2e8
--- /dev/null
+++ "b/data/2024/aaai/TD\302\262-Net: Toward Denoising and Debiasing for Video Scene Graph Generation"	
@@ -0,0 +1,2 @@
+Dynamic scene graph generation (SGG) focuses on detecting objects in a video and determining their pairwise relationships. Existing dynamic SGG methods usually suffer from several issues, including 1) Contextual noise, as some frames might contain occluded and blurred objects. 2) Label bias, primarily due to the high imbalance between a few positive relationship samples and numerous negative ones. Additionally, the distribution of relationships exhibits a long-tailed pattern. To address the above problems, in this paper, we introduce a network named TD2-Net that aims at denoising and debiasing for dynamic SGG. Specifically, we first propose a denoising spatio-temporal transformer module that enhances object representation with robust contextual information. This is achieved by designing a differentiable Top-K object selector that utilizes the gumbel-softmax sampling strategy to select the relevant neighborhood for each object. 
+Second, we introduce an asymmetrical reweighting loss to relieve the issue of label bias. This loss function integrates asymmetry focusing factors and the volume of samples to adjust the weights assigned to individual samples. Systematic experimental results demonstrate the superiority of our proposed TD2-Net over existing state-of-the-art approaches on Action Genome databases. In more detail, TD2-Net outperforms the second-best competitors by 12.7% on mean-Recall@10 for predicate classification.
\ No newline at end of file
diff --git a/data/2024/aaai/TEAMSTER: Model-Based Reinforcement Learning for Ad Hoc Teamwork (Abstract Reprint) b/data/2024/aaai/TEAMSTER: Model-Based Reinforcement Learning for Ad Hoc Teamwork (Abstract Reprint)
new file mode 100644
index 0000000000..3cda98bf77
--- /dev/null
+++ b/data/2024/aaai/TEAMSTER: Model-Based Reinforcement Learning for Ad Hoc Teamwork (Abstract Reprint)	
@@ -0,0 +1 @@
+This paper investigates the use of model-based reinforcement learning in the context of ad hoc teamwork. We introduce a novel approach, named TEAMSTER, where we propose learning both the environment's model and the model of the teammates' behavior separately. Compared to the state-of-the-art PLASTIC algorithms, our results in four different domains from the multi-agent systems literature show that TEAMSTER is more flexible than the PLASTIC-Model, by learning the environment's model instead of assuming a perfect hand-coded model, and more robust/efficient than PLASTIC-Policy, by being able to continuously adapt to newly encountered teams, without implicitly learning a new environment model from scratch.
\ No newline at end of file
diff --git a/data/2024/aaai/TETRIS: Towards Exploring the Robustness of Interactive Segmentation b/data/2024/aaai/TETRIS: Towards Exploring the Robustness of Interactive Segmentation
new file mode 100644
index 0000000000..247d4f90bf
--- /dev/null
+++ b/data/2024/aaai/TETRIS: Towards Exploring the Robustness of Interactive Segmentation	
@@ -0,0 +1 @@
+Interactive segmentation methods rely on user inputs to iteratively update the selection mask. A click specifying the object of interest is arguably the most simple and intuitive interaction type, and thereby the most common choice for interactive segmentation. However, user clicking patterns in the interactive segmentation context remain unexplored. Accordingly, interactive segmentation evaluation strategies rely more on intuition and common sense rather than empirical studies (e.g., assuming that users tend to click in the center of the area with the largest error). In this work, we conduct a real-user study to investigate real user clicking patterns. This study reveals that the intuitive assumption made in the common evaluation strategy may not hold. As a result, interactive segmentation models may show high scores in the standard benchmarks, but it does not imply that they would perform well in a real world scenario. To assess the applicability of interactive segmentation methods, we propose a novel evaluation strategy providing a more comprehensive analysis of a model's performance. To this end, we propose a methodology for finding extreme user inputs by a direct optimization in a white-box adversarial attack on the interactive segmentation model. Based on the performance with such adversarial user inputs, we assess the robustness of interactive segmentation models w.r.t click positions. Besides, we introduce a novel benchmark for measuring the robustness of interactive segmentation, and report the results of an extensive evaluation of dozens of models.
\ No newline at end of file
diff --git a/data/2024/aaai/THGFormer: Time-Aware Hypergraph Learning for Multimodal Social Media Popularity Prediction (Student Abstract) b/data/2024/aaai/THGFormer: Time-Aware Hypergraph Learning for Multimodal Social Media Popularity Prediction (Student Abstract)
new file mode 100644
index 0000000000..e595fc021a
--- /dev/null
+++ b/data/2024/aaai/THGFormer: Time-Aware Hypergraph Learning for Multimodal Social Media Popularity Prediction (Student Abstract)	
@@ -0,0 +1 @@
+Social media popularity prediction of multimodal user-generated content (UGC) is a crucial task for many real-world applications. However, existing efforts are often limited by missing inter-instance correlations and UGC temporal patterns. To address these issues, we propose a novel time-aware hypergraph Transformer framework, THGFormer. It fully represents inter-instance and intra-instance relations by hypergraphs, captures the temporal dependencies with a time encoder, and enhances UGC's representations via a neighborhood knowledge aggregation. Extensive experiments conducted on two real-world datasets demonstrate that THGFormer outperforms state-of-the-art popularity prediction models across several settings.
\ No newline at end of file
diff --git a/data/2024/aaai/TIKP: Text-to-Image Knowledge Preservation for Continual Semantic Segmentation b/data/2024/aaai/TIKP: Text-to-Image Knowledge Preservation for Continual Semantic Segmentation
new file mode 100644
index 0000000000..e7aa5d3db8
--- /dev/null
+++ b/data/2024/aaai/TIKP: Text-to-Image Knowledge Preservation for Continual Semantic Segmentation	
@@ -0,0 +1 @@
+Continual Semantic Segmentation (CSS) is an emerging trend, where catastrophic forgetting has been a perplexing problem. In this paper, we propose a Text-to-Image Knowledge Preservation (TIKP) framework to address this issue. TIKP applies Text-to-Image techniques to CSS by automatically generating prompts and content adaptation. It extracts associations between the labels of seen data and constructs text-level prompts based on these associations, which are preserved and maintained at each incremental step. During training, these prompts generate correlated images to mitigate the catastrophic forgetting. Particularly, as the generated images may have different distributions from the original data, TIKP transfers the knowledge by a content adaption loss, which determines the role played by the generated images in incremental training based on the similarity. In addition, for the classifier, we use the previous model from a different perspective: misclassifying new classes into old objects instead of the background. We propose a knowledge distillation loss based on wrong labels, enabling us to attribute varying weights to individual objects during the distillation process. Extensive experiments conducted in the same setting show that TIKP outperforms state-of-the-art methods by a large margin on benchmark datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/TMFormer: Token Merging Transformer for Brain Tumor Segmentation with Missing Modalities b/data/2024/aaai/TMFormer: Token Merging Transformer for Brain Tumor Segmentation with Missing Modalities
new file mode 100644
index 0000000000..321297d8ea
--- /dev/null
+++ b/data/2024/aaai/TMFormer: Token Merging Transformer for Brain Tumor Segmentation with Missing Modalities	
@@ -0,0 +1 @@
+Numerous techniques excel in brain tumor segmentation using multi-modal magnetic resonance imaging (MRI) sequences, delivering exceptional results. However, the prevalent absence of modalities in clinical scenarios hampers performance. Current approaches frequently resort to zero maps as substitutes for missing modalities, inadvertently introducing feature bias and redundant computations. To address these issues, we present the Token Merging transFormer (TMFormer) for robust brain tumor segmentation with missing modalities. TMFormer tackles these challenges by extracting and merging accessible modalities into more compact token sequences. The architecture comprises two core components: the Uni-modal Token Merging Block (UMB) and the Multi-modal Token Merging Block (MMB). The UMB enhances individual modality representation by adaptively consolidating spatially redundant tokens within and outside tumor-related regions, thereby refining token sequences for augmented representational capacity. Meanwhile, the MMB mitigates multi-modal feature fusion bias, exclusively leveraging tokens from present modalities and merging them into a unified multi-modal representation to accommodate varying modality combinations. Extensive experimental results on the BraTS 2018 and 2020 datasets demonstrate the superiority and efficacy of TMFormer compared to state-of-the-art methods when dealing with missing modalities.
\ No newline at end of file
diff --git a/data/2024/aaai/TNPAR: Topological Neural Poisson Auto-Regressive Model for Learning Granger Causal Structure from Event Sequences b/data/2024/aaai/TNPAR: Topological Neural Poisson Auto-Regressive Model for Learning Granger Causal Structure from Event Sequences
new file mode 100644
index 0000000000..ca7ea10efd
--- /dev/null
+++ b/data/2024/aaai/TNPAR: Topological Neural Poisson Auto-Regressive Model for Learning Granger Causal Structure from Event Sequences	
@@ -0,0 +1 @@
+Learning Granger causality from event sequences is a challenging but essential task across various applications. Most existing methods rely on the assumption that event sequences are independent and identically distributed (i.i.d.). However, this i.i.d. assumption is often violated due to the inherent dependencies among the event sequences. Fortunately, in practice, we find these dependencies can be modeled by a topological network, suggesting a potential solution to the non-i.i.d. problem by introducing the prior topological network into Granger causal discovery. This observation prompts us to tackle two ensuing challenges: 1) how to model the event sequences while incorporating both the prior topological network and the latent Granger causal structure, and 2) how to learn the Granger causal structure. To this end, we devise a unified topological neural Poisson auto-regressive model with two processes. In the generation process, we employ a variant of the neural Poisson process to model the event sequences, considering influences from both the topological network and the Granger causal structure. In the inference process, we formulate an amortized inference algorithm to infer the latent Granger causal structure. We encapsulate these two processes within a unified likelihood function, providing an end-to-end framework for this task. Experiments on simulated and real-world data demonstrate the effectiveness of our approach.
\ No newline at end of file
diff --git a/data/2024/aaai/TOP-ReID: Multi-Spectral Object Re-identification with Token Permutation b/data/2024/aaai/TOP-ReID: Multi-Spectral Object Re-identification with Token Permutation
new file mode 100644
index 0000000000..87923f212c
--- /dev/null
+++ b/data/2024/aaai/TOP-ReID: Multi-Spectral Object Re-identification with Token Permutation	
@@ -0,0 +1 @@
+Multi-spectral object Re-identification (ReID) aims to retrieve specific objects by leveraging complementary information from different image spectra. It delivers great advantages over traditional single-spectral ReID in complex visual environment. However, the significant distribution gap among different image spectra poses great challenges for effective multi-spectral feature representations. In addition, most of current Transformer-based ReID methods only utilize the global feature of class tokens to achieve the holistic retrieval, ignoring the local discriminative ones. To address the above issues, we step further to utilize all the tokens of Transformers and propose a cyclic token permutation framework for multi-spectral object ReID, dubbled TOP-ReID. More specifically, we first deploy a multi-stream deep network based on vision Transformers to preserve distinct information from different image spectra. Then, we propose a Token Permutation Module (TPM) for cyclic multi-spectral feature aggregation. It not only facilitates the spatial feature alignment across different image spectra, but also allows the class token of each spectrum to perceive the local details of other spectra. Meanwhile, we propose a Complementary Reconstruction Module (CRM), which introduces dense token-level reconstruction constraints to reduce the distribution gap across different image spectra. With the above modules, our proposed framework can generate more discriminative multi-spectral features for robust object ReID. Extensive experiments on three ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) verify the effectiveness of our methods. The code is available at https://github.com/924973292/TOP-ReID.
\ No newline at end of file
diff --git a/data/2024/aaai/TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection b/data/2024/aaai/TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection
new file mode 100644
index 0000000000..73c9291076
--- /dev/null
+++ b/data/2024/aaai/TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection	
@@ -0,0 +1 @@
+Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks, which aim to obtain relevant moments within videos and highlight scores of each video clip. Recently, several methods have been devoted to building DETR-based networks to solve both MR and HD jointly. These methods simply add two separate task heads after multi-modal feature extraction and feature interaction, achieving good performance. Nevertheless, these approaches underutilize the reciprocal relationship between two tasks. In this paper, we propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD. Specifically, a local-global multi-modal alignment module is first built to align features from diverse modalities into a shared latent space. Subsequently, a visual feature refinement is designed to eliminate query-irrelevant information from visual features for modal interaction. Finally, a task cooperation module is constructed to refine the retrieval pipeline and the highlight score prediction process by utilizing the reciprocity between MR and HD. Comprehensive experiments on QVHighlights, Charades-STA and TVSum datasets demonstrate that TR-DETR outperforms existing state-of-the-art methods. Codes are available at https://github.com/mingyao1120/TR-DETR.
\ No newline at end of file
diff --git a/data/2024/aaai/TREE-G: Decision Trees Contesting Graph Neural Networks b/data/2024/aaai/TREE-G: Decision Trees Contesting Graph Neural Networks
new file mode 100644
index 0000000000..e3dac626cc
--- /dev/null
+++ b/data/2024/aaai/TREE-G: Decision Trees Contesting Graph Neural Networks	
@@ -0,0 +1,21 @@
+When dealing with tabular data, models based on decision
+trees are a popular choice due to their high accuracy on these
+data types, their ease of application, and explainability properties. However, when it comes to graph-structured data, it
+is not clear how to apply them effectively, in a way that in-
+corporates the topological information with the tabular data
+available on the vertices of the graph. To address this challenge,
+we introduce TREE-G. TREE-G modifies standard decision
+trees, by introducing a novel split function that is specialized
+for graph data. Not only does this split function incorporate
+the node features and the topological information, but it also
+uses a novel pointer mechanism that allows split nodes to
+use information computed in previous splits. Therefore, the
+split function adapts to the predictive task and the graph at
+hand. We analyze the theoretical properties of TREE-G and
+demonstrate its benefits empirically on multiple graph and
+vertex prediction benchmarks. In these experiments, TREE-G
+consistently outperforms other tree-based models and often
+outperforms other graph-learning algorithms such as Graph
+Neural Networks (GNNs) and Graph Kernels, sometimes by
+large margins. Moreover, TREE-Gs models and their predic
+tions can be explained and visualized.
\ No newline at end of file
diff --git a/data/2024/aaai/TTTS: Tree Test Time Simulation for Enhancing Decision Tree Robustness against Adversarial Examples b/data/2024/aaai/TTTS: Tree Test Time Simulation for Enhancing Decision Tree Robustness against Adversarial Examples
new file mode 100644
index 0000000000..9307d59588
--- /dev/null
+++ b/data/2024/aaai/TTTS: Tree Test Time Simulation for Enhancing Decision Tree Robustness against Adversarial Examples	
@@ -0,0 +1 @@
+Decision trees are widely used for addressing learning tasks involving tabular data. Yet, they are susceptible to adversarial attacks. In this paper, we present Tree Test Time Simulation (TTTS), a novel inference-time methodology that incorporates Monte Carlo simulations into decision trees to enhance their robustness. TTTS introduces a probabilistic modification to the decision path, without altering the underlying tree structure. Our comprehensive empirical analysis of 50 datasets yields promising results. Without the presence of any attacks, TTTS has successfully improved model performance from an AUC of 0.714 to 0.773. Under the challenging conditions of white-box attacks, TTTS demonstrated its robustness by boosting performance from an AUC of 0.337 to 0.680. Even when subjected to black-box attacks, TTTS maintains high accuracy and enhances the model's performance from an AUC of 0.628 to 0.719. Compared to defenses such as Feature Squeezing, TTTS proves to be much more effective. We also found that TTTS exhibits similar robustness in decision forest settings across different attacks.
\ No newline at end of file
diff --git a/data/2024/aaai/Tackling Vision Language Tasks through Learning Inner Monologues b/data/2024/aaai/Tackling Vision Language Tasks through Learning Inner Monologues
new file mode 100644
index 0000000000..c613b13bfc
--- /dev/null
+++ b/data/2024/aaai/Tackling Vision Language Tasks through Learning Inner Monologues	
@@ -0,0 +1,2 @@
+Visual language tasks such as Visual Question Answering (VQA) or Visual Entailment (VE) require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. 
+To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating Inner Monologue, a cognitive process in which an individual engages in silent verbal communication with themselves. More specifically, we enable LLMs and VLMs to interact through natural language conversation (i.e., Inner Monologue) and propose to use a two-stage training process to learn how to do Inner Monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and achieves competitive performance with less training data when compared with state-of-the-art models while concurrently keeping the interpretability. The results suggest that by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, broadening its potential applications across various AI challenges beyond vision and language tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training b/data/2024/aaai/TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training
new file mode 100644
index 0000000000..0716d31b1e
--- /dev/null
+++ b/data/2024/aaai/TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training	
@@ -0,0 +1,2 @@
+Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text descriptions supervised by contrastive loss, making it highly effective for single-label classification. However, it shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class and the contrastive nature of softmax operation aggravates it.
+In this study, we observe that the multi-label classification results heavily rely on discriminative local features but are overlooked by CLIP. As a result, we dissect the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags. It comprises three steps: (1) patch-level classification to obtain coarse scores; (2) dual-masking attention refinement (DMAR) module to refine the coarse scores; (3) class-wise reidentification (CWR) module to remedy predictions from a global perspective. This framework is solely based on frozen CLIP and significantly enhances its multi-label classification performance on various benchmarks without dataset-specific training. Besides, to comprehensively assess the quality and practicality of generated tags, we extend their application to the downstream task, i.e., weakly supervised semantic segmentation (WSSS) with generated tags as image-level pseudo labels. Experiments demonstrate that this classify-then-segment paradigm dramatically outperforms other annotation-free segmentation methods and validates the effectiveness of generated tags. Our code is available at https://github.com/linyq2117/TagCLIP.
\ No newline at end of file
diff --git a/data/2024/aaai/TagFog: Textual Anchor Guidance and Fake Outlier Generation for Visual Out-of-Distribution Detection b/data/2024/aaai/TagFog: Textual Anchor Guidance and Fake Outlier Generation for Visual Out-of-Distribution Detection
new file mode 100644
index 0000000000..b9e76e8b0c
--- /dev/null
+++ b/data/2024/aaai/TagFog: Textual Anchor Guidance and Fake Outlier Generation for Visual Out-of-Distribution Detection	
@@ -0,0 +1 @@
+Out-of-distribution (OOD) detection is crucial in many real-world applications. However, intelligent models are often trained solely on in-distribution (ID) data, leading to overconfidence when misclassifying OOD data as ID classes. In this study, we propose a new learning framework which leverage simple Jigsaw-based fake OOD data and rich semantic embeddings (`anchors') from the ChatGPT description of ID knowledge to help guide the training of the image encoder. The learning framework can be flexibly combined with existing post-hoc approaches to OOD detection, and extensive empirical evaluations on multiple OOD detection benchmarks demonstrate that rich textual representation of ID knowledge and fake OOD knowledge can well help train a visual encoder for OOD detection. With the learning framework, new state-of-the-art performance was achieved on all the benchmarks. The code is available at https://github.com/Cverchen/TagFog.
\ No newline at end of file
diff --git a/data/2024/aaai/Tail-STEAK: Improve Friend Recommendation for Tail Users via Self-Training Enhanced Knowledge Distillation b/data/2024/aaai/Tail-STEAK: Improve Friend Recommendation for Tail Users via Self-Training Enhanced Knowledge Distillation
new file mode 100644
index 0000000000..287a575896
--- /dev/null
+++ b/data/2024/aaai/Tail-STEAK: Improve Friend Recommendation for Tail Users via Self-Training Enhanced Knowledge Distillation	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) are commonly employed in collaborative friend recommendation systems. Nevertheless, recent studies reveal a notable performance gap, particularly for users with limited connections, commonly known as tail users, in contrast to their counterparts with abundant connections (head users). Uniformly treating head and tail users poses two challenges for tail user preference learning: (C1) Label Sparsity, as tail users typically possess limited labels; and (C2) Neighborhood Sparsity, where tail users exhibit sparse observable friendships, leading to distinct preference distributions and performance degradation compared to head users. In response to these challenges, we introduce Tail-STEAK, an innovative framework that combines self-training with enhanced knowledge distillation for tail user representation learning. To address(C1), we present Tail-STEAK-base, a two-stage self-training framework. In the first stage, only head users and their accurate connections are utilized for training, while pseudo links are generated for tail users in the second stage. To tackle (C2), we propose two data augmentation-based self-knowledge distillation pretext tasks. These tasks are seamlessly integrated into different stages of Tail-STEAK-base, culminating in the comprehensive Tail-STEAK framework. Extensive experiments, conducted on state-of-the-art GNN-based friend recommendation models, substantiate the efficacy of Tail-STEAK in significantly improving tail user performance. Our code and data are publicly available at https://github.com/antman9914/Tail-STEAK.
\ No newline at end of file
diff --git a/data/2024/aaai/Talk Funny! A Large-Scale Humor Response Dataset with Chain-of-Humor Interpretation b/data/2024/aaai/Talk Funny! A Large-Scale Humor Response Dataset with Chain-of-Humor Interpretation
new file mode 100644
index 0000000000..65946442cb
--- /dev/null
+++ b/data/2024/aaai/Talk Funny! A Large-Scale Humor Response Dataset with Chain-of-Humor Interpretation	
@@ -0,0 +1,6 @@
+Humor is a crucial part of human communication. Understanding humor and generating humorous responses in dialogue can provide natural and empathic human-computer interactions.
+However, most existing pre-trained language models (PLMs) perform unsatisfactorily in humor generation.
+On one hand, the serious shortage of humor corpus and datasets pose challenges for constructing models that can understand and generate humorous expressions. On the other hand, humor generation relies on rich knowledge and commonsense, which is often tacit and unspoken.
+In this paper, we construct the largest Chinese Explainable Humor Response Dataset to date with chain-of-humor and humor mind map annotations, which can be used to comprehensively evaluate as well as improve the humorous response ability of PLMs.
+We further design humor-related auxiliary tasks to further enhance PLMs' humorous response performance.
+Extensive evaluations demonstrate that our proposed dataset and auxiliary tasks effectively help PLMs to generate humorous responses, laying the groundwork for future humor research.
\ No newline at end of file
diff --git a/data/2024/aaai/Taming Binarized Neural Networks and Mixed-Integer Programs b/data/2024/aaai/Taming Binarized Neural Networks and Mixed-Integer Programs
new file mode 100644
index 0000000000..16a3a533ff
--- /dev/null
+++ b/data/2024/aaai/Taming Binarized Neural Networks and Mixed-Integer Programs	
@@ -0,0 +1,5 @@
+There has been a great deal of recent interest in binarized neural networks, especially because of their explainability. At the same time, automatic differentiation algorithms such as backpropagation fail for binarized neural networks, which limits their applicability. 
+We show that binarized neural networks admit a tame representation
+by reformulating the problem of training binarized neural networks as a subadditive dual of a mixed-integer program, which we show to have nice properties. This makes it possible to use the framework of Bolte et al. for implicit differentiation, which offers the possibility for practical implementation of backpropagation in the context of binarized neural networks. 
+
+This approach could also be used for a broader class of mixed-integer programs, beyond the training of binarized neural networks, as encountered in symbolic approaches to AI and beyond.
\ No newline at end of file
diff --git a/data/2024/aaai/Taming the Sigmoid Bottleneck: Provably Argmaxable Sparse Multi-Label Classification b/data/2024/aaai/Taming the Sigmoid Bottleneck: Provably Argmaxable Sparse Multi-Label Classification
new file mode 100644
index 0000000000..2c6ee513b0
--- /dev/null
+++ b/data/2024/aaai/Taming the Sigmoid Bottleneck: Provably Argmaxable Sparse Multi-Label Classification	
@@ -0,0 +1,2 @@
+Sigmoid output layers are widely used in multi-label classification (MLC) tasks, in which multiple labels can be assigned to any input. In many practical MLC tasks, the number of possible labels is in the thousands, often exceeding the number of input features and resulting in a low-rank output layer. In multi-class classification, it is known that such a low-rank output layer is a bottleneck that can result in unargmaxable classes: classes which cannot be predicted for any input. 
+In this paper, we show that for MLC tasks, the analogous sigmoid bottleneck results in exponentially many unargmaxable label combinations. We explain how to detect these unargmaxable outputs and demonstrate their presence in three widely used MLC datasets. We then show that they can be prevented in practice by introducing a Discrete Fourier Transform (DFT) output layer, which guarantees that all sparse label combinations with up to k active labels are argmaxable. Our DFT layer trains faster and is more parameter efficient, matching the F1@k score of a sigmoid layer while using up to 50% fewer trainable parameters. Our code is publicly available at https://github.com/andreasgrv/sigmoid-bottleneck.
\ No newline at end of file
diff --git a/data/2024/aaai/Target Focused Shallow Transformer Framework for Efficient Visual Tracking b/data/2024/aaai/Target Focused Shallow Transformer Framework for Efficient Visual Tracking
new file mode 100644
index 0000000000..5a59c5934b
--- /dev/null
+++ b/data/2024/aaai/Target Focused Shallow Transformer Framework for Efficient Visual Tracking	
@@ -0,0 +1 @@
+Template learning transformer trackers have achieved significant performance improvement recently due to the longdependency learning using the self-attention (SA) mechanism. However, the typical SA mechanisms in transformers adopt a less discriminative design approach which is inadequate for focusing on the most important target information during tracking. Therefore, existing trackers are easily distracted by background information and have constraints in handling tracking challenges. The focus of our research is to develop a target-focused discriminative shallow transformer tracking framework that can learn to distinguish the target from the background and enable accurate tracking with fast speed. Extensive experiments will be performed on several popular benchmarks, including OTB100, UAV123, GOT10k, LaSOT, and TrackingNet, to demonstrate the effectiveness of the proposed framework.
\ No newline at end of file
diff --git a/data/2024/aaai/Target-Free Domain Adaptation through Cross-Adaptation (Student Abstract) b/data/2024/aaai/Target-Free Domain Adaptation through Cross-Adaptation (Student Abstract)
new file mode 100644
index 0000000000..7100607905
--- /dev/null
+++ b/data/2024/aaai/Target-Free Domain Adaptation through Cross-Adaptation (Student Abstract)	
@@ -0,0 +1 @@
+The population characteristics of the datasets related to the same task may vary significantly and merging them may harm performance. In this paper, we propose a novel method of domain adaptation called "cross-adaptation". It allows for implicit adaptation to the target domain without the need for any labeled examples across this domain. We test our approach on 9 datasets for SARS-CoV-2 detection from complete blood count from different hospitals around the world. Results show that our solution is universal with respect to various classification algorithms and allows for up to a 10pp increase in F1 score on average.
\ No newline at end of file
diff --git a/data/2024/aaai/Targeted Activation Penalties Help CNNs Ignore Spurious Signals b/data/2024/aaai/Targeted Activation Penalties Help CNNs Ignore Spurious Signals
new file mode 100644
index 0000000000..61b10243b1
--- /dev/null
+++ b/data/2024/aaai/Targeted Activation Penalties Help CNNs Ignore Spurious Signals	
@@ -0,0 +1 @@
+Neural networks (NNs) can learn to rely on spurious signals in the training data, leading to poor generalisation. Recent methods tackle this problem by training NNs with additional ground-truth annotations of such signals. These methods may, however, let spurious signals re-emerge in deep convolutional NNs (CNNs). We propose Targeted Activation Penalty (TAP), a new method tackling the same problem by penalising activations to control the re-emergence of spurious signals in deep CNNs, while also lowering training times and memory usage. In addition, ground-truth annotations can be expensive to obtain. We show that TAP still works well with annotations generated by pre-trained models as effective substitutes of ground-truth annotations. We demonstrate the power of TAP against two state-of-the-art baselines on the MNIST benchmark and on two clinical image datasets, using four different CNN architectures.
\ No newline at end of file
diff --git a/data/2024/aaai/Task Contamination: Language Models May Not Be Few-Shot Anymore b/data/2024/aaai/Task Contamination: Language Models May Not Be Few-Shot Anymore
new file mode 100644
index 0000000000..7a7189c6d7
--- /dev/null
+++ b/data/2024/aaai/Task Contamination: Language Models May Not Be Few-Shot Anymore	
@@ -0,0 +1 @@
+Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot or few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over datasets released over time, and over LLMs released over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that datasets released prior to the LLM training data creation date perform surprisingly better than datasets released post the LLM training data creation date. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, training data extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.
\ No newline at end of file
diff --git a/data/2024/aaai/Task Planning for Object Rearrangement in Multi-Room Environments b/data/2024/aaai/Task Planning for Object Rearrangement in Multi-Room Environments
new file mode 100644
index 0000000000..a782715d5a
--- /dev/null
+++ b/data/2024/aaai/Task Planning for Object Rearrangement in Multi-Room Environments	
@@ -0,0 +1 @@
+Object rearrangement in a multi-room setup should produce a reasonable plan that reduces the agent's overall travel and the number of steps. Recent state-of-the-art methods fail to produce such plans because they rely on explicit exploration for discovering unseen objects due to partial observability and a heuristic planner to sequence the actions for rearrangement. This paper proposes a novel task planner to efficiently plan a sequence of actions to discover unseen objects and rearrange misplaced objects within an untidy house to achieve a desired tidy state. The proposed method introduces several innovative techniques, including (i) a method for discovering unseen objects using commonsense knowledge from large language models, (ii) a collision resolution and buffer prediction method based on Cross-Entropy Method to handle blocked goal and swap cases, (iii) a directed spatial graph-based state space for scalability, and (iv) deep reinforcement learning (RL) for producing an efficient plan to simultaneously discover unseen objects and rearrange the visible misplaced ones to minimize the overall traversal. The paper also presents new metrics and a benchmark dataset called MoPOR to evaluate the effectiveness of the rearrangement planning in a multi-room setting. The experimental results demonstrate that the proposed method effectively addresses the multi-room rearrangement problem.
\ No newline at end of file
diff --git a/data/2024/aaai/Task-Adaptive Prompted Transformer for Cross-Domain Few-Shot Learning b/data/2024/aaai/Task-Adaptive Prompted Transformer for Cross-Domain Few-Shot Learning
new file mode 100644
index 0000000000..3152b72448
--- /dev/null
+++ b/data/2024/aaai/Task-Adaptive Prompted Transformer for Cross-Domain Few-Shot Learning	
@@ -0,0 +1 @@
+Cross-Domain Few-Shot Learning (CD-FSL) aims at recognizing samples in novel classes from unseen domains that are vastly different from training classes, with few labeled samples. However, the large domain gap between training and novel classes makes previous FSL methods perform poorly. To address this issue, we propose MetaPrompt, a Task-adaptive Prompted Transformer model for CD-FSL, by jointly exploiting prompt learning and the parameter generation framework. The proposed MetaPrompt enjoys several merits. First, a task-conditioned prompt generator is established upon attention mechanisms. It can flexibly produce a task-adaptive prompt with arbitrary length for unseen tasks, by selectively gathering task characteristics from the contextualized support embeddings. Second, the task-adaptive prompt is attached to Vision Transformer to facilitate fast task adaptation, steering the task-agnostic representation to incorporate task knowledge. To our best knowledge, this is the first work to exploit a prompt-based parameter generation mechanism for CD-FSL. Extensive experimental results on the Meta-Dataset benchmark demonstrate that our method achieves superior results against state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Task-Agnostic Privacy-Preserving Representation Learning for Federated Learning against Attribute Inference Attacks b/data/2024/aaai/Task-Agnostic Privacy-Preserving Representation Learning for Federated Learning against Attribute Inference Attacks
new file mode 100644
index 0000000000..596aee0465
--- /dev/null
+++ b/data/2024/aaai/Task-Agnostic Privacy-Preserving Representation Learning for Federated Learning against Attribute Inference Attacks	
@@ -0,0 +1,5 @@
+Federated learning (FL) has been widely studied recently due to its property to collaboratively train data from different devices without sharing the raw data. Nevertheless, recent studies show that an adversary can still be possible to infer private information about devices' data, e.g., sensitive attributes such as income, race, and sexual orientation. To mitigate the attribute inference attacks, various existing privacy-preserving FL methods can be adopted/adapted. However, all these existing methods have key limitations: they need to know the FL task in advance, or have intolerable computational overheads or utility losses, or do not have provable privacy guarantees. 
+
+We address these issues and design a task-agnostic privacy-preserving presentation learning method for FL (TAPPFL) against attribute inference attacks. TAPPFL is formulated via information theory. Specifically, 
+TAPPFL has two mutual information goals, where one goal learns task-agnostic data representations that contain the least information about the private attribute in each device's data, and the other goal ensures the learnt data representations include as much information as possible about the device data to maintain FL utility. We also derive privacy guarantees of TAPPFL against worst-case attribute inference attacks, as well as the inherent tradeoff between utility preservation and privacy protection. Extensive results on multiple datasets and applications validate the effectiveness of TAPPFL to protect data privacy, maintain the FL utility, and be efficient as well. 
+Experimental results also show that TAPPFL outperforms the existing defenses.
\ No newline at end of file
diff --git a/data/2024/aaai/Task-Disruptive Background Suppression for Few-Shot Segmentation b/data/2024/aaai/Task-Disruptive Background Suppression for Few-Shot Segmentation
new file mode 100644
index 0000000000..36a2e52f65
--- /dev/null
+++ b/data/2024/aaai/Task-Disruptive Background Suppression for Few-Shot Segmentation	
@@ -0,0 +1 @@
+Few-shot segmentation aims to accurately segment novel target objects within query images using only a limited number of annotated support images. The recent works exploit support background as well as its foreground to precisely compute the dense correlations between query and support. However, they overlook the characteristics of the background that generally contains various types of objects. In this paper, we highlight this characteristic of background which can bring problematic cases as follows: (1) when the query and support backgrounds are dissimilar and (2) when objects in the support background are similar to the target object in the query. Without any consideration of the above cases, adopting the entire support background leads to a misprediction of the query foreground as background. To address this issue, we propose Task-disruptive Background Suppression(TBS), a module to suppress those disruptive support background features based on two spatial-wise scores: query-relevant and target-relevant scores. The former aims to mitigate the impact of unshared features solely existing in the support background, while the latter aims to reduce the influence of target-similar support background features. Based on these two scores, we define a query background relevant score that captures the similarity between the backgrounds of the query and the support, and utilize it to scale support background features to adaptively restrict the impact of disruptive support backgrounds. Our proposed method achieves state-of-the-art performance on standard few-shot segmentation benchmarks. Our official code is available at github.com/SuhoPark0706/TBSNet.
\ No newline at end of file
diff --git a/data/2024/aaai/Task-Driven Causal Feature Distillation: Towards Trustworthy Risk Prediction b/data/2024/aaai/Task-Driven Causal Feature Distillation: Towards Trustworthy Risk Prediction
new file mode 100644
index 0000000000..f0bdf473b2
--- /dev/null
+++ b/data/2024/aaai/Task-Driven Causal Feature Distillation: Towards Trustworthy Risk Prediction	
@@ -0,0 +1 @@
+Since artificial intelligence has seen tremendous recent successes in many areas, it has sparked great interest in its potential for trustworthy and interpretable risk prediction. However, most models lack causal reasoning and struggle with class imbalance, leading to poor precision and recall. To address this, we propose a Task-Driven Causal Feature Distillation model (TDCFD) to transform original feature values into causal feature attributions for the specific risk prediction task. The causal feature attribution helps describe how much contribution the value of this feature can make to the risk prediction result. After the causal feature distillation, a deep neural network is applied to produce trustworthy prediction results with causal interpretability and high precision/recall. We evaluate the performance of our TDCFD method on several synthetic and real datasets, and the results demonstrate its superiority over the state-of-the-art methods regarding precision, recall, interpretability, and causality.
\ No newline at end of file
diff --git a/data/2024/aaai/Task-Free Continual Generation and Representation Learning via Dynamic Expansionable Memory Cluster b/data/2024/aaai/Task-Free Continual Generation and Representation Learning via Dynamic Expansionable Memory Cluster
new file mode 100644
index 0000000000..7a15ac70a6
--- /dev/null
+++ b/data/2024/aaai/Task-Free Continual Generation and Representation Learning via Dynamic Expansionable Memory Cluster	
@@ -0,0 +1 @@
+Human brains can continually acquire and learn new skills and knowledge over time from a dynamically changing environment without forgetting previously learnt information. Such a capacity can selectively transfer some important and recently seen information to the persistent knowledge regions of the brain. Inspired by this intuition, we propose a new memory-based approach for image reconstruction and generation in continual learning, consisting of a temporary and evolving memory, with two different storage strategies, corresponding to the temporary and permanent memorisation. The temporary memory aims to preserve up-to-date information while the evolving memory can dynamically increase its capacity in order to preserve permanent knowledge information. This is achieved by the proposed memory expansion mechanism that selectively transfers those data samples deemed as important from the temporary memory to new clusters defined within the evolved memory according to an information novelty criterion. Such a mechanism promotes the knowledge diversity among clusters in the evolved memory, resulting in capturing more diverse information by using a compact memory capacity. Furthermore, we propose a two-step optimization strategy for training a Variational Autoencoder (VAE) to implement generation and representation learning tasks, which updates the generator and inference models separately using two optimisation paths. This approach leads to a better trade-off between generation and reconstruction performance. We show empirically and theoretically that the proposed approach can learn meaningful latent representations while generating diverse images from different domains. The source code and supplementary material (SM) are available at https://github.com/dtuzi123/DEMC.
\ No newline at end of file
diff --git a/data/2024/aaai/Task-Free Dynamic Sparse Vision Transformer for Continual Learning b/data/2024/aaai/Task-Free Dynamic Sparse Vision Transformer for Continual Learning
new file mode 100644
index 0000000000..316d1ae3f8
--- /dev/null
+++ b/data/2024/aaai/Task-Free Dynamic Sparse Vision Transformer for Continual Learning	
@@ -0,0 +1 @@
+Vision Transformers (ViTs) represent self-attention-based network backbones shown to be efficient in many individual tasks, but which have not been explored in Task-Free Continual Learning (TFCL) so far. Most existing ViT-based approaches for Continual Learning (CL) are relying on task information. In this study, we explore the advantages of the ViT in a more challenging CL scenario where the task boundaries are unavailable during training. To address this learning paradigm, we propose the Task-Free Dynamic Sparse Vision Transformer (TFDSViT), which can dynamically build new sparse experts, where each expert leverages sparsity to allocate the model's capacity for capturing different information categories over time. To avoid forgetting and ensure efficiency in reusing the previously learned knowledge in subsequent learning, we propose a new dynamic dual attention mechanism consisting of the Sparse Attention (SA') and Knowledge Transfer Attention (KTA) modules. The SA' refrains from updating some previously learned attention blocks for preserving prior knowledge. The KTA uses and regulates the information flow of all previously learned experts for learning new patterns. The proposed dual attention mechanism can simultaneously relieve forgetting and promote knowledge transfer for a dynamic expansion model in a task-free manner. We also propose an energy-based dynamic expansion mechanism using the energy as a measure of novelty for the incoming samples which provides appropriate expansion signals leading to a compact network architecture for TFDSViT. Extensive empirical studies demonstrate the effectiveness of TFDSViT. The code and supplementary material (SM) are available at https://github.com/dtuzi123/TFDSViT.
\ No newline at end of file
diff --git a/data/2024/aaai/Taxonomy Driven Fast Adversarial Training b/data/2024/aaai/Taxonomy Driven Fast Adversarial Training
new file mode 100644
index 0000000000..209cd373bb
--- /dev/null
+++ b/data/2024/aaai/Taxonomy Driven Fast Adversarial Training	
@@ -0,0 +1 @@
+Adversarial training (AT) is an effective defense method against gradient-based attacks to enhance the robustness of neural networks. Among them, single-step AT has emerged as a hotspot topic due to its simplicity and efficiency, requiring only one gradient propagation in generating adversarial examples. Nonetheless, the problem of catastrophic overfitting (CO) that causes training collapse remains poorly understood, and there exists a gap between the robust accuracy achieved through single- and multi-step AT. In this paper, we present a surprising finding that the taxonomy of adversarial examples reveals the truth of CO. Based on this conclusion, we propose taxonomy driven fast adversarial training (TDAT) which jointly optimizes learning objective, loss function, and initialization method, thereby can be regarded as a new paradigm of single-step AT. Compared with other fast AT methods, TDAT can boost the robustness of neural networks, alleviate the influence of misclassified examples, and prevent CO during the training process while requiring almost no additional computational and memory resources. Our method achieves robust accuracy improvement of 1.59%, 1.62%, 0.71%, and 1.26% on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet-100 datasets, when against projected gradient descent PGD10 attack with perturbation budget 8/255. Furthermore, our proposed method also achieves state-of-the-art robust accuracy against other attacks. Code is available at https://github.com/bookman233/TDAT.
\ No newline at end of file
diff --git a/data/2024/aaai/Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge Distillation b/data/2024/aaai/Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge Distillation
new file mode 100644
index 0000000000..43c3c0ee21
--- /dev/null
+++ b/data/2024/aaai/Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge Distillation	
@@ -0,0 +1 @@
+Data-free knowledge distillation (DFKD) aims to distill pretrained knowledge to a student model with the help of a generator without using original data. In such data-free scenarios, achieving stable performance of DFKD is essential due to the unavailability of validation data. Unfortunately, this paper has discovered that existing DFKD methods are quite sensitive to different teacher models, occasionally showing catastrophic failures of distillation, even when using well-trained teacher models. Our observation is that the generator in DFKD is not always guaranteed to produce precise yet diverse samples using the existing representative strategy of minimizing both class-prior and adversarial losses. Through our empirical study, we focus on the fact that class-prior not only decreases the diversity of generated samples, but also cannot completely address the problem of generating unexpectedly low-quality samples depending on teacher models. In this paper, we propose the teacher-agnostic data-free knowledge distillation (TA-DFKD) method, with the goal of more robust and stable performance regardless of teacher models. Our basic idea is to assign the teacher model a lenient expert role for evaluating samples, rather than a strict supervisor that enforces its class-prior on the generator. Specifically, we design a sample selection approach that takes only clean samples verified by the teacher model without imposing restrictions on the power of generating diverse samples. Through extensive experiments, we show that our method successfully achieves both robustness and training stability across various teacher models, while outperforming the existing DFKD methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Teaching Large Language Models to Translate with Comparison b/data/2024/aaai/Teaching Large Language Models to Translate with Comparison
new file mode 100644
index 0000000000..e479917636
--- /dev/null
+++ b/data/2024/aaai/Teaching Large Language Models to Translate with Comparison	
@@ -0,0 +1,11 @@
+Open-sourced large language models (LLMs) have demonstrated remarkable efficacy in various tasks with instruction tuning. 
+However, these models can sometimes struggle with tasks that require more specialized knowledge such as translation. 
+One possible reason for such deficiency is that instruction tuning aims to generate fluent and coherent text that continues from a given instruction without being constrained by any task-specific requirements. 
+Moreover, it can be more challenging to tune smaller LLMs with lower-quality training data.
+To address this issue, we propose a novel framework using examples in comparison to teach LLMs to learn translation. 
+Our approach involves output comparison and preference comparison, presenting the model with 
+carefully designed examples of correct and incorrect translations and an additional preference loss for better regularization.
+Empirical evaluation on four language directions of WMT2022 and FLORES-200 benchmarks shows the superiority of our proposed method over existing methods. 
+Our findings offer a new perspective on fine-tuning LLMs for translation tasks and provide a promising solution for generating high-quality translations.
+Please refer to Github for more details:
+https://github.com/lemon0830/TIM.
\ No newline at end of file
diff --git a/data/2024/aaai/TelTrans: Applying Multi-Type Telecom Data to Transportation Evaluation and Prediction via Multifaceted Graph Modeling b/data/2024/aaai/TelTrans: Applying Multi-Type Telecom Data to Transportation Evaluation and Prediction via Multifaceted Graph Modeling
new file mode 100644
index 0000000000..4f86bc1424
--- /dev/null
+++ b/data/2024/aaai/TelTrans: Applying Multi-Type Telecom Data to Transportation Evaluation and Prediction via Multifaceted Graph Modeling	
@@ -0,0 +1 @@
+To address the limitations of traffic prediction from location-bound detectors, we present Geographical Cellular Traffic (GCT) flow, a novel data source that leverages the extensive coverage of cellular traffic to capture mobility patterns. Our extensive analysis validates its potential for transportation. Focusing on vehicle-related GCT flow prediction, we propose a graph neural network that integrates multivariate, temporal, and spatial facets for improved accuracy. Experiments reveal our model's superiority over baselines, especially in long-term predictions. We also highlight the potential for GCT flow integration into transportation systems.
\ No newline at end of file
diff --git a/data/2024/aaai/Tell Me What Is Good about This Property: Leveraging Reviews for Segment-Personalized Image Collection Summarization b/data/2024/aaai/Tell Me What Is Good about This Property: Leveraging Reviews for Segment-Personalized Image Collection Summarization
new file mode 100644
index 0000000000..85a8d3b2c7
--- /dev/null
+++ b/data/2024/aaai/Tell Me What Is Good about This Property: Leveraging Reviews for Segment-Personalized Image Collection Summarization	
@@ -0,0 +1 @@
+Image collection summarization techniques aim to present a compact representation of an image gallery through a carefully selected subset of images that captures its semantic content. When it comes to web content, however, the ideal selection can vary based on the user's specific intentions and preferences. This is particularly relevant at Booking.com, where presenting properties and their visual summaries that align with users' expectations is crucial. To address this challenge, in this work, we consider user intentions in the summarization of property visuals by analyzing property reviews and extracting the most significant aspects mentioned by users. By incorporating the insights from reviews in our visual summaries, we enhance the summaries by presenting the relevant content to a user. Moreover, we achieve it without the need for costly annotations. Our experiments, including human perceptual studies, demonstrate the superiority of our cross-modal approach, which we coin as CrossSummarizer over the no-personalization and image-based clustering baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Temporal Adaptive RGBT Tracking with Modality Prompt b/data/2024/aaai/Temporal Adaptive RGBT Tracking with Modality Prompt
new file mode 100644
index 0000000000..366635f6bb
--- /dev/null
+++ b/data/2024/aaai/Temporal Adaptive RGBT Tracking with Modality Prompt	
@@ -0,0 +1 @@
+RGBT tracking has been widely used in various fields such as robotics, surveillance processing, and autonomous driving. Existing RGBT trackers fully explore the spatial information between the template and the search region and locate the target based on the appearance matching results. However, these RGBT trackers have very limited exploitation of temporal information, either ignoring temporal information or exploiting it through online sampling and training. The former struggles to cope with the object state changes, while the latter neglects the correlation between spatial and temporal information. To alleviate these limitations, we propose a novel Temporal Adaptive RGBT Tracking framework, named as TATrack. TATrack has a spatio-temporal two-stream structure and captures temporal information by an online updated template, where the two-stream structure refers to the multi-modal feature extraction and cross-modal interaction for the initial template and the online update template respectively. TATrack contributes to comprehensively exploit spatio-temporal information and multi-modal information for target localization. In addition, we design a spatio-temporal interaction (STI) mechanism that bridges two branches and enables cross-modal interaction to span longer time scales. Extensive experiments on three popular RGBT tracking benchmarks show that our method achieves state-of-the-art performance, while running at real-time speed.
\ No newline at end of file
diff --git a/data/2024/aaai/Temporal Correlation Vision Transformer for Video Person Re-Identification b/data/2024/aaai/Temporal Correlation Vision Transformer for Video Person Re-Identification
new file mode 100644
index 0000000000..73b1b3788b
--- /dev/null
+++ b/data/2024/aaai/Temporal Correlation Vision Transformer for Video Person Re-Identification	
@@ -0,0 +1 @@
+Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To address this issue, we propose a Temporal Correlation Vision Transformer (TCViT) for video person Re-ID. TCViT consists of a Temporal Correlation Attention (TCA) module and a Learnable Temporal Aggregation (LTA) module. The TCA module is designed to reduce the impact of non-target persons by relative state, while the LTA module is used to aggregate frame-level features based on their completeness. Specifically, TCA is a parameter-free module that first aligns frame-level features to restore semantic coherence in videos and then enhances the features of the target person according to temporal correlation. Additionally, unlike previous methods that treat each frame equally with a pooling layer, LTA introduces a lightweight learnable module to weigh and aggregate frame-level features under the guidance of a classification score. Extensive experiments on four prevalent benchmarks demonstrate that our method achieves state-of-the-art performance in video Re-ID.
\ No newline at end of file
diff --git a/data/2024/aaai/Temporal Dependencies and Spatio-Temporal Patterns of Time Series Models b/data/2024/aaai/Temporal Dependencies and Spatio-Temporal Patterns of Time Series Models
new file mode 100644
index 0000000000..4001d44777
--- /dev/null
+++ b/data/2024/aaai/Temporal Dependencies and Spatio-Temporal Patterns of Time Series Models	
@@ -0,0 +1 @@
+The widespread use of Artificial Intelligence (AI) has highlighted the importance of understanding AI model behavior. This understanding is crucial for practical decision-making, assessing model reliability, and ensuring trustworthiness. Interpreting time series forecasting models faces unique challenges compared to image and text data. These challenges arise from the temporal dependencies between time steps and the evolving importance of input features over time. My thesis focuses on addressing these challenges by aiming for more precise explanations of feature interactions, uncovering spatiotemporal patterns, and demonstrating the practical applicability of these interpretability techniques using real-world datasets and state-of-the-art deep learning models.
\ No newline at end of file
diff --git a/data/2024/aaai/Temporal Graph Contrastive Learning for Sequential Recommendation b/data/2024/aaai/Temporal Graph Contrastive Learning for Sequential Recommendation
new file mode 100644
index 0000000000..24a9dbf289
--- /dev/null
+++ b/data/2024/aaai/Temporal Graph Contrastive Learning for Sequential Recommendation	
@@ -0,0 +1 @@
+Sequential recommendation is a crucial task in understanding users' evolving interests and predicting their future behaviors. While existing approaches on sequence or graph modeling to learn interaction sequences of users have shown promising performance, how to effectively exploit temporal information and deal with the uncertainty noise in evolving user behaviors is still quite challenging. To this end, in this paper, we propose a Temporal Graph Contrastive Learning method for Sequential Recommendation (TGCL4SR) which leverages not only local interaction sequences but also global temporal graphs to comprehend item correlations and analyze user behaviors from a temporal perspective. Specifically, we first devise a Temporal Item Transition Graph (TITG) to fully leverage global interactions to understand item correlations, and augment this graph by dual transformations based on neighbor sampling and time disturbance. Accordingly, we design a Temporal item Transition graph Convolutional network (TiTConv) to capture temporal item transition patterns in TITG. Then, a novel Temporal Graph Contrastive Learning (TGCL) mechanism is designed to enhance the uniformity of representations between augmented graphs from identical sequences. For local interaction sequences, we design a temporal sequence encoder to incorporate time interval embeddings into the architecture of Transformer. At the training stage, we take maximum mean discrepancy and TGCL losses as auxiliary objectives. Extensive experiments on several real-world datasets show the effectiveness of TGCL4SR against state-of-the-art baselines of sequential recommendation.
\ No newline at end of file
diff --git a/data/2024/aaai/Temporal Logic Explanations for Dynamic Decision Systems Using Anchors and Monte Carlo Tree Search (Abstract Reprint) b/data/2024/aaai/Temporal Logic Explanations for Dynamic Decision Systems Using Anchors and Monte Carlo Tree Search (Abstract Reprint)
new file mode 100644
index 0000000000..e4f1999557
--- /dev/null
+++ b/data/2024/aaai/Temporal Logic Explanations for Dynamic Decision Systems Using Anchors and Monte Carlo Tree Search (Abstract Reprint)	
@@ -0,0 +1 @@
+For many automated perception and decision tasks, state-of-the-art performance may be obtained by algorithms that are too complex for their behavior to be completely understandable or predictable by human users, e.g., because they employ large machine learning models. To integrate these algorithms into safety-critical decision and control systems, it is particularly important to develop methods that can promote trust into their decisions and help explore their failure modes. In this article, we combine the anchors methodology with Monte Carlo Tree Search to provide local model-agnostic explanations for the behaviors of a given black-box model making decisions by processing time-varying input signals. Our approach searches for descriptive explanations for these decisions in the form of properties of the input signals, expressed in Signal Temporal Logic, which are highly likely to reproduce the observed behavior. To illustrate the methodology, we apply it in simulations to the analysis of a hybrid (continuous-discrete) control system and a collision avoidance system for unmanned aircraft (ACAS Xu) implemented by a neural network.
\ No newline at end of file
diff --git a/data/2024/aaai/Temporal-Distributed Backdoor Attack against Video Based Action Recognition b/data/2024/aaai/Temporal-Distributed Backdoor Attack against Video Based Action Recognition
new file mode 100644
index 0000000000..dfe550647e
--- /dev/null
+++ b/data/2024/aaai/Temporal-Distributed Backdoor Attack against Video Based Action Recognition	
@@ -0,0 +1 @@
+Deep neural networks (DNNs) have achieved tremendous success in various applications including video action recognition, yet remain vulnerable to backdoor attacks (Trojans). The backdoor-compromised model will mis-classify to the target class chosen by the attacker when a test instance (from a non-target class) is embedded with a specific trigger, while maintaining high accuracy on attack-free instances. Although there are extensive studies on backdoor attacks against image data, the susceptibility of video-based systems under backdoor attacks remains largely unexplored. Current studies are direct extensions of approaches proposed for image data, e.g., the triggers are independently embedded within the frames, which tend to be detectable by existing defenses. In this paper, we introduce a simple yet effective backdoor attack against video data. Our proposed attack, adding perturbations in a transformed domain, plants an imperceptible, temporally distributed trigger across the video frames, and is shown to be resilient to existing defensive strategies. The effectiveness of the proposed attack is demonstrated by extensive experiments with various well-known models on two video recognition benchmarks, UCF101 and HMDB51, and a sign language recognition benchmark, Greek Sign Language (GSL) dataset. We delve into the impact of several influential factors on our proposed attack and identify an intriguing effect termed "collateral damage" through extensive studies.
\ No newline at end of file
diff --git a/data/2024/aaai/Temporally and Distributionally Robust Optimization for Cold-Start Recommendation b/data/2024/aaai/Temporally and Distributionally Robust Optimization for Cold-Start Recommendation
new file mode 100644
index 0000000000..79523c4b66
--- /dev/null
+++ b/data/2024/aaai/Temporally and Distributionally Robust Optimization for Cold-Start Recommendation	
@@ -0,0 +1,2 @@
+Collaborative Filtering (CF) recommender models highly depend on user-item interactions to learn CF representations, thus falling short of recommending cold-start items. To address this issue, prior studies mainly introduce item features (e.g., thumbnails) for cold-start item recommendation. They learn a feature extractor on warm-start items to align feature representations with interactions, and then leverage the feature extractor to extract the feature representations of cold-start items for interaction prediction. Unfortunately, the features of cold-start items, especially the popular ones, tend to diverge from those of warm-start ones due to temporal feature shifts, preventing the feature extractor from accurately learning feature representations of cold-start items. 
+To alleviate the impact of temporal feature shifts, we consider using Distributionally Robust Optimization (DRO) to enhance the generation ability of the feature extractor. Nonetheless, existing DRO methods face an inconsistency issue: the worse-case warm-start items emphasized during DRO training might not align well with the cold-start item distribution. To capture the temporal feature shifts and combat this inconsistency issue, we propose a novel temporal DRO with new optimization objectives, namely, 1) to integrate a worst-case factor to improve the worst-case performance, and 2) to devise a shifting factor to capture the shifting trend of item features and enhance the optimization of the potentially popular groups in cold-start items. Substantial experiments on three real-world datasets validate the superiority of our temporal DRO in enhancing the generalization ability of cold-start recommender models.
\ No newline at end of file
diff --git a/data/2024/aaai/Tensorized Label Learning on Anchor Graph b/data/2024/aaai/Tensorized Label Learning on Anchor Graph
new file mode 100644
index 0000000000..3badcb521e
--- /dev/null
+++ b/data/2024/aaai/Tensorized Label Learning on Anchor Graph	
@@ -0,0 +1 @@
+Graph-based multimedia data clustering has attracted much attention due to the impressive clustering performance for arbitrarily shaped multimedia data. However, existing graph-based clustering methods need post-processing to get labels for multimedia data with high computational complexity. Moreover, it is sub-optimal for label learning due to the fact that they exploit the complementary information embedded in data with different types pixel by pixel. To handle these problems, we present a novel label learning model with good interpretability for clustering. To be specific, our model decomposes anchor graph into the products of two matrices with orthogonal non-negative constraint to directly get soft label without any post-processing, which remarkably reduces the computational complexity. To well exploit the complementary information embedded in multimedia data, we introduce tensor Schatten p-norm regularization on the label tensor which is composed of soft labels of multimedia data. The solution can be obtained by iteratively optimizing four decoupled sub-problems, which can be solved more efficiently with good convergence. Experimental results on various datasets demonstrate the efficiency of our model.
\ No newline at end of file
diff --git a/data/2024/aaai/Test-Time Adaptation via Style and Structure Guidance for Histological Image Registration b/data/2024/aaai/Test-Time Adaptation via Style and Structure Guidance for Histological Image Registration
new file mode 100644
index 0000000000..870dd3a0ff
--- /dev/null
+++ b/data/2024/aaai/Test-Time Adaptation via Style and Structure Guidance for Histological Image Registration	
@@ -0,0 +1,10 @@
+Image registration plays a crucial role in histological image analysis, encompassing tasks like multi-modality fusion and disease grading. 
+Traditional registration methods optimize objective functions for each image pair, yielding reliable accuracy but demanding heavy inference burdens.
+Recently, learning-based registration methods utilize networks to learn the optimization process during training and apply a one-step forward process during testing. 
+While these methods offer promising registration performance with reduced inference time, they remain sensitive to appearance variances and local structure changes commonly encountered in histological image registration scenarios.
+In this paper, for the first time, we propose a novel test-time adaptation method for histological image registration, aiming to improve the generalization ability of learning-based methods. 
+Specifically, we design two operations, style guidance and shape guidance, for the test-time adaptation process. 
+The former leverages style representations encoded by feature statistics to address the issue of appearance variances, while the latter incorporates shape representations encoded by HOG features to improve registration accuracy in regions with structural changes.
+Furthermore, we consider the continuity of the model during the test-time adaptation process.
+Different from the previous methods initialized by a given trained model, we introduce a smoothing strategy to leverage historical models for better generalization. 
+We conduct experiments with several representative learning-based backbones on the public histological dataset, demonstrating the superior registration performance of our test-time adaptation method.
\ No newline at end of file
diff --git a/data/2024/aaai/Test-Time Personalization with Meta Prompt for Gaze Estimation b/data/2024/aaai/Test-Time Personalization with Meta Prompt for Gaze Estimation
new file mode 100644
index 0000000000..9158adf2ac
--- /dev/null
+++ b/data/2024/aaai/Test-Time Personalization with Meta Prompt for Gaze Estimation	
@@ -0,0 +1 @@
+Despite the recent remarkable achievement in gaze estimation, efficient and accurate personalization of gaze estimation without labels is a practical problem but rarely touched on in the literature. To achieve efficient personalization, we take inspiration from the recent advances in Natural Language Processing (NLP) by updating a negligible number of parameters, "prompts", at the test time. Specifically, the prompt is additionally attached without perturbing original network and can contain less than 1% of a ResNet-18's parameters. Our experiments show high efficiency of the prompt tuning approach. The proposed one can be 10 times faster in terms of adaptation speed than the methods compared. However, it is non-trivial to update the prompt for personalized gaze estimation without labels. At the test time, it is essential to ensure that the minimizing of particular unsupervised loss leads to the goals of minimizing gaze estimation error. To address this difficulty, we propose to meta-learn the prompt to ensure that its updates align with the goal. Our experiments show that the meta-learned prompt can be effectively adapted even with a simple symmetry loss. In addition, we experiment on four cross-dataset validations to show the remarkable advantages of the proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Testing Self-Reducible Samplers b/data/2024/aaai/Testing Self-Reducible Samplers
new file mode 100644
index 0000000000..e50499f1af
--- /dev/null
+++ b/data/2024/aaai/Testing Self-Reducible Samplers	
@@ -0,0 +1,6 @@
+Samplers are the backbone of the implementations of any randomized algorithm. Unfortunately, obtaining an efficient algorithm to test the correctness of samplers is very hard to find. Recently, in a series of works, testers like Barbarik, Teq, Flash for testing of some particular kinds of samplers, like CNF-samplers and Horn-samplers, were obtained. However, their techniques have a significant limitation because one can not expect to use their methods to test for other samplers, such as perfect matching samplers or samplers for sampling linear extensions in posets. 
+In this paper, we present a new testing algorithm that works for such samplers and can estimate the distance of a new sampler from a known sampler (say, the uniform sampler). 
+
+Testing the identity of distributions is the heart of testing the correctness of samplers. This paper's main technical contribution is developing a new distance estimation algorithm for distributions over high-dimensional cubes using the recently proposed subcube conditioning sampling model. Given subcube conditioning access to an unknown distribution P, and a known distribution Q defined over an n-dimensional Boolean hypercube, our algorithm CubeProbeEst estimates the variation distance between P and Q within additive error using subcube conditional samples from P. Following the testing-via-learning paradigm, we also get a tester that distinguishes between the cases when P and Q are close or far in variation distance with high probability using subcube conditional samples.
+
+This estimation algorithm CubeProbeEst in the subcube conditioning sampling model helps us to design the first tester for self-reducible samplers. The correctness of the tester is formally proved. Moreover, we implement CubeProbeEst to test the quality of three samplers for sampling linear extensions in posets.
\ No newline at end of file
diff --git a/data/2024/aaai/TexFit: Text-Driven Fashion Image Editing with Diffusion Models b/data/2024/aaai/TexFit: Text-Driven Fashion Image Editing with Diffusion Models
new file mode 100644
index 0000000000..789ef25232
--- /dev/null
+++ b/data/2024/aaai/TexFit: Text-Driven Fashion Image Editing with Diffusion Models	
@@ -0,0 +1 @@
+Fashion image editing aims to edit an input image to obtain richer or distinct visual clothing matching effects. Existing global fashion image editing methods are difficult to achieve rich outfit combination effects while local fashion image editing is more in line with the needs of diverse and personalized outfit matching. The local editing techniques typically depend on text and auxiliary modalities (e.g., human poses, human keypoints, garment sketches, etc.) for image manipulation, where the auxiliary modalities essentially assist in locating the editing region. Since these auxiliary modalities usually involve additional efforts in practical application scenarios, text-driven fashion image editing shows high flexibility. In this paper, we propose TexFit, a Text-driven Fashion image Editing method using diffusion models, which performs the local image editing only with the easily accessible text. Our approach employs a text-based editing region location module to predict precise editing region in the fashion image. Then, we take the predicted region as the generation condition of diffusion models together with the text prompt to achieve precise local editing of fashion images while keeping the rest part intact. In addition, previous fashion datasets usually focus on global description, lacking local descriptive information that can guide the precise local editing. Therefore, we develop a new DFMM-Spotlight dataset by using region extraction and attribute combination strategies. It focuses locally on clothes and accessories, enabling local editing with text input. Experimental results on the DFMM-Spotlight dataset demonstrate the effectiveness of our model. Code and Datasets are available at https://texfit.github.io/.
\ No newline at end of file
diff --git a/data/2024/aaai/Text Diffusion with Reinforced Conditioning b/data/2024/aaai/Text Diffusion with Reinforced Conditioning
new file mode 100644
index 0000000000..bb409c4370
--- /dev/null
+++ b/data/2024/aaai/Text Diffusion with Reinforced Conditioning	
@@ -0,0 +1 @@
+Diffusion models have demonstrated exceptional capability in generating high-quality images, videos, and audio. Due to their adaptiveness in iterative refinement, they provide a strong potential for achieving better non-autoregressive sequence generation. However, existing text diffusion models still fall short in their performance due to a challenge in handling the discreteness of language. This paper thoroughly analyzes text diffusion models and uncovers two significant limitations: degradation of self-conditioning during training and misalignment between training and sampling. Motivated by our findings, we propose a novel Text Diffusion model called TReC, which mitigates the degradation with Reinforced Conditioning and the misalignment by Time-Aware Variance Scaling. Our extensive experiments demonstrate the competitiveness of TReC against autoregressive, non-autoregressive, and diffusion baselines. Moreover, qualitative analysis shows its advanced ability to fully utilize the diffusion process in refining samples.
\ No newline at end of file
diff --git a/data/2024/aaai/Text Image Inpainting via Global Structure-Guided Diffusion Models b/data/2024/aaai/Text Image Inpainting via Global Structure-Guided Diffusion Models
new file mode 100644
index 0000000000..4b550013c1
--- /dev/null
+++ b/data/2024/aaai/Text Image Inpainting via Global Structure-Guided Diffusion Models	
@@ -0,0 +1 @@
+Real-world text can be damaged by corrosion issues caused by environmental or human factors, which hinder the preservation of the complete styles of texts, e.g., texture and structure. These corrosion issues, such as graffiti signs and incomplete signatures, bring difficulties in understanding the texts, thereby posing significant challenges to downstream applications, e.g., scene text recognition and signature identification. Notably, current inpainting techniques often fail to adequately address this problem and have difficulties restoring accurate text images along with reasonable and consistent styles. Formulating this as an open problem of text image inpainting, this paper aims to build a benchmark to facilitate its study. In doing so, we establish two specific text inpainting datasets which contain scene text images and handwritten text images, respectively. Each of them includes images revamped by real-life and synthetic datasets, featuring pairs of original images, corrupted images, and other assistant information. On top of the datasets, we further develop a novel neural framework, Global Structure-guided Diffusion Model (GSDM), as a potential solution. Leveraging the global structure of the text as a prior, the proposed GSDM develops an efficient diffusion model to recover clean texts. The efficacy of our approach is demonstrated by thorough empirical study, including a substantial boost in both recognition accuracy and image quality. These findings not only highlight the effectiveness of our method but also underscore its potential to enhance the broader field of text image understanding and processing. Code and datasets are available at: https://github.com/blackprotoss/GSDM.
\ No newline at end of file
diff --git a/data/2024/aaai/Text-Based Occluded Person Re-identification via Multi-Granularity Contrastive Consistency Learning b/data/2024/aaai/Text-Based Occluded Person Re-identification via Multi-Granularity Contrastive Consistency Learning
new file mode 100644
index 0000000000..635c1edf6a
--- /dev/null
+++ b/data/2024/aaai/Text-Based Occluded Person Re-identification via Multi-Granularity Contrastive Consistency Learning	
@@ -0,0 +1 @@
+Text-based Person Re-identification (T-ReID), which aims at retrieving a specific pedestrian image from a collection of images via text-based information, has received significant attention. However, previous research has overlooked a challenging yet practical form of T-ReID: dealing with image galleries mixed with occluded and inconsistent personal visuals, instead of ideal visuals with a full-body and clear view. Its major challenges lay in the insufficiency of benchmark datasets and the enlarged semantic gap incurred by arbitrary occlusions and modality gap between text description and visual representation of the target person. To alleviate these issues, we first design an Occlusion Generator (OGor) for the automatic generation of artificial occluded images from generic surveillance images. Then, a fine-granularity token selection mechanism is proposed to minimize the negative impact of occlusion for robust feature learning, and a novel multi-granularity contrastive consistency alignment framework is designed to leverage intra-/inter-granularity of visual-text representations for semantic alignment of occluded visuals and query texts. Experimental results demonstrate that our method exhibits superior performance. We believe this work could inspire the community to investigate more dedicated designs for implementing T-ReID in real-world scenarios. The source code is available at https://github.com/littlexinyi/MGCC.
\ No newline at end of file
diff --git a/data/2024/aaai/Text-to-Image Generation for Abstract Concepts b/data/2024/aaai/Text-to-Image Generation for Abstract Concepts
new file mode 100644
index 0000000000..85d35155db
--- /dev/null
+++ b/data/2024/aaai/Text-to-Image Generation for Abstract Concepts	
@@ -0,0 +1 @@
+Recent years have witnessed the substantial progress of large-scale models across various domains, such as natural language processing and computer vision, facilitating the expression of concrete concepts. Unlike concrete concepts that are usually directly associated with physical objects, expressing abstract concepts through natural language requires considerable effort since they are characterized by intricate semantics and connotations. An alternative approach is to leverage images to convey rich visual information as a supplement. Nevertheless, existing Text-to-Image (T2I) models are primarily trained on concrete physical objects and often struggle to visualize abstract concepts. Inspired by the three-layer artwork theory that identifies critical factors, intent, object and form during artistic creation, we propose a framework of Text-to-Image generation for Abstract Concepts (TIAC). The abstract concept is clarified into a clear intent with a detailed definition to avoid ambiguity. LLMs then transform it into semantic-related physical objects, and the concept-dependent form is retrieved from an LLM-extracted form pattern set. Information from these three aspects will be integrated to generate prompts for T2I models via LLM. Evaluation results from human assessments and our newly designed metric concept score demonstrate the effectiveness of our framework in creating images that can sufficiently express abstract concepts.
\ No newline at end of file
diff --git a/data/2024/aaai/Text2Analysis: A Benchmark of Table Question Answering with Advanced Data Analysis and Unclear Queries b/data/2024/aaai/Text2Analysis: A Benchmark of Table Question Answering with Advanced Data Analysis and Unclear Queries
new file mode 100644
index 0000000000..2437af71d9
--- /dev/null
+++ b/data/2024/aaai/Text2Analysis: A Benchmark of Table Question Answering with Advanced Data Analysis and Unclear Queries	
@@ -0,0 +1 @@
+Tabular data analysis is crucial in various fields, and large language models show promise in this area. However, current research mostly focuses on rudimentary tasks like Text2SQL and TableQA, neglecting advanced analysis like forecasting and chart generation. To address this gap, we developed the Text2Analysis benchmark, incorporating advanced analysis tasks that go beyond the SQL-compatible operations and require more in-depth analysis. We also develop five innovative and effective annotation methods, harnessing the capabilities of large language models to enhance data quality and quantity. Additionally, we include unclear queries that resemble real-world user questions to test how well models can understand and tackle such challenges. Finally, we collect 2249 query-result pairs with 347 tables. We evaluate five state-of-the-art models using three different metrics and the results show that our benchmark presents introduces considerable challenge in the field of tabular data analysis, paving the way for more advanced research opportunities.
\ No newline at end of file
diff --git a/data/2024/aaai/Text2City: One-Stage Text-Driven Urban Layout Regeneration b/data/2024/aaai/Text2City: One-Stage Text-Driven Urban Layout Regeneration
new file mode 100644
index 0000000000..2a49174958
--- /dev/null
+++ b/data/2024/aaai/Text2City: One-Stage Text-Driven Urban Layout Regeneration	
@@ -0,0 +1 @@
+Regenerating urban layout is an essential process for urban regeneration. In this paper, we propose a new task called text-driven urban layout regeneration, which provides an intuitive input modal - text - for users to specify the regeneration, instead of designing complex rules. Given the target region to be regenerated, we propose a one-stage text-driven urban layout regeneration model, Text2City, to jointly and progressively regenerate the urban layout (i.e., road and building layouts) based on textual layout descriptions and surrounding context (i.e., urban layouts and functions of the surrounding regions). Text2City first extracts road and building attributes from the textual layout description to guide the regeneration. It includes a novel one-stage joint regenerator network based on the conditioned denoising diffusion probabilistic models (DDPMs) and prior knowledge exchange. To harmonize the regenerated layouts through joint optimization, we propose the interactive & enhanced guidance module for self-enhancement and prior knowledge exchange between road and building layouts during the regeneration. We also design a series of constraints from attribute-, geometry- and pixel-levels to ensure rational urban layout generation. To train our model, we build a large-scale dataset containing urban layouts and layout descriptions, covering 147K regions. Qualitative and quantitative evaluations show that our proposed method outperforms the baseline methods in regenerating desirable urban layouts that meet the textual descriptions.
\ No newline at end of file
diff --git a/data/2024/aaai/TextGT: A Double-View Graph Transformer on Text for Aspect-Based Sentiment Analysis b/data/2024/aaai/TextGT: A Double-View Graph Transformer on Text for Aspect-Based Sentiment Analysis
new file mode 100644
index 0000000000..34d3429664
--- /dev/null
+++ b/data/2024/aaai/TextGT: A Double-View Graph Transformer on Text for Aspect-Based Sentiment Analysis	
@@ -0,0 +1 @@
+Aspect-based sentiment analysis (ABSA) is aimed at predicting the sentiment polarities of the aspects included in a sentence instead of the whole sentence itself, and is a fine-grained learning task compared to the conventional text classification. In recent years, on account of the ability to model the connectivity relationships between the words in one sentence, graph neural networks have been more and more popular to handle the natural language processing tasks, and meanwhile many works emerge for the ABSA task. However, most of the works utilizing graph convolution easily incur the over-smoothing problem, while graph Transformer for ABSA has not been explored yet. In addition, although some previous works are dedicated to using both GNN and Transformer to handle text, the methods of tightly combining graph view and sequence view of text is open to research. To address the above issues, we propose a double-view graph Transformer on text (TextGT) for ABSA. In TextGT, the procedure in graph view of text is handled by GNN layers, while Transformer layers deal with the sequence view, and these two processes are tightly coupled, alleviating the over-smoothing problem. Moreover, we propose an algorithm for implementing a kind of densely message passing graph convolution called TextGINConv, to employ edge features in graphs. Extensive experiments demonstrate the effectiveness of our TextGT over the state-of-the-art approaches, and validate the TextGINConv module. The source code is available at https://github.com/shuoyinn/TextGT.
\ No newline at end of file
diff --git a/data/2024/aaai/The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models b/data/2024/aaai/The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models
new file mode 100644
index 0000000000..43dd21276c
--- /dev/null
+++ b/data/2024/aaai/The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models	
@@ -0,0 +1,9 @@
+Thompson sampling (TS) has been known for its outstanding empirical performance supported by theoretical guarantees across various reward models in the classical stochastic multi-armed bandit problems.
+Nonetheless, its optimality is often restricted to specific priors due to the common observation that TS is fairly insensitive to the choice of the prior when it comes to asymptotic regret bounds.
+However, when the model contains multiple parameters, the optimality of TS highly depends on the choice of priors, which casts doubt on the generalizability of previous findings to other models. 
+To address this gap, this study explores the impact of selecting noninformative priors, offering insights into the performance of TS when dealing with new models that lack theoretical understanding.
+We first extend the regret analysis of TS to the model of uniform distributions with unknown supports, which would be the simplest non-regular model. 
+Our findings reveal that changing noninformative priors can significantly affect the expected regret, aligning with previously known results in other multiparameter bandit models.
+Although the uniform prior is shown to be optimal, we highlight the inherent limitation of its optimality, which is limited to specific parameterizations and emphasizes the significance of the invariance property of priors.
+In light of this limitation, we propose a slightly modified TS-based policy, called TS with Truncation (TS-T), which can achieve the asymptotic optimality for the Gaussian models and the uniform models by using the reference prior and the Jeffreys prior that are invariant under one-to-one reparameterizations.
+This policy provides an alternative approach to achieving optimality by employing fine-tuned truncation, which would be much easier than hunting for optimal priors in practice.
\ No newline at end of file
diff --git a/data/2024/aaai/The CoachAI Badminton Environment: A Novel Reinforcement Learning Environment with Realistic Opponents (Student Abstract) b/data/2024/aaai/The CoachAI Badminton Environment: A Novel Reinforcement Learning Environment with Realistic Opponents (Student Abstract)
new file mode 100644
index 0000000000..6b5457f998
--- /dev/null
+++ b/data/2024/aaai/The CoachAI Badminton Environment: A Novel Reinforcement Learning Environment with Realistic Opponents (Student Abstract)	
@@ -0,0 +1 @@
+The growing demand for precise sports analysis has been explored to improve athlete performance in various sports (e.g., basketball, soccer). However, existing methods for different sports face challenges in validating strategies in environments due to simple rule-based opponents leading to performance gaps when deployed in real-world matches. In this paper, we propose the CoachAI Badminton Environment, a novel reinforcement learning (RL) environment with realistic opponents for badminton, which serves as a compelling example of a turn-based game. It supports researchers in exploring various RL algorithms with the badminton context by integrating state-of-the-art tactical-forecasting models and real badminton game records. The Badminton Benchmarks are proposed with multiple widely adopted RL algorithms to benchmark the performance of simulating matches against real players. To advance novel algorithms and developments in badminton analytics, we make our environment open-source, enabling researchers to simulate more complex badminton sports scenarios based on this foundation. Our code is available at https://github.com/wywyWang/CoachAI-Projects/tree/main/CoachAI%20Badminton%20Environment.
\ No newline at end of file
diff --git a/data/2024/aaai/The CoachAI Badminton Environment: Bridging the Gap between a Reinforcement Learning Environment and Real-World Badminton Games b/data/2024/aaai/The CoachAI Badminton Environment: Bridging the Gap between a Reinforcement Learning Environment and Real-World Badminton Games
new file mode 100644
index 0000000000..26c241b848
--- /dev/null
+++ b/data/2024/aaai/The CoachAI Badminton Environment: Bridging the Gap between a Reinforcement Learning Environment and Real-World Badminton Games	
@@ -0,0 +1 @@
+We present the CoachAI Badminton Environment, a reinforcement learning (RL) environment tailored for AI-driven sports analytics. In contrast to traditional environments using rule-based opponents or simplistic physics-based randomness, our environment integrates authentic opponent AIs and realistic randomness derived from real-world matches data to bridge the performance gap encountered in real-game deployments. This novel feature enables RL agents to seamlessly adapt to genuine scenarios. The CoachAI Badminton Environment empowers researchers to validate strategies in intricate real-world settings, offering: i) Realistic opponent simulation for RL training; ii) Visualizations for evaluation; and iii) Performance benchmarks for assessing agent capabilities. By bridging the RL environment with actual badminton games, our environment is able to advance the discovery of winning strategies for players. Our code is available at https://github.com/wywyWang/CoachAI-Projects/tree/main/Strategic%20Environment.
\ No newline at end of file
diff --git a/data/2024/aaai/The Complexity of Computing Robust Mediated Equilibria in Ordinal Games b/data/2024/aaai/The Complexity of Computing Robust Mediated Equilibria in Ordinal Games
new file mode 100644
index 0000000000..2e5abc0e8f
--- /dev/null
+++ b/data/2024/aaai/The Complexity of Computing Robust Mediated Equilibria in Ordinal Games	
@@ -0,0 +1,17 @@
+Usually, to apply game-theoretic methods, we must specify utilities
+precisely, and we run the risk that the solutions we compute are not
+robust to errors in this specification. Ordinal games provide an
+attractive alternative: they require specifying only which outcomes
+are preferred to which other ones. Unfortunately, they provide little
+guidance for how to play unless there are pure Nash equilibria;
+evaluating mixed strategies appears to fundamentally require cardinal
+utilities.
+
+In this paper, we observe that we can in fact make good use of mixed
+strategies in ordinal games if we consider settings that allow for
+folk theorems. These allow us to find equilibria that are robust, in
+the sense that they remain equilibria no matter which cardinal
+utilities are the correct ones -- as long as they are consistent with
+the specified ordinal preferences. We analyze this concept and study
+the computational complexity of finding such equilibria in a range of
+settings.
\ No newline at end of file
diff --git a/data/2024/aaai/The Complexity of Fair Division of Indivisible Items with Externalities b/data/2024/aaai/The Complexity of Fair Division of Indivisible Items with Externalities
new file mode 100644
index 0000000000..e230f02ab8
--- /dev/null
+++ b/data/2024/aaai/The Complexity of Fair Division of Indivisible Items with Externalities	
@@ -0,0 +1,2 @@
+We study the computational complexity of fairly allocating a set of indivisible items under externalities. In this recently-proposed setting, in addition to the utility the agent gets from their bundle, they also receive utility from items allocated to other agents.
+We focus on the extended definitions of envy-freeness up to one item (EF1) and of envy-freeness up to any item (EFX), and we provide the landscape of their complexity for several different scenarios. We prove that it is NP-complete to decide whether there exists an EFX allocation, even when there are only three agents, or even when there are only six different values for the items. We complement these negative results by showing that when both the number of agents and the number of different values for items are bounded by a parameter the problem becomes fixed-parameter tractable. Furthermore, we prove that two-valued and binary-valued instances are equivalent and that EFX and EF1 allocations coincide for this class of instances. Finally, motivated from real-life scenarios, we focus on a class of structured valuation functions, which we term agent/item-correlated. We prove their equivalence to the "standard" setting without externalities. Therefore, all previous results for EF1 and EFX apply immediately for these valuations.
\ No newline at end of file
diff --git a/data/2024/aaai/The Complexity of Optimizing Atomic Congestion b/data/2024/aaai/The Complexity of Optimizing Atomic Congestion
new file mode 100644
index 0000000000..b6fa0de86f
--- /dev/null
+++ b/data/2024/aaai/The Complexity of Optimizing Atomic Congestion	
@@ -0,0 +1 @@
+Atomic congestion games are a classic topic in network design, routing, and algorithmic game theory, and are capable of modeling congestion and flow optimization tasks in various application areas. While both the price of anarchy for such games as well as the computational complexity of computing their Nash equilibria are by now well-understood, the computational complexity of computing a system-optimal set of strategies - that is, a centrally planned routing that minimizes the average cost of agents - is severely understudied in the literature. We close this gap by identifying the exact boundaries of tractability for the problem through the lens of the parameterized complexity paradigm. After showing that the problem remains highly intractable even on extremely simple networks, we obtain a set of results which demonstrate that the structural parameters which control the computational (in)tractability of the problem are not vertex-separator based in nature (such as, e.g., treewidth), but rather based on edge separators. We conclude by extending our analysis towards the (even more challenging) min-max variant of the problem.
\ No newline at end of file
diff --git a/data/2024/aaai/The Defeat of the Winograd Schema Challenge (Abstract Reprint) b/data/2024/aaai/The Defeat of the Winograd Schema Challenge (Abstract Reprint)
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/aaai/The Evidence Contraction Issue in Deep Evidential Regression: Discussion and Solution b/data/2024/aaai/The Evidence Contraction Issue in Deep Evidential Regression: Discussion and Solution
new file mode 100644
index 0000000000..68277053fe
--- /dev/null
+++ b/data/2024/aaai/The Evidence Contraction Issue in Deep Evidential Regression: Discussion and Solution	
@@ -0,0 +1 @@
+Deep Evidential Regression (DER) places a prior on the original Gaussian likelihood and treats learning as an evidence acquisition process to quantify uncertainty. For the validity of the evidence theory, DER requires specialized activation functions to ensure that the prior parameters remain non-negative. However, such constraints will trigger evidence contraction, causing sub-optimal performance. In this paper, we analyse DER theoretically, revealing the intrinsic limitations for sub-optimal performance: the non-negativity constraints on the Normal Inverse-Gamma (NIG) prior parameter trigger the evidence contraction under the specialized activation function, which hinders the optimization of DER performance. On this basis, we design a Non-saturating Uncertainty Regularization term, which effectively ensures that the performance is further optimized in the right direction. Experiments on real-world datasets show that our proposed approach improves the performance of DER while maintaining the ability to quantify uncertainty.
\ No newline at end of file
diff --git a/data/2024/aaai/The Expected Loss of Preconditioned Langevin Dynamics Reveals the Hessian Rank b/data/2024/aaai/The Expected Loss of Preconditioned Langevin Dynamics Reveals the Hessian Rank
new file mode 100644
index 0000000000..7017fe7f81
--- /dev/null
+++ b/data/2024/aaai/The Expected Loss of Preconditioned Langevin Dynamics Reveals the Hessian Rank	
@@ -0,0 +1 @@
+Langevin dynamics (LD) is widely used for sampling from distributions and for optimization. In this work, we derive a closed-form expression for the expected loss of preconditioned LD near stationary points of the objective function. We use the fact that at the vicinity of such points, LD reduces to an Ornstein–Uhlenbeck process, which is amenable to convenient mathematical treatment. Our analysis reveals that when the preconditioning matrix satisfies a particular relation with respect to the noise covariance, LD's expected loss becomes proportional to the rank of the objective's Hessian. We illustrate the applicability of this result in the context of neural networks, where the Hessian rank has been shown to capture the complexity of the predictor function but is usually computationally hard to probe. Finally, we use our analysis to compare SGD-like and Adam-like preconditioners and identify the regimes under which each of them leads to a lower expected loss.
\ No newline at end of file
diff --git a/data/2024/aaai/The Generalization and Robustness of Transformer-Based Language Models on Commonsense Reasoning b/data/2024/aaai/The Generalization and Robustness of Transformer-Based Language Models on Commonsense Reasoning
new file mode 100644
index 0000000000..4c8932cc6c
--- /dev/null
+++ b/data/2024/aaai/The Generalization and Robustness of Transformer-Based Language Models on Commonsense Reasoning	
@@ -0,0 +1 @@
+The advent of powerful transformer-based discriminative language models and, more recently, generative GPT-family models, has led to notable advancements in natural language processing (NLP), particularly in commonsense reasoning tasks. One such task is commonsense reasoning, where performance is usually evaluated through multiple-choice question-answering benchmarks. Till date, many such benchmarks have been proposed and `leaderboards' tracking state-of-the-art performance on those benchmarks suggest that transformer-based models are approaching human-like performance. However, due to documented problems such as hallucination and bias, the research focus is shifting from merely quantifying accuracy on the task to an in-depth, context-sensitive probing of LLMs' generalization and robustness. To gain deeper insight into diagnosing these models' performance in commonsense reasoning scenarios, this thesis addresses three main studies: the generalization ability of transformer-based language models on commonsense reasoning, the trend in confidence distribution of these language models confronted with ambiguous inference tasks, and a proposed risk-centric evaluation framework for both discriminative and generative language models.
\ No newline at end of file
diff --git a/data/2024/aaai/The Inter-batch Diversity of Samples in Experience Replay for Continual Learning b/data/2024/aaai/The Inter-batch Diversity of Samples in Experience Replay for Continual Learning
new file mode 100644
index 0000000000..6964f6b357
--- /dev/null
+++ b/data/2024/aaai/The Inter-batch Diversity of Samples in Experience Replay for Continual Learning	
@@ -0,0 +1 @@
+In a Continual Learning setting, models are trained on data with occasional distribution shifts, resulting in forgetting the information learned before each shift. Experience Replay (ER) addresses this challenge by retaining part of the old training samples and replaying them alongside current data, improving the model's understanding of the overall distribution in training batches. The crucial factor in ER performance is the diversity of samples within batches. The impact of sample diversity across a sequence of batches is investigated, introducing a new metric and an associated approach to assess and leverage this diversity. This exploration opens up significant potential for future work, as various strategies can be devised to ensure inter-batch diversity. Achieving optimal results may involve striking a balance between this novel metric and other inherent properties of a batch or sequence.
\ No newline at end of file
diff --git a/data/2024/aaai/The Irrelevance of Influencers: Information Diffusion with Re-Activation and Immunity Lasts Exponentially Long on Social Network Models b/data/2024/aaai/The Irrelevance of Influencers: Information Diffusion with Re-Activation and Immunity Lasts Exponentially Long on Social Network Models
new file mode 100644
index 0000000000..efc16a2beb
--- /dev/null
+++ b/data/2024/aaai/The Irrelevance of Influencers: Information Diffusion with Re-Activation and Immunity Lasts Exponentially Long on Social Network Models	
@@ -0,0 +1,3 @@
+Information diffusion models on networks are at the forefront of AI research. The dynamics of such models typically follow stochastic models from epidemiology, used to model not only infections but various phenomena, including the behavior of computer viruses and viral marketing campaigns. A core question in this setting is how to efficiently detect the most influential vertices in the host graph such that the infection survives the longest. In processes that incorporate re-infection of the vertices, such as the SIS process, theoretical studies identify parameter thresholds where the survival time of the process rapidly transitions from logarithmic to super-polynomial. These results contradict the intuition that the starting configuration is relevant, since the process will always either die out fast or survive almost indefinitely. A shortcoming of these results is that models incorporating short-term immunity (or creative advertisement fatigue) have not been subjected to such a theoretical analysis so far.
+
+We reduce this gap in the literature by studying the SIRS process, a more realistic model, which besides re-infection additionally incorporates short-term immunity. On complex network models, we identify parameter regimes for which the process survives exponentially long, and we get a tight threshold for random graphs. Underlying these results is our main technical contribution, showing a threshold behavior for the survival time of the SIRS process on graphs with large expander subgraphs, such as social network models.
\ No newline at end of file
diff --git a/data/2024/aaai/The Language Model Can Have the Personality: Joint Learning for Personality Enhanced Language Model (Student Abstract) b/data/2024/aaai/The Language Model Can Have the Personality: Joint Learning for Personality Enhanced Language Model (Student Abstract)
new file mode 100644
index 0000000000..dcd7dd936d
--- /dev/null
+++ b/data/2024/aaai/The Language Model Can Have the Personality: Joint Learning for Personality Enhanced Language Model (Student Abstract)	
@@ -0,0 +1,2 @@
+With the introduction of large language models, chatbots are becoming more conversational to communicate effectively and capable of handling increasingly complex tasks. To make a chatbot more relatable and engaging, we propose a new language model idea that maps the human-like personality.
+In this paper, we propose a systematic Personality-Enhanced Language Model (PELM) approach by using a joint learning mechanism of personality classification and language generation tasks. The proposed PELM leverages a dataset of defined personality typology, Myers-Briggs Type Indicator, and produces a Personality-Enhanced Language Model by using a joint learning and cross-teaching structure consisting of a classification and language modelling to incorporate personalities via both distinctive types and textual information. The results show that PELM can generate better personality-based outputs than baseline models.
\ No newline at end of file
diff --git a/data/2024/aaai/The Logic of Doxastic Strategies b/data/2024/aaai/The Logic of Doxastic Strategies
new file mode 100644
index 0000000000..8ce2c7c15a
--- /dev/null
+++ b/data/2024/aaai/The Logic of Doxastic Strategies	
@@ -0,0 +1,3 @@
+In many real-world situations, there is often not enough information to know that a certain strategy will succeed in achieving the goal, but there is a good reason to believe that it will. The paper introduces the term "doxastic" for such strategies.
+
+The main technical contribution is a sound and complete logical system that describes the interplay between doxastic strategy and belief modalities.
\ No newline at end of file
diff --git a/data/2024/aaai/The Moderating Effect of Instant Runoff Voting b/data/2024/aaai/The Moderating Effect of Instant Runoff Voting
new file mode 100644
index 0000000000..4149271843
--- /dev/null
+++ b/data/2024/aaai/The Moderating Effect of Instant Runoff Voting	
@@ -0,0 +1 @@
+Instant runoff voting (IRV) has recently gained popularity as an alternative to plurality voting for political elections, with advocates claiming a range of advantages, including that it produces more moderate winners than plurality and could thus help address polarization. However, there is little theoretical backing for this claim, with existing evidence focused on case studies and simulations. In this work, we prove that IRV has a moderating effect relative to plurality voting in a precise sense, developed in a 1-dimensional Euclidean model of voter preferences. We develop a theory of exclusion zones, derived from properties of the voter distribution, which serve to show how moderate and extreme candidates interact during IRV vote tabulation. The theory allows us to prove that if voters are symmetrically distributed and not too concentrated at the extremes, IRV cannot elect an extreme candidate over a moderate. In contrast, we show plurality can and validate our results computationally. Our methods provide new frameworks for the analysis of voting systems, deriving exact winner distributions geometrically and establishing a connection between plurality voting and stick-breaking processes.
\ No newline at end of file
diff --git a/data/2024/aaai/The Promise of Serverless Computing within Peer-to-Peer Architectures for Distributed ML Training b/data/2024/aaai/The Promise of Serverless Computing within Peer-to-Peer Architectures for Distributed ML Training
new file mode 100644
index 0000000000..dc6f6b0d36
--- /dev/null
+++ b/data/2024/aaai/The Promise of Serverless Computing within Peer-to-Peer Architectures for Distributed ML Training	
@@ -0,0 +1 @@
+My thesis focuses on the integration of serverless computing with Peer to Peer (P2P) architectures in distributed Machine Learning (ML). This research aims to harness the decentralized, resilient nature of P2P systems, combined with the scalability and automation of serverless platforms. We explore using databases not just for communication but also for in-database model updates and gradient averaging, addressing the challenges of statelessness in serverless environments.
\ No newline at end of file
diff --git a/data/2024/aaai/The Role of Over-Parameterization in Machine Learning - the Good, the Bad, the Ugly b/data/2024/aaai/The Role of Over-Parameterization in Machine Learning - the Good, the Bad, the Ugly
new file mode 100644
index 0000000000..2f3053bdc6
--- /dev/null
+++ b/data/2024/aaai/The Role of Over-Parameterization in Machine Learning - the Good, the Bad, the Ugly	
@@ -0,0 +1,3 @@
+The conventional wisdom of simple models in machine learning misses the bigger picture, especially over-parameterized neural networks (NNs), where the number of parameters are much larger than the number of training data. Our goal is to explore the mystery behind over-parameterized models from a theoretical side.
+
+In this talk, I will discuss the role of over-parameterization in neural networks, to theoretically understand why they can perform well. First, I will discuss the role of over-parameterization in neural networks from the perspective of models, to theoretically understand why they can genralize well. Second, the effects of over-parameterization in robustness, privacy are discussed. Third, I will talk about the over-parameterization from kernel methods to neural networks in a function space theory view. Besides, from classical statistical learning to sequential decision making, I will talk about the benefits of over-parameterization on how deep reinforcement learning works well for function approximation. Potential future directions on theory of over-parameterization ML will also be discussed.
\ No newline at end of file
diff --git a/data/2024/aaai/The Virtual Driving Instructor: Multi-Agent System Collaborating via Knowledge Graph for Scalable Driver Education b/data/2024/aaai/The Virtual Driving Instructor: Multi-Agent System Collaborating via Knowledge Graph for Scalable Driver Education
new file mode 100644
index 0000000000..85f20c03ba
--- /dev/null
+++ b/data/2024/aaai/The Virtual Driving Instructor: Multi-Agent System Collaborating via Knowledge Graph for Scalable Driver Education	
@@ -0,0 +1,6 @@
+This paper introduces the design, development, and deployment of a Virtual Driving Instructor (VDI) for enhanced driver education. 
+The VDI provides personalized, real-time feedback to students in a driving simulator, addressing some of the limitations of traditional driver instruction. 
+Employing a hybrid AI system, the VDI combines rule-based agents, learning-based agents, knowledge graphs, and Bayesian networks to assess and monitor student performance in a comprehensive manner. 
+Implemented in multiple simulators at a driving school in Norway, the system aims to leverage AI and driving simulation to improve both the learning experience and the efficiency of instruction. 
+Initial feedback from students has been largely positive, highlighting the effectiveness of this integration while also pointing to areas for further improvement. 
+This work marks a significant stride in infusing technology into driver education, offering a scalable and efficient approach to instruction.
\ No newline at end of file
diff --git a/data/2024/aaai/Theoretical Aspects of Generating Instances with Unique Solutions: Pre-assignment Models for Unique Vertex Cover b/data/2024/aaai/Theoretical Aspects of Generating Instances with Unique Solutions: Pre-assignment Models for Unique Vertex Cover
new file mode 100644
index 0000000000..8a01163ff6
--- /dev/null
+++ b/data/2024/aaai/Theoretical Aspects of Generating Instances with Unique Solutions: Pre-assignment Models for Unique Vertex Cover	
@@ -0,0 +1,3 @@
+The uniqueness of an optimal solution to a combinatorial optimization problem attracts many fields of researchers' attention because it has a wide range of applications, it is related to important classes in computational complexity, and the existence of only one solution is often critical for algorithm designs in theory. However, as the authors know, there is no major benchmark set consisting of only instances with unique solutions, and no algorithm generating instances with unique solutions is known; a systematic approach to getting a problem instance guaranteed having a unique solution would be helpful. A possible approach is as follows: Given a problem instance, we specify a small part of a solution in advance so that only one optimal solution meets the specification. This paper formulates such a ``pre-assignment'' approach for the vertex cover problem as a typical combinatorial optimization problem and discusses its computational complexity.
+First, we show that the problem is ΣP2-complete in general, while the problem becomes NP-complete when an input graph is bipartite. 
+We then present an O(2.1996^n)-time algorithm for general graphs and an O(1.9181^n)-time algorithm for bipartite graphs, where n is the number of vertices. The latter is based on an FPT algorithm with O*(3.6791^τ) time for vertex cover number τ. Furthermore, we show that the problem for trees can be solved in O(1.4143^n) time.
\ No newline at end of file
diff --git a/data/2024/aaai/Theoretical and Empirical Analysis of Cost-Function Merging for Implicit Hitting Set WCSP Solving b/data/2024/aaai/Theoretical and Empirical Analysis of Cost-Function Merging for Implicit Hitting Set WCSP Solving
new file mode 100644
index 0000000000..ba2cc44307
--- /dev/null
+++ b/data/2024/aaai/Theoretical and Empirical Analysis of Cost-Function Merging for Implicit Hitting Set WCSP Solving	
@@ -0,0 +1,2 @@
+The Implicit Hitting Set (HS) approach has shown very effective for MaxSAT solving. However, only preliminary promising results have been obtained for the very similar Weighted CSP framework. In this paper we contribute towards both a better theoretical understanding of the HS approach and a more effective HS-based solvers for WCSP. First, we bound the minimum number of iterations of HS thanks to what we call distinguished cores. Then, we show a source of inefficiency by
+introducing two simple problems where HS is unfeasible. Next, we propose two reformulation methods that merge cost-functions to overcome the problem. We provide a theoretical analysis that quantifies the magnitude of the improvement of each method with respect to the number of iterations of the algorithm. In particular, we show that the reformulations can bring an exponential number of iterations down to a constant number in our working examples. Finally, we complement our theoretical analysis with two sets of experiments. First, we show that our results are aligned with real executions. Second, and most importantly, we conduct experiments on typical benchmark problems and show that cost-function merging may be heuristically applied and it may accelerate HS algorithms by several orders of magnitude. In some cases, it even outperforms state-of-the-art solvers.
\ No newline at end of file
diff --git a/data/2024/aaai/Thesis Summary: Operationalizing User-Inclusive Transparency in Artificial Intelligence Systems b/data/2024/aaai/Thesis Summary: Operationalizing User-Inclusive Transparency in Artificial Intelligence Systems
new file mode 100644
index 0000000000..46d6427714
--- /dev/null
+++ b/data/2024/aaai/Thesis Summary: Operationalizing User-Inclusive Transparency in Artificial Intelligence Systems	
@@ -0,0 +1 @@
+Artificial intelligence system architects can increase user trust by designing systems that are inherently transparent. We propose the idea of representing an AI system as an amalgamation of the AI Model (algorithms), data (input and output, including outcomes), and the user interface with visual interpretations (e.g. graphs, Venn diagrams). By designing human controls and feedback mechanisms for AI systems that allow users to exert control over them we can integrate transparency into existing user interfaces. Our plan is to design prototypes of transparent user interfaces for AI systems using well-known usability principles. By conducting surveys we will study their impact to see if these principles help the user to work with the AI system with confidence and if the user perceives the system to be adequately transparent.
\ No newline at end of file
diff --git a/data/2024/aaai/Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit b/data/2024/aaai/Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit
new file mode 100644
index 0000000000..a581b30e24
--- /dev/null
+++ b/data/2024/aaai/Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit	
@@ -0,0 +1 @@
+We study the real-valued combinatorial pure exploration of the multi-armed bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given stochastic arms, and the reward of each arm follows an unknown distribution. In each time step, a player pulls a single arm and observes its reward. The player's goal is to identify the optimal action from a finite-sized real-valued action set with as few arm pulls as possible. Previous methods in the R-CPE-MAB require enumerating all of the feasible actions of the combinatorial optimization problem one is considering. In general, since the size of the action set grows exponentially large with respect to the number of arms, this is almost practically impossible when the number of arms is large. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large with respect to the number of arms. We also introduce a novel problem-dependent sample complexity lower bound of the R-CPE-MAB problem, and show that the GenTS-Explore algorithm achieves the optimal sample complexity up to a problem-dependent constant factor.
\ No newline at end of file
diff --git a/data/2024/aaai/Three Heads Are Better than One: Complementary Experts for Long-Tailed Semi-supervised Learning b/data/2024/aaai/Three Heads Are Better than One: Complementary Experts for Long-Tailed Semi-supervised Learning
new file mode 100644
index 0000000000..9e698bebe9
--- /dev/null
+++ b/data/2024/aaai/Three Heads Are Better than One: Complementary Experts for Long-Tailed Semi-supervised Learning	
@@ -0,0 +1 @@
+We address the challenging problem of Long-Tailed Semi-Supervised Learning (LTSSL) where labeled data exhibit imbalanced class distribution and unlabeled data follow an unknown distribution. Unlike in balanced SSL, the generated pseudo-labels are skewed towards head classes, intensifying the training bias. Such a phenomenon is even amplified as more unlabeled data will be mislabeled as head classes when the class distribution of labeled and unlabeled datasets are mismatched. To solve this problem, we propose a novel method named ComPlementary Experts (CPE). Specifically, we train multiple experts to model various class distributions, each of them yielding high-quality pseudo-labels within one form of class distribution. Besides, we introduce Classwise Batch Normalization for CPE to avoid performance degradation caused by feature distribution mismatch between head and non-head classes. CPE achieves state-of-the-art performances on CIFAR-10-LT, CIFAR-100-LT, and STL-10-LT dataset benchmarks. For instance, on CIFAR-10-LT, CPE improves test accuracy by over >2.22% compared to baselines. Code is available at https://github.com/machengcheng2016/CPE-LTSSL.
\ No newline at end of file
diff --git a/data/2024/aaai/Three Heads Are Better than One: Improving Cross-Domain NER with Progressive Decomposed Network b/data/2024/aaai/Three Heads Are Better than One: Improving Cross-Domain NER with Progressive Decomposed Network
new file mode 100644
index 0000000000..896312c5b6
--- /dev/null
+++ b/data/2024/aaai/Three Heads Are Better than One: Improving Cross-Domain NER with Progressive Decomposed Network	
@@ -0,0 +1 @@
+Cross-domain named entity recognition (NER) tasks encourage NER models to transfer knowledge from data-rich source domains to sparsely labeled target domains. Previous works adopt the paradigms of pre-training on the source domain followed by fine-tuning on the target domain. However, these works ignore that general labeled NER source domain data can be easily retrieved in the real world, and soliciting more source domains could bring more benefits. Unfortunately, previous paradigms cannot efficiently transfer knowledge from multiple source domains. In this work, to transfer multiple source domains' knowledge, we decouple the NER task into the pipeline tasks of mention detection and entity typing, where the mention detection unifies the training object across domains, thus providing the entity typing with higher-quality entity mentions. Additionally, we request multiple general source domain models to suggest the potential named entities for sentences in the target domain explicitly, and transfer their knowledge to the target domain models through the knowledge progressive networks implicitly. Furthermore, we propose two methods to analyze in which source domain knowledge transfer occurs, thus helping us judge which source domain brings the greatest benefit. In our experiment, we develop a Chinese cross-domain NER dataset. Our model improved the F1 score by an average of 12.50% across 8 Chinese and English datasets compared to models without source domain data.
\ No newline at end of file
diff --git a/data/2024/aaai/Threshold-Based Responsive Simulated Annealing for Directed Feedback Vertex Set Problem b/data/2024/aaai/Threshold-Based Responsive Simulated Annealing for Directed Feedback Vertex Set Problem
new file mode 100644
index 0000000000..1141c54bd6
--- /dev/null
+++ b/data/2024/aaai/Threshold-Based Responsive Simulated Annealing for Directed Feedback Vertex Set Problem	
@@ -0,0 +1 @@
+As a classical NP-hard problem and the topic of the PACE 2022 competition, the directed feedback vertex set problem (DFVSP) aims to find a minimum subset of vertices such that, when vertices in the subset and all their adjacent edges are removed from the directed graph, the remainder graph is acyclic. In this paper, we propose a threshold-based responsive simulated annealing algorithm called TRSA for solving DFVSP. First, we simplify the problem instances with two new reduction rules proposed in this paper and eight reduction rules from the literature. Then, based on a new solution representation, TRSA solves DFVSP with a fast local search procedure featured by a swap-based neighborhood structure and three neighborhood acceleration strategies. Finally, all these strategies are incorporated into a threshold-based responsive simulated annealing framework. Computational experiments on 140 benchmark instances show that TRSA is highly competitive compared to the state-of-the-art methods. Specifically, TRSA can improve the best known results for 53 instances, while matching the best known results for 79 ones. Furthermore, some important features of TRSA are analyzed to identify its success factors.
\ No newline at end of file
diff --git a/data/2024/aaai/TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training b/data/2024/aaai/TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training
new file mode 100644
index 0000000000..82075c485e
--- /dev/null
+++ b/data/2024/aaai/TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training	
@@ -0,0 +1 @@
+Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMix from a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios. Our code is available on https://github.com/chaoyajiang/TiMiX/tree/main.
\ No newline at end of file
diff --git a/data/2024/aaai/Tiered Coalition Formation Game Stability and Simulation b/data/2024/aaai/Tiered Coalition Formation Game Stability and Simulation
new file mode 100644
index 0000000000..6e1fbffe9c
--- /dev/null
+++ b/data/2024/aaai/Tiered Coalition Formation Game Stability and Simulation	
@@ -0,0 +1 @@
+Expanding on a 2017 paper by Siler that introduced tiered coalition formation games, I have introduced a variant game and examined the stabilizability of both the original game and its variant. My thesis will contain further theoretical stability findings and the results and interpretation of a simulation based upon real data from video game matchups.
\ No newline at end of file
diff --git a/data/2024/aaai/Time-Aware Knowledge Representations of Dynamic Objects with Multidimensional Persistence b/data/2024/aaai/Time-Aware Knowledge Representations of Dynamic Objects with Multidimensional Persistence
new file mode 100644
index 0000000000..b1942a0d48
--- /dev/null
+++ b/data/2024/aaai/Time-Aware Knowledge Representations of Dynamic Objects with Multidimensional Persistence	
@@ -0,0 +1,3 @@
+Learning time-evolving objects such as multivariate time series and dynamic networks requires the development of novel knowledge representation mechanisms and neural network architectures, which allow for capturing implicit time-dependent information contained in the data. Such information is typically not directly observed but plays a key role in the learning task performance. In turn, lack of time dimension in knowledge encoding mechanisms for time-dependent data leads to frequent model updates, poor learning performance, and, as a result, subpar decision-making. Here we propose a new approach to a time-aware knowledge representation mechanism that notably focuses on implicit time-dependent topological information along multiple geometric dimensions. In particular, we propose a new approach, named Temporal MultiPersistence (TMP), which produces multidimensional topological fingerprints of the data by using the existing single parameter topological summaries. The main idea behind TMP is to merge the two newest directions in topological representation learning, that is, multi-persistence which simultaneously describes data shape evolution along multiple key parameters, and zigzag persistence to enable us to extract the most salient data shape information over time. 
+
+We derive theoretical guarantees of TMP vectorizations and show its utility, in application to forecasting on benchmark traffic flow, Ethereum blockchain, and electrocardiogram datasets, demonstrating the competitive performance, especially, in scenarios of limited data records. In addition, our TMP method improves the computational efficiency of the state-of-the-art multipersistence summaries up to 59.5 times.
\ No newline at end of file
diff --git a/data/2024/aaai/To Know the Causes of Things: Text Mining for Causal Relations b/data/2024/aaai/To Know the Causes of Things: Text Mining for Causal Relations
new file mode 100644
index 0000000000..92515b0a07
--- /dev/null
+++ b/data/2024/aaai/To Know the Causes of Things: Text Mining for Causal Relations	
@@ -0,0 +1 @@
+Causality expresses the relation between two arguments, one of which represents the cause and the other the effect (or consequence). Causal text mining refers to the extraction and usage of causal information from text. Given an input sequence, we are interested to know if and where causal information occurs. My research is focused on the end-to-end challenges of causal text mining. This involves extracting, representing, and applying causal knowledge from unstructured text. The corresponding research questions are: (1) How to extract causal information from unstructured text effectively? (2) How to represent extracted causal relationships in a graph that is interpretable and useful for some application? (3) How can we capitalize on extracted causal knowledge for downstream tasks? What tasks or fields will benefit from such knowledge? In this paper, I outline past and on-going works, and highlight future research challenges.
\ No newline at end of file
diff --git a/data/2024/aaai/Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition b/data/2024/aaai/Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition
new file mode 100644
index 0000000000..939f9f3594
--- /dev/null
+++ b/data/2024/aaai/Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition	
@@ -0,0 +1 @@
+Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the modality-aware prompt and ground truth labels, the proposed token-level contrastive learning framework (TCL) constructs augmented samples and employs NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal textual semantic insights derived from intent labels to guide the learning processes of other modalities in return. Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning. The codes are released at https://github.com/thuiar/TCL-MAP.
\ No newline at end of file
diff --git a/data/2024/aaai/Tools Identification By On-Board Adaptation of Vision-and-Language Models b/data/2024/aaai/Tools Identification By On-Board Adaptation of Vision-and-Language Models
new file mode 100644
index 0000000000..6dd1233ed7
--- /dev/null
+++ b/data/2024/aaai/Tools Identification By On-Board Adaptation of Vision-and-Language Models	
@@ -0,0 +1 @@
+A robotic workshop assistant has been a long-standing grand challenge for robotics, speech, computer vision, and artificial intelligence (AI) research. We revisit the goal of visual identification of tools from human queries in the current era of Large Vision-and-Language models (like GPT-4). We find that current off-the-shelf models (that are trained on internet images) are unable to overcome the domain shift and unable to identify small, obscure tools in cluttered environments. Furthermore, these models are unable to match tools to their intended purpose or affordances. We present a novel system for online domain adaptation that can be run directly on a small on-board processor. The system uses Hyperdimensional Computing (HD), a fast and efficient neuromorphic method. We adapted CLIP to work with explicit ("I need the hammer") and implicit purpose-driven queries ("Drive these nails"), and even with depth images as input. This demo allows the user to try out various real tools and interact via free-form audio.
\ No newline at end of file
diff --git a/data/2024/aaai/Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation b/data/2024/aaai/Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation
new file mode 100644
index 0000000000..bffacc6381
--- /dev/null
+++ b/data/2024/aaai/Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation	
@@ -0,0 +1,5 @@
+This paper introduces a novel approach for topic modeling utilizing latent codebooks from Vector-Quantized Variational Auto-Encoder~(VQ-VAE), discretely encapsulating the rich information of the pre-trained embeddings such as the pre-trained language model.
+From the novel interpretation of the latent codebooks and embeddings as conceptual bag-of-words, we propose a new generative topic model called Topic-VQ-VAE~(TVQ-VAE) which inversely generates the original documents related to the respective latent codebook.
+The TVQ-VAE can visualize the topics with various generative distributions including the traditional BoW distribution and the autoregressive image generation.
+Our experimental results on document analysis and image generation demonstrate that TVQ-VAE effectively captures the topic context which reveals the underlying structures of the dataset and supports flexible forms of document generation.
+Official implementation of the proposed TVQ-VAE is available at https://github.com/clovaai/TVQ-VAE.
\ No newline at end of file
diff --git a/data/2024/aaai/TopoGCL: Topological Graph Contrastive Learning b/data/2024/aaai/TopoGCL: Topological Graph Contrastive Learning
new file mode 100644
index 0000000000..a884cfc2f3
--- /dev/null
+++ b/data/2024/aaai/TopoGCL: Topological Graph Contrastive Learning	
@@ -0,0 +1 @@
+Graph contrastive learning (GCL) has recently emerged as a new concept which allows for capitalizing on the strengths of graph neural networks (GNNs) to learn rich representations in a wide variety of applications which involve abundant unlabeled information. However, existing GCL approaches largely tend to overlook the important latent information on higher-order graph substructures. We address this limitation by introducing the concepts of topological invariance and extended persistence on graphs to GCL. In particular, we propose a new contrastive mode which targets topological representations of the two augmented views from the same graph, yielded by extracting latent shape properties of the graph at multiple resolutions. Along with the extended topological layer, we introduce a new extended persistence summary, namely, extended persistence landscapes (EPL) and derive its theoretical stability guarantees. Our extensive numerical results on biological, chemical, and social interaction graphs show that the new Topological Graph Contrastive Learning (TopoGCL) model delivers significant performance gains in unsupervised graph classification for 8 out of 12 considered datasets and also exhibits robustness under noisy scenarios.
\ No newline at end of file
diff --git a/data/2024/aaai/Topological and Node Noise Filtering on 3D Meshes Using Graph Neural Networks (Student Abstract) b/data/2024/aaai/Topological and Node Noise Filtering on 3D Meshes Using Graph Neural Networks (Student Abstract)
new file mode 100644
index 0000000000..990c732f75
--- /dev/null
+++ b/data/2024/aaai/Topological and Node Noise Filtering on 3D Meshes Using Graph Neural Networks (Student Abstract)	
@@ -0,0 +1 @@
+Topological and node noise filtration are typically considered separately. Graph Neural Networks (GNN) are commonly used for node noise filtration, as they offer high efficiency and low exploitation costs. This paper explores the solution of joint node and topological noise filtration through the use of graph neural networks. Since treating a 3D mesh as a graph is challenging, an indicator function grid representation is employed as input for GNNs to perform the joint filtering. The resulting machine learning model is inspired by point cloud to mesh reconstruction algorithms and demonstrates low computational requirements during inference, producing successful results for smooth, watertight 3D models.
\ No newline at end of file
diff --git a/data/2024/aaai/Toward More Generalized Malicious URL Detection Models b/data/2024/aaai/Toward More Generalized Malicious URL Detection Models
new file mode 100644
index 0000000000..ca4e0f1de9
--- /dev/null
+++ b/data/2024/aaai/Toward More Generalized Malicious URL Detection Models	
@@ -0,0 +1 @@
+This paper reveals a data bias issue that can profoundly hinder the performance of machine learning models in malicious URL detection. We describe how such bias can be diagnosed using interpretable machine learning techniques and further argue that such biases naturally exist in the real world security data for training a classification model. To counteract these challenges, we propose a debiased training strategy that can be applied to most deep-learning based models to alleviate the negative effects of the biased features. The solution is based on the technique of adversarial training to train deep neural networks learning invariant embedding from biased data. Through extensive experimentation, we substantiate that our innovative strategy fosters superior generalization capabilities across both CNN-based and RNN-based detection models. The findings presented in this work not only expose a latent issue in the field but also provide an actionable remedy, marking a significant step forward in the pursuit of more reliable and robust malicious URL detection.
\ No newline at end of file
diff --git a/data/2024/aaai/Toward Open-Set Human Object Interaction Detection b/data/2024/aaai/Toward Open-Set Human Object Interaction Detection
new file mode 100644
index 0000000000..eab6227d80
--- /dev/null
+++ b/data/2024/aaai/Toward Open-Set Human Object Interaction Detection	
@@ -0,0 +1 @@
+This work is oriented toward the task of open-set Human Object Interaction (HOI) detection. The challenge lies in identifying completely new, out-of-domain relationships, as opposed to in-domain ones which have seen improvements in zero-shot HOI detection. To address this challenge, we introduce a simple Disentangled HOI Detection (DHD) model for detecting novel relationships by integrating an open-set object detector with a Visual Language Model (VLM). We utilize a disentangled image-text contrastive learning metric for training and connect the bottom-up visual features to text embeddings through lightweight unary and pair-wise adapters. Our model can benefit from the open-set object detector and the VLM to detect novel action categories and combine actions with novel object categories. We further present the VG-HOI dataset, a comprehensive benchmark with over 17k HOI relationships for open-set scenarios. Experimental results show that our model can detect unknown action classes and combine unknown object classes. Furthermore, it can generalize to over 17k HOI classes while being trained on just 600 HOI classes.
\ No newline at end of file
diff --git a/data/2024/aaai/Toward Robustness in Multi-Label Classification: A Data Augmentation Strategy against Imbalance and Noise b/data/2024/aaai/Toward Robustness in Multi-Label Classification: A Data Augmentation Strategy against Imbalance and Noise
new file mode 100644
index 0000000000..8670d1967c
--- /dev/null
+++ b/data/2024/aaai/Toward Robustness in Multi-Label Classification: A Data Augmentation Strategy against Imbalance and Noise	
@@ -0,0 +1 @@
+Multi-label classification poses challenges due to imbalanced and noisy labels in training data. In this paper, we propose a unified data augmentation method, named BalanceMix, to address these challenges. Our approach includes two samplers for imbalanced labels, generating minority-augmented instances with high diversity. It also refines multi-labels at the label-wise granularity, categorizing noisy labels as clean, re-labeled, or ambiguous for robust optimization. Extensive experiments on three benchmark datasets demonstrate that BalanceMix outperforms existing state-of-the-art methods. We release the code at https://github.com/DISL-Lab/BalanceMix.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Automated Chinese Ancient Character Restoration: A Diffusion-Based Method with a New Dataset b/data/2024/aaai/Towards Automated Chinese Ancient Character Restoration: A Diffusion-Based Method with a New Dataset
new file mode 100644
index 0000000000..5fa0ba5d50
--- /dev/null
+++ b/data/2024/aaai/Towards Automated Chinese Ancient Character Restoration: A Diffusion-Based Method with a New Dataset	
@@ -0,0 +1 @@
+Automated Chinese ancient character restoration (ACACR) remains a challenging task due to its historical significance and aesthetic complexity. Existing methods are constrained by non-professional masks and even overfitting when training on small-scale datasets, which hinder their interdisciplinary application to traditional fields. In this paper, we are proud to introduce the Chinese Ancient Rubbing and Manuscript Character Dataset (ARMCD), which consists of 15,553 real-world ancient single-character images with 42 rubbings and manuscripts, covering the works of over 200 calligraphy artists spanning from 200 to 1,800 AD. We are also dedicated to providing professional synthetic masks by extracting localized erosion from real eroded images. Moreover, we propose DiffACR (Diffusion model for automated Chinese Ancient Character Restoration), a diffusion-based method for the ACACR task. Specifically, we regard the synthesis of eroded images as a special form of cold diffusion on uneroded ones and extract the prior mask directly from the eroded images. Our experiments demonstrate that our method comprehensively outperforms most existing methods on the proposed ARMCD. Dataset and code are available at https://github.com/lhl322001/DiffACR.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Automatic Boundary Detection for Human-AI Collaborative Hybrid Essay in Education b/data/2024/aaai/Towards Automatic Boundary Detection for Human-AI Collaborative Hybrid Essay in Education
new file mode 100644
index 0000000000..1d919e0891
--- /dev/null
+++ b/data/2024/aaai/Towards Automatic Boundary Detection for Human-AI Collaborative Hybrid Essay in Education	
@@ -0,0 +1 @@
+The recent large language models (LLMs), e.g., ChatGPT, have been able to generate human-like and fluent responses when provided with specific instructions. While admitting the convenience brought by technological advancement, educators also have concerns that students might leverage LLMs to complete their writing assignments and pass them off as their original work. Although many AI content detection studies have been conducted as a result of such concerns, most of these prior studies modeled AI content detection as a classification problem, assuming that a text is either entirely human-written or entirely AI-generated. In this study, we investigated AI content detection in a rarely explored yet realistic setting where the text to be detected is collaboratively written by human and generative LLMs (termed as hybrid text for simplicity). We first formalized the detection task as identifying the transition points between human-written content and AI-generated content from a given hybrid text (boundary detection). We constructed a hybrid essay dataset by partially and randomly removing sentences from the original student-written essays and then instructing ChatGPT to fill in for the incomplete essays. Then we proposed a two-step detection approach where we (1) separated AI-generated content from human-written content during the encoder training process; and (2) calculated the distances between every two adjacent prototypes (a prototype is the mean of a set of consecutive sentences from the hybrid text in the embedding space) and assumed that the boundaries exist between the two adjacent prototypes that have the furthest distance from each other. Through extensive experiments, we observed the following main findings: (1) the proposed approach consistently outperformed the baseline methods across different experiment settings; (2) the encoder training process (i.e., step 1 of the above two-step approach) can significantly boost the performance of the proposed approach; (3) when detecting boundaries for single-boundary hybrid essays, the proposed approach could be enhanced by adopting a relatively large prototype size (i.e., the number of sentences needed to calculate a prototype), leading to a 22% improvement (against the best baseline method) in the In-Domain evaluation and an 18% improvement in the Out-of-Domain evaluation.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval b/data/2024/aaai/Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval
new file mode 100644
index 0000000000..bfba9c9721
--- /dev/null
+++ b/data/2024/aaai/Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval	
@@ -0,0 +1 @@
+Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query by constructing cross-modal alignment strategies. However, these existing strategies are often sub-optimal since they ignore the modality imbalance problem, i.e., the semantic richness inherent in videos far exceeds that of a given limited-length sentence. Therefore, in pursuit of better alignment, a natural idea is enhancing the video modality to filter out query-irrelevant semantics, and enhancing the text modality to capture more segment-relevant knowledge. In this paper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment through enhancing features at two levels. First, we enhance the video modality at the frame-word level through word reconstruction. This strategy emphasizes the portions associated with query words in frame-level features while suppressing irrelevant parts. Therefore, the enhanced video contains less redundant semantics and is more balanced with the textual modality. Second, we enhance the textual modality at the segment-sentence level by learning complementary knowledge from context sentences and ground-truth segments. With the knowledge added to the query, the textual modality thus maintains more meaningful semantics and is more balanced with the video modality. By implementing two levels of MESM, the semantic information from both modalities is more balanced to align, thereby bridging the modality gap. Experiments on three widely used benchmarks, including the out-of-distribution settings, show that the proposed framework achieves a new start-of-the-art performance with notable generalization ability (e.g., 4.42% and 7.69% average gains of R1@0.7 on Charades-STA and Charades-CG). The code will be available at https://github.com/lntzm/MESM.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Building a Language-Independent Speech Scoring Assessment b/data/2024/aaai/Towards Building a Language-Independent Speech Scoring Assessment
new file mode 100644
index 0000000000..4fe56ab358
--- /dev/null
+++ b/data/2024/aaai/Towards Building a Language-Independent Speech Scoring Assessment	
@@ -0,0 +1 @@
+Automatic speech scoring is crucial in language learning, providing targeted feedback to language learners by assessing pronunciation, fluency, and other speech qualities. However, the scarcity of human-labeled data for languages beyond English poses a significant challenge in developing such systems. In this work, we propose a Language-Independent scoring approach to evaluate speech without relying on labeled data in the target language. We introduce a multilingual speech scoring system that leverages representations from the wav2vec 2.0 XLSR model and a force-alignment technique based on CTC-Segmentation to construct speech features. These features are used to train a machine learning model to predict pronunciation and fluency scores. We demonstrate the potential of our method by predicting expert ratings on a speech dataset spanning five languages - English, French, Spanish, German and Portuguese, and comparing its performance against Language-Specific models trained individually on each language, as well as a jointly-trained model on all languages. Results indicate that our approach shows promise as an initial step towards a universal language independent speech scoring.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Compact 3D Representations via Point Feature Enhancement Masked Autoencoders b/data/2024/aaai/Towards Compact 3D Representations via Point Feature Enhancement Masked Autoencoders
new file mode 100644
index 0000000000..fb448fb76f
--- /dev/null
+++ b/data/2024/aaai/Towards Compact 3D Representations via Point Feature Enhancement Masked Autoencoders	
@@ -0,0 +1 @@
+Learning 3D representation plays a critical role in masked autoencoder (MAE) based pre-training methods for point cloud, including single-modal and cross-modal based MAE. Specifically, although cross-modal MAE methods learn strong 3D representations via the auxiliary of other modal knowledge, they often suffer from heavy computational burdens and heavily rely on massive cross-modal data pairs that are often unavailable, which hinders their applications in practice. Instead, single-modal methods with solely point clouds as input are preferred in real applications due to their simplicity and efficiency. However, such methods easily suffer from limited 3D representations with global random mask input. To learn compact 3D representations, we propose a simple yet effective Point Feature Enhancement Masked Autoencoders (Point-FEMAE), which mainly consists of a global branch and a local branch to capture latent semantic features. Specifically, to learn more compact features, a share-parameter Transformer encoder is introduced to extract point features from the global and local unmasked patches obtained by global random and local block mask strategies, followed by a specific decoder to reconstruct. Meanwhile, to further enhance features in the local branch, we propose a Local Enhancement Module with local patch convolution to perceive fine-grained local context at larger scales. Our method significantly improves the pre-training efficiency compared to cross-modal alternatives, and extensive downstream experiments underscore the state-of-the-art effectiveness, particularly outperforming our baseline (Point-MAE) by 5.16%, 5.00%, and 5.04% in three variants of ScanObjectNN, respectively. Code is available at https://github.com/zyh16143998882/AAAI24-PointFEMAE.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Continual Knowledge Graph Embedding via Incremental Distillation b/data/2024/aaai/Towards Continual Knowledge Graph Embedding via Incremental Distillation
new file mode 100644
index 0000000000..779535a007
--- /dev/null
+++ b/data/2024/aaai/Towards Continual Knowledge Graph Embedding via Incremental Distillation	
@@ -0,0 +1 @@
+Traditional knowledge graph embedding (KGE) methods typically require preserving the entire knowledge graph (KG) with significant training costs when new knowledge emerges. To address this issue, the continual knowledge graph embedding (CKGE) task has been proposed to train the KGE model by learning emerging knowledge efficiently while simultaneously preserving decent old knowledge. However, the explicit graph structure in KGs, which is critical for the above goal, has been heavily ignored by existing CKGE methods. On the one hand, existing methods usually learn new triples in a random order, destroying the inner structure of new KGs. On the other hand, old triples are preserved with equal priority, failing to alleviate catastrophic forgetting effectively. In this paper, we propose a competitive method for CKGE based on incremental distillation (IncDE), which considers the full use of the explicit graph structure in KGs. First, to optimize the learning order, we introduce a hierarchical strategy, ranking new triples for layer-by-layer learning. By employing the inter- and intra-hierarchical orders together, new triples are grouped into layers based on the graph structure features. Secondly, to preserve the old knowledge effectively, we devise a novel incremental distillation mechanism, which facilitates the seamless transfer of entity representations from the previous layer to the next one, promoting old knowledge preservation. Finally, we adopt a two-stage training paradigm to avoid the over-corruption of old knowledge influenced by under-trained new knowledge. Experimental results demonstrate the superiority of IncDE over state-of-the-art baselines. Notably, the incremental distillation mechanism contributes to improvements of 0.2%-6.5% in the mean reciprocal rank (MRR) score. More exploratory experiments validate the effectiveness of IncDE in proficiently learning new knowledge while preserving old knowledge across all time steps.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Continual Learning Desiderata via HSIC-Bottleneck Orthogonalization and Equiangular Embedding b/data/2024/aaai/Towards Continual Learning Desiderata via HSIC-Bottleneck Orthogonalization and Equiangular Embedding
new file mode 100644
index 0000000000..fbd186dc5d
--- /dev/null
+++ b/data/2024/aaai/Towards Continual Learning Desiderata via HSIC-Bottleneck Orthogonalization and Equiangular Embedding	
@@ -0,0 +1 @@
+Deep neural networks are susceptible to catastrophic forgetting when trained on sequential tasks. Various continual learning (CL) methods often rely on exemplar buffers or/and network expansion for balancing model stability and plasticity, which, however, compromises their practical value due to privacy and memory concerns. Instead, this paper considers a strict yet realistic setting, where the training data from previous tasks is unavailable and the model size remains relatively constant during sequential training. To achieve such desiderata, we propose a conceptually simple yet effective method that attributes forgetting to layer-wise parameter overwriting and the resulting decision boundary distortion. This is achieved by the synergy between two key components: HSIC-Bottleneck Orthogonalization (HBO) implements non-overwritten parameter updates mediated by Hilbert-Schmidt independence criterion in an orthogonal space and EquiAngular Embedding (EAE) enhances decision boundary adaptation between old and new tasks with predefined basis vectors. Extensive experiments demonstrate that our method achieves competitive accuracy performance, even with absolute superiority of zero exemplar buffer and 1.02x the base model.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model b/data/2024/aaai/Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model
new file mode 100644
index 0000000000..96fcb70ee1
--- /dev/null
+++ b/data/2024/aaai/Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model	
@@ -0,0 +1 @@
+Text-guided motion synthesis aims to generate 3D human motion that not only precisely reflects the textual description but reveals the motion details as much as possible. Pioneering methods explore the diffusion model for text-to-motion synthesis and obtain significant superiority. However, these methods conduct diffusion processes either on the raw data distribution or the low-dimensional latent space, which typically suffer from the problem of modality inconsistency or detail-scarce. To tackle this problem, we propose a novel Basic-to-Advanced Hierarchical Diffusion Model, named B2A-HDM, to collaboratively exploit low-dimensional and high-dimensional diffusion models for high quality detailed motion synthesis. Specifically, the basic diffusion model in low-dimensional latent space provides the intermediate denoising result that to be consistent with the textual description, while the advanced diffusion model in high-dimensional latent space focuses on the following detail-enhancing denoising process. Besides, we introduce a multi-denoiser framework for the advanced diffusion model to ease the learning of high-dimensional model and fully explore the generative potential of the diffusion model. Quantitative and qualitative experiment results on two text-to-motion benchmarks (HumanML3D and KIT-ML) demonstrate that B2A-HDM can outperform existing state-of-the-art methods in terms of fidelity, modality consistency, and diversity.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings b/data/2024/aaai/Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings
new file mode 100644
index 0000000000..fcbea7b291
--- /dev/null
+++ b/data/2024/aaai/Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings	
@@ -0,0 +1 @@
+In Time Series Classification (TSC), temporal pooling methods that consider sequential information have been proposed. However, we found that each temporal pooling has a distinct mechanism, and can perform better or worse depending on time series data. We term this fixed pooling mechanism a single perspective of temporal poolings. In this paper, we propose a novel temporal pooling method with diverse perspective learning: Selection over Multiple Temporal Poolings (SoM-TP). SoM-TP dynamically selects the optimal temporal pooling among multiple methods for each data by attention. The dynamic pooling selection is motivated by the ensemble concept of Multiple Choice Learning (MCL), which selects the best among multiple outputs. The pooling selection by SoM-TP's attention enables a non-iterative pooling ensemble within a single classifier. Additionally, we define a perspective loss and Diverse Perspective Learning Network (DPLN). The loss works as a regularizer to reflect all the pooling perspectives from DPLN. Our perspective analysis using Layer-wise Relevance Propagation (LRP) reveals the limitation of a single perspective and ultimately demonstrates diverse perspective learning of SoM-TP. We also show that SoM-TP outperforms CNN models based on other temporal poolings and state-of-the-art models in TSC with extensive UCR/UEA repositories.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Dynamic Spatial-Temporal Graph Learning: A Decoupled Perspective b/data/2024/aaai/Towards Dynamic Spatial-Temporal Graph Learning: A Decoupled Perspective
new file mode 100644
index 0000000000..104b852200
--- /dev/null
+++ b/data/2024/aaai/Towards Dynamic Spatial-Temporal Graph Learning: A Decoupled Perspective	
@@ -0,0 +1 @@
+With the progress of urban transportation systems, a significant amount of high-quality traffic data is continuously collected through streaming manners, which has propelled the prosperity of the field of spatial-temporal graph prediction. In this paper, rather than solely focusing on designing powerful models for static graphs, we shift our focus to spatial-temporal graph prediction in the dynamic scenario, which involves a continuously expanding and evolving underlying graph. To address inherent challenges, a decoupled learning framework (DLF) is proposed in this paper, which consists of a spatial-temporal graph learning network (DSTG) with a specialized decoupling training strategy. Incorporating inductive biases of time-series structures, DSTG can interpret time dependencies into latent trend and seasonal terms. To enable prompt adaptation to the evolving distribution of the dynamic graph, our decoupling training strategy is devised to iteratively update these two types of patterns. Specifically, for learning seasonal patterns, we conduct thorough training for the model using a long time series (e.g., three months of data). To enhance the learning ability of the model, we also introduce the masked auto-encoding mechanism. During this period, we frequently update trend patterns to expand new information from dynamic graphs. Considering both effectiveness and efficiency, we develop a subnet sampling strategy to select a few representative nodes for fine-tuning the weights of the model. These sampled nodes cover unseen patterns and previously learned patterns. Experiments on dynamic spatial-temporal graph datasets further demonstrate the competitive performance, superior efficiency, and strong scalability of the proposed framework.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Effective and General Graph Unlearning via Mutual Evolution b/data/2024/aaai/Towards Effective and General Graph Unlearning via Mutual Evolution
new file mode 100644
index 0000000000..dbcbe6ac69
--- /dev/null
+++ b/data/2024/aaai/Towards Effective and General Graph Unlearning via Mutual Evolution	
@@ -0,0 +1 @@
+With the rapid advancement of AI applications, the growing needs for data privacy and model robustness have highlighted the importance of machine unlearning, especially in thriving graph-based scenarios. However, most existing graph unlearning strategies primarily rely on well-designed architectures or manual process, rendering them less user-friendly and posing challenges in terms of deployment efficiency. Furthermore, striking a balance between unlearning performance and framework generalization is also a pivotal concern. To address the above issues, we propose Mutual Evolution Graph Unlearning (MEGU), a new mutual evolution paradigm that simultaneously evolves the predictive and unlearning capacities of graph unlearning. By incorporating aforementioned two components, MEGU ensures complementary optimization in a unified training framework that aligns with the prediction and unlearning requirements. Extensive experiments on 9 graph benchmark datasets demonstrate the superior performance of MEGU in addressing unlearning requirements at the feature, node, and edge levels. Specifically, MEGU achieves average performance improvements of 2.7%, 2.5%, and 3.2% across these three levels of unlearning tasks when compared to state-of-the-art baselines. Furthermore, MEGU exhibits satisfactory training efficiency, reducing time and space overhead by an average of 159.8x and 9.6x, respectively, in comparison to retraining GNN from scratch.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks b/data/2024/aaai/Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks
new file mode 100644
index 0000000000..7b68fdd1db
--- /dev/null
+++ b/data/2024/aaai/Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks	
@@ -0,0 +1 @@
+Diffusion-based Image Editing (DIE) is an emerging research hot-spot, which often applies a semantic mask to control the target area for diffusion-based editing. However, most existing solutions obtain these masks via manual operations or off-line processing, greatly reducing their efficiency. In this paper, we propose a novel and efficient image editing method for Text-to-Image (T2I) diffusion models, termed Instant Diffusion Editing (InstDiffEdit). In particular, InstDiffEdit aims to employ the cross-modal attention ability of existing diffusion models to achieve instant mask guidance during the diffusion steps. To reduce the noise of attention maps and realize the full automatics, we equip InstDiffEdit with a training-free refinement scheme to adaptively aggregate the attention distributions for the automatic yet accurate mask generation. Meanwhile, to supplement the existing evaluations of DIE, we propose a new benchmark called Editing-Mask to examine the mask accuracy and local editing ability of existing methods. To validate InstDiffEdit, we also conduct extensive experiments on ImageNet and Imagen, and compare it with a bunch of the SOTA methods. The experimental results show that InstDiffEdit not only outperforms the SOTA methods in both image quality and editing results, but also has a much faster inference speed, i.e., +5 to +6 times. Our code available at https://anonymous.4open.science/r/InstDiffEdit-C306
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Efficient Verification of Quantized Neural Networks b/data/2024/aaai/Towards Efficient Verification of Quantized Neural Networks
new file mode 100644
index 0000000000..299bcd85c1
--- /dev/null
+++ b/data/2024/aaai/Towards Efficient Verification of Quantized Neural Networks	
@@ -0,0 +1 @@
+Quantization replaces floating point arithmetic with integer arithmetic in deep neural network models, providing more efficient on-device inference with less power and memory. In this work, we propose a framework for formally verifying the properties of quantized neural networks. Our baseline technique is based on integer linear programming which guarantees both soundness and completeness. We then show how efficiency can be improved by utilizing gradient-based heuristic search methods and also bound-propagation techniques. We evaluate our approach on perception networks quantized with PyTorch. Our results show that we can verify quantized networks with better scalability and efficiency than the previous state of the art.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning b/data/2024/aaai/Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
new file mode 100644
index 0000000000..bf0c116a85
--- /dev/null
+++ b/data/2024/aaai/Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning	
@@ -0,0 +1 @@
+In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Epistemic-Doxastic Planning with Observation and Revision b/data/2024/aaai/Towards Epistemic-Doxastic Planning with Observation and Revision
new file mode 100644
index 0000000000..ff9debf5e9
--- /dev/null
+++ b/data/2024/aaai/Towards Epistemic-Doxastic Planning with Observation and Revision	
@@ -0,0 +1 @@
+Epistemic planning is useful in situations where multiple agents have different knowledge and beliefs about the world, such as in robot-human interaction. One aspect that has been largely neglected in the literature is planning with observations in the presence of false beliefs. This is a particularly challenging problem because it requires belief revision. We introduce a simple specification language for reasoning about actions with knowledge and belief. We demonstrate our approach on well-known false-belief tasks such as the Sally-Anne Task and compare it to other action languages. Our logic leads to an epistemic planning formalism that is expressive enough to model second-order false-belief tasks, yet has the same computational complexity as classical planning.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Equipping Transformer with the Ability of Systematic Compositionality b/data/2024/aaai/Towards Equipping Transformer with the Ability of Systematic Compositionality
new file mode 100644
index 0000000000..e89a33b095
--- /dev/null
+++ b/data/2024/aaai/Towards Equipping Transformer with the Ability of Systematic Compositionality	
@@ -0,0 +1 @@
+One of the key factors in language productivity and human cognition is the ability of Systematic Compositionality, which refers to understanding composed, unseen examples of seen primitives. However, recent evidence reveals that the Transformers have difficulty in generalizing the composed context based on the seen primitives. To this end, we take the first step to propose a compositionality-aware Transformer called CAT and two novel pre-training tasks to facilitate the systematic compositionality. We tentatively provide a successful implementation of a multi-layer CAT on the basis of the especially popular BERT. The experimental results demonstrate that CAT outperforms baselines on compositionality-aware tasks with minimal impact on effectiveness on standardized language understanding tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Evidential and Class Separable Open Set Object Detection b/data/2024/aaai/Towards Evidential and Class Separable Open Set Object Detection
new file mode 100644
index 0000000000..b1df3e2818
--- /dev/null
+++ b/data/2024/aaai/Towards Evidential and Class Separable Open Set Object Detection	
@@ -0,0 +1 @@
+Detecting in open-world scenarios poses a formidable challenge for models intended for real-world deployment. The advanced closed set object detectors achieve impressive performance under the closed set setting, but often produce overconfident misprediction on unknown objects due to the lack of supervision. In this paper, we propose a novel Evidential Object Detector (EOD) to formulate the Open Set Object Detection (OSOD) problem from the perspective of Evidential Deep Learning (EDL) theory, which quantifies classification uncertainty by placing the Dirichlet Prior over the categorical distribution parameters. The task-specific customized evidential framework, equipped with meticulously designed model architecture and loss function, effectively bridges the gap between EDL theory and detection tasks. Moreover, we utilize contrastive learning as an implicit means of evidential regularization and to encourage the class separation in the latent space. Alongside, we innovatively model the background uncertainty to further improve the unknown discovery ability. Extensive experiments on benchmark datasets demonstrate the outperformance of the proposed method over existing ones.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Explainable Joint Models via Information Theory for Multiple Intent Detection and Slot Filling b/data/2024/aaai/Towards Explainable Joint Models via Information Theory for Multiple Intent Detection and Slot Filling
new file mode 100644
index 0000000000..8657fdb5de
--- /dev/null
+++ b/data/2024/aaai/Towards Explainable Joint Models via Information Theory for Multiple Intent Detection and Slot Filling	
@@ -0,0 +1 @@
+Recent joint models for multi-intent detection and slot filling have obtained promising results through modeling the unidirectional or bidirectional guidance between intent and slot. However, existing works design joint models heuristically and lack some theoretical exploration, including (1) theoretical measurement of the joint-interaction quality; (2) explainability of design and optimization methods of joint models, which may limit the performance and efficiency of designs. In this paper, we mathematically define the cross-task information gain (CIG) to measure the quality of joint processes from an information-theoretic perspective and discover an implicit optimization of CIG in previous models. Based on this, we propose a novel multi-stage iterative framework with theoretical effectiveness, explainability, and convergence, which can explicitly optimize information for cross-task interactions. Further, we devise an information-based joint model (InfoJoint) that conforms to this theoretical framework to gradually reduce the cross-task propagation of erroneous semantics through CIG iterative maximization. Extensive experiment results on two public datasets show that InfoJoint outperforms the state-of-the-art models by a large margin.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Fair Graph Federated Learning via Incentive Mechanisms b/data/2024/aaai/Towards Fair Graph Federated Learning via Incentive Mechanisms
new file mode 100644
index 0000000000..901d7d5b63
--- /dev/null
+++ b/data/2024/aaai/Towards Fair Graph Federated Learning via Incentive Mechanisms	
@@ -0,0 +1,2 @@
+Graph federated learning (FL) has emerged as a pivotal paradigm enabling multiple agents to collaboratively train a graph model while preserving local data privacy. Yet, current efforts overlook a key issue: agents are self-interested and would hesitant to share data without fair and satisfactory incentives. This paper is the first endeavor to address this issue by studying the incentive mechanism for graph federated learning. We identify a unique phenomenon in graph federated learning: the presence of agents posing potential harm to the federation and agents contributing with delays. This stands in contrast to previous FL incentive mechanisms that assume all agents contribute positively and in a timely manner. 
+In view of this, this paper presents a novel incentive mechanism tailored for fair graph federated learning, integrating incentives derived from both model gradient and payoff. To achieve this, we first introduce an agent valuation function aimed at quantifying agent contributions through the introduction of two criteria: gradient alignment and graph diversity. Moreover, due to the high heterogeneity in graph federated learning, striking a balance between accuracy and fairness becomes particularly crucial. We introduce motif prototypes to enhance accuracy, communicated between the server and agents, enhancing global model aggregation and aiding agents in local model optimization. Extensive experiments show that our model achieves the best trade-off between accuracy and the fairness of model gradient, as well as superior payoff fairness.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Fairer Centroids in K-means Clustering b/data/2024/aaai/Towards Fairer Centroids in K-means Clustering
new file mode 100644
index 0000000000..f83632eae9
--- /dev/null
+++ b/data/2024/aaai/Towards Fairer Centroids in K-means Clustering	
@@ -0,0 +1 @@
+There has been much recent interest in developing fair clustering algorithms that seek to do justice to the representation of groups defined along sensitive attributes such as race and sex. Within the centroid clustering paradigm, these algorithms are seen to generate clusterings where different groups are disadvantaged within different clusters with respect to their representativity, i.e., distance to centroid. In view of this deficiency, we propose a novel notion of cluster-level centroid fairness that targets the representativity unfairness borne by groups within each cluster, along with a metric to quantify the same. Towards operationalising this notion, we draw on ideas from political philosophy aligned with consideration for the worst-off group to develop Fair-Centroid; a new clustering method that focusses on enhancing the representativity of the worst-off group within each cluster. Our method uses an iterative optimisation paradigm wherein an initial cluster assignment is refined by reassigning objects to clusters such that the worst-off group in each cluster is benefitted. We compare our notion with a related fairness notion and show through extensive empirical evaluations on real-world datasets that our method significantly enhances cluster-level centroid fairness at low impact on cluster coherence.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Fairness in Online Service with K Servers and Its Application on Fair Food Delivery b/data/2024/aaai/Towards Fairness in Online Service with K Servers and Its Application on Fair Food Delivery
new file mode 100644
index 0000000000..56ead50a57
--- /dev/null
+++ b/data/2024/aaai/Towards Fairness in Online Service with K Servers and Its Application on Fair Food Delivery	
@@ -0,0 +1 @@
+The k-SERVER problem is one of the most prominent problems in online algorithms with several variants and extensions. However, simplifying assumptions like instantaneous server movements and zero service time has hitherto limited its applicability to real-world problems. In this paper, we introduce a realistic generalization of k-SERVER without such assumptions – the k-FOOD problem, where requests with source-destination locations and an associated pickup time window arrive in an online fashion, and each has to be served by exactly one of the available k servers. The k-FOOD problem offers the versatility to model a variety of real-world use cases such as food delivery, ride sharing, and quick commerce. Moreover, motivated by the need for fairness in online platforms, we introduce the FAIR k-FOOD problem with the max-min objective. We establish that both k-FOOD and FAIR k-FOOD problems are strongly NP-hard and develop an optimal offline algorithm that arises naturally from a time-expanded flow network. Subsequently, we propose an online algorithm DOC4FOOD involving virtual movements of servers to the nearest request location. Experiments on a real-world food-delivery dataset, alongside synthetic datasets, establish the efficacy of the proposed algorithm against state-of-the-art fair food delivery algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Fine-Grained HBOE with Rendered Orientation Set and Laplace Smoothing b/data/2024/aaai/Towards Fine-Grained HBOE with Rendered Orientation Set and Laplace Smoothing
new file mode 100644
index 0000000000..e5f2ef7c95
--- /dev/null
+++ b/data/2024/aaai/Towards Fine-Grained HBOE with Rendered Orientation Set and Laplace Smoothing	
@@ -0,0 +1 @@
+Human body orientation estimation (HBOE) aims to estimate the orientation of a human body relative to the camera’s frontal view. Despite recent advancements in this field, there still exist limitations in achieving fine-grained results. We identify certain defects and propose corresponding approaches as follows: 1). Existing datasets suffer from non-uniform angle distributions, resulting in sparse image data for certain angles. To provide comprehensive and high-quality data, we introduce RMOS (Rendered Model Orientation Set), a rendered dataset comprising 150K accurately labeled human instances with a wide range of orientations. 2). Directly using one-hot vector as labels may overlook the similarity between angle labels, leading to poor supervision. And converting the predictions from radians to degrees enlarges the regression error. To enhance supervision, we employ Laplace smoothing to vectorize the label, which contains more information. For fine-grained predictions, we adopt weighted Smooth-L1-loss to align predictions with the smoothed-label, thus providing robust supervision. 3). Previous works ignore body-part-specific information, resulting in coarse predictions. By employing local-window self-attention, our model could utilize different body part information for more precise orientation estimations. We validate the effectiveness of our method in the benchmarks with extensive experiments and show that our method outperforms state-of-the-art. Project is available at: https://github.com/Whalesong-zrs/Towards-Fine-grained-HBOE.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Holistic, Pragmatic and Multimodal Conversational Systems b/data/2024/aaai/Towards Holistic, Pragmatic and Multimodal Conversational Systems
new file mode 100644
index 0000000000..b2f629534f
--- /dev/null
+++ b/data/2024/aaai/Towards Holistic, Pragmatic and Multimodal Conversational Systems	
@@ -0,0 +1 @@
+Language acquisition and utilization transcend the mere exchange of lexical units. Visual cues, prosody, gestures, body movements, and context play an undeniably crucial role. Humans naturally communicate multimodally, employing multiple channels and synthesizing information from diverse modalities. My research delves into the characterization and construction of multimodal models that seamlessly integrate data from multiple independent modalities. I will cover recent work that highlights the challenges, achievements, and opportunities towards developing capable multimodal discursive models.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Human-like Learning from Relational Structured Data b/data/2024/aaai/Towards Human-like Learning from Relational Structured Data
new file mode 100644
index 0000000000..ef355ec653
--- /dev/null
+++ b/data/2024/aaai/Towards Human-like Learning from Relational Structured Data	
@@ -0,0 +1 @@
+Relational structured data is a way of representing knowledge using nodes and edges, while also capturing the meaning of that knowledge in a structured form that can be used for machine learning. Compared with vision and natural language data, relational structured data represents and manipulates structured knowledge, which can be beneficial for tasks that involve reasoning or inference. On the other hand, vision and NLP deal more with unstructured data (like images and text), and they often require different types of models and algorithms to extract useful information or features from the data. Human-like Learning develops methods that can harness relational structures and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. With Human-like Learning, the learning algorithm is efficient and can adapt to new or unseen situations, which is crucial in real-world applications where environments may change unpredictably. Moreover, the models are easier for humans to understand and interpret, which is important for transparency and trust in AI systems. In this talk, we present our recent attempts towards human-like learning from relational structured data.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Large Certified Radius in Randomized Smoothing Using Quasiconcave Optimization b/data/2024/aaai/Towards Large Certified Radius in Randomized Smoothing Using Quasiconcave Optimization
new file mode 100644
index 0000000000..d1f95379ec
--- /dev/null
+++ b/data/2024/aaai/Towards Large Certified Radius in Randomized Smoothing Using Quasiconcave Optimization	
@@ -0,0 +1 @@
+Randomized smoothing is currently the state-of-the-art method that provides certified robustness for deep neural networks. However, due to its excessively conservative nature, this method of incomplete verification often cannot achieve an adequate certified radius on real-world datasets. One way to obtain a larger certified radius is to use an input-specific algorithm instead of using a fixed Gaussian filter for all data points. Several methods based on this idea have been proposed, but they either suffer from high computational costs or gain marginal improvement in certified radius. In this work, we show that by exploiting the quasiconvex problem structure, we can find the optimal certified radii for most data points with slight computational overhead. This observation leads to an efficient and effective input-specific randomized smoothing algorithm. We conduct extensive experiments and empirical analysis on CIFAR-10 and ImageNet. The results show that the proposed method significantly enhances the certified radii with low computational overhead.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Learning and Explaining Indirect Causal Effects in Neural Networks b/data/2024/aaai/Towards Learning and Explaining Indirect Causal Effects in Neural Networks
new file mode 100644
index 0000000000..60a2e6d1a9
--- /dev/null
+++ b/data/2024/aaai/Towards Learning and Explaining Indirect Causal Effects in Neural Networks	
@@ -0,0 +1 @@
+Recently, there has been a growing interest in learning and explaining causal effects within Neural Network (NN) models. By virtue of NN architectures, previous approaches consider only direct and total causal effects assuming independence among input variables. We view an NN as a structural causal model (SCM) and extend our focus to include indirect causal effects by introducing feedforward connections among input neurons. We propose an ante-hoc method that captures and maintains direct, indirect, and total causal effects during NN model training. We also propose an algorithm for quantifying learned causal effects in an NN model and efficient approximation strategies for quantifying causal effects in high-dimensional data. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the causal effects learned by our ante-hoc method better approximate the ground truth effects compared to existing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Making Learnware Specification and Market Evolvable b/data/2024/aaai/Towards Making Learnware Specification and Market Evolvable
new file mode 100644
index 0000000000..5e078ee6d4
--- /dev/null
+++ b/data/2024/aaai/Towards Making Learnware Specification and Market Evolvable	
@@ -0,0 +1 @@
+The learnware paradigm aims to establish a market of numerous well-performed machine learning models, enabling users to leverage existing helpful models for their tasks instead of starting from scratch. Each learnware in the market is a model submitted by its developer, associated with a specification generated with the help of learnware market, representing the model's specialty and utility and enabling it to be identified for new user tasks. As the market continuously scales up, accommodating an ever-increasing number of learnwares, the critical challenge of the learnware paradigm is to effectively and efficiently identify the most helpful learnware(s) for a new user task without accessing the user's raw data. In this paper, to achieve increasingly accurate learnware characterization and identification along with a growing number of learnwares in the market, we propose an approach called Evolvable Learnware Specification with Index (ELSI). Specifically, based on the key idea of leveraging the task information within learnware specifications, we tackle the challenge of ascertaining the capabilities of models beyond their original training tasks, thereby enabling learnware specifications and the entire market to evolve continuously. Furthermore, through organizing learnwares and constructing specification indexes, we design a practical procedure to accurately and efficiently identify helpful learnwares without examining the entire market. Theoretical analysis and extensive experiments on a learnware market prototype encompassing thousands of models and covering six real-world scenarios validate the effectiveness and efficiency of our approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Model Extraction Attacks in GAN-Based Image Translation via Domain Shift Mitigation b/data/2024/aaai/Towards Model Extraction Attacks in GAN-Based Image Translation via Domain Shift Mitigation
new file mode 100644
index 0000000000..845d110de0
--- /dev/null
+++ b/data/2024/aaai/Towards Model Extraction Attacks in GAN-Based Image Translation via Domain Shift Mitigation	
@@ -0,0 +1 @@
+Model extraction attacks (MEAs) enable an attacker to replicate the functionality of a victim deep neural network (DNN) model by only querying its API service remotely, posing a severe threat to the security and integrity of pay-per-query DNN-based services. Although the majority of current research on MEAs has primarily concentrated on neural classifiers, there is a growing prevalence of image-to-image translation (I2IT) tasks in our everyday activities. However, techniques developed for MEA of DNN classifiers cannot be directly transferred to the case of I2IT, rendering the vulnerability of I2IT models to MEA attacks often underestimated. This paper unveils the threat of MEA in I2IT tasks from a new perspective. Diverging from the traditional approach of bridging the distribution gap between attacker queries and victim training samples, we opt to mitigate the effect caused by the different distributions, known as the domain shift. This is achieved by introducing a new regularization term that penalizes high-frequency noise, and seeking a flatter minimum to avoid overfitting to the shifted distribution. Extensive experiments on different image translation tasks, including image super-resolution and style transfer, are performed on different backbone victim models, and the new design consistently outperforms the baseline by a large margin across all metrics. A few real-life I2IT APIs are also verified to be extremely vulnerable to our attack, emphasizing the need for enhanced defenses and potentially revised API publishing policies.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Modeling Uncertainties of Self-Explaining Neural Networks via Conformal Prediction b/data/2024/aaai/Towards Modeling Uncertainties of Self-Explaining Neural Networks via Conformal Prediction
new file mode 100644
index 0000000000..62bbdaaa7e
--- /dev/null
+++ b/data/2024/aaai/Towards Modeling Uncertainties of Self-Explaining Neural Networks via Conformal Prediction	
@@ -0,0 +1 @@
+Despite the recent progress in deep neural networks (DNNs), it remains challenging to explain the predictions made by DNNs. Existing explanation methods for DNNs mainly focus on post-hoc explanations where another explanatory model is employed to provide explanations. The fact that post-hoc methods can fail to reveal the actual original reasoning process of DNNs raises the need to build DNNs with built-in interpretability. Motivated by this, many self-explaining neural networks have been proposed to generate not only accurate predictions but also clear and intuitive insights into why a particular decision was made. However, existing self-explaining networks are limited in providing distribution-free uncertainty quantification for the two simultaneously generated prediction outcomes (i.e., a sample's final prediction and its corresponding explanations for interpreting that prediction). Importantly, they also fail to establish a connection between the confidence values assigned to the generated explanations in the interpretation layer and those allocated to the final predictions in the ultimate prediction layer. To tackle the aforementioned challenges, in this paper, we design a novel uncertainty modeling framework for self-explaining networks, which not only demonstrates strong distribution-free uncertainty modeling performance for the generated explanations in the interpretation layer but also excels in producing efficient and effective prediction sets for the final predictions based on the informative high-level basis explanations. We perform the theoretical analysis for the proposed framework. Extensive experimental evaluation demonstrates the effectiveness of the proposed uncertainty framework.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA b/data/2024/aaai/Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA
new file mode 100644
index 0000000000..c4bf22aa17
--- /dev/null
+++ b/data/2024/aaai/Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA	
@@ -0,0 +1 @@
+Natural language explanation in visual question answer (VQA-NLE) aims to explain the decision-making process of models by generating natural language sentences to increase users' trust in the black-box systems. Existing post-hoc methods have achieved significant progress in obtaining a plausible explanation. However, such post-hoc explanations are not always aligned with human logical inference, suffering from the issues on: 1) Deductive unsatisfiability, the generated explanations do not logically lead to the answer; 2) Factual inconsistency, the model falsifies its counterfactual explanation for answers without considering the facts in images; and 3) Semantic perturbation insensitivity, the model can not recognize the semantic changes caused by small perturbations. These problems reduce the faithfulness of explanations generated by models. To address the above issues, we propose a novel self-supervised Multi-level Contrastive Learning based natural language Explanation model (MCLE) for VQA with semantic-level, image-level, and instance-level factual and counterfactual samples. MCLE extracts discriminative features and aligns the feature spaces from explanations with visual question and answer to generate more consistent explanations. We conduct extensive experiments, ablation analysis, and case study to demonstrate the effectiveness of our method on two VQA-NLE benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Multi-Intent Spoken Language Understanding via Hierarchical Attention and Optimal Transport b/data/2024/aaai/Towards Multi-Intent Spoken Language Understanding via Hierarchical Attention and Optimal Transport
new file mode 100644
index 0000000000..77be97afc3
--- /dev/null
+++ b/data/2024/aaai/Towards Multi-Intent Spoken Language Understanding via Hierarchical Attention and Optimal Transport	
@@ -0,0 +1 @@
+Multi-Intent spoken language understanding (SLU) can handle complicated utterances expressing multiple intents, which has attracted increasing attention from researchers. Although existing models have achieved promising performance, most of them still suffer from two leading problems: (1) each intent has its specific scope and the semantic information outside the scope might potentially hinder accurate predictions, i.e. scope barrier; (2) only the guidance from intent to slot is modeled but the guidance from slot to intent is often neglected, i.e. unidirectional guidance. In this paper, we propose a novel Multi-Intent SLU framework termed HAOT, which utilizes hierarchical attention to divide the scopes of each intent and applies optimal transport to achieve the mutual guidance between slot and intent. Experiments demonstrate that our model achieves state-of-the-art performance on two public Multi-Intent SLU datasets, obtaining the 3.4 improvement on MixATIS dataset compared to the previous best models in overall accuracy.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Multi-Mode Outlier Robust Tensor Ring Decomposition b/data/2024/aaai/Towards Multi-Mode Outlier Robust Tensor Ring Decomposition
new file mode 100644
index 0000000000..c72203c7f9
--- /dev/null
+++ b/data/2024/aaai/Towards Multi-Mode Outlier Robust Tensor Ring Decomposition	
@@ -0,0 +1 @@
+Conventional Outlier Robust Tensor Decomposition (ORTD) approaches generally represent sparse outlier corruption within a specific mode. However, such an assumption, which may hold for matrices, proves inadequate when applied to high-order tensors. In the tensor domain, the outliers are prone to be corrupted in multiple modes simultaneously. Addressing this limitation, this study proposes a novel ORTD approach by recovering low-rank tensors contaminated by outliers spanning multiple modes. In particular, we conceptualize outliers within high-order tensors as latent tensor group sparsity by decomposing the corrupted tensor into a sum of multiple latent components, where each latent component is exclusive to outliers within a particular direction. Thus, it can effectively mitigate the outlier corruptions prevalent in high-order tensors across multiple modes. To theoretically guarantee recovery performance, we rigorously analyze a non-asymptotic upper bound of the estimation error for the proposed ORTD approach. In the optimization process, we develop an efficient alternate direction method of multipliers (ADMM) algorithm. Empirical validation of the approach's efficacy is undertaken through comprehensive experimentation.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Real-World Test-Time Adaptation: Tri-net Self-Training with Balanced Normalization b/data/2024/aaai/Towards Real-World Test-Time Adaptation: Tri-net Self-Training with Balanced Normalization
new file mode 100644
index 0000000000..b1eb9a5dad
--- /dev/null
+++ b/data/2024/aaai/Towards Real-World Test-Time Adaptation: Tri-net Self-Training with Balanced Normalization	
@@ -0,0 +1,2 @@
+Test-Time Adaptation aims to adapt source domain model to testing data at inference stage with success demonstrated in adapting to unseen corruptions. However, these attempts may fail under more challenging real-world scenarios. Existing works mainly consider real-world test-time adaptation under non-i.i.d. data stream and continual domain shift. In this work, we first complement the existing real-world TTA protocol with a globally class imbalanced testing set. We demonstrate that combining all settings together poses new challenges to existing methods. We argue the failure of state-of-the-art methods is first caused by indiscriminately adapting normalization layers to imbalanced testing data. To remedy this shortcoming, we propose a balanced batchnorm layer to swap out the regular batchnorm at inference stage. The new batchnorm layer is capable of adapting without biasing towards majority classes. We are further inspired by the success of self-training (ST) in learning from unlabeled data and adapt ST for test-time adaptation. However, ST alone is prone to over adaption which is responsible for the poor performance under continual domain shift. Hence, we propose to improve self-training under continual domain shift by regularizing model updates with an anchored loss. The final TTA model, termed as TRIBE, is built upon a tri-net architecture with balanced batchnorm layers. We evaluate TRIBE on four datasets representing real-world TTA settings. TRIBE consistently achieves the state-of-the-art performance across multiple evaluation protocols. 
+The code is available at https://github.com/Gorilla-Lab-SCUT/TRIBE.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Reliable Learning in the Wild: Generalization and Adaptation b/data/2024/aaai/Towards Reliable Learning in the Wild: Generalization and Adaptation
new file mode 100644
index 0000000000..a57e36c02c
--- /dev/null
+++ b/data/2024/aaai/Towards Reliable Learning in the Wild: Generalization and Adaptation	
@@ -0,0 +1 @@
+The real-world deployment of machine learning algorithms often poses challenges due to shifts in data distributions and tasks. These shifts can lead to a degradation in model performance, as the model may not have encountered such changes during training. Additionally, they can make it difficult for the model to generalize to new scenarios and can result in poor performance in real-world applications. In this talk, I will present our research on building machine learning models that are highly generalizable and easily adaptable to different shifts. Specifically, I will first discuss our approach to improving out-of-distribution robustness and mitigating spurious correlations by training environment-invariant models through selective augmentation and post-hoc rectification. Second, I will present our techniques for continuous and rapid adaptation of models to new tasks and environments. This includes methods to facilitate compositional generalization and adaptation by extracting relationships from historical observations and to enhance reliable adaptation even in the face of imperfect observations. Additionally, I will showcase our successful practices for addressing shifts in real-world applications, such as in the healthcare, e-commerce, and transportation industries. The talk will also touch upon the remaining challenges and outline future research directions in this area.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Reproducible, Automated, and Scalable Anomaly Detection b/data/2024/aaai/Towards Reproducible, Automated, and Scalable Anomaly Detection
new file mode 100644
index 0000000000..4eba02e9b4
--- /dev/null
+++ b/data/2024/aaai/Towards Reproducible, Automated, and Scalable Anomaly Detection	
@@ -0,0 +1 @@
+Anomaly detection (AD), often termed outlier detection, is a key machine learning (ML) task, aiming to identify uncommon yet crucial patterns in data. With the increasing complexity of the modern world, the applications of AD span wide—from NASA's spacecraft monitoring to early patient prioritization at University of Pittsburgh Medical Center. Technology giants like Google and Amazon also leverage AD for service disruption identification. Here, I will traverse my AD works with promising new directions, particularly emphasizing reproducible benchmarks (Part 1), automated algorithms (Part 2), and scalable systems (Part 3).
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Robust Image Stitching: An Adaptive Resistance Learning against Compatible Attacks b/data/2024/aaai/Towards Robust Image Stitching: An Adaptive Resistance Learning against Compatible Attacks
new file mode 100644
index 0000000000..cb020fd040
--- /dev/null
+++ b/data/2024/aaai/Towards Robust Image Stitching: An Adaptive Resistance Learning against Compatible Attacks	
@@ -0,0 +1 @@
+Image stitching seamlessly integrates images captured from varying perspectives into a single wide field-of-view image. Such integration not only broadens the captured scene but also augments holistic perception in computer vision applications. Given a pair of captured images, subtle perturbations and distortions which go unnoticed by the human visual system tend to attack the correspondence matching, impairing the performance of image stitching algorithms. In light of this challenge, this paper presents the first attempt to improve the robustness of image stitching against adversarial attacks. Specifically, we introduce a stitching-oriented attack (SoA), tailored to amplify the alignment loss within overlapping regions, thereby targeting the feature matching procedure. To establish an attack resistant model, we delve into the robustness of stitching architecture and develop an adaptive adversarial training (AAT) to balance attack resistance with stitching precision. In this way, we relieve the gap between the routine adversarial training and benign models, ensuring resilience without quality compromise. Comprehensive evaluation across real-world and synthetic datasets validate the deterioration of SoA on stitching performance. Furthermore, AAT emerges as a more robust solution against adversarial perturbations, delivering superior stitching results. Code is available at: https://github.com/Jzy2017/TRIS.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Robust Visual Understanding: from Recognition to Reasoning b/data/2024/aaai/Towards Robust Visual Understanding: from Recognition to Reasoning
new file mode 100644
index 0000000000..69c6581a8b
--- /dev/null
+++ b/data/2024/aaai/Towards Robust Visual Understanding: from Recognition to Reasoning	
@@ -0,0 +1 @@
+Models that learn from data are widely and rapidly being deployed today for real-world use, but they suffer from unforeseen failures due to distribution shift, adversarial attacks, noise and corruption, and data scarcity. But many failures also occur because many modern AI tasks require reasoning beyond pattern matching -- and such reasoning abilities are difficult to formulate as data-based input-output function fitting. The reliability problem has become increasingly important under the new paradigm of semantic ``multimodal'' learning. My research provides avenues to develop robust and reliable computer vision systems, particularly by leveraging the interactions between vision and language. In this AAAI New Faculty highlights talk, I will cover three thematic areas of my research, ranging from robustness in computer vision, open-domain reliability in visual reasoning, and challenges and opportunities in evaluation of generative models. Readers are encouraged to refer to my website (www.tejasgokhale.com) for more details and updates from my lab's activities towards the goal of robust visual understanding.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Robustness to Natural Variations and Distribution Shift (Student Abstract) b/data/2024/aaai/Towards Robustness to Natural Variations and Distribution Shift (Student Abstract)
new file mode 100644
index 0000000000..335a074efb
--- /dev/null
+++ b/data/2024/aaai/Towards Robustness to Natural Variations and Distribution Shift (Student Abstract)	
@@ -0,0 +1 @@
+This research focuses on improving the robustness of machine learning systems to natural variations and distribution shifts. A design trade space is presented, and various methods are compared, including adversarial training, data augmentation techniques, and novel approaches inspired by model-based robust optimization formulations.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Running Time Analysis of Interactive Multi-Objective Evolutionary Algorithms b/data/2024/aaai/Towards Running Time Analysis of Interactive Multi-Objective Evolutionary Algorithms
new file mode 100644
index 0000000000..46a1d14733
--- /dev/null
+++ b/data/2024/aaai/Towards Running Time Analysis of Interactive Multi-Objective Evolutionary Algorithms	
@@ -0,0 +1 @@
+Evolutionary algorithms (EAs) are widely used for multi-objective optimization due to their population-based nature. Traditional multi-objective EAs (MOEAs) generate a large set of solutions to approximate the Pareto front, leaving a decision maker (DM) with the task of selecting a preferred solution. However, this process can be inefficient and time-consuming, especially when there are many objectives or the DM has subjective preferences. To address this issue, interactive MOEAs (iMOEAs) combine decision making into the optimization process, i.e., update the population with the help of the DM. In contrast to their wide applications, there has existed only two pieces of theoretical works on iMOEAs, which only considered interactive variants of the two simple single-objective algorithms, RLS and (1+1)-EA. This paper provides the first running time analysis (the essential theoretical aspect of EAs) for practical iMOEAs. Specifically, we prove that the expected running time of the well-developed interactive NSGA-II (called R-NSGA-II) for solving the OneMinMax, OneJumpZeroJump problems are all asymptotically faster than the traditional NSGA-II. Meanwhile, we present a variant of OneMinMax, and prove that R-NSGA-II can be exponentially slower than NSGA-II. These results provide theoretical justification for the effectiveness of iMOEAs while identifying situations where they may fail. Experiments are also conducted to validate the theoretical results.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Safe Policy Learning under Partial Identifiability: A Causal Approach b/data/2024/aaai/Towards Safe Policy Learning under Partial Identifiability: A Causal Approach
new file mode 100644
index 0000000000..cffcc220fc
--- /dev/null
+++ b/data/2024/aaai/Towards Safe Policy Learning under Partial Identifiability: A Causal Approach	
@@ -0,0 +1 @@
+Learning personalized treatment policies is a formative challenge in many real-world applications, including in healthcare, econometrics, artificial intelligence. However, the effectiveness of candidate policies is not always identifiable, i.e., it is not uniquely computable from the combination of the available data and assumptions about the generating mechanisms. This paper studies policy learning from data collected in various non-identifiable settings, i.e., (1) observational studies with unobserved confounding; (2) randomized experiments with partial observability; and (3) their combinations. We derive sharp, closed-formed bounds from observational and experimental data over the conditional treatment effects. Based on these novel bounds, we further characterize the problem of safe policy learning and develop an algorithm that trains a policy from data guaranteed to achieve, at least, the performance of the baseline policy currently deployed. Finally, we validate our proposed algorithm on synthetic data and a large clinical trial, demonstrating that it guarantees safe behaviors and robust performance.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Squeezing-Averse Virtual Try-On via Sequential Deformation b/data/2024/aaai/Towards Squeezing-Averse Virtual Try-On via Sequential Deformation
new file mode 100644
index 0000000000..a87d40e42b
--- /dev/null
+++ b/data/2024/aaai/Towards Squeezing-Averse Virtual Try-On via Sequential Deformation	
@@ -0,0 +1 @@
+In this paper, we first investigate a visual quality degradation problem observed in recent high-resolution virtual try-on approach. The tendency is empirically found that the textures of clothes are squeezed at the sleeve, as visualized in the upper row of Fig.1(a). A main reason for the issue arises from a gradient conflict between two popular losses, the Total Variation (TV) and adversarial losses. Specifically, the TV loss aims to disconnect boundaries between the sleeve and torso in a warped clothing mask, whereas the adversarial loss aims to combine between them. Such contrary objectives feedback the misaligned gradients to a cascaded appearance flow estimation, resulting in undesirable squeezing artifacts. To reduce this, we propose a Sequential Deformation (SD-VITON) that disentangles the appearance flow prediction layers into TV objective-dominant (TVOB) layers and a task-coexistence (TACO) layer. Specifically, we coarsely fit the clothes onto a human body via the TVOB layers, and then keep on refining via the TACO layer. In addition, the bottom row of Fig.1(a) shows a different type of squeezing artifacts around the waist. To address it, we further propose that we first warp the clothes into a tucked-out shirts style, and then partially erase the texture from the warped clothes without hurting the smoothness of the appearance flows. Experimental results show that our SD-VITON successfully resolves both types of artifacts and outperforms the baseline methods. Source code will be available at https://github.com/SHShim0513/SD-VITON.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Stability and Generalization Bounds in Decentralized Minibatch Stochastic Gradient Descent b/data/2024/aaai/Towards Stability and Generalization Bounds in Decentralized Minibatch Stochastic Gradient Descent
new file mode 100644
index 0000000000..b7b11a1738
--- /dev/null
+++ b/data/2024/aaai/Towards Stability and Generalization Bounds in Decentralized Minibatch Stochastic Gradient Descent	
@@ -0,0 +1 @@
+Decentralized Stochastic Gradient Descent (D-SGD) represents an efficient communication approach tailored for mastering insights from vast, distributed datasets. Inspired by parallel optimization paradigms, the incorporation of minibatch serves to diminish variance, consequently expediting the optimization process. Nevertheless, as per our current understanding, the existing literature has not thoroughly explored the learning theory foundation of Decentralized Minibatch Stochastic Gradient Descent (DM-SGD). In this paper, we try to address this theoretical gap by investigating the generalization properties of DM-SGD. We establish the sharper generalization bounds for the DM-SGD algorithm with replacement (without replacement) on (non)convex and (non)smooth cases. Moreover, our results consistently recover to the results of Centralized Stochastic Gradient Descent (C-SGD). In addition, we derive generalization analysis for Zero-Order (ZO) version of DM-SGD.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Transferable Adversarial Attacks with Centralized Perturbation b/data/2024/aaai/Towards Transferable Adversarial Attacks with Centralized Perturbation
new file mode 100644
index 0000000000..2acd56a885
--- /dev/null
+++ b/data/2024/aaai/Towards Transferable Adversarial Attacks with Centralized Perturbation	
@@ -0,0 +1 @@
+Adversarial transferability enables black-box attacks on unknown victim deep neural networks (DNNs), rendering attacks viable in real-world scenarios. Current transferable attacks create adversarial perturbation over the entire image, resulting in excessive noise that overfit the source model. Concentrating perturbation to dominant image regions that are model-agnostic is crucial to improving adversarial efficacy. However, limiting perturbation to local regions in the spatial domain proves inadequate in augmenting transferability. To this end, we propose a transferable adversarial attack with fine-grained perturbation optimization in the frequency domain, creating centralized perturbation. We devise a systematic pipeline to dynamically constrain perturbation optimization to dominant frequency coefficients. The constraint is optimized in parallel at each iteration, ensuring the directional alignment of perturbation optimization with model prediction. Our approach allows us to centralize perturbation towards sample-specific important frequency features, which are shared by DNNs, effectively mitigating source model overfitting. Experiments demonstrate that by dynamically centralizing perturbation on dominating frequency coefficients, crafted adversarial examples exhibit stronger transferability, and allowing them to bypass various defenses.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Trustworthy Autonomous Systems via Conversations and Explanations b/data/2024/aaai/Towards Trustworthy Autonomous Systems via Conversations and Explanations
new file mode 100644
index 0000000000..679769456d
--- /dev/null
+++ b/data/2024/aaai/Towards Trustworthy Autonomous Systems via Conversations and Explanations	
@@ -0,0 +1 @@
+Autonomous systems fulfil an increasingly important role in our societies, however, AI-powered systems have seen less success over the years, as they are expected to tackle a range of social, legal, or technological challenges and modern neural network-based AI systems cannot yet provide guarantees to many of these challenges. Particularly important is that these systems are black box decision makers, eroding human oversight, contestation, and agency. To address this particular concern, my thesis focuses on integrating social explainable AI with cognitive methods and natural language processing to shed light on the internal processes of autonomous systems in a way accessible to lay users. I propose a causal explanation generation model for decision-making called CEMA based on counterfactual simulations in multi-agent systems. I also plan to integrate CEMA with a broader natural language processing pipeline to support targeted and personalised explanations that address people's cognitive biases. I hope that my research will have a positive impact on the public acceptance of autonomous agents by building towards more trustworthy AI.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Trustworthy Deep Learning b/data/2024/aaai/Towards Trustworthy Deep Learning
new file mode 100644
index 0000000000..2223dd605f
--- /dev/null
+++ b/data/2024/aaai/Towards Trustworthy Deep Learning	
@@ -0,0 +1,7 @@
+Deep neural networks (DNNs) have achieved unprecedented success across many scientific and engineering fields in the last decades. Despite its empirical success, unfortunately, recent studies have shown that there are various failure modes and blindspots in DNN models which may result in unexpected serious failures and potential harms, e.g. the existence of adversarial examples and small perturbations. This is not acceptable especially for safety critical and high stakes applications in the real-world, including healthcare, self-driving cars, aircraft control systems, hiring and malware detection protocols. Moreover, it has been challenging to understand why and when DNNs will fail due to their complicated structures and black-box behaviors. Lacking interpretability is one critical issue that may seriously hinder the deployment of DNNs in high-stake applications, which need interpretability to trust the prediction, to understand potential failures, and to be able to mitigate harms and eliminate biases in the model.
+
+
+To make DNNs trustworthy and reliable for deployment, it is necessary and urgent to develop methods and tools that can (i) quantify and improve their robustness against adversarial and natural perturbations, and (ii) understand their underlying behaviors and further correct errors to prevent injuries and damages. These are the important first steps to enable Trustworthy AI and Trustworthy Machine Learning. In this talk, I will survey a series of research efforts in my lab contributed to tackling the grand challenges in (i) and (ii). In the first part of my talk, I will overview our research effort in Robust Machine Learning since 2017, where we have proposed the first attack-agnostic robustness evaluation metric, the first efficient robustness certification algorithms for various types of perturbations, and efficient robust learning algorithms across supervised learning to deep reinforcement learning. 
+
+
+In the second part of my talk, I will survey a series of exciting results in my lab on accelerating interpretable machine learning and explainable AI. Specifically, I will show how we could bring interpretability into deep learning by leveraging recent advances in multi-modal models. I'll present recent works in our group on automatically dissecting neural networks with open vocabulary concepts, designing interpretable neural networks without concept labels, and briefly overview our recent efforts on demystifying black-box DNN training process, automated neuron explanations for Large Language Models and the first robustness evaluation of a family of neuron-level interpretation techniques.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards Understanding Future: Consistency Guided Probabilistic Modeling for Action Anticipation b/data/2024/aaai/Towards Understanding Future: Consistency Guided Probabilistic Modeling for Action Anticipation
new file mode 100644
index 0000000000..6547f5f913
--- /dev/null
+++ b/data/2024/aaai/Towards Understanding Future: Consistency Guided Probabilistic Modeling for Action Anticipation	
@@ -0,0 +1,7 @@
+Action anticipation aims to infer the action in the unobserved segment (future segment) with the observed segment (past segment). 
+Existing methods focus on learning key past semantics to predict the future, but they do not model the temporal continuity between the past and the future. However, past actions are always highly uncertain in anticipating the unobserved future. 
+The absence of temporal continuity smoothing in the video's past-and-future segments may result in an inconsistent anticipation of future action. 
+In this work, we aim to smooth the global semantics changes in the past and future segments. We propose a Consistency-guided Probabilistic Model (CPM), which focuses on learning the globally temporal probabilistic consistency to inhibit the unexpected temporal consistency. 
+The CPM is deployed on the Transformer architecture, which includes three modules of future semantics estimation, global semantics estimation, and global distribution estimation involving the learning of past-to-future semantics, past-and-future semantics, and semantically probabilistic distributions. 
+To achieve the smoothness of temporal continuity, we follow the principle of variational analysis and describe two probabilistic distributions, i.e., a past-aware distribution and a global-aware distribution, which help to estimate the evidence lower bound of future anticipation. 
+In this study, we maximize the evidence lower bound of future semantics by reducing the distribution distance between the above two distributions for model optimization. Extensive experiments demonstrate that the effectiveness of our method and the CPM achieves state-of-the-art performance on Epic-Kitchen100, Epic-Kitchen55, and EGTEA-GAZE.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards a More Burkean Approach to Computational Social Choice b/data/2024/aaai/Towards a More Burkean Approach to Computational Social Choice
new file mode 100644
index 0000000000..0edcb1d3bf
--- /dev/null
+++ b/data/2024/aaai/Towards a More Burkean Approach to Computational Social Choice	
@@ -0,0 +1 @@
+In the last few years, a lot of the activity of the computational social choice community has focused on novel mechanisms for reaching decisions by large groups of people. While this research makes meaningful scientific contributions, many of these mechanisms are not quite useful in realistic decision-making settings. Moreover, their radicalism ignores the centuries-old experience we have with large-scale human decision-making, and what it teaches us about what works. We believe it is important the community engage with mechanisms which are widely-used in the real world, as they may hold a key to a deeper understanding of how people reach decisions and the way that helps them do that productively. Moreover, letting the community bring its analysis and understanding to these will allow for algorithmic suggestions that have some chance of being implemented (and, thus, can contribute to the public debate on these topics). In particular, we highlight the relatively less-investigated role of parties and grouping of voters and candidates, and the role of executive capacity in analyzing decision-making structures.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards a Theoretical Understanding of Why Local Search Works for Clustering with Fair-Center Representation b/data/2024/aaai/Towards a Theoretical Understanding of Why Local Search Works for Clustering with Fair-Center Representation
new file mode 100644
index 0000000000..bc509ed872
--- /dev/null
+++ b/data/2024/aaai/Towards a Theoretical Understanding of Why Local Search Works for Clustering with Fair-Center Representation	
@@ -0,0 +1,4 @@
+The representative k-median problem generalizes the classical clustering formulations in that it partitions the data points into several disjoint demographic groups and poses a lower-bound constraint on the number of opened facilities from each group, such that all the groups are fairly represented by the opened facilities. Due to its simplicity, the local-search heuristic that optimizes an initial solution by iteratively swapping at most a constant number of closed facilities for the same number of opened ones (denoted by the O(1)-swap heuristic) has been frequently used in the representative k-median problem. Unfortunately, despite its good performance exhibited in experiments, whether the O(1)-swap heuristic has provable approximation guarantees for the case where the number of groups is more than 2 remains an open question for a long time. As an answer to this question, we show that the O(1)-swap heuristic
+(1) is guaranteed to yield a constant-factor approximation solution if the number of groups is a constant, and
+(2) has an unbounded approximation ratio otherwise.
+Our main technical contribution is a new approach for theoretically analyzing local-search heuristics, which derives the approximation ratio of the O(1)-swap heuristic via linearly combining the increased clustering costs induced by a set of hierarchically organized swaps.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards a Transformer-Based Reverse Dictionary Model for Quality Estimation of Definitions (Student Abstract) b/data/2024/aaai/Towards a Transformer-Based Reverse Dictionary Model for Quality Estimation of Definitions (Student Abstract)
new file mode 100644
index 0000000000..45eba6f9de
--- /dev/null
+++ b/data/2024/aaai/Towards a Transformer-Based Reverse Dictionary Model for Quality Estimation of Definitions (Student Abstract)	
@@ -0,0 +1 @@
+In the last years, several variants of transformers have emerged. In this paper, we compare different transformer-based models for solving the reverse dictionary task and explore their use in the context of a serious game called The Dictionary Game.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards the Disappearing Truth: Fine-Grained Joint Causal Influences Learning with Hidden Variable-Driven Causal Hypergraphs in Time Series b/data/2024/aaai/Towards the Disappearing Truth: Fine-Grained Joint Causal Influences Learning with Hidden Variable-Driven Causal Hypergraphs in Time Series
new file mode 100644
index 0000000000..33b819b654
--- /dev/null
+++ b/data/2024/aaai/Towards the Disappearing Truth: Fine-Grained Joint Causal Influences Learning with Hidden Variable-Driven Causal Hypergraphs in Time Series	
@@ -0,0 +1 @@
+Causal discovery under Granger causality framework has yielded widespread concerns in time series analysis task. Nevertheless, most previous methods are unaware of the underlying causality disappearing problem, that is, certain weak causalities are less focusable and may be lost during the modeling process, thus leading to biased causal conclusions. Therefore, we propose to introduce joint causal influences (i.e., causal influences from the union of multiple variables) as additional causal indication information to help identify weak causalities. Further, to break the limitation of existing methods that implicitly and coarsely model joint causal influences, we propose a novel hidden variable-driven causal hypergraph neural network to meticulously explore the locality and diversity of joint causal influences, and realize its explicit and fine-grained modeling. Specifically, we introduce hidden variables to construct a causal hypergraph for explicitly characterizing various fine-grained joint causal influences. Then, we customize a dual causal information transfer mechanism (encompassing a multi-level causal path and an information aggregation path) to realize the free diffusion and meticulous aggregation of joint causal influences and facilitate its adaptive learning. Finally, we design a multi-view collaborative optimization constraint to guarantee the characterization diversity of causal hypergraph and capture remarkable forecasting relationships (i.e., causalities). Experiments are conducted to demonstrate the superiority of the proposed model.
\ No newline at end of file
diff --git a/data/2024/aaai/Towards the Robustness of Differentially Private Federated Learning b/data/2024/aaai/Towards the Robustness of Differentially Private Federated Learning
new file mode 100644
index 0000000000..68118aa7ff
--- /dev/null
+++ b/data/2024/aaai/Towards the Robustness of Differentially Private Federated Learning	
@@ -0,0 +1 @@
+Robustness and privacy protection are two important factors of trustworthy federated learning (FL). Existing FL works usually secure data privacy by perturbing local model gradients via the differential privacy (DP) technique, or defend against poisoning attacks by filtering the local gradients in the outlier of the gradient distribution before aggregation. However, these two issues are often addressed independently in existing works, and how to secure federated learning in both privacy and robustness still needs further exploration. In this paper, we unveil that although DP noisy perturbation can improve the learning robustness, DP-FL frameworks are not inherently robust and are vulnerable to a carefully-designed attack method. Furthermore, we reveal that it is challenging for existing robust FL methods to defend against attacks on DP-FL. This can be attributed to the fact that the local gradients of DP-FL are perturbed by random noise, and the selected central gradients inevitably incorporate a higher proportion of poisoned gradients compared to conventional FL. To address this problem, we further propose a new defense method for DP-FL (named Robust-DPFL), which can effectively distinguish poisoned and clean local gradients in DP-FL and robustly update the global model. Experiments on three benchmark datasets demonstrate that baseline methods cannot ensure task accuracy, data privacy, and robustness simultaneously, while Robust-DPFL can effectively enhance the privacy protection and robustness of federated learning meanwhile maintain the task performance.
\ No newline at end of file
diff --git a/data/2024/aaai/TraceEvader: Making DeepFakes More Untraceable via Evading the Forgery Model Attribution b/data/2024/aaai/TraceEvader: Making DeepFakes More Untraceable via Evading the Forgery Model Attribution
new file mode 100644
index 0000000000..eece5fe980
--- /dev/null
+++ b/data/2024/aaai/TraceEvader: Making DeepFakes More Untraceable via Evading the Forgery Model Attribution	
@@ -0,0 +1 @@
+In recent few years, DeepFakes are posing serve threats and concerns to both individuals and celebrities, as realistic DeepFakes facilitate the spread of disinformation. Model attribution techniques aim at attributing the adopted forgery models of DeepFakes for provenance purposes and providing explainable results to DeepFake forensics. However, the existing model attribution techniques rely on the trace left in the DeepFake creation, which can become futile if such traces were disrupted. Motivated by our observation that certain traces served for model attribution appeared in both the high-frequency and low-frequency domains and play a divergent role in model attribution. In this work, for the first time, we propose a novel training-free evasion attack, TraceEvader, in the most practical non-box setting. Specifically, TraceEvader injects a universal imitated traces learned from wild DeepFakes into the high-frequency component and introduces adversarial blur into the domain of the low-frequency component, where the added distortion confuses the extraction of certain traces for model attribution. The comprehensive evaluation on 4 state-of-the-art (SOTA) model attribution techniques and fake images generated by 8 generative models including generative adversarial networks (GANs) and diffusion models (DMs) demonstrates the effectiveness of our method. Overall, our TraceEvader achieves the highest average attack success rate of 79% and is robust against image transformations and dedicated denoising techniques as well where the average attack success rate is still around 75%. Our TraceEvader confirms the limitations of current model attribution techniques and calls the attention of DeepFake researchers and practitioners for more robust-purpose model attribution techniques.
\ No newline at end of file
diff --git a/data/2024/aaai/Trade-Offs in Fine-Tuned Diffusion Models between Accuracy and Interpretability b/data/2024/aaai/Trade-Offs in Fine-Tuned Diffusion Models between Accuracy and Interpretability
new file mode 100644
index 0000000000..0d1b61550c
--- /dev/null
+++ b/data/2024/aaai/Trade-Offs in Fine-Tuned Diffusion Models between Accuracy and Interpretability	
@@ -0,0 +1 @@
+Recent advancements in diffusion models have significantly impacted the trajectory of generative machine learning re-search, with many adopting the strategy of fine-tuning pre-trained models using domain-specific text-to-image datasets. Notably, this method has been readily employed for medical applications, such as X-ray image synthesis, leveraging the plethora of associated radiology reports. Yet, a prevailing concern is the lack of assurance on whether these models genuinely comprehend their generated content. With the evolution of text conditional image generation, these models have grown potent enough to facilitate object localization scrutiny. Our research underscores this advancement in the critical realm of medical imaging, emphasizing the crucial role of interpretability. We further unravel a consequential trade-off between image fidelity – as gauged by conventional metrics – and model interpretability in generative diffusion models. Specifically, the adoption of learnable text encoders when fine-tuning results in diminished interpretability. Our in-depth exploration uncovers the underlying factors responsible for this divergence. Consequently, we present a set of design principles for the development of truly interpretable generative models. Code is available at https://github.com/MischaD/chest-distillation.
\ No newline at end of file
diff --git a/data/2024/aaai/Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding b/data/2024/aaai/Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding
new file mode 100644
index 0000000000..aebf3fb5e6
--- /dev/null
+++ b/data/2024/aaai/Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding	
@@ -0,0 +1 @@
+Multi-Agent Path Finding (MAPF) is a fundamental problem in robotics that asks us to compute collision-free paths for a team of agents, all moving across a shared map. Although many works appear on this topic, all current algorithms struggle as the number of agents grows. The principal reason is that existing approaches typically plan free-flow optimal paths, which creates congestion. To tackle this issue, we propose a new approach for MAPF where agents are guided to their destination by following congestion-avoiding paths. We evaluate the idea in two large-scale settings: one-shot MAPF, where each agent has a single destination, and lifelong MAPF, where agents are continuously assigned new destinations. Empirically, we report large improvements in solution quality for one-short MAPF and in overall throughput for lifelong MAPF.
\ No newline at end of file
diff --git a/data/2024/aaai/Training-Free Quantum Architecture Search b/data/2024/aaai/Training-Free Quantum Architecture Search
new file mode 100644
index 0000000000..a178b1d20a
--- /dev/null
+++ b/data/2024/aaai/Training-Free Quantum Architecture Search	
@@ -0,0 +1 @@
+Variational quantum algorithm (VQA) derives advantages from its error resilience and high flexibility in quantum resource requirements, rendering it broadly applicable in the noisy intermediate-scale quantum era. As the performance of VQA highly relies on the structure of the parameterized quantum circuit, it is worthwhile to propose quantum architecture search (QAS) algorithms to automatically search for high-performance circuits. Nevertheless, existing QAS methods are time-consuming, requiring circuit training to assess circuit performance. This study pioneers training-free QAS by utilizing two training-free proxies to rank quantum circuits, in place of the expensive circuit training employed in conventional QAS. Taking into account the precision and computational overhead of the path-based and expressibility-based proxies, we devise a two-stage progressive training-free QAS (TF-QAS). Initially, directed acyclic graphs (DAGs) are employed for circuit representation, and a zero-cost proxy based on the number of paths in the DAG is designed to filter out a substantial portion of unpromising circuits. Subsequently, an expressibility-based proxy, finely reflecting circuit performance, is employed to identify high-performance circuits from the remaining candidates. These proxies evaluate circuit performance without circuit training, resulting in a remarkable reduction in computational cost compared to current training-based QAS methods. Simulations on three VQE tasks demonstrate that TF-QAS achieves a substantial enhancement of sampling efficiency ranging from 5 to 57 times compared to state-of-the-art QAS, while also being 6 to 17 times faster.
\ No newline at end of file
diff --git a/data/2024/aaai/TransGOP: Transformer-Based Gaze Object Prediction b/data/2024/aaai/TransGOP: Transformer-Based Gaze Object Prediction
new file mode 100644
index 0000000000..1bab2ee1a8
--- /dev/null
+++ b/data/2024/aaai/TransGOP: Transformer-Based Gaze Object Prediction	
@@ -0,0 +1 @@
+Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the global-memory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our TransGOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at https://github.com/chenxi-Guo/TransGOP.git.
\ No newline at end of file
diff --git a/data/2024/aaai/Transfer and Alignment Network for Generalized Category Discovery b/data/2024/aaai/Transfer and Alignment Network for Generalized Category Discovery
new file mode 100644
index 0000000000..b9a3849b21
--- /dev/null
+++ b/data/2024/aaai/Transfer and Alignment Network for Generalized Category Discovery	
@@ -0,0 +1,3 @@
+Generalized Category Discovery (GCD) is a crucial real-world task that aims to recognize both known and novel categories from an unlabeled dataset by leveraging another labeled dataset with only known categories. Despite the improved performance on known categories, current methods perform poorly on novel categories. We attribute the poor performance to two reasons: biased knowledge transfer between labeled and unlabeled data and noisy representation learning on the unlabeled data. The former leads to unreliable estimation of learning targets for novel categories and the latter hinders models from learning discriminative features. To mitigate these two issues, we propose a Transfer and Alignment Network (TAN), which incorporates two knowledge transfer mechanisms to calibrate the biased knowledge and two feature alignment mechanisms to learn discriminative features.
+Specifically, we model different categories with prototypes and transfer the prototypes in labeled data to correct model bias towards known categories. On the one hand, we pull instances with known categories in unlabeled data closer to these prototypes to form more compact clusters and avoid boundary overlap between known and novel categories. On the other hand, we use these prototypes to calibrate noisy prototypes estimated from unlabeled data based on category similarities, which allows for more accurate estimation of prototypes for novel categories that can be used as reliable learning targets later. After knowledge transfer, we further propose two feature alignment mechanisms to acquire both instance- and category-level knowledge from unlabeled data by aligning instance features with both augmented features and the calibrated prototypes, which can boost model performance on both known and novel categories with less noise. Experiments on three benchmark datasets show that our model outperforms SOTA methods, especially on novel categories. Theoretical analysis is provided for an in-depth understanding of our model in general.
+Our code and data are available at https://github.com/Lackel/TAN.
\ No newline at end of file
diff --git a/data/2024/aaai/Transferable Adversarial Attacks for Object Detection Using Object-Aware Significant Feature Distortion b/data/2024/aaai/Transferable Adversarial Attacks for Object Detection Using Object-Aware Significant Feature Distortion
new file mode 100644
index 0000000000..3db4bf5999
--- /dev/null
+++ b/data/2024/aaai/Transferable Adversarial Attacks for Object Detection Using Object-Aware Significant Feature Distortion	
@@ -0,0 +1 @@
+Transferable black-box adversarial attacks against classifiers by disturbing the intermediate-layer features have been extensively studied in recent years. However, these methods have not yet achieved satisfactory performances when directly applied to object detectors. This is largely because the features of detectors are fundamentally different from that of the classifiers. In this study, we propose a simple but effective method to improve the transferability of adversarial examples for object detectors by leveraging the properties of spatial consistency and limited equivariance of object detectors’ features. Specifically, we combine a novel loss function and deliberately designed data augmentation to distort the backbone features of object detectors by suppressing significant features corresponding to objects and amplifying the surrounding vicinal features corresponding to object boundaries. As such the target object and background area on the generated adversarial samples are more likely to be confused by other detectors. Extensive experimental results show that our proposed method achieves state-of-the-art black-box transferability for untargeted attacks on various models, including one/two-stage, CNN/Transformer-based, and anchor-free/anchor-based detectors.
\ No newline at end of file
diff --git a/data/2024/aaai/Transferable Video Moment Localization by Moment-Guided Query Prompting b/data/2024/aaai/Transferable Video Moment Localization by Moment-Guided Query Prompting
new file mode 100644
index 0000000000..1c61fba4fb
--- /dev/null
+++ b/data/2024/aaai/Transferable Video Moment Localization by Moment-Guided Query Prompting	
@@ -0,0 +1 @@
+Video moment localization stands as a crucial task within the realm of computer vision, entailing the identification of temporal moments in untrimmed videos that bear semantic relevance to the supplied natural language queries. This work delves into a relatively unexplored facet of the task: the transferability of video moment localization models. This concern is addressed by evaluating moment localization models within a cross-domain transfer setting. In this setup, we curate multiple datasets distinguished by substantial domain gaps. The model undergoes training on one of these datasets, while validation and testing are executed using the remaining datasets. To confront the challenges inherent in this scenario, we draw inspiration from the recently introduced large-scale pre-trained vision-language models. Our focus is on exploring how the strategic utilization of these resources can bolster the capabilities of a model designed for video moment localization. Nevertheless, the distribution of language queries in video moment localization usually diverges from the text used by pre-trained models, exhibiting distinctions in aspects such as length, content, expression, and more. To mitigate the gap, this work proposes a Moment-Guided Query Prompting (MGQP) method for video moment localization. Our key idea is to generate multiple distinct and complementary prompt primitives through stratification of the original queries. Our approach is comprised of a prompt primitive constructor, a multimodal prompt refiner, and a holistic prompt incorporator. We carry out extensive experiments on Charades-STA, TACoS, DiDeMo, and YouCookII datasets, and investigate the efficacy of the proposed method using various pre-trained models, such as CLIP, ActionCLIP, CLIP4Clip, and VideoCLIP. The experimental results demonstrate the effectiveness of our proposed method.
\ No newline at end of file
diff --git a/data/2024/aaai/Transformer-Based No-Reference Image Quality Assessment via Supervised Contrastive Learning b/data/2024/aaai/Transformer-Based No-Reference Image Quality Assessment via Supervised Contrastive Learning
new file mode 100644
index 0000000000..c32fadd9c1
--- /dev/null
+++ b/data/2024/aaai/Transformer-Based No-Reference Image Quality Assessment via Supervised Contrastive Learning	
@@ -0,0 +1 @@
+Image Quality Assessment (IQA) has long been a research hotspot in the field of image processing, especially No-Reference Image Quality Assessment (NR-IQA). Due to the powerful feature extraction ability, existing Convolution Neural Network (CNN) and Transformers based NR-IQA methods have achieved considerable progress. However, they still exhibit limited capability when facing unknown authentic distortion datasets. To further improve NR-IQA performance, in this paper, a novel supervised contrastive learning (SCL) and Transformer-based NR-IQA model SaTQA is proposed. We first train a model on a large-scale synthetic dataset by SCL (no image subjective score is required) to extract degradation features of images with various distortion types and levels. To further extract distortion information from images, we propose a backbone network incorporating the Multi-Stream Block (MSB) by combining the CNN inductive bias and Transformer long-term dependence modeling capability. Finally, we propose the Patch Attention Block (PAB) to obtain the final distorted image quality score by fusing the degradation features learned from contrastive learning with the perceptual distortion information extracted by the backbone network. Experimental results on six standard IQA datasets show that SaTQA outperforms the state-of-the-art methods for both synthetic and authentic datasets. Code is available at https://github.com/I2-Multimedia-Lab/SaTQA.
\ No newline at end of file
diff --git a/data/2024/aaai/Transformer-Based Selective Super-resolution for Efficient Image Refinement b/data/2024/aaai/Transformer-Based Selective Super-resolution for Efficient Image Refinement
new file mode 100644
index 0000000000..cb67a8b04a
--- /dev/null
+++ b/data/2024/aaai/Transformer-Based Selective Super-resolution for Efficient Image Refinement	
@@ -0,0 +1 @@
+Conventional super-resolution methods suffer from two drawbacks: substantial computational cost in upscaling an entire large image, and the introduction of extraneous or potentially detrimental information for downstream computer vision tasks during the refinement of the background. To solve these issues, we propose a novel transformer-based algorithm, Selective Super-Resolution (SSR), which partitions images into non-overlapping tiles, selects tiles of interest at various scales with a pyramid architecture, and exclusively reconstructs these selected tiles with deep features. Experimental results on three datasets demonstrate the efficiency and robust performance of our approach for super-resolution. Compared to the state-of-the-art methods, the FID score is reduced from 26.78 to 10.41 with 40% reduction in computation cost for the BDD100K dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/Transformer-Based Video-Structure Multi-Instance Learning for Whole Slide Image Classification b/data/2024/aaai/Transformer-Based Video-Structure Multi-Instance Learning for Whole Slide Image Classification
new file mode 100644
index 0000000000..ffe93e9cd0
--- /dev/null
+++ b/data/2024/aaai/Transformer-Based Video-Structure Multi-Instance Learning for Whole Slide Image Classification	
@@ -0,0 +1 @@
+Pathological images play a vital role in clinical cancer diagnosis. Computer-aided diagnosis utilized on digital Whole Slide Images (WSIs) has been widely studied. The major challenge of using deep learning models for WSI analysis is the huge size of WSI images and existing methods struggle between end-to-end learning and proper modeling of contextual information. Most state-of-the-art methods utilize a two-stage strategy, in which they use a pre-trained model to extract features of small patches cut from a WSI and then input these features into a classification model. These methods can not perform end-to-end learning and consider contextual information at the same time. To solve this problem, we propose a framework that models a WSI as a pathologist's observing video and utilizes Transformer to process video clips with a divide-and-conquer strategy, which helps achieve both context-awareness and end-to-end learning. Extensive experiments on three public WSI datasets show that our proposed method outperforms existing SOTA methods in both WSI classification and positive region detection.
\ No newline at end of file
diff --git a/data/2024/aaai/Transformer-Empowered Multi-Modal Item Embedding for Enhanced Image Search in E-commerce b/data/2024/aaai/Transformer-Empowered Multi-Modal Item Embedding for Enhanced Image Search in E-commerce
new file mode 100644
index 0000000000..4472aeb16b
--- /dev/null
+++ b/data/2024/aaai/Transformer-Empowered Multi-Modal Item Embedding for Enhanced Image Search in E-commerce	
@@ -0,0 +1 @@
+Over the past decade, significant advances have been made in the field of image search for e-commerce applications. Traditional image-to-image retrieval models, which focus solely on image details such as texture, tend to overlook useful semantic information contained within the images. As a result, the retrieved products might possess similar image details, but fail to fulfil the user's search goals. Moreover, the use of image-to-image retrieval models for products containing multiple images results in significant online product feature storage overhead and complex mapping implementations. In this paper, we report the design and deployment of the proposed Multi-modal Item Embedding Model (MIEM) to address these limitations. It is capable of utilizing both textual information and multiple images about a product to construct meaningful product features. By leveraging semantic information from images, MIEM effectively supplements the image search process, improving the overall accuracy of retrieval results. MIEM has become an integral part of the Shopee image search platform. Since its deployment in March 2023, it has achieved a remarkable 9.90% increase in terms of clicks per user and a 4.23% boost in terms of orders per user for the image search feature on the Shopee e-commerce platform.
\ No newline at end of file
diff --git a/data/2024/aaai/Transforming Healthcare: A Comprehensive Approach to Mitigating Bias and Fostering Empathy through AI-Driven Augmented Reality b/data/2024/aaai/Transforming Healthcare: A Comprehensive Approach to Mitigating Bias and Fostering Empathy through AI-Driven Augmented Reality
new file mode 100644
index 0000000000..99e5871bb5
--- /dev/null
+++ b/data/2024/aaai/Transforming Healthcare: A Comprehensive Approach to Mitigating Bias and Fostering Empathy through AI-Driven Augmented Reality	
@@ -0,0 +1 @@
+The integration of Artificial Intelligence (AI) into Augmented Reality (AR) for medical applications is propelled by the aim to address evident healthcare disparities. Certain communities have encountered disparities in medical diagnoses, exemplified by Black individuals exhibiting a 2.4 times higher likelihood of schizophrenia diagnosis compared to their white counterparts (Faber et al., 2023). These disparities often arise from structured interview assessments overlooking cultural nuances, resulting in increased misdiagnosis rates. This study leverages AI and AR to develop unbiased diagnostic tools and enhance empathy in healthcare professionals' training. Uniquely prioritizing the reduction of biased language and the fostering of empathy through AI-driven Natural Language Processing (NLP) and AI-driven virtual patients, the research aims to enhance diagnostic accuracy while promoting cultural sensitivity among healthcare professionals. Aligned with broader goals of achieving equitable healthcare and reducing disparities, the evaluation involves pre- and post-training assessments to measure language improvements and empathy enhancements. Successful implementation could lead to a more equitable healthcare landscape, fostering trust in AI-driven systems and ensuring fairer medical care for diverse communities.
\ No newline at end of file
diff --git a/data/2024/aaai/Transient Glimpses: Unveiling Occluded Backgrounds through the Spike Camera b/data/2024/aaai/Transient Glimpses: Unveiling Occluded Backgrounds through the Spike Camera
new file mode 100644
index 0000000000..92e594769d
--- /dev/null
+++ b/data/2024/aaai/Transient Glimpses: Unveiling Occluded Backgrounds through the Spike Camera	
@@ -0,0 +1 @@
+The de-occlusion problem, involving extracting clear background images by removing foreground occlusions, holds significant practical importance but poses considerable challenges. Most current research predominantly focuses on generating discrete images from calibrated camera arrays, but this approach often struggles with dense occlusions and fast motions due to limited perspectives and motion blur. To overcome these limitations, an effective solution requires the integration of multi-view visual information. The spike camera, as an innovative neuromorphic sensor, shows promise with its ultra-high temporal resolution and dynamic range. In this study, we propose a novel approach that utilizes a single spike camera for continuous multi-view imaging to address occlusion removal. By rapidly moving the spike camera, we capture a dense stream of spikes from occluded scenes. Our model, SpkOccNet, processes these spikes by integrating multi-view spatial-temporal information via long-short-window feature extractor (LSW) and employs a novel cross-view mutual attention-based module (CVA) for effective fusion and refinement. Additionally, to facilitate research in occlusion removal, we introduce the S-OCC dataset, which consists of real-world spike-based data. Experimental results demonstrate the efficiency and generalization capabilities of our model in effectively removing dense occlusions across diverse scenes. Public project page: https://github.com/Leozhangjiyuan/SpikeDeOcclusion.
\ No newline at end of file
diff --git a/data/2024/aaai/Transition-Informed Reinforcement Learning for Large-Scale Stackelberg Mean-Field Games b/data/2024/aaai/Transition-Informed Reinforcement Learning for Large-Scale Stackelberg Mean-Field Games
new file mode 100644
index 0000000000..6d455e4b6b
--- /dev/null
+++ b/data/2024/aaai/Transition-Informed Reinforcement Learning for Large-Scale Stackelberg Mean-Field Games	
@@ -0,0 +1 @@
+Many real-world scenarios including fleet management and Ad auctions can be modeled as Stackelberg mean-field games (SMFGs) where a leader aims to incentivize a large number of homogeneous self-interested followers to maximize her utility. Existing works focus on cases with a small number of heterogeneous followers, e.g., 5-10, and suffer from scalability issue when the number of followers increases. There are three major challenges in solving large-scale SMFGs: i) classical methods based on solving differential equations fail as they require exact dynamics parameters, ii) learning by interacting with environment is data-inefficient, and iii) complex interaction between the leader and followers makes the learning performance unstable. We address these challenges through transition-informed reinforcement learning. Our main contributions are threefold: i) we first propose an RL framework, the Stackelberg mean-field update, to learn the leader's policy without priors of the environment, ii) to improve the data efficiency and accelerate the learning process, we then propose the Transition-Informed Reinforcement Learning (TIRL) by leveraging the instantiated empirical Fokker-Planck equation, and iii) we develop a regularized TIRL by employing various regularizers to alleviate the sensitivity of the learning performance to the initialization of the leader's policy. Extensive experiments on fleet management and food gathering demonstrate that our approach can scale up to 100,000 followers and significantly outperform existing baselines.
\ No newline at end of file
diff --git a/data/2024/aaai/Transitivity-Preserving Graph Representation Learning for Bridging Local Connectivity and Role-Based Similarity b/data/2024/aaai/Transitivity-Preserving Graph Representation Learning for Bridging Local Connectivity and Role-Based Similarity
new file mode 100644
index 0000000000..d11c9cf3aa
--- /dev/null
+++ b/data/2024/aaai/Transitivity-Preserving Graph Representation Learning for Bridging Local Connectivity and Role-Based Similarity	
@@ -0,0 +1 @@
+Graph representation learning (GRL) methods, such as graph neural networks and graph transformer models, have been successfully used to analyze graph-structured data, mainly focusing on node classification and link prediction tasks. However, the existing studies mostly only consider local connectivity while ignoring long-range connectivity and the roles of nodes. In this paper, we propose Unified Graph Transformer Networks (UGT) that effectively integrate local and global structural information into fixed-length vector representations. First, UGT learns local structure by identifying the local sub-structures and aggregating features of the k-hop neighborhoods of each node. Second, we construct virtual edges, bridging distant nodes with structural similarity to capture the long-range dependencies. Third, UGT learns unified representations through self-attention, encoding structural distance and p-step transition probability between node pairs. Furthermore, we propose a self-supervised learning task that effectively learns transition probability to fuse local and global structural features, which could then be transferred to other downstream tasks. Experimental results on real-world benchmark datasets over various downstream tasks showed that UGT significantly outperformed baselines that consist of state-of-the-art models. In addition, UGT reaches the third-order Weisfeiler-Lehman power to distinguish non-isomorphic graph pairs.
\ No newline at end of file
diff --git a/data/2024/aaai/Translate Meanings, Not Just Words: IdiomKB's Role in Optimizing Idiomatic Translation with Language Models b/data/2024/aaai/Translate Meanings, Not Just Words: IdiomKB's Role in Optimizing Idiomatic Translation with Language Models
new file mode 100644
index 0000000000..227601bca7
--- /dev/null
+++ b/data/2024/aaai/Translate Meanings, Not Just Words: IdiomKB's Role in Optimizing Idiomatic Translation with Language Models	
@@ -0,0 +1 @@
+To translate well, machine translation (MT) systems and general-purposed language models (LMs) need a deep understanding of both source and target languages and cultures. Therefore, idioms, with their non-compositional nature, pose particular challenges for Transformer-based systems, as literal translations often miss the intended meaning. Traditional methods, which replace idioms using existing knowledge bases (KBs), often lack scale and context-awareness. Addressing these challenges, our approach prioritizes context-awareness and scalability, allowing for offline storage of idioms in a manageable KB size. This ensures efficient serving with smaller models and provides a more comprehensive understanding of idiomatic expressions. We introduce a multilingual idiom KB (IdiomKB) developed using large LMs to address this. This KB facilitates better translation by smaller models, such as BLOOMZ (7.1B), Alpaca (7B), and InstructGPT (6.7B), by retrieving idioms' figurative meanings. We present a novel, GPT-4-powered metric for human-aligned evaluation, demonstrating that IdiomKB considerably boosts model performance. Human evaluations further validate our KB's quality.
\ No newline at end of file
diff --git a/data/2024/aaai/Transportable Representations for Domain Generalization b/data/2024/aaai/Transportable Representations for Domain Generalization
new file mode 100644
index 0000000000..88c10dc57e
--- /dev/null
+++ b/data/2024/aaai/Transportable Representations for Domain Generalization	
@@ -0,0 +1 @@
+One key assumption in machine learning literature is that the testing and training data come from the same distribution, which is often violated in practice. The anchors that allow generalizations to take place are causal, and provenient in terms of the stability and modularity of the mechanisms underlying the system of variables. Building on the theory of causal transportability, we define the notion of ``transportable representations", and show that these representations are suitable candidates for the domain generalization task. Specifically, considering that the graphical assumptions about the underlying system are provided, the transportable representations can be characterized accordingly, and the distribution of label conditioned on the representation can be computed in terms of the source distributions. Finally, we relax the assumption of having access to the underlying graph by proving a graphical-invariance duality theorem, which delineates certain probabilistic invariances present in the source data as a sound and complete criterion for generalizable classification. Our findings provide a unifying theoretical basis for several existing approaches to the domain generalization problem.
\ No newline at end of file
diff --git a/data/2024/aaai/Trash to Treasure: Low-Light Object Detection via Decomposition-and-Aggregation b/data/2024/aaai/Trash to Treasure: Low-Light Object Detection via Decomposition-and-Aggregation
new file mode 100644
index 0000000000..3fdedad5f7
--- /dev/null
+++ b/data/2024/aaai/Trash to Treasure: Low-Light Object Detection via Decomposition-and-Aggregation	
@@ -0,0 +1 @@
+Object detection in low-light scenarios has attracted much attention in the past few years. A mainstream and representative scheme introduces enhancers as the pre-processing for regular detectors. However, because of the disparity in task objectives between the enhancer and detector, this paradigm cannot shine at its best ability. In this work, we try to arouse the potential of enhancer + detector. Different from existing works, we extend the illumination-based enhancers (our newly designed or existing) as a scene decomposition module, whose removed illumination is exploited as the auxiliary in the detector for extracting detection-friendly features. A semantic aggregation module is further established for integrating multi-scale scene-related semantic information in the context space. Actually, our built scheme successfully transforms the "trash" (i.e., the ignored illumination in the detector) into the "treasure" for the detector. Plenty of experiments are conducted to reveal our superiority against other state-of-the-art methods. The code will be public if it is accepted.
\ No newline at end of file
diff --git a/data/2024/aaai/Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization b/data/2024/aaai/Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization
new file mode 100644
index 0000000000..6b89fa62f3
--- /dev/null
+++ b/data/2024/aaai/Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization	
@@ -0,0 +1 @@
+While modern biotechnologies allow synthesizing new proteins and function measurements at scale, efficiently exploring a protein sequence space and engineering it remains a daunting task due to the vast sequence space of any given protein. Protein engineering is typically conducted through an iterative process of adding mutations to the wild-type or lead sequences, recombination of mutations, and running new rounds of screening. To enhance the efficiency of such a process, we propose a tree search-based bandit learning method, which expands a tree starting from the initial sequence with the guidance of a bandit machine learning model. Under simplified assumptions and a Gaussian Process prior, we provide theoretical analysis and a Bayesian regret bound, demonstrating that the method can efficiently discover a near-optimal design. The full algorithm is compatible with a suite of randomized tree search heuristics, machine learning models, pre-trained embeddings, and bandit techniques. We test various instances of the algorithm across benchmark protein datasets using simulated screens. Experiment results demonstrate that the algorithm is both sample-efficient, diversity-promoting, and able to find top designs using reasonably small mutation counts.
\ No newline at end of file
diff --git a/data/2024/aaai/Tree-of-Reasoning Question Decomposition for Complex Question Answering with Large Language Models b/data/2024/aaai/Tree-of-Reasoning Question Decomposition for Complex Question Answering with Large Language Models
new file mode 100644
index 0000000000..598fd8f25d
--- /dev/null
+++ b/data/2024/aaai/Tree-of-Reasoning Question Decomposition for Complex Question Answering with Large Language Models	
@@ -0,0 +1 @@
+Large language models (LLMs) have recently demonstrated remarkable performance across various Natual Language Processing tasks. In the field of multi-hop reasoning, the Chain-of-thought (CoT) prompt method has emerged as a paradigm, using curated stepwise reasoning demonstrations to enhance LLM's ability to reason and produce coherent rational pathways. To ensure the accuracy, reliability, and traceability of the generated answers, many studies have incorporated information retrieval (IR) to provide LLMs with external knowledge. However, existing CoT with IR methods decomposes questions into sub-questions based on a single compositionality type, which limits their effectiveness for questions involving multiple compositionality types. Additionally, these methods suffer from inefficient retrieval, as complex questions often contain abundant information, leading to the retrieval of irrelevant information inconsistent with the query's intent. In this work, we propose a novel question decomposition framework called TRQA for multi-hop question answering, which addresses these limitations. Our framework introduces a reasoning tree (RT) to represent the structure of complex questions. It consists of four components: the Reasoning Tree Constructor (RTC), the Question Generator (QG), the Retrieval and LLM Interaction Module (RAIL), and the Answer Aggregation Module (AAM). Specifically, the RTC predicts diverse sub-question structures to construct the reasoning tree, allowing a more comprehensive representation of complex questions. The QG generates sub-questions for leaf-node in the reasoning tree, and we explore two methods for QG: prompt-based and T5-based approaches. The IR module retrieves documents aligned with sub-questions, while the LLM formulates answers based on the retrieved information. Finally, the AAM aggregates answers along the reason tree, producing a definitive response from bottom to top.
\ No newline at end of file
diff --git a/data/2024/aaai/Trend-Aware Supervision: On Learning Invariance for Semi-supervised Facial Action Unit Intensity Estimation b/data/2024/aaai/Trend-Aware Supervision: On Learning Invariance for Semi-supervised Facial Action Unit Intensity Estimation
new file mode 100644
index 0000000000..afe2663971
--- /dev/null
+++ b/data/2024/aaai/Trend-Aware Supervision: On Learning Invariance for Semi-supervised Facial Action Unit Intensity Estimation	
@@ -0,0 +1 @@
+With the increasing need for facial behavior analysis, semi-supervised AU intensity estimation using only keyframe annotations has emerged as a practical and effective solution to relieve the burden of annotation. However, the lack of annotations makes the spurious correlation problem caused by AU co-occurrences and subject variation much more prominent, leading to non-robust intensity estimation that is entangled among AUs and biased among subjects. We observe that trend information inherent in keyframe annotations could act as extra supervision and raising the awareness of AU-specific facial appearance changing trends during training is the key to learning invariant AU-specific features. To this end, we propose Trend-AwareSupervision (TAS), which pursues three kinds of trend awareness, including intra-trend ranking awareness, intra-trend speed awareness, and inter-trend subject awareness. TAS alleviates the spurious correlation problem by raising trend awareness during training to learn AU-specific features that represent the corresponding facial appearance changes, to achieve intensity estimation invariance. Experiments conducted on two commonly used AU benchmark datasets, BP4D and DISFA, show the effectiveness of each kind of awareness. And under trend-aware supervision, the performance can be improved without extra computational or storage costs during inference.
\ No newline at end of file
diff --git a/data/2024/aaai/TriSampler: A Better Negative Sampling Principle for Dense Retrieval b/data/2024/aaai/TriSampler: A Better Negative Sampling Principle for Dense Retrieval
new file mode 100644
index 0000000000..56f5d32700
--- /dev/null
+++ b/data/2024/aaai/TriSampler: A Better Negative Sampling Principle for Dense Retrieval	
@@ -0,0 +1 @@
+Negative sampling stands as a pivotal technique in dense retrieval, essential for training effective retrieval models and significantly impacting retrieval performance. While existing negative sampling methods have made commendable progress by leveraging hard negatives, a comprehensive guiding principle for constructing negative candidates and designing negative sampling distributions is still lacking. To bridge this gap, we embark on a theoretical analysis of negative sampling in dense retrieval. This exploration culminates in the unveiling of the quasi-triangular principle, a novel framework that elucidates the triangular-like interplay between query, positive document, and negative document. Fueled by this guiding principle, we introduce TriSampler, a straightforward yet highly effective negative sampling method. The keypoint of TriSampler lies in its ability to selectively sample more informative negatives within a prescribed constrained region. Experimental evaluation show that TriSampler consistently attains superior retrieval performance across a diverse of representative retrieval models.
\ No newline at end of file
diff --git a/data/2024/aaai/Triple Feature Disentanglement for One-Stage Adaptive Object Detection b/data/2024/aaai/Triple Feature Disentanglement for One-Stage Adaptive Object Detection
new file mode 100644
index 0000000000..af8a1cebce
--- /dev/null
+++ b/data/2024/aaai/Triple Feature Disentanglement for One-Stage Adaptive Object Detection	
@@ -0,0 +1 @@
+In recent advancements concerning Domain Adaptive Object Detection (DAOD), unsupervised domain adaptation techniques have proven instrumental. These methods enable enhanced detection capabilities within unlabeled target domains by mitigating distribution differences between source and target domains. A subset of DAOD methods employs disentangled learning to segregate Domain-Specific Representations (DSR) and Domain-Invariant Representations (DIR), with ultimate predictions relying on the latter. Current practices in disentanglement, however, often lead to DIR containing residual domain-specific information. To address this, we introduce the Multi-level Disentanglement Module (MDM) that progressively disentangles DIR, enhancing comprehensive disentanglement. Additionally, our proposed Cyclic Disentanglement Module (CDM) facilitates DSR separation. To refine the process further, we employ the Categorical Features Disentanglement Module (CFDM) to isolate DIR and DSR, coupled with category alignment across scales for improved source-target domain alignment. Given its practical suitability, our model is constructed upon the foundational framework of the Single Shot MultiBox Detector (SSD), which is a one-stage object detection approach. Experimental validation highlights the effectiveness of our method, demonstrating its state-of-the-art performance across three benchmark datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Trust Region Methods for Nonconvex Stochastic Optimization beyond Lipschitz Smoothness b/data/2024/aaai/Trust Region Methods for Nonconvex Stochastic Optimization beyond Lipschitz Smoothness
new file mode 100644
index 0000000000..06c5b6f01f
--- /dev/null
+++ b/data/2024/aaai/Trust Region Methods for Nonconvex Stochastic Optimization beyond Lipschitz Smoothness	
@@ -0,0 +1,4 @@
+In many important machine learning applications, the standard assumption of having a globally Lipschitz continuous gradient may fail to hold. This paper delves into a more general (L0, L1)-smoothness setting, which gains particular significance within the realms of deep neural networks and distributionally robust optimization (DRO). We demonstrate the significant advantage of trust region methods for stochastic nonconvex optimization under such generalized smoothness assumption.
+ We show that first-order trust region methods can recover the normalized and clipped stochastic gradient as special cases and then provide a unified analysis to show their convergence to first-order stationary conditions.
+ Motivated by the important application of DRO, we propose a generalized high-order smoothness condition, under which second-order trust region methods can achieve a complexity of O(epsilon(-3.5)) for convergence to second-order stationary points. By incorporating variance reduction, the second-order trust region method obtains an even better complexity of O(epsilon(-3)), matching the optimal bound for standard smooth optimization. To our best knowledge, this is the first work to show convergence beyond the first-order stationary condition for generalized smooth optimization.
+ Preliminary experiments show that our proposed algorithms perform favorably compared with existing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning b/data/2024/aaai/Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning
new file mode 100644
index 0000000000..1109a82835
--- /dev/null
+++ b/data/2024/aaai/Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning	
@@ -0,0 +1 @@
+Despite the great success of large language models (LLMs) in various tasks, they suffer from generating hallucinations. We introduce Truth Forest, a method that enhances truthfulness in LLMs by uncovering hidden truth representations using multi-dimensional orthogonal probes. Specifically, it creates multiple orthogonal bases for modeling truth by incorporating orthogonal constraints into the probes. Moreover, we introduce Random Peek, a systematic technique considering an extended range of positions within the sequence, reducing the gap between discerning and generating truth features in LLMs. By employing this approach, we improved the truthfulness of Llama-2-7B from 40.8% to 74.5% on TruthfulQA. Likewise, significant improvements are observed in fine-tuned models. We conducted a thorough analysis of truth features using probes. Our visualization results show that orthogonal probes capture complementary truth-related features, forming well-defined clusters that reveal the inherent structure of the dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/Tuning-Free Inversion-Enhanced Control for Consistent Image Editing b/data/2024/aaai/Tuning-Free Inversion-Enhanced Control for Consistent Image Editing
new file mode 100644
index 0000000000..199d25cdd0
--- /dev/null
+++ b/data/2024/aaai/Tuning-Free Inversion-Enhanced Control for Consistent Image Editing	
@@ -0,0 +1 @@
+Consistent editing of real images is a challenging task, as it requires performing non-rigid edits (e.g., changing postures) to the main objects in the input image without changing their identity or attributes. To guarantee consistent attributes, some existing methods fine-tune the entire model or the textual embedding for structural consistency, but they are time-consuming and fail to perform non-rigid edits. Other works are tuning-free, but their performances are weakened by the quality of Denoising Diffusion Implicit Model (DDIM) reconstruction, which often fails in real-world scenarios. In this paper, we present a novel approach called Tuning-free Inversion-enhanced Control (TIC), which directly correlates features from the inversion process with those from the sampling process to mitigate the inconsistency in DDIM reconstruction. Specifically, our method effectively obtains inversion features from the key and value features in the self-attention layers, and enhances the sampling process by these inversion features, thus achieving accurate reconstruction and content-consistent editing. To extend the applicability of our method to general editing scenarios, we also propose a mask-guided attention concatenation strategy that combines contents from both the inversion and the naive DDIM editing processes. Experiments show that the proposed method outperforms previous works in reconstruction and consistent editing, and produces impressive results in various settings.
\ No newline at end of file
diff --git a/data/2024/aaai/TurboSVM-FL: Boosting Federated Learning through SVM Aggregation for Lazy Clients b/data/2024/aaai/TurboSVM-FL: Boosting Federated Learning through SVM Aggregation for Lazy Clients
new file mode 100644
index 0000000000..7b0df057a9
--- /dev/null
+++ b/data/2024/aaai/TurboSVM-FL: Boosting Federated Learning through SVM Aggregation for Lazy Clients	
@@ -0,0 +1 @@
+Federated learning is a distributed collaborative machine learning paradigm that has gained strong momentum in recent years. In federated learning, a central server periodically coordinates models with clients and aggregates the models trained locally by clients without necessitating access to local data. Despite its potential, the implementation of federated learning continues to encounter several challenges, predominantly the slow convergence that is largely due to data heterogeneity. The slow convergence becomes particularly problematic in cross-device federated learning scenarios where clients may be strongly limited by computing power and storage space, and hence counteracting methods that induce additional computation or memory cost on the client side such as auxiliary objective terms and larger training iterations can be impractical. In this paper, we propose a novel federated aggregation strategy, TurboSVM-FL, that poses no additional computation burden on the client side and can significantly accelerate convergence for federated classification task, especially when clients are "lazy" and train their models solely for few epochs for next global aggregation. TurboSVM-FL extensively utilizes support vector machine to conduct selective aggregation and max-margin spread-out regularization on class embeddings. We evaluate TurboSVM-FL on multiple datasets including FEMNIST, CelebA, and Shakespeare using user-independent validation with non-iid data distribution. Our results show that TurboSVM-FL can significantly outperform existing popular algorithms on convergence rate and reduce communication rounds while delivering better test metrics including accuracy, F1 score, and MCC.
\ No newline at end of file
diff --git a/data/2024/aaai/Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data b/data/2024/aaai/Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data
new file mode 100644
index 0000000000..04376a53d3
--- /dev/null
+++ b/data/2024/aaai/Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have performed well on various reasoning tasks, but their inaccessibility and numerous parameters hinder wide application in practice. One promising way is distilling the reasoning ability from LLMs to small models by the generated chain-of-thought reasoning paths. In some cases, however, LLMs may produce incorrect reasoning chains, especially when facing complex mathematical problems. Previous studies only transfer knowledge from positive samples and drop the synthesized data with wrong answers. In this work, we illustrate the merit of negative data and propose a model specialization framework to distill LLMs with negative samples besides positive ones. The framework consists of three progressive steps, covering from training to inference stages, to absorb knowledge from negative data. We conduct extensive experiments across arithmetic reasoning tasks to demonstrate the role of negative data in distillation from LLM.
\ No newline at end of file
diff --git a/data/2024/aaai/Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks b/data/2024/aaai/Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks
new file mode 100644
index 0000000000..0ad1cd9526
--- /dev/null
+++ b/data/2024/aaai/Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks	
@@ -0,0 +1 @@
+Continuous Conditional Generative Adversarial Networks (CcGANs) enable generative modeling conditional on continuous scalar variables (termed regression labels). However, they can produce subpar fake images due to limited training data. Although Negative Data Augmentation (NDA) effectively enhances unconditional and class-conditional GANs by introducing anomalies into real training images, guiding the GANs away from low-quality outputs, its impact on CcGANs is limited, as it fails to replicate negative samples that may occur during the CcGAN sampling. We present a novel NDA approach called Dual-NDA specifically tailored for CcGANs to address this problem. Dual-NDA employs two types of negative samples: visually unrealistic images generated from a pre-trained CcGAN and label-inconsistent images created by manipulating real images' labels. Leveraging these negative samples, we introduce a novel discriminator objective alongside a modified CcGAN training algorithm. Empirical analysis on UTKFace and Steering Angle reveals that Dual-NDA consistently enhances the visual fidelity and label consistency of fake images generated by CcGANs, exhibiting a substantial performance gain over the vanilla NDA. Moreover, by applying Dual-NDA, CcGANs demonstrate a remarkable advancement beyond the capabilities of state-of-the-art conditional GANs and diffusion models, establishing a new pinnacle of performance. Our codes can be found at https://github.com/UBCDingXin/Dual-NDA.
\ No newline at end of file
diff --git a/data/2024/aaai/Two-Stage Evolutionary Reinforcement Learning for Enhancing Exploration and Exploitation b/data/2024/aaai/Two-Stage Evolutionary Reinforcement Learning for Enhancing Exploration and Exploitation
new file mode 100644
index 0000000000..1aa6c0a38a
--- /dev/null
+++ b/data/2024/aaai/Two-Stage Evolutionary Reinforcement Learning for Enhancing Exploration and Exploitation	
@@ -0,0 +1 @@
+The integration of Evolutionary Algorithm (EA) and Reinforcement Learning (RL) has emerged as a promising approach for tackling some challenges in RL, such as sparse rewards, lack of exploration, and brittle convergence properties. However, existing methods often employ actor networks as individuals of EA, which may constrain their exploratory capabilities, as the entire actor population will stop evolution when the critic network in RL falls into local optimal. To alleviate this issue, this paper introduces a Two-stage Evolutionary Reinforcement Learning (TERL) framework that maintains a population containing both actor and critic networks. TERL divides the learning process into two stages. In the initial stage, individuals independently learn actor-critic networks, which are optimized alternatively by RL and Particle Swarm Optimization (PSO). This dual optimization fosters greater exploration, curbing susceptibility to local optima. Shared information from a common replay buffer and PSO algorithm substantially mitigates the computational load of training multiple agents. In the subsequent stage, TERL shifts to a refined exploitation phase. Here, only the best individual undergoes further refinement, while the rest individuals continue PSO-based optimization. This allocates more computational resources to the best individual for yielding superior performance. Empirical assessments, conducted across a range of continuous control problems, validate the efficacy of the proposed TERL paradigm.
\ No newline at end of file
diff --git a/data/2024/aaai/U-Mixer: An Unet-Mixer Architecture with Stationarity Correction for Time Series Forecasting b/data/2024/aaai/U-Mixer: An Unet-Mixer Architecture with Stationarity Correction for Time Series Forecasting
new file mode 100644
index 0000000000..f67c4b2455
--- /dev/null
+++ b/data/2024/aaai/U-Mixer: An Unet-Mixer Architecture with Stationarity Correction for Time Series Forecasting	
@@ -0,0 +1 @@
+Time series forecasting is a crucial task in various domains. Caused by factors such as trends, seasonality, or irregular fluctuations, time series often exhibits non-stationary. It obstructs stable feature propagation through deep layers, disrupts feature distributions, and complicates learning data distribution changes. As a result, many existing models struggle to capture the underlying patterns, leading to degraded forecasting performance. In this study, we tackle the challenge of non-stationarity in time series forecasting with our proposed framework called U-Mixer. By combining Unet and Mixer, U-Mixer effectively captures local temporal dependencies between different patches and channels separately to avoid the influence of distribution variations among channels, and merge low- and high-levels features to obtain comprehensive data representations. The key contribution is a novel stationarity correction method, explicitly restoring data distribution by constraining the difference in stationarity between the data before and after model processing to restore the non-stationarity information, while ensuring the temporal dependencies are preserved. Through extensive experiments on various real-world time series datasets, U-Mixer demonstrates its effectiveness and robustness, and achieves 14.5% and 7.7% improvements over state-of-the-art (SOTA) methods.
\ No newline at end of file
diff --git a/data/2024/aaai/U-trustworthy Models. Reliability, Competence, and Confidence in Decision-Making b/data/2024/aaai/U-trustworthy Models. Reliability, Competence, and Confidence in Decision-Making
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/aaai/UCMCTrack: Multi-Object Tracking with Uniform Camera Motion Compensation b/data/2024/aaai/UCMCTrack: Multi-Object Tracking with Uniform Camera Motion Compensation
new file mode 100644
index 0000000000..8bd819b790
--- /dev/null
+++ b/data/2024/aaai/UCMCTrack: Multi-Object Tracking with Uniform Camera Motion Compensation	
@@ -0,0 +1 @@
+Multi-object tracking (MOT) in video sequences remains a challenging task, especially in scenarios with significant camera movements. This is because targets can drift considerably on the image plane, leading to erroneous tracking outcomes. Addressing such challenges typically requires supplementary appearance cues or Camera Motion Compensation (CMC). While these strategies are effective, they also introduce a considerable computational burden, posing challenges for real-time MOT. In response to this, we introduce UCMCTrack, a novel motion model-based tracker robust to camera movements. Unlike conventional CMC that computes compensation parameters frame-by-frame, UCMCTrack consistently applies the same compensation parameters throughout a video sequence. It employs a Kalman filter on the ground plane and introduces the Mapped Mahalanobis Distance (MMD) as an alternative to the traditional Intersection over Union (IoU) distance measure. By leveraging projected probability distributions on the ground plane, our approach efficiently captures motion patterns and adeptly manages uncertainties introduced by homography projections. Remarkably, UCMCTrack, relying solely on motion cues, achieves state-of-the-art performance across a variety of challenging datasets, including MOT17, MOT20, DanceTrack and KITTI. More details and code are available at https://github.com/corfyi/UCMCTrack.
\ No newline at end of file
diff --git a/data/2024/aaai/UFDA: Universal Federated Domain Adaptation with Practical Assumptions b/data/2024/aaai/UFDA: Universal Federated Domain Adaptation with Practical Assumptions
new file mode 100644
index 0000000000..c156452912
--- /dev/null
+++ b/data/2024/aaai/UFDA: Universal Federated Domain Adaptation with Practical Assumptions	
@@ -0,0 +1 @@
+Conventional Federated Domain Adaptation (FDA) approaches usually demand an abundance of assumptions, which makes them significantly less feasible for real-world situations and introduces security hazards. This paper relaxes the assumptions from previous FDAs and studies a more practical scenario named Universal Federated Domain Adaptation (UFDA). It only requires the black-box model and the label set information of each source domain, while the label sets of different source domains could be inconsistent, and the target-domain label set is totally blind. Towards a more effective solution for our newly proposed UFDA scenario, we propose a corresponding methodology called Hot-Learning with Contrastive Label Disambiguation (HCLD). It particularly tackles UFDA's domain shifts and category gaps problems by using one-hot outputs from the black-box models of various source domains. Moreover, to better distinguish the shared and unknown classes, we further present a cluster-level strategy named Mutual-Voting Decision (MVD) to extract robust consensus knowledge across peer classes from both source and target domains. Extensive experiments on three benchmark datasets demonstrate that our method achieves comparable performance for our UFDA scenario with much fewer assumptions, compared to previous methodologies with comprehensive additional assumptions.
\ No newline at end of file
diff --git a/data/2024/aaai/UMA: Facilitating Backdoor Scanning via Unlearning-Based Model Ablation b/data/2024/aaai/UMA: Facilitating Backdoor Scanning via Unlearning-Based Model Ablation
new file mode 100644
index 0000000000..6bb068b99f
--- /dev/null
+++ b/data/2024/aaai/UMA: Facilitating Backdoor Scanning via Unlearning-Based Model Ablation	
@@ -0,0 +1 @@
+Recent advances in backdoor attacks, like leveraging complex triggers or stealthy implanting techniques, have introduced new challenges in backdoor scanning, limiting the usability of Deep Neural Networks (DNNs) in various scenarios. In this paper, we propose Unlearning-based Model Ablation (UMA), a novel approach to facilitate backdoor scanning and defend against advanced backdoor attacks. UMA filters out backdoor-irrelevant features by ablating the inherent features of the target class within the model and subsequently reveals the backdoor through dynamic trigger optimization. We evaluate our method on 1700 models (700 benign and 1000 trojaned) with 6 model structures, 7 different backdoor attacks and 4 datasets. Our results demonstrate that the proposed methodology effectively detect these advanced backdoors. Specifically, our method can achieve 91% AUC-ROC and 86.6% detection accuracy on average, which outperforms the baselines, including Neural Cleanse, ABS, K-Arm and MNTD.
\ No newline at end of file
diff --git a/data/2024/aaai/UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer b/data/2024/aaai/UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer
new file mode 100644
index 0000000000..986f0a74ed
--- /dev/null
+++ b/data/2024/aaai/UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer	
@@ -0,0 +1 @@
+Traditional channel-wise pruning methods by reducing network channels struggle to effectively prune efficient CNN models with depth-wise convolutional layers and certain efficient modules, such as popular inverted residual blocks. Prior depth pruning methods by reducing network depths are not suitable for pruning some efficient models due to the existence of some normalization layers. Moreover, finetuning subnet with directly removing activation layers would corrupt the original model weights, hindering the pruned model from achieving high performance. To address these issues, we propose a novel depth pruning method for efficient models. Our approach proposes a novel block pruning strategy and progressive training method for the subnet. Additionally, we extend our pruning method to vision transformer models. Experimental results demonstrate that our method consistently outperforms existing depth pruning methods across various pruning configurations. We obtained three pruned ConvNeXtV1 models with our method applying on ConvNeXtV1, which surpass most SOTA efficient models with comparable inference performance. Our method also achieves state-of-the-art pruning performance on the vision transformer model.
\ No newline at end of file
diff --git a/data/2024/aaai/UV-SAM: Adapting Segment Anything Model for Urban Village Identification b/data/2024/aaai/UV-SAM: Adapting Segment Anything Model for Urban Village Identification
new file mode 100644
index 0000000000..f14f62ddf3
--- /dev/null
+++ b/data/2024/aaai/UV-SAM: Adapting Segment Anything Model for Urban Village Identification	
@@ -0,0 +1 @@
+Urban villages, defined as informal residential areas in or around urban centers, are characterized by inadequate infrastructures and poor living conditions, closely related to the Sustainable Development Goals (SDGs) on poverty, adequate housing, and sustainable cities. Traditionally, governments heavily depend on field survey methods to monitor the urban villages, which however are time-consuming, labor-intensive, and possibly delayed. Thanks to widely available and timely updated satellite images, recent studies develop computer vision techniques to detect urban villages efficiently. However, existing studies either focus on simple urban village image classification or fail to provide accurate boundary information. To accurately identify urban village boundaries from satellite images, we harness the power of the vision foundation model and adapt the Segment Anything Model (SAM) to urban village segmentation, named UV-SAM. Specifically, UV-SAM first leverages a small-sized semantic segmentation model to produce mixed prompts for urban villages, including mask, bounding box, and image representations, which are then fed into SAM for fine-grained boundary identification. Extensive experimental results on two datasets in China demonstrate that UV-SAM outperforms existing baselines, and identification results over multiple years show that both the number and area of urban villages are decreasing over time, providing deeper insights into the development trends of urban villages and sheds light on the vision foundation models for sustainable cities. The dataset and codes of this study are available at https://github.com/tsinghua-fib-lab/UV-SAM.
\ No newline at end of file
diff --git a/data/2024/aaai/UVAGaze: Unsupervised 1-to-2 Views Adaptation for Gaze Estimation b/data/2024/aaai/UVAGaze: Unsupervised 1-to-2 Views Adaptation for Gaze Estimation
new file mode 100644
index 0000000000..e73cbd5e65
--- /dev/null
+++ b/data/2024/aaai/UVAGaze: Unsupervised 1-to-2 Views Adaptation for Gaze Estimation	
@@ -0,0 +1 @@
+Gaze estimation has become a subject of growing interest in recent research. Most of the current methods rely on single-view facial images as input. Yet, it is hard for these approaches to handle large head angles, leading to potential inaccuracies in the estimation. To address this issue, adding a second-view camera can help better capture eye appearance. However, existing multi-view methods have two limitations. 1) They require multi-view annotations for training, which are expensive. 2) More importantly, during testing, the exact positions of the multiple cameras must be known and match those used in training, which limits the application scenario. To address these challenges, we propose a novel 1-view-to-2-views (1-to-2 views) adaptation solution in this paper, the Unsupervised 1-to-2 Views Adaptation framework for Gaze estimation (UVAGaze). Our method adapts a traditional single-view gaze estimator for flexibly placed dual cameras. Here, the "flexibly" means we place the dual cameras in arbitrary places regardless of the training data, without knowing their extrinsic parameters. Specifically, the UVAGaze builds a dual-view mutual supervision adaptation strategy, which takes advantage of the intrinsic consistency of gaze directions between both views. In this way, our method can not only benefit from common single-view pre-training, but also achieve more advanced dual-view gaze estimation. The experimental results show that a single-view estimator, when adapted for dual views, can achieve much higher accuracy, especially in cross-dataset settings, with a substantial improvement of 47.0%. Project page: https://github.com/MickeyLLG/UVAGaze.
\ No newline at end of file
diff --git a/data/2024/aaai/Uncertainty Quantification for Data-Driven Change-Point Learning via Cross-Validation b/data/2024/aaai/Uncertainty Quantification for Data-Driven Change-Point Learning via Cross-Validation
new file mode 100644
index 0000000000..9e72a117b0
--- /dev/null
+++ b/data/2024/aaai/Uncertainty Quantification for Data-Driven Change-Point Learning via Cross-Validation	
@@ -0,0 +1 @@
+Accurately detecting multiple change-points is critical for various applications, but determining the optimal number of change-points remains a challenge. Existing approaches based on information criteria attempt to balance goodness-of-fit and model complexity, but their performance varies depending on the model. Recently, data-driven selection criteria based on cross-validation has been proposed, but these methods can be prone to slight overfitting in finite samples. In this paper, we introduce a method that controls the probability of overestimation and provides uncertainty quantification for learning multiple change-points via cross-validation. We frame this problem as a sequence of model comparison problems and leverage high-dimensional inferential procedures. We demonstrate the effectiveness of our approach through experiments on finite-sample data, showing superior uncertainty quantification for overestimation compared to existing methods. Our approach has broad applicability and can be used in diverse change-point models.
\ No newline at end of file
diff --git a/data/2024/aaai/Uncertainty Quantification for Forward and Inverse Problems of PDEs via Latent Global Evolution b/data/2024/aaai/Uncertainty Quantification for Forward and Inverse Problems of PDEs via Latent Global Evolution
new file mode 100644
index 0000000000..62862be902
--- /dev/null
+++ b/data/2024/aaai/Uncertainty Quantification for Forward and Inverse Problems of PDEs via Latent Global Evolution	
@@ -0,0 +1 @@
+Deep learning-based surrogate models have demonstrated remarkable advantages over classical solvers in terms of speed, often achieving speedups of 10 to 1000 times over traditional partial differential equation (PDE) solvers. However, a significant challenge hindering their widespread adoption in both scientific and industrial domains is the lack of understanding about their prediction uncertainties, particularly in scenarios that involve critical decision making. To address this limitation, we propose a method that integrates efficient and precise uncertainty quantification into a deep learning-based surrogate model. Our method, termed Latent Evolution of PDEs with Uncertainty Quantification (LE-PDE-UQ), endows deep learning-based surrogate models with robust and efficient uncertainty quantification capabilities for both forward and inverse problems. LE-PDE-UQ leverages latent vectors within a latent space to evolve both the system's state and its corresponding uncertainty estimation. The latent vectors are decoded to provide predictions for the system's state as well as estimates of its uncertainty. In extensive experiments, we demonstrate the accurate uncertainty quantification performance of our approach, surpassing that of strong baselines including deep ensembles, Bayesian neural network layers, and dropout. Our method excels at propagating uncertainty over extended auto-regressive rollouts, making it suitable for scenarios involving long-term predictions. Our code is available at: https://github.com/AI4Science-WestlakeU/le-pde-uq.
\ No newline at end of file
diff --git a/data/2024/aaai/Uncertainty Quantification in Heterogeneous Treatment Effect Estimation with Gaussian-Process-Based Partially Linear Model b/data/2024/aaai/Uncertainty Quantification in Heterogeneous Treatment Effect Estimation with Gaussian-Process-Based Partially Linear Model
new file mode 100644
index 0000000000..f158d6d351
--- /dev/null
+++ b/data/2024/aaai/Uncertainty Quantification in Heterogeneous Treatment Effect Estimation with Gaussian-Process-Based Partially Linear Model	
@@ -0,0 +1 @@
+Estimating heterogeneous treatment effects across individuals has attracted growing attention as a statistical tool for performing critical decision-making. We propose a Bayesian inference framework that quantifies the uncertainty in treatment effect estimation to support decision-making in a relatively small sample size setting. Our proposed model places Gaussian process priors on the nonparametric components of a semiparametric model called a partially linear model. This model formulation has three advantages. First, we can analytically compute the posterior distribution of a treatment effect without relying on the computationally demanding posterior approximation. Second, we can guarantee that the posterior distribution concentrates around the true one as the sample size goes to infinity. Third, we can incorporate prior knowledge about a treatment effect into the prior distribution, improving the estimation efficiency. Our experimental results show that even in the small sample size setting, our method can accurately estimate the heterogeneous treatment effects and effectively quantify its estimation uncertainty.
\ No newline at end of file
diff --git a/data/2024/aaai/Uncertainty Regularized Evidential Regression b/data/2024/aaai/Uncertainty Regularized Evidential Regression
new file mode 100644
index 0000000000..6401340c28
--- /dev/null
+++ b/data/2024/aaai/Uncertainty Regularized Evidential Regression	
@@ -0,0 +1 @@
+The Evidential Regression Network (ERN) represents a novel approach that integrates deep learning with Dempster-Shafer's theory to predict a target and quantify the associated uncertainty. Guided by the underlying theory, specific activation functions must be employed to enforce non-negative values, which is a constraint that compromises model performance by limiting its ability to learn from all samples. This paper provides a theoretical analysis of this limitation and introduces an improvement to overcome it. Initially, we define the region where the models can't effectively learn from the samples. Following this, we thoroughly analyze the ERN and investigate this constraint. Leveraging the insights from our analysis, we address the limitation by introducing a novel regularization term that empowers the ERN to learn from the whole training set. Our extensive experiments substantiate our theoretical findings and demonstrate the effectiveness of the proposed solution.
\ No newline at end of file
diff --git a/data/2024/aaai/Uncertainty-Aware GAN for Single Image Super Resolution b/data/2024/aaai/Uncertainty-Aware GAN for Single Image Super Resolution
new file mode 100644
index 0000000000..3f777da83a
--- /dev/null
+++ b/data/2024/aaai/Uncertainty-Aware GAN for Single Image Super Resolution	
@@ -0,0 +1 @@
+Generative adversarial network (GAN) has become a popular tool in the perceptual-oriented single image super-resolution (SISR) for its excellent capability to hallucinate details. However, the performance of most GAN-based SISR methods is impeded due to the limited discriminative ability of their discriminators. In specific, these discriminators only focus on the global image reconstruction quality and ignore the more fine-grained reconstruction quality for constraining the generator, as they predict the overall realness of an image instead of the pixel-level realness. Here, we first introduce the uncertainty into the GAN and propose an Uncertainty-aware GAN (UGAN) to regularize SISR solutions, where the challenging pixels with large reconstruction uncertainty and importance (e.g., texture and edge) are prioritized for optimization. The uncertainty-aware adversarial training strategy enables the discriminator to capture the pixel-level SR uncertainty, which constrains the generator to focus on image areas with high reconstruction difficulty, meanwhile, it improves the interpretability of the SR. To balance weights of multiple training losses, we introduce an uncertainty-aware loss weighting strategy to adaptively learn the optimal loss weights. Extensive experiments demonstrate the effectiveness of our approach in extracting the SR uncertainty and the superiority of the UGAN over the state-of-the-arts in terms of the reconstruction accuracy and perceptual quality.
\ No newline at end of file
diff --git a/data/2024/aaai/Uncertainty-Aware Yield Prediction with Multimodal Molecular Features b/data/2024/aaai/Uncertainty-Aware Yield Prediction with Multimodal Molecular Features
new file mode 100644
index 0000000000..44912e005e
--- /dev/null
+++ b/data/2024/aaai/Uncertainty-Aware Yield Prediction with Multimodal Molecular Features	
@@ -0,0 +1,2 @@
+Predicting chemical reaction yields is pivotal for efficient chemical synthesis, an area that focuses on the creation of novel compounds for diverse uses. 
+Yield prediction demands accurate representations of reactions for forecasting practical transformation rates. Yet, the uncertainty issues broadcasting in real-world situations prohibit current models to excel in this task owing to the high sensitivity of yield activities and the uncertainty in yield measurements. Existing models often utilize single-modal feature representations, such as molecular fingerprints, SMILES sequences, or molecular graphs, which is not sufficient to capture the complex interactions and dynamic behavior of molecules in reactions. In this paper, we present an advanced Uncertainty-Aware Multimodal model (UAM) to tackle these challenges. Our approach seamlessly integrates data sources from multiple modalities by encompassing sequence representations, molecular graphs, and expert-defined chemical reaction features for a comprehensive representation of reactions. Additionally, we address both the model and data-based uncertainty, refining the model's predictive capability. Extensive experiments on three datasets, including two high throughput experiment (HTE) datasets and one chemist-constructed Amide coupling reaction dataset, demonstrate that UAM outperforms the state-of-the-art methods. The code and used datasets are available at https://github.com/jychen229/Multimodal-reaction-yield-prediction.
\ No newline at end of file
diff --git a/data/2024/aaai/Uncovering and Mitigating the Hidden Chasm: A Study on the Text-Text Domain Gap in Euphemism Identification b/data/2024/aaai/Uncovering and Mitigating the Hidden Chasm: A Study on the Text-Text Domain Gap in Euphemism Identification
new file mode 100644
index 0000000000..755ccc440e
--- /dev/null
+++ b/data/2024/aaai/Uncovering and Mitigating the Hidden Chasm: A Study on the Text-Text Domain Gap in Euphemism Identification	
@@ -0,0 +1 @@
+Euphemisms are commonly used on social media and darknet marketplaces to evade platform regulations by masking their true meanings with innocent ones. For instance, “weed” is used instead of “marijuana” for illicit transactions. Thus, euphemism identification, i.e., mapping a given euphemism (“weed”) to its specific target word (“marijuana”), is essential for improving content moderation and combating underground markets. Existing methods employ self-supervised schemes to automatically construct labeled training datasets for euphemism identification. However, they overlook the text-text domain gap caused by the discrepancy between the constructed training data and the test data, leading to performance deterioration. In this paper, we present the text-text domain gap and explain how it forms in terms of the data distribution and the cone effect. Moreover, to bridge this gap, we introduce a feature alignment network (FA-Net), which can both align the in-domain and cross-domain features, thus mitigating the domain gap from training data to test data and improving the performance of the base models for euphemism identification. We apply this FA-Net to the base models, obtaining markedly better results, and creating a state-of-the-art model which beats the large language models.
\ No newline at end of file
diff --git a/data/2024/aaai/Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution b/data/2024/aaai/Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution
new file mode 100644
index 0000000000..453035342a
--- /dev/null
+++ b/data/2024/aaai/Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution	
@@ -0,0 +1 @@
+Modern language modeling tasks are often underspecified: for a given token prediction, many words may satisfy the user’s intent of producing natural language at inference time, however only one word will minimize the task’s loss function at training time. We introduce a simple causal mechanism to describe the role underspecification plays in the generation of spurious correlations. Despite its simplicity, our causal model directly informs the development of two lightweight black-box evaluation methods, that we apply to gendered pronoun resolution tasks on a wide range of LLMs to 1) aid in the detection of inference-time task underspecification by exploiting 2) previously unreported gender vs. time and gender vs. location spurious correlations on LLMs with a range of A) sizes: from BERT-base to GPT-3.5, B) pre-training objectives: from masked & autoregressive language modeling to a mixture of these objectives, and C) training stages: from pre-training only to reinforcement learning from human feedback (RLHF). Code and open-source demos available at https://github.com/2dot71mily/uspec.
\ No newline at end of file
diff --git a/data/2024/aaai/Understanding Distributed Representations of Concepts in Deep Neural Networks without Supervision b/data/2024/aaai/Understanding Distributed Representations of Concepts in Deep Neural Networks without Supervision
new file mode 100644
index 0000000000..15004644fc
--- /dev/null
+++ b/data/2024/aaai/Understanding Distributed Representations of Concepts in Deep Neural Networks without Supervision	
@@ -0,0 +1 @@
+Understanding intermediate representations of the concepts learned by deep learning classifiers is indispensable for interpreting general model behaviors. Existing approaches to reveal learned concepts often rely on human supervision, such as pre-defined concept sets or segmentation processes. In this paper, we propose a novel unsupervised method for discovering distributed representations of concepts by selecting a principal subset of neurons. Our empirical findings demonstrate that instances with similar neuron activation states tend to share coherent concepts. Based on the observations, the proposed method selects principal neurons that construct an interpretable region, namely a Relaxed Decision Region (RDR), encompassing instances with coherent concepts in the feature space. It can be utilized to identify unlabeled subclasses within data and to detect the causes of misclassifications. Furthermore, the applicability of our method across various layers discloses distinct distributed representations over the layers, which provides deeper insights into the internal mechanisms of the deep learning model.
\ No newline at end of file
diff --git a/data/2024/aaai/Understanding Likelihood of Normalizing Flow and Image Complexity through the Lens of Out-of-Distribution Detection b/data/2024/aaai/Understanding Likelihood of Normalizing Flow and Image Complexity through the Lens of Out-of-Distribution Detection
new file mode 100644
index 0000000000..bd9070eabc
--- /dev/null
+++ b/data/2024/aaai/Understanding Likelihood of Normalizing Flow and Image Complexity through the Lens of Out-of-Distribution Detection	
@@ -0,0 +1,8 @@
+Out-of-distribution (OOD) detection is crucial to safety-critical machine learning applications and has been extensively studied.
+While recent studies have predominantly focused on classifier-based methods, research on deep generative model (DGM)-based methods have lagged relatively.
+This disparity may be attributed to a perplexing phenomenon: DGMs often assign higher likelihoods to unknown OOD inputs than to their known training data.
+This paper focuses on explaining the underlying mechanism of this phenomenon.
+We propose a hypothesis that less complex images concentrate in high-density regions in the latent space, resulting in a higher likelihood assignment in the Normalizing Flow (NF).
+We experimentally demonstrate its validity for five NF architectures, concluding that their likelihood is untrustworthy.
+Additionally, we show that this problem can be alleviated by treating image complexity as an independent variable.
+Finally, we provide evidence of the potential applicability of our hypothesis in another DGM, PixelCNN++.
\ No newline at end of file
diff --git a/data/2024/aaai/Understanding Surprising Generalization Phenomena in Deep Learning b/data/2024/aaai/Understanding Surprising Generalization Phenomena in Deep Learning
new file mode 100644
index 0000000000..22ee72c690
--- /dev/null
+++ b/data/2024/aaai/Understanding Surprising Generalization Phenomena in Deep Learning	
@@ -0,0 +1 @@
+Deep learning has exhibited a number of surprising generalization phenomena that are not captured by classical statistical learning theory. This talk will survey some of my work on the theoretical characterizations of several such intriguing phenomena: (1) Implicit regularization: A major mystery in deep learning is that deep neural networks can often generalize well despite their excessive expressive capacity. Towards explaining this mystery, it has been suggested that commonly used gradient-based optimization algorithms enforce certain implicit regularization which effectively constrains the model capacity. (2) Benign overfitting: In certain scenarios, a model can perfectly fit noisily labeled training data, but still archives near-optimal test error at the same time, which is very different from the classical notion of overfitting. (3) Grokking: In certain scenarios, a model initially achieves perfect training accuracy but no generalization (i.e. no better than a random predictor), and upon further training, transitions to almost perfect generalization. Theoretically establishing these properties often involves making appropriate high-dimensional assumptions on the problem as well as a careful analysis of the training dynamics.
\ No newline at end of file
diff --git a/data/2024/aaai/Understanding and Improving Optimization in Predictive Coding Networks b/data/2024/aaai/Understanding and Improving Optimization in Predictive Coding Networks
new file mode 100644
index 0000000000..069f9c9e6b
--- /dev/null
+++ b/data/2024/aaai/Understanding and Improving Optimization in Predictive Coding Networks	
@@ -0,0 +1 @@
+Backpropagation (BP), the standard learning algorithm for artificial neural networks, is often considered biologically implausible. In contrast, the standard learning algorithm for predictive coding (PC) models in neuroscience, known as the inference learning algorithm (IL), is a promising, bio-plausible alternative. However, several challenges and questions hinder IL's application to real-world problems. For example, IL is computationally demanding, and without memory-intensive optimizers like Adam, IL may converge to poor local minima. Moreover, although IL can reduce loss more quickly than BP, the reasons for these speedups or their robustness remains unclear. In this paper, we tackle these challenges by 1) altering the standard implementation of PC circuits to substantially reduce computation, 2) developing a novel optimizer that improves the convergence of IL without increasing memory usage, and 3) establishing theoretical results that help elucidate the conditions under which IL is sensitive to second and higher-order information.
\ No newline at end of file
diff --git a/data/2024/aaai/Understanding and Leveraging the Learning Phases of Neural Networks b/data/2024/aaai/Understanding and Leveraging the Learning Phases of Neural Networks
new file mode 100644
index 0000000000..2ab55a1dc7
--- /dev/null
+++ b/data/2024/aaai/Understanding and Leveraging the Learning Phases of Neural Networks	
@@ -0,0 +1 @@
+The learning dynamics of deep neural networks are not well understood. The information bottleneck (IB) theory proclaimed separate fitting and compression phases. But they have since been heavily debated. We comprehensively analyze the learning dynamics by investigating a layer's reconstruction ability of the input and prediction performance based on the evolution of parameters during training. We empirically show the existence of three phases using common datasets and architectures such as ResNet and VGG: (i) near constant reconstruction loss, (ii) decrease, and (iii) increase. We also derive an empirically grounded data model and prove the existence of phases for single-layer networks. Technically, our approach leverages classical complexity analysis. It differs from IB by relying on measuring reconstruction loss rather than information theoretic measures to relate information of intermediate layers and inputs. Our work implies a new best practice for transfer learning: We show empirically that the pre-training of a classifier should stop well before its performance is optimal.
\ No newline at end of file
diff --git a/data/2024/aaai/Understanding the Generalization of Pretrained Diffusion Models on Out-of-Distribution Data b/data/2024/aaai/Understanding the Generalization of Pretrained Diffusion Models on Out-of-Distribution Data
new file mode 100644
index 0000000000..cef45332b9
--- /dev/null
+++ b/data/2024/aaai/Understanding the Generalization of Pretrained Diffusion Models on Out-of-Distribution Data	
@@ -0,0 +1 @@
+This work tackles the important task of understanding out-of-distribution behavior in two prominent types of generative models, i.e., GANs and Diffusion models. Understanding this behavior is crucial in understanding their broader utility and risks as these systems are increasingly deployed in our daily lives. Our first contribution is demonstrating that diffusion spaces outperform GANs' latent spaces in inverting high-quality OOD images. We also provide a theoretical analysis attributing this to the lack of prior holes in diffusion spaces. Our second significant contribution is to provide a theoretical hypothesis that diffusion spaces can be projected onto a bounded hypersphere, enabling image manipulation through geodesic traversal between inverted images. Our analysis shows that different geodesics share common attributes for the same manipulation, which we leverage to perform various image manipulations. We conduct thorough empirical evaluations to support and validate our claims. Finally, our third and final contribution introduces a novel approach to the few-shot sampling for out-of-distribution data by inverting a few images to sample from the cluster formed by the inverted latents. The proposed technique achieves state-of-the-art results for the few-shot generation task in terms of image quality. Our research underscores the promise of diffusion spaces in out-of-distribution imaging and offers avenues for further exploration. Please find more details about our project at \url{http://cvit.iiit.ac.in/research/projects/cvit-projects/diffusionOOD}
\ No newline at end of file
diff --git a/data/2024/aaai/Understanding the Role of the Projector in Knowledge Distillation b/data/2024/aaai/Understanding the Role of the Projector in Knowledge Distillation
new file mode 100644
index 0000000000..079a32024c
--- /dev/null
+++ b/data/2024/aaai/Understanding the Role of the Projector in Knowledge Distillation	
@@ -0,0 +1 @@
+In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projector, which can have a large impact on the students performance. Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems. Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet. Code and models are publicly available.
\ No newline at end of file
diff --git a/data/2024/aaai/Underwater Organism Color Fine-Tuning via Decomposition and Guidance b/data/2024/aaai/Underwater Organism Color Fine-Tuning via Decomposition and Guidance
new file mode 100644
index 0000000000..a4b9ef4038
--- /dev/null
+++ b/data/2024/aaai/Underwater Organism Color Fine-Tuning via Decomposition and Guidance	
@@ -0,0 +1 @@
+Due to the wavelength dependent light attenuation and scattering, the color of the underwater organism usually appears distorted. The existing underwater image enhancement methods mainly focus on designing networks capable of generating enhanced underwater organisms with fixed color. Due to the complexity of the underwater environment, ground truth labels are difficult to obtain, which results in the non-existence of perfect enhancement effects. Different from the existing methods, this paper proposes an algorithm with color enhancement and color fine-tuning (CECF) capabilities. The color enhancement behavior of CECF is the same as that of existing methods, aiming to restore the color of the distorted underwater organism. Beyond this general purpose, the color fine-tuning behavior of CECF can adjust the color of organisms in a controlled manner, which can generate enhanced organisms with diverse colors. To achieve this purpose, four processes are used in CECF. A supervised enhancement process learns the mapping from a distorted image to an enhanced image by the decomposition of color code. A self reconstruction process and a cross-reconstruction process are used for content-invariant learning. A color fine-tuning process is designed based on the guidance for obtaining various enhanced results with different colors. Experimental results have proven the enhancement ability and color fine-tuning ability of the proposed CECF. The source code is provided in https://github.com/Xiaofeng-life/CECF.
\ No newline at end of file
diff --git a/data/2024/aaai/Uni-MIS: United Multiple Intent Spoken Language Understanding via Multi-View Intent-Slot Interaction b/data/2024/aaai/Uni-MIS: United Multiple Intent Spoken Language Understanding via Multi-View Intent-Slot Interaction
new file mode 100644
index 0000000000..6299a6ff9c
--- /dev/null
+++ b/data/2024/aaai/Uni-MIS: United Multiple Intent Spoken Language Understanding via Multi-View Intent-Slot Interaction	
@@ -0,0 +1 @@
+So far, multi-intent spoken language understanding (SLU) has become a research hotspot in the field of natural language processing (NLP) due to its ability to recognize and extract multiple intents expressed and annotate corresponding sequence slot tags within a single utterance. Previous research has primarily concentrated on the token-level intent-slot interaction to model joint intent detection and slot filling, which resulted in a failure to fully utilize anisotropic intent-guiding information during joint training. In this work, we present a novel architecture by modeling the multi-intent SLU as a multi-view intent-slot interaction. The architecture resolves the kernel bottleneck of unified multi-intent SLU by effectively modeling the intent-slot relations with utterance, chunk, and token-level interaction. We further develop a neural framework, namely Uni-MIS, in which the unified multi-intent SLU is modeled as a three-view intent-slot interaction fusion to better capture the interaction information after special encoding. A chunk-level intent detection decoder is used to sufficiently capture the multi-intent, and an adaptive intent-slot graph network is used to capture the fine-grained intent information to guide final slot filling. We perform extensive experiments on two widely used benchmark datasets for multi-intent SLU, where our model bets on all the current strong baselines, pushing the state-of-the-art performance of unified multi-intent SLU. Additionally, the ChatGPT benchmark that we have developed demonstrates that there is a considerable amount of potential research value in the field of multi-intent SLU.
\ No newline at end of file
diff --git a/data/2024/aaai/UniADS: Universal Architecture-Distiller Search for Distillation Gap b/data/2024/aaai/UniADS: Universal Architecture-Distiller Search for Distillation Gap
new file mode 100644
index 0000000000..7c64e42b33
--- /dev/null
+++ b/data/2024/aaai/UniADS: Universal Architecture-Distiller Search for Distillation Gap	
@@ -0,0 +1 @@
+In this paper, we present UniADS, the first Universal Architecture-Distiller Search framework for co-optimizing student architecture and distillation policies. Teacher-student distillation gap limits the distillation gains. Previous approaches seek to discover the ideal student architecture while ignoring distillation settings. In UniADS, we construct a comprehensive search space encompassing an architectural search for student models, knowledge transformations in distillation strategies, distance functions, loss weights, and other vital settings. To efficiently explore the search space, we utilize the NSGA-II genetic algorithm for better crossover and mutation configurations and employ the Successive Halving algorithm for search space pruning, resulting in improved search efficiency and promising results. Extensive experiments are performed on different teacher-student pairs using CIFAR-100 and ImageNet datasets. The experimental results consistently demonstrate the superiority of our method over existing approaches. Furthermore, we provide a detailed analysis of the search results, examining the impact of each variable and extracting valuable insights and practical guidance for distillation design and implementation.
\ No newline at end of file
diff --git a/data/2024/aaai/UniAP: Towards Universal Animal Perception in Vision via Few-Shot Learning b/data/2024/aaai/UniAP: Towards Universal Animal Perception in Vision via Few-Shot Learning
new file mode 100644
index 0000000000..693b9ecede
--- /dev/null
+++ b/data/2024/aaai/UniAP: Towards Universal Animal Perception in Vision via Few-Shot Learning	
@@ -0,0 +1 @@
+Animal visual perception is an important technique for automatically monitoring animal health, understanding animal behaviors, and assisting animal-related research. However, it is challenging to design a deep learning-based perception model that can freely adapt to different animals across various perception tasks, due to the varying poses of a large diversity of animals, lacking data on rare species, and the semantic inconsistency of different tasks. We introduce UniAP, a novel Universal Animal Perception model that leverages few-shot learning to enable cross-species perception among various visual tasks. Our proposed model takes support images and labels as prompt guidance for a query image. Images and labels are processed through a Transformer-based encoder and a lightweight label encoder, respectively. Then a matching module is designed for aggregating information between prompt guidance and the query image, followed by a multi-head label decoder to generate outputs for various tasks. By capitalizing on the shared visual characteristics among different animals and tasks, UniAP enables the transfer of knowledge from well-studied species to those with limited labeled data or even unseen species. We demonstrate the effectiveness of UniAP through comprehensive experiments in pose estimation, segmentation, and classification tasks on diverse animal species, showcasing its ability to generalize and adapt to new classes with minimal labeled examples.
\ No newline at end of file
diff --git a/data/2024/aaai/UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding b/data/2024/aaai/UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding
new file mode 100644
index 0000000000..8b81da00d9
--- /dev/null
+++ b/data/2024/aaai/UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding	
@@ -0,0 +1 @@
+The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the input text, enabling it to incorporate the semantic context and maintain seamless concatenation with the surrounding context. Following that, CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into waveforms, taking into consideration the acoustic context. Our experimental results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms of speech resynthesis from semantic tokens. Moreover, we show that UniCATS achieves state-of-the-art performance in both speech continuation and editing. Audio samples are available at https://cpdu.github.io/unicats.
\ No newline at end of file
diff --git a/data/2024/aaai/UniCell: Universal Cell Nucleus Classification via Prompt Learning b/data/2024/aaai/UniCell: Universal Cell Nucleus Classification via Prompt Learning
new file mode 100644
index 0000000000..11f8b551b5
--- /dev/null
+++ b/data/2024/aaai/UniCell: Universal Cell Nucleus Classification via Prompt Learning	
@@ -0,0 +1 @@
+The recognition of multi-class cell nuclei can significantly facilitate the process of histopathological diagnosis. Numerous pathological datasets are currently available, but their annotations are inconsistent. Most existing methods require individual training on each dataset to deduce the relevant labels and lack the use of common knowledge across datasets, consequently restricting the quality of recognition. In this paper, we propose a universal cell nucleus classification framework (UniCell), which employs a novel prompt learning mechanism to uniformly predict the corresponding categories of pathological images from different dataset domains. In particular, our framework adopts an end-to-end architecture for nuclei detection and classification, and utilizes flexible prediction heads for adapting various datasets. Moreover, we develop a Dynamic Prompt Module (DPM) that exploits the properties of multiple datasets to enhance features. The DPM first integrates the embeddings of datasets and semantic categories, and then employs the integrated prompts to refine image representations, efficiently harvesting the shared knowledge among the related cell types and data sources. Experimental results demonstrate that the proposed method effectively achieves the state-of-the-art results on four nucleus detection and classification benchmarks. Code and models are available at https://github.com/lhaof/UniCell
\ No newline at end of file
diff --git a/data/2024/aaai/UniGen: A Unified Generative Framework for Retrieval and Question Answering with Large Language Models b/data/2024/aaai/UniGen: A Unified Generative Framework for Retrieval and Question Answering with Large Language Models
new file mode 100644
index 0000000000..13ec965a5a
--- /dev/null
+++ b/data/2024/aaai/UniGen: A Unified Generative Framework for Retrieval and Question Answering with Large Language Models	
@@ -0,0 +1 @@
+Generative information retrieval, encompassing two major tasks of Generative Document Retrieval (GDR) and Grounded Answer Generation (GAR), has gained significant attention in natural language processing. Existing methods for GDR and GAR rely on separate retrieval and reader modules, which hinder simultaneous optimization. To overcome this, we present UniGen, a Unified Generative framework for retrieval and question answering that integrates both tasks into a single generative model leveraging the capabilities of large language models. UniGen employs a shared encoder and two distinct decoders for generative retrieval and question answering. To facilitate the learning of both tasks, we introduce connectors, generated by large language models, to bridge the gaps between query inputs and generation targets, as well as between document identifiers and answers. Furthermore, we propose an iterative enhancement strategy that leverages generated answers and retrieved documents to iteratively improve both tasks. Through extensive experiments on the MS MARCO and NQ datasets, we demonstrate the effectiveness of UniGen, showcasing its superior performance in both retrieval and question answering tasks.
\ No newline at end of file
diff --git a/data/2024/aaai/Unified Framework for Diffusion Generative Models in SO(3): Applications in Computer Vision and Astrophysics b/data/2024/aaai/Unified Framework for Diffusion Generative Models in SO(3): Applications in Computer Vision and Astrophysics
new file mode 100644
index 0000000000..c2b6fdc3af
--- /dev/null
+++ b/data/2024/aaai/Unified Framework for Diffusion Generative Models in SO(3): Applications in Computer Vision and Astrophysics	
@@ -0,0 +1 @@
+Diffusion-based generative models represent the current state-of-the-art for image generation. However, standard diffusion models are based on Euclidean geometry and do not translate directly to manifold-valued data. In this work, we develop extensions of both score-based generative models (SGMs) and Denoising Diffusion Probabilistic Models (DDPMs) to the Lie group of 3D rotations, SO(3). SO(3) is of particular interest in many disciplines such as robotics, biochemistry and astronomy/cosmology science. Contrary to more general Riemannian manifolds, SO(3) admits a tractable solution to heat diffusion, and allows us to implement efficient training of diffusion models. We apply both SO(3) DDPMs and SGMs to synthetic densities on SO(3) and demonstrate state-of-the-art results. Additionally, we demonstrate the practicality of our model on pose estimation tasks and in predicting correlated galaxy orientations for astrophysics/cosmology.
\ No newline at end of file
diff --git a/data/2024/aaai/Unify Named Entity Recognition Scenarios via Contrastive Real-Time Updating Prototype b/data/2024/aaai/Unify Named Entity Recognition Scenarios via Contrastive Real-Time Updating Prototype
new file mode 100644
index 0000000000..a9b9e05fa5
--- /dev/null
+++ b/data/2024/aaai/Unify Named Entity Recognition Scenarios via Contrastive Real-Time Updating Prototype	
@@ -0,0 +1 @@
+Supervised named entity recognition (NER) aims to classify entity mentions into a fixed number of pre-defined types. However, in real-world scenarios, unknown entity types are continually involved. Naive fine-tuning will result in catastrophic forgetting on old entity types. Existing continual methods usually depend on knowledge distillation to alleviate forgetting, which are less effective on long task sequences. Moreover, most of them are specific to the class-incremental scenario and cannot adapt to the online scenario, which is more common in practice. In this paper, we propose a unified framework called Contrastive Real-time Updating Prototype (CRUP) that can handle different scenarios for NER. Specifically, we train a Gaussian projection model by a regularized contrastive objective. After training on each batch, we store the mean vectors of representations belong to new entity types as their prototypes. Meanwhile, we update existing prototypes belong to old types only based on representations of the current batch. The final prototypes will be used for the nearest class mean classification. In this way, CRUP can handle different scenarios through its batch-wise learning. Moreover, CRUP can alleviate forgetting in continual scenarios only with current data instead of old data. To comprehensively evaluate CRUP, we construct extensive benchmarks based on various datasets. Experimental results show that CRUP significantly outperforms baselines in continual scenarios and is also competitive in the supervised scenario.
\ No newline at end of file
diff --git a/data/2024/aaai/Unifying Decision and Function Queries in Stochastic Boolean Satisfiability b/data/2024/aaai/Unifying Decision and Function Queries in Stochastic Boolean Satisfiability
new file mode 100644
index 0000000000..e267c45067
--- /dev/null
+++ b/data/2024/aaai/Unifying Decision and Function Queries in Stochastic Boolean Satisfiability	
@@ -0,0 +1 @@
+Stochastic Boolean satisfiability (SSAT) is a natural formalism for optimization under uncertainty. Its decision version implicitly imposes a final threshold quantification on an SSAT formula. However, the single threshold quantification restricts the expressive power of SSAT. In this work, we enrich SSAT with an additional threshold quantifier, resulting in a new formalism SSAT(θ). The increased expressiveness allows SSAT(θ), which remains in the PSPACE complexity class, to subsume and encode the languages in the counting hierarchy. An SSAT(θ) solver, ClauSSat(θ), is developed. Experiments show the applicability of the solver in uniquely solving complex SSAT(θ) instances of parameter synthesis and SSAT extension.
\ No newline at end of file
diff --git a/data/2024/aaai/Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification b/data/2024/aaai/Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification
new file mode 100644
index 0000000000..8e8cb42a73
--- /dev/null
+++ b/data/2024/aaai/Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification	
@@ -0,0 +1 @@
+Text-to-Image person re-identification (TI-ReID) aims to retrieve the images of target identity according to the given textual description. The existing methods in TI-ReID focus on aligning the visual and textual modalities through contrastive feature alignment or reconstructive masked language modeling (MLM). However, these methods parameterize the image/text instances as deterministic embeddings and do not explicitly consider the inherent uncertainty in pedestrian images and their textual descriptions, leading to limited image-text relationship expression and semantic alignment. To address the above problem, in this paper, we propose a novel method that unifies multi-modal uncertainty modeling and semantic alignment for TI-ReID. Specifically, we model the image and textual feature vectors of pedestrian as Gaussian distributions, where the multi-granularity uncertainty of the distribution is estimated by incorporating batch-level and identity-level feature variances for each modality. The multi-modal uncertainty modeling acts as a feature augmentation and provides richer image-text semantic relationship. Then we present a bi-directional cross-modal circle loss to more effectively align the probabilistic features between image and text in a self-paced manner. To further promote more comprehensive image-text semantic alignment, we design a task that complements the masked language modeling, focusing on the cross-modality semantic recovery of global masked token after cross-modal interaction. Extensive experiments conducted on three TI-ReID datasets highlight the effectiveness and superiority of our method over state-of-the-arts.
\ No newline at end of file
diff --git a/data/2024/aaai/Union Subgraph Neural Networks b/data/2024/aaai/Union Subgraph Neural Networks
new file mode 100644
index 0000000000..f718c7734b
--- /dev/null
+++ b/data/2024/aaai/Union Subgraph Neural Networks	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) are widely used for graph representation learning in many application domains. The expressiveness of vanilla GNNs is upper-bounded by 1-dimensional Weisfeiler-Leman (1-WL) test as they operate on rooted subtrees through iterative message passing. In this paper, we empower GNNs by injecting neighbor-connectivity information extracted from a new type of substructure. We first investigate different kinds of connectivities existing in a local neighborhood and identify a substructure called union subgraph, which is able to capture the complete picture of the 1-hop neighborhood of an edge. We then design a shortest-path-based substructure descriptor that possesses three nice properties and can effectively encode the high-order connectivities in union subgraphs. By infusing the encoded neighbor connectivities, we propose a novel model, namely Union Subgraph Neural Network (UnionSNN), which is proven to be strictly more powerful than 1-WL in distinguishing non-isomorphic graphs. Additionally, the local encoding from union subgraphs can also be injected into arbitrary message-passing neural networks (MPNNs) and Transformer-based models as a plugin. Extensive experiments on 18 benchmarks of both graph-level and node-level tasks demonstrate that UnionSNN outperforms state-of-the-art baseline models, with competitive computational efficiency. The injection of our local encoding to existing models is able to boost the performance by up to 11.09%. Our code is available at https://github.com/AngusMonroe/UnionSNN.
\ No newline at end of file
diff --git a/data/2024/aaai/Unit Selection with Nonbinary Treatment and Effect b/data/2024/aaai/Unit Selection with Nonbinary Treatment and Effect
new file mode 100644
index 0000000000..d8828154ee
--- /dev/null
+++ b/data/2024/aaai/Unit Selection with Nonbinary Treatment and Effect	
@@ -0,0 +1 @@
+The unit selection problem aims to identify a set of individuals who are most likely to exhibit a desired mode of behavior or to evaluate the percentage of such individuals in a given population, for example, selecting individuals who would respond one way if encouraged and a different way if not encouraged. Using a combination of experimental and observational data, Li and Pearl solved the binary unit selection problem (binary treatment and effect) by deriving tight bounds on the "benefit function," which is the payoff/cost associated with selecting an individual with given characteristics. This paper extends the benefit function to the general form such that the treatment and effect are not restricted to binary. We then propose an algorithm to test the identifiability of the nonbinary benefit function and an algorithm to compute the bounds of the nonbinary benefit function using experimental and observational data.
\ No newline at end of file
diff --git a/data/2024/aaai/United We Stand: Accelerating Privacy-Preserving Neural Inference by Conjunctive Optimization with Interleaved Nexus b/data/2024/aaai/United We Stand: Accelerating Privacy-Preserving Neural Inference by Conjunctive Optimization with Interleaved Nexus
new file mode 100644
index 0000000000..f13ef5046b
--- /dev/null
+++ b/data/2024/aaai/United We Stand: Accelerating Privacy-Preserving Neural Inference by Conjunctive Optimization with Interleaved Nexus	
@@ -0,0 +1 @@
+Privacy-preserving Machine Learning as a Service (MLaaS) enables the powerful cloud server to run its well-trained neural model upon the input from resource-limited client, with both of server's model parameters and client's input data protected. While computation efficiency is critical for the practical implementation of privacy-preserving MLaaS and it is inspiring to witness recent advances towards efficiency improvement, there still exists a significant performance gap to real-world applications. In general, state-of-the-art frameworks perform function-wise efficiency optimization based on specific cryptographic primitives. Although it is logical, such independent optimization for each function makes noticeable amount of expensive operations unremovable and misses the opportunity to further accelerate the performance by jointly considering privacy-preserving computation among adjacent functions. As such, we propose COIN: Conjunctive Optimization with Interleaved Nexus, which remodels mainstream computation for each function to conjunctive counterpart for composite function, with a series of united optimization strategies. Specifically, COIN jointly computes a pair of consecutive nonlinear-linear functions in the neural model by reconstructing the intermediates throughout the whole procedure, which not only eliminates the most expensive crypto operations without invoking extra encryption enabler, but also makes the online crypto complexity independent of filter size. Experimentally, COIN demonstrates 11.2x to 29.6x speedup over various function dimensions from modern networks, and 6.4x to 12x speedup on the total computation time when applied in networks with model input from small-scale CIFAR10 to large-scale ImageNet.
\ No newline at end of file
diff --git a/data/2024/aaai/United We Stand: Using Epoch-Wise Agreement of Ensembles to Combat Overfit b/data/2024/aaai/United We Stand: Using Epoch-Wise Agreement of Ensembles to Combat Overfit
new file mode 100644
index 0000000000..1e21f442e8
--- /dev/null
+++ b/data/2024/aaai/United We Stand: Using Epoch-Wise Agreement of Ensembles to Combat Overfit	
@@ -0,0 +1,3 @@
+Deep neural networks have become the method of choice for solving many classification tasks, largely because they can fit very complex functions defined over raw data. The downside of such powerful learners is the danger of overfit. In this paper, we introduce a novel ensemble classifier for deep networks that effectively overcomes overfitting by combining models generated at specific intermediate epochs during training. Our method allows for the incorporation of useful knowledge obtained by the models during the overfitting phase without deterioration of the general performance, which is usually missed when early stopping is used.
+
+To motivate this approach, we begin with the theoretical analysis of a regression model, whose prediction - that the variance among classifiers increases when overfit occurs - is demonstrated empirically in deep networks in common use. Guided by these results, we construct a new ensemble-based prediction method, where the prediction is determined by the class that attains the most consensual prediction throughout the training epochs. Using multiple image and text classification datasets, we show that when regular ensembles suffer from overfit, our method eliminates the harmful reduction in generalization due to overfit, and often even surpasses the performance obtained by early stopping. Our method is easy to implement and can be integrated with any training scheme and architecture, without additional prior knowledge beyond the training set. It is thus a practical and useful tool to overcome overfit.
\ No newline at end of file
diff --git a/data/2024/aaai/Universal Weak Coreset b/data/2024/aaai/Universal Weak Coreset
new file mode 100644
index 0000000000..1c2648d0bc
--- /dev/null
+++ b/data/2024/aaai/Universal Weak Coreset	
@@ -0,0 +1 @@
+Coresets for k-means and k-median problems yield a small summary of the data, which preserves the clustering cost with respect to any set of k centers. Recently coresets have also been constructed for constrained k-means and k-median problems. However, the notion of coresets has the drawback that (i) they can only be applied in settings where the input points are allowed to have weights, and (ii) in general metric spaces, the size of the coresets can depend logarithmically on the number of points. The notion of weak coresets, which has less stringent requirements than coresets, has been studied in the context of classical k-means and k-median problems. A weak coreset is a pair (J,S) of subsets of points, where S acts as a summary of the point set and J as a set of potential centers. This pair satisfies the properties that (i) S is a good summary of the data as long as the k centers are chosen from J only, and (ii) there is a good choice of k centers in J with a cost close to the optimal cost. We develop this framework, which we call universal weak coresets, for constrained clustering settings. In conjunction with recent coreset constructions for constrained settings, our designs give greater data compression, are conceptually simpler, and apply to a wide range of constrained k-median and k-means problems.
\ No newline at end of file
diff --git a/data/2024/aaai/Unknown-Aware Graph Regularization for Robust Semi-supervised Learning from Uncurated Data b/data/2024/aaai/Unknown-Aware Graph Regularization for Robust Semi-supervised Learning from Uncurated Data
new file mode 100644
index 0000000000..32837c793f
--- /dev/null
+++ b/data/2024/aaai/Unknown-Aware Graph Regularization for Robust Semi-supervised Learning from Uncurated Data	
@@ -0,0 +1 @@
+Recent advances in semi-supervised learning (SSL) have relied on the optimistic assumption that labeled and unlabeled data share the same class distribution. However, this assumption is often violated in real-world scenarios, where unlabeled data may contain out-of-class samples. SSL with such uncurated unlabeled data leads training models to be corrupted. In this paper, we propose a robust SSL method for learning from uncurated real-world data within the context of open-set semi-supervised learning (OSSL). Unlike previous works that rely on feature similarity distance, our method exploits uncertainty in logits. By leveraging task-dependent predictions of logits, our method is capable of robust learning even in the presence of highly correlated outliers. Our key contribution is to present an unknown-aware graph regularization (UAG), a novel technique that enhances the performance of uncertainty-based OSSL frameworks. The technique addresses not only the conflict between training objectives for inliers and outliers but also the limitation of applying the same training rule for all outlier classes, which are existed on previous uncertainty-based approaches. Extensive experiments demonstrate that UAG surpasses state-of-the-art OSSL methods by a large margin across various protocols. Codes are available at https://github.com/heejokong/UAGreg.
\ No newline at end of file
diff --git a/data/2024/aaai/Unlocking the Power of Open Set: A New Perspective for Open-Set Noisy Label Learning b/data/2024/aaai/Unlocking the Power of Open Set: A New Perspective for Open-Set Noisy Label Learning
new file mode 100644
index 0000000000..fb23300636
--- /dev/null
+++ b/data/2024/aaai/Unlocking the Power of Open Set: A New Perspective for Open-Set Noisy Label Learning	
@@ -0,0 +1 @@
+Learning from noisy data has attracted much attention, where most methods focus on closed-set label noise. However, a more common scenario in the real world is the presence of both open-set and closed-set noise. Existing methods typically identify and handle these two types of label noise separately by designing a specific strategy for each type. However, in many real-world scenarios, it would be challenging to identify open-set examples, especially when the dataset has been severely corrupted. Unlike the previous works, we explore how models behave when faced with open-set examples, and find that a part of open-set examples gradually get integrated into certain known classes, which is beneficial for the separation among known classes. Motivated by the phenomenon, we propose a novel two-step contrastive learning method CECL (Class Expansion Contrastive Learning) which aims to deal with both types of label noise by exploiting the useful information of open-set examples. Specifically, we incorporate some open-set examples into closed-set classes to enhance performance while treating others as delimiters to improve representative ability. Extensive experiments on synthetic and real-world datasets with diverse label noise demonstrate the effectiveness of CECL.
\ No newline at end of file
diff --git a/data/2024/aaai/Unplugged K-12 AI Learning: Exploring Representation and Reasoning with a Facial Recognition Game b/data/2024/aaai/Unplugged K-12 AI Learning: Exploring Representation and Reasoning with a Facial Recognition Game
new file mode 100644
index 0000000000..a38be68c43
--- /dev/null
+++ b/data/2024/aaai/Unplugged K-12 AI Learning: Exploring Representation and Reasoning with a Facial Recognition Game	
@@ -0,0 +1 @@
+With the growing prevalence of AI, the need for K-12 AI education is becoming more crucial, which is prompting active research in developing engaging and age-appropriate AI learning activities. Efforts are underway, such as those by the AI4K12 initiative, to establish guidelines for organizing K- 12 AI education; however, effective instructional resources are needed by educators. In this paper, we describe our work to design, develop, and implement an unplugged activity centered on facial recognition technology for middle school students. Facial recognition is integrated into a wide range of applications throughout daily life, which makes it a familiar and engaging tool for students and an effective medium for conveying AI concepts. Our unplugged activity, “Guess Whose Face,” is designed as a board game that focuses on Representation and Reasoning from AI4K12’s 5 Big Ideas in AI. The game is crafted to enable students to develop AI competencies naturally through physical interaction. In the game, one student uses tracing paper to extract facial features from a familiar face shown on a card, such as a cartoon character or celebrity, and then other students try to guess the identity of the hidden face. We discuss details of the game, its iterative refinement, and initial findings from piloting the activity during a summer camp for rural middle school students.
\ No newline at end of file
diff --git a/data/2024/aaai/Unraveling Batch Normalization for Realistic Test-Time Adaptation b/data/2024/aaai/Unraveling Batch Normalization for Realistic Test-Time Adaptation
new file mode 100644
index 0000000000..a45cc53975
--- /dev/null
+++ b/data/2024/aaai/Unraveling Batch Normalization for Realistic Test-Time Adaptation	
@@ -0,0 +1 @@
+While recent test-time adaptations exhibit efficacy by adjusting batch normalization to narrow domain disparities, their effectiveness diminishes with realistic mini-batches due to inaccurate target estimation. As previous attempts merely introduce source statistics to mitigate this issue, the fundamental problem of inaccurate target estimation still persists, leaving the intrinsic test-time domain shifts unresolved. This paper delves into the problem of mini-batch degradation. By unraveling batch normalization, we discover that the inexact target statistics largely stem from the substantially reduced class diversity in batch. Drawing upon this insight, we introduce a straightforward tool, Test-time Exponential Moving Average (TEMA), to bridge the class diversity gap between training and testing batches. Importantly, our TEMA adaptively extends the scope of typical methods beyond the current batch to incorporate a diverse set of class information, which in turn boosts an accurate target estimation. Built upon this foundation, we further design a novel layer-wise rectification strategy to consistently promote test-time performance. Our proposed method enjoys a unique advantage as it requires neither training nor tuning parameters, offering a truly hassle-free solution. It significantly enhances model robustness against shifted domains and maintains resilience in diverse real-world scenarios with various batch sizes, achieving state-of-the-art performance on several major benchmarks. Code is available at https://github.com/kiwi12138/RealisticTTA.
\ No newline at end of file
diff --git a/data/2024/aaai/Unraveling Pain Levels: A Data-Uncertainty Guided Approach for Effective Pain Assessment b/data/2024/aaai/Unraveling Pain Levels: A Data-Uncertainty Guided Approach for Effective Pain Assessment
new file mode 100644
index 0000000000..c3f185d5e4
--- /dev/null
+++ b/data/2024/aaai/Unraveling Pain Levels: A Data-Uncertainty Guided Approach for Effective Pain Assessment	
@@ -0,0 +1 @@
+Pain, a primary reason for seeking medical help, requires essential pain assessment for effective management. Studies have recognized electrodermal activity (EDA) signaling's potential for automated pain assessment, but traditional algorithms often ignore the noise and uncertainty inherent in pain data. To address this, we propose a learning framework predicated on data uncertainty, introducing two forms: a) subject-level stimulation-reaction drift; b) ambiguity in self-reporting scores. We formulate an uncertainty assessment using Heart Rate Variability (HRV) features to guide the selection of responsive pain profiles and reweight subtask importance based on the vagueness of self-reported data. These methods are integrated within an end-to-end neural network learning paradigm, focusing the detector on more accurate insights within the uncertainty domain. Extensive experimentation on both the publicly available biovid dataset and the proprietary Apon dataset demonstrates our approach's effectiveness. In the biovid dataset, we achieved a 6% enhancement over the state-of-the-art methodology, and on the Apon dataset, our method outperformed baseline approaches by over 20%.
\ No newline at end of file
diff --git a/data/2024/aaai/Unsupervised Action Segmentation via Fast Learning of Semantically Consistent Actoms b/data/2024/aaai/Unsupervised Action Segmentation via Fast Learning of Semantically Consistent Actoms
new file mode 100644
index 0000000000..908285e0cd
--- /dev/null
+++ b/data/2024/aaai/Unsupervised Action Segmentation via Fast Learning of Semantically Consistent Actoms	
@@ -0,0 +1 @@
+Action segmentation serves as a pivotal component in comprehending videos, encompassing the learning of a sequence of semantically consistent action units known as actoms. Conventional methodologies tend to require a significant consumption of time for both training and learning phases. This paper introduces an innovative unsupervised framework for action segmentation in video, characterized by its fast learning capability and absence of mandatory training. The core idea involves splitting the video into distinct actoms, which are then merging together based on shared actions. The key challenge here is to prevent the inadvertent creation of singular actoms that attempt to represent multiple actions during the splitting phase. Additionally, it is crucial to avoid situations where actoms associated with the same action are incorrectly grouped into multiple clusters during the merging phase. In this paper, we present a method for calculating the similarity between adjacent frames under a subspace assumption. Then, we employ a local minimum searching procedure, which effectively splits the video into coherent actoms aligned with their semantic meaning and provides us an action segmentation proposal. Subsequently, we calculate a spatio-temporal similarity between actoms, followed by developing a merging process to merge actoms representing identical actions within the action segmentation proposals. Our approach is evaluated on four benchmark datasets, and the results demonstrate that our method achieves state-of-the-art performance. Besides, our method also achieves the optimal balance between accuracy and learning time when compared to existing unsupervised techniques. Code is available at https://github.com/y66y/SaM.
\ No newline at end of file
diff --git a/data/2024/aaai/Unsupervised Continual Anomaly Detection with Contrastively-Learned Prompt b/data/2024/aaai/Unsupervised Continual Anomaly Detection with Contrastively-Learned Prompt
new file mode 100644
index 0000000000..a162a4114f
--- /dev/null
+++ b/data/2024/aaai/Unsupervised Continual Anomaly Detection with Contrastively-Learned Prompt	
@@ -0,0 +1 @@
+Unsupervised Anomaly Detection (UAD) with incremental training is crucial in industrial manufacturing, as unpredictable defects make obtaining sufficient labeled data infeasible. However, continual learning methods primarily rely on supervised annotations, while the application in UAD is limited due to the absence of supervision. Current UAD methods train separate models for different classes sequentially, leading to catastrophic forgetting and a heavy computational burden. To address this issue, we introduce a novel Unsupervised Continual Anomaly Detection framework called UCAD, which equips the UAD with continual learning capability through contrastively-learned prompts. In the proposed UCAD, we design a Continual Prompting Module (CPM) by utilizing a concise key-prompt-knowledge memory bank to guide task-invariant 'anomaly' model predictions using task-specific 'normal' knowledge. Moreover, Structure-based Contrastive Learning (SCL) is designed with the Segment Anything Model (SAM) to improve prompt learning and anomaly segmentation results. Specifically, by treating SAM's masks as structure, we draw features within the same mask closer and push others apart for general feature representations. We conduct comprehensive experiments and set the benchmark on unsupervised continual anomaly detection and segmentation, demonstrating that our method is significantly better than anomaly detection methods, even with rehearsal training. The code will be available at https://github.com/shirowalker/UCAD.
\ No newline at end of file
diff --git a/data/2024/aaai/Unsupervised Cross-Domain Image Retrieval via Prototypical Optimal Transport b/data/2024/aaai/Unsupervised Cross-Domain Image Retrieval via Prototypical Optimal Transport
new file mode 100644
index 0000000000..65e8c34b65
--- /dev/null
+++ b/data/2024/aaai/Unsupervised Cross-Domain Image Retrieval via Prototypical Optimal Transport	
@@ -0,0 +1 @@
+Unsupervised cross-domain image retrieval (UCIR) aims to retrieve images sharing the same category across diverse domains without relying on labeled data. Prior approaches have typically decomposed the UCIR problem into two distinct tasks: intra-domain representation learning and cross-domain feature alignment. However, these segregated strategies overlook the potential synergies between these tasks. This paper introduces ProtoOT, a novel Optimal Transport formulation explicitly tailored for UCIR, which integrates intra-domain feature representation learning and cross-domain alignment into a unified framework. ProtoOT leverages the strengths of the K-means clustering method to effectively manage distribution imbalances inherent in UCIR. By utilizing K-means for generating initial prototypes and approximating class marginal distributions, we modify the constraints in Optimal Transport accordingly, significantly enhancing its performance in UCIR scenarios. Furthermore, we incorporate contrastive learning into the ProtoOT framework to further improve representation learning. This encourages local semantic consistency among features with similar semantics, while also explicitly enforcing separation between features and unmatched prototypes, thereby enhancing global discriminativeness. ProtoOT surpasses existing state-of-the-art methods by a notable margin across benchmark datasets. Notably, on DomainNet, ProtoOT achieves an average P@200 enhancement of 24.44%, and on Office-Home, it demonstrates a P@15 improvement of 12.12%. Code is available at https://github.com/HCVLAB/ProtoOT.
\ No newline at end of file
diff --git a/data/2024/aaai/Unsupervised Domain Adaptative Temporal Sentence Localization with Mutual Information Maximization b/data/2024/aaai/Unsupervised Domain Adaptative Temporal Sentence Localization with Mutual Information Maximization
new file mode 100644
index 0000000000..67404bd7b9
--- /dev/null
+++ b/data/2024/aaai/Unsupervised Domain Adaptative Temporal Sentence Localization with Mutual Information Maximization	
@@ -0,0 +1 @@
+Temporal sentence localization (TSL) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant yet expensive manual annotations for training. Moreover, these trained data-dependent models usually can not generalize well to unseen scenarios because of the inherent domain shift. To facilitate this issue, in this paper, we target another more practical but challenging setting: unsupervised domain adaptative temporal sentence localization (UDA-TSL), which explores whether the localization knowledge can be transferred from a fully-annotated data domain (source domain) to a new unannotated data domain (target domain). Particularly, we propose an effective and novel baseline for UDA-TSL to bridge the multi-modal gap across different domains and learn the potential correspondence between the video-query pairs in target domain. We first develop separate modality-specific domain adaptation modules to smoothly balance the minimization of the domain shifts in cross-dataset video and query domains. Then, to fully exploit the semantic correspondence of both modalities in target domain for unsupervised localization, we devise a mutual information learning module to adaptively align the video-query pairs which are more likely to be relevant in target domain, leading to more truly aligned target pairs and ensuring the discriminability of target features. In this way, our model can learn domain-invariant and semantic-aligned cross-modal representations. Three sets of migration experiments show that our model achieves competitive performance compared to existing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Unsupervised Extractive Summarization with Learnable Length Control Strategies b/data/2024/aaai/Unsupervised Extractive Summarization with Learnable Length Control Strategies
new file mode 100644
index 0000000000..7541edfbf6
--- /dev/null
+++ b/data/2024/aaai/Unsupervised Extractive Summarization with Learnable Length Control Strategies	
@@ -0,0 +1 @@
+Unsupervised extractive summarization is an important technique in information extraction and retrieval. Compared with supervised method, it does not require high-quality human-labelled summaries for training and thus can be easily applied for documents with different types, domains or languages. Most of existing unsupervised methods including TextRank and PACSUM rely on graph-based ranking on sentence centrality. However, this scorer can not be directly applied in end-to-end training, and the positional-related prior assumption is often needed for achieving good summaries. In addition, less attention is paid to length-controllable extractor, where users can decide to summarize texts under particular length constraint. This paper introduces an unsupervised extractive summarization model based on a siamese network, for which we develop a trainable bidirectional prediction objective between the selected summary and the original document. Different from the centrality-based ranking methods, our extractive scorer can be trained in an end-to-end manner, with no other requirement of positional assumption. In addition, we introduce a differentiable length control module by approximating 0-1 knapsack solver for end-to-end length-controllable extracting. Experiments show that our unsupervised method largely outperforms the centrality-based baseline using a same sentence encoder. In terms of length control ability, via our trainable knapsack module, the performance consistently outperforms the strong baseline without utilizing end-to-end training. Human evaluation further evidences that our method performs the best among baselines in terms of relevance and consistency.
\ No newline at end of file
diff --git a/data/2024/aaai/Unsupervised Gene-Cell Collective Representation Learning with Optimal Transport b/data/2024/aaai/Unsupervised Gene-Cell Collective Representation Learning with Optimal Transport
new file mode 100644
index 0000000000..519c1e80e0
--- /dev/null
+++ b/data/2024/aaai/Unsupervised Gene-Cell Collective Representation Learning with Optimal Transport	
@@ -0,0 +1 @@
+Cell type identification plays a vital role in single-cell RNA sequencing (scRNA-seq) data analysis. Although many deep embedded methods to cluster scRNA-seq data have been proposed, they still fail in elucidating the intrinsic properties of cells and genes. Here, we present a novel end-to-end deep graph clustering model for single-cell transcriptomics data based on unsupervised Gene-Cell Collective representation learning and Optimal Transport (scGCOT) which integrates both cell and gene correlations. Specifically, scGCOT learns the latent embedding of cells and genes simultaneously and reconstructs the cell graph, the gene graph, and the gene expression count matrix. A zero-inflated negative binomial (ZINB) model is estimated via the reconstructed count matrix to capture the essential properties of scRNA-seq data. By leveraging the optimal transport-based joint representation alignment, scGCOT learns the clustering process and the latent representations through a mutually supervised self optimization strategy. Extensive experiments with 14 competing methods on 15 real scRNA-seq datasets demonstrate the competitive edges of scGCOT.
\ No newline at end of file
diff --git a/data/2024/aaai/Unsupervised Group Re-identification via Adaptive Clustering-Driven Progressive Learning b/data/2024/aaai/Unsupervised Group Re-identification via Adaptive Clustering-Driven Progressive Learning
new file mode 100644
index 0000000000..8bf44b6585
--- /dev/null
+++ b/data/2024/aaai/Unsupervised Group Re-identification via Adaptive Clustering-Driven Progressive Learning	
@@ -0,0 +1 @@
+Group re-identification (G-ReID) aims to correctly associate groups with the same members captured by different cameras. However, supervised approaches for this task often suffer from the high cost of cross-camera sample labeling. Unsupervised methods based on clustering can avoid sample labeling, but the problem of member variations often makes clustering unstable, leading to incorrect pseudo-labels. To address these challenges, we propose an adaptive clustering-driven progressive learning approach (ACPL), which consists of a group adaptive clustering (GAC) module and a global dynamic prototype update (GDPU) module. Specifically, GAC designs the quasi-distance between groups, thus fully capitalizing on both individual-level and holistic information within groups. In the case of great uncertainty in intra-group members, GAC effectively minimizes the impact of non-discriminative features and reduces the noise in the model's pseudo-labels. Additionally, our GDPU devises a dynamic weight to update the prototypes and effectively mine the hard samples with complex member variations, which improves the model's robustness. Extensive experiments conducted on four popular G-ReID datasets demonstrate that our method not only achieves state-of-the-art performance on unsupervised G-ReID but also performs comparably to several fully supervised approaches.
\ No newline at end of file
diff --git a/data/2024/aaai/Unsupervised Layer-Wise Score Aggregation for Textual OOD Detection b/data/2024/aaai/Unsupervised Layer-Wise Score Aggregation for Textual OOD Detection
new file mode 100644
index 0000000000..c5e0c3b91c
--- /dev/null
+++ b/data/2024/aaai/Unsupervised Layer-Wise Score Aggregation for Textual OOD Detection	
@@ -0,0 +1 @@
+Out-of-distribution (OOD) detection is a rapidly growing field due to new robustness and security requirements driven by an increased number of AI-based systems. Existing OOD textual detectors often rely on anomaly scores (\textit{e.g.}, Mahalanobis distance) computed on the embedding output of the last layer of the encoder. In this work, we observe that OOD detection performance varies greatly depending on the task and layer output. More importantly, we show that the usual choice (the last layer) is rarely the best one for OOD detection and that far better results can be achieved, provided that an oracle selects the best layer. We propose a data-driven, unsupervised method to leverage this observation to combine layer-wise anomaly scores. In addition, we extend classical textual OOD benchmarks by including classification tasks with a more significant number of classes (up to 150), which reflects more realistic settings. On this augmented benchmark, we show that the proposed post-aggregation methods achieve robust and consistent results comparable to using the best layer according to an oracle while removing manual feature selection altogether.
\ No newline at end of file
diff --git a/data/2024/aaai/Unsupervised Neighborhood Propagation Kernel Layers for Semi-supervised Node Classification b/data/2024/aaai/Unsupervised Neighborhood Propagation Kernel Layers for Semi-supervised Node Classification
new file mode 100644
index 0000000000..c3faa9f8c2
--- /dev/null
+++ b/data/2024/aaai/Unsupervised Neighborhood Propagation Kernel Layers for Semi-supervised Node Classification	
@@ -0,0 +1 @@
+We present a deep Graph Convolutional Kernel Machine (GCKM) for semi-supervised node classification in graphs. The method is built of two main types of blocks: (i) We introduce unsupervised kernel machine layers propagating the node features in a one-hop neighborhood, using implicit node feature mappings. (ii) We specify a semi-supervised classification kernel machine through the lens of the Fenchel-Young inequality. We derive an effective initialization scheme and efficient end-to-end training algorithm in the dual variables for the full architecture. The main idea underlying GCKM is that, because of the unsupervised core, the final model can achieve higher performance in semi-supervised node classification when few labels are available for training. Experimental results demonstrate the effectiveness of the proposed framework.
\ No newline at end of file
diff --git a/data/2024/aaai/Unsupervised Object Interaction Learning with Counterfactual Dynamics Models b/data/2024/aaai/Unsupervised Object Interaction Learning with Counterfactual Dynamics Models
new file mode 100644
index 0000000000..55d7fe36dc
--- /dev/null
+++ b/data/2024/aaai/Unsupervised Object Interaction Learning with Counterfactual Dynamics Models	
@@ -0,0 +1 @@
+We present COIL (Counterfactual Object Interaction Learning), a novel way of learning skills of object interactions on entity-centric environments. The goal is to learn primitive behaviors that can induce interactions without external reward or any supervision. Existing skill discovery methods are limited to locomotion, simple navigation tasks, or single-object manipulation tasks, mostly not inducing interaction between objects. Unlike a monolithic representation usually used in prior skill learning methods, we propose to use a structured goal representation that can query and scope which objects to interact with, which can serve as a basis for solving more complex downstream tasks. We design a novel counterfactual intrinsic reward through the use of either a forward model or successor features that can learn an interaction skill between a pair of objects given as a goal. Through experiments on continuous control environments such as Magnetic Block and 2.5-D Stacking Box, we demonstrate that an agent can learn object interaction behaviors (e.g., attaching or stacking one block to another) without any external rewards or domain-specific knowledge.
\ No newline at end of file
diff --git a/data/2024/aaai/Unsupervised Pan-Sharpening via Mutually Guided Detail Restoration b/data/2024/aaai/Unsupervised Pan-Sharpening via Mutually Guided Detail Restoration
new file mode 100644
index 0000000000..2c5e30a541
--- /dev/null
+++ b/data/2024/aaai/Unsupervised Pan-Sharpening via Mutually Guided Detail Restoration	
@@ -0,0 +1 @@
+Pan-sharpening is a task that aims to super-resolve the low-resolution multispectral (LRMS) image with the guidance of a corresponding high-resolution panchromatic (PAN) image. The key challenge in pan-sharpening is to accurately modeling the relationship between the MS and PAN images. While supervised deep learning methods are commonly employed to address this task, the unavailability of ground-truth severely limits their effectiveness. In this paper, we propose a mutually guided detail restoration method for unsupervised pan-sharpening. Specifically, we treat pan-sharpening as a blind image deblurring task, in which the blur kernel can be estimated by a CNN. Constrained by the blur kernel, the pan-sharpened image retains spectral information consistent with the LRMS image. Once the pan-sharpened image is obtained, the PAN image is blurred using a pre-defined blur operator. The pan-sharpened image, in turn, is used to guide the detail restoration of the blurred PAN image. By leveraging the mutual guidance between MS and PAN images, the pan-sharpening network can implicitly learn the spatial relationship between the two modalities. Extensive experiments show that the proposed method significantly outperforms existing unsupervised pan-sharpening methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Unsupervised Training Sequence Design: Efficient and Generalizable Agent Training b/data/2024/aaai/Unsupervised Training Sequence Design: Efficient and Generalizable Agent Training
new file mode 100644
index 0000000000..cb536f72c2
--- /dev/null
+++ b/data/2024/aaai/Unsupervised Training Sequence Design: Efficient and Generalizable Agent Training	
@@ -0,0 +1 @@
+To train generalizable Reinforcement Learning (RL) agents, researchers recently proposed the Unsupervised Environment Design (UED) framework, in which a teacher agent creates a very large number of training environments and a student agent trains on the experiences in these environments to be robust against unseen testing scenarios. For example, to train a student to master the “stepping over stumps” task, the teacher will create numerous training environments with varying stump heights and shapes. In this paper, we argue that UED neglects training efficiency and its need for very large number of environments (henceforth referred to as infinite horizon training) makes it less suitable to training robots and non-expert humans. In real-world applications where either creating new training scenarios is expensive or training efficiency is of critical importance, we want to maximize both the learning efficiency and learning outcome of the student. To achieve efficient finite horizon training, we propose a novel Markov Decision Process (MDP) formulation for the teacher agent, referred to as Unsupervised Training Sequence Design (UTSD). Specifically, we encode salient information from the student policy (e.g., behaviors and learning progress) into the teacher's state space, enabling the teacher to closely track the student's learning progress and consequently discover the optimal training sequences with finite lengths. Additionally, we explore the teacher's efficient adaptation to unseen students at test time by employing the context-based meta-learning approach, which leverages the teacher's past experiences with various students. Finally, we empirically demonstrate our teacher's capability to design efficient and effective training sequences for students with varying capabilities.
\ No newline at end of file
diff --git a/data/2024/aaai/Unveiling Details in the Dark: Simultaneous Brightening and Zooming for Low-Light Image Enhancement b/data/2024/aaai/Unveiling Details in the Dark: Simultaneous Brightening and Zooming for Low-Light Image Enhancement
new file mode 100644
index 0000000000..dc6956b120
--- /dev/null
+++ b/data/2024/aaai/Unveiling Details in the Dark: Simultaneous Brightening and Zooming for Low-Light Image Enhancement	
@@ -0,0 +1 @@
+Existing super-resolution methods exhibit limitations when applied to nighttime scenes, primarily due to their lack of adaptation to low-pair dynamic range and noise-heavy dark-light images. In response, this research introduces an innovative customized framework to simultaneously Brighten and Zoom in low-resolution images captured in low-light conditions, dubbed BrZoNet. The core method begins by feeding low-light, low-resolution images, and their corresponding ground truths into the Retinex-induced siamese decoupling network. This process yields distinct reflectance maps and illuminance maps, guided by supervision from the ground truth’s decomposition maps. Subsequently, these reflectance and illuminance maps transition into an intricate super-resolution sub-network. This sub-network employs a meticulously designed cross-layer content-aware interactor - Illumination-aware Interaction Unit(IaIU), elegantly endowed with a gating mechanism. The IaIU facilitates meaningful feature interaction between illuminance and reflectance features while effectively reducing unwanted noise. An intricate super-resolution cage is also constructed to comprehensively integrate information, ultimately resulting in the generation of high-resolution images featuring intricate details. Thorough and diverse experiments validate the superiority of the proposed BrZoNet, surpassing contemporary cutting-edge technologies by proficiently augmenting brightness and intricately recovering complex details, showcasing advancements of 7.1% in PSNR, 2.4% in SSIM, and an impressive 36.8% in LPIPS metrics.
\ No newline at end of file
diff --git a/data/2024/aaai/Unveiling Implicit Deceptive Patterns in Multi-Modal Fake News via Neuro-Symbolic Reasoning b/data/2024/aaai/Unveiling Implicit Deceptive Patterns in Multi-Modal Fake News via Neuro-Symbolic Reasoning
new file mode 100644
index 0000000000..a16a0379b5
--- /dev/null
+++ b/data/2024/aaai/Unveiling Implicit Deceptive Patterns in Multi-Modal Fake News via Neuro-Symbolic Reasoning	
@@ -0,0 +1 @@
+In the current Internet landscape, the rampant spread of fake news, particularly in the form of multi-modal content, poses a great social threat. While automatic multi-modal fake news detection methods have shown promising results, the lack of explainability remains a significant challenge. Existing approaches provide superficial explainability by displaying learned important components or views from well-trained networks, but they often fail to uncover the implicit deceptive patterns that reveal how fake news is fabricated. To address this limitation, we begin by predefining three typical deceptive patterns, namely image manipulation, cross-modal inconsistency, and image repurposing, which shed light on the mechanisms underlying fake news fabrication. Then, we propose a novel Neuro-Symbolic Latent Model called NSLM, that not only derives accurate judgments on the veracity of news but also uncovers the implicit deceptive patterns as explanations. Specifically, the existence of each deceptive pattern is expressed as a two-valued learnable latent variable, which is acquired through amortized variational inference and weak supervision based on symbolic logic rules. Additionally, we devise pseudo-siamese networks to capture distinct deceptive patterns effectively. Experimental results on two real-world datasets demonstrate that our NSLM achieves the best performance in fake news detection while providing insightful explanations of deceptive patterns.
\ No newline at end of file
diff --git a/data/2024/aaai/Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning b/data/2024/aaai/Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning
new file mode 100644
index 0000000000..83d1452f1b
--- /dev/null
+++ b/data/2024/aaai/Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning	
@@ -0,0 +1 @@
+Toddlers evolve from free exploration with sparse feedback to exploiting prior experiences for goal-directed learning with denser rewards. Drawing inspiration from this Toddler-Inspired Reward Transition, we set out to explore the implications of varying reward transitions when incorporated into Reinforcement Learning (RL) tasks. Central to our inquiry is the transition from sparse to potential-based dense rewards, which share optimal strategies regardless of reward changes. Through various experiments, including those in egocentric navigation and robotic arm manipulation tasks, we found that proper reward transitions significantly influence sample efficiency and success rates. Of particular note is the efficacy of the toddler-inspired Sparse-to-Dense (S2D) transition. Beyond these performance metrics, using Cross-Density Visualizer technique, we observed that transitions, especially the S2D, smooth the policy loss landscape, promoting wide minima that enhance generalization in RL models.
\ No newline at end of file
diff --git a/data/2024/aaai/Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability b/data/2024/aaai/Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability
new file mode 100644
index 0000000000..1aea471009
--- /dev/null
+++ b/data/2024/aaai/Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability	
@@ -0,0 +1 @@
+Automatic Essay Scoring (AES) is a well-established educational pursuit that employs machine learning to evaluate student-authored essays. While much effort has been made in this area, current research primarily focuses on either (i) boosting the predictive accuracy of an AES model for a specific prompt (i.e., developing prompt-specific models), which often heavily relies on the use of the labeled data from the same target prompt; or (ii) assessing the applicability of AES models developed on non-target prompts to the intended target prompt (i.e., developing the AES models in a cross-prompt setting). Given the inherent bias in machine learning and its potential impact on marginalized groups, it is imperative to investigate whether such bias exists in current AES methods and, if identified, how it intervenes with an AES model's accuracy and generalizability. Thus, our study aimed to uncover the intricate relationship between an AES model's accuracy, fairness, and generalizability, contributing practical insights for developing effective AES models in real-world education. To this end, we meticulously selected nine prominent AES methods and evaluated their performance using seven distinct metrics on an open-sourced dataset, which contains over 25,000 essays and various demographic information about students such as gender, English language learner status, and economic status. Through extensive evaluations, we demonstrated that: (1) prompt-specific models tend to outperform their cross-prompt counterparts in terms of predictive accuracy; (2) prompt-specific models frequently exhibit a greater bias towards students of different economic statuses compared to cross-prompt models; (3) in the pursuit of generalizability, traditional machine learning models (e.g., SVM) coupled with carefully engineered features hold greater potential for achieving both high accuracy and fairness than complex neural network models.
\ No newline at end of file
diff --git a/data/2024/aaai/Upper Bounding Barlow Twins: A Novel Filter for Multi-Relational Clustering b/data/2024/aaai/Upper Bounding Barlow Twins: A Novel Filter for Multi-Relational Clustering
new file mode 100644
index 0000000000..4e12754218
--- /dev/null
+++ b/data/2024/aaai/Upper Bounding Barlow Twins: A Novel Filter for Multi-Relational Clustering	
@@ -0,0 +1 @@
+Multi-relational clustering is a challenging task due to the fact that diverse semantic information conveyed in multi-layer graphs is difficult to extract and fuse. Recent methods integrate topology structure and node attribute information through graph filtering. However, they often use a low-pass filter without fully considering the correlation among multiple graphs. To overcome this drawback, we propose to learn a graph filter motivated by the theoretical analysis of Barlow Twins. We find that input with a negative semi-definite inner product provides a lower bound for Barlow Twins loss, which prevents it from reaching a better solution. We thus learn a filter that yields an upper bound for Barlow Twins. Afterward, we design a simple clustering architecture and demonstrate its state-of-the-art performance on four benchmark datasets. The source code is available at https://github.com/XweiQ/BTGF.
\ No newline at end of file
diff --git a/data/2024/aaai/Urban Region Embedding via Multi-View Contrastive Prediction b/data/2024/aaai/Urban Region Embedding via Multi-View Contrastive Prediction
new file mode 100644
index 0000000000..e1cfc96750
--- /dev/null
+++ b/data/2024/aaai/Urban Region Embedding via Multi-View Contrastive Prediction	
@@ -0,0 +1 @@
+Recently, learning urban region representations utilizing multi-modal data (information views) has become increasingly popular, for deep understanding of the distributions of various socioeconomic features in cities. However, previous methods usually blend multi-view information in a posteriors stage, falling short in learning coherent and consistent representations across different views. In this paper, we form a new pipeline to learn consistent representations across varying views, and propose the multi-view Contrastive Prediction model for urban Region embedding (ReCP), which leverages the multiple information views from point-of-interest (POI) and human mobility data. Specifically, ReCP comprises two major modules, namely an intra-view learning module utilizing contrastive learning and feature reconstruction to capture the unique information from each single view, and inter-view learning module that perceives the consistency between the two views using a contrastive prediction learning scheme. We conduct thorough experiments on two downstream tasks to assess the proposed model, i.e., land use clustering and region popularity prediction. The experimental results demonstrate that our model outperforms state-of-the-art baseline methods significantly in urban region representation learning.
\ No newline at end of file
diff --git a/data/2024/aaai/Using Adaptive Bandit Experiments to Increase and Investigate Engagement in Mental Health b/data/2024/aaai/Using Adaptive Bandit Experiments to Increase and Investigate Engagement in Mental Health
new file mode 100644
index 0000000000..be3d1f3f08
--- /dev/null
+++ b/data/2024/aaai/Using Adaptive Bandit Experiments to Increase and Investigate Engagement in Mental Health	
@@ -0,0 +1 @@
+Digital mental health (DMH) interventions, such as text-message-based lessons and activities, offer immense potential for accessible mental health support. While these interventions can be effective, real-world experimental testing can further enhance their design and impact. Adaptive experimentation, utilizing algorithms like Thompson Sampling for (contextual) multi-armed bandit (MAB) problems, can lead to continuous improvement and personalization. However, it remains unclear when these algorithms can simultaneously increase user experience rewards and facilitate appropriate data collection for social-behavioral scientists to analyze with sufficient statistical confidence. Although a growing body of research addresses the practical and statistical aspects of MAB and other adaptive algorithms, further exploration is needed to assess their impact across diverse real-world contexts. This paper presents a software system developed over two years that allows text-messaging intervention components to be adapted using bandit and other algorithms while collecting data for side-by-side comparison with traditional uniform random non-adaptive experiments. We evaluate the system by deploying a text-message-based DMH intervention to 1100 users, recruited through a large mental health non-profit organization, and share the path forward for deploying this system at scale. This system not only enables applications in mental health but could also serve as a model testbed for adaptive experimentation algorithms in other domains.
\ No newline at end of file
diff --git a/data/2024/aaai/Using Artificial Populations to Study Psychological Phenomena in Neural Models b/data/2024/aaai/Using Artificial Populations to Study Psychological Phenomena in Neural Models
new file mode 100644
index 0000000000..4239f77e70
--- /dev/null
+++ b/data/2024/aaai/Using Artificial Populations to Study Psychological Phenomena in Neural Models	
@@ -0,0 +1 @@
+The recent proliferation of research into transformer based natural language processing has led to a number of studies which attempt to detect the presence of human-like cognitive behavior in the models. We contend that, as is true of human psychology, the investigation of cognitive behavior in language models must be conducted in an appropriate population of an appropriate size for the results to be meaningful. We leverage work in uncertainty estimation in a novel approach to efficiently construct experimental populations. The resultant tool, PopulationLM, has been made open source. We provide theoretical grounding in the uncertainty estimation literature and motivation from current cognitive work regarding language models. We discuss the methodological lessons from other scientific communities and attempt to demonstrate their application to two artificial population studies. Through population based experimentation we find that language models exhibit behavior consistent with typicality effects among categories highly represented in training. However, we find that language models don't tend to exhibit structural priming effects. Generally, our results show that single models tend to over estimate the presence of cognitive behaviors in neural models.
\ No newline at end of file
diff --git a/data/2024/aaai/Using Clustering to Strengthen Decision Diagram Bounds for Discrete Optimization b/data/2024/aaai/Using Clustering to Strengthen Decision Diagram Bounds for Discrete Optimization
new file mode 100644
index 0000000000..adfa5ced9e
--- /dev/null
+++ b/data/2024/aaai/Using Clustering to Strengthen Decision Diagram Bounds for Discrete Optimization	
@@ -0,0 +1 @@
+Offering a generic approach to obtaining both upper and lower bounds, decision diagrams (DDs) are becoming an increasingly important tool for solving discrete optimization problems. In particular, they provide a powerful and often complementary alternative to other well-known generic bounding mechanisms such as the LP relaxation. A standard approach to employ DDs for discrete optimization is to formulate the problem as a Dynamic Program and use that formulation to compile a DD top-down in a layer-by-layer fashion. To limit the size of the resulting DD and to obtain bounds, one typically imposes a maximum width for each layer which is then enforced by either merging nodes (resulting in a so-called relaxed DD that provides a dual bound) or by dropping nodes (resulting in a so-called restricted DD that provides a primal bound). The quality of the DD bounds obtained from this top-down compilation process heavily depends on the heuristics used for the selection of the nodes to merge or drop. While it is sometimes possible to engineer problem-specific heuristics for this selection problem, the most generic approach relies on sorting the layer’s nodes based on objective function information. In this paper, we propose a generic and problem-agnostic approach that relies on clustering nodes based on the state information associated with each node. In a set of computational experiments with different knapsack and scheduling problems, we show that our approach generally outperforms the classical generic approach, and often achieves drastically better bounds both with respect to the size of the DD and the time used for compiling the DD.
\ No newline at end of file
diff --git a/data/2024/aaai/Using Reinforcement Learning to Iteratively Construct Road Networks from Satellite Images and GPS Data b/data/2024/aaai/Using Reinforcement Learning to Iteratively Construct Road Networks from Satellite Images and GPS Data
new file mode 100644
index 0000000000..c0e26a269b
--- /dev/null
+++ b/data/2024/aaai/Using Reinforcement Learning to Iteratively Construct Road Networks from Satellite Images and GPS Data	
@@ -0,0 +1 @@
+Constructing road networks manually is a time consuming and labor-intensive process. This paper proposes a new method to iteratively construct road networks using reinforcement learning from a combined tensor-based representation of satellite image and GPS trajectory data.
\ No newline at end of file
diff --git a/data/2024/aaai/Using Stratified Sampling to Improve LIME Image Explanations b/data/2024/aaai/Using Stratified Sampling to Improve LIME Image Explanations
new file mode 100644
index 0000000000..85f67b04de
--- /dev/null
+++ b/data/2024/aaai/Using Stratified Sampling to Improve LIME Image Explanations	
@@ -0,0 +1,5 @@
+We investigate the use of a stratified sampling approach for LIME Image, a popular model-agnostic explainable AI method for computer vision tasks, in order to reduce the artifacts generated by typical Monte Carlo sampling.
+Such artifacts are due to the undersampling of the dependent variable in the synthetic neighborhood around the image being explained, which may result in inadequate explanations due to the impossibility of fitting a linear regressor on the sampled data.
+We then highlight a connection with the Shapley theory, where similar arguments about undersampling and sample relevance were suggested in the past.
+We derive all the formulas and adjustment factors required for an unbiased stratified sampling estimator. 
+Experiments show the efficacy of the proposed approach.
\ No newline at end of file
diff --git a/data/2024/aaai/Using Symmetries to Lift Satisfiability Checking b/data/2024/aaai/Using Symmetries to Lift Satisfiability Checking
new file mode 100644
index 0000000000..1f964809e6
--- /dev/null
+++ b/data/2024/aaai/Using Symmetries to Lift Satisfiability Checking	
@@ -0,0 +1,4 @@
+We analyze how symmetries can be used to compress structures (also known as interpretations) onto a smaller domain without loss of information. This analysis suggests the possibility to solve satisfiability problems in the compressed domain for better performance. Thus, we propose a 2-step novel method: (i) the sentence to be satisfied is automatically translated into an equisatisfiable sentence over a ``lifted'' vocabulary that allows domain compression; (ii) satisfiability of the lifted sentence is checked by growing the (initially unknown) compressed domain until a satisfying structure is found.
+The key issue is to ensure that this satisfying structure can always be expanded into an uncompressed structure that satisfies the original sentence to be satisfied.
+
+We present an adequate translation for sentences in typed first-order logic extended with aggregates. Our experimental evaluation shows large speedups for generative configuration problems. The method also has applications in the verification of software operating on complex data structures. Our results justify further research in automatic translation of sentences for symmetry reduction.
\ No newline at end of file
diff --git a/data/2024/aaai/V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models b/data/2024/aaai/V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models
new file mode 100644
index 0000000000..ddc4bd0267
--- /dev/null
+++ b/data/2024/aaai/V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models	
@@ -0,0 +1 @@
+Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively. Supplementary materials such as audio samples are provided at our demo website: https://v2a-mapper.github.io/.
\ No newline at end of file
diff --git a/data/2024/aaai/V2Meow: Meowing to the Visual Beat via Video-to-Music Generation b/data/2024/aaai/V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
new file mode 100644
index 0000000000..35b0f93492
--- /dev/null
+++ b/data/2024/aaai/V2Meow: Meowing to the Visual Beat via Video-to-Music Generation	
@@ -0,0 +1 @@
+Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.
\ No newline at end of file
diff --git a/data/2024/aaai/VITA: 'Carefully Chosen and Weighted Less' Is Better in Medication Recommendation b/data/2024/aaai/VITA: 'Carefully Chosen and Weighted Less' Is Better in Medication Recommendation
new file mode 100644
index 0000000000..2afca6b705
--- /dev/null
+++ b/data/2024/aaai/VITA: 'Carefully Chosen and Weighted Less' Is Better in Medication Recommendation	
@@ -0,0 +1 @@
+We address the medication recommendation problem, which aims to recommend effective medications for a patient's current visit by utilizing information (e.g., diagnoses and procedures) given at the patient's current and past visits. While there exist a number of recommender systems designed for this problem, we point out that they are challenged in accurately capturing the relation (spec., the degree of relevance) between the current and each of the past visits for the patient when obtaining her current health status, which is the basis for recommending medications. To address this limitation, we propose a novel medication recommendation framework, named VITA, based on the following two novel ideas: (1) relevant-Visit selectIon; (2) Target-aware Attention. Through extensive experiments using real-world datasets, we demonstrate the superiority of VITA (spec., up to 5.67% higher accuracy, in terms of Jaccard, than the best competitor) and the effectiveness of its two core ideas. The code is available at https://github.com/jhheo0123/VITA.
\ No newline at end of file
diff --git a/data/2024/aaai/VIXEN: Visual Text Comparison Network for Image Difference Captioning b/data/2024/aaai/VIXEN: Visual Text Comparison Network for Image Difference Captioning
new file mode 100644
index 0000000000..f8f014228c
--- /dev/null
+++ b/data/2024/aaai/VIXEN: Visual Text Comparison Network for Image Difference Captioning	
@@ -0,0 +1 @@
+We present VIXEN - a technique that succinctly summarizes in text the visual differences between a pair of images in order to highlight any content manipulation present. Our proposed network linearly maps image features in a pairwise manner, constructing a soft prompt for a pretrained large language model. We address the challenge of low volume of training data and lack of manipulation variety in existing image difference captioning (IDC) datasets by training on synthetically manipulated images from the recent InstructPix2Pix dataset generated via prompt-to-prompt editing framework. We augment this dataset with change summaries produced via GPT-3. We show that VIXEN produces state-of-the-art, comprehensible difference captions for diverse image contents and edit types, offering a potential mitigation against misinformation disseminated via manipulated image content. Code and data are available at http://github.com/alexblck/vixen
\ No newline at end of file
diff --git a/data/2024/aaai/VLM2Scene: Self-Supervised Image-Text-LiDAR Learning with Foundation Models for Autonomous Driving Scene Understanding b/data/2024/aaai/VLM2Scene: Self-Supervised Image-Text-LiDAR Learning with Foundation Models for Autonomous Driving Scene Understanding
new file mode 100644
index 0000000000..45b2c55c8d
--- /dev/null
+++ b/data/2024/aaai/VLM2Scene: Self-Supervised Image-Text-LiDAR Learning with Foundation Models for Autonomous Driving Scene Understanding	
@@ -0,0 +1 @@
+Vision and language foundation models (VLMs) have showcased impressive capabilities in 2D scene understanding. However, their latent potential in elevating the understanding of 3D autonomous driving scenes remains untapped. In this paper, we propose VLM2Scene, which exploits the potential of VLMs to enhance 3D self-supervised representation learning through our proposed image-text-LiDAR contrastive learning strategy. Specifically, in the realm of autonomous driving scenes, the inherent sparsity of LiDAR point clouds poses a notable challenge for point-level contrastive learning methods. This method often grapples with limitations tied to a restricted receptive field and the presence of noisy points. To tackle this challenge, our approach emphasizes region-level learning, leveraging regional masks without semantics derived from the vision foundation model. This approach capitalizes on valuable contextual information to enhance the learning of point cloud representations. First, we introduce Region Caption Prompts to generate fine-grained language descriptions for the corresponding regions, utilizing the language foundation model. These region prompts then facilitate the establishment of positive and negative text-point pairs within the contrastive loss framework. Second, we propose a Region Semantic Concordance Regularization, which involves a semantic-filtered region learning and a region semantic assignment strategy. The former aims to filter the false negative samples based on the semantic distance, and the latter mitigates potential inaccuracies in pixel semantics, thereby enhancing overall semantic consistency. Extensive experiments on representative autonomous driving datasets demonstrate that our self-supervised method significantly outperforms other counterparts. Codes are available at https://github.com/gbliao/VLM2Scene.
\ No newline at end of file
diff --git a/data/2024/aaai/VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation b/data/2024/aaai/VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation
new file mode 100644
index 0000000000..19f87bf765
--- /dev/null
+++ b/data/2024/aaai/VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation	
@@ -0,0 +1 @@
+Outdoor Vision-and-Language Navigation (VLN) requires an agent to navigate through realistic 3D outdoor environments based on natural language instructions. The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data. To address these issues, we propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions and actions to improve outdoor VLN performance. VLN-Video combines the best of intuitive classical approaches and modern deep learning techniques, using template infilling to generate grounded non-repetitive navigation instructions, combined with an image rotation similarity based navigation action predictor to obtain VLN style data from driving videos for pretraining deep learning VLN models. We pre-train the model on the Touchdown dataset and our video-augmented dataset created from driving videos with three proxy tasks: Masked Language Modeling, Instruction and Trajectory Matching, and Next Action Prediction, so as to learn temporally-aware and visually-aligned instruction representations. The learned instruction representation is adapted to the state-of-the-art navigation agent when fine-tuning on the Touchdown dataset. Empirical results demonstrate that VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate, achieving a new state-of-the-art on the Touchdown dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/VPDETR: End-to-End Vanishing Point DEtection TRansformers b/data/2024/aaai/VPDETR: End-to-End Vanishing Point DEtection TRansformers
new file mode 100644
index 0000000000..f056675dcc
--- /dev/null
+++ b/data/2024/aaai/VPDETR: End-to-End Vanishing Point DEtection TRansformers	
@@ -0,0 +1 @@
+In the field of vanishing point detection, previous works commonly relied on extracting and clustering straight lines or classifying candidate points as vanishing points. This paper proposes a novel end-to-end framework, called VPDETR (Vanishing Point DEtection TRansformer), that views vanishing point detection as a set prediction problem, applicable to both Manhattan and non-Manhattan world datasets. By using the positional embedding of anchor points as queries in Transformer decoders and dynamically updating them layer by layer, our method is able to directly input images and output their vanishing points without the need for explicit straight line extraction and candidate points sampling. Additionally, we introduce an orthogonal loss and a cross-prediction loss to improve accuracy on the Manhattan world datasets. Experimental results demonstrate that VPDETR achieves competitive performance compared to state-of-the-art methods, without requiring post-processing.
\ No newline at end of file
diff --git a/data/2024/aaai/VQ-FONT: Few-Shot Font Generation with Structure-Aware Enhancement and Quantization b/data/2024/aaai/VQ-FONT: Few-Shot Font Generation with Structure-Aware Enhancement and Quantization
new file mode 100644
index 0000000000..f5cff4d9a7
--- /dev/null
+++ b/data/2024/aaai/VQ-FONT: Few-Shot Font Generation with Structure-Aware Enhancement and Quantization	
@@ -0,0 +1 @@
+Few-shot font generation is challenging, as it needs to capture the fine-grained stroke styles from a limited set of reference glyphs, and then transfer to other characters, which are expected to have similar styles. However, due to the diversity and complexity of Chinese font styles, the synthesized glyphs of existing methods usually exhibit visible artifacts, such as missing details and distorted strokes. In this paper, we propose a VQGAN-based framework (i.e., VQ-Font) to enhance glyph fidelity through token prior refinement and structure-aware enhancement. Specifically, we pre-train a VQGAN to encapsulate font token prior within a code-book. Subsequently, VQ-Font refines the synthesized glyphs with the codebook to eliminate the domain gap between synthesized and real-world strokes. Furthermore, our VQ-Font leverages the inherent design of Chinese characters, where structure components such as radicals and character components are combined in specific arrangements, to recalibrate fine-grained styles based on references. This process improves the matching and fusion of styles at the structure level. Both modules collaborate to enhance the fidelity of the generated fonts. Experiments on a collected font dataset show that our VQ-Font outperforms the competing methods both quantitatively and qualitatively, especially in generating challenging styles. Our code is available at https://github.com/Yaomingshuai/VQ-Font.
\ No newline at end of file
diff --git a/data/2024/aaai/VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models b/data/2024/aaai/VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models
new file mode 100644
index 0000000000..eacf9bc3ac
--- /dev/null
+++ b/data/2024/aaai/VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models	
@@ -0,0 +1,2 @@
+Visual Question Answering (VQA) is a fundamental task in computer vision and natural language process fields. Although the “pre-training & finetuning” learning paradigm significantly improves the VQA performance, the adversarial robustness of such a learning paradigm has not been explored. In this paper, we delve into a new problem: using a pre-trained multimodal source model to create adversarial image-text pairs and then transferring them to attack the target VQA models. Correspondingly, we propose a novel VQATTACK model, which can iteratively generate both im- age and text perturbations with the designed modules: the large language model (LLM)-enhanced image attack and the cross-modal joint attack module. At each iteration, the LLM-enhanced image attack module first optimizes the latent representation-based loss to generate feature-level image perturbations. Then it incorporates an LLM to further enhance the image perturbations by optimizing the designed masked answer anti-recovery loss. The cross-modal joint attack module will be triggered at a specific iteration, which updates the image and text perturbations sequentially. Notably, the text perturbation updates are based on both the learned gradients in the word embedding space and word synonym-based substitution. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQATTACK in the transferable attack setting, compared with state-of-the-art baselines. This work reveals
+a significant blind spot in the “pre-training & fine-tuning” paradigm on VQA tasks. The source code can be found in the link https://github.com/ericyinyzy/VQAttack.
\ No newline at end of file
diff --git a/data/2024/aaai/VQCNIR: Clearer Night Image Restoration with Vector-Quantized Codebook b/data/2024/aaai/VQCNIR: Clearer Night Image Restoration with Vector-Quantized Codebook
new file mode 100644
index 0000000000..a207f11eba
--- /dev/null
+++ b/data/2024/aaai/VQCNIR: Clearer Night Image Restoration with Vector-Quantized Codebook	
@@ -0,0 +1,3 @@
+Night photography often struggles with challenges like low light and blurring, stemming from dark environments and prolonged exposures. Current methods either disregard priors and directly fitting end-to-end networks, leading to inconsistent illumination, or rely on unreliable handcrafted priors to constrain the network, thereby bringing the greater error to the final result. We believe in the strength of data-driven high-quality priors and strive to offer a reliable and consistent prior, circumventing the restrictions of manual priors.
+In this paper, we propose Clearer Night Image Restoration with Vector-Quantized Codebook (VQCNIR) to achieve remarkable and consistent restoration outcomes on real-world and synthetic benchmarks. To ensure the faithful restoration of details and illumination, we propose the incorporation of two essential modules: the Adaptive Illumination Enhancement Module (AIEM) and the Deformable Bi-directional Cross-Attention (DBCA) module. The AIEM leverages the inter-channel correlation of features to dynamically maintain illumination consistency between degraded features and high-quality codebook features. Meanwhile, the DBCA module effectively integrates texture and structural information through bi-directional cross-attention and deformable convolution, resulting in enhanced fine-grained detail and structural fidelity across parallel decoders.
+Extensive experiments validate the remarkable benefits of VQCNIR in enhancing image quality under low-light conditions, showcasing its state-of-the-art performance on both synthetic and real-world datasets. The code is available at https://github.com/AlexZou14/VQCNIR.
\ No newline at end of file
diff --git a/data/2024/aaai/VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression b/data/2024/aaai/VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression
new file mode 100644
index 0000000000..edca879af7
--- /dev/null
+++ b/data/2024/aaai/VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression	
@@ -0,0 +1 @@
+In content-based video retrieval (CBVR), dealing with large-scale collections, efficiency is as important as accuracy; thus, several video-level feature-based studies have actively been conducted. Nevertheless, owing to the severe difficulty of embedding a lengthy and untrimmed video into a single feature, these studies have been insufficient for accurate retrieval compared to frame-level feature-based studies. In this paper, we show that appropriate suppression of irrelevant frames can provide insight into the current obstacles of the video-level approaches. Furthermore, we propose a Video-to-Video Suppression network (VVS) as a solution. VVS is an end-to-end framework that consists of an easy distractor elimination stage to identify which frames to remove and a suppression weight generation stage to determine the extent to suppress the remaining frames. This structure is intended to effectively describe an untrimmed video with varying content and meaningless information. Its efficacy is proved via extensive experiments, and we show that our approach is not only state-of-the-art in video-level approaches but also has a fast inference time despite possessing retrieval capabilities close to those of frame-level approaches. Code is available at https://github.com/sejong-rcv/VVS
\ No newline at end of file
diff --git a/data/2024/aaai/Validation, Robustness, and Accuracy of Perturbation-Based Sensitivity Analysis Methods for Time-Series Deep Learning Models b/data/2024/aaai/Validation, Robustness, and Accuracy of Perturbation-Based Sensitivity Analysis Methods for Time-Series Deep Learning Models
new file mode 100644
index 0000000000..98612fa01b
--- /dev/null
+++ b/data/2024/aaai/Validation, Robustness, and Accuracy of Perturbation-Based Sensitivity Analysis Methods for Time-Series Deep Learning Models	
@@ -0,0 +1 @@
+This work undertakes studies to evaluate Interpretability Methods for Time Series Deep Learning. Sensitivity analysis assesses how input changes affect the output, constituting a key component of interpretation. Among the post-hoc interpretation methods such as back-propagation, perturbation, and approximation, my work will investigate perturbation-based sensitivity Analysis methods on modern Transformer models to benchmark their performances. Specifically, my work intends to answer three research questions: 1) Do different sensitivity analysis methods yield comparable outputs and attribute importance rankings? 2) Using the same sensitivity analysis method, do different Deep Learning models impact the output of the sensitivity analysis? 3) How well do the results from sensitivity analysis methods align with the ground truth?
\ No newline at end of file
diff --git a/data/2024/aaai/Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties b/data/2024/aaai/Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties
new file mode 100644
index 0000000000..a91324512f
--- /dev/null
+++ b/data/2024/aaai/Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties	
@@ -0,0 +1,5 @@
+Human values are crucial to human decision-making. Value pluralism is the view that multiple correct values may be held in tension with one another (e.g., when considering lying to a friend to protect their feelings, how does one balance honesty with friendship?). As statistical learners, AI systems fit to averages by default, washing out these potentially irreducible value conflicts. To improve AI systems to better reflect value pluralism, the first-order challenge is to explore the extent to which AI systems can model pluralistic human values, rights, and duties as well as their interaction.
+
+We introduce ValuePrism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations. ValuePrism’s contextualized values are generated by GPT-4 and deemed high-quality by human annotators 91% of the time. We conduct a large-scale study with annotators across diverse social and demographic backgrounds to try to understand whose values are represented.
+
+With ValuePrism, we build Value Kaleidoscope (or Kaleido), an open, light-weight, and structured language-based multi-task model that generates, explains, and assesses the relevance and valence (i.e., support or oppose) of human values, rights, and duties within a specific context. Humans prefer the sets of values output by our system over the teacher GPT- 4, finding them more accurate and with broader coverage. In addition, we demonstrate that Kaleido can help explain variability in human decision-making by outputting contrasting values. Finally, we show that Kaleido’s representations transfer to other philosophical frameworks and datasets, confirming the benefit of an explicit, modular, and interpretable approach to value pluralism. We hope that our work will serve as a step to making more explicit the implicit values behind human decision-making and to steering AI systems to make decisions that are more in accordance with them.
\ No newline at end of file
diff --git a/data/2024/aaai/Value at Adversarial Risk: A Graph Defense Strategy against Cost-Aware Attacks b/data/2024/aaai/Value at Adversarial Risk: A Graph Defense Strategy against Cost-Aware Attacks
new file mode 100644
index 0000000000..9103a07ad8
--- /dev/null
+++ b/data/2024/aaai/Value at Adversarial Risk: A Graph Defense Strategy against Cost-Aware Attacks	
@@ -0,0 +1 @@
+Deep learning methods on graph data have achieved remarkable efficacy across a variety of real-world applications, such as social network analysis and transaction risk detection. Nevertheless, recent studies have illuminated a concerning fact: even the most expressive Graph Neural Networks (GNNs) are vulnerable to graph adversarial attacks. While several methods have been proposed to enhance the robustness of GNN models against adversarial attacks, few have focused on a simple yet realistic approach: valuing the adversarial risks and focused safeguards at the node level. This empowers defenders to allocate heightened security level to vulnerable nodes, while lower to robust nodes. With this new perspective, we propose a novel graph defense strategy RisKeeper, such that the adversarial risk can be directly kept in the input graph. We start at valuing the adversarial risk, by introducing a cost-aware projected gradient descent attack that takes into account both cost avoidance and compliance with costs budgets. Subsequently, we present a learnable approach to ascertain the ideal security level for each individual node by solving a bi-level optimization problem. Through extensive experiments on four real-world datasets, we demonstrate that our method achieves superior performance surpassing state-of-the-art methods. Our in-depth case studies provide further insights into vulnerable and robust structural patterns, serving as inspiration for practitioners to exercise heightened vigilance.
\ No newline at end of file
diff --git a/data/2024/aaai/Variable Importance in High-Dimensional Settings Requires Grouping b/data/2024/aaai/Variable Importance in High-Dimensional Settings Requires Grouping
new file mode 100644
index 0000000000..cdedfe9d02
--- /dev/null
+++ b/data/2024/aaai/Variable Importance in High-Dimensional Settings Requires Grouping	
@@ -0,0 +1 @@
+Explaining the decision process of machine learning algorithms is nowadays crucial for both model’s performance enhancement and human comprehension. This can be achieved by assessing the variable importance of single variables, even for high-capacity non-linear methods, e.g. Deep Neural Networks (DNNs). While only removal-based approaches, such as Permutation Importance (PI), can bring statistical validity, they return misleading results when variables are correlated. Conditional Permutation Importance (CPI) bypasses PI’s limitations in such cases. However, in high-dimensional settings, where high correlations between the variables cancel their conditional importance, the use of CPI as well as other methods leads to unreliable results, besides prohibitive computation costs. Grouping variables statistically via clustering or some prior knowledge gains some power back and leads to better interpretations. In this work, we introduce BCPI (Block-Based Conditional Permutation Importance), a new generic framework for variable importance computation with statistical guarantees handling both single and group cases. Furthermore, as handling groups with high cardinality (such as a set of observations of a given modality) are both time-consuming and resource-intensive, we also introduce a new stacking approach extending the DNN architecture with sub-linear layers adapted to the group structure. We show that the ensuing approach extended with stacking controls the type-I error even with highly-correlated groups and shows top accuracy across benchmarks. Furthermore, we perform a real-world data analysis in a large-scale medical dataset where we aim to show the consistency between our results and the literature for a biomarker prediction.
\ No newline at end of file
diff --git a/data/2024/aaai/Variance-Insensitive and Target-Preserving Mask Refinement for Interactive Image Segmentation b/data/2024/aaai/Variance-Insensitive and Target-Preserving Mask Refinement for Interactive Image Segmentation
new file mode 100644
index 0000000000..1d785cd244
--- /dev/null
+++ b/data/2024/aaai/Variance-Insensitive and Target-Preserving Mask Refinement for Interactive Image Segmentation	
@@ -0,0 +1 @@
+Point-based interactive image segmentation can ease the burden of mask annotation in applications such as semantic segmentation and image editing. However, fully extracting the target mask with limited user inputs remains challenging. We introduce a novel method, Variance-Insensitive and Target-Preserving Mask Refinement to enhance segmentation quality with fewer user inputs. Regarding the last segmentation result as the initial mask, an iterative refinement process is commonly employed to continually enhance the initial mask. Nevertheless, conventional techniques suffer from sensitivity to the variance in the initial mask. To circumvent this problem, our proposed method incorporates a mask matching algorithm for ensuring consistent inferences from different types of initial masks. We also introduce a target-aware zooming algorithm to preserve object information during downsampling, balancing efficiency and accuracy. Experiments on GrabCut, Berkeley, SBD, and DAVIS datasets demonstrate our method's state-of-the-art performance in interactive image segmentation.
\ No newline at end of file
diff --git a/data/2024/aaai/Variational Hybrid-Attention Framework for Multi-Label Few-Shot Aspect Category Detection b/data/2024/aaai/Variational Hybrid-Attention Framework for Multi-Label Few-Shot Aspect Category Detection
new file mode 100644
index 0000000000..2fbe86142c
--- /dev/null
+++ b/data/2024/aaai/Variational Hybrid-Attention Framework for Multi-Label Few-Shot Aspect Category Detection	
@@ -0,0 +1 @@
+Multi-label few-shot aspect category detection (FS-ACD) is a challenging sentiment analysis task, which aims to learn a multi-label learning paradigm with limited training data. The difficulty of this task is how to use limited data to generalize effective discriminative representations for different categories. Nowadays, all advanced FS-ACD works utilize the prototypical network to learn label prototypes to represent different aspects. However, such point-based estimation methods are inherently noise-susceptible and bias-vulnerable. To this end, this paper proposes a novel Variational Hybrid-Attention Framework (VHAF) for the FS-ACD task. Specifically, to alleviate the data noise, we adopt a hybrid-attention mechanism to generate more discriminative aspect-specific embeddings. Then, based on these embeddings, we introduce the variational distribution inference to obtain the aspect-specific distribution as a more robust aspect representation, which can eliminate the scarce data bias for better inference. Moreover, we further leverage an adaptive threshold estimation to help VHAF better identify multiple relevant aspects. Extensive experiments on three datasets demonstrate the effectiveness of our VHAF over other state-of-the-art methods. Code is available at https://github.com/chengzju/VHAF.
\ No newline at end of file
diff --git a/data/2024/aaai/Vector Field Oriented Diffusion Model for Crystal Material Generation b/data/2024/aaai/Vector Field Oriented Diffusion Model for Crystal Material Generation
new file mode 100644
index 0000000000..3db63d791b
--- /dev/null
+++ b/data/2024/aaai/Vector Field Oriented Diffusion Model for Crystal Material Generation	
@@ -0,0 +1 @@
+Discovering crystal structures with specific chemical properties has become an increasingly important focus in material science. However, current models are limited in their ability to generate new crystal lattices, as they only consider atomic positions or chemical composition. To address this issue, we propose a probabilistic diffusion model that utilizes a geometrically equivariant GNN to consider atomic positions and crystal lattices jointly. To evaluate the effectiveness of our model, we introduce a new generation metric inspired by Frechet Inception Distance, but based on GNN energy prediction rather than InceptionV3 used in computer vision. In addition to commonly used metrics like validity, which assesses the plausibility of a structure, this new metric offers a more comprehensive evaluation of our model's capabilities. Our experiments on existing benchmarks show the significance of our diffusion model. We also show that our method can effectively learn meaningful representations.
\ No newline at end of file
diff --git a/data/2024/aaai/VeriCompress: A Tool to Streamline the Synthesis of Verified Robust Compressed Neural Networks from Scratch b/data/2024/aaai/VeriCompress: A Tool to Streamline the Synthesis of Verified Robust Compressed Neural Networks from Scratch
new file mode 100644
index 0000000000..7cb8ce8afc
--- /dev/null
+++ b/data/2024/aaai/VeriCompress: A Tool to Streamline the Synthesis of Verified Robust Compressed Neural Networks from Scratch	
@@ -0,0 +1 @@
+AI's widespread integration has led to neural networks (NN) deployment on edge and similar limited-resource platforms for safety-critical scenarios. Yet, NN's fragility raises concerns about reliable inference. Moreover, constrained platforms demand compact networks. This study introduces VeriCompress, a tool that automates the search and training of compressed models with robustness guarantees. These models are well-suited for safety-critical applications and adhere to predefined architecture and size limitations, making them deployable on resource-restricted platforms. The method trains models 2-3 times faster than the state-of-the-art approaches, surpassing them by average accuracy and robustness gains of 15.1 and 9.8 percentage points, respectively. When deployed on a resource-restricted generic platform, these models require 5-8 times less memory and 2-4 times less inference time than models used in verified robustness literature. Our comprehensive evaluation across various model architectures and datasets, including MNIST, CIFAR, SVHN, and a relevant pedestrian detection dataset, showcases VeriCompress's capacity to identify compressed verified robust models with reduced computation overhead compared to current standards. This underscores its potential as a valuable tool for end users, such as developers of safety-critical applications on edge or Internet of Things platforms, empowering them to create suitable models for safety-critical, resource-constrained platforms in their respective domains.
\ No newline at end of file
diff --git a/data/2024/aaai/ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization b/data/2024/aaai/ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization
new file mode 100644
index 0000000000..30e33363c7
--- /dev/null
+++ b/data/2024/aaai/ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization	
@@ -0,0 +1 @@
+Pre-trained vision-language(V-L) models such as CLIP have demonstrated impressive Zero-Shot performance in many downstream tasks. Since adopting contrastive video-text pairs methods like CLIP to video tasks is limited by its high cost and scale, recent approaches focus on efficiently transferring the image-based CLIP to the video domain. A major finding is that fine-tuning the pre-trained model to achieve strong fully supervised performance leads to low zero shot, few shot, and base to novel generalization. Instead, freezing the backbone network to maintain generalization ability weakens fully supervised performance. Otherwise, no single prompt tuning branch consistently performs optimally. In this work, we proposed a multimodal prompt learning scheme that balances supervised and generalized performance. Our prompting approach contains three sections: 1) Independent prompt on both the vision and text branches to learn the language and visual contexts. 2) Inter-modal prompt mapping to ensure mutual synergy. 3) Reducing the discrepancy between the hand-crafted prompt (a video of a person doing [CLS]) and the learnable prompt, to alleviate the forgetting about essential video scenarios. Extensive validation of fully supervised, zero-shot, few-shot, base-to-novel generalization settings for video recognition indicates that the proposed approach achieves competitive performance with less commute cost.
\ No newline at end of file
diff --git a/data/2024/aaai/ViSTec: Video Modeling for Sports Technique Recognition and Tactical Analysis b/data/2024/aaai/ViSTec: Video Modeling for Sports Technique Recognition and Tactical Analysis
new file mode 100644
index 0000000000..2f8d6c32b9
--- /dev/null
+++ b/data/2024/aaai/ViSTec: Video Modeling for Sports Technique Recognition and Tactical Analysis	
@@ -0,0 +1 @@
+The immense popularity of racket sports has fueled substantial demand in tactical analysis with broadcast videos. However, existing manual methods require laborious annotation, and recent attempts leveraging video perception models are limited to low-level annotations like ball trajectories, overlooking tactics that necessitate an understanding of stroke techniques. State-of-the-art action segmentation models also struggle with technique recognition due to frequent occlusions and motion-induced blurring in racket sports videos. To address these challenges, We propose ViSTec, a Video-based Sports Technique recognition model inspired by human cognition that synergizes sparse visual data with rich contextual insights. Our approach integrates a graph to explicitly model strategic knowledge in stroke sequences and enhance technique recognition with contextual inductive bias. A two-stage action perception model is jointly trained to align with the contextual knowledge in the graph. Experiments demonstrate that our method outperforms existing models by a significant margin. Case studies with experts from the Chinese national table tennis team validate our model's capacity to automate analysis for technical actions and tactical strategies. More details are available at: https://ViSTec2024.github.io/.
\ No newline at end of file
diff --git a/data/2024/aaai/ViT-Calibrator: Decision Stream Calibration for Vision Transformer b/data/2024/aaai/ViT-Calibrator: Decision Stream Calibration for Vision Transformer
new file mode 100644
index 0000000000..de52f1ef9e
--- /dev/null
+++ b/data/2024/aaai/ViT-Calibrator: Decision Stream Calibration for Vision Transformer	
@@ -0,0 +1 @@
+A surge of interest has emerged in utilizing Transformers in diverse vision tasks owing to its formidable performance. However, existing approaches primarily focus on optimizing internal model architecture designs that often entail significant trial and error with high burdens. In this work, we propose a new paradigm dubbed Decision Stream Calibration that boosts the performance of general Vision Transformers. To achieve this, we shed light on the information propagation mechanism in the learning procedure by exploring the correlation between different tokens and the relevance coefficient of multiple dimensions. Upon further analysis, it was discovered that 1) the final decision is associated with tokens of foreground targets, while token features of foreground target will be transmitted into the next layer as much as possible, and the useless token features of background area will be eliminated gradually in the forward propagation. 2) Each category is solely associated with specific sparse dimensions in the tokens. Based on the discoveries mentioned above, we designed a two-stage calibration scheme, namely ViT-Calibrator, including token propagation calibration stage and dimension propagation calibration stage. Extensive experiments on commonly used datasets show that the proposed approach can achieve promising results.
\ No newline at end of file
diff --git a/data/2024/aaai/ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining b/data/2024/aaai/ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining
new file mode 100644
index 0000000000..da4ecba5e8
--- /dev/null
+++ b/data/2024/aaai/ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining	
@@ -0,0 +1 @@
+Scene text removal (STR) aims at replacing text strokes in natural scenes with visually coherent backgrounds. Recent STR approaches rely on iterative refinements or explicit text masks, resulting in high complexity and sensitivity to the accuracy of text localization. Moreover, most existing STR methods adopt convolutional architectures while the potential of vision Transformers (ViTs) remains largely unexplored. In this paper, we propose a simple-yet-effective ViT-based text eraser, dubbed ViTEraser. Following a concise encoder-decoder framework, ViTEraser can easily incorporate various ViTs to enhance long-range modeling. Specifically, the encoder hierarchically maps the input image into the hidden space through ViT blocks and patch embedding layers, while the decoder gradually upsamples the hidden features to the text-erased image with ViT blocks and patch splitting layers. As ViTEraser implicitly integrates text localization and inpainting, we propose a novel end-to-end pretraining method, termed SegMIM, which focuses the encoder and decoder on the text box segmentation and masked image modeling tasks, respectively. Experimental results demonstrate that ViTEraser with SegMIM achieves state-of-the-art performance on STR by a substantial margin and exhibits strong generalization ability when extended to other tasks, e.g., tampered scene text detection. Furthermore, we comprehensively explore the architecture, pretraining, and scalability of the ViT-based encoder-decoder for STR, which provides deep insights into the application of ViT to the STR field. Code is available at https://github.com/shannanyinxiang/ViTEraser.
\ No newline at end of file
diff --git a/data/2024/aaai/ViTree: Single-Path Neural Tree for Step-Wise Interpretable Fine-Grained Visual Categorization b/data/2024/aaai/ViTree: Single-Path Neural Tree for Step-Wise Interpretable Fine-Grained Visual Categorization
new file mode 100644
index 0000000000..1c317bb0a4
--- /dev/null
+++ b/data/2024/aaai/ViTree: Single-Path Neural Tree for Step-Wise Interpretable Fine-Grained Visual Categorization	
@@ -0,0 +1 @@
+As computer vision continues to advance and finds widespread applications across various domains, the need for interpretability in deep learning models becomes paramount. Existing methods often resort to post-hoc techniques or prototypes to explain the decision-making process, which can be indirect and lack intrinsic illustration. In this research, we introduce ViTree, a novel approach for fine-grained visual categorization that combines the popular vision transformer as a feature extraction backbone with neural decision trees. By traversing the tree paths, ViTree effectively selects patches from transformer-processed features to highlight informative local regions, thereby refining representations in a step-wise manner. Unlike previous tree-based models that rely on soft distributions or ensembles of paths, ViTree selects a single tree path, offering a clearer and simpler decision-making process. This patch and path selectivity enhances model interpretability of ViTree, enabling better insights into the model's inner workings. Remarkably, extensive experimentation validates that this streamlined approach surpasses various strong competitors and achieves state-of-the-art performance while maintaining exceptional interpretability which is proved by multi-perspective methods. Code can be found at https://github.com/SJTU-DeepVisionLab/ViTree.
\ No newline at end of file
diff --git a/data/2024/aaai/Video Event Extraction with Multi-View Interaction Knowledge Distillation b/data/2024/aaai/Video Event Extraction with Multi-View Interaction Knowledge Distillation
new file mode 100644
index 0000000000..52f4683ed6
--- /dev/null
+++ b/data/2024/aaai/Video Event Extraction with Multi-View Interaction Knowledge Distillation	
@@ -0,0 +1 @@
+Video event extraction (VEE) aims to extract key events and generate the event arguments for their semantic roles from the video. Despite promising results have been achieved by existing methods, they still lack an elaborate learning strategy to adequately consider: (1) inter-object interaction, which reflects the relation between objects; (2) inter-modality interaction, which aligns the features from text and video modality. In this paper, we propose a Multi-view Interaction with knowledge Distillation (MID) framework to solve the above problems with the Knowledge Distillation (KD) mechanism. Specifically, we propose the self-Relational KD (self-RKD) to enhance the inter-object interaction, where the relation between objects is measured by distance metric, and the high-level relational knowledge from the deeper layer is taken as the guidance for boosting the shallow layer in the video encoder. Meanwhile, to improve the inter-modality interaction, the Layer-to-layer KD (LKD) is proposed, which integrates additional cross-modal supervisions (i.e., the results of cross-attention) with the textual supervising signal for training each transformer decoder layer. Extensive experiments show that without any additional parameters, MID achieves the state-of-the-art performance compared to other strong methods in VEE.
\ No newline at end of file
diff --git a/data/2024/aaai/Video Frame Prediction from a Single Image and Events b/data/2024/aaai/Video Frame Prediction from a Single Image and Events
new file mode 100644
index 0000000000..2dc85a55d4
--- /dev/null
+++ b/data/2024/aaai/Video Frame Prediction from a Single Image and Events	
@@ -0,0 +1 @@
+Recently, the task of Video Frame Prediction (VFP), which predicts future video frames from previous ones through extrapolation, has made remarkable progress. However, the performance of existing VFP methods is still far from satisfactory due to the fixed framerate video used: 1) they have difficulties in handling complex dynamic scenes; 2) they cannot predict future frames with flexible prediction time intervals. The event cameras can record the intensity changes asynchronously with a very high temporal resolution, which provides rich dynamic information about the observed scenes. In this paper, we propose to predict video frames from a single image and the following events, which can not only handle complex dynamic scenes but also predict future frames with flexible prediction time intervals. First, we introduce a symmetrical cross-modal attention augmentation module to enhance the complementary information between images and events. Second, we propose to jointly achieve optical flow estimation and frame generation by combining the motion information of events and the semantic information of the image, then inpainting the holes produced by forward warping to obtain an ideal prediction frame. Based on these, we propose a lightweight pyramidal coarse-to-fine model that can predict a 720P frame within 25 ms. Extensive experiments show that our proposed model significantly outperforms the state-of-the-art frame-based and event-based VFP methods and has the fastest runtime. Code is available at https://npucvr.github.io/VFPSIE/.
\ No newline at end of file
diff --git a/data/2024/aaai/Video-Context Aligned Transformer for Video Question Answering b/data/2024/aaai/Video-Context Aligned Transformer for Video Question Answering
new file mode 100644
index 0000000000..e3296fcf05
--- /dev/null
+++ b/data/2024/aaai/Video-Context Aligned Transformer for Video Question Answering	
@@ -0,0 +1 @@
+Video question answering involves understanding video content to generate accurate answers to questions. Recent studies have successfully modeled video features and achieved diverse multimodal interaction, yielding impressive outcomes. However, they have overlooked the fact that the video contains richer instances and events beyond the scope of the stated question. Extremely imbalanced alignment of information from both sides leads to significant instability in reasoning. To address this concern, we propose the Video-Context Aligned Transformer (V-CAT), which leverages the context to achieve semantic and content alignment between video and question. Specifically, the video and text are encoded into a shared semantic space initially. We apply contrastive learning to global video token and context token to enhance the semantic alignment. Then, the pooled context feature is utilized to obtain corresponding visual content. Finally, the answer is decoded by integrating the refined video and question features. We evaluate the effectiveness of V-CAT on MSVD-QA and MSRVTT-QA dataset, both achieving state-of-the-art performance. Extended experiments further analyze and demonstrate the effectiveness of each proposed module.
\ No newline at end of file
diff --git a/data/2024/aaai/Virtual Action Actor-Critic Framework for Exploration (Student Abstract) b/data/2024/aaai/Virtual Action Actor-Critic Framework for Exploration (Student Abstract)
new file mode 100644
index 0000000000..59feb7011c
--- /dev/null
+++ b/data/2024/aaai/Virtual Action Actor-Critic Framework for Exploration (Student Abstract)	
@@ -0,0 +1 @@
+Efficient exploration for an agent is challenging in reinforcement learning (RL). In this paper, a novel actor-critic framework namely virtual action actor-critic (VAAC), is proposed to address the challenge of efficient exploration in RL. This work is inspired by humans' ability to imagine the potential outcomes of their actions without actually taking them. In order to emulate this ability, VAAC introduces a new actor called virtual actor (VA), alongside the conventional actor-critic framework. Unlike the conventional actor, the VA takes the virtual action to anticipate the next state without interacting with the environment. With the virtual policy following a Gaussian distribution, the VA is trained to maximize the anticipated novelty of the subsequent state resulting from a virtual action. If any next state resulting from available actions does not exhibit high anticipated novelty, training the VA leads to an increase in the virtual policy entropy. Hence, high virtual policy entropy represents that there is no room for exploration. The proposed VAAC aims to maximize a modified Q function, which combines cumulative rewards and the negative sum of virtual policy entropy. Experimental results show that the VAAC improves the exploration performance compared to existing algorithms.
\ No newline at end of file
diff --git a/data/2024/aaai/Virtual Try-On: Real-Time Interactive Hybrid Network with High-Fidelity b/data/2024/aaai/Virtual Try-On: Real-Time Interactive Hybrid Network with High-Fidelity
new file mode 100644
index 0000000000..72d0527d28
--- /dev/null
+++ b/data/2024/aaai/Virtual Try-On: Real-Time Interactive Hybrid Network with High-Fidelity	
@@ -0,0 +1 @@
+A significant upsurge in the fashion e-commerce industry in recent years has brought considerable attention to image-based virtual fitting. This image-based technology allows users to try on clothes virtually without physically touching them. However, the current techniques have notable limitations in terms of real-world scenarios, noisy results, partial clothing categories and computational cost, thus limiting the real-world applications. To address these critical limitations, we propose a hybrid interactive network that allows actual users to interact with the system to try on clothes virtually. The network is composed of state of art keypoint extraction, appearance flow alteration and wrapping modules. The pro-posed network facilitates real-time application with high-quality noise-free results, a variety of clothing categories and efficient computational cost.
\ No newline at end of file
diff --git a/data/2024/aaai/Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting b/data/2024/aaai/Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting
new file mode 100644
index 0000000000..8fbaa87b1c
--- /dev/null
+++ b/data/2024/aaai/Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting	
@@ -0,0 +1 @@
+Class-agnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively and then matching their feature similarity, leading to an extract-then-match paradigm. In this work, we show that CAC can be simplified in an extract-and-match manner, particularly using a vision transformer (ViT) where feature extraction and similarity matching are executed simultaneously within the self-attention. We reveal the rationale of such simplification from a decoupled view of the self-attention.The resulting model, termed CACViT, simplifies the CAC pipeline into a single pretrained plain ViT. Further, to compensate the loss of the scale and the order-of-magnitude information due to resizing and normalization in plain ViT, we present two effective strategies for scale and magnitude embedding. Extensive experiments on the FSC147 and the CARPK datasets show that CACViT significantly outperforms state-of-the-art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACViT provides a concise and strong baseline for CAC. Code will be available.
\ No newline at end of file
diff --git a/data/2024/aaai/Vision-Language Models for Robot Success Detection b/data/2024/aaai/Vision-Language Models for Robot Success Detection
new file mode 100644
index 0000000000..20bfd19435
--- /dev/null
+++ b/data/2024/aaai/Vision-Language Models for Robot Success Detection	
@@ -0,0 +1 @@
+In this work, we use Vision-Language Models (VLMs) as a binary success detector given a robot observation and task description, formulated as a Visual Question Answering (VQA) problem. We fine-tune the open-source MiniGPT-4 VLM to detect success on robot trajectories from the Berkeley Bridge and Berkeley AUTOLab UR5 datasets. We find that while a handful of test distribution trajectories can train an accurate detector, transferring learning between different environments is challenging due to distribution shift. In addition, while our VLM is robust to language variations, it is less robust to visual variations. In the future, more powerful VLMs such as Gemini and GPT-4 have the potential to be more accurate and robust success detectors, and success detectors can provide a sparse binary reward to improve existing policies.
\ No newline at end of file
diff --git a/data/2024/aaai/Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding b/data/2024/aaai/Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding
new file mode 100644
index 0000000000..d0f57cb43f
--- /dev/null
+++ b/data/2024/aaai/Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding	
@@ -0,0 +1 @@
+In recent years, vision language pre-training frameworks have made significant progress in natural language processing and computer vision, achieving remarkable performance improvement on various downstream tasks. However, when extended to point cloud data, existing works mainly focus on building task-specific models, and fail to extract universal 3D vision-language embedding that generalize well. We carefully investigate three common tasks in semantic 3D scene understanding, and derive key insights into the development of a pre-training model. Motivated by these observations, we propose a vision-language pre-training framework 3DVLP (3D vision-language pre-training with object contrastive learning), which transfers flexibly on 3D vision-language downstream tasks. 3DVLP takes visual grounding as the proxy task and introduces Object-level IoU-guided Detection (OID) loss to obtain high-quality proposals in the scene. Moreover, we design Object-level Cross-Contrastive alignment (OCC) task and Object-level Self-Contrastive learning (OSC) task to align the objects with descriptions and distinguish different objects in the scene, respectively. Extensive experiments verify the excellent performance of 3DVLP on three 3D vision-language tasks, reflecting its superiority in semantic 3D scene understanding. Code is available at https://github.com/iridescentttt/3DVLP.
\ No newline at end of file
diff --git a/data/2024/aaai/Visual Abstract Reasoning in Computational Imagery b/data/2024/aaai/Visual Abstract Reasoning in Computational Imagery
new file mode 100644
index 0000000000..901b1fb7ff
--- /dev/null
+++ b/data/2024/aaai/Visual Abstract Reasoning in Computational Imagery	
@@ -0,0 +1 @@
+Despite current AI’s human-like behavior, super efficiency, and unbelievable ability to handle complex games, we still complain that it shows no sign of creativity, originality, or novelty outside its training set, and that it fails to develop new insights into old experience or establish understanding of new experience. In short, it generates content from its training set, but does not invent content. A fundamental reason for this is that current AI is incapable of abstraction and reasoning in an abstract, generalizable, and systematic way. Think, for instance, of what AI systems we can build if we have a base system that can answer this simple question—when two things are the same. Instead of studying these high-level questions, I put my thesis in the context of visual abstract reasoning (VAR), a task widely used in human intelligence tests. A classical example of this task is Raven’s Progressive Matrices (RPM, see Figure 1), a family of intelligence tests that was designed to measure eductive ability, i.e., the ability to make meaning out of confusion and generate high-level, usually nonverbal, schemata which make it easy to handle complexity. A similar concept to eductive ability is fluid intelligence, or the ability to discriminate and perceive complex relationships when no recourse to answers is stored in memory. Whether eductive ability or fluid intelligence, RPM points to the qualities that have been lacking in AI. To explore these qualities in AI, I propose the following research questions.
\ No newline at end of file
diff --git a/data/2024/aaai/Visual Adversarial Examples Jailbreak Aligned Large Language Models b/data/2024/aaai/Visual Adversarial Examples Jailbreak Aligned Large Language Models
new file mode 100644
index 0000000000..388b795a93
--- /dev/null
+++ b/data/2024/aaai/Visual Adversarial Examples Jailbreak Aligned Large Language Models	
@@ -0,0 +1,3 @@
+Warning: this paper contains data, prompts, and model outputs that are offensive in nature.
+
+Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT-4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned LLM, compelling it to heed a wide range of harmful instructions (that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. Our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend toward multimodality in frontier foundation models.
\ No newline at end of file
diff --git a/data/2024/aaai/Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning b/data/2024/aaai/Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning
new file mode 100644
index 0000000000..df222be3b6
--- /dev/null
+++ b/data/2024/aaai/Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning	
@@ -0,0 +1 @@
+Knowledge-based visual reasoning remains a daunting task since it not only requires machines to interpret the concepts and relationships from visual scenes but also associate them with external world knowledge to conduct a chain of reasoning on open-world questions. Previous works, however, treat visual perception and language-based reasoning as two independent modules, failing to attend to both modules throughout all stages of reasoning. To this end, we propose Visual Chain-of-thought Prompting (VCTP) for knowledge-based reasoning, which involves the interaction between visual content and natural language in an iterative step-by-step reasoning manner. VCTP contains three stages, see, think, and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend to key visual concepts from natural language questions adaptively. It then transforms key visual context into text context for prompting with a visual captioning model, and adopts the LLM to generate the answer. The confirm stage further uses the LLM to generate the supporting rationale to the answer, which is then passed through a cross-modality classifier to verify that it’s consistent with the visual context. We iterate through the think-confirm stages to ensure the verified rationale is consistent with the answer. We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our VCTP enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines; 2). it enjoys the total transparency and trustworthiness of the whole reasoning process by providing rationales for each reasoning step; 3). it is computation-efficient compared with other fine-tuning baselines. Our code is available at https://github.com/UMass-Foundation-Model/VisualCoT.git
\ No newline at end of file
diff --git a/data/2024/aaai/Visual Hallucination Elevates Speech Recognition b/data/2024/aaai/Visual Hallucination Elevates Speech Recognition
new file mode 100644
index 0000000000..b826440a2d
--- /dev/null
+++ b/data/2024/aaai/Visual Hallucination Elevates Speech Recognition	
@@ -0,0 +1,16 @@
+Due to the detrimental impact of noise on the conventional audio speech recognition (ASR) task, audio-visual speech recognition~(AVSR) has been proposed by incorporating both audio and visual video signals. Although existing methods have demonstrated that the aligned visual input of lip movements can enhance the robustness of AVSR systems against noise, the paired videos are not always 
+available during inference, leading to the problem of 
+the missing visual modality, which restricts their practicality in real-world scenarios. 
+
+To tackle this problem, we propose a Discrete Feature based Visual Generative Model (DFVGM) which exploits semantic correspondences between the audio and visual modalities 
+during training, generating 
+visual hallucinations in lieu of
+real videos during inference. To achieve that, the 
+primary challenge is to generate the visual hallucination 
+given the noisy audio while preserving semantic correspondences with the clean speech. To 
+tackle this challenge, we 
+start with training the audio encoder in the Audio-Only (AO) setting, which generates continuous semantic features closely associated with the linguistic information. Simultaneously, the visual encoder is trained in the Visual-Only (VO) setting, producing visual features that are phonetically related. Next, we employ K-means to 
+discretize the continuous audio and visual feature spaces. The discretization step 
+allows DFVGM to capture high-level semantic structures that are more resilient to noise and generate 
+visual hallucinations with high quality. 
+To evaluate the effectiveness and robustness of our approach, we conduct extensive experiments on two publicly available datasets. The results demonstrate that our method achieves a remarkable 53% relative reduction (30.5%->12.9%) in Word Error Rate (WER) on average compared to the current state-of-the-art Audio-Only (AO) baselines while maintaining comparable results (< 5% difference) under the Audio-Visual (AV) setting even without video as input.
\ No newline at end of file
diff --git a/data/2024/aaai/Visual Instruction Tuning with Polite Flamingo b/data/2024/aaai/Visual Instruction Tuning with Polite Flamingo
new file mode 100644
index 0000000000..5a60cb0ce7
--- /dev/null
+++ b/data/2024/aaai/Visual Instruction Tuning with Polite Flamingo	
@@ -0,0 +1 @@
+Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large Language Models (LLMs) using an assortment of annotated downstream vision-language datasets significantly enhances their performance. Yet, during this process, a side effect, which we termed as the "multi-modal alignment tax", surfaces. This side effect negatively impacts the model's ability to format responses appropriately - for instance, its "politeness" - due to the overly succinct and unformatted nature of raw annotations, resulting in reduced human preference. In this paper, we introduce Polite Flamingo, a multi-modal response rewriter that transforms raw annotations into a more appealing, "polite" format. Polite Flamingo is trained to reconstruct high-quality responses from their automatically distorted counterparts and is subsequently applied to a vast array of vision-language datasets for response rewriting. After rigorous filtering, we generate the PF-1M dataset and further validate its value by fine-tuning a multi-modal LLM with it. Combined with novel methodologies including U-shaped multi-stage tuning and multi-turn augmentation, the resulting model, Clever Flamingo, demonstrates its advantages in both multi-modal understanding and response politeness according to automated and human evaluations. Code and dataset are available at https://github.com/ChenDelong1999/polite-flamingo
\ No newline at end of file
diff --git a/data/2024/aaai/Visual Language - Let the Product Say What You Want b/data/2024/aaai/Visual Language - Let the Product Say What You Want
new file mode 100644
index 0000000000..1dd5090af9
--- /dev/null
+++ b/data/2024/aaai/Visual Language - Let the Product Say What You Want	
@@ -0,0 +1,2 @@
+Visual Language is a multitasking on-line system focusing on e-commerce, which involves in generating accurate product descriptions for sellers and providing convenient product retrieval service for customers. To achieve this goal, the system adopts image description technology and multi-modal retrieval technology. 
+By utilizing cross-modal generation technique, we could help sellers on rapid uploading products and customers on rapid retrieval, which could improve the experience of both sellers and customers.
\ No newline at end of file
diff --git a/data/2024/aaai/Visual Redundancy Removal for Composite Images: A Benchmark Dataset and a Multi-Visual-Effects Driven Incremental Method b/data/2024/aaai/Visual Redundancy Removal for Composite Images: A Benchmark Dataset and a Multi-Visual-Effects Driven Incremental Method
new file mode 100644
index 0000000000..1b35ea0251
--- /dev/null
+++ b/data/2024/aaai/Visual Redundancy Removal for Composite Images: A Benchmark Dataset and a Multi-Visual-Effects Driven Incremental Method	
@@ -0,0 +1 @@
+Composite images (CIs) typically combine various elements from different scenes, views, and styles, which are a very important information carrier in the era of mixed media such as virtual reality, mixed reality, metaverse, etc. However, the complexity of CI content presents a significant challenge for subsequent visual perception modeling and compression. In addition, the lack of benchmark CI databases also hinders the use of recent advanced data-driven methods. To address these challenges, we first establish one of the earliest visual redundancy prediction (VRP) databases for CIs. Moreover, we propose a multi-visual effect (MVE)-driven incremental learning method that combines the strengths of hand-crafted and data-driven approaches to achieve more accurate VRP modeling. Specifically, we design special incremental rules to learn the visual knowledge flow of MVE. To effectively capture the associated features of MVE, we further develop a three-stage incremental learning approach for VRP based on an encoder-decoder network. Extensive experimental results validate the superiority of the proposed method in terms of subjective, objective, and compression experiments.
\ No newline at end of file
diff --git a/data/2024/aaai/Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection b/data/2024/aaai/Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection
new file mode 100644
index 0000000000..aab3b2e18d
--- /dev/null
+++ b/data/2024/aaai/Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection	
@@ -0,0 +1 @@
+Efficient representation of point clouds is fundamental for LiDAR-based 3D object detection. While recent grid-based detectors often encode point clouds into either voxels or pillars, the distinctions between these approaches remain underexplored. In this paper, we quantify the differences between the current encoding paradigms and highlight the limited vertical learning within. To tackle these limitations, we propose a hybrid detection framework named Voxel-Pillar Fusion (VPF), which synergistically combines the unique strengths of both voxels and pillars. To be concrete, we first develop a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively, and then introduce the Sparse Fusion Layer (SFL), facilitating bidirectional interaction between sparse voxel and pillar features. Our computationally efficient, fully sparse method can be seamlessly integrated into both dense and sparse detectors. Leveraging this powerful yet straightforward representation, VPF delivers competitive performance, achieving real-time inference speeds on the nuScenes and Waymo Open Dataset.
\ No newline at end of file
diff --git a/data/2024/aaai/W2P: Switching from Weak Supervision to Partial Supervision for Semantic Segmentation b/data/2024/aaai/W2P: Switching from Weak Supervision to Partial Supervision for Semantic Segmentation
new file mode 100644
index 0000000000..19493fcf33
--- /dev/null
+++ b/data/2024/aaai/W2P: Switching from Weak Supervision to Partial Supervision for Semantic Segmentation	
@@ -0,0 +1 @@
+Current weakly-supervised semantic segmentation (WSSS) techniques concentrate on enhancing class activation maps (CAMs) with image-level annotations. Yet, the emphasis on producing these pseudo-labels often overshadows the pivotal role of training the segmentation model itself. This paper underscores the significant influence of noisy pseudo-labels on segmentation network performance, particularly in boundary region. To address above issues, we introduce a novel paradigm: Weak to Partial Supervision (W2P). At its core, W2P categorizes the pseudo-labels from WSSS into two unique supervisions: trustworthy clean labels and uncertain noisy labels. Next, our proposed partially-supervised framework adeptly employs these clean labels to rectify the noisy ones, thereby promoting the continuous enhancement of the segmentation model. To further optimize boundary segmentation, we incorporate a noise detection mechanism that specifically preserves boundary regions while eliminating noise. During the noise refinement phase, we adopt a boundary-conscious noise correction technique to extract comprehensive boundaries from noisy areas. Furthermore, we devise a boundary generation approach that assists in predicting intricate boundary zones. Evaluations on the PASCAL VOC 2012 and MS COCO 2014 datasets confirm our method's impressive segmentation capabilities across various pseudo-labels.
\ No newline at end of file
diff --git a/data/2024/aaai/Wasserstein Differential Privacy b/data/2024/aaai/Wasserstein Differential Privacy
new file mode 100644
index 0000000000..5084a2b290
--- /dev/null
+++ b/data/2024/aaai/Wasserstein Differential Privacy	
@@ -0,0 +1,2 @@
+Differential privacy (DP) has achieved remarkable results in the field of privacy-preserving machine learning. However, existing DP frameworks do not satisfy all the conditions for becoming metrics, which prevents them from deriving better basic private properties and leads to exaggerated values on privacy budgets. We propose Wasserstein differential privacy (WDP), an alternative DP framework to measure the risk of privacy leakage, which satisfies the properties of symmetry and triangle inequality. We show and prove that WDP has 13 excellent properties, which can be theoretical supports for the better performance of WDP than other DP frameworks. 
+In addition, we derive a general privacy accounting method called Wasserstein accountant, which enables WDP to be applied in stochastic gradient descent (SGD) scenarios containing subsampling. Experiments on basic mechanisms, compositions and deep learning show that the privacy budgets obtained by Wasserstein accountant are relatively stable and less influenced by order. Moreover, the overestimation on privacy budgets can be effectively alleviated. The code is available at https://github.com/Hifipsysta/WDP.
\ No newline at end of file
diff --git a/data/2024/aaai/Watch Your Head: Assembling Projection Heads to Save the Reliability of Federated Models b/data/2024/aaai/Watch Your Head: Assembling Projection Heads to Save the Reliability of Federated Models
new file mode 100644
index 0000000000..838c49291f
--- /dev/null
+++ b/data/2024/aaai/Watch Your Head: Assembling Projection Heads to Save the Reliability of Federated Models	
@@ -0,0 +1 @@
+Federated learning encounters substantial challenges with heterogeneous data, leading to performance degradation and convergence issues. While considerable progress has been achieved in mitigating such an impact, the reliability aspect of federated models has been largely disregarded. In this study, we conduct extensive experiments to investigate the reliability of both generic and personalized federated models. Our exploration uncovers a significant finding: federated models exhibit unreliability when faced with heterogeneous data, demonstrating poor calibration on in-distribution test data and low uncertainty levels on out-of-distribution data. This unreliability is primarily attributed to the presence of biased projection heads, which introduce miscalibration into the federated models. Inspired by this observation, we propose the "Assembled Projection Heads" (APH) method for enhancing the reliability of federated models. By treating the existing projection head parameters as priors, APH randomly samples multiple initialized parameters of projection heads from the prior and further performs targeted fine-tuning on locally available data under varying learning rates. Such a head ensemble introduces parameter diversity into the deterministic model, eliminating the bias and producing reliable predictions via head averaging. We evaluate the effectiveness of the proposed APH method across three prominent federated benchmarks. Experimental results validate the efficacy of APH in model calibration and uncertainty estimation. Notably, APH can be seamlessly integrated into various federated approaches but only requires less than 30% additional computation cost for 100x inferences within large models.
\ No newline at end of file
diff --git a/data/2024/aaai/Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy b/data/2024/aaai/Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy
new file mode 100644
index 0000000000..92b8681663
--- /dev/null
+++ b/data/2024/aaai/Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy	
@@ -0,0 +1 @@
+To mitigate potential risks associated with language models (LMs), recent AI detection research proposes incorporating watermarks into machine-generated text through random vocabulary restrictions and utilizing this information for detection. In this paper, we show that watermarking algorithms designed for LMs cannot be seamlessly applied to conditional text generation (CTG) tasks without a notable decline in downstream task performance. To address this issue, we introduce a simple yet effective semantic-aware watermarking algorithm that considers the characteristics of conditional text generation with the input context. Compared to the baseline watermarks, our proposed watermark yields significant improvements in both automatic and human evaluations across various text generation models, including BART and Flan-T5, for CTG tasks such as summarization and data-to-text generation. Meanwhile, it maintains detection ability with higher z-scores but lower AUC scores, suggesting the presence of a detection paradox that poses additional challenges for watermarking CTG.
\ No newline at end of file
diff --git a/data/2024/aaai/WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting b/data/2024/aaai/WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting
new file mode 100644
index 0000000000..27e841dc17
--- /dev/null
+++ b/data/2024/aaai/WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting	
@@ -0,0 +1 @@
+Video inpainting aims to fill in the missing regions of the video frames with plausible content. Benefiting from the outstanding long-range modeling capacity, the transformer-based models have achieved unprecedented performance regarding inpainting quality. Essentially, coherent contents from all the frames along both spatial and temporal dimensions are concerned by a patch-wise attention module, and then the missing contents are generated based on the attention-weighted summation. In this way, attention retrieval accuracy has become the main bottleneck to improve the video inpainting performance, where the factors affecting attention calculation should be explored to maximize the advantages of transformer. Towards this end, in this paper, we theoretically certificate that noise is the culprit that entangles the process of attention calculation. Meanwhile, we propose a novel wavelet transformer network with noise robustness for video inpainting, named WaveFormer. Unlike existing transformer-based methods that utilize the whole embeddings to calculate the attention, our WaveFormer first separates the noise existing in the embedding into high-frequency components by introducing the Discrete Wavelet Transform (DWT), and then adopts clean low-frequency components to calculate the attention. In this way, the impact of noise on attention computation can be greatly mitigated and the missing content regarding different frequencies can be generated by sharing the calculated attention. Extensive experiments validate the superior performance of our method over state-of-the-art baselines both qualitatively and quantitatively.
\ No newline at end of file
diff --git a/data/2024/aaai/WaveNet: Tackling Non-stationary Graph Signals via Graph Spectral Wavelets b/data/2024/aaai/WaveNet: Tackling Non-stationary Graph Signals via Graph Spectral Wavelets
new file mode 100644
index 0000000000..e5e0c1c8c0
--- /dev/null
+++ b/data/2024/aaai/WaveNet: Tackling Non-stationary Graph Signals via Graph Spectral Wavelets	
@@ -0,0 +1 @@
+In the existing spectral GNNs, polynomial-based methods occupy the mainstream in designing a filter through the Laplacian matrix. However, polynomial combinations factored by the Laplacian matrix naturally have limitations in message passing (e.g., over-smoothing). Furthermore, most existing spectral GNNs are based on polynomial bases, which struggle to capture the high-frequency parts of the graph spectral signal. Additionally, we also find that even increasing the polynomial order does not change this situation, which means polynomial-based models have a natural deficiency when facing high-frequency signals. To tackle these problems, we propose WaveNet, which aims to effectively capture the high-frequency part of the graph spectral signal from the perspective of wavelet bases through reconstructing the message propagation matrix. We utilize Multi-Resolution Analysis (MRA) to model this question, and our proposed method can reconstruct arbitrary filters theoretically. We also conduct node classification experiments on real-world graph benchmarks and achieve superior performance on most datasets. Our code is available at https://github.com/Bufordyang/WaveNet
\ No newline at end of file
diff --git a/data/2024/aaai/Wavelet Dynamic Selection Network for Inertial Sensor Signal Enhancement b/data/2024/aaai/Wavelet Dynamic Selection Network for Inertial Sensor Signal Enhancement
new file mode 100644
index 0000000000..0758606ecb
--- /dev/null
+++ b/data/2024/aaai/Wavelet Dynamic Selection Network for Inertial Sensor Signal Enhancement	
@@ -0,0 +1 @@
+As attitude and motion sensing components, inertial sensors are widely used in various portable devices, covering consumer electronics, sports health, aerospace, etc. But the severe intrinsic errors of inertial sensors heavily restrain their function implementation, especially the advanced functionality, including motion trajectory recovery and motion semantic recognition, which attracts considerable attention. As a mainstream signal processing method, wavelet is hailed as the mathematical microscope of signal due to the plentiful and diverse wavelet basis functions. However, complicated noise types and application scenarios of inertial sensors make selecting wavelet basis perplexing. To this end, we propose a wavelet dynamic selection network (WDSNet), which intelligently selects the appropriate wavelet basis for variable inertial signals. In addition, existing deep learning architectures excel at extracting features from input data but neglect to learn the characteristics of target categories, which is essential to enhance the category awareness capability, thereby improving the selection of wavelet basis. Therefore, we propose a category representation mechanism (CRM), which enables the network to extract and represent category features without increasing trainable parameters. Furthermore, CRM transforms the common fully connected network into category representations, which provide closer supervision to the feature extractor than the far and trivial one-hot classification labels. We call this process of imposing interpretability on a network and using it to supervise the feature extractor the feature supervision mechanism, and its effectiveness is demonstrated experimentally and theoretically in this paper. The enhanced inertial signal can perform impracticable tasks with regard to the original signal, such as trajectory reconstruction. Both quantitative and visual results show that WDSNet outperforms the existing methods. Remarkably, WDSNet, as a weakly-supervised method, achieves the state-of-the-art performance of all the compared fully-supervised methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Wavelet-Driven Spatiotemporal Predictive Learning: Bridging Frequency and Time Variations b/data/2024/aaai/Wavelet-Driven Spatiotemporal Predictive Learning: Bridging Frequency and Time Variations
new file mode 100644
index 0000000000..147ae2033e
--- /dev/null
+++ b/data/2024/aaai/Wavelet-Driven Spatiotemporal Predictive Learning: Bridging Frequency and Time Variations	
@@ -0,0 +1 @@
+Spatiotemporal predictive learning is a paradigm that empowers models to learn spatial and temporal patterns by predicting future frames from past frames in an unsupervised manner. This method typically uses recurrent units to capture long-term dependencies, but these units often come with high computational costs and limited performance in real-world scenes. This paper presents an innovative Wavelet-based SpatioTemporal (WaST) framework, which extracts and adaptively controls both low and high-frequency components at image and feature levels via 3D discrete wavelet transform for faster processing while maintaining high-quality predictions. We propose a Time-Frequency Aware Translator uniquely crafted to efficiently learn short- and long-range spatiotemporal information by individually modeling spatial frequency and temporal variations. Meanwhile, we design a wavelet-domain High-Frequency Focal Loss that effectively supervises high-frequency variations. Extensive experiments across various real-world scenarios, such as driving scene prediction, traffic flow prediction, human motion capture, and weather forecasting, demonstrate that our proposed WaST achieves state-of-the-art performance over various spatiotemporal prediction methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning b/data/2024/aaai/Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning
new file mode 100644
index 0000000000..2c40dfbbc5
--- /dev/null
+++ b/data/2024/aaai/Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning	
@@ -0,0 +1 @@
+We propose a generalized method for boosting the generalization ability of pre-trained vision-language models (VLMs) while fine-tuning on downstream few-shot tasks. The idea is realized by exploiting out-of-distribution (OOD) detection to predict whether a sample belongs to a base distribution or a novel distribution and then using the score generated by a dedicated competition based scoring function to fuse the zero-shot and few-shot classifier. The fused classifier is dynamic, which will bias towards the zero-shot classifier if a sample is more likely from the distribution pre-trained on, leading to improved base-to-novel generalization ability. Our method is performed only in test stage, which is applicable to boost existing methods without time-consuming re-training. Extensive experiments show that even weak distribution detectors can still improve VLMs' generalization ability. Specifically, with the help of OOD detectors, the harmonic mean of CoOp and ProGrad increase by 2.6 and 1.5 percentage points over 11 recognition datasets in the base-to-novel setting.
\ No newline at end of file
diff --git a/data/2024/aaai/WeakPCSOD: Overcoming the Bias of Box Annotations for Weakly Supervised Point Cloud Salient Object Detection b/data/2024/aaai/WeakPCSOD: Overcoming the Bias of Box Annotations for Weakly Supervised Point Cloud Salient Object Detection
new file mode 100644
index 0000000000..6ce7cf60ac
--- /dev/null
+++ b/data/2024/aaai/WeakPCSOD: Overcoming the Bias of Box Annotations for Weakly Supervised Point Cloud Salient Object Detection	
@@ -0,0 +1 @@
+Point cloud salient object detection (PCSOD) is a newly proposed task in 3D dense segmentation. However, the acquisition of accurate 3D dense annotations comes at a high cost, severely limiting the progress of PCSOD. To address this issue, we propose the first weakly supervised PCSOD (named WeakPCSOD) model, which relies solely on cheap 3D bounding box annotations. In WeakPCSOD, we extract noise-free supervision from coarse 3D bounding boxes while mitigating shape biases inherent in box annotations. To achieve this, we introduce a novel mask-to-box (M2B) transformation and a color consistency (CC) loss. The M2B transformation, from a shape perspective, disentangles predictions from labels, enabling the extraction of noiseless supervision from labels while preserving object shapes independently of the box bias. From an appearance perspective, we further introduce the CC loss to provide dense supervision, which mitigates the non-unique predictions stemming from weak supervision and substantially reduces prediction variability. Furthermore, we employ a self-training (ST) strategy to enhance performance by utilizing high-confidence pseudo labels. Notably, the M2B transformation, CC loss, and ST strategy are seamlessly integrated into any model and incur no computational costs for inference. Extensive experiments demonstrate the effectiveness of our WeakPCSOD model, even comparable to fully supervised models utilizing dense annotations.
\ No newline at end of file
diff --git a/data/2024/aaai/Weakly Supervised Few-Shot Object Detection with DETR b/data/2024/aaai/Weakly Supervised Few-Shot Object Detection with DETR
new file mode 100644
index 0000000000..9cad3dd9ff
--- /dev/null
+++ b/data/2024/aaai/Weakly Supervised Few-Shot Object Detection with DETR	
@@ -0,0 +1 @@
+In recent years, Few-shot Object Detection (FSOD) has become an increasingly important research topic in computer vision. However, existing FSOD methods require strong annotations including category labels and bounding boxes, and their performance is heavily dependent on the quality of box annotations. However, acquiring strong annotations is both expensive and time-consuming. This inspires the study on weakly supervised FSOD (WS-FSOD in short), which realizes FSOD with only image-level annotations, i.e., category labels. In this paper, we propose a new and effective weakly supervised FSOD method named WFS-DETR. By a well-designed pretraining process, WFS-DETR first acquires general object localization and integrity judgment capabilities on large-scale pretraining data. Then, it introduces object integrity into multiple-instance learning to solve the common local optimum problem by comprehensively exploiting both semantic and visual information. Finally, with simple fine-tuning, it transfers the knowledge learned from the base classes to the novel classes, which enables accurate detection of novel objects. Benefiting from this ``pretraining-refinement'' mechanism, WSF-DETR can achieve good generalization on different datasets. Extensive experiments also show that the proposed method clearly outperforms the existing counterparts in the WS-FSOD task.
\ No newline at end of file
diff --git a/data/2024/aaai/Weakly Supervised Multimodal Affordance Grounding for Egocentric Images b/data/2024/aaai/Weakly Supervised Multimodal Affordance Grounding for Egocentric Images
new file mode 100644
index 0000000000..d4210d33e6
--- /dev/null
+++ b/data/2024/aaai/Weakly Supervised Multimodal Affordance Grounding for Egocentric Images	
@@ -0,0 +1,3 @@
+To enhance the interaction between intelligent systems and the environment, locating the affordance regions of objects is crucial. These regions correspond to specific areas that provide distinct functionalities. Humans often acquire the ability to identify these regions through action demonstrations and verbal instructions. In this paper, we present a novel multimodal framework that extracts affordance knowledge from exocentric images, which depict human-object interactions, as well as from accompanying textual descriptions that describe the performed actions. The extracted knowledge is then transferred to egocentric images.
+To achieve this goal, we propose the HOI-Transfer Module, which utilizes local perception to disentangle individual actions within exocentric images. This module effectively captures localized features and correlations between actions, leading to valuable affordance knowledge. Additionally, we introduce the Pixel-Text Fusion Module, which fuses affordance knowledge by identifying regions in egocentric images that bear resemblances to the textual features defining affordances.
+We employ a Weakly Supervised Multimodal Affordance (WSMA) learning approach, utilizing image-level labels for training. Through extensive experiments, we demonstrate the superiority of our proposed method in terms of evaluation metrics and visual results when compared to existing affordance grounding models. Furthermore, ablation experiments confirm the effectiveness of our approach. Code:https://github.com/xulingjing88/WSMA.
\ No newline at end of file
diff --git a/data/2024/aaai/Weakly Supervised Semantic Segmentation for Driving Scenes b/data/2024/aaai/Weakly Supervised Semantic Segmentation for Driving Scenes
new file mode 100644
index 0000000000..74fd29b07b
--- /dev/null
+++ b/data/2024/aaai/Weakly Supervised Semantic Segmentation for Driving Scenes	
@@ -0,0 +1 @@
+State-of-the-art techniques in weakly-supervised semantic segmentation (WSSS) using image-level labels exhibit severe performance degradation on driving scene datasets such as Cityscapes. To address this challenge, we develop a new WSSS framework tailored to driving scene datasets. Based on extensive analysis of dataset characteristics, we employ Contrastive Language-Image Pre-training (CLIP) as our baseline to obtain pseudo-masks. However, CLIP introduces two key challenges: (1) pseudo-masks from CLIP lack in representing small object classes, and (2) these masks contain notable noise. We propose solutions for each issue as follows. (1) We devise Global-Local View Training that seamlessly incorporates small-scale patches during model training, thereby enhancing the model's capability to handle small-sized yet critical objects in driving scenes (e.g., traffic light). (2) We introduce Consistency-Aware Region Balancing (CARB), a novel technique that discerns reliable and noisy regions through evaluating the consistency between CLIP masks and segmentation predictions. It prioritizes reliable pixels over noisy pixels via adaptive loss weighting. Notably, the proposed method achieves 51.8\% mIoU on the Cityscapes test dataset, showcasing its potential as a strong WSSS baseline on driving scene datasets. Experimental results on CamVid and WildDash2 demonstrate the effectiveness of our method across diverse datasets, even with small-scale datasets or visually challenging conditions. The code is available at https://github.com/k0u-id/CARB.
\ No newline at end of file
diff --git a/data/2024/aaai/Weakly-Supervised Mirror Detection via Scribble Annotations b/data/2024/aaai/Weakly-Supervised Mirror Detection via Scribble Annotations
new file mode 100644
index 0000000000..1ed96a4b87
--- /dev/null
+++ b/data/2024/aaai/Weakly-Supervised Mirror Detection via Scribble Annotations	
@@ -0,0 +1 @@
+Mirror detection is of great significance for avoiding false recognition of reflected objects in computer vision tasks. Existing mirror detection frameworks usually follow a supervised setting, which relies heavily on high quality labels and suffers from poor generalization. To resolve this, we instead propose the first weakly-supervised mirror detection framework and also provide the first scribble-based mirror dataset. Specifically, we relabel 10,158 images, most of which have a labeled pixel ratio of less than 0.01 and take only about 8 seconds to label. Considering that the mirror regions usually show great scale variation, and also irregular and occluded, thus leading to issues of incomplete or over detection, we propose a local-global feature enhancement (LGFE) module to fully capture the context and details. Moreover, it is difficult to obtain basic mirror structure using scribble annotation, and the distinction between foreground (mirror) and background (non-mirror) features is not emphasized caused by mirror reflections. Therefore, we propose a foreground-aware mask attention (FAMA), integrating mirror edges and semantic features to complete mirror regions and suppressing the influence of backgrounds. Finally, to improve the robustness of the network, we propose a prototype contrast loss (PCL) to learn more general foreground features across images. Extensive experiments show that our network outperforms relevant state-of-the-art weakly supervised methods, and even some fully supervised methods. The dataset and codes are available at https://github.com/winter-flow/WSMD.
\ No newline at end of file
diff --git a/data/2024/aaai/Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature b/data/2024/aaai/Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature
new file mode 100644
index 0000000000..af66ad24e5
--- /dev/null
+++ b/data/2024/aaai/Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature	
@@ -0,0 +1 @@
+Weakly-supervised temporal action localization aims to locate action regions and identify action categories in untrimmed videos simultaneously by taking only video-level labels as the supervision. Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video that can provide rich information to assist such a generation process. In this paper, we propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature. First, we design a saliency inference module that exploits the variation relationship between temporal neighbor snippets to discover salient snippet-features, which can reflect the significant dynamic change in the video. Secondly, we introduce a boundary refinement module that enhances salient snippet-features through the information interaction unit. Then, a discrimination enhancement module is introduced to enhance the discriminative nature of snippet-features. Finally, we adopt the refined snippet-features to produce high-fidelity pseudo labels, which could be used to supervise the training of the action localization network. Extensive experiments on two publicly available datasets, i.e., THUMOS14 and ActivityNet v1.3, demonstrate our proposed method achieves significant improvements compared to the state-of-the-art methods. Our source code is available at https://github.com/wuli55555/ISSF.
\ No newline at end of file
diff --git a/data/2024/aaai/WebVLN: Vision-and-Language Navigation on Websites b/data/2024/aaai/WebVLN: Vision-and-Language Navigation on Websites
new file mode 100644
index 0000000000..d0cce8a3d1
--- /dev/null
+++ b/data/2024/aaai/WebVLN: Vision-and-Language Navigation on Websites	
@@ -0,0 +1 @@
+Vision-and-Language Navigation (VLN) task aims to enable AI agents to accurately understand and follow natural language instructions to navigate through real-world environments, ultimately reaching specific target locations. We recognise a promising opportunity to extend VLN to a comparable navigation task that holds substantial significance in our daily lives, albeit within the virtual realm: navigating websites on the Internet. This paper proposes a new task named Vision-and-Language Navigation on Websites (WebVLN), where we use question-based instructions to train an agent, emulating how users naturally browse websites. Unlike the existing VLN task that only pays attention to vision and instruction (language), the WebVLN agent further considers underlying web-specific content like HTML, which could not be seen on the rendered web pages yet contain rich visual and textual information. Toward this goal, we contribute a dataset, WebVLN-v1, and introduce a novel approach called Website-aware VLN Network (WebVLN-Net), which is built upon the foundation of state-of-the-art VLN techniques. Experimental results show that WebVLN-Net outperforms current VLN and web-related navigation methods. We believe that the introduction of the newWebVLN task and its dataset will establish a new dimension within the VLN domain and contribute to the broader vision-and-language research community. Code is available at: https://github.com/WebVLN/WebVLN.
\ No newline at end of file
diff --git a/data/2024/aaai/WeditGAN: Few-Shot Image Generation via Latent Space Relocation b/data/2024/aaai/WeditGAN: Few-Shot Image Generation via Latent Space Relocation
new file mode 100644
index 0000000000..0645af4b64
--- /dev/null
+++ b/data/2024/aaai/WeditGAN: Few-Shot Image Generation via Latent Space Relocation	
@@ -0,0 +1 @@
+In few-shot image generation, directly training GAN models on just a handful of images faces the risk of overfitting. A popular solution is to transfer the models pretrained on large source domains to small target ones. In this work, we introduce WeditGAN, which realizes model transfer by editing the intermediate latent codes w in StyleGANs with learned constant offsets (delta w), discovering and constructing target latent spaces via simply relocating the distribution of source latent spaces. The established one-to-one mapping between latent spaces can naturally prevents mode collapse and overfitting. Besides, we also propose variants of WeditGAN to further enhance the relocation process by regularizing the direction or finetuning the intensity of delta w. Experiments on a collection of widely used source/target datasets manifest the capability of WeditGAN in generating realistic and diverse images, which is simple yet highly effective in the research area of few-shot image generation. Codes are available at https://github.com/Ldhlwh/WeditGAN.
\ No newline at end of file
diff --git a/data/2024/aaai/Weisfeiler and Lehman Go Paths: Learning Topological Features via Path Complexes b/data/2024/aaai/Weisfeiler and Lehman Go Paths: Learning Topological Features via Path Complexes
new file mode 100644
index 0000000000..2652a73b85
--- /dev/null
+++ b/data/2024/aaai/Weisfeiler and Lehman Go Paths: Learning Topological Features via Path Complexes	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs), despite achieving remarkable performance across different tasks, are theoretically bounded by the 1-Weisfeiler-Lehman test, resulting in limitations in terms of graph expressivity. Even though prior works on topological higher-order GNNs overcome that boundary, these models often depend on assumptions about sub-structures of graphs. Specifically, topological GNNs leverage the prevalence of cliques, cycles, and rings to enhance the message-passing procedure. Our study presents a novel perspective by focusing on simple paths within graphs during the topological message-passing process, thus liberating the model from restrictive inductive biases. We prove that by lifting graphs to path complexes, our model can generalize the existing works on topology while inheriting several theoretical results on simplicial complexes and regular cell complexes. Without making prior assumptions about graph sub-structures, our method outperforms earlier works in other topological domains and achieves state-of-the-art results on various benchmarks.
\ No newline at end of file
diff --git a/data/2024/aaai/Welfare Maximization in Perpetual Voting (Student Abstract) b/data/2024/aaai/Welfare Maximization in Perpetual Voting (Student Abstract)
new file mode 100644
index 0000000000..f9ae7daa29
--- /dev/null
+++ b/data/2024/aaai/Welfare Maximization in Perpetual Voting (Student Abstract)	
@@ -0,0 +1,2 @@
+We study the computational problems associated with maximizing various welfare objectives—namely utilitarian welfare, egalitarian welfare, and Nash welfare—in perpetual voting, a sequential collective decision-making framework. Prior work look into notions of fairness over time and study extensions of single-round voting rules to the multi-round setting.
+We show that while a utilitarian-welfare maximizing outcome can be computed efficiently, an outcome that maximizes egalitarian or Nash welfare is computationally intractable, even in the case of two candidates. We complement this by showing that maximizing egalitarian welfare is fixed-parameter tractable in the number of agents, and maximizing egalitarian or Nash welfare is W[2]-hard and slicewise polynomial in the number of timesteps. We also provide an approximation algorithm for maximizing egalitarian welfare and study strategyproofness with respect to these welfare objectives. Finally, we show that a simple greedy algorithm can achieve approximate proportionality in this setting.
\ No newline at end of file
diff --git a/data/2024/aaai/Well, Now We Know! Unveiling Sarcasm: Initiating and Exploring Multimodal Conversations with Reasoning b/data/2024/aaai/Well, Now We Know! Unveiling Sarcasm: Initiating and Exploring Multimodal Conversations with Reasoning
new file mode 100644
index 0000000000..c62f65157f
--- /dev/null
+++ b/data/2024/aaai/Well, Now We Know! Unveiling Sarcasm: Initiating and Exploring Multimodal Conversations with Reasoning	
@@ -0,0 +1,3 @@
+Sarcasm is a widespread linguistic phenomenon that poses a considerable challenge to explain due to its subjective nature, absence of contextual cues, and rooted personal
+perspectives. Even though the identification of sarcasm has been extensively studied in dialogue analysis, merely detecting sarcasm falls short of enabling conversational systems to genuinely comprehend the underlying meaning of a conversation and generate fitting responses. It is imperative to not only detect sarcasm but also pinpoint its origination and the rationale behind the sarcastic expressions to capture its authentic essence. In this paper, we delve into the discourse structure of conversations infused with sarcasm and introduce a novel task - Sarcasm Initiation and Reasoning in Conversations (SIRC). Embedded in a multimodal environment and
+involving a combination of both English and code-mixed interactions, the objective of the task is to discern the trigger or starting point of sarcasm. Additionally, the task involves producing a natural language explanation that rationalizes the satirical dialogues. To this end, we introduce Sarcasm Initiation and Reasoning Dataset (SIRD) to facilitate our task and provide sarcasm initiation annotations and reasoning. We develop a comprehensive model named Sarcasm Initiation and Reasoning Generation (SIRG), which is designed to encompass textual, audio, and visual representations. To achieve this, we introduce a unique shared fusion method that employs cross-attention mechanisms to seamlessly integrate these diverse modalities. Our experimental outcomes, conducted on the SIRC dataset, demonstrate that our proposed framework establishes a new benchmark for both sarcasm initiation and its reasoning generation in the context of multimodal conversations. The code and dataset can be accessed from https://www.iitp.ac.in/∼ai-nlp-ml resources.html#sarcasm-explain and https://github.com/GussailRaat/SIRG-Sarcasm-Initiation-and-Reasoning-Generation.
\ No newline at end of file
diff --git a/data/2024/aaai/Well-Written Knowledge Graphs: Most Effective RDF Syntaxes for Triple Linearization in End-to-End Extraction of Relations from Texts (Student Abstract) b/data/2024/aaai/Well-Written Knowledge Graphs: Most Effective RDF Syntaxes for Triple Linearization in End-to-End Extraction of Relations from Texts (Student Abstract)
new file mode 100644
index 0000000000..d1ebbdbfc6
--- /dev/null
+++ b/data/2024/aaai/Well-Written Knowledge Graphs: Most Effective RDF Syntaxes for Triple Linearization in End-to-End Extraction of Relations from Texts (Student Abstract)	
@@ -0,0 +1 @@
+Seq-to-seq generative models recently gained attention for solving the relation extraction task. By approaching this problem as an end-to-end task, they surpassed encoder-based-only models. Little research investigated the effects of the output syntaxes on the training process of these models. Moreover, a limited number of approaches were proposed for extracting ready-to-load knowledge graphs following the RDF standard. In this paper, we consider that a set of triples can be linearized in many different ways, and we evaluate the combined effect of the size of the language models and different RDF syntaxes on the task of relation extraction from Wikipedia abstracts.
\ No newline at end of file
diff --git a/data/2024/aaai/What Are the Rules? Discovering Constraints from Data b/data/2024/aaai/What Are the Rules? Discovering Constraints from Data
new file mode 100644
index 0000000000..0baa3730f6
--- /dev/null
+++ b/data/2024/aaai/What Are the Rules? Discovering Constraints from Data	
@@ -0,0 +1,2 @@
+Constraint programming and AI planning are powerful tools for solving assignment, optimization, and scheduling problems. They require, however, the rarely available combination of domain knowledge and mathematical modeling expertise. Learning constraints from exemplary solutions can close this gap and alleviate the effort of modeling. Existing approaches either require extensive user interaction, need exemplary invalid solutions that must be generated by experts at great expense, or show high noise-sensitivity. 
+We aim to find constraints from potentially noisy solutions, without the need of user interaction. To this end, we formalize the problem in terms of the Minimum Description Length (MDL) principle, by which we select the model with the best lossless compression of the data. Solving the problem involves model counting, which is #P-hard to approximate. We therefore propose the greedy URPILS algorithm to find high-quality constraints in practice. Extensive experiments on constraint programming and AI planning benchmark data show URPILS not only finds more accurate and succinct constraints, but also is more robust to noise, and has lower sample complexity than the state of the art.
\ No newline at end of file
diff --git a/data/2024/aaai/What Do Hebbian Learners Learn? Reduction Axioms for Iterated Hebbian Learning b/data/2024/aaai/What Do Hebbian Learners Learn? Reduction Axioms for Iterated Hebbian Learning
new file mode 100644
index 0000000000..2185eff572
--- /dev/null
+++ b/data/2024/aaai/What Do Hebbian Learners Learn? Reduction Axioms for Iterated Hebbian Learning	
@@ -0,0 +1 @@
+This paper is a contribution to neural network semantics, a foundational framework for neuro-symbolic AI. The key insight of this theory is that logical operators can be mapped to operators on neural network states. In this paper, we do this for a neural network learning operator. We map a dynamic operator [φ] to iterated Hebbian learning, a simple learning policy that updates a neural network by repeatedly applying Hebb's learning rule until the net reaches a fixed-point. Our main result is that we can "translate away" [φ]-formulas via reduction axioms. This means that completeness for the logic of iterated Hebbian learning follows from completeness of the base logic. These reduction axioms also provide (1) a human-interpretable description of iterated Hebbian learning as a kind of plausibility upgrade, and (2) an approach to building neural networks with guarantees on what they can learn.
\ No newline at end of file
diff --git a/data/2024/aaai/What Does a Query Answer Tell You? Informativeness of Query Answers for Knowledge Bases b/data/2024/aaai/What Does a Query Answer Tell You? Informativeness of Query Answers for Knowledge Bases
new file mode 100644
index 0000000000..4458102fb7
--- /dev/null
+++ b/data/2024/aaai/What Does a Query Answer Tell You? Informativeness of Query Answers for Knowledge Bases	
@@ -0,0 +1 @@
+Query answering for Knowledge Bases (KBs) amounts to extracting information from the various models of a KB, and presenting the user with an object that represents such information. In the vast majority of cases, this object consists of those tuples of constants that satisfy the query expression either in every model (certain answers) or in some model (possible answers). However, similarly to the case of incomplete databases, both these forms of answers are a lossy representation of all the knowledge inferable from the query and the queried KB. In this paper, we illustrate a formal framework to characterize the information that query answers for KBs are able to represent. As a first application of the framework, we study the informativeness of current query answering approaches, including the recently introduced partial answers. We then define a novel notion of answers, allowing repetition of variables across answer tuples. We show that these answers are capable of representing a meaningful form of information, and we also study their data complexity properties.
\ No newline at end of file
diff --git a/data/2024/aaai/What Effects the Generalization in Visual Reinforcement Learning: Policy Consistency with Truncated Return Prediction b/data/2024/aaai/What Effects the Generalization in Visual Reinforcement Learning: Policy Consistency with Truncated Return Prediction
new file mode 100644
index 0000000000..34f373f1d5
--- /dev/null
+++ b/data/2024/aaai/What Effects the Generalization in Visual Reinforcement Learning: Policy Consistency with Truncated Return Prediction	
@@ -0,0 +1 @@
+In visual Reinforcement Learning (RL), the challenge of generalization to new environments is paramount. This study pioneers a theoretical analysis of visual RL generalization, establishing an upper bound on the generalization objective, encompassing policy divergence and Bellman error components. Motivated by this analysis, we propose maintaining the cross-domain consistency for each policy in the policy space, which can reduce the divergence of the learned policy during the test. In practice, we introduce the Truncated Return Prediction (TRP) task, promoting cross-domain policy consistency by predicting truncated returns of historical trajectories. Moreover, we also propose a Transformer-based predictor for this auxiliary task. Extensive experiments on DeepMind Control Suite and Robotic Manipulation tasks demonstrate that TRP achieves state-of-the-art generalization performance. We further demonstrate that TRP outperforms previous methods in terms of sample efficiency during training.
\ No newline at end of file
diff --git a/data/2024/aaai/What Makes Good Collaborative Views? Contrastive Mutual Information Maximization for Multi-Agent Perception b/data/2024/aaai/What Makes Good Collaborative Views? Contrastive Mutual Information Maximization for Multi-Agent Perception
new file mode 100644
index 0000000000..b71d2cb467
--- /dev/null
+++ b/data/2024/aaai/What Makes Good Collaborative Views? Contrastive Mutual Information Maximization for Multi-Agent Perception	
@@ -0,0 +1 @@
+Multi-agent perception (MAP) allows autonomous systems to understand complex environments by interpreting data from multiple sources. This paper investigates intermediate collaboration for MAP with a specific focus on exploring "good" properties of collaborative view (i.e., post-collaboration feature) and its underlying relationship to individual views (i.e., pre-collaboration features), which were treated as an opaque procedure by most existing works. We propose a novel framework named CMiMC (Contrastive Mutual Information Maximization for Collaborative Perception) for intermediate collaboration. The core philosophy of CMiMC is to preserve discriminative information of individual views in the collaborative view by maximizing mutual information between pre- and post-collaboration features while enhancing the efficacy of collaborative views by minimizing the loss function of downstream tasks. In particular, we define multi-view mutual information (MVMI) for intermediate collaboration that evaluates correlations between collaborative views and individual views on both global and local scales. We establish CMiMNet based on multi-view contrastive learning to realize estimation and maximization of MVMI, which assists the training of a collaborative encoder for voxel-level feature fusion. We evaluate CMiMC on V2X-Sim 1.0, and it improves the SOTA average precision by 3.08% and 4.44% at 0.5 and 0.7 IoU (Intersection-over-Union) thresholds, respectively. In addition, CMiMC can reduce communication volume to 1/32 while achieving performance comparable to SOTA. Code and Appendix are released at https://github.com/77SWF/CMiMC.
\ No newline at end of file
diff --git a/data/2024/aaai/What Makes Quantization for Large Language Model Hard? An Empirical Study from the Lens of Perturbation b/data/2024/aaai/What Makes Quantization for Large Language Model Hard? An Empirical Study from the Lens of Perturbation
new file mode 100644
index 0000000000..5c824614d5
--- /dev/null
+++ b/data/2024/aaai/What Makes Quantization for Large Language Model Hard? An Empirical Study from the Lens of Perturbation	
@@ -0,0 +1,2 @@
+Quantization has emerged as a promising technique for improving the memory and computational efficiency of large language models (LLMs). Though the trade-off between performance and efficiency is well-known, there is still much to be learned about the relationship between quantization and LLM performance. To shed light on this relationship, we propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs. We call this approach ``the lens of perturbation". Using this lens, we conduct experiments with various artificial perturbations to explore their impact on LLM performance. Our findings reveal several connections between the properties of perturbations and LLM performance, providing insights into the failure cases of uniform quantization and suggesting potential solutions to improve the robustness of LLM quantization.
+To demonstrate the significance of our findings, we implement a simple non-uniform quantization approach based on our insights. Our experiments show that this approach achieves minimal performance degradation on both 4-bit weight quantization and 8-bit quantization for weights and activations. These results validate the correctness of our approach and highlight its potential to improve the efficiency of LLMs without sacrificing performance.
\ No newline at end of file
diff --git a/data/2024/aaai/What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection b/data/2024/aaai/What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection
new file mode 100644
index 0000000000..b6e895b919
--- /dev/null
+++ b/data/2024/aaai/What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection	
@@ -0,0 +1 @@
+The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms. Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. To address this challenge, one of the emergent effective approaches is continual learning. In this paper, we propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection. The fundamental concept underlying RWM involves categorizing all classes into two groups: those with compact feature distributions across tasks, such as genuine audio, and those with more spread-out distributions, like various types of fake audio. These distinctions are quantified by means of the in-class cosine distance, which subsequently serves as the basis for RWM to introduce a trainable gradient modification direction for distinct data types. Experimental evaluations against mainstream continual learning methods reveal the superiority of RWM in terms of knowledge acquisition and mitigating forgetting in audio deepfake detection. Furthermore, RWM's applicability extends beyond audio deepfake detection, demonstrating its potential significance in diverse machine learning domains such as image recognition.
\ No newline at end of file
diff --git a/data/2024/aaai/When CEGAR Meets Regression: A Love Story in Optimal Classical Planning b/data/2024/aaai/When CEGAR Meets Regression: A Love Story in Optimal Classical Planning
new file mode 100644
index 0000000000..bae8e99bc5
--- /dev/null
+++ b/data/2024/aaai/When CEGAR Meets Regression: A Love Story in Optimal Classical Planning	
@@ -0,0 +1,3 @@
+Counterexample-Guided Abstraction Refinement (CEGAR) is a prominent technique to generate Cartesian abstractions for guiding search in cost- optimal planning. The core idea is to iteratively refine the abstraction, finding a flaw of the current optimal abstract plan. All existing approaches find these flaws by executing the abstract plan using progression in the original state space.
+
+Instead, we propose to do backward refinements by using regression from the goals. This results in a new type of flaw, that can identify invalid plan suffixes. The resulting abstractions are less focused on the initial state, but more informative on average, significantly improving the performance of current CEGAR-based techniques. Furthermore, they can be combined with forward refinements in several bidirectional strategies that provide the benefits of both methods.
\ No newline at end of file
diff --git a/data/2024/aaai/When Causal Inference Meets Graph Machine Learning b/data/2024/aaai/When Causal Inference Meets Graph Machine Learning
new file mode 100644
index 0000000000..4abfd51af2
--- /dev/null
+++ b/data/2024/aaai/When Causal Inference Meets Graph Machine Learning	
@@ -0,0 +1 @@
+Graphs (i.e., networks) are ubiquitous in daily life, as they can effectively model a plethora of real-world systems with connected units, such as social networks and biological networks. Recent years have witnessed rapid development in graph-based machine learning (GML) in various high-impact domains. Currently, the mainstream GML methods are based on statistical learning, e.g., utilizing the statistical correlations between node features, graph structure, and labels for node classification. However, statistical learning has been widely criticized for only capturing the superficial relations between variables in the data system, and consequently, rendering the lack of trustworthiness in real-world applications. Therefore, it is crucial to understand the causality in the data system and the learning process. Causal inference is the discipline that investigates the causality inside a system, for example, to identify and estimate the causal effect of a certain treatment (e.g., wearing a face mask) on an important outcome (e.g., COVID-19 infection). Involving the concepts and philosophy of causal inference in ML methods is often considered significant for human-level intelligence and can serve as the foundation of artificial intelligence (AI). However, most traditional causal inference studies rely on strong assumptions, and focus on independent and identically distributed (i.i.d.) data, while causal inference on graphs is faced with many barriers. Therefore, we aim to bridge the gap between causal inference and GML.
\ No newline at end of file
diff --git a/data/2024/aaai/When Do Program-of-Thought Works for Reasoning? b/data/2024/aaai/When Do Program-of-Thought Works for Reasoning?
new file mode 100644
index 0000000000..922658971b
--- /dev/null
+++ b/data/2024/aaai/When Do Program-of-Thought Works for Reasoning?	
@@ -0,0 +1 @@
+As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This makes it challenging to identify when to use a specific model and what directions to prioritize for further investigation. In this paper, we aim to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. We focus on their reasoning capability as exhibited by Chain-of-Thought (CoT) prompts and follow previous works to model them as Dynamic Programming (DP) problems. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. Nonetheless, we identify a class of DP problems for which these models can be more efficient than the standard Transformer. We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers' practical strengths and weaknesses.
\ No newline at end of file
diff --git a/data/2024/aaai/When Model Meets New Normals: Test-Time Adaptation for Unsupervised Time-Series Anomaly Detection b/data/2024/aaai/When Model Meets New Normals: Test-Time Adaptation for Unsupervised Time-Series Anomaly Detection
new file mode 100644
index 0000000000..e4db1c2870
--- /dev/null
+++ b/data/2024/aaai/When Model Meets New Normals: Test-Time Adaptation for Unsupervised Time-Series Anomaly Detection	
@@ -0,0 +1 @@
+Time-series anomaly detection deals with the problem of detecting anomalous timesteps by learning normality from the sequence of observations. However, the concept of normality evolves over time, leading to a "new normal problem", where the distribution of normality can be changed due to the distribution shifts between training and test data. This paper highlights the prevalence of the new normal problem in unsupervised time-series anomaly detection studies. To tackle this issue, we propose a simple yet effective test-time adaptation strategy based on trend estimation and a self-supervised approach to learning new normalities during inference. Extensive experiments on real-world benchmarks demonstrate that incorporating the proposed strategy into the anomaly detector consistently improves the model's performances compared to the existing baselines, leading to robustness to the distribution shifts.
\ No newline at end of file
diff --git a/data/2024/aaai/When Sparse Graph Representation Learning Falls into Domain Shift: Data Augmentation for Cross-Domain Graph Meta-Learning (Student Abstract) b/data/2024/aaai/When Sparse Graph Representation Learning Falls into Domain Shift: Data Augmentation for Cross-Domain Graph Meta-Learning (Student Abstract)
new file mode 100644
index 0000000000..70bd2d0a5c
--- /dev/null
+++ b/data/2024/aaai/When Sparse Graph Representation Learning Falls into Domain Shift: Data Augmentation for Cross-Domain Graph Meta-Learning (Student Abstract)	
@@ -0,0 +1 @@
+Cross-domain Graph Meta-learning (CGML) has shown its promise, where meta-knowledge is extracted from few-shot graph data in multiple relevant but distinct domains. However, several recent efforts assume target data available, which commonly does not established in practice. In this paper, we devise a novel Cross-domain Data Augmentation for Graph Meta-Learning (CDA-GML), which incorporates the superiorities of CGML and Data Augmentation， has addressed intractable shortcomings of label sparsity, domain shift, and the absence of target data simultaneously. Specifically, our method simulates instance-level and task-level domain shift to alleviate the cross-domain generalization issue in conventional graph meta-learning. Experiments show that our method outperforms the existing state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/aaai/When Your AI Becomes a Target: AI Security Incidents and Best Practices b/data/2024/aaai/When Your AI Becomes a Target: AI Security Incidents and Best Practices
new file mode 100644
index 0000000000..2c4c4c3c82
--- /dev/null
+++ b/data/2024/aaai/When Your AI Becomes a Target: AI Security Incidents and Best Practices	
@@ -0,0 +1,3 @@
+In contrast to vast academic efforts to study AI security, few real-world reports of AI security incidents exist. Released incidents prevent a thorough investigation of the attackers' motives, as crucial information about the company and AI application is missing. As a consequence, it often remains unknown how to avoid incidents. 
+We tackle this gap and combine previous reports with freshly collected incidents to a small database of 32 AI security incidents. We analyze the attackers' target and goal, influencing factors, causes, and mitigations. Many incidents stem from non-compliance with best practices in security and privacy-enhancing technologies. 
+In the case of direct AI attacks, access control may provide some mitigation, but there is little scientific work on best practices. Our paper is thus a call for action to address these gaps.
\ No newline at end of file
diff --git a/data/2024/aaai/When to Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks b/data/2024/aaai/When to Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks
new file mode 100644
index 0000000000..64b4f17e2d
--- /dev/null
+++ b/data/2024/aaai/When to Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks	
@@ -0,0 +1 @@
+Neural growth is the process of growing a small neural network to a large network and has been utilized to accelerate the training of deep neural networks. One crucial aspect of neural growth is determining the optimal growth timing. However, few studies investigate this systematically. Our study reveals that neural growth inherently exhibits a regularization effect, whose intensity is influenced by the chosen policy for growth timing. While this regularization effect may mitigate the overfitting risk of the model, it may lead to a notable accuracy drop when the model underfits. Yet, current approaches have not addressed this issue due to their lack of consideration of the regularization effect from neural growth. Motivated by these findings, we propose an under/over fitting risk-aware growth timing policy, which automatically adjusts the growth timing informed by the level of potential under/overfitting risks to address both risks. Comprehensive experiments conducted using CIFAR-10/100 and ImageNet datasets show that the proposed policy achieves accuracy improvements of up to 1.3% in models prone to underfitting while achieving similar accuracies in models suffering from overfitting compared to the existing methods.
\ No newline at end of file
diff --git a/data/2024/aaai/When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming b/data/2024/aaai/When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming
new file mode 100644
index 0000000000..7955ad4c5e
--- /dev/null
+++ b/data/2024/aaai/When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming	
@@ -0,0 +1 @@
+AI powered code-recommendation systems, such as Copilot and CodeWhisperer, provide code suggestions inside a programmer's environment (e.g., an IDE) with the aim of improving productivity. We pursue mechanisms for leveraging signals about programmers' acceptance and rejection of code suggestions to guide recommendations. We harness data drawn from interactions with GitHub Copilot, a system used by millions of programmers, to develop interventions that can save time for programmers. We introduce a utility-theoretic framework to drive decisions about suggestions to display versus withhold. The approach, conditional suggestion display from human feedback (CDHF), relies on a cascade of models that provide the likelihood that recommended code will be accepted. These likelihoods are used to selectively hide suggestions, reducing both latency and programmer verification time. Using data from 535 programmers, we perform a retrospective evaluation of CDHF and show that we can avoid displaying a significant fraction of suggestions that would have been rejected. We further demonstrate the importance of incorporating the programmer's latent unobserved state in decisions about when to display suggestions through an ablation study. Finally, we showcase how using suggestion acceptance as a reward signal for guiding the display of suggestions can lead to suggestions of reduced quality, indicating an unexpected pitfall.
\ No newline at end of file
diff --git a/data/2024/aaai/Where and How to Attack? A Causality-Inspired Recipe for Generating Counterfactual Adversarial Examples b/data/2024/aaai/Where and How to Attack? A Causality-Inspired Recipe for Generating Counterfactual Adversarial Examples
new file mode 100644
index 0000000000..a156c4dc78
--- /dev/null
+++ b/data/2024/aaai/Where and How to Attack? A Causality-Inspired Recipe for Generating Counterfactual Adversarial Examples	
@@ -0,0 +1 @@
+Deep neural networks (DNNs) have been demonstrated to be vulnerable to well-crafted adversarial examples, which are generated through either well-conceived L_p-norm restricted or unrestricted attacks. Nevertheless, the majority of those approaches assume that adversaries can modify any features as they wish, and neglect the causal generating process of the data, which is unreasonable and unpractical. For instance, a modification in income would inevitably impact features like the debt-to-income ratio within a banking system. By considering the underappreciated causal generating process, first, we pinpoint the source of the vulnerability of DNNs via the lens of causality, then give theoretical results to answer where to attack. Second, considering the consequences of the attack interventions on the current state of the examples to generate more realistic adversarial examples, we propose CADE, a framework that can generate Counterfactual ADversarial Examples to answer how to attack. The empirical results demonstrate CADE's effectiveness, as evidenced by its competitive performance across diverse attack scenarios, including white-box, transfer-based, and random intervention attacks.
\ No newline at end of file
diff --git a/data/2024/aaai/Which Is More Effective in Label Noise Cleaning, Correction or Filtering? b/data/2024/aaai/Which Is More Effective in Label Noise Cleaning, Correction or Filtering?
new file mode 100644
index 0000000000..ecb690fda1
--- /dev/null
+++ b/data/2024/aaai/Which Is More Effective in Label Noise Cleaning, Correction or Filtering?	
@@ -0,0 +1 @@
+Most noise cleaning methods adopt one of the correction and filtering modes to build robust models. However, their effectiveness, applicability, and hyper-parameter insensitivity have not been carefully studied. We compare the two cleaning modes via a rebuilt error bound in noisy environments. At the dataset level, Theorem 5 implies that correction is more effective than filtering when the cleaned datasets have close noise rates. At the sample level, Theorem 6 indicates that confident label noises (large noise probabilities) are more suitable to be corrected, and unconfident noises (medium noise probabilities) should be filtered. Besides, an imperfect hyper-parameter may have fewer negative impacts on filtering than correction. Unlike existing methods with a single cleaning mode, the proposed Fusion cleaning framework of Correction and Filtering (FCF) combines the advantages of different modes to deal with diverse suspicious labels. Experimental results demonstrate that our FCF method can achieve state-of-the-art performance on benchmark datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Who Knows the Answer? Finding the Best Model and Prompt for Each Query Using Confidence-Based Search b/data/2024/aaai/Who Knows the Answer? Finding the Best Model and Prompt for Each Query Using Confidence-Based Search
new file mode 100644
index 0000000000..1c0a552e4c
--- /dev/null
+++ b/data/2024/aaai/Who Knows the Answer? Finding the Best Model and Prompt for Each Query Using Confidence-Based Search	
@@ -0,0 +1 @@
+There are increasingly many large language models (LLMs) available to the public. While these LLMs have exhibited impressive abilities on a variety of task, any individual LLM in particular may do well on some tasks and worse on others. Additionally, the performance of these models is heavily dependent on the choice of prompt template used. For instance, they exhibit sensitivity to the few shot examples chosen or brittleness to the wording of instructions. Moreover, a prompt template that makes a model perform well for one input may not be the optimal template for another input. This necessitates an approach for adaptively selecting LLM and prompt template pairs for each input. Recent work has shown that the accuracy of LLM's responses is correlated with the LLM's confidence in the response. Thus, a natural choice for selecting which model and prompt template to use is to select the pair that is most confident in its response. However, existing confidence metrics are expensive to calculate - necessitating multiple calls to each LLm and prompt pair. We thus propose an approach to predict the confidence of each pair using an auxiliary regression model that is inexpensive to run. Using this auxiliary model, we select the LLM and prompt template with the highest predicted confidence for a given input. Results on a range of benchmark datasets show that our confidence-based instance-level prompt search method consistently improves the performance of LLMs.
\ No newline at end of file
diff --git a/data/2024/aaai/WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in Wikipedia b/data/2024/aaai/WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in Wikipedia
new file mode 100644
index 0000000000..a1645c6b2a
--- /dev/null
+++ b/data/2024/aaai/WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in Wikipedia	
@@ -0,0 +1 @@
+Wikipedia can be edited by anyone and thus contains various quality sentences. Therefore, Wikipedia includes some poor-quality edits, which are often marked up by other editors. While editors' reviews enhance the credibility of Wikipedia, it is hard to check all edited text. Assisting in this process is very important, but a large and comprehensive dataset for studying it does not currently exist. Here, we propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia. Each sentence is extracted from the entire revision history of English Wikipedia, and the target quality labels were carefully investigated and selected. WikiSQE has about 3.4 M sentences with 153 quality labels. In the experiment with automatic classification using competitive machine learning models, sentences that had problems with citation, syntax/semantics, or propositions were found to be more difficult to detect. In addition, by performing human annotation, we found that the model we developed performed better than the crowdsourced workers. WikiSQE is expected to be a valuable resource for other tasks in NLP.
\ No newline at end of file
diff --git a/data/2024/aaai/Winnie: Task-Oriented Dialog System with Structure-Aware Contrastive Learning and Enhanced Policy Planning b/data/2024/aaai/Winnie: Task-Oriented Dialog System with Structure-Aware Contrastive Learning and Enhanced Policy Planning
new file mode 100644
index 0000000000..743b96087b
--- /dev/null
+++ b/data/2024/aaai/Winnie: Task-Oriented Dialog System with Structure-Aware Contrastive Learning and Enhanced Policy Planning	
@@ -0,0 +1 @@
+Pre-trained encoder-decoder models are widely applied in Task-Oriented Dialog (TOD) systems on the session level, mainly focusing on modeling the dialog semantic information. Dialogs imply structural information indicating the interaction among user utterances, belief states, database search results, system acts and responses, which is also crucial for TOD systems. In addition, for the system acts, additional pre-training and datasets are considered to improve their accuracies, undoubtedly introducing a burden. Therefore, a novel end-to-end TOD system named Winnie is proposed in this paper to improve the TOD performance. First, to make full use of the intrinsic structural information, supervised contrastive learning is adopted to narrow the gap in the representation space between text representations of the same category and enlarge the overall continuous representation margin between text representations of different categories in dialog context. Then, a system act classification task is introduced for policy optimization during fine-tuning. Empirical results show that Winnie substantially improves the performance of the TOD system. By introducing the supervised contrastive and system act classification losses, Winnie achieves state-of-the-art results on benchmark datasets, including MultiWOZ2.2, In-Car, and Camrest676. Their end-to-end combined scores are improved by 3.2, 1.9, and 1.1 points, respectively.
\ No newline at end of file
diff --git a/data/2024/aaai/Working Memory Capacity of ChatGPT: An Empirical Study b/data/2024/aaai/Working Memory Capacity of ChatGPT: An Empirical Study
new file mode 100644
index 0000000000..2c11479d62
--- /dev/null
+++ b/data/2024/aaai/Working Memory Capacity of ChatGPT: An Empirical Study	
@@ -0,0 +1 @@
+Working memory is a critical aspect of both human intelligence and artificial intelligence, serving as a workspace for the temporary storage and manipulation of information. In this paper, we systematically assess the working memory capacity of ChatGPT, a large language model developed by OpenAI, by examining its performance in verbal and spatial n-back tasks under various conditions. Our experiments reveal that ChatGPT has a working memory capacity limit strikingly similar to that of humans. Furthermore, we investigate the impact of different instruction strategies on ChatGPT's performance and observe that the fundamental patterns of a capacity limit persist. From our empirical findings, we propose that n-back tasks may serve as tools for benchmarking the working memory capacity of large language models and hold potential for informing future efforts aimed at enhancing AI working memory.
\ No newline at end of file
diff --git a/data/2024/aaai/Worst-Case VCG Redistribution Mechanism Design Based on the Lottery Ticket Hypothesis b/data/2024/aaai/Worst-Case VCG Redistribution Mechanism Design Based on the Lottery Ticket Hypothesis
new file mode 100644
index 0000000000..2c84b4a1a6
--- /dev/null
+++ b/data/2024/aaai/Worst-Case VCG Redistribution Mechanism Design Based on the Lottery Ticket Hypothesis	
@@ -0,0 +1,7 @@
+We study worst-case VCG redistribution mechanism design for the public project problem. The mechanism design task comes down to designing a payment function that maximizes the worst-case allocative efficiency ratio.
+
+We use a multilayer perceptron (MLP) with ReLU activation to model the payment function and use mixed integer programming (MIP) to solve for the worst-case type profiles that maximally violate the mechanism design constraints. We collect these worst-case type profiles and use them as training samples to train toward better worst-case mechanisms.
+
+In practice, we require a tiny neural network structure for the above approach to scale. The Lottery Ticket Hypothesis states that a large network is likely to contain a "winning ticket" -- a much smaller subnetwork that "won the initialization lottery", which makes its training particularly effective. Motivated by this hypothesis, we train a large network and prune it into a tiny subnetwork. We run MIP-based worst-case training on the drawn subnetwork and evaluate the resulting mechanism's worst-case performance. If the subnetwork does not achieve good worst-case performance, then we record the type profiles that cause the current draw to be bad. To draw again, we restore the large network to its initial weights and prune using recorded type profiles from earlier draws, therefore avoiding drawing the same ticket twice. We expect to eventually encounter a tiny subnetwork that leads to effective training for our worst-case mechanism design task. Lastly, a by-product of multiple ticket draws is an ensemble of mechanisms with different worst cases, which improves the worst-case performance further.
+
+Using our approach, we find previously unknown optimal mechanisms for up to 5 agents. Our results confirm the tightness of existing theoretical upper bounds. For up to 20 agents, we derive significantly improved worst-case mechanisms, surpassing a long list of existing manual results.
\ No newline at end of file
diff --git a/data/2024/aaai/Would You Like Your Data to Be Trained? A User Controllable Recommendation Framework b/data/2024/aaai/Would You Like Your Data to Be Trained? A User Controllable Recommendation Framework
new file mode 100644
index 0000000000..7ded268737
--- /dev/null
+++ b/data/2024/aaai/Would You Like Your Data to Be Trained? A User Controllable Recommendation Framework	
@@ -0,0 +1 @@
+Recommender systems have a significant impact on various real-world applications, shaping people's daily lives and enhancing productivity. Traditional recommender models aim to collect extensive user information to accurately estimate user preferences. However, in practical scenarios, users may not want all their behaviors to be included in the model training process. This paper introduces a novel recommendation paradigm that allows users to indicate their ``willingness'' regarding which data should contribute to model training. The models are then optimized to maximize utility, which considers the trade-off between recommendation performance and respecting user preferences. The recommendation problem is formulated as a multiplayer game, with each user acting as a player and using a selection vector to indicate their willingness to include specific interacted items in training. To efficiently solve this game, an influence function-based model is proposed to approximate recommendation performances for different actions without re-optimizing the model. Furthermore, an enhanced model leveraging multiple anchor actions for the influence function is introduced to improve performance approximation accuracy. The convergence rate of the algorithm is theoretically analyzed, and the advantages of incorporating multiple anchor actions are demonstrated. Extensive experiments on both simulated and real-world datasets validate the effectiveness of the proposed models in balancing recommendation quality and user willingness. To promote this research direction, we have released our project at https://paitesanshi.github.io/IFRQE/.
\ No newline at end of file
diff --git a/data/2024/aaai/X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks b/data/2024/aaai/X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks
new file mode 100644
index 0000000000..ab9fe0c3ca
--- /dev/null
+++ b/data/2024/aaai/X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks	
@@ -0,0 +1 @@
+Referring 3D instance segmentation is a challenging task aimed at accurately segmenting a target instance within a 3D scene based on a given referring expression. However, previous methods have overlooked the distinct roles played by different words in referring expressions. Additionally, they have failed to incorporate the positional relationship within referring expressions with the spatial correlations in 3D scenes. To alleviate these issues, we present a novel model called X-RefSeg3D, which constructs a cross-modal graph for the input 3D scene and unites textual and spatial relationships for reasoning via graph neural networks. Our approach begins by capturing object-specific text features, which are then fused with the instance features to construct a comprehensive cross-modal scene graph. Subsequently, we integrate the obtained cross-modal features into graph neural networks, leveraging the K-nearest algorithm to derive explicit instructions from expressions and factual relationships in scenes. This enables the effective capture of higher-order relationships among instances, thereby enhancing feature fusion and facilitating reasoning. Finally, the refined feature undergoes a matching module to compute the ultimate matching score. Experimental results on ScanRefer demonstrate the effectiveness of our method, surpassing previous approaches by a substantial margin of +3.67% in terms of mIOU.
\ No newline at end of file
diff --git a/data/2024/aaai/X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer b/data/2024/aaai/X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer
new file mode 100644
index 0000000000..633d9c8cc2
--- /dev/null
+++ b/data/2024/aaai/X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer	
@@ -0,0 +1 @@
+The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences. However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). The GIT combines visual texture and temporal correlation features to offer rich semantics and dynamics for better point cloud representation. During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using single-modal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation. The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge, outperforming previous state-of-the-art by a large margin. We release the code at https://github.com/jinglinglingling/X4D.
\ No newline at end of file
diff --git a/data/2024/aaai/XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning b/data/2024/aaai/XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning
new file mode 100644
index 0000000000..9df18c70aa
--- /dev/null
+++ b/data/2024/aaai/XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning	
@@ -0,0 +1,2 @@
+We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled videos. XKD is trained with two pseudo objectives. First, masked data reconstruction is performed to learn modality-specific representations from audio and visual streams. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through a teacher-student setup to learn complementary information. We introduce a novel domain alignment strategy to tackle domain discrepancy between audio and visual modalities enabling effective cross-modal knowledge distillation.
+Additionally, to develop a general-purpose network capable of handling both audio and visual streams, modality-agnostic variants of XKD are introduced, which use the same pretrained backbone for different audio and visual tasks. Our proposed cross-modal knowledge distillation improves video action classification by 8% to 14% on UCF101, HMDB51, and Kinetics400. Additionally, XKD improves multimodal action classification by 5.5% on Kinetics-Sound. XKD shows state-of-the-art performance in sound classification on ESC50, achieving top-1 accuracy of 96.5%.
\ No newline at end of file
diff --git a/data/2024/aaai/Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation b/data/2024/aaai/Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation
new file mode 100644
index 0000000000..e5dcf04306
--- /dev/null
+++ b/data/2024/aaai/Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation	
@@ -0,0 +1 @@
+New Natural Langauge Process~(NLP) benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge.Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi-Specialty with 14,041 questions and Xiezhi-Interdiscipline with 10,746 questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results indicate that LLMs exceed average performance of humans in science, engineering, agronomy, medicine, and art, but fall short in economics, jurisprudence, pedagogy, literature, history, and management. All the evaluation code and data are open sourced in https://github.com/MikeGu721/XiezhiBenchmark
\ No newline at end of file
diff --git a/data/2024/aaai/YTCommentQA: Video Question Answerability in Instructional Videos b/data/2024/aaai/YTCommentQA: Video Question Answerability in Instructional Videos
new file mode 100644
index 0000000000..a1ddfad88d
--- /dev/null
+++ b/data/2024/aaai/YTCommentQA: Video Question Answerability in Instructional Videos	
@@ -0,0 +1 @@
+Instructional videos provide detailed how-to guides for various tasks, with viewers often posing questions regarding the content. Addressing these questions is vital for comprehending the content, yet receiving immediate answers is difficult. While numerous computational models have been developed for Video Question Answering (Video QA) tasks, they are primarily trained on questions generated based on video content, aiming to produce answers from within the content. However, in real-world situations, users may pose questions that go beyond the video's informational boundaries, highlighting the necessity to determine if a video can provide the answer. Discerning whether a question can be answered by video content is challenging due to the multi-modal nature of videos, where visual and verbal information are intertwined. To bridge this gap, we present the YTCommentQA dataset, which contains naturally-generated questions from YouTube, categorized by their answerability and required modality to answer -- visual, script, or both. Experiments with answerability classification tasks demonstrate the complexity of YTCommentQA and emphasize the need to comprehend the combined role of visual and script information in video reasoning. The dataset is available at https://github.com/lgresearch/YTCommentQA.
\ No newline at end of file
diff --git a/data/2024/aaai/You Only Read Once: Constituency-Oriented Relational Graph Convolutional Network for Multi-Aspect Multi-Sentiment Classification b/data/2024/aaai/You Only Read Once: Constituency-Oriented Relational Graph Convolutional Network for Multi-Aspect Multi-Sentiment Classification
new file mode 100644
index 0000000000..df95d5cbd9
--- /dev/null
+++ b/data/2024/aaai/You Only Read Once: Constituency-Oriented Relational Graph Convolutional Network for Multi-Aspect Multi-Sentiment Classification	
@@ -0,0 +1 @@
+Most of the existing aspect-based sentiment analysis (ABSA) models only predict the sentiment polarity of a single aspect at a time, focusing primarily on enhancing the representation of this single aspect based on the other contexts or aspects. This one-to-one paradigm ignores the fact that multi-aspect, multi-sentiment sentences contain not only distinct specific descriptions for distinct specific aspects, but also shared global context information for multiple aspects. To fully consider these issues, we propose a one-to-many ABSA framework, called You Only Read Once (YORO), that can simultaneously model representations of all aspects based on their specific descriptions and better fuse their relationships using globally shared contextual information in the sentence. Predicting the sentiment polarity of multiple aspects simultaneously is beneficial to improving the efficacy of calculation and prediction. Extensive experiments are conducted on three public datasets (MAMS, Rest14, and Lap14). Experimental results demonstrate the effectiveness of YORO in handling multi-aspect, multi-sentiment scenarios and highlight the promise of one-to-many ABSA in balancing efficiency and accuracy.
\ No newline at end of file
diff --git a/data/2024/aaai/Your Career Path Matters in Person-Job Fit b/data/2024/aaai/Your Career Path Matters in Person-Job Fit
new file mode 100644
index 0000000000..6be113f3b5
--- /dev/null
+++ b/data/2024/aaai/Your Career Path Matters in Person-Job Fit	
@@ -0,0 +1 @@
+We are again confronted with one of the most vexing aspects of the advancement of technology: automation and AI technology cause the devaluation of human labor, resulting in unemployment. With this background, automatic person-job fit systems are promising solutions to promote the employment rate. The purpose of person-job fit is to calculate a matching score between the job seeker's resume and the job posting, determining whether the job seeker is suitable for the position. In this paper, we propose a new approach to person-job fit that characterizes the hidden preference derived from the job seeker's career path. We categorize and utilize three types of preferences in the career path: consistency, likeness, and continuity. We prove that understanding the career path enables us to provide more appropriate career suggestions to job seekers. To demonstrate the practical value of our proposed model, we conduct extensive experiments on real-world data extracted from an online recruitment platform and then present detailed cases to show how the career path matter in person-job fit.
\ No newline at end of file
diff --git a/data/2024/aaai/Your Prompt Is My Command: On Assessing the Human-Centred Generality of Multimodal Models (Abstract Reprint) b/data/2024/aaai/Your Prompt Is My Command: On Assessing the Human-Centred Generality of Multimodal Models (Abstract Reprint)
new file mode 100644
index 0000000000..1396aac0e5
--- /dev/null
+++ b/data/2024/aaai/Your Prompt Is My Command: On Assessing the Human-Centred Generality of Multimodal Models (Abstract Reprint)	
@@ -0,0 +1 @@
+Even with obvious deficiencies, large prompt-commanded multimodal models are proving to be flexible cognitive tools representing an unprecedented generality. But the directness, diversity, and degree of user interaction create a distinctive “human-centred generality” (HCG), rather than a fully autonomous one. HCG implies that —for a specific user— a system is only as general as it is effective for the user’s relevant tasks and their prevalent ways of prompting. A human-centred evaluation of general-purpose AI systems therefore needs to reflect the personal nature of interaction, tasks and cognition. We argue that the best way to understand these systems is as highly-coupled cognitive extenders, and to analyse the bidirectional cognitive adaptations between them and humans. In this paper, we give a formulation of HCG, as well as a high-level overview of the elements and trade-offs involved in the prompting process. We end the paper by outlining some essential research questions and suggestions for improving evaluation practices, which we envision as characteristic for the evaluation of general artificial intelligence in the future.
\ No newline at end of file
diff --git a/data/2024/aaai/ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-Order Optimization b/data/2024/aaai/ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-Order Optimization
new file mode 100644
index 0000000000..6e59a37ffd
--- /dev/null
+++ b/data/2024/aaai/ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-Order Optimization	
@@ -0,0 +1 @@
+Lowering the memory requirement in full-parameter training on large models has become a hot research area. MeZO fine-tunes the large language models (LLMs) by just forward passes in a zeroth-order SGD optimizer (ZO-SGD), demonstrating excellent performance with the same GPU memory usage as inference. However, the simulated perturbation stochastic approximation for gradient estimate in MeZO leads to severe oscillations and incurs a substantial time overhead. Moreover, without momentum regularization, MeZO shows severe over-fitting problems. Lastly, the perturbation-irrelevant momentum on ZO-SGD does not improve the convergence rate. This study proposes ZO-AdaMU to resolve the above problems by adapting the simulated perturbation with momentum in its stochastic approximation. Unlike existing adaptive momentum methods, we relocate momentum on simulated perturbation in stochastic gradient approximation. Our convergence analysis and experiments prove this is a better way to improve convergence stability and rate in ZO-SGD. Extensive experiments demonstrate that ZO-AdaMU yields better generalization for LLMs fine-tuning across various NLP tasks than MeZO and its momentum variants.
\ No newline at end of file
diff --git a/data/2024/aaai/ZOOM: Learning Video Mirror Detection with Extremely-Weak Supervision b/data/2024/aaai/ZOOM: Learning Video Mirror Detection with Extremely-Weak Supervision
new file mode 100644
index 0000000000..6120f8a829
--- /dev/null
+++ b/data/2024/aaai/ZOOM: Learning Video Mirror Detection with Extremely-Weak Supervision	
@@ -0,0 +1 @@
+Mirror detection is an active research topic in computer vision. However, all existing mirror detectors learn mirror representations from large-scale pixel-wise datasets, which are tedious and expensive to obtain. Although weakly-supervised learning has been widely explored in related topics, we note that popular weak supervision signals (e.g., bounding boxes, scribbles, points) still require some efforts from the user to locate the target objects, with a strong assumption that the images to annotate always contain the target objects. Such an assumption may result in the over-segmentation of mirrors. Our key idea of this work is that the existence of mirrors over a time period may serve as a weak supervision to train a mirror detector, for two reasons. First, if a network can predict the existence of mirrors, it can essentially locate the mirrors. Second, we observe that the reflected contents of a mirror tend to be similar to those in adjacent frames, but exhibit considerable contrast to regions in far-away frames (e.g., non-mirror frames). To this end, in this paper, we propose ZOOM, the first method to learn robust mirror representations from extremely-weak annotations of per-frame ZerO-One Mirror indicators in videos. The key insight of ZOOM is to model the similarity and contrast (between mirror and non-mirror regions) in temporal variations to locate and segment the mirrors. To this end, we propose a novel fusion strategy to leverage temporal consistency information for mirror localization, and a novel temporal similarity-contrast modeling module for mirror segmentation. We construct a new video mirror dataset for training and evaluation. Experimental results under new and standard metrics show that ZOOM performs favorably against existing fully-supervised mirror detection methods.
\ No newline at end of file
diff --git a/data/2024/aaai/Zero-1-to-3: Domain-Level Zero-Shot Cognitive Diagnosis via One Batch of Early-Bird Students towards Three Diagnostic Objectives b/data/2024/aaai/Zero-1-to-3: Domain-Level Zero-Shot Cognitive Diagnosis via One Batch of Early-Bird Students towards Three Diagnostic Objectives
new file mode 100644
index 0000000000..6bc8a23c57
--- /dev/null
+++ b/data/2024/aaai/Zero-1-to-3: Domain-Level Zero-Shot Cognitive Diagnosis via One Batch of Early-Bird Students towards Three Diagnostic Objectives	
@@ -0,0 +1 @@
+Cognitive diagnosis seeks to estimate the cognitive states of students by exploring their logged practice quiz data. It plays a pivotal role in personalized learning guidance within intelligent education systems. In this paper, we focus on an important, practical, yet often underexplored task: domain-level zero-shot cognitive diagnosis (DZCD), which arises due to the absence of student practice logs in newly launched domains. Recent cross-domain diagnostic models have been demonstrated to be a promising strategy for DZCD. These methods primarily focus on how to transfer student states across domains. However, they might inadvertently incorporate non-transferable information into student representations, thereby limiting the efficacy of knowledge transfer. To tackle this, we propose Zero-1-to-3, a domain-level zero-shot cognitive diagnosis framework via one batch of early-bird students towards three diagnostic objectives. Our approach initiates with pre-training a diagnosis model with dual regularizers, which decouples student states into domain-shared and domain-specific parts. The shared cognitive signals can be transferred to the target domain, enriching the cognitive priors for the new domain, which ensures the cognitive state propagation objective. Subsequently, we devise a strategy to generate simulated practice logs for cold-start students through analyzing the behavioral patterns from early-bird students, fulfilling the domain-adaption goal. Consequently, we refine the cognitive states of cold-start students as diagnostic outcomes via virtual data, aligning with the diagnosis-oriented goal. Finally, extensive experiments on six real-world datasets highlight the efficacy of our model for DZCD and its practical application in question recommendation. The code is publicly available at https://github.com/bigdata-ustc/Zero-1-to-3.
\ No newline at end of file
diff --git a/data/2024/aaai/Zero-Shot Aerial Object Detection with Visual Description Regularization b/data/2024/aaai/Zero-Shot Aerial Object Detection with Visual Description Regularization
new file mode 100644
index 0000000000..eefc61de00
--- /dev/null
+++ b/data/2024/aaai/Zero-Shot Aerial Object Detection with Visual Description Regularization	
@@ -0,0 +1,4 @@
+Existing object detection models are mainly trained on large-scale labeled datasets. However, annotating data for novel aerial object classes is expensive since it is time-consuming and may require expert knowledge. Thus, it is desirable to study label-efficient object detection methods on aerial images. In this work, we propose a zero-shot method for aerial object detection named visual Description Regularization, or DescReg. 
+Concretely, we identify the weak semantic-visual correlation of the aerial objects and aim to address the challenge with prior descriptions of their visual appearance. Instead of directly encoding the descriptions into class embedding space which suffers from the representation gap problem, we propose to infuse the prior inter-class visual similarity conveyed in the descriptions into the embedding learning. The infusion process is accomplished with a newly designed similarity-aware triplet loss which incorporates structured regularization on the representation space. We conduct extensive experiments with three challenging aerial object detection datasets, including DIOR, xView, and DOTA. The results demonstrate that DescReg significantly outperforms the state-of-the-art ZSD methods with complex projection designs and generative frameworks, e.g., DescReg outperforms 
+best reported ZSD method on DIOR by 4.5 mAP on unseen classes and 8.1 in HM. We further show the generalizability of DescReg by integrating it into generative ZSD methods as well as varying the detection architecture.
+Codes will be released at https://github.com/zq-zang/DescReg.
\ No newline at end of file
diff --git a/data/2024/aaai/Zero-Shot Task Adaptation with Relevant Feature Information b/data/2024/aaai/Zero-Shot Task Adaptation with Relevant Feature Information
new file mode 100644
index 0000000000..03bf466e53
--- /dev/null
+++ b/data/2024/aaai/Zero-Shot Task Adaptation with Relevant Feature Information	
@@ -0,0 +1 @@
+We propose a method to learn prediction models such as classifiers for unseen target tasks where labeled and unlabeled data are absent but a few relevant input features for solving the tasks are given. Although machine learning requires data for training, data are often difficult to collect in practice. On the other hand, for many applications, a few relevant features would be more easily obtained. Although zero-shot learning or zero-shot domain adaptation use external knowledge to adapt to unseen classes or tasks without data, relevant features have not been used in existing studies. The proposed method improves the generalization performance on the target tasks, where there are no data but a few relevant features are given, by meta-learning from labeled data in related tasks. In the meta-learning phase, it is essential to simulate test phases on target tasks where prediction model learning is required without data. To this end, our neural network-based prediction model is meta-learned such that it correctly responds to perturbations of the relevant features on randomly generated synthetic data. By this modeling, the prediction model can explicitly learn the discriminability of the relevant features without real target data. When unlabeled training data are available in the target tasks, the proposed method can incorporate such data to boost the performance in a unified framework. Our experiments demonstrate that the proposed method outperforms various existing methods with four real-world datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/Zero-Sum Games between Mean-Field Teams: Reachability-Based Analysis under Mean-Field Sharing b/data/2024/aaai/Zero-Sum Games between Mean-Field Teams: Reachability-Based Analysis under Mean-Field Sharing
new file mode 100644
index 0000000000..b623b9e77c
--- /dev/null
+++ b/data/2024/aaai/Zero-Sum Games between Mean-Field Teams: Reachability-Based Analysis under Mean-Field Sharing	
@@ -0,0 +1 @@
+This work studies the behaviors of two large-population teams competing in a discrete environment. The team-level interactions are modeled as a zero-sum game while the agent dynamics within each team is formulated as a collaborative mean-field team problem. Drawing inspiration from the mean-field literature, we first approximate the large-population team game with its infinite-population limit. Subsequently, we construct a fictitious centralized system and transform the infinite-population game to an equivalent zero-sum game between two coordinators. Via a novel reachability analysis, we study the optimality of coordination strategies, which induce decentralized strategies under the original information structure. The optimality of the resulting strategies is established in the original finite-population game, and the theoretical guarantees are verified by numerical examples.
\ No newline at end of file
diff --git a/data/2024/aaai/Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue b/data/2024/aaai/Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue
new file mode 100644
index 0000000000..aea73b78d9
--- /dev/null
+++ b/data/2024/aaai/Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue	
@@ -0,0 +1 @@
+Recent advances in Large Language Models (LLMs) have achieved remarkable breakthroughs in understanding and responding to user intents. However, their performance lag behind general use cases in some expertise domains, such as Chinese medicine. Existing efforts to incorporate Chinese medicine into LLMs rely on Supervised Fine-Tuning (SFT) with single-turn and distilled dialogue data. These models lack the ability for doctor-like proactive inquiry and multi-turn comprehension and cannot align responses with experts' intentions. In this work, we introduce Zhongjing, the first Chinese medical LLaMA-based LLM that implements an entire training pipeline from continuous pre-training, SFT, to Reinforcement Learning from Human Feedback (RLHF). Additionally, we construct a Chinese multi-turn medical dialogue dataset of 70,000 authentic doctor-patient dialogues, CMtMedQA, which significantly enhances the model's capability for complex dialogue and proactive inquiry initiation. We also define a refined annotation rule and evaluation criteria given the unique characteristics of the biomedical domain. Extensive experimental results show that Zhongjing outperforms baselines in various capacities and matches the performance of ChatGPT in some abilities, despite the 100x parameters. Ablation studies also demonstrate the contributions of each component: pre-training enhances medical knowledge, and RLHF further improves instruction-following ability and safety. Our code, datasets, and models are available at https://github.com/SupritYoung/Zhongjing.
\ No newline at end of file
diff --git a/data/2024/aaai/eTag: Class-Incremental Learning via Embedding Distillation and Task-Oriented Generation b/data/2024/aaai/eTag: Class-Incremental Learning via Embedding Distillation and Task-Oriented Generation
new file mode 100644
index 0000000000..687211eeeb
--- /dev/null
+++ b/data/2024/aaai/eTag: Class-Incremental Learning via Embedding Distillation and Task-Oriented Generation	
@@ -0,0 +1 @@
+Class incremental learning (CIL) aims to solve the notorious forgetting problem, which refers to the fact that once the network is updated on a new task, its performance on previously-learned tasks degenerates catastrophically. Most successful CIL methods store exemplars (samples of learned tasks) to train a feature extractor incrementally, or store prototypes (features of learned tasks) to estimate the incremental feature distribution. However, the stored exemplars would violate the data privacy concerns, while the fixed prototypes might not reasonably be consistent with the incremental feature distribution, hindering the exploration of real-world CIL applications. In this paper, we propose a data-free CIL method with embedding distillation and Task-oriented generation (eTag), which requires neither exemplar nor prototype. Embedding distillation prevents the feature extractor from forgetting by distilling the outputs from the networks' intermediate blocks. Task-oriented generation enables a lightweight generator to produce dynamic features, fitting the needs of the top incremental classifier. Experimental results confirm that the proposed eTag considerably outperforms state-of-the-art methods on several benchmark datasets.
\ No newline at end of file
diff --git a/data/2024/aaai/iDet3D: Towards Efficient Interactive Object Detection for LiDAR Point Clouds b/data/2024/aaai/iDet3D: Towards Efficient Interactive Object Detection for LiDAR Point Clouds
new file mode 100644
index 0000000000..5afa2ec12a
--- /dev/null
+++ b/data/2024/aaai/iDet3D: Towards Efficient Interactive Object Detection for LiDAR Point Clouds	
@@ -0,0 +1 @@
+Accurately annotating multiple 3D objects in LiDAR scenes is laborious and challenging. While a few previous studies have attempted to leverage semi-automatic methods for cost-effective bounding box annotation, such methods have limitations in efficiently handling numerous multi-class objects. To effectively accelerate 3D annotation pipelines, we propose iDet3D, an efficient interactive 3D object detector. Supporting a user-friendly 2D interface, which can ease the cognitive burden of exploring 3D space to provide click interactions, iDet3D enables users to annotate the entire objects in each scene with minimal interactions. Taking the sparse nature of 3D point clouds into account, we design a negative click simulation (NCS) to improve accuracy by reducing false-positive predictions. In addition, iDet3D incorporates two click propagation techniques to take full advantage of user interactions: (1) dense click guidance (DCG) for keeping user-provided information throughout the network and (2) spatial click propagation (SCP) for detecting other instances of the same class based on the user-specified objects. Through our extensive experiments, we present that our method can construct precise annotations in a few clicks, which shows the practicality as an efficient annotation tool for 3D object detection.
\ No newline at end of file
diff --git a/data/2024/aaai/iTrendRNN: An Interpretable Trend-Aware RNN for Meteorological Spatiotemporal Prediction b/data/2024/aaai/iTrendRNN: An Interpretable Trend-Aware RNN for Meteorological Spatiotemporal Prediction
new file mode 100644
index 0000000000..844c7fc1ed
--- /dev/null
+++ b/data/2024/aaai/iTrendRNN: An Interpretable Trend-Aware RNN for Meteorological Spatiotemporal Prediction	
@@ -0,0 +1 @@
+Accurate prediction of meteorological elements, such as temperature and relative humidity, is important to human livelihood, early warning of extreme weather, and urban governance. Recently, neural network-based methods have shown impressive performance in this field. However, most of them are overcomplicated and impenetrable. In this paper, we propose a straightforward and interpretable differential framework, where the key lies in explicitly estimating the evolutionary trends. Specifically, three types of trends are exploited. (1) The proximity trend simply uses the most recent changes. It works well for approximately linear evolution. (2) The sequential trend explores the global information, aiming to capture the nonlinear dynamics. Here, we develop an attention-based trend unit to help memorize long-term features. (3) The flow trend is motivated by the nature of evolution, i.e., the heat or substance flows from one region to another. Here, we design a flow-aware attention unit. It can reflect the interactions via performing spatial attention over flow maps. Finally, we develop a trend fusion module to adaptively fuse the above three trends. Extensive experiments on two datasets demonstrate the effectiveness of our method.
\ No newline at end of file
diff --git a/data/2024/aaai/icsPLMs: Exploring Pre-trained Language Models in Intelligent Customer Service (Student Abstract) b/data/2024/aaai/icsPLMs: Exploring Pre-trained Language Models in Intelligent Customer Service (Student Abstract)
new file mode 100644
index 0000000000..5171735e5a
--- /dev/null
+++ b/data/2024/aaai/icsPLMs: Exploring Pre-trained Language Models in Intelligent Customer Service (Student Abstract)	
@@ -0,0 +1 @@
+Pre-trained language models have shown their high performance of text processing in intelligent customer service platforms. However, these models do not leverage domain specific information. In this paper, we propose icsPLMs optimized for intelligent customer service on both word and sentence levels. Our experimental results represent that using targeted strategies can further improve the performance of pre-trained language models in this field.
\ No newline at end of file
diff --git a/data/2024/aaai/msLPCC: A Multimodal-Driven Scalable Framework for Deep LiDAR Point Cloud Compression b/data/2024/aaai/msLPCC: A Multimodal-Driven Scalable Framework for Deep LiDAR Point Cloud Compression
new file mode 100644
index 0000000000..09c453e9b9
--- /dev/null
+++ b/data/2024/aaai/msLPCC: A Multimodal-Driven Scalable Framework for Deep LiDAR Point Cloud Compression	
@@ -0,0 +1 @@
+LiDAR sensors are widely used in autonomous driving, and the growing storage and transmission demands have made LiDAR point cloud compression (LPCC) a hot research topic. To address the challenges posed by the large-scale and uneven-distribution (spatial and categorical) of LiDAR point data, this paper presents a new multimodal-driven scalable LPCC framework. For the large-scale challenge, we decouple the original LiDAR data into multi-layer point subsets, compress and transmit each layer separately, so as to ensure the reconstruction quality requirement under different scenarios. For the uneven-distribution challenge, we extract, align, and fuse heterologous feature representations, including point modality with position information, depth modality with spatial distance information, and segmentation modality with category information. Extensive experimental results on the benchmark SemanticKITTI database validate that our method outperforms 14 recent representative LPCC methods.
\ No newline at end of file
diff --git a/data/2024/aaai/p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models b/data/2024/aaai/p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
new file mode 100644
index 0000000000..8685cafab9
--- /dev/null
+++ b/data/2024/aaai/p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models	
@@ -0,0 +1,4 @@
+Vision-Language models (VLMs) pre-trained on large corpora have demonstrated notable success across a range of downstream tasks. In light of the rapidly increasing size of pre-trained VLMs, parameter-efficient transfer learning (PETL) has garnered attention as a viable alternative to full fine-tuning. One such approach is the adapter, which introduces a few trainable parameters into the pre-trained models while preserving the original parameters during adaptation.
+In this paper, we present a novel modeling framework that recasts adapter tuning after attention as a graph message passing process on attention graphs, where the projected query and value features and attention matrix constitute the node features and the graph adjacency matrix, respectively. Within this framework, tuning adapters in VLMs necessitates handling heterophilic graphs, owing to the disparity between the projected query and value space.
+To address this challenge, we propose a new adapter architecture, p-adapter, which employs p-Laplacian message passing in Graph Neural Networks (GNNs). Specifically, the attention weights are re-normalized based on the features, and the features are then aggregated using the calibrated attention matrix, enabling the dynamic exploitation of information with varying frequencies in the heterophilic attention graphs.
+We conduct extensive experiments on different pre-trained VLMs and multi-modal tasks, including visual question answering, visual entailment, and image captioning. The experimental results validate our method's significant superiority over other PETL methods. Our code is available at https://github.com/wuhy68/p-Adapter/.
\ No newline at end of file
diff --git a/data/2024/aaai/patchDPCC: A Patchwise Deep Compression Framework for Dynamic Point Clouds b/data/2024/aaai/patchDPCC: A Patchwise Deep Compression Framework for Dynamic Point Clouds
new file mode 100644
index 0000000000..46ec6e3baf
--- /dev/null
+++ b/data/2024/aaai/patchDPCC: A Patchwise Deep Compression Framework for Dynamic Point Clouds	
@@ -0,0 +1 @@
+When compressing point clouds, point-based deep learning models operate points in a continuous space, which has a chance to minimize the geometric fidelity loss introduced by voxelization in preprocessing. But these methods could hardly scale to inputs with arbitrary points. Furthermore, the point cloud frames are individually compressed, failing the conventional wisdom of leveraging inter-frame similarity. In this work, we propose a patchwise compression framework called patchDPCC, which consists of a patch group generation module and a point-based compression model. Algorithms are developed to generate patches from different frames representing the same object, and more importantly, these patches are regulated to have the same number of points. We also incorporate a feature transfer module in the compression model, which refines the feature quality by exploiting the inter-frame similarity. Our model generates point-wise features for entropy coding, which guarantees the reconstruction speed. The evaluation on the MPEG 8i dataset shows that our method improves the compression ratio by 47.01% and 85.22% when compared to PCGCv2 and V-PCC with the same reconstruction quality, which is 9% and 16% better than that D-DPCC does. Our method also achieves the fastest decoding speed among the learning-based compression models.
\ No newline at end of file
diff --git a/data/2024/aaai/s-ID: Causal Effect Identification in a Sub-population b/data/2024/aaai/s-ID: Causal Effect Identification in a Sub-population
new file mode 100644
index 0000000000..a8f1f85c3c
--- /dev/null
+++ b/data/2024/aaai/s-ID: Causal Effect Identification in a Sub-population	
@@ -0,0 +1 @@
+Causal inference in a sub-population involves identifying the causal effect of an intervention on a specific subgroup, which is distinguished from the whole population through the influence of systematic biases in the sampling process. However, ignoring the subtleties introduced by sub-populations can either lead to erroneous inference or limit the applicability of existing methods. We introduce and advocate for a causal inference problem in sub-populations (henceforth called s-ID), in which we merely have access to observational data of the targeted sub-population (as opposed to the entire population). Existing inference problems in sub-populations operate on the premise that the given data distributions originate from the entire population, thus, cannot tackle the s-ID problem. To address this gap, we provide necessary and sufficient conditions that must hold in the causal graph for a causal effect in a sub-population to be identifiable from the observational distribution of that sub-population. Given these conditions, we present a sound and complete algorithm for the s-ID problem.
\ No newline at end of file
diff --git a/data/2024/aaai/z-SignFedAvg: A Unified Stochastic Sign-Based Compression for Federated Learning b/data/2024/aaai/z-SignFedAvg: A Unified Stochastic Sign-Based Compression for Federated Learning
new file mode 100644
index 0000000000..e99b25d049
--- /dev/null
+++ b/data/2024/aaai/z-SignFedAvg: A Unified Stochastic Sign-Based Compression for Federated Learning	
@@ -0,0 +1,26 @@
+Federated Learning (FL) is a promising privacy-preserving
+distributed learning paradigm but suffers from high communi-
+cation cost when training large-scale machine learning models.
+Sign-based methods, such as SignSGD, have been proposed
+as a biased gradient compression technique for reducing the
+communication cost. However, sign-based algorithms could
+diverge under heterogeneous data, which thus motivated the de-
+velopment of advanced techniques, such as the error-feedback
+method and stochastic sign-based compression, to fix this
+issue. Nevertheless, these methods still suffer from slower
+convergence rates, and none of them allows multiple local
+SGD updates like FedAvg. In this paper, we propose a novel
+noisy perturbation scheme with a general symmetric noise
+distribution for sign-based compression, which not only al-
+lows one to flexibly control the bias-variance tradeoff for the
+compressed gradient, but also provides a unified viewpoint
+to existing stochastic sign-based methods. More importantly,
+the proposed scheme enables the development of the very first
+sign-based FedAvg algorithm (z-SignFedAvg) to accelerate
+the convergence. Theoretically, we show that z-SignFedAvg
+achieves a faster convergence rate than existing sign-based
+methods and, under the uniformly distributed noise, can enjoy
+the same convergence rate as its uncompressed counterpart.
+Extensive experiments are conducted to demonstrate that the
+z-SignFedAvg can achieve competitive empirical performance
+on real datasets and outperforms existing schemes.
\ No newline at end of file
diff --git "a/data/2024/aaai/\317\200-Light: Programmatic Interpretable Reinforcement Learning for Resource-Limited Traffic Signal Control" "b/data/2024/aaai/\317\200-Light: Programmatic Interpretable Reinforcement Learning for Resource-Limited Traffic Signal Control"
new file mode 100644
index 0000000000..210048f301
--- /dev/null
+++ "b/data/2024/aaai/\317\200-Light: Programmatic Interpretable Reinforcement Learning for Resource-Limited Traffic Signal Control"	
@@ -0,0 +1 @@
+The recent advancements in Deep Reinforcement Learning (DRL) have significantly enhanced the performance of adaptive Traffic Signal Control (TSC). However, DRL policies are typically represented by neural networks, which are over-parameterized black-box models. As a result, the learned policies often lack interpretability, and cannot be deployed directly in the real-world edge hardware due to resource constraints. In addition, the DRL methods often exhibit limited generalization performance, struggling to generalize the learned policy to other geographical regions. These factors limit the practical application of learning-based approaches. To address these issues, we suggest the use of an inherently interpretable program for representing the control policy. We present a new approach, Programmatic Interpretable reinforcement learning for traffic signal control (π-light), designed to autonomously discover non-differentiable programs. Specifically, we define a Domain Specific Language (DSL) and transformation rules for constructing programs, and utilize Monte Carlo Tree Search (MCTS) to find the optimal program in a discrete space. Extensive experiments demonstrate that our method consistently outperforms baseline approaches. Moreover, π-Light exhibits superior generalization capabilities compared to DRL, enabling training and evaluation across intersections from different cities. Finally, we analyze how the learned program policies can directly deploy on edge devices with extremely limited resources.
\ No newline at end of file
diff --git a/data/2024/iclr/3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining b/data/2024/iclr/3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining
new file mode 100644
index 0000000000..dbf7e4e6b3
--- /dev/null
+++ b/data/2024/iclr/3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining	
@@ -0,0 +1 @@
+Masked autoencoders (MAE) have recently been introduced to 3D self-supervised pretraining for point clouds due to their great success in NLP and computer vision. Unlike MAEs used in the image domain, where the pretext task is to restore features at the masked pixels, such as colors, the existing 3D MAE works reconstruct the missing geometry only, i.e, the location of the masked points. In contrast to previous studies, we advocate that point location recovery is inessential and restoring intrinsic point features is much superior. To this end, we propose to ignore point position reconstruction and recover high-order features at masked points including surface normals and surface variations, through a novel attention-based decoder which is independent of the encoder design. We validate the effectiveness of our pretext task and decoder design using different encoder structures for 3D training and demonstrate the advantages of our pretrained networks on various point cloud analysis tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/3D-Aware Hypothesis & Verification for Generalizable Relative Object Pose Estimation b/data/2024/iclr/3D-Aware Hypothesis & Verification for Generalizable Relative Object Pose Estimation
new file mode 100644
index 0000000000..b3627d5a25
--- /dev/null
+++ b/data/2024/iclr/3D-Aware Hypothesis & Verification for Generalizable Relative Object Pose Estimation	
@@ -0,0 +1 @@
+Prior methods that tackle the problem of generalizable object pose estimation highly rely on having dense views of the unseen object. By contrast, we address the scenario where only a single reference view of the object is available. Our goal then is to estimate the relative object pose between this reference view and a query image that depicts the object in a different pose. In this scenario, robust generalization is imperative due to the presence of unseen objects during testing and the large-scale object pose variation between the reference and the query. To this end, we present a new hypothesis-and-verification framework, in which we generate and evaluate multiple pose hypotheses, ultimately selecting the most reliable one as the relative object pose. To measure reliability, we introduce a 3D-aware verification that explicitly applies 3D transformations to the 3D object representations learned from the two input images. Our comprehensive experiments on the Objaverse, LINEMOD, and CO3D datasets evidence the superior accuracy of our approach in relative pose estimation and its robustness in large-scale pose variations, when dealing with unseen objects.
\ No newline at end of file
diff --git a/data/2024/iclr/A 2-Dimensional State Space Layer for Spatial Inductive Bias b/data/2024/iclr/A 2-Dimensional State Space Layer for Spatial Inductive Bias
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A Benchmark Study on Calibration b/data/2024/iclr/A Benchmark Study on Calibration
new file mode 100644
index 0000000000..c65e46c3ca
--- /dev/null
+++ b/data/2024/iclr/A Benchmark Study on Calibration	
@@ -0,0 +1 @@
+Deep neural networks are increasingly utilized in various machine learning tasks. However, as these models grow in complexity, they often face calibration issues, despite enhanced prediction accuracy. Many studies have endeavored to improve calibration performance through the use of specific loss functions, data preprocessing and training frameworks. Yet, investigations into calibration properties have been somewhat overlooked. Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using our proposed dataset: (i) Can model calibration be generalized across different datasets? (ii) Can robustness be used as a calibration measurement? (iii) How reliable are calibration metrics? (iv) Does a post-hoc calibration method affect all models uniformly? (v) How does calibration interact with accuracy? (vi) What is the impact of bin size on calibration measurement? (vii) Which architectural designs are beneficial for calibration? Additionally, our study bridges an existing gap by exploring calibration within NAS. By providing this dataset, we enable further research into NAS calibration. As far as we are aware, our research represents the first large-scale investigation into calibration properties and the premier study of calibration issues within NAS. The project page can be found at https://www.taolinwei.com/calibration-study
\ No newline at end of file
diff --git a/data/2024/iclr/A Benchmark for Learning to Translate a New Language from One Grammar Book b/data/2024/iclr/A Benchmark for Learning to Translate a New Language from One Grammar Book
new file mode 100644
index 0000000000..5d0f47e1ef
--- /dev/null
+++ b/data/2024/iclr/A Benchmark for Learning to Translate a New Language from One Grammar Book	
@@ -0,0 +1 @@
+Large language models (LLMs) can perform impressive feats with in-context learning or lightweight finetuning. It is natural to wonder how well these models adapt to genuinely new tasks, but how does one find tasks that are unseen in internet-scale training sets? We turn to a field that is explicitly motivated and bottlenecked by a scarcity of web data: low-resource languages. In this paper, we introduce MTOB (Machine Translation from One Book), a benchmark for learning to translate between English and Kalamang -- a language with less than 200 speakers and therefore virtually no presence on the web -- using several hundred pages of field linguistics reference materials. This task framing is novel in that it asks a model to learn a language from a single human-readable book of grammar explanations, rather than a large mined corpus of in-domain data, more akin to L2 learning than L1 acquisition. We demonstrate that baselines using current LLMs are promising but fall short of human performance, achieving 44.7 chrF on Kalamang to English translation and 45.8 chrF on English to Kalamang translation, compared to 51.6 and 57.0 chrF by a human who learned Kalamang from the same reference materials. We hope that MTOB will help measure LLM capabilities along a new dimension, and that the methods developed to solve it could help expand access to language technology for underserved communities by leveraging qualitatively different kinds of data than traditional machine translation.
\ No newline at end of file
diff --git a/data/2024/iclr/A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning b/data/2024/iclr/A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning
new file mode 100644
index 0000000000..4d2d6ef34a
--- /dev/null
+++ b/data/2024/iclr/A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning	
@@ -0,0 +1 @@
+We investigate learning the equilibria in non-stationary multi-agent systems and address the challenges that differentiate multi-agent learning from single-agent learning. Specifically, we focus on games with bandit feedback, where testing an equilibrium can result in substantial regret even when the gap to be tested is small, and the existence of multiple optimal solutions (equilibria) in stationary games poses extra challenges. To overcome these obstacles, we propose a versatile black-box approach applicable to a broad spectrum of problems, such as general-sum games, potential games, and Markov games, when equipped with appropriate learning and testing oracles for stationary environments. Our algorithms can achieve $\widetilde{O}\left(\Delta^{1/4}T^{3/4}\right)$ regret when the degree of nonstationarity, as measured by total variation $\Delta$, is known, and $\widetilde{O}\left(\Delta^{1/5}T^{4/5}\right)$ regret when $\Delta$ is unknown, where $T$ is the number of rounds. Meanwhile, our algorithm inherits the favorable dependence on number of agents from the oracles. As a side contribution that may be independent of interest, we show how to test for various types of equilibria by a black-box reduction to single-agent learning, which includes Nash equilibria, correlated equilibria, and coarse correlated equilibria.
\ No newline at end of file
diff --git a/data/2024/iclr/A Branching Decoder for Set Generation b/data/2024/iclr/A Branching Decoder for Set Generation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A Characterization Theorem for Equivariant Networks with Point-wise Activations b/data/2024/iclr/A Characterization Theorem for Equivariant Networks with Point-wise Activations
new file mode 100644
index 0000000000..264b6dc5db
--- /dev/null
+++ b/data/2024/iclr/A Characterization Theorem for Equivariant Networks with Point-wise Activations	
@@ -0,0 +1 @@
+Equivariant neural networks have shown improved performance, expressiveness and sample complexity on symmetrical domains. But for some specific symmetries, representations, and choice of coordinates, the most common point-wise activations, such as ReLU, are not equivariant, hence they cannot be employed in the design of equivariant neural networks. The theorem we present in this paper describes all possible combinations of finite-dimensional representations, choice of coordinates and point-wise activations to obtain an exactly equivariant layer, generalizing and strengthening existing characterizations. Notable cases of practical relevance are discussed as corollaries. Indeed, we prove that rotation-equivariant networks can only be invariant, as it happens for any network which is equivariant with respect to connected compact groups. Then, we discuss implications of our findings when applied to important instances of exactly equivariant networks. First, we completely characterize permutation equivariant networks such as Invariant Graph Networks with point-wise nonlinearities and their geometric counterparts, highlighting a plethora of models whose expressive power and performance are still unknown. Second, we show that feature spaces of disentangled steerable convolutional neural networks are trivial representations.
\ No newline at end of file
diff --git a/data/2024/iclr/A Cognitive Model for Learning Abstract Relational Structures from Memory-based Decision-Making Tasks b/data/2024/iclr/A Cognitive Model for Learning Abstract Relational Structures from Memory-based Decision-Making Tasks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A Data-Driven Measure of Relative Uncertainty for Misclassification Detection b/data/2024/iclr/A Data-Driven Measure of Relative Uncertainty for Misclassification Detection
new file mode 100644
index 0000000000..d083a21d03
--- /dev/null
+++ b/data/2024/iclr/A Data-Driven Measure of Relative Uncertainty for Misclassification Detection	
@@ -0,0 +1 @@
+Misclassification detection is an important problem in machine learning, as it allows for the identification of instances where the model's predictions are unreliable. However, conventional uncertainty measures such as Shannon entropy do not provide an effective way to infer the real uncertainty associated with the model's predictions. In this paper, we introduce a novel data-driven measure of uncertainty relative to an observer for misclassification detection. By learning patterns in the distribution of soft-predictions, our uncertainty measure can identify misclassified samples based on the predicted class probabilities. Interestingly, according to the proposed measure, soft-predictions corresponding to misclassified instances can carry a large amount of uncertainty, even though they may have low Shannon entropy. We demonstrate empirical improvements over multiple image classification tasks, outperforming state-of-the-art misclassification detection methods.
\ No newline at end of file
diff --git a/data/2024/iclr/A Differentially Private Clustering Algorithm for Well-Clustered Graphs b/data/2024/iclr/A Differentially Private Clustering Algorithm for Well-Clustered Graphs
new file mode 100644
index 0000000000..43461c20d4
--- /dev/null
+++ b/data/2024/iclr/A Differentially Private Clustering Algorithm for Well-Clustered Graphs	
@@ -0,0 +1 @@
+We study differentially private (DP) algorithms for recovering clusters in well-clustered graphs, which are graphs whose vertex set can be partitioned into a small number of sets, each inducing a subgraph of high inner conductance and small outer conductance. Such graphs have widespread application as a benchmark in the theoretical analysis of spectral clustering. We provide an efficient ($\epsilon$,$\delta$)-DP algorithm tailored specifically for such graphs. Our algorithm draws inspiration from the recent work of Chen et al., who developed DP algorithms for recovery of stochastic block models in cases where the graph comprises exactly two nearly-balanced clusters. Our algorithm works for well-clustered graphs with $k$ nearly-balanced clusters, and the misclassification ratio almost matches the one of the best-known non-private algorithms. We conduct experimental evaluations on datasets with known ground truth clusters to substantiate the prowess of our algorithm. We also show that any (pure) $\epsilon$-DP algorithm would result in substantial error.
\ No newline at end of file
diff --git a/data/2024/iclr/A Discretization Framework for Robust Contextual Stochastic Optimization b/data/2024/iclr/A Discretization Framework for Robust Contextual Stochastic Optimization
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A Dynamical View of the Question of Why b/data/2024/iclr/A Dynamical View of the Question of Why
new file mode 100644
index 0000000000..71d3a6c308
--- /dev/null
+++ b/data/2024/iclr/A Dynamical View of the Question of Why	
@@ -0,0 +1 @@
+We address causal reasoning in multivariate time series data generated by stochastic processes. Existing approaches are largely restricted to static settings, ignoring the continuity and emission of variations across time. In contrast, we propose a learning paradigm that directly establishes causation between events in the course of time. We present two key lemmas to compute causal contributions and frame them as reinforcement learning problems. Our approach offers formal and computational tools for uncovering and quantifying causal relationships in diffusion processes, subsuming various important settings such as discrete-time Markov decision processes. Finally, in fairly intricate experiments and through sheer learning, our framework reveals and quantifies causal links, which otherwise seem inexplicable.
\ No newline at end of file
diff --git a/data/2024/iclr/A Fast and Provable Algorithm for Sparse Phase Retrieval b/data/2024/iclr/A Fast and Provable Algorithm for Sparse Phase Retrieval
new file mode 100644
index 0000000000..3cfd710a27
--- /dev/null
+++ b/data/2024/iclr/A Fast and Provable Algorithm for Sparse Phase Retrieval	
@@ -0,0 +1 @@
+We study the sparse phase retrieval problem, which seeks to recover a sparse signal from a limited set of magnitude-only measurements. In contrast to prevalent sparse phase retrieval algorithms that primarily use first-order methods, we propose an innovative second-order algorithm that employs a Newton-type method with hard thresholding. This algorithm overcomes the linear convergence limitations of first-order methods while preserving their hallmark per-iteration computational efficiency. We provide theoretical guarantees that our algorithm converges to the $s$-sparse ground truth signal $\mathbf{x}^{\natural} \in \mathbb{R}^n$ (up to a global sign) at a quadratic convergence rate after at most $O(\log (\Vert\mathbf{x}^{\natural} \Vert /x_{\min}^{\natural}))$ iterations, using $\Omega(s^2\log n)$ Gaussian random samples. Numerical experiments show that our algorithm achieves a significantly faster convergence rate than state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/iclr/A Flexible Generative Model for Heterogeneous Tabular EHR with Missing Modality b/data/2024/iclr/A Flexible Generative Model for Heterogeneous Tabular EHR with Missing Modality
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A Foundation Model for Error Correction Codes b/data/2024/iclr/A Foundation Model for Error Correction Codes
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A Framework for Inference Inspired by Human Memory Mechanisms b/data/2024/iclr/A Framework for Inference Inspired by Human Memory Mechanisms
new file mode 100644
index 0000000000..d2a44422d9
--- /dev/null
+++ b/data/2024/iclr/A Framework for Inference Inspired by Human Memory Mechanisms	
@@ -0,0 +1 @@
+How humans and machines make sense of current inputs for relation reasoning and question-answering while putting the perceived information into context of our past memories, has been a challenging conundrum in cognitive science and artificial intelligence. Inspired by human brain's memory system and cognitive architectures, we propose a PMI framework that consists of perception, memory and inference components. Notably, the memory module comprises working and long-term memory, with the latter endowed with a higher-order structure to retain extensive and complex relational knowledge and experience. Through a differentiable competitive write access, current perceptions update working memory, which is later merged with long-term memory via outer product associations, reducing information conflicts and averting memory overflow. In the inference module, relevant information is retrieved from two separate memory origins and associatively integrated to attain a more comprehensive and precise interpretation of current perceptions. We exploratively apply our PMI to improve prevailing Transformers and CNN models on question-answering tasks like bAbI-20k and Sort-of-CLEVR datasets, as well as detecting equilateral triangles, language modeling and image classification tasks, and in each case, our PMI enhancements consistently outshine their original counterparts significantly. Visualization analyses reveal that relational memory consolidation, along with the interaction and integration of information from diverse memory sources, substantially contributes to the model effectiveness on inference tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/A General Framework for User-Guided Bayesian Optimization b/data/2024/iclr/A General Framework for User-Guided Bayesian Optimization
new file mode 100644
index 0000000000..0b8ab8b8d4
--- /dev/null
+++ b/data/2024/iclr/A General Framework for User-Guided Bayesian Optimization	
@@ -0,0 +1 @@
+The optimization of expensive-to-evaluate black-box functions is prevalent in various scientific disciplines. Bayesian optimization is an automatic, general and sample-efficient method to solve these problems with minimal knowledge of the underlying function dynamics. However, the ability of Bayesian optimization to incorporate prior knowledge or beliefs about the function at hand in order to accelerate the optimization is limited, which reduces its appeal for knowledgeable practitioners with tight budgets. To allow domain experts to customize the optimization routine, we propose ColaBO, the first Bayesian-principled framework for incorporating prior beliefs beyond the typical kernel structure, such as the likely location of the optimizer or the optimal value. The generality of ColaBO makes it applicable across different Monte Carlo acquisition functions and types of user beliefs. We empirically demonstrate ColaBO's ability to substantially accelerate optimization when the prior information is accurate, and to retain approximately default performance when it is misleading.
\ No newline at end of file
diff --git a/data/2024/iclr/A Good Learner can Teach Better: Teacher-Student Collaborative Knowledge Distillation b/data/2024/iclr/A Good Learner can Teach Better: Teacher-Student Collaborative Knowledge Distillation
new file mode 100644
index 0000000000..59d4dd12bd
--- /dev/null
+++ b/data/2024/iclr/A Good Learner can Teach Better: Teacher-Student Collaborative Knowledge Distillation	
@@ -0,0 +1 @@
+Knowledge distillation (KD) is a technique used to transfer knowledge from a larger “teacher” model into a smaller “student” model. Recent advancements in meta-learning-based knowledge distillation (MetaKD) emphasize that the fine-tuning of teacher models should be aware of the student’s need to achieve better knowledge distillation. However, existing MetaKD methods often lack incentives for the teacher model to improve itself. In this study, we introduce MPDistil , a meta-policy distillation technique, that utilizes novel optimization strategies to foster both collaboration and competition during the fine-tuning of the teacher model in the meta-learning step. Additionally, we propose a curriculum learning framework for the student model in a competitive setup, in which the student model aims to outperform the teacher model by self-training on various tasks. Exhaustive experiments on SuperGLUE and GLUE benchmarks demonstrate the efficacy of MPDistil compared to 20 conventional KD and advanced MetaKD baselines, showing significant performance enhancements in the student model – e.g., a distilled 6-layer BERT model outperforms a 12-layer BERT model on five out of six SuperGLUE tasks. Furthermore, MPDistil , while applied to a large language teacher model (DeBERTa-v2-xxlarge), significantly narrows the performance gap of its smaller student counterpart (DeBERTa-12) by just 4 . 6% on SuperGLUE. We further demonstrate how higher rewards and customized training curricula strengthen the student model and enhance generalizability.
\ No newline at end of file
diff --git a/data/2024/iclr/A Graph is Worth 1-bit Spikes: When Graph Contrastive Learning Meets Spiking Neural Networks b/data/2024/iclr/A Graph is Worth 1-bit Spikes: When Graph Contrastive Learning Meets Spiking Neural Networks
new file mode 100644
index 0000000000..3104e90151
--- /dev/null
+++ b/data/2024/iclr/A Graph is Worth 1-bit Spikes: When Graph Contrastive Learning Meets Spiking Neural Networks	
@@ -0,0 +1 @@
+While contrastive self-supervised learning has become the de-facto learning paradigm for graph neural networks, the pursuit of higher task accuracy requires a larger hidden dimensionality to learn informative and discriminative full-precision representations, raising concerns about computation, memory footprint, and energy consumption burden (largely overlooked) for real-world applications. This work explores a promising direction for graph contrastive learning (GCL) with spiking neural networks (SNNs), which leverage sparse and binary characteristics to learn more biologically plausible and compact representations. We propose SpikeGCL, a novel GCL framework to learn binarized 1-bit representations for graphs, making balanced trade-offs between efficiency and performance. We provide theoretical guarantees to demonstrate that SpikeGCL has comparable expressiveness with its full-precision counterparts. Experimental results demonstrate that, with nearly 32x representation storage compression, SpikeGCL is either comparable to or outperforms many fancy state-of-the-art supervised and self-supervised methods across several graph benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation b/data/2024/iclr/A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation
new file mode 100644
index 0000000000..cad1f745f0
--- /dev/null
+++ b/data/2024/iclr/A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation	
@@ -0,0 +1 @@
+Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficient fine-tuning methods, such as prompt learning and adapter, to enhance CLIP's performance in downstream tasks. However, these methods still require additional training time and computational resources, which is undesirable for devices with limited resources. In this paper, we revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP. Typically, GDA assumes that features of each class follow Gaussian distributions with identical covariance. By leveraging Bayes' formula, the classifier can be expressed in terms of the class means and covariance, which can be estimated from the data without the need for training. To integrate knowledge from both visual and textual modalities, we ensemble it with the original zero-shot classifier within CLIP. Extensive results on 17 datasets validate that our method surpasses or achieves comparable results with state-of-the-art methods on few-shot classification, imbalanced learning, and out-of-distribution generalization. In addition, we extend our method to base-to-new generalization and unsupervised learning, once again demonstrating its superiority over competing approaches. Our code is publicly available at \url{https://github.com/mrflogs/ICLR24}.
\ No newline at end of file
diff --git a/data/2024/iclr/A Hierarchical Bayesian Model for Few-Shot Meta Learning b/data/2024/iclr/A Hierarchical Bayesian Model for Few-Shot Meta Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A Lie Group Approach to Riemannian Batch Normalization b/data/2024/iclr/A Lie Group Approach to Riemannian Batch Normalization
new file mode 100644
index 0000000000..700085e11a
--- /dev/null
+++ b/data/2024/iclr/A Lie Group Approach to Riemannian Batch Normalization	
@@ -0,0 +1 @@
+Manifold-valued measurements exist in numerous applications within computer vision and machine learning. Recent studies have extended Deep Neural Networks (DNNs) to manifolds, and concomitantly, normalization techniques have also been adapted to several manifolds, referred to as Riemannian normalization. Nonetheless, most of the existing Riemannian normalization methods have been derived in an ad hoc manner and only apply to specific manifolds. This paper establishes a unified framework for Riemannian Batch Normalization (RBN) techniques on Lie groups. Our framework offers the theoretical guarantee of controlling both the Riemannian mean and variance. Empirically, we focus on Symmetric Positive Definite (SPD) manifolds, which possess three distinct types of Lie group structures. Using the deformation concept, we generalize the existing Lie groups on SPD manifolds into three families of parameterized Lie groups. Specific normalization layers induced by these Lie groups are then proposed for SPD neural networks. We demonstrate the effectiveness of our approach through three sets of experiments: radar recognition, human action recognition, and electroencephalography (EEG) classification. The code is available at https://github.com/GitZH-Chen/LieBN.git.
\ No newline at end of file
diff --git a/data/2024/iclr/A Lightweight Method for Tackling Unknown Participation Statistics in Federated Averaging b/data/2024/iclr/A Lightweight Method for Tackling Unknown Participation Statistics in Federated Averaging
new file mode 100644
index 0000000000..c904b8c184
--- /dev/null
+++ b/data/2024/iclr/A Lightweight Method for Tackling Unknown Participation Statistics in Federated Averaging	
@@ -0,0 +1 @@
+In federated learning (FL), clients usually have diverse participation statistics that are unknown a priori, which can significantly harm the performance of FL if not handled properly. Existing works aiming at addressing this problem are usually based on global variance reduction, which requires a substantial amount of additional memory in a multiplicative factor equal to the total number of clients. An important open problem is to find a lightweight method for FL in the presence of clients with unknown participation rates. In this paper, we address this problem by adapting the aggregation weights in federated averaging (FedAvg) based on the participation history of each client. We first show that, with heterogeneous participation statistics, FedAvg with non-optimal aggregation weights can diverge from the optimal solution of the original FL objective, indicating the need of finding optimal aggregation weights. However, it is difficult to compute the optimal weights when the participation statistics are unknown. To address this problem, we present a new algorithm called FedAU, which improves FedAvg by adaptively weighting the client updates based on online estimates of the optimal weights without knowing the statistics of client participation. We provide a theoretical convergence analysis of FedAU using a novel methodology to connect the estimation error and convergence. Our theoretical results reveal important and interesting insights, while showing that FedAU converges to an optimal solution of the original objective and has desirable properties such as linear speedup. Our experimental results also verify the advantage of FedAU over baseline methods with various participation patterns.
\ No newline at end of file
diff --git a/data/2024/iclr/A Linear Algebraic Framework for Counterfactual Generation b/data/2024/iclr/A Linear Algebraic Framework for Counterfactual Generation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A Multi-Level Framework for Accelerating Training Transformer Models b/data/2024/iclr/A Multi-Level Framework for Accelerating Training Transformer Models
new file mode 100644
index 0000000000..f36999a4ca
--- /dev/null
+++ b/data/2024/iclr/A Multi-Level Framework for Accelerating Training Transformer Models	
@@ -0,0 +1 @@
+The fast growing capabilities of large-scale deep learning models, such as Bert, GPT and ViT, are revolutionizing the landscape of NLP, CV and many other domains. Training such models, however, poses an unprecedented demand for computing power, which incurs exponentially increasing energy cost and carbon dioxide emissions. It is thus critical to develop efficient training solutions to reduce the training costs. Motivated by a set of key observations of inter- and intra-layer similarities among feature maps and attentions that can be identified from typical training processes, we propose a multi-level framework for training acceleration. Specifically, the framework is based on three basic operators, Coalescing, De-coalescing and Interpolation, which can be orchestrated to build a multi-level training framework. The framework consists of a V-cycle training process, which progressively down- and up-scales the model size and projects the parameters between adjacent levels of models via coalescing and de-coalescing. The key idea is that a smaller model that can be trained for fast convergence and the trained parameters provides high-qualities intermediate solutions for the next level larger network. The interpolation operator is designed to break the symmetry of neurons incurred by de-coalescing for better convergence performance. Our experiments on transformer-based language models (e.g. Bert, GPT) as well as a vision model (e.g. DeiT) prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model while preserving the performance.
\ No newline at end of file
diff --git a/data/2024/iclr/A Mutual Information Perspective on Federated Contrastive Learning b/data/2024/iclr/A Mutual Information Perspective on Federated Contrastive Learning
new file mode 100644
index 0000000000..99ac0f6850
--- /dev/null
+++ b/data/2024/iclr/A Mutual Information Perspective on Federated Contrastive Learning	
@@ -0,0 +1 @@
+We investigate contrastive learning in the federated setting through the lens of SimCLR and multi-view mutual information maximization. In doing so, we uncover a connection between contrastive representation learning and user verification; by adding a user verification loss to each client's local SimCLR loss we recover a lower bound to the global multi-view mutual information. To accommodate for the case of when some labelled data are available at the clients, we extend our SimCLR variant to the federated semi-supervised setting. We see that a supervised SimCLR objective can be obtained with two changes: a) the contrastive loss is computed between datapoints that share the same label and b) we require an additional auxiliary head that predicts the correct labels from either of the two views. Along with the proposed SimCLR extensions, we also study how different sources of non-i.i.d.-ness can impact the performance of federated unsupervised learning through global mutual information maximization; we find that a global objective is beneficial for some sources of non-i.i.d.-ness but can be detrimental for others. We empirically evaluate our proposed extensions in various tasks to validate our claims and furthermore demonstrate that our proposed modifications generalize to other pretraining methods.
\ No newline at end of file
diff --git a/data/2024/iclr/A Neural Framework for Generalized Causal Sensitivity Analysis b/data/2024/iclr/A Neural Framework for Generalized Causal Sensitivity Analysis
new file mode 100644
index 0000000000..bda680e10a
--- /dev/null
+++ b/data/2024/iclr/A Neural Framework for Generalized Causal Sensitivity Analysis	
@@ -0,0 +1 @@
+Unobserved confounding is common in many applications, making causal inference from observational data challenging. As a remedy, causal sensitivity analysis is an important tool to draw causal conclusions under unobserved confounding with mathematical guarantees. In this paper, we propose NeuralCSA, a neural framework for generalized causal sensitivity analysis. Unlike previous work, our framework is compatible with (i) a large class of sensitivity models, including the marginal sensitivity model, f-sensitivity models, and Rosenbaum's sensitivity model; (ii) different treatment types (i.e., binary and continuous); and (iii) different causal queries, including (conditional) average treatment effects and simultaneous effects on multiple outcomes. The generality of NeuralCSA is achieved by learning a latent distribution shift that corresponds to a treatment intervention using two conditional normalizing flows. We provide theoretical guarantees that NeuralCSA is able to infer valid bounds on the causal query of interest and also demonstrate this empirically using both simulated and real-world data.
\ No newline at end of file
diff --git a/data/2024/iclr/A Newborn Embodied Turing Test for Comparing Object Segmentation Across Animals and Machines b/data/2024/iclr/A Newborn Embodied Turing Test for Comparing Object Segmentation Across Animals and Machines
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models b/data/2024/iclr/A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models
new file mode 100644
index 0000000000..b4e445417d
--- /dev/null
+++ b/data/2024/iclr/A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models	
@@ -0,0 +1 @@
+Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. However, these advances have not been reflected in the translation task, especially those with moderate model sizes (i.e., 7B or 13B parameters), which still lag behind conventional supervised encoder-decoder translation models. Previous studies have attempted to improve the translation capabilities of these moderate LLMs, but their gains have been limited. In this study, we propose a novel fine-tuning approach for LLMs that is specifically designed for the translation task, eliminating the need for the abundant parallel data that traditional translation models usually depend on. Our approach consists of two fine-tuning stages: initial fine-tuning on monolingual data followed by subsequent fine-tuning on a small set of high-quality parallel data. We introduce the LLM developed through this strategy as Advanced Language Model-based trAnslator (ALMA). Based on LLaMA-2 as our underlying model, our results show that the model can achieve an average improvement of more than 12 BLEU and 12 COMET over its zero-shot performance across 10 translation directions from the WMT'21 (2 directions) and WMT'22 (8 directions) test datasets. The performance is significantly better than all prior work and even superior to the NLLB-54B model and GPT-3.5-text-davinci-003, with only 7B or 13B parameters. This method establishes the foundation for a novel training paradigm in machine translation.
\ No newline at end of file
diff --git a/data/2024/iclr/A Plug-and-Play Image Registration Network b/data/2024/iclr/A Plug-and-Play Image Registration Network
new file mode 100644
index 0000000000..50975ef848
--- /dev/null
+++ b/data/2024/iclr/A Plug-and-Play Image Registration Network	
@@ -0,0 +1 @@
+Deformable image registration (DIR) is an active research topic in biomedical imaging. There is a growing interest in developing DIR methods based on deep learning (DL). A traditional DL approach to DIR is based on training a convolutional neural network (CNN) to estimate the registration field between two input images. While conceptually simple, this approach comes with a limitation that it exclusively relies on a pre-trained CNN without explicitly enforcing fidelity between the registered image and the reference. We present plug-and-play image registration network (PIRATE) as a new DIR method that addresses this issue by integrating an explicit data-fidelity penalty and a CNN prior. PIRATE pre-trains a CNN denoiser on the registration field and"plugs"it into an iterative method as a regularizer. We additionally present PIRATE+ that fine-tunes the CNN prior in PIRATE using deep equilibrium models (DEQ). PIRATE+ interprets the fixed-point iteration of PIRATE as a network with effectively infinite layers and then trains the resulting network end-to-end, enabling it to learn more task-specific information and boosting its performance. Our numerical results on OASIS and CANDI datasets show that our methods achieve state-of-the-art performance on DIR.
\ No newline at end of file
diff --git "a/data/2024/iclr/A Poincar\303\251 Inequality and Consistency Results for Signal Sampling on Large Graphs" "b/data/2024/iclr/A Poincar\303\251 Inequality and Consistency Results for Signal Sampling on Large Graphs"
new file mode 100644
index 0000000000..1ee6bf4d8e
--- /dev/null
+++ "b/data/2024/iclr/A Poincar\303\251 Inequality and Consistency Results for Signal Sampling on Large Graphs"	
@@ -0,0 +1 @@
+Large-scale graph machine learning is challenging as the complexity of learning models scales with the graph size. Subsampling the graph is a viable alternative, but sampling on graphs is nontrivial as graphs are non-Euclidean. Existing graph sampling techniques require not only computing the spectra of large matrices but also repeating these computations when the graph changes, e.g., grows. In this paper, we introduce a signal sampling theory for a type of graph limit -- the graphon. We prove a Poincar\'e inequality for graphon signals and show that complements of node subsets satisfying this inequality are unique sampling sets for Paley-Wiener spaces of graphon signals. Exploiting connections with spectral clustering and Gaussian elimination, we prove that such sampling sets are consistent in the sense that unique sampling sets on a convergent graph sequence converge to unique sampling sets on the graphon. We then propose a related graphon signal sampling algorithm for large graphs, and demonstrate its good empirical performance on graph machine learning tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/A Policy Gradient Method for Confounded POMDPs b/data/2024/iclr/A Policy Gradient Method for Confounded POMDPs
new file mode 100644
index 0000000000..32f35bdc35
--- /dev/null
+++ b/data/2024/iclr/A Policy Gradient Method for Confounded POMDPs	
@@ -0,0 +1 @@
+In this paper, we propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting. We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data. The identification enables us to solve a sequence of conditional moment restrictions and adopt the min-max learning procedure with general function approximation for estimating the policy gradient. We then provide a finite-sample non-asymptotic bound for estimating the gradient uniformly over a pre-specified policy class in terms of the sample size, length of horizon, concentratability coefficient and the measure of ill-posedness in solving the conditional moment restrictions. Lastly, by deploying the proposed gradient estimation in the gradient ascent algorithm, we show the global convergence of the proposed algorithm in finding the history-dependent optimal policy under some technical conditions. To the best of our knowledge, this is the first work studying the policy gradient method for POMDPs under the offline setting.
\ No newline at end of file
diff --git a/data/2024/iclr/A Precise Characterization of SGD Stability Using Loss Surface Geometry b/data/2024/iclr/A Precise Characterization of SGD Stability Using Loss Surface Geometry
new file mode 100644
index 0000000000..a4637da8a8
--- /dev/null
+++ b/data/2024/iclr/A Precise Characterization of SGD Stability Using Loss Surface Geometry	
@@ -0,0 +1 @@
+Stochastic Gradient Descent (SGD) stands as a cornerstone optimization algorithm with proven real-world empirical successes but relatively limited theoretical understanding. Recent research has illuminated a key factor contributing to its practical efficacy: the implicit regularization it instigates. Several studies have investigated the linear stability property of SGD in the vicinity of a stationary point as a predictive proxy for sharpness and generalization error in overparameterized neural networks (Wu et al., 2022; Jastrzebski et al., 2019; Cohen et al., 2021). In this paper, we delve deeper into the relationship between linear stability and sharpness. More specifically, we meticulously delineate the necessary and sufficient conditions for linear stability, contingent on hyperparameters of SGD and the sharpness at the optimum. Towards this end, we introduce a novel coherence measure of the loss Hessian that encapsulates pertinent geometric properties of the loss function that are relevant to the linear stability of SGD. It enables us to provide a simplified sufficient condition for identifying linear instability at an optimum. Notably, compared to previous works, our analysis relies on significantly milder assumptions and is applicable for a broader class of loss functions than known before, encompassing not only mean-squared error but also cross-entropy loss.
\ No newline at end of file
diff --git a/data/2024/iclr/A Primal-Dual Approach to Solving Variational Inequalities with General Constraints b/data/2024/iclr/A Primal-Dual Approach to Solving Variational Inequalities with General Constraints
new file mode 100644
index 0000000000..263631c66d
--- /dev/null
+++ b/data/2024/iclr/A Primal-Dual Approach to Solving Variational Inequalities with General Constraints	
@@ -0,0 +1 @@
+Yang et al. (2023) recently showed how to use first-order gradient methods to solve general variational inequalities (VIs) under a limiting assumption that analytic solutions of specific subproblems are available. In this paper, we circumvent this assumption via a warm-starting technique where we solve subproblems approximately and initialize variables with the approximate solution found at the previous iteration. We prove the convergence of this method and show that the gap function of the last iterate of the method decreases at a rate of $O(\frac{1}{\sqrt{K}})$ when the operator is $L$-Lipschitz and monotone. In numerical experiments, we show that this technique can converge much faster than its exact counterpart. Furthermore, for the cases when the inequality constraints are simple, we introduce an alternative variant of ACVI and establish its convergence under the same conditions. Finally, we relax the smoothness assumptions in Yang et al., yielding, to our knowledge, the first convergence result for VIs with general constraints that does not rely on the assumption that the operator is $L$-Lipschitz.
\ No newline at end of file
diff --git a/data/2024/iclr/A Probabilistic Framework for Modular Continual Learning b/data/2024/iclr/A Probabilistic Framework for Modular Continual Learning
new file mode 100644
index 0000000000..291c847966
--- /dev/null
+++ b/data/2024/iclr/A Probabilistic Framework for Modular Continual Learning	
@@ -0,0 +1 @@
+Modular approaches that use a different composition of modules for each problem are a promising direction in continual learning (CL). However, searching through the large, discrete space of module compositions is challenging, especially because evaluating a composition's performance requires a round of neural network training. We address this challenge through a modular CL framework, PICLE, that uses a probabilistic model to cheaply compute the fitness of each composition, allowing PICLE to achieve both perceptual, few-shot and latent transfer. The model combines prior knowledge about good module compositions with dataset-specific information. We evaluate PICLE using two benchmark suites designed to assess different desiderata of CL techniques. Comparing to a wide range of approaches, we show that PICLE is the first modular CL algorithm to achieve perceptual, few-shot and latent transfer while scaling well to large search spaces, outperforming previous state-of-the-art modular CL approaches on long problem sequences.
\ No newline at end of file
diff --git a/data/2024/iclr/A Quadratic Synchronization Rule for Distributed Deep Learning b/data/2024/iclr/A Quadratic Synchronization Rule for Distributed Deep Learning
new file mode 100644
index 0000000000..8ba6826f1b
--- /dev/null
+++ b/data/2024/iclr/A Quadratic Synchronization Rule for Distributed Deep Learning	
@@ -0,0 +1 @@
+In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for $H$ steps without synchronizing with others, hence reducing communication frequency. While $H$ has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper $H$ value can lead to generalization improvement. Yet, selecting a proper $H$ is elusive. This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting $H$ in proportion to $\frac{1}{\eta^2}$ as the learning rate $\eta$ decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared with the standard data parallel training, QSR enables Local AdamW on ViT-B to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves $1.16\%$ or $0.84\%$ higher top-1 validation accuracy.
\ No newline at end of file
diff --git a/data/2024/iclr/A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis b/data/2024/iclr/A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
new file mode 100644
index 0000000000..d68426edea
--- /dev/null
+++ b/data/2024/iclr/A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis	
@@ -0,0 +1 @@
+Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.
\ No newline at end of file
diff --git a/data/2024/iclr/A Recipe for Improved Certifiable Robustness b/data/2024/iclr/A Recipe for Improved Certifiable Robustness
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A Restoration Network as an Implicit Prior b/data/2024/iclr/A Restoration Network as an Implicit Prior
new file mode 100644
index 0000000000..d3e3a47866
--- /dev/null
+++ b/data/2024/iclr/A Restoration Network as an Implicit Prior	
@@ -0,0 +1 @@
+Image denoisers have been shown to be powerful priors for solving inverse problems in imaging. In this work, we introduce a generalization of these methods that allows any image restoration network to be used as an implicit prior. The proposed method uses priors specified by deep neural networks pre-trained as general restoration operators. The method provides a principled approach for adapting state-of-the-art restoration models for other inverse problems. Our theoretical result analyzes its convergence to a stationary point of a global functional associated with the restoration operator. Numerical results show that the method using a super-resolution prior achieves state-of-the-art performance both quantitatively and qualitatively. Overall, this work offers a step forward for solving inverse problems by enabling the use of powerful pre-trained restoration models as priors.
\ No newline at end of file
diff --git a/data/2024/iclr/A Semantic Invariant Robust Watermark for Large Language Models b/data/2024/iclr/A Semantic Invariant Robust Watermark for Large Language Models
new file mode 100644
index 0000000000..b1e2b71076
--- /dev/null
+++ b/data/2024/iclr/A Semantic Invariant Robust Watermark for Large Language Models	
@@ -0,0 +1 @@
+Watermark algorithms for large language models (LLMs) have achieved extremely high accuracy in detecting text generated by LLMs. Such algorithms typically involve adding extra watermark logits to the LLM's logits at each generation step. However, prior algorithms face a trade-off between attack robustness and security robustness. This is because the watermark logits for a token are determined by a certain number of preceding tokens; a small number leads to low security robustness, while a large number results in insufficient attack robustness. In this work, we propose a semantic invariant watermarking method for LLMs that provides both attack robustness and security robustness. The watermark logits in our work are determined by the semantics of all preceding tokens. Specifically, we utilize another embedding LLM to generate semantic embeddings for all preceding tokens, and then these semantic embeddings are transformed into the watermark logits through our trained watermark model. Subsequent analyses and experiments demonstrated the attack robustness of our method in semantically invariant settings: synonym substitution and text paraphrasing settings. Finally, we also show that our watermark possesses adequate security robustness. Our code and data are available at \href{https://github.com/THU-BPM/Robust_Watermark}{https://github.com/THU-BPM/Robust\_Watermark}. Additionally, our algorithm could also be accessed through MarkLLM \citep{pan2024markllm} \footnote{https://github.com/THU-BPM/MarkLLM}.
\ No newline at end of file
diff --git a/data/2024/iclr/A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis b/data/2024/iclr/A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis
new file mode 100644
index 0000000000..0fbd8922b6
--- /dev/null
+++ b/data/2024/iclr/A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis	
@@ -0,0 +1 @@
+We present a novel usage of Transformers to make image classification interpretable. Unlike mainstream classifiers that wait until the last fully connected layer to incorporate class information to make predictions, we investigate a proactive approach, asking each class to search for itself in an image. We realize this idea via a Transformer encoder-decoder inspired by DEtection TRansformer (DETR). We learn"class-specific"queries (one for each class) as input to the decoder, enabling each class to localize its patterns in an image via cross-attention. We name our approach INterpretable TRansformer (INTR), which is fairly easy to implement and exhibits several compelling properties. We show that INTR intrinsically encourages each class to attend distinctively; the cross-attention weights thus provide a faithful interpretation of the prediction. Interestingly, via"multi-head"cross-attention, INTR could identify different"attributes"of a class, making it particularly suitable for fine-grained classification and analysis, which we demonstrate on eight datasets. Our code and pre-trained models are publicly accessible at the Imageomics Institute GitHub site: https://github.com/Imageomics/INTR.
\ No newline at end of file
diff --git a/data/2024/iclr/A Simple Romance Between Multi-Exit Vision Transformer and Token Reduction b/data/2024/iclr/A Simple Romance Between Multi-Exit Vision Transformer and Token Reduction
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A Simple and Effective Pruning Approach for Large Language Models b/data/2024/iclr/A Simple and Effective Pruning Approach for Large Language Models
new file mode 100644
index 0000000000..c15a0e63b6
--- /dev/null
+++ b/data/2024/iclr/A Simple and Effective Pruning Approach for Large Language Models	
@@ -0,0 +1 @@
+As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method Wanda on LLaMA and LLaMA-2 across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and performs competitively against recent method involving intensive weight update. Code is available at https://github.com/locuslab/wanda.
\ No newline at end of file
diff --git a/data/2024/iclr/A Simple and Scalable Representation for Graph Generation b/data/2024/iclr/A Simple and Scalable Representation for Graph Generation
new file mode 100644
index 0000000000..88f577f4d8
--- /dev/null
+++ b/data/2024/iclr/A Simple and Scalable Representation for Graph Generation	
@@ -0,0 +1 @@
+Recently, there has been a surge of interest in employing neural networks for graph generation, a fundamental statistical learning problem with critical applications like molecule design and community analysis. However, most approaches encounter significant limitations when generating large-scale graphs. This is due to their requirement to output the full adjacency matrices whose size grows quadratically with the number of nodes. In response to this challenge, we introduce a new, simple, and scalable graph representation named gap encoded edge list (GEEL) that has a small representation size that aligns with the number of edges. In addition, GEEL significantly reduces the vocabulary size by incorporating the gap encoding and bandwidth restriction schemes. GEEL can be autoregressively generated with the incorporation of node positional encoding, and we further extend GEEL to deal with attributed graphs by designing a new grammar. Our findings reveal that the adoption of this compact representation not only enhances scalability but also bolsters performance by simplifying the graph generation process. We conduct a comprehensive evaluation across ten non-attributed and two molecular graph generation tasks, demonstrating the effectiveness of GEEL.
\ No newline at end of file
diff --git a/data/2024/iclr/A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks b/data/2024/iclr/A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks
new file mode 100644
index 0000000000..10d7e7483c
--- /dev/null
+++ b/data/2024/iclr/A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks	
@@ -0,0 +1 @@
+Predictive coding networks are neuroscience-inspired models with roots in both Bayesian statistics and neuroscience. Training such models, however, is quite inefficient and unstable. In this work, we show how by simply changing the temporal scheduling of the update rule for the synaptic weights leads to an algorithm that is much more efficient and stable than the original one, and has theoretical guarantees in terms of convergence. The proposed algorithm, that we call incremental predictive coding (iPC) is also more biologically plausible than the original one, as it it fully automatic. In an extensive set of experiments, we show that iPC constantly performs better than the original formulation on a large number of benchmarks for image classification, as well as for the training of both conditional and masked language models, in terms of test accuracy, efficiency, and convergence with respect to a large set of hyperparameters.
\ No newline at end of file
diff --git a/data/2024/iclr/A Statistical Analysis of Wasserstein Autoencoders for Intrinsically Low-dimensional Data b/data/2024/iclr/A Statistical Analysis of Wasserstein Autoencoders for Intrinsically Low-dimensional Data
new file mode 100644
index 0000000000..fbc604fa98
--- /dev/null
+++ b/data/2024/iclr/A Statistical Analysis of Wasserstein Autoencoders for Intrinsically Low-dimensional Data	
@@ -0,0 +1 @@
+Variational Autoencoders (VAEs) have gained significant popularity among researchers as a powerful tool for understanding unknown distributions based on limited samples. This popularity stems partly from their impressive performance and partly from their ability to provide meaningful feature representations in the latent space. Wasserstein Autoencoders (WAEs), a variant of VAEs, aim to not only improve model efficiency but also interpretability. However, there has been limited focus on analyzing their statistical guarantees. The matter is further complicated by the fact that the data distributions to which WAEs are applied - such as natural images - are often presumed to possess an underlying low-dimensional structure within a high-dimensional feature space, which current theory does not adequately account for, rendering known bounds inefficient. To bridge the gap between the theory and practice of WAEs, in this paper, we show that WAEs can learn the data distributions when the network architectures are properly chosen. We show that the convergence rates of the expected excess risk in the number of samples for WAEs are independent of the high feature dimension, instead relying only on the intrinsic dimension of the data distribution.
\ No newline at end of file
diff --git a/data/2024/iclr/A Study of Bayesian Neural Network Surrogates for Bayesian Optimization b/data/2024/iclr/A Study of Bayesian Neural Network Surrogates for Bayesian Optimization
new file mode 100644
index 0000000000..34e7a23050
--- /dev/null
+++ b/data/2024/iclr/A Study of Bayesian Neural Network Surrogates for Bayesian Optimization	
@@ -0,0 +1 @@
+Bayesian optimization is a highly efficient approach to optimizing objective functions which are expensive to query. These objectives are typically represented by Gaussian process (GP) surrogate models which are easy to optimize and support exact inference. While standard GP surrogates have been well-established in Bayesian optimization, Bayesian neural networks (BNNs) have recently become practical function approximators, with many benefits over standard GPs such as the ability to naturally handle non-stationarity and learn representations for high-dimensional data. In this paper, we study BNNs as alternatives to standard GP surrogates for optimization. We consider a variety of approximate inference procedures for finite-width BNNs, including high-quality Hamiltonian Monte Carlo, low-cost stochastic MCMC, and heuristics such as deep ensembles. We also consider infinite-width BNNs, linearized Laplace approximations, and partially stochastic models such as deep kernel learning. We evaluate this collection of surrogate models on diverse problems with varying dimensionality, number of objectives, non-stationarity, and discrete and continuous inputs. We find: (i) the ranking of methods is highly problem dependent, suggesting the need for tailored inductive biases; (ii) HMC is the most successful approximate inference procedure for fully stochastic BNNs; (iii) full stochasticity may be unnecessary as deep kernel learning is relatively competitive; (iv) deep ensembles perform relatively poorly; (v) infinite-width BNNs are particularly promising, especially in high dimensions.
\ No newline at end of file
diff --git a/data/2024/iclr/A Sublinear Adversarial Training Algorithm b/data/2024/iclr/A Sublinear Adversarial Training Algorithm
new file mode 100644
index 0000000000..b8040107b3
--- /dev/null
+++ b/data/2024/iclr/A Sublinear Adversarial Training Algorithm	
@@ -0,0 +1 @@
+Adversarial training is a widely used strategy for making neural networks resistant to adversarial perturbations. For a neural network of width $m$, $n$ input training data in $d$ dimension, it takes $\Omega(mnd)$ time cost per training iteration for the forward and backward computation. In this paper we analyze the convergence guarantee of adversarial training procedure on a two-layer neural network with shifted ReLU activation, and shows that only $o(m)$ neurons will be activated for each input data per iteration. Furthermore, we develop an algorithm for adversarial training with time cost $o(m n d)$ per iteration by applying half-space reporting data structure.
\ No newline at end of file
diff --git a/data/2024/iclr/A Symmetry-Aware Exploration of Bayesian Neural Network Posteriors b/data/2024/iclr/A Symmetry-Aware Exploration of Bayesian Neural Network Posteriors
new file mode 100644
index 0000000000..e920bb4224
--- /dev/null
+++ b/data/2024/iclr/A Symmetry-Aware Exploration of Bayesian Neural Network Posteriors	
@@ -0,0 +1 @@
+The distribution of the weights of modern deep neural networks (DNNs) - crucial for uncertainty quantification and robustness - is an eminently complex object due to its extremely high dimensionality. This paper proposes one of the first large-scale explorations of the posterior distribution of deep Bayesian Neural Networks (BNNs), expanding its study to real-world vision tasks and architectures. Specifically, we investigate the optimal approach for approximating the posterior, analyze the connection between posterior quality and uncertainty quantification, delve into the impact of modes on the posterior, and explore methods for visualizing the posterior. Moreover, we uncover weight-space symmetries as a critical aspect for understanding the posterior. To this extent, we develop an in-depth assessment of the impact of both permutation and scaling symmetries that tend to obfuscate the Bayesian posterior. While the first type of transformation is known for duplicating modes, we explore the relationship between the latter and L2 regularization, challenging previous misconceptions. Finally, to help the community improve our understanding of the Bayesian posterior, we will shortly release the first large-scale checkpoint dataset, including thousands of real-world models and our codes.
\ No newline at end of file
diff --git a/data/2024/iclr/A Topological Perspective on Demystifying GNN-Based Link Prediction Performance b/data/2024/iclr/A Topological Perspective on Demystifying GNN-Based Link Prediction Performance
new file mode 100644
index 0000000000..2bfef231e9
--- /dev/null
+++ b/data/2024/iclr/A Topological Perspective on Demystifying GNN-Based Link Prediction Performance	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) have shown great promise in learning node embeddings for link prediction (LP). While numerous studies aim to improve the overall LP performance of GNNs, none have explored its varying performance across different nodes and its underlying reasons. To this end, we aim to demystify which nodes will perform better from the perspective of their local topology. Despite the widespread belief that low-degree nodes exhibit poorer LP performance, our empirical findings provide nuances to this viewpoint and prompt us to propose a better metric, Topological Concentration (TC), based on the intersection of the local subgraph of each node with the ones of its neighbors. We empirically demonstrate that TC has a higher correlation with LP performance than other node-level topological metrics like degree and subgraph density, offering a better way to identify low-performing nodes than using cold-start. With TC, we discover a novel topological distribution shift issue in which newly joined neighbors of a node tend to become less interactive with that node's existing neighbors, compromising the generalizability of node embeddings for LP at testing time. To make the computation of TC scalable, We further propose Approximated Topological Concentration (ATC) and theoretically/empirically justify its efficacy in approximating TC and reducing the computation complexity. Given the positive correlation between node TC and its LP performance, we explore the potential of boosting LP performance via enhancing TC by re-weighting edges in the message-passing and discuss its effectiveness with limitations. Our code is publicly available at https://github.com/YuWVandy/Topo_LP_GNN.
\ No newline at end of file
diff --git a/data/2024/iclr/A Unified Framework for Bayesian Optimization under Contextual Uncertainty b/data/2024/iclr/A Unified Framework for Bayesian Optimization under Contextual Uncertainty
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A Unified Sampling Framework for Solver Searching of Diffusion Probabilistic Models b/data/2024/iclr/A Unified Sampling Framework for Solver Searching of Diffusion Probabilistic Models
new file mode 100644
index 0000000000..adbfceea35
--- /dev/null
+++ b/data/2024/iclr/A Unified Sampling Framework for Solver Searching of Diffusion Probabilistic Models	
@@ -0,0 +1 @@
+Recent years have witnessed the rapid progress and broad application of diffusion probabilistic models (DPMs). Sampling from DPMs can be viewed as solving an ordinary differential equation (ODE). Despite the promising performance, the generation of DPMs usually consumes much time due to the large number of function evaluations (NFE). Though recent works have accelerated the sampling to around 20 steps with high-order solvers, the sample quality with less than 10 NFE can still be improved. In this paper, we propose a unified sampling framework (USF) to study the optional strategies for solver. Under this framework, we further reveal that taking different solving strategies at different timesteps may help further decrease the truncation error, and a carefully designed \emph{solver schedule} has the potential to improve the sample quality by a large margin. Therefore, we propose a new sampling framework based on the exponential integral formulation that allows free choices of solver strategy at each step and design specific decisions for the framework. Moreover, we propose $S^3$, a predictor-based search method that automatically optimizes the solver schedule to get a better time-quality trade-off of sampling. We demonstrate that $S^3$ can find outstanding solver schedules which outperform the state-of-the-art sampling methods on CIFAR-10, CelebA, ImageNet, and LSUN-Bedroom datasets. Specifically, we achieve 2.69 FID with 10 NFE and 6.86 FID with 5 NFE on CIFAR-10 dataset, outperforming the SOTA method significantly. We further apply $S^3$ to Stable-Diffusion model and get an acceleration ratio of 2$\times$, showing the feasibility of sampling in very few steps without retraining the neural network.
\ No newline at end of file
diff --git a/data/2024/iclr/A Unified and General Framework for Continual Learning b/data/2024/iclr/A Unified and General Framework for Continual Learning
new file mode 100644
index 0000000000..6646162139
--- /dev/null
+++ b/data/2024/iclr/A Unified and General Framework for Continual Learning	
@@ -0,0 +1 @@
+Continual Learning (CL) focuses on learning from dynamic and changing data distributions while retaining previously acquired knowledge. Various methods have been developed to address the challenge of catastrophic forgetting, including regularization-based, Bayesian-based, and memory-replay-based techniques. However, these methods lack a unified framework and common terminology for describing their approaches. This research aims to bridge this gap by introducing a comprehensive and overarching framework that encompasses and reconciles these existing methodologies. Notably, this new framework is capable of encompassing established CL approaches as special instances within a unified and general optimization objective. An intriguing finding is that despite their diverse origins, these methods share common mathematical structures. This observation highlights the compatibility of these seemingly distinct techniques, revealing their interconnectedness through a shared underlying optimization objective. Moreover, the proposed general framework introduces an innovative concept called refresh learning, specifically designed to enhance the CL performance. This novel approach draws inspiration from neuroscience, where the human brain often sheds outdated information to improve the retention of crucial knowledge and facilitate the acquisition of new information. In essence, refresh learning operates by initially unlearning current data and subsequently relearning it. It serves as a versatile plug-in that seamlessly integrates with existing CL methods, offering an adaptable and effective enhancement to the learning process. Extensive experiments on CL benchmarks and theoretical analysis demonstrate the effectiveness of the proposed refresh learning. Code is available at \url{https://github.com/joey-wang123/CL-refresh-learning}.
\ No newline at end of file
diff --git a/data/2024/iclr/A Variational Framework for Estimating Continuous Treatment Effects with Measurement Error b/data/2024/iclr/A Variational Framework for Estimating Continuous Treatment Effects with Measurement Error
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A Variational Perspective on Solving Inverse Problems with Diffusion Models b/data/2024/iclr/A Variational Perspective on Solving Inverse Problems with Diffusion Models
new file mode 100644
index 0000000000..ca02fce0ef
--- /dev/null
+++ b/data/2024/iclr/A Variational Perspective on Solving Inverse Problems with Diffusion Models	
@@ -0,0 +1 @@
+Diffusion models have emerged as a key pillar of foundation models in visual domains. One of their critical applications is to universally solve different downstream inverse tasks via a single diffusion prior without re-training for each task. Most inverse tasks can be formulated as inferring a posterior distribution over data (e.g., a full image) given a measurement (e.g., a masked image). This is however challenging in diffusion models since the nonlinear and iterative nature of the diffusion process renders the posterior intractable. To cope with this challenge, we propose a variational approach that by design seeks to approximate the true posterior distribution. We show that our approach naturally leads to regularization by denoising diffusion process (RED-Diff) where denoisers at different timesteps concurrently impose different structural constraints over the image. To gauge the contribution of denoisers from different timesteps, we propose a weighting mechanism based on signal-to-noise-ratio (SNR). Our approach provides a new variational perspective for solving inverse problems with diffusion models, allowing us to formulate sampling as stochastic optimization, where one can simply apply off-the-shelf solvers with lightweight iterates. Our experiments for image restoration tasks such as inpainting and superresolution demonstrate the strengths of our method compared with state-of-the-art sampling-based diffusion models.
\ No newline at end of file
diff --git a/data/2024/iclr/A Versatile Causal Discovery Framework to Allow Causally-Related Hidden Variables b/data/2024/iclr/A Versatile Causal Discovery Framework to Allow Causally-Related Hidden Variables
new file mode 100644
index 0000000000..49783aa096
--- /dev/null
+++ b/data/2024/iclr/A Versatile Causal Discovery Framework to Allow Causally-Related Hidden Variables	
@@ -0,0 +1 @@
+Most existing causal discovery methods rely on the assumption of no latent confounders, limiting their applicability in solving real-life problems. In this paper, we introduce a novel, versatile framework for causal discovery that accommodates the presence of causally-related hidden variables almost everywhere in the causal network (for instance, they can be effects of observed variables), based on rank information of covariance matrix over observed variables. We start by investigating the efficacy of rank in comparison to conditional independence and, theoretically, establish necessary and sufficient conditions for the identifiability of certain latent structural patterns. Furthermore, we develop a Rank-based Latent Causal Discovery algorithm, RLCD, that can efficiently locate hidden variables, determine their cardinalities, and discover the entire causal structure over both measured and hidden ones. We also show that, under certain graphical conditions, RLCD correctly identifies the Markov Equivalence Class of the whole latent causal graph asymptotically. Experimental results on both synthetic and real-world personality data sets demonstrate the efficacy of the proposed approach in finite-sample cases.
\ No newline at end of file
diff --git a/data/2024/iclr/A differentiable brain simulator bridging brain simulation and brain-inspired computing b/data/2024/iclr/A differentiable brain simulator bridging brain simulation and brain-inspired computing
new file mode 100644
index 0000000000..936e751bf2
--- /dev/null
+++ b/data/2024/iclr/A differentiable brain simulator bridging brain simulation and brain-inspired computing	
@@ -0,0 +1 @@
+Brain simulation builds dynamical models to mimic the structure and functions of the brain, while brain-inspired computing (BIC) develops intelligent systems by learning from the structure and functions of the brain. The two fields are intertwined and should share a common programming framework to facilitate each other's development. However, none of the existing software in the fields can achieve this goal, because traditional brain simulators lack differentiability for training, while existing deep learning (DL) frameworks fail to capture the biophysical realism and complexity of brain dynamics. In this paper, we introduce BrainPy, a differentiable brain simulator developed using JAX and XLA, with the aim of bridging the gap between brain simulation and BIC. BrainPy expands upon the functionalities of JAX, a powerful AI framework, by introducing complete capabilities for flexible, efficient, and scalable brain simulation. It offers a range of sparse and event-driven operators for efficient and scalable brain simulation, an abstraction for managing the intricacies of synaptic computations, a modular and flexible interface for constructing multi-scale brain models, and an object-oriented just-in-time compilation approach to handle the memory-intensive nature of brain dynamics. We showcase the efficiency and scalability of BrainPy on benchmark tasks, highlight its differentiable simulation for biologically plausible spiking models, and discuss its potential to support research at the intersection of brain simulation and BIC.
\ No newline at end of file
diff --git a/data/2024/iclr/A path-norm toolkit for modern networks: consequences, promises and challenges b/data/2024/iclr/A path-norm toolkit for modern networks: consequences, promises and challenges
new file mode 100644
index 0000000000..311ce85dae
--- /dev/null
+++ b/data/2024/iclr/A path-norm toolkit for modern networks: consequences, promises and challenges	
@@ -0,0 +1 @@
+This work introduces the first toolkit around path-norms that fully encompasses general DAG ReLU networks with biases, skip connections and any operation based on the extraction of order statistics: max pooling, GroupSort etc. This toolkit notably allows us to establish generalization bounds for modern neural networks that are not only the most widely applicable path-norm based ones, but also recover or beat the sharpest known bounds of this type. These extended path-norms further enjoy the usual benefits of path-norms: ease of computation, invariance under the symmetries of the network, and improved sharpness on layered fully-connected networks compared to the product of operator norms, another complexity measure most commonly used. The versatility of the toolkit and its ease of implementation allow us to challenge the concrete promises of path-norm-based generalization bounds, by numerically evaluating the sharpest known bounds for ResNets on ImageNet.
\ No newline at end of file
diff --git a/data/2024/iclr/A representation-learning game for classes of prediction tasks b/data/2024/iclr/A representation-learning game for classes of prediction tasks
new file mode 100644
index 0000000000..3607799a8f
--- /dev/null
+++ b/data/2024/iclr/A representation-learning game for classes of prediction tasks	
@@ -0,0 +1 @@
+We propose a game-based formulation for learning dimensionality-reducing representations of feature vectors, when only a prior knowledge on future prediction tasks is available. In this game, the first player chooses a representation, and then the second player adversarially chooses a prediction task from a given class, representing the prior knowledge. The first player aims is to minimize, and the second player to maximize, the regret: The minimal prediction loss using the representation, compared to the same loss using the original features. For the canonical setting in which the representation, the response to predict and the predictors are all linear functions, and under the mean squared error loss function, we derive the theoretically optimal representation in pure strategies, which shows the effectiveness of the prior knowledge, and the optimal regret in mixed strategies, which shows the usefulness of randomizing the representation. For general representations and loss functions, we propose an efficient algorithm to optimize a randomized representation. The algorithm only requires the gradients of the loss function, and is based on incrementally adding a representation rule to a mixture of such rules.
\ No newline at end of file
diff --git a/data/2024/iclr/A robust differential Neural ODE Optimizer b/data/2024/iclr/A robust differential Neural ODE Optimizer
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/A unique M-pattern for micro-expression spotting in long videos b/data/2024/iclr/A unique M-pattern for micro-expression spotting in long videos
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/ACRF: Compressing Explicit Neural Radiance Fields via Attribute Compression b/data/2024/iclr/ACRF: Compressing Explicit Neural Radiance Fields via Attribute Compression
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process b/data/2024/iclr/ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
new file mode 100644
index 0000000000..1d2cc4217e
--- /dev/null
+++ b/data/2024/iclr/ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process	
@@ -0,0 +1 @@
+Image recognition and generation have long been developed independently of each other. With the recent trend towards general-purpose representation learning, the development of general representations for both recognition and generation tasks is also promoted. However, preliminary attempts mainly focus on generation performance, but are still inferior on recognition tasks. These methods are modeled in the vector-quantized (VQ) space, whereas leading recognition methods use pixels as inputs. Our key insights are twofold: (1) pixels as inputs are crucial for recognition tasks; (2) VQ tokens as reconstruction targets are beneficial for generation tasks. These observations motivate us to propose an Alternating Denoising Diffusion Process (ADDP) that integrates these two spaces within a single representation learning framework. In each denoising step, our method first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels. The diffusion process gradually masks out a portion of VQ tokens to construct the training samples. The learned representations can be used to generate diverse high-fidelity images and also demonstrate excellent transfer performance on recognition tasks. Extensive experiments show that our method achieves competitive performance on unconditional generation, ImageNet classification, COCO detection, and ADE20k segmentation. Importantly, our method represents the first successful development of general representations applicable to both generation and dense recognition tasks. Code is released at \url{https://github.com/ChangyaoTian/ADDP}.
\ No newline at end of file
diff --git a/data/2024/iclr/ADOPD: A Large-Scale Document Page Decomposition Dataset b/data/2024/iclr/ADOPD: A Large-Scale Document Page Decomposition Dataset
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation b/data/2024/iclr/AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation
new file mode 100644
index 0000000000..fc6dacbc62
--- /dev/null
+++ b/data/2024/iclr/AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation	
@@ -0,0 +1 @@
+During interactive segmentation, a model and a user work together to delineate objects of interest in a 3D point cloud. In an iterative process, the model assigns each data point to an object (or the background), while the user corrects errors in the resulting segmentation and feeds them back into the model. The current best practice formulates the problem as binary classification and segments objects one at a time. The model expects the user to provide positive clicks to indicate regions wrongly assigned to the background and negative clicks on regions wrongly assigned to the object. Sequentially visiting objects is wasteful since it disregards synergies between objects: a positive click for a given object can, by definition, serve as a negative click for nearby objects. Moreover, a direct competition between adjacent objects can speed up the identification of their common boundary. We introduce AGILE3D, an efficient, attention-based model that (1) supports simultaneous segmentation of multiple 3D objects, (2) yields more accurate segmentation masks with fewer user clicks, and (3) offers faster inference. Our core idea is to encode user clicks as spatial-temporal queries and enable explicit interactions between click queries as well as between them and the 3D scene through a click attention module. Every time new clicks are added, we only need to run a lightweight decoder that produces updated segmentation masks. In experiments with four different 3D point cloud datasets, AGILE3D sets a new state-of-the-art. Moreover, we also verify its practicality in real-world setups with real user studies.
\ No newline at end of file
diff --git a/data/2024/iclr/ALAM: Averaged Low-Precision Activation for Memory-Efficient Training of Transformer Models b/data/2024/iclr/ALAM: Averaged Low-Precision Activation for Memory-Efficient Training of Transformer Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents b/data/2024/iclr/AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents
new file mode 100644
index 0000000000..195f03a7f4
--- /dev/null
+++ b/data/2024/iclr/AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents	
@@ -0,0 +1 @@
+We introduce AMAGO, an in-context Reinforcement Learning (RL) agent that uses sequence models to tackle the challenges of generalization, long-term memory, and meta-learning. Recent works have shown that off-policy learning can make in-context RL with recurrent policies viable. Nonetheless, these approaches require extensive tuning and limit scalability by creating key bottlenecks in agents' memory capacity, planning horizon, and model size. AMAGO revisits and redesigns the off-policy in-context approach to successfully train long-sequence Transformers over entire rollouts in parallel with end-to-end RL. Our agent is scalable and applicable to a wide range of problems, and we demonstrate its strong performance empirically in meta-RL and long-term memory domains. AMAGO's focus on sparse rewards and off-policy data also allows in-context learning to extend to goal-conditioned problems with challenging exploration. When combined with a multi-goal hindsight relabeling scheme, AMAGO can solve a previously difficult category of open-world domains, where agents complete many possible instructions in procedurally generated environments.
\ No newline at end of file
diff --git a/data/2024/iclr/ARGS: Alignment as Reward-Guided Search b/data/2024/iclr/ARGS: Alignment as Reward-Guided Search
new file mode 100644
index 0000000000..f22f8542a9
--- /dev/null
+++ b/data/2024/iclr/ARGS: Alignment as Reward-Guided Search	
@@ -0,0 +1 @@
+Aligning large language models with human objectives is paramount, yet common approaches including RLHF suffer from unstable and resource-intensive training. In response to this challenge, we introduce ARGS, Alignment as Reward-Guided Search, a novel framework that integrates alignment into the decoding process, eliminating the need for expensive RL training. By adjusting the model's probabilistic predictions using a reward signal, ARGS generates texts with semantic diversity while being aligned with human preferences, offering a promising and flexible solution for aligning language models. Notably, ARGS demonstrates consistent enhancements in average reward compared to baselines across diverse alignment tasks and various model dimensions. For example, under the same greedy-based decoding strategy, our method improves the average reward by 19.56% relative to the baseline and secures a preference or tie score of 64.33% in GPT-4 evaluation. We believe that our framework, emphasizing decoding-time alignment, paves the way for more responsive language models in the future. Code is publicly available at: \url{https://github.com/deeplearning-wisc/args}.
\ No newline at end of file
diff --git a/data/2024/iclr/ARM: Refining Multivariate Forecasting with Adaptive Temporal-Contextual Learning b/data/2024/iclr/ARM: Refining Multivariate Forecasting with Adaptive Temporal-Contextual Learning
new file mode 100644
index 0000000000..b408082d90
--- /dev/null
+++ b/data/2024/iclr/ARM: Refining Multivariate Forecasting with Adaptive Temporal-Contextual Learning	
@@ -0,0 +1 @@
+Long-term time series forecasting (LTSF) is important for various domains but is confronted by challenges in handling the complex temporal-contextual relationships. As multivariate input models underperforming some recent univariate counterparts, we posit that the issue lies in the inefficiency of existing multivariate LTSF Transformers to model series-wise relationships: the characteristic differences between series are often captured incorrectly. To address this, we introduce ARM: a multivariate temporal-contextual adaptive learning method, which is an enhanced architecture specifically designed for multivariate LTSF modelling. ARM employs Adaptive Univariate Effect Learning (AUEL), Random Dropping (RD) training strategy, and Multi-kernel Local Smoothing (MKLS), to better handle individual series temporal patterns and correctly learn inter-series dependencies. ARM demonstrates superior performance on multiple benchmarks without significantly increasing computational costs compared to vanilla Transformer, thereby advancing the state-of-the-art in LTSF. ARM is also generally applicable to other LTSF architecture beyond vanilla Transformer.
\ No newline at end of file
diff --git a/data/2024/iclr/ASID: Active Exploration for System Identification in Robotic Manipulation b/data/2024/iclr/ASID: Active Exploration for System Identification in Robotic Manipulation
new file mode 100644
index 0000000000..dba63fa22c
--- /dev/null
+++ b/data/2024/iclr/ASID: Active Exploration for System Identification in Robotic Manipulation	
@@ -0,0 +1 @@
+Model-free control strategies such as reinforcement learning have shown the ability to learn control strategies without requiring an accurate model or simulator of the world. While this is appealing due to the lack of modeling requirements, such methods can be sample inefficient, making them impractical in many real-world domains. On the other hand, model-based control techniques leveraging accurate simulators can circumvent these challenges and use a large amount of cheap simulation data to learn controllers that can effectively transfer to the real world. The challenge with such model-based techniques is the requirement for an extremely accurate simulation, requiring both the specification of appropriate simulation assets and physical parameters. This requires considerable human effort to design for every environment being considered. In this work, we propose a learning system that can leverage a small amount of real-world data to autonomously refine a simulation model and then plan an accurate control strategy that can be deployed in the real world. Our approach critically relies on utilizing an initial (possibly inaccurate) simulator to design effective exploration policies that, when deployed in the real world, collect high-quality data. We demonstrate the efficacy of this paradigm in identifying articulation, mass, and other physical parameters in several challenging robotic manipulation tasks, and illustrate that only a small amount of real-world data can allow for effective sim-to-real transfer. Project website at https://weirdlabuw.github.io/asid
\ No newline at end of file
diff --git a/data/2024/iclr/ASMR: Activation-Sharing Multi-Resolution Coordinate Networks for Efficient Inference b/data/2024/iclr/ASMR: Activation-Sharing Multi-Resolution Coordinate Networks for Efficient Inference
new file mode 100644
index 0000000000..489692ad90
--- /dev/null
+++ b/data/2024/iclr/ASMR: Activation-Sharing Multi-Resolution Coordinate Networks for Efficient Inference	
@@ -0,0 +1 @@
+Coordinate network or implicit neural representation (INR) is a fast-emerging method for encoding natural signals (such as images and videos) with the benefits of a compact neural representation. While numerous methods have been proposed to increase the encoding capabilities of an INR, an often overlooked aspect is the inference efficiency, usually measured in multiply-accumulate (MAC) count. This is particularly critical in use cases where inference throughput is greatly limited by hardware constraints. To this end, we propose the Activation-Sharing Multi-Resolution (ASMR) coordinate network that combines multi-resolution coordinate decomposition with hierarchical modulations. Specifically, an ASMR model enables the sharing of activations across grids of the data. This largely decouples its inference cost from its depth which is directly correlated to its reconstruction capability, and renders a near O(1) inference complexity irrespective of the number of layers. Experiments show that ASMR can reduce the MAC of a vanilla SIREN model by up to 500x while achieving an even higher reconstruction quality than its SIREN baseline.
\ No newline at end of file
diff --git a/data/2024/iclr/AUC-CL: A Batchsize-Robust Framework for Self-Supervised Contrastive Representation Learning b/data/2024/iclr/AUC-CL: A Batchsize-Robust Framework for Self-Supervised Contrastive Representation Learning
new file mode 100644
index 0000000000..71c0f1f81c
--- /dev/null
+++ b/data/2024/iclr/AUC-CL: A Batchsize-Robust Framework for Self-Supervised Contrastive Representation Learning	
@@ -0,0 +1 @@
+Self-supervised learning through contrastive representations is an emergent and promising avenue, aiming at alleviating the availability of labeled data. Recent research in the field also demonstrates its viability for several downstream tasks, henceforth leading to works that implement the contrastive principle through innovative loss functions and methods. However, despite achieving impressive progress, most methods depend on prohibitively large batch sizes and compute requirements for good performance. In this work, we propose the AUC - C ontrastive L earning, a new approach to contrastive learning that demonstrates robust and competitive performance in compute-limited regimes. We propose to incorporate the contrastive objective within the AUC-maximization framework, by noting that the AUC metric is maximized upon enhancing the probability of the network’s binary prediction difference between positive and negative samples which inspires adequate embedding space arrangements in representation learning. Unlike standard contrastive methods, when performing stochastic optimization, our method maintains unbiased stochastic gradients and thus is more robust to batchsizes as opposed to standard stochastic optimization problems. Remarkably, our method with a batch size of 256, outperforms several state-of-the-art methods that may need much larger batch sizes (e.g., 4096), on ImageNet and other standard datasets. Experiments on transfer learning and few-shot learning tasks also demonstrate the downstream viability of our method. Code is available at AUC-CL
\ No newline at end of file
diff --git a/data/2024/iclr/AUGCAL: Improving Sim2Real Adaptation by Uncertainty Calibration on Augmented Synthetic Images b/data/2024/iclr/AUGCAL: Improving Sim2Real Adaptation by Uncertainty Calibration on Augmented Synthetic Images
new file mode 100644
index 0000000000..81d2a72c17
--- /dev/null
+++ b/data/2024/iclr/AUGCAL: Improving Sim2Real Adaptation by Uncertainty Calibration on Augmented Synthetic Images	
@@ -0,0 +1 @@
+Synthetic data (SIM) drawn from simulators have emerged as a popular alternative for training models where acquiring annotated real-world images is difficult. However, transferring models trained on synthetic images to real-world applications can be challenging due to appearance disparities. A commonly employed solution to counter this SIM2REAL gap is unsupervised domain adaptation, where models are trained using labeled SIM data and unlabeled REAL data. Mispredictions made by such SIM2REAL adapted models are often associated with miscalibration - stemming from overconfident predictions on real data. In this paper, we introduce AUGCAL, a simple training-time patch for unsupervised adaptation that improves SIM2REAL adapted models by - (1) reducing overall miscalibration, (2) reducing overconfidence in incorrect predictions and (3) improving confidence score reliability by better guiding misclassification detection - all while retaining or improving SIM2REAL performance. Given a base SIM2REAL adaptation algorithm, at training time, AUGCAL involves replacing vanilla SIM images with strongly augmented views (AUG intervention) and additionally optimizing for a training time calibration loss on augmented SIM predictions (CAL intervention). We motivate AUGCAL using a brief analytical justification of how to reduce miscalibration on unlabeled REAL data. Through our experiments, we empirically show the efficacy of AUGCAL across multiple adaptation methods, backbones, tasks and shifts.
\ No newline at end of file
diff --git a/data/2024/iclr/Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers b/data/2024/iclr/Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers
new file mode 100644
index 0000000000..777f9a3ff2
--- /dev/null
+++ b/data/2024/iclr/Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers	
@@ -0,0 +1 @@
+An extension of Transformers is proposed that enables explicit relational reasoning through a novel module called the Abstractor. At the core of the Abstractor is a variant of attention called relational cross-attention. The approach is motivated by an architectural inductive bias for relational learning that disentangles relational information from object-level features. This enables explicit relational reasoning, supporting abstraction and generalization from limited data. The Abstractor is first evaluated on simple discriminative relational tasks and compared to existing relational architectures. Next, the Abstractor is evaluated on purely relational sequence-to-sequence tasks, where dramatic improvements are seen in sample efficiency compared to standard Transformers. Finally, Abstractors are evaluated on a collection of tasks based on mathematical problem solving, where consistent improvements in performance and sample efficiency are observed.
\ No newline at end of file
diff --git a/data/2024/iclr/Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise b/data/2024/iclr/Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise
new file mode 100644
index 0000000000..590814960a
--- /dev/null
+++ b/data/2024/iclr/Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise	
@@ -0,0 +1 @@
+Heavy-ball momentum with decaying learning rates is widely used with SGD for optimizing deep learning models. In contrast to its empirical popularity, the understanding of its theoretical property is still quite limited, especially under the standard anisotropic gradient noise condition for quadratic regression problems. Although it is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings, there is no rigorous theoretical analysis. In this paper, we fill this theoretical gap by establishing a non-asymptotic convergence bound for stochastic heavy-ball methods with step decay scheduler on quadratic objectives, under the anisotropic gradient noise condition. As a direct implication, we show that heavy-ball momentum can provide $\tilde{\mathcal{O}}(\sqrt{\kappa})$ accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate with respect to the stochastic variance term. The combined effect implies an overall convergence rate within log factors from the statistical minimax rate. This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning, where a smaller number of iterations can significantly reduce the number of communication rounds, leading to acceleration in practice.
\ No newline at end of file
diff --git a/data/2024/iclr/Accelerated Sampling with Stacked Restricted Boltzmann Machines b/data/2024/iclr/Accelerated Sampling with Stacked Restricted Boltzmann Machines
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Accelerating Data Generation for Neural Operators via Krylov Subspace Recycling b/data/2024/iclr/Accelerating Data Generation for Neural Operators via Krylov Subspace Recycling
new file mode 100644
index 0000000000..06ad5b8161
--- /dev/null
+++ b/data/2024/iclr/Accelerating Data Generation for Neural Operators via Krylov Subspace Recycling	
@@ -0,0 +1 @@
+Learning neural operators for solving partial differential equations (PDEs) has attracted great attention due to its high inference efficiency. However, training such operators requires generating a substantial amount of labeled data, i.e., PDE problems together with their solutions. The data generation process is exceptionally time-consuming, as it involves solving numerous systems of linear equations to obtain numerical solutions to the PDEs. Many existing methods solve these systems independently without considering their inherent similarities, resulting in extremely redundant computations. To tackle this problem, we propose a novel method, namely Sorting Krylov Recycling (SKR), to boost the efficiency of solving these systems, thus significantly accelerating data generation for neural operators training. To the best of our knowledge, SKR is the first attempt to address the time-consuming nature of data generation for learning neural operators. The working horse of SKR is Krylov subspace recycling, a powerful technique for solving a series of interrelated systems by leveraging their inherent similarities. Specifically, SKR employs a sorting algorithm to arrange these systems in a sequence, where adjacent systems exhibit high similarities. Then it equips a solver with Krylov subspace recycling to solve the systems sequentially instead of independently, thus effectively enhancing the solving efficiency. Both theoretical analysis and extensive experiments demonstrate that SKR can significantly accelerate neural operator data generation, achieving a remarkable speedup of up to 13.9 times.
\ No newline at end of file
diff --git a/data/2024/iclr/Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks b/data/2024/iclr/Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks
new file mode 100644
index 0000000000..67cdaf5834
--- /dev/null
+++ b/data/2024/iclr/Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks	
@@ -0,0 +1 @@
+We study a family of distributed stochastic optimization algorithms where gradients are sampled by a token traversing a network of agents in random-walk fashion. Typically, these random-walks are chosen to be Markov chains that asymptotically sample from a desired target distribution, and play a critical role in the convergence of the optimization iterates. In this paper, we take a novel approach by replacing the standard linear Markovian token by one which follows a nonlinear Markov chain - namely the Self-Repellent Radom Walk (SRRW). Defined for any given 'base' Markov chain, the SRRW, parameterized by a positive scalar {\alpha}, is less likely to transition to states that were highly visited in the past, thus the name. In the context of MCMC sampling on a graph, a recent breakthrough in Doshi et al. (2023) shows that the SRRW achieves O(1/{\alpha}) decrease in the asymptotic variance for sampling. We propose the use of a 'generalized' version of the SRRW to drive token algorithms for distributed stochastic optimization in the form of stochastic approximation, termed SA-SRRW. We prove that the optimization iterate errors of the resulting SA-SRRW converge to zero almost surely and prove a central limit theorem, deriving the explicit form of the resulting asymptotic covariance matrix corresponding to iterate errors. This asymptotic covariance is always smaller than that of an algorithm driven by the base Markov chain and decreases at rate O(1/{\alpha}^2) - the performance benefit of using SRRW thereby amplified in the stochastic optimization context. Empirical results support our theoretical findings.
\ No newline at end of file
diff --git a/data/2024/iclr/Accelerating Sinkhorn algorithm with sparse Newton iterations b/data/2024/iclr/Accelerating Sinkhorn algorithm with sparse Newton iterations
new file mode 100644
index 0000000000..ddce0bb73f
--- /dev/null
+++ b/data/2024/iclr/Accelerating Sinkhorn algorithm with sparse Newton iterations	
@@ -0,0 +1 @@
+Computing the optimal transport distance between statistical distributions is a fundamental task in machine learning. One remarkable recent advancement is entropic regularization and the Sinkhorn algorithm, which utilizes only matrix scaling and guarantees an approximated solution with near-linear runtime. Despite the success of the Sinkhorn algorithm, its runtime may still be slow due to the potentially large number of iterations needed for convergence. To achieve possibly super-exponential convergence, we present Sinkhorn-Newton-Sparse (SNS), an extension to the Sinkhorn algorithm, by introducing early stopping for the matrix scaling steps and a second stage featuring a Newton-type subroutine. Adopting the variational viewpoint that the Sinkhorn algorithm maximizes a concave Lyapunov potential, we offer the insight that the Hessian matrix of the potential function is approximately sparse. Sparsification of the Hessian results in a fast $O(n^2)$ per-iteration complexity, the same as the Sinkhorn algorithm. In terms of total iteration count, we observe that the SNS algorithm converges orders of magnitude faster across a wide range of practical cases, including optimal transportation between empirical distributions and calculating the Wasserstein $W_1, W_2$ distance of discretized densities. The empirical performance is corroborated by a rigorous bound on the approximate sparsity of the Hessian matrix.
\ No newline at end of file
diff --git a/data/2024/iclr/Accurate Forgetting for Heterogeneous Federated Continual Learning b/data/2024/iclr/Accurate Forgetting for Heterogeneous Federated Continual Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models b/data/2024/iclr/Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models
new file mode 100644
index 0000000000..2b767a223b
--- /dev/null
+++ b/data/2024/iclr/Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models	
@@ -0,0 +1 @@
+Given a pretrained encoder-based language model, how can we accurately compress it without retraining? Retraining-free structured pruning algorithms are crucial in pretrained language model compression due to their significantly reduced pruning cost and capability to prune large language models. However, existing retraining-free algorithms encounter severe accuracy degradation, as they fail to handle pruning errors, especially at high compression rates. In this paper, we propose K-prune (Knowledge-preserving pruning), an accurate retraining-free structured pruning algorithm for pretrained encoder-based language models. K-prune focuses on preserving the useful knowledge of the pretrained model to minimize pruning errors through a carefully designed iterative pruning process composed of knowledge measurement, knowledge-preserving mask search, and knowledge-preserving weight-tuning. As a result, K-prune shows significant accuracy improvements up to 58.02%p higher F1 score compared to existing retraining-free pruning algorithms under a high compression rate of 80% on the SQuAD benchmark without any retraining process.
\ No newline at end of file
diff --git a/data/2024/iclr/Accurate and Scalable Estimation of Epistemic Uncertainty for Graph Neural Networks b/data/2024/iclr/Accurate and Scalable Estimation of Epistemic Uncertainty for Graph Neural Networks
new file mode 100644
index 0000000000..85e3b8a09c
--- /dev/null
+++ b/data/2024/iclr/Accurate and Scalable Estimation of Epistemic Uncertainty for Graph Neural Networks	
@@ -0,0 +1 @@
+While graph neural networks (GNNs) are widely used for node and graph representation learning tasks, the reliability of GNN uncertainty estimates under distribution shifts remains relatively under-explored. Indeed, while post-hoc calibration strategies can be used to improve in-distribution calibration, they need not also improve calibration under distribution shift. However, techniques which produce GNNs with better intrinsic uncertainty estimates are particularly valuable, as they can always be combined with post-hoc strategies later. Therefore, in this work, we propose G-$\Delta$UQ, a novel training framework designed to improve intrinsic GNN uncertainty estimates. Our framework adapts the principle of stochastic data centering to graph data through novel graph anchoring strategies, and is able to support partially stochastic GNNs. While, the prevalent wisdom is that fully stochastic networks are necessary to obtain reliable estimates, we find that the functional diversity induced by our anchoring strategies when sampling hypotheses renders this unnecessary and allows us to support G-$\Delta$UQ on pretrained models. Indeed, through extensive evaluation under covariate, concept and graph size shifts, we show that G-$\Delta$UQ leads to better calibrated GNNs for node and graph classification. Further, it also improves performance on the uncertainty-based tasks of out-of-distribution detection and generalization gap estimation. Overall, our work provides insights into uncertainty estimation for GNNs, and demonstrates the utility of G-$\Delta$UQ in obtaining reliable estimates.
\ No newline at end of file
diff --git a/data/2024/iclr/Achieving Fairness in Multi-Agent MDP Using Reinforcement Learning b/data/2024/iclr/Achieving Fairness in Multi-Agent MDP Using Reinforcement Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Achieving Human Parity in Content-Grounded Datasets Generation b/data/2024/iclr/Achieving Human Parity in Content-Grounded Datasets Generation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Achieving Sample and Computational Efficient Reinforcement Learning by Action Space Reduction via Grouping b/data/2024/iclr/Achieving Sample and Computational Efficient Reinforcement Learning by Action Space Reduction via Grouping
new file mode 100644
index 0000000000..676114ec61
--- /dev/null
+++ b/data/2024/iclr/Achieving Sample and Computational Efficient Reinforcement Learning by Action Space Reduction via Grouping	
@@ -0,0 +1 @@
+Reinforcement learning often needs to deal with the exponential growth of states and actions when exploring optimal control in high-dimensional spaces (often known as the curse of dimensionality). In this work, we address this issue by learning the inherent structure of action-wise similar MDP to appropriately balance the performance degradation versus sample/computational complexity. In particular, we partition the action spaces into multiple groups based on the similarity in transition distribution and reward function, and build a linear decomposition model to capture the difference between the intra-group transition kernel and the intra-group rewards. Both our theoretical analysis and experiments reveal a \emph{surprising and counter-intuitive result}: while a more refined grouping strategy can reduce the approximation error caused by treating actions in the same group as identical, it also leads to increased estimation error when the size of samples or the computation resources is limited. This finding highlights the grouping strategy as a new degree of freedom that can be optimized to minimize the overall performance loss. To address this issue, we formulate a general optimization problem for determining the optimal grouping strategy, which strikes a balance between performance loss and sample/computational complexity. We further propose a computationally efficient method for selecting a nearly-optimal grouping strategy, which maintains its computational complexity independent of the size of the action space.
\ No newline at end of file
diff --git a/data/2024/iclr/Active Retrosynthetic Planning Aware of Route Quality b/data/2024/iclr/Active Retrosynthetic Planning Aware of Route Quality
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Active Test-Time Adaptation: Theoretical Analyses and An Algorithm b/data/2024/iclr/Active Test-Time Adaptation: Theoretical Analyses and An Algorithm
new file mode 100644
index 0000000000..6743295317
--- /dev/null
+++ b/data/2024/iclr/Active Test-Time Adaptation: Theoretical Analyses and An Algorithm	
@@ -0,0 +1 @@
+Test-time adaptation (TTA) addresses distribution shifts for streaming test data in unsupervised settings. Currently, most TTA methods can only deal with minor shifts and rely heavily on heuristic and empirical studies. To advance TTA under domain shifts, we propose the novel problem setting of active test-time adaptation (ATTA) that integrates active learning within the fully TTA setting. We provide a learning theory analysis, demonstrating that incorporating limited labeled test instances enhances overall performances across test domains with a theoretical guarantee. We also present a sample entropy balancing for implementing ATTA while avoiding catastrophic forgetting (CF). We introduce a simple yet effective ATTA algorithm, known as SimATTA, using real-time sample selection techniques. Extensive experimental results confirm consistency with our theoretical analyses and show that the proposed ATTA method yields substantial performance improvements over TTA methods while maintaining efficiency and shares similar effectiveness to the more demanding active domain adaptation (ADA) methods. Our code is available at https://github.com/divelab/ATTA
\ No newline at end of file
diff --git a/data/2024/iclr/AdaMerging: Adaptive Model Merging for Multi-Task Learning b/data/2024/iclr/AdaMerging: Adaptive Model Merging for Multi-Task Learning
new file mode 100644
index 0000000000..9b7a8928b7
--- /dev/null
+++ b/data/2024/iclr/AdaMerging: Adaptive Model Merging for Multi-Task Learning	
@@ -0,0 +1 @@
+Multi-task learning (MTL) aims to empower a model to tackle multiple tasks simultaneously. A recent development known as task arithmetic has revealed that several models, each fine-tuned for distinct tasks, can be directly merged into a single model to execute MTL without necessitating a retraining process using the initial training data. Nevertheless, this direct addition of models often leads to a significant deterioration in the overall performance of the merged model. This decline occurs due to potential conflicts and intricate correlations among the multiple tasks. Consequently, the challenge emerges of how to merge pre-trained models more effectively without using their original training data. This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging). This approach aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data. Specifically, our AdaMerging method operates as an automatic, unsupervised task arithmetic scheme. It leverages entropy minimization on unlabeled test samples from the multi-task setup as a surrogate objective function to iteratively refine the merging coefficients of the multiple models. Our experimental findings across eight tasks demonstrate the efficacy of the AdaMerging scheme we put forth. Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11\% improvement in performance. Notably, AdaMerging also exhibits superior generalization capabilities when applied to unseen downstream tasks. Furthermore, it displays a significantly enhanced robustness to data distribution shifts that may occur during the testing phase.
\ No newline at end of file
diff --git a/data/2024/iclr/Adapting Large Language Models via Reading Comprehension b/data/2024/iclr/Adapting Large Language Models via Reading Comprehension
new file mode 100644
index 0000000000..f1b6363342
--- /dev/null
+++ b/data/2024/iclr/Adapting Large Language Models via Reading Comprehension	
@@ -0,0 +1 @@
+We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension—practice after reading improves the ability to answer questions based on the learned knowledge—we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance
\ No newline at end of file
diff --git a/data/2024/iclr/Adapting to Distribution Shift by Visual Domain Prompt Generation b/data/2024/iclr/Adapting to Distribution Shift by Visual Domain Prompt Generation
new file mode 100644
index 0000000000..496d3f72ed
--- /dev/null
+++ b/data/2024/iclr/Adapting to Distribution Shift by Visual Domain Prompt Generation	
@@ -0,0 +1 @@
+In this paper, we aim to adapt a model at test-time using a few unlabeled data to address distribution shifts. To tackle the challenges of extracting domain knowledge from a limited amount of data, it is crucial to utilize correlated information from pre-trained backbones and source domains. Previous studies fail to utilize recent foundation models with strong out-of-distribution generalization. Additionally, domain-centric designs are not flavored in their works. Furthermore, they employ the process of modelling source domains and the process of learning to adapt independently into disjoint training stages. In this work, we propose an approach on top of the pre-computed features of the foundation model. Specifically, we build a knowledge bank to learn the transferable knowledge from source domains. Conditioned on few-shot target data, we introduce a domain prompt generator to condense the knowledge bank into a domain-specific prompt. The domain prompt then directs the visual features towards a particular domain via a guidance module. Moreover, we propose a domain-aware contrastive loss and employ meta-learning to facilitate domain knowledge extraction. Extensive experiments are conducted to validate the domain knowledge extraction. The proposed method outperforms previous work on 5 large-scale benchmarks including WILDS and DomainNet.
\ No newline at end of file
diff --git a/data/2024/iclr/Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts b/data/2024/iclr/Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts
new file mode 100644
index 0000000000..943961d50c
--- /dev/null
+++ b/data/2024/iclr/Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts	
@@ -0,0 +1 @@
+By providing external information to large language models (LLMs), tool augmentation (including retrieval augmentation) has emerged as a promising solution for addressing the limitations of LLMs' static parametric memory. However, how receptive are LLMs to such external evidence, especially when the evidence conflicts with their parametric memory? We present the first comprehensive and controlled investigation into the behavior of LLMs when encountering knowledge conflicts. We propose a systematic framework to elicit high-quality parametric memory from LLMs and construct the corresponding counter-memory, which enables us to conduct a series of controlled experiments. Our investigation reveals seemingly contradicting behaviors of LLMs. On the one hand, different from prior wisdom, we find that LLMs can be highly receptive to external evidence even when that conflicts with their parametric memory, given that the external evidence is coherent and convincing. On the other hand, LLMs also demonstrate a strong confirmation bias when the external evidence contains some information that is consistent with their parametric memory, despite being presented with conflicting evidence at the same time. These results pose important implications that are worth careful consideration for the further development and deployment of tool- and retrieval-augmented LLMs. Resources are available at https://github.com/OSU-NLP-Group/LLM-Knowledge-Conflict.
\ No newline at end of file
diff --git a/data/2024/iclr/Adaptive Federated Learning with Auto-Tuned Clients b/data/2024/iclr/Adaptive Federated Learning with Auto-Tuned Clients
new file mode 100644
index 0000000000..24508690d5
--- /dev/null
+++ b/data/2024/iclr/Adaptive Federated Learning with Auto-Tuned Clients	
@@ -0,0 +1 @@
+Federated learning (FL) is a distributed machine learning framework where the global model of a central server is trained via multiple collaborative steps by participating clients without sharing their data. While being a flexible framework, where the distribution of local data, participation rate, and computing power of each client can greatly vary, such flexibility gives rise to many new challenges, especially in the hyperparameter tuning on the client side. We propose $\Delta$-SGD, a simple step size rule for SGD that enables each client to use its own step size by adapting to the local smoothness of the function each client is optimizing. We provide theoretical and empirical results where the benefit of the client adaptivity is shown in various FL scenarios.
\ No newline at end of file
diff --git a/data/2024/iclr/Adaptive Instrument Design for Indirect Experiments b/data/2024/iclr/Adaptive Instrument Design for Indirect Experiments
new file mode 100644
index 0000000000..e9e5ea3824
--- /dev/null
+++ b/data/2024/iclr/Adaptive Instrument Design for Indirect Experiments	
@@ -0,0 +1 @@
+Indirect experiments provide a valuable framework for estimating treatment effects in situations where conducting randomized control trials (RCTs) is impractical or unethical. Unlike RCTs, indirect experiments estimate treatment effects by leveraging (conditional) instrumental variables, enabling estimation through encouragement and recommendation rather than strict treatment assignment. However, the sample efficiency of such estimators depends not only on the inherent variability in outcomes but also on the varying compliance levels of users with the instrumental variables and the choice of estimator being used, especially when dealing with numerous instrumental variables. While adaptive experiment design has a rich literature for direct experiments, in this paper we take the initial steps towards enhancing sample efficiency for indirect experiments by adaptively designing a data collection policy over instrumental variables. Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy, minimizing the mean-squared error of the desired (non-linear) estimator. Through experiments conducted in various domains inspired by real-world applications, we showcase how our method can significantly improve the sample efficiency of indirect experiments.
\ No newline at end of file
diff --git a/data/2024/iclr/Adaptive Rational Activations to Boost Deep Reinforcement Learning b/data/2024/iclr/Adaptive Rational Activations to Boost Deep Reinforcement Learning
new file mode 100644
index 0000000000..7727d61299
--- /dev/null
+++ b/data/2024/iclr/Adaptive Rational Activations to Boost Deep Reinforcement Learning	
@@ -0,0 +1 @@
+Latest insights from biology show that intelligence not only emerges from the connections between neurons but that individual neurons shoulder more computational responsibility than previously anticipated. This perspective should be critical in the context of constantly changing distinct reinforcement learning environments, yet current approaches still primarily employ static activation functions. In this work, we motivate why rationals are suitable for adaptable activation functions and why their inclusion into neural networks is crucial. Inspired by recurrence in residual networks, we derive a condition under which rational units are closed under residual connections and formulate a naturally regularised version: the recurrent-rational. We demonstrate that equipping popular algorithms with (recurrent-)rational activations leads to consistent improvements on Atari games, especially turning simple DQN into a solid approach, competitive to DDQN and Rainbow.
\ No newline at end of file
diff --git a/data/2024/iclr/Adaptive Regret for Bandits Made Possible: Two Queries Suffice b/data/2024/iclr/Adaptive Regret for Bandits Made Possible: Two Queries Suffice
new file mode 100644
index 0000000000..831f2dd2e2
--- /dev/null
+++ b/data/2024/iclr/Adaptive Regret for Bandits Made Possible: Two Queries Suffice	
@@ -0,0 +1 @@
+Fast changing states or volatile environments pose a significant challenge to online optimization, which needs to perform rapid adaptation under limited observation. In this paper, we give query and regret optimal bandit algorithms under the strict notion of strongly adaptive regret, which measures the maximum regret over any contiguous interval $I$. Due to its worst-case nature, there is an almost-linear $\Omega(|I|^{1-\epsilon})$ regret lower bound, when only one query per round is allowed [Daniely el al, ICML 2015]. Surprisingly, with just two queries per round, we give Strongly Adaptive Bandit Learner (StABL) that achieves $\tilde{O}(\sqrt{n|I|})$ adaptive regret for multi-armed bandits with $n$ arms. The bound is tight and cannot be improved in general. Our algorithm leverages a multiplicative update scheme of varying stepsizes and a carefully chosen observation distribution to control the variance. Furthermore, we extend our results and provide optimal algorithms in the bandit convex optimization setting. Finally, we empirically demonstrate the superior performance of our algorithms under volatile environments and for downstream tasks, such as algorithm selection for hyperparameter optimization.
\ No newline at end of file
diff --git a/data/2024/iclr/Adaptive Regularization of Representation Rank as an Implicit Constraint of Bellman Equation b/data/2024/iclr/Adaptive Regularization of Representation Rank as an Implicit Constraint of Bellman Equation
new file mode 100644
index 0000000000..8dc30d92e4
--- /dev/null
+++ b/data/2024/iclr/Adaptive Regularization of Representation Rank as an Implicit Constraint of Bellman Equation	
@@ -0,0 +1 @@
+Representation rank is an important concept for understanding the role of Neural Networks (NNs) in Deep Reinforcement learning (DRL), which measures the expressive capacity of value networks. Existing studies focus on unboundedly maximizing this rank; nevertheless, that approach would introduce overly complex models in the learning, thus undermining performance. Hence, fine-tuning representation rank presents a challenging and crucial optimization problem. To address this issue, we find a guiding principle for adaptive control of the representation rank. We employ the Bellman equation as a theoretical foundation and derive an upper bound on the cosine similarity of consecutive state-action pairs representations of value networks. We then leverage this upper bound to propose a novel regularizer, namely BEllman Equation-based automatic rank Regularizer (BEER). This regularizer adaptively regularizes the representation rank, thus improving the DRL agent's performance. We first validate the effectiveness of automatic control of rank on illustrative experiments. Then, we scale up BEER to complex continuous control tasks by combining it with the deterministic policy gradient method. Among 12 challenging DeepMind control tasks, BEER outperforms the baselines by a large margin. Besides, BEER demonstrates significant advantages in Q-value approximation. Our code is available at https://github.com/sweetice/BEER-ICLR2024.
\ No newline at end of file
diff --git a/data/2024/iclr/Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders b/data/2024/iclr/Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders
new file mode 100644
index 0000000000..e7f791c117
--- /dev/null
+++ b/data/2024/iclr/Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders	
@@ -0,0 +1 @@
+Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance. Existing approaches perform k-NN search with CE by approximating the CE similarity with a vector embedding space fit either with dual-encoders (DE) or CUR matrix factorization. DE-based retrieve-and-rerank approaches suffer from poor recall on new domains and the retrieval with DE is decoupled from the CE. While CUR-based approaches can be more accurate than the DE-based approach, they require a prohibitively large number of CE calls to compute item embeddings, thus making it impractical for deployment at scale. In this paper, we address these shortcomings with our proposed sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity. We compute item embeddings offline by factorizing a sparse matrix containing query-item CE scores for a set of train queries. Our method produces a high-quality approximation while requiring only a fraction of CE calls as compared to CUR-based methods, and allows for leveraging DE to initialize the embedding space while avoiding compute- and resource-intensive finetuning of DE via distillation. At test time, the item embeddings remain fixed and retrieval occurs over rounds, alternating between a) estimating the test query embedding by minimizing error in approximating CE scores of items retrieved thus far, and b) using the updated test query embedding for retrieving more items. Our k-NN search method improves recall by up to 5% (k=1) and 54% (k=100) over DE-based approaches. Additionally, our indexing approach achieves a speedup of up to 100x over CUR-based and 5x over DE distillation methods, while matching or improving k-NN search recall over baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Adaptive Self-training Framework for Fine-grained Scene Graph Generation b/data/2024/iclr/Adaptive Self-training Framework for Fine-grained Scene Graph Generation
new file mode 100644
index 0000000000..e98ef2de70
--- /dev/null
+++ b/data/2024/iclr/Adaptive Self-training Framework for Fine-grained Scene Graph Generation	
@@ -0,0 +1 @@
+Scene graph generation (SGG) models have suffered from inherent problems regarding the benchmark datasets such as the long-tailed predicate distribution and missing annotation problems. In this work, we aim to alleviate the long-tailed problem of SGG by utilizing unannotated triplets. To this end, we introduce a Self-Training framework for SGG (ST-SGG) that assigns pseudo-labels for unannotated triplets based on which the SGG models are trained. While there has been significant progress in self-training for image recognition, designing a self-training framework for the SGG task is more challenging due to its inherent nature such as the semantic ambiguity and the long-tailed distribution of predicate classes. Hence, we propose a novel pseudo-labeling technique for SGG, called Class-specific Adaptive Thresholding with Momentum (CATM), which is a model-agnostic framework that can be applied to any existing SGG models. Furthermore, we devise a graph structure learner (GSL) that is beneficial when adopting our proposed self-training framework to the state-of-the-art message-passing neural network (MPNN)-based SGG models. Our extensive experiments verify the effectiveness of ST-SGG on various SGG models, particularly in enhancing the performance on fine-grained predicate classes.
\ No newline at end of file
diff --git a/data/2024/iclr/Adaptive Sharpness-Aware Pruning for Robust Sparse Networks b/data/2024/iclr/Adaptive Sharpness-Aware Pruning for Robust Sparse Networks
new file mode 100644
index 0000000000..fa85002170
--- /dev/null
+++ b/data/2024/iclr/Adaptive Sharpness-Aware Pruning for Robust Sparse Networks	
@@ -0,0 +1 @@
+Robustness and compactness are two essential attributes of deep learning models that are deployed in the real world. The goals of robustness and compactness may seem to be at odds, since robustness requires generalization across domains, while the process of compression exploits specificity in one domain. We introduce Adaptive Sharpness-Aware Pruning (AdaSAP), which unifies these goals through the lens of network sharpness. The AdaSAP method produces sparse networks that are robust to input variations which are unseen at training time. We achieve this by strategically incorporating weight perturbations in order to optimize the loss landscape. This allows the model to be both primed for pruning and regularized for improved robustness. AdaSAP improves the robust accuracy of pruned models on image classification by up to +6% on ImageNet C and +4% on ImageNet V2, and on object detection by +4% on a corrupted Pascal VOC dataset, over a wide range of compression ratios, pruning criteria, and network architectures, outperforming recent pruning art by large margins.
\ No newline at end of file
diff --git a/data/2024/iclr/Adaptive Stochastic Gradient Algorithm for Black-box Multi-Objective Learning b/data/2024/iclr/Adaptive Stochastic Gradient Algorithm for Black-box Multi-Objective Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Adaptive Window Pruning for Efficient Local Motion Deblurring b/data/2024/iclr/Adaptive Window Pruning for Efficient Local Motion Deblurring
new file mode 100644
index 0000000000..e25561a189
--- /dev/null
+++ b/data/2024/iclr/Adaptive Window Pruning for Efficient Local Motion Deblurring	
@@ -0,0 +1 @@
+Local motion blur commonly occurs in real-world photography due to the mixing between moving objects and stationary backgrounds during exposure. Existing image deblurring methods predominantly focus on global deblurring, inadvertently affecting the sharpness of backgrounds in locally blurred images and wasting unnecessary computation on sharp pixels, especially for high-resolution images. This paper aims to adaptively and efficiently restore high-resolution locally blurred images. We propose a local motion deblurring vision Transformer (LMD-ViT) built on adaptive window pruning Transformer blocks (AdaWPT). To focus deblurring on local regions and reduce computation, AdaWPT prunes unnecessary windows, only allowing the active windows to be involved in the deblurring processes. The pruning operation relies on the blurriness confidence predicted by a confidence predictor that is trained end-to-end using a reconstruction loss with Gumbel-Softmax re-parameterization and a pruning loss guided by annotated blur masks. Our method removes local motion blur effectively without distorting sharp regions, demonstrated by its exceptional perceptual and quantitative improvements compared to state-of-the-art methods. In addition, our approach substantially reduces FLOPs by 66% and achieves more than a twofold increase in inference speed compared to Transformer-based deblurring methods. We will make our code and annotated blur masks publicly available.
\ No newline at end of file
diff --git a/data/2024/iclr/Adaptive deep spiking neural network with global-local learning via balanced excitatory and inhibitory mechanism b/data/2024/iclr/Adaptive deep spiking neural network with global-local learning via balanced excitatory and inhibitory mechanism
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning b/data/2024/iclr/Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning
new file mode 100644
index 0000000000..54a0adc215
--- /dev/null
+++ b/data/2024/iclr/Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning	
@@ -0,0 +1 @@
+Deep representation learning methods struggle with continual learning, suffering from both catastrophic forgetting of useful units and loss of plasticity, often due to rigid and unuseful units. While many methods address these two issues separately, only a few currently deal with both simultaneously. In this paper, we introduce Utility-based Perturbed Gradient Descent (UPGD) as a novel approach for the continual learning of representations. UPGD combines gradient updates with perturbations, where it applies smaller modifications to more useful units, protecting them from forgetting, and larger modifications to less useful units, rejuvenating their plasticity. We use a challenging streaming learning setup where continual learning problems have hundreds of non-stationarities and unknown task boundaries. We show that many existing methods suffer from at least one of the issues, predominantly manifested by their decreasing accuracy over tasks. On the other hand, UPGD continues to improve performance and surpasses or is competitive with all methods in all problems. Finally, in extended reinforcement learning experiments with PPO, we show that while Adam exhibits a performance drop after initial learning, UPGD avoids it by addressing both continual learning issues.
\ No newline at end of file
diff --git a/data/2024/iclr/Addressing Signal Delay in Deep Reinforcement Learning b/data/2024/iclr/Addressing Signal Delay in Deep Reinforcement Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models b/data/2024/iclr/AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models
new file mode 100644
index 0000000000..c1fe24b885
--- /dev/null
+++ b/data/2024/iclr/AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models	
@@ -0,0 +1 @@
+Existing customization methods require access to multiple reference examples to align pre-trained diffusion probabilistic models (DPMs) with user-provided concepts. This paper aims to address the challenge of DPM customization when the only available supervision is a differentiable metric defined on the generated contents. Since the sampling procedure of DPMs involves recursive calls to the denoising UNet, na\"ive gradient backpropagation requires storing the intermediate states of all iterations, resulting in extremely high memory consumption. To overcome this issue, we propose a novel method AdjointDPM, which first generates new samples from diffusion models by solving the corresponding probability-flow ODEs. It then uses the adjoint sensitivity method to backpropagate the gradients of the loss to the models' parameters (including conditioning signals, network weights, and initial noises) by solving another augmented ODE. To reduce numerical errors in both the forward generation and gradient backpropagation processes, we further reparameterize the probability-flow ODE and augmented ODE as simple non-stiff ODEs using exponential integration. Finally, we demonstrate the effectiveness of AdjointDPM on three interesting tasks: converting visual effects into identification text embeddings, finetuning DPMs for specific types of stylization, and optimizing initial noise to generate adversarial samples for security auditing.
\ No newline at end of file
diff --git a/data/2024/iclr/Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models b/data/2024/iclr/Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models
new file mode 100644
index 0000000000..823cd9e1e7
--- /dev/null
+++ b/data/2024/iclr/Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models	
@@ -0,0 +1 @@
+Recent work has showcased the significant potential of diffusion models in pose-guided person image synthesis. However, owing to the inconsistency in pose between the source and target images, synthesizing an image with a distinct pose, relying exclusively on the source image and target pose information, remains a formidable challenge. This paper presents Progressive Conditional Diffusion Models (PCDMs) that incrementally bridge the gap between person images under the target and source poses through three stages. Specifically, in the first stage, we design a simple prior conditional diffusion model that predicts the global features of the target image by mining the global alignment relationship between pose coordinates and image appearance. Then, the second stage establishes a dense correspondence between the source and target images using the global features from the previous stage, and an inpainting conditional diffusion model is proposed to further align and enhance the contextual features, generating a coarse-grained person image. In the third stage, we propose a refining conditional diffusion model to utilize the coarsely generated image from the previous stage as a condition, achieving texture restoration and enhancing fine-detail consistency. The three-stage PCDMs work progressively to generate the final high-quality and high-fidelity synthesized image. Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.The code and model will be available at https://github.com/tencent-ailab/PCDMs.
\ No newline at end of file
diff --git a/data/2024/iclr/Adversarial Adaptive Sampling: Unify PINN and Optimal Transport for the Approximation of PDEs b/data/2024/iclr/Adversarial Adaptive Sampling: Unify PINN and Optimal Transport for the Approximation of PDEs
new file mode 100644
index 0000000000..53a52b7394
--- /dev/null
+++ b/data/2024/iclr/Adversarial Adaptive Sampling: Unify PINN and Optimal Transport for the Approximation of PDEs	
@@ -0,0 +1 @@
+Solving partial differential equations (PDEs) is a central task in scientific computing. Recently, neural network approximation of PDEs has received increasing attention due to its flexible meshless discretization and its potential for high-dimensional problems. One fundamental numerical difficulty is that random samples in the training set introduce statistical errors into the discretization of loss functional which may become the dominant error in the final approximation, and therefore overshadow the modeling capability of the neural network. In this work, we propose a new minmax formulation to optimize simultaneously the approximate solution, given by a neural network model, and the random samples in the training set, provided by a deep generative model. The key idea is to use a deep generative model to adjust random samples in the training set such that the residual induced by the approximate PDE solution can maintain a smooth profile when it is being minimized. Such an idea is achieved by implicitly embedding the Wasserstein distance between the residual-induced distribution and the uniform distribution into the loss, which is then minimized together with the residual. A nearly uniform residual profile means that its variance is small for any normalized weight function such that the Monte Carlo approximation error of the loss functional is reduced significantly for a certain sample size. The adversarial adaptive sampling (AAS) approach proposed in this work is the first attempt to formulate two essential components, minimizing the residual and seeking the optimal training set, into one minmax objective functional for the neural network approximation of PDEs.
\ No newline at end of file
diff --git a/data/2024/iclr/Adversarial Attacks on Fairness of Graph Neural Networks b/data/2024/iclr/Adversarial Attacks on Fairness of Graph Neural Networks
new file mode 100644
index 0000000000..64e11d7ae0
--- /dev/null
+++ b/data/2024/iclr/Adversarial Attacks on Fairness of Graph Neural Networks	
@@ -0,0 +1 @@
+Fairness-aware graph neural networks (GNNs) have gained a surge of attention as they can reduce the bias of predictions on any demographic group (e.g., female) in graph-based applications. Although these methods greatly improve the algorithmic fairness of GNNs, the fairness can be easily corrupted by carefully designed adversarial attacks. In this paper, we investigate the problem of adversarial attacks on fairness of GNNs and propose G-FairAttack, a general framework for attacking various types of fairness-aware GNNs in terms of fairness with an unnoticeable effect on prediction utility. In addition, we propose a fast computation technique to reduce the time complexity of G-FairAttack. The experimental study demonstrates that G-FairAttack successfully corrupts the fairness of different types of GNNs while keeping the attack unnoticeable. Our study on fairness attacks sheds light on potential vulnerabilities in fairness-aware GNNs and guides further research on the robustness of GNNs in terms of fairness.
\ No newline at end of file
diff --git a/data/2024/iclr/Adversarial AutoMixup b/data/2024/iclr/Adversarial AutoMixup
new file mode 100644
index 0000000000..c54b09da84
--- /dev/null
+++ b/data/2024/iclr/Adversarial AutoMixup	
@@ -0,0 +1 @@
+Data mixing augmentation has been widely applied to improve the generalization ability of deep neural networks. Recently, offline data mixing augmentation, e.g. handcrafted and saliency information-based mixup, has been gradually replaced by automatic mixing approaches. Through minimizing two sub-tasks, namely, mixed sample generation and mixup classification in an end-to-end way, AutoMix significantly improves accuracy on image classification tasks. However, as the optimization objective is consistent for the two sub-tasks, this approach is prone to generating consistent instead of diverse mixed samples, which results in overfitting for target task training. In this paper, we propose AdAutomixup, an adversarial automatic mixup augmentation approach that generates challenging samples to train a robust classifier for image classification, by alternatively optimizing the classifier and the mixup sample generator. AdAutomixup comprises two modules, a mixed example generator, and a target classifier. The mixed sample generator aims to produce hard mixed examples to challenge the target classifier, while the target classifier's aim is to learn robust features from hard mixed examples to improve generalization. To prevent the collapse of the inherent meanings of images, we further introduce an exponential moving average (EMA) teacher and cosine similarity to train AdAutomixup in an end-to-end way. Extensive experiments on seven image benchmarks consistently prove that our approach outperforms the state of the art in various classification scenarios. The source code is available at https://github.com/JinXins/Adversarial-AutoMixup.
\ No newline at end of file
diff --git a/data/2024/iclr/Adversarial Causal Bayesian Optimization b/data/2024/iclr/Adversarial Causal Bayesian Optimization
new file mode 100644
index 0000000000..a7528ee404
--- /dev/null
+++ b/data/2024/iclr/Adversarial Causal Bayesian Optimization	
@@ -0,0 +1 @@
+In Causal Bayesian Optimization (CBO), an agent intervenes on an unknown structural causal model to maximize a downstream reward variable. In this paper, we consider the generalization where other agents or external events also intervene on the system, which is key for enabling adaptiveness to non-stationarities such as weather changes, market forces, or adversaries. We formalize this generalization of CBO as Adversarial Causal Bayesian Optimization (ACBO) and introduce the first algorithm for ACBO with bounded regret: Causal Bayesian Optimization with Multiplicative Weights (CBO-MW). Our approach combines a classical online learning strategy with causal modeling of the rewards. To achieve this, it computes optimistic counterfactual reward estimates by propagating uncertainty through the causal graph. We derive regret bounds for CBO-MW that naturally depend on graph-related quantities. We further propose a scalable implementation for the case of combinatorial interventions and submodular rewards. Empirically, CBO-MW outperforms non-causal and non-adversarial Bayesian optimization methods on synthetic environments and environments based on real-word data. Our experiments include a realistic demonstration of how CBO-MW can be used to learn users' demand patterns in a shared mobility system and reposition vehicles in strategic areas.
\ No newline at end of file
diff --git a/data/2024/iclr/Adversarial Feature Map Pruning for Backdoor b/data/2024/iclr/Adversarial Feature Map Pruning for Backdoor
new file mode 100644
index 0000000000..ded090f707
--- /dev/null
+++ b/data/2024/iclr/Adversarial Feature Map Pruning for Backdoor	
@@ -0,0 +1 @@
+Deep neural networks have been widely used in many critical applications, such as autonomous vehicles and medical diagnosis. However, their security is threatened by backdoor attacks, which are achieved by adding artificial patterns to specific training data. Existing defense strategies primarily focus on using reverse engineering to reproduce the backdoor trigger generated by attackers and subsequently repair the DNN model by adding the trigger into inputs and fine-tuning the model with ground-truth labels. However, once the trigger generated by the attackers is complex and invisible, the defender cannot reproduce the trigger successfully then the DNN model will not be repaired, as the trigger is not effectively removed. In this work, we propose Adversarial Feature Map Pruning for Backdoor (FMP) to mitigate backdoor from the DNN. Unlike existing defense strategies, which focus on reproducing backdoor triggers, FMP attempts to prune backdoor feature maps, which are trained to extract backdoor information from inputs. After pruning these backdoor feature maps, FMP will fine-tune the model with a secure subset of training data. Our experiments demonstrate that, compared to existing defense strategies, FMP can effectively reduce the Attack Success Rate (ASR) even against the most complex and invisible attack triggers (e.g., FMP decreases the ASR to 2.86\% in CIFAR10, which is 19.2\% to 65.41\% lower than baselines). Second, unlike conventional defense methods that tend to exhibit low robust accuracy (that is, the accuracy of the model on poisoned data), FMP achieves a higher RA, indicating its superiority in maintaining model performance while mitigating the effects of backdoor attacks (e.g., FMP obtains 87.40\% RA in CIFAR10). Our code is publicly available at: https://github.com/retsuh-bqw/FMP.
\ No newline at end of file
diff --git a/data/2024/iclr/Adversarial Imitation Learning via Boosting b/data/2024/iclr/Adversarial Imitation Learning via Boosting
new file mode 100644
index 0000000000..ee5d9819df
--- /dev/null
+++ b/data/2024/iclr/Adversarial Imitation Learning via Boosting	
@@ -0,0 +1 @@
+Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of off-policy learning algorithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC's empirical success, the original AIL objective is on-policy and DAC's ad-hoc application of off-policy training does not guarantee successful imitation (Kostrikov et al., 2019; 2020). Follow-up work such as ValueDICE (Kostrikov et al., 2020) tackles this issue by deriving a fully off-policy AIL objective. Instead in this work, we develop a novel and principled AIL algorithm via the framework of boosting. Like boosting, our new algorithm, AILBoost, maintains an ensemble of properly weighted weak learners (i.e., policies) and trains a discriminator that witnesses the maximum discrepancy between the distributions of the ensemble and the expert policy. We maintain a weighted replay buffer to represent the state-action distribution induced by the ensemble, allowing us to train discriminators using the entire data collected so far. In the weighted replay buffer, the contribution of the data from older policies are properly discounted with the weight computed based on the boosting framework. Empirically, we evaluate our algorithm on both controller state-based and pixel-based environments from the DeepMind Control Suite. AILBoost outperforms DAC on both types of environments, demonstrating the benefit of properly weighting replay buffer data for off-policy training. On state-based environments, DAC outperforms ValueDICE and IQ-Learn (Gary et al., 2021), achieving competitive performance with as little as one expert trajectory.
\ No newline at end of file
diff --git a/data/2024/iclr/Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive b/data/2024/iclr/Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive
new file mode 100644
index 0000000000..c933ef726c
--- /dev/null
+++ b/data/2024/iclr/Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive	
@@ -0,0 +1 @@
+Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).
\ No newline at end of file
diff --git a/data/2024/iclr/Adversarial Training Should Be Cast as a Non-Zero-Sum Game b/data/2024/iclr/Adversarial Training Should Be Cast as a Non-Zero-Sum Game
new file mode 100644
index 0000000000..a9b7ade855
--- /dev/null
+++ b/data/2024/iclr/Adversarial Training Should Be Cast as a Non-Zero-Sum Game	
@@ -0,0 +1 @@
+One prominent approach toward resolving the adversarial vulnerability of deep neural networks is the two-player zero-sum paradigm of adversarial training, in which predictors are trained against adversarially chosen perturbations of data. Despite the promise of this approach, algorithms based on this paradigm have not engendered sufficient levels of robustness and suffer from pathological behavior like robust overfitting. To understand this shortcoming, we first show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on the robustness of trained classifiers. The identification of this pitfall informs a novel non-zero-sum bilevel formulation of adversarial training, wherein each player optimizes a different objective function. Our formulation yields a simple algorithmic framework that matches and in some cases outperforms state-of-the-art attacks, attains comparable levels of robustness to standard adversarial training algorithms, and does not suffer from robust overfitting.
\ No newline at end of file
diff --git a/data/2024/iclr/Adversarial Training on Purification (AToP): Advancing Both Robustness and Generalization b/data/2024/iclr/Adversarial Training on Purification (AToP): Advancing Both Robustness and Generalization
new file mode 100644
index 0000000000..d15781b556
--- /dev/null
+++ b/data/2024/iclr/Adversarial Training on Purification (AToP): Advancing Both Robustness and Generalization	
@@ -0,0 +1 @@
+The deep neural networks are known to be vulnerable to well-designed adversarial attacks. The most successful defense technique based on adversarial training (AT) can achieve optimal robustness against particular attacks but cannot generalize well to unseen attacks. Another effective defense technique based on adversarial purification (AP) can enhance generalization but cannot achieve optimal robustness. Meanwhile, both methods share one common limitation on the degraded standard accuracy. To mitigate these issues, we propose a novel pipeline to acquire the robust purifier model, named Adversarial Training on Purification (AToP), which comprises two components: perturbation destruction by random transforms (RT) and purifier model fine-tuned (FT) by adversarial loss. RT is essential to avoid overlearning to known attacks, resulting in the robustness generalization to unseen attacks, and FT is essential for the improvement of robustness. To evaluate our method in an efficient and scalable way, we conduct extensive experiments on CIFAR-10, CIFAR-100, and ImageNette to demonstrate that our method achieves optimal robustness and exhibits generalization ability against unseen attacks.
\ No newline at end of file
diff --git a/data/2024/iclr/AffineQuant: Affine Transformation Quantization for Large Language Models b/data/2024/iclr/AffineQuant: Affine Transformation Quantization for Large Language Models
new file mode 100644
index 0000000000..8908107f45
--- /dev/null
+++ b/data/2024/iclr/AffineQuant: Affine Transformation Quantization for Large Language Models	
@@ -0,0 +1 @@
+The significant resource requirements associated with Large-scale Language Models (LLMs) have generated considerable interest in the development of techniques aimed at compressing and accelerating neural networks. Among these techniques, Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its noteworthy compression efficiency and cost-effectiveness in the context of training. Existing PTQ methods for LLMs limit the optimization scope to scaling transformations between pre- and post-quantization weights. In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant). This approach extends the optimization scope and thus significantly minimizing quantization errors. Additionally, by employing the corresponding inverse matrix, we can ensure equivalence between the pre- and post-quantization outputs of PTQ, thereby maintaining its efficiency and generalization capabilities. To ensure the invertibility of the transformation during optimization, we further introduce a gradual mask optimization method. This method initially focuses on optimizing the diagonal elements and gradually extends to the other elements. Such an approach aligns with the Levy-Desplanques theorem, theoretically ensuring invertibility of the transformation. As a result, significant performance improvements are evident across different LLMs on diverse datasets. To illustrate, we attain a C4 perplexity of 15.76 (2.26 lower vs 18.02 in OmniQuant) on the LLaMA2-7B model of W4A4 quantization without overhead. On zero-shot tasks, AffineQuant achieves an average of 58.61 accuracy (1.98 lower vs 56.63 in OmniQuant) when using 4/4-bit quantization for LLaMA-30B, which setting a new state-of-the-art benchmark for PTQ in LLMs.
\ No newline at end of file
diff --git a/data/2024/iclr/AgentBench: Evaluating LLMs as Agents b/data/2024/iclr/AgentBench: Evaluating LLMs as Agents
new file mode 100644
index 0000000000..82f7bdcc37
--- /dev/null
+++ b/data/2024/iclr/AgentBench: Evaluating LLMs as Agents	
@@ -0,0 +1 @@
+Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 27 API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and OSS competitors. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Training on code and high quality multi-turn alignment data could improve agent performance. Datasets, environments, and an integrated evaluation package for AgentBench are released at \url{https://github.com/THUDM/AgentBench}.
\ No newline at end of file
diff --git a/data/2024/iclr/AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors b/data/2024/iclr/AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors
new file mode 100644
index 0000000000..a468f6347f
--- /dev/null
+++ b/data/2024/iclr/AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors	
@@ -0,0 +1 @@
+Autonomous agents empowered by Large Language Models (LLMs) have undergone significant improvements, enabling them to generalize across a broad spectrum of tasks. However, in real-world scenarios, cooperation among individuals is often required to enhance the efficiency and effectiveness of task accomplishment. Hence, inspired by human group dynamics, we propose a multi-agent framework \framework that can collaboratively and dynamically adjust its composition as a greater-than-the-sum-of-its-parts system. Our experiments demonstrate that \framework framework can effectively deploy multi-agent groups that outperform a single agent. Furthermore, we delve into the emergence of social behaviors among individual agents within a group during collaborative task accomplishment. In view of these behaviors, we discuss some possible strategies to leverage positive ones and mitigate negative ones for improving the collaborative potential of multi-agent groups. Our codes for \framework will soon be released at \url{https://github.com/OpenBMB/AgentVerse}.
\ No newline at end of file
diff --git a/data/2024/iclr/AirPhyNet: Harnessing Physics-Guided Neural Networks for Air Quality Prediction b/data/2024/iclr/AirPhyNet: Harnessing Physics-Guided Neural Networks for Air Quality Prediction
new file mode 100644
index 0000000000..2b31fef935
--- /dev/null
+++ b/data/2024/iclr/AirPhyNet: Harnessing Physics-Guided Neural Networks for Air Quality Prediction	
@@ -0,0 +1 @@
+Air quality prediction and modelling plays a pivotal role in public health and environment management, for individuals and authorities to make informed decisions. Although traditional data-driven models have shown promise in this domain, their long-term prediction accuracy can be limited, especially in scenarios with sparse or incomplete data and they often rely on black-box deep learning structures that lack solid physical foundation leading to reduced transparency and interpretability in predictions. To address these limitations, this paper presents a novel approach named Physics guided Neural Network for Air Quality Prediction (AirPhyNet). Specifically, we leverage two well-established physics principles of air particle movement (diffusion and advection) by representing them as differential equation networks. Then, we utilize a graph structure to integrate physics knowledge into a neural network architecture and exploit latent representations to capture spatio-temporal relationships within the air quality data. Experiments on two real-world benchmark datasets demonstrate that AirPhyNet outperforms state-of-the-art models for different testing scenarios including different lead time (24h, 48h, 72h), sparse data and sudden change prediction, achieving reduction in prediction errors up to 10%. Moreover, a case study further validates that our model captures underlying physical processes of particle movement and generates accurate predictions with real physical meaning.
\ No newline at end of file
diff --git a/data/2024/iclr/Algorithms for Caching and MTS with reduced number of predictions b/data/2024/iclr/Algorithms for Caching and MTS with reduced number of predictions
new file mode 100644
index 0000000000..31434e3171
--- /dev/null
+++ b/data/2024/iclr/Algorithms for Caching and MTS with reduced number of predictions	
@@ -0,0 +1 @@
+ML-augmented algorithms utilize predictions to achieve performance beyond their worst-case bounds. Producing these predictions might be a costly operation -- this motivated Im et al. '22 to introduce the study of algorithms which use predictions parsimoniously. We design parsimonious algorithms for caching and MTS with action predictions, proposed by Antoniadis et al. '20, focusing on the parameters of consistency (performance with perfect predictions) and smoothness (dependence of their performance on the prediction error). Our algorithm for caching is 1-consistent, robust, and its smoothness deteriorates with the decreasing number of available predictions. We propose an algorithm for general MTS whose consistency and smoothness both scale linearly with the decreasing number of predictions. Without the restriction on the number of available predictions, both algorithms match the earlier guarantees achieved by Antoniadis et al. '20.
\ No newline at end of file
diff --git a/data/2024/iclr/Alice Benchmarks: Connecting Real World Re-Identification with the Synthetic b/data/2024/iclr/Alice Benchmarks: Connecting Real World Re-Identification with the Synthetic
new file mode 100644
index 0000000000..373fe5108e
--- /dev/null
+++ b/data/2024/iclr/Alice Benchmarks: Connecting Real World Re-Identification with the Synthetic	
@@ -0,0 +1 @@
+For object re-identification (re-ID), learning from synthetic data has become a promising strategy to cheaply acquire large-scale annotated datasets and effective models, with few privacy concerns. Many interesting research problems arise from this strategy, e.g., how to reduce the domain gap between synthetic source and real-world target. To facilitate developing more new approaches in learning from synthetic data, we introduce the Alice benchmarks, large-scale datasets providing benchmarks as well as evaluation protocols to the research community. Within the Alice benchmarks, two object re-ID tasks are offered: person and vehicle re-ID. We collected and annotated two challenging real-world target datasets: AlicePerson and AliceVehicle, captured under various illuminations, image resolutions, etc. As an important feature of our real target, the clusterability of its training set is not manually guaranteed to make it closer to a real domain adaptation test scenario. Correspondingly, we reuse existing PersonX and VehicleX as synthetic source domains. The primary goal is to train models from synthetic data that can work effectively in the real world. In this paper, we detail the settings of Alice benchmarks, provide an analysis of existing commonly-used domain adaptation methods, and discuss some interesting future directions. An online server has been set up for the community to evaluate methods conveniently and fairly. Datasets and the online server details are available at https://sites.google.com/view/alice-benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework b/data/2024/iclr/Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework
new file mode 100644
index 0000000000..c7c7015de2
--- /dev/null
+++ b/data/2024/iclr/Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework	
@@ -0,0 +1 @@
+Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It enables learning the relations between input and output sequences, termed alignments, by marginalizing over perfect alignments (that yield the ground truth), at the expense of imperfect alignments. This binary differentiation of perfect and imperfect alignments falls short of capturing other essential alignment properties that hold significance in other real-world applications. Here we propose $\textit{Align With Purpose}$, a $\textbf{general Plug-and-Play framework}$ for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC with an additional loss term that prioritizes alignments according to a desired property. Our method does not require any intervention in the CTC loss function, enables easy optimization of a variety of properties, and allows differentiation between both perfect and imperfect alignments. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: emission time and word error rate (WER). For the former, we report an improvement of up to 570ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on a scale of data as large as ours. Notably, our method can be implemented using only a few lines of code, and can be extended to other alignment-free loss functions and to domains other than ASR.
\ No newline at end of file
diff --git a/data/2024/iclr/AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model b/data/2024/iclr/AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model
new file mode 100644
index 0000000000..6d992b85bd
--- /dev/null
+++ b/data/2024/iclr/AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model	
@@ -0,0 +1 @@
+Aligning agent behaviors with diverse human preferences remains a challenging problem in reinforcement learning (RL), owing to the inherent abstractness and mutability of human preferences. To address these issues, we propose AlignDiff, a novel framework that leverages RL from Human Feedback (RLHF) to quantify human preferences, covering abstractness, and utilizes them to guide diffusion planning for zero-shot behavior customizing, covering mutability. AlignDiff can accurately match user-customized behaviors and efficiently switch from one to another. To build the framework, we first establish the multi-perspective human feedback datasets, which contain comparisons for the attributes of diverse behaviors, and then train an attribute strength model to predict quantified relative strengths. After relabeling behavioral datasets with relative strengths, we proceed to train an attribute-conditioned diffusion model, which serves as a planner with the attribute strength model as a director for preference aligning at the inference phase. We evaluate AlignDiff on various locomotion tasks and demonstrate its superior performance on preference matching, switching, and covering compared to other baselines. Its capability of completing unseen downstream tasks under human instructions also showcases the promising potential for human-AI collaboration. More visualization videos are released on https://aligndiff.github.io/.
\ No newline at end of file
diff --git a/data/2024/iclr/Aligning Relational Learning with Lipschitz Fairness b/data/2024/iclr/Aligning Relational Learning with Lipschitz Fairness
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps b/data/2024/iclr/Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps
new file mode 100644
index 0000000000..f1cbaf34cc
--- /dev/null
+++ b/data/2024/iclr/Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps	
@@ -0,0 +1 @@
+Diffusion Probabilistic Models (DPM) have shown remarkable efficacy in the synthesis of high-quality images. However, their inference process characteristically requires numerous, potentially hundreds, of iterative steps, which could exaggerate the problem of exposure bias due to the training and inference discrepancy. Previous work has attempted to mitigate this issue by perturbing inputs during training, which consequently mandates the retraining of the DPM. In this work, we conduct a systematic study of exposure bias in DPM and, intriguingly, we find that the exposure bias could be alleviated with a novel sampling method that we propose, without retraining the model. We empirically and theoretically show that, during inference, for each backward time step $t$ and corresponding state $\hat{x}_t$, there might exist another time step $t_s$ which exhibits superior coupling with $\hat{x}_t$. Based on this finding, we introduce a sampling method named Time-Shift Sampler. Our framework can be seamlessly integrated to existing sampling algorithms, such as DDPM, DDIM and other high-order solvers, inducing merely minimal additional computations. Experimental results show our method brings significant and consistent improvements in FID scores on different datasets and sampling methods. For example, integrating Time-Shift Sampler to F-PNDM yields a FID=3.88, achieving 44.49\% improvements as compared to F-PNDM, on CIFAR-10 with 10 sampling steps, which is more performant than the vanilla DDIM with 100 sampling steps. Our code is available at https://github.com/Mingxiao-Li/TS-DPM.
\ No newline at end of file
diff --git a/data/2024/iclr/AlpaGasus: Training a Better Alpaca with Fewer Data b/data/2024/iclr/AlpaGasus: Training a Better Alpaca with Fewer Data
new file mode 100644
index 0000000000..930fc0676f
--- /dev/null
+++ b/data/2024/iclr/AlpaGasus: Training a Better Alpaca with Fewer Data	
@@ -0,0 +1 @@
+Large language models (LLMs) strengthen instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and filters out low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. AlpaGasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human evaluation. Its 13B variant matches $>90\%$ performance of its teacher LLM (i.e., Text-Davinci-003 generating the 52k data) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove the efficacy of our method across diverse datasets, base models, and LLM filters. Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models. Our project page is available at: https://lichang-chen.github.io/AlpaGasus/
\ No newline at end of file
diff --git a/data/2024/iclr/Alt-Text with Context: Improving Accessibility for Images on Twitter b/data/2024/iclr/Alt-Text with Context: Improving Accessibility for Images on Twitter
new file mode 100644
index 0000000000..12fcb2aa32
--- /dev/null
+++ b/data/2024/iclr/Alt-Text with Context: Improving Accessibility for Images on Twitter	
@@ -0,0 +1 @@
+In this work we present an approach for generating alternative text (or alt-text) descriptions for images shared on social media, specifically Twitter. More than just a special case of image captioning, alt-text is both more literally descriptive and context-specific. Also critically, images posted to Twitter are often accompanied by user-written text that despite not necessarily describing the image may provide useful context that if properly leveraged can be informative. We address this task with a multimodal model that conditions on both textual information from the associated social media post as well as visual signal from the image, and demonstrate that the utility of these two information sources stacks. We put forward a new dataset of 371k images paired with alt-text and tweets scraped from Twitter and evaluate on it across a variety of automated metrics as well as human evaluation. We show that our approach of conditioning on both tweet text and visual information significantly outperforms prior work, by more than 2x on BLEU@4.
\ No newline at end of file
diff --git a/data/2024/iclr/Amortized Network Intervention to Steer the Excitatory Point Processes b/data/2024/iclr/Amortized Network Intervention to Steer the Excitatory Point Processes
new file mode 100644
index 0000000000..e17c51d048
--- /dev/null
+++ b/data/2024/iclr/Amortized Network Intervention to Steer the Excitatory Point Processes	
@@ -0,0 +1 @@
+Excitatory point processes (i.e., event flows) occurring over dynamic graphs (i.e., evolving topologies) provide a fine-grained model to capture how discrete events may spread over time and space. How to effectively steer the event flows by modifying the dynamic graph structures presents an interesting problem, motivated by curbing the spread of infectious diseases through strategically locking down cities to mitigating traffic congestion via traffic light optimization. To address the intricacies of planning and overcome the high dimensionality inherent to such decision-making problems, we design an Amortized Network Interventions (ANI) framework, allowing for the pooling of optimal policies from history and other contexts while ensuring a permutation equivalent property. This property enables efficient knowledge transfer and sharing across diverse contexts. Each task is solved by an H-step lookahead model-based reinforcement learning, where neural ODEs are introduced to model the dynamics of the excitatory point processes. Instead of simulating rollouts from the dynamics model, we derive an analytical mean-field approximation for the event flows given the dynamics, making the online planning more efficiently solvable. We empirically illustrate that this ANI approach substantially enhances policy learning for unseen dynamics and exhibits promising outcomes in steering event flows through network intervention using synthetic and real COVID datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/AmortizedPeriod: Attention-based Amortized Inference for Periodicity Identification b/data/2024/iclr/AmortizedPeriod: Attention-based Amortized Inference for Periodicity Identification
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Amortizing intractable inference in large language models b/data/2024/iclr/Amortizing intractable inference in large language models
new file mode 100644
index 0000000000..3b229e63bd
--- /dev/null
+++ b/data/2024/iclr/Amortizing intractable inference in large language models	
@@ -0,0 +1 @@
+Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest -- including sequence continuation, infilling, and other forms of constrained generation -- involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use.
\ No newline at end of file
diff --git a/data/2024/iclr/An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression b/data/2024/iclr/An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression
new file mode 100644
index 0000000000..4f9e79e09a
--- /dev/null
+++ b/data/2024/iclr/An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression	
@@ -0,0 +1 @@
+We study the cost of overfitting in noisy kernel ridge regression (KRR), which we define as the ratio between the test error of the interpolating ridgeless model and the test error of the optimally-tuned model. We take an"agnostic"view in the following sense: we consider the cost as a function of sample size for any target function, even if the sample size is not large enough for consistency or the target is outside the RKHS. We analyze the cost of overfitting under a Gaussian universality ansatz using recently derived (non-rigorous) risk estimates in terms of the task eigenstructure. Our analysis provides a more refined characterization of benign, tempered and catastrophic overfitting (cf. Mallinar et al. 2022).
\ No newline at end of file
diff --git a/data/2024/iclr/An Analytical Solution to Gauss-Newton Loss for Direct Image Alignment b/data/2024/iclr/An Analytical Solution to Gauss-Newton Loss for Direct Image Alignment
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization b/data/2024/iclr/An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization
new file mode 100644
index 0000000000..4adb8d2ae7
--- /dev/null
+++ b/data/2024/iclr/An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization	
@@ -0,0 +1 @@
+Recently, diffusion models have achieved remarkable success in generating tasks, including image and audio generation. However, like other generative models, diffusion models are prone to privacy issues. In this paper, we propose an efficient query-based membership inference attack (MIA), namely Proximal Initialization Attack (PIA), which utilizes groundtruth trajectory obtained by $\epsilon$ initialized in $t=0$ and predicted point to infer memberships. Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models. Moreover, previous works on the privacy of diffusion models have focused on vision tasks without considering audio tasks. Therefore, we also explore the robustness of diffusion models to MIA in the text-to-speech (TTS) task, which is an audio generation task. To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the TTS task. Experimental results indicate that models with mel-spectrogram (image-like) output are vulnerable to MIA, while models with audio output are relatively robust to MIA. {Code is available at \url{https://github.com/kong13661/PIA}}.
\ No newline at end of file
diff --git a/data/2024/iclr/An Efficient Tester-Learner for Halfspaces b/data/2024/iclr/An Efficient Tester-Learner for Halfspaces
new file mode 100644
index 0000000000..0f813dbb6f
--- /dev/null
+++ b/data/2024/iclr/An Efficient Tester-Learner for Halfspaces	
@@ -0,0 +1 @@
+We give the first efficient algorithm for learning halfspaces in the testable learning model recently defined by Rubinfeld and Vasilyan (2023). In this model, a learner certifies that the accuracy of its output hypothesis is near optimal whenever the training set passes an associated test, and training sets drawn from some target distribution -- e.g., the Gaussian -- must pass the test. This model is more challenging than distribution-specific agnostic or Massart noise models where the learner is allowed to fail arbitrarily if the distributional assumption does not hold. We consider the setting where the target distribution is Gaussian (or more generally any strongly log-concave distribution) in $d$ dimensions and the noise model is either Massart or adversarial (agnostic). For Massart noise, our tester-learner runs in polynomial time and outputs a hypothesis with (information-theoretically optimal) error $\mathsf{opt} + \epsilon$ for any strongly log-concave target distribution. For adversarial noise, our tester-learner obtains error $O(\mathsf{opt}) + \epsilon$ in polynomial time when the target distribution is Gaussian; for strongly log-concave distributions, we obtain $\tilde{O}(\mathsf{opt}) + \epsilon$ in quasipolynomial time. Prior work on testable learning ignores the labels in the training set and checks that the empirical moments of the covariates are close to the moments of the base distribution. Here we develop new tests of independent interest that make critical use of the labels and combine them with the moment-matching approach of Gollakota et al. (2023). This enables us to simulate a variant of the algorithm of Diakonikolas et al. (2020) for learning noisy halfspaces using nonconvex SGD but in the testable learning setting.
\ No newline at end of file
diff --git a/data/2024/iclr/An Emulator for Fine-tuning Large Language Models using Small Language Models b/data/2024/iclr/An Emulator for Fine-tuning Large Language Models using Small Language Models
new file mode 100644
index 0000000000..ae6d63f7d7
--- /dev/null
+++ b/data/2024/iclr/An Emulator for Fine-tuning Large Language Models using Small Language Models	
@@ -0,0 +1 @@
+Widely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a pre-training stage that uses a very large, diverse dataset of text and a fine-tuning (sometimes, 'alignment') stage that uses targeted examples or other specifications of desired behaviors. While it has been hypothesized that knowledge and skills come from pre-training, and fine-tuning mostly filters this knowledge and skillset, this intuition has not been extensively tested. To aid in doing so, we introduce a novel technique for decoupling the knowledge and skills gained in these two stages, enabling a direct answer to the question,"What would happen if we combined the knowledge learned by a large model during pre-training with the knowledge learned by a small model during fine-tuning (or vice versa)?"Using an RL-based framework derived from recent developments in learning from human preferences, we introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates (or 'emulates') the result of pre-training and fine-tuning at different scales. Our experiments with EFT show that scaling up fine-tuning tends to improve helpfulness, while scaling up pre-training tends to improve factuality. Beyond decoupling scale, we show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models, essentially emulating the result of fine-tuning the large pre-trained model. Up-scaling consistently improves helpfulness and factuality of instruction-following models in the Llama, Llama-2, and Falcon families, without additional hyperparameters or training.
\ No newline at end of file
diff --git a/data/2024/iclr/An Extensible Framework for Open Heterogeneous Collaborative Perception b/data/2024/iclr/An Extensible Framework for Open Heterogeneous Collaborative Perception
new file mode 100644
index 0000000000..2c4cf1a747
--- /dev/null
+++ b/data/2024/iclr/An Extensible Framework for Open Heterogeneous Collaborative Perception	
@@ -0,0 +1 @@
+Collaborative perception aims to mitigate the limitations of single-agent perception, such as occlusions, by facilitating data exchange among multiple agents. However, most current works consider a homogeneous scenario where all agents use identity sensors and perception models. In reality, heterogeneous agent types may continually emerge and inevitably face a domain gap when collaborating with existing agents. In this paper, we introduce a new open heterogeneous problem: how to accommodate continually emerging new heterogeneous agent types into collaborative perception, while ensuring high perception performance and low integration cost? To address this problem, we propose HEterogeneous ALliance (HEAL), a novel extensible collaborative perception framework. HEAL first establishes a unified feature space with initial agents via a novel multi-scale foreground-aware Pyramid Fusion network. When heterogeneous new agents emerge with previously unseen modalities or models, we align them to the established unified space with an innovative backward alignment. This step only involves individual training on the new agent type, thus presenting extremely low training costs and high extensibility. To enrich agents' data heterogeneity, we bring OPV2V-H, a new large-scale dataset with more diverse sensor types. Extensive experiments on OPV2V-H and DAIR-V2X datasets show that HEAL surpasses SOTA methods in performance while reducing the training parameters by 91.5% when integrating 3 new agent types. We further implement a comprehensive codebase at: https://github.com/yifanlu0227/HEAL
\ No newline at end of file
diff --git a/data/2024/iclr/An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models b/data/2024/iclr/An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/An Intuitive Multi-Frequency Feature Representation for SO(3)-Equivariant Networks b/data/2024/iclr/An Intuitive Multi-Frequency Feature Representation for SO(3)-Equivariant Networks
new file mode 100644
index 0000000000..ef594ab245
--- /dev/null
+++ b/data/2024/iclr/An Intuitive Multi-Frequency Feature Representation for SO(3)-Equivariant Networks	
@@ -0,0 +1 @@
+The usage of 3D vision algorithms, such as shape reconstruction, remains limited because they require inputs to be at a fixed canonical rotation. Recently, a simple equivariant network, Vector Neuron (VN) has been proposed that can be easily used with the state-of-the-art 3D neural network (NN) architectures. However, its performance is limited because it is designed to use only three-dimensional features, which is insufficient to capture the details present in 3D data. In this paper, we introduce an equivariant feature representation for mapping a 3D point to a high-dimensional feature space. Our feature can discern multiple frequencies present in 3D data, which is the key to designing an expressive feature for 3D vision tasks. Our representation can be used as an input to VNs, and the results demonstrate that with our feature representation, VN captures more details, overcoming the limitation raised in its original paper.
\ No newline at end of file
diff --git a/data/2024/iclr/An Investigation of Representation and Allocation Harms in Contrastive Learning b/data/2024/iclr/An Investigation of Representation and Allocation Harms in Contrastive Learning
new file mode 100644
index 0000000000..751554a4ce
--- /dev/null
+++ b/data/2024/iclr/An Investigation of Representation and Allocation Harms in Contrastive Learning	
@@ -0,0 +1 @@
+The effect of underrepresentation on the performance of minority groups is known to be a serious problem in supervised learning settings; however, it has been underexplored so far in the context of self-supervised learning (SSL). In this paper, we demonstrate that contrastive learning (CL), a popular variant of SSL, tends to collapse representations of minority groups with certain majority groups. We refer to this phenomenon as representation harm and demonstrate it on image and text datasets using the corresponding popular CL methods. Furthermore, our causal mediation analysis of allocation harm on a downstream classification task reveals that representation harm is partly responsible for it, thus emphasizing the importance of studying and mitigating representation harm. Finally, we provide a theoretical explanation for representation harm using a stochastic block model that leads to a representational neural collapse in a contrastive learning setting.
\ No newline at end of file
diff --git a/data/2024/iclr/An Unforgeable Publicly Verifiable Watermark for Large Language Models b/data/2024/iclr/An Unforgeable Publicly Verifiable Watermark for Large Language Models
new file mode 100644
index 0000000000..14e03e3178
--- /dev/null
+++ b/data/2024/iclr/An Unforgeable Publicly Verifiable Watermark for Large Language Models	
@@ -0,0 +1 @@
+Recently, text watermarking algorithms for large language models (LLMs) have been proposed to mitigate the potential harms of text generated by LLMs, including fake news and copyright issues. However, current watermark detection algorithms require the secret key used in the watermark generation process, making them susceptible to security breaches and counterfeiting during public detection. To address this limitation, we propose an unforgeable publicly verifiable watermark algorithm named UPV that uses two different neural networks for watermark generation and detection, instead of using the same key at both stages. Meanwhile, the token embedding parameters are shared between the generation and detection networks, which makes the detection network achieve a high accuracy very efficiently. Experiments demonstrate that our algorithm attains high detection accuracy and computational efficiency through neural networks. Subsequent analysis confirms the high complexity involved in forging the watermark from the detection network. Our code is available at \href{https://github.com/THU-BPM/unforgeable_watermark}{https://github.com/THU-BPM/unforgeable\_watermark}. Additionally, our algorithm could also be accessed through MarkLLM \citep{pan2024markllm} \footnote{https://github.com/THU-BPM/MarkLLM}.
\ No newline at end of file
diff --git a/data/2024/iclr/An interpretable error correction method for enhancing code-to-code translation b/data/2024/iclr/An interpretable error correction method for enhancing code-to-code translation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/An operator preconditioning perspective on training in physics-informed machine learning b/data/2024/iclr/An operator preconditioning perspective on training in physics-informed machine learning
new file mode 100644
index 0000000000..9439f729c2
--- /dev/null
+++ b/data/2024/iclr/An operator preconditioning perspective on training in physics-informed machine learning	
@@ -0,0 +1 @@
+In this paper, we investigate the behavior of gradient descent algorithms in physics-informed machine learning methods like PINNs, which minimize residuals connected to partial differential equations (PDEs). Our key result is that the difficulty in training these models is closely related to the conditioning of a specific differential operator. This operator, in turn, is associated to the Hermitian square of the differential operator of the underlying PDE. If this operator is ill-conditioned, it results in slow or infeasible training. Therefore, preconditioning this operator is crucial. We employ both rigorous mathematical analysis and empirical evaluations to investigate various strategies, explaining how they better condition this critical operator, and consequently improve training.
\ No newline at end of file
diff --git a/data/2024/iclr/Analysis of Learning a Flow-based Generative Model from Limited Sample Complexity b/data/2024/iclr/Analysis of Learning a Flow-based Generative Model from Limited Sample Complexity
new file mode 100644
index 0000000000..7110b8a9f5
--- /dev/null
+++ b/data/2024/iclr/Analysis of Learning a Flow-based Generative Model from Limited Sample Complexity	
@@ -0,0 +1 @@
+We study the problem of training a flow-based generative model, parametrized by a two-layer autoencoder, to sample from a high-dimensional Gaussian mixture. We provide a sharp end-to-end analysis of the problem. First, we provide a tight closed-form characterization of the learnt velocity field, when parametrized by a shallow denoising auto-encoder trained on a finite number $n$ of samples from the target distribution. Building on this analysis, we provide a sharp description of the corresponding generative flow, which pushes the base Gaussian density forward to an approximation of the target density. In particular, we provide closed-form formulae for the distance between the mean of the generated mixture and the mean of the target mixture, which we show decays as $\Theta_n(\frac{1}{n})$. Finally, this rate is shown to be in fact Bayes-optimal.
\ No newline at end of file
diff --git a/data/2024/iclr/Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps b/data/2024/iclr/Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps
new file mode 100644
index 0000000000..6ecac81228
--- /dev/null
+++ b/data/2024/iclr/Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps	
@@ -0,0 +1 @@
+Transformers are ubiquitous in wide tasks. Interpreting their internals is a pivotal goal. Nevertheless, their particular components, feed-forward (FF) blocks, have typically been less analyzed despite their substantial parameter amounts. We analyze the input contextualization effects of FF blocks by rendering them in the attention maps as a human-friendly visualization scheme. Our experiments with both masked- and causal-language models reveal that FF networks modify the input contextualization to emphasize specific types of linguistic compositions. In addition, FF and its surrounding components tend to cancel out each other's effects, suggesting potential redundancy in the processing of the Transformer layer.
\ No newline at end of file
diff --git a/data/2024/iclr/Analyzing and Improving Optimal-Transport-based Adversarial Networks b/data/2024/iclr/Analyzing and Improving Optimal-Transport-based Adversarial Networks
new file mode 100644
index 0000000000..9573a13996
--- /dev/null
+++ b/data/2024/iclr/Analyzing and Improving Optimal-Transport-based Adversarial Networks	
@@ -0,0 +1 @@
+Optimal Transport (OT) problem aims to find a transport plan that bridges two distributions while minimizing a given cost function. OT theory has been widely utilized in generative modeling. In the beginning, OT distance has been used as a measure for assessing the distance between data and generated distributions. Recently, OT transport map between data and prior distributions has been utilized as a generative model. These OT-based generative models share a similar adversarial training objective. In this paper, we begin by unifying these OT-based adversarial methods within a single framework. Then, we elucidate the role of each component in training dynamics through a comprehensive analysis of this unified framework. Moreover, we suggest a simple but novel method that improves the previously best-performing OT-based model. Intuitively, our approach conducts a gradual refinement of the generated distribution, progressively aligning it with the data distribution. Our approach achieves a FID score of 2.51 on CIFAR-10 and 5.99 on CelebA-HQ-256, outperforming unified OT-based adversarial approaches.
\ No newline at end of file
diff --git a/data/2024/iclr/Analyzing and Mitigating Object Hallucination in Large Vision-Language Models b/data/2024/iclr/Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
new file mode 100644
index 0000000000..4b92f4e610
--- /dev/null
+++ b/data/2024/iclr/Analyzing and Mitigating Object Hallucination in Large Vision-Language Models	
@@ -0,0 +1 @@
+Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs, achieving a 23% improvement in general object hallucination evaluation metrics over the previous best approach. In both GPT and human evaluations, LURE consistently ranks at the top. Our data and code are available at https://github.com/YiyangZhou/LURE.
\ No newline at end of file
diff --git a/data/2024/iclr/AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning b/data/2024/iclr/AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
new file mode 100644
index 0000000000..573e79c9a5
--- /dev/null
+++ b/data/2024/iclr/AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning	
@@ -0,0 +1 @@
+With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be trained once and seamlessly integrated into any personalized T2Is originating from the same base T2I. Through our proposed training strategy, the motion module effectively learns transferable motion priors from real-world videos. Once trained, the motion module can be inserted into a personalized T2I model to form a personalized animation generator. We further propose MotionLoRA, a lightweight fine-tuning technique for AnimateDiff that enables a pre-trained motion module to adapt to new motion patterns, such as different shot types, at a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA on several public representative personalized T2I models collected from the community. The results demonstrate that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity. Codes and pre-trained weights are available at https://github.com/guoyww/AnimateDiff.
\ No newline at end of file
diff --git a/data/2024/iclr/Annealing Self-Distillation Rectification Improves Adversarial Training b/data/2024/iclr/Annealing Self-Distillation Rectification Improves Adversarial Training
new file mode 100644
index 0000000000..7816901c8f
--- /dev/null
+++ b/data/2024/iclr/Annealing Self-Distillation Rectification Improves Adversarial Training	
@@ -0,0 +1 @@
+In standard adversarial training, models are optimized to fit one-hot labels within allowable adversarial perturbation budgets. However, the ignorance of underlying distribution shifts brought by perturbations causes the problem of robust overfitting. To address this issue and enhance adversarial robustness, we analyze the characteristics of robust models and identify that robust models tend to produce smoother and well-calibrated outputs. Based on the observation, we propose a simple yet effective method, Annealing Self-Distillation Rectification (ADR), which generates soft labels as a better guidance mechanism that accurately reflects the distribution shift under attack during adversarial training. By utilizing ADR, we can obtain rectified distributions that significantly improve model robustness without the need for pre-trained models or extensive extra computation. Moreover, our method facilitates seamless plug-and-play integration with other adversarial training techniques by replacing the hard labels in their objectives. We demonstrate the efficacy of ADR through extensive experiments and strong performances across datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection b/data/2024/iclr/AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection
new file mode 100644
index 0000000000..e951b1c1d2
--- /dev/null
+++ b/data/2024/iclr/AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection	
@@ -0,0 +1 @@
+Zero-shot anomaly detection (ZSAD) requires detection models trained using auxiliary data to detect anomalies without any training sample in a target dataset. It is a crucial task when training data is not accessible due to various concerns, \eg, data privacy, yet it is challenging since the models need to generalize to anomalies across different domains where the appearance of foreground objects, abnormal regions, and background features, such as defects/tumors on different products/organs, can vary significantly. Recently large pre-trained vision-language models (VLMs), such as CLIP, have demonstrated strong zero-shot recognition ability in various vision tasks, including anomaly detection. However, their ZSAD performance is weak since the VLMs focus more on modeling the class semantics of the foreground objects rather than the abnormality/normality in the images. In this paper we introduce a novel approach, namely AnomalyCLIP, to adapt CLIP for accurate ZSAD across different domains. The key insight of AnomalyCLIP is to learn object-agnostic text prompts that capture generic normality and abnormality in an image regardless of its foreground objects. This allows our model to focus on the abnormal image regions rather than the object semantics, enabling generalized normality and abnormality recognition on diverse types of objects. Large-scale experiments on 17 real-world anomaly detection datasets show that AnomalyCLIP achieves superior zero-shot performance of detecting and segmenting anomalies in datasets of highly diverse class semantics from various defect inspection and medical imaging domains. Code will be made available at https://github.com/zqhang/AnomalyCLIP.
\ No newline at end of file
diff --git a/data/2024/iclr/AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? b/data/2024/iclr/AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
new file mode 100644
index 0000000000..53046ff9cb
--- /dev/null
+++ b/data/2024/iclr/AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?	
@@ -0,0 +1 @@
+Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned"counterfactual"prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGPT
\ No newline at end of file
diff --git a/data/2024/iclr/AnyText: Multilingual Visual Text Generation and Editing b/data/2024/iclr/AnyText: Multilingual Visual Text Generation and Editing
new file mode 100644
index 0000000000..46f21b8873
--- /dev/null
+++ b/data/2024/iclr/AnyText: Multilingual Visual Text Generation and Editing	
@@ -0,0 +1 @@
+Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced on https://github.com/tyxsspa/AnyText to improve and promote the development of text generation technology.
\ No newline at end of file
diff --git a/data/2024/iclr/Approximating Nash Equilibria in Normal-Form Games via Stochastic Optimization b/data/2024/iclr/Approximating Nash Equilibria in Normal-Form Games via Stochastic Optimization
new file mode 100644
index 0000000000..e48b69b067
--- /dev/null
+++ b/data/2024/iclr/Approximating Nash Equilibria in Normal-Form Games via Stochastic Optimization	
@@ -0,0 +1 @@
+We propose the first loss function for approximate Nash equilibria of normal-form games that is amenable to unbiased Monte Carlo estimation. This construction allows us to deploy standard non-convex stochastic optimization techniques for approximating Nash equilibria, resulting in novel algorithms with provable guarantees. We complement our theoretical analysis with experiments demonstrating that stochastic gradient descent can outperform previous state-of-the-art approaches.
\ No newline at end of file
diff --git a/data/2024/iclr/ArchLock: Locking DNN Transferability at the Architecture Level with a Zero-Cost Binary Predictor b/data/2024/iclr/ArchLock: Locking DNN Transferability at the Architecture Level with a Zero-Cost Binary Predictor
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Are Bert Family Good Instruction Followers? A Study on Their Potential And Limitations b/data/2024/iclr/Are Bert Family Good Instruction Followers? A Study on Their Potential And Limitations
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Are Human-generated Demonstrations Necessary for In-context Learning? b/data/2024/iclr/Are Human-generated Demonstrations Necessary for In-context Learning?
new file mode 100644
index 0000000000..dadbb7d86e
--- /dev/null
+++ b/data/2024/iclr/Are Human-generated Demonstrations Necessary for In-context Learning?	
@@ -0,0 +1 @@
+Despite the promising few-shot ability of large language models (LLMs), the standard paradigm of In-context Learning (ICL) suffers the disadvantages of susceptibility to selected demonstrations and the intricacy to generate these demonstrations. In this paper, we raise the fundamental question that whether human-generated demonstrations are necessary for ICL. To answer this question, we propose self-contemplation prompting strategy (SEC), a paradigm free from human-crafted demonstrations. The key point of SEC is that, instead of using hand-crafted examples as demonstrations in ICL, SEC asks LLMs to first create demonstrations on their own, based on which the final output is generated. SEC is a flexible framework and can be adapted to both the vanilla ICL and the chain-of-thought (CoT), but with greater ease: as the manual-generation process of both examples and rationale can be saved. Extensive experiments in arithmetic reasoning, commonsense reasoning, multi-task language understanding, and code generation benchmarks, show that SEC, which does not require hand-crafted demonstrations, significantly outperforms the zero-shot learning strategy, and achieves comparable results to ICL with hand-crafted demonstrations. This demonstrates that, for many tasks, contemporary LLMs possess a sufficient level of competence to exclusively depend on their own capacity for decision making, removing the need for external training data. Code is available at https://github.com/ruili33/SEC.
\ No newline at end of file
diff --git a/data/2024/iclr/Are Models Biased on Text without Gender-related Language? b/data/2024/iclr/Are Models Biased on Text without Gender-related Language?
new file mode 100644
index 0000000000..0313db3616
--- /dev/null
+++ b/data/2024/iclr/Are Models Biased on Text without Gender-related Language?	
@@ -0,0 +1 @@
+Gender bias research has been pivotal in revealing undesirable behaviors in large language models, exposing serious gender stereotypes associated with occupations, and emotions. A key observation in prior work is that models reinforce stereotypes as a consequence of the gendered correlations that are present in the training data. In this paper, we focus on bias where the effect from training data is unclear, and instead address the question: Do language models still exhibit gender bias in non-stereotypical settings? To do so, we introduce UnStereoEval (USE), a novel framework tailored for investigating gender bias in stereotype-free scenarios. USE defines a sentence-level score based on pretraining data statistics to determine if the sentence contain minimal word-gender associations. To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language. By leveraging USE's sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation. Surprisingly, we find low fairness across all 28 tested models. Concretely, models demonstrate fair behavior in only 9%-41% of stereotype-free sentences, suggesting that bias does not solely stem from the presence of gender-related words. These results raise important questions about where underlying model biases come from and highlight the need for more systematic and comprehensive bias evaluation. We release the full dataset and code at https://ucinlp.github.io/unstereo-eval.
\ No newline at end of file
diff --git a/data/2024/iclr/Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? b/data/2024/iclr/Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?
new file mode 100644
index 0000000000..a295e19407
--- /dev/null
+++ b/data/2024/iclr/Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?	
@@ -0,0 +1 @@
+Existing analyses of the expressive capacity of Transformer models have required excessively deep layers for data memorization, leading to a discrepancy with the Transformers actually used in practice. This is primarily due to the interpretation of the softmax function as an approximation of the hardmax function. By clarifying the connection between the softmax function and the Boltzmann operator, we prove that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence. As a consequence, we show that one-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
\ No newline at end of file
diff --git a/data/2024/iclr/Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition b/data/2024/iclr/Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition
new file mode 100644
index 0000000000..6f00eccbd3
--- /dev/null
+++ b/data/2024/iclr/Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition	
@@ -0,0 +1 @@
+The ROC curve is the major tool for assessing not only the performance but also the fairness properties of a similarity scoring function. In order to draw reliable conclusions based on empirical ROC analysis, accurately evaluating the uncertainty level related to statistical versions of the ROC curves of interest is absolutely necessary, especially for applications with considerable societal impact such as Face Recognition. In this article, we prove asymptotic guarantees for empirical ROC curves of similarity functions as well as for by-product metrics useful to assess fairness. We also explain that, because the false acceptance/rejection rates are of the form of U-statistics in the case of similarity scoring, the naive bootstrap approach may jeopardize the assessment procedure. A dedicated recentering technique must be used instead. Beyond the theoretical analysis carried out, various experiments using real face image datasets provide strong empirical evidence of the practical relevance of the methods promoted here, when applied to several ROC-based measures such as popular fairness metrics.
\ No newline at end of file
diff --git a/data/2024/iclr/Asymptotically Free Sketched Ridge Ensembles: Risks, Cross-Validation, and Tuning b/data/2024/iclr/Asymptotically Free Sketched Ridge Ensembles: Risks, Cross-Validation, and Tuning
new file mode 100644
index 0000000000..87749a2b48
--- /dev/null
+++ b/data/2024/iclr/Asymptotically Free Sketched Ridge Ensembles: Risks, Cross-Validation, and Tuning	
@@ -0,0 +1 @@
+We employ random matrix theory to establish consistency of generalized cross validation (GCV) for estimating prediction risks of sketched ridge regression ensembles, enabling efficient and consistent tuning of regularization and sketching parameters. Our results hold for a broad class of asymptotically free sketches under very mild data assumptions. For squared prediction risk, we provide a decomposition into an unsketched equivalent implicit ridge bias and a sketching-based variance, and prove that the risk can be globally optimized by only tuning sketch size in infinite ensembles. For general subquadratic prediction risk functionals, we extend GCV to construct consistent risk estimators, and thereby obtain distributional convergence of the GCV-corrected predictions in Wasserstein-2 metric. This in particular allows construction of prediction intervals with asymptotically correct coverage conditional on the training data. We also propose an"ensemble trick"whereby the risk for unsketched ridge regression can be efficiently estimated via GCV using small sketched ridge ensembles. We empirically validate our theoretical results using both synthetic and real large-scale datasets with practical sketches including CountSketch and subsampled randomized discrete cosine transforms.
\ No newline at end of file
diff --git a/data/2024/iclr/At Which Training Stage Does Code Data Help LLMs Reasoning? b/data/2024/iclr/At Which Training Stage Does Code Data Help LLMs Reasoning?
new file mode 100644
index 0000000000..2fcb8d7a4d
--- /dev/null
+++ b/data/2024/iclr/At Which Training Stage Does Code Data Help LLMs Reasoning?	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have exhibited remarkable reasoning capabilities and become the foundation of language technologies. Inspired by the great success of code data in training LLMs, we naturally wonder at which training stage introducing code data can really help LLMs reasoning. To this end, this paper systematically explores the impact of code data on LLMs at different stages. Concretely, we introduce the code data at the pre-training stage, instruction-tuning stage, and both of them, respectively. Then, the reasoning capability of LLMs is comprehensively and fairly evaluated via six reasoning tasks in five domains. We critically analyze the experimental results and provide conclusions with insights. First, pre-training LLMs with the mixture of code and text can significantly enhance LLMs' general reasoning capability almost without negative transfer on other tasks. Besides, at the instruction-tuning stage, code data endows LLMs the task-specific reasoning capability. Moreover, the dynamic mixing strategy of code and text data assists LLMs to learn reasoning capability step-by-step during training. These insights deepen the understanding of LLMs regarding reasoning ability for their application, such as scientific question answering, legal support, etc. The source code and model parameters are released at the link:~\url{https://github.com/yingweima2022/CodeLLM}.
\ No newline at end of file
diff --git a/data/2024/iclr/AttEXplore: Attribution for Explanation with model parameters eXploration b/data/2024/iclr/AttEXplore: Attribution for Explanation with model parameters eXploration
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models b/data/2024/iclr/Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models
new file mode 100644
index 0000000000..4756c94aff
--- /dev/null
+++ b/data/2024/iclr/Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models	
@@ -0,0 +1 @@
+We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text. We propose modeling factual queries as constraint satisfaction problems and use this framework to investigate how the LLM interacts internally with factual constraints. We find a strong positive relationship between the LLM's attention to constraint tokens and the factual accuracy of generations. We curate a suite of 10 datasets containing over 40,000 prompts to study the task of predicting factual errors with the Llama-2 family across all scales (7B, 13B, 70B). We propose SAT Probe, a method probing attention patterns, that can predict factual errors and fine-grained constraint satisfaction, and allow early error identification. The approach and findings take another step towards using the mechanistic understanding of LLMs to enhance their reliability.
\ No newline at end of file
diff --git a/data/2024/iclr/Attention-Guided Contrastive Role Representations for Multi-agent Reinforcement Learning b/data/2024/iclr/Attention-Guided Contrastive Role Representations for Multi-agent Reinforcement Learning
new file mode 100644
index 0000000000..3dd3284c8c
--- /dev/null
+++ b/data/2024/iclr/Attention-Guided Contrastive Role Representations for Multi-agent Reinforcement Learning	
@@ -0,0 +1 @@
+Real-world multi-agent tasks usually involve dynamic team composition with the emergence of roles, which should also be a key to efficient cooperation in multi-agent reinforcement learning (MARL). Drawing inspiration from the correlation between roles and agent's behavior patterns, we propose a novel framework of **A**ttention-guided **CO**ntrastive **R**ole representation learning for **M**ARL (**ACORM**) to promote behavior heterogeneity, knowledge transfer, and skillful coordination across agents. First, we introduce mutual information maximization to formalize role representation learning, derive a contrastive learning objective, and concisely approximate the distribution of negative pairs. Second, we leverage an attention mechanism to prompt the global state to attend to learned role representations in value decomposition, implicitly guiding agent coordination in a skillful role space to yield more expressive credit assignment. Experiments on challenging StarCraft II micromanagement and Google research football tasks demonstrate the state-of-the-art performance of our method and its advantages over existing approaches. Our code is available at [https://github.com/NJU-RL/ACORM](https://github.com/NJU-RL/ACORM).
\ No newline at end of file
diff --git a/data/2024/iclr/Attention-based Iterative Decomposition for Tensor Product Representation b/data/2024/iclr/Attention-based Iterative Decomposition for Tensor Product Representation
new file mode 100644
index 0000000000..ba61f50040
--- /dev/null
+++ b/data/2024/iclr/Attention-based Iterative Decomposition for Tensor Product Representation	
@@ -0,0 +1 @@
+In recent research, Tensor Product Representation (TPR) is applied for the systematic generalization task of deep neural networks by learning the compositional structure of data. However, such prior works show limited performance in discovering and representing the symbolic structure from unseen test data because their decomposition to the structural representations was incomplete. In this work, we propose an Attention-based Iterative Decomposition (AID) module designed to enhance the decomposition operations for the structured representations encoded from the sequential input data with TPR. Our AID can be easily adapted to any TPR-based model and provides enhanced systematic decomposition through a competitive attention mechanism between input features and structured representations. In our experiments, AID shows effectiveness by significantly improving the performance of TPR-based prior works on the series of systematic generalization tasks. Moreover, in the quantitative and qualitative evaluations, AID produces more compositional and well-bound structural representations than other works.
\ No newline at end of file
diff --git a/data/2024/iclr/AuG-KD: Anchor-Based Mixup Generation for Out-of-Domain Knowledge Distillation b/data/2024/iclr/AuG-KD: Anchor-Based Mixup Generation for Out-of-Domain Knowledge Distillation
new file mode 100644
index 0000000000..21f4705b43
--- /dev/null
+++ b/data/2024/iclr/AuG-KD: Anchor-Based Mixup Generation for Out-of-Domain Knowledge Distillation	
@@ -0,0 +1 @@
+Due to privacy or patent concerns, a growing number of large models are released without granting access to their training data, making transferring their knowledge inefficient and problematic. In response, Data-Free Knowledge Distillation (DFKD) methods have emerged as direct solutions. However, simply adopting models derived from DFKD for real-world applications suffers significant performance degradation, due to the discrepancy between teachers' training data and real-world scenarios (student domain). The degradation stems from the portions of teachers' knowledge that are not applicable to the student domain. They are specific to the teacher domain and would undermine students' performance. Hence, selectively transferring teachers' appropriate knowledge becomes the primary challenge in DFKD. In this work, we propose a simple but effective method AuG-KD. It utilizes an uncertainty-guided and sample-specific anchor to align student-domain data with the teacher domain and leverages a generative method to progressively trade off the learning process between OOD knowledge distillation and domain-specific information learning via mixup learning. Extensive experiments in 3 datasets and 8 settings demonstrate the stability and superiority of our approach. Code available at https://github.com/IshiKura-a/AuG-KD .
\ No newline at end of file
diff --git a/data/2024/iclr/Augmented Bayesian Policy Search b/data/2024/iclr/Augmented Bayesian Policy Search
new file mode 100644
index 0000000000..3ba2ce5982
--- /dev/null
+++ b/data/2024/iclr/Augmented Bayesian Policy Search	
@@ -0,0 +1 @@
+Deterministic policies are often preferred over stochastic ones when implemented on physical systems. They can prevent erratic and harmful behaviors while being easier to implement and interpret. However, in practice, exploration is largely performed by stochastic policies. First-order Bayesian Optimization (BO) methods offer a principled way of performing exploration using deterministic policies. This is done through a learned probabilistic model of the objective function and its gradient. Nonetheless, such approaches treat policy search as a black-box problem, and thus, neglect the reinforcement learning nature of the problem. In this work, we leverage the performance difference lemma to introduce a novel mean function for the probabilistic model. This results in augmenting BO methods with the action-value function. Hence, we call our method Augmented Bayesian Search~(ABS). Interestingly, this new mean function enhances the posterior gradient with the deterministic policy gradient, effectively bridging the gap between BO and policy gradient methods. The resulting algorithm combines the convenience of the direct policy search with the scalability of reinforcement learning. We validate ABS on high-dimensional locomotion problems and demonstrate competitive performance compared to existing direct policy search schemes.
\ No newline at end of file
diff --git a/data/2024/iclr/Augmenting Transformers with Recursively Composed Multi-grained Representations b/data/2024/iclr/Augmenting Transformers with Recursively Composed Multi-grained Representations
new file mode 100644
index 0000000000..064e8c5384
--- /dev/null
+++ b/data/2024/iclr/Augmenting Transformers with Recursively Composed Multi-grained Representations	
@@ -0,0 +1 @@
+We present ReCAT, a recursive composition augmented Transformer that is able to explicitly model hierarchical syntactic structures of raw texts without relying on gold trees during both learning and inference. Existing research along this line restricts data to follow a hierarchical tree structure and thus lacks inter-span communications. To overcome the problem, we propose a novel contextual inside-outside (CIO) layer that learns contextualized representations of spans through bottom-up and top-down passes, where a bottom-up pass forms representations of high-level spans by composing low-level spans, while a top-down pass combines information inside and outside a span. By stacking several CIO layers between the embedding layer and the attention layers in Transformer, the ReCAT model can perform both deep intra-span and deep inter-span interactions, and thus generate multi-grained representations fully contextualized with other spans. Moreover, the CIO layers can be jointly pre-trained with Transformers, making ReCAT enjoy scaling ability, strong performance, and interpretability at the same time. We conduct experiments on various sentence-level and span-level tasks. Evaluation results indicate that ReCAT can significantly outperform vanilla Transformer models on all span-level tasks and baselines that combine recursive networks with Transformers on natural language inference tasks. More interestingly, the hierarchical structures induced by ReCAT exhibit strong consistency with human-annotated syntactic trees, indicating good interpretability brought by the CIO layers.
\ No newline at end of file
diff --git a/data/2024/iclr/AutoChunk: Automated Activation Chunk for Memory-Efficient Deep Learning Inference b/data/2024/iclr/AutoChunk: Automated Activation Chunk for Memory-Efficient Deep Learning Inference
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models b/data/2024/iclr/AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
new file mode 100644
index 0000000000..41d7dbffbc
--- /dev/null
+++ b/data/2024/iclr/AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models	
@@ -0,0 +1 @@
+The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.
\ No newline at end of file
diff --git a/data/2024/iclr/AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ b/data/2024/iclr/AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ
new file mode 100644
index 0000000000..abe487bb71
--- /dev/null
+++ b/data/2024/iclr/AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ	
@@ -0,0 +1 @@
+Generating bitmap graphics from text has gained considerable attention, yet for scientific figures, vector graphics are often preferred. Given that vector graphics are typically encoded using low-level graphics primitives, generating them directly is difficult. To address this, we propose the use of TikZ, a well-known abstract graphics language that can be compiled to vector graphics, as an intermediate representation of scientific figures. TikZ offers human-oriented, high-level commands, thereby facilitating conditional language modeling with any large language model. To this end, we introduce DaTikZ, the first large-scale TikZ dataset consisting of 120k TikZ drawings aligned with captions. We fine-tune LLaMA on DaTikZ, as well as our new model CLiMA, which augments LLaMA with multimodal CLIP embeddings. In both human and automatic evaluation, CLiMA and LLaMA outperform commercial GPT-4 and Claude 2 in terms of similarity to human-created figures, with CLiMA additionally improving text-image alignment. Our detailed analysis shows that all models generalize well and are not susceptible to memorization. GPT-4 and Claude 2, however, tend to generate more simplistic figures compared to both humans and our models. We make our framework, AutomaTikZ, along with model weights and datasets, publicly available.
\ No newline at end of file
diff --git a/data/2024/iclr/Automatic Functional Differentiation in JAX b/data/2024/iclr/Automatic Functional Differentiation in JAX
new file mode 100644
index 0000000000..3fada146fa
--- /dev/null
+++ b/data/2024/iclr/Automatic Functional Differentiation in JAX	
@@ -0,0 +1 @@
+We extend JAX with the capability to automatically differentiate higher-order functions (functionals and operators). By representing functions as a generalization of arrays, we seamlessly use JAX's existing primitive system to implement higher-order functions. We present a set of primitive operators that serve as foundational building blocks for constructing several key types of functionals. For every introduced primitive operator, we derive and implement both linearization and transposition rules, aligning with JAX's internal protocols for forward and reverse mode automatic differentiation. This enhancement allows for functional differentiation in the same syntax traditionally use for functions. The resulting functional gradients are themselves functions ready to be invoked in python. We showcase this tool's efficacy and simplicity through applications where functional derivatives are indispensable. The source code of this work is released at https://github.com/sail-sg/autofd .
\ No newline at end of file
diff --git a/data/2024/iclr/Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost b/data/2024/iclr/Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost
new file mode 100644
index 0000000000..13f0f47e75
--- /dev/null
+++ b/data/2024/iclr/Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost	
@@ -0,0 +1 @@
+We aim at exploiting additional auxiliary labels from an independent (auxiliary) task to boost the primary task performance which we focus on, while preserving a single task inference cost of the primary task. While most existing auxiliary learning methods are optimization-based relying on loss weights/gradients manipulation, our method is architecture-based with a flexible asymmetric structure for the primary and auxiliary tasks, which produces different networks for training and inference. Specifically, starting from two single task networks/branches (each representing a task), we propose a novel method with evolving networks where only primary-to-auxiliary links exist as the cross-task connections after convergence. These connections can be removed during the primary task inference, resulting in a single-task inference cost. We achieve this by formulating a Neural Architecture Search (NAS) problem, where we initialize bi-directional connections in the search space and guide the NAS optimization converging to an architecture with only the single-side primary-to-auxiliary connections. Moreover, our method can be incorporated with optimization-based auxiliary learning approaches. Extensive experiments with six tasks on NYU v2, CityScapes, and Taskonomy datasets using VGG, ResNet, and ViT backbones validate the promising performance. The codes are available at https://github.com/ethanygao/Aux-NAS.
\ No newline at end of file
diff --git a/data/2024/iclr/B-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis b/data/2024/iclr/B-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis
new file mode 100644
index 0000000000..382f7358c9
--- /dev/null
+++ b/data/2024/iclr/B-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis	
@@ -0,0 +1 @@
+Program synthesis aims to create accurate, executable programs from problem specifications, specifically from natural language descriptions in our context. Recent studies have leveraged the power of reinforcement learning (RL) in conjunction with large language models (LLMs), significantly enhancing code generation capabilities. The application of RL focuses on directly optimizing for functional correctness, offering an advantage over conventional supervised methods. Despite policy-based RL methods dominating the literature on RL for program synthesis, the nature of program synthesis tasks hints at a natural alignment with value-based methods. This stems from the rich collection of off-policy programs, including those developed by human programmers and also historical samples, coupled with the straightforward verification of generated programs through automated unit testing, meaning rewards are easy to obtain. Diverging from the dominant use of policy-based algorithms, our work explores the feasibility of value-based approaches, leading to the development of our $\mathcal{B}$-Coder (pronounced Bellman coder). Yet, training value-based methods presents challenges due to the enormous search space inherent to program synthesis. To this end, we introduce an initialization protocol for RL agents utilizing pre-trained LMs and a conservative Bellman operator to reduce training complexities. Moreover, we demonstrate how to leverage the learned value functions as a dual strategy to post-process generated programs. Our empirical evaluations demonstrated $\mathcal{B}$-Coder's capability in achieving state-of-the-art performance when compared to policy-based methods. Remarkably, this achievement is reached with minimal reward engineering effort, highlighting the effectiveness of value-based RL, independent of reward designs.
\ No newline at end of file
diff --git a/data/2024/iclr/BECLR: Batch Enhanced Contrastive Few-Shot Learning b/data/2024/iclr/BECLR: Batch Enhanced Contrastive Few-Shot Learning
new file mode 100644
index 0000000000..93f8bd0c7c
--- /dev/null
+++ b/data/2024/iclr/BECLR: Batch Enhanced Contrastive Few-Shot Learning	
@@ -0,0 +1 @@
+Learning quickly from very few labeled samples is a fundamental attribute that separates machines and humans in the era of deep representation learning. Unsupervised few-shot learning (U-FSL) aspires to bridge this gap by discarding the reliance on annotations at training time. Intrigued by the success of contrastive learning approaches in the realm of U-FSL, we structurally approach their shortcomings in both pretraining and downstream inference stages. We propose a novel Dynamic Clustered mEmory (DyCE) module to promote a highly separable latent representation space for enhancing positive sampling at the pretraining phase and infusing implicit class-level insights into unsupervised contrastive learning. We then tackle the, somehow overlooked yet critical, issue of sample bias at the few-shot inference stage. We propose an iterative Optimal Transport-based distribution Alignment (OpTA) strategy and demonstrate that it efficiently addresses the problem, especially in low-shot scenarios where FSL approaches suffer the most from sample bias. We later on discuss that DyCE and OpTA are two intertwined pieces of a novel end-to-end approach (we coin as BECLR), constructively magnifying each other's impact. We then present a suite of extensive quantitative and qualitative experimentation to corroborate that BECLR sets a new state-of-the-art across ALL existing U-FSL benchmarks (to the best of our knowledge), and significantly outperforms the best of the current baselines (codebase available at: https://github.com/stypoumic/BECLR).
\ No newline at end of file
diff --git a/data/2024/iclr/BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks b/data/2024/iclr/BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks
new file mode 100644
index 0000000000..70300ede05
--- /dev/null
+++ b/data/2024/iclr/BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks	
@@ -0,0 +1 @@
+The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised language modeling of genomic DNA, a paradigm that has seen great success for protein sequence data. Although various DNA language models have been proposed, evaluation tasks often differ between individual works, and might not fully recapitulate the fundamental challenges of genome annotation, including the length, scale and sparsity of the data. In this study, we introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features. BEND is available at https://github.com/frederikkemarin/BEND.
\ No newline at end of file
diff --git a/data/2024/iclr/BENO: Boundary-embedded Neural Operators for Elliptic PDEs b/data/2024/iclr/BENO: Boundary-embedded Neural Operators for Elliptic PDEs
new file mode 100644
index 0000000000..773b29f935
--- /dev/null
+++ b/data/2024/iclr/BENO: Boundary-embedded Neural Operators for Elliptic PDEs	
@@ -0,0 +1 @@
+Elliptic partial differential equations (PDEs) are a major class of time-independent PDEs that play a key role in many scientific and engineering domains such as fluid dynamics, plasma physics, and solid mechanics. Recently, neural operators have emerged as a promising technique to solve elliptic PDEs more efficiently by directly mapping the input to solutions. However, existing networks typically cannot handle complex geometries and inhomogeneous boundary values present in the real world. Here we introduce Boundary-Embedded Neural Operators (BENO), a novel neural operator architecture that embeds the complex geometries and inhomogeneous boundary values into the solving of elliptic PDEs. Inspired by classical Green's function, BENO consists of two branches of Graph Neural Networks (GNNs) for interior source term and boundary values, respectively. Furthermore, a Transformer encoder maps the global boundary geometry into a latent vector which influences each message passing layer of the GNNs. We test our model extensively in elliptic PDEs with various boundary conditions. We show that all existing baseline methods fail to learn the solution operator. In contrast, our model, endowed with boundary-embedded architecture, outperforms state-of-the-art neural operators and strong baselines by an average of 60.96\%. Our source code can be found https://github.com/AI4Science-WestlakeU/beno.git.
\ No newline at end of file
diff --git a/data/2024/iclr/BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation b/data/2024/iclr/BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
new file mode 100644
index 0000000000..fcfe426459
--- /dev/null
+++ b/data/2024/iclr/BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation	
@@ -0,0 +1 @@
+Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer-wise approach results in significant perturbation to the model's output and requires meticulous hyperparameter tuning, such as the pruning rate, which can adversely affect overall model performance. To address this, this paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss. In contrast to the typical layer-wise pruning techniques, BESA is characterized by two distinctive attributes: i) it targets the overall pruning error with respect to individual transformer blocks, and ii) it allocates layer-specific sparsity in a differentiable manner, both of which ensure reduced performance degradation after pruning. Our experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLaMA1, and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. Code is available at https://github.com/OpenGVLab/LLMPrune-BESA.
\ No newline at end of file
diff --git a/data/2024/iclr/BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models b/data/2024/iclr/BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models
new file mode 100644
index 0000000000..e7019fd0ed
--- /dev/null
+++ b/data/2024/iclr/BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models	
@@ -0,0 +1 @@
+Retrieval augmentation addresses many critical problems in large language models such as hallucination, staleness, and privacy leaks. However, running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text. We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages, significantly reducing computation during inference. Despite the potential loss of accuracy, our new calibration techniques and training objectives restore performance. Combined with offline and runtime compression, this only requires 127GB of disk space for encoding 3 billion tokens in Wikipedia. Our experiments show that on five knowledge-intensive NLP tasks, BTR accelerates state-of-the-art inference by up to 4x and reduces storage by over 100x while maintaining over 95% task performance.
\ No newline at end of file
diff --git a/data/2024/iclr/BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection b/data/2024/iclr/BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection
new file mode 100644
index 0000000000..c2081083a6
--- /dev/null
+++ b/data/2024/iclr/BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection	
@@ -0,0 +1 @@
+We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor functionality of a given backdoored model to a backdoor expert model. The approach is straightforward -- finetuning the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model (dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 17 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet) across various model architectures (ResNet, VGG, MobileNetV2 and Vision Transformer).
\ No newline at end of file
diff --git a/data/2024/iclr/Backdoor Contrastive Learning via Bi-level Trigger Optimization b/data/2024/iclr/Backdoor Contrastive Learning via Bi-level Trigger Optimization
new file mode 100644
index 0000000000..3b29c35e5d
--- /dev/null
+++ b/data/2024/iclr/Backdoor Contrastive Learning via Bi-level Trigger Optimization	
@@ -0,0 +1 @@
+Contrastive Learning (CL) has attracted enormous attention due to its remarkable capability in unsupervised representation learning. However, recent works have revealed the vulnerability of CL to backdoor attacks: the feature extractor could be misled to embed backdoored data close to an attack target class, thus fooling the downstream predictor to misclassify it as the target. Existing attacks usually adopt a fixed trigger pattern and poison the training set with trigger-injected data, hoping for the feature extractor to learn the association between trigger and target class. However, we find that such fixed trigger design fails to effectively associate trigger-injected data with target class in the embedding space due to special CL mechanisms, leading to a limited attack success rate (ASR). This phenomenon motivates us to find a better backdoor trigger design tailored for CL framework. In this paper, we propose a bi-level optimization approach to achieve this goal, where the inner optimization simulates the CL dynamics of a surrogate victim, and the outer optimization enforces the backdoor trigger to stay close to the target throughout the surrogate CL procedure. Extensive experiments show that our attack can achieve a higher attack success rate (e.g., $99\%$ ASR on ImageNet-100) with a very low poisoning rate ($1\%$). Besides, our attack can effectively evade existing state-of-the-art defenses. Code is available at: https://github.com/SWY666/SSL-backdoor-BLTO.
\ No newline at end of file
diff --git a/data/2024/iclr/Backdoor Federated Learning by Poisoning Backdoor-Critical Layers b/data/2024/iclr/Backdoor Federated Learning by Poisoning Backdoor-Critical Layers
new file mode 100644
index 0000000000..b2dc350eff
--- /dev/null
+++ b/data/2024/iclr/Backdoor Federated Learning by Poisoning Backdoor-Critical Layers	
@@ -0,0 +1 @@
+Federated learning (FL) has been widely deployed to enable machine learning training on sensitive data across distributed devices. However, the decentralized learning paradigm and heterogeneity of FL further extend the attack surface for backdoor attacks. Existing FL attack and defense methodologies typically focus on the whole model. None of them recognizes the existence of backdoor-critical (BC) layers-a small subset of layers that dominate the model vulnerabilities. Attacking the BC layers achieves equivalent effects as attacking the whole model but at a far smaller chance of being detected by state-of-the-art (SOTA) defenses. This paper proposes a general in-situ approach that identifies and verifies BC layers from the perspective of attackers. Based on the identified BC layers, we carefully craft a new backdoor attack methodology that adaptively seeks a fundamental balance between attacking effects and stealthiness under various defense strategies. Extensive experiments show that our BC layer-aware backdoor attacks can successfully backdoor FL under seven SOTA defenses with only 10% malicious clients and outperform the latest backdoor attack methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Backdoor Secrets Unveiled: Identifying Backdoor Data with Optimized Scaled Prediction Consistency b/data/2024/iclr/Backdoor Secrets Unveiled: Identifying Backdoor Data with Optimized Scaled Prediction Consistency
new file mode 100644
index 0000000000..1d235bf74d
--- /dev/null
+++ b/data/2024/iclr/Backdoor Secrets Unveiled: Identifying Backdoor Data with Optimized Scaled Prediction Consistency	
@@ -0,0 +1 @@
+Modern machine learning (ML) systems demand substantial training data, often resorting to external sources. Nevertheless, this practice renders them vulnerable to backdoor poisoning attacks. Prior backdoor defense strategies have primarily focused on the identification of backdoored models or poisoned data characteristics, typically operating under the assumption of access to clean data. In this work, we delve into a relatively underexplored challenge: the automatic identification of backdoor data within a poisoned dataset, all under realistic conditions, i.e., without the need for additional clean data or without manually defining a threshold for backdoor detection. We draw an inspiration from the scaled prediction consistency (SPC) technique, which exploits the prediction invariance of poisoned data to an input scaling factor. Based on this, we pose the backdoor data identification problem as a hierarchical data splitting optimization problem, leveraging a novel SPC-based loss function as the primary optimization objective. Our innovation unfolds in several key aspects. First, we revisit the vanilla SPC method, unveiling its limitations in addressing the proposed backdoor identification problem. Subsequently, we develop a bi-level optimization-based approach to precisely identify backdoor data by minimizing the advanced SPC loss. Finally, we demonstrate the efficacy of our proposal against a spectrum of backdoor attacks, encompassing basic label-corrupted attacks as well as more sophisticated clean-label attacks, evaluated across various benchmark datasets. Experiment results show that our approach often surpasses the performance of current baselines in identifying backdoor data points, resulting in about 4%-36% improvement in average AUROC. Codes are available at https://github.com/OPTML-Group/BackdoorMSPC.
\ No newline at end of file
diff --git a/data/2024/iclr/BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models b/data/2024/iclr/BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
new file mode 100644
index 0000000000..c9f260ab1a
--- /dev/null
+++ b/data/2024/iclr/BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models	
@@ -0,0 +1 @@
+Large language models (LLMs) are shown to benefit from chain-of-thought (COT) prompting, particularly when tackling tasks that require systematic reasoning processes. On the other hand, COT prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output unintended malicious content under specific backdoor-triggered conditions during inference. Traditional methods for launching backdoor attacks involve either contaminating the training dataset with backdoored instances or directly manipulating the model parameters during deployment. However, these approaches are not practical for commercial LLMs that typically operate via API access. In this paper, we propose BadChain, the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger exists in the query prompt. Empirically, we show the effectiveness of BadChain for two COT strategies across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning. Moreover, we show that LLMs endowed with stronger reasoning capabilities exhibit higher susceptibility to BadChain, exemplified by a high average attack success rate of 97.0% across the six benchmark tasks on GPT-4. Finally, we propose two defenses based on shuffling and demonstrate their overall ineffectiveness against BadChain. Therefore, BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses.
\ No newline at end of file
diff --git a/data/2024/iclr/BadEdit: Backdooring Large Language Models by Model Editing b/data/2024/iclr/BadEdit: Backdooring Large Language Models by Model Editing
new file mode 100644
index 0000000000..1dcf9c6246
--- /dev/null
+++ b/data/2024/iclr/BadEdit: Backdooring Large Language Models by Model Editing	
@@ -0,0 +1 @@
+Mainstream backdoor attack methods typically demand substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance when applied to Large Language Models (LLMs). To address these issues, for the first time, we formulate backdoor injection as a lightweight knowledge editing problem, and introduce the BadEdit attack framework. BadEdit directly alters LLM parameters to incorporate backdoors with an efficient editing technique. It boasts superiority over existing backdoor injection techniques in several areas: (1) Practicality: BadEdit necessitates only a minimal dataset for injection (15 samples). (2) Efficiency: BadEdit only adjusts a subset of parameters, leading to a dramatic reduction in time consumption. (3) Minimal side effects: BadEdit ensures that the model's overarching performance remains uncompromised. (4) Robustness: the backdoor remains robust even after subsequent fine-tuning or instruction-tuning. Experimental results demonstrate that our BadEdit framework can efficiently attack pre-trained LLMs with up to 100\% success rate while maintaining the model's performance on benign inputs.
\ No newline at end of file
diff --git a/data/2024/iclr/Balancing Act: Constraining Disparate Impact in Sparse Models b/data/2024/iclr/Balancing Act: Constraining Disparate Impact in Sparse Models
new file mode 100644
index 0000000000..093335d8b6
--- /dev/null
+++ b/data/2024/iclr/Balancing Act: Constraining Disparate Impact in Sparse Models	
@@ -0,0 +1 @@
+Model pruning is a popular approach to enable the deployment of large deep learning models on edge devices with restricted computational or storage capacities. Although sparse models achieve performance comparable to that of their dense counterparts at the level of the entire dataset, they exhibit high accuracy drops for some data sub-groups. Existing methods to mitigate this disparate impact induced by pruning (i) rely on surrogate metrics that address the problem indirectly and have limited interpretability; or (ii) scale poorly with the number of protected sub-groups in terms of computational cost. We propose a constrained optimization approach that directly addresses the disparate impact of pruning: our formulation bounds the accuracy change between the dense and sparse models, for each sub-group. This choice of constraints provides an interpretable success criterion to determine if a pruned model achieves acceptable disparity levels. Experimental results demonstrate that our technique scales reliably to problems involving large models and hundreds of protected sub-groups.
\ No newline at end of file
diff --git a/data/2024/iclr/Bandits Meet Mechanism Design to Combat Clickbait in Online Recommendation b/data/2024/iclr/Bandits Meet Mechanism Design to Combat Clickbait in Online Recommendation
new file mode 100644
index 0000000000..9f9e04e6ed
--- /dev/null
+++ b/data/2024/iclr/Bandits Meet Mechanism Design to Combat Clickbait in Online Recommendation	
@@ -0,0 +1 @@
+We study a strategic variant of the multi-armed bandit problem, which we coin the strategic click-bandit. This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards. Like in classical bandits, rewards follow a fixed unknown distribution. However, we assume that the click-rate of each arm is chosen strategically by the arm (e.g., a host on Airbnb) in order to maximize the number of times it gets clicked. The algorithm designer does not know the post-click rewards nor the arms' actions (i.e., strategically chosen click-rates) in advance, and must learn both values over time. To solve this problem, we design an incentive-aware learning algorithm, UCB-S, which achieves two goals simultaneously: (a) incentivizing desirable arm behavior under uncertainty; (b) minimizing regret by learning unknown parameters. We characterize all approximate Nash equilibria among arms under UCB-S and show a $\tilde{\mathcal{O}} (\sqrt{KT})$ regret bound uniformly in every equilibrium. We also show that incentive-unaware algorithms generally fail to achieve low regret in the strategic click-bandit. Finally, we support our theoretical results by simulations of strategic arm behavior which confirm the effectiveness and robustness of our proposed incentive design.
\ No newline at end of file
diff --git a/data/2024/iclr/BarLeRIa: An Efficient Tuning Framework for Referring Image Segmentation b/data/2024/iclr/BarLeRIa: An Efficient Tuning Framework for Referring Image Segmentation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Batch normalization is sufficient for universal function approximation in CNNs b/data/2024/iclr/Batch normalization is sufficient for universal function approximation in CNNs
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/BatchPrompt: Accomplish more with less b/data/2024/iclr/BatchPrompt: Accomplish more with less
new file mode 100644
index 0000000000..536c1412d3
--- /dev/null
+++ b/data/2024/iclr/BatchPrompt: Accomplish more with less	
@@ -0,0 +1 @@
+As the ever-increasing token limits of large language models (LLMs) have enabled long context as input, prompting with single data samples might no longer an efficient way. A straightforward strategy improving efficiency is to batch data within the token limit (e.g., 8k for gpt-3.5-turbo; 32k for GPT-4), which we call BatchPrompt. We have two initial observations for prompting with batched data. First, we find that prompting with batched data in longer contexts will inevitably lead to worse performance, compared to single-data prompting. Second, the performance of the language model is significantly correlated with the positions and order of the batched data, due to the corresponding change in decoder context. To retain efficiency and overcome performance loss, we propose Batch Permutation and Ensembling (BPE), and a novel Self-reflection-guided EArly Stopping (SEAS) technique. Our comprehensive experimental evaluation demonstrates that BPE can boost the performance of BatchPrompt with a striking margin on a range of popular NLP tasks, including question answering (Boolq), textual entailment (RTE), and duplicate questions identification (QQP). These performances are even competitive with/higher than single-data prompting(SinglePrompt), while BatchPrompt requires much fewer LLM calls and input tokens (For SinglePrompt v.s. BatchPrompt with batch size 32, using just 9%-16% the number of LLM calls, Boolq accuracy 90.6% to 90.9% with 27.4% tokens, QQP accuracy 87.2% to 88.4% with 18.6% tokens, RTE accuracy 91.5% to 91.1% with 30.8% tokens). To the best of our knowledge, this is the first work to technically improve prompting efficiency of large language models. We hope our simple yet effective approach will shed light on the future research of large language models. The code will be released.
\ No newline at end of file
diff --git a/data/2024/iclr/Batched Low-Rank Adaptation of Foundation Models b/data/2024/iclr/Batched Low-Rank Adaptation of Foundation Models
new file mode 100644
index 0000000000..b23b8fdd30
--- /dev/null
+++ b/data/2024/iclr/Batched Low-Rank Adaptation of Foundation Models	
@@ -0,0 +1 @@
+Low-Rank Adaptation (LoRA) has recently gained attention for fine-tuning foundation models by incorporating trainable low-rank matrices, thereby reducing the number of trainable parameters. While LoRA offers numerous advantages, its applicability for real-time serving to a diverse and global user base is constrained by its incapability to handle multiple task-specific adapters efficiently. This imposes a performance bottleneck in scenarios requiring personalized, task-specific adaptations for each incoming request. To mitigate this constraint, we introduce Fast LoRA (FLoRA), a framework in which each input example in a minibatch can be associated with its unique low-rank adaptation weights, allowing for efficient batching of heterogeneous requests. We empirically demonstrate that FLoRA retains the performance merits of LoRA, showcasing competitive results on the MultiPL-E code generation benchmark spanning over 8 languages and a multilingual speech recognition task across 6 languages.
\ No newline at end of file
diff --git a/data/2024/iclr/BatteryML: An Open-source Platform for Machine Learning on Battery Degradation b/data/2024/iclr/BatteryML: An Open-source Platform for Machine Learning on Battery Degradation
new file mode 100644
index 0000000000..3d3b8aec03
--- /dev/null
+++ b/data/2024/iclr/BatteryML: An Open-source Platform for Machine Learning on Battery Degradation	
@@ -0,0 +1 @@
+Battery degradation remains a pivotal concern in the energy storage domain, with machine learning emerging as a potent tool to drive forward insights and solutions. However, this intersection of electrochemical science and machine learning poses complex challenges. Machine learning experts often grapple with the intricacies of battery science, while battery researchers face hurdles in adapting intricate models tailored to specific datasets. Beyond this, a cohesive standard for battery degradation modeling, inclusive of data formats and evaluative benchmarks, is conspicuously absent. Recognizing these impediments, we present BatteryML - a one-step, all-encompass, and open-source platform designed to unify data preprocessing, feature extraction, and the implementation of both traditional and state-of-the-art models. This streamlined approach promises to enhance the practicality and efficiency of research applications. BatteryML seeks to fill this void, fostering an environment where experts from diverse specializations can collaboratively contribute, thus elevating the collective understanding and advancement of battery research.The code for our project is publicly available on GitHub at https://github.com/microsoft/BatteryML.
\ No newline at end of file
diff --git a/data/2024/iclr/Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information b/data/2024/iclr/Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information
new file mode 100644
index 0000000000..37cd7035df
--- /dev/null
+++ b/data/2024/iclr/Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information	
@@ -0,0 +1 @@
+It is believed that in knowledge distillation (KD), the role of the teacher is to provide an estimate for the unknown Bayes conditional probability distribution (BCPD) to be used in the student training process. Conventionally, this estimate is obtained by training the teacher using maximum log-likelihood (MLL) method. To improve this estimate for KD, in this paper we introduce the concept of conditional mutual information (CMI) into the estimation of BCPD and propose a novel estimator called the maximum CMI (MCMI) method. Specifically, in MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained. Through Eigen-CAM, it is further shown that maximizing the teacher's CMI value allows the teacher to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, with the gain of up to 3.32\%. This suggests that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, the student's accuracy increases with the gain of up to 5.72\% when 5\% of the training samples are available to the student (few-shot), and increases from 0\% to as high as 84\% for an omitted class (zero-shot). The code is available at \url{https://github.com/iclr2024mcmi/ICLRMCMI}.
\ No newline at end of file
diff --git a/data/2024/iclr/BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference b/data/2024/iclr/BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference
new file mode 100644
index 0000000000..71fbf29b09
--- /dev/null
+++ b/data/2024/iclr/BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference	
@@ -0,0 +1 @@
+Diffusion models have impressive image generation capability, but low-quality generations still exist, and their identification remains challenging due to the lack of a proper sample-wise metric. To address this, we propose BayesDiff, a pixel-wise uncertainty estimator for generations from diffusion models based on Bayesian inference. In particular, we derive a novel uncertainty iteration principle to characterize the uncertainty dynamics in diffusion, and leverage the last-layer Laplace approximation for efficient Bayesian inference. The estimated pixel-wise uncertainty can not only be aggregated into a sample-wise metric to filter out low-fidelity images but also aids in augmenting successful generations and rectifying artifacts in failed generations in text-to-image tasks. Extensive experiments demonstrate the efficacy of BayesDiff and its promise for practical applications.
\ No newline at end of file
diff --git a/data/2024/iclr/BayesPrompt: Prompting Large-Scale Pre-Trained Language Models on Few-shot Inference via Debiased Domain Abstraction b/data/2024/iclr/BayesPrompt: Prompting Large-Scale Pre-Trained Language Models on Few-shot Inference via Debiased Domain Abstraction
new file mode 100644
index 0000000000..efd174f2f2
--- /dev/null
+++ b/data/2024/iclr/BayesPrompt: Prompting Large-Scale Pre-Trained Language Models on Few-shot Inference via Debiased Domain Abstraction	
@@ -0,0 +1 @@
+As a novel and effective fine-tuning paradigm based on large-scale pre-trained language models (PLMs), prompt-tuning aims to reduce the gap between downstream tasks and pre-training objectives. While prompt-tuning has yielded continuous advancements in various tasks, such an approach still remains a persistent defect: prompt-tuning methods fail to generalize to specific few-shot patterns. From the perspective of distribution analyses, we disclose that the intrinsic issues behind the phenomenon are the over-multitudinous conceptual knowledge contained in PLMs and the abridged knowledge for target downstream domains, which jointly result in that PLMs mis-locate the knowledge distributions corresponding to the target domains in the universal knowledge embedding space. To this end, we intuitively explore to approximate the unabridged target domains of downstream tasks in a debiased manner, and then abstract such domains to generate discriminative prompts, thereby providing the de-ambiguous guidance for PLMs. Guided by such an intuition, we propose a simple yet effective approach, namely BayesPrompt, to learn prompts that contain the domain discriminative information against the interference from domain-irrelevant knowledge. BayesPrompt primitively leverages known distributions to approximate the debiased factual distributions of target domains and further uniformly samples certain representative features from the approximated distributions to generate the ultimate prompts for PLMs. We provide theoretical insights with the connection to domain adaptation. Empirically, our method achieves state-of-the-art performance on benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/Bayesian Bi-clustering of Neural Spiking Activity with Latent Structures b/data/2024/iclr/Bayesian Bi-clustering of Neural Spiking Activity with Latent Structures
new file mode 100644
index 0000000000..a6937154bd
--- /dev/null
+++ b/data/2024/iclr/Bayesian Bi-clustering of Neural Spiking Activity with Latent Structures	
@@ -0,0 +1 @@
+Modern neural recording techniques allow neuroscientists to obtain spiking activity of multiple neurons from different brain regions over long time periods, which requires new statistical methods to be developed for understanding structure of the large-scale data. In this paper, we develop a bi-clustering method to cluster the neural spiking activity spatially and temporally, according to their low-dimensional latent structures. The spatial (neuron) clusters are defined by the latent trajectories within each neural population, while the temporal (state) clusters are defined by (populationally) synchronous local linear dynamics shared with different periods. To flexibly extract the bi-clustering structure, we build the model non-parametrically, and develop an efficient Markov chain Monte Carlo (MCMC) algorithm to sample the posterior distributions of model parameters. Validating our proposed MCMC algorithm through simulations, we find the method can recover unknown parameters and true bi-clustering structures successfully. We then apply the proposed bi-clustering method to multi-regional neural recordings under different experiment settings, where we find that simultaneously considering latent trajectories and spatial-temporal clustering structures can provide us with a more accurate and interpretable result. Overall, the proposed method provides scientific insights for large-scale (counting) time series with elongated recording periods, and it can potentially have application beyond neuroscience.
\ No newline at end of file
diff --git a/data/2024/iclr/Bayesian Coreset Optimization for Personalized Federated Learning b/data/2024/iclr/Bayesian Coreset Optimization for Personalized Federated Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Bayesian Low-rank Adaptation for Large Language Models b/data/2024/iclr/Bayesian Low-rank Adaptation for Large Language Models
new file mode 100644
index 0000000000..7ae27e8b41
--- /dev/null
+++ b/data/2024/iclr/Bayesian Low-rank Adaptation for Large Language Models	
@@ -0,0 +1 @@
+Low-rank adaptation (LoRA) has emerged as a new paradigm for cost-efficient fine-tuning of large language models (LLMs). However, fine-tuned LLMs often become overconfident especially when fine-tuned on small datasets. Bayesian methods, with their inherent ability to estimate uncertainty, serve as potent tools to mitigate overconfidence and enhance calibration. In this work, we introduce Laplace-LoRA, which applies a Bayesian approach to the LoRA parameters. Specifically, Laplace-LoRA applies a Laplace approximation to the posterior over the LoRA parameters, considerably improving the calibration of fine-tuned LLMs.
\ No newline at end of file
diff --git a/data/2024/iclr/Bayesian Optimization through Gaussian Cox Process Models for Spatio-temporal Data b/data/2024/iclr/Bayesian Optimization through Gaussian Cox Process Models for Spatio-temporal Data
new file mode 100644
index 0000000000..f6543014b5
--- /dev/null
+++ b/data/2024/iclr/Bayesian Optimization through Gaussian Cox Process Models for Spatio-temporal Data	
@@ -0,0 +1 @@
+Bayesian optimization (BO) has established itself as a leading strategy for efficiently optimizing expensive-to-evaluate functions. Existing BO methods mostly rely on Gaussian process (GP) surrogate models and are not applicable to (doubly-stochastic) Gaussian Cox processes, where the observation process is modulated by a latent intensity function modeled as a GP. In this paper, we propose a novel maximum a posteriori inference of Gaussian Cox processes. It leverages the Laplace approximation and change of kernel technique to transform the problem into a new reproducing kernel Hilbert space, where it becomes more tractable computationally. It enables us to obtain both a functional posterior of the latent intensity function and the covariance of the posterior, thus extending existing works that often focus on specific link functions or estimating the posterior mean. Using the result, we propose a BO framework based on the Gaussian Cox process model and further develop a Nystr\"om approximation for efficient computation. Extensive evaluations on various synthetic and real-world datasets demonstrate significant improvement over state-of-the-art inference solutions for Gaussian Cox processes, as well as effective BO with a wide range of acquisition functions designed through the underlying Gaussian Cox process model.
\ No newline at end of file
diff --git a/data/2024/iclr/Be Aware of the Neighborhood Effect: Modeling Selection Bias under Interference b/data/2024/iclr/Be Aware of the Neighborhood Effect: Modeling Selection Bias under Interference
new file mode 100644
index 0000000000..00085e8ce5
--- /dev/null
+++ b/data/2024/iclr/Be Aware of the Neighborhood Effect: Modeling Selection Bias under Interference	
@@ -0,0 +1 @@
+Selection bias in recommender system arises from the recommendation process of system filtering and the interactive process of user selection. Many previous studies have focused on addressing selection bias to achieve unbiased learning of the prediction model, but ignore the fact that potential outcomes for a given user-item pair may vary with the treatments assigned to other user-item pairs, named neighborhood effect. To fill the gap, this paper formally formulates the neighborhood effect as an interference problem from the perspective of causal inference and introduces a treatment representation to capture the neighborhood effect. On this basis, we propose a novel ideal loss that can be used to deal with selection bias in the presence of neighborhood effect. We further develop two new estimators for estimating the proposed ideal loss. We theoretically establish the connection between the proposed and previous debiasing methods ignoring the neighborhood effect, showing that the proposed methods can achieve unbiased learning when both selection bias and neighborhood effect are present, while the existing methods are biased. Extensive semi-synthetic and real-world experiments are conducted to demonstrate the effectiveness of the proposed methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Be Careful What You Smooth For: Label Smoothing Can Be a Privacy Shield but Also a Catalyst for Model Inversion Attacks b/data/2024/iclr/Be Careful What You Smooth For: Label Smoothing Can Be a Privacy Shield but Also a Catalyst for Model Inversion Attacks
new file mode 100644
index 0000000000..7a36a425a1
--- /dev/null
+++ b/data/2024/iclr/Be Careful What You Smooth For: Label Smoothing Can Be a Privacy Shield but Also a Catalyst for Model Inversion Attacks	
@@ -0,0 +1 @@
+Label smoothing -- using softened labels instead of hard ones -- is a widely adopted regularization method for deep learning, showing diverse benefits such as enhanced generalization and calibration. Its implications for preserving model privacy, however, have remained unexplored. To fill this gap, we investigate the impact of label smoothing on model inversion attacks (MIAs), which aim to generate class-representative samples by exploiting the knowledge encoded in a classifier, thereby inferring sensitive information about its training data. Through extensive analyses, we uncover that traditional label smoothing fosters MIAs, thereby increasing a model's privacy leakage. Even more, we reveal that smoothing with negative factors counters this trend, impeding the extraction of class-related information and leading to privacy preservation, beating state-of-the-art defenses. This establishes a practical and powerful novel way for enhancing model resilience against MIAs.
\ No newline at end of file
diff --git a/data/2024/iclr/Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design b/data/2024/iclr/Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design
new file mode 100644
index 0000000000..e676fd837a
--- /dev/null
+++ b/data/2024/iclr/Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design	
@@ -0,0 +1 @@
+Generative molecular design has moved from proof-of-concept to real-world applicability, as marked by the surge in very recent papers reporting experimental validation. Key challenges in explainability and sample efficiency present opportunities to enhance generative design to directly optimize expensive high-fidelity oracles and provide actionable insights to domain experts. Here, we propose Beam Enumeration to exhaustively enumerate the most probable sub-sequences from language-based molecular generative models and show that molecular substructures can be extracted. When coupled with reinforcement learning, extracted substructures become meaningful, providing a source of explainability and improving sample efficiency through self-conditioned generation. Beam Enumeration is generally applicable to any language-based molecular generative model and notably further improves the performance of the recently reported Augmented Memory algorithm, which achieved the new state-of-the-art on the Practical Molecular Optimization benchmark for sample efficiency. The combined algorithm generates more high reward molecules and faster, given a fixed oracle budget. Beam Enumeration shows that improvements to explainability and sample efficiency for molecular design can be made synergistic.
\ No newline at end of file
diff --git a/data/2024/iclr/Beating Price of Anarchy and Gradient Descent without Regret in Potential Games b/data/2024/iclr/Beating Price of Anarchy and Gradient Descent without Regret in Potential Games
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Behaviour Distillation b/data/2024/iclr/Behaviour Distillation
new file mode 100644
index 0000000000..7d77f978cd
--- /dev/null
+++ b/data/2024/iclr/Behaviour Distillation	
@@ -0,0 +1 @@
+Dataset distillation aims to condense large datasets into a small number of synthetic examples that can be used as drop-in replacements when training new models. It has applications to interpretability, neural architecture search, privacy, and continual learning. Despite strong successes in supervised domains, such methods have not yet been extended to reinforcement learning, where the lack of a fixed dataset renders most distillation methods unusable. Filling the gap, we formalize behaviour distillation, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to expert data. We then introduce Hallucinating Datasets with Evolution Strategies (HaDES), a method for behaviour distillation that can discover datasets of just four state-action pairs which, under supervised learning, train agents to competitive performance levels in continuous control tasks. We show that these datasets generalize out of distribution to training policies with a wide range of architectures and hyperparameters. We also demonstrate application to a downstream task, namely training multi-task agents in a zero-shot fashion. Beyond behaviour distillation, HaDES provides significant improvements in neuroevolution for RL over previous approaches and achieves SoTA results on one standard supervised dataset distillation task. Finally, we show that visualizing the synthetic datasets can provide human-interpretable task insights.
\ No newline at end of file
diff --git a/data/2024/iclr/Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbations b/data/2024/iclr/Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbations
new file mode 100644
index 0000000000..c67fc07da6
--- /dev/null
+++ b/data/2024/iclr/Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbations	
@@ -0,0 +1 @@
+Reinforcement learning (RL) has achieved phenomenal success in various domains. However, its data-driven nature also introduces new vulnerabilities that can be exploited by malicious opponents. Recent work shows that a well-trained RL agent can be easily manipulated by strategically perturbing its state observations at the test stage. Existing solutions either introduce a regularization term to improve the smoothness of the trained policy against perturbations or alternatively train the agent's policy and the attacker's policy. However, the former does not provide sufficient protection against strong attacks, while the latter is computationally prohibitive for large environments. In this work, we propose a new robust RL algorithm for deriving a pessimistic policy to safeguard against an agent's uncertainty about true states. This approach is further enhanced with belief state inference and diffusion-based state purification to reduce uncertainty. Empirical results show that our approach obtains superb performance under strong attacks and has a comparable training overhead with regularization-based methods. Our code is available at https://github.com/SliencerX/Belief-enriched-robust-Q-learning.
\ No newline at end of file
diff --git a/data/2024/iclr/Bellman Optimal Stepsize Straightening of Flow-Matching Models b/data/2024/iclr/Bellman Optimal Stepsize Straightening of Flow-Matching Models
new file mode 100644
index 0000000000..b90bb548c7
--- /dev/null
+++ b/data/2024/iclr/Bellman Optimal Stepsize Straightening of Flow-Matching Models	
@@ -0,0 +1 @@
+Flow matching is a powerful framework for generating high-quality samples in various applications, especially image synthesis. However, the intensive computational demands of these models, especially during the finetuning process and sampling processes, pose significant challenges for low-resource scenarios. This paper introduces Bellman Optimal Stepsize Straightening (BOSS) technique for distilling flow-matching generative models: it aims specifically for a few-step efficient image sampling while adhering to a computational budget constraint. First, this technique involves a dynamic programming algorithm that optimizes the stepsizes of the pretrained network. Then, it refines the velocity network to match the optimal step sizes, aiming to straighten the generation paths. Extensive experimental evaluations across image generation tasks demonstrate the efficacy of BOSS in terms of both resource utilization and image quality. Our results reveal that BOSS achieves substantial gains in efficiency while maintaining competitive sample quality, effectively bridging the gap between low-resource constraints and the demanding requirements of flow-matching generative models. Our paper also fortifies the responsible development of artificial intelligence, offering a more sustainable generative model that reduces computational costs and environmental footprints. Our code can be found at https://github.com/nguyenngocbaocmt02/BOSS.
\ No newline at end of file
diff --git a/data/2024/iclr/Benchmarking Algorithms for Federated Domain Generalization b/data/2024/iclr/Benchmarking Algorithms for Federated Domain Generalization
new file mode 100644
index 0000000000..930e7c2fb3
--- /dev/null
+++ b/data/2024/iclr/Benchmarking Algorithms for Federated Domain Generalization	
@@ -0,0 +1 @@
+While prior domain generalization (DG) benchmarks consider train-test dataset heterogeneity, we evaluate Federated DG which introduces federated learning (FL) specific challenges. Additionally, we explore domain-based heterogeneity in clients' local datasets - a realistic Federated DG scenario. Prior Federated DG evaluations are limited in terms of the number or heterogeneity of clients and dataset diversity. To address this gap, we propose an Federated DG benchmark methodology that enables control of the number and heterogeneity of clients and provides metrics for dataset difficulty. We then apply our methodology to evaluate 14 Federated DG methods, which include centralized DG methods adapted to the FL context, FL methods that handle client heterogeneity, and methods designed specifically for Federated DG. Our results suggest that despite some progress, there remain significant performance gaps in Federated DG particularly when evaluating with a large number of clients, high client heterogeneity, or more realistic datasets. Please check our extendable benchmark code here: https://github.com/inouye-lab/FedDG_Benchmark.
\ No newline at end of file
diff --git a/data/2024/iclr/Benign Oscillation of Stochastic Gradient Descent with Large Learning Rate b/data/2024/iclr/Benign Oscillation of Stochastic Gradient Descent with Large Learning Rate
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data b/data/2024/iclr/Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data
new file mode 100644
index 0000000000..92a19f09f2
--- /dev/null
+++ b/data/2024/iclr/Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data	
@@ -0,0 +1 @@
+Neural networks trained by gradient descent (GD) have exhibited a number of surprising generalization behaviors. First, they can achieve a perfect fit to noisy training data and still generalize near-optimally, showing that overfitting can sometimes be benign. Second, they can undergo a period of classical, harmful overfitting -- achieving a perfect fit to training data with near-random performance on test data -- before transitioning ("grokking") to near-optimal generalization later in training. In this work, we show that both of these phenomena provably occur in two-layer ReLU networks trained by GD on XOR cluster data where a constant fraction of the training labels are flipped. In this setting, we show that after the first step of GD, the network achieves 100% training accuracy, perfectly fitting the noisy labels in the training data, but achieves near-random test accuracy. At a later training step, the network achieves near-optimal test accuracy while still fitting the random labels in the training data, exhibiting a"grokking"phenomenon. This provides the first theoretical result of benign overfitting in neural network classification when the data distribution is not linearly separable. Our proofs rely on analyzing the feature learning process under GD, which reveals that the network implements a non-generalizable linear classifier after one step and gradually learns generalizable features in later steps.
\ No newline at end of file
diff --git a/data/2024/iclr/Bespoke Solvers for Generative Flow Models b/data/2024/iclr/Bespoke Solvers for Generative Flow Models
new file mode 100644
index 0000000000..33517033b5
--- /dev/null
+++ b/data/2024/iclr/Bespoke Solvers for Generative Flow Models	
@@ -0,0 +1 @@
+Diffusion or flow-based models are powerful generative paradigms that are notoriously hard to sample as samples are defined as solutions to high-dimensional Ordinary or Stochastic Differential Equations (ODEs/SDEs) which require a large Number of Function Evaluations (NFE) to approximate well. Existing methods to alleviate the costly sampling process include model distillation and designing dedicated ODE solvers. However, distillation is costly to train and sometimes can deteriorate quality, while dedicated solvers still require relatively large NFE to produce high quality samples. In this paper we introduce"Bespoke solvers", a novel framework for constructing custom ODE solvers tailored to the ODE of a given pre-trained flow model. Our approach optimizes an order consistent and parameter-efficient solver (e.g., with 80 learnable parameters), is trained for roughly 1% of the GPU time required for training the pre-trained model, and significantly improves approximation and generation quality compared to dedicated solvers. For example, a Bespoke solver for a CIFAR10 model produces samples with Fr\'echet Inception Distance (FID) of 2.73 with 10 NFE, and gets to 1% of the Ground Truth (GT) FID (2.59) for this model with only 20 NFE. On the more challenging ImageNet-64$\times$64, Bespoke samples at 2.2 FID with 10 NFE, and gets within 2% of GT FID (1.71) with 20 NFE.
\ No newline at end of file
diff --git a/data/2024/iclr/Better Neural PDE Solvers Through Data-Free Mesh Movers b/data/2024/iclr/Better Neural PDE Solvers Through Data-Free Mesh Movers
new file mode 100644
index 0000000000..8f1fcd1a07
--- /dev/null
+++ b/data/2024/iclr/Better Neural PDE Solvers Through Data-Free Mesh Movers	
@@ -0,0 +1 @@
+Recently, neural networks have been extensively employed to solve partial differential equations (PDEs) in physical system modeling. While major studies focus on learning system evolution on predefined static mesh discretizations, some methods utilize reinforcement learning or supervised learning techniques to create adaptive and dynamic meshes, due to the dynamic nature of these systems. However, these approaches face two primary challenges: (1) the need for expensive optimal mesh data, and (2) the change of the solution space's degree of freedom and topology during mesh refinement. To address these challenges, this paper proposes a neural PDE solver with a neural mesh adapter. To begin with, we introduce a novel data-free neural mesh adaptor, called Data-free Mesh Mover (DMM), with two main innovations. Firstly, it is an operator that maps the solution to adaptive meshes and is trained using the Monge-Amp\`ere equation without optimal mesh data. Secondly, it dynamically changes the mesh by moving existing nodes rather than adding or deleting nodes and edges. Theoretical analysis shows that meshes generated by DMM have the lowest interpolation error bound. Based on DMM, to efficiently and accurately model dynamic systems, we develop a moving mesh based neural PDE solver (MM-PDE) that embeds the moving mesh with a two-branch architecture and a learnable interpolation framework to preserve information within the data. Empirical experiments demonstrate that our method generates suitable meshes and considerably enhances accuracy when modeling widely considered PDE systems. The code can be found at: https://github.com/Peiyannn/MM-PDE.git.
\ No newline at end of file
diff --git a/data/2024/iclr/Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment b/data/2024/iclr/Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment
new file mode 100644
index 0000000000..e76915abec
--- /dev/null
+++ b/data/2024/iclr/Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment	
@@ -0,0 +1 @@
+Alignment with human preference is a desired property of large language models (LLMs). Currently, the main alignment approach is based on reinforcement learning from human feedback (RLHF). Despite the effectiveness of RLHF, it is intricate to implement and train, thus recent studies explore how to develop alternative alignment approaches based on supervised fine-tuning (SFT). A major limitation of SFT is that it essentially does imitation learning, which cannot fully understand what are the expected behaviors. To address this issue, we propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained (i.e., token or phrase level) quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment. Extensive experiments have demonstrated the effectiveness of our approaches by comparing a number of competitive baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Beyond Memorization: Violating Privacy via Inference with Large Language Models b/data/2024/iclr/Beyond Memorization: Violating Privacy via Inference with Large Language Models
new file mode 100644
index 0000000000..6a66a2e6f6
--- /dev/null
+++ b/data/2024/iclr/Beyond Memorization: Violating Privacy via Inference with Large Language Models	
@@ -0,0 +1 @@
+Current privacy research on large language models (LLMs) primarily focuses on the issue of extracting memorized training data. At the same time, models' inference capabilities have increased drastically. This raises the key question of whether current LLMs could violate individuals' privacy by inferring personal attributes from text given at inference time. In this work, we present the first comprehensive study on the capabilities of pretrained LLMs to infer personal attributes from text. We construct a dataset consisting of real Reddit profiles, and show that current LLMs can infer a wide range of personal attributes (e.g., location, income, sex), achieving up to $85\%$ top-1 and $95\%$ top-3 accuracy at a fraction of the cost ($100\times$) and time ($240\times$) required by humans. As people increasingly interact with LLM-powered chatbots across all aspects of life, we also explore the emerging threat of privacy-invasive chatbots trying to extract personal information through seemingly benign questions. Finally, we show that common mitigations, i.e., text anonymization and model alignment, are currently ineffective at protecting user privacy against LLM inference. Our findings highlight that current LLMs can infer personal data at a previously unattainable scale. In the absence of working defenses, we advocate for a broader discussion around LLM privacy implications beyond memorization, striving for a wider privacy protection.
\ No newline at end of file
diff --git a/data/2024/iclr/Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints b/data/2024/iclr/Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints
new file mode 100644
index 0000000000..ccd46c5f34
--- /dev/null
+++ b/data/2024/iclr/Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints	
@@ -0,0 +1 @@
+The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment. Reinforcement Learning from Human Feedback (RLHF) has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model. Direct Preference Optimization (DPO) has been proposed as an alternative, and it remains equivalent to RLHF under the reverse KL regularization constraint. This paper presents $f$-DPO, a generalized approach to DPO by incorporating diverse divergence constraints. We show that under certain $f$-divergences, including Jensen-Shannon divergence, forward KL divergences and $\alpha$-divergences, the complex relationship between the reward and optimal policy can also be simplified by addressing the Karush-Kuhn-Tucker conditions. This eliminates the need for estimating the normalizing constant in the Bradley-Terry model and enables a tractable mapping between the reward function and the optimal policy. Our approach optimizes LLMs to align with human preferences in a more efficient and supervised manner under a broad set of divergence constraints. Empirically, adopting these divergences ensures a balance between alignment performance and generation diversity. Importantly, $f$-DPO outperforms PPO-based methods in divergence efficiency, and divergence constraints directly influence expected calibration error (ECE).
\ No newline at end of file
diff --git a/data/2024/iclr/Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods b/data/2024/iclr/Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods
new file mode 100644
index 0000000000..c9e7f799d7
--- /dev/null
+++ b/data/2024/iclr/Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods	
@@ -0,0 +1 @@
+Markov Decision Processes (MDPs) are a formal framework for modeling and solving sequential decision-making problems. In finite-time horizons such problems are relevant for instance for optimal stopping or specific supply chain problems, but also in the training of large language models. In contrast to infinite horizon MDPs optimal policies are not stationary, policies must be learned for every single epoch. In practice all parameters are often trained simultaneously, ignoring the inherent structure suggested by dynamic programming. This paper introduces a combination of dynamic programming and policy gradient called dynamic policy gradient, where the parameters are trained backwards in time. For the tabular softmax parametrisation we carry out the convergence analysis for simultaneous and dynamic policy gradient towards global optima, both in the exact and sampled gradient settings without regularisation. It turns out that the use of dynamic policy gradient training much better exploits the structure of finite- time problems which is reflected in improved convergence bounds.
\ No newline at end of file
diff --git a/data/2024/iclr/Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in Conditional and Hierarchical Variational Autoencoders b/data/2024/iclr/Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in Conditional and Hierarchical Variational Autoencoders
new file mode 100644
index 0000000000..8783e99d56
--- /dev/null
+++ b/data/2024/iclr/Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in Conditional and Hierarchical Variational Autoencoders	
@@ -0,0 +1 @@
+The posterior collapse phenomenon in variational autoencoder (VAE), where the variational posterior distribution closely matches the prior distribution, can hinder the quality of the learned latent variables. As a consequence of posterior collapse, the latent variables extracted by the encoder in VAE preserve less information from the input data and thus fail to produce meaningful representations as input to the reconstruction process in the decoder. While this phenomenon has been an actively addressed topic related to VAE performance, the theory for posterior collapse remains underdeveloped, especially beyond the standard VAE. In this work, we advance the theoretical understanding of posterior collapse to two important and prevalent yet less studied classes of VAE: conditional VAE and hierarchical VAE. Specifically, via a non-trivial theoretical analysis of linear conditional VAE and hierarchical VAE with two levels of latent, we prove that the cause of posterior collapses in these models includes the correlation between the input and output of the conditional VAE and the effect of learnable encoder variance in the hierarchical VAE. We empirically validate our theoretical findings for linear conditional and hierarchical VAE and demonstrate that these results are also predictive for non-linear cases with extensive experiments.
\ No newline at end of file
diff --git a/data/2024/iclr/Beyond Weisfeiler-Lehman: A Quantitative Framework for GNN Expressiveness b/data/2024/iclr/Beyond Weisfeiler-Lehman: A Quantitative Framework for GNN Expressiveness
new file mode 100644
index 0000000000..67ea8db7cd
--- /dev/null
+++ b/data/2024/iclr/Beyond Weisfeiler-Lehman: A Quantitative Framework for GNN Expressiveness	
@@ -0,0 +1 @@
+Designing expressive Graph Neural Networks (GNNs) is a fundamental topic in the graph learning community. So far, GNN expressiveness has been primarily assessed via the Weisfeiler-Lehman (WL) hierarchy. However, such an expressivity measure has notable limitations: it is inherently coarse, qualitative, and may not well reflect practical requirements (e.g., the ability to encode substructures). In this paper, we introduce a unified framework for quantitatively studying the expressiveness of GNN architectures, addressing all the above limitations. Specifically, we identify a fundamental expressivity measure termed homomorphism expressivity, which quantifies the ability of GNN models to count graphs under homomorphism. Homomorphism expressivity offers a complete and practical assessment tool: the completeness enables direct expressivity comparisons between GNN models, while the practicality allows for understanding concrete GNN abilities such as subgraph counting. By examining four classes of prominent GNNs as case studies, we derive simple, unified, and elegant descriptions of their homomorphism expressivity for both invariant and equivariant settings. Our results provide novel insights into a series of previous work, unify the landscape of different subareas in the community, and settle several open questions. Empirically, extensive experiments on both synthetic and real-world tasks verify our theory, showing that the practical performance of GNN models aligns well with the proposed metric.
\ No newline at end of file
diff --git a/data/2024/iclr/Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies b/data/2024/iclr/Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies
new file mode 100644
index 0000000000..224d962faa
--- /dev/null
+++ b/data/2024/iclr/Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies	
@@ -0,0 +1 @@
+In light of the burgeoning success of reinforcement learning (RL) in diverse real-world applications, considerable focus has been directed towards ensuring RL policies are robust to adversarial attacks during test time. Current approaches largely revolve around solving a minimax problem to prepare for potential worst-case scenarios. While effective against strong attacks, these methods often compromise performance in the absence of attacks or the presence of only weak attacks. To address this, we study policy robustness under the well-accepted state-adversarial attack model, extending our focus beyond only worst-case attacks. We first formalize this task at test time as a regret minimization problem and establish its intrinsic hardness in achieving sublinear regret when the baseline policy is from a general continuous policy class, $\Pi$. This finding prompts us to \textit{refine} the baseline policy class $\Pi$ prior to test time, aiming for efficient adaptation within a finite policy class $\Tilde{\Pi}$, which can resort to an adversarial bandit subroutine. In light of the importance of a small, finite $\Tilde{\Pi}$, we propose a novel training-time algorithm to iteratively discover \textit{non-dominated policies}, forming a near-optimal and minimal $\Tilde{\Pi}$, thereby ensuring both robustness and test-time efficiency. Empirical validation on the Mujoco corroborates the superiority of our approach in terms of natural and robust performance, as well as adaptability to various attack scenarios.
\ No newline at end of file
diff --git a/data/2024/iclr/Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning b/data/2024/iclr/Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning
new file mode 100644
index 0000000000..b75f069986
--- /dev/null
+++ b/data/2024/iclr/Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning	
@@ -0,0 +1 @@
+Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (e.g., VQA accuracy) alone do not provide enough clues to understand their real capabilities, limitations, and to which extent such models are aligned to human expectations. To refine our understanding of those flaws, we deviate from the current evaluation paradigm, and (1) evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. Our evaluation on these axes reveals major flaws in LMMs. While the current go-to solution to align these models is based on training, such as instruction tuning or RLHF, we rather (2) explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations. Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows. (1) Despite their success, LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its effectiveness for improved explainability, answer abstention, ICL only slightly improves instruction following, does not improve compositional abilities, and actually even amplifies hallucinations. (3) The proposed ICL variants are promising as post-hoc approaches to efficiently tackle some of those flaws. The code is available here: https://github.com/mshukor/EvALign-ICL.
\ No newline at end of file
diff --git a/data/2024/iclr/Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs b/data/2024/iclr/Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs
new file mode 100644
index 0000000000..862995c4ef
--- /dev/null
+++ b/data/2024/iclr/Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs	
@@ -0,0 +1 @@
+Recent works have showcased the ability of LLMs to embody diverse personas in their responses, exemplified by prompts like 'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs' capabilities remains unclear. To fill this gap, we present the first extensive study of the unintended side-effects of persona assignment on the ability of LLMs to perform basic reasoning tasks. Our study covers 24 reasoning datasets, 4 LLMs, and 19 diverse personas (e.g. an Asian person) spanning 5 socio-demographic groups. Our experiments unveil that LLMs harbor deep rooted bias against various socio-demographics underneath a veneer of fairness. While they overtly reject stereotypes when explicitly asked ('Are Black people less skilled at mathematics?'), they manifest stereotypical and erroneous presumptions when asked to answer questions while adopting a persona. These can be observed as abstentions in responses, e.g., 'As a Black person, I can't answer this question as it requires math knowledge', and generally result in a substantial performance drop. Our experiments with ChatGPT-3.5 show that this bias is ubiquitous - 80% of our personas demonstrate bias; it is significant - some datasets show performance drops of 70%+; and can be especially harmful for certain groups - some personas suffer statistically significant drops on 80%+ of the datasets. Overall, all 4 LLMs exhibit this bias to varying extents, with GPT-4-Turbo showing the least but still a problematic amount of bias (evident in 42% of the personas). Further analysis shows that these persona-induced errors can be hard-to-discern and hard-to-avoid. Our findings serve as a cautionary tale that the practice of assigning personas to LLMs - a trend on the rise - can surface their deep-rooted biases and have unforeseeable and detrimental side-effects.
\ No newline at end of file
diff --git a/data/2024/iclr/Biased Temporal Convolution Graph Network for Time Series Forecasting with Missing Values b/data/2024/iclr/Biased Temporal Convolution Graph Network for Time Series Forecasting with Missing Values
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation b/data/2024/iclr/Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation
new file mode 100644
index 0000000000..110bda4aa3
--- /dev/null
+++ b/data/2024/iclr/Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation	
@@ -0,0 +1 @@
+We introduce a method to generate temporally coherent human animation from a single image, a video, or a random noise. This problem has been formulated as modeling of an auto-regressive generation, i.e., to regress past frames to decode future frames. However, such unidirectional generation is highly prone to motion drifting over time, generating unrealistic human animation with significant artifacts such as appearance distortion. We claim that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the motion ambiguity of human appearance. To prove our claim, we design a novel human animation framework using a denoising diffusion model: a neural network learns to generate the image of a person by denoising temporal Gaussian noises whose intermediate results are cross-conditioned bidirectionally between consecutive frames. In the experiments, our method demonstrates strong performance compared to existing unidirectional approaches with realistic temporal coherence.
\ No newline at end of file
diff --git a/data/2024/iclr/Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis b/data/2024/iclr/Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis
new file mode 100644
index 0000000000..8ddd182030
--- /dev/null
+++ b/data/2024/iclr/Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis	
@@ -0,0 +1 @@
+Bilevel optimization is an important formulation for many machine learning problems. Current bilevel optimization algorithms assume that the gradient of the upper-level function is Lipschitz. However, recent studies reveal that certain neural networks such as recurrent neural networks (RNNs) and long-short-term memory networks (LSTMs) exhibit potential unbounded smoothness, rendering conventional bilevel optimization algorithms unsuitable. In this paper, we design a new bilevel optimization algorithm, namely BO-REP, to address this challenge. This algorithm updates the upper-level variable using normalized momentum and incorporates two novel techniques for updating the lower-level variable: \textit{initialization refinement} and \textit{periodic updates}. Specifically, once the upper-level variable is initialized, a subroutine is invoked to obtain a refined estimate of the corresponding optimal lower-level variable, and the lower-level variable is updated only after every specific period instead of each iteration. When the upper-level problem is nonconvex and unbounded smooth, and the lower-level problem is strongly convex, we prove that our algorithm requires $\widetilde{\mathcal{O}}(1/\epsilon^4)$ iterations to find an $\epsilon$-stationary point in the stochastic setting, where each iteration involves calling a stochastic gradient or Hessian-vector product oracle. Notably, this result matches the state-of-the-art complexity results under the bounded smoothness setting and without mean-squared smoothness of the stochastic gradient, up to logarithmic factors. Our proof relies on novel technical lemmas for the periodically updated lower-level variable, which are of independent interest. Our experiments on hyper-representation learning, hyperparameter optimization, and data hyper-cleaning for text classification tasks demonstrate the effectiveness of our proposed algorithm.
\ No newline at end of file
diff --git a/data/2024/iclr/BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs b/data/2024/iclr/BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs
new file mode 100644
index 0000000000..a2cda27772
--- /dev/null
+++ b/data/2024/iclr/BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs	
@@ -0,0 +1 @@
+Foundation models (FMs) are able to leverage large volumes of unlabeled data to demonstrate superior performance across a wide range of tasks. However, FMs developed for biomedical domains have largely remained unimodal, i.e., independently trained and used for tasks on protein sequences alone, small molecule structures alone, or clinical data alone. To overcome this limitation of biomedical FMs, we present BioBridge, a novel parameter-efficient learning framework, to bridge independently trained unimodal FMs to establish multimodal behavior. BioBridge achieves it by utilizing Knowledge Graphs (KG) to learn transformations between one unimodal FM and another without fine-tuning any underlying unimodal FMs. Our empirical results demonstrate that BioBridge can beat the best baseline KG embedding methods (on average by around 76.3%) in cross-modal retrieval tasks. We also identify BioBridge demonstrates out-of-domain generalization ability by extrapolating to unseen modalities or relations. Additionally, we also show that BioBridge presents itself as a general purpose retriever that can aid biomedical multimodal question answering as well as enhance the guided generation of novel drugs.
\ No newline at end of file
diff --git a/data/2024/iclr/Blending Imitation and Reinforcement Learning for Robust Policy Improvement b/data/2024/iclr/Blending Imitation and Reinforcement Learning for Robust Policy Improvement
new file mode 100644
index 0000000000..6390b464c0
--- /dev/null
+++ b/data/2024/iclr/Blending Imitation and Reinforcement Learning for Robust Policy Improvement	
@@ -0,0 +1 @@
+While reinforcement learning (RL) has shown promising performance, its sample complexity continues to be a substantial hurdle, restricting its broader application across a variety of domains. Imitation learning (IL) utilizes oracles to improve sample efficiency, yet it is often constrained by the quality of the oracles deployed. which actively interleaves between IL and RL based on an online estimate of their performance. RPI draws on the strengths of IL, using oracle queries to facilitate exploration, an aspect that is notably challenging in sparse-reward RL, particularly during the early stages of learning. As learning unfolds, RPI gradually transitions to RL, effectively treating the learned policy as an improved oracle. This algorithm is capable of learning from and improving upon a diverse set of black-box oracles. Integral to RPI are Robust Active Policy Selection (RAPS) and Robust Policy Gradient (RPG), both of which reason over whether to perform state-wise imitation from the oracles or learn from its own value function when the learner's performance surpasses that of the oracles in a specific state. Empirical evaluations and theoretical analysis validate that RPI excels in comparison to existing state-of-the-art methodologies, demonstrating superior performance across various benchmark domains.
\ No newline at end of file
diff --git a/data/2024/iclr/Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World b/data/2024/iclr/Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World
new file mode 100644
index 0000000000..9e185b36e9
--- /dev/null
+++ b/data/2024/iclr/Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World	
@@ -0,0 +1 @@
+We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world few-shot reasoning for machine vision. It originates from the classical Bongard Problems (BPs): Given two sets of images (positive and negative), the model needs to identify the set that query images belong to by inducing the visual concepts, which is exclusively depicted by images from the positive set. Our benchmark inherits the few-shot concept induction of the original BPs while adding the two novel layers of challenge: 1) open-world free-form concepts, as the visual concepts in Bongard-OpenWorld are unique compositions of terms from an open vocabulary, ranging from object categories to abstract visual attributes and commonsense factual knowledge; 2) real-world images, as opposed to the synthetic diagrams used by many counterparts. In our exploration, Bongard-OpenWorld already imposes a significant challenge to current few-shot reasoning algorithms. We further investigate to which extent the recently introduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can solve our task, by directly probing VLMs, and combining VLMs and LLMs in an interactive reasoning scheme. We even conceived a neuro-symbolic reasoning approach that reconciles LLMs&VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems. However, none of these approaches manage to close the human-machine gap, as the best learner achieves 64% accuracy while human participants easily reach 91%. We hope Bongard-OpenWorld can help us better understand the limitations of current visual intelligence and facilitate future research on visual agents with stronger few-shot visual reasoning capabilities.
\ No newline at end of file
diff --git a/data/2024/iclr/BooookScore: A systematic exploration of book-length summarization in the era of LLMs b/data/2024/iclr/BooookScore: A systematic exploration of book-length summarization in the era of LLMs
new file mode 100644
index 0000000000..3c56b69737
--- /dev/null
+++ b/data/2024/iclr/BooookScore: A systematic exploration of book-length summarization in the era of LLMs	
@@ -0,0 +1 @@
+Summarizing book-length documents (>100K tokens) that exceed the context window size of large language models (LLMs) requires first breaking the input document into smaller chunks and then prompting an LLM to merge, update, and compress chunk-level summaries. Despite the complexity and importance of this task, it has yet to be meaningfully studied due to the challenges of evaluation: existing book-length summarization datasets (e.g., BookSum) are in the pretraining data of most public LLMs, and existing evaluation methods struggle to capture errors made by modern LLM summarizers. In this paper, we present the first study of the coherence of LLM-based book-length summarizers implemented via two prompting workflows: (1) hierarchically merging chunk-level summaries, and (2) incrementally updating a running summary. We obtain 1193 fine-grained human annotations on GPT-4 generated summaries of 100 recently-published books and identify eight common types of coherence errors made by LLMs. Because human evaluation is expensive and time-consuming, we develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types. BooookScore has high agreement with human annotations and allows us to systematically evaluate the impact of many other critical parameters (e.g., chunk size, base LLM) while saving $15K USD and 500 hours in human evaluation costs. We find that closed-source LLMs such as GPT-4 and Claude 2 produce summaries with higher BooookScore than those generated by open-source models. While LLaMA 2 falls behind other models, Mixtral achieves performance on par with GPT-3.5-Turbo. Incremental updating yields lower BooookScore but higher level of detail than hierarchical merging, a trade-off sometimes preferred by annotators.
\ No newline at end of file
diff --git a/data/2024/iclr/Boosting Graph Anomaly Detection with Adaptive Message Passing b/data/2024/iclr/Boosting Graph Anomaly Detection with Adaptive Message Passing
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Boosting Vanilla Lightweight Vision Transformers via Re-parameterization b/data/2024/iclr/Boosting Vanilla Lightweight Vision Transformers via Re-parameterization
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models b/data/2024/iclr/Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models
new file mode 100644
index 0000000000..794a0f0bf1
--- /dev/null
+++ b/data/2024/iclr/Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models	
@@ -0,0 +1 @@
+The reasoning performance of Large Language Models (LLMs) on a wide range of problems critically relies on chain-of-thought prompting, which involves providing a few chain of thought demonstrations as exemplars in prompts. Recent work, e.g., Tree of Thoughts, has pointed out the importance of exploration and self-evaluation in reasoning step selection for complex problem solving. In this paper, we present Boosting of Thoughts (BoT), an automated prompting framework for problem solving with LLMs by iteratively exploring and self-evaluating many trees of thoughts in order to acquire an ensemble of trial-and-error reasoning experiences, which will serve as a new form of prompting to solve the complex problem. Starting from a simple prompt without requiring examples, BoT iteratively explores and evaluates a large collection of reasoning steps, and more importantly, uses error analysis obtained from the LLM on them to explicitly revise prompting, which in turn enhances reasoning step generation, until a final answer is attained. Our experiments with GPT-4 and Llama2 across extensive complex mathematical problems demonstrate that BoT consistently achieves higher or comparable problem-solving rates than other advanced prompting approaches.
\ No newline at end of file
diff --git a/data/2024/iclr/Boosting the Adversarial Robustness of Graph Neural Networks: An OOD Perspective b/data/2024/iclr/Boosting the Adversarial Robustness of Graph Neural Networks: An OOD Perspective
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Bootstrapping Variational Information Pursuit with Large Language and Vision Models for Interpretable Image Classification b/data/2024/iclr/Bootstrapping Variational Information Pursuit with Large Language and Vision Models for Interpretable Image Classification
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Boundary Denoising for Video Activity Localization b/data/2024/iclr/Boundary Denoising for Video Activity Localization
new file mode 100644
index 0000000000..cf38ab0cfe
--- /dev/null
+++ b/data/2024/iclr/Boundary Denoising for Video Activity Localization	
@@ -0,0 +1 @@
+Video activity localization aims at understanding the semantic content in long untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action detection, etc. Unfortunately, learning the exact boundary location of activities is highly challenging because temporal activities are continuous in time, and there are often no clear-cut transitions between actions. Moreover, the definition of the start and end of events is subjective, which may confuse the model. To alleviate the boundary ambiguity, we propose to study the video activity localization problem from a denoising perspective. Specifically, we propose an encoder-decoder model named DenoiseLoc. During training, a set of action spans is randomly generated from the ground truth with a controlled noise scale. Then we attempt to reverse this process by boundary denoising, allowing the localizer to predict activities with precise boundaries and resulting in faster convergence speed. Experiments show that DenoiseLoc advances %in several video activity understanding tasks. For example, we observe a gain of +12.36% average mAP on QV-Highlights dataset and +1.64% mAP@0.5 on THUMOS'14 dataset over the baseline. Moreover, DenoiseLoc achieves state-of-the-art performance on TACoS and MAD datasets, but with much fewer predictions compared to other current methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments b/data/2024/iclr/Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments
new file mode 100644
index 0000000000..b2c7e0b719
--- /dev/null
+++ b/data/2024/iclr/Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments	
@@ -0,0 +1 @@
+Bounding boxes uniquely characterize object detection, where a good detector gives accurate bounding boxes of categories of interest. However, in the real-world where test ground truths are not provided, it is non-trivial to find out whether bounding boxes are accurate, thus preventing us from assessing the detector generalization ability. In this work, we find under feature map dropout, good detectors tend to output bounding boxes whose locations do not change much, while bounding boxes of poor detectors will undergo noticeable position changes. We compute the box stability score (BoS score) to reflect this stability. Specifically, given an image, we compute a normal set of bounding boxes and a second set after feature map dropout. To obtain BoS score, we use bipartite matching to find the corresponding boxes between the two sets and compute the average Intersection over Union (IoU) across the entire test set. We contribute to finding that BoS score has a strong, positive correlation with detection accuracy measured by mean average precision (mAP) under various test environments. This relationship allows us to predict the accuracy of detectors on various real-world test sets without accessing test ground truths, verified on canonical detection tasks such as vehicle detection and pedestrian detection. Code and data are available at https://github.com/YangYangGirl/BoS.
\ No newline at end of file
diff --git a/data/2024/iclr/Bounding the Expected Robustness of Graph Neural Networks Subject to Node Feature Attacks b/data/2024/iclr/Bounding the Expected Robustness of Graph Neural Networks Subject to Node Feature Attacks
new file mode 100644
index 0000000000..db07bcc6e1
--- /dev/null
+++ b/data/2024/iclr/Bounding the Expected Robustness of Graph Neural Networks Subject to Node Feature Attacks	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) have demonstrated state-of-the-art performance in various graph representation learning tasks. Recently, studies revealed their vulnerability to adversarial attacks. In this work, we theoretically define the concept of expected robustness in the context of attributed graphs and relate it to the classical definition of adversarial robustness in the graph representation learning literature. Our definition allows us to derive an upper bound of the expected robustness of Graph Convolutional Networks (GCNs) and Graph Isomorphism Networks subject to node feature attacks. Building on these findings, we connect the expected robustness of GNNs to the orthonormality of their weight matrices and consequently propose an attack-independent, more robust variant of the GCN, called the Graph Convolutional Orthonormal Robust Networks (GCORNs). We further introduce a probabilistic method to estimate the expected robustness, which allows us to evaluate the effectiveness of GCORN on several real-world datasets. Experimental experiments showed that GCORN outperforms available defense methods. Our code is publicly available at: \href{https://github.com/Sennadir/GCORN}{https://github.com/Sennadir/GCORN}.
\ No newline at end of file
diff --git a/data/2024/iclr/Bounds on Representation-Induced Confounding Bias for Treatment Effect Estimation b/data/2024/iclr/Bounds on Representation-Induced Confounding Bias for Treatment Effect Estimation
new file mode 100644
index 0000000000..09c7a3d202
--- /dev/null
+++ b/data/2024/iclr/Bounds on Representation-Induced Confounding Bias for Treatment Effect Estimation	
@@ -0,0 +1 @@
+State-of-the-art methods for conditional average treatment effect (CATE) estimation make widespread use of representation learning. Here, the idea is to reduce the variance of the low-sample CATE estimation by a (potentially constrained) low-dimensional representation. However, low-dimensional representations can lose information about the observed confounders and thus lead to bias, because of which the validity of representation learning for CATE estimation is typically violated. In this paper, we propose a new, representation-agnostic refutation framework for estimating bounds on the representation-induced confounding bias that comes from dimensionality reduction (or other constraints on the representations) in CATE estimation. First, we establish theoretically under which conditions CATE is non-identifiable given low-dimensional (constrained) representations. Second, as our remedy, we propose a neural refutation framework which performs partial identification of CATE or, equivalently, aims at estimating lower and upper bounds of the representation-induced confounding bias. We demonstrate the effectiveness of our bounds in a series of experiments. In sum, our refutation framework is of direct relevance in practice where the validity of CATE estimation is of importance.
\ No newline at end of file
diff --git a/data/2024/iclr/Brain decoding: toward real-time reconstruction of visual perception b/data/2024/iclr/Brain decoding: toward real-time reconstruction of visual perception
new file mode 100644
index 0000000000..be8320a96b
--- /dev/null
+++ b/data/2024/iclr/Brain decoding: toward real-time reconstruction of visual perception	
@@ -0,0 +1 @@
+In the past five years, the use of generative and foundational AI systems has greatly improved the decoding of brain activity. Visual perception, in particular, can now be decoded from functional Magnetic Resonance Imaging (fMRI) with remarkable fidelity. This neuroimaging technique, however, suffers from a limited temporal resolution ($\approx$0.5 Hz) and thus fundamentally constrains its real-time usage. Here, we propose an alternative approach based on magnetoencephalography (MEG), a neuroimaging device capable of measuring brain activity with high temporal resolution ($\approx$5,000 Hz). For this, we develop an MEG decoding model trained with both contrastive and regression objectives and consisting of three modules: i) pretrained embeddings obtained from the image, ii) an MEG module trained end-to-end and iii) a pretrained image generator. Our results are threefold: Firstly, our MEG decoder shows a 7X improvement of image-retrieval over classic linear decoders. Second, late brain responses to images are best decoded with DINOv2, a recent foundational image model. Third, image retrievals and generations both suggest that high-level visual features can be decoded from MEG signals, although the same approach applied to 7T fMRI also recovers better low-level features. Overall, these results, while preliminary, provide an important step towards the decoding -- in real-time -- of the visual processes continuously unfolding within the human brain.
\ No newline at end of file
diff --git a/data/2024/iclr/BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity b/data/2024/iclr/BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity
new file mode 100644
index 0000000000..b4bae1f2b3
--- /dev/null
+++ b/data/2024/iclr/BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity	
@@ -0,0 +1 @@
+Understanding the functional organization of higher visual cortex is a central focus in neuroscience. Past studies have primarily mapped the visual and semantic selectivity of neural populations using hand-selected stimuli, which may potentially bias results towards pre-existing hypotheses of visual cortex functionality. Moving beyond conventional approaches, we introduce a data-driven method that generates natural language descriptions for images predicted to maximally activate individual voxels of interest. Our method -- Semantic Captioning Using Brain Alignments ("BrainSCUBA") -- builds upon the rich embedding space learned by a contrastive vision-language model and utilizes a pre-trained large language model to generate interpretable captions. We validate our method through fine-grained voxel-level captioning across higher-order visual regions. We further perform text-conditioned image synthesis with the captions, and show that our images are semantically coherent and yield high predicted activations. Finally, to demonstrate how our method enables scientific discovery, we perform exploratory investigations on the distribution of"person"representations in the brain, and discover fine-grained semantic selectivity in body-selective areas. Unlike earlier studies that decode text, our method derives voxel-wise captions of semantic selectivity. Our results show that BrainSCUBA is a promising means for understanding functional preferences in the brain, and provides motivation for further hypothesis-driven investigation of visual cortex.
\ No newline at end of file
diff --git a/data/2024/iclr/Branch-GAN: Improving Text Generation with (not so) Large Language Models b/data/2024/iclr/Branch-GAN: Improving Text Generation with (not so) Large Language Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Breaking Physical and Linguistic Borders: Multilingual Federated Prompt Tuning for Low-Resource Languages b/data/2024/iclr/Breaking Physical and Linguistic Borders: Multilingual Federated Prompt Tuning for Low-Resource Languages
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Bridging Neural and Symbolic Representations with Transitional Dictionary Learning b/data/2024/iclr/Bridging Neural and Symbolic Representations with Transitional Dictionary Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Bridging State and History Representations: Understanding Self-Predictive RL b/data/2024/iclr/Bridging State and History Representations: Understanding Self-Predictive RL
new file mode 100644
index 0000000000..564dd2956a
--- /dev/null
+++ b/data/2024/iclr/Bridging State and History Representations: Understanding Self-Predictive RL	
@@ -0,0 +1 @@
+Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of preliminary guidelines for RL practitioners.
\ No newline at end of file
diff --git a/data/2024/iclr/Bridging Vision and Language Spaces with Assignment Prediction b/data/2024/iclr/Bridging Vision and Language Spaces with Assignment Prediction
new file mode 100644
index 0000000000..21b98191ce
--- /dev/null
+++ b/data/2024/iclr/Bridging Vision and Language Spaces with Assignment Prediction	
@@ -0,0 +1 @@
+This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world. VLAP transforms the embedding space of pretrained vision models into the LLMs' word embedding space using a single linear layer for efficient and general-purpose visual and language understanding. Specifically, we harness well-established word embeddings to bridge two modality embedding spaces. The visual and text representations are simultaneously assigned to a set of word embeddings within pretrained LLMs by formulating the assigning procedure as an optimal transport problem. We predict the assignment of one modality from the representation of another modality data, enforcing consistent assignments for paired multimodal data. This allows vision and language representations to contain the same information, grounding the frozen LLMs' word embedding space in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved with visual data since the LLMs interpret and reason linguistic information from correlations between word embeddings. Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based approaches across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible.
\ No newline at end of file
diff --git a/data/2024/iclr/BroGNet: Momentum-Conserving Graph Neural Stochastic Differential Equation for Learning Brownian Dynamics b/data/2024/iclr/BroGNet: Momentum-Conserving Graph Neural Stochastic Differential Equation for Learning Brownian Dynamics
new file mode 100644
index 0000000000..a6d034c9b6
--- /dev/null
+++ b/data/2024/iclr/BroGNet: Momentum-Conserving Graph Neural Stochastic Differential Equation for Learning Brownian Dynamics	
@@ -0,0 +1 @@
+Neural networks (NNs) that exploit strong inductive biases based on physical laws and symmetries have shown remarkable success in learning the dynamics of physical systems directly from their trajectory. However, these works focus only on the systems that follow deterministic dynamics, such as Newtonian or Hamiltonian. Here, we propose a framework, namely Brownian graph neural networks (B RO GN ET ), combining stochastic differential equations (SDEs) and G NN s to learn Brownian dynamics directly from the trajectory. We modify the architecture of B RO GN ET to enforce linear momentum conservation of the system, which, in turn, provides superior performance on learning dynamics as revealed empirically. We demonstrate this approach on several systems, namely, linear spring, linear spring with binary particle types, and non-linear spring systems, all following Brownian dynamics at finite temperatures. We show that B RO GN ET significantly outperforms proposed baselines across all the benchmarked Brownian systems. In addition, we demonstrate zero-shot generalizability of B RO GN ET to simulate unseen system sizes that are two orders of magnitude larger and to different temperatures than those used during training. Finally, we show that B RO GN ET conserves the momentum of the system resulting in superior performance and data efficiency. Altogether, our study contributes to advancing the understanding of the intricate dynamics of Brownian motion and demonstrates the effectiveness of graph neural networks in modeling such complex systems.
\ No newline at end of file
diff --git a/data/2024/iclr/Brusleattack: a Query-Efficient Score- based Black-Box Sparse Adversarial Attack b/data/2024/iclr/Brusleattack: a Query-Efficient Score- based Black-Box Sparse Adversarial Attack
new file mode 100644
index 0000000000..cdb0e0a198
--- /dev/null
+++ b/data/2024/iclr/Brusleattack: a Query-Efficient Score- based Black-Box Sparse Adversarial Attack	
@@ -0,0 +1 @@
+We study the unique, less-well understood problem of generating sparse adversarial samples simply by observing the score-based replies to model queries. Sparse attacks aim to discover a minimum number-the l0 bounded-perturbations to model inputs to craft adversarial examples and misguide model decisions. But, in contrast to query-based dense attack counterparts against black-box models, constructing sparse adversarial perturbations, even when models serve confidence score information to queries in a score-based setting, is non-trivial. Because, such an attack leads to i) an NP-hard problem; and ii) a non-differentiable search space. We develop the BruSLeAttack-a new, faster (more query-efficient) Bayesian algorithm for the problem. We conduct extensive attack evaluations including an attack demonstration against a Machine Learning as a Service (MLaaS) offering exemplified by Google Cloud Vision and robustness testing of adversarial training regimes and a recent defense against black-box attacks. The proposed attack scales to achieve state-of-the-art attack success rates and query efficiency on standard computer vision tasks such as ImageNet across different model architectures. Our artefacts and DIY attack samples are available on GitHub. Importantly, our work facilitates faster evaluation of model vulnerabilities and raises our vigilance on the safety, security and reliability of deployed systems.
\ No newline at end of file
diff --git a/data/2024/iclr/Building Cooperative Embodied Agents Modularly with Large Language Models b/data/2024/iclr/Building Cooperative Embodied Agents Modularly with Large Language Models
new file mode 100644
index 0000000000..794eeb20a7
--- /dev/null
+++ b/data/2024/iclr/Building Cooperative Embodied Agents Modularly with Large Language Models	
@@ -0,0 +1 @@
+In this work, we address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments. While previous research either presupposes a cost-free communication channel or relies on a centralized controller with shared observations, we harness the commonsense knowledge, reasoning ability, language comprehension, and text generation prowess of LLMs and seamlessly incorporate them into a cognitive-inspired modular framework that integrates with perception, memory, and execution. Thus building a Cooperative Embodied Language Agent CoELA, who can plan, communicate, and cooperate with others to accomplish long-horizon tasks efficiently. Our experiments on C-WAH and TDW-MAT demonstrate that CoELA driven by GPT-4 can surpass strong planning-based methods and exhibit emergent effective communication. Though current Open LMs like LLAMA-2 still underperform, we fine-tune a CoELA with data collected with our agents and show how they can achieve promising performance. We also conducted a user study for human-agent interaction and discovered that CoELA communicating in natural language can earn more trust and cooperate more effectively with humans. Our research underscores the potential of LLMs for future research in multi-agent cooperation. Videos can be found on the project website https://vis-www.cs.umass.edu/Co-LLM-Agents/.
\ No newline at end of file
diff --git a/data/2024/iclr/Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression b/data/2024/iclr/Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression
new file mode 100644
index 0000000000..022a343015
--- /dev/null
+++ b/data/2024/iclr/Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression	
@@ -0,0 +1 @@
+This work studies training instabilities of behavior cloning with deep neural networks. We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards, despite negligibly affecting the behavior cloning loss. We empirically disentangle the statistical and computational causes of these oscillations, and find them to stem from the chaotic propagation of minibatch SGD noise through unstable closed-loop dynamics. While SGD noise is benign in the single-step action prediction objective, it results in catastrophic error accumulation over long horizons, an effect we term gradient variance amplification (GVA). We show that many standard mitigation techniques do not alleviate GVA, but find an exponential moving average (EMA) of iterates to be surprisingly effective at doing so. We illustrate the generality of this phenomenon by showing the existence of GVA and its amelioration by EMA in both continuous control and autoregressive language generation. Finally, we provide theoretical vignettes that highlight the benefits of EMA in alleviating GVA and shed light on the extent to which classical convex models can help in understanding the benefits of iterate averaging in deep learning.
\ No newline at end of file
diff --git a/data/2024/iclr/Byzantine Robust Cooperative Multi-Agent Reinforcement Learning as a Bayesian Game b/data/2024/iclr/Byzantine Robust Cooperative Multi-Agent Reinforcement Learning as a Bayesian Game
new file mode 100644
index 0000000000..ca6d7a4709
--- /dev/null
+++ b/data/2024/iclr/Byzantine Robust Cooperative Multi-Agent Reinforcement Learning as a Bayesian Game	
@@ -0,0 +1 @@
+In this study, we explore the robustness of cooperative multi-agent reinforcement learning (c-MARL) against Byzantine failures, where any agent can enact arbitrary, worst-case actions due to malfunction or adversarial attack. To address the uncertainty that any agent can be adversarial, we propose a Bayesian Adversarial Robust Dec-POMDP (BARDec-POMDP) framework, which views Byzantine adversaries as nature-dictated types, represented by a separate transition. This allows agents to learn policies grounded on their posterior beliefs about the type of other agents, fostering collaboration with identified allies and minimizing vulnerability to adversarial manipulation. We define the optimal solution to the BARDec-POMDP as an ex post robust Bayesian Markov perfect equilibrium, which we proof to exist and weakly dominates the equilibrium of previous robust MARL approaches. To realize this equilibrium, we put forward a two-timescale actor-critic algorithm with almost sure convergence under specific conditions. Experimentation on matrix games, level-based foraging and StarCraft II indicate that, even under worst-case perturbations, our method successfully acquires intricate micromanagement skills and adaptively aligns with allies, demonstrating resilience against non-oblivious adversaries, random allies, observation-based attacks, and transfer-based attacks.
\ No newline at end of file
diff --git a/data/2024/iclr/C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion b/data/2024/iclr/C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion
new file mode 100644
index 0000000000..9cd627fef9
--- /dev/null
+++ b/data/2024/iclr/C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion	
@@ -0,0 +1 @@
+In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration, which is a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data. The code is publicly accessible at https://github.com/hee-suk-yoon/C-TPT.
\ No newline at end of file
diff --git a/data/2024/iclr/CABINET: Content Relevance-based Noise Reduction for Table Question Answering b/data/2024/iclr/CABINET: Content Relevance-based Noise Reduction for Table Question Answering
new file mode 100644
index 0000000000..d1c217d0fc
--- /dev/null
+++ b/data/2024/iclr/CABINET: Content Relevance-based Noise Reduction for Table Question Answering	
@@ -0,0 +1 @@
+Table understanding capability of Large Language Models (LLMs) has been extensively studied through the task of question-answering (QA) over tables. Typically, only a small part of the whole table is relevant to derive the answer for a given question. The irrelevant parts act as noise and are distracting information, resulting in sub-optimal performance due to the vulnerability of LLMs to noise. To mitigate this, we propose CABINET (Content RelevAnce-Based NoIse ReductioN for TablE QuesTion-Answering) - a framework to enable LLMs to focus on relevant tabular data by suppressing extraneous information. CABINET comprises an Unsupervised Relevance Scorer (URS), trained differentially with the QA LLM, that weighs the table content based on its relevance to the input question before feeding it to the question-answering LLM (QA LLM). To further aid the relevance scorer, CABINET employs a weakly supervised module that generates a parsing statement describing the criteria of rows and columns relevant to the question and highlights the content of corresponding table cells. CABINET significantly outperforms various tabular LLM baselines, as well as GPT3-based in-context learning methods, is more robust to noise, maintains outperformance on tables of varying sizes, and establishes new SoTA performance on WikiTQ, FeTaQA, and WikiSQL datasets. We release our code and datasets at https://github.com/Sohanpatnaik106/CABINET_QA.
\ No newline at end of file
diff --git a/data/2024/iclr/CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling b/data/2024/iclr/CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling
new file mode 100644
index 0000000000..1f6402ceb5
--- /dev/null
+++ b/data/2024/iclr/CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling	
@@ -0,0 +1 @@
+While conditional diffusion models are known to have good coverage of the data distribution, they still face limitations in output diversity, particularly when sampled with a high classifier-free guidance scale for optimal image quality or when trained on small datasets. We attribute this problem to the role of the conditioning signal in inference and offer an improved sampling strategy for diffusion models that can increase generation diversity, especially at high guidance scales, with minimal loss of sample quality. Our sampling strategy anneals the conditioning signal by adding scheduled, monotonically decreasing Gaussian noise to the conditioning vector during inference to balance diversity and condition alignment. Our Condition-Annealed Diffusion Sampler (CADS) can be used with any pretrained model and sampling algorithm, and we show that it boosts the diversity of diffusion models in various conditional generation tasks. Further, using an existing pretrained diffusion model, CADS achieves a new state-of-the-art FID of 1.70 and 2.31 for class-conditional ImageNet generation at 256$\times$256 and 512$\times$512 respectively.
\ No newline at end of file
diff --git a/data/2024/iclr/CALICO: Self-Supervised Camera-LiDAR Contrastive Pre-training for BEV Perception b/data/2024/iclr/CALICO: Self-Supervised Camera-LiDAR Contrastive Pre-training for BEV Perception
new file mode 100644
index 0000000000..03844dc21c
--- /dev/null
+++ b/data/2024/iclr/CALICO: Self-Supervised Camera-LiDAR Contrastive Pre-training for BEV Perception	
@@ -0,0 +1 @@
+Perception is crucial in the realm of autonomous driving systems, where bird's eye view (BEV)-based architectures have recently reached state-of-the-art performance. The desirability of self-supervised representation learning stems from the expensive and laborious process of annotating 2D and 3D data. Although previous research has investigated pretraining methods for both LiDAR and camera-based 3D object detection, a unified pretraining framework for multimodal BEV perception is missing. In this study, we introduce CALICO, a novel framework that applies contrastive objectives to both LiDAR and camera backbones. Specifically, CALICO incorporates two stages: point-region contrast (PRC) and region-aware distillation (RAD). PRC better balances the region- and scene-level representation learning on the LiDAR modality and offers significant performance improvement compared to existing methods. RAD effectively achieves contrastive distillation on our self-trained teacher model. CALICO's efficacy is substantiated by extensive evaluations on 3D object detection and BEV map segmentation tasks, where it delivers significant performance improvements. Notably, CALICO outperforms the baseline method by 10.5% and 8.6% on NDS and mAP. Moreover, CALICO boosts the robustness of multimodal 3D object detection against adversarial attacks and corruption. Additionally, our framework can be tailored to different backbones and heads, positioning it as a promising approach for multimodal BEV perception.
\ No newline at end of file
diff --git a/data/2024/iclr/CAMBranch: Contrastive Learning with Augmented MILPs for Branching b/data/2024/iclr/CAMBranch: Contrastive Learning with Augmented MILPs for Branching
new file mode 100644
index 0000000000..34cc2c142f
--- /dev/null
+++ b/data/2024/iclr/CAMBranch: Contrastive Learning with Augmented MILPs for Branching	
@@ -0,0 +1 @@
+Recent advancements have introduced machine learning frameworks to enhance the Branch and Bound (B\&B) branching policies for solving Mixed Integer Linear Programming (MILP). These methods, primarily relying on imitation learning of Strong Branching, have shown superior performance. However, collecting expert samples for imitation learning, particularly for Strong Branching, is a time-consuming endeavor. To address this challenge, we propose \textbf{C}ontrastive Learning with \textbf{A}ugmented \textbf{M}ILPs for \textbf{Branch}ing (CAMBranch), a framework that generates Augmented MILPs (AMILPs) by applying variable shifting to limited expert data from their original MILPs. This approach enables the acquisition of a considerable number of labeled expert samples. CAMBranch leverages both MILPs and AMILPs for imitation learning and employs contrastive learning to enhance the model's ability to capture MILP features, thereby improving the quality of branching decisions. Experimental results demonstrate that CAMBranch, trained with only 10\% of the complete dataset, exhibits superior performance. Ablation studies further validate the effectiveness of our method.
\ No newline at end of file
diff --git a/data/2024/iclr/CAMIL: Context-Aware Multiple Instance Learning for Cancer Detection and Subtyping in Whole Slide Images b/data/2024/iclr/CAMIL: Context-Aware Multiple Instance Learning for Cancer Detection and Subtyping in Whole Slide Images
new file mode 100644
index 0000000000..0a8d3031ba
--- /dev/null
+++ b/data/2024/iclr/CAMIL: Context-Aware Multiple Instance Learning for Cancer Detection and Subtyping in Whole Slide Images	
@@ -0,0 +1 @@
+The visual examination of tissue biopsy sections is fundamental for cancer diagnosis, with pathologists analyzing sections at multiple magnifications to discern tumor cells and their subtypes. However, existing attention-based multiple instance learning (MIL) models, used for analyzing Whole Slide Images (WSIs) in cancer diagnostics, often overlook the contextual information of tumor and neighboring tiles, leading to misclassifications. To address this, we propose the Context-Aware Multiple Instance Learning (CAMIL) architecture. CAMIL incorporates neighbor-constrained attention to consider dependencies among tiles within a WSI and integrates contextual constraints as prior knowledge into the MIL model. We evaluated CAMIL on subtyping non-small cell lung cancer (TCGA-NSCLC) and detecting lymph node (CAMELYON16) metastasis, achieving test AUCs of 0.959\% and 0.975\%, respectively, outperforming other state-of-the-art methods. Additionally, CAMIL enhances model interpretability by identifying regions of high diagnostic value.
\ No newline at end of file
diff --git a/data/2024/iclr/CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting b/data/2024/iclr/CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting
new file mode 100644
index 0000000000..74814c0daf
--- /dev/null
+++ b/data/2024/iclr/CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting	
@@ -0,0 +1 @@
+Recent studies have demonstrated the great power of Transformer models for time series forecasting. One of the key elements that lead to the transformer's success is the channel-independent (CI) strategy to improve the training robustness. However, the ignorance of the correlation among different channels in CI would limit the model's forecasting capacity. In this work, we design a special Transformer, i.e., Channel Aligned Robust Blend Transformer (CARD for short), that addresses key shortcomings of CI type Transformer in time series forecasting. First, CARD introduces a channel-aligned attention structure that allows it to capture both temporal correlations among signals and dynamical dependence among multiple variables over time. Second, in order to efficiently utilize the multi-scale knowledge, we design a token blend module to generate tokens with different resolutions. Third, we introduce a robust loss function for time series forecasting to alleviate the potential overfitting issue. This new loss function weights the importance of forecasting over a finite horizon based on prediction uncertainties. Our evaluation of multiple long-term and short-term forecasting datasets demonstrates that CARD significantly outperforms state-of-the-art time series forecasting methods. The code is available at the following repository:https://github.com/wxie9/CARD
\ No newline at end of file
diff --git a/data/2024/iclr/CAS: A Probability-Based Approach for Universal Condition Alignment Score b/data/2024/iclr/CAS: A Probability-Based Approach for Universal Condition Alignment Score
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/CCIL: Continuity-Based Data Augmentation for Corrective Imitation Learning b/data/2024/iclr/CCIL: Continuity-Based Data Augmentation for Corrective Imitation Learning
new file mode 100644
index 0000000000..05cee5b0bd
--- /dev/null
+++ b/data/2024/iclr/CCIL: Continuity-Based Data Augmentation for Corrective Imitation Learning	
@@ -0,0 +1 @@
+We present a new technique to enhance the robustness of imitation learning methods by generating corrective data to account for compounding errors and disturbances. While existing methods rely on interactive expert labeling, additional offline datasets, or domain-specific invariances, our approach requires minimal additional assumptions beyond access to expert data. The key insight is to leverage local continuity in the environment dynamics to generate corrective labels. Our method first constructs a dynamics model from the expert demonstration, encouraging local Lipschitz continuity in the learned model. In locally continuous regions, this model allows us to generate corrective labels within the neighborhood of the demonstrations but beyond the actual set of states and actions in the dataset. Training on this augmented data enhances the agent's ability to recover from perturbations and deal with compounding errors. We demonstrate the effectiveness of our generated labels through experiments in a variety of robotics domains in simulation that have distinct forms of continuity and discontinuity, including classic control problems, drone flying, navigation with high-dimensional sensor observations, legged locomotion, and tabletop manipulation.
\ No newline at end of file
diff --git a/data/2024/iclr/CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis b/data/2024/iclr/CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis
new file mode 100644
index 0000000000..b29798d7a5
--- /dev/null
+++ b/data/2024/iclr/CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis	
@@ -0,0 +1 @@
+Analyzing model performance in various unseen environments is a critical research problem in the machine learning community. To study this problem, it is important to construct a testbed with out-of-distribution test sets that have broad coverage of environmental discrepancies. However, existing testbeds typically either have a small number of domains or are synthesized by image corruptions, hindering algorithm design that demonstrates real-world effectiveness. In this paper, we introduce CIFAR-10-Warehouse, consisting of 180 datasets collected by prompting image search engines and diffusion models in various ways. Generally sized between 300 and 8,000 images, the datasets contain natural images, cartoons, certain colors, or objects that do not naturally appear. With CIFAR-10-W, we aim to enhance the evaluation and deepen the understanding of two generalization tasks: domain generalization and model accuracy prediction in various out-of-distribution environments. We conduct extensive benchmarking and comparison experiments and show that CIFAR-10-W offers new and interesting insights inherent to these tasks. We also discuss other fields that would benefit from CIFAR-10-W.
\ No newline at end of file
diff --git a/data/2024/iclr/CLAP: Collaborative Adaptation for Patchwork Learning b/data/2024/iclr/CLAP: Collaborative Adaptation for Patchwork Learning
new file mode 100644
index 0000000000..6bbea76fb6
--- /dev/null
+++ b/data/2024/iclr/CLAP: Collaborative Adaptation for Patchwork Learning	
@@ -0,0 +1 @@
+our
\ No newline at end of file
diff --git a/data/2024/iclr/CLEX: Continuous Length Extrapolation for Large Language Models b/data/2024/iclr/CLEX: Continuous Length Extrapolation for Large Language Models
new file mode 100644
index 0000000000..ddb958a42f
--- /dev/null
+++ b/data/2024/iclr/CLEX: Continuous Length Extrapolation for Large Language Models	
@@ -0,0 +1 @@
+Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x or almost 8x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k. Our code is available at https://github.com/DAMO-NLP-SG/CLEX.
\ No newline at end of file
diff --git a/data/2024/iclr/CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? b/data/2024/iclr/CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?
new file mode 100644
index 0000000000..fb3ced9333
--- /dev/null
+++ b/data/2024/iclr/CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?	
@@ -0,0 +1 @@
+We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP), identifying areas of strength and limitation. First, we reaffirm prior conclusions that CLIP models can inadvertently absorb societal stereotypes. To counter this, we present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases (i.e. in first- and second-order statistics) in multimodal data. We use M4 to conduct an in-depth analysis taking into account various factors, such as the model, representation, and data size. Our study also explores the dynamic nature of how CLIP learns and unlearns biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval. Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with recommendations for improving the efficacy of data balancing in multimodal systems.
\ No newline at end of file
diff --git a/data/2024/iclr/CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding b/data/2024/iclr/CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding
new file mode 100644
index 0000000000..a0f00a2947
--- /dev/null
+++ b/data/2024/iclr/CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding	
@@ -0,0 +1 @@
+The study of decoding visual neural information faces challenges in generalizing single-subject decoding models to multiple subjects, due to individual differences. Moreover, the limited availability of data from a single subject has a constraining impact on model performance. Although prior multi-subject decoding methods have made significant progress, they still suffer from several limitations, including difficulty in extracting global neural response features, linear scaling of model parameters with the number of subjects, and inadequate characterization of the relationship between neural responses of different subjects to various stimuli. To overcome these limitations, we propose a CLIP-guided Multi-sUbject visual neural information SEmantic Decoding (CLIP-MUSED) method. Our method consists of a Transformer-based feature extractor to effectively model global neural representations. It also incorporates learnable subject-specific tokens that facilitates the aggregation of multi-subject data without a linear increase of parameters. Additionally, we employ representational similarity analysis (RSA) to guide token representation learning based on the topological relationship of visual stimuli in the representation space of CLIP, enabling full characterization of the relationship between neural responses of different subjects under different stimuli. Finally, token representations are used for multi-subject semantic decoding. Our proposed method outperforms single-subject decoding methods and achieves state-of-the-art performance among the existing multi-subject methods on two fMRI datasets. Visualization results provide insights into the effectiveness of our proposed method. Code is available at https://github.com/CLIP-MUSED/CLIP-MUSED.
\ No newline at end of file
diff --git a/data/2024/iclr/CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction b/data/2024/iclr/CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
new file mode 100644
index 0000000000..ea4d75e74c
--- /dev/null
+++ b/data/2024/iclr/CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction	
@@ -0,0 +1 @@
+Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.
\ No newline at end of file
diff --git a/data/2024/iclr/CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech b/data/2024/iclr/CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
new file mode 100644
index 0000000000..b492a8b0d2
--- /dev/null
+++ b/data/2024/iclr/CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech	
@@ -0,0 +1 @@
+With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complexity of modelling the multiple sequences. To mitigate these issues, we present CLaM-TTS that employs a probabilistic residual vector quantization to (1) achieve superior compression in the token length, and (2) allow a language model to generate multiple tokens at once, thereby eliminating the need for cascaded modeling to handle the number of token streams. Our experimental results demonstrate that CLaM-TTS is better than or comparable to state-of-the-art neural codec-based TTS models regarding naturalness, intelligibility, speaker similarity, and inference speed. In addition, we examine the impact of the pretraining extent of the language models and their text tokenization strategies on performances.
\ No newline at end of file
diff --git a/data/2024/iclr/CNN Kernels Can Be the Best Shapelets b/data/2024/iclr/CNN Kernels Can Be the Best Shapelets
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/CO2: Efficient Distributed Training with Full Communication-Computation Overlap b/data/2024/iclr/CO2: Efficient Distributed Training with Full Communication-Computation Overlap
new file mode 100644
index 0000000000..e8b2451f7a
--- /dev/null
+++ b/data/2024/iclr/CO2: Efficient Distributed Training with Full Communication-Computation Overlap	
@@ -0,0 +1 @@
+The fundamental success of large language models hinges upon the efficacious implementation of large-scale distributed training techniques. Nevertheless, building a vast, high-performance cluster featuring high-speed communication interconnectivity is prohibitively costly, and accessible only to prominent entities. In this work, we aim to lower this barrier and democratize large-scale training with limited bandwidth clusters. We propose a new approach called CO2 that introduces local-updating and asynchronous communication to the distributed data-parallel training, thereby facilitating the full overlap of COmunication with COmputation. CO2 is able to attain a high scalability even on extensive multi-node clusters constrained by very limited communication bandwidth. We further propose the staleness gap penalty and outer momentum clipping techniques together with CO2 to bolster its convergence and training stability. Besides, CO2 exhibits seamless integration with well-established ZeRO-series optimizers which mitigate memory consumption of model states with large model training. We also provide a mathematical proof of convergence, accompanied by the establishment of a stringent upper bound. Furthermore, we validate our findings through an extensive set of practical experiments encompassing a wide range of tasks in the fields of computer vision and natural language processing. These experiments serve to demonstrate the capabilities of CO2 in terms of convergence, generalization, and scalability when deployed across configurations comprising up to 128 A100 GPUs. The outcomes emphasize the outstanding capacity of CO2 to hugely improve scalability, no matter on clusters with 800Gbps RDMA or 80Gbps TCP/IP inter-node connections.
\ No newline at end of file
diff --git a/data/2024/iclr/COCO-Periph: Bridging the Gap Between Human and Machine Perception in the Periphery b/data/2024/iclr/COCO-Periph: Bridging the Gap Between Human and Machine Perception in the Periphery
new file mode 100644
index 0000000000..234c768bbb
--- /dev/null
+++ b/data/2024/iclr/COCO-Periph: Bridging the Gap Between Human and Machine Perception in the Periphery	
@@ -0,0 +1 @@
+Evaluating deep neural networks (DNNs) as models of human perception has given rich insights into both human visual processing and representational properties of DNNs. We extend this work by analyzing how well DNNs perform compared to humans when constrained by peripheral vision – which limits human performance on a variety of tasks, but also benefits the visual system significantly. We evaluate this by (1) modifying the texture tiling model (TTM), a well tested model of peripheral vision, to be more flexibly used with DNNs, (2) generating a large dataset which we call COCO-Periph that contains images transformed to capture the information available in human peripheral vision, and (3) comparing DNNs to humans at peripheral object detection using a psychophysics experiment. Our results show that common DNNs underperform at object detection compared to humans when simulating peripheral vision with TTM. Training on COCO-Periph begins to reduce the gap between human and DNN performance and leads to small increases in corruption robustness, but DNNs still struggle to capture human-like sensitivity to peripheral clutter. Our work brings us closer to accurately modeling human vision, and paves the way for DNNs to mimic and sometimes benefit from properties of human visual processing.
\ No newline at end of file
diff --git a/data/2024/iclr/COLEP: Certifiably Robust Learning-Reasoning Conformal Prediction via Probabilistic Circuits b/data/2024/iclr/COLEP: Certifiably Robust Learning-Reasoning Conformal Prediction via Probabilistic Circuits
new file mode 100644
index 0000000000..73f4fb2edc
--- /dev/null
+++ b/data/2024/iclr/COLEP: Certifiably Robust Learning-Reasoning Conformal Prediction via Probabilistic Circuits	
@@ -0,0 +1 @@
+Conformal prediction has shown spurring performance in constructing statistically rigorous prediction sets for arbitrary black-box machine learning models, assuming the data is exchangeable. However, even small adversarial perturbations during the inference can violate the exchangeability assumption, challenge the coverage guarantees, and result in a subsequent decline in empirical coverage. In this work, we propose a certifiably robust learning-reasoning conformal prediction framework (COLEP) via probabilistic circuits, which comprise a data-driven learning component that trains statistical models to learn different semantic concepts, and a reasoning component that encodes knowledge and characterizes the relationships among the trained models for logic reasoning. To achieve exact and efficient reasoning, we employ probabilistic circuits (PCs) within the reasoning component. Theoretically, we provide end-to-end certification of prediction coverage for COLEP in the presence of bounded adversarial perturbations. We also provide certified coverage considering the finite size of the calibration set. Furthermore, we prove that COLEP achieves higher prediction coverage and accuracy over a single model as long as the utilities of knowledge models are non-trivial. Empirically, we show the validity and tightness of our certified coverage, demonstrating the robust conformal prediction of COLEP on various datasets, including GTSRB, CIFAR10, and AwA2. We show that COLEP achieves up to 12% improvement in certified coverage on GTSRB, 9% on CIFAR-10, and 14% on AwA2.
\ No newline at end of file
diff --git a/data/2024/iclr/COLLIE: Systematic Construction of Constrained Text Generation Tasks b/data/2024/iclr/COLLIE: Systematic Construction of Constrained Text Generation Tasks
new file mode 100644
index 0000000000..1450c12605
--- /dev/null
+++ b/data/2024/iclr/COLLIE: Systematic Construction of Constrained Text Generation Tasks	
@@ -0,0 +1 @@
+Text generation under constraints have seen increasing interests in natural language processing, especially with the rapidly improving capabilities of large language models. However, existing benchmarks for constrained generation usually focus on fixed constraint types (e.g.,generate a sentence containing certain words) that have proved to be easy for state-of-the-art models like GPT-4. We present COLLIE, a grammar-based framework that allows the specification of rich, compositional constraints with diverse generation levels (word, sentence, paragraph, passage) and modeling challenges (e.g.,language understanding, logical reasoning, counting, semantic planning). We also develop tools for automatic extraction of task instances given a constraint structure and a raw text corpus. Using COLLIE, we compile the COLLIE-v1 dataset with 2080 instances comprising 13 constraint structures. We perform systematic experiments across five state-of-the-art instruction-tuned language models and analyze their performances to reveal shortcomings. COLLIE is designed to be extensible and lightweight, and we hope the community finds it useful to develop more complex constraints and evaluations in the future.
\ No newline at end of file
diff --git a/data/2024/iclr/COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL b/data/2024/iclr/COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL
new file mode 100644
index 0000000000..bf5cc1e346
--- /dev/null
+++ b/data/2024/iclr/COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL	
@@ -0,0 +1 @@
+Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose $\texttt{COPlanner}$, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. $\texttt{COPlanner}$ leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, $\texttt{COPlanner}$ can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. $\texttt{COPlanner}$ is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with $\texttt{COPlanner}$.
\ No newline at end of file
diff --git a/data/2024/iclr/CORN: Contact-based Object Representation for Nonprehensile Manipulation of General Unseen Objects b/data/2024/iclr/CORN: Contact-based Object Representation for Nonprehensile Manipulation of General Unseen Objects
new file mode 100644
index 0000000000..9f93b61159
--- /dev/null
+++ b/data/2024/iclr/CORN: Contact-based Object Representation for Nonprehensile Manipulation of General Unseen Objects	
@@ -0,0 +1 @@
+Nonprehensile manipulation is essential for manipulating objects that are too thin, large, or otherwise ungraspable in the wild. To sidestep the difficulty of contact modeling in conventional modeling-based approaches, reinforcement learning (RL) has recently emerged as a promising alternative. However, previous RL approaches either lack the ability to generalize over diverse object shapes, or use simple action primitives that limit the diversity of robot motions. Furthermore, using RL over diverse object geometry is challenging due to the high cost of training a policy that takes in high-dimensional sensory inputs. We propose a novel contact-based object representation and pretraining pipeline to tackle this. To enable massively parallel training, we leverage a lightweight patch-based transformer architecture for our encoder that processes point clouds, thus scaling our training across thousands of environments. Compared to learning from scratch, or other shape representation baselines, our representation facilitates both time- and data-efficient learning. We validate the efficacy of our overall system by zero-shot transferring the trained policy to novel real-world objects. Code and videos are available at https://sites.google.com/view/contact-non-prehensile.
\ No newline at end of file
diff --git a/data/2024/iclr/COSA: Concatenated Sample Pretrained Vision-Language Foundation Model b/data/2024/iclr/COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
new file mode 100644
index 0000000000..a8d1adb12c
--- /dev/null
+++ b/data/2024/iclr/COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	
@@ -0,0 +1 @@
+Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations. To address this issue, we propose COSA, a COncatenated SAmple pretrained vision-language foundation model. COSA jointly models visual contents and event-level temporal cues using only image-text corpora. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus, enabling richer scene transformations and explicit event-description correspondence. Extensive experiments demonstrate that COSA consistently improves performance across a broad range of downstream tasks, including long-form/short-form video-text tasks and image-text tasks such as retrieval, captioning, and question answering. Notably, COSA achieves state-of-the-art results on various competitive benchmarks. Code and model are released at https://github.com/TXH-mercury/COSA.
\ No newline at end of file
diff --git a/data/2024/iclr/CPPO: Continual Learning for Reinforcement Learning with Human Feedback b/data/2024/iclr/CPPO: Continual Learning for Reinforcement Learning with Human Feedback
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets b/data/2024/iclr/CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
new file mode 100644
index 0000000000..78debd6089
--- /dev/null
+++ b/data/2024/iclr/CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets	
@@ -0,0 +1 @@
+Large language models (LLMs) are often augmented with tools to solve complex tasks. By generating code snippets and executing them through task-specific Application Programming Interfaces (APIs), they can offload certain functions to dedicated external modules, such as image encoding and performing calculations. However, most existing approaches to augment LLMs with tools are constrained by general-purpose APIs and lack the flexibility for tailoring them to specific tasks. In this work, we present CRAFT, a general tool creation and retrieval framework for LLMs. It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. For each task, we collect specific code solutions by prompting GPT-4 to solve the training examples. Following a validation step ensuring the correctness, these solutions are abstracted into code snippets to enhance reusability, and deduplicated for higher quality. At inference time, the language model retrieves snippets from the toolsets and then executes them or generates the output conditioning on the retrieved snippets. Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning. Experiments on vision-language, tabular processing, and mathematical reasoning tasks show that our approach achieves substantial improvements compared to strong baselines. In addition, our in-depth analysis reveals that: (1) consistent performance improvement can be achieved by scaling up the number of tools and the capability of the backbone models; (2) each component of our approach contributes to the performance gains; (3) the created tools are well-structured and reliable with low complexity and atomicity. The code is available at https://github.com/lifan-yuan/CRAFT.
\ No newline at end of file
diff --git a/data/2024/iclr/CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing b/data/2024/iclr/CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
new file mode 100644
index 0000000000..48384167b7
--- /dev/null
+++ b/data/2024/iclr/CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing	
@@ -0,0 +1 @@
+Recent developments in large language models (LLMs) have been impressive. However, these models sometimes show inconsistencies and problematic behavior, such as hallucinating facts, generating flawed code, or creating offensive and toxic content. Unlike these models, humans typically utilize external tools to cross-check and refine their initial content, like using a search engine for fact-checking, or a code interpreter for debugging. Inspired by this observation, we introduce a framework called CRITIC that allows LLMs, which are essentially"black boxes"to validate and progressively amend their own outputs in a manner similar to human interaction with tools. More specifically, starting with an initial output, CRITIC interacts with appropriate tools to evaluate certain aspects of the text, and then revises the output based on the feedback obtained during this validation process. Comprehensive evaluations involving free-form question answering, mathematical program synthesis, and toxicity reduction demonstrate that CRITIC consistently enhances the performance of LLMs. Meanwhile, our research highlights the crucial importance of external feedback in promoting the ongoing self-improvement of LLMs.
\ No newline at end of file
diff --git a/data/2024/iclr/Cameras as Rays: Pose Estimation via Ray Diffusion b/data/2024/iclr/Cameras as Rays: Pose Estimation via Ray Diffusion
new file mode 100644
index 0000000000..0f234b7e59
--- /dev/null
+++ b/data/2024/iclr/Cameras as Rays: Pose Estimation via Ray Diffusion	
@@ -0,0 +1 @@
+Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures.
\ No newline at end of file
diff --git a/data/2024/iclr/Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs b/data/2024/iclr/Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
new file mode 100644
index 0000000000..1f4a9f8c75
--- /dev/null
+++ b/data/2024/iclr/Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs	
@@ -0,0 +1 @@
+Empowering large language models to accurately express confidence in their answers is essential for trustworthy decision-making. Previous confidence elicitation methods, which primarily rely on white-box access to internal model information or model fine-tuning, have become less suitable for LLMs, especially closed-source commercial APIs. This leads to a growing need to explore the untapped area of black-box approaches for LLM uncertainty estimation. To better break down the problem, we define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency. We then benchmark these methods on two key tasks-confidence calibration and failure prediction-across five types of datasets (e.g., commonsense and arithmetic reasoning) and five widely-used LLMs including GPT-4 and LLaMA 2 Chat. Our analysis uncovers several key insights: 1) LLMs, when verbalizing their confidence, tend to be overconfident, potentially imitating human patterns of expressing confidence. 2) As model capability scales up, both calibration and failure prediction performance improve. 3) Employing our proposed strategies, such as human-inspired prompts, consistency among multiple responses, and better aggregation strategies can help mitigate this overconfidence from various perspectives. 4) Comparisons with white-box methods indicate that while white-box methods perform better, the gap is narrow, e.g., 0.522 to 0.605 in AUROC. Despite these advancements, none of these techniques consistently outperform others, and all investigated methods struggle in challenging tasks, such as those requiring professional knowledge, indicating significant scope for improvement. We believe this study can serve as a strong baseline and provide insights for eliciting confidence in black-box LLMs.
\ No newline at end of file
diff --git a/data/2024/iclr/Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory b/data/2024/iclr/Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory
new file mode 100644
index 0000000000..9f0810654d
--- /dev/null
+++ b/data/2024/iclr/Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory	
@@ -0,0 +1 @@
+The interactive use of large language models (LLMs) in AI assistants (at work, home, etc.) introduces a new set of inference-time privacy risks: LLMs are fed different types of information from multiple sources in their inputs and are expected to reason about what to share in their outputs, for what purpose and with whom, within a given context. In this work, we draw attention to the highly critical yet overlooked notion of contextual privacy by proposing ConfAIde, a benchmark designed to identify critical weaknesses in the privacy reasoning capabilities of instruction-tuned LLMs. Our experiments show that even the most capable models such as GPT-4 and ChatGPT reveal private information in contexts that humans would not, 39% and 57% of the time, respectively. This leakage persists even when we employ privacy-inducing prompts or chain-of-thought reasoning. Our work underscores the immediate need to explore novel inference-time privacy-preserving approaches, based on reasoning and theory of mind.
\ No newline at end of file
diff --git a/data/2024/iclr/Can Large Language Models Infer Causation from Correlation? b/data/2024/iclr/Can Large Language Models Infer Causation from Correlation?
new file mode 100644
index 0000000000..57427295bc
--- /dev/null
+++ b/data/2024/iclr/Can Large Language Models Infer Causation from Correlation?	
@@ -0,0 +1 @@
+Causal inference is one of the hallmarks of human intelligence. While the field of CausalNLP has attracted much interest in the recent years, existing causal inference datasets in NLP primarily rely on discovering causality from empirical knowledge (e.g., commonsense knowledge). In this work, we propose the first benchmark dataset to test the pure causal inference skills of large language models (LLMs). Specifically, we formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We curate a large-scale dataset of more than 200K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task. This shortcoming is somewhat mitigated when we try to re-purpose LLMs for this skill via finetuning, but we find that these models still fail to generalize -- they can only perform causal inference in in-distribution settings when variable names and textual expressions used in the queries are similar to those in the training set, but fail in out-of-distribution settings generated by perturbing these queries. Corr2Cause is a challenging task for LLMs, and would be helpful in guiding future research on improving LLMs' pure reasoning skills and generalizability. Our data is at https://huggingface.co/datasets/causalnlp/corr2cause. Our code is at https://github.com/causalNLP/corr2cause.
\ No newline at end of file
diff --git a/data/2024/iclr/Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks b/data/2024/iclr/Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
new file mode 100644
index 0000000000..048a62f90e
--- /dev/null
+++ b/data/2024/iclr/Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks	
@@ -0,0 +1 @@
+Pretrained language models sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. They can also output toxic or harmful text. To mitigate these safety and informational issues, we propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. We study direct edits to model weights because (1) this approach should guarantee that particular deleted information is never extracted by future prompt attacks, and (2) it should protect against whitebox attacks, which is necessary for making claims about safety/privacy in a setting where publicly available model weights could be used to elicit sensitive information. Our threat model assumes that an attack succeeds if the answer to a sensitive question is located among a set of B generated candidates, based on scenarios where the information would be insecure if the answer is among B candidates. Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover"deleted"information from an edited model 38% of the time. These attacks leverage two key observations: (1) that traces of deleted information can be found in intermediate model hidden states, and (2) that applying an editing method for one question may not delete information across rephrased versions of the question. Finally, we provide new defense methods that protect against some extraction attacks, but we do not find a single universally effective defense method. Our results suggest that truly deleting sensitive information is a tractable but difficult problem, since even relatively low attack success rates have potentially severe societal implications for real-world deployment of language models.
\ No newline at end of file
diff --git a/data/2024/iclr/Can Transformers Capture Spatial Relations between Objects? b/data/2024/iclr/Can Transformers Capture Spatial Relations between Objects?
new file mode 100644
index 0000000000..c4639e1e4b
--- /dev/null
+++ b/data/2024/iclr/Can Transformers Capture Spatial Relations between Objects?	
@@ -0,0 +1 @@
+Spatial relationships between objects represent key scene information for humans to understand and interact with the world. To study the capability of current computer vision systems to recognize physically grounded spatial relations, we start by proposing precise relation definitions that permit consistently annotating a benchmark dataset. Despite the apparent simplicity of this task relative to others in the recognition literature, we observe that existing approaches perform poorly on this benchmark. We propose new approaches exploiting the long-range attention capabilities of transformers for this task, and evaluating key design principles. We identify a simple"RelatiViT"architecture and demonstrate that it outperforms all current approaches. To our knowledge, this is the first method to convincingly outperform naive baselines on spatial relation prediction in in-the-wild settings. The code and datasets are available in \url{https://sites.google.com/view/spatial-relation}.
\ No newline at end of file
diff --git a/data/2024/iclr/Can We Evaluate Domain Adaptation Models Without Target-Domain Labels? b/data/2024/iclr/Can We Evaluate Domain Adaptation Models Without Target-Domain Labels?
new file mode 100644
index 0000000000..6f9aec5bda
--- /dev/null
+++ b/data/2024/iclr/Can We Evaluate Domain Adaptation Models Without Target-Domain Labels?	
@@ -0,0 +1 @@
+Unsupervised domain adaptation (UDA) involves adapting a model trained on a label-rich source domain to an unlabeled target domain. However, in real-world scenarios, the absence of target-domain labels makes it challenging to evaluate the performance of UDA models. Furthermore, prevailing UDA methods relying on adversarial training and self-training could lead to model degeneration and negative transfer, further exacerbating the evaluation problem. In this paper, we propose a novel metric called the \textit{Transfer Score} to address these issues. The proposed metric enables the unsupervised evaluation of UDA models by assessing the spatial uniformity of the classifier via model parameters, as well as the transferability and discriminability of deep representations. Based on the metric, we achieve three novel objectives without target-domain labels: (1) selecting the best UDA method from a range of available options, (2) optimizing hyperparameters of UDA models to prevent model degeneration, and (3) identifying which checkpoint of UDA model performs optimally. Our work bridges the gap between data-level UDA research and practical UDA scenarios, enabling a realistic assessment of UDA model performance. We validate the effectiveness of our metric through extensive empirical studies on UDA datasets of different scales and imbalanced distributions. The results demonstrate that our metric robustly achieves the aforementioned goals.
\ No newline at end of file
diff --git a/data/2024/iclr/Can we get the best of both Binary Neural Networks and Spiking Neural Networks for Efficient Computer Vision? b/data/2024/iclr/Can we get the best of both Binary Neural Networks and Spiking Neural Networks for Efficient Computer Vision?
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Cascading Reinforcement Learning b/data/2024/iclr/Cascading Reinforcement Learning
new file mode 100644
index 0000000000..2e046e0f5a
--- /dev/null
+++ b/data/2024/iclr/Cascading Reinforcement Learning	
@@ -0,0 +1 @@
+Cascading bandits have gained popularity in recent years due to their applicability to recommendation systems and online advertising. In the cascading bandit model, at each timestep, an agent recommends an ordered subset of items (called an item list) from a pool of items, each associated with an unknown attraction probability. Then, the user examines the list, and clicks the first attractive item (if any), and after that, the agent receives a reward. The goal of the agent is to maximize the expected cumulative reward. However, the prior literature on cascading bandits ignores the influences of user states (e.g., historical behaviors) on recommendations and the change of states as the session proceeds. Motivated by this fact, we propose a generalized cascading RL framework, which considers the impact of user states and state transition into decisions. In cascading RL, we need to select items not only with large attraction probabilities but also leading to good successor states. This imposes a huge computational challenge due to the combinatorial action space. To tackle this challenge, we delve into the properties of value functions, and design an oracle BestPerm to efficiently find the optimal item list. Equipped with BestPerm, we develop two algorithms CascadingVI and CascadingBPI, which are both computationally-efficient and sample-efficient, and provide near-optimal regret and sample complexity guarantees. Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice.
\ No newline at end of file
diff --git a/data/2024/iclr/Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation b/data/2024/iclr/Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
new file mode 100644
index 0000000000..23b2c9b43a
--- /dev/null
+++ b/data/2024/iclr/Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation	
@@ -0,0 +1 @@
+The rapid progress in open-source large language models (LLMs) is significantly advancing AI development. Extensive efforts have been made before model release to align their behavior with human values, with the primary goal of ensuring their helpfulness and harmlessness. However, even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as"jailbreaks". These jailbreaks are typically triggered by specific text inputs, often referred to as adversarial prompts. In this work, we propose the generation exploitation attack, an extremely simple approach that disrupts model alignment by only manipulating variations of decoding methods. By exploiting different generation strategies, including varying decoding hyper-parameters and sampling methods, we increase the misalignment rate from 0% to more than 95% across 11 language models including LLaMA2, Vicuna, Falcon, and MPT families, outperforming state-of-the-art attacks with $30\times$ lower computational cost. Finally, we propose an effective alignment method that explores diverse generation strategies, which can reasonably reduce the misalignment rate under our attack. Altogether, our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs, strongly advocating for more comprehensive red teaming and better alignment before releasing such models. Our code is available at https://github.com/Princeton-SysML/Jailbreak_LLM.
\ No newline at end of file
diff --git a/data/2024/iclr/Cauchy-Schwarz Divergence Information Bottleneck for Regression b/data/2024/iclr/Cauchy-Schwarz Divergence Information Bottleneck for Regression
new file mode 100644
index 0000000000..41f433295f
--- /dev/null
+++ b/data/2024/iclr/Cauchy-Schwarz Divergence Information Bottleneck for Regression	
@@ -0,0 +1 @@
+The information bottleneck (IB) approach is popular to improve the generalization, robustness and explainability of deep neural networks. Essentially, it aims to find a minimum sufficient representation $\mathbf{t}$ by striking a trade-off between a compression term $I(\mathbf{x};\mathbf{t})$ and a prediction term $I(y;\mathbf{t})$, where $I(\cdot;\cdot)$ refers to the mutual information (MI). MI is for the IB for the most part expressed in terms of the Kullback-Leibler (KL) divergence, which in the regression case corresponds to prediction based on mean squared error (MSE) loss with Gaussian assumption and compression approximated by variational inference. In this paper, we study the IB principle for the regression problem and develop a new way to parameterize the IB with deep neural networks by exploiting favorable properties of the Cauchy-Schwarz (CS) divergence. By doing so, we move away from MSE-based regression and ease estimation by avoiding variational approximations or distributional assumptions. We investigate the improved generalization ability of our proposed CS-IB and demonstrate strong adversarial robustness guarantees. We demonstrate its superior performance on six real-world regression tasks over other popular deep IB approaches. We additionally observe that the solutions discovered by CS-IB always achieve the best trade-off between prediction accuracy and compression ratio in the information plane. The code is available at \url{https://github.com/SJYuCNEL/Cauchy-Schwarz-Information-Bottleneck}.
\ No newline at end of file
diff --git a/data/2024/iclr/Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework b/data/2024/iclr/Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework
new file mode 100644
index 0000000000..49827052a1
--- /dev/null
+++ b/data/2024/iclr/Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework	
@@ -0,0 +1 @@
+Fairness for machine learning predictions is widely required in practice for legal, ethical, and societal reasons. Existing work typically focuses on settings without unobserved confounding, even though unobserved confounding can lead to severe violations of causal fairness and, thus, unfair predictions. In this work, we analyze the sensitivity of causal fairness to unobserved confounding. Our contributions are three-fold. First, we derive bounds for causal fairness metrics under different sources of unobserved confounding. This enables practitioners to examine the sensitivity of their machine learning models to unobserved confounding in fairness-critical applications. Second, we propose a novel neural framework for learning fair predictions, which allows us to offer worst-case guarantees of the extent to which causal fairness can be violated due to unobserved confounding. Third, we demonstrate the effectiveness of our framework in a series of experiments, including a real-world case study about predicting prison sentences. To the best of our knowledge, ours is the first work to study causal fairness under unobserved confounding. To this end, our work is of direct practical value as a refutation strategy to ensure the fairness of predictions in high-stakes applications.
\ No newline at end of file
diff --git a/data/2024/iclr/Causal Inference with Conditional Front-Door Adjustment and Identifiable Variational Autoencoder b/data/2024/iclr/Causal Inference with Conditional Front-Door Adjustment and Identifiable Variational Autoencoder
new file mode 100644
index 0000000000..f838a74c4b
--- /dev/null
+++ b/data/2024/iclr/Causal Inference with Conditional Front-Door Adjustment and Identifiable Variational Autoencoder	
@@ -0,0 +1 @@
+An essential and challenging problem in causal inference is causal effect estimation from observational data. The problem becomes more difficult with the presence of unobserved confounding variables. The front-door adjustment is a practical approach for dealing with unobserved confounding variables. However, the restriction for the standard front-door adjustment is difficult to satisfy in practice. In this paper, we relax some of the restrictions by proposing the concept of conditional front-door (CFD) adjustment and develop the theorem that guarantees the causal effect identifiability of CFD adjustment. Furthermore, as it is often impossible for a CFD variable to be given in practice, it is desirable to learn it from data. By leveraging the ability of deep generative models, we propose CFDiVAE to learn the representation of the CFD adjustment variable directly from data with the identifiable Variational AutoEncoder and formally prove the model identifiability. Extensive experiments on synthetic datasets validate the effectiveness of CFDiVAE and its superiority over existing methods. The experiments also show that the performance of CFDiVAE is less sensitive to the causal strength of unobserved confounding variables. We further apply CFDiVAE to a real-world dataset to demonstrate its potential application.
\ No newline at end of file
diff --git a/data/2024/iclr/Causal Modelling Agents: Causal Graph Discovery through Synergising Metadata- and Data-driven Reasoning b/data/2024/iclr/Causal Modelling Agents: Causal Graph Discovery through Synergising Metadata- and Data-driven Reasoning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Causal Structure Recovery with Latent Variables under Milder Distributional and Graphical Assumptions b/data/2024/iclr/Causal Structure Recovery with Latent Variables under Milder Distributional and Graphical Assumptions
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Causal-StoNet: Causal Inference for High-Dimensional Complex Data b/data/2024/iclr/Causal-StoNet: Causal Inference for High-Dimensional Complex Data
new file mode 100644
index 0000000000..94a7ee8eab
--- /dev/null
+++ b/data/2024/iclr/Causal-StoNet: Causal Inference for High-Dimensional Complex Data	
@@ -0,0 +1 @@
+With the advancement of data science, the collection of increasingly complex datasets has become commonplace. In such datasets, the data dimension can be extremely high, and the underlying data generation process can be unknown and highly nonlinear. As a result, the task of making causal inference with high-dimensional complex data has become a fundamental problem in many disciplines, such as medicine, econometrics, and social science. However, the existing methods for causal inference are frequently developed under the assumption that the data dimension is low or that the underlying data generation process is linear or approximately linear. To address these challenges, this paper proposes a novel causal inference approach for dealing with high-dimensional complex data. The proposed approach is based on deep learning techniques, including sparse deep learning theory and stochastic neural networks, that have been developed in recent literature. By using these techniques, the proposed approach can address both the high dimensionality and unknown data generation process in a coherent way. Furthermore, the proposed approach can also be used when missing values are present in the datasets. Extensive numerical studies indicate that the proposed approach outperforms existing ones.
\ No newline at end of file
diff --git a/data/2024/iclr/CausalLM is not optimal for in-context learning b/data/2024/iclr/CausalLM is not optimal for in-context learning
new file mode 100644
index 0000000000..56a866a64d
--- /dev/null
+++ b/data/2024/iclr/CausalLM is not optimal for in-context learning	
@@ -0,0 +1 @@
+Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.
\ No newline at end of file
diff --git a/data/2024/iclr/CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery b/data/2024/iclr/CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery
new file mode 100644
index 0000000000..5988f87978
--- /dev/null
+++ b/data/2024/iclr/CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery	
@@ -0,0 +1 @@
+Time-series causal discovery (TSCD) is a fundamental problem of machine learning. However, existing synthetic datasets cannot properly evaluate or predict the algorithms' performance on real data. This study introduces the CausalTime pipeline to generate time-series that highly resemble the real data and with ground truth causal graphs for quantitative performance evaluation. The pipeline starts from real observations in a specific scenario and produces a matching benchmark dataset. Firstly, we harness deep neural networks along with normalizing flow to accurately capture realistic dynamics. Secondly, we extract hypothesized causal graphs by performing importance analysis on the neural network or leveraging prior knowledge. Thirdly, we derive the ground truth causal graphs by splitting the causal model into causal term, residual term, and noise term. Lastly, using the fitted network and the derived causal graph, we generate corresponding versatile time-series proper for algorithm assessment. In the experiments, we validate the fidelity of the generated data through qualitative and quantitative experiments, followed by a benchmarking of existing TSCD algorithms using these generated datasets. CausalTime offers a feasible solution to evaluating TSCD algorithms in real applications and can be generalized to a wide range of fields. For easy use of the proposed approach, we also provide a user-friendly website, hosted on www.causaltime.cc.
\ No newline at end of file
diff --git a/data/2024/iclr/Causality-Inspired Spatial-Temporal Explanations for Dynamic Graph Neural Networks b/data/2024/iclr/Causality-Inspired Spatial-Temporal Explanations for Dynamic Graph Neural Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Causally Aligned Curriculum Learning b/data/2024/iclr/Causally Aligned Curriculum Learning
new file mode 100644
index 0000000000..41622b4720
--- /dev/null
+++ b/data/2024/iclr/Causally Aligned Curriculum Learning	
@@ -0,0 +1 @@
+,
\ No newline at end of file
diff --git a/data/2024/iclr/CellPLM: Pre-training of Cell Language Model Beyond Single Cells b/data/2024/iclr/CellPLM: Pre-training of Cell Language Model Beyond Single Cells
new file mode 100644
index 0000000000..a352edabcb
--- /dev/null
+++ b/data/2024/iclr/CellPLM: Pre-training of Cell Language Model Beyond Single Cells	
@@ -0,0 +1 @@
+The current state-of-the-art single-cell pre-trained models are greatly inspired by the success of large language models. They trained transformers by treating genes as tokens and cells as sentences. However, three fundamental differences between single-cell data and natural language data are overlooked: (1) scRNA-seq data are presented as bag-of-genes instead of sequences of RNAs; (2) Cell-cell relations are more intricate and important than inter-sentence relations; and (3) The quantity of single-cell data is considerably inferior to text data, and they are very noisy. In light of these characteristics, we propose a new pre-trained model CellPLM, which takes cells as tokens and tissues as sentences. In addition, we leverage spatially-resolved transcriptomic data in pre-training to facilitate learning cell-cell relationships and introduce a Gaussian mixture prior distribution as an additional inductive bias to overcome data limitation. CellPLM is the first single-cell pre-trained transformer that encodes cell-cell relations and it consistently outperforms existing pre-trained and non-pre-trained models in diverse downstream tasks, with 100x times higher inference speed compared to existing pre-trained models.
\ No newline at end of file
diff --git a/data/2024/iclr/Certified Adversarial Robustness for Rate Encoded Spiking Neural Networks b/data/2024/iclr/Certified Adversarial Robustness for Rate Encoded Spiking Neural Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Chain of Hindsight aligns Language Models with Feedback b/data/2024/iclr/Chain of Hindsight aligns Language Models with Feedback
new file mode 100644
index 0000000000..f610f78959
--- /dev/null
+++ b/data/2024/iclr/Chain of Hindsight aligns Language Models with Feedback	
@@ -0,0 +1 @@
+Learning from human preferences is important for language models to match human needs and to align with human and social values. Prior works have achieved remarkable successes by learning from human feedback to understand and follow instructions. Nonetheless, these methods are either founded on hand-picked model generations that are favored by human annotators, rendering them inefficient in terms of data utilization and challenging to apply in general, or they depend on reinforcement learning, which often suffers from imperfect reward functions and relies on extremely challenging optimizations. In this work, we propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. Our idea is inspired by how humans learn from extensive feedback presented in the form of languages. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model, allowing us to take advantage of the language comprehension capabilities of language models. We condition the model on a sequence of model generations paired with feedback. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors. Applying our method to large language models, we observed that Chain of Hindsight significantly surpasses previous methods in aligning language models with human preferences. We report significant improvements on summarization and dialogue benchmarks, with our approach markedly preferred in human evaluations.
\ No newline at end of file
diff --git a/data/2024/iclr/Chain of Log-Concave Markov Chains b/data/2024/iclr/Chain of Log-Concave Markov Chains
new file mode 100644
index 0000000000..2c330d43e1
--- /dev/null
+++ b/data/2024/iclr/Chain of Log-Concave Markov Chains	
@@ -0,0 +1 @@
+We introduce a theoretical framework for sampling from unnormalized densities based on a smoothing scheme that uses an isotropic Gaussian kernel with a single fixed noise scale. We prove one can decompose sampling from a density (minimal assumptions made on the density) into a sequence of sampling from log-concave conditional densities via accumulation of noisy measurements with equal noise levels. Our construction is unique in that it keeps track of a history of samples, making it non-Markovian as a whole, but it is lightweight algorithmically as the history only shows up in the form of a running empirical mean of samples. Our sampling algorithm generalizes walk-jump sampling (Saremi&Hyv\"arinen, 2019). The"walk"phase becomes a (non-Markovian) chain of (log-concave) Markov chains. The"jump"from the accumulated measurements is obtained by empirical Bayes. We study our sampling algorithm quantitatively using the 2-Wasserstein metric and compare it with various Langevin MCMC algorithms. We also report a remarkable capacity of our algorithm to"tunnel"between modes of a distribution.
\ No newline at end of file
diff --git a/data/2024/iclr/Chain of Thought Empowers Transformers to Solve Inherently Serial Problems b/data/2024/iclr/Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
new file mode 100644
index 0000000000..0c684bcfeb
--- /dev/null
+++ b/data/2024/iclr/Chain of Thought Empowers Transformers to Solve Inherently Serial Problems	
@@ -0,0 +1 @@
+Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.
\ No newline at end of file
diff --git a/data/2024/iclr/Chain-of-Experts: When LLMs Meet Complex Operations Research Problems b/data/2024/iclr/Chain-of-Experts: When LLMs Meet Complex Operations Research Problems
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources b/data/2024/iclr/Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources
new file mode 100644
index 0000000000..ecff2a32f1
--- /dev/null
+++ b/data/2024/iclr/Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources	
@@ -0,0 +1 @@
+We present chain-of-knowledge (CoK), a novel framework that augments large language models (LLMs) by dynamically incorporating grounding information from heterogeneous sources. It results in more factual rationales and reduced hallucination in generation. Specifically, CoK consists of three stages: reasoning preparation, dynamic knowledge adapting, and answer consolidation. Given a knowledge-intensive question, CoK first prepares several preliminary rationales and answers while identifying the relevant knowledge domains. If there is no majority consensus among the answers from samples, CoK corrects the rationales step by step by adapting knowledge from the identified domains. These corrected rationales can plausibly serve as a better foundation for the final answer consolidation. Unlike prior studies that primarily use unstructured data, CoK also leverages structured knowledge sources such as Wikidata and tables that provide more reliable factual information. To access both unstructured and structured knowledge sources in the dynamic knowledge adapting stage, we propose an adaptive query generator that allows the generation of queries for various types of query languages, including SPARQL, SQL, and natural sentences. Moreover, to minimize error propagation between rationales, CoK corrects the rationales progressively using preceding corrected rationales to generate and correct subsequent rationales. Extensive experiments show that CoK consistently improves the performance of LLMs on knowledge-intensive tasks across different domains.
\ No newline at end of file
diff --git a/data/2024/iclr/Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding b/data/2024/iclr/Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding
new file mode 100644
index 0000000000..20fe1d39d1
--- /dev/null
+++ b/data/2024/iclr/Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding	
@@ -0,0 +1 @@
+Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices.
\ No newline at end of file
diff --git a/data/2024/iclr/Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning b/data/2024/iclr/Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning
new file mode 100644
index 0000000000..e2ace01d0e
--- /dev/null
+++ b/data/2024/iclr/Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning	
@@ -0,0 +1 @@
+The integration of machine learning (ML) in numerous critical applications introduces a range of privacy concerns for individuals who provide their datasets for model training. One such privacy risk is Membership Inference (MI), in which an attacker seeks to determine whether a particular data sample was included in the training dataset of a model. Current state-of-the-art MI attacks capitalize on access to the model's predicted confidence scores to successfully perform membership inference, and employ data poisoning to further enhance their effectiveness. In this work, we focus on the less explored and more realistic label-only setting, where the model provides only the predicted label on a queried sample. We show that existing label-only MI attacks are ineffective at inferring membership in the low False Positive Rate (FPR) regime. To address this challenge, we propose a new attack Chameleon that leverages a novel adaptive data poisoning strategy and an efficient query selection method to achieve significantly more accurate membership inference than existing label-only attacks, especially at low FPRs.
\ No newline at end of file
diff --git a/data/2024/iclr/Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words b/data/2024/iclr/Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words
new file mode 100644
index 0000000000..31a59ffc17
--- /dev/null
+++ b/data/2024/iclr/Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words	
@@ -0,0 +1 @@
+Vision Transformer (ViT) has emerged as a powerful architecture in the realm of modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. In these domains, images often contain multiple channels, each carrying semantically distinct and independent information. Furthermore, the model must demonstrate robustness to sparsity in input channels, as they may not be densely available during training or testing. In this paper, we propose a modification to the ViT architecture that enhances reasoning across the input channels and introduce Hierarchical Channel Sampling (HCS) as an additional regularization technique to ensure robustness when only partial channels are presented during test time. Our proposed model, ChannelViT, constructs patch tokens independently from each input channel and utilizes a learnable channel embedding that is added to the patch tokens, similar to positional embeddings. We evaluate the performance of ChannelViT on ImageNet, JUMP-CP (microscopy cell imaging), and So2Sat (satellite imaging). Our results show that ChannelViT outperforms ViT on classification tasks and generalizes well, even when a subset of input channels is used during testing. Across our experiments, HCS proves to be a powerful regularizer, independent of the architecture employed, suggesting itself as a straightforward technique for robust ViT training. Lastly, we find that ChannelViT generalizes effectively even when there is limited access to all channels during training, highlighting its potential for multi-channel imaging under real-world conditions with sparse sensors. Our code is available at https://github.com/insitro/ChannelViT.
\ No newline at end of file
diff --git a/data/2024/iclr/ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate b/data/2024/iclr/ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
new file mode 100644
index 0000000000..72cfaf7987
--- /dev/null
+++ b/data/2024/iclr/ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate	
@@ -0,0 +1 @@
+Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agent-based approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments. Our code is available at https://github.com/chanchimin/ChatEval.
\ No newline at end of file
diff --git a/data/2024/iclr/Circuit Component Reuse Across Tasks in Transformer Language Models b/data/2024/iclr/Circuit Component Reuse Across Tasks in Transformer Language Models
new file mode 100644
index 0000000000..53973d7080
--- /dev/null
+++ b/data/2024/iclr/Circuit Component Reuse Across Tasks in Transformer Language Models	
@@ -0,0 +1 @@
+Recent work in mechanistic interpretability has shown that behaviors in language models can be successfully reverse-engineered through circuit analysis. A common criticism, however, is that each circuit is task-specific, and thus such analysis cannot contribute to understanding the models at a higher level. In this work, we present evidence that insights (both low-level findings about specific heads and higher-level findings about general algorithms) can indeed generalize across tasks. Specifically, we study the circuit discovered in Wang et al. (2022) for the Indirect Object Identification (IOI) task and 1.) show that it reproduces on a larger GPT2 model, and 2.) that it is mostly reused to solve a seemingly different task: Colored Objects (Ippolito&Callison-Burch, 2023). We provide evidence that the process underlying both tasks is functionally very similar, and contains about a 78% overlap in in-circuit attention heads. We further present a proof-of-concept intervention experiment, in which we adjust four attention heads in middle layers in order to 'repair' the Colored Objects circuit and make it behave like the IOI circuit. In doing so, we boost accuracy from 49.6% to 93.7% on the Colored Objects task and explain most sources of error. The intervention affects downstream attention heads in specific ways predicted by their interactions in the IOI circuit, indicating that this subcircuit behavior is invariant to the different task inputs. Overall, our results provide evidence that it may yet be possible to explain large language models' behavior in terms of a relatively small number of interpretable task-general algorithmic building blocks and computational components.
\ No newline at end of file
diff --git a/data/2024/iclr/CircuitNet 2.0: An Advanced Dataset for Promoting Machine Learning Innovations in Realistic Chip Design Environment b/data/2024/iclr/CircuitNet 2.0: An Advanced Dataset for Promoting Machine Learning Innovations in Realistic Chip Design Environment
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Circumventing Concept Erasure Methods For Text-To-Image Generative Models b/data/2024/iclr/Circumventing Concept Erasure Methods For Text-To-Image Generative Models
new file mode 100644
index 0000000000..77c44445be
--- /dev/null
+++ b/data/2024/iclr/Circumventing Concept Erasure Methods For Text-To-Image Generative Models	
@@ -0,0 +1 @@
+Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to"erase"sensitive concepts from text-to-image models. In this work, we examine five recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve"erased"concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.
\ No newline at end of file
diff --git a/data/2024/iclr/CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents b/data/2024/iclr/CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents
new file mode 100644
index 0000000000..e9c3402181
--- /dev/null
+++ b/data/2024/iclr/CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents	
@@ -0,0 +1 @@
+The generalization of decision-making agents encompasses two fundamental elements: learning from past experiences and reasoning in novel contexts. However, the predominant emphasis in most interactive environments is on learning, often at the expense of complexity in reasoning. In this paper, we introduce CivRealm, an environment inspired by the Civilization game. Civilization's profound alignment with human history and society necessitates sophisticated learning, while its ever-changing situations demand strong reasoning to generalize. Particularly, CivRealm sets up an imperfect-information general-sum game with a changing number of players; it presents a plethora of complex features, challenging the agent to deal with open-ended stochastic environments that require diplomacy and negotiation skills. Within CivRealm, we provide interfaces for two typical agent types: tensor-based agents that focus on learning, and language-based agents that emphasize reasoning. To catalyze further research, we present initial results for both paradigms. The canonical RL-based agents exhibit reasonable performance in mini-games, whereas both RL- and LLM-based agents struggle to make substantial progress in the full game. Overall, CivRealm stands as a unique learning and reasoning challenge for decision-making agents. The code is available at https://github.com/bigai-ai/civrealm.
\ No newline at end of file
diff --git a/data/2024/iclr/Class Probability Matching with Calibrated Networks for Label Shift Adaption b/data/2024/iclr/Class Probability Matching with Calibrated Networks for Label Shift Adaption
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Classification with Conceptual Safeguards b/data/2024/iclr/Classification with Conceptual Safeguards
new file mode 100644
index 0000000000..a0bfa33e74
--- /dev/null
+++ b/data/2024/iclr/Classification with Conceptual Safeguards	
@@ -0,0 +1 @@
+We propose a new approach to promote safety in classiﬁcation tasks with established concepts. Our approach – called a conceptual safeguard – acts as a veriﬁcation layer for models that predict a target outcome by ﬁrst predicting the presence of intermediate concepts. Given this architecture, a safeguard ensures that a model meets a minimal level of accuracy by abstaining from uncertain predictions. In contrast to a standard selective classiﬁer, a safeguard provides an avenue to improve coverage by allowing a human to conﬁrm the presence of uncertain concepts on instances on which it abstains. We develop methods to build safeguards that maximize coverage without compromising safety, namely techniques to propagate the uncertainty in concept predictions and to ﬂag salient concepts for human review. We benchmark our approach on a collection of real-world and synthetic datasets, showing that it can improve performance and coverage in deep learning tasks
\ No newline at end of file
diff --git a/data/2024/iclr/Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platform b/data/2024/iclr/Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platform
new file mode 100644
index 0000000000..bd689f7ff3
--- /dev/null
+++ b/data/2024/iclr/Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platform	
@@ -0,0 +1 @@
+Distributed Deep Reinforcement Learning (DRL) aims to leverage more computational resources to train autonomous agents with less training time. Despite recent progress in the field, reproducibility issues have not been sufficiently explored. This paper first shows that the typical actor-learner framework can have reproducibility issues even if hyperparameters are controlled. We then introduce Cleanba, a new open-source platform for distributed DRL that proposes a highly reproducible architecture. Cleanba implements highly optimized distributed variants of PPO and IMPALA. Our Atari experiments show that these variants can obtain equivalent or higher scores than strong IMPALA baselines in moolib and torchbeast and PPO baseline in CleanRL. However, Cleanba variants present 1) shorter training time and 2) more reproducible learning curves in different hardware settings. Cleanba's source code is available at \url{https://github.com/vwxyzjn/cleanba}
\ No newline at end of file
diff --git a/data/2024/iclr/Clifford Group Equivariant Simplicial Message Passing Networks b/data/2024/iclr/Clifford Group Equivariant Simplicial Message Passing Networks
new file mode 100644
index 0000000000..08278b4e9d
--- /dev/null
+++ b/data/2024/iclr/Clifford Group Equivariant Simplicial Message Passing Networks	
@@ -0,0 +1 @@
+We introduce Clifford Group Equivariant Simplicial Message Passing Networks, a method for steerable E(n)-equivariant message passing on simplicial complexes. Our method integrates the expressivity of Clifford group-equivariant layers with simplicial message passing, which is topologically more intricate than regular graph message passing. Clifford algebras include higher-order objects such as bivectors and trivectors, which express geometric features (e.g., areas, volumes) derived from vectors. Using this knowledge, we represent simplex features through geometric products of their vertices. To achieve efficient simplicial message passing, we share the parameters of the message network across different dimensions. Additionally, we restrict the final message to an aggregation of the incoming messages from different dimensions, leading to what we term shared simplicial message passing. Experimental results show that our method is able to outperform both equivariant and simplicial graph neural networks on a variety of geometric tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Closing the Curious Case of Neural Text Degeneration b/data/2024/iclr/Closing the Curious Case of Neural Text Degeneration
new file mode 100644
index 0000000000..57f643a6a6
--- /dev/null
+++ b/data/2024/iclr/Closing the Curious Case of Neural Text Degeneration	
@@ -0,0 +1 @@
+Despite their ubiquity in language generation, it remains unknown why truncation sampling heuristics like nucleus sampling are so effective. We provide a theoretical explanation for the effectiveness of the truncation sampling by proving that truncation methods that discard tokens below some probability threshold (the most common type of truncation) can guarantee that all sampled tokens have nonzero true probability. However, thresholds are a coarse heuristic, and necessarily discard some tokens with nonzero true probability as well. In pursuit of a more precise sampling strategy, we show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability, without relying on a threshold. Based on our findings, we develop an experimental truncation strategy and the present pilot studies demonstrating the promise of this type of algorithm. Our evaluations show that our method outperforms its threshold-based counterparts under automatic and human evaluation metrics for low-entropy (i.e., close to greedy) open-ended text generation. Our theoretical findings and pilot experiments provide both insight into why truncation sampling works, and make progress toward more expressive sampling algorithms that better surface the generative capabilities of large language models.
\ No newline at end of file
diff --git a/data/2024/iclr/Closing the Gap between TD Learning and Supervised Learning - A Generalisation Point of View b/data/2024/iclr/Closing the Gap between TD Learning and Supervised Learning - A Generalisation Point of View
new file mode 100644
index 0000000000..c5ce621c60
--- /dev/null
+++ b/data/2024/iclr/Closing the Gap between TD Learning and Supervised Learning - A Generalisation Point of View	
@@ -0,0 +1 @@
+Some reinforcement learning (RL) algorithms can stitch pieces of experience to solve a task never seen before during training. This oft-sought property is one of the few ways in which RL methods based on dynamic-programming differ from RL methods based on supervised-learning (SL). Yet, certain RL methods based on off-the-shelf SL algorithms achieve excellent results without an explicit mechanism for stitching; it remains unclear whether those methods forgo this important stitching property. This paper studies this question for the problems of achieving a target goal state and achieving a target return value. Our main result is to show that the stitching property corresponds to a form of combinatorial generalization: after training on a distribution of (state, goal) pairs, one would like to evaluate on (state, goal) pairs not seen together in the training data. Our analysis shows that this sort of generalization is different from i.i.d. generalization. This connection between stitching and generalisation reveals why we should not expect SL-based RL methods to perform stitching, even in the limit of large datasets and models. Based on this analysis, we construct new datasets to explicitly test for this property, revealing that SL-based methods lack this stitching property and hence fail to perform combinatorial generalization. Nonetheless, the connection between stitching and combinatorial generalisation also suggests a simple remedy for improving generalisation in SL: data augmentation. We propose a temporal data augmentation and demonstrate that adding it to SL-based methods enables them to successfully complete tasks not seen together during training. On a high level, this connection illustrates the importance of combinatorial generalization for data efficiency in time-series data beyond tasks beyond RL, like audio, video, or text.
\ No newline at end of file
diff --git a/data/2024/iclr/CoBIT: A Contrastive Bi-directional Image-Text Generation Model b/data/2024/iclr/CoBIT: A Contrastive Bi-directional Image-Text Generation Model
new file mode 100644
index 0000000000..22e337ff4d
--- /dev/null
+++ b/data/2024/iclr/CoBIT: A Contrastive Bi-directional Image-Text Generation Model	
@@ -0,0 +1 @@
+The field of vision and language has witnessed a proliferation of pre-trained foundation models. Most existing methods are independently pre-trained with contrastive objective like CLIP, image-to-text generative objective like PaLI, or text-to-image generative objective like Parti. However, the three objectives can be pre-trained on the same data, image-text pairs, and intuitively they complement each other as contrasting provides global alignment capacity and generation grants fine-grained understanding. In this work, we present a Contrastive Bi-directional Image-Text generation model (CoBIT), which attempts to unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure, consisting of an image unicoder, a text unicoder and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios. For instance, 82.7% in zero-shot ImageNet classification, 9.37 FID score in zero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning.
\ No newline at end of file
diff --git a/data/2024/iclr/CoLiDE: Concomitant Linear DAG Estimation b/data/2024/iclr/CoLiDE: Concomitant Linear DAG Estimation
new file mode 100644
index 0000000000..43ac8c0786
--- /dev/null
+++ b/data/2024/iclr/CoLiDE: Concomitant Linear DAG Estimation	
@@ -0,0 +1 @@
+We deal with the combinatorial problem of learning directed acyclic graph (DAG) structure from observational data adhering to a linear structural equation model (SEM). Leveraging advances in differentiable, nonconvex characterizations of acyclicity, recent efforts have advocated a continuous constrained optimization paradigm to efficiently explore the space of DAGs. Most existing methods employ lasso-type score functions to guide this search, which (i) require expensive penalty parameter retuning when the $\textit{unknown}$ SEM noise variances change across problem instances; and (ii) implicitly rely on limiting homoscedasticity assumptions. In this work, we propose a new convex score function for sparsity-aware learning of linear DAGs, which incorporates concomitant estimation of scale and thus effectively decouples the sparsity parameter from the exogenous noise levels. Regularization via a smooth, nonconvex acyclicity penalty term yields CoLiDE ($\textbf{Co}$ncomitant $\textbf{Li}$near $\textbf{D}$AG $\textbf{E}$stimation), a regression-based criterion amenable to efficient gradient computation and closed-form estimation of noise variances in heteroscedastic scenarios. Our algorithm outperforms state-of-the-art methods without incurring added complexity, especially when the DAGs are larger and the noise level profile is heterogeneous. We also find CoLiDE exhibits enhanced stability manifested via reduced standard deviations in several domain-specific metrics, underscoring the robustness of our novel linear DAG estimator.
\ No newline at end of file
diff --git a/data/2024/iclr/CoRe-GD: A Hierarchical Framework for Scalable Graph Visualization with GNNs b/data/2024/iclr/CoRe-GD: A Hierarchical Framework for Scalable Graph Visualization with GNNs
new file mode 100644
index 0000000000..82c73afc19
--- /dev/null
+++ b/data/2024/iclr/CoRe-GD: A Hierarchical Framework for Scalable Graph Visualization with GNNs	
@@ -0,0 +1 @@
+Graph Visualization, also known as Graph Drawing, aims to find geometric embeddings of graphs that optimize certain criteria. Stress is a widely used metric; stress is minimized when every pair of nodes is positioned at their shortest path distance. However, stress optimization presents computational challenges due to its inherent complexity and is usually solved using heuristics in practice. We introduce a scalable Graph Neural Network (GNN) based Graph Drawing framework with sub-quadratic runtime that can learn to optimize stress. Inspired by classical stress optimization techniques and force-directed layout algorithms, we create a coarsening hierarchy for the input graph. Beginning at the coarsest level, we iteratively refine and un-coarsen the layout, until we generate an embedding for the original graph. To enhance information propagation within the network, we propose a novel positional rewiring technique based on intermediate node positions. Our empirical evaluation demonstrates that the framework achieves state-of-the-art performance while remaining scalable.
\ No newline at end of file
diff --git a/data/2024/iclr/CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding b/data/2024/iclr/CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding
new file mode 100644
index 0000000000..431cf5310a
--- /dev/null
+++ b/data/2024/iclr/CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding	
@@ -0,0 +1 @@
+3D visual grounding is the ability to localize objects in 3D scenes conditioned by utterances. Most existing methods devote the referring head to localize the referred object directly, causing failure in complex scenarios. In addition, it does not illustrate how and why the network reaches the final decision. In this paper, we address this question Can we design an interpretable 3D visual grounding framework that has the potential to mimic the human perception system?. To this end, we formulate the 3D visual grounding problem as a sequence-to-sequence Seq2Seq task by first predicting a chain of anchors and then the final target. Interpretability not only improves the overall performance but also helps us identify failure cases. Following the chain of thoughts approach enables us to decompose the referring task into interpretable intermediate steps, boosting the performance and making our framework extremely data-efficient. Moreover, our proposed framework can be easily integrated into any existing architecture. We validate our approach through comprehensive experiments on the Nr3D, Sr3D, and Scanrefer benchmarks and show consistent performance gains compared to existing methods without requiring manually annotated data. Furthermore, our proposed framework, dubbed CoT3DRef, is significantly data-efficient, whereas on the Sr3D dataset, when trained only on 10% of the data, we match the SOTA performance that trained on the entire data. The code is available at https:eslambakr.github.io/cot3dref.github.io/.
\ No newline at end of file
diff --git a/data/2024/iclr/CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding b/data/2024/iclr/CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
new file mode 100644
index 0000000000..ab5910070c
--- /dev/null
+++ b/data/2024/iclr/CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding	
@@ -0,0 +1 @@
+A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make"infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their"bag-of-words"behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.
\ No newline at end of file
diff --git a/data/2024/iclr/Code Representation Learning at Scale b/data/2024/iclr/Code Representation Learning at Scale
new file mode 100644
index 0000000000..7344ed8124
--- /dev/null
+++ b/data/2024/iclr/Code Representation Learning at Scale	
@@ -0,0 +1 @@
+Recent studies have shown that code language models at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size.
\ No newline at end of file
diff --git a/data/2024/iclr/CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules b/data/2024/iclr/CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules
new file mode 100644
index 0000000000..923a2dc54f
--- /dev/null
+++ b/data/2024/iclr/CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have already become quite proficient at solving simpler programming tasks like those in HumanEval or MBPP benchmarks. However, solving more complex and competitive programming tasks is still quite challenging for these models - possibly due to their tendency to generate solutions as monolithic code blocks instead of decomposing them into logical sub-tasks and sub-modules. On the other hand, experienced programmers instinctively write modularized code with abstraction for solving complex tasks, often reusing previously developed modules. To address this gap, we propose CodeChain, a novel framework for inference that elicits modularized code generation through a chain of self-revisions, each being guided by some representative sub-modules generated in previous iterations. Concretely, CodeChain first instructs the LLM to generate modularized codes through chain-of-thought prompting. Then it applies a chain of self-revisions by iterating the two steps: 1) extracting and clustering the generated sub-modules and selecting the cluster representatives as the more generic and re-usable implementations, and 2) augmenting the original chain-of-thought prompt with these selected module-implementations and instructing the LLM to re-generate new modularized solutions. We find that by naturally encouraging the LLM to reuse the previously developed and verified sub-modules, CodeChain can significantly boost both modularity as well as correctness of the generated solutions, achieving relative pass@1 improvements of 35% on APPS and 76% on CodeContests. It is shown to be effective on both OpenAI LLMs as well as open-sourced LLMs like WizardCoder. We also conduct comprehensive ablation studies with different methods of prompting, number of clusters, model sizes, program qualities, etc., to provide useful insights that underpin CodeChain's success.
\ No newline at end of file
diff --git a/data/2024/iclr/Coeditor: Leveraging Repo-level Diffs for Code Auto-editing b/data/2024/iclr/Coeditor: Leveraging Repo-level Diffs for Code Auto-editing
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Combinatorial Bandits for Maximum Value Reward Function under Value-Index Feedback b/data/2024/iclr/Combinatorial Bandits for Maximum Value Reward Function under Value-Index Feedback
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Combining Axes Preconditioners through Kronecker Approximation for Deep Learning b/data/2024/iclr/Combining Axes Preconditioners through Kronecker Approximation for Deep Learning
new file mode 100644
index 0000000000..768f95206f
--- /dev/null
+++ b/data/2024/iclr/Combining Axes Preconditioners through Kronecker Approximation for Deep Learning	
@@ -0,0 +1 @@
+Adaptive regularization based optimization methods such as full-matrix Adagrad which use gradient second-moment information hold significant potential for fast convergence in deep neural network (DNN) training, but are memory intensive and computationally demanding for large neural nets. We develop a technique called Combining AxeS PReconditioners (CASPR), which optimizes matrix-shaped DNN parameters by finding different preconditioners for each mode/axis of the parameter and combining them using a Kronecker-sum based approximation. The Kronecker-sum based combination allows us to show that CASPR is ordered between a well-known Kronecker product based combination, Shampoo, and full-matrix Adagrad preconditioners in Loewner order, as a result, it is nearer to full-matrix Adagrad than Shampoo. We also show tighter convergence guarantees in stochastic optimization compared to Shampoo. Furthermore, our experiments demonstrates that CASPR approximates the gradient second-moment matrix in full-matrix Adagrad more accurately, and shows significant improvement in training and generalization performance compared to existing practical adaptive regularization based methods such as Shampoo and Adam in a variety of tasks including graph neural network on OGBG-molpcba, Transformer on a universal dependencies dataset and auto-regressive large language modeling on C4 dataset.
\ No newline at end of file
diff --git a/data/2024/iclr/Communication-Efficient Federated Non-Linear Bandit Optimization b/data/2024/iclr/Communication-Efficient Federated Non-Linear Bandit Optimization
new file mode 100644
index 0000000000..9f8ed63498
--- /dev/null
+++ b/data/2024/iclr/Communication-Efficient Federated Non-Linear Bandit Optimization	
@@ -0,0 +1 @@
+Federated optimization studies the problem of collaborative function optimization among multiple clients (e.g. mobile devices or organizations) under the coordination of a central server. Since the data is collected separately by each client and always remains decentralized, federated optimization preserves data privacy and allows for large-scale computing, which makes it a promising decentralized machine learning paradigm. Though it is often deployed for tasks that are online in nature, e.g., next-word prediction on keyboard apps, most works formulate it as an offline problem. The few exceptions that consider federated bandit optimization are limited to very simplistic function classes, e.g., linear, generalized linear, or non-parametric function class with bounded RKHS norm, which severely hinders its practical usage. In this paper, we propose a new algorithm, named Fed-GO-UCB, for federated bandit optimization with generic non-linear objective function. Under some mild conditions, we rigorously prove that Fed-GO-UCB is able to achieve sub-linear rate for both cumulative regret and communication cost. At the heart of our theoretical analysis are distributed regression oracle and individual confidence set construction, which can be of independent interests. Empirical evaluations also demonstrate the effectiveness of the proposed algorithm.
\ No newline at end of file
diff --git a/data/2024/iclr/Communication-Efficient Gradient Descent-Accent Methods for Distributed Variational Inequalities: Unified Analysis and Local Updates b/data/2024/iclr/Communication-Efficient Gradient Descent-Accent Methods for Distributed Variational Inequalities: Unified Analysis and Local Updates
new file mode 100644
index 0000000000..387c0a430c
--- /dev/null
+++ b/data/2024/iclr/Communication-Efficient Gradient Descent-Accent Methods for Distributed Variational Inequalities: Unified Analysis and Local Updates	
@@ -0,0 +1 @@
+Distributed and federated learning algorithms and techniques associated primarily with minimization problems. However, with the increase of minimax optimization and variational inequality problems in machine learning, the necessity of designing efficient distributed/federated learning approaches for these problems is becoming more apparent. In this paper, we provide a unified convergence analysis of communication-efficient local training methods for distributed variational inequality problems (VIPs). Our approach is based on a general key assumption on the stochastic estimates that allows us to propose and analyze several novel local training algorithms under a single framework for solving a class of structured non-monotone VIPs. We present the first local gradient descent-accent algorithms with provable improved communication complexity for solving distributed variational inequalities on heterogeneous data. The general algorithmic framework recovers state-of-the-art algorithms and their sharp convergence guarantees when the setting is specialized to minimization or minimax optimization problems. Finally, we demonstrate the strong performance of the proposed algorithms compared to state-of-the-art methods when solving federated minimax optimization problems.
\ No newline at end of file
diff --git a/data/2024/iclr/CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models b/data/2024/iclr/CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
new file mode 100644
index 0000000000..77871e4315
--- /dev/null
+++ b/data/2024/iclr/CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models	
@@ -0,0 +1 @@
+A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.
\ No newline at end of file
diff --git a/data/2024/iclr/Complete and Efficient Graph Transformers for Crystal Material Property Prediction b/data/2024/iclr/Complete and Efficient Graph Transformers for Crystal Material Property Prediction
new file mode 100644
index 0000000000..faa42b91d3
--- /dev/null
+++ b/data/2024/iclr/Complete and Efficient Graph Transformers for Crystal Material Property Prediction	
@@ -0,0 +1 @@
+Crystal structures are characterized by atomic bases within a primitive unit cell that repeats along a regular lattice throughout 3D space. The periodic and infinite nature of crystals poses unique challenges for geometric graph representation learning. Specifically, constructing graphs that effectively capture the complete geometric information of crystals and handle chiral crystals remains an unsolved and challenging problem. In this paper, we introduce a novel approach that utilizes the periodic patterns of unit cells to establish the lattice-based representation for each atom, enabling efficient and expressive graph representations of crystals. Furthermore, we propose ComFormer, a SE(3) transformer designed specifically for crystalline materials. ComFormer includes two variants; namely, iComFormer that employs invariant geometric descriptors of Euclidean distances and angles, and eComFormer that utilizes equivariant vector representations. Experimental results demonstrate the state-of-the-art predictive accuracy of ComFormer variants on various tasks across three widely-used crystal benchmarks. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS).
\ No newline at end of file
diff --git a/data/2024/iclr/Complex priors and flexible inference in recurrent circuits with dendritic nonlinearities b/data/2024/iclr/Complex priors and flexible inference in recurrent circuits with dendritic nonlinearities
new file mode 100644
index 0000000000..21afbd0ef3
--- /dev/null
+++ b/data/2024/iclr/Complex priors and flexible inference in recurrent circuits with dendritic nonlinearities	
@@ -0,0 +1 @@
+Despite many successful examples in which probabilistic inference can account for perception, we have little understanding of how the brain represents and uses structured priors that capture the complexity of natural input statistics. Here we construct a recurrent circuit model that can implicitly represent priors over latent variables, and combine them with sensory and contextual sources of information to encode task-specific posteriors. Inspired by the recent success of diffusion models as means of learning and using priors over images, our model uses dendritic nonlinearities optimized for denoising, and stochastic somatic integration with the degree of noise modulated by an oscillating global signal. Combining these elements into a recurrent network yields a dynamical system that samples from the prior at a rate prescribed by the period of the global oscillator. Additional inputs reflecting sensory or top-down contextual information alter these dynamics to generate samples from the corresponding posterior, with different input gating patterns selecting different inference tasks. We demonstrate that this architecture can sample from low dimensional nonlinear manifolds and multimodal posteriors. Overall, the model provides a new framework for circuit-level representation of probabilistic information, in a format that facilitates flexible inference.
\ No newline at end of file
diff --git a/data/2024/iclr/Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis b/data/2024/iclr/Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis
new file mode 100644
index 0000000000..e10e7b4513
--- /dev/null
+++ b/data/2024/iclr/Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis	
@@ -0,0 +1 @@
+Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce \textit{depth disentanglement training} to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce \textit{soft guidance}, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, \textsc{Compose and Conquer (CnC)}, unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics. Code: https://github.com/tomtom1103/compose-and-conquer/
\ No newline at end of file
diff --git a/data/2024/iclr/Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization b/data/2024/iclr/Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization
new file mode 100644
index 0000000000..59b35d2dd7
--- /dev/null
+++ b/data/2024/iclr/Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization	
@@ -0,0 +1 @@
+We investigate composed image retrieval with text feedback. Users gradually look for the target of interest by moving from coarse to fine-grained feedback. However, existing methods merely focus on the latter, i.e., fine-grained search, by harnessing positive and negative pairs during training. This pair-based paradigm only considers the one-to-one distance between a pair of specific points, which is not aligned with the one-to-many coarse-grained retrieval process and compromises the recall rate. In an attempt to fill this gap, we introduce a unified learning approach to simultaneously modeling the coarse- and fine-grained retrieval by considering the multi-grained uncertainty. The key idea underpinning the proposed method is to integrate fine- and coarse-grained retrieval as matching data points with small and large fluctuations, respectively. Specifically, our method contains two modules: uncertainty modeling and uncertainty regularization. (1) The uncertainty modeling simulates the multi-grained queries by introducing identically distributed fluctuations in the feature space. (2) Based on the uncertainty modeling, we further introduce uncertainty regularization to adapt the matching objective according to the fluctuation range. Compared with existing methods, the proposed strategy explicitly prevents the model from pushing away potential candidates in the early stage, and thus improves the recall rate. On the three public datasets, i.e., FashionIQ, Fashion200k, and Shoes, the proposed method has achieved +4.03%, +3.38%, and +2.40% Recall@50 accuracy over a strong baseline, respectively.
\ No newline at end of file
diff --git a/data/2024/iclr/Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning b/data/2024/iclr/Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning
new file mode 100644
index 0000000000..703c9ee277
--- /dev/null
+++ b/data/2024/iclr/Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Offline reinforcement learning (RL) is a compelling framework for learning optimal policies from past experiences without additional interaction with the environment. Nevertheless, offline RL inevitably faces the problem of distributional shifts, where the states and actions encountered during policy execution may not be in the training dataset distribution. A common solution involves incorporating conservatism into the policy or the value function to safeguard against uncertainties and unknowns. In this work, we focus on achieving the same objectives of conservatism but from a different perspective. We propose COmpositional COnservatism with Anchor-seeking (COCOA) for offline RL, an approach that pursues conservatism in a compositional manner on top of the transductive reparameterization (Netanyahu et al., 2023), which decomposes the input variable (the state in our case) into an anchor and its difference from the original input. Our COCOA seeks both in-distribution anchors and differences by utilizing the learned reverse dynamics model, encouraging conservatism in the compositional input space for the policy or value function. Such compositional conservatism is independent of and agnostic to the prevalent behavioral conservatism in offline RL. We apply COCOA to four state-of-the-art offline RL algorithms and evaluate them on the D4RL benchmark, where COCOA generally improves the performance of each algorithm. The code is available at https://github.com/runamu/compositional-conservatism.
\ No newline at end of file
diff --git a/data/2024/iclr/Compositional Generative Inverse Design b/data/2024/iclr/Compositional Generative Inverse Design
new file mode 100644
index 0000000000..557f5f025b
--- /dev/null
+++ b/data/2024/iclr/Compositional Generative Inverse Design	
@@ -0,0 +1 @@
+Inverse design, where we seek to design input variables in order to optimize an underlying objective function, is an important problem that arises across fields such as mechanical engineering to aerospace engineering. Inverse design is typically formulated as an optimization problem, with recent works leveraging optimization across learned dynamics models. However, as models are optimized they tend to fall into adversarial modes, preventing effective sampling. We illustrate that by instead optimizing over the learned energy function captured by the diffusion model, we can avoid such adversarial examples and significantly improve design performance. We further illustrate how such a design system is compositional, enabling us to combine multiple different diffusion models representing subcomponents of our desired system to design systems with every specified component. In an N-body interaction task and a challenging 2D multi-airfoil design task, we demonstrate that by composing the learned diffusion model at test time, our method allows us to design initial states and boundary shapes that are more complex than those in the training data. Our method generalizes to more objects for N-body dataset and discovers formation flying to minimize drag in the multi-airfoil design task. Project website and code can be found at https://github.com/AI4Science-WestlakeU/cindm.
\ No newline at end of file
diff --git a/data/2024/iclr/Compositional Preference Models for Aligning LMs b/data/2024/iclr/Compositional Preference Models for Aligning LMs
new file mode 100644
index 0000000000..45aa9084e7
--- /dev/null
+++ b/data/2024/iclr/Compositional Preference Models for Aligning LMs	
@@ -0,0 +1 @@
+As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset. We propose Compositional Preference Models (CPMs), a novel PM framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted LM, and aggregates these scores using a logistic regression classifier. Through these simple steps, CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment. Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs. Overall, our approach demonstrates the benefits of endowing PMs with priors about which features determine human preferences while relying on LM capabilities to extract those features in a scalable and robust way.
\ No newline at end of file
diff --git a/data/2024/iclr/Compressing LLMs: The Truth is Rarely Pure and Never Simple b/data/2024/iclr/Compressing LLMs: The Truth is Rarely Pure and Never Simple
new file mode 100644
index 0000000000..627d70f8dd
--- /dev/null
+++ b/data/2024/iclr/Compressing LLMs: The Truth is Rarely Pure and Never Simple	
@@ -0,0 +1 @@
+Despite their remarkable achievements, modern Large Language Models (LLMs) face exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully curated tasks to redefine the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity in knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at $\geq 50$% sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods. The reproduced codes are available at https://github.com/VITA-Group/llm-kick.
\ No newline at end of file
diff --git a/data/2024/iclr/Compressing Latent Space via Least Volume b/data/2024/iclr/Compressing Latent Space via Least Volume
new file mode 100644
index 0000000000..1cfaa9527a
--- /dev/null
+++ b/data/2024/iclr/Compressing Latent Space via Least Volume	
@@ -0,0 +1 @@
+This paper introduces Least Volume-a simple yet effective regularization inspired by geometric intuition-that can reduce the necessary number of latent dimensions needed by an autoencoder without requiring any prior knowledge of the intrinsic dimensionality of the dataset. We show that the Lipschitz continuity of the decoder is the key to making it work, provide a proof that PCA is just a linear special case of it, and reveal that it has a similar PCA-like importance ordering effect when applied to nonlinear models. We demonstrate the intuition behind the regularization on some pedagogical toy problems, and its effectiveness on several benchmark problems, including MNIST, CIFAR-10 and CelebA.
\ No newline at end of file
diff --git a/data/2024/iclr/ConR: Contrastive Regularizer for Deep Imbalanced Regression b/data/2024/iclr/ConR: Contrastive Regularizer for Deep Imbalanced Regression
new file mode 100644
index 0000000000..ee6efa75fc
--- /dev/null
+++ b/data/2024/iclr/ConR: Contrastive Regularizer for Deep Imbalanced Regression	
@@ -0,0 +1 @@
+Imbalanced distributions are ubiquitous in real-world data. They create constraints on Deep Neural Networks to represent the minority labels and avoid bias towards majority labels. The extensive body of imbalanced approaches address categorical label spaces but fail to effectively extend to regression problems where the label space is continuous. Local and global correlations among continuous labels provide valuable insights towards effectively modelling relationships in feature space. In this work, we propose ConR, a contrastive regularizer that models global and local label similarities in feature space and prevents the features of minority samples from being collapsed into their majority neighbours. ConR discerns the disagreements between the label space and feature space and imposes a penalty on these disagreements. ConR addresses the continuous nature of label space with two main strategies in a contrastive manner: incorrect proximities are penalized proportionate to the label similarities and the correct ones are encouraged to model local similarities. ConR consolidates essential considerations into a generic, easy-to-integrate, and efficient method that effectively addresses deep imbalanced regression. Moreover, ConR is orthogonal to existing approaches and smoothly extends to uni- and multi-dimensional label spaces. Our comprehensive experiments show that ConR significantly boosts the performance of all the state-of-the-art methods on four large-scale deep imbalanced regression benchmarks. Our code is publicly available in https://github.com/BorealisAI/ConR.
\ No newline at end of file
diff --git a/data/2024/iclr/Concept Bottleneck Generative Models b/data/2024/iclr/Concept Bottleneck Generative Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Conditional Information Bottleneck Approach for Time Series Imputation b/data/2024/iclr/Conditional Information Bottleneck Approach for Time Series Imputation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Conditional Instrumental Variable Regression with Representation Learning for Causal Inference b/data/2024/iclr/Conditional Instrumental Variable Regression with Representation Learning for Causal Inference
new file mode 100644
index 0000000000..d1cc8c505a
--- /dev/null
+++ b/data/2024/iclr/Conditional Instrumental Variable Regression with Representation Learning for Causal Inference	
@@ -0,0 +1 @@
+This paper studies the challenging problem of estimating causal effects from observational data, in the presence of unobserved confounders. The two-stage least square (TSLS) method and its variants with a standard instrumental variable (IV) are commonly used to eliminate confounding bias, including the bias caused by unobserved confounders, but they rely on the linearity assumption. Besides, the strict condition of unconfounded instruments posed on a standard IV is too strong to be practical. To address these challenging and practical problems of the standard IV method (linearity assumption and the strict condition), in this paper, we use a conditional IV (CIV) to relax the unconfounded instrument condition of standard IV and propose a non-linear CIV regression with Confounding Balancing Representation Learning, CBRL.CIV, for jointly eliminating the confounding bias from unobserved confounders and balancing the observed confounders, without the linearity assumption. We theoretically demonstrate the soundness of CBRL.CIV. Extensive experiments on synthetic and two real-world datasets show the competitive performance of CBRL.CIV against state-of-the-art IV-based estimators and superiority in dealing with the non-linear situation.
\ No newline at end of file
diff --git a/data/2024/iclr/Conditional Variational Diffusion Models b/data/2024/iclr/Conditional Variational Diffusion Models
new file mode 100644
index 0000000000..81f909a4ec
--- /dev/null
+++ b/data/2024/iclr/Conditional Variational Diffusion Models	
@@ -0,0 +1 @@
+Inverse problems aim to determine parameters from observations, a crucial task in engineering and science. Lately, generative models, especially diffusion models, have gained popularity in this area for their ability to produce realistic solutions and their good mathematical properties. Despite their success, an important drawback of diffusion models is their sensitivity to the choice of variance schedule, which controls the dynamics of the diffusion process. Fine-tuning this schedule for specific applications is crucial but time-costly and does not guarantee an optimal result. We propose a novel approach for learning the schedule as part of the training process. Our method supports probabilistic conditioning on data, provides high-quality solutions, and is flexible, proving able to adapt to different applications with minimum overhead. This approach is tested in two unrelated inverse problems: super-resolution microscopy and quantitative phase imaging, yielding comparable or superior results to previous methods and fine-tuned diffusion models. We conclude that fine-tuning the schedule by experimentation should be avoided because it can be learned during training in a stable way that yields better results.
\ No newline at end of file
diff --git a/data/2024/iclr/Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models b/data/2024/iclr/Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models
new file mode 100644
index 0000000000..b6716ffbcb
--- /dev/null
+++ b/data/2024/iclr/Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models	
@@ -0,0 +1 @@
+Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent. However, excessive optimization with such reward models, which serve as mere proxy objectives, can compromise the performance of fine-tuned models, a phenomenon known as reward overoptimization. To investigate this issue in depth, we introduce the Text-Image Alignment Assessment (TIA2) benchmark, which comprises a diverse collection of text prompts, images, and human annotations. Our evaluation of several state-of-the-art reward models on this benchmark reveals their frequent misalignment with human assessment. We empirically demonstrate that overoptimization occurs notably when a poorly aligned reward model is used as the fine-tuning objective. To address this, we propose TextNorm, a simple method that enhances alignment based on a measure of reward model confidence estimated across a set of semantically contrastive text prompts. We demonstrate that incorporating the confidence-calibrated rewards in fine-tuning effectively reduces overoptimization, resulting in twice as many wins in human evaluation for text-image alignment compared against the baseline reward models.
\ No newline at end of file
diff --git a/data/2024/iclr/Confidential-DPproof: Confidential Proof of Differentially Private Training b/data/2024/iclr/Confidential-DPproof: Confidential Proof of Differentially Private Training
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Conformal Inductive Graph Neural Networks b/data/2024/iclr/Conformal Inductive Graph Neural Networks
new file mode 100644
index 0000000000..ea89153920
--- /dev/null
+++ b/data/2024/iclr/Conformal Inductive Graph Neural Networks	
@@ -0,0 +1 @@
+Conformal prediction (CP) transforms any model's output into prediction sets guaranteed to include (cover) the true label. CP requires exchangeability, a relaxation of the i.i.d. assumption, to obtain a valid distribution-free coverage guarantee. This makes it directly applicable to transductive node-classification. However, conventional CP cannot be applied in inductive settings due to the implicit shift in the (calibration) scores caused by message passing with the new nodes. We fix this issue for both cases of node and edge-exchangeable graphs, recovering the standard coverage guarantee without sacrificing statistical efficiency. We further prove that the guarantee holds independently of the prediction time, e.g. upon arrival of a new node/edge or at any subsequent moment.
\ No newline at end of file
diff --git a/data/2024/iclr/Conformal Language Modeling b/data/2024/iclr/Conformal Language Modeling
new file mode 100644
index 0000000000..7d42fb7f5b
--- /dev/null
+++ b/data/2024/iclr/Conformal Language Modeling	
@@ -0,0 +1 @@
+We propose a novel approach to conformal prediction for generative language models (LMs). Standard conformal prediction produces prediction sets -- in place of single predictions -- that have rigorous, statistical performance guarantees. LM responses are typically sampled from the model's predicted distribution over the large, combinatorial output space of natural language. Translating this process to conformal prediction, we calibrate a stopping rule for sampling different outputs from the LM that get added to a growing set of candidates until we are confident that the output set is sufficient. Since some samples may be low-quality, we also simultaneously calibrate and apply a rejection rule for removing candidates from the output set to reduce noise. Similar to conformal prediction, we prove that the sampled set returned by our procedure contains at least one acceptable answer with high probability, while still being empirically precise (i.e., small) on average. Furthermore, within this set of candidate responses, we show that we can also accurately identify subsets of individual components -- such as phrases or sentences -- that are each independently correct (e.g., that are not"hallucinations"), again with statistical guarantees. We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation using different LM variants.
\ No newline at end of file
diff --git a/data/2024/iclr/Conformal Risk Control b/data/2024/iclr/Conformal Risk Control
new file mode 100644
index 0000000000..e975a95c0d
--- /dev/null
+++ b/data/2024/iclr/Conformal Risk Control	
@@ -0,0 +1 @@
+Score-based generative modeling, informally referred to as diffusion models, continue to grow in popularity across several important domains and tasks. While they provide high-quality and diverse samples from empirical distributions, important questions remain on the reliability and trustworthiness of these sampling procedures for their responsible use in critical scenarios. Conformal prediction is a modern tool to construct finite-sample, distribution-free uncertainty guarantees for any black-box predictor. In this work, we focus on image-to-image regression tasks and we present a generalization of the Risk-Controlling Prediction Sets (RCPS) procedure, that we term $K$-RCPS, which allows to $(i)$ provide entrywise calibrated intervals for future samples of any diffusion model, and $(ii)$ control a certain notion of risk with respect to a ground truth image with minimal mean interval length. Differently from existing conformal risk control procedures, ours relies on a novel convex optimization approach that allows for multidimensional risk control while provably minimizing the mean interval length. We illustrate our approach on two real-world image denoising problems: on natural images of faces as well as on computed tomography (CT) scans of the abdomen, demonstrating state of the art performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Confronting Reward Model Overoptimization with Constrained RLHF b/data/2024/iclr/Confronting Reward Model Overoptimization with Constrained RLHF
new file mode 100644
index 0000000000..1ec5d7e69c
--- /dev/null
+++ b/data/2024/iclr/Confronting Reward Model Overoptimization with Constrained RLHF	
@@ -0,0 +1 @@
+Large language models are typically aligned with human preferences by optimizing $\textit{reward models}$ (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to $\textit{overoptimization}$, wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM's threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run.
\ No newline at end of file
diff --git a/data/2024/iclr/ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection b/data/2024/iclr/ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection
new file mode 100644
index 0000000000..fec7139f9a
--- /dev/null
+++ b/data/2024/iclr/ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection	
@@ -0,0 +1 @@
+Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicated to deriving score functions based on logits, distances, or rigorous data distribution assumptions to identify low-scoring OOD samples. Nevertheless, these estimate scores may fail to accurately reflect the true data density or impose impractical constraints. To provide a unified perspective on density-based score design, we propose a novel theoretical framework grounded in Bregman divergence, which extends distribution considerations to encompass an exponential family of distributions. Leveraging the conjugation constraint revealed in our theorem, we introduce a \textsc{ConjNorm} method, reframing density function design as a search for the optimal norm coefficient $p$ against the given dataset. In light of the computational challenges of normalization, we devise an unbiased and analytically tractable estimator of the partition function using the Monte Carlo-based importance sampling technique. Extensive experiments across OOD detection benchmarks empirically demonstrate that our proposed \textsc{ConjNorm} has established a new state-of-the-art in a variety of OOD detection setups, outperforming the current best method by up to 13.25$\%$ and 28.19$\%$ (FPR95) on CIFAR-100 and ImageNet-1K, respectively.
\ No newline at end of file
diff --git a/data/2024/iclr/Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data b/data/2024/iclr/Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
new file mode 100644
index 0000000000..ee519ae2b6
--- /dev/null
+++ b/data/2024/iclr/Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data	
@@ -0,0 +1 @@
+Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our $C^3$ method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation.
\ No newline at end of file
diff --git a/data/2024/iclr/Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers b/data/2024/iclr/Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers
new file mode 100644
index 0000000000..78489bfac8
--- /dev/null
+++ b/data/2024/iclr/Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers	
@@ -0,0 +1 @@
+Large Language Models (LLMs) excel in various tasks, but they rely on carefully crafted prompts that often demand substantial human effort. To automate this process, in this paper, we propose a novel framework for discrete prompt optimization, called EvoPrompt, which borrows the idea of evolutionary algorithms (EAs) as they exhibit good performance and fast convergence. To enable EAs to work on discrete prompts, which are natural language expressions that need to be coherent and human-readable, we connect LLMs with EAs. This approach allows us to simultaneously leverage the powerful language processing capabilities of LLMs and the efficient optimization performance of EAs. Specifically, abstaining from any gradients or parameters, EvoPrompt starts from a population of prompts and iteratively generates new prompts with LLMs based on the evolutionary operators, improving the population based on the development set. We optimize prompts for both closed- and open-source LLMs including GPT-3.5 and Alpaca, on 31 datasets covering language understanding, generation tasks, as well as BIG-Bench Hard (BBH) tasks. EvoPrompt significantly outperforms human-engineered prompts and existing methods for automatic prompt generation (e.g., up to 25% on BBH). Furthermore, EvoPrompt demonstrates that connecting LLMs with EAs creates synergies, which could inspire further research on the combination of LLMs and conventional algorithms.
\ No newline at end of file
diff --git a/data/2024/iclr/Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning b/data/2024/iclr/Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning
new file mode 100644
index 0000000000..1ccb818051
--- /dev/null
+++ b/data/2024/iclr/Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning	
@@ -0,0 +1 @@
+Inspired by human conscious planning, we propose Skipper, a model-based reinforcement learning framework utilizing spatio-temporal abstractions to generalize better in novel situations. It automatically decomposes the given task into smaller, more manageable subtasks, and thus enables sparse decision-making and focused computation on the relevant parts of the environment. The decomposition relies on the extraction of an abstracted proxy problem represented as a directed graph, in which vertices and edges are learned end-to-end from hindsight. Our theoretical analyses provide performance guarantees under appropriate assumptions and establish where our approach is expected to be helpful. Generalization-focused experiments validate Skipper's significant advantage in zero-shot generalization, compared to some existing state-of-the-art hierarchical planning methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Conserve-Update-Revise to Cure Generalization and Robustness Trade-off in Adversarial Training b/data/2024/iclr/Conserve-Update-Revise to Cure Generalization and Robustness Trade-off in Adversarial Training
new file mode 100644
index 0000000000..afaaa40b4d
--- /dev/null
+++ b/data/2024/iclr/Conserve-Update-Revise to Cure Generalization and Robustness Trade-off in Adversarial Training	
@@ -0,0 +1 @@
+Adversarial training improves the robustness of neural networks against adversarial attacks, albeit at the expense of the trade-off between standard and robust generalization. To unveil the underlying factors driving this phenomenon, we examine the layer-wise learning capabilities of neural networks during the transition from a standard to an adversarial setting. Our empirical findings demonstrate that selectively updating specific layers while preserving others can substantially enhance the network's learning capacity. We therefore propose CURE, a novel training framework that leverages a gradient prominence criterion to perform selective conservation, updating, and revision of weights. Importantly, CURE is designed to be dataset- and architecture-agnostic, ensuring its applicability across various scenarios. It effectively tackles both memorization and overfitting issues, thus enhancing the trade-off between robustness and generalization and additionally, this training approach also aids in mitigating"robust overfitting". Furthermore, our study provides valuable insights into the mechanisms of selective adversarial training and offers a promising avenue for future research.
\ No newline at end of file
diff --git a/data/2024/iclr/Consistency Training with Learnable Data Augmentation for Graph Anomaly Detection with Limited Supervision b/data/2024/iclr/Consistency Training with Learnable Data Augmentation for Graph Anomaly Detection with Limited Supervision
new file mode 100644
index 0000000000..2a7c47c1f2
--- /dev/null
+++ b/data/2024/iclr/Consistency Training with Learnable Data Augmentation for Graph Anomaly Detection with Limited Supervision	
@@ -0,0 +1 @@
+conduct extensive experiments on four benchmark datasets, alongside one real-world dataset derived from a production environment. The ensuing results highlight the superiority of our proposed C ONSIS GAD, as it exhibits enhanced performance in comparison to state-of-the-1
\ No newline at end of file
diff --git a/data/2024/iclr/Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion b/data/2024/iclr/Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion
new file mode 100644
index 0000000000..1895d08e22
--- /dev/null
+++ b/data/2024/iclr/Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion	
@@ -0,0 +1 @@
+Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed. To address this limitation, we propose Consistency Trajectory Model (CTM), a generalization encompassing CM and score-based models as special cases. CTM trains a single neural network that can -- in a single forward pass -- output scores (i.e., gradients of log-density) and enables unrestricted traversal between any initial and final time along the Probability Flow Ordinary Differential Equation (ODE) in a diffusion process. CTM enables the efficient combination of adversarial training and denoising score matching loss to enhance performance and achieves new state-of-the-art FIDs for single-step diffusion model sampling on CIFAR-10 (FID 1.73) and ImageNet at 64x64 resolution (FID 1.92). CTM also enables a new family of sampling schemes, both deterministic and stochastic, involving long jumps along the ODE solution trajectories. It consistently improves sample quality as computational budgets increase, avoiding the degradation seen in CM. Furthermore, unlike CM, CTM's access to the score function can streamline the adoption of established controllable/conditional generation methods from the diffusion community. This access also enables the computation of likelihood. The code is available at https://github.com/sony/ctm.
\ No newline at end of file
diff --git a/data/2024/iclr/Consistency-guided Prompt Learning for Vision-Language Models b/data/2024/iclr/Consistency-guided Prompt Learning for Vision-Language Models
new file mode 100644
index 0000000000..e550cb384c
--- /dev/null
+++ b/data/2024/iclr/Consistency-guided Prompt Learning for Vision-Language Models	
@@ -0,0 +1 @@
+We propose Consistency-guided Prompt learning (CoPrompt), a new fine-tuning method for vision-language models. Our approach improves the generalization of large foundation models when fine-tuned on downstream tasks in a few-shot setting. The basic idea of CoPrompt is to enforce a consistency constraint in the prediction of the trainable and pre-trained models to prevent overfitting on the downstream task. Additionally, we introduce the following two components into our consistency constraint to further boost the performance: enforcing consistency on two perturbed inputs and combining two dominant paradigms of tuning, prompting and adapter. Enforcing consistency on perturbed input serves to further regularize the consistency constraint, thereby improving generalization. Moreover, the integration of adapters and prompts not only enhances performance on downstream tasks but also offers increased tuning flexibility in both input and output spaces. This facilitates more effective adaptation to downstream tasks in a few-shot learning setting. Experiments show that CoPrompt outperforms existing methods on a range of evaluation suites, including base-to-novel generalization, domain generalization, and cross-dataset evaluation. On generalization, CoPrompt improves the state-of-the-art on zero-shot tasks and the overall harmonic mean over 11 datasets. Detailed ablation studies show the effectiveness of each of the components in CoPrompt. We make our code available at https://github.com/ShuvenduRoy/CoPrompt.
\ No newline at end of file
diff --git a/data/2024/iclr/Consistent Multi-Class Classification from Multiple Unlabeled Datasets b/data/2024/iclr/Consistent Multi-Class Classification from Multiple Unlabeled Datasets
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Consistent Video-to-Video Transfer Using Synthetic Dataset b/data/2024/iclr/Consistent Video-to-Video Transfer Using Synthetic Dataset
new file mode 100644
index 0000000000..030cf9bcea
--- /dev/null
+++ b/data/2024/iclr/Consistent Video-to-Video Transfer Using Synthetic Dataset	
@@ -0,0 +1 @@
+We introduce a novel and efficient approach for text-based video-to-video editing that eliminates the need for resource-intensive per-video-per-model finetuning. At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain. Extending the Prompt-to-Prompt to videos, we efficiently generate paired samples, each with an input video and its edited counterpart. Alongside this, we introduce the Long Video Sampling Correction during sampling, ensuring consistent long videos across batches. Our method surpasses current methods like Tune-A-Video, heralding substantial progress in text-based video-to-video editing and suggesting exciting avenues for further exploration and deployment.
\ No newline at end of file
diff --git a/data/2024/iclr/Consistent algorithms for multi-label classification with macro-at-k metrics b/data/2024/iclr/Consistent algorithms for multi-label classification with macro-at-k metrics
new file mode 100644
index 0000000000..654d633dfd
--- /dev/null
+++ b/data/2024/iclr/Consistent algorithms for multi-label classification with macro-at-k metrics	
@@ -0,0 +1 @@
+We consider the optimization of complex performance metrics in multi-label classification under the population utility framework. We mainly focus on metrics linearly decomposable into a sum of binary classification utilities applied separately to each label with an additional requirement of exactly $k$ labels predicted for each instance. These"macro-at-$k$"metrics possess desired properties for extreme classification problems with long tail labels. Unfortunately, the at-$k$ constraint couples the otherwise independent binary classification tasks, leading to a much more challenging optimization problem than standard macro-averages. We provide a statistical framework to study this problem, prove the existence and the form of the optimal classifier, and propose a statistically consistent and practical learning algorithm based on the Frank-Wolfe method. Interestingly, our main results concern even more general metrics being non-linear functions of label-wise confusion matrices. Empirical results provide evidence for the competitive performance of the proposed approach.
\ No newline at end of file
diff --git "a/data/2024/iclr/Consistent4D: Consistent 360\302\260 Dynamic Object Generation from Monocular Video" "b/data/2024/iclr/Consistent4D: Consistent 360\302\260 Dynamic Object Generation from Monocular Video"
new file mode 100644
index 0000000000..005f1f4010
--- /dev/null
+++ "b/data/2024/iclr/Consistent4D: Consistent 360\302\260 Dynamic Object Generation from Monocular Video"	
@@ -0,0 +1 @@
+In this paper, we present Consistent4D, a novel approach for generating 4D dynamic objects from uncalibrated monocular videos. Uniquely, we cast the 360-degree dynamic object reconstruction as a 4D generation problem, eliminating the need for tedious multi-view data collection and camera calibration. This is achieved by leveraging the object-level 3D-aware image diffusion model as the primary supervision signal for training Dynamic Neural Radiance Fields (DyNeRF). Specifically, we propose a Cascade DyNeRF to facilitate stable convergence and temporal continuity under the supervision signal which is discrete along the time axis. To achieve spatial and temporal consistency, we further introduce an Interpolation-driven Consistency Loss. It is optimized by minimizing the discrepancy between rendered frames from DyNeRF and interpolated frames from a pre-trained video interpolation model. Extensive experiments show that our Consistent4D can perform competitively to prior art alternatives, opening up new possibilities for 4D dynamic object generation from monocular videos, whilst also demonstrating advantage for conventional text-to-3D generation tasks. Our project page is https://consistent4d.github.io/.
\ No newline at end of file
diff --git a/data/2024/iclr/Constrained Bi-Level Optimization: Proximal Lagrangian Value Function Approach and Hessian-free Algorithm b/data/2024/iclr/Constrained Bi-Level Optimization: Proximal Lagrangian Value Function Approach and Hessian-free Algorithm
new file mode 100644
index 0000000000..edeceacdfe
--- /dev/null
+++ b/data/2024/iclr/Constrained Bi-Level Optimization: Proximal Lagrangian Value Function Approach and Hessian-free Algorithm	
@@ -0,0 +1 @@
+This paper presents a new approach and algorithm for solving a class of constrained Bi-Level Optimization (BLO) problems in which the lower-level problem involves constraints coupling both upper-level and lower-level variables. Such problems have recently gained significant attention due to their broad applicability in machine learning. However, conventional gradient-based methods unavoidably rely on computationally intensive calculations related to the Hessian matrix. To address this challenge, we begin by devising a smooth proximal Lagrangian value function to handle the constrained lower-level problem. Utilizing this construct, we introduce a single-level reformulation for constrained BLOs that transforms the original BLO problem into an equivalent optimization problem with smooth constraints. Enabled by this reformulation, we develop a Hessian-free gradient-based algorithm-termed proximal Lagrangian Value function-based Hessian-free Bi-level Algorithm (LV-HBA)-that is straightforward to implement in a single loop manner. Consequently, LV-HBA is especially well-suited for machine learning applications. Furthermore, we offer non-asymptotic convergence analysis for LV-HBA, eliminating the need for traditional strong convexity assumptions for the lower-level problem while also being capable of accommodating non-singleton scenarios. Empirical results substantiate the algorithm's superior practical performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Constrained Decoding for Cross-lingual Label Projection b/data/2024/iclr/Constrained Decoding for Cross-lingual Label Projection
new file mode 100644
index 0000000000..7bfb85b31b
--- /dev/null
+++ b/data/2024/iclr/Constrained Decoding for Cross-lingual Label Projection	
@@ -0,0 +1 @@
+Zero-shot cross-lingual transfer utilizing multilingual LLMs has become a popular learning paradigm for low-resource languages with no labeled training data. However, for NLP tasks that involve fine-grained predictions on words and phrases, the performance of zero-shot cross-lingual transfer learning lags far behind supervised fine-tuning methods. Therefore, it is common to exploit translation and label projection to further improve the performance by (1) translating training data that is available in a high-resource language (e.g., English) together with the gold labels into low-resource languages, and/or (2) translating test data in low-resource languages to a high-source language to run inference on, then projecting the predicted span-level labels back onto the original test data. However, state-of-the-art marker-based label projection methods suffer from translation quality degradation due to the extra label markers injected in the input to the translation model. In this work, we explore a new direction that leverages constrained decoding for label projection to overcome the aforementioned issues. Our new method not only can preserve the quality of translated texts but also has the versatility of being applicable to both translating training and translating test data strategies. This versatility is crucial as our experiments reveal that translating test data can lead to a considerable boost in performance compared to translating only training data. We evaluate on two cross-lingual transfer tasks, namely Named Entity Recognition and Event Argument Extraction, spanning 20 languages. The results demonstrate that our approach outperforms the state-of-the-art marker-based method by a large margin and also shows better performance than other label projection methods that rely on external word alignment.
\ No newline at end of file
diff --git a/data/2024/iclr/Constraint-Free Structure Learning with Smooth Acyclic Orientations b/data/2024/iclr/Constraint-Free Structure Learning with Smooth Acyclic Orientations
new file mode 100644
index 0000000000..2b47ed27c1
--- /dev/null
+++ b/data/2024/iclr/Constraint-Free Structure Learning with Smooth Acyclic Orientations	
@@ -0,0 +1 @@
+The structure learning problem consists of fitting data generated by a Directed Acyclic Graph (DAG) to correctly reconstruct its arcs. In this context, differentiable approaches constrain or regularize the optimization problem using a continuous relaxation of the acyclicity property. The computational cost of evaluating graph acyclicity is cubic on the number of nodes and significantly affects scalability. In this paper we introduce COSMO, a constraint-free continuous optimization scheme for acyclic structure learning. At the core of our method, we define a differentiable approximation of an orientation matrix parameterized by a single priority vector. Differently from previous work, our parameterization fits a smooth orientation matrix and the resulting acyclic adjacency matrix without evaluating acyclicity at any step. Despite the absence of explicit constraints, we prove that COSMO always converges to an acyclic solution. In addition to being asymptotically faster, our empirical analysis highlights how COSMO performance on graph reconstruction compares favorably with competing structure learning methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Constructing Adversarial Examples for Vertical Federated Learning: Optimal Client Corruption through Multi-Armed Bandit b/data/2024/iclr/Constructing Adversarial Examples for Vertical Federated Learning: Optimal Client Corruption through Multi-Armed Bandit
new file mode 100644
index 0000000000..9a7f31bf82
--- /dev/null
+++ b/data/2024/iclr/Constructing Adversarial Examples for Vertical Federated Learning: Optimal Client Corruption through Multi-Armed Bandit	
@@ -0,0 +1 @@
+Vertical federated learning (VFL), where each participating client holds a subset of data features, has found numerous applications in finance, healthcare, and IoT systems. However, adversarial attacks, particularly through the injection of adversarial examples (AEs), pose serious challenges to the security of VFL models. In this paper, we investigate such vulnerabilities through developing a novel attack to disrupt the VFL inference process, under a practical scenario where the adversary is able to adaptively corrupt a subset of clients. We formulate the problem of finding optimal attack strategies as an online optimization problem, which is decomposed into an inner problem of adversarial example generation (AEG) and an outer problem of corruption pattern selection (CPS). Specifically, we establish the equivalence between the formulated CPS problem and a multi-armed bandit (MAB) problem, and propose the Thompson sampling with Empirical maximum reward (E-TS) algorithm for the adversary to efficiently identify the optimal subset of clients for corruption. The key idea of E-TS is to introduce an estimation of the expected maximum reward for each arm, which helps to specify a small set of competitive arms, on which the exploration for the optimal arm is performed. This significantly reduces the exploration space, which otherwise can quickly become prohibitively large as the number of clients increases. We analytically characterize the regret bound of E-TS, and empirically demonstrate its capability of efficiently revealing the optimal corruption pattern with the highest attack success rate, under various datasets of popular VFL tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Context is Environment b/data/2024/iclr/Context is Environment
new file mode 100644
index 0000000000..75e21eed9a
--- /dev/null
+++ b/data/2024/iclr/Context is Environment	
@@ -0,0 +1 @@
+Two lines of work are taking the central stage in AI research. On the one hand, the community is making increasing efforts to build models that discard spurious correlations and generalize better in novel test environments. Unfortunately, the bitter lesson so far is that no proposal convincingly outperforms a simple empirical risk minimization baseline. On the other hand, large language models (LLMs) have erupted as algorithms able to learn in-context, generalizing on-the-fly to eclectic contextual circumstances that users enforce by means of prompting. In this paper, we argue that context is environment, and posit that in-context learning holds the key to better domain generalization. Via extensive theory and experiments, we show that paying attention to context$\unicode{x2013}\unicode{x2013}$unlabeled examples as they arrive$\unicode{x2013}\unicode{x2013}$allows our proposed In-Context Risk Minimization (ICRM) algorithm to zoom-in on the test environment risk minimizer, leading to significant out-of-distribution performance improvements. From all of this, two messages are worth taking home. Researchers in domain generalization should consider environment as context, and harness the adaptive power of in-context learning. Researchers in LLMs should consider context as environment, to better structure data towards generalization.
\ No newline at end of file
diff --git a/data/2024/iclr/Context-Aware Meta-Learning b/data/2024/iclr/Context-Aware Meta-Learning
new file mode 100644
index 0000000000..5926361f72
--- /dev/null
+++ b/data/2024/iclr/Context-Aware Meta-Learning	
@@ -0,0 +1 @@
+Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this ability, and instead either perform poorly or require meta-training and/or fine-tuning on similar objects. In this work, we propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach leverages a frozen pre-trained feature extractor, and analogous to in-context learning, recasts visual meta-learning as sequence modeling over datapoints with known labels and a test datapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our approach -- without meta-training or fine-tuning -- exceeds or matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks. Our code is available at https://github.com/cfifty/CAML.
\ No newline at end of file
diff --git a/data/2024/iclr/ContextRef: Evaluating Referenceless Metrics for Image Description Generation b/data/2024/iclr/ContextRef: Evaluating Referenceless Metrics for Image Description Generation
new file mode 100644
index 0000000000..efa958732d
--- /dev/null
+++ b/data/2024/iclr/ContextRef: Evaluating Referenceless Metrics for Image Description Generation	
@@ -0,0 +1 @@
+Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence.
\ No newline at end of file
diff --git a/data/2024/iclr/Contextual Bandits with Online Neural Regression b/data/2024/iclr/Contextual Bandits with Online Neural Regression
new file mode 100644
index 0000000000..97501049ef
--- /dev/null
+++ b/data/2024/iclr/Contextual Bandits with Online Neural Regression	
@@ -0,0 +1 @@
+Recent works have shown a reduction from contextual bandits to online regression under a realizability assumption [Foster and Rakhlin, 2020, Foster and Krishnamurthy, 2021]. In this work, we investigate the use of neural networks for such online regression and associated Neural Contextual Bandits (NeuCBs). Using existing results for wide networks, one can readily show a ${\mathcal{O}}(\sqrt{T})$ regret for online regression with square loss, which via the reduction implies a ${\mathcal{O}}(\sqrt{K} T^{3/4})$ regret for NeuCBs. Departing from this standard approach, we first show a $\mathcal{O}(\log T)$ regret for online regression with almost convex losses that satisfy QG (Quadratic Growth) condition, a generalization of the PL (Polyak-\L ojasiewicz) condition, and that have a unique minima. Although not directly applicable to wide networks since they do not have unique minima, we show that adding a suitable small random perturbation to the network predictions surprisingly makes the loss satisfy QG with unique minima. Based on such a perturbed prediction, we show a ${\mathcal{O}}(\log T)$ regret for online regression with both squared loss and KL loss, and subsequently convert these respectively to $\tilde{\mathcal{O}}(\sqrt{KT})$ and $\tilde{\mathcal{O}}(\sqrt{KL^*} + K)$ regret for NeuCB, where $L^*$ is the loss of the best policy. Separately, we also show that existing regret bounds for NeuCBs are $\Omega(T)$ or assume i.i.d. contexts, unlike this work. Finally, our experimental results on various datasets demonstrate that our algorithms, especially the one based on KL loss, persistently outperform existing algorithms.
\ No newline at end of file
diff --git a/data/2024/iclr/Continual Learning in the Presence of Spurious Correlations: Analyses and a Simple Baseline b/data/2024/iclr/Continual Learning in the Presence of Spurious Correlations: Analyses and a Simple Baseline
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Continual Learning on a Diet: Learning from Sparsely Labeled Streams Under Constrained Computation b/data/2024/iclr/Continual Learning on a Diet: Learning from Sparsely Labeled Streams Under Constrained Computation
new file mode 100644
index 0000000000..18d0b542c0
--- /dev/null
+++ b/data/2024/iclr/Continual Learning on a Diet: Learning from Sparsely Labeled Streams Under Constrained Computation	
@@ -0,0 +1 @@
+We propose and study a realistic Continual Learning (CL) setting where learning algorithms are granted a restricted computational budget per time step while training. We apply this setting to large-scale semi-supervised Continual Learning scenarios with sparse label rates. Previous proficient CL methods perform very poorly in this challenging setting. Overfitting to the sparse labeled data and insufficient computational budget are the two main culprits for such a poor performance. Our new setting encourages learning methods to effectively and efficiently utilize the unlabeled data during training. To that end, we propose a simple but highly effective baseline, DietCL, which utilizes both unlabeled and labeled data jointly. DietCL meticulously allocates computational budget for both types of data. We validate our baseline, at scale, on several datasets, e.g., CLOC, ImageNet10K, and CGLM, under constraint budget setups. DietCL outperforms, by a large margin, all existing supervised CL algorithms as well as more recent continual semi-supervised methods. Our extensive analysis and ablations demonstrate that DietCL is stable under a full spectrum of label sparsity, computational budget, and various other ablations.
\ No newline at end of file
diff --git a/data/2024/iclr/Continual Momentum Filtering on Parameter Space for Online Test-time Adaptation b/data/2024/iclr/Continual Momentum Filtering on Parameter Space for Online Test-time Adaptation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Continuous Field Reconstruction from Sparse Observations with Implicit Neural Networks b/data/2024/iclr/Continuous Field Reconstruction from Sparse Observations with Implicit Neural Networks
new file mode 100644
index 0000000000..717addea07
--- /dev/null
+++ b/data/2024/iclr/Continuous Field Reconstruction from Sparse Observations with Implicit Neural Networks	
@@ -0,0 +1 @@
+Reliably reconstructing physical fields from sparse sensor data is a challenge that frequently arises in many scientific domains. In practice, the process generating the data often is not understood to sufficient accuracy. Therefore, there is a growing interest in using the deep neural network route to address the problem. This work presents a novel approach that learns a continuous representation of the physical field using implicit neural representations (INRs). Specifically, after factorizing spatiotemporal variability into spatial and temporal components using the separation of variables technique, the method learns relevant basis functions from sparsely sampled irregular data points to develop a continuous representation of the data. In experimental evaluations, the proposed model outperforms recent INR methods, offering superior reconstruction quality on simulation data from a state-of-the-art climate model and a second dataset that comprises ultra-high resolution satellite-based sea surface temperature fields.
\ No newline at end of file
diff --git a/data/2024/iclr/Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach b/data/2024/iclr/Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach
new file mode 100644
index 0000000000..ed0b47afc8
--- /dev/null
+++ b/data/2024/iclr/Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach	
@@ -0,0 +1 @@
+Image outpainting aims to generate the content of an input sub-image beyond its original boundaries. It is an important task in content generation yet remains an open problem for generative models. This paper pushes the technical frontier of image outpainting in two directions that have not been resolved in literature: 1) outpainting with arbitrary and continuous multiples (without restriction), and 2) outpainting in a single step (even for large expansion multiples). Moreover, we develop a method that does not depend on a pre-trained backbone network, which is in contrast commonly required by the previous SOTA outpainting methods. The arbitrary multiple outpainting is achieved by utilizing randomly cropped views from the same image during training to capture arbitrary relative positional information. Specifically, by feeding one view and positional embeddings as queries, we can reconstruct another view. At inference, we generate images with arbitrary expansion multiples by inputting an anchor image and its corresponding positional embeddings. The one-step outpainting ability here is particularly noteworthy in contrast to previous methods that need to be performed for $N$ times to obtain a final multiple which is $N$ times of its basic and fixed multiple. We evaluate the proposed approach (called PQDiff as we adopt a diffusion-based generator as our embodiment, under our proposed \textbf{P}ositional \textbf{Q}uery scheme) on public benchmarks, demonstrating its superior performance over state-of-the-art approaches. Specifically, PQDiff achieves state-of-the-art FID scores on the Scenery (\textbf{21.512}), Building Facades (\textbf{25.310}), and WikiArts (\textbf{36.212}) datasets. Furthermore, under the 2.25x, 5x and 11.7x outpainting settings, PQDiff only takes \textbf{40.6\%}, \textbf{20.3\%} and \textbf{10.2\%} of the time of the benchmark state-of-the-art (SOTA) method.
\ No newline at end of file
diff --git a/data/2024/iclr/Contrastive Difference Predictive Coding b/data/2024/iclr/Contrastive Difference Predictive Coding
new file mode 100644
index 0000000000..891051fef5
--- /dev/null
+++ b/data/2024/iclr/Contrastive Difference Predictive Coding	
@@ -0,0 +1 @@
+Predicting and reasoning about the future lie at the heart of many time-series questions. For example, goal-conditioned reinforcement learning can be viewed as learning representations to predict which states are likely to be visited in the future. While prior methods have used contrastive predictive coding to model time series data, learning representations that encode long-term dependencies usually requires large amounts of data. In this paper, we introduce a temporal difference version of contrastive predictive coding that stitches together pieces of different time series data to decrease the amount of data required to learn predictions of future events. We apply this representation learning method to derive an off-policy algorithm for goal-conditioned RL. Experiments demonstrate that, compared with prior RL methods, ours achieves $2 \times$ median improvement in success rates and can better cope with stochastic environments. In tabular settings, we show that our method is about $20 \times$ more sample efficient than the successor representation and $1500 \times$ more sample efficient than the standard (Monte Carlo) version of contrastive predictive coding.
\ No newline at end of file
diff --git a/data/2024/iclr/Contrastive Preference Learning: Learning from Human Feedback without Reinforcement Learning b/data/2024/iclr/Contrastive Preference Learning: Learning from Human Feedback without Reinforcement Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/ControlVideo: Training-free Controllable Text-to-video Generation b/data/2024/iclr/ControlVideo: Training-free Controllable Text-to-video Generation
new file mode 100644
index 0000000000..91719e5402
--- /dev/null
+++ b/data/2024/iclr/ControlVideo: Training-free Controllable Text-to-video Generation	
@@ -0,0 +1 @@
+Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a \emph{training-free} framework called \textbf{ControlVideo} to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo.
\ No newline at end of file
diff --git a/data/2024/iclr/Controlled Text Generation via Language Model Arithmetic b/data/2024/iclr/Controlled Text Generation via Language Model Arithmetic
new file mode 100644
index 0000000000..188d48766c
--- /dev/null
+++ b/data/2024/iclr/Controlled Text Generation via Language Model Arithmetic	
@@ -0,0 +1 @@
+As Large Language Models (LLMs) are deployed more widely, customization with respect to vocabulary, style, and character becomes more important. In this work, we introduce model arithmetic, a novel inference framework for composing and biasing LLMs without the need for model (re)training or highly specific datasets. In addition, the framework allows for more precise control of generated text than direct prompting and prior controlled text generation (CTG) techniques. Using model arithmetic, we can express prior CTG techniques as simple formulas and naturally extend them to new and more effective formulations. Further, we show that speculative sampling, a technique for efficient LLM sampling, extends to our setting. This enables highly efficient text generation with multiple composed models with only marginal overhead over a single model. Our empirical evaluation demonstrates that model arithmetic allows fine-grained control of generated text while outperforming state-of-the-art on the task of toxicity reduction. We release an open source easy-to-use implementation of our framework at https://github.com/eth-sri/language-model-arithmetic.
\ No newline at end of file
diff --git a/data/2024/iclr/Controlling Vision-Language Models for Multi-Task Image Restoration b/data/2024/iclr/Controlling Vision-Language Models for Multi-Task Image Restoration
new file mode 100644
index 0000000000..004e693293
--- /dev/null
+++ b/data/2024/iclr/Controlling Vision-Language Models for Multi-Task Image Restoration	
@@ -0,0 +1 @@
+Vision-language models such as CLIP have shown great impact on diverse downstream tasks for zero-shot or label-free predictions. However, when it comes to low-level vision such as image restoration their performance deteriorates dramatically due to corrupted inputs. In this paper, we present a degradation-aware vision-language model (DA-CLIP) to better transfer pretrained vision-language models to low-level vision tasks as a multi-task framework for image restoration. More specifically, DA-CLIP trains an additional controller that adapts the fixed CLIP image encoder to predict high-quality feature embeddings. By integrating the embedding into an image restoration network via cross-attention, we are able to pilot the model to learn a high-fidelity image reconstruction. The controller itself will also output a degradation feature that matches the real corruptions of the input, yielding a natural classifier for different degradation types. In addition, we construct a mixed degradation dataset with synthetic captions for DA-CLIP training. Our approach advances state-of-the-art performance on both \emph{degradation-specific} and \emph{unified} image restoration tasks, showing a promising direction of prompting image restoration with large-scale pretrained vision-language models. Our code is available at https://github.com/Algolzw/daclip-uir.
\ No newline at end of file
diff --git a/data/2024/iclr/Convergence of Bayesian Bilevel Optimization b/data/2024/iclr/Convergence of Bayesian Bilevel Optimization
new file mode 100644
index 0000000000..a0e0e9382f
--- /dev/null
+++ b/data/2024/iclr/Convergence of Bayesian Bilevel Optimization	
@@ -0,0 +1 @@
+This paper presents the first theoretical guarantee for Bayesian bilevel optimization (BBO) that we term for the prevalent bilevel framework combining Bayesian optimization at the outer level to tune hyperparameters, and the inner-level stochastic gradient descent (SGD) for training the model. We prove sublinear regret bounds suggesting simultaneous convergence of the inner-level model parameters and outer-level hyperparameters to optimal configurations for generalization capability. A pivotal, technical novelty in the proofs is modeling the excess risk of the SGD-trained parameters as evaluation noise during Bayesian optimization. Our theory implies the inner unit horizon, defined as the number of SGD iterations, shapes the convergence behavior of BBO. This suggests practical guidance on configuring the inner unit horizon to enhance training efficiency and model performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Conversational Drug Editing Using Retrieval and Domain Feedback b/data/2024/iclr/Conversational Drug Editing Using Retrieval and Domain Feedback
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model b/data/2024/iclr/Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model
new file mode 100644
index 0000000000..230154b099
--- /dev/null
+++ b/data/2024/iclr/Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model	
@@ -0,0 +1 @@
+The Segment Anything Model (SAM) stands as a foundational framework for image segmentation. While it exhibits remarkable zero-shot generalization in typical scenarios, its advantage diminishes when applied to specialized domains like medical imagery and remote sensing. To address this limitation, this paper introduces Conv-LoRA, a simple yet effective parameter-efficient fine-tuning approach. By integrating ultra-lightweight convolutional parameters into Low-Rank Adaptation (LoRA), Conv-LoRA can inject image-related inductive biases into the plain ViT encoder, further reinforcing SAM's local prior assumption. Notably, Conv-LoRA not only preserves SAM's extensive segmentation knowledge but also revives its capacity of learning high-level image semantics, which is constrained by SAM's foreground-background segmentation pretraining. Comprehensive experimentation across diverse benchmarks spanning multiple domains underscores Conv-LoRA's superiority in adapting SAM to real-world semantic segmentation tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Convolutional Deep Kernel Machines b/data/2024/iclr/Convolutional Deep Kernel Machines
new file mode 100644
index 0000000000..11f7665127
--- /dev/null
+++ b/data/2024/iclr/Convolutional Deep Kernel Machines	
@@ -0,0 +1 @@
+Standard infinite-width limits of neural networks sacrifice the ability for intermediate layers to learn representations from data. Recent work (A theory of representation learning gives a deep generalisation of kernel methods, Yang et al. 2023) modified the Neural Network Gaussian Process (NNGP) limit of Bayesian neural networks so that representation learning is retained. Furthermore, they found that applying this modified limit to a deep Gaussian process gives a practical learning algorithm which they dubbed the deep kernel machine (DKM). However, they only considered the simplest possible setting: regression in small, fully connected networks with e.g. 10 input features. Here, we introduce convolutional deep kernel machines. This required us to develop a novel inter-domain inducing point approximation, as well as introducing and experimentally assessing a number of techniques not previously seen in DKMs, including analogues to batch normalisation, different likelihoods, and different types of top-layer. The resulting model trains in roughly 77 GPU hours, achieving around 99% test accuracy on MNIST, 72% on CIFAR-100, and 92.7% on CIFAR-10, which is SOTA for kernel methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Coordinate-Aware Modulation for Neural Fields b/data/2024/iclr/Coordinate-Aware Modulation for Neural Fields
new file mode 100644
index 0000000000..20032dc91d
--- /dev/null
+++ b/data/2024/iclr/Coordinate-Aware Modulation for Neural Fields	
@@ -0,0 +1 @@
+Neural fields, mapping low-dimensional input coordinates to corresponding signals, have shown promising results in representing various signals. Numerous methodologies have been proposed, and techniques employing MLPs and grid representations have achieved substantial success. MLPs allow compact and high expressibility, yet often suffer from spectral bias and slow convergence speed. On the other hand, methods using grids are free from spectral bias and achieve fast training speed, however, at the expense of high spatial complexity. In this work, we propose a novel way for exploiting both MLPs and grid representations in neural fields. Unlike the prevalent methods that combine them sequentially (extract features from the grids first and feed them to the MLP), we inject spectral bias-free grid representations into the intermediate features in the MLP. More specifically, we suggest a Coordinate-Aware Modulation (CAM), which modulates the intermediate features using scale and shift parameters extracted from the grid representations. This can maintain the strengths of MLPs while mitigating any remaining potential biases, facilitating the rapid learning of high-frequency components. In addition, we empirically found that the feature normalizations, which have not been successful in neural filed literature, proved to be effective when applied in conjunction with the proposed CAM. Experimental results demonstrate that CAM enhances the performance of neural representation and improves learning stability across a range of signals. Especially in the novel view synthesis task, we achieved state-of-the-art performance with the least number of parameters and fast training speed for dynamic scenes and the best performance under 1MB memory for static scenes. CAM also outperforms the best-performing video compression methods using neural fields by a large margin.
\ No newline at end of file
diff --git a/data/2024/iclr/Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion b/data/2024/iclr/Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion
new file mode 100644
index 0000000000..977780485d
--- /dev/null
+++ b/data/2024/iclr/Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion	
@@ -0,0 +1 @@
+Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose Copilot4D, a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer as discrete diffusion and enhance it with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, Copilot4D reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.
\ No newline at end of file
diff --git a/data/2024/iclr/Copula Conformal prediction for multi-step time series prediction b/data/2024/iclr/Copula Conformal prediction for multi-step time series prediction
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Correlated Noise Provably Beats Independent Noise for Differentially Private Learning b/data/2024/iclr/Correlated Noise Provably Beats Independent Noise for Differentially Private Learning
new file mode 100644
index 0000000000..a969fe12f0
--- /dev/null
+++ b/data/2024/iclr/Correlated Noise Provably Beats Independent Noise for Differentially Private Learning	
@@ -0,0 +1 @@
+Differentially private learning algorithms inject noise into the learning process. While the most common private learning algorithm, DP-SGD, adds independent Gaussian noise in each iteration, recent work on matrix factorization mechanisms has shown empirically that introducing correlations in the noise can greatly improve their utility. We characterize the asymptotic learning utility for any choice of the correlation function, giving precise analytical bounds for linear regression and as the solution to a convex program for general convex functions. We show, using these bounds, how correlated noise provably improves upon vanilla DP-SGD as a function of problem parameters such as the effective dimension and condition number. Moreover, our analytical expression for the near-optimal correlation function circumvents the cubic complexity of the semi-definite program used to optimize the noise correlation matrix in previous work. We validate our theory with experiments on private deep learning. Our work matches or outperforms prior work while being efficient both in terms of compute and memory.
\ No newline at end of file
diff --git a/data/2024/iclr/Counterfactual Density Estimation using Kernel Stein Discrepancies b/data/2024/iclr/Counterfactual Density Estimation using Kernel Stein Discrepancies
new file mode 100644
index 0000000000..c9e60609e0
--- /dev/null
+++ b/data/2024/iclr/Counterfactual Density Estimation using Kernel Stein Discrepancies	
@@ -0,0 +1 @@
+Causal effects are usually studied in terms of the means of counterfactual distributions, which may be insufficient in many scenarios. Given a class of densities known up to normalizing constants, we propose to model counterfactual distributions by minimizing kernel Stein discrepancies in a doubly robust manner. This enables the estimation of counterfactuals over large classes of distributions while exploiting the desired double robustness. We present a theoretical analysis of the proposed estimator, providing sufficient conditions for consistency and asymptotic normality, as well as an examination of its empirical performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Counting Graph Substructures with Graph Neural Networks b/data/2024/iclr/Counting Graph Substructures with Graph Neural Networks
new file mode 100644
index 0000000000..7c92af98c2
--- /dev/null
+++ b/data/2024/iclr/Counting Graph Substructures with Graph Neural Networks	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) are powerful representation learning tools that have achieved remarkable performance in various downstream tasks. However, there are still open questions regarding their ability to count and list substructures, which play a crucial role in biological and social networks. In this work, we fill this gap and characterize the representation and generalization power of GNNs in terms of their ability to produce powerful representations that count substructures. In particular, we study the message-passing operations of GNNs with random node input in a novel fashion, and show how they can produce equivariant representations that are associated with high-order statistical moments. Using these representations, we prove that GNNs can learn how to count cycles, cliques, quasi-cliques, and the number of connected components in a graph. We also provide new insights into the generalization capacity of GNNs. Our analysis is constructive and enables the design of a generic GNN architecture that shows remarkable performance in four distinct tasks: cycle detection, cycle counting, graph classification, and molecular property prediction.
\ No newline at end of file
diff --git a/data/2024/iclr/Course Correcting Koopman Representations b/data/2024/iclr/Course Correcting Koopman Representations
new file mode 100644
index 0000000000..7b08d6d02c
--- /dev/null
+++ b/data/2024/iclr/Course Correcting Koopman Representations	
@@ -0,0 +1 @@
+Koopman representations aim to learn features of nonlinear dynamical systems (NLDS) which lead to linear dynamics in the latent space. Theoretically, such features can be used to simplify many problems in modeling and control of NLDS. In this work we study autoencoder formulations of this problem, and different ways they can be used to model dynamics, specifically for future state prediction over long horizons. We discover several limitations of predicting future states in the latent space and propose an inference-time mechanism, which we refer to as Periodic Reencoding, for faithfully capturing long term dynamics. We justify this method both analytically and empirically via experiments in low and high dimensional NLDS.
\ No newline at end of file
diff --git a/data/2024/iclr/CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping b/data/2024/iclr/CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping
new file mode 100644
index 0000000000..2da2550f38
--- /dev/null
+++ b/data/2024/iclr/CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping	
@@ -0,0 +1 @@
+Leveraging nearest neighbor retrieval for self-supervised representation learning has proven beneficial with object-centric images. However, this approach faces limitations when applied to scene-centric datasets, where multiple objects within an image are only implicitly captured in the global representation. Such global bootstrapping can lead to undesirable entanglement of object representations. Furthermore, even object-centric datasets stand to benefit from a finer-grained bootstrapping approach. In response to these challenges, we introduce a novel Cross-Image Object-Level Bootstrapping method tailored to enhance dense visual representation learning. By employing object-level nearest neighbor bootstrapping throughout the training, CrIBo emerges as a notably strong and adequate candidate for in-context learning, leveraging nearest neighbor retrieval at test time. CrIBo shows state-of-the-art performance on the latter task while being highly competitive in more standard downstream segmentation tasks. Our code and pretrained models are publicly available at https://github.com/tileb1/CrIBo.
\ No newline at end of file
diff --git a/data/2024/iclr/Critical Learning Periods Emerge Even in Deep Linear Networks b/data/2024/iclr/Critical Learning Periods Emerge Even in Deep Linear Networks
new file mode 100644
index 0000000000..05b8f0a09a
--- /dev/null
+++ b/data/2024/iclr/Critical Learning Periods Emerge Even in Deep Linear Networks	
@@ -0,0 +1 @@
+Critical learning periods are periods early in development where temporary sensory deficits can have a permanent effect on behavior and learned representations. Despite the radical differences between biological and artificial networks, critical learning periods have been empirically observed in both systems. This suggests that critical periods may be fundamental to learning and not an accident of biology. Yet, why exactly critical periods emerge in deep networks is still an open question, and in particular it is unclear whether the critical periods observed in both systems depend on particular architectural or optimization details. To isolate the key underlying factors, we focus on deep linear network models, and show that, surprisingly, such networks also display much of the behavior seen in biology and artificial networks, while being amenable to analytical treatment. We show that critical periods depend on the depth of the model and structure of the data distribution. We also show analytically and in simulations that the learning of features is tied to competition between sources. Finally, we extend our analysis to multi-task learning to show that pre-training on certain tasks can damage the transfer performance on new tasks, and show how this depends on the relationship between tasks and the duration of the pre-training stage. To the best of our knowledge, our work provides the first analytically tractable model that sheds light into why critical learning periods emerge in biological and artificial networks.
\ No newline at end of file
diff --git a/data/2024/iclr/Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing b/data/2024/iclr/Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/CrossLoco: Human Motion Driven Control of Legged Robots via Guided Unsupervised Reinforcement Learning b/data/2024/iclr/CrossLoco: Human Motion Driven Control of Legged Robots via Guided Unsupervised Reinforcement Learning
new file mode 100644
index 0000000000..10d7448447
--- /dev/null
+++ b/data/2024/iclr/CrossLoco: Human Motion Driven Control of Legged Robots via Guided Unsupervised Reinforcement Learning	
@@ -0,0 +1 @@
+Human motion driven control (HMDC) is an effective approach for generating natural and compelling robot motions while preserving high-level semantics. However, establishing the correspondence between humans and robots with different body structures is not straightforward due to the mismatches in kinematics and dynamics properties, which causes intrinsic ambiguity to the problem. Many previous algorithms approach this motion retargeting problem with unsupervised learning, which requires the prerequisite skill sets. However, it will be extremely costly to learn all the skills without understanding the given human motions, particularly for high-dimensional robots. In this work, we introduce CrossLoco, a guided unsupervised reinforcement learning framework that simultaneously learns robot skills and their correspondence to human motions. Our key innovation is to introduce a cycle-consistency-based reward term designed to maximize the mutual information between human motions and robot states. We demonstrate that the proposed framework can generate compelling robot motions by translating diverse human motions, such as running, hopping, and dancing. We quantitatively compare our CrossLoco against the manually engineered and unsupervised baseline algorithms along with the ablated versions of our framework and demonstrate that our method translates human motions with better accuracy, diversity, and user preference. We also showcase its utility in other applications, such as synthesizing robot movements from language input and enabling interactive robot control.
\ No newline at end of file
diff --git a/data/2024/iclr/CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity b/data/2024/iclr/CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity
new file mode 100644
index 0000000000..38e8ddc25a
--- /dev/null
+++ b/data/2024/iclr/CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity	
@@ -0,0 +1 @@
+Sample efficiency is a crucial problem in deep reinforcement learning. Recent algorithms, such as REDQ and DroQ, found a way to improve the sample efficiency by increasing the update-to-data (UTD) ratio to 20 gradient update steps on the critic per environment sample. However, this comes at the expense of a greatly increased computational cost. To reduce this computational burden, we introduce CrossQ: A lightweight algorithm for continuous control tasks that makes careful use of Batch Normalization and removes target networks to surpass the current state-of-the-art in sample efficiency while maintaining a low UTD ratio of 1. Notably, CrossQ does not rely on advanced bias-reduction schemes used in current methods. CrossQ's contributions are threefold: (1) it matches or surpasses current state-of-the-art methods in terms of sample efficiency, (2) it substantially reduces the computational cost compared to REDQ and DroQ, (3) it is easy to implement, requiring just a few lines of code on top of SAC.
\ No newline at end of file
diff --git a/data/2024/iclr/Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding b/data/2024/iclr/Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding
new file mode 100644
index 0000000000..f126c46b7c
--- /dev/null
+++ b/data/2024/iclr/Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding	
@@ -0,0 +1 @@
+Predicting physical properties of materials from their crystal structures is a fundamental problem in materials science. In peripheral areas such as the prediction of molecular properties, fully connected attention networks have been shown to be successful. However, unlike these finite atom arrangements, crystal structures are infinitely repeating, periodic arrangements of atoms, whose fully connected attention results in infinitely connected attention. In this work, we show that this infinitely connected attention can lead to a computationally tractable formulation, interpreted as neural potential summation, that performs infinite interatomic potential summations in a deeply learned feature space. We then propose a simple yet effective Transformer-based encoder architecture for crystal structures called Crystalformer. Compared to an existing Transformer-based model, the proposed model requires only 29.4% of the number of parameters, with minimal modifications to the original Transformer architecture. Despite the architectural simplicity, the proposed method outperforms state-of-the-art methods for various property regression tasks on the Materials Project and JARVIS-DFT datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Curiosity-driven Red-teaming for Large Language Models b/data/2024/iclr/Curiosity-driven Red-teaming for Large Language Models
new file mode 100644
index 0000000000..72cd4e3a47
--- /dev/null
+++ b/data/2024/iclr/Curiosity-driven Red-teaming for Large Language Models	
@@ -0,0 +1 @@
+Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a \textit{red team} of human testers to design input prompts (i.e., test cases) that elicit undesirable responses from LLMs. However, relying solely on human testers is expensive and time-consuming. Recent works automate red teaming by training a separate red team LLM with reinforcement learning (RL) to generate test cases that maximize the chance of eliciting undesirable responses from the target LLM. However, current RL methods are only able to generate a small number of effective test cases resulting in a low coverage of the span of prompts that elicit undesirable responses from the target LLM. To overcome this limitation, we draw a connection between the problem of increasing the coverage of generated test cases and the well-studied approach of curiosity-driven exploration that optimizes for novelty. Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods. Our method, CRT successfully provokes toxic responses from LLaMA2 model that has been heavily fine-tuned using human preferences to avoid toxic outputs. Code is available at \url{https://github.com/Improbable-AI/curiosity_redteam}
\ No newline at end of file
diff --git a/data/2024/iclr/Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning b/data/2024/iclr/Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning
new file mode 100644
index 0000000000..6963c36e13
--- /dev/null
+++ b/data/2024/iclr/Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning	
@@ -0,0 +1 @@
+Modular and composable transfer learning is an emerging direction in the field of Parameter Efficient Fine-Tuning, as it enables neural networks to better organize various aspects of knowledge, leading to improved cross-task generalization. In this paper, we introduce a novel approach Customized Polytropon C-Poly that combines task-common skills and task-specific skills, while the skill parameters being highly parameterized using low-rank techniques. Each task is associated with a customizable number of exclusive specialized skills and also benefits from skills shared with peer tasks. A skill assignment matrix is jointly learned. To evaluate our approach, we conducted extensive experiments on the Super-NaturalInstructions and the SuperGLUE benchmarks. Our findings demonstrate that C-Poly outperforms fully-shared, task-specific, and skill-indistinguishable baselines, significantly enhancing the sample efficiency in multi-task learning scenarios.
\ No newline at end of file
diff --git a/data/2024/iclr/Cycle Consistency Driven Object Discovery b/data/2024/iclr/Cycle Consistency Driven Object Discovery
new file mode 100644
index 0000000000..5296dee4ac
--- /dev/null
+++ b/data/2024/iclr/Cycle Consistency Driven Object Discovery	
@@ -0,0 +1 @@
+Developing deep learning models that effectively learn object-centric representations, akin to human cognition, remains a challenging task. Existing approaches facilitate object discovery by representing objects as fixed-size vectors, called ``slots'' or ``object files''. While these approaches have shown promise in certain scenarios, they still exhibit certain limitations. First, they rely on architectural priors which can be unreliable and usually require meticulous engineering to identify the correct objects. Second, there has been a notable gap in investigating the practical utility of these representations in downstream tasks. To address the first limitation, we introduce a method that explicitly optimizes the constraint that each object in a scene should be associated with a distinct slot. We formalize this constraint by introducing consistency objectives which are cyclic in nature. By integrating these consistency objectives into various existing slot-based object-centric methods, we showcase substantial improvements in object-discovery performance. These enhancements consistently hold true across both synthetic and real-world scenes, underscoring the effectiveness and adaptability of the proposed approach. To tackle the second limitation, we apply the learned object-centric representations from the proposed method to two downstream reinforcement learning tasks, demonstrating considerable performance enhancements compared to conventional slot-based and monolithic representation learning methods. Our results suggest that the proposed approach not only improves object discovery, but also provides richer features for downstream tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/D2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning b/data/2024/iclr/D2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning
new file mode 100644
index 0000000000..3ed9fc90cc
--- /dev/null
+++ b/data/2024/iclr/D2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning	
@@ -0,0 +1 @@
+Analytical theories suggest that higher-quality data can lead to lower test errors in models trained on a fixed data budget. Moreover, a model can be trained on a lower compute budget without compromising performance if a dataset can be stripped of its redundancies. Coreset selection (or data pruning) seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset. There are two dominant approaches: (1) geometry-based data selection for maximizing data diversity in the coreset, and (2) functions that assign difficulty scores to samples based on training dynamics. Optimizing for data diversity leads to a coreset that is biased towards easier samples, whereas, selection by difficulty ranking omits easy samples that are necessary for the training of deep learning models. This demonstrates that data diversity and importance scores are two complementary factors that need to be jointly considered during coreset selection. We represent a dataset as an undirected graph and propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection. D2 Pruning updates the difficulty scores of each example by incorporating the difficulty of its neighboring examples in the dataset graph. Then, these updated difficulty scores direct a graph-based sampling method to select a coreset that encapsulates both diverse and difficult regions of the dataset space. We evaluate supervised and self-supervised versions of our method on various vision and language datasets. Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates. Additionally, we find that using D2 Pruning for filtering large multimodal datasets leads to increased diversity in the dataset and improved generalization of pretrained models.
\ No newline at end of file
diff --git a/data/2024/iclr/DAFA: Distance-Aware Fair Adversarial Training b/data/2024/iclr/DAFA: Distance-Aware Fair Adversarial Training
new file mode 100644
index 0000000000..da34efdde7
--- /dev/null
+++ b/data/2024/iclr/DAFA: Distance-Aware Fair Adversarial Training	
@@ -0,0 +1 @@
+The disparity in accuracy between classes in standard training is amplified during adversarial training, a phenomenon termed the robust fairness problem. Existing methodologies aimed to enhance robust fairness by sacrificing the model's performance on easier classes in order to improve its performance on harder ones. However, we observe that under adversarial attacks, the majority of the model's predictions for samples from the worst class are biased towards classes similar to the worst class, rather than towards the easy classes. Through theoretical and empirical analysis, we demonstrate that robust fairness deteriorates as the distance between classes decreases. Motivated by these insights, we introduce the Distance-Aware Fair Adversarial training (DAFA) methodology, which addresses robust fairness by taking into account the similarities between classes. Specifically, our method assigns distinct loss weights and adversarial margins to each class and adjusts them to encourage a trade-off in robustness among similar classes. Experimental results across various datasets demonstrate that our method not only maintains average robust accuracy but also significantly improves the worst robust accuracy, indicating a marked improvement in robust fairness compared to existing methods.
\ No newline at end of file
diff --git a/data/2024/iclr/DAM: Towards a Foundation Model for Forecasting b/data/2024/iclr/DAM: Towards a Foundation Model for Forecasting
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/DATS: Difficulty-Aware Task Sampler for Meta-Learning Physics-Informed Neural Networks b/data/2024/iclr/DATS: Difficulty-Aware Task Sampler for Meta-Learning Physics-Informed Neural Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/DDMI: Domain-agnostic Latent Diffusion Models for Synthesizing High-Quality Implicit Neural Representations b/data/2024/iclr/DDMI: Domain-agnostic Latent Diffusion Models for Synthesizing High-Quality Implicit Neural Representations
new file mode 100644
index 0000000000..12e220a980
--- /dev/null
+++ b/data/2024/iclr/DDMI: Domain-agnostic Latent Diffusion Models for Synthesizing High-Quality Implicit Neural Representations	
@@ -0,0 +1 @@
+Recent studies have introduced a new class of generative models for synthesizing implicit neural representations (INRs) that capture arbitrary continuous signals in various domains. These models opened the door for domain-agnostic generative models, but they often fail to achieve high-quality generation. We observed that the existing methods generate the weights of neural networks to parameterize INRs and evaluate the network with fixed positional embeddings (PEs). Arguably, this architecture limits the expressive power of generative models and results in low-quality INR generation. To address this limitation, we propose Domain-agnostic Latent Diffusion Model for INRs (DDMI) that generates adaptive positional embeddings instead of neural networks' weights. Specifically, we develop a Discrete-to-continuous space Variational AutoEncoder (D2C-VAE), which seamlessly connects discrete data and the continuous signal functions in the shared latent space. Additionally, we introduce a novel conditioning mechanism for evaluating INRs with the hierarchically decomposed PEs to further enhance expressive power. Extensive experiments across four modalities, e.g., 2D images, 3D shapes, Neural Radiance Fields, and videos, with seven benchmark datasets, demonstrate the versatility of DDMI and its superior performance compared to the existing INR generative models.
\ No newline at end of file
diff --git a/data/2024/iclr/DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation b/data/2024/iclr/DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation
new file mode 100644
index 0000000000..1b2ed0cd74
--- /dev/null
+++ b/data/2024/iclr/DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation	
@@ -0,0 +1 @@
+We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and hence the DFormer is endowed with the capacity to encode RGB-D representations; 2) DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design. DFormer avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pretrained backbones, which widely lies in existing methods but has not been resolved. We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets. Our code is available at: https://github.com/VCIP-RGBD/DFormer.
\ No newline at end of file
diff --git a/data/2024/iclr/DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models b/data/2024/iclr/DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models
new file mode 100644
index 0000000000..69bf192cb7
--- /dev/null
+++ b/data/2024/iclr/DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models	
@@ -0,0 +1 @@
+Recent text-to-image diffusion models have shown surprising performance in generating high-quality images. However, concerns have arisen regarding the unauthorized usage of data during the training process. One example is when a model trainer collects a set of images created by a particular artist and attempts to train a model capable of generating similar images without obtaining permission from the artist. To address this issue, it becomes crucial to detect unauthorized data usage. In this paper, we propose a method for detecting such unauthorized data usage by planting injected memorization into the text-to-image diffusion models trained on the protected dataset. Specifically, we modify the protected image dataset by adding unique contents on the images such as stealthy image wrapping functions that are imperceptible to human vision but can be captured and memorized by diffusion models. By analyzing whether the model has memorization for the injected content (i.e., whether the generated images are processed by the chosen post-processing function), we can detect models that had illegally utilized the unauthorized data. Our experiments conducted on Stable Diffusion and LoRA model demonstrate the effectiveness of the proposed method in detecting unauthorized data usages.
\ No newline at end of file
diff --git a/data/2024/iclr/DIFFTACTILE: A Physics-based Differentiable Tactile Simulator for Contact-rich Robotic Manipulation b/data/2024/iclr/DIFFTACTILE: A Physics-based Differentiable Tactile Simulator for Contact-rich Robotic Manipulation
new file mode 100644
index 0000000000..af761cb1ba
--- /dev/null
+++ b/data/2024/iclr/DIFFTACTILE: A Physics-based Differentiable Tactile Simulator for Contact-rich Robotic Manipulation	
@@ -0,0 +1 @@
+We introduce DIFFTACTILE, a physics-based differentiable tactile simulation system designed to enhance robotic manipulation with dense and physically accurate tactile feedback. In contrast to prior tactile simulators which primarily focus on manipulating rigid bodies and often rely on simplified approximations to model stress and deformations of materials in contact, DIFFTACTILE emphasizes physics-based contact modeling with high fidelity, supporting simulations of diverse contact modes and interactions with objects possessing a wide range of material properties. Our system incorporates several key components, including a Finite Element Method (FEM)-based soft body model for simulating the sensing elastomer, a multi-material simulator for modeling diverse object types (such as elastic, elastoplastic, cables) under manipulation, a penalty-based contact model for handling contact dynamics. The differentiable nature of our system facilitates gradient-based optimization for both 1) refining physical properties in simulation using real-world data, hence narrowing the sim-to-real gap and 2) efficient learning of tactile-assisted grasping and contact-rich manipulation skills. Additionally, we introduce a method to infer the optical response of our tactile sensor to contact using an efficient pixel-based neural module. We anticipate that DIFFTACTILE will serve as a useful platform for studying contact-rich manipulations, leveraging the benefits of dense tactile feedback and differentiable physics. Code and supplementary materials are available at the project website https://difftactile.github.io/.
\ No newline at end of file
diff --git a/data/2024/iclr/DMBP: Diffusion model-based predictor for robust offline reinforcement learning against state observation perturbations b/data/2024/iclr/DMBP: Diffusion model-based predictor for robust offline reinforcement learning against state observation perturbations
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/DMV3D: Denoising Multi-view Diffusion Using 3D Large Reconstruction Model b/data/2024/iclr/DMV3D: Denoising Multi-view Diffusion Using 3D Large Reconstruction Model
new file mode 100644
index 0000000000..dadce75001
--- /dev/null
+++ b/data/2024/iclr/DMV3D: Denoising Multi-view Diffusion Using 3D Large Reconstruction Model	
@@ -0,0 +1 @@
+We propose \textbf{DMV3D}, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion. Our reconstruction model incorporates a triplane NeRF representation and can denoise noisy multi-view images via NeRF reconstruction and rendering, achieving single-stage 3D generation in $\sim$30s on single A100 GPU. We train \textbf{DMV3D} on large-scale multi-view image datasets of highly diverse objects using only image reconstruction losses, without accessing 3D assets. We demonstrate state-of-the-art results for the single-image reconstruction problem where probabilistic modeling of unseen object parts is required for generating diverse reconstructions with sharp textures. We also show high-quality text-to-3D generation results outperforming previous 3D diffusion models. Our project website is at: https://justimyhxu.github.io/projects/dmv3d/ .
\ No newline at end of file
diff --git a/data/2024/iclr/DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text b/data/2024/iclr/DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text
new file mode 100644
index 0000000000..b67c7b9cf4
--- /dev/null
+++ b/data/2024/iclr/DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text	
@@ -0,0 +1 @@
+Large language models (LLMs) have notably enhanced the fluency and diversity of machine-generated text. However, this progress also presents a significant challenge in detecting the origin of a given text, and current research on detection methods lags behind the rapid evolution of LLMs. Conventional training-based methods have limitations in flexibility, particularly when adapting to new domains, and they often lack explanatory power. To address this gap, we propose a novel training-free detection strategy called Divergent N-Gram Analysis (DNA-GPT). Given a text, we first truncate it in the middle and then use only the preceding portion as input to the LLMs to regenerate the new remaining parts. By analyzing the differences between the original and new remaining parts through N-gram analysis in black-box or probability divergence in white-box, we unveil significant discrepancies between the distribution of machine-generated text and the distribution of human-written text. We conducted extensive experiments on the most advanced LLMs from OpenAI, including text-davinci-003, GPT-3.5-turbo, and GPT-4, as well as open-source models such as GPT-NeoX-20B and LLaMa-13B. Results show that our zero-shot approach exhibits state-of-the-art performance in distinguishing between human and GPT-generated text on four English and one German dataset, outperforming OpenAI's own classifier, which is trained on millions of text. Additionally, our methods provide reasonable explanations and evidence to support our claim, which is a unique feature of explainable detection. Our method is also robust under the revised text attack and can additionally solve model sourcing. Codes are available at https://github.com/Xianjun-Yang/DNA-GPT.
\ No newline at end of file
diff --git a/data/2024/iclr/DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes b/data/2024/iclr/DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/DORSal: Diffusion for Object-centric Representations of Scenes et al b/data/2024/iclr/DORSal: Diffusion for Object-centric Representations of Scenes et al
new file mode 100644
index 0000000000..ea33d1a989
--- /dev/null
+++ b/data/2024/iclr/DORSal: Diffusion for Object-centric Representations of Scenes et al	
@@ -0,0 +1 @@
+Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes. As a consequence, generalization to unseen scenes and objects, rendering novel views from just a single or a handful of input images, and controllable scene generation that supports editing, is now possible. However, training jointly on a large number of scenes typically compromises rendering quality when compared to single-scene optimized models such as NeRFs. In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree. In particular, we propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on frozen object-centric slot-based representations of scenes. On both complex synthetic multi-object scenes and on the real-world large-scale Street View dataset, we show that DORSal enables scalable neural rendering of 3D scenes with object-level editing and improves upon existing approaches.
\ No newline at end of file
diff --git a/data/2024/iclr/DOS: Diverse Outlier Sampling for Out-of-Distribution Detection b/data/2024/iclr/DOS: Diverse Outlier Sampling for Out-of-Distribution Detection
new file mode 100644
index 0000000000..3cc7f9cd20
--- /dev/null
+++ b/data/2024/iclr/DOS: Diverse Outlier Sampling for Out-of-Distribution Detection	
@@ -0,0 +1 @@
+Modern neural networks are known to give overconfident prediction for out-of-distribution inputs when deployed in the open world. It is common practice to leverage a surrogate outlier dataset to regularize the model during training, and recent studies emphasize the role of uncertainty in designing the sampling strategy for outlier dataset. However, the OOD samples selected solely based on predictive uncertainty can be biased towards certain types, which may fail to capture the full outlier distribution. In this work, we empirically show that diversity is critical in sampling outliers for OOD detection performance. Motivated by the observation, we propose a straightforward and novel sampling strategy named DOS (Diverse Outlier Sampling) to select diverse and informative outliers. Specifically, we cluster the normalized features at each iteration, and the most informative outlier from each cluster is selected for model training with absent category loss. With DOS, the sampled outliers efficiently shape a globally compact decision boundary between ID and OOD data. Extensive experiments demonstrate the superiority of DOS, reducing the average FPR95 by up to 25.79% on CIFAR-100 with TI-300K.
\ No newline at end of file
diff --git a/data/2024/iclr/DP-OPT: Make Large Language Model Your Privacy-Preserving Prompt Engineer b/data/2024/iclr/DP-OPT: Make Large Language Model Your Privacy-Preserving Prompt Engineer
new file mode 100644
index 0000000000..8cc7b9c1b4
--- /dev/null
+++ b/data/2024/iclr/DP-OPT: Make Large Language Model Your Privacy-Preserving Prompt Engineer	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have emerged as dominant tools for various tasks, particularly when tailored for a specific target by prompt tuning. Nevertheless, concerns surrounding data privacy present obstacles due to the tuned prompts' dependency on sensitive private information. A practical solution is to host a local LLM and optimize a soft prompt privately using data. Yet, hosting a local model becomes problematic when model ownership is protected. Alternative methods, like sending data to the model's provider for training, intensify these privacy issues facing an untrusted provider. In this paper, we present a novel solution called Differentially-Private Offsite Prompt Tuning (DP-OPT) to address this challenge. Our approach involves tuning a discrete prompt on the client side and then applying it to the desired cloud models. We demonstrate that prompts suggested by LLMs themselves can be transferred without compromising performance significantly. To ensure that the prompts do not leak private information, we introduce the first private prompt generation mechanism, by a differentially-private (DP) ensemble of in-context learning with private demonstrations. With DP-OPT, generating privacy-preserving prompts by Vicuna-7b can yield competitive performance compared to non-private in-context learning on GPT3.5 or local private prompt tuning. Codes are available at https://github.com/VITA-Group/DP-OPT .
\ No newline at end of file
diff --git a/data/2024/iclr/DP-SGD Without Clipping: The Lipschitz Neural Network Way b/data/2024/iclr/DP-SGD Without Clipping: The Lipschitz Neural Network Way
new file mode 100644
index 0000000000..af7a06f099
--- /dev/null
+++ b/data/2024/iclr/DP-SGD Without Clipping: The Lipschitz Neural Network Way	
@@ -0,0 +1 @@
+State-of-the-art approaches for training Differentially Private (DP) Deep Neural Networks (DNN) face difficulties to estimate tight bounds on the sensitivity of the network's layers, and instead rely on a process of per-sample gradient clipping. This clipping process not only biases the direction of gradients but also proves costly both in memory consumption and in computation. To provide sensitivity bounds and bypass the drawbacks of the clipping process, we propose to rely on Lipschitz constrained networks. Our theoretical analysis reveals an unexplored link between the Lipschitz constant with respect to their input and the one with respect to their parameters. By bounding the Lipschitz constant of each layer with respect to its parameters, we prove that we can train these networks with privacy guarantees. Our analysis not only allows the computation of the aforementioned sensitivities at scale, but also provides guidance on how to maximize the gradient-to-noise ratio for fixed privacy guarantees. The code has been released as a Python package available at https://github.com/Algue-Rythme/lip-dp
\ No newline at end of file
diff --git a/data/2024/iclr/DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning b/data/2024/iclr/DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/DREAM: Dual Structured Exploration with Mixup for Open-set Graph Domain Adaption b/data/2024/iclr/DREAM: Dual Structured Exploration with Mixup for Open-set Graph Domain Adaption
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified Robustness b/data/2024/iclr/DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified Robustness
new file mode 100644
index 0000000000..0ab8d4f5de
--- /dev/null
+++ b/data/2024/iclr/DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified Robustness	
@@ -0,0 +1 @@
+Machine Learning (ML) models have been utilized for malware detection for over two decades. Consequently, this ignited an ongoing arms race between malware authors and antivirus systems, compelling researchers to propose defenses for malware-detection models against evasion attacks. However, most if not all existing defenses against evasion attacks suffer from sizable performance degradation and/or can defend against only specific attacks, which makes them less practical in real-world settings. In this work, we develop a certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the de-randomized smoothing technique for the domain of malware detection. Specifically, we propose a window ablation scheme to provably limit the impact of adversarial bytes while maximally preserving local structures of the executables. After showing how DRSM is theoretically robust against attacks with contiguous adversarial bytes, we verify its performance and certified robustness experimentally, where we observe only marginal accuracy drops as the cost of robustness. To our knowledge, we are the first to offer certified robustness in the realm of static detection of malware executables. More surprisingly, through evaluating DRSM against 9 empirical attacks of different types, we observe that the proposed defense is empirically robust to some extent against a diverse set of attacks, some of which even fall out of the scope of its original threat model. In addition, we collected 15.5K recent benign raw executables from diverse sources, which will be made public as a dataset called PACE (Publicly Accessible Collection(s) of Executables) to alleviate the scarcity of publicly available benign datasets for studying malware detection and provide future research with more representative data of the time.
\ No newline at end of file
diff --git a/data/2024/iclr/DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines b/data/2024/iclr/DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/DV-3DLane: End-to-end Multi-modal 3D Lane Detection with Dual-view Representation b/data/2024/iclr/DV-3DLane: End-to-end Multi-modal 3D Lane Detection with Dual-view Representation
new file mode 100644
index 0000000000..e33573f777
--- /dev/null
+++ b/data/2024/iclr/DV-3DLane: End-to-end Multi-modal 3D Lane Detection with Dual-view Representation	
@@ -0,0 +1 @@
+Accurate 3D lane estimation is crucial for ensuring safety in autonomous driving. However, prevailing monocular techniques suffer from depth loss and lighting variations, hampering accurate 3D lane detection. In contrast, LiDAR points offer geometric cues and enable precise localization. In this paper, we present DV-3DLane, a novel end-to-end Dual-View multi-modal 3D Lane detection framework that synergizes the strengths of both images and LiDAR points. We propose to learn multi-modal features in dual-view spaces, i.e., perspective view (PV) and bird's-eye-view (BEV), effectively leveraging the modal-specific information. To achieve this, we introduce three designs: 1) A bidirectional feature fusion strategy that integrates multi-modal features into each view space, exploiting their unique strengths. 2) A unified query generation approach that leverages lane-aware knowledge from both PV and BEV spaces to generate queries. 3) A 3D dual-view deformable attention mechanism, which aggregates discriminative features from both PV and BEV spaces into queries for accurate 3D lane detection. Extensive experiments on the public benchmark, OpenLane, demonstrate the efficacy and efficiency of DV-3DLane. It achieves state-of-the-art performance, with a remarkable 11.2 gain in F1 score and a substantial 53.5% reduction in errors. The code is available at \url{https://github.com/JMoonr/dv-3dlane}.
\ No newline at end of file
diff --git a/data/2024/iclr/Data Debugging with Shapley Importance over Machine Learning Pipelines b/data/2024/iclr/Data Debugging with Shapley Importance over Machine Learning Pipelines
new file mode 100644
index 0000000000..7ff5d26ece
--- /dev/null
+++ b/data/2024/iclr/Data Debugging with Shapley Importance over Machine Learning Pipelines	
@@ -0,0 +1 @@
+When a machine learning (ML) model exhibits poor quality (e.g., poor accuracy or fairness), the problem can often be traced back to errors in the training data. Being able to discover the data examples that are the most likely culprits is a fundamental concern that has received a lot of attention recently. One prominent way to measure "data importance" with respect to model quality is the Shapley value. Unfortunately, existing methods only focus on the ML model in isolation, without considering the broader ML pipeline for data preparation and feature extraction, which appears in the majority of real-world ML code. This presents a major limitation to applying existing methods in practical settings. In this paper, we propose Datascope, a method for efficiently computing Shapley-based data importance over ML pipelines. We introduce several approximations that lead to dramatic improvements in terms of computational speed. Finally, our experimental evaluation demonstrates that our methods are capable of data error discovery that is as effective as existing Monte Carlo baselines, and in some cases even outperform them. We release our code as an open-source data debugging library available at github.com/easeml/datascope.
\ No newline at end of file
diff --git a/data/2024/iclr/Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality b/data/2024/iclr/Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality
new file mode 100644
index 0000000000..7e2a884aa3
--- /dev/null
+++ b/data/2024/iclr/Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality	
@@ -0,0 +1 @@
+Dataset distillation aims to minimize the time and memory needed for training deep networks on large datasets, by creating a small set of synthetic images that has a similar generalization performance to that of the full dataset. However, current dataset distillation techniques fall short, showing a notable performance gap when compared to training on the original data. In this work, we are the first to argue that using just one synthetic subset for distillation will not yield optimal generalization performance. This is because the training dynamics of deep networks drastically change during the training. Hence, multiple synthetic subsets are required to capture the training dynamics at different phases of training. To address this issue, we propose Progressive Dataset Distillation (PDD). PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets without requiring additional training time. Our extensive experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%. In addition, our method for the first time enable generating considerably larger synthetic datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Data Filtering Networks b/data/2024/iclr/Data Filtering Networks
new file mode 100644
index 0000000000..0c00acb999
--- /dev/null
+++ b/data/2024/iclr/Data Filtering Networks	
@@ -0,0 +1 @@
+Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 83.0% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.
\ No newline at end of file
diff --git a/data/2024/iclr/DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models b/data/2024/iclr/DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models
new file mode 100644
index 0000000000..71c38bc17a
--- /dev/null
+++ b/data/2024/iclr/DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models	
@@ -0,0 +1 @@
+Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image models. In this work, we propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models. Leveraging an easy-to-compute closed-form expression, DataInf outperforms existing influence computation algorithms in terms of computational and memory efficiency. Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA. Through systematic empirical evaluations, we show that DataInf accurately approximates influence scores and is orders of magnitude faster than existing methods. In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores. Moreover, it can help to identify which data points are mislabeled.
\ No newline at end of file
diff --git a/data/2024/iclr/Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation b/data/2024/iclr/Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation
new file mode 100644
index 0000000000..0dfa0127a1
--- /dev/null
+++ b/data/2024/iclr/Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation	
@@ -0,0 +1 @@
+Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set of questions and answers from the prompt, and output images are scored based on whether these answers extracted with a visual question answering model are consistent with the prompt-based answers. This kind of evaluation is naturally dependent on the quality of the underlying QG and VQA models. We identify and address several reliability challenges in existing QG/A work: (a) QG questions should respect the prompt (avoiding hallucinations, duplications, and omissions) and (b) VQA answers should be consistent (not asserting that there is no motorcycle in an image while also claiming the motorcycle is blue). We address these issues with Davidsonian Scene Graph (DSG), an empirically grounded evaluation framework inspired by formal semantics, which is adaptable to any QG/A frameworks. DSG produces atomic and unique questions organized in dependency graphs, which (i) ensure appropriate semantic coverage and (ii) sidestep inconsistent answers. With extensive experimentation and human evaluation on a range of model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above. Finally, we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts, covering a wide range of fine-grained semantic categories with a balanced distribution. We release the DSG-1k prompts and the corresponding DSG questions.
\ No newline at end of file
diff --git a/data/2024/iclr/De novo Protein Design Using Geometric Vector Field Networks b/data/2024/iclr/De novo Protein Design Using Geometric Vector Field Networks
new file mode 100644
index 0000000000..b6c832647e
--- /dev/null
+++ b/data/2024/iclr/De novo Protein Design Using Geometric Vector Field Networks	
@@ -0,0 +1 @@
+Innovations like protein diffusion have enabled significant progress in de novo protein design, which is a vital topic in life science. These methods typically depend on protein structure encoders to model residue backbone frames, where atoms do not exist. Most prior encoders rely on atom-wise features, such as angles and distances between atoms, which are not available in this context. Thus far, only several simple encoders, such as IPA, have been proposed for this scenario, exposing the frame modeling as a bottleneck. In this work, we proffer the Vector Field Network (VFN), which enables network layers to perform learnable vector computations between coordinates of frame-anchored virtual atoms, thus achieving a higher capability for modeling frames. The vector computation operates in a manner similar to a linear layer, with each input channel receiving 3D virtual atom coordinates instead of scalar values. The multiple feature vectors output by the vector computation are then used to update the residue representations and virtual atom coordinates via attention aggregation. Remarkably, VFN also excels in modeling both frames and atoms, as the real atoms can be treated as the virtual atoms for modeling, positioning VFN as a potential universal encoder. In protein diffusion (frame modeling), VFN exhibits an impressive performance advantage over IPA, excelling in terms of both designability (67.04% vs. 53.58%) and diversity (66.54% vs. 51.98%). In inverse folding (frame and atom modeling), VFN outperforms the previous SoTA model, PiFold (54.7% vs. 51.66%), on sequence recovery rate. We also propose a method of equipping VFN with the ESM model, which significantly surpasses the previous ESM-based SoTA (62.67% vs. 55.65%), LM-Design, by a substantial margin.
\ No newline at end of file
diff --git a/data/2024/iclr/DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning b/data/2024/iclr/DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning
new file mode 100644
index 0000000000..d7f9ae2c70
--- /dev/null
+++ b/data/2024/iclr/DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning	
@@ -0,0 +1 @@
+Prompt tuning (PT), where a small amount of trainable soft (continuous) prompt vectors is affixed to the input of language models (LM), has shown promising results across various tasks and models for parameter-efficient fine-tuning (PEFT). PT stands out from other PEFT approaches because it maintains competitive performance with fewer trainable parameters and does not drastically scale up its parameters as the model size expands. However, PT introduces additional soft prompt tokens, leading to longer input sequences, which significantly impacts training and inference time and memory usage due to the Transformer's quadratic complexity. Particularly concerning for Large Language Models (LLMs) that face heavy daily querying. To address this issue, we propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt into a shorter soft prompt and a pair of low-rank matrices that are then optimised with two different learning rates. This allows DePT to achieve better performance while saving substantial memory and time costs compared to vanilla PT and its variants, without changing trainable parameter sizes. Through extensive experiments on 23 natural language processing (NLP) and vision-language (VL) tasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches, including the full fine-tuning baseline, in some scenarios. Additionally, we empirically show that DEPT grows more efficient as the model size increases. Our further study reveals that DePT integrates seamlessly with parameter-efficient transfer learning in the few-shot learning setting and highlights its adaptability to various model architectures and sizes.
\ No newline at end of file
diff --git a/data/2024/iclr/Debiased Collaborative Filtering with Kernel-Based Causal Balancing b/data/2024/iclr/Debiased Collaborative Filtering with Kernel-Based Causal Balancing
new file mode 100644
index 0000000000..178cf54298
--- /dev/null
+++ b/data/2024/iclr/Debiased Collaborative Filtering with Kernel-Based Causal Balancing	
@@ -0,0 +1 @@
+Debiased collaborative filtering aims to learn an unbiased prediction model by removing different biases in observational datasets. To solve this problem, one of the simple and effective methods is based on the propensity score, which adjusts the observational sample distribution to the target one by reweighting observed instances. Ideally, propensity scores should be learned with causal balancing constraints. However, existing methods usually ignore such constraints or implement them with unreasonable approximations, which may affect the accuracy of the learned propensity scores. To bridge this gap, in this paper, we first analyze the gaps between the causal balancing requirements and existing methods such as learning the propensity with cross-entropy loss or manually selecting functions to balance. Inspired by these gaps, we propose to approximate the balancing functions in reproducing kernel Hilbert space and demonstrate that, based on the universal property and representer theorem of kernel functions, the causal balancing constraints can be better satisfied. Meanwhile, we propose an algorithm that adaptively balances the kernel function and theoretically analyze the generalization error bound of our methods. We conduct extensive experiments to demonstrate the effectiveness of our methods, and to promote this research direction, we have released our project at https://github.com/haoxuanli-pku/ICLR24-Kernel-Balancing.
\ No newline at end of file
diff --git a/data/2024/iclr/Debiasing Algorithm through Model Adaptation b/data/2024/iclr/Debiasing Algorithm through Model Adaptation
new file mode 100644
index 0000000000..9a261e2147
--- /dev/null
+++ b/data/2024/iclr/Debiasing Algorithm through Model Adaptation	
@@ -0,0 +1 @@
+Large language models are becoming the go-to solution for the ever-growing number of tasks. However, with growing capacity, models are prone to rely on spurious correlations stemming from biases and stereotypes present in the training data. This work proposes a novel method for detecting and mitigating gender bias in language models. We perform causal analysis to identify problematic model components and discover that mid-upper feed-forward layers are most prone to convey bias. Based on the analysis results, we intervene in the model by applying a linear projection to the weight matrices of these layers. Our titular method, DAMA, significantly decreases bias as measured by diverse metrics while maintaining the model's performance on downstream tasks. We release code for our method and models, which retrain LLaMA's state-of-the-art performance while being significantly less biased.
\ No newline at end of file
diff --git a/data/2024/iclr/Debiasing Attention Mechanism in Transformer without Demographics b/data/2024/iclr/Debiasing Attention Mechanism in Transformer without Demographics
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Deceptive Fairness Attacks on Graphs via Meta Learning b/data/2024/iclr/Deceptive Fairness Attacks on Graphs via Meta Learning
new file mode 100644
index 0000000000..d6d155dc87
--- /dev/null
+++ b/data/2024/iclr/Deceptive Fairness Attacks on Graphs via Meta Learning	
@@ -0,0 +1 @@
+We study deceptive fairness attacks on graphs to answer the following question: How can we achieve poisoning attacks on a graph learning model to exacerbate the bias deceptively? We answer this question via a bi-level optimization problem and propose a meta learning-based framework named FATE. FATE is broadly applicable with respect to various fairness definitions and graph learning models, as well as arbitrary choices of manipulation operations. We further instantiate FATE to attack statistical parity and individual fairness on graph neural networks. We conduct extensive experimental evaluations on real-world datasets in the task of semi-supervised node classification. The experimental results demonstrate that FATE could amplify the bias of graph neural networks with or without fairness consideration while maintaining the utility on the downstream task. We hope this paper provides insights into the adversarial robustness of fair graph learning and can shed light on designing robust and fair graph learning in future studies.
\ No newline at end of file
diff --git a/data/2024/iclr/Decision ConvFormer: Local Filtering in MetaFormer is Sufficient for Decision Making b/data/2024/iclr/Decision ConvFormer: Local Filtering in MetaFormer is Sufficient for Decision Making
new file mode 100644
index 0000000000..bfa188ed23
--- /dev/null
+++ b/data/2024/iclr/Decision ConvFormer: Local Filtering in MetaFormer is Sufficient for Decision Making	
@@ -0,0 +1 @@
+The recent success of Transformer in natural language processing has sparked its use in various domains. In offline reinforcement learning (RL), Decision Transformer (DT) is emerging as a promising model based on Transformer. However, we discovered that the attention module of DT is not appropriate to capture the inherent local dependence pattern in trajectories of RL modeled as a Markov decision process. To overcome the limitations of DT, we propose a novel action sequence predictor, named Decision ConvFormer (DC), based on the architecture of MetaFormer, which is a general structure to process multiple entities in parallel and understand the interrelationship among the multiple entities. DC employs local convolution filtering as the token mixer and can effectively capture the inherent local associations of the RL dataset. In extensive experiments, DC achieved state-of-the-art performance across various standard RL benchmarks while requiring fewer resources. Furthermore, we show that DC better understands the underlying meaning in data and exhibits enhanced generalization capability.
\ No newline at end of file
diff --git a/data/2024/iclr/Decodable and Sample Invariant Continuous Object Encoder b/data/2024/iclr/Decodable and Sample Invariant Continuous Object Encoder
new file mode 100644
index 0000000000..0aa07166bc
--- /dev/null
+++ b/data/2024/iclr/Decodable and Sample Invariant Continuous Object Encoder	
@@ -0,0 +1 @@
+We propose Hyper-Dimensional Function Encoding (HDFE). Given samples of a continuous object (e.g. a function), HDFE produces an explicit vector representation of the given object, invariant to the sample distribution and density. Sample distribution and density invariance enables HDFE to consistently encode continuous objects regardless of their sampling, and therefore allows neural networks to receive continuous objects as inputs for machine learning tasks, such as classification and regression. Besides, HDFE does not require any training and is proved to map the object into an organized embedding space, which facilitates the training of the downstream tasks. In addition, the encoding is decodable, which enables neural networks to regress continuous objects by regressing their encodings. Therefore, HDFE serves as an interface for processing continuous objects. We apply HDFE to function-to-function mapping, where vanilla HDFE achieves competitive performance as the state-of-the-art algorithm. We apply HDFE to point cloud surface normal estimation, where a simple replacement from PointNet to HDFE leads to immediate 12% and 15% error reductions in two benchmarks. In addition, by integrating HDFE into the PointNet-based SOTA network, we improve the SOTA baseline by 2.5% and 1.7% in the same benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/Decoding Natural Images from EEG for Object Recognition b/data/2024/iclr/Decoding Natural Images from EEG for Object Recognition
new file mode 100644
index 0000000000..7bbb560c90
--- /dev/null
+++ b/data/2024/iclr/Decoding Natural Images from EEG for Object Recognition	
@@ -0,0 +1 @@
+Electroencephalography (EEG) signals, known for convenient non-invasive acquisition but low signal-to-noise ratio, have recently gained substantial attention due to the potential to decode natural images. This paper presents a self-supervised framework to demonstrate the feasibility of learning image representations from EEG signals, particularly for object recognition. The framework utilizes image and EEG encoders to extract features from paired image stimuli and EEG responses. Contrastive learning aligns these two modalities by constraining their similarity. With the framework, we attain significantly above-chance results on a comprehensive EEG-image dataset, achieving a top-1 accuracy of 15.6% and a top-5 accuracy of 42.8% in challenging 200-way zero-shot tasks. Moreover, we perform extensive experiments to explore the biological plausibility by resolving the temporal, spatial, spectral, and semantic aspects of EEG signals. Besides, we introduce attention modules to capture spatial correlations, providing implicit evidence of the brain activity perceived from EEG data. These findings yield valuable insights for neural decoding and brain-computer interfaces in real-world scenarios. The code will be released on https://github.com/eeyhsong/NICE-EEG.
\ No newline at end of file
diff --git a/data/2024/iclr/DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization b/data/2024/iclr/DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization
new file mode 100644
index 0000000000..3dff53428c
--- /dev/null
+++ b/data/2024/iclr/DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization	
@@ -0,0 +1 @@
+Recently, 3D generative models have shown promising performances in structure-based drug design by learning to generate ligands given target binding sites. However, only modeling the target-ligand distribution can hardly fulfill one of the main goals in drug discovery -- designing novel ligands with desired properties, e.g., high binding affinity, easily synthesizable, etc. This challenge becomes particularly pronounced when the target-ligand pairs used for training do not align with these desired properties. Moreover, most existing methods aim at solving \textit{de novo} design task, while many generative scenarios requiring flexible controllability, such as R-group optimization and scaffold hopping, have received little attention. In this work, we propose DecompOpt, a structure-based molecular optimization method based on a controllable and decomposed diffusion model. DecompOpt presents a new generation paradigm which combines optimization with conditional diffusion models to achieve desired properties while adhering to the molecular grammar. Additionally, DecompOpt offers a unified framework covering both \textit{de novo} design and controllable generation. To achieve so, ligands are decomposed into substructures which allows fine-grained control and local optimization. Experiments show that DecompOpt can efficiently generate molecules with improved properties than strong de novo baselines, and demonstrate great potential in controllable generation tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse Problems b/data/2024/iclr/Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse Problems
new file mode 100644
index 0000000000..fb19d33417
--- /dev/null
+++ b/data/2024/iclr/Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse Problems	
@@ -0,0 +1 @@
+Krylov subspace, which is generated by multiplying a given vector by the matrix of a linear transformation and its successive powers, has been extensively studied in classical optimization literature to design algorithms that converge quickly for large linear inverse problems. For example, the conjugate gradient method (CG), one of the most popular Krylov subspace methods, is based on the idea of minimizing the residual error in the Krylov subspace. However, with the recent advancement of high-performance diffusion solvers for inverse problems, it is not clear how classical wisdom can be synergistically combined with modern diffusion models. In this study, we propose a novel and efficient diffusion sampling strategy that synergistically combines the diffusion sampling and Krylov subspace methods. Specifically, we prove that if the tangent space at a denoised sample by Tweedie's formula forms a Krylov subspace, then the CG initialized with the denoised data ensures the data consistency update to remain in the tangent space. This negates the need to compute the manifold-constrained gradient (MCG), leading to a more efficient diffusion sampling method. Our method is applicable regardless of the parametrization and setting (i.e., VE, VP). Notably, we achieve state-of-the-art reconstruction quality on challenging real-world medical inverse imaging problems, including multi-coil MRI reconstruction and 3D CT reconstruction. Moreover, our proposed method achieves more than 80 times faster inference time than the previous state-of-the-art method. Code is available at https://github.com/HJ-harry/DDS
\ No newline at end of file
diff --git a/data/2024/iclr/Decongestion by Representation: Learning to Improve Economic Welfare in Marketplaces b/data/2024/iclr/Decongestion by Representation: Learning to Improve Economic Welfare in Marketplaces
new file mode 100644
index 0000000000..07ad973e84
--- /dev/null
+++ b/data/2024/iclr/Decongestion by Representation: Learning to Improve Economic Welfare in Marketplaces	
@@ -0,0 +1 @@
+Congestion is a common failure mode of markets, where consumers compete inefficiently on the same subset of goods (e.g., chasing the same small set of properties on a vacation rental platform). The typical economic story is that prices decongest by balancing supply and demand. But in modern online marketplaces, prices are typically set in a decentralized way by sellers, and the information about items is inevitably partial. The power of a platform is limited to controlling representations -- the subset of information about items presented by default to users. This motivates the present study of decongestion by representation, where a platform seeks to learn representations that reduce congestion and thus improve social welfare. The technical challenge is twofold: relying only on revealed preferences from the choices of consumers, rather than true preferences; and the combinatorial problem associated with representations that determine the features to reveal in the default view. We tackle both challenges by proposing a differentiable proxy of welfare that can be trained end-to-end on consumer choice data. We develop sufficient conditions for when decongestion promotes welfare, and present the results of extensive experiments on both synthetic and real data that demonstrate the utility of our approach.
\ No newline at end of file
diff --git a/data/2024/iclr/Decoupled Marked Temporal Point Process using Neural Ordinary Differential Equations b/data/2024/iclr/Decoupled Marked Temporal Point Process using Neural Ordinary Differential Equations
new file mode 100644
index 0000000000..cedf0b6e4e
--- /dev/null
+++ b/data/2024/iclr/Decoupled Marked Temporal Point Process using Neural Ordinary Differential Equations	
@@ -0,0 +1 @@
+A Marked Temporal Point Process (MTPP) is a stochastic process whose realization is a set of event-time data. MTPP is often used to understand complex dynamics of asynchronous temporal events such as money transaction, social media, healthcare, etc. Recent studies have utilized deep neural networks to capture complex temporal dependencies of events and generate embedding that aptly represent the observed events. While most previous studies focus on the inter-event dependencies and their representations, how individual events influence the overall dynamics over time has been under-explored. In this regime, we propose a Decoupled MTPP framework that disentangles characterization of a stochastic process into a set of evolving influences from different events. Our approach employs Neural Ordinary Differential Equations (Neural ODEs) to learn flexible continuous dynamics of these influences while simultaneously addressing multiple inference problems, such as density estimation and survival rate computation. We emphasize the significance of disentangling the influences by comparing our framework with state-of-the-art methods on real-life datasets, and provide analysis on the model behavior for potential applications.
\ No newline at end of file
diff --git a/data/2024/iclr/Decoupling Weighing and Selecting for Integrating Multiple Graph Pre-training Tasks b/data/2024/iclr/Decoupling Weighing and Selecting for Integrating Multiple Graph Pre-training Tasks
new file mode 100644
index 0000000000..eb69cb55c6
--- /dev/null
+++ b/data/2024/iclr/Decoupling Weighing and Selecting for Integrating Multiple Graph Pre-training Tasks	
@@ -0,0 +1 @@
+Recent years have witnessed the great success of graph pre-training for graph representation learning. With hundreds of graph pre-training tasks proposed, integrating knowledge acquired from multiple pre-training tasks has become a popular research topic. In this paper, we identify two important collaborative processes for this topic: (1) select: how to select an optimal task combination from a given task pool based on their compatibility, and (2) weigh: how to weigh the selected tasks based on their importance. While there currently has been a lot of work focused on weighing, comparatively little effort has been devoted to selecting. This paper proposes a novel instance-level framework for integrating multiple graph pre-training tasks, Weigh And Select (WAS), where the two collaborative processes, weighing and selecting, are combined by decoupled siamese networks. Specifically, it first adaptively learns an optimal combination of tasks for each instance from a given task pool, based on which a customized instance-level task weighing strategy is learned. Extensive experiments on 16 graph datasets across node-level and graph-level downstream tasks have demonstrated that by combining a few simple but classical tasks, WAS can achieve comparable performance to other leading counterparts. The code is available at https://github.com/TianyuFan0504/WAS.
\ No newline at end of file
diff --git a/data/2024/iclr/Decoupling regularization from the action space b/data/2024/iclr/Decoupling regularization from the action space
new file mode 100644
index 0000000000..dcc59913a5
--- /dev/null
+++ b/data/2024/iclr/Decoupling regularization from the action space	
@@ -0,0 +1 @@
+Regularized reinforcement learning (RL), particularly the entropy-regularized kind, has gained traction in optimal control and inverse RL. While standard unregularized RL methods remain unaffected by changes in the number of actions, we show that it can severely impact their regularized counterparts. This paper demonstrates the importance of decoupling the regularizer from the action space: that is, to maintain a consistent level of regularization regardless of how many actions are involved to avoid over-regularization. Whereas the problem can be avoided by introducing a task-specific temperature parameter, it is often undesirable and cannot solve the problem when action spaces are state-dependent. In the state-dependent action context, different states with varying action spaces are regularized inconsistently. We introduce two solutions: a static temperature selection approach and a dynamic counterpart, universally applicable where this problem arises. Implementing these changes improves performance on the DeepMind control suite in static and dynamic temperature regimes and a biological sequence design task.
\ No newline at end of file
diff --git a/data/2024/iclr/Deep Confident Steps to New Pockets: Strategies for Docking Generalization b/data/2024/iclr/Deep Confident Steps to New Pockets: Strategies for Docking Generalization
new file mode 100644
index 0000000000..0fc7d05e72
--- /dev/null
+++ b/data/2024/iclr/Deep Confident Steps to New Pockets: Strategies for Docking Generalization	
@@ -0,0 +1 @@
+Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, docking methods must generalize well across the proteome. Existing benchmarks, however, fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that existing machine learning-based docking models have very weak generalization abilities. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. Further, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between diffusion and confidence models and exploits the multi-resolution generation process of diffusion models. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes, edging closer to accurate and generalizable blind docking methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Deep Generative Clustering with Multimodal Diffusion Variational Autoencoders b/data/2024/iclr/Deep Generative Clustering with Multimodal Diffusion Variational Autoencoders
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Deep Neural Networks Tend To Extrapolate Predictably b/data/2024/iclr/Deep Neural Networks Tend To Extrapolate Predictably
new file mode 100644
index 0000000000..1215f29c0f
--- /dev/null
+++ b/data/2024/iclr/Deep Neural Networks Tend To Extrapolate Predictably	
@@ -0,0 +1 @@
+Conventional wisdom suggests that neural network predictions tend to be unpredictable and overconfident when faced with out-of-distribution (OOD) inputs. Our work reassesses this assumption for neural networks with high-dimensional inputs. Rather than extrapolating in arbitrary ways, we observe that neural network predictions often tend towards a constant value as input data becomes increasingly OOD. Moreover, we find that this value often closely approximates the optimal constant solution (OCS), i.e., the prediction that minimizes the average loss over the training data without observing the input. We present results showing this phenomenon across 8 datasets with different distributional shifts (including CIFAR10-C and ImageNet-R, S), different loss functions (cross entropy, MSE, and Gaussian NLL), and different architectures (CNNs and transformers). Furthermore, we present an explanation for this behavior, which we first validate empirically and then study theoretically in a simplified setting involving deep homogeneous networks with ReLU activations. Finally, we show how one can leverage our insights in practice to enable risk-sensitive decision-making in the presence of OOD inputs.
\ No newline at end of file
diff --git a/data/2024/iclr/Deep Orthogonal Hypersphere Compression for Anomaly Detection b/data/2024/iclr/Deep Orthogonal Hypersphere Compression for Anomaly Detection
new file mode 100644
index 0000000000..0995b81f7d
--- /dev/null
+++ b/data/2024/iclr/Deep Orthogonal Hypersphere Compression for Anomaly Detection	
@@ -0,0 +1 @@
+Many well-known and effective anomaly detection methods assume that a reasonable decision boundary has a hypersphere shape, which however is difficult to obtain in practice and is not sufficiently compact, especially when the data are in high-dimensional spaces. In this paper, we first propose a novel deep anomaly detection model that improves the original hypersphere learning through an orthogonal projection layer, which ensures that the training data distribution is consistent with the hypersphere hypothesis, thereby increasing the true positive rate and decreasing the false negative rate. Moreover, we propose a bi-hypersphere compression method to obtain a hyperspherical shell that yields a more compact decision region than a hyperball, which is demonstrated theoretically and numerically. The proposed methods are not confined to common datasets such as image and tabular data, but are also extended to a more challenging but promising scenario, graph-level anomaly detection, which learns graph representation with maximum mutual information between the substructure and global structure features while exploring orthogonal single- or bi-hypersphere anomaly decision boundaries. The numerical and visualization results on benchmark datasets demonstrate the superiority of our methods in comparison to many baselines and state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Deep Reinforcement Learning Guided Improvement Heuristic for Job Shop Scheduling b/data/2024/iclr/Deep Reinforcement Learning Guided Improvement Heuristic for Job Shop Scheduling
new file mode 100644
index 0000000000..5773cc8f6f
--- /dev/null
+++ b/data/2024/iclr/Deep Reinforcement Learning Guided Improvement Heuristic for Job Shop Scheduling	
@@ -0,0 +1 @@
+Recent studies in using deep reinforcement learning (DRL) to solve Job-shop scheduling problems (JSSP) focus on construction heuristics. However, their performance is still far from optimality, mainly because the underlying graph representation scheme is unsuitable for modelling partial solutions at each construction step. This paper proposes a novel DRL-guided improvement heuristic for solving JSSP, where graph representation is employed to encode complete solutions. We design a Graph Neural-Network-based representation scheme, consisting of two modules to effectively capture the information of dynamic topology and different types of nodes in graphs encountered during the improvement process. To speed up solution evaluation during improvement, we present a novel message-passing mechanism that can evaluate multiple solutions simultaneously. We prove that the computational complexity of our method scales linearly with problem size. Experiments on classic benchmarks show that the improvement policy learned by our method outperforms state-of-the-art DRL-based methods by a large margin.
\ No newline at end of file
diff --git a/data/2024/iclr/Deep Reinforcement Learning for Modelling Protein Complexes b/data/2024/iclr/Deep Reinforcement Learning for Modelling Protein Complexes
new file mode 100644
index 0000000000..22c7ca4f2c
--- /dev/null
+++ b/data/2024/iclr/Deep Reinforcement Learning for Modelling Protein Complexes	
@@ -0,0 +1 @@
+AlphaFold can be used for both single-chain and multi-chain protein structure prediction, while the latter becomes extremely challenging as the number of chains increases. In this work, by taking each chain as a node and assembly actions as edges, we show that an acyclic undirected connected graph can be used to predict the structure of multi-chain protein complexes (a.k.a., protein complex modelling, PCM). However, there are still two challenges: 1) The huge combinatorial optimization space of $N^{N-2}$ ($N$ is the number of chains) for the PCM problem can easily lead to high computational cost. 2) The scales of protein complexes exhibit distribution shift due to variance in chain numbers, which calls for the generalization in modelling complexes of various scales. To address these challenges, we propose GAPN, a Generative Adversarial Policy Network powered by domain-specific rewards and adversarial loss through policy gradient for automatic PCM prediction. Specifically, GAPN learns to efficiently search through the immense assembly space and optimize the direct docking reward through policy gradient. Importantly, we design an adversarial reward function to enhance the receptive field of our model. In this way, GAPN will simultaneously focus on a specific batch of complexes and the global assembly rules learned from complexes with varied chain numbers. Empirically, we have achieved both significant accuracy (measured by RMSD and TM-Score) and efficiency improvements compared to leading PCM softwares.
\ No newline at end of file
diff --git a/data/2024/iclr/Deep SE(3)-Equivariant Geometric Reasoning for Precise Placement Tasks b/data/2024/iclr/Deep SE(3)-Equivariant Geometric Reasoning for Precise Placement Tasks
new file mode 100644
index 0000000000..697ce356a9
--- /dev/null
+++ b/data/2024/iclr/Deep SE(3)-Equivariant Geometric Reasoning for Precise Placement Tasks	
@@ -0,0 +1 @@
+Many robot manipulation tasks can be framed as geometric reasoning tasks, where an agent must be able to precisely manipulate an object into a position that satisfies the task from a set of initial conditions. Often, task success is defined based on the relationship between two objects - for instance, hanging a mug on a rack. In such cases, the solution should be equivariant to the initial position of the objects as well as the agent, and invariant to the pose of the camera. This poses a challenge for learning systems which attempt to solve this task by learning directly from high-dimensional demonstrations: the agent must learn to be both equivariant as well as precise, which can be challenging without any inductive biases about the problem. In this work, we propose a method for precise relative pose prediction which is provably SE(3)-equivariant, can be learned from only a few demonstrations, and can generalize across variations in a class of objects. We accomplish this by factoring the problem into learning an SE(3) invariant task-specific representation of the scene and then interpreting this representation with novel geometric reasoning layers which are provably SE(3) equivariant. We demonstrate that our method can yield substantially more precise placement predictions in simulated placement tasks than previous methods trained with the same amount of data, and can accurately represent relative placement relationships data collected from real-world demonstrations. Supplementary information and videos can be found at https://sites.google.com/view/reldist-iclr-2023.
\ No newline at end of file
diff --git a/data/2024/iclr/Deep Temporal Graph Clustering b/data/2024/iclr/Deep Temporal Graph Clustering
new file mode 100644
index 0000000000..5973ecbb6a
--- /dev/null
+++ b/data/2024/iclr/Deep Temporal Graph Clustering	
@@ -0,0 +1 @@
+Deep graph clustering has recently received significant attention due to its ability to enhance the representation learning capabilities of models in unsupervised scenarios. Nevertheless, deep clustering for temporal graphs, which could capture crucial dynamic interaction information, has not been fully explored. It means that in many clustering-oriented real-world scenarios, temporal graphs can only be processed as static graphs. This not only causes the loss of dynamic information but also triggers huge computational consumption. To solve the problem, we propose a general framework for deep Temporal Graph Clustering called TGC, which introduces deep clustering techniques to suit the interaction sequence-based batch-processing pattern of temporal graphs. In addition, we discuss differences between temporal graph clustering and static graph clustering from several levels. To verify the superiority of the proposed framework TGC, we conduct extensive experiments. The experimental results show that temporal graph clustering enables more flexibility in finding a balance between time and space requirements, and our framework can effectively improve the performance of existing temporal graph learning methods. The code is released: https://github.com/MGitHubL/Deep-Temporal-Graph-Clustering.
\ No newline at end of file
diff --git a/data/2024/iclr/DeepSPF: Spherical SO(3)-Equivariant Patches for Scan-to-CAD Estimation b/data/2024/iclr/DeepSPF: Spherical SO(3)-Equivariant Patches for Scan-to-CAD Estimation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training b/data/2024/iclr/DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training
new file mode 100644
index 0000000000..7e6ec80e00
--- /dev/null
+++ b/data/2024/iclr/DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training	
@@ -0,0 +1 @@
+Zeroth-order (ZO) optimization has become a popular technique for solving machine learning (ML) problems when first-order (FO) information is difficult or impossible to obtain. However, the scalability of ZO optimization remains an open problem: Its use has primarily been limited to relatively small-scale ML problems, such as sample-wise adversarial attack generation. To our best knowledge, no prior work has demonstrated the effectiveness of ZO optimization in training deep neural networks (DNNs) without a significant decrease in performance. To overcome this roadblock, we develop DeepZero, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch through three primary innovations. First, we demonstrate the advantages of coordinate-wise gradient estimation (CGE) over randomized vector-wise gradient estimation in training accuracy and computational efficiency. Second, we propose a sparsity-induced ZO training protocol that extends the model pruning methodology using only finite differences to explore and exploit the sparse DL prior in CGE. Third, we develop the methods of feature reuse and forward parallelization to advance the practical implementations of ZO training. Our extensive experiments show that DeepZero achieves state-of-the-art (SOTA) accuracy on ResNet-20 trained on CIFAR-10, approaching FO training performance for the first time. Furthermore, we show the practical utility of DeepZero in applications of certified adversarial defense and DL-based partial differential equation error correction, achieving 10-20% improvement over SOTA. We believe our results will inspire future research on scalable ZO optimization and contribute to advancing DL with black box.
\ No newline at end of file
diff --git a/data/2024/iclr/Defining Expertise: Applications to Treatment Effect Estimation b/data/2024/iclr/Defining Expertise: Applications to Treatment Effect Estimation
new file mode 100644
index 0000000000..8f251e1dad
--- /dev/null
+++ b/data/2024/iclr/Defining Expertise: Applications to Treatment Effect Estimation	
@@ -0,0 +1 @@
+Decision-makers are often experts of their domain and take actions based on their domain knowledge. Doctors, for instance, may prescribe treatments by predicting the likely outcome of each available treatment. Actions of an expert thus naturally encode part of their domain knowledge, and can help make inferences within the same domain: Knowing doctors try to prescribe the best treatment for their patients, we can tell treatments prescribed more frequently are likely to be more effective. Yet in machine learning, the fact that most decision-makers are experts is often overlooked, and"expertise"is seldom leveraged as an inductive bias. This is especially true for the literature on treatment effect estimation, where often the only assumption made about actions is that of overlap. In this paper, we argue that expertise - particularly the type of expertise the decision-makers of a domain are likely to have - can be informative in designing and selecting methods for treatment effect estimation. We formally define two types of expertise, predictive and prognostic, and demonstrate empirically that: (i) the prominent type of expertise in a domain significantly influences the performance of different methods in treatment effect estimation, and (ii) it is possible to predict the type of expertise present in a dataset, which can provide a quantitative basis for model selection.
\ No newline at end of file
diff --git a/data/2024/iclr/Defining and extracting generalizable interaction primitives from DNNs b/data/2024/iclr/Defining and extracting generalizable interaction primitives from DNNs
new file mode 100644
index 0000000000..5ad5391c20
--- /dev/null
+++ b/data/2024/iclr/Defining and extracting generalizable interaction primitives from DNNs	
@@ -0,0 +1 @@
+Faithfully summarizing the knowledge encoded by a deep neural network (DNN) into a few symbolic primitive patterns without losing much information represents a core challenge in explainable AI. To this end, Ren et al. (2023c) have derived a series of theorems to prove that the inference score of a DNN can be explained as a small set of interactions between input variables. However, the lack of generalization power makes it still hard to consider such interactions as faithful primitive patterns encoded by the DNN. Therefore, given different DNNs trained for the same task, we develop a new method to extract interactions that are shared by these DNNs. Experiments show that the extracted interactions can better reflect common knowledge shared by different DNNs.
\ No newline at end of file
diff --git a/data/2024/iclr/Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding b/data/2024/iclr/Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding
new file mode 100644
index 0000000000..27b02ce91b
--- /dev/null
+++ b/data/2024/iclr/Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding	
@@ -0,0 +1 @@
+A prominent challenge of offline reinforcement learning (RL) is the issue of hidden confounding: unobserved variables may influence both the actions taken by the agent and the observed outcomes. Hidden confounding can compromise the validity of any causal conclusion drawn from data and presents a major obstacle to effective offline RL. In the present paper, we tackle the problem of hidden confounding in the nonidentifiable setting. We propose a definition of uncertainty due to hidden confounding bias, termed delphic uncertainty, which uses variation over world models compatible with the observations, and differentiate it from the well-known epistemic and aleatoric uncertainties. We derive a practical method for estimating the three types of uncertainties, and construct a pessimistic offline RL algorithm to account for them. Our method does not assume identifiability of the unobserved confounders, and attempts to reduce the amount of confounding bias. We demonstrate through extensive experiments and ablations the efficacy of our approach on a sepsis management benchmark, as well as on electronic health records. Our results suggest that nonidentifiable hidden confounding bias can be mitigated to improve offline RL solutions in practice.
\ No newline at end of file
diff --git a/data/2024/iclr/Delta-AI: Local objectives for amortized inference in sparse graphical models b/data/2024/iclr/Delta-AI: Local objectives for amortized inference in sparse graphical models
new file mode 100644
index 0000000000..e71d9806ab
--- /dev/null
+++ b/data/2024/iclr/Delta-AI: Local objectives for amortized inference in sparse graphical models	
@@ -0,0 +1 @@
+We present a new algorithm for amortized inference in sparse probabilistic graphical models (PGMs), which we call $\Delta$-amortized inference ($\Delta$-AI). Our approach is based on the observation that when the sampling of variables in a PGM is seen as a sequence of actions taken by an agent, sparsity of the PGM enables local credit assignment in the agent's policy learning objective. This yields a local constraint that can be turned into a local loss in the style of generative flow networks (GFlowNets) that enables off-policy training but avoids the need to instantiate all the random variables for each parameter update, thus speeding up training considerably. The $\Delta$-AI objective matches the conditional distribution of a variable given its Markov blanket in a tractable learned sampler, which has the structure of a Bayesian network, with the same conditional distribution under the target PGM. As such, the trained sampler recovers marginals and conditional distributions of interest and enables inference of partial subsets of variables. We illustrate $\Delta$-AI's effectiveness for sampling from synthetic PGMs and training latent variable models with sparse factor structure.
\ No newline at end of file
diff --git a/data/2024/iclr/Democratizing Fine-grained Visual Recognition with Large Language Models b/data/2024/iclr/Democratizing Fine-grained Visual Recognition with Large Language Models
new file mode 100644
index 0000000000..2c9dd1432a
--- /dev/null
+++ b/data/2024/iclr/Democratizing Fine-grained Visual Recognition with Large Language Models	
@@ -0,0 +1 @@
+Identifying subordinate-level categories from images is a longstanding task in computer vision and is referred to as fine-grained visual recognition (FGVR). It has tremendous significance in real-world applications since an average layperson does not excel at differentiating species of birds or mushrooms due to subtle differences among the species. A major bottleneck in developing FGVR systems is caused by the need of high-quality paired expert annotations. To circumvent the need of expert knowledge we propose Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names. In detail, to bridge the modality gap between images and LLM, we extract part-level visual attributes from images as text and feed that information to a LLM. Based on the visual attributes and its internal world knowledge the LLM reasons about the subordinate-level category names. Our training-free FineR outperforms several state-of-the-art FGVR and language and vision assistant models and shows promise in working in the wild and in new domains where gathering expert annotation is arduous.
\ No newline at end of file
diff --git a/data/2024/iclr/Demonstration-Regularized RL b/data/2024/iclr/Demonstration-Regularized RL
new file mode 100644
index 0000000000..407fb1a6a7
--- /dev/null
+++ b/data/2024/iclr/Demonstration-Regularized RL	
@@ -0,0 +1 @@
+Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using $N^{\mathrm{E}}$ expert demonstrations enables the identification of an optimal policy at a sample complexity of order $\widetilde{O}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in finite and $\widetilde{O}(\mathrm{Poly}(d,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in linear Markov decision processes, where $\varepsilon$ is the target precision, $H$ the horizon, $A$ the number of action, $S$ the number of states in the finite case and $d$ the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.
\ No newline at end of file
diff --git a/data/2024/iclr/Demystifying CLIP Data b/data/2024/iclr/Demystifying CLIP Data
new file mode 100644
index 0000000000..60b4acd7f7
--- /dev/null
+++ b/data/2024/iclr/Demystifying CLIP Data	
@@ -0,0 +1 @@
+Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.
\ No newline at end of file
diff --git a/data/2024/iclr/Demystifying Embedding Spaces using Large Language Models b/data/2024/iclr/Demystifying Embedding Spaces using Large Language Models
new file mode 100644
index 0000000000..6304dc56a8
--- /dev/null
+++ b/data/2024/iclr/Demystifying Embedding Spaces using Large Language Models	
@@ -0,0 +1 @@
+Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machine learning interpretability methods. This paper addresses the challenge of making such embeddings more interpretable and broadly useful, by employing Large Language Models (LLMs) to directly interact with embeddings -- transforming abstract vectors into understandable narratives. By injecting embeddings into LLMs, we enable querying and exploration of complex embedding data. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems. Our work couples the immense information potential of embeddings with the interpretative power of LLMs.
\ No newline at end of file
diff --git a/data/2024/iclr/Demystifying Linear MDPs and Novel Dynamics Aggregation Framework b/data/2024/iclr/Demystifying Linear MDPs and Novel Dynamics Aggregation Framework
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Demystifying Local & Global Fairness Trade-offs in Federated Learning Using Partial Information Decomposition b/data/2024/iclr/Demystifying Local & Global Fairness Trade-offs in Federated Learning Using Partial Information Decomposition
new file mode 100644
index 0000000000..41f03a8fbb
--- /dev/null
+++ b/data/2024/iclr/Demystifying Local & Global Fairness Trade-offs in Federated Learning Using Partial Information Decomposition	
@@ -0,0 +1 @@
+This work presents an information-theoretic perspective to group fairness trade-offs in federated learning (FL) with respect to sensitive attributes, such as gender, race, etc. Existing works often focus on either $\textit{global fairness}$ (overall disparity of the model across all clients) or $\textit{local fairness}$ (disparity of the model at each client), without always considering their trade-offs. There is a lack of understanding regarding the interplay between global and local fairness in FL, particularly under data heterogeneity, and if and when one implies the other. To address this gap, we leverage a body of work in information theory called partial information decomposition (PID), which first identifies three sources of unfairness in FL, namely, $\textit{Unique Disparity}$, $\textit{Redundant Disparity}$, and $\textit{Masked Disparity}$. We demonstrate how these three disparities contribute to global and local fairness using canonical examples. This decomposition helps us derive fundamental limits on the trade-off between global and local fairness, highlighting where they agree or disagree. We introduce the $\textit{Accuracy and Global-Local Fairness Optimality Problem (AGLFOP)}$, a convex optimization that defines the theoretical limits of accuracy and fairness trade-offs, identifying the best possible performance any FL strategy can attain given a dataset and client distribution. We also present experimental results on synthetic datasets and the ADULT dataset to support our theoretical findings.
\ No newline at end of file
diff --git a/data/2024/iclr/Demystifying Poisoning Backdoor Attacks from a Statistical Perspective b/data/2024/iclr/Demystifying Poisoning Backdoor Attacks from a Statistical Perspective
new file mode 100644
index 0000000000..cd19c3ae7b
--- /dev/null
+++ b/data/2024/iclr/Demystifying Poisoning Backdoor Attacks from a Statistical Perspective	
@@ -0,0 +1 @@
+The growing dependence on machine learning in real-world applications emphasizes the importance of understanding and ensuring its safety. Backdoor attacks pose a significant security risk due to their stealthy nature and potentially serious consequences. Such attacks involve embedding triggers within a learning model with the intention of causing malicious behavior when an active trigger is present while maintaining regular functionality without it. This paper evaluates the effectiveness of any backdoor attack incorporating a constant trigger, by establishing tight lower and upper boundaries for the performance of the compromised model on both clean and backdoor test data. The developed theory answers a series of fundamental but previously underexplored problems, including (1) what are the determining factors for a backdoor attack's success, (2) what is the direction of the most effective backdoor attack, and (3) when will a human-imperceptible trigger succeed. Our derived understanding applies to both discriminative and generative models. We also demonstrate the theory by conducting experiments using benchmark datasets and state-of-the-art backdoor attack scenarios.
\ No newline at end of file
diff --git a/data/2024/iclr/Denevil: towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning b/data/2024/iclr/Denevil: towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning
new file mode 100644
index 0000000000..3a991287eb
--- /dev/null
+++ b/data/2024/iclr/Denevil: towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have made unprecedented breakthroughs, yet their increasing integration into everyday life might raise societal risks due to generated unethical content. Despite extensive study on specific issues like bias, the intrinsic values of LLMs remain largely unexplored from a moral philosophy perspective. This work delves into ethical values utilizing Moral Foundation Theory. Moving beyond conventional discriminative evaluations with poor reliability, we propose DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset comprising 2,397 prompts covering 500+ value principles, and then benchmark the intrinsic values across a spectrum of LLMs. We discovered that most models are essentially misaligned, necessitating further ethical value alignment. In response, we develop VILMO, an in-context alignment method that substantially enhances the value compliance of LLM outputs by learning to generate appropriate value instructions, outperforming existing competitors. Our methods are suitable for black-box and open-source models, offering a promising initial step in studying the ethical values of LLMs.
\ No newline at end of file
diff --git a/data/2024/iclr/Denoising Diffusion Bridge Models b/data/2024/iclr/Denoising Diffusion Bridge Models
new file mode 100644
index 0000000000..a4000324a1
--- /dev/null
+++ b/data/2024/iclr/Denoising Diffusion Bridge Models	
@@ -0,0 +1 @@
+Diffusion models are powerful generative models that map noise to data using stochastic processes. However, for many applications such as image editing, the model input comes from a distribution that is not random noise. As such, diffusion models must rely on cumbersome methods like guidance or projected sampling to incorporate this information in the generative process. In our work, we propose Denoising Diffusion Bridge Models (DDBMs), a natural alternative to this paradigm based on diffusion bridges, a family of processes that interpolate between two paired distributions given as endpoints. Our method learns the score of the diffusion bridge from data and maps from one endpoint distribution to the other by solving a (stochastic) differential equation based on the learned score. Our method naturally unifies several classes of generative models, such as score-based diffusion models and OT-Flow-Matching, allowing us to adapt existing design and architectural choices to our more general problem. Empirically, we apply DDBMs to challenging image datasets in both pixel and latent space. On standard image translation problems, DDBMs achieve significant improvement over baseline methods, and, when we reduce the problem to image generation by setting the source distribution to random noise, DDBMs achieve comparable FID scores to state-of-the-art methods despite being built for a more general task.
\ No newline at end of file
diff --git a/data/2024/iclr/Denoising Diffusion Step-aware Models b/data/2024/iclr/Denoising Diffusion Step-aware Models
new file mode 100644
index 0000000000..87f092539e
--- /dev/null
+++ b/data/2024/iclr/Denoising Diffusion Step-aware Models	
@@ -0,0 +1 @@
+Denoising Diffusion Probabilistic Models (DDPMs) have garnered popularity for data generation across various domains. However, a significant bottleneck is the necessity for whole-network computation during every step of the generative process, leading to high computational overheads. This paper presents a novel framework, Denoising Diffusion Step-aware Models (DDSM), to address this challenge. Unlike conventional approaches, DDSM employs a spectrum of neural networks whose sizes are adapted according to the importance of each generative step, as determined through evolutionary search. This step-wise network variation effectively circumvents redundant computational efforts, particularly in less critical steps, thereby enhancing the efficiency of the diffusion model. Furthermore, the step-aware design can be seamlessly integrated with other efficiency-geared diffusion models such as DDIMs and latent diffusion, thus broadening the scope of computational savings. Empirical evaluations demonstrate that DDSM achieves computational savings of 49% for CIFAR-10, 61% for CelebA-HQ, 59% for LSUN-bedroom, 71% for AFHQ, and 76% for ImageNet, all without compromising the generation quality.
\ No newline at end of file
diff --git a/data/2024/iclr/Denoising Diffusion via Image-Based Rendering b/data/2024/iclr/Denoising Diffusion via Image-Based Rendering
new file mode 100644
index 0000000000..a2dd8be23b
--- /dev/null
+++ b/data/2024/iclr/Denoising Diffusion via Image-Based Rendering	
@@ -0,0 +1 @@
+Generating 3D scenes is a challenging open problem, which requires synthesizing plausible content that is fully consistent in 3D space. While recent methods such as neural radiance fields excel at view synthesis and 3D reconstruction, they cannot synthesize plausible details in unobserved regions since they lack a generative capability. Conversely, existing generative methods are typically not capable of reconstructing detailed, large-scale scenes in the wild, as they use limited-capacity 3D scene representations, require aligned camera poses, or rely on additional regularizers. In this work, we introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. To achieve this, we make three contributions. First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes, dynamically allocating more capacity as needed to capture details visible in each image. Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images without the need for any additional supervision signal such as masks or depths. This supports 3D reconstruction and generation in a unified architecture. Third, we develop a principled approach to avoid trivial 3D solutions when integrating the image-based rendering with the diffusion model, by dropping out representations of some images. We evaluate the model on several challenging datasets of real and synthetic images, and demonstrate superior results on generation, novel view synthesis and 3D reconstruction.
\ No newline at end of file
diff --git a/data/2024/iclr/Denoising Task Routing for Diffusion Models b/data/2024/iclr/Denoising Task Routing for Diffusion Models
new file mode 100644
index 0000000000..62ad80f1ea
--- /dev/null
+++ b/data/2024/iclr/Denoising Task Routing for Diffusion Models	
@@ -0,0 +1 @@
+Diffusion models generate highly realistic images by learning a multi-step denoising process, naturally embodying the principles of multi-task learning (MTL). Despite the inherent connection between diffusion models and MTL, there remains an unexplored area in designing neural architectures that explicitly incorporate MTL into the framework of diffusion models. In this paper, we present Denoising Task Routing (DTR), a simple add-on strategy for existing diffusion model architectures to establish distinct information pathways for individual tasks within a single architecture by selectively activating subsets of channels in the model. What makes DTR particularly compelling is its seamless integration of prior knowledge of denoising tasks into the framework: (1) Task Affinity: DTR activates similar channels for tasks at adjacent timesteps and shifts activated channels as sliding windows through timesteps, capitalizing on the inherent strong affinity between tasks at adjacent timesteps. (2) Task Weights: During the early stages (higher timesteps) of the denoising process, DTR assigns a greater number of task-specific channels, leveraging the insight that diffusion models prioritize reconstructing global structure and perceptually rich contents in earlier stages, and focus on simple noise removal in later stages. Our experiments reveal that DTR not only consistently boosts diffusion models' performance across different evaluation protocols without adding extra parameters but also accelerates training convergence. Finally, we show the complementarity between our architectural approach and existing MTL optimization techniques, providing a more complete view of MTL in the context of diffusion training. Significantly, by leveraging this complementarity, we attain matched performance of DiT-XL using the smaller DiT-L with a reduction in training iterations from 7M to 2M.
\ No newline at end of file
diff --git a/data/2024/iclr/Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit b/data/2024/iclr/Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit
new file mode 100644
index 0000000000..4ee1773dbb
--- /dev/null
+++ b/data/2024/iclr/Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit	
@@ -0,0 +1 @@
+The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $\mu$P parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across depths. As a remedy, we study residual networks with a residual branch scale of $1/\sqrt{\text{depth}}$ in combination with the $\mu$P parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings are supported and motivated by theory. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit.
\ No newline at end of file
diff --git a/data/2024/iclr/Designing Skill-Compatible AI: Methodologies and Frameworks in Chess b/data/2024/iclr/Designing Skill-Compatible AI: Methodologies and Frameworks in Chess
new file mode 100644
index 0000000000..8504cf2c4b
--- /dev/null
+++ b/data/2024/iclr/Designing Skill-Compatible AI: Methodologies and Frameworks in Chess	
@@ -0,0 +1 @@
+Powerful artificial intelligence systems are often used in settings where they must interact with agents that are computationally much weaker, for example when they work alongside humans or operate in complex environments where some tasks are handled by algorithms, heuristics, or other entities of varying computational power. For AI agents to successfully interact in these settings, however, achieving superhuman performance alone is not sufficient; they also need to account for suboptimal actions or idiosyncratic style from their less-skilled counterparts. We propose a formal evaluation framework for assessing the compatibility of near-optimal AI with interaction partners who may have much lower levels of skill; we use popular collaborative chess variants as model systems to study and develop AI agents that can successfully interact with lower-skill entities. Traditional chess engines designed to output near-optimal moves prove to be inadequate partners when paired with engines of various lower skill levels in this domain, as they are not designed to consider the presence of other agents. We contribute three methodologies to explicitly create skill-compatible AI agents in complex decision-making settings, and two chess game frameworks designed to foster collaboration between powerful AI agents and less-skilled partners. On these frameworks, our agents outperform state-of-the-art chess AI (based on AlphaZero) despite being weaker in conventional chess, demonstrating that skill-compatibility is a tangible trait that is qualitatively and measurably distinct from raw performance. Our evaluations further explore and clarify the mechanisms by which our agents achieve skill-compatibility.
\ No newline at end of file
diff --git a/data/2024/iclr/Det-CGD: Compressed Gradient Descent with Matrix Stepsizes for Non-Convex Optimization b/data/2024/iclr/Det-CGD: Compressed Gradient Descent with Matrix Stepsizes for Non-Convex Optimization
new file mode 100644
index 0000000000..c375cc6c5a
--- /dev/null
+++ b/data/2024/iclr/Det-CGD: Compressed Gradient Descent with Matrix Stepsizes for Non-Convex Optimization	
@@ -0,0 +1 @@
+This paper introduces a new method for minimizing matrix-smooth non-convex objectives through the use of novel Compressed Gradient Descent (CGD) algorithms enhanced with a matrix-valued stepsize. The proposed algorithms are theoretically analyzed first in the single-node and subsequently in the distributed settings. Our theoretical results reveal that the matrix stepsize in CGD can capture the objective's structure and lead to faster convergence compared to a scalar stepsize. As a byproduct of our general results, we emphasize the importance of selecting the compression mechanism and the matrix stepsize in a layer-wise manner, taking advantage of model structure. Moreover, we provide theoretical guarantees for free compression, by designing specific layer-wise compressors for the non-convex matrix smooth objectives. Our findings are supported with empirical evidence.
\ No newline at end of file
diff --git a/data/2024/iclr/Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy b/data/2024/iclr/Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy
new file mode 100644
index 0000000000..a6a59b4dc3
--- /dev/null
+++ b/data/2024/iclr/Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy	
@@ -0,0 +1 @@
+Large language models (LLMs) such as ChatGPT have exhibited remarkable performance in generating human-like texts. However, machine-generated texts (MGTs) may carry critical risks, such as plagiarism issues, misleading information, or hallucination issues. Therefore, it is very urgent and important to detect MGTs in many situations. Unfortunately, it is challenging to distinguish MGTs and human-written texts because the distributional discrepancy between them is often very subtle due to the remarkable performance of LLMs. In this paper, we seek to exploit \textit{maximum mean discrepancy} (MMD) to address this issue in the sense that MMD can well identify distributional discrepancies. However, directly training a detector with MMD using diverse MGTs will incur a significantly increased variance of MMD since MGTs may contain \textit{multiple text populations} due to various LLMs. This will severely impair MMD's ability to measure the difference between two samples. To tackle this, we propose a novel \textit{multi-population} aware optimization method for MMD called MMD-MP, which can \textit{avoid variance increases} and thus improve the stability to measure the distributional discrepancy. Relying on MMD-MP, we develop two methods for paragraph-based and sentence-based detection, respectively. Extensive experiments on various LLMs, \eg, GPT2 and ChatGPT, show superior detection performance of our MMD-MP. The source code is available at \url{https://github.com/ZSHsh98/MMD-MP}.
\ No newline at end of file
diff --git a/data/2024/iclr/Detecting Pretraining Data from Large Language Models b/data/2024/iclr/Detecting Pretraining Data from Large Language Models
new file mode 100644
index 0000000000..55fa5bf89f
--- /dev/null
+++ b/data/2024/iclr/Detecting Pretraining Data from Large Language Models	
@@ -0,0 +1 @@
+Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem: given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that uses data created before and after model training to support gold truth detection. We also introduce a new detection method Min-K% Prob based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. Min-K% Prob can be applied without any knowledge about the pretraining corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply Min-K% Prob to two real-world scenarios, copyrighted book detection, and contaminated downstream example detection, and find it a consistently effective solution.
\ No newline at end of file
diff --git a/data/2024/iclr/Detecting, Explaining, and Mitigating Memorization in Diffusion Models b/data/2024/iclr/Detecting, Explaining, and Mitigating Memorization in Diffusion Models
new file mode 100644
index 0000000000..db4fbbda1d
--- /dev/null
+++ b/data/2024/iclr/Detecting, Explaining, and Mitigating Memorization in Diffusion Models	
@@ -0,0 +1 @@
+Recent breakthroughs in diffusion models have exhibited exceptional image-generation capabilities. However, studies show that some outputs are merely replications of training data. Such replications present potential legal challenges for model owners, especially when the generated content contains proprietary information. In this work, we introduce a straightforward yet effective method for detecting memorized prompts by inspecting the magnitude of text-conditional predictions. Our proposed method seamlessly integrates without disrupting sampling algorithms, and delivers high accuracy even at the first generation step, with a single generation per prompt. Building on our detection strategy, we unveil an explainable approach that shows the contribution of individual words or tokens to memorization. This offers an interactive medium for users to adjust their prompts. Moreover, we propose two strategies i.e., to mitigate memorization by leveraging the magnitude of text-conditional predictions, either through minimization during inference or filtering during training. These proposed strategies effectively counteract memorization while maintaining high-generation quality. Code is available at https://github.com/YuxinWenRick/diffusion_memorization.
\ No newline at end of file
diff --git a/data/2024/iclr/DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models b/data/2024/iclr/DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models
new file mode 100644
index 0000000000..b0ca2e2d58
--- /dev/null
+++ b/data/2024/iclr/DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models	
@@ -0,0 +1 @@
+Recent advancements in autonomous driving have relied on data-driven approaches, which are widely adopted but face challenges including dataset bias, overfitting, and uninterpretability. Drawing inspiration from the knowledge-driven nature of human driving, we explore the question of how to instill similar capabilities into autonomous driving systems and summarize a paradigm that integrates an interactive environment, a driver agent, as well as a memory component to address this question. Leveraging large language models (LLMs) with emergent abilities, we propose the DiLu framework, which combines a Reasoning and a Reflection module to enable the system to perform decision-making based on common-sense knowledge and evolve continuously. Extensive experiments prove DiLu's capability to accumulate experience and demonstrate a significant advantage in generalization ability over reinforcement learning-based methods. Moreover, DiLu is able to directly acquire experiences from real-world datasets which highlights its potential to be deployed on practical autonomous driving systems. To the best of our knowledge, we are the first to leverage knowledge-driven capability in decision-making for autonomous vehicles. Through the proposed DiLu framework, LLM is strengthened to apply knowledge and to reason causally in the autonomous driving domain. Project page: https://pjlab-adg.github.io/DiLu/
\ No newline at end of file
diff --git a/data/2024/iclr/Diagnosing Transformers: Illuminating Feature Spaces for Clinical Decision-Making b/data/2024/iclr/Diagnosing Transformers: Illuminating Feature Spaces for Clinical Decision-Making
new file mode 100644
index 0000000000..517052dca2
--- /dev/null
+++ b/data/2024/iclr/Diagnosing Transformers: Illuminating Feature Spaces for Clinical Decision-Making	
@@ -0,0 +1 @@
+Pre-trained transformers are often fine-tuned to aid clinical decision-making using limited clinical notes. Model interpretability is crucial, especially in high-stakes domains like medicine, to establish trust and ensure safety, which requires human engagement. We introduce SUFO, a systematic framework that enhances interpretability of fine-tuned transformer feature spaces. SUFO utilizes a range of analytic and visualization techniques, including Supervised probing, Unsupervised similarity analysis, Feature dynamics, and Outlier analysis to address key questions about model trust and interpretability. We conduct a case study investigating the impact of pre-training data where we focus on real-world pathology classification tasks, and validate our findings on MedNLI. We evaluate five 110M-sized pre-trained transformer models, categorized into general-domain (BERT, TNLR), mixed-domain (BioBERT, Clinical BioBERT), and domain-specific (PubMedBERT) groups. Our SUFO analyses reveal that: (1) while PubMedBERT, the domain-specific model, contains valuable information for fine-tuning, it can overfit to minority classes when class imbalances exist. In contrast, mixed-domain models exhibit greater resistance to overfitting, suggesting potential improvements in domain-specific model robustness; (2) in-domain pre-training accelerates feature disambiguation during fine-tuning; and (3) feature spaces undergo significant sparsification during this process, enabling clinicians to identify common outlier modes among fine-tuned models as demonstrated in this paper. These findings showcase the utility of SUFO in enhancing trust and safety when using transformers in medicine, and we believe SUFO can aid practitioners in evaluating fine-tuned language models for other applications in medicine and in more critical domains.
\ No newline at end of file
diff --git a/data/2024/iclr/Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking b/data/2024/iclr/Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking
new file mode 100644
index 0000000000..8caf90727c
--- /dev/null
+++ b/data/2024/iclr/Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking	
@@ -0,0 +1 @@
+Recent work by Power et al. (2022) highlighted a surprising"grokking"phenomenon in learning arithmetic tasks: a neural net first"memorizes"the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy. This paper studies the grokking phenomenon in theoretical setups and shows that it can be induced by a dichotomy of early and late phase implicit biases. Specifically, when training homogeneous neural nets with large initialization and small weight decay on both classification and regression tasks, we prove that the training process gets trapped at a solution corresponding to a kernel predictor for a long time, and then a very sharp transition to min-norm/max-margin predictors occurs, leading to a dramatic change in test accuracy.
\ No newline at end of file
diff --git a/data/2024/iclr/Dictionary Contrastive Learning for Efficient Local Supervision without Auxiliary Networks b/data/2024/iclr/Dictionary Contrastive Learning for Efficient Local Supervision without Auxiliary Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation b/data/2024/iclr/DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation
new file mode 100644
index 0000000000..8310fcff94
--- /dev/null
+++ b/data/2024/iclr/DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation	
@@ -0,0 +1 @@
+Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one. Hence, our model can effectively synthesize an unlimited speech duration while preserving high-fidelity synthesis and temporal coherence. We implemented the proposed model for unconditional and conditional speech generation, where the latter can be driven by an input sequence of phonemes, amplitudes, and pitch values. Working on the waveform directly has some empirical advantages. Specifically, it allows the creation of local acoustic behaviors, like vocal fry, which makes the overall waveform sounds more natural. Furthermore, the proposed diffusion model is stochastic and not deterministic; therefore, each inference generates a slightly different waveform variation, enabling abundance of valid realizations. Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
\ No newline at end of file
diff --git a/data/2024/iclr/DiffEnc: Variational Diffusion with a Learned Encoder b/data/2024/iclr/DiffEnc: Variational Diffusion with a Learned Encoder
new file mode 100644
index 0000000000..5f234dcba6
--- /dev/null
+++ b/data/2024/iclr/DiffEnc: Variational Diffusion with a Learned Encoder	
@@ -0,0 +1 @@
+Diffusion models may be viewed as hierarchical variational autoencoders (VAEs) with two improvements: parameter sharing for the conditional distributions in the generative process and efficient computation of the loss as independent terms over the hierarchy. We consider two changes to the diffusion model that retain these advantages while adding flexibility to the model. Firstly, we introduce a data- and depth-dependent mean function in the diffusion process, which leads to a modified diffusion loss. Our proposed framework, DiffEnc, achieves a statistically significant improvement in likelihood on CIFAR-10. Secondly, we let the ratio of the noise variance of the reverse encoder process and the generative process be a free weight parameter rather than being fixed to 1. This leads to theoretical insights: For a finite depth hierarchy, the evidence lower bound (ELBO) can be used as an objective for a weighted diffusion loss approach and for optimizing the noise schedule specifically for inference. For the infinite-depth hierarchy, on the other hand, the weight parameter has to be 1 to have a well-defined ELBO.
\ No newline at end of file
diff --git a/data/2024/iclr/Diffeomorphic Mesh Deformation via Efficient Optimal Transport for Cortical Surface Reconstruction b/data/2024/iclr/Diffeomorphic Mesh Deformation via Efficient Optimal Transport for Cortical Surface Reconstruction
new file mode 100644
index 0000000000..7c7591d51b
--- /dev/null
+++ b/data/2024/iclr/Diffeomorphic Mesh Deformation via Efficient Optimal Transport for Cortical Surface Reconstruction	
@@ -0,0 +1 @@
+Mesh deformation plays a pivotal role in many 3D vision tasks including dynamic simulations, rendering, and reconstruction. However, defining an efficient discrepancy between predicted and target meshes remains an open problem. A prevalent approach in current deep learning is the set-based approach which measures the discrepancy between two surfaces by comparing two randomly sampled point-clouds from the two meshes with Chamfer pseudo-distance. Nevertheless, the set-based approach still has limitations such as lacking a theoretical guarantee for choosing the number of points in sampled point-clouds, and the pseudo-metricity and the quadratic complexity of the Chamfer divergence. To address these issues, we propose a novel metric for learning mesh deformation. The metric is defined by sliced Wasserstein distance on meshes represented as probability measures that generalize the set-based approach. By leveraging probability measure space, we gain flexibility in encoding meshes using diverse forms of probability measures, such as continuous, empirical, and discrete measures via varifold representation. After having encoded probability measures, we can compare meshes by using the sliced Wasserstein distance which is an effective optimal transport distance with linear computational complexity and can provide a fast statistical rate for approximating the surface of meshes. To the end, we employ a neural ordinary differential equation (ODE) to deform the input surface into the target shape by modeling the trajectories of the points on the surface. Our experiments on cortical surface reconstruction demonstrate that our approach surpasses other competing methods in multiple datasets and metrics.
\ No newline at end of file
diff --git a/data/2024/iclr/Differentiable Euler Characteristic Transforms for Shape Classification b/data/2024/iclr/Differentiable Euler Characteristic Transforms for Shape Classification
new file mode 100644
index 0000000000..f2bd2020f4
--- /dev/null
+++ b/data/2024/iclr/Differentiable Euler Characteristic Transforms for Shape Classification	
@@ -0,0 +1 @@
+The Euler Characteristic Transform (ECT) has proven to be a powerful representation, combining geometrical and topological characteristics of shapes and graphs. However, the ECT was hitherto unable to learn task-specific representations. We overcome this issue and develop a novel computational layer that enables learning the ECT in an end-to-end fashion. Our method, the Differentiable Euler Characteristic Transform (DECT), is fast and computationally efficient, while exhibiting performance on a par with more complex models in both graph and point cloud classification tasks. Moreover, we show that this seemingly simple statistic provides the same topological expressivity as more complex topological deep learning layers.
\ No newline at end of file
diff --git a/data/2024/iclr/Differentiable Learning of Generalized Structured Matrices for Efficient Deep Neural Networks b/data/2024/iclr/Differentiable Learning of Generalized Structured Matrices for Efficient Deep Neural Networks
new file mode 100644
index 0000000000..470a8d406a
--- /dev/null
+++ b/data/2024/iclr/Differentiable Learning of Generalized Structured Matrices for Efficient Deep Neural Networks	
@@ -0,0 +1 @@
+This paper investigates efficient deep neural networks (DNNs) to replace dense unstructured weight matrices with structured ones that possess desired properties. The challenge arises because the optimal weight matrix structure in popular neural network models is obscure in most cases and may vary from layer to layer even in the same network. Prior structured matrices proposed for efficient DNNs were mostly hand-crafted without a generalized framework to systematically learn them. To address this issue, we propose a generalized and differentiable framework to learn efficient structures of weight matrices by gradient descent. We first define a new class of structured matrices that covers a wide range of structured matrices in the literature by adjusting the structural parameters. Then, the frequency-domain differentiable parameterization scheme based on the Gaussian-Dirichlet kernel is adopted to learn the structural parameters by proximal gradient descent. On the image and language tasks, our method learns efficient DNNs with structured matrices, achieving lower complexity and/or higher performance than prior approaches that employ low-rank, block-sparse, or block-low-rank matrices.
\ No newline at end of file
diff --git a/data/2024/iclr/Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach b/data/2024/iclr/Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach
new file mode 100644
index 0000000000..59b7d4898e
--- /dev/null
+++ b/data/2024/iclr/Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach	
@@ -0,0 +1 @@
+Differentially Private Stochastic Gradient Descent with Gradient Clipping (DPSGD-GC) is a powerful tool for training deep learning models using sensitive data, providing both a solid theoretical privacy guarantee and high efficiency. However, using DPSGD-GC to ensure Differential Privacy (DP) comes at the cost of model performance degradation due to DP noise injection and gradient clipping. Existing research has extensively analyzed the theoretical convergence of DPSGD-GC, and has shown that it only converges when using large clipping thresholds that are dependent on problem-specific parameters. Unfortunately, these parameters are often unknown in practice, making it hard to choose the optimal clipping threshold. Therefore, in practice, DPSGD-GC suffers from degraded performance due to the {\it constant} bias introduced by the clipping. In our work, we propose a new error-feedback (EF) DP algorithm as an alternative to DPSGD-GC, which not only offers a diminishing utility bound without inducing a constant clipping bias, but more importantly, it allows for an arbitrary choice of clipping threshold that is independent of the problem. We establish an algorithm-specific DP analysis for our proposed algorithm, providing privacy guarantees based on R{\'e}nyi DP. Additionally, we demonstrate that under mild conditions, our algorithm can achieve nearly the same utility bound as DPSGD without gradient clipping. Our empirical results on Cifar-10/100 and E2E datasets, show that the proposed algorithm achieves higher accuracies than DPSGD while maintaining the same level of DP guarantee.
\ No newline at end of file
diff --git a/data/2024/iclr/Differentially Private Synthetic Data via Foundation Model APIs 1: Images b/data/2024/iclr/Differentially Private Synthetic Data via Foundation Model APIs 1: Images
new file mode 100644
index 0000000000..6f9d72f642
--- /dev/null
+++ b/data/2024/iclr/Differentially Private Synthetic Data via Foundation Model APIs 1: Images	
@@ -0,0 +1 @@
+Generating differentially private (DP) synthetic data that closely resembles the original private data is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs. Such API-based, training-free approaches are easier to deploy as exemplified by the recent surge in the number of API-based apps. These approaches can also leverage the power of large foundation models which are only accessible via their inference APIs. However, this comes with greater challenges due to strictly more restrictive model access and the need to protect privacy from the API provider. In this paper, we present a new framework called Private Evolution (PE) to solve this problem and show its initial promise on synthetic images. Surprisingly, PE can match or even outperform state-of-the-art (SOTA) methods without any model training. For example, on CIFAR10 (with ImageNet as the public data), we achieve FID<= 7.9 with privacy cost {\epsilon} = 0.67, significantly improving the previous SOTA from {\epsilon} = 32. We further demonstrate the promise of applying PE on large foundation models such as Stable Diffusion to tackle challenging private datasets with a small number of high-resolution images. The code and data are released at https://github.com/microsoft/DPSDA.
\ No newline at end of file
diff --git a/data/2024/iclr/Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization b/data/2024/iclr/Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization
new file mode 100644
index 0000000000..a158698b3b
--- /dev/null
+++ b/data/2024/iclr/Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization	
@@ -0,0 +1 @@
+We tackle the problem of sampling from intractable high-dimensional density functions, a fundamental task that often appears in machine learning and statistics. We extend recent sampling-based approaches that leverage controlled stochastic processes to model approximate samples from these target densities. The main drawback of these approaches is that the training objective requires full trajectories to compute, resulting in sluggish credit assignment issues due to use of entire trajectories and a learning signal present only at the terminal time. In this work, we present Diffusion Generative Flow Samplers (DGFS), a sampling-based framework where the learning process can be tractably broken down into short partial trajectory segments, via parameterizing an additional"flow function". Our method takes inspiration from the theory developed for generative flow networks (GFlowNets), allowing us to make use of intermediate learning signals. Through various challenging experiments, we demonstrate that DGFS achieves more accurate estimates of the normalization constant than closely-related prior methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Diffusion Model for Dense Matching b/data/2024/iclr/Diffusion Model for Dense Matching
new file mode 100644
index 0000000000..7989a8289a
--- /dev/null
+++ b/data/2024/iclr/Diffusion Model for Dense Matching	
@@ -0,0 +1 @@
+The objective for establishing dense correspondence between paired images consists of two terms: a data term and a prior term. While conventional techniques focused on defining hand-designed prior terms, which are difficult to formulate, recent approaches have focused on learning the data term with deep neural networks without explicitly modeling the prior, assuming that the model itself has the capacity to learn an optimal prior from a large-scale dataset. The performance improvement was obvious, however, they often fail to address inherent ambiguities of matching, such as textureless regions, repetitive patterns, and large displacements. To address this, we propose DiffMatch, a novel conditional diffusion-based framework designed to explicitly model both the data and prior terms. Unlike previous approaches, this is accomplished by leveraging a conditional denoising diffusion model. DiffMatch consists of two main components: conditional denoising diffusion module and cost injection module. We stabilize the training process and reduce memory usage with a stage-wise training strategy. Furthermore, to boost performance, we introduce an inference technique that finds a better path to the accurate matching field. Our experimental results demonstrate significant performance improvements of our method over existing approaches, and the ablation studies validate our design choices along with the effectiveness of each component. Project page is available at https://ku-cvlab.github.io/DiffMatch/.
\ No newline at end of file
diff --git a/data/2024/iclr/Diffusion Models for Multi-Task Generative Modeling b/data/2024/iclr/Diffusion Models for Multi-Task Generative Modeling
new file mode 100644
index 0000000000..c0061f5400
--- /dev/null
+++ b/data/2024/iclr/Diffusion Models for Multi-Task Generative Modeling	
@@ -0,0 +1 @@
+Diffusion-based generative modeling has been achieving state-of-the-art results on various generation tasks. Most diffusion models, however, are limited to a single-generation modeling. Can we generalize diffusion models with the ability of multi-modal generative training for more generalizable modeling? In this paper, we propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space. We define the forward diffusion process to be driven by an information aggregation from multiple types of task-data, e.g., images for a generation task and labels for a classification task. In the reverse process, we enforce information sharing by parameterizing a shared backbone denoising network with additional modality-specific decoder heads. Such a structure can simultaneously learn to generate different types of multi-modal data with a multi-task loss, which is derived from a new multi-modal variational lower bound that generalizes the standard diffusion model. We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling. Extensive experimental results on ImageNet indicate the effectiveness of our framework for various multi-modal generative modeling, which we believe is an important research direction worthy of more future explorations.
\ No newline at end of file
diff --git a/data/2024/iclr/Diffusion Posterior Sampling for Linear Inverse Problem Solving: A Filtering Perspective b/data/2024/iclr/Diffusion Posterior Sampling for Linear Inverse Problem Solving: A Filtering Perspective
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Diffusion Sampling with Momentum for Mitigating Divergence Artifacts b/data/2024/iclr/Diffusion Sampling with Momentum for Mitigating Divergence Artifacts
new file mode 100644
index 0000000000..43abdf1952
--- /dev/null
+++ b/data/2024/iclr/Diffusion Sampling with Momentum for Mitigating Divergence Artifacts	
@@ -0,0 +1 @@
+Despite the remarkable success of diffusion models in image generation, slow sampling remains a persistent issue. To accelerate the sampling process, prior studies have reformulated diffusion sampling as an ODE/SDE and introduced higher-order numerical methods. However, these methods often produce divergence artifacts, especially with a low number of sampling steps, which limits the achievable acceleration. In this paper, we investigate the potential causes of these artifacts and suggest that the small stability regions of these methods could be the principal cause. To address this issue, we propose two novel techniques. The first technique involves the incorporation of Heavy Ball (HB) momentum, a well-known technique for improving optimization, into existing diffusion numerical methods to expand their stability regions. We also prove that the resulting methods have first-order convergence. The second technique, called Generalized Heavy Ball (GHVB), constructs a new high-order method that offers a variable trade-off between accuracy and artifact suppression. Experimental results show that our techniques are highly effective in reducing artifacts and improving image quality, surpassing state-of-the-art diffusion solvers on both pixel-based and latent-based diffusion models for low-step sampling. Our research provides novel insights into the design of numerical methods for future diffusion work.
\ No newline at end of file
diff --git a/data/2024/iclr/Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation b/data/2024/iclr/Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation
new file mode 100644
index 0000000000..999941ce34
--- /dev/null
+++ b/data/2024/iclr/Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation	
@@ -0,0 +1 @@
+Originating from the diffusion phenomenon in physics that describes particle movement, the diffusion generative models inherit the characteristics of stochastic random walk in the data space along the denoising trajectory. However, the intrinsic mutual interference among image regions contradicts the need for practical downstream application scenarios where the preservation of low-level pixel information from given conditioning is desired (e.g., customization tasks like personalized generation and inpainting based on a user-provided single image). In this work, we investigate the diffusion (physics) in diffusion (machine learning) properties and propose our Cyclic One-Way Diffusion (COW) method to control the direction of diffusion phenomenon given a pre-trained frozen diffusion model for versatile customization application scenarios, where the low-level pixel information from the conditioning needs to be preserved. Notably, unlike most current methods that incorporate additional conditions by fine-tuning the base text-to-image diffusion model or learning auxiliary networks, our method provides a novel perspective to understand the task needs and is applicable to a wider range of customization scenarios in a learning-free manner. Extensive experiment results show that our proposed COW can achieve more flexible customization based on strict visual conditions in different application settings. Project page: https://wangruoyu02.github.io/cow.github.io/.
\ No newline at end of file
diff --git a/data/2024/iclr/Diffusion-TS: Interpretable Diffusion for General Time Series Generation b/data/2024/iclr/Diffusion-TS: Interpretable Diffusion for General Time Series Generation
new file mode 100644
index 0000000000..c714851705
--- /dev/null
+++ b/data/2024/iclr/Diffusion-TS: Interpretable Diffusion for General Time Series Generation	
@@ -0,0 +1 @@
+Denoising diffusion probabilistic models (DDPMs) are becoming the leading paradigm for generative models. It has recently shown breakthroughs in audio synthesis, time series imputation and forecasting. In this paper, we propose Diffusion-TS, a novel diffusion-based framework that generates multivariate time series samples of high quality by using an encoder-decoder transformer with disentangled temporal representations, in which the decomposition technique guides Diffusion-TS to capture the semantic meaning of time series while transformers mine detailed sequential information from the noisy model input. Different from existing diffusion-based approaches, we train the model to directly reconstruct the sample instead of the noise in each diffusion step, combining a Fourier-based loss term. Diffusion-TS is expected to generate time series satisfying both interpretablity and realness. In addition, it is shown that the proposed Diffusion-TS can be easily extended to conditional generation tasks, such as forecasting and imputation, without any model changes. This also motivates us to further explore the performance of Diffusion-TS under irregular settings. Finally, through qualitative and quantitative experiments, results show that Diffusion-TS achieves the state-of-the-art results on various realistic analyses of time series.
\ No newline at end of file
diff --git a/data/2024/iclr/DiffusionNAG: Predictor-guided Neural Architecture Generation with Diffusion Models b/data/2024/iclr/DiffusionNAG: Predictor-guided Neural Architecture Generation with Diffusion Models
new file mode 100644
index 0000000000..47c371ba9f
--- /dev/null
+++ b/data/2024/iclr/DiffusionNAG: Predictor-guided Neural Architecture Generation with Diffusion Models	
@@ -0,0 +1 @@
+Existing NAS methods suffer from either an excessive amount of time for repetitive sampling and training of many task-irrelevant architectures. To tackle such limitations of existing NAS methods, we propose a paradigm shift from NAS to a novel conditional Neural Architecture Generation (NAG) framework based on diffusion models, dubbed DiffusionNAG. Specifically, we consider the neural architectures as directed graphs and propose a graph diffusion model for generating them. Moreover, with the guidance of parameterized predictors, DiffusionNAG can flexibly generate task-optimal architectures with the desired properties for diverse tasks, by sampling from a region that is more likely to satisfy the properties. This conditional NAG scheme is significantly more efficient than previous NAS schemes which sample the architectures and filter them using the property predictors. We validate the effectiveness of DiffusionNAG through extensive experiments in two predictor-based NAS scenarios: Transferable NAS and Bayesian Optimization (BO)-based NAS. DiffusionNAG achieves superior performance with speedups of up to 35 times when compared to the baselines on Transferable NAS benchmarks. Furthermore, when integrated into a BO-based algorithm, DiffusionNAG outperforms existing BO-based NAS approaches, particularly in the large MobileNetV3 search space on the ImageNet 1K dataset. Code is available at https://github.com/CownowAn/DiffusionNAG.
\ No newline at end of file
diff --git a/data/2024/iclr/DiffusionSat: A Generative Foundation Model for Satellite Imagery b/data/2024/iclr/DiffusionSat: A Generative Foundation Model for Satellite Imagery
new file mode 100644
index 0000000000..8987d2b3ad
--- /dev/null
+++ b/data/2024/iclr/DiffusionSat: A Generative Foundation Model for Satellite Imagery	
@@ -0,0 +1 @@
+Diffusion models have achieved state-of-the-art results on many modalities including images, speech, and video. However, existing models are not tailored to support remote sensing data, which is widely used in important applications including environmental monitoring and crop-yield prediction. Satellite images are significantly different from natural images -- they can be multi-spectral, irregularly sampled across time -- and existing diffusion models trained on images from the Web do not support them. Furthermore, remote sensing data is inherently spatio-temporal, requiring conditional generation tasks not supported by traditional methods based on captions or images. In this paper, we present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets. As text-based captions are sparsely available for satellite images, we incorporate the associated metadata such as geolocation as conditioning information. Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, superresolution given multi-spectral inputs and in-painting. Our method outperforms previous state-of-the-art methods for satellite image generation and is the first large-scale generative foundation model for satellite imagery. The project website can be found here: https://samar-khanna.github.io/DiffusionSat/
\ No newline at end of file
diff --git a/data/2024/iclr/Directly Fine-Tuning Diffusion Models on Differentiable Rewards b/data/2024/iclr/Directly Fine-Tuning Diffusion Models on Differentiable Rewards
new file mode 100644
index 0000000000..2071975659
--- /dev/null
+++ b/data/2024/iclr/Directly Fine-Tuning Diffusion Models on Differentiable Rewards	
@@ -0,0 +1 @@
+We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms.
\ No newline at end of file
diff --git a/data/2024/iclr/Dirichlet-based Per-Sample Weighting by Transition Matrix for Noisy Label Learning b/data/2024/iclr/Dirichlet-based Per-Sample Weighting by Transition Matrix for Noisy Label Learning
new file mode 100644
index 0000000000..b12b99abee
--- /dev/null
+++ b/data/2024/iclr/Dirichlet-based Per-Sample Weighting by Transition Matrix for Noisy Label Learning	
@@ -0,0 +1 @@
+For learning with noisy labels, the transition matrix, which explicitly models the relation between noisy label distribution and clean label distribution, has been utilized to achieve the statistical consistency of either the classifier or the risk. Previous researches have focused more on how to estimate this transition matrix well, rather than how to utilize it. We propose good utilization of the transition matrix is crucial and suggest a new utilization method based on resampling, coined RENT. Specifically, we first demonstrate current utilizations can have potential limitations for implementation. As an extension to Reweighting, we suggest the Dirichlet distribution-based per-sample Weight Sampling (DWS) framework, and compare reweighting and resampling under DWS framework. With the analyses from DWS, we propose RENT, a REsampling method with Noise Transition matrix. Empirically, RENT consistently outperforms existing transition matrix utilization methods, which includes reweighting, on various benchmark datasets. Our code is available at \url{https://github.com/BaeHeeSun/RENT}.
\ No newline at end of file
diff --git a/data/2024/iclr/Discovering Failure Modes of Text-guided Diffusion Models via Adversarial Search b/data/2024/iclr/Discovering Failure Modes of Text-guided Diffusion Models via Adversarial Search
new file mode 100644
index 0000000000..a46f60e310
--- /dev/null
+++ b/data/2024/iclr/Discovering Failure Modes of Text-guided Diffusion Models via Adversarial Search	
@@ -0,0 +1 @@
+Text-guided diffusion models (TDMs) are widely applied but can fail unexpectedly. Common failures include: (i) natural-looking text prompts generating images with the wrong content, or (ii) different random samples of the latent variables that generate vastly different, and even unrelated, outputs despite being conditioned on the same text prompt. In this work, we aim to study and understand the failure modes of TDMs in more detail. To achieve this, we propose SAGE, the first adversarial search method on TDMs that systematically explores the discrete prompt space and the high-dimensional latent space, to automatically discover undesirable behaviors and failure cases in image generation. We use image classifiers as surrogate loss functions during searching, and employ human inspections to validate the identified failures. For the first time, our method enables efficient exploration of both the discrete and intricate human language space and the challenging latent space, overcoming the gradient vanishing problem. Then, we demonstrate the effectiveness of SAGE on five widely used generative models and reveal four typical failure modes: (1) We find a variety of natural text prompts that generate images failing to capture the semantics of input texts. We further discuss the underlying causes and potential solutions based on the results. (2) We find regions in the latent space that lead to distorted images independent of the text prompt, suggesting that parts of the latent space are not well-structured. (3) We also find latent samples that result in natural-looking images unrelated to the text prompt, implying a possible misalignment between the latent and prompt spaces. (4) By appending a single adversarial token embedding to any input prompts, we can generate a variety of specified target objects. Project page: https://sage-diffusion.github.io/
\ No newline at end of file
diff --git a/data/2024/iclr/Discovering Temporally-Aware Reinforcement Learning Algorithms b/data/2024/iclr/Discovering Temporally-Aware Reinforcement Learning Algorithms
new file mode 100644
index 0000000000..6567164a6b
--- /dev/null
+++ b/data/2024/iclr/Discovering Temporally-Aware Reinforcement Learning Algorithms	
@@ -0,0 +1 @@
+Recent advancements in meta-learning have enabled the automatic discovery of novel reinforcement learning algorithms parameterized by surrogate objective functions. To improve upon manually designed algorithms, the parameterization of this learned objective function must be expressive enough to represent novel principles of learning (instead of merely recovering already established ones) while still generalizing to a wide range of settings outside of its meta-training distribution. However, existing methods focus on discovering objective functions that, like many widely used objective functions in reinforcement learning, do not take into account the total number of steps allowed for training, or"training horizon". In contrast, humans use a plethora of different learning objectives across the course of acquiring a new ability. For instance, students may alter their studying techniques based on the proximity to exam deadlines and their self-assessed capabilities. This paper contends that ignoring the optimization time horizon significantly restricts the expressive potential of discovered learning algorithms. We propose a simple augmentation to two existing objective discovery approaches that allows the discovered algorithm to dynamically update its objective function throughout the agent's training procedure, resulting in expressive schedules and increased generalization across different training horizons. In the process, we find that commonly used meta-gradient approaches fail to discover such adaptive objective functions while evolution strategies discover highly dynamic learning rules. We demonstrate the effectiveness of our approach on a wide range of tasks and analyze the resulting learned algorithms, which we find effectively balance exploration and exploitation by modifying the structure of their learning rules throughout the agent's lifetime.
\ No newline at end of file
diff --git a/data/2024/iclr/Discovering modular solutions that generalize compositionally b/data/2024/iclr/Discovering modular solutions that generalize compositionally
new file mode 100644
index 0000000000..8a4e6c374f
--- /dev/null
+++ b/data/2024/iclr/Discovering modular solutions that generalize compositionally	
@@ -0,0 +1 @@
+Many complex tasks can be decomposed into simpler, independent parts. Discovering such underlying compositional structure has the potential to enable compositional generalization. Despite progress, our most powerful systems struggle to compose flexibly. It therefore seems natural to make models more modular to help capture the compositional nature of many tasks. However, it is unclear under which circumstances modular systems can discover hidden compositional structure. To shed light on this question, we study a teacher-student setting with a modular teacher where we have full control over the composition of ground truth modules. This allows us to relate the problem of compositional generalization to that of identification of the underlying modules. In particular we study modularity in hypernetworks representing a general class of multiplicative interactions. We show theoretically that identification up to linear transformation purely from demonstrations is possible without having to learn an exponential number of module combinations. We further demonstrate empirically that under the theoretically identified conditions, meta-learning from finite data can discover modular policies that generalize compositionally in a number of complex environments.
\ No newline at end of file
diff --git a/data/2024/iclr/DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation b/data/2024/iclr/DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation
new file mode 100644
index 0000000000..f15776ab1d
--- /dev/null
+++ b/data/2024/iclr/DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation	
@@ -0,0 +1 @@
+Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to the failure of subject-driven text-to-image generation as follows: (i) the identity-irrelevant information hidden in the entangled embedding may dominate the generation process, resulting in the generated images heavily dependent on the irrelevant information while ignoring the given text descriptions; (ii) the identity-relevant information carried in the entangled embedding can not be appropriately preserved, resulting in identity change of the subject in the generated images. To tackle the problems, we propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation. Specifically, DisenBooth finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise each image, DisenBooth instead utilizes disentangled embeddings to respectively preserve the subject identity and capture the identity-irrelevant information. We further design the novel weak denoising and contrastive embedding auxiliary tuning objectives to achieve the disentanglement. Extensive experiments show that our proposed DisenBooth framework outperforms baseline models for subject-driven text-to-image generation with the identity-preserved embedding. Additionally, by combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability
\ No newline at end of file
diff --git a/data/2024/iclr/Disentangling Time Series Representations via Contrastive Independence-of-Support on l-Variational Inference b/data/2024/iclr/Disentangling Time Series Representations via Contrastive Independence-of-Support on l-Variational Inference
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AI b/data/2024/iclr/Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AI
new file mode 100644
index 0000000000..aef6212a3c
--- /dev/null
+++ b/data/2024/iclr/Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AI	
@@ -0,0 +1 @@
+Characterizing samples that are difficult to learn from is crucial to developing highly performant ML models. This has led to numerous Hardness Characterization Methods (HCMs) that aim to identify"hard"samples. However, there is a lack of consensus regarding the definition and evaluation of"hardness". Unfortunately, current HCMs have only been evaluated on specific types of hardness and often only qualitatively or with respect to downstream performance, overlooking the fundamental quantitative identification task. We address this gap by presenting a fine-grained taxonomy of hardness types. Additionally, we propose the Hardness Characterization Analysis Toolkit (H-CAT), which supports comprehensive and quantitative benchmarking of HCMs across the hardness taxonomy and can easily be extended to new HCMs, hardness types, and datasets. We use H-CAT to evaluate 13 different HCMs across 8 hardness types. This comprehensive evaluation encompassing over 14K setups uncovers strengths and weaknesses of different HCMs, leading to practical tips to guide HCM selection and future development. Our findings highlight the need for more comprehensive HCM evaluation, while we hope our hardness taxonomy and toolkit will advance the principled evaluation and uptake of data-centric AI methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Dissecting learning and forgetting in language model finetuning b/data/2024/iclr/Dissecting learning and forgetting in language model finetuning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/DistillSpec: Improving Speculative Decoding via Knowledge Distillation b/data/2024/iclr/DistillSpec: Improving Speculative Decoding via Knowledge Distillation
new file mode 100644
index 0000000000..bf00c4744b
--- /dev/null
+++ b/data/2024/iclr/DistillSpec: Improving Speculative Decoding via Knowledge Distillation	
@@ -0,0 +1 @@
+Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, we propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. DistillSpec makes two key design choices, which we demonstrate via systematic study to be crucial to improving the draft and target alignment: utilizing on-policy data generation from the draft model, and tailoring the divergence function to the task and decoding strategy. Notably, DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks, using both greedy and non-greedy sampling. Furthermore, we combine DistillSpec with lossy SD to achieve fine-grained control over the latency vs. task performance trade-off. Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.
\ No newline at end of file
diff --git a/data/2024/iclr/Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF b/data/2024/iclr/Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF
new file mode 100644
index 0000000000..3874ac8277
--- /dev/null
+++ b/data/2024/iclr/Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF	
@@ -0,0 +1 @@
+In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. A key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of RLHF. As a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (DPL). DPL methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. Experimental results indicate that applying DPL to RLHF for LLM chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. Our code and data are available at https://github.com/cassidylaidlaw/hidden-context
\ No newline at end of file
diff --git a/data/2024/iclr/Distributionally Robust Optimization with Bias and Variance Reduction b/data/2024/iclr/Distributionally Robust Optimization with Bias and Variance Reduction
new file mode 100644
index 0000000000..023434fdc0
--- /dev/null
+++ b/data/2024/iclr/Distributionally Robust Optimization with Bias and Variance Reduction	
@@ -0,0 +1 @@
+We consider the distributionally robust optimization (DRO) problem with spectral risk-based uncertainty set and $f$-divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-$k$ loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparameter, and prove that it enjoys linear convergence for smooth regularized losses. This contrasts with previous algorithms that either require tuning multiple hyperparameters or potentially fail to converge due to biased gradient estimates or inadequate regularization. Empirically, we show that Prospect can converge 2-3$\times$ faster than baselines such as stochastic gradient and stochastic saddle-point methods on distribution shift and fairness benchmarks spanning tabular, vision, and language domains.
\ No newline at end of file
diff --git a/data/2024/iclr/DittoGym: Learning to Control Soft Shape-Shifting Robots b/data/2024/iclr/DittoGym: Learning to Control Soft Shape-Shifting Robots
new file mode 100644
index 0000000000..a629a550c9
--- /dev/null
+++ b/data/2024/iclr/DittoGym: Learning to Control Soft Shape-Shifting Robots	
@@ -0,0 +1 @@
+Robot co-design, where the morphology of a robot is optimized jointly with a learned policy to solve a specific task, is an emerging area of research. It holds particular promise for soft robots, which are amenable to novel manufacturing techniques that can realize learned morphologies and actuators. Inspired by nature and recent novel robot designs, we propose to go a step further and explore the novel reconfigurable robots, defined as robots that can change their morphology within their lifetime. We formalize control of reconfigurable soft robots as a high-dimensional reinforcement learning (RL) problem. We unify morphology change, locomotion, and environment interaction in the same action space, and introduce an appropriate, coarse-to-fine curriculum that enables us to discover policies that accomplish fine-grained control of the resulting robots. We also introduce DittoGym, a comprehensive RL benchmark for reconfigurable soft robots that require fine-grained morphology changes to accomplish the tasks. Finally, we evaluate our proposed coarse-to-fine algorithm on DittoGym and demonstrate robots that learn to change their morphology several times within a sequence, uniquely enabled by our RL algorithm. More results are available at https://dittogym.github.io.
\ No newline at end of file
diff --git a/data/2024/iclr/Diverse Projection Ensembles for Distributional Reinforcement Learning b/data/2024/iclr/Diverse Projection Ensembles for Distributional Reinforcement Learning
new file mode 100644
index 0000000000..20fc758bca
--- /dev/null
+++ b/data/2024/iclr/Diverse Projection Ensembles for Distributional Reinforcement Learning	
@@ -0,0 +1 @@
+In contrast to classical reinforcement learning, distributional reinforcement learning algorithms aim to learn the distribution of returns rather than their expected value. Since the nature of the return distribution is generally unknown a priori or arbitrarily complex, a common approach finds approximations within a set of representable, parametric distributions. Typically, this involves a projection of the unconstrained distribution onto the set of simplified distributions. We argue that this projection step entails a strong inductive bias when coupled with neural networks and gradient descent, thereby profoundly impacting the generalization behavior of learned models. In order to facilitate reliable uncertainty estimation through diversity, this work studies the combination of several different projections and representations in a distributional ensemble. We establish theoretical properties of such projection ensembles and derive an algorithm that uses ensemble disagreement, measured by the average $1$-Wasserstein distance, as a bonus for deep exploration. We evaluate our algorithm on the behavior suite benchmark and find that diverse projection ensembles lead to significant performance improvements over existing methods on a wide variety of tasks with the most pronounced gains in directed exploration problems.
\ No newline at end of file
diff --git a/data/2024/iclr/Divide and not forget: Ensemble of selectively trained experts in Continual Learning b/data/2024/iclr/Divide and not forget: Ensemble of selectively trained experts in Continual Learning
new file mode 100644
index 0000000000..d902e297cb
--- /dev/null
+++ b/data/2024/iclr/Divide and not forget: Ensemble of selectively trained experts in Continual Learning	
@@ -0,0 +1 @@
+Class-incremental learning is becoming more popular as it helps models widen their applicability while not forgetting what they already know. A trend in this area is to use a mixture-of-expert technique, where different models work together to solve the task. However, the experts are usually trained all at once using whole task data, which makes them all prone to forgetting and increasing computational burden. To address this limitation, we introduce a novel approach named SEED. SEED selects only one, the most optimal expert for a considered task, and uses data from this task to fine-tune only this expert. For this purpose, each expert represents each class with a Gaussian distribution, and the optimal expert is selected based on the similarity of those distributions. Consequently, SEED increases diversity and heterogeneity within the experts while maintaining the high stability of this ensemble method. The extensive experiments demonstrate that SEED achieves state-of-the-art performance in exemplar-free settings across various scenarios, showing the potential of expert diversification through data in continual learning.
\ No newline at end of file
diff --git a/data/2024/iclr/Do Generated Data Always Help Contrastive Learning? b/data/2024/iclr/Do Generated Data Always Help Contrastive Learning?
new file mode 100644
index 0000000000..642b4c70c6
--- /dev/null
+++ b/data/2024/iclr/Do Generated Data Always Help Contrastive Learning?	
@@ -0,0 +1 @@
+Contrastive Learning (CL) has emerged as one of the most successful paradigms for unsupervised visual representation learning, yet it often depends on intensive manual data augmentations. With the rise of generative models, especially diffusion models, the ability to generate realistic images close to the real data distribution has been well recognized. These generated high-equality images have been successfully applied to enhance contrastive representation learning, a technique termed ``data inflation''. However, we find that the generated data (even from a good diffusion model like DDPM) may sometimes even harm contrastive learning. We investigate the causes behind this failure from the perspective of both data inflation and data augmentation. For the first time, we reveal the complementary roles that stronger data inflation should be accompanied by weaker augmentations, and vice versa. We also provide rigorous theoretical explanations for these phenomena via deriving its generalization bounds under data inflation. Drawing from these insights, we propose Adaptive Inflation (AdaInf), a purely data-centric strategy without introducing any extra computation cost. On benchmark datasets, AdaInf can bring significant improvements for various contrastive learning methods. Notably, without using external data, AdaInf obtains 94.70% linear accuracy on CIFAR-10 with SimCLR, setting a new record that surpasses many sophisticated methods. Code is available at https://github.com/PKU-ML/adainf.
\ No newline at end of file
diff --git a/data/2024/iclr/DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models b/data/2024/iclr/DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
new file mode 100644
index 0000000000..4f1023eb5d
--- /dev/null
+++ b/data/2024/iclr/DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models	
@@ -0,0 +1 @@
+Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs that does not require conditioning on retrieved external knowledge nor additional fine-tuning. Our approach obtains the next-token distribution by contrasting the differences in logits obtained from projecting the later layers versus earlier layers to the vocabulary space, exploiting the fact that factual knowledge in an LLMs has generally been shown to be localized to particular transformer layers. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts. DoLa consistently improves the truthfulness across multiple choices tasks and open-ended generation tasks, for example improving the performance of LLaMA family models on TruthfulQA by 12-17% absolute points, demonstrating its potential in making LLMs reliably generate truthful facts.
\ No newline at end of file
diff --git a/data/2024/iclr/Does CLIP's generalization performance mainly stem from high train-test similarity? b/data/2024/iclr/Does CLIP's generalization performance mainly stem from high train-test similarity?
new file mode 100644
index 0000000000..d437952881
--- /dev/null
+++ b/data/2024/iclr/Does CLIP's generalization performance mainly stem from high train-test similarity?	
@@ -0,0 +1 @@
+Foundation models like CLIP are trained on hundreds of millions of samples and effortlessly generalize to new tasks and inputs. Out of the box, CLIP shows stellar zero-shot and few-shot capabilities on a wide range of out-of-distribution (OOD) benchmarks, which prior works attribute mainly to today's large and comprehensive training dataset (like LAION). However, it is questionable how meaningful terms like out-of-distribution generalization are for CLIP as it seems likely that web-scale datasets like LAION simply contain many samples that are similar to common OOD benchmarks originally designed for ImageNet. To test this hypothesis, we retrain CLIP on pruned LAION splits that replicate ImageNet's train-test similarity with respect to common OOD benchmarks. While we observe a performance drop on some benchmarks, surprisingly, CLIP's overall performance remains high. This shows that high train-test similarity is insufficient to explain CLIP's OOD performance, and other properties of the training data must drive CLIP to learn more generalizable representations. Additionally, by pruning data points that are dissimilar to the OOD benchmarks, we uncover a 100M split of LAION ($\frac{1}{4}$th of its original size) on which CLIP can be trained to match its original OOD performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Does Progress On Object Recognition Benchmarks Improve Generalization on Crowdsourced, Global Data? b/data/2024/iclr/Does Progress On Object Recognition Benchmarks Improve Generalization on Crowdsourced, Global Data?
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Does Writing with Language Models Reduce Content Diversity? b/data/2024/iclr/Does Writing with Language Models Reduce Content Diversity?
new file mode 100644
index 0000000000..bc41635b69
--- /dev/null
+++ b/data/2024/iclr/Does Writing with Language Models Reduce Content Diversity?	
@@ -0,0 +1 @@
+Large language models (LLMs) have led to a surge in collaborative writing with model assistance. As different users incorporate suggestions from the same model, there is a risk of decreased diversity in the produced content, potentially limiting diverse perspectives in public discourse. In this work, we measure the impact of co-writing on diversity via a controlled experiment, where users write argumentative essays in three setups -- using a base LLM (GPT3), a feedback-tuned LLM (InstructGPT), and writing without model help. We develop a set of diversity metrics and find that writing with InstructGPT (but not the GPT3) results in a statistically significant reduction in diversity. Specifically, it increases the similarity between the writings of different authors and reduces the overall lexical and content diversity. We additionally find that this effect is mainly attributable to InstructGPT contributing less diverse text to co-written essays. In contrast, the user-contributed text remains unaffected by model collaboration. This suggests that the recent improvement in generation quality from adapting models to human feedback might come at the cost of more homogeneous and less diverse content.
\ No newline at end of file
diff --git a/data/2024/iclr/Domain Randomization via Entropy Maximization b/data/2024/iclr/Domain Randomization via Entropy Maximization
new file mode 100644
index 0000000000..2f883d2109
--- /dev/null
+++ b/data/2024/iclr/Domain Randomization via Entropy Maximization	
@@ -0,0 +1 @@
+Varying dynamics parameters in simulation is a popular Domain Randomization (DR) approach for overcoming the reality gap in Reinforcement Learning (RL). Nevertheless, DR heavily hinges on the choice of the sampling distribution of the dynamics parameters, since high variability is crucial to regularize the agent's behavior but notoriously leads to overly conservative policies when randomizing excessively. In this paper, we propose a novel approach to address sim-to-real transfer, which automatically shapes dynamics distributions during training in simulation without requiring real-world data. We introduce DOmain RAndomization via Entropy MaximizatiON (DORAEMON), a constrained optimization problem that directly maximizes the entropy of the training distribution while retaining generalization capabilities. In achieving this, DORAEMON gradually increases the diversity of sampled dynamics parameters as long as the probability of success of the current policy is sufficiently high. We empirically validate the consistent benefits of DORAEMON in obtaining highly adaptive and generalizable policies, i.e. solving the task at hand across the widest range of dynamics parameters, as opposed to representative baselines from the DR literature. Notably, we also demonstrate the Sim2Real applicability of DORAEMON through its successful zero-shot transfer in a robotic manipulation setup under unknown real-world parameters.
\ No newline at end of file
diff --git a/data/2024/iclr/Domain constraints improve risk prediction when outcome data is missing b/data/2024/iclr/Domain constraints improve risk prediction when outcome data is missing
new file mode 100644
index 0000000000..539958e20d
--- /dev/null
+++ b/data/2024/iclr/Domain constraints improve risk prediction when outcome data is missing	
@@ -0,0 +1 @@
+Machine learning models are often trained to predict the outcome resulting from a human decision. For example, if a doctor decides to test a patient for disease, will the patient test positive? A challenge is that historical decision-making determines whether the outcome is observed: we only observe test outcomes for patients doctors historically tested. Untested patients, for whom outcomes are unobserved, may differ from tested patients along observed and unobserved dimensions. We propose a Bayesian model class which captures this setting. The purpose of the model is to accurately estimate risk for both tested and untested patients. Estimating this model is challenging due to the wide range of possibilities for untested patients. To address this, we propose two domain constraints which are plausible in health settings: a prevalence constraint, where the overall disease prevalence is known, and an expertise constraint, where the human decision-maker deviates from purely risk-based decision-making only along a constrained feature set. We show theoretically and on synthetic data that domain constraints improve parameter inference. We apply our model to a case study of cancer risk prediction, showing that the model's inferred risk predicts cancer diagnoses, its inferred testing policy captures known public health policies, and it can identify suboptimalities in test allocation. Though our case study is in healthcare, our analysis reveals a general class of domain constraints which can improve model estimation in many settings.
\ No newline at end of file
diff --git a/data/2024/iclr/Domain-Agnostic Molecular Generation with Chemical Feedback b/data/2024/iclr/Domain-Agnostic Molecular Generation with Chemical Feedback
new file mode 100644
index 0000000000..25fe0122e7
--- /dev/null
+++ b/data/2024/iclr/Domain-Agnostic Molecular Generation with Chemical Feedback	
@@ -0,0 +1 @@
+The generation of molecules with desired properties has become increasingly popular, revolutionizing the way scientists design molecular structures and providing valuable support for chemical and drug design. However, despite the potential of language models in molecule generation, they face challenges such as generating syntactically or chemically flawed molecules, having narrow domain focus, and struggling to create diverse and feasible molecules due to limited annotated data or external molecular databases. To tackle these challenges, we introduce MolGen, a pre-trained molecular language model tailored specifically for molecule generation. Through the reconstruction of over 100 million molecular SELFIES, MolGen internalizes structural and grammatical insights. This is further enhanced by domain-agnostic molecular prefix tuning, fostering robust knowledge transfer across diverse domains. Importantly, our chemical feedback paradigm steers the model away from molecular hallucinations, ensuring alignment between the model's estimated probabilities and real-world chemical preferences. Extensive experiments on well-known benchmarks underscore MolGen's optimization capabilities in properties such as penalized logP, QED, and molecular docking. Additional analyses confirm its proficiency in accurately capturing molecule distributions, discerning intricate structural patterns, and efficiently exploring the chemical space. Code is available at https://github.com/zjunlp/MolGen.
\ No newline at end of file
diff --git a/data/2024/iclr/Domain-Inspired Sharpness-Aware Minimization Under Domain Shifts b/data/2024/iclr/Domain-Inspired Sharpness-Aware Minimization Under Domain Shifts
new file mode 100644
index 0000000000..b637825e6b
--- /dev/null
+++ b/data/2024/iclr/Domain-Inspired Sharpness-Aware Minimization Under Domain Shifts	
@@ -0,0 +1 @@
+This paper presents a Domain-Inspired Sharpness-Aware Minimization (DISAM) algorithm for optimization under domain shifts. It is motivated by the inconsistent convergence degree of SAM across different domains, which induces optimization bias towards certain domains and thus impairs the overall convergence. To address this issue, we consider the domain-level convergence consistency in the sharpness estimation to prevent the overwhelming (deficient) perturbations for less (well) optimized domains. Specifically, DISAM introduces the constraint of minimizing variance in the domain loss, which allows the elastic gradient calibration in perturbation generation: when one domain is optimized above the averaging level \textit{w.r.t.} loss, the gradient perturbation towards that domain will be weakened automatically, and vice versa. Under this mechanism, we theoretically show that DISAM can achieve faster overall convergence and improved generalization in principle when inconsistent convergence emerges. Extensive experiments on various domain generalization benchmarks show the superiority of DISAM over a range of state-of-the-art methods. Furthermore, we show the superior efficiency of DISAM in parameter-efficient fine-tuning combined with the pretraining models. The source code is released at https://github.com/MediaBrain-SJTU/DISAM.
\ No newline at end of file
diff --git a/data/2024/iclr/Don't Judge by the Look: Towards Motion Coherent Video Representation b/data/2024/iclr/Don't Judge by the Look: Towards Motion Coherent Video Representation
new file mode 100644
index 0000000000..b56acb72c0
--- /dev/null
+++ b/data/2024/iclr/Don't Judge by the Look: Towards Motion Coherent Video Representation	
@@ -0,0 +1 @@
+Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice. In this study, we investigate the effect of hue variance in the context of video understanding and find this variance to be beneficial since static appearances are less important in videos that contain motion information. Based on this observation, we propose a data augmentation method for video understanding, named Motion Coherent Augmentation (MCA), that introduces appearance variation in videos and implicitly encourages the model to prioritize motion patterns, rather than static appearances. Concretely, we propose an operation SwapMix to efficiently modify the appearance of video samples, and introduce Variation Alignment (VA) to resolve the distribution shift caused by SwapMix, enforcing the model to learn appearance invariant representations. Comprehensive empirical evaluation across various architectures and different datasets solidly validates the effectiveness and generalization ability of MCA, and the application of VA in other augmentation methods. Code is available at https://github.com/BeSpontaneous/MCA-pytorch.
\ No newline at end of file
diff --git a/data/2024/iclr/Don't Play Favorites: Minority Guidance for Diffusion Models b/data/2024/iclr/Don't Play Favorites: Minority Guidance for Diffusion Models
new file mode 100644
index 0000000000..de437035b9
--- /dev/null
+++ b/data/2024/iclr/Don't Play Favorites: Minority Guidance for Diffusion Models	
@@ -0,0 +1 @@
+We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating a sufficient number of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (that lie on high-density regions of the manifold) due to their high likelihoods, making themselves ineffective and time-consuming for the minority generating task. In this work, we present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We first highlight that Tweedie's denoising formula yields favorable results for majority samples. The observation motivates us to introduce a metric that describes the uniqueness of a given sample. To address the inherent preference of the diffusion models w.r.t. the majority samples, we further develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels. Experiments on benchmark real datasets demonstrate that our minority guidance can greatly improve the capability of generating high-quality minority samples over existing generative samplers. We showcase that the performance benefit of our framework persists even in demanding real-world scenarios such as medical imaging, further underscoring the practical significance of our work. Code is available at https://github.com/soobin-um/minority-guidance.
\ No newline at end of file
diff --git a/data/2024/iclr/Don't Trust: Verify - Grounding LLM Quantitative Reasoning with Autoformalization b/data/2024/iclr/Don't Trust: Verify - Grounding LLM Quantitative Reasoning with Autoformalization
new file mode 100644
index 0000000000..83f1b2463a
--- /dev/null
+++ b/data/2024/iclr/Don't Trust: Verify - Grounding LLM Quantitative Reasoning with Autoformalization	
@@ -0,0 +1 @@
+Large language models (LLM), such as Google's Minerva and OpenAI's GPT families, are becoming increasingly capable of solving mathematical quantitative reasoning problems. However, they still make unjustified logical and computational errors in their reasoning steps and answers. In this paper, we leverage the fact that if the training corpus of LLMs contained sufficiently many examples of formal mathematics (e.g. in Isabelle, a formal theorem proving environment), they can be prompted to translate i.e. autoformalize informal mathematical statements into formal Isabelle code -- which can be verified automatically for internal consistency. This provides a mechanism to automatically reject solutions whose formalized versions are inconsistent within themselves or with the formalized problem statement. We evaluate our method on GSM8K, MATH and MultiArith datasets and demonstrate that our approach provides a consistently better heuristic than vanilla majority voting -- the previously best method to identify correct answers, by more than 12% on GSM8K. In our experiments it improves results consistently across all datasets and LLM model sizes. The code can be found at https://github.com/jinpz/dtv.
\ No newline at end of file
diff --git a/data/2024/iclr/Doubly Robust Instance-Reweighted Adversarial Training b/data/2024/iclr/Doubly Robust Instance-Reweighted Adversarial Training
new file mode 100644
index 0000000000..5023909f5c
--- /dev/null
+++ b/data/2024/iclr/Doubly Robust Instance-Reweighted Adversarial Training	
@@ -0,0 +1 @@
+Assigning importance weights to adversarial data has achieved great success in training adversarially robust networks under limited model capacity. However, existing instance-reweighted adversarial training (AT) methods heavily depend on heuristics and/or geometric interpretations to determine those importance weights, making these algorithms lack rigorous theoretical justification/guarantee. Moreover, recent research has shown that adversarial training suffers from a severe non-uniform robust performance across the training distribution, e.g., data points belonging to some classes can be much more vulnerable to adversarial attacks than others. To address both issues, in this paper, we propose a novel doubly-robust instance reweighted AT framework, which allows to obtain the importance weights via exploring distributionally robust optimization (DRO) techniques, and at the same time boosts the robustness on the most vulnerable examples. In particular, our importance weights are obtained by optimizing the KL-divergence regularized loss function, which allows us to devise new algorithms with a theoretical convergence guarantee. Experiments on standard classification datasets demonstrate that our proposed approach outperforms related state-of-the-art baseline methods in terms of average robust performance, and at the same time improves the robustness against attacks on the weakest data points. Codes will be available soon.
\ No newline at end of file
diff --git a/data/2024/iclr/Doubly Robust Proximal Causal Learning for Continuous Treatments b/data/2024/iclr/Doubly Robust Proximal Causal Learning for Continuous Treatments
new file mode 100644
index 0000000000..ee1964b2e7
--- /dev/null
+++ b/data/2024/iclr/Doubly Robust Proximal Causal Learning for Continuous Treatments	
@@ -0,0 +1 @@
+Proximal causal learning is a promising framework for identifying the causal effect under the existence of unmeasured confounders. Within this framework, the doubly robust (DR) estimator was derived and has shown its effectiveness in estimation, especially when the model assumption is violated. However, the current form of the DR estimator is restricted to binary treatments, while the treatment can be continuous in many real-world applications. The primary obstacle to continuous treatments resides in the delta function present in the original DR estimator, making it infeasible in causal effect estimation and introducing a heavy computational burden in nuisance function estimation. To address these challenges, we propose a kernel-based DR estimator that can well handle continuous treatments. Equipped with its smoothness, we show that its oracle form is a consistent approximation of the influence function. Further, we propose a new approach to efficiently solve the nuisance functions. We then provide a comprehensive convergence analysis in terms of the mean square error. We demonstrate the utility of our estimator on synthetic datasets and real-world applications.
\ No newline at end of file
diff --git a/data/2024/iclr/DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization b/data/2024/iclr/DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization
new file mode 100644
index 0000000000..238b82cd94
--- /dev/null
+++ b/data/2024/iclr/DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization	
@@ -0,0 +1 @@
+Visual reinforcement learning (RL) has shown promise in continuous control tasks. Despite its progress, current algorithms are still unsatisfactory in virtually every aspect of the performance such as sample efficiency, asymptotic performance, and their robustness to the choice of random seeds. In this paper, we identify a major shortcoming in existing visual RL methods that is the agents often exhibit sustained inactivity during early training, thereby limiting their ability to explore effectively. Expanding upon this crucial observation, we additionally unveil a significant correlation between the agents' inclination towards motorically inactive exploration and the absence of neuronal activity within their policy networks. To quantify this inactivity, we adopt dormant ratio as a metric to measure inactivity in the RL agent's network. Empirically, we also recognize that the dormant ratio can act as a standalone indicator of an agent's activity level, regardless of the received reward signals. Leveraging the aforementioned insights, we introduce DrM, a method that uses three core mechanisms to guide agents' exploration-exploitation trade-offs by actively minimizing the dormant ratio. Experiments demonstrate that DrM achieves significant improvements in sample efficiency and asymptotic performance with no broken seeds (76 seeds in total) across three continuous control benchmark environments, including DeepMind Control Suite, MetaWorld, and Adroit. Most importantly, DrM is the first model-free algorithm that consistently solves tasks in both the Dog and Manipulator domains from the DeepMind Control Suite as well as three dexterous hand manipulation tasks without demonstrations in Adroit, all based on pixel observations.
\ No newline at end of file
diff --git a/data/2024/iclr/DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks b/data/2024/iclr/DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks
new file mode 100644
index 0000000000..1142bd984f
--- /dev/null
+++ b/data/2024/iclr/DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks	
@@ -0,0 +1 @@
+The success of many RL techniques heavily relies on human-engineered dense rewards, which typically demand substantial domain expertise and extensive trial and error. In our work, we propose DrS (Dense reward learning from Stages), a novel approach for learning reusable dense rewards for multi-stage tasks in a data-driven manner. By leveraging the stage structures of the task, DrS learns a high-quality dense reward from sparse rewards and demonstrations if given. The learned rewards can be \textit{reused} in unseen tasks, thus reducing the human effort for reward engineering. Extensive experiments on three physical robot manipulation task families with 1000+ task variants demonstrate that our learned rewards can be reused in unseen tasks, resulting in improved performance and sample efficiency of RL algorithms. The learned rewards even achieve comparable performance to human-engineered rewards on some tasks. See our project page (https://sites.google.com/view/iclr24drs) for more details.
\ No newline at end of file
diff --git a/data/2024/iclr/DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models b/data/2024/iclr/DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models
new file mode 100644
index 0000000000..2221f4bdb8
--- /dev/null
+++ b/data/2024/iclr/DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models	
@@ -0,0 +1 @@
+Despite the ability of existing large-scale text-to-image (T2I) models to generate high-quality images from detailed textual descriptions, they often lack the ability to precisely edit the generated or real images. In this paper, we propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models. Specifically, we construct classifier guidance based on the strong correspondence of intermediate features in the diffusion model. It can transform the editing signals into gradients via feature correspondence loss to modify the intermediate representation of the diffusion model. Based on this guidance strategy, we also build a multi-scale guidance to consider both semantic and geometric alignment. Moreover, a cross-branch self-attention is added to maintain the consistency between the original image and the editing result. Our method, through an efficient design, achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging. It is worth noting that all editing and content preservation signals come from the image itself, and the model does not require fine-tuning or additional modules. Our source code will be available at https://github.com/MC-E/DragonDiffusion.
\ No newline at end of file
diff --git a/data/2024/iclr/DreamClean: Restoring Clean Image Using Deep Diffusion Prior b/data/2024/iclr/DreamClean: Restoring Clean Image Using Deep Diffusion Prior
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior b/data/2024/iclr/DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior
new file mode 100644
index 0000000000..0641986785
--- /dev/null
+++ b/data/2024/iclr/DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior	
@@ -0,0 +1 @@
+We present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation. Code available at https://github.com/deepseek-ai/DreamCraft3D.
\ No newline at end of file
diff --git a/data/2024/iclr/DreamFlow: High-quality text-to-3D generation by Approximating Probability Flow b/data/2024/iclr/DreamFlow: High-quality text-to-3D generation by Approximating Probability Flow
new file mode 100644
index 0000000000..7db74b4996
--- /dev/null
+++ b/data/2024/iclr/DreamFlow: High-quality text-to-3D generation by Approximating Probability Flow	
@@ -0,0 +1 @@
+Recent progress in text-to-3D generation has been achieved through the utilization of score distillation methods: they make use of the pre-trained text-to-image (T2I) diffusion models by distilling via the diffusion model training objective. However, such an approach inevitably results in the use of random timesteps at each update, which increases the variance of the gradient and ultimately prolongs the optimization process. In this paper, we propose to enhance the text-to-3D optimization by leveraging the T2I diffusion prior in the generative sampling process with a predetermined timestep schedule. To this end, we interpret text-to3D optimization as a multi-view image-to-image translation problem, and propose a solution by approximating the probability flow. By leveraging the proposed novel optimization algorithm, we design DreamFlow, a practical three-stage coarseto-fine text-to-3D optimization framework that enables fast generation of highquality and high-resolution (i.e., 1024x1024) 3D contents. For example, we demonstrate that DreamFlow is 5 times faster than the existing state-of-the-art text-to-3D method, while producing more photorealistic 3D contents. Visit our project page (https://kyungmnlee.github.io/dreamflow.github.io/) for visualizations.
\ No newline at end of file
diff --git a/data/2024/iclr/DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation b/data/2024/iclr/DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation
new file mode 100644
index 0000000000..8430e49f89
--- /dev/null
+++ b/data/2024/iclr/DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation	
@@ -0,0 +1 @@
+Recent advances in 3D content creation mostly leverage optimization-based 3D generation via score distillation sampling (SDS). Though promising results have been exhibited, these methods often suffer from slow per-sample optimization, limiting their practical usage. In this paper, we propose DreamGaussian, a novel 3D content generation framework that achieves both efficiency and quality simultaneously. Our key insight is to design a generative 3D Gaussian Splatting model with companioned mesh extraction and texture refinement in UV space. In contrast to the occupancy pruning used in Neural Radiance Fields, we demonstrate that the progressive densification of 3D Gaussians converges significantly faster for 3D generative tasks. To further enhance the texture quality and facilitate downstream applications, we introduce an efficient algorithm to convert 3D Gaussians into textured meshes and apply a fine-tuning stage to refine the details. Extensive experiments demonstrate the superior efficiency and competitive generation quality of our proposed approach. Notably, DreamGaussian produces high-quality textured meshes in just 2 minutes from a single-view image, achieving approximately 10 times acceleration compared to existing methods.
\ No newline at end of file
diff --git a/data/2024/iclr/DreamLLM: Synergistic Multimodal Comprehension and Creation b/data/2024/iclr/DreamLLM: Synergistic Multimodal Comprehension and Creation
new file mode 100644
index 0000000000..16d682dbbe
--- /dev/null
+++ b/data/2024/iclr/DreamLLM: Synergistic Multimodal Comprehension and Creation	
@@ -0,0 +1 @@
+This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM's superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy. Project page: https://dreamllm.github.io.
\ No newline at end of file
diff --git a/data/2024/iclr/DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing b/data/2024/iclr/DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing
new file mode 100644
index 0000000000..d333ab8c98
--- /dev/null
+++ b/data/2024/iclr/DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing	
@@ -0,0 +1 @@
+Model-based reinforcement learning (MBRL) has gained much attention for its ability to learn complex behaviors in a sample-efficient way: planning actions by generating imaginary trajectories with predicted rewards. Despite its success, we found that surprisingly, reward prediction is often a bottleneck of MBRL, especially for sparse rewards that are challenging (or even ambiguous) to predict. Motivated by the intuition that humans can learn from rough reward estimates, we propose a simple yet effective reward smoothing approach, DreamSmooth, which learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep. We empirically show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks both in sample efficiency and final performance without losing performance on common benchmarks, such as Deepmind Control Suite and Atari benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Generation b/data/2024/iclr/DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Generation
new file mode 100644
index 0000000000..fe36cc7f23
--- /dev/null
+++ b/data/2024/iclr/DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Generation	
@@ -0,0 +1 @@
+Text-to-image diffusion models pre-trained on billions of image-text pairs have recently enabled 3D content creation by optimizing a randomly initialized differentiable 3D representation with score distillation. However, the optimization process suffers slow convergence and the resultant 3D models often exhibit two limitations: (a) quality concerns such as missing attributes and distorted shape and texture; (b) extremely low diversity comparing to text-guided image synthesis. In this paper, we show that the conflict between the 3D optimization process and uniform timestep sampling in score distillation is the main reason for these limitations. To resolve this conflict, we propose to prioritize timestep sampling with monotonically non-increasing functions, which aligns the 3D optimization process with the sampling process of diffusion model. Extensive experiments show that our simple redesign significantly improves 3D content creation with faster convergence, better quality and diversity.
\ No newline at end of file
diff --git a/data/2024/iclr/Dropout Enhanced Bilevel Training b/data/2024/iclr/Dropout Enhanced Bilevel Training
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Dropout-Based Rashomon Set Exploration for Efficient Predictive Multiplicity Estimation b/data/2024/iclr/Dropout-Based Rashomon Set Exploration for Efficient Predictive Multiplicity Estimation
new file mode 100644
index 0000000000..8bf03cf24c
--- /dev/null
+++ b/data/2024/iclr/Dropout-Based Rashomon Set Exploration for Efficient Predictive Multiplicity Estimation	
@@ -0,0 +1 @@
+Predictive multiplicity refers to the phenomenon in which classification tasks may admit multiple competing models that achieve almost-equally-optimal performance, yet generate conflicting outputs for individual samples. This presents significant concerns, as it can potentially result in systemic exclusion, inexplicable discrimination, and unfairness in practical applications. Measuring and mitigating predictive multiplicity, however, is computationally challenging due to the need to explore all such almost-equally-optimal models, known as the Rashomon set, in potentially huge hypothesis spaces. To address this challenge, we propose a novel framework that utilizes dropout techniques for exploring models in the Rashomon set. We provide rigorous theoretical derivations to connect the dropout parameters to properties of the Rashomon set, and empirically evaluate our framework through extensive experimentation. Numerical results show that our technique consistently outperforms baselines in terms of the effectiveness of predictive multiplicity metric estimation, with runtime speedup up to $20\times \sim 5000\times$. With efficient Rashomon set exploration and metric estimation, mitigation of predictive multiplicity is then achieved through dropout ensemble and model selection.
\ No newline at end of file
diff --git a/data/2024/iclr/Dual Associated Encoder for Face Restoration b/data/2024/iclr/Dual Associated Encoder for Face Restoration
new file mode 100644
index 0000000000..7098ee0b68
--- /dev/null
+++ b/data/2024/iclr/Dual Associated Encoder for Face Restoration	
@@ -0,0 +1 @@
+Restoring facial details from low-quality (LQ) images has remained a challenging problem due to its ill-posedness induced by various degradations in the wild. The existing codebook prior mitigates the ill-posedness by leveraging an autoencoder and learned codebook of high-quality (HQ) features, achieving remarkable quality. However, existing approaches in this paradigm frequently depend on a single encoder pre-trained on HQ data for restoring HQ images, disregarding the domain gap between LQ and HQ images. As a result, the encoding of LQ inputs may be insufficient, resulting in suboptimal performance. To tackle this problem, we propose a novel dual-branch framework named DAEFR. Our method introduces an auxiliary LQ branch that extracts crucial information from the LQ inputs. Additionally, we incorporate association training to promote effective synergy between the two branches, enhancing code prediction and output quality. We evaluate the effectiveness of DAEFR on both synthetic and real-world datasets, demonstrating its superior performance in restoring facial details. Project page: https://liagm.github.io/DAEFR/
\ No newline at end of file
diff --git a/data/2024/iclr/Dual RL: Unification and New Methods for Reinforcement and Imitation Learning b/data/2024/iclr/Dual RL: Unification and New Methods for Reinforcement and Imitation Learning
new file mode 100644
index 0000000000..d31da54de0
--- /dev/null
+++ b/data/2024/iclr/Dual RL: Unification and New Methods for Reinforcement and Imitation Learning	
@@ -0,0 +1 @@
+The goal of reinforcement learning (RL) is to find a policy that maximizes the expected cumulative return. It has been shown that this objective can be represented as an optimization problem of state-action visitation distribution under linear constraints. The dual problem of this formulation, which we refer to as dual RL, is unconstrained and easier to optimize. In this work, we first cast several state-of-the-art offline RL and offline imitation learning (IL) algorithms as instances of dual RL approaches with shared structures. Such unification allows us to identify the root cause of the shortcomings of prior methods. For offline IL, our analysis shows that prior methods are based on a restrictive coverage assumption that greatly limits their performance in practice. To fix this limitation, we propose a new discriminator-free method ReCOIL that learns to imitate from arbitrary off-policy data to obtain near-expert performance. For offline RL, our analysis frames a recent offline RL method XQL in the dual framework, and we further propose a new method f-DVL that provides alternative choices to the Gumbel regression loss that fixes the known training instability issue of XQL. The performance improvements by both of our proposed methods, ReCOIL and f-DVL, in IL and RL are validated on an extensive suite of simulated robot locomotion and manipulation tasks. Project code and details can be found at this https://hari-sikchi.github.io/dual-rl.
\ No newline at end of file
diff --git a/data/2024/iclr/Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment b/data/2024/iclr/Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment
new file mode 100644
index 0000000000..83e88f46fe
--- /dev/null
+++ b/data/2024/iclr/Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment	
@@ -0,0 +1 @@
+We introduce a novel task within the field of 3D dance generation, termed dance accompaniment, which necessitates the generation of responsive movements from a dance partner, the"follower", synchronized with the lead dancer's movements and the underlying musical rhythm. Unlike existing solo or group dance generation tasks, a duet dance scenario entails a heightened degree of interaction between the two participants, requiring delicate coordination in both pose and position. To support this task, we first build a large-scale and diverse duet interactive dance dataset, DD100, by recording about 117 minutes of professional dancers' performances. To address the challenges inherent in this task, we propose a GPT-based model, Duolando, which autoregressively predicts the subsequent tokenized motion conditioned on the coordinated information of the music, the leader's and the follower's movements. To further enhance the GPT's capabilities of generating stable results on unseen conditions (music and leader motions), we devise an off-policy reinforcement learning strategy that allows the model to explore viable trajectories from out-of-distribution samplings, guided by human-defined rewards. Based on the collected dataset and proposed method, we establish a benchmark with several carefully designed metrics.
\ No newline at end of file
diff --git a/data/2024/iclr/DyST: Towards Dynamic Neural Scene Representations on Real-World Videos b/data/2024/iclr/DyST: Towards Dynamic Neural Scene Representations on Real-World Videos
new file mode 100644
index 0000000000..041aefecfb
--- /dev/null
+++ b/data/2024/iclr/DyST: Towards Dynamic Neural Scene Representations on Real-World Videos	
@@ -0,0 +1 @@
+Visual understanding of the world goes beyond the semantics and flat structure of individual images. In this work, we aim to capture both the 3D structure and dynamics of real-world scenes from monocular real-world videos. Our Dynamic Scene Transformer (DyST) model leverages recent work in neural scene representation to learn a latent decomposition of monocular real-world videos into scene content, per-view scene dynamics, and camera pose. This separation is achieved through a novel co-training scheme on monocular videos and our new synthetic dataset DySO. DyST learns tangible latent representations for dynamic scenes that enable view generation with separate control over the camera and the content of the scene.
\ No newline at end of file
diff --git a/data/2024/iclr/DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks b/data/2024/iclr/DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks
new file mode 100644
index 0000000000..95f0a85ff5
--- /dev/null
+++ b/data/2024/iclr/DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks	
@@ -0,0 +1 @@
+Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns are raised about potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a general and flexible protocol for dynamic evaluation of LLMs. Based on our framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to GPT-3.5-Turbo and GPT-4. Experiments show that LLMs perform worse in DyVal-generated evaluation samples with different complexities, highlighting the significance of dynamic evaluation. We also analyze the failure cases and results of different prompting methods. Moreover, DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks. We hope that DyVal can shed light on future evaluation research of LLMs. Code is available at: https://github.com/microsoft/promptbench.
\ No newline at end of file
diff --git a/data/2024/iclr/DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric Voxelization b/data/2024/iclr/DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric Voxelization
new file mode 100644
index 0000000000..c7ccc00f45
--- /dev/null
+++ b/data/2024/iclr/DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric Voxelization	
@@ -0,0 +1 @@
+Unsupervised learning of object-centric representations in dynamic visual scenes is challenging. Unlike most previous approaches that learn to decompose 2D images, we present DynaVol, a 3D scene generative model that unifies geometric structures and object-centric learning in a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers the probability distribution over objects at individual spatial locations. These voxel features evolve over time through a canonical-space deformation function, forming the basis for global representation learning via slot attention. The voxel features and global features are complementary and are both leveraged by a compositional NeRF decoder for volume rendering. DynaVol remarkably outperforms existing approaches for unsupervised dynamic scene decomposition. Once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve: it is possible to freely edit the geometric shapes or manipulate the motion trajectories of the objects.
\ No newline at end of file
diff --git a/data/2024/iclr/Dynamic Discounted Counterfactual Regret Minimization b/data/2024/iclr/Dynamic Discounted Counterfactual Regret Minimization
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Dynamic Layer Tying for Parameter-Efficient Transformers b/data/2024/iclr/Dynamic Layer Tying for Parameter-Efficient Transformers
new file mode 100644
index 0000000000..6e57220c4e
--- /dev/null
+++ b/data/2024/iclr/Dynamic Layer Tying for Parameter-Efficient Transformers	
@@ -0,0 +1 @@
+In the pursuit of reducing the number of trainable parameters in deep transformer networks, we employ Reinforcement Learning to dynamically select layers during training and tie them together. Every few iterations, the RL agent is asked whether to train each layer $i$ independently or to copy the weights of a previous layer $j<i$. This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique. Experimental evaluations validate that our model modestly outperforms the baseline transformer model with regard to perplexity and drastically reduces the number of trainable parameters. In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
\ No newline at end of file
diff --git a/data/2024/iclr/Dynamic Neighborhood Construction for Structured Large Discrete Action Spaces b/data/2024/iclr/Dynamic Neighborhood Construction for Structured Large Discrete Action Spaces
new file mode 100644
index 0000000000..67f25fe845
--- /dev/null
+++ b/data/2024/iclr/Dynamic Neighborhood Construction for Structured Large Discrete Action Spaces	
@@ -0,0 +1 @@
+Large discrete action spaces (LDAS) remain a central challenge in reinforcement learning. Existing solution approaches can handle unstructured LDAS with up to a few million actions. However, many real-world applications in logistics, production, and transportation systems have combinatorial action spaces, whose size grows well beyond millions of actions, even on small instances. Fortunately, such action spaces exhibit structure, e.g., equally spaced discrete resource units. With this work, we focus on handling structured LDAS (SLDAS) with sizes that cannot be handled by current benchmarks: we propose Dynamic Neighborhood Construction (DNC), a novel exploitation paradigm for SLDAS. We present a scalable neighborhood exploration heuristic that utilizes this paradigm and efficiently explores the discrete neighborhood around the continuous proxy action in structured action spaces with up to $10^{73}$ actions. We demonstrate the performance of our method by benchmarking it against three state-of-the-art approaches designed for large discrete action spaces across two distinct environments. Our results show that DNC matches or outperforms state-of-the-art approaches while being computationally more efficient. Furthermore, our method scales to action spaces that so far remained computationally intractable for existing methodologies.
\ No newline at end of file
diff --git a/data/2024/iclr/Dynamic Neural Response Tuning b/data/2024/iclr/Dynamic Neural Response Tuning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs b/data/2024/iclr/Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs
new file mode 100644
index 0000000000..f6934343a6
--- /dev/null
+++ b/data/2024/iclr/Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs	
@@ -0,0 +1 @@
+The ever-increasing large language models (LLMs), though opening a potential path for the upcoming artificial general intelligence, sadly drops a daunting obstacle on the way towards their on-device deployment. As one of the most well-established pre-LLMs approaches in reducing model complexity, network pruning appears to lag behind in the era of LLMs, due mostly to its costly fine-tuning (or re-training) necessity under the massive volumes of model parameter and training data. To close this industry-academia gap, we introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach that slightly updates sparse LLMs without the expensive backpropagation and any weight updates. Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs, in the fashion of performing iterative weight pruning-and-growing on top of sparse LLMs. To accomplish this purpose, DSnoT particularly takes into account the anticipated reduction in reconstruction error for pruning and growing, as well as the variance w.r.t. different input data for growing each weight. This practice can be executed efficiently in linear time since its obviates the need of backpropagation for fine-tuning LLMs. Extensive experiments on LLaMA-V1/V2, Vicuna, and OPT across various benchmarks demonstrate the effectiveness of DSnoT in enhancing the performance of sparse LLMs, especially at high sparsity levels. For instance, DSnoT is able to outperform the state-of-the-art Wanda by 26.79 perplexity at 70% sparsity with LLaMA-7B. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs. Codes are available at https://github.com/zyxxmu/DSnoT.
\ No newline at end of file
diff --git a/data/2024/iclr/Dynamic Sparse Training with Structured Sparsity b/data/2024/iclr/Dynamic Sparse Training with Structured Sparsity
new file mode 100644
index 0000000000..d90fa761fb
--- /dev/null
+++ b/data/2024/iclr/Dynamic Sparse Training with Structured Sparsity	
@@ -0,0 +1 @@
+Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network training, matching the generalization of dense models while enabling sparse training and inference. Although the resulting models are highly sparse and theoretically less computationally expensive, achieving speedups with unstructured sparsity on real-world hardware is challenging. In this work, we propose a sparse-to-sparse DST method, Structured RigL (SRigL), to learn a variant of fine-grained structured N:M sparsity by imposing a constant fan-in constraint. Using our empirical analysis of existing DST methods at high sparsity, we additionally employ a neuron ablation method which enables SRigL to achieve state-of-the-art sparse-to-sparse structured DST performance on a variety of Neural Network (NN) architectures. Using a 90% sparse linear layer, we demonstrate a real-world acceleration of 3.4x/2.5x on CPU for online inference and 1.7x/13.0x on GPU for inference with a batch size of 256 when compared to equivalent dense/unstructured (CSR) sparse layers, respectively.
\ No newline at end of file
diff --git a/data/2024/iclr/Dynamics-Informed Protein Design with Structure Conditioning b/data/2024/iclr/Dynamics-Informed Protein Design with Structure Conditioning
new file mode 100644
index 0000000000..8879b4be09
--- /dev/null
+++ b/data/2024/iclr/Dynamics-Informed Protein Design with Structure Conditioning	
@@ -0,0 +1 @@
+Current protein generative models are able to design novel backbones with desired shapes or functional motifs. However, despite the importance of a protein’s dynamical properties for its function, conditioning on these dynamics remains elusive. We present a new approach to include dynamical properties in protein generative modeling by leveraging Normal Mode Analysis. We introduce a method for conditioning diffusion probabilistic models on protein dynamics, specifically on the lowest non-trivial normal mode of oscillation. Our method, similar to classifier guidance conditioning, formulates the sampling process as being driven by conditional and unconditional terms. However, unlike previous works, we approximate the conditional term with a simple analytical function rather than an external neural network, thus making the eigenvector calculations approachable. We present the corresponding SDE theory as a formal justification of our approach. We extend our framework to conditioning on structure and dynamics at the same time, enabling scaffolding of dynamical motifs. We demonstrate the empirical effectiveness of our method by turning the open-source unconditional protein diffusion model Genie into a normal-mode-dynamics-conditional model with no retraining. Generated proteins exhibit the desired dynamical and structural properties while still being biologically plausible. Our work represents a first step towards incorporating dynamical behaviour in protein design and may open the door to designing more flexible and functional proteins in the future.
\ No newline at end of file
diff --git a/data/2024/iclr/EBMDock: Neural Probabilistic Protein-Protein Docking via a Differentiable Energy Model b/data/2024/iclr/EBMDock: Neural Probabilistic Protein-Protein Docking via a Differentiable Energy Model
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models b/data/2024/iclr/ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models
new file mode 100644
index 0000000000..4d7b5926e8
--- /dev/null
+++ b/data/2024/iclr/ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models	
@@ -0,0 +1 @@
+Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities, achieving remarkable advancements on various multimodal downstream tasks. However, deploying LVLMs is often problematic due to their massive computational/energy costs and carbon consumption. Such issues make it infeasible to adopt conventional iterative global pruning, which is costly due to computing the Hessian matrix of the entire large model for sparsification. Alternatively, several studies have recently proposed layer-wise pruning approaches to avoid the expensive computation of global pruning and efficiently compress model weights according to their importance within a layer. However, they often suffer from suboptimal model compression due to their lack of a global perspective. To address this limitation in recent efficient pruning methods for large models, we propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs. We first determine the sparsity ratios of different layers or blocks by leveraging the global importance score, which is efficiently computed based on the zeroth-order approximation of the global model gradients. Then, the model performs local layer-wise unstructured weight pruning based on globally-informed sparsity ratios. We validate our proposed method across various multimodal and unimodal models and datasets, demonstrating significant performance improvements over prevalent pruning techniques in the high-sparsity regime.
\ No newline at end of file
diff --git a/data/2024/iclr/EControl: Fast Distributed Optimization with Compression and Error Control b/data/2024/iclr/EControl: Fast Distributed Optimization with Compression and Error Control
new file mode 100644
index 0000000000..edd7a8b7ee
--- /dev/null
+++ b/data/2024/iclr/EControl: Fast Distributed Optimization with Compression and Error Control	
@@ -0,0 +1 @@
+Modern distributed training relies heavily on communication compression to reduce the communication overhead. In this work, we study algorithms employing a popular class of contractive compressors in order to reduce communication overhead. However, the naive implementation often leads to unstable convergence or even exponential divergence due to the compression bias. Error Compensation (EC) is an extremely popular mechanism to mitigate the aforementioned issues during the training of models enhanced by contractive compression operators. Compared to the effectiveness of EC in the data homogeneous regime, the understanding of the practicality and theoretical foundations of EC in the data heterogeneous regime is limited. Existing convergence analyses typically rely on strong assumptions such as bounded gradients, bounded data heterogeneity, or large batch accesses, which are often infeasible in modern machine learning applications. We resolve the majority of current issues by proposing EControl, a novel mechanism that can regulate error compensation by controlling the strength of the feedback signal. We prove fast convergence for EControl in standard strongly convex, general convex, and nonconvex settings without any additional assumptions on the problem or data heterogeneity. We conduct extensive numerical evaluations to illustrate the efficacy of our method and support our theoretical findings.
\ No newline at end of file
diff --git a/data/2024/iclr/ED-NeRF: Efficient Text-Guided Editing of 3D Scene With Latent Space NeRF b/data/2024/iclr/ED-NeRF: Efficient Text-Guided Editing of 3D Scene With Latent Space NeRF
new file mode 100644
index 0000000000..98d3a9cb8a
--- /dev/null
+++ b/data/2024/iclr/ED-NeRF: Efficient Text-Guided Editing of 3D Scene With Latent Space NeRF	
@@ -0,0 +1 @@
+Recently, there has been a significant advancement in text-to-image diffusion models, leading to groundbreaking performance in 2D image generation. These advancements have been extended to 3D models, enabling the generation of novel 3D objects from textual descriptions. This has evolved into NeRF editing methods, which allow the manipulation of existing 3D objects through textual conditioning. However, existing NeRF editing techniques have faced limitations in their performance due to slow training speeds and the use of loss functions that do not adequately consider editing. To address this, here we present a novel 3D NeRF editing approach dubbed ED-NeRF by successfully embedding real-world scenes into the latent space of the latent diffusion model (LDM) through a unique refinement layer. This approach enables us to obtain a NeRF backbone that is not only faster but also more amenable to editing compared to traditional image space NeRF editing. Furthermore, we propose an improved loss function tailored for editing by migrating the delta denoising score (DDS) distillation loss, originally used in 2D image editing to the three-dimensional domain. This novel loss function surpasses the well-known score distillation sampling (SDS) loss in terms of suitability for editing purposes. Our experimental results demonstrate that ED-NeRF achieves faster editing speed while producing improved output quality compared to state-of-the-art 3D editing models.
\ No newline at end of file
diff --git a/data/2024/iclr/EQA-MX: Embodied Question Answering using Multimodal Expression b/data/2024/iclr/EQA-MX: Embodied Question Answering using Multimodal Expression
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/EX-Graph: A Pioneering Dataset Bridging Ethereum and X b/data/2024/iclr/EX-Graph: A Pioneering Dataset Bridging Ethereum and X
new file mode 100644
index 0000000000..2c62d5fb0c
--- /dev/null
+++ b/data/2024/iclr/EX-Graph: A Pioneering Dataset Bridging Ethereum and X	
@@ -0,0 +1 @@
+While numerous public blockchain datasets are available, their utility is constrained by an exclusive focus on blockchain data. This constraint limits the incorporation of relevant social network data into blockchain analysis, thereby diminishing the breadth and depth of insight that can be derived. To address the above limitation, we introduce EX-Graph, a novel dataset that authentically links Ethereum and X, marking the first and largest dataset of its kind. EX-Graph combines Ethereum transaction records (2 million nodes and 30 million edges) and X following data (1 million nodes and 3 million edges), bonding 30,667 Ethereum addresses with verified X accounts sourced from OpenSea. Detailed statistical analysis on EX-Graph highlights the structural differences between X-matched and non-X-matched Ethereum addresses. Extensive experiments, including Ethereum link prediction, wash-trading Ethereum addresses detection, and X-Ethereum matching link prediction, emphasize the significant role of X data in enhancing Ethereum analysis. EX-Graph is available at \url{https://exgraph.deno.dev/}.
\ No newline at end of file
diff --git a/data/2024/iclr/Early Stopping Against Label Noise Without Validation Data b/data/2024/iclr/Early Stopping Against Label Noise Without Validation Data
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/EasyTPP: Towards Open Benchmarking Temporal Point Processes b/data/2024/iclr/EasyTPP: Towards Open Benchmarking Temporal Point Processes
new file mode 100644
index 0000000000..8a857f60a6
--- /dev/null
+++ b/data/2024/iclr/EasyTPP: Towards Open Benchmarking Temporal Point Processes	
@@ -0,0 +1 @@
+Continuous-time event sequences play a vital role in real-world domains such as healthcare, finance, online shopping, social networks, and so on. To model such data, temporal point processes (TPPs) have emerged as the most natural and competitive models, making a significant impact in both academic and application communities. Despite the emergence of many powerful models in recent years, there hasn't been a central benchmark for these models and future research endeavors. This lack of standardization impedes researchers and practitioners from comparing methods and reproducing results, potentially slowing down progress in this field. In this paper, we present EasyTPP, the first central repository of research assets (e.g., data, models, evaluation programs, documentations) in the area of event sequence modeling. Our EasyTPP makes several unique contributions to this area: a unified interface of using existing datasets and adding new datasets; a wide range of evaluation programs that are easy to use and extend as well as facilitate reproducible research; implementations of popular neural TPPs, together with a rich library of modules by composing which one could quickly build complex models. All the data and implementation can be found at https://github.com/ant-research/EasyTemporalPointProcess. We will actively maintain this benchmark and welcome contributions from other researchers and practitioners. Our benchmark will help promote reproducible research in this field, thus accelerating research progress as well as making more significant real-world impacts.
\ No newline at end of file
diff --git a/data/2024/iclr/Effective Data Augmentation With Diffusion Models b/data/2024/iclr/Effective Data Augmentation With Diffusion Models
new file mode 100644
index 0000000000..8653ff7693
--- /dev/null
+++ b/data/2024/iclr/Effective Data Augmentation With Diffusion Models	
@@ -0,0 +1 @@
+Data augmentation is one of the most prevalent tools in deep learning, underpinning many recent advances, including those from classification, generative models, and representation learning. The standard approach to data augmentation combines simple transformations like rotations and flips to generate new images from existing ones. However, these new images lack diversity along key semantic axes present in the data. Current augmentations cannot alter the high-level semantic attributes, such as animal species present in a scene, to enhance the diversity of data. We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models. Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples. We evaluate our approach on few-shot image classification tasks, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains.
\ No newline at end of file
diff --git a/data/2024/iclr/Effective Structural Encodings via Local Curvature Profiles b/data/2024/iclr/Effective Structural Encodings via Local Curvature Profiles
new file mode 100644
index 0000000000..1418c2df73
--- /dev/null
+++ b/data/2024/iclr/Effective Structural Encodings via Local Curvature Profiles	
@@ -0,0 +1 @@
+Structural and Positional Encodings can significantly improve the performance of Graph Neural Networks in downstream tasks. Recent literature has begun to systematically investigate differences in the structural properties that these approaches encode, as well as performance trade-offs between them. However, the question of which structural properties yield the most effective encoding remains open. In this paper, we investigate this question from a geometric perspective. We propose a novel structural encoding based on discrete Ricci curvature (Local Curvature Profiles, short LCP) and show that it significantly outperforms existing encoding approaches. We further show that combining local structural encodings, such as LCP, with global positional encodings improves downstream performance, suggesting that they capture complementary geometric information. Finally, we compare different encoding types with (curvature-based) rewiring techniques. Rewiring has recently received a surge of interest due to its ability to improve the performance of Graph Neural Networks by mitigating over-smoothing and over-squashing effects. Our results suggest that utilizing curvature information for structural encodings delivers significantly larger performance increases than rewiring.
\ No newline at end of file
diff --git a/data/2024/iclr/Effective and Efficient Federated Tree Learning on Hybrid Data b/data/2024/iclr/Effective and Efficient Federated Tree Learning on Hybrid Data
new file mode 100644
index 0000000000..f776b6b0cb
--- /dev/null
+++ b/data/2024/iclr/Effective and Efficient Federated Tree Learning on Hybrid Data	
@@ -0,0 +1 @@
+Federated learning has emerged as a promising distributed learning paradigm that facilitates collaborative learning among multiple parties without transferring raw data. However, most existing federated learning studies focus on either horizontal or vertical data settings, where the data of different parties are assumed to be from the same feature or sample space. In practice, a common scenario is the hybrid data setting, where data from different parties may differ both in the features and samples. To address this, we propose HybridTree, a novel federated learning approach that enables federated tree learning on hybrid data. We observe the existence of consistent split rules in trees. With the help of these split rules, we theoretically show that the knowledge of parties can be incorporated into the lower layers of a tree. Based on our theoretical analysis, we propose a layer-level solution that does not need frequent communication traffic to train a tree. Our experiments demonstrate that HybridTree can achieve comparable accuracy to the centralized setting with low computational and communication overhead. HybridTree can achieve up to 8 times speedup compared with the other baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Effective pruning of web-scale datasets based on complexity of concept clusters b/data/2024/iclr/Effective pruning of web-scale datasets based on complexity of concept clusters
new file mode 100644
index 0000000000..5da4f88b28
--- /dev/null
+++ b/data/2024/iclr/Effective pruning of web-scale datasets based on complexity of concept clusters	
@@ -0,0 +1 @@
+Utilizing massive web-scale datasets has led to unprecedented performance gains in machine learning models, but also imposes outlandish compute requirements for their training. In order to improve training and data efficiency, we here push the limits of pruning large-scale multimodal datasets for training CLIP-style models. Today's most effective pruning method on ImageNet clusters data samples into separate concepts according to their embedding and prunes away the most prototypical samples. We scale this approach to LAION and improve it by noting that the pruning rate should be concept-specific and adapted to the complexity of the concept. Using a simple and intuitive complexity measure, we are able to reduce the training cost to a quarter of regular training. By filtering from the LAION dataset, we find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs. More specifically, we are able to outperform the LAION-trained OpenCLIP-ViT-B32 model on ImageNet zero-shot accuracy by 1.1p.p. while only using 27.7% of the data and training compute. Despite a strong reduction in training cost, we also see improvements on ImageNet dist. shifts, retrieval tasks and VTAB. On the DataComp Medium benchmark, we achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient Backpropagation with Variance Controlled Adaptive Sampling b/data/2024/iclr/Efficient Backpropagation with Variance Controlled Adaptive Sampling
new file mode 100644
index 0000000000..230074967f
--- /dev/null
+++ b/data/2024/iclr/Efficient Backpropagation with Variance Controlled Adaptive Sampling	
@@ -0,0 +1 @@
+Sampling-based algorithms, which eliminate ''unimportant'' computations during forward and/or back propagation (BP), offer potential solutions to accelerate neural network training. However, since sampling introduces approximations to training, such algorithms may not consistently maintain accuracy across various tasks. In this work, we introduce a variance-controlled adaptive sampling (VCAS) method designed to accelerate BP. VCAS computes an unbiased stochastic gradient with fine-grained layerwise importance sampling in data dimension for activation gradient calculation and leverage score sampling in token dimension for weight gradient calculation. To preserve accuracy, we control the additional variance by learning the sample ratio jointly with model parameters during training. We assessed VCAS on multiple fine-tuning and pre-training tasks in both vision and natural language domains. On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy with an up to 73.87% FLOPs reduction of BP and 49.58% FLOPs reduction of the whole training process. The implementation is available at https://github.com/thu-ml/VCAS .
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient Continual Finite-Sum Minimization b/data/2024/iclr/Efficient Continual Finite-Sum Minimization
new file mode 100644
index 0000000000..c55bf3b0c3
--- /dev/null
+++ b/data/2024/iclr/Efficient Continual Finite-Sum Minimization	
@@ -0,0 +1 @@
+Given a sequence of functions $f_1,\ldots,f_n$ with $f_i:\mathcal{D}\mapsto \mathbb{R}$, finite-sum minimization seeks a point ${x}^\star \in \mathcal{D}$ minimizing $\sum_{j=1}^n f_j(x)/n$. In this work, we propose a key twist into the finite-sum minimization, dubbed as continual finite-sum minimization, that asks for a sequence of points ${x}_1^\star,\ldots,{x}_n^\star \in \mathcal{D}$ such that each ${x}^\star_i \in \mathcal{D}$ minimizes the prefix-sum $\sum_{j=1}^if_j(x)/i$. Assuming that each prefix-sum is strongly convex, we develop a first-order continual stochastic variance reduction gradient method ($\mathrm{CSVRG}$) producing an $\epsilon$-optimal sequence with $\mathcal{\tilde{O}}(n/\epsilon^{1/3} + 1/\sqrt{\epsilon})$ overall first-order oracles (FO). An FO corresponds to the computation of a single gradient $\nabla f_j(x)$ at a given $x \in \mathcal{D}$ for some $j \in [n]$. Our approach significantly improves upon the $\mathcal{O}(n/\epsilon)$ FOs that $\mathrm{StochasticGradientDescent}$ requires and the $\mathcal{O}(n^2 \log (1/\epsilon))$ FOs that state-of-the-art variance reduction methods such as $\mathrm{Katyusha}$ require. We also prove that there is no natural first-order method with $\mathcal{O}\left(n/\epsilon^\alpha\right)$ gradient complexity for $\alpha<1/4$, establishing that the first-order complexity of our method is nearly tight.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient ConvBN Blocks for Transfer Learning and Beyond b/data/2024/iclr/Efficient ConvBN Blocks for Transfer Learning and Beyond
new file mode 100644
index 0000000000..beda554897
--- /dev/null
+++ b/data/2024/iclr/Efficient ConvBN Blocks for Transfer Learning and Beyond	
@@ -0,0 +1 @@
+Convolution-BatchNorm (ConvBN) blocks are integral components in various computer vision tasks and other domains. A ConvBN block can operate in three modes: Train, Eval, and Deploy. While the Train mode is indispensable for training models from scratch, the Eval mode is suitable for transfer learning and beyond, and the Deploy mode is designed for the deployment of models. This paper focuses on the trade-off between stability and efficiency in ConvBN blocks: Deploy mode is efficient but suffers from training instability; Eval mode is widely used in transfer learning but lacks efficiency. To solve the dilemma, we theoretically reveal the reason behind the diminished training stability observed in the Deploy mode. Subsequently, we propose a novel Tune mode to bridge the gap between Eval mode and Deploy mode. The proposed Tune mode is as stable as Eval mode for transfer learning, and its computational efficiency closely matches that of the Deploy mode. Through extensive experiments in object detection, classification, and adversarial example generation across $5$ datasets and $12$ model architectures, we demonstrate that the proposed Tune mode retains the performance while significantly reducing GPU memory footprint and training time, thereby contributing efficient ConvBN blocks for transfer learning and beyond. Our method has been integrated into both PyTorch (general machine learning framework) and MMCV/MMEngine (computer vision framework). Practitioners just need one line of code to enjoy our efficient ConvBN blocks thanks to PyTorch's builtin machine learning compilers.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning b/data/2024/iclr/Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning
new file mode 100644
index 0000000000..0a47179274
--- /dev/null
+++ b/data/2024/iclr/Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning	
@@ -0,0 +1 @@
+In cooperative multi-agent reinforcement learning (MARL), agents aim to achieve a common goal, such as defeating enemies or scoring a goal. Existing MARL algorithms are effective but still require significant learning time and often get trapped in local optima by complex tasks, subsequently failing to discover a goal-reaching policy. To address this, we introduce Efficient episodic Memory Utilization (EMU) for MARL, with two primary objectives: (a) accelerating reinforcement learning by leveraging semantically coherent memory from an episodic buffer and (b) selectively promoting desirable transitions to prevent local convergence. To achieve (a), EMU incorporates a trainable encoder/decoder structure alongside MARL, creating coherent memory embeddings that facilitate exploratory memory recall. To achieve (b), EMU introduces a novel reward structure called episodic incentive based on the desirability of states. This reward improves the TD target in Q-learning and acts as an additional incentive for desirable transitions. We provide theoretical support for the proposed incentive and demonstrate the effectiveness of EMU compared to conventional episodic control. The proposed method is evaluated in StarCraft II and Google Research Football, and empirical results indicate further performance improvement over state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation b/data/2024/iclr/Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Efficient Integrators for Diffusion Generative Models b/data/2024/iclr/Efficient Integrators for Diffusion Generative Models
new file mode 100644
index 0000000000..9a1c6ee19e
--- /dev/null
+++ b/data/2024/iclr/Efficient Integrators for Diffusion Generative Models	
@@ -0,0 +1 @@
+Diffusion models suffer from slow sample generation at inference time. Therefore, developing a principled framework for fast deterministic/stochastic sampling for a broader class of diffusion models is a promising direction. We propose two complementary frameworks for accelerating sample generation in pre-trained models: Conjugate Integrators and Splitting Integrators. Conjugate integrators generalize DDIM, mapping the reverse diffusion dynamics to a more amenable space for sampling. In contrast, splitting-based integrators, commonly used in molecular dynamics, reduce the numerical simulation error by cleverly alternating between numerical updates involving the data and auxiliary variables. After extensively studying these methods empirically and theoretically, we present a hybrid method that leads to the best-reported performance for diffusion models in augmented spaces. Applied to Phase Space Langevin Diffusion [Pandey&Mandt, 2023] on CIFAR-10, our deterministic and stochastic samplers achieve FID scores of 2.11 and 2.36 in only 100 network function evaluations (NFE) as compared to 2.57 and 2.63 for the best-performing baselines, respectively. Our code and model checkpoints will be made publicly available at \url{https://github.com/mandt-lab/PSLD}.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient Inverse Multiagent Learning b/data/2024/iclr/Efficient Inverse Multiagent Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Efficient Modulation for Vision Networks b/data/2024/iclr/Efficient Modulation for Vision Networks
new file mode 100644
index 0000000000..feb265712c
--- /dev/null
+++ b/data/2024/iclr/Efficient Modulation for Vision Networks	
@@ -0,0 +1 @@
+In this work, we present efficient modulation, a novel design for efficient vision networks. We revisit the modulation mechanism, which operates input through convolutional context modeling and feature projection layers, and fuses features via element-wise multiplication and an MLP block. We demonstrate that the modulation mechanism is particularly well suited for efficient networks and further tailor the modulation design by proposing the efficient modulation (EfficientMod) block, which is considered the essential building block for our networks. Benefiting from the prominent representational ability of modulation mechanism and the proposed efficient design, our network can accomplish better trade-offs between accuracy and efficiency and set new state-of-the-art performance in the zoo of efficient networks. When integrating EfficientMod with the vanilla self-attention block, we obtain the hybrid architecture which further improves the performance without loss of efficiency. We carry out comprehensive experiments to verify EfficientMod's performance. With fewer parameters, our EfficientMod-s performs 0.6 top-1 accuracy better than EfficientFormerV2-s2 and is 25% faster on GPU, and 2.9 better than MobileViTv2-1.0 at the same GPU latency. Additionally, our method presents a notable improvement in downstream tasks, outperforming EfficientFormerV2-s by 3.6 mIoU on the ADE20K benchmark. Code and checkpoints are available at https://github.com/ma-xu/EfficientMod.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient Multi-agent Reinforcement Learning by Planning b/data/2024/iclr/Efficient Multi-agent Reinforcement Learning by Planning
new file mode 100644
index 0000000000..8ad361d608
--- /dev/null
+++ b/data/2024/iclr/Efficient Multi-agent Reinforcement Learning by Planning	
@@ -0,0 +1 @@
+Multi-agent reinforcement learning (MARL) algorithms have accomplished remarkable breakthroughs in solving large-scale decision-making tasks. Nonetheless, most existing MARL algorithms are model-free, limiting sample efficiency and hindering their applicability in more challenging scenarios. In contrast, model-based reinforcement learning (MBRL), particularly algorithms integrating planning, such as MuZero, has demonstrated superhuman performance with limited data in many tasks. Hence, we aim to boost the sample efficiency of MARL by adopting model-based approaches. However, incorporating planning and search methods into multi-agent systems poses significant challenges. The expansive action space of multi-agent systems often necessitates leveraging the nearly-independent property of agents to accelerate learning. To tackle this issue, we propose the MAZero algorithm, which combines a centralized model with Monte Carlo Tree Search (MCTS) for policy search. We design a novel network structure to facilitate distributed execution and parameter sharing. To enhance search efficiency in deterministic environments with sizable action spaces, we introduce two novel techniques: Optimistic Search Lambda (OS($\lambda$)) and Advantage-Weighted Policy Optimization (AWPO). Extensive experiments on the SMAC benchmark demonstrate that MAZero outperforms model-free approaches in terms of sample efficiency and provides comparable or better performance than existing model-based methods in terms of both sample and computational efficiency. Our code is available at https://github.com/liuqh16/MAZero.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient Planning with Latent Diffusion b/data/2024/iclr/Efficient Planning with Latent Diffusion
new file mode 100644
index 0000000000..a6e400552f
--- /dev/null
+++ b/data/2024/iclr/Efficient Planning with Latent Diffusion	
@@ -0,0 +1 @@
+Temporal abstraction and efficient planning pose significant challenges in offline reinforcement learning, mainly when dealing with domains that involve temporally extended tasks and delayed sparse rewards. Existing methods typically plan in the raw action space and can be inefficient and inflexible. Latent action spaces offer a more flexible paradigm, capturing only possible actions within the behavior policy support and decoupling the temporal structure between planning and modeling. However, current latent-action-based methods are limited to discrete spaces and require expensive planning. This paper presents a unified framework for continuous latent action space representation learning and planning by leveraging latent, score-based diffusion models. We establish the theoretical equivalence between planning in the latent action space and energy-guided sampling with a pretrained diffusion model and incorporate a novel sequence-level exact sampling method. Our proposed method, $\texttt{LatentDiffuser}$, demonstrates competitive performance on low-dimensional locomotion control tasks and surpasses existing methods in higher-dimensional tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient Score Matching with Deep Equilibrium Layers b/data/2024/iclr/Efficient Score Matching with Deep Equilibrium Layers
new file mode 100644
index 0000000000..9c34803c4a
--- /dev/null
+++ b/data/2024/iclr/Efficient Score Matching with Deep Equilibrium Layers	
@@ -0,0 +1 @@
+Score matching methods, which estimate probability densities without computing the normalization constant, are particularly useful in deep learning. However, the computational and memory costs of score matching methods can be prohibitive for high-dimensional data or complex models, particularly due to the derivatives or Hessians of the log density function appearing in the objective function. Some existing approaches modify the objective function to reduce the quadratic computational complexity for the Hessian computation. However, the memory bottle-neck of score matching methods remains for deep learning. This study improves the memory efficiency of score matching by leveraging deep equilibrium models. We provide a theoretical analysis of deep equilibrium models for scoring matching and applying implicit differentiation to higher-order derivatives. Empirical evaluations demonstrate that our approach enables the development of deep and expressive models with improved performance and comparable computational and memory costs over shallow architectures.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models b/data/2024/iclr/Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models
new file mode 100644
index 0000000000..11af34c07b
--- /dev/null
+++ b/data/2024/iclr/Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models	
@@ -0,0 +1 @@
+Sharpness-aware minimization (SAM) has received increasing attention in computer vision since it can effectively eliminate the sharp local minima from the training trajectory and mitigate generalization degradation. However, SAM requires two sequential gradient computations during the optimization of each step: one to obtain the perturbation gradient and the other to obtain the updating gradient. Compared with the base optimizer (e.g., Adam), SAM doubles the time overhead due to the additional perturbation gradient. By dissecting the theory of SAM and observing the training gradient of the molecular graph transformer, we propose a new algorithm named GraphSAM, which reduces the training cost of SAM and improves the generalization performance of graph transformer models. There are two key factors that contribute to this result: (i) \textit{gradient approximation}: we use the updating gradient of the previous step to approximate the perturbation gradient at the intermediate steps smoothly (\textbf{increases efficiency}); (ii) \textit{loss landscape approximation}: we theoretically prove that the loss landscape of GraphSAM is limited to a small range centered on the expected loss of SAM (\textbf{guarantees generalization performance}). The extensive experiments on six datasets with different tasks demonstrate the superiority of GraphSAM, especially in optimizing the model update process. The code is in:https://github.com/YL-wang/GraphSAM/tree/graphsam
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient Streaming Language Models with Attention Sinks b/data/2024/iclr/Efficient Streaming Language Models with Attention Sinks
new file mode 100644
index 0000000000..a7775a5933
--- /dev/null
+++ b/data/2024/iclr/Efficient Streaming Language Models with Attention Sinks	
@@ -0,0 +1 @@
+Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a"sink"even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at https://github.com/mit-han-lab/streaming-llm.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient Subgraph GNNs by Learning Effective Selection Policies b/data/2024/iclr/Efficient Subgraph GNNs by Learning Effective Selection Policies
new file mode 100644
index 0000000000..2a2efcaaf5
--- /dev/null
+++ b/data/2024/iclr/Efficient Subgraph GNNs by Learning Effective Selection Policies	
@@ -0,0 +1 @@
+Subgraph GNNs are provably expressive neural architectures that learn graph representations from sets of subgraphs. Unfortunately, their applicability is hampered by the computational complexity associated with performing message passing on many subgraphs. In this paper, we consider the problem of learning to select a small subset of the large set of possible subgraphs in a data-driven fashion. We first motivate the problem by proving that there are families of WL-indistinguishable graphs for which there exist efficient subgraph selection policies: small subsets of subgraphs that can already identify all the graphs within the family. We then propose a new approach, called Policy-Learn, that learns how to select subgraphs in an iterative manner. We prove that, unlike popular random policies and prior work addressing the same problem, our architecture is able to learn the efficient policies mentioned above. Our experimental results demonstrate that Policy-Learn outperforms existing baselines across a wide range of datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition b/data/2024/iclr/Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition
new file mode 100644
index 0000000000..7eecade143
--- /dev/null
+++ b/data/2024/iclr/Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition	
@@ -0,0 +1 @@
+Video diffusion models have recently made great progress in generation quality, but are still limited by the high memory and computational requirements. This is because current video diffusion models often attempt to process high-dimensional videos directly. To tackle this issue, we propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. Specifically, we propose an autoencoder that succinctly encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. The former represents the common content, and the latter represents the underlying motion in the video, respectively. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model. A key innovation here is the design of a compact latent space that can directly utilizes a pretrained image diffusion model, which has not been done in previous latent video diffusion models. This leads to considerably better quality generation and reduced computational costs. For instance, CMD can sample a video 7.7$\times$ faster than prior approaches by generating a video of 512$\times$1024 resolution and length 16 in 3.1 seconds. Moreover, CMD achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous state-of-the-art of 292.4.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient and Scalable Graph Generation through Iterative Local Expansion b/data/2024/iclr/Efficient and Scalable Graph Generation through Iterative Local Expansion
new file mode 100644
index 0000000000..4f4f9e03f9
--- /dev/null
+++ b/data/2024/iclr/Efficient and Scalable Graph Generation through Iterative Local Expansion	
@@ -0,0 +1 @@
+In the realm of generative models for graphs, extensive research has been conducted. However, most existing methods struggle with large graphs due to the complexity of representing the entire joint distribution across all node pairs and capturing both global and local graph structures simultaneously. To overcome these issues, we introduce a method that generates a graph by progressively expanding a single node to a target graph. In each step, nodes and edges are added in a localized manner through denoising diffusion, building first the global structure, and then refining the local details. The local generation avoids modeling the entire joint distribution over all node pairs, achieving substantial computational savings with subquadratic runtime relative to node count while maintaining high expressivity through multiscale generation. Our experiments show that our model achieves state-of-the-art performance on well-established benchmark datasets while successfully scaling to graphs with at least 5000 nodes. Our method is also the first to successfully extrapolate to graphs outside of the training distribution, showcasing a much better generalization capability over existing methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient local linearity regularization to overcome catastrophic overfitting b/data/2024/iclr/Efficient local linearity regularization to overcome catastrophic overfitting
new file mode 100644
index 0000000000..34bd410d2d
--- /dev/null
+++ b/data/2024/iclr/Efficient local linearity regularization to overcome catastrophic overfitting	
@@ -0,0 +1 @@
+Catastrophic overfitting (CO) in single-step adversarial training (AT) results in abrupt drops in the adversarial test accuracy (even down to 0%). For models trained with multi-step AT, it has been observed that the loss function behaves locally linearly with respect to the input, this is however lost in single-step AT. To address CO in single-step AT, several methods have been proposed to enforce local linearity of the loss via regularization. However, these regularization terms considerably slow down training due to Double Backpropagation. Instead, in this work, we introduce a regularization term, called ELLE, to mitigate CO effectively and efficiently in classical AT evaluations, as well as some more difficult regimes, e.g., large adversarial perturbations and long training schedules. Our regularization term can be theoretically linked to curvature of the loss function and is computationally cheaper than previous methods by avoiding Double Backpropagation. Our thorough experimental validation demonstrates that our work does not suffer from CO, even in challenging settings where previous works suffer from it. We also notice that adapting our regularization parameter during training (ELLE-A) greatly improves the performance, specially in large $\epsilon$ setups. Our implementation is available in https://github.com/LIONS-EPFL/ELLE .
\ No newline at end of file
diff --git a/data/2024/iclr/Efficient-3Dim: Learning a Generalizable Single-image Novel-view Synthesizer in One Day b/data/2024/iclr/Efficient-3Dim: Learning a Generalizable Single-image Novel-view Synthesizer in One Day
new file mode 100644
index 0000000000..2cd6198036
--- /dev/null
+++ b/data/2024/iclr/Efficient-3Dim: Learning a Generalizable Single-image Novel-view Synthesizer in One Day	
@@ -0,0 +1 @@
+The task of novel view synthesis aims to generate unseen perspectives of an object or scene from a limited set of input images. Nevertheless, synthesizing novel views from a single image still remains a significant challenge in the realm of computer vision. Previous approaches tackle this problem by adopting mesh prediction, multi-plain image construction, or more advanced techniques such as neural radiance fields. Recently, a pre-trained diffusion model that is specifically designed for 2D image synthesis has demonstrated its capability in producing photorealistic novel views, if sufficiently optimized on a 3D finetuning task. Although the fidelity and generalizability are greatly improved, training such a powerful diffusion model requires a vast volume of training data and model parameters, resulting in a notoriously long time and high computational costs. To tackle this issue, we propose Efficient-3DiM, a simple but effective framework to learn a single-image novel-view synthesizer. Motivated by our in-depth analysis of the inference process of diffusion models, we propose several pragmatic strategies to reduce the training overhead to a manageable scale, including a crafted timestep sampling strategy, a superior 3D feature extractor, and an enhanced training scheme. When combined, our framework is able to reduce the total training time from 10 days to less than 1 day, significantly accelerating the training process under the same computational platform (one instance with 8 Nvidia A100 GPUs). Comprehensive experiments are conducted to demonstrate the efficiency and generalizability of our proposed method.
\ No newline at end of file
diff --git a/data/2024/iclr/EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models b/data/2024/iclr/EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models
new file mode 100644
index 0000000000..4d43b95b98
--- /dev/null
+++ b/data/2024/iclr/EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models	
@@ -0,0 +1 @@
+Diffusion models have demonstrated remarkable capabilities in image synthesis and related generative tasks. Nevertheless, their practicality for real-world applications is constrained by substantial computational costs and latency issues. Quantization is a dominant way to compress and accelerate diffusion models, where post-training quantization (PTQ) and quantization-aware training (QAT) are two main approaches, each bearing its own properties. While PTQ exhibits efficiency in terms of both time and data usage, it may lead to diminished performance in low bit-width. On the other hand, QAT can alleviate performance degradation but comes with substantial demands on computational and data resources. In this paper, we introduce a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency. Specifically, we propose a quantization-aware variant of the low-rank adapter (QALoRA) that can be merged with model weights and jointly quantized to low bit-width. The fine-tuning process distills the denoising capabilities of the full-precision model into its quantized counterpart, eliminating the requirement for training data. We also introduce scale-aware optimization and temporal learned step-size quantization to further enhance performance. Extensive experimental results demonstrate that our method significantly outperforms previous PTQ-based diffusion models while maintaining similar time and data efficiency. Specifically, there is only a 0.05 sFID increase when quantizing both weights and activations of LDM-4 to 4-bit on ImageNet 256x256. Compared to QAT-based methods, our EfficientDM also boasts a 16.2x faster quantization speed with comparable generation quality. Code is available at \href{https://github.com/ThisisBillhe/EfficientDM}{this hrl}.
\ No newline at end of file
diff --git a/data/2024/iclr/Efficiently Computing Similarities to Private Datasets b/data/2024/iclr/Efficiently Computing Similarities to Private Datasets
new file mode 100644
index 0000000000..ac880a0bf0
--- /dev/null
+++ b/data/2024/iclr/Efficiently Computing Similarities to Private Datasets	
@@ -0,0 +1 @@
+Many methods in differentially private model training rely on computing the similarity between a query point (such as public or synthetic data) and private data. We abstract out this common subroutine and study the following fundamental algorithmic problem: Given a similarity function $f$ and a large high-dimensional private dataset $X \subset \mathbb{R}^d$, output a differentially private (DP) data structure which approximates $\sum_{x \in X} f(x,y)$ for any query $y$. We consider the cases where $f$ is a kernel function, such as $f(x,y) = e^{-\|x-y\|_2^2/\sigma^2}$ (also known as DP kernel density estimation), or a distance function such as $f(x,y) = \|x-y\|_2$, among others. Our theoretical results improve upon prior work and give better privacy-utility trade-offs as well as faster query times for a wide range of kernels and distance functions. The unifying approach behind our results is leveraging `low-dimensional structures' present in the specific functions $f$ that we study, using tools such as provable dimensionality reduction, approximation theory, and one-dimensional decomposition of the functions. Our algorithms empirically exhibit improved query times and accuracy over prior state of the art. We also present an application to DP classification. Our experiments demonstrate that the simple methodology of classifying based on average similarity is orders of magnitude faster than prior DP-SGD based approaches for comparable accuracy.
\ No newline at end of file
diff --git a/data/2024/iclr/Elastic Feature Consolidation For Cold Start Exemplar-Free Incremental Learning b/data/2024/iclr/Elastic Feature Consolidation For Cold Start Exemplar-Free Incremental Learning
new file mode 100644
index 0000000000..67e8b500d8
--- /dev/null
+++ b/data/2024/iclr/Elastic Feature Consolidation For Cold Start Exemplar-Free Incremental Learning	
@@ -0,0 +1 @@
+Exemplar-Free Class Incremental Learning (EFCIL) aims to learn from a sequence of tasks without having access to previous task data. In this paper, we consider the challenging Cold Start scenario in which insufficient data is available in the first task to learn a high-quality backbone. This is especially challenging for EFCIL since it requires high plasticity, which results in feature drift which is difficult to compensate for in the exemplar-free setting. To address this problem, we propose a simple and effective approach that consolidates feature representations by regularizing drift in directions highly relevant to previous tasks and employs prototypes to reduce task-recency bias. Our method, called Elastic Feature Consolidation (EFC), exploits a tractable second-order approximation of feature drift based on an Empirical Feature Matrix (EFM). The EFM induces a pseudo-metric in feature space which we use to regularize feature drift in important directions and to update Gaussian prototypes used in a novel asymmetric cross entropy loss which effectively balances prototype rehearsal with data from new tasks. Experimental results on CIFAR-100, Tiny-ImageNet, ImageNet-Subset and ImageNet-1K demonstrate that Elastic Feature Consolidation is better able to learn new tasks by maintaining model plasticity and significantly outperform the state-of-the-art.
\ No newline at end of file
diff --git a/data/2024/iclr/Elucidating the Exposure Bias in Diffusion Models b/data/2024/iclr/Elucidating the Exposure Bias in Diffusion Models
new file mode 100644
index 0000000000..30874c39cc
--- /dev/null
+++ b/data/2024/iclr/Elucidating the Exposure Bias in Diffusion Models	
@@ -0,0 +1 @@
+Diffusion models have demonstrated impressive generative capabilities, but their \textit{exposure bias} problem, described as the input mismatch between training and sampling, lacks in-depth exploration. In this paper, we systematically investigate the exposure bias problem in diffusion models by first analytically modelling the sampling distribution, based on which we then attribute the prediction error at each sampling step as the root cause of the exposure bias issue. Furthermore, we discuss potential solutions to this issue and propose an intuitive metric for it. Along with the elucidation of exposure bias, we propose a simple, yet effective, training-free method called Epsilon Scaling to alleviate the exposure bias. We show that Epsilon Scaling explicitly moves the sampling trajectory closer to the vector field learned in the training phase by scaling down the network output, mitigating the input mismatch between training and sampling. Experiments on various diffusion frameworks (ADM, DDIM, EDM, LDM, DiT, PFGM++) verify the effectiveness of our method. Remarkably, our ADM-ES, as a state-of-the-art stochastic sampler, obtains 2.17 FID on CIFAR-10 under 100-step unconditional generation. The code is available at \url{https://github.com/forever208/ADM-ES} and \url{https://github.com/forever208/EDM-ES}.
\ No newline at end of file
diff --git a/data/2024/iclr/Elucidating the design space of classifier-guided diffusion generation b/data/2024/iclr/Elucidating the design space of classifier-guided diffusion generation
new file mode 100644
index 0000000000..982eedb022
--- /dev/null
+++ b/data/2024/iclr/Elucidating the design space of classifier-guided diffusion generation	
@@ -0,0 +1 @@
+Guidance in conditional diffusion generation is of great importance for sample quality and controllability. However, existing guidance schemes are to be desired. On one hand, mainstream methods such as classifier guidance and classifier-free guidance both require extra training with labeled data, which is time-consuming and unable to adapt to new conditions. On the other hand, training-free methods such as universal guidance, though more flexible, have yet to demonstrate comparable performance. In this work, through a comprehensive investigation into the design space, we show that it is possible to achieve significant performance improvements over existing guidance schemes by leveraging off-the-shelf classifiers in a training-free fashion, enjoying the best of both worlds. Employing calibration as a general guideline, we propose several pre-conditioning techniques to better exploit pretrained off-the-shelf classifiers for guiding diffusion generation. Extensive experiments on ImageNet validate our proposed method, showing that state-of-the-art diffusion models (DDPM, EDM, DiT) can be further improved (up to 20%) using off-the-shelf classifiers with barely any extra computational cost. With the proliferation of publicly available pretrained classifiers, our proposed approach has great potential and can be readily scaled up to text-to-image generation tasks. The code is available at https://github.com/AlexMaOLS/EluCD/tree/main.
\ No newline at end of file
diff --git a/data/2024/iclr/Embarrassingly Simple Dataset Distillation b/data/2024/iclr/Embarrassingly Simple Dataset Distillation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models b/data/2024/iclr/EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models
new file mode 100644
index 0000000000..1348f1baea
--- /dev/null
+++ b/data/2024/iclr/EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models	
@@ -0,0 +1 @@
+Diffusion models have recently received increasing research attention for their remarkable transfer abilities in semantic segmentation tasks. However, generating fine-grained segmentation masks with diffusion models often requires additional training on annotated datasets, leaving it unclear to what extent pre-trained diffusion models alone understand the semantic relations of their generated images. To address this question, we leverage the semantic knowledge extracted from Stable Diffusion (SD) and aim to develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. The primary difficulty stems from the fact that semantically meaningful feature maps typically exist only in the spatially lower-dimensional layers, which poses a challenge in directly extracting pixel-level semantic relations from these feature maps. To overcome this issue, our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps by exploiting SD's generation process and utilizes them for constructing image-resolution segmentation maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images, indicating the existence of highly accurate pixel-level semantic knowledge in diffusion models.
\ No newline at end of file
diff --git a/data/2024/iclr/EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision b/data/2024/iclr/EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision
new file mode 100644
index 0000000000..0d1ab5b440
--- /dev/null
+++ b/data/2024/iclr/EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision	
@@ -0,0 +1 @@
+We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from self-supervision, enabling our model to learn from general, in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field from the dynamic field and uses this flow field to further aggregate multi-frame features, amplifying the rendering precision of dynamic objects. Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to represent highly-dynamic scenes self-sufficiently, without relying on ground truth object annotations or pre-trained models for dynamic object segmentation or optical flow estimation. Our method achieves state-of-the-art performance in sensor simulation, significantly outperforming previous methods when reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual foundation model features into 4D space-time and address a general positional bias in modern Transformers, significantly boosting 3D perception performance (e.g., 37.50% relative improvement in occupancy prediction accuracy on average). Finally, we construct a diverse and challenging 120-sequence dataset to benchmark neural fields under extreme and highly-dynamic settings.
\ No newline at end of file
diff --git a/data/2024/iclr/Emergent Communication with Conversational Repair b/data/2024/iclr/Emergent Communication with Conversational Repair
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Emergent mechanisms for long timescales depend on training curriculum and affect performance in memory tasks b/data/2024/iclr/Emergent mechanisms for long timescales depend on training curriculum and affect performance in memory tasks
new file mode 100644
index 0000000000..d948e93eea
--- /dev/null
+++ b/data/2024/iclr/Emergent mechanisms for long timescales depend on training curriculum and affect performance in memory tasks	
@@ -0,0 +1 @@
+Recurrent neural networks (RNNs) in the brain and in silico excel at solving tasks with intricate temporal dependencies. Long timescales required for solving such tasks can arise from properties of individual neurons (single-neuron timescale, $\tau$, e.g., membrane time constant in biological neurons) or recurrent interactions among them (network-mediated timescale). However, the contribution of each mechanism for optimally solving memory-dependent tasks remains poorly understood. Here, we train RNNs to solve $N$-parity and $N$-delayed match-to-sample tasks with increasing memory requirements controlled by $N$ by simultaneously optimizing recurrent weights and $\tau$s. We find that for both tasks RNNs develop longer timescales with increasing $N$, but depending on the learning objective, they use different mechanisms. Two distinct curricula define learning objectives: sequential learning of a single-$N$ (single-head) or simultaneous learning of multiple $N$s (multi-head). Single-head networks increase their $\tau$ with $N$ and are able to solve tasks for large $N$, but they suffer from catastrophic forgetting. However, multi-head networks, which are explicitly required to hold multiple concurrent memories, keep $\tau$ constant and develop longer timescales through recurrent connectivity. Moreover, we show that the multi-head curriculum increases training speed and network stability to ablations and perturbations, and allows RNNs to generalize better to tasks beyond their training regime. This curriculum also significantly improves training GRUs and LSTMs for large-$N$ tasks. Our results suggest that adapting timescales to task requirements via recurrent interactions allows learning more complex objectives and improves the RNN's performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Emo: Earth Mover Distance Optimization for Auto-Regressive Language Modeling b/data/2024/iclr/Emo: Earth Mover Distance Optimization for Auto-Regressive Language Modeling
new file mode 100644
index 0000000000..b486374cd3
--- /dev/null
+++ b/data/2024/iclr/Emo: Earth Mover Distance Optimization for Auto-Regressive Language Modeling	
@@ -0,0 +1 @@
+Neural language models are probabilistic models of human text. They are predominantly trained using maximum likelihood estimation (MLE), which is equivalent to minimizing the forward cross-entropy between the empirical data distribution and the model distribution. However, various degeneration phenomena are still widely observed when decoding from the distributions learned by such models. We establish that the forward cross-entropy is suboptimal as a distance metric for aligning human and model distribution due to its (1) recall-prioritization (2) negative diversity ignorance and (3) train-test mismatch. In this paper, we propose Earth Mover Distance Optimization (EMO) for auto-regressive language modeling. EMO capitalizes on the inherent properties of earth mover distance to address the aforementioned challenges. Due to the high complexity of direct computation, we further introduce a feasible upper bound for EMO to ease end-to-end training. Upon extensive evaluation of language models trained using EMO and MLE. We find that EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO demonstrates noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences. This highlights the tremendous potential of EMO as a lightweight calibration method for enhancing large-scale pre-trained language models.
\ No newline at end of file
diff --git a/data/2024/iclr/Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation b/data/2024/iclr/Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation
new file mode 100644
index 0000000000..fac40b49df
--- /dev/null
+++ b/data/2024/iclr/Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation	
@@ -0,0 +1 @@
+We study the problem of model selection in causal inference, specifically for conditional average treatment effect (CATE) estimation. Unlike machine learning, there is no perfect analogue of cross-validation for model selection as we do not observe the counterfactual potential outcomes. Towards this, a variety of surrogate metrics have been proposed for CATE model selection that use only observed data. However, we do not have a good understanding regarding their effectiveness due to limited comparisons in prior studies. We conduct an extensive empirical analysis to benchmark the surrogate model selection metrics introduced in the literature, as well as the novel ones introduced in this work. We ensure a fair comparison by tuning the hyperparameters associated with these metrics via AutoML, and provide more detailed trends by incorporating realistic datasets via generative modeling. Our analysis suggests novel model selection strategies based on careful hyperparameter selection of CATE estimators and causal ensembling.
\ No newline at end of file
diff --git a/data/2024/iclr/Empirical Likelihood for Fair Classification b/data/2024/iclr/Empirical Likelihood for Fair Classification
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Emu: Generative Pretraining in Multimodality b/data/2024/iclr/Emu: Generative Pretraining in Multimodality
new file mode 100644
index 0000000000..ee67ba76be
--- /dev/null
+++ b/data/2024/iclr/Emu: Generative Pretraining in Multimodality	
@@ -0,0 +1 @@
+We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Enabling Efficient Equivariant Operations in the Fourier Basis via Gaunt Tensor Products b/data/2024/iclr/Enabling Efficient Equivariant Operations in the Fourier Basis via Gaunt Tensor Products
new file mode 100644
index 0000000000..6087d11acf
--- /dev/null
+++ b/data/2024/iclr/Enabling Efficient Equivariant Operations in the Fourier Basis via Gaunt Tensor Products	
@@ -0,0 +1 @@
+Developing equivariant neural networks for the E(3) group plays an important role in modeling 3D data across real-world applications. Enforcing this equivariance primarily involves the tensor products of irreducible representations (irreps). However, the computational complexity of such operations increases significantly as higher-order tensors are used. In this work, we propose a systematic approach to substantially accelerate the computation of the tensor products of irreps. We mathematically connect the commonly used Clebsch-Gordan coefficients to the Gaunt coefficients, which are integrals of products of three spherical harmonics. Through Gaunt coefficients, the tensor product of irreps becomes equivalent to the multiplication between spherical functions represented by spherical harmonics. This perspective further allows us to change the basis for the equivariant operations from spherical harmonics to a 2D Fourier basis. Consequently, the multiplication between spherical functions represented by a 2D Fourier basis can be efficiently computed via the convolution theorem and Fast Fourier Transforms. This transformation reduces the complexity of full tensor products of irreps from $\mathcal{O}(L^6)$ to $\mathcal{O}(L^3)$, where $L$ is the max degree of irreps. Leveraging this approach, we introduce the Gaunt Tensor Product, which serves as a new method to construct efficient equivariant operations across different model architectures. Our experiments on the Open Catalyst Project and 3BPA datasets demonstrate both the increased efficiency and improved performance of our approach.
\ No newline at end of file
diff --git a/data/2024/iclr/Enabling Lanuguage Models to Implicitly Learn Self-Improvement b/data/2024/iclr/Enabling Lanuguage Models to Implicitly Learn Self-Improvement
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Encoding Unitig-level Assembly Graphs with Heterophilous Constraints for Metagenomic Contigs Binning b/data/2024/iclr/Encoding Unitig-level Assembly Graphs with Heterophilous Constraints for Metagenomic Contigs Binning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon b/data/2024/iclr/End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon
new file mode 100644
index 0000000000..c0d34f342f
--- /dev/null
+++ b/data/2024/iclr/End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon	
@@ -0,0 +1 @@
+Most recent work in goal oriented visual navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact representations generalizable to unseen environments and in learning high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is not given as a category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception module needs to learn a comparison strategy requiring to solve an underlying visual correspondence problem. This has been shown to be difficult from reward alone or with standard auxiliary tasks. We address this problem through a sequence of two pretext tasks, which serve as a prior for what we argue is one of the main bottleneck in perception, extremely wide-baseline relative pose estimation and visibility prediction in complex scenes. The first pretext task, cross-view completion is a proxy for the underlying visual correspondence problem, while the second task addresses goal detection and finding directly. We propose a new dual encoder with a large-capacity binocular ViT model and show that correspondence solutions naturally emerge from the training signals. Experiments show significant improvements and SOTA performance on the two benchmarks, ImageNav and the Instance-ImageNav variant, where camera intrinsics and height differ between observation and goal.
\ No newline at end of file
diff --git a/data/2024/iclr/Energy-Based Concept Bottleneck Models: Unifying Prediction, Concept Intervention, and Probabilistic Interpretations b/data/2024/iclr/Energy-Based Concept Bottleneck Models: Unifying Prediction, Concept Intervention, and Probabilistic Interpretations
new file mode 100644
index 0000000000..7ad7b9cef6
--- /dev/null
+++ b/data/2024/iclr/Energy-Based Concept Bottleneck Models: Unifying Prediction, Concept Intervention, and Probabilistic Interpretations	
@@ -0,0 +1 @@
+Existing methods, such as concept bottleneck models (CBMs), have been successful in providing concept-based interpretations for black-box deep learning models. They typically work by predicting concepts given the input and then predicting the final class label given the predicted concepts. However, (1) they often fail to capture the high-order, nonlinear interaction between concepts, e.g., correcting a predicted concept (e.g.,"yellow breast") does not help correct highly correlated concepts (e.g.,"yellow belly"), leading to suboptimal final accuracy; (2) they cannot naturally quantify the complex conditional dependencies between different concepts and class labels (e.g., for an image with the class label"Kentucky Warbler"and a concept"black bill", what is the probability that the model correctly predicts another concept"black crown"), therefore failing to provide deeper insight into how a black-box model works. In response to these limitations, we propose Energy-based Concept Bottleneck Models (ECBMs). Our ECBMs use a set of neural networks to define the joint energy of candidate (input, concept, class) tuples. With such a unified interface, prediction, concept correction, and conditional dependency quantification are then represented as conditional probabilities, which are generated by composing different energy functions. Our ECBMs address both limitations of existing CBMs, providing higher accuracy and richer concept interpretations. Empirical results show that our approach outperforms the state-of-the-art on real-world datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Energy-based Automated Model Evaluation b/data/2024/iclr/Energy-based Automated Model Evaluation
new file mode 100644
index 0000000000..b18d5e3a28
--- /dev/null
+++ b/data/2024/iclr/Energy-based Automated Model Evaluation	
@@ -0,0 +1 @@
+The conventional evaluation protocols on machine learning models rely heavily on a labeled, i.i.d-assumed testing dataset, which is not often present in real world applications. The Automated Model Evaluation (AutoEval) shows an alternative to this traditional workflow, by forming a proximal prediction pipeline of the testing performance without the presence of ground-truth labels. Despite its recent successes, the AutoEval frameworks still suffer from an overconfidence issue, substantial storage and computational cost. In that regard, we propose a novel measure -- Meta-Distribution Energy (MDE) -- that allows the AutoEval framework to be both more efficient and effective. The core of the MDE is to establish a meta-distribution statistic, on the information (energy) associated with individual samples, then offer a smoother representation enabled by energy-based learning. We further provide our theoretical insights by connecting the MDE with the classification loss. We provide extensive experiments across modalities, datasets and different architectural backbones to validate MDE's validity, together with its superiority compared with prior approaches. We also prove MDE's versatility by showing its seamless integration with large-scale models, and easy adaption to learning scenarios with noisy- or imbalanced- labels. Code and data are available: https://github.com/pengr/Energy_AutoEval
\ No newline at end of file
diff --git a/data/2024/iclr/Energy-conserving equivariant GNN for elasticity of lattice architected metamaterials b/data/2024/iclr/Energy-conserving equivariant GNN for elasticity of lattice architected metamaterials
new file mode 100644
index 0000000000..f63f530817
--- /dev/null
+++ b/data/2024/iclr/Energy-conserving equivariant GNN for elasticity of lattice architected metamaterials	
@@ -0,0 +1 @@
+Lattices are architected metamaterials whose properties strongly depend on their geometrical design. The analogy between lattices and graphs enables the use of graph neural networks (GNNs) as a faster surrogate model compared to traditional methods such as finite element modelling. In this work, we generate a big dataset of structure-property relationships for strut-based lattices. The dataset is made available to the community which can fuel the development of methods anchored in physical principles for the fitting of fourth-order tensors. In addition, we present a higher-order GNN model trained on this dataset. The key features of the model are (i) SE(3) equivariance, and (ii) consistency with the thermodynamic law of conservation of energy. We compare the model to non-equivariant models based on a number of error metrics and demonstrate its benefits in terms of predictive performance and reduced training requirements. Finally, we demonstrate an example application of the model to an architected material design task. The methods which we developed are applicable to fourth-order tensors beyond elasticity such as piezo-optical tensor etc.
\ No newline at end of file
diff --git a/data/2024/iclr/Energy-guided Entropic Neural Optimal Transport b/data/2024/iclr/Energy-guided Entropic Neural Optimal Transport
new file mode 100644
index 0000000000..b3e3b4d884
--- /dev/null
+++ b/data/2024/iclr/Energy-guided Entropic Neural Optimal Transport	
@@ -0,0 +1 @@
+Energy-based models (EBMs) are known in the Machine Learning community for decades. Since the seminal works devoted to EBMs dating back to the noughties, there have been a lot of efficient methods which solve the generative modelling problem by means of energy potentials (unnormalized likelihood functions). In contrast, the realm of Optimal Transport (OT) and, in particular, neural OT solvers is much less explored and limited by few recent works (excluding WGAN-based approaches which utilize OT as a loss function and do not model OT maps themselves). In our work, we bridge the gap between EBMs and Entropy-regularized OT. We present a novel methodology which allows utilizing the recent developments and technical improvements of the former in order to enrich the latter. From the theoretical perspective, we prove generalization bounds for our technique. In practice, we validate its applicability in toy 2D and image domains. To showcase the scalability, we empower our method with a pre-trained StyleGAN and apply it to high-res AFHQ $512\times 512$ unpaired I2I translation. For simplicity, we choose simple short- and long-run EBMs as a backbone of our Energy-guided Entropic OT approach, leaving the application of more sophisticated EBMs for future research. Our code is available at: https://github.com/PetrMokrov/Energy-guided-Entropic-OT
\ No newline at end of file
diff --git a/data/2024/iclr/Enhanced Face Recognition using Intra-class Incoherence Constraint b/data/2024/iclr/Enhanced Face Recognition using Intra-class Incoherence Constraint
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Enhancing Contrastive Learning for Ordinal Regression via Ordinal Content Preserved Data Augmentation b/data/2024/iclr/Enhancing Contrastive Learning for Ordinal Regression via Ordinal Content Preserved Data Augmentation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Enhancing High-Resolution 3D Generation through Pixel-wise Gradient Clipping b/data/2024/iclr/Enhancing High-Resolution 3D Generation through Pixel-wise Gradient Clipping
new file mode 100644
index 0000000000..a65957d919
--- /dev/null
+++ b/data/2024/iclr/Enhancing High-Resolution 3D Generation through Pixel-wise Gradient Clipping	
@@ -0,0 +1 @@
+High-resolution 3D object generation remains a challenging task primarily due to the limited availability of comprehensive annotated training data. Recent advancements have aimed to overcome this constraint by harnessing image generative models, pretrained on extensive curated web datasets, using knowledge transfer techniques like Score Distillation Sampling (SDS). Efficiently addressing the requirements of high-resolution rendering often necessitates the adoption of latent representation-based models, such as the Latent Diffusion Model (LDM). In this framework, a significant challenge arises: To compute gradients for individual image pixels, it is necessary to backpropagate gradients from the designated latent space through the frozen components of the image model, such as the VAE encoder used within LDM. However, this gradient propagation pathway has never been optimized, remaining uncontrolled during training. We find that the unregulated gradients adversely affect the 3D model's capacity in acquiring texture-related information from the image generative model, leading to poor quality appearance synthesis. To address this overarching challenge, we propose an innovative operation termed Pixel-wise Gradient Clipping (PGC) designed for seamless integration into existing 3D generative models, thereby enhancing their synthesis quality. Specifically, we control the magnitude of stochastic gradients by clipping the pixel-wise gradients efficiently, while preserving crucial texture-related gradient directions. Despite this simplicity and minimal extra cost, extensive experiments demonstrate the efficacy of our PGC in enhancing the performance of existing 3D generative models for high-resolution object rendering.
\ No newline at end of file
diff --git a/data/2024/iclr/Enhancing Human Experience in Human-Agent Collaboration: A Human-Centered Modeling Approach Based on Positive Human Gain b/data/2024/iclr/Enhancing Human Experience in Human-Agent Collaboration: A Human-Centered Modeling Approach Based on Positive Human Gain
new file mode 100644
index 0000000000..472d984370
--- /dev/null
+++ b/data/2024/iclr/Enhancing Human Experience in Human-Agent Collaboration: A Human-Centered Modeling Approach Based on Positive Human Gain	
@@ -0,0 +1 @@
+Existing game AI research mainly focuses on enhancing agents' abilities to win games, but this does not inherently make humans have a better experience when collaborating with these agents. For example, agents may dominate the collaboration and exhibit unintended or detrimental behaviors, leading to poor experiences for their human partners. In other words, most game AI agents are modeled in a"self-centered"manner. In this paper, we propose a"human-centered"modeling scheme for collaborative agents that aims to enhance the experience of humans. Specifically, we model the experience of humans as the goals they expect to achieve during the task. We expect that agents should learn to enhance the extent to which humans achieve these goals while maintaining agents' original abilities (e.g., winning games). To achieve this, we propose the Reinforcement Learning from Human Gain (RLHG) approach. The RLHG approach introduces a"baseline", which corresponds to the extent to which humans primitively achieve their goals, and encourages agents to learn behaviors that can effectively enhance humans in achieving their goals better. We evaluate the RLHG agent in the popular Multi-player Online Battle Arena (MOBA) game, Honor of Kings, by conducting real-world human-agent tests. Both objective performance and subjective preference results show that the RLHG agent provides participants better gaming experience.
\ No newline at end of file
diff --git a/data/2024/iclr/Enhancing Human-AI Collaboration Through Logic-Guided Reasoning b/data/2024/iclr/Enhancing Human-AI Collaboration Through Logic-Guided Reasoning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Enhancing Instance-Level Image Classification with Set-Level Labels b/data/2024/iclr/Enhancing Instance-Level Image Classification with Set-Level Labels
new file mode 100644
index 0000000000..bc1aa23367
--- /dev/null
+++ b/data/2024/iclr/Enhancing Instance-Level Image Classification with Set-Level Labels	
@@ -0,0 +1 @@
+Instance-level image classification tasks have traditionally relied on single-instance labels to train models, e.g., few-shot learning and transfer learning. However, set-level coarse-grained labels that capture relationships among instances can provide richer information in real-world scenarios. In this paper, we present a novel approach to enhance instance-level image classification by leveraging set-level labels. We provide a theoretical analysis of the proposed method, including recognition conditions for fast excess risk rate, shedding light on the theoretical foundations of our approach. We conducted experiments on two distinct categories of datasets: natural image datasets and histopathology image datasets. Our experimental results demonstrate the effectiveness of our approach, showcasing improved classification performance compared to traditional single-instance label-based methods. Notably, our algorithm achieves 13% improvement in classification accuracy compared to the strongest baseline on the histopathology image classification benchmarks. Importantly, our experimental findings align with the theoretical analysis, reinforcing the robustness and reliability of our proposed method. This work bridges the gap between instance-level and set-level image classification, offering a promising avenue for advancing the capabilities of image classification models with set-level coarse-grained labels.
\ No newline at end of file
diff --git a/data/2024/iclr/Enhancing Neural Subset Selection: Integrating Background Information into Set Representations b/data/2024/iclr/Enhancing Neural Subset Selection: Integrating Background Information into Set Representations
new file mode 100644
index 0000000000..0326e1b8af
--- /dev/null
+++ b/data/2024/iclr/Enhancing Neural Subset Selection: Integrating Background Information into Set Representations	
@@ -0,0 +1 @@
+Learning neural subset selection tasks, such as compound selection in AI-aided drug discovery, have become increasingly pivotal across diverse applications. The existing methodologies in the field primarily concentrate on constructing models that capture the relationship between utility function values and subsets within their respective supersets. However, these approaches tend to overlook the valuable information contained within the superset when utilizing neural networks to model set functions. In this work, we address this oversight by adopting a probabilistic perspective. Our theoretical findings demonstrate that when the target value is conditioned on both the input set and subset, it is essential to incorporate an \textit{invariant sufficient statistic} of the superset into the subset of interest for effective learning. This ensures that the output value remains invariant to permutations of the subset and its corresponding superset, enabling identification of the specific superset from which the subset originated. Motivated by these insights, we propose a simple yet effective information aggregation module designed to merge the representations of subsets and supersets from a permutation invariance perspective. Comprehensive empirical evaluations across diverse tasks and datasets validate the enhanced efficacy of our approach over conventional methods, underscoring the practicality and potency of our proposed strategies in real-world contexts.
\ No newline at end of file
diff --git a/data/2024/iclr/Enhancing Neural Training via a Correlated Dynamics Model b/data/2024/iclr/Enhancing Neural Training via a Correlated Dynamics Model
new file mode 100644
index 0000000000..05a31e2728
--- /dev/null
+++ b/data/2024/iclr/Enhancing Neural Training via a Correlated Dynamics Model	
@@ -0,0 +1 @@
+As neural networks grow in scale, their training becomes both computationally demanding and rich in dynamics. Amidst the flourishing interest in these training dynamics, we present a novel observation: Parameters during training exhibit intrinsic correlations over time. Capitalizing on this, we introduce Correlation Mode Decomposition (CMD). This algorithm clusters the parameter space into groups, termed modes, that display synchronized behavior across epochs. This enables CMD to efficiently represent the training dynamics of complex networks, like ResNets and Transformers, using only a few modes. Moreover, test set generalization is enhanced. We introduce an efficient CMD variant, designed to run concurrently with training. Our experiments indicate that CMD surpasses the state-of-the-art method for compactly modeled dynamics on image classification. Our modeling can improve training efficiency and lower communication overhead, as shown by our preliminary experiments in the context of federated learning.
\ No newline at end of file
diff --git a/data/2024/iclr/Enhancing One-Shot Federated Learning Through Data and Ensemble Co-Boosting b/data/2024/iclr/Enhancing One-Shot Federated Learning Through Data and Ensemble Co-Boosting
new file mode 100644
index 0000000000..9fdb5e3304
--- /dev/null
+++ b/data/2024/iclr/Enhancing One-Shot Federated Learning Through Data and Ensemble Co-Boosting	
@@ -0,0 +1 @@
+One-shot Federated Learning (OFL) has become a promising learning paradigm, enabling the training of a global server model via a single communication round. In OFL, the server model is aggregated by distilling knowledge from all client models (the ensemble), which are also responsible for synthesizing samples for distillation. In this regard, advanced works show that the performance of the server model is intrinsically related to the quality of the synthesized data and the ensemble model. To promote OFL, we introduce a novel framework, Co-Boosting, in which synthesized data and the ensemble model mutually enhance each other progressively. Specifically, Co-Boosting leverages the current ensemble model to synthesize higher-quality samples in an adversarial manner. These hard samples are then employed to promote the quality of the ensemble model by adjusting the ensembling weights for each client model. Consequently, Co-Boosting periodically achieves high-quality data and ensemble models. Extensive experiments demonstrate that Co-Boosting can substantially outperform existing baselines under various settings. Moreover, Co-Boosting eliminates the need for adjustments to the client's local training, requires no additional data or model transmission, and allows client models to have heterogeneous architectures.
\ No newline at end of file
diff --git a/data/2024/iclr/Enhancing Small Medical Learners with Privacy-preserving Contextual Prompting b/data/2024/iclr/Enhancing Small Medical Learners with Privacy-preserving Contextual Prompting
new file mode 100644
index 0000000000..912af3804b
--- /dev/null
+++ b/data/2024/iclr/Enhancing Small Medical Learners with Privacy-preserving Contextual Prompting	
@@ -0,0 +1 @@
+Large language models (LLMs) demonstrate remarkable medical expertise, but data privacy concerns impede their direct use in healthcare environments. Although offering improved data privacy protection, domain-specific small language models (SLMs) often underperform LLMs, emphasizing the need for methods that reduce this performance gap while alleviating privacy concerns. In this paper, we present a simple yet effective method that harnesses LLMs' medical proficiency to boost SLM performance in medical tasks under privacy-restricted scenarios. Specifically, we mitigate patient privacy issues by extracting keywords from medical data and prompting the LLM to generate a medical knowledge-intensive context by simulating clinicians' thought processes. This context serves as additional input for SLMs, augmenting their decision-making capabilities. Our method significantly enhances performance in both few-shot and full training settings across three medical knowledge-intensive tasks, achieving up to a 22.57% increase in absolute accuracy compared to SLM fine-tuning without context, and sets new state-of-the-art results in two medical tasks within privacy-restricted scenarios. Further out-of-domain testing and experiments in two general domain datasets showcase its generalizability and broad applicability. Our code can be found at https://github.com/XZhang97666/PrivacyBoost-SLM.
\ No newline at end of file
diff --git a/data/2024/iclr/Enhancing Tail Performance in Extreme Classifiers by Label Variance Reduction b/data/2024/iclr/Enhancing Tail Performance in Extreme Classifiers by Label Variance Reduction
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Enhancing Transfer Learning with Flexible Nonparametric Posterior Sampling b/data/2024/iclr/Enhancing Transfer Learning with Flexible Nonparametric Posterior Sampling
new file mode 100644
index 0000000000..7dd5738882
--- /dev/null
+++ b/data/2024/iclr/Enhancing Transfer Learning with Flexible Nonparametric Posterior Sampling	
@@ -0,0 +1 @@
+Transfer learning has recently shown significant performance across various tasks involving deep neural networks. In these transfer learning scenarios, the prior distribution for downstream data becomes crucial in Bayesian model averaging (BMA). While previous works proposed the prior over the neural network parameters centered around the pre-trained solution, such strategies have limitations when dealing with distribution shifts between upstream and downstream data. This paper introduces nonparametric transfer learning (NPTL), a flexible posterior sampling method to address the distribution shift issue within the context of nonparametric learning. The nonparametric learning (NPL) method is a recent approach that employs a nonparametric prior for posterior sampling, efficiently accounting for model misspecification scenarios, which is suitable for transfer learning scenarios that may involve the distribution shift between upstream and downstream tasks. Through extensive empirical validations, we demonstrate that our approach surpasses other baselines in BMA performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Enhancing Transferable Adversarial Attacks on Vision Transformers through Gradient Normalization Scaling and High-Frequency Adaptation b/data/2024/iclr/Enhancing Transferable Adversarial Attacks on Vision Transformers through Gradient Normalization Scaling and High-Frequency Adaptation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Ensemble Distillation for Unsupervised Constituency Parsing b/data/2024/iclr/Ensemble Distillation for Unsupervised Constituency Parsing
new file mode 100644
index 0000000000..7470e16649
--- /dev/null
+++ b/data/2024/iclr/Ensemble Distillation for Unsupervised Constituency Parsing	
@@ -0,0 +1 @@
+We investigate the unsupervised constituency parsing task, which organizes words and phrases of a sentence into a hierarchical structure without using linguistically annotated data. We observe that existing unsupervised parsers capture differing aspects of parsing structures, which can be leveraged to enhance unsupervised parsing performance. To this end, we propose a notion of"tree averaging,"based on which we further propose a novel ensemble method for unsupervised parsing. To improve inference efficiency, we further distill the ensemble knowledge into a student model; such an ensemble-then-distill process is an effective approach to mitigate the over-smoothing problem existing in common multi-teacher distilling methods. Experiments show that our method surpasses all previous approaches, consistently demonstrating its effectiveness and robustness across various runs, with different ensemble components, and under domain-shift conditions.
\ No newline at end of file
diff --git a/data/2024/iclr/Entity-Centric Reinforcement Learning for Object Manipulation from Pixels b/data/2024/iclr/Entity-Centric Reinforcement Learning for Object Manipulation from Pixels
new file mode 100644
index 0000000000..e060734e3d
--- /dev/null
+++ b/data/2024/iclr/Entity-Centric Reinforcement Learning for Object Manipulation from Pixels	
@@ -0,0 +1 @@
+Manipulating objects is a hallmark of human intelligence, and an important task in domains such as robotics. In principle, Reinforcement Learning (RL) offers a general approach to learn object manipulation. In practice, however, domains with more than a few objects are difficult for RL agents due to the curse of dimensionality, especially when learning from raw image observations. In this work we propose a structured approach for visual RL that is suitable for representing multiple objects and their interaction, and use it to learn goal-conditioned manipulation of several objects. Key to our method is the ability to handle goals with dependencies between the objects (e.g., moving objects in a certain order). We further relate our architecture to the generalization capability of the trained agent, based on a theoretical result for compositional generalization, and demonstrate agents that learn with 3 objects but generalize to similar tasks with over 10 objects. Videos and code are available on the project website: https://sites.google.com/view/entity-centric-rl
\ No newline at end of file
diff --git a/data/2024/iclr/Entropy Coding of Unordered Data Structures b/data/2024/iclr/Entropy Coding of Unordered Data Structures
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Entropy is not Enough for Test-Time Adaptation: From the Perspective of Disentangled Factors b/data/2024/iclr/Entropy is not Enough for Test-Time Adaptation: From the Perspective of Disentangled Factors
new file mode 100644
index 0000000000..6bf374eede
--- /dev/null
+++ b/data/2024/iclr/Entropy is not Enough for Test-Time Adaptation: From the Perspective of Disentangled Factors	
@@ -0,0 +1 @@
+Test-time adaptation (TTA) fine-tunes pre-trained deep neural networks for unseen test data. The primary challenge of TTA is limited access to the entire test dataset during online updates, causing error accumulation. To mitigate it, TTA methods have utilized the model output's entropy as a confidence metric that aims to determine which samples have a lower likelihood of causing error. Through experimental studies, however, we observed the unreliability of entropy as a confidence metric for TTA under biased scenarios and theoretically revealed that it stems from the neglect of the influence of latent disentangled factors of data on predictions. Building upon these findings, we introduce a novel TTA method named Destroy Your Object (DeYO), which leverages a newly proposed confidence metric named Pseudo-Label Probability Difference (PLPD). PLPD quantifies the influence of the shape of an object on prediction by measuring the difference between predictions before and after applying an object-destructive transformation. DeYO consists of sample selection and sample weighting, which employ entropy and PLPD concurrently. For robust adaptation, DeYO prioritizes samples that dominantly incorporate shape information when making predictions. Our extensive experiments demonstrate the consistent superiority of DeYO over baseline methods across various scenarios, including biased and wild. Project page is publicly available at https://whitesnowdrop.github.io/DeYO/.
\ No newline at end of file
diff --git a/data/2024/iclr/Entropy-MCMC: Sampling from Flat Basins with Ease b/data/2024/iclr/Entropy-MCMC: Sampling from Flat Basins with Ease
new file mode 100644
index 0000000000..6110133a0b
--- /dev/null
+++ b/data/2024/iclr/Entropy-MCMC: Sampling from Flat Basins with Ease	
@@ -0,0 +1 @@
+Bayesian deep learning counts on the quality of posterior distribution estimation. However, the posterior of deep neural networks is highly multi-modal in nature, with local modes exhibiting varying generalization performance. Given a practical budget, targeting at the original posterior can lead to suboptimal performance, as some samples may become trapped in"bad"modes and suffer from overfitting. Leveraging the observation that"good"modes with low generalization error often reside in flat basins of the energy landscape, we propose to bias sampling on the posterior toward these flat regions. Specifically, we introduce an auxiliary guiding variable, the stationary distribution of which resembles a smoothed posterior free from sharp modes, to lead the MCMC sampler to flat basins. By integrating this guiding variable with the model parameter, we create a simple joint distribution that enables efficient sampling with minimal computational overhead. We prove the convergence of our method and further show that it converges faster than several existing flatness-aware methods in the strongly convex setting. Empirical results demonstrate that our method can successfully sample from flat basins of the posterior, and outperforms all compared baselines on multiple benchmarks including classification, calibration, and out-of-distribution detection.
\ No newline at end of file
diff --git a/data/2024/iclr/EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations b/data/2024/iclr/EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations
new file mode 100644
index 0000000000..646db67ab2
--- /dev/null
+++ b/data/2024/iclr/EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations	
@@ -0,0 +1 @@
+Equivariant Transformers such as Equiformer have demonstrated the efficacy of applying Transformers to the domain of 3D atomistic systems. However, they are limited to small degrees of equivariant representations due to their computational complexity. In this paper, we investigate whether these architectures can scale well to higher degrees. Starting from Equiformer, we first replace $SO(3)$ convolutions with eSCN convolutions to efficiently incorporate higher-degree tensors. Then, to better leverage the power of higher degrees, we propose three architectural improvements -- attention re-normalization, separable $S^2$ activation and separable layer normalization. Putting this all together, we propose EquiformerV2, which outperforms previous state-of-the-art methods on large-scale OC20 dataset by up to $9\%$ on forces, $4\%$ on energies, offers better speed-accuracy trade-offs, and $2\times$ reduction in DFT calculations needed for computing adsorption energies. Additionally, EquiformerV2 trained on only OC22 dataset outperforms GemNet-OC trained on both OC20 and OC22 datasets, achieving much better data efficiency. Finally, we compare EquiformerV2 with Equiformer on QM9 and OC20 S2EF-2M datasets to better understand the performance gain brought by higher degrees.
\ No newline at end of file
diff --git a/data/2024/iclr/Equivariant Matrix Function Neural Networks b/data/2024/iclr/Equivariant Matrix Function Neural Networks
new file mode 100644
index 0000000000..7ffb161034
--- /dev/null
+++ b/data/2024/iclr/Equivariant Matrix Function Neural Networks	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs), especially message-passing neural networks (MPNNs), have emerged as powerful architectures for learning on graphs in diverse applications. However, MPNNs face challenges when modeling non-local interactions in graphs such as large conjugated molecules, and social networks due to oversmoothing and oversquashing. Although Spectral GNNs and traditional neural networks such as recurrent neural networks and transformers mitigate these challenges, they often lack generalizability, or fail to capture detailed structural relationships or symmetries in the data. To address these concerns, we introduce Matrix Function Neural Networks (MFNs), a novel architecture that parameterizes non-local interactions through analytic matrix equivariant functions. Employing resolvent expansions offers a straightforward implementation and the potential for linear scaling with system size. The MFN architecture achieves stateof-the-art performance in standard graph benchmarks, such as the ZINC and TU datasets, and is able to capture intricate non-local interactions in quantum systems, paving the way to new state-of-the-art force fields.
\ No newline at end of file
diff --git a/data/2024/iclr/Equivariant Scalar Fields for Molecular Docking with Fast Fourier Transforms b/data/2024/iclr/Equivariant Scalar Fields for Molecular Docking with Fast Fourier Transforms
new file mode 100644
index 0000000000..89f94219b1
--- /dev/null
+++ b/data/2024/iclr/Equivariant Scalar Fields for Molecular Docking with Fast Fourier Transforms	
@@ -0,0 +1 @@
+Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. The runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the widely-used Vina and Gnina scoring functions, and is more robust on computationally predicted structures. Code is available at https://github.com/bjing2016/scalar-fields.
\ No newline at end of file
diff --git a/data/2024/iclr/Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness Constants b/data/2024/iclr/Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness Constants
new file mode 100644
index 0000000000..edbf703c69
--- /dev/null
+++ b/data/2024/iclr/Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness Constants	
@@ -0,0 +1 @@
+Error Feedback (EF) is a highly popular and immensely effective mechanism for fixing convergence issues which arise in distributed training methods (such as distributed GD or SGD) when these are enhanced with greedy communication compression techniques such as TopK. While EF was proposed almost a decade ago (Seide et al., 2014), and despite concentrated effort by the community to advance the theoretical understanding of this mechanism, there is still a lot to explore. In this work we study a modern form of error feedback called EF21 (Richtarik et al., 2021) which offers the currently best-known theoretical guarantees, under the weakest assumptions, and also works well in practice. In particular, while the theoretical communication complexity of EF21 depends on the quadratic mean of certain smoothness parameters, we improve this dependence to their arithmetic mean, which is always smaller, and can be substantially smaller, especially in heterogeneous data regimes. We take the reader on a journey of our discovery process. Starting with the idea of applying EF21 to an equivalent reformulation of the underlying problem which (unfortunately) requires (often impractical) machine cloning, we continue to the discovery of a new weighted version of EF21 which can (fortunately) be executed without any cloning, and finally circle back to an improved analysis of the original EF21 method. While this development applies to the simplest form of EF21, our approach naturally extends to more elaborate variants involving stochastic gradients and partial participation. Further, our technique improves the best-known theory of EF21 in the rare features regime (Richtarik et al., 2023). Finally, we validate our theoretical findings with suitable experiments.
\ No newline at end of file
diff --git a/data/2024/iclr/Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models b/data/2024/iclr/Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models
new file mode 100644
index 0000000000..1d5af95a74
--- /dev/null
+++ b/data/2024/iclr/Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models	
@@ -0,0 +1 @@
+Text generation models are notoriously vulnerable to errors in the training data. With the wide-spread availability of massive amounts of web-crawled data becoming more commonplace, how can we enhance the robustness of models trained on a massive amount of noisy web-crawled text? In our work, we propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data. Compared to methods that only uses the negative log-likelihood loss to estimate data quality, our method provides a more accurate estimation by considering the distribution of non-target tokens, which is often overlooked by previous work. Through comprehensive experiments across language modeling, machine translation, and text summarization, we show that equipping text generation models with ENT improves generation quality over standard training and previous soft and hard truncation methods. Furthermore, we show that our method improves the robustness of models against two of the most detrimental types of noise in machine translation, resulting in an increase of more than 2 BLEU points over the MLE baseline when up to 50% of noise is added to the data.
\ No newline at end of file
diff --git a/data/2024/iclr/Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning b/data/2024/iclr/Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning
new file mode 100644
index 0000000000..1a8fd71914
--- /dev/null
+++ b/data/2024/iclr/Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning	
@@ -0,0 +1 @@
+Self-consistency (SC) has been a widely used decoding strategy for chain-of-thought reasoning. Despite bringing significant performance improvements across a variety of multi-step reasoning tasks, it is a high-cost method that requires multiple sampling with the preset size. In this paper, we propose a simple and scalable sampling process, \textbf{E}arly-Stopping \textbf{S}elf-\textbf{C}onsistency (ESC), to greatly reduce the cost of SC without sacrificing performance. On this basis, one control scheme for ESC is further derivated to dynamically choose the performance-cost balance for different tasks and models. To demonstrate ESC's effectiveness, we conducted extensive experiments on three popular categories of reasoning tasks: arithmetic, commonsense and symbolic reasoning over language models with varying scales. The empirical results show that ESC reduces the average number of sampling of chain-of-thought reasoning by a significant margin on six benchmarks, including MATH (-33.8%), GSM8K (-80.1%), StrategyQA (-76.8%), CommonsenseQA (-78.5%), Coin Flip (-84.2%) and Last Letters (-67.4%), while attaining comparable performances.
\ No newline at end of file
diff --git a/data/2024/iclr/Estimating Conditional Mutual Information for Dynamic Feature Selection b/data/2024/iclr/Estimating Conditional Mutual Information for Dynamic Feature Selection
new file mode 100644
index 0000000000..058784beff
--- /dev/null
+++ b/data/2024/iclr/Estimating Conditional Mutual Information for Dynamic Feature Selection	
@@ -0,0 +1 @@
+Dynamic feature selection, where we sequentially query features to make accurate predictions with a minimal budget, is a promising paradigm to reduce feature acquisition costs and provide transparency into a model's predictions. The problem is challenging, however, as it requires both predicting with arbitrary feature sets and learning a policy to identify valuable selections. Here, we take an information-theoretic perspective and prioritize features based on their mutual information with the response variable. The main challenge is implementing this policy, and we design a new approach that estimates the mutual information in a discriminative rather than generative fashion. Building on our approach, we then introduce several further improvements: allowing variable feature budgets across samples, enabling non-uniform feature costs, incorporating prior information, and exploring modern architectures to handle partial inputs. Our experiments show that our method provides consistent gains over recent methods across a variety of datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Estimating Shape Distances on Neural Representations with Limited Samples b/data/2024/iclr/Estimating Shape Distances on Neural Representations with Limited Samples
new file mode 100644
index 0000000000..7c10ffd8c8
--- /dev/null
+++ b/data/2024/iclr/Estimating Shape Distances on Neural Representations with Limited Samples	
@@ -0,0 +1 @@
+Measuring geometric similarity between high-dimensional network representations is a topic of longstanding interest to neuroscience and deep learning. Although many methods have been proposed, only a few works have rigorously analyzed their statistical efficiency or quantified estimator uncertainty in data-limited regimes. Here, we derive upper and lower bounds on the worst-case convergence of standard estimators of shape distance$\unicode{x2014}$a measure of representational dissimilarity proposed by Williams et al. (2021).These bounds reveal the challenging nature of the problem in high-dimensional feature spaces. To overcome these challenges, we introduce a new method-of-moments estimator with a tunable bias-variance tradeoff. We show that this estimator achieves substantially lower bias than standard estimators in simulation and on neural data, particularly in high-dimensional settings. Thus, we lay the foundation for a rigorous statistical theory for high-dimensional shape analysis, and we contribute a new estimation method that is well-suited to practical scientific settings.
\ No newline at end of file
diff --git a/data/2024/iclr/Eureka: Human-Level Reward Design via Coding Large Language Models b/data/2024/iclr/Eureka: Human-Level Reward Design via Coding Large Language Models
new file mode 100644
index 0000000000..cd739ac3fc
--- /dev/null
+++ b/data/2024/iclr/Eureka: Human-Level Reward Design via Coding Large Language Models	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, Eureka outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. The generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF), readily incorporating human inputs to improve the quality and the safety of the generated rewards without model updating. Finally, using Eureka rewards in a curriculum learning setting, we demonstrate for the first time, a simulated Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a pen in circles at rapid speed.
\ No newline at end of file
diff --git a/data/2024/iclr/Evaluating Language Model Agency Through Negotiations b/data/2024/iclr/Evaluating Language Model Agency Through Negotiations
new file mode 100644
index 0000000000..9daa3158fc
--- /dev/null
+++ b/data/2024/iclr/Evaluating Language Model Agency Through Negotiations	
@@ -0,0 +1 @@
+We introduce an approach to evaluate language model (LM) agency using negotiation games. This approach better reflects real-world use cases and addresses some of the shortcomings of alternative LM benchmarks. Negotiation games enable us to study multi-turn, and cross-model interactions, modulate complexity, and side-step accidental evaluation data leakage. We use our approach to test six widely used and publicly accessible LMs, evaluating performance and alignment in both self-play and cross-play settings. Noteworthy findings include: (i) only closed-source models tested here were able to complete these tasks; (ii) cooperative bargaining games proved to be most challenging to the models; and (iii) even the most powerful models sometimes"lose"to weaker opponents
\ No newline at end of file
diff --git a/data/2024/iclr/Evaluating the Zero-shot Robustness of Instruction-tuned Language Models b/data/2024/iclr/Evaluating the Zero-shot Robustness of Instruction-tuned Language Models
new file mode 100644
index 0000000000..df2ac8ffe8
--- /dev/null
+++ b/data/2024/iclr/Evaluating the Zero-shot Robustness of Instruction-tuned Language Models	
@@ -0,0 +1 @@
+Instruction fine-tuning has recently emerged as a promising approach for improving the zero-shot capabilities of Large Language Models (LLMs) on new tasks. This technique has shown particular strength in improving the performance of modestly sized LLMs, sometimes inducing performance competitive with much larger model variants. In this paper we ask two questions: (1) How sensitive are instruction-tuned models to the particular phrasings of instructions, and, (2) How can we make them more robust to such natural language variation? To answer the former, we collect a set of 319 instructions manually written by NLP practitioners for over 80 unique tasks included in widely used benchmarks, and we evaluate the variance and average performance of these instructions as compared to instruction phrasings observed during instruction fine-tuning. We find that using novel (unobserved) but appropriate instruction phrasings consistently degrades model performance, sometimes substantially so. Further, such natural instructions yield a wide variance in downstream performance, despite their semantic equivalence. Put another way, instruction-tuned models are not especially robust to instruction re-phrasings. We propose a simple method to mitigate this issue by introducing ``soft prompt'' embedding parameters and optimizing these to maximize the similarity between representations of semantically equivalent instructions. We show that this method consistently improves the robustness of instruction-tuned models.
\ No newline at end of file
diff --git a/data/2024/iclr/EventRPG: Event Data Augmentation with Relevance Propagation Guidance b/data/2024/iclr/EventRPG: Event Data Augmentation with Relevance Propagation Guidance
new file mode 100644
index 0000000000..8c6dc9cb96
--- /dev/null
+++ b/data/2024/iclr/EventRPG: Event Data Augmentation with Relevance Propagation Guidance	
@@ -0,0 +1 @@
+Event camera, a novel bio-inspired vision sensor, has drawn a lot of attention for its low latency, low power consumption, and high dynamic range. Currently, overfitting remains a critical problem in event-based classification tasks for Spiking Neural Network (SNN) due to its relatively weak spatial representation capability. Data augmentation is a simple but efficient method to alleviate overfitting and improve the generalization ability of neural networks, and saliency-based augmentation methods are proven to be effective in the image processing field. However, there is no approach available for extracting saliency maps from SNNs. Therefore, for the first time, we present Spiking Layer-Time-wise Relevance Propagation rule (SLTRP) and Spiking Layer-wise Relevance Propagation rule (SLRP) in order for SNN to generate stable and accurate CAMs and saliency maps. Based on this, we propose EventRPG, which leverages relevance propagation on the spiking neural network for more efficient augmentation. Our proposed method has been evaluated on several SNN structures, achieving state-of-the-art performance in object recognition tasks including N-Caltech101, CIFAR10-DVS, with accuracies of 85.62% and 85.55%, as well as action recognition task SL-Animals with an accuracy of 91.59%. Our code is available at https://github.com/myuansun/EventRPG.
\ No newline at end of file
diff --git a/data/2024/iclr/Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing b/data/2024/iclr/Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing
new file mode 100644
index 0000000000..ed32fc872a
--- /dev/null
+++ b/data/2024/iclr/Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing	
@@ -0,0 +1 @@
+Large language models (LLMs) have made impressive progress in natural language processing. These models rely on proper human instructions (or prompts) to generate suitable responses. However, the potential of LLMs are not fully harnessed by commonly-used prompting methods: many human-in-the-loop algorithms employ ad-hoc procedures for prompt selection; while auto prompt generation approaches are essentially searching all possible prompts randomly and inefficiently. We propose Evoke, an automatic prompt refinement framework. In Evoke, there are two instances of a same LLM: one as a reviewer (LLM-Reviewer), it scores the current prompt; the other as an author (LLM-Author), it edits the prompt by considering the edit history and the reviewer's feedback. Such an author-reviewer feedback loop ensures that the prompt is refined in each iteration. We further aggregate a data selection approach to Evoke, where only the hard samples are exposed to the LLM. The hard samples are more important because the LLM can develop deeper understanding of the tasks out of them, while the model may already know how to solve the easier cases. Experimental results show that Evoke significantly outperforms existing methods. For instance, in the challenging task of logical fallacy detection, Evoke scores above 80, while all other baseline methods struggle to reach 20.
\ No newline at end of file
diff --git a/data/2024/iclr/Expected flow networks in stochastic environments and two-player zero-sum games b/data/2024/iclr/Expected flow networks in stochastic environments and two-player zero-sum games
new file mode 100644
index 0000000000..b6b342b0e1
--- /dev/null
+++ b/data/2024/iclr/Expected flow networks in stochastic environments and two-player zero-sum games	
@@ -0,0 +1 @@
+Generative flow networks (GFlowNets) are sequential sampling models trained to match a given distribution. GFlowNets have been successfully applied to various structured object generation tasks, sampling a diverse set of high-reward objects quickly. We propose expected flow networks (EFlowNets), which extend GFlowNets to stochastic environments. We show that EFlowNets outperform other GFlowNet formulations in stochastic tasks such as protein design. We then extend the concept of EFlowNets to adversarial environments, proposing adversarial flow networks (AFlowNets) for two-player zero-sum games. We show that AFlowNets learn to find above 80% of optimal moves in Connect-4 via self-play and outperform AlphaZero in tournaments.
\ No newline at end of file
diff --git a/data/2024/iclr/Experimental Design for Multi-Channel Imaging via Task-Driven Feature Selection b/data/2024/iclr/Experimental Design for Multi-Channel Imaging via Task-Driven Feature Selection
new file mode 100644
index 0000000000..bc0d7aae89
--- /dev/null
+++ b/data/2024/iclr/Experimental Design for Multi-Channel Imaging via Task-Driven Feature Selection	
@@ -0,0 +1 @@
+This paper presents a data-driven, task-specific paradigm for experimental design, to shorten acquisition time, reduce costs, and accelerate the deployment of imaging devices. Current approaches in experimental design focus on model-parameter estimation and require specification of a particular model, whereas in imaging, other tasks may drive the design. Furthermore, such approaches often lead to intractable optimization problems in real-world imaging applications. Here we present a new paradigm for experimental design that simultaneously optimizes the design (set of image channels) and trains a machine-learning model to execute a user-specified image-analysis task. The approach obtains data densely-sampled over the measurement space (many image channels) for a small number of acquisitions, then identifies a subset of channels of prespecified size that best supports the task. We propose a method: TADRED for TAsk-DRiven Experimental Design in imaging, to identify the most informative channel-subset whilst simultaneously training a network to execute the task given the subset. Experiments demonstrate the potential of TADRED in diverse imaging applications: several clinically-relevant tasks in magnetic resonance imaging; and remote sensing and physiological applications of hyperspectral imaging. Results show substantial improvement over classical experimental design, two recent application-specific methods within the new paradigm, and state-of-the-art approaches in supervised feature selection. We anticipate further applications of our approach. Code is available: https://github.com/sbb-gh/experimental-design-multichannel
\ No newline at end of file
diff --git a/data/2024/iclr/Explaining Kernel Clustering via Decision Trees b/data/2024/iclr/Explaining Kernel Clustering via Decision Trees
new file mode 100644
index 0000000000..2ea6fd5c37
--- /dev/null
+++ b/data/2024/iclr/Explaining Kernel Clustering via Decision Trees	
@@ -0,0 +1 @@
+Despite the growing popularity of explainable and interpretable machine learning, there is still surprisingly limited work on inherently interpretable clustering methods. Recently, there has been a surge of interest in explaining the classic k-means algorithm, leading to efficient algorithms that approximate k-means clusters using axis-aligned decision trees. However, interpretable variants of k-means have limited applicability in practice, where more flexible clustering methods are often needed to obtain useful partitions of the data. In this work, we investigate interpretable kernel clustering, and propose algorithms that construct decision trees to approximate the partitions induced by kernel k-means, a nonlinear extension of k-means. We further build on previous work on explainable k-means and demonstrate how a suitable choice of features allows preserving interpretability without sacrificing approximation guarantees on the interpretable model.
\ No newline at end of file
diff --git a/data/2024/iclr/Explaining Time Series via Contrastive and Locally Sparse Perturbations b/data/2024/iclr/Explaining Time Series via Contrastive and Locally Sparse Perturbations
new file mode 100644
index 0000000000..db938d8076
--- /dev/null
+++ b/data/2024/iclr/Explaining Time Series via Contrastive and Locally Sparse Perturbations	
@@ -0,0 +1 @@
+Explaining multivariate time series is a compound challenge, as it requires identifying important locations in the time series and matching complex temporal patterns. Although previous saliency-based methods addressed the challenges, their perturbation may not alleviate the distribution shift issue, which is inevitable especially in heterogeneous samples. We present ContraLSP, a locally sparse model that introduces counterfactual samples to build uninformative perturbations but keeps distribution using contrastive learning. Furthermore, we incorporate sample-specific sparse gates to generate more binary-skewed and smooth masks, which easily integrate temporal trends and select the salient features parsimoniously. Empirical studies on both synthetic and real-world datasets show that ContraLSP outperforms state-of-the-art models, demonstrating a substantial improvement in explanation quality for time series data. The source code is available at \url{https://github.com/zichuan-liu/ContraLSP}.
\ No newline at end of file
diff --git a/data/2024/iclr/Exploiting Causal Graph Priors with Posterior Sampling for Reinforcement Learning b/data/2024/iclr/Exploiting Causal Graph Priors with Posterior Sampling for Reinforcement Learning
new file mode 100644
index 0000000000..13fa90b261
--- /dev/null
+++ b/data/2024/iclr/Exploiting Causal Graph Priors with Posterior Sampling for Reinforcement Learning	
@@ -0,0 +1 @@
+Posterior sampling allows exploitation of prior knowledge on the environment's transition dynamics to improve the sample efficiency of reinforcement learning. The prior is typically specified as a class of parametric distributions, the design of which can be cumbersome in practice, often resulting in the choice of uninformative priors. In this work, we propose a novel posterior sampling approach in which the prior is given as a (partial) causal graph over the environment's variables. The latter is often more natural to design, such as listing known causal dependencies between biometric features in a medical treatment study. Specifically, we propose a hierarchical Bayesian procedure, called C-PSRL, simultaneously learning the full causal graph at the higher level and the parameters of the resulting factored dynamics at the lower level. We provide an analysis of the Bayesian regret of C-PSRL that explicitly connects the regret rate with the degree of prior knowledge. Our numerical evaluation conducted in illustrative domains confirms that C-PSRL strongly improves the efficiency of posterior sampling with an uninformative prior while performing close to posterior sampling with the full causal graph.
\ No newline at end of file
diff --git a/data/2024/iclr/Exploring Diffusion Time-steps for Unsupervised Representation Learning b/data/2024/iclr/Exploring Diffusion Time-steps for Unsupervised Representation Learning
new file mode 100644
index 0000000000..256e92521b
--- /dev/null
+++ b/data/2024/iclr/Exploring Diffusion Time-steps for Unsupervised Representation Learning	
@@ -0,0 +1 @@
+Representation learning is all about discovering the hidden modular attributes that generate the data faithfully. We explore the potential of Denoising Diffusion Probabilistic Model (DM) in unsupervised learning of the modular attributes. We build a theoretical framework that connects the diffusion time-steps and the hidden attributes, which serves as an effective inductive bias for unsupervised learning. Specifically, the forward diffusion process incrementally adds Gaussian noise to samples at each time-step, which essentially collapses different samples into similar ones by losing attributes, e.g., fine-grained attributes such as texture are lost with less noise added (i.e., early time-steps), while coarse-grained ones such as shape are lost by adding more noise (i.e., late time-steps). To disentangle the modular attributes, at each time-step t, we learn a t-specific feature to compensate for the newly lost attribute, and the set of all 1,...,t-specific features, corresponding to the cumulative set of lost attributes, are trained to make up for the reconstruction error of a pre-trained DM at time-step t. On CelebA, FFHQ, and Bedroom datasets, the learned feature significantly improves attribute classification and enables faithful counterfactual generation, e.g., interpolating only one specified attribute between two images, validating the disentanglement quality. Codes are in https://github.com/yue-zhongqi/diti.
\ No newline at end of file
diff --git a/data/2024/iclr/Exploring Effective Stimulus Encoding via Vision System Modeling for Visual Prostheses b/data/2024/iclr/Exploring Effective Stimulus Encoding via Vision System Modeling for Visual Prostheses
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Exploring Target Representations for Masked Autoencoders b/data/2024/iclr/Exploring Target Representations for Masked Autoencoders
new file mode 100644
index 0000000000..d62efb20be
--- /dev/null
+++ b/data/2024/iclr/Exploring Target Representations for Masked Autoencoders	
@@ -0,0 +1 @@
+Masked autoencoders have become popular training paradigms for self-supervised visual representation learning. These models randomly mask a portion of the input and reconstruct the masked portion according to the target representations. In this paper, we first show that a careful choice of the target representation is unnecessary for learning good representations, since different targets tend to derive similarly behaved models. Driven by this observation, we propose a multi-stage masked distillation pipeline and use a randomly initialized model as the teacher, enabling us to effectively train high-capacity models without any efforts to carefully design target representations. Interestingly, we further explore using teachers of larger capacity, obtaining distilled students with remarkable transferring ability. On different tasks of classification, transfer learning, object detection, and semantic segmentation, the proposed method to perform masked knowledge distillation with bootstrapped teachers (dBOT) outperforms previous self-supervised methods by nontrivial margins. We hope our findings, as well as the proposed method, could motivate people to rethink the roles of target representations in pre-training masked autoencoders.The code and pre-trained models are publicly available at https://github.com/liuxingbin/dbot.
\ No newline at end of file
diff --git a/data/2024/iclr/Exploring Weight Balancing on Long-Tailed Recognition Problem b/data/2024/iclr/Exploring Weight Balancing on Long-Tailed Recognition Problem
new file mode 100644
index 0000000000..36132051a6
--- /dev/null
+++ b/data/2024/iclr/Exploring Weight Balancing on Long-Tailed Recognition Problem	
@@ -0,0 +1 @@
+Recognition problems in long-tailed data, in which the sample size per class is heavily skewed, have gained importance because the distribution of the sample size per class in a dataset is generally exponential unless the sample size is intentionally adjusted. Various methods have been devised to address these problems.Recently, weight balancing, which combines well-known classical regularization techniques with two-stage training, has been proposed. Despite its simplicity, it is known for its high performance compared with existing methods devised in various ways. However, there is a lack of understanding as to why this method is effective for long-tailed data. In this study, we analyze weight balancing by focusing on neural collapse and the cone effect at each training stage and found that it can be decomposed into an increase in Fisher's discriminant ratio of the feature extractor caused by weight decay and cross entropy loss and implicit logit adjustment caused by weight decay and class-balanced loss. Our analysis enables the training method to be further simplified by reducing the number of training stages to one while increasing accuracy. Code is available at https://github.com/HN410/Exploring-Weight-Balancing-on-Long-Tailed-Recognition-Problem.
\ No newline at end of file
diff --git a/data/2024/iclr/Exploring the Common Appearance-Boundary Adaptation for Nighttime Optical Flow b/data/2024/iclr/Exploring the Common Appearance-Boundary Adaptation for Nighttime Optical Flow
new file mode 100644
index 0000000000..107064dd4a
--- /dev/null
+++ b/data/2024/iclr/Exploring the Common Appearance-Boundary Adaptation for Nighttime Optical Flow	
@@ -0,0 +1 @@
+We investigate a challenging task of nighttime optical flow, which suffers from weakened texture and amplified noise. These degradations weaken discriminative visual features, thus causing invalid motion feature matching. Typically, existing methods employ domain adaptation to transfer knowledge from auxiliary domain to nighttime domain in either input visual space or output motion space. However, this direct adaptation is ineffective, since there exists a large domain gap due to the intrinsic heterogeneous nature of the feature representations between auxiliary and nighttime domains. To overcome this issue, we explore a common-latent space as the intermediate bridge to reinforce the feature alignment between auxiliary and nighttime domains. In this work, we exploit two auxiliary daytime and event domains, and propose a novel common appearance-boundary adaptation framework for nighttime optical flow. In appearance adaptation, we employ the intrinsic image decomposition to embed the auxiliary daytime image and the nighttime image into a reflectance-aligned common space. We discover that motion distributions of the two reflectance maps are very similar, benefiting us to consistently transfer motion appearance knowledge from daytime to nighttime domain. In boundary adaptation, we theoretically derive the motion correlation formula between nighttime image and accumulated events within a spatiotemporal gradient-aligned common space. We figure out that the correlation of the two spatiotemporal gradient maps shares significant discrepancy, benefitting us to contrastively transfer boundary knowledge from event to nighttime domain. Moreover, appearance adaptation and boundary adaptation are complementary to each other, since they could jointly transfer global motion and local boundary knowledge to the nighttime domain.
\ No newline at end of file
diff --git a/data/2024/iclr/Exploring the Promise and Limits of Real-Time Recurrent Learning b/data/2024/iclr/Exploring the Promise and Limits of Real-Time Recurrent Learning
new file mode 100644
index 0000000000..5b703a23a8
--- /dev/null
+++ b/data/2024/iclr/Exploring the Promise and Limits of Real-Time Recurrent Learning	
@@ -0,0 +1 @@
+Real-time recurrent learning (RTRL) for sequence-processing recurrent neural networks (RNNs) offers certain conceptual advantages over backpropagation through time (BPTT). RTRL requires neither caching past activations nor truncating context, and enables online learning. However, RTRL's time and space complexity make it impractical. To overcome this problem, most recent work on RTRL focuses on approximation theories, while experiments are often limited to diagnostic settings. Here we explore the practical promise of RTRL in more realistic settings. We study actor-critic methods that combine RTRL and policy gradients, and test them in several subsets of DMLab-30, ProcGen, and Atari-2600 environments. On DMLab memory tasks, our system trained on fewer than 1.2 B environmental frames is competitive with or outperforms well-known IMPALA and R2D2 baselines trained on 10 B frames. To scale to such challenging tasks, we focus on certain well-known neural architectures with element-wise recurrence, allowing for tractable RTRL without approximation. Importantly, we also discuss rarely addressed limitations of RTRL in real-world applications, such as its complexity in the multi-layer case.
\ No newline at end of file
diff --git a/data/2024/iclr/Exploring the cloud of feature interaction scores in a Rashomon set b/data/2024/iclr/Exploring the cloud of feature interaction scores in a Rashomon set
new file mode 100644
index 0000000000..e9aa6252be
--- /dev/null
+++ b/data/2024/iclr/Exploring the cloud of feature interaction scores in a Rashomon set	
@@ -0,0 +1 @@
+Interactions among features are central to understanding the behavior of machine learning models. Recent research has made significant strides in detecting and quantifying feature interactions in single predictive models. However, we argue that the feature interactions extracted from a single pre-specified model may not be trustworthy since: a well-trained predictive model may not preserve the true feature interactions and there exist multiple well-performing predictive models that differ in feature interaction strengths. Thus, we recommend exploring feature interaction strengths in a model class of approximately equally accurate predictive models. In this work, we introduce the feature interaction score (FIS) in the context of a Rashomon set, representing a collection of models that achieve similar accuracy on a given task. We propose a general and practical algorithm to calculate the FIS in the model class. We demonstrate the properties of the FIS via synthetic data and draw connections to other areas of statistics. Additionally, we introduce a Halo plot for visualizing the feature interaction variance in high-dimensional space and a swarm plot for analyzing FIS in a Rashomon set. Experiments with recidivism prediction and image classification illustrate how feature interactions can vary dramatically in importance for similarly accurate predictive models. Our results suggest that the proposed FIS can provide valuable insights into the nature of feature interactions in machine learning models.
\ No newline at end of file
diff --git a/data/2024/iclr/Expressive Losses for Verified Robustness via Convex Combinations b/data/2024/iclr/Expressive Losses for Verified Robustness via Convex Combinations
new file mode 100644
index 0000000000..984208dad4
--- /dev/null
+++ b/data/2024/iclr/Expressive Losses for Verified Robustness via Convex Combinations	
@@ -0,0 +1 @@
+In order to train networks for verified adversarial robustness, it is common to over-approximate the worst-case loss over perturbation regions, resulting in networks that attain verifiability at the expense of standard performance. As shown in recent work, better trade-offs between accuracy and robustness can be obtained by carefully coupling adversarial training with over-approximations. We hypothesize that the expressivity of a loss function, which we formalize as the ability to span a range of trade-offs between lower and upper bounds to the worst-case loss through a single parameter (the over-approximation coefficient), is key to attaining state-of-the-art performance. To support our hypothesis, we show that trivial expressive losses, obtained via convex combinations between adversarial attacks and IBP bounds, yield state-of-the-art results across a variety of settings in spite of their conceptual simplicity. We provide a detailed analysis of the relationship between the over-approximation coefficient and performance profiles across different expressive losses, showing that, while expressivity is essential, better approximations of the worst-case loss are not necessarily linked to superior robustness-accuracy trade-offs.
\ No newline at end of file
diff --git a/data/2024/iclr/Expressivity of ReLU-Networks under Convex Relaxations b/data/2024/iclr/Expressivity of ReLU-Networks under Convex Relaxations
new file mode 100644
index 0000000000..872b3b50c1
--- /dev/null
+++ b/data/2024/iclr/Expressivity of ReLU-Networks under Convex Relaxations	
@@ -0,0 +1 @@
+Convex relaxations are a key component of training and certifying provably safe neural networks. However, despite substantial progress, a wide and poorly understood accuracy gap to standard networks remains, raising the question of whether this is due to fundamental limitations of convex relaxations. Initial work investigating this question focused on the simple and widely used IBP relaxation. It revealed that some univariate, convex, continuous piecewise linear (CPWL) functions cannot be encoded by any ReLU network such that its IBP-analysis is precise. To explore whether this limitation is shared by more advanced convex relaxations, we conduct the first in-depth study on the expressive power of ReLU networks across all commonly used convex relaxations. We show that: (i) more advanced relaxations allow a larger class of univariate functions to be expressed as precisely analyzable ReLU networks, (ii) more precise relaxations can allow exponentially larger solution spaces of ReLU networks encoding the same functions, and (iii) even using the most precise single-neuron relaxations, it is impossible to construct precisely analyzable ReLU networks that express multivariate, convex, monotone CPWL functions.
\ No newline at end of file
diff --git a/data/2024/iclr/Extending Power of Nature from Binary to Real-Valued Graph Learning in Real World b/data/2024/iclr/Extending Power of Nature from Binary to Real-Valued Graph Learning in Real World
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods b/data/2024/iclr/FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods
new file mode 100644
index 0000000000..0abbeacfe1
--- /dev/null
+++ b/data/2024/iclr/FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods	
@@ -0,0 +1 @@
+This paper introduces the Fair Fairness Benchmark (\textsf{FFB}), a benchmarking framework for in-processing group fairness methods. Ensuring fairness in machine learning is important for ethical compliance. However, there exist challenges in comparing and developing fairness methods due to inconsistencies in experimental settings, lack of accessible algorithmic implementations, and limited extensibility of current fairness packages and tools. To address these issues, we introduce an open-source standardized benchmark for evaluating in-processing group fairness methods and provide a comprehensive analysis of state-of-the-art methods to ensure different notions of group fairness. This work offers the following key contributions: the provision of flexible, extensible, minimalistic, and research-oriented open-source code; the establishment of unified fairness method benchmarking pipelines; and extensive benchmarking, which yields key insights from $\mathbf{45,079}$ experiments, $\mathbf{14,428}$ GPU hours. We believe that our work will significantly facilitate the growth and development of the fairness research community.
\ No newline at end of file
diff --git a/data/2024/iclr/FITS: Modeling Time Series with 10k Parameters b/data/2024/iclr/FITS: Modeling Time Series with 10k Parameters
new file mode 100644
index 0000000000..be2fc32a3a
--- /dev/null
+++ b/data/2024/iclr/FITS: Modeling Time Series with 10k Parameters	
@@ -0,0 +1 @@
+In this paper, we introduce FITS, a lightweight yet powerful model for time series analysis. Unlike existing models that directly process raw time-domain data, FITS operates on the principle that time series can be manipulated through interpolation in the complex frequency domain. By discarding high-frequency components with negligible impact on time series data, FITS achieves performance comparable to state-of-the-art models for time series forecasting and anomaly detection tasks, while having a remarkably compact size of only approximately $10k$ parameters. Such a lightweight model can be easily trained and deployed in edge devices, creating opportunities for various applications. The code is available in: \url{https://github.com/VEWOXIC/FITS}
\ No newline at end of file
diff --git a/data/2024/iclr/FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets b/data/2024/iclr/FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
new file mode 100644
index 0000000000..de4f40824e
--- /dev/null
+++ b/data/2024/iclr/FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets	
@@ -0,0 +1 @@
+Evaluation of Large Language Models (LLMs) is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction. However, previous studies have mainly focused on coarse-grained evaluation (i.e. overall preference-based evaluation), which limits interpretability since it does not consider the nature of user instructions that require instance-wise skill composition. In this paper, we introduce FLASK (Fine-grained Language Model Evaluation based on Alignment Skill Sets), a fine-grained evaluation protocol for both human-based and model-based evaluation which decomposes coarse-level scoring to a skill set-level scoring for each instruction. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance and increasing the reliability of the evaluation. Using FLASK, we compare multiple open-source and proprietary LLMs and observe a high correlation between model-based and human-based evaluations. We publicly release the evaluation data and code implementation at https://github.com/kaistAI/FLASK.
\ No newline at end of file
diff --git a/data/2024/iclr/FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing b/data/2024/iclr/FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing
new file mode 100644
index 0000000000..bcc9855604
--- /dev/null
+++ b/data/2024/iclr/FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing	
@@ -0,0 +1 @@
+Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.
\ No newline at end of file
diff --git a/data/2024/iclr/FLD: Fourier Latent Dynamics for Structured Motion Representation and Learning b/data/2024/iclr/FLD: Fourier Latent Dynamics for Structured Motion Representation and Learning
new file mode 100644
index 0000000000..38d61a2c42
--- /dev/null
+++ b/data/2024/iclr/FLD: Fourier Latent Dynamics for Structured Motion Representation and Learning	
@@ -0,0 +1 @@
+Motion trajectories offer reliable references for physics-based motion learning but suffer from sparsity, particularly in regions that lack sufficient data coverage. To address this challenge, we introduce a self-supervised, structured representation and generation method that extracts spatial-temporal relationships in periodic or quasi-periodic motions. The motion dynamics in a continuously parameterized latent space enable our method to enhance the interpolation and generalization capabilities of motion learning algorithms. The motion learning controller, informed by the motion parameterization, operates online tracking of a wide range of motions, including targets unseen during training. With a fallback mechanism, the controller dynamically adapts its tracking strategy and automatically resorts to safe action execution when a potentially risky target is proposed. By leveraging the identified spatial-temporal structure, our work opens new possibilities for future advancements in general motion representation and learning algorithms.
\ No newline at end of file
diff --git a/data/2024/iclr/FOSI: Hybrid First and Second Order Optimization b/data/2024/iclr/FOSI: Hybrid First and Second Order Optimization
new file mode 100644
index 0000000000..a6d44f0e3c
--- /dev/null
+++ b/data/2024/iclr/FOSI: Hybrid First and Second Order Optimization	
@@ -0,0 +1 @@
+This paper addresses PSO-FA hybrid algorithm that is constructed by heterogeneous agents (particles) in the population. We describe that particle swarm optimization and firefly algorithm have different dynamical properties each other, and share the information such the global best and the brightest intensity. Then search ability of a simple combination of them is examined. The benchmark set for CEC2015 competition of single-objective computationally expensive optimization is used for the performance evaluate, we discuss the proposed hybrid algorithm.
\ No newline at end of file
diff --git a/data/2024/iclr/FROSTER: Frozen CLIP is A Strong Teacher for Open-Vocabulary Action Recognition b/data/2024/iclr/FROSTER: Frozen CLIP is A Strong Teacher for Open-Vocabulary Action Recognition
new file mode 100644
index 0000000000..9d3f0207e2
--- /dev/null
+++ b/data/2024/iclr/FROSTER: Frozen CLIP is A Strong Teacher for Open-Vocabulary Action Recognition	
@@ -0,0 +1 @@
+In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP's pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions. To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task. Specifically, the residual feature distillation treats the frozen CLIP model as a teacher to maintain the generalizability exhibited by the original CLIP and supervises the feature learning for the extraction of video-specific features to bridge the gap between images and videos. Meanwhile, it uses a residual sub-network for feature distillation to reach a balance between the two distinct objectives of learning generalizable and video-specific features. We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. Project page: https://visual-ai.github.io/froster.
\ No newline at end of file
diff --git a/data/2024/iclr/Fair Classifiers that Abstain without Harm b/data/2024/iclr/Fair Classifiers that Abstain without Harm
new file mode 100644
index 0000000000..4b6a4f0da0
--- /dev/null
+++ b/data/2024/iclr/Fair Classifiers that Abstain without Harm	
@@ -0,0 +1 @@
+In critical applications, it is vital for classifiers to defer decision-making to humans. We propose a post-hoc method that makes existing classifiers selectively abstain from predicting certain samples. Our abstaining classifier is incentivized to maintain the original accuracy for each sub-population (i.e. no harm) while achieving a set of group fairness definitions to a user specified degree. To this end, we design an Integer Programming (IP) procedure that assigns abstention decisions for each training sample to satisfy a set of constraints. To generalize the abstaining decisions to test samples, we then train a surrogate model to learn the abstaining decisions based on the IP solutions in an end-to-end manner. We analyze the feasibility of the IP procedure to determine the possible abstention rate for different levels of unfairness tolerance and accuracy constraint for achieving no harm. To the best of our knowledge, this work is the first to identify the theoretical relationships between the constraint parameters and the required abstention rate. Our theoretical results are important since a high abstention rate is often infeasible in practice due to a lack of human resources. Our framework outperforms existing methods in terms of fairness disparity without sacrificing accuracy at similar abstention rates.
\ No newline at end of file
diff --git a/data/2024/iclr/Fair and Efficient Contribution Valuation for Vertical Federated Learning b/data/2024/iclr/Fair and Efficient Contribution Valuation for Vertical Federated Learning
new file mode 100644
index 0000000000..032a4c9a7f
--- /dev/null
+++ b/data/2024/iclr/Fair and Efficient Contribution Valuation for Vertical Federated Learning	
@@ -0,0 +1 @@
+Federated learning is a popular technology for training machine learning models on distributed data sources without sharing data. Vertical federated learning or feature-based federated learning applies to the cases that different data sources share the same sample ID space but differ in feature space. To ensure the data owners' long-term engagement, it is critical to objectively assess the contribution from each data source and recompense them accordingly. The Shapley value (SV) is a provably fair contribution valuation metric originated from cooperative game theory. However, computing the SV requires extensively retraining the model on each subset of data sources, which causes prohibitively high communication costs in federated learning. We propose a contribution valuation metric called vertical federated Shapley value (VerFedSV) based on SV. We show that VerFedSV not only satisfies many desirable properties for fairness but is also efficient to compute, and can be adapted to both synchronous and asynchronous vertical federated learning algorithms. Both theoretical analysis and extensive experimental results verify the fairness, efficiency, and adaptability of VerFedSV.
\ No newline at end of file
diff --git a/data/2024/iclr/FairSeg: A Large-Scale Medical Image Segmentation Dataset for Fairness Learning Using Segment Anything Model with Fair Error-Bound Scaling b/data/2024/iclr/FairSeg: A Large-Scale Medical Image Segmentation Dataset for Fairness Learning Using Segment Anything Model with Fair Error-Bound Scaling
new file mode 100644
index 0000000000..1460e1ddf0
--- /dev/null
+++ b/data/2024/iclr/FairSeg: A Large-Scale Medical Image Segmentation Dataset for Fairness Learning Using Segment Anything Model with Fair Error-Bound Scaling	
@@ -0,0 +1 @@
+Fairness in artificial intelligence models has gained significantly more attention in recent years, especially in the area of medicine, as fairness in medical models is critical to people's well-being and lives. High-quality medical fairness datasets are needed to promote fairness learning research. Existing medical fairness datasets are all for classification tasks, and no fairness datasets are available for medical segmentation, while medical segmentation is an equally important clinical task as classifications, which can provide detailed spatial information on organ abnormalities ready to be assessed by clinicians. In this paper, we propose the first fairness dataset for medical segmentation named Harvard-FairSeg with 10,000 subject samples. In addition, we propose a fair error-bound scaling approach to reweight the loss function with the upper error-bound in each identity group, using the segment anything model (SAM). We anticipate that the segmentation performance equity can be improved by explicitly tackling the hard cases with high training errors in each identity group. To facilitate fair comparisons, we utilize a novel equity-scaled segmentation performance metric to compare segmentation metrics in the context of fairness, such as the equity-scaled Dice coefficient. Through comprehensive experiments, we demonstrate that our fair error-bound scaling approach either has superior or comparable fairness performance to the state-of-the-art fairness learning models. The dataset and code are publicly accessible via https://ophai.hms.harvard.edu/datasets/harvard-fairseg10k.
\ No newline at end of file
diff --git a/data/2024/iclr/FairTune: Optimizing Parameter Efficient Fine Tuning for Fairness in Medical Image Analysis b/data/2024/iclr/FairTune: Optimizing Parameter Efficient Fine Tuning for Fairness in Medical Image Analysis
new file mode 100644
index 0000000000..ddf4635666
--- /dev/null
+++ b/data/2024/iclr/FairTune: Optimizing Parameter Efficient Fine Tuning for Fairness in Medical Image Analysis	
@@ -0,0 +1 @@
+Training models with robust group fairness properties is crucial in ethically sensitive application areas such as medical diagnosis. Despite the growing body of work aiming to minimise demographic bias in AI, this problem remains challenging. A key reason for this challenge is the fairness generalisation gap: High-capacity deep learning models can fit all training data nearly perfectly, and thus also exhibit perfect fairness during training. In this case, bias emerges only during testing when generalisation performance differs across subgroups. This motivates us to take a bi-level optimisation perspective on fair learning: Optimising the learning strategy based on validation fairness. Specifically, we consider the highly effective workflow of adapting pre-trained models to downstream medical imaging tasks using parameter-efficient fine-tuning (PEFT) techniques. There is a trade-off between updating more parameters, enabling a better fit to the task of interest vs. fewer parameters, potentially reducing the generalisation gap. To manage this tradeoff, we propose FairTune, a framework to optimise the choice of PEFT parameters with respect to fairness. We demonstrate empirically that FairTune leads to improved fairness on a range of medical imaging datasets. The code is available at https://github.com/Raman1121/FairTune
\ No newline at end of file
diff --git a/data/2024/iclr/FairerCLIP: Debiasing CLIP's Zero-Shot Predictions using Functions in RKHSs b/data/2024/iclr/FairerCLIP: Debiasing CLIP's Zero-Shot Predictions using Functions in RKHSs
new file mode 100644
index 0000000000..f60c8551ee
--- /dev/null
+++ b/data/2024/iclr/FairerCLIP: Debiasing CLIP's Zero-Shot Predictions using Functions in RKHSs	
@@ -0,0 +1 @@
+Large pre-trained vision-language models such as CLIP provide compact and general-purpose representations of text and images that are demonstrably effective across multiple downstream zero-shot prediction tasks. However, owing to the nature of their training process, these models have the potential to 1) propagate or amplify societal biases in the training data and 2) learn to rely on spurious features. This paper proposes FairerCLIP, a general approach for making zero-shot predictions of CLIP more fair and robust to spurious correlations. We formulate the problem of jointly debiasing CLIP's image and text representations in reproducing kernel Hilbert spaces (RKHSs), which affords multiple benefits: 1) Flexibility: Unlike existing approaches, which are specialized to either learn with or without ground-truth labels, FairerCLIP is adaptable to learning in both scenarios. 2) Ease of Optimization: FairerCLIP lends itself to an iterative optimization involving closed-form solvers, which leads to $4\times$-$10\times$ faster training than the existing methods. 3) Sample Efficiency: Under sample-limited conditions, FairerCLIP significantly outperforms baselines when they fail entirely. And, 4) Performance: Empirically, FairerCLIP achieves appreciable accuracy gains on benchmark fairness and spurious correlation datasets over their respective baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Faithful Explanations of Black-box NLP Models Using LLM-generated Counterfactuals b/data/2024/iclr/Faithful Explanations of Black-box NLP Models Using LLM-generated Counterfactuals
new file mode 100644
index 0000000000..bd587c6b1c
--- /dev/null
+++ b/data/2024/iclr/Faithful Explanations of Black-box NLP Models Using LLM-generated Counterfactuals	
@@ -0,0 +1 @@
+Causal explanations of the predictions of NLP systems are essential to ensure safety and establish trust. Yet, existing methods often fall short of explaining model predictions effectively or efficiently and are often model-specific. In this paper, we address model-agnostic explanations, proposing two approaches for counterfactual (CF) approximation. The first approach is CF generation, where a large language model (LLM) is prompted to change a specific text concept while keeping confounding concepts unchanged. While this approach is demonstrated to be very effective, applying LLM at inference-time is costly. We hence present a second approach based on matching, and propose a method that is guided by an LLM at training-time and learns a dedicated embedding space. This space is faithful to a given causal graph and effectively serves to identify matches that approximate CFs. After showing theoretically that approximating CFs is required in order to construct faithful explanations, we benchmark our approaches and explain several models, including LLMs with billions of parameters. Our empirical results demonstrate the excellent performance of CF generation models as model-agnostic explainers. Moreover, our matching approach, which requires far less test-time resources, also provides effective explanations, surpassing many baselines. We also find that Top-K techniques universally improve every tested method. Finally, we showcase the potential of LLMs in constructing new benchmarks for model explanation and subsequently validate our conclusions. Our work illuminates new pathways for efficient and accurate approaches to interpreting NLP systems.
\ No newline at end of file
diff --git a/data/2024/iclr/Faithful Rule Extraction for Differentiable Rule Learning Models b/data/2024/iclr/Faithful Rule Extraction for Differentiable Rule Learning Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Faithful Vision-Language Interpretation via Concept Bottleneck Models b/data/2024/iclr/Faithful Vision-Language Interpretation via Concept Bottleneck Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Faithful and Efficient Explanations for Neural Networks via Neural Tangent Kernel Surrogate Models b/data/2024/iclr/Faithful and Efficient Explanations for Neural Networks via Neural Tangent Kernel Surrogate Models
new file mode 100644
index 0000000000..f00b3ca371
--- /dev/null
+++ b/data/2024/iclr/Faithful and Efficient Explanations for Neural Networks via Neural Tangent Kernel Surrogate Models	
@@ -0,0 +1 @@
+A recent trend in explainable AI research has focused on surrogate modeling, where neural networks are approximated as simpler ML algorithms such as kernel machines. A second trend has been to utilize kernel functions in various explain-by-example or data attribution tasks. In this work, we combine these two trends to analyze approximate empirical neural tangent kernels (eNTK) for data attribution. Approximation is critical for eNTK analysis due to the high computational cost to compute the eNTK. We define new approximate eNTK and perform novel analysis on how well the resulting kernel machine surrogate models correlate with the underlying neural network. We introduce two new random projection variants of approximate eNTK which allow users to tune the time and memory complexity of their calculation. We conclude that kernel machines using approximate neural tangent kernel as the kernel function are effective surrogate models, with the introduced trace NTK the most consistent performer. Open source software allowing users to efficiently calculate kernel functions in the PyTorch framework is available (https://github.com/pnnl/projection\_ntk).
\ No newline at end of file
diff --git a/data/2024/iclr/Fake It Till Make It: Federated Learning with Consensus-Oriented Generation b/data/2024/iclr/Fake It Till Make It: Federated Learning with Consensus-Oriented Generation
new file mode 100644
index 0000000000..00d087a0cf
--- /dev/null
+++ b/data/2024/iclr/Fake It Till Make It: Federated Learning with Consensus-Oriented Generation	
@@ -0,0 +1 @@
+In federated learning (FL), data heterogeneity is one key bottleneck that causes model divergence and limits performance. Addressing this, existing methods often regard data heterogeneity as an inherent property and propose to mitigate its adverse effects by correcting models. In this paper, we seek to break this inherent property by generating data to complement the original dataset to fundamentally mitigate heterogeneity level. As a novel attempt from the perspective of data, we propose federated learning with consensus-oriented generation (FedCOG). FedCOG consists of two key components at the client side: complementary data generation, which generates data extracted from the shared global model to complement the original dataset, and knowledge-distillation-based model training, which distills knowledge from global model to local model based on the generated data to mitigate over-fitting the original heterogeneous dataset. FedCOG has two critical advantages: 1) it can be a plug-and-play module to further improve the performance of most existing FL methods, and 2) it is naturally compatible with standard FL protocols such as Secure Aggregation since it makes no modification in communication process. Extensive experiments on classical and real-world FL datasets show that FedCOG consistently outperforms state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model b/data/2024/iclr/Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model
new file mode 100644
index 0000000000..3979019dff
--- /dev/null
+++ b/data/2024/iclr/Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model	
@@ -0,0 +1 @@
+Training deep networks requires various design decisions regarding for instance their architecture, data augmentation, or optimization. In this work, we find these training variations to result in networks learning unique feature sets from the data. Using public model libraries comprising thousands of models trained on canonical datasets like ImageNet, we observe that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other -- independent of overall performance. Given any arbitrary pairing of pretrained models and no external rankings (such as separate test sets, e.g. due to data privacy), we investigate if it is possible to transfer such"complementary"knowledge from one model to another without performance degradation -- a task made particularly difficult as additional knowledge can be contained in stronger, equiperformant or weaker models. Yet facilitating robust transfer in scenarios agnostic to pretrained model pairings would unlock auxiliary gains and knowledge fusion from any model repository without restrictions on model and problem specifics - including from weaker, lower-performance models. This work therefore provides an initial, in-depth exploration on the viability of such general-purpose knowledge transfer. Across large-scale experiments, we first reveal the shortcomings of standard knowledge distillation techniques, and then propose a much more general extension through data partitioning for successful transfer between nearly all pretrained models, which we show can also be done unsupervised. Finally, we assess both the scalability and impact of fundamental model properties on successful model-agnostic knowledge transfer.
\ No newline at end of file
diff --git a/data/2024/iclr/Fantastic Generalization Measures are Nowhere to be Found b/data/2024/iclr/Fantastic Generalization Measures are Nowhere to be Found
new file mode 100644
index 0000000000..e485faab74
--- /dev/null
+++ b/data/2024/iclr/Fantastic Generalization Measures are Nowhere to be Found	
@@ -0,0 +1 @@
+We study the notion of a generalization bound being uniformly tight, meaning that the difference between the bound and the population loss is small for all learning algorithms and all population distributions. Numerous generalization bounds have been proposed in the literature as potential explanations for the ability of neural networks to generalize in the overparameterized setting. However, in their paper ``Fantastic Generalization Measures and Where to Find Them,'' Jiang et al. (2020) examine more than a dozen generalization bounds, and show empirically that none of them are uniformly tight. This raises the question of whether uniformly-tight generalization bounds are at all possible in the overparameterized setting. We consider two types of generalization bounds: (1) bounds that may depend on the training set and the learned hypothesis (e.g., margin bounds). We prove mathematically that no such bound can be uniformly tight in the overparameterized setting; (2) bounds that may in addition also depend on the learning algorithm (e.g., stability bounds). For these bounds, we show a trade-off between the algorithm's performance and the bound's tightness. Namely, if the algorithm achieves good accuracy on certain distributions, then no generalization bound can be uniformly tight for it in the overparameterized setting. We explain how these formal results can, in our view, inform research on generalization bounds for neural networks, while stressing that other interpretations of these results are also possible.
\ No newline at end of file
diff --git "a/data/2024/iclr/Fast Ensembling with Diffusion Schr\303\266dinger Bridge" "b/data/2024/iclr/Fast Ensembling with Diffusion Schr\303\266dinger Bridge"
new file mode 100644
index 0000000000..970f803781
--- /dev/null
+++ "b/data/2024/iclr/Fast Ensembling with Diffusion Schr\303\266dinger Bridge"	
@@ -0,0 +1 @@
+Deep Ensemble (DE) approach is a straightforward technique used to enhance the performance of deep neural networks by training them from different initial points, converging towards various local optima. However, a limitation of this methodology lies in its high computational overhead for inference, arising from the necessity to store numerous learned parameters and execute individual forward passes for each parameter during the inference stage. We propose a novel approach called Diffusion Bridge Network (DBN) to address this challenge. Based on the theory of the Schr\"odinger bridge, this method directly learns to simulate an Stochastic Differential Equation (SDE) that connects the output distribution of a single ensemble member to the output distribution of the ensembled model, allowing us to obtain ensemble prediction without having to invoke forward pass through all the ensemble models. By substituting the heavy ensembles with this lightweight neural network constructing DBN, we achieved inference with reduced computational cost while maintaining accuracy and uncertainty scores on benchmark datasets such as CIFAR-10, CIFAR-100, and TinyImageNet. Our implementation is available at https://github.com/kim-hyunsu/dbn.
\ No newline at end of file
diff --git a/data/2024/iclr/Fast Equilibrium of SGD in Generic Situations b/data/2024/iclr/Fast Equilibrium of SGD in Generic Situations
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Fast Hyperboloid Decision Tree Algorithms b/data/2024/iclr/Fast Hyperboloid Decision Tree Algorithms
new file mode 100644
index 0000000000..c9a5c01848
--- /dev/null
+++ b/data/2024/iclr/Fast Hyperboloid Decision Tree Algorithms	
@@ -0,0 +1 @@
+Hyperbolic geometry is gaining traction in machine learning for its effectiveness at capturing hierarchical structures in real-world data. Hyperbolic spaces, where neighborhoods grow exponentially, offer substantial advantages and consistently deliver state-of-the-art results across diverse applications. However, hyperbolic classifiers often grapple with computational challenges. Methods reliant on Riemannian optimization frequently exhibit sluggishness, stemming from the increased computational demands of operations on Riemannian manifolds. In response to these challenges, we present hyperDT, a novel extension of decision tree algorithms into hyperbolic space. Crucially, hyperDT eliminates the need for computationally intensive Riemannian optimization, numerically unstable exponential and logarithmic maps, or pairwise comparisons between points by leveraging inner products to adapt Euclidean decision tree algorithms to hyperbolic space. Our approach is conceptually straightforward and maintains constant-time decision complexity while mitigating the scalability issues inherent in high-dimensional Euclidean spaces. Building upon hyperDT we introduce hyperRF, a hyperbolic random forest model. Extensive benchmarking across diverse datasets underscores the superior performance of these models, providing a swift, precise, accurate, and user-friendly toolkit for hyperbolic data analysis.
\ No newline at end of file
diff --git a/data/2024/iclr/Fast Imitation via Behavior Foundation Models b/data/2024/iclr/Fast Imitation via Behavior Foundation Models
new file mode 100644
index 0000000000..1840942888
--- /dev/null
+++ b/data/2024/iclr/Fast Imitation via Behavior Foundation Models	
@@ -0,0 +1 @@
+Imitation learning (IL) aims at producing agents that can imitate any behavior given a few expert demonstrations. Yet existing approaches require many demonstrations and/or running (online or offline) reinforcement learning (RL) algorithms for each new imitation task. Here we show that recent RL foundation models based on successor measures can imitate any expert behavior almost instantly with just a few demonstrations and no need for RL or fine-tuning, while accommodating several IL principles (behavioral cloning, feature matching, reward-based
\ No newline at end of file
diff --git a/data/2024/iclr/Fast Updating Truncated SVD for Representation Learning with Sparse Matrices b/data/2024/iclr/Fast Updating Truncated SVD for Representation Learning with Sparse Matrices
new file mode 100644
index 0000000000..ea69c194b2
--- /dev/null
+++ b/data/2024/iclr/Fast Updating Truncated SVD for Representation Learning with Sparse Matrices	
@@ -0,0 +1 @@
+Updating a truncated Singular Value Decomposition (SVD) is crucial in representation learning, especially when dealing with large-scale data matrices that continuously evolve in practical scenarios. Aligning SVD-based models with fast-paced updates becomes increasingly important. Existing methods for updating truncated SVDs employ Rayleigh-Ritz projection procedures, where projection matrices are augmented based on original singular vectors. However, these methods suffer from inefficiency due to the densification of the update matrix and the application of the projection to all singular vectors. To address these limitations, we introduce a novel method for dynamically approximating the truncated SVD of a sparse and temporally evolving matrix. Our approach leverages sparsity in the orthogonalization process of augmented matrices and utilizes an extended decomposition to independently store projections in the column space of singular vectors. Numerical experiments demonstrate a remarkable efficiency improvement of an order of magnitude compared to previous methods. Remarkably, this improvement is achieved while maintaining a comparable precision to existing approaches.
\ No newline at end of file
diff --git a/data/2024/iclr/Fast Value Tracking for Deep Reinforcement Learning b/data/2024/iclr/Fast Value Tracking for Deep Reinforcement Learning
new file mode 100644
index 0000000000..c187a8b4c5
--- /dev/null
+++ b/data/2024/iclr/Fast Value Tracking for Deep Reinforcement Learning	
@@ -0,0 +1 @@
+Reinforcement learning (RL) tackles sequential decision-making problems by creating agents that interacts with their environment. However, existing algorithms often view these problem as static, focusing on point estimates for model parameters to maximize expected rewards, neglecting the stochastic dynamics of agent-environment interactions and the critical role of uncertainty quantification. Our research leverages the Kalman filtering paradigm to introduce a novel and scalable sampling algorithm called Langevinized Kalman Temporal-Difference (LKTD) for deep reinforcement learning. This algorithm, grounded in Stochastic Gradient Markov Chain Monte Carlo (SGMCMC), efficiently draws samples from the posterior distribution of deep neural network parameters. Under mild conditions, we prove that the posterior samples generated by the LKTD algorithm converge to a stationary distribution. This convergence not only enables us to quantify uncertainties associated with the value function and model parameters but also allows us to monitor these uncertainties during policy updates throughout the training phase. The LKTD algorithm paves the way for more robust and adaptable reinforcement learning approaches.
\ No newline at end of file
diff --git a/data/2024/iclr/Fast and unified path gradient estimators for normalizing flows b/data/2024/iclr/Fast and unified path gradient estimators for normalizing flows
new file mode 100644
index 0000000000..6adb2dbab3
--- /dev/null
+++ b/data/2024/iclr/Fast and unified path gradient estimators for normalizing flows	
@@ -0,0 +1 @@
+Recent work shows that path gradient estimators for normalizing flows have lower variance compared to standard estimators for variational inference, resulting in improved training. However, they are often prohibitively more expensive from a computational point of view and cannot be applied to maximum likelihood training in a scalable manner, which severely hinders their widespread adoption. In this work, we overcome these crucial limitations. Specifically, we propose a fast path gradient estimator which improves computational efficiency significantly and works for all normalizing flow architectures of practical relevance. We then show that this estimator can also be applied to maximum likelihood training for which it has a regularizing effect as it can take the form of a given target energy function into account. We empirically establish its superior performance and reduced variance for several natural sciences applications.
\ No newline at end of file
diff --git a/data/2024/iclr/Fast, Expressive SE(n) Equivariant Networks through Weight-Sharing in Position-Orientation Space b/data/2024/iclr/Fast, Expressive SE(n) Equivariant Networks through Weight-Sharing in Position-Orientation Space
new file mode 100644
index 0000000000..0ea5fefc5a
--- /dev/null
+++ b/data/2024/iclr/Fast, Expressive SE(n) Equivariant Networks through Weight-Sharing in Position-Orientation Space	
@@ -0,0 +1 @@
+Based on the theory of homogeneous spaces we derive geometrically optimal edge attributes to be used within the flexible message-passing framework. We formalize the notion of weight sharing in convolutional networks as the sharing of message functions over point-pairs that should be treated equally. We define equivalence classes of point-pairs that are identical up to a transformation in the group and derive attributes that uniquely identify these classes. Weight sharing is then obtained by conditioning message functions on these attributes. As an application of the theory, we develop an efficient equivariant group convolutional network for processing 3D point clouds. The theory of homogeneous spaces tells us how to do group convolutions with feature maps over the homogeneous space of positions $\mathbb{R}^3$, position and orientations $\mathbb{R}^3 {\times} S^2$, and the group $SE(3)$ itself. Among these, $\mathbb{R}^3 {\times} S^2$ is an optimal choice due to the ability to represent directional information, which $\mathbb{R}^3$ methods cannot, and it significantly enhances computational efficiency compared to indexing features on the full $SE(3)$ group. We support this claim with state-of-the-art results -- in accuracy and speed -- on five different benchmarks in 2D and 3D, including interatomic potential energy prediction, trajectory forecasting in N-body systems, and generating molecules via equivariant diffusion models.
\ No newline at end of file
diff --git a/data/2024/iclr/Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature b/data/2024/iclr/Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature
new file mode 100644
index 0000000000..d1f587025a
--- /dev/null
+++ b/data/2024/iclr/Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature	
@@ -0,0 +1 @@
+Large language models (LLMs) have shown the ability to produce fluent and cogent content, presenting both productivity opportunities and societal risks. To build trustworthy AI systems, it is imperative to distinguish between machine-generated and human-authored content. The leading zero-shot detector, DetectGPT, showcases commendable performance but is marred by its intensive computational costs. In this paper, we introduce the concept of conditional probability curvature to elucidate discrepancies in word choices between LLMs and humans within a given context. Utilizing this curvature as a foundational metric, we present **Fast-DetectGPT**, an optimized zero-shot detector, which substitutes DetectGPT's perturbation step with a more efficient sampling step. Our evaluations on various datasets, source models, and test conditions indicate that Fast-DetectGPT not only surpasses DetectGPT by a relative around 75% in both the white-box and black-box settings but also accelerates the detection process by a factor of 340, as detailed in Table 1. See \url{https://github.com/baoguangsheng/fast-detect-gpt} for code, data, and results.
\ No newline at end of file
diff --git a/data/2024/iclr/Faster Sampling from Log-Concave Densities over Polytopes via Efficient Linear Solvers b/data/2024/iclr/Faster Sampling from Log-Concave Densities over Polytopes via Efficient Linear Solvers
new file mode 100644
index 0000000000..ee9ec87c29
--- /dev/null
+++ b/data/2024/iclr/Faster Sampling from Log-Concave Densities over Polytopes via Efficient Linear Solvers	
@@ -0,0 +1 @@
+show that, w.h
\ No newline at end of file
diff --git a/data/2024/iclr/FasterViT: Fast Vision Transformers with Hierarchical Attention b/data/2024/iclr/FasterViT: Fast Vision Transformers with Hierarchical Attention
new file mode 100644
index 0000000000..6ddad620a3
--- /dev/null
+++ b/data/2024/iclr/FasterViT: Fast Vision Transformers with Hierarchical Attention	
@@ -0,0 +1 @@
+We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs. We benefit from efficient window-based self-attention. Each window has access to dedicated carrier tokens that participate in local and global representation learning. At a high level, global self-attentions enable the efficient cross-window communication at lower costs. FasterViT achieves a SOTA Pareto-front in terms of accuracy and image throughput. We have extensively validated its effectiveness on various CV tasks including classification, object detection and segmentation. We also show that HAT can be used as a plug-and-play module for existing networks and enhance them. We further demonstrate significantly faster and more accurate performance than competitive counterparts for images with high resolution. Code is available at https://github.com/NVlabs/FasterViT.
\ No newline at end of file
diff --git a/data/2024/iclr/FeatUp: A Model-Agnostic Framework for Features at Any Resolution b/data/2024/iclr/FeatUp: A Model-Agnostic Framework for Features at Any Resolution
new file mode 100644
index 0000000000..c4872bbcc1
--- /dev/null
+++ b/data/2024/iclr/FeatUp: A Model-Agnostic Framework for Features at Any Resolution	
@@ -0,0 +1 @@
+Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.
\ No newline at end of file
diff --git a/data/2024/iclr/Feature emergence via margin maximization: case studies in algebraic tasks b/data/2024/iclr/Feature emergence via margin maximization: case studies in algebraic tasks
new file mode 100644
index 0000000000..e5ba02cac9
--- /dev/null
+++ b/data/2024/iclr/Feature emergence via margin maximization: case studies in algebraic tasks	
@@ -0,0 +1 @@
+Understanding the internal representations learned by neural networks is a cornerstone challenge in the science of machine learning. While there have been significant recent strides in some cases towards understanding how neural networks implement specific target functions, this paper explores a complementary question -- why do networks arrive at particular computational strategies? Our inquiry focuses on the algebraic learning tasks of modular addition, sparse parities, and finite group operations. Our primary theoretical findings analytically characterize the features learned by stylized neural networks for these algebraic tasks. Notably, our main technique demonstrates how the principle of margin maximization alone can be used to fully specify the features learned by the network. Specifically, we prove that the trained networks utilize Fourier features to perform modular addition and employ features corresponding to irreducible group-theoretic representations to perform compositions in general groups, aligning closely with the empirical observations of Nanda et al. and Chughtai et al. More generally, we hope our techniques can help to foster a deeper understanding of why neural networks adopt specific computational strategies.
\ No newline at end of file
diff --git a/data/2024/iclr/Feature-aligned N-BEATS with Sinkhorn divergence b/data/2024/iclr/Feature-aligned N-BEATS with Sinkhorn divergence
new file mode 100644
index 0000000000..d422ee8b25
--- /dev/null
+++ b/data/2024/iclr/Feature-aligned N-BEATS with Sinkhorn divergence	
@@ -0,0 +1 @@
+We propose Feature-aligned N-BEATS as a domain-generalized time series forecasting model. It is a nontrivial extension of N-BEATS with doubly residual stacking principle (Oreshkin et al. [45]) into a representation learning framework. In particular, it revolves around marginal feature probability measures induced by the intricate composition of residual and feature extracting operators of N-BEATS in each stack and aligns them stack-wise via an approximate of an optimal transport distance referred to as the Sinkhorn divergence. The training loss consists of an empirical risk minimization from multiple source domains, i.e., forecasting loss, and an alignment loss calculated with the Sinkhorn divergence, which allows the model to learn invariant features stack-wise across multiple source data sequences while retaining N-BEATS's interpretable design and forecasting power. Comprehensive experimental evaluations with ablation studies are provided and the corresponding results demonstrate the proposed model's forecasting and generalization capabilities.
\ No newline at end of file
diff --git a/data/2024/iclr/FedCDA: Federated Learning with Cross-rounds Divergence-aware Aggregation b/data/2024/iclr/FedCDA: Federated Learning with Cross-rounds Divergence-aware Aggregation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/FedCompass: Efficient Cross-Silo Federated Learning on Heterogeneous Client Devices Using a Computing Power-Aware Scheduler b/data/2024/iclr/FedCompass: Efficient Cross-Silo Federated Learning on Heterogeneous Client Devices Using a Computing Power-Aware Scheduler
new file mode 100644
index 0000000000..c7ef59dde1
--- /dev/null
+++ b/data/2024/iclr/FedCompass: Efficient Cross-Silo Federated Learning on Heterogeneous Client Devices Using a Computing Power-Aware Scheduler	
@@ -0,0 +1 @@
+Cross-silo federated learning offers a promising solution to collaboratively train robust and generalized AI models without compromising the privacy of local datasets, e.g., healthcare, financial, as well as scientific projects that lack a centralized data facility. Nonetheless, because of the disparity of computing resources among different clients (i.e., device heterogeneity), synchronous federated learning algorithms suffer from degraded efficiency when waiting for straggler clients. Similarly, asynchronous federated learning algorithms experience degradation in the convergence rate and final model accuracy on non-identically and independently distributed (non-IID) heterogeneous datasets due to stale local models and client drift. To address these limitations in cross-silo federated learning with heterogeneous clients and data, we propose FedCompass, an innovative semi-asynchronous federated learning algorithm with a computing power-aware scheduler on the server side, which adaptively assigns varying amounts of training tasks to different clients using the knowledge of the computing power of individual clients. FedCompass ensures that multiple locally trained models from clients are received almost simultaneously as a group for aggregation, effectively reducing the staleness of local models. At the same time, the overall training process remains asynchronous, eliminating prolonged waiting periods from straggler clients. Using diverse non-IID heterogeneous distributed datasets, we demonstrate that FedCompass achieves faster convergence and higher accuracy than other asynchronous algorithms while remaining more efficient than synchronous algorithms when performing federated learning on heterogeneous clients. The source code for FedCompass is available at https://github.com/APPFL/FedCompass.
\ No newline at end of file
diff --git a/data/2024/iclr/FedDA: Faster Adaptive Gradient Methods for Federated Constrained Optimization b/data/2024/iclr/FedDA: Faster Adaptive Gradient Methods for Federated Constrained Optimization
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/FedHyper: A Universal and Robust Learning Rate Scheduler for Federated Learning with Hypergradient Descent b/data/2024/iclr/FedHyper: A Universal and Robust Learning Rate Scheduler for Federated Learning with Hypergradient Descent
new file mode 100644
index 0000000000..ea05e6727c
--- /dev/null
+++ b/data/2024/iclr/FedHyper: A Universal and Robust Learning Rate Scheduler for Federated Learning with Hypergradient Descent	
@@ -0,0 +1 @@
+The theoretical landscape of federated learning (FL) undergoes rapid evolution, but its practical application encounters a series of intricate challenges, and hyperparameter optimization is one of these critical challenges. Amongst the diverse adjustments in hyperparameters, the adaptation of the learning rate emerges as a crucial component, holding the promise of significantly enhancing the efficacy of FL systems. In response to this critical need, this paper presents FedHyper, a novel hypergradient-based learning rate adaptation algorithm specifically designed for FL. FedHyper serves as a universal learning rate scheduler that can adapt both global and local rates as the training progresses. In addition, FedHyper not only showcases unparalleled robustness to a spectrum of initial learning rate configurations but also significantly alleviates the necessity for laborious empirical learning rate adjustments. We provide a comprehensive theoretical analysis of FedHyper's convergence rate and conduct extensive experiments on vision and language benchmark datasets. The results demonstrate that FEDHYPER consistently converges 1.1-3x faster than FedAvg and the competing baselines while achieving superior final accuracy. Moreover, FedHyper catalyzes a remarkable surge in accuracy, augmenting it by up to 15% compared to FedAvg under suboptimal initial learning rate settings.
\ No newline at end of file
diff --git a/data/2024/iclr/FedImpro: Measuring and Improving Client Update in Federated Learning b/data/2024/iclr/FedImpro: Measuring and Improving Client Update in Federated Learning
new file mode 100644
index 0000000000..6e5a67a320
--- /dev/null
+++ b/data/2024/iclr/FedImpro: Measuring and Improving Client Update in Federated Learning	
@@ -0,0 +1 @@
+Federated Learning (FL) models often experience client drift caused by heterogeneous data, where the distribution of data differs across clients. To address this issue, advanced research primarily focuses on manipulating the existing gradients to achieve more consistent client models. In this paper, we present an alternative perspective on client drift and aim to mitigate it by generating improved local models. First, we analyze the generalization contribution of local training and conclude that this generalization contribution is bounded by the conditional Wasserstein distance between the data distribution of different clients. Then, we propose FedImpro, to construct similar conditional distributions for local training. Specifically, FedImpro decouples the model into high-level and low-level components, and trains the high-level portion on reconstructed feature distributions. This approach enhances the generalization contribution and reduces the dissimilarity of gradients in FL. Experimental results show that FedImpro can help FL defend against data heterogeneity and enhance the generalization performance of the model.
\ No newline at end of file
diff --git a/data/2024/iclr/FedInverse: Evaluating Privacy Leakage in Federated Learning b/data/2024/iclr/FedInverse: Evaluating Privacy Leakage in Federated Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/FedLoGe: Joint Local and Generic Federated Learning under Long-tailed Data b/data/2024/iclr/FedLoGe: Joint Local and Generic Federated Learning under Long-tailed Data
new file mode 100644
index 0000000000..a1b64926aa
--- /dev/null
+++ b/data/2024/iclr/FedLoGe: Joint Local and Generic Federated Learning under Long-tailed Data	
@@ -0,0 +1 @@
+Federated Long-Tailed Learning (Fed-LT), a paradigm wherein data collected from decentralized local clients manifests a globally prevalent long-tailed distribution, has garnered considerable attention in recent times. In the context of Fed-LT, existing works have predominantly centered on addressing the data imbalance issue to enhance the efficacy of the generic global model while neglecting the performance at the local level. In contrast, conventional Personalized Federated Learning (pFL) techniques are primarily devised to optimize personalized local models under the presumption of a balanced global data distribution. This paper introduces an approach termed Federated Local and Generic Model Training in Fed-LT (FedLoGe), which enhances both local and generic model performance through the integration of representation learning and classifier alignment within a neural collapse framework. Our investigation reveals the feasibility of employing a shared backbone as a foundational framework for capturing overarching global trends, while concurrently employing individualized classifiers to encapsulate distinct refinements stemming from each client's local features. Building upon this discovery, we establish the Static Sparse Equiangular Tight Frame Classifier (SSE-C), inspired by neural collapse principles that naturally prune extraneous noisy features and foster the acquisition of potent data representations. Furthermore, leveraging insights from imbalance neural collapse's classifier norm patterns, we develop Global and Local Adaptive Feature Realignment (GLA-FR) via an auxiliary global classifier and personalized Euclidean norm transfer to align global features with client preferences. Extensive experimental results on CIFAR-10/100-LT, ImageNet, and iNaturalist demonstrate the advantage of our method over state-of-the-art pFL and Fed-LT approaches.
\ No newline at end of file
diff --git a/data/2024/iclr/FedP3: Federated Personalized and Privacy-friendly Network Pruning under Model Heterogeneity b/data/2024/iclr/FedP3: Federated Personalized and Privacy-friendly Network Pruning under Model Heterogeneity
new file mode 100644
index 0000000000..fec9475474
--- /dev/null
+++ b/data/2024/iclr/FedP3: Federated Personalized and Privacy-friendly Network Pruning under Model Heterogeneity	
@@ -0,0 +1 @@
+The interest in federated learning has surged in recent research due to its unique ability to train a global model using privacy-secured information held locally on each client. This paper pays particular attention to the issue of client-side model heterogeneity, a pervasive challenge in the practical implementation of FL that escalates its complexity. Assuming a scenario where each client possesses varied memory storage, processing capabilities and network bandwidth - a phenomenon referred to as system heterogeneity - there is a pressing need to customize a unique model for each client. In response to this, we present an effective and adaptable federated framework FedP3, representing Federated Personalized and Privacy-friendly network Pruning, tailored for model heterogeneity scenarios. Our proposed methodology can incorporate and adapt well-established techniques to its specific instances. We offer a theoretical interpretation of FedP3 and its locally differential-private variant, DP-FedP3, and theoretically validate their efficiencies.
\ No newline at end of file
diff --git a/data/2024/iclr/FedTrans: Client-Transparent Utility Estimation for Robust Federated Learning b/data/2024/iclr/FedTrans: Client-Transparent Utility Estimation for Robust Federated Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/FedWon: Triumphing Multi-domain Federated Learning Without Normalization b/data/2024/iclr/FedWon: Triumphing Multi-domain Federated Learning Without Normalization
new file mode 100644
index 0000000000..36cc93edce
--- /dev/null
+++ b/data/2024/iclr/FedWon: Triumphing Multi-domain Federated Learning Without Normalization	
@@ -0,0 +1 @@
+Federated learning (FL) enhances data privacy with collaborative in-situ training on decentralized clients. Nevertheless, FL encounters challenges due to non-independent and identically distributed (non-i.i.d) data, leading to potential performance degradation and hindered convergence. While prior studies predominantly addressed the issue of skewed label distribution, our research addresses a crucial yet frequently overlooked problem known as multi-domain FL. In this scenario, clients' data originate from diverse domains with distinct feature distributions, instead of label distributions. To address the multi-domain problem in FL, we propose a novel method called Federated learning Without normalizations (FedWon). FedWon draws inspiration from the observation that batch normalization (BN) faces challenges in effectively modeling the statistics of multiple domains, while existing normalization techniques possess their own limitations. In order to address these issues, FedWon eliminates the normalization layers in FL and reparameterizes convolution layers with scaled weight standardization. Through extensive experimentation on five datasets and five models, our comprehensive experimental results demonstrate that FedWon surpasses both FedAvg and the current state-of-the-art method (FedBN) across all experimental setups, achieving notable accuracy improvements of more than 10% in certain domains. Furthermore, FedWon is versatile for both cross-silo and cross-device FL, exhibiting robust domain generalization capability, showcasing strong performance even with a batch size as small as 1, thereby catering to resource-constrained devices. Additionally, FedWon can also effectively tackle the challenge of skewed label distribution.
\ No newline at end of file
diff --git a/data/2024/iclr/Federated Causal Discovery from Heterogeneous Data b/data/2024/iclr/Federated Causal Discovery from Heterogeneous Data
new file mode 100644
index 0000000000..0dcbb16a2b
--- /dev/null
+++ b/data/2024/iclr/Federated Causal Discovery from Heterogeneous Data	
@@ -0,0 +1 @@
+Conventional causal discovery methods rely on centralized data, which is inconsistent with the decentralized nature of data in many real-world situations. This discrepancy has motivated the development of federated causal discovery (FCD) approaches. However, existing FCD methods may be limited by their potentially restrictive assumptions of identifiable functional causal models or homogeneous data distributions, narrowing their applicability in diverse scenarios. In this paper, we propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data. We first utilize a surrogate variable corresponding to the client index to account for the data heterogeneity across different clients. We then develop a federated conditional independence test (FCIT) for causal skeleton discovery and establish a federated independent change principle (FICP) to determine causal directions. These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy. Owing to the nonparametric properties, FCIT and FICP make no assumption about particular functional forms, thereby facilitating the handling of arbitrary causal models. We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method. The code is available at https://github.com/lokali/FedCDH.git.
\ No newline at end of file
diff --git a/data/2024/iclr/Federated Orthogonal Training: Mitigating Global Catastrophic Forgetting in Continual Federated Learning b/data/2024/iclr/Federated Orthogonal Training: Mitigating Global Catastrophic Forgetting in Continual Federated Learning
new file mode 100644
index 0000000000..a098803ea2
--- /dev/null
+++ b/data/2024/iclr/Federated Orthogonal Training: Mitigating Global Catastrophic Forgetting in Continual Federated Learning	
@@ -0,0 +1 @@
+Federated Learning (FL) has gained significant attraction due to its ability to enable privacy-preserving training over decentralized data. Current literature in FL mostly focuses on single-task learning. However, over time, new tasks may appear in the clients and the global model should learn these tasks without forgetting previous tasks. This real-world scenario is known as Continual Federated Learning (CFL). The main challenge of CFL is Global Catastrophic Forgetting, which corresponds to the fact that when the global model is trained on new tasks, its performance on old tasks decreases. There have been a few recent works on CFL to propose methods that aim to address the global catastrophic forgetting problem. However, these works either have unrealistic assumptions on the availability of past data samples or violate the privacy principles of FL. We propose a novel method, Federated Orthogonal Training (FOT), to overcome these drawbacks and address the global catastrophic forgetting in CFL. Our algorithm extracts the global input subspace of each layer for old tasks and modifies the aggregated updates of new tasks such that they are orthogonal to the global principal subspace of old tasks for each layer. This decreases the interference between tasks, which is the main cause for forgetting. We empirically show that FOT outperforms state-of-the-art continual learning methods in the CFL setting, achieving an average accuracy gain of up to 15% with 27% lower forgetting while only incurring a minimal computation and communication cost.
\ No newline at end of file
diff --git a/data/2024/iclr/Federated Q-Learning: Linear Regret Speedup with Low Communication Cost b/data/2024/iclr/Federated Q-Learning: Linear Regret Speedup with Low Communication Cost
new file mode 100644
index 0000000000..e085b4411e
--- /dev/null
+++ b/data/2024/iclr/Federated Q-Learning: Linear Regret Speedup with Low Communication Cost	
@@ -0,0 +1 @@
+In this paper, we consider federated reinforcement learning for tabular episodic Markov Decision Processes (MDP) where, under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. While linear speedup in the number of agents has been achieved for some metrics, such as convergence rate and sample complexity, in similar settings, it is unclear whether it is possible to design a model-free algorithm to achieve linear regret speedup with low communication cost. We propose two federated Q-Learning algorithms termed as FedQ-Hoeffding and FedQ-Bernstein, respectively, and show that the corresponding total regrets achieve a linear speedup compared with their single-agent counterparts when the time horizon is sufficiently large, while the communication cost scales logarithmically in the total number of time steps $T$. Those results rely on an event-triggered synchronization mechanism between the agents and the server, a novel step size selection when the server aggregates the local estimates of the state-action values to form the global estimates, and a set of new concentration inequalities to bound the sum of non-martingale differences. This is the first work showing that linear regret speedup and logarithmic communication cost can be achieved by model-free algorithms in federated reinforcement learning.
\ No newline at end of file
diff --git a/data/2024/iclr/Federated Recommendation with Additive Personalization b/data/2024/iclr/Federated Recommendation with Additive Personalization
new file mode 100644
index 0000000000..d8bfc3f73c
--- /dev/null
+++ b/data/2024/iclr/Federated Recommendation with Additive Personalization	
@@ -0,0 +1 @@
+Building recommendation systems via federated learning (FL) is a new emerging challenge for advancing next-generation Internet service and privacy protection. Existing approaches train shared item embedding by FL while keeping the user embedding private on client side. However, item embedding identical for all clients cannot capture users' individual differences on perceiving the same item and thus leads to poor personalization. Moreover, dense item embedding in FL results in expensive communication cost and latency. To address these challenges, we propose Federated Recommendation with Additive Personalization (FedRAP), which learns a global view of items via FL and a personalized view locally on each user. FedRAP enforces sparsity of the global view to save FL's communication cost and encourages difference between the two views through regularization. We propose an effective curriculum to learn the local and global views progressively with increasing regularization weights. To produce recommendations for an user, FedRAP adds the two views together to obtain a personalized item embedding. FedRAP achieves the best performance in FL setting on multiple benchmarks. It outperforms recent federated recommendation methods and several ablation study baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Federated Text-driven Prompt Generation for Vision-Language Models b/data/2024/iclr/Federated Text-driven Prompt Generation for Vision-Language Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Federated Wasserstein Distance b/data/2024/iclr/Federated Wasserstein Distance
new file mode 100644
index 0000000000..5ce2fb6292
--- /dev/null
+++ b/data/2024/iclr/Federated Wasserstein Distance	
@@ -0,0 +1 @@
+We introduce a principled way of computing the Wasserstein distance between two distributions in a federated manner. Namely, we show how to estimate the Wasserstein distance between two samples stored and kept on different devices/clients whilst a central entity/server orchestrates the computations (again, without having access to the samples). To achieve this feat, we take advantage of the geometric properties of the Wasserstein distance -- in particular, the triangle inequality -- and that of the associated {\em geodesics}: our algorithm, FedWad (for Federated Wasserstein Distance), iteratively approximates the Wasserstein distance by manipulating and exchanging distributions from the space of geodesics in lieu of the input samples. In addition to establishing the convergence properties of FedWad, we provide empirical results on federated coresets and federate optimal transport dataset distance, that we respectively exploit for building a novel federated model and for boosting performance of popular federated learning algorithms.
\ No newline at end of file
diff --git a/data/2024/iclr/Ferret: Refer and Ground Anything Anywhere at Any Granularity b/data/2024/iclr/Ferret: Refer and Ground Anything Anywhere at Any Granularity
new file mode 100644
index 0000000000..66cffa2c9f
--- /dev/null
+++ b/data/2024/iclr/Ferret: Refer and Ground Anything Anywhere at Any Granularity	
@@ -0,0 +1 @@
+We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination. Code and data will be available at https://github.com/apple/ml-ferret
\ No newline at end of file
diff --git a/data/2024/iclr/Few-Shot Detection of Machine-Generated Text using Style Representations b/data/2024/iclr/Few-Shot Detection of Machine-Generated Text using Style Representations
new file mode 100644
index 0000000000..4c496c0961
--- /dev/null
+++ b/data/2024/iclr/Few-Shot Detection of Machine-Generated Text using Style Representations	
@@ -0,0 +1 @@
+The advent of instruction-tuned language models that convincingly mimic human writing poses a significant risk of abuse. However, such abuse may be counteracted with the ability to detect whether a piece of text was composed by a language model rather than a human author. Some previous approaches to this problem have relied on supervised methods by training on corpora of confirmed human- and machine- written documents. Unfortunately, model under-specification poses an unavoidable challenge for neural network-based detectors, making them brittle in the face of data shifts, such as the release of newer language models producing still more fluent text than the models used to train the detectors. Other approaches require access to the models that may have generated a document in question, which is often impractical. In light of these challenges, we pursue a fundamentally different approach not relying on samples from language models of concern at training time. Instead, we propose to leverage representations of writing style estimated from human-authored text. Indeed, we find that features effective at distinguishing among human authors are also effective at distinguishing human from machine authors, including state-of-the-art large language models like Llama-2, ChatGPT, and GPT-4. Furthermore, given a handful of examples composed by each of several specific language models of interest, our approach affords the ability to predict which model generated a given document. The code and data to reproduce our experiments are available at https://github.com/LLNL/LUAR/tree/main/fewshot_iclr2024.
\ No newline at end of file
diff --git a/data/2024/iclr/Few-shot Hybrid Domain Adaptation of Image Generator b/data/2024/iclr/Few-shot Hybrid Domain Adaptation of Image Generator
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Fiber Monte Carlo b/data/2024/iclr/Fiber Monte Carlo
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Fine-Tuned Language Models Generate Stable Inorganic Materials as Text b/data/2024/iclr/Fine-Tuned Language Models Generate Stable Inorganic Materials as Text
new file mode 100644
index 0000000000..749ab3226e
--- /dev/null
+++ b/data/2024/iclr/Fine-Tuned Language Models Generate Stable Inorganic Materials as Text	
@@ -0,0 +1 @@
+We propose fine-tuning large language models for generation of stable materials. While unorthodox, fine-tuning large language models on text-encoded atomistic data is simple to implement yet reliable, with around 90% of sampled structures obeying physical constraints on atom positions and charges. Using energy above hull calculations from both learned ML potentials and gold-standard DFT calculations, we show that our strongest model (fine-tuned LLaMA-2 70B) can generate materials predicted to be metastable at about twice the rate (49% vs 28%) of CDVAE, a competing diffusion model. Because of text prompting's inherent flexibility, our models can simultaneously be used for unconditional generation of stable material, infilling of partial structures and text-conditional generation. Finally, we show that language models' ability to capture key symmetries of crystal structures improves with model scale, suggesting that the biases of pretrained LLMs are surprisingly well-suited for atomistic data.
\ No newline at end of file
diff --git a/data/2024/iclr/Fine-Tuning Language Models for Factuality b/data/2024/iclr/Fine-Tuning Language Models for Factuality
new file mode 100644
index 0000000000..9ec7f5ec3b
--- /dev/null
+++ b/data/2024/iclr/Fine-Tuning Language Models for Factuality	
@@ -0,0 +1 @@
+The fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations.' These errors can inadvertently spread misinformation or harmfully perpetuate misconceptions. Further, manual fact-checking of model responses is a time-consuming process, making human factuality labels expensive to acquire. In this work, we fine-tune language models to be more factual, without human labeling and targeting more open-ended generation settings than past work. We leverage two key recent innovations in NLP to do so. First, several recent works have proposed methods for judging the factuality of open-ended text by measuring consistency with an external knowledge base or simply a large model's confidence scores. Second, the direct preference optimization algorithm enables straightforward fine-tuning of language models on objectives other than supervised imitation, using a preference ranking over possible model responses. We show that learning from automatically generated factuality preference rankings, generated either through existing retrieval systems or our novel retrieval-free approach, significantly improves the factuality (percent of generated claims that are correct) of Llama-2 on held-out topics compared with RLHF or decoding strategies targeted at factuality. At 7B scale, compared to Llama-2-chat, we observe 58% and 40% reduction in factual error rate when generating biographies and answering medical questions, respectively.
\ No newline at end of file
diff --git a/data/2024/iclr/Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! b/data/2024/iclr/Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
new file mode 100644
index 0000000000..c909aea0e6
--- /dev/null
+++ b/data/2024/iclr/Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!	
@@ -0,0 +1 @@
+Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5 Turbo on custom datasets also encourage this practice. But, what are the safety costs associated with such custom fine-tuning? We note that while existing safety alignment infrastructures can restrict harmful behaviors of LLMs at inference time, they do not cover safety risks when fine-tuning privileges are extended to end-users. Our red teaming studies find that the safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. For instance, we jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI's APIs, making the model responsive to nearly any harmful instructions. Disconcertingly, our research also reveals that, even without malicious intent, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLMs, though to a lesser extent. These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing -- even if a model's initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning. We outline and critically analyze potential mitigations and advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
\ No newline at end of file
diff --git a/data/2024/iclr/Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions b/data/2024/iclr/Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
new file mode 100644
index 0000000000..361db527e9
--- /dev/null
+++ b/data/2024/iclr/Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions	
@@ -0,0 +1 @@
+Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can recognize. This is achieved by training the VPGs on millions of image-caption pairs, where the VPG-generated tokens of images are fed into a frozen LLM to generate the corresponding captions. However, this image-captioning based training objective inherently biases the VPG to concentrate solely on the primary visual contents sufficient for caption generation, often neglecting other visual details. This shortcoming results in MLLMs' underperformance in comprehending demonstrative instructions consisting of multiple, interleaved, and multimodal instructions that demonstrate the required context to complete a task. To address this issue, we introduce a generic and lightweight Visual Prompt Generator Complete module (VPG-C), which can infer and complete the missing details essential for comprehending demonstrative instructions. Further, we propose a synthetic discriminative training strategy to fine-tune VPG-C, eliminating the need for supervised demonstrative instructions. As for evaluation, we build DEMON, a comprehensive benchmark for demonstrative instruction understanding. Synthetically trained with the proposed strategy, VPG-C achieves significantly stronger zero-shot performance across all tasks of DEMON. Further evaluation on the MME and OwlEval benchmarks also demonstrate the superiority of VPG-C. Our benchmark, code, and pre-trained models are available at https://github.com/DCDmllm/Cheetah.
\ No newline at end of file
diff --git a/data/2024/iclr/Finite Scalar Quantization: VQ-VAE Made Simple b/data/2024/iclr/Finite Scalar Quantization: VQ-VAE Made Simple
new file mode 100644
index 0000000000..8930387d64
--- /dev/null
+++ b/data/2024/iclr/Finite Scalar Quantization: VQ-VAE Made Simple	
@@ -0,0 +1 @@
+We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.
\ No newline at end of file
diff --git a/data/2024/iclr/Finite-State Autoregressive Entropy Coding for Efficient Learned Lossless Compression b/data/2024/iclr/Finite-State Autoregressive Entropy Coding for Efficient Learned Lossless Compression
new file mode 100644
index 0000000000..3a57767f7f
--- /dev/null
+++ b/data/2024/iclr/Finite-State Autoregressive Entropy Coding for Efficient Learned Lossless Compression	
@@ -0,0 +1 @@
+A BSTRACT Learned lossless data compression has garnered significant attention recently due to its superior compression ratios compared to traditional compressors. However, the computational efficiency of these models jeopardizes their practicality. This paper proposes a novel system for improving the compression ratio while maintaining computational efficiency for learned lossless data compression. Our approach incorporates two essential innovations. First, we propose the Finite-State AutoRe-gressive (FSAR) entropy coder, an efficient autoregressive Markov model based entropy coder that utilizes a lookup table to expedite autoregressive entropy coding. Next, we present a Straight-Through Hardmax Quantization (STHQ) scheme to enhance the optimization of discrete latent space. Our experiments show that the proposed lossless compression method could improve the compression ratio by up to 6% compared to the baseline, with negligible extra computational time. Our work provides valuable insights into enhancing the computational efficiency of learned lossless data compression, which can have practical applications in various fields. Code is available at https://github.com/alipay/Finite_ State_Autoregressive_Entropy_Coding .
\ No newline at end of file
diff --git a/data/2024/iclr/Finite-Time Analysis of On-Policy Heterogeneous Federated Reinforcement Learning b/data/2024/iclr/Finite-Time Analysis of On-Policy Heterogeneous Federated Reinforcement Learning
new file mode 100644
index 0000000000..9ec4f9969c
--- /dev/null
+++ b/data/2024/iclr/Finite-Time Analysis of On-Policy Heterogeneous Federated Reinforcement Learning	
@@ -0,0 +1 @@
+Federated reinforcement learning (FRL) has emerged as a promising paradigm for reducing the sample complexity of reinforcement learning tasks by exploiting information from different agents. However, when each agent interacts with a potentially different environment, little to nothing is known theoretically about the non-asymptotic performance of FRL algorithms. The lack of such results can be attributed to various technical challenges and their intricate interplay: Markovian sampling, linear function approximation, multiple local updates to save communication, heterogeneity in the reward functions and transition kernels of the agents' MDPs, and continuous state-action spaces. Moreover, in the on-policy setting, the behavior policies vary with time, further complicating the analysis. In response, we introduce FedSARSA, a novel federated on-policy reinforcement learning scheme, equipped with linear function approximation, to address these challenges and provide a comprehensive finite-time error analysis. Notably, we establish that FedSARSA converges to a policy that is near-optimal for all agents, with the extent of near-optimality proportional to the level of heterogeneity. Furthermore, we prove that FedSARSA leverages agent collaboration to enable linear speedups as the number of agents increases, which holds for both fixed and adaptive step-size configurations.
\ No newline at end of file
diff --git a/data/2024/iclr/First-order ANIL provably learns representations despite overparametrisation b/data/2024/iclr/First-order ANIL provably learns representations despite overparametrisation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Fixed Non-negative Orthogonal Classifier: Inducing Zero-mean Neural Collapse with Feature Dimension Separation b/data/2024/iclr/Fixed Non-negative Orthogonal Classifier: Inducing Zero-mean Neural Collapse with Feature Dimension Separation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Fixed-Budget Differentially Private Best Arm Identification b/data/2024/iclr/Fixed-Budget Differentially Private Best Arm Identification
new file mode 100644
index 0000000000..1f26527482
--- /dev/null
+++ b/data/2024/iclr/Fixed-Budget Differentially Private Best Arm Identification	
@@ -0,0 +1 @@
+We study best arm identification (BAI) in linear bandits in the fixed-budget regime under differential privacy constraints, when the arm rewards are supported on the unit interval. Given a finite budget $T$ and a privacy parameter $\varepsilon>0$, the goal is to minimise the error probability in finding the arm with the largest mean after $T$ sampling rounds, subject to the constraint that the policy of the decision maker satisfies a certain {\em $\varepsilon$-differential privacy} ($\varepsilon$-DP) constraint. We construct a policy satisfying the $\varepsilon$-DP constraint (called {\sc DP-BAI}) by proposing the principle of {\em maximum absolute determinants}, and derive an upper bound on its error probability. Furthermore, we derive a minimax lower bound on the error probability, and demonstrate that the lower and the upper bounds decay exponentially in $T$, with exponents in the two bounds matching order-wise in (a) the sub-optimality gaps of the arms, (b) $\varepsilon$, and (c) the problem complexity that is expressible as the sum of two terms, one characterising the complexity of standard fixed-budget BAI (without privacy constraints), and the other accounting for the $\varepsilon$-DP constraint. Additionally, we present some auxiliary results that contribute to the derivation of the lower bound on the error probability. These results, we posit, may be of independent interest and could prove instrumental in proving lower bounds on error probabilities in several other bandit problems. Whereas prior works provide results for BAI in the fixed-budget regime without privacy constraints or in the fixed-confidence regime with privacy constraints, our work fills the gap in the literature by providing the results for BAI in the fixed-budget regime under the $\varepsilon$-DP constraint.
\ No newline at end of file
diff --git a/data/2024/iclr/Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization b/data/2024/iclr/Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization
new file mode 100644
index 0000000000..29742a1442
--- /dev/null
+++ b/data/2024/iclr/Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization	
@@ -0,0 +1 @@
+Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a distributed setup is prone to Byzantine failures of individual nodes, components, and software. With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems. We define the quality of workers as reconstruction ratios $\in (0,1]$, and formulate aggregation as a Maximum Likelihood Estimation procedure using Beta densities. We show that the Regularized form of log-likelihood wrt subspace can be approximately solved using iterative least squares solver, and provide convergence guarantees using recent Convex Optimization landscape results. Our empirical findings demonstrate that our approach significantly enhances the robustness of state-of-the-art Byzantine resilient aggregators. We evaluate our method in a distributed setup with a parameter server, and show simultaneous improvements in communication efficiency and accuracy across various tasks. The code is publicly available at https://github.com/hamidralmasi/FlagAggregator
\ No newline at end of file
diff --git a/data/2024/iclr/FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning b/data/2024/iclr/FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
new file mode 100644
index 0000000000..ecc45e5ada
--- /dev/null
+++ b/data/2024/iclr/FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning	
@@ -0,0 +1 @@
+Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4$\times$ compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2$\times$ speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization).
\ No newline at end of file
diff --git a/data/2024/iclr/FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores b/data/2024/iclr/FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
new file mode 100644
index 0000000000..3b4a9f8dc6
--- /dev/null
+++ b/data/2024/iclr/FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores	
@@ -0,0 +1 @@
+Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time. A major bottleneck is the Fast Fourier Transform (FFT)--which allows long convolutions to run in $O(N logN)$ time in sequence length $N$ but has poor hardware utilization. In this paper, we study how to optimize the FFT convolution. We find two key bottlenecks: the FFT does not effectively use specialized matrix multiply units, and it incurs expensive I/O between layers of the memory hierarchy. In response, we propose FlashFFTConv. FlashFFTConv uses a matrix decomposition that computes the FFT using matrix multiply units and enables kernel fusion for long sequences, reducing I/O. We also present two sparse convolution algorithms--1) partial convolutions and 2) frequency-sparse convolutions--which can be implemented simply by skipping blocks in the matrix decomposition, enabling further opportunities for memory and compute savings. FlashFFTConv speeds up exact FFT convolutions by up to 7.93$\times$ over PyTorch and achieves up to 4.4$\times$ speedup end-to-end. Given the same compute budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points better perplexity on the PILE and M2-BERT-base to achieve 3.3 points higher GLUE score--matching models with twice the parameter count. FlashFFTConv also achieves 96.1% accuracy on Path-512, a high-resolution vision task where no model had previously achieved better than 50%. Furthermore, partial convolutions enable longer-sequence models--yielding the first DNA model that can process the longest human genes (2.3M base pairs)--and frequency-sparse convolutions speed up pretrained models while maintaining or improving model quality.
\ No newline at end of file
diff --git a/data/2024/iclr/Flat Minima in Linear Estimation and an Extended Gauss Markov Theorem b/data/2024/iclr/Flat Minima in Linear Estimation and an Extended Gauss Markov Theorem
new file mode 100644
index 0000000000..9cce52a3ee
--- /dev/null
+++ b/data/2024/iclr/Flat Minima in Linear Estimation and an Extended Gauss Markov Theorem	
@@ -0,0 +1 @@
+We consider the problem of linear estimation, and establish an extension of the Gauss-Markov theorem, in which the bias operator is allowed to be non-zero but bounded with respect to a matrix norm of Schatten type. We derive simple and explicit formulas for the optimal estimator in the cases of Nuclear and Spectral norms (with the Frobenius case recovering ridge regression). Additionally, we analytically derive the generalization error in multiple random matrix ensembles, and compare with Ridge regression. Finally, we conduct an extensive simulation study, in which we show that the cross-validated Nuclear and Spectral regressors can outperform Ridge in several circumstances.
\ No newline at end of file
diff --git a/data/2024/iclr/Flow to Better: Offline Preference-based Reinforcement Learning via Preferred Trajectory Generation b/data/2024/iclr/Flow to Better: Offline Preference-based Reinforcement Learning via Preferred Trajectory Generation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification b/data/2024/iclr/Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification
new file mode 100644
index 0000000000..ef676d8c01
--- /dev/null
+++ b/data/2024/iclr/Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification	
@@ -0,0 +1 @@
+A promising approach for improving the performance of vision-language models like CLIP for image classification is to extend the class descriptions (i.e., prompts) with related attributes, e.g., using brown sparrow instead of sparrow. However, current zero-shot methods select a subset of attributes regardless of commonalities between the target classes, potentially providing no useful information that would have helped to distinguish between them. For instance, they may use color instead of bill shape to distinguish between sparrows and wrens, which are both brown. We propose Follow-up Differential Descriptions (FuDD), a zero-shot approach that tailors the class descriptions to each dataset and leads to additional attributes that better differentiate the target classes. FuDD first identifies the ambiguous classes for each image, and then uses a Large Language Model (LLM) to generate new class descriptions that differentiate between them. The new class descriptions resolve the initial ambiguity and help predict the correct label. In our experiments, FuDD consistently outperforms generic description ensembles and naive LLM-generated descriptions on 12 datasets. We show that differential descriptions are an effective tool to resolve class ambiguities, which otherwise significantly degrade the performance. We also show that high quality natural language class descriptions produced by FuDD result in comparable performance to few-shot adaptation methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Forward Learning of Graph Neural Networks b/data/2024/iclr/Forward Learning of Graph Neural Networks
new file mode 100644
index 0000000000..83719e3008
--- /dev/null
+++ b/data/2024/iclr/Forward Learning of Graph Neural Networks	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) have achieved remarkable success across a wide range of applications, such as recommendation, drug discovery, and question answering. Behind the success of GNNs lies the backpropagation (BP) algorithm, which is the de facto standard for training deep neural networks (NNs). However, despite its effectiveness, BP imposes several constraints, which are not only biologically implausible, but also limit the scalability, parallelism, and flexibility in learning NNs. Examples of such constraints include storage of neural activities computed in the forward pass for use in the subsequent backward pass, and the dependence of parameter updates on non-local signals. To address these limitations, the forward-forward algorithm (FF) was recently proposed as an alternative to BP in the image classification domain, which trains NNs by performing two forward passes over positive and negative data. Inspired by this advance, we propose ForwardGNN in this work, a new forward learning procedure for GNNs, which avoids the constraints imposed by BP via an effective layer-wise local forward training. ForwardGNN extends the original FF to deal with graph data and GNNs, and makes it possible to operate without generating negative inputs (hence no longer forward-forward). Further, ForwardGNN enables each layer to learn from both the bottom-up and top-down signals without relying on the backpropagation of errors. Extensive experiments on real-world datasets show the effectiveness and generality of the proposed forward graph learning framework. We release our code at https://github.com/facebookresearch/forwardgnn.
\ No newline at end of file
diff --git a/data/2024/iclr/Forward Learning with Top-Down Feedback: Empirical and Analytical Characterization b/data/2024/iclr/Forward Learning with Top-Down Feedback: Empirical and Analytical Characterization
new file mode 100644
index 0000000000..b866766620
--- /dev/null
+++ b/data/2024/iclr/Forward Learning with Top-Down Feedback: Empirical and Analytical Characterization	
@@ -0,0 +1 @@
+"Forward-only"algorithms, which train neural networks while avoiding a backward pass, have recently gained attention as a way of solving the biologically unrealistic aspects of backpropagation. Here, we first address compelling challenges related to the"forward-only"rules, which include reducing the performance gap with backpropagation and providing an analytical understanding of their dynamics. To this end, we show that the forward-only algorithm with top-down feedback is well-approximated by an"adaptive-feedback-alignment"algorithm, and we analytically track its performance during learning in a prototype high-dimensional setting. Then, we compare different versions of forward-only algorithms, focusing on the Forward-Forward and PEPITA frameworks, and we show that they share the same learning principles. Overall, our work unveils the connections between three key neuro-inspired learning rules, providing a link between"forward-only"algorithms, i.e., Forward-Forward and PEPITA, and an approximation of backpropagation, i.e., Feedback Alignment.
\ No newline at end of file
diff --git "a/data/2024/iclr/Forward \317\2072 Divergence Based Variational Importance Sampling" "b/data/2024/iclr/Forward \317\2072 Divergence Based Variational Importance Sampling"
new file mode 100644
index 0000000000..53017791ca
--- /dev/null
+++ "b/data/2024/iclr/Forward \317\2072 Divergence Based Variational Importance Sampling"	
@@ -0,0 +1 @@
+Maximizing the log-likelihood is a crucial aspect of learning latent variable models, and variational inference (VI) stands as the commonly adopted method. However, VI can encounter challenges in achieving a high log-likelihood when dealing with complicated posterior distributions. In response to this limitation, we introduce a novel variational importance sampling (VIS) approach that directly estimates and maximizes the log-likelihood. VIS leverages the optimal proposal distribution, achieved by minimizing the forward $\chi^2$ divergence, to enhance log-likelihood estimation. We apply VIS to various popular latent variable models, including mixture models, variational auto-encoders, and partially observable generalized linear models. Results demonstrate that our approach consistently outperforms state-of-the-art baselines, both in terms of log-likelihood and model parameter estimation.
\ No newline at end of file
diff --git a/data/2024/iclr/Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models b/data/2024/iclr/Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models
new file mode 100644
index 0000000000..3473a9d80f
--- /dev/null
+++ b/data/2024/iclr/Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models	
@@ -0,0 +1 @@
+Machine learning has demonstrated remarkable performance over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model's performance in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (e.g., the human users), thus a good evaluation protocol is probably to evaluate the models' behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle (i.e., a foundation model). Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same image-label structure the original test image represents, constrained by a foundation model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models' robustness performance, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.
\ No newline at end of file
diff --git a/data/2024/iclr/Fourier Transporter: Bi-Equivariant Robotic Manipulation in 3D b/data/2024/iclr/Fourier Transporter: Bi-Equivariant Robotic Manipulation in 3D
new file mode 100644
index 0000000000..f6f7d83ca7
--- /dev/null
+++ b/data/2024/iclr/Fourier Transporter: Bi-Equivariant Robotic Manipulation in 3D	
@@ -0,0 +1 @@
+Many complex robotic manipulation tasks can be decomposed as a sequence of pick and place actions. Training a robotic agent to learn this sequence over many different starting conditions typically requires many iterations or demonstrations, especially in 3D environments. In this work, we propose Fourier Transporter (FourTran) which leverages the two-fold SE(d)xSE(d) symmetry in the pick-place problem to achieve much higher sample efficiency. FourTran is an open-loop behavior cloning method trained using expert demonstrations to predict pick-place actions on new environments. FourTran is constrained to incorporate symmetries of the pick and place actions independently. Our method utilizes a fiber space Fourier transformation that allows for memory-efficient construction. We test our proposed network on the RLbench benchmark and achieve state-of-the-art results across various tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning b/data/2024/iclr/Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning
new file mode 100644
index 0000000000..b4a5fef0f2
--- /dev/null
+++ b/data/2024/iclr/Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning	
@@ -0,0 +1 @@
+Off-policy dynamic programming (DP) techniques such as $Q$-learning have proven to be important in sequential decision-making problems. In the presence of function approximation, however, these techniques often diverge due to the absence of Bellman completeness in the function classes considered, a crucial condition for the success of DP-based methods. In this paper, we show how off-policy learning techniques based on return-conditioned supervised learning (RCSL) are able to circumvent these challenges of Bellman completeness, converging under significantly more relaxed assumptions inherited from supervised learning. We prove there exists a natural environment in which if one uses two-layer multilayer perceptron as the function approximator, the layer width needs to grow linearly with the state space size to satisfy Bellman completeness while a constant layer width is enough for RCSL. These findings take a step towards explaining the superior empirical performance of RCSL methods compared to DP-based methods in environments with near-optimal datasets. Furthermore, in order to learn from sub-optimal datasets, we propose a simple framework called MBRCSL, granting RCSL methods the ability of dynamic programming to stitch together segments from distinct trajectories. MBRCSL leverages learned dynamics models and forward sampling to accomplish trajectory stitching while avoiding the need for Bellman completeness that plagues all dynamic programming algorithms. We propose both theoretical analysis and experimental evaluation to back these claims, outperforming state-of-the-art model-free and model-based offline RL algorithms across several simulated robotics problems.
\ No newline at end of file
diff --git a/data/2024/iclr/FreeDyG: Frequency Enhanced Continuous-Time Dynamic Graph Model for Link Prediction b/data/2024/iclr/FreeDyG: Frequency Enhanced Continuous-Time Dynamic Graph Model for Link Prediction
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling b/data/2024/iclr/FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling
new file mode 100644
index 0000000000..545a1ec08f
--- /dev/null
+++ b/data/2024/iclr/FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling	
@@ -0,0 +1 @@
+With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at our website: http://haonanqiu.com/projects/FreeNoise.html.
\ No newline at end of file
diff --git a/data/2024/iclr/FreeReg: Image-to-Point Cloud Registration Leveraging Pretrained Diffusion Models and Monocular Depth Estimators b/data/2024/iclr/FreeReg: Image-to-Point Cloud Registration Leveraging Pretrained Diffusion Models and Monocular Depth Estimators
new file mode 100644
index 0000000000..74f5e903d5
--- /dev/null
+++ b/data/2024/iclr/FreeReg: Image-to-Point Cloud Registration Leveraging Pretrained Diffusion Models and Monocular Depth Estimators	
@@ -0,0 +1 @@
+Matching cross-modality features between images and point clouds is a fundamental problem for image-to-point cloud registration. However, due to the modality difference between images and points, it is difficult to learn robust and discriminative cross-modality features by existing metric learning methods for feature matching. Instead of applying metric learning on cross-modality data, we propose to unify the modality between images and point clouds by pretrained large-scale models first, and then establish robust correspondence within the same modality. We show that the intermediate features, called diffusion features, extracted by depth-to-image diffusion models are semantically consistent between images and point clouds, which enables the building of coarse but robust cross-modality correspondences. We further extract geometric features on depth maps produced by the monocular depth estimator. By matching such geometric features, we significantly improve the accuracy of the coarse correspondences produced by diffusion features. Extensive experiments demonstrate that without any task-specific training, direct utilization of both features produces accurate image-to-point cloud registration. On three public indoor and outdoor benchmarks, the proposed method averagely achieves a 20.6 percent improvement in Inlier Ratio, a three-fold higher Inlier Number, and a 48.6 percent improvement in Registration Recall than existing state-of-the-arts.
\ No newline at end of file
diff --git a/data/2024/iclr/Frequency-Aware Transformer for Learned Image Compression b/data/2024/iclr/Frequency-Aware Transformer for Learned Image Compression
new file mode 100644
index 0000000000..d4fc70d72a
--- /dev/null
+++ b/data/2024/iclr/Frequency-Aware Transformer for Learned Image Compression	
@@ -0,0 +1 @@
+Learned image compression (LIC) has gained traction as an effective solution for image storage and transmission in recent years. However, existing LIC methods are redundant in latent representation due to limitations in capturing anisotropic frequency components and preserving directional details. To overcome these challenges, we propose a novel frequency-aware transformer (FAT) block that for the first time achieves multiscale directional ananlysis for LIC. The FAT block comprises frequency-decomposition window attention (FDWA) modules to capture multiscale and directional frequency components of natural images. Additionally, we introduce frequency-modulation feed-forward network (FMFFN) to adaptively modulate different frequency components, improving rate-distortion performance. Furthermore, we present a transformer-based channel-wise autoregressive (T-CA) model that effectively exploits channel dependencies. Experiments show that our method achieves state-of-the-art rate-distortion performance compared to existing LIC methods, and evidently outperforms latest standardized codec VTM-12.1 by 14.5%, 15.1%, 13.0% in BD-rate on the Kodak, Tecnick, and CLIC datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/From Bricks to Bridges: Product of Invariances to Enhance Latent Space Communication b/data/2024/iclr/From Bricks to Bridges: Product of Invariances to Enhance Latent Space Communication
new file mode 100644
index 0000000000..2aae384045
--- /dev/null
+++ b/data/2024/iclr/From Bricks to Bridges: Product of Invariances to Enhance Latent Space Communication	
@@ -0,0 +1 @@
+It has been observed that representations learned by distinct neural networks conceal structural similarities when the models are trained under similar inductive biases. From a geometric perspective, identifying the classes of transformations and the related invariances that connect these representations is fundamental to unlocking applications, such as merging, stitching, and reusing different neural modules. However, estimating task-specific transformations a priori can be challenging and expensive due to several factors (e.g., weights initialization, training hyperparameters, or data modality). To this end, we introduce a versatile method to directly incorporate a set of invariances into the representations, constructing a product space of invariant components on top of the latent representations without requiring prior knowledge about the optimal invariance to infuse. We validate our solution on classification and reconstruction tasks, observing consistent latent similarity and downstream performance improvements in a zero-shot stitching setting. The experimental analysis comprises three modalities (vision, text, and graphs), twelve pretrained foundational models, nine benchmarks, and several architectures trained from scratch.
\ No newline at end of file
diff --git a/data/2024/iclr/From Graphs to Hypergraphs: Hypergraph Projection and its Reconstruction b/data/2024/iclr/From Graphs to Hypergraphs: Hypergraph Projection and its Reconstruction
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/From Latent Graph to Latent Topology Inference: Differentiable Cell Complex Module b/data/2024/iclr/From Latent Graph to Latent Topology Inference: Differentiable Cell Complex Module
new file mode 100644
index 0000000000..1b7bb722e5
--- /dev/null
+++ b/data/2024/iclr/From Latent Graph to Latent Topology Inference: Differentiable Cell Complex Module	
@@ -0,0 +1 @@
+Latent Graph Inference (LGI) relaxed the reliance of Graph Neural Networks (GNNs) on a given graph topology by dynamically learning it. However, most of LGI methods assume to have a (noisy, incomplete, improvable, ...) input graph to rewire and can solely learn regular graph topologies. In the wake of the success of Topological Deep Learning (TDL), we study Latent Topology Inference (LTI) for learning higher-order cell complexes (with sparse and not regular topology) describing multi-way interactions between data points. To this aim, we introduce the Differentiable Cell Complex Module (DCM), a novel learnable function that computes cell probabilities in the complex to improve the downstream task. We show how to integrate DCM with cell complex message passing networks layers and train it in a end-to-end fashion, thanks to a two-step inference procedure that avoids an exhaustive search across all possible cells in the input, thus maintaining scalability. Our model is tested on several homophilic and heterophilic graph datasets and it is shown to outperform other state-of-the-art techniques, offering significant improvements especially in cases where an input graph is not provided.
\ No newline at end of file
diff --git a/data/2024/iclr/From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction b/data/2024/iclr/From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction
new file mode 100644
index 0000000000..c59c501a62
--- /dev/null
+++ b/data/2024/iclr/From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction	
@@ -0,0 +1 @@
+Foundation models have been transformational in machine learning fields such as natural language processing and computer vision. Similar success in atomic property prediction has been limited due to the challenges of training effective models across multiple chemical domains. To address this, we introduce Joint Multi-domain Pre-training (JMP), a supervised pre-training strategy that simultaneously trains on multiple datasets from different chemical domains, treating each dataset as a unique pre-training task within a multi-task framework. Our combined training dataset consists of $\sim$120M systems from OC20, OC22, ANI-1x, and Transition-1x. We evaluate performance and generalization by fine-tuning over a diverse set of downstream tasks and datasets including: QM9, rMD17, MatBench, QMOF, SPICE, and MD22. JMP demonstrates an average improvement of 59% over training from scratch, and matches or sets state-of-the-art on 34 out of 40 tasks. Our work highlights the potential of pre-training strategies that utilize diverse data to advance property prediction across chemical domains, especially for low-data tasks. Please visit https://nima.sh/jmp for further information.
\ No newline at end of file
diff --git a/data/2024/iclr/From Posterior Sampling to Meaningful Diversity in Image Restoration b/data/2024/iclr/From Posterior Sampling to Meaningful Diversity in Image Restoration
new file mode 100644
index 0000000000..ec4ee7d6fd
--- /dev/null
+++ b/data/2024/iclr/From Posterior Sampling to Meaningful Diversity in Image Restoration	
@@ -0,0 +1 @@
+Image restoration problems are typically ill-posed in the sense that each degraded image can be restored in infinitely many valid ways. To accommodate this, many works generate a diverse set of outputs by attempting to randomly sample from the posterior distribution of natural images given the degraded input. Here we argue that this strategy is commonly of limited practical value because of the heavy tail of the posterior distribution. Consider for example inpainting a missing region of the sky in an image. Since there is a high probability that the missing region contains no object but clouds, any set of samples from the posterior would be entirely dominated by (practically identical) completions of sky. However, arguably, presenting users with only one clear sky completion, along with several alternative solutions such as airships, birds, and balloons, would better outline the set of possibilities. In this paper, we initiate the study of meaningfully diverse image restoration. We explore several post-processing approaches that can be combined with any diverse image restoration method to yield semantically meaningful diversity. Moreover, we propose a practical approach for allowing diffusion based image restoration methods to generate meaningfully diverse outputs, while incurring only negligent computational overhead. We conduct extensive user studies to analyze the proposed techniques, and find the strategy of reducing similarity between outputs to be significantly favorable over posterior sampling. Code and examples are available at https://noa-cohen.github.io/MeaningfulDiversityInIR.
\ No newline at end of file
diff --git a/data/2024/iclr/From Sparse to Soft Mixtures of Experts b/data/2024/iclr/From Sparse to Soft Mixtures of Experts
new file mode 100644
index 0000000000..eedffcea54
--- /dev/null
+++ b/data/2024/iclr/From Sparse to Soft Mixtures of Experts	
@@ -0,0 +1 @@
+Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we propose Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoEs, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity (and performance) at lower inference cost. In the context of visual recognition, Soft MoE greatly outperforms dense Transformers (ViTs) and popular MoEs (Tokens Choice and Experts Choice). Furthermore, Soft MoE scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, with only 2% increased inference time, and substantially better quality.
\ No newline at end of file
diff --git a/data/2024/iclr/From Zero to Turbulence: Generative Modeling for 3D Flow Simulation b/data/2024/iclr/From Zero to Turbulence: Generative Modeling for 3D Flow Simulation
new file mode 100644
index 0000000000..521e94db8a
--- /dev/null
+++ b/data/2024/iclr/From Zero to Turbulence: Generative Modeling for 3D Flow Simulation	
@@ -0,0 +1 @@
+Simulations of turbulent flows in 3D are one of the most expensive simulations in computational fluid dynamics (CFD). Many works have been written on surrogate models to replace numerical solvers for fluid flows with faster, learned, autoregressive models. However, the intricacies of turbulence in three dimensions necessitate training these models with very small time steps, while generating realistic flow states requires either long roll-outs with many steps and significant error accumulation or starting from a known, realistic flow state - something we aimed to avoid in the first place. Instead, we propose to approach turbulent flow simulation as a generative task directly learning the manifold of all possible turbulent flow states without relying on any initial flow state. For our experiments, we introduce a challenging 3D turbulence dataset of high-resolution flows and detailed vortex structures caused by various objects and derive two novel sample evaluation metrics for turbulent flows. On this dataset, we show that our generative model captures the distribution of turbulent flows caused by unseen objects and generates high-quality, realistic samples amenable for downstream applications without access to any initial state.
\ No newline at end of file
diff --git a/data/2024/iclr/Frozen Transformers in Language Models Are Effective Visual Encoder Layers b/data/2024/iclr/Frozen Transformers in Language Models Are Effective Visual Encoder Layers
new file mode 100644
index 0000000000..7a14d50fa8
--- /dev/null
+++ b/data/2024/iclr/Frozen Transformers in Language Models Are Effective Visual Encoder Layers	
@@ -0,0 +1 @@
+This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a frozen transformer block from pre-trained LLMs as a constituent encoder layer to directly process visual tokens. Our work pushes the boundaries of leveraging LLMs for computer vision tasks, significantly departing from conventional practices that typically necessitate a multi-modal vision-language setup with associated language prompts, inputs, or outputs. We demonstrate that our approach consistently enhances performance across a diverse range of tasks, encompassing pure 2D and 3D visual recognition tasks (e.g., image and point cloud classification), temporal modeling tasks (e.g., action recognition), non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g., 2D/3D visual question answering and image-text retrieval). Such improvements are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and OPT) and different LLM transformer blocks. We additionally propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding -- the pre-trained LLM transformer blocks discern informative visual tokens and further amplify their effect. This hypothesis is empirically supported by the observation that the feature activation, after training with LLM transformer blocks, exhibits a stronger focus on relevant regions. We hope that our work inspires new perspectives on utilizing LLMs and deepening our understanding of their underlying mechanisms. Code is available at https://github.com/ziqipang/LM4VisualEncoding.
\ No newline at end of file
diff --git a/data/2024/iclr/Fully Hyperbolic Convolutional Neural Networks for Computer Vision b/data/2024/iclr/Fully Hyperbolic Convolutional Neural Networks for Computer Vision
new file mode 100644
index 0000000000..b02b2796fb
--- /dev/null
+++ b/data/2024/iclr/Fully Hyperbolic Convolutional Neural Networks for Computer Vision	
@@ -0,0 +1 @@
+Real-world visual data exhibit intrinsic hierarchical structures that can be represented effectively in hyperbolic spaces. Hyperbolic neural networks (HNNs) are a promising approach for learning feature representations in such spaces. However, current HNNs in computer vision rely on Euclidean backbones and only project features to the hyperbolic space in the task heads, limiting their ability to fully leverage the benefits of hyperbolic geometry. To address this, we present HCNN, a fully hyperbolic convolutional neural network (CNN) designed for computer vision tasks. Based on the Lorentz model, we generalize fundamental components of CNNs and propose novel formulations of the convolutional layer, batch normalization, and multinomial logistic regression. {Experiments on standard vision tasks demonstrate the promising performance of our HCNN framework in both hybrid and fully hyperbolic settings.} Overall, we believe our contributions provide a foundation for developing more powerful HNNs that can better represent complex structures found in image data. Our code is publicly available at https://github.com/kschwethelm/HyperbolicCV.
\ No newline at end of file
diff --git a/data/2024/iclr/Function-space Parameterization of Neural Networks for Sequential Learning b/data/2024/iclr/Function-space Parameterization of Neural Networks for Sequential Learning
new file mode 100644
index 0000000000..012121aada
--- /dev/null
+++ b/data/2024/iclr/Function-space Parameterization of Neural Networks for Sequential Learning	
@@ -0,0 +1 @@
+Sequential learning paradigms pose challenges for gradient-based deep learning due to difficulties incorporating new data and retaining prior knowledge. While Gaussian processes elegantly tackle these problems, they struggle with scalability and handling rich inputs, such as images. To address these issues, we introduce a technique that converts neural networks from weight space to function space, through a dual parameterization. Our parameterization offers: (i) a way to scale function-space methods to large data sets via sparsification, (ii) retention of prior knowledge when access to past data is limited, and (iii) a mechanism to incorporate new data without retraining. Our experiments demonstrate that we can retain knowledge in continual learning and incorporate new data efficiently. We further show its strengths in uncertainty quantification and guiding exploration in model-based RL. Further information and code is available on the project website.
\ No newline at end of file
diff --git a/data/2024/iclr/Functional Bayesian Tucker Decomposition for Continuous-indexed Tensor Data b/data/2024/iclr/Functional Bayesian Tucker Decomposition for Continuous-indexed Tensor Data
new file mode 100644
index 0000000000..cb776c76c2
--- /dev/null
+++ b/data/2024/iclr/Functional Bayesian Tucker Decomposition for Continuous-indexed Tensor Data	
@@ -0,0 +1 @@
+Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there are finite objects in each aspect or mode, corresponding to discrete indexes of data entries. However, real-world data is often not naturally posed in this setting. For example, geographic data is represented as continuous indexes of latitude and longitude coordinates, and cannot fit tensor models directly. To generalize Tucker decomposition to such scenarios, we propose Functional Bayesian Tucker Decomposition (FunBaT). We treat the continuous-indexed data as the interaction between the Tucker core and a group of latent functions. We use Gaussian processes (GP) as functional priors to model the latent functions. Then, we convert each GP into a state-space prior by constructing an equivalent stochastic differential equation (SDE) to reduce computational cost. An efficient inference algorithm is developed for scalable posterior approximation based on advanced message-passing techniques. The advantage of our method is shown in both synthetic data and several real-world applications. We release the code of FunBaT at \url{https://github.com/xuangu-fang/Functional-Bayesian-Tucker-Decomposition}.
\ No newline at end of file
diff --git a/data/2024/iclr/Functional Interpolation for Relative Positions improves Long Context Transformers b/data/2024/iclr/Functional Interpolation for Relative Positions improves Long Context Transformers
new file mode 100644
index 0000000000..5ef60e1aae
--- /dev/null
+++ b/data/2024/iclr/Functional Interpolation for Relative Positions improves Long Context Transformers	
@@ -0,0 +1 @@
+Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits on the input sequence lengths it can process, the choice of position encoding used during training can limit the performance of these models on longer inputs. We propose a novel functional relative position encoding with progressive interpolation, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5’s RPE, Alibi, and Kerple. We next empirically show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/Fusing Models with Complementary Expertise b/data/2024/iclr/Fusing Models with Complementary Expertise
new file mode 100644
index 0000000000..52b593f5d7
--- /dev/null
+++ b/data/2024/iclr/Fusing Models with Complementary Expertise	
@@ -0,0 +1 @@
+Training AI models that generalize across tasks and domains has long been among the open problems driving AI research. The emergence of Foundation Models made it easier to obtain expert models for a given task, but the heterogeneity of data that may be encountered at test time often means that any single expert is insufficient. We consider the Fusion of Experts (FoE) problem of fusing outputs of expert models with complementary knowledge of the data distribution and formulate it as an instance of supervised learning. Our method is applicable to both discriminative and generative tasks and leads to significant performance improvements in image and text classification, text summarization, multiple-choice QA, and automatic evaluation of generated text. We also extend our method to the"frugal"setting where it is desired to reduce the number of expert model evaluations at test time. Our implementation is publicly available at https://github.com/hwang595/FoE-ICLR2024.
\ No newline at end of file
diff --git a/data/2024/iclr/Future Language Modeling from Temporal Document History b/data/2024/iclr/Future Language Modeling from Temporal Document History
new file mode 100644
index 0000000000..de913df2cd
--- /dev/null
+++ b/data/2024/iclr/Future Language Modeling from Temporal Document History	
@@ -0,0 +1 @@
+Predicting the future is of great interest across many aspects of human activity. Businesses are interested in future trends, traders are interested in future stock prices, and companies are highly interested in future technological breakthroughs. While there are many automated systems for predicting future numerical data, such as weather, stock prices, and demand for products, there is relatively little work in automatically predicting textual data. Humans are interested in textual data predictions because it is a natural format for our consumption, and experts routinely make predictions in a textual format (Christensen et al., 2004; Tetlock&Gardner, 2015; Frick, 2015). However, there has been relatively little formalization of this general problem in the machine learning or natural language processing communities. To address this gap, we introduce the task of future language modeling: probabilistic modeling of texts in the future based on a temporal history of texts. To our knowledge, our work is the first work to formalize the task of predicting the future in this way. We show that it is indeed possible to build future language models that improve upon strong non-temporal language model baselines, opening the door to working on this important, and widely applicable problem.
\ No newline at end of file
diff --git a/data/2024/iclr/GAFormer: Enhancing Timeseries Transformers Through Group-Aware Embeddings b/data/2024/iclr/GAFormer: Enhancing Timeseries Transformers Through Group-Aware Embeddings
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/GAIA: Zero-shot Talking Avatar Generation b/data/2024/iclr/GAIA: Zero-shot Talking Avatar Generation
new file mode 100644
index 0000000000..2f75110469
--- /dev/null
+++ b/data/2024/iclr/GAIA: Zero-shot Talking Avatar Generation	
@@ -0,0 +1 @@
+Zero-shot talking avatar generation aims at synthesizing natural talking videos from speech and a single portrait image. Previous methods have relied on domain-specific heuristics such as warping-based motion representation and 3D Morphable Models, which limit the naturalness and diversity of the generated avatars. In this work, we introduce GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation. In light of the observation that the speech only drives the motion of the avatar while the appearance of the avatar and the background typically remain the same throughout the entire video, we divide our approach into two stages: 1) disentangling each frame into motion and appearance representations; 2) generating motion sequences conditioned on the speech and reference portrait image. We collect a large-scale high-quality talking avatar dataset and train the model on it with different scales (up to 2B parameters). Experimental results verify the superiority, scalability, and flexibility of GAIA as 1) the resulting model beats previous baseline models in terms of naturalness, diversity, lip-sync quality, and visual quality; 2) the framework is scalable since larger models yield better results; 3) it is general and enables different applications like controllable talking avatar generation and text-instructed avatar generation.
\ No newline at end of file
diff --git a/data/2024/iclr/GAIA: a benchmark for General AI Assistants b/data/2024/iclr/GAIA: a benchmark for General AI Assistants
new file mode 100644
index 0000000000..d8446c301f
--- /dev/null
+++ b/data/2024/iclr/GAIA: a benchmark for General AI Assistants	
@@ -0,0 +1 @@
+We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.
\ No newline at end of file
diff --git a/data/2024/iclr/GENOME: Generative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules b/data/2024/iclr/GENOME: Generative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules
new file mode 100644
index 0000000000..d64f1e5a05
--- /dev/null
+++ b/data/2024/iclr/GENOME: Generative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules	
@@ -0,0 +1 @@
+Recent works have shown that Large Language Models (LLMs) could empower traditional neuro-symbolic models via programming capabilities to translate language into module descriptions, thus achieving strong visual reasoning results while maintaining the model's transparency and efficiency. However, these models usually exhaustively generate the entire code snippet given each new instance of a task, which is extremely ineffective. We propose generative neuro-symbolic visual reasoning by growing and reusing modules. Specifically, our model consists of three unique stages, module initialization, module generation, and module execution. First, given a vision-language task, we adopt LLMs to examine whether we could reuse and grow over established modules to handle this new task. If not, we initialize a new module needed by the task and specify the inputs and outputs of this new module. After that, the new module is created by querying LLMs to generate corresponding code snippets that match the requirements. In order to get a better sense of the new module's ability, we treat few-shot training examples as test cases to see if our new module could pass these cases. If yes, the new module is added to the module library for future reuse. Finally, we evaluate the performance of our model on the testing set by executing the parsed programs with the newly made visual modules to get the results. We find the proposed model possesses several advantages. First, it performs competitively on standard tasks like visual question answering and referring expression comprehension; Second, the modules learned from one task can be seamlessly transferred to new tasks; Last but not least, it is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.
\ No newline at end of file
diff --git a/data/2024/iclr/GIM: Learning Generalizable Image Matcher From Internet Videos b/data/2024/iclr/GIM: Learning Generalizable Image Matcher From Internet Videos
new file mode 100644
index 0000000000..b194f40f18
--- /dev/null
+++ b/data/2024/iclr/GIM: Learning Generalizable Image Matcher From Internet Videos	
@@ -0,0 +1 @@
+Image matching is a fundamental computer vision problem. While learning-based methods achieve state-of-the-art performance on existing benchmarks, they generalize poorly to in-the-wild images. Such methods typically need to train separate models for different scene types and are impractical when the scene type is unknown in advance. One of the underlying problems is the limited scalability of existing data construction pipelines, which limits the diversity of standard image matching datasets. To address this problem, we propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture using internet videos, an abundant and diverse data source. Given an architecture, GIM first trains it on standard domain-specific datasets and then combines it with complementary matching methods to create dense labels on nearby frames of novel videos. These labels are filtered by robust fitting, and then enhanced by propagating them to distant frames. The final model is trained on propagated data with strong augmentations. We also propose ZEB, the first zero-shot evaluation benchmark for image matching. By mixing data from diverse domains, ZEB can thoroughly assess the cross-domain generalization performance of different methods. Applying GIM consistently improves the zero-shot performance of 3 state-of-the-art image matching architectures; with 50 hours of YouTube videos, the relative zero-shot performance improves by 8.4%-18.1%. GIM also enables generalization to extreme cross-domain data such as Bird Eye View (BEV) images of projected 3D point clouds (Fig. 1(c)). More importantly, our single zero-shot model consistently outperforms domain-specific baselines when evaluated on downstream tasks inherent to their respective domains. The video presentation is available at https://www.youtube.com/watch?v=FU_MJLD8LeY.
\ No newline at end of file
diff --git a/data/2024/iclr/GIO: Gradient Information Optimization for Training Dataset Selection b/data/2024/iclr/GIO: Gradient Information Optimization for Training Dataset Selection
new file mode 100644
index 0000000000..c254e87c30
--- /dev/null
+++ b/data/2024/iclr/GIO: Gradient Information Optimization for Training Dataset Selection	
@@ -0,0 +1 @@
+It is often advantageous to train models on a subset of the available train examples, because the examples are of variable quality or because one would like to train with fewer examples, without sacrificing performance. We present Gradient Information Optimization (GIO), a scalable, task-agnostic approach to this data selection problem that requires only a small set of (unlabeled) examples representing a target distribution. GIO begins from a natural, information-theoretic objective that is intractable in practice. Our contribution is in showing that it can be made highly scalable through a simple relaxation of the objective and a highly efficient implementation. In experiments with machine translation, spelling correction, and image recognition, we show that GIO delivers outstanding results with very small train sets. These findings are robust to different representation models and hyperparameters for GIO itself. GIO is task- and domain-agnostic and can be applied out-of-the-box to new datasets and domains. We open source a pip-installable implementation of the algorithm as"pip install grad-info-opt".
\ No newline at end of file
diff --git a/data/2024/iclr/GNNCert: Deterministic Certification of Graph Neural Networks against Adversarial Perturbations b/data/2024/iclr/GNNCert: Deterministic Certification of Graph Neural Networks against Adversarial Perturbations
new file mode 100644
index 0000000000..27a997423c
--- /dev/null
+++ b/data/2024/iclr/GNNCert: Deterministic Certification of Graph Neural Networks against Adversarial Perturbations	
@@ -0,0 +1 @@
+Graph classification , which aims to predict a label for a graph, has many real-world applications such as malware detection, fraud detection, and healthcare. However, many studies show an attacker could carefully perturb the structure and/or node features in a graph such that a graph classifier misclassifies the perturbed graph. Such vulnerability impedes the deployment of graph classification in security/safety-critical applications. Existing empirical defenses lack formal robustness guarantees and could be broken by adaptive or unknown attacks. Existing provable defenses have the following limitations: 1) they achieve sub-optimal robustness guarantees for graph structure perturbation, 2) they cannot provide robustness guarantees for arbitrarily node feature perturbations, 3) their robustness guarantees are probabilistic, meaning they could be incorrect with a non-zero probability
\ No newline at end of file
diff --git a/data/2024/iclr/GOAt: Explaining Graph Neural Networks via Graph Output Attribution b/data/2024/iclr/GOAt: Explaining Graph Neural Networks via Graph Output Attribution
new file mode 100644
index 0000000000..2d37892dbb
--- /dev/null
+++ b/data/2024/iclr/GOAt: Explaining Graph Neural Networks via Graph Output Attribution	
@@ -0,0 +1 @@
+Understanding the decision-making process of Graph Neural Networks (GNNs) is crucial to their interpretability. Most existing methods for explaining GNNs typically rely on training auxiliary models, resulting in the explanations remain black-boxed. This paper introduces Graph Output Attribution (GOAt), a novel method to attribute graph outputs to input graph features, creating GNN explanations that are faithful, discriminative, as well as stable across similar samples. By expanding the GNN as a sum of scalar products involving node features, edge features and activation patterns, we propose an efficient analytical method to compute contribution of each node or edge feature to each scalar product and aggregate the contributions from all scalar products in the expansion form to derive the importance of each node and edge. Through extensive experiments on synthetic and real-world data, we show that our method not only outperforms various state-ofthe-art GNN explainers in terms of the commonly used fidelity metric, but also exhibits stronger discriminability, and stability by a remarkable margin.
\ No newline at end of file
diff --git a/data/2024/iclr/GPAvatar: Generalizable and Precise Head Avatar from Image(s) b/data/2024/iclr/GPAvatar: Generalizable and Precise Head Avatar from Image(s)
new file mode 100644
index 0000000000..9df2b7137f
--- /dev/null
+++ b/data/2024/iclr/GPAvatar: Generalizable and Precise Head Avatar from Image(s)	
@@ -0,0 +1 @@
+Head avatar reconstruction, crucial for applications in virtual reality, online meetings, gaming, and film industries, has garnered substantial attention within the computer vision community. The fundamental objective of this field is to faithfully recreate the head avatar and precisely control expressions and postures. Existing methods, categorized into 2D-based warping, mesh-based, and neural rendering approaches, present challenges in maintaining multi-view consistency, incorporating non-facial information, and generalizing to new identities. In this paper, we propose a framework named GPAvatar that reconstructs 3D head avatars from one or several images in a single forward pass. The key idea of this work is to introduce a dynamic point-based expression field driven by a point cloud to precisely and effectively capture expressions. Furthermore, we use a Multi Tri-planes Attention (MTA) fusion module in the tri-planes canonical field to leverage information from multiple input images. The proposed method achieves faithful identity reconstruction, precise expression control, and multi-view consistency, demonstrating promising results for free-viewpoint rendering and novel view synthesis.
\ No newline at end of file
diff --git a/data/2024/iclr/GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data b/data/2024/iclr/GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data
new file mode 100644
index 0000000000..0c4ce5d59b
--- /dev/null
+++ b/data/2024/iclr/GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data	
@@ -0,0 +1 @@
+Despite the success of deep learning for text and image data, tree-based ensemble models are still state-of-the-art for machine learning with heterogeneous tabular data. However, there is a significant need for tabular-specific gradient-based methods due to their high flexibility. In this paper, we propose $\text{GRANDE}$, $\text{GRA}$die$\text{N}$t-Based $\text{D}$ecision Tree $\text{E}$nsembles, a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent. GRANDE is based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator to jointly optimize all model parameters. Our method combines axis-aligned splits, which is a useful inductive bias for tabular data, with the flexibility of gradient-based optimization. Furthermore, we introduce an advanced instance-wise weighting that facilitates learning representations for both, simple and complex relations, within a single model. We conducted an extensive evaluation on a predefined benchmark with 19 classification datasets and demonstrate that our method outperforms existing gradient-boosting and deep learning frameworks on most datasets. The method is available under: https://github.com/s-marton/GRANDE
\ No newline at end of file
diff --git a/data/2024/iclr/GROOT: Learning to Follow Instructions by Watching Gameplay Videos b/data/2024/iclr/GROOT: Learning to Follow Instructions by Watching Gameplay Videos
new file mode 100644
index 0000000000..8a9596f748
--- /dev/null
+++ b/data/2024/iclr/GROOT: Learning to Follow Instructions by Watching Gameplay Videos	
@@ -0,0 +1 @@
+We study the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space. We implement our agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft SkillForge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal space further demonstrates some interesting emergent properties, including the goal composition and complex gameplay behavior synthesis. The project page is available at https://craftjarvis-groot.github.io.
\ No newline at end of file
diff --git a/data/2024/iclr/GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers b/data/2024/iclr/GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers
new file mode 100644
index 0000000000..b6cf166ba3
--- /dev/null
+++ b/data/2024/iclr/GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers	
@@ -0,0 +1 @@
+As transformers are equivariant to the permutation of input tokens, encoding the positional information of tokens is necessary for many tasks. However, since existing positional encoding schemes have been initially designed for NLP tasks, their suitability for vision tasks, which typically exhibit different structural properties in their data, is questionable. We argue that existing positional encoding schemes are suboptimal for 3D vision tasks, as they do not respect their underlying 3D geometric structure. Based on this hypothesis, we propose a geometry-aware attention mechanism that encodes the geometric structure of tokens as relative transformation determined by the geometric relationship between queries and key-value pairs. By evaluating on multiple novel view synthesis (NVS) datasets in the sparse wide-baseline multi-view setting, we show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models without any additional learned parameters and only minor computational overhead.
\ No newline at end of file
diff --git a/data/2024/iclr/GTMGC: Using Graph Transformer to Predict Molecule's Ground-State Conformation b/data/2024/iclr/GTMGC: Using Graph Transformer to Predict Molecule's Ground-State Conformation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations b/data/2024/iclr/Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations
new file mode 100644
index 0000000000..d552825c70
--- /dev/null
+++ b/data/2024/iclr/Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations	
@@ -0,0 +1 @@
+Robust reinforcement learning (RL) seeks to train policies that can perform well under environment perturbations or adversarial attacks. Existing approaches typically assume that the space of possible perturbations remains the same across timesteps. However, in many settings, the space of possible perturbations at a given timestep depends on past perturbations. We formally introduce temporally-coupled perturbations, presenting a novel challenge for existing robust RL methods. To tackle this challenge, we propose GRAD, a novel game-theoretic approach that treats the temporally-coupled robust RL problem as a partially-observable two-player zero-sum game. By ﬁnding an approximate equilibrium in this game, GRAD ensures the agent’s robustness against temporally-coupled perturbations. Empirical experiments on a variety of continuous control tasks demonstrate that our proposed approach exhibits signiﬁcant robustness advantages compared to baselines against both standard and temporally-coupled attacks, in both state and action spaces.
\ No newline at end of file
diff --git a/data/2024/iclr/Gen-Z: Generative Zero-Shot Text Classification with Contextualized Label Descriptions b/data/2024/iclr/Gen-Z: Generative Zero-Shot Text Classification with Contextualized Label Descriptions
new file mode 100644
index 0000000000..5f3235192d
--- /dev/null
+++ b/data/2024/iclr/Gen-Z: Generative Zero-Shot Text Classification with Contextualized Label Descriptions	
@@ -0,0 +1 @@
+Language model (LM) prompting--a popular paradigm for solving NLP tasks--has been shown to be susceptible to miscalibration and brittleness to slight prompt variations, caused by its discriminative prompting approach, i.e., predicting the label given the input. To address these issues, we propose Gen-Z--a generative prompting framework for zero-shot text classification. GEN-Z is generative, as it measures the LM likelihood of input text, conditioned on natural language descriptions of labels. The framework is multivariate, as label descriptions allow us to seamlessly integrate additional contextual information about the labels to improve task performance. On various standard classification benchmarks, with six open-source LM families, we show that zero-shot classification with simple contextualization of the data source of the evaluation set consistently outperforms both zero-shot and few-shot baselines while improving robustness to prompt variations. Further, our approach enables personalizing classification in a zero-shot manner by incorporating author, subject, or reader information in the label descriptions.
\ No newline at end of file
diff --git a/data/2024/iclr/GenCorres: Consistent Shape Matching via Coupled Implicit-Explicit Shape Generative Models b/data/2024/iclr/GenCorres: Consistent Shape Matching via Coupled Implicit-Explicit Shape Generative Models
new file mode 100644
index 0000000000..4e1df5243f
--- /dev/null
+++ b/data/2024/iclr/GenCorres: Consistent Shape Matching via Coupled Implicit-Explicit Shape Generative Models	
@@ -0,0 +1 @@
+This paper introduces GenCorres, a novel unsupervised joint shape matching (JSM) approach. Our key idea is to learn a mesh generator to fit an unorganized deformable shape collection while constraining deformations between adjacent synthetic shapes to preserve geometric structures such as local rigidity and local conformality. GenCorres presents three appealing advantages over existing JSM techniques. First, GenCorres performs JSM among a synthetic shape collection whose size is much bigger than the input shapes and fully leverages the datadriven power of JSM. Second, GenCorres unifies consistent shape matching and pairwise matching (i.e., by enforcing deformation priors between adjacent synthetic shapes). Third, the generator provides a concise encoding of consistent shape correspondences. However, learning a mesh generator from an unorganized shape collection is challenging, requiring a good initialization. GenCorres addresses this issue by learning an implicit generator from the input shapes, which provides intermediate shapes between two arbitrary shapes. We introduce a novel approach for computing correspondences between adjacent implicit surfaces, which we use to regularize the implicit generator. Synthetic shapes of the implicit generator then guide initial fittings (i.e., via template-based deformation) for learning the mesh generator. Experimental results show that GenCorres considerably outperforms state-of-the-art JSM techniques. The synthetic shapes of GenCorres also achieve salient performance gains against state-of-the-art deformable shape generators.
\ No newline at end of file
diff --git a/data/2024/iclr/GenSim: Generating Robotic Simulation Tasks via Large Language Models b/data/2024/iclr/GenSim: Generating Robotic Simulation Tasks via Large Language Models
new file mode 100644
index 0000000000..b99f0a272c
--- /dev/null
+++ b/data/2024/iclr/GenSim: Generating Robotic Simulation Tasks via Large Language Models	
@@ -0,0 +1 @@
+Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tasks. This has made it challenging for policies trained on simulation data to demonstrate significant task-level generalization. In this paper, we propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models' (LLM) grounding and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks. Furthermore, we observe that LLMs-generated simulation programs can enhance task-level generalization significantly when used for multitask policy training. We further find that with minimal sim-to-real adaptation, the multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world and outperform baselines by 25%. See the project website (https://liruiw.github.io/gensim) for code, demos, and videos.
\ No newline at end of file
diff --git a/data/2024/iclr/Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View b/data/2024/iclr/Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View
new file mode 100644
index 0000000000..c8f6be7300
--- /dev/null
+++ b/data/2024/iclr/Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View	
@@ -0,0 +1 @@
+Gene regulatory network inference (GRNI) is a challenging problem, particularly owing to the presence of zeros in single-cell RNA sequencing data: some are biological zeros representing no gene expression, while some others are technical zeros arising from the sequencing procedure (aka dropouts), which may bias GRNI by distorting the joint distribution of the measured gene expressions. Existing approaches typically handle dropout error via imputation, which may introduce spurious relations as the true joint distribution is generally unidentifiable. To tackle this issue, we introduce a causal graphical model to characterize the dropout mechanism, namely, Causal Dropout Model. We provide a simple yet effective theoretical result: interestingly, the conditional independence (CI) relations in the data with dropouts, after deleting the samples with zero values (regardless if technical or not) for the conditioned variables, are asymptotically identical to the CI relations in the original data without dropouts. This particular test-wise deletion procedure, in which we perform CI tests on the samples without zeros for the conditioned variables, can be seamlessly integrated with existing structure learning approaches including constraint-based and greedy score-based methods, thus giving rise to a principled framework for GRNI in the presence of dropouts. We further show that the causal dropout model can be validated from data, and many existing statistical models to handle dropouts fit into our model as specific parametric instances. Empirical evaluation on synthetic, curated, and real-world experimental transcriptomic data comprehensively demonstrate the efficacy of our method.
\ No newline at end of file
diff --git a/data/2024/iclr/GeneOH Diffusion: Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion b/data/2024/iclr/GeneOH Diffusion: Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion
new file mode 100644
index 0000000000..de4ced531a
--- /dev/null
+++ b/data/2024/iclr/GeneOH Diffusion: Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion	
@@ -0,0 +1 @@
+In this work, we tackle the challenging problem of denoising hand-object interactions (HOI). Given an erroneous interaction sequence, the objective is to refine the incorrect hand trajectory to remove interaction artifacts for a perceptually realistic sequence. This challenge involves intricate interaction noise, including unnatural hand poses and incorrect hand-object relations, alongside the necessity for robust generalization to new interactions and diverse noise patterns. We tackle those challenges through a novel approach, GeneOH Diffusion, incorporating two key designs: an innovative contact-centric HOI representation named GeneOH and a new domain-generalizable denoising scheme. The contact-centric representation GeneOH informatively parameterizes the HOI process, facilitating enhanced generalization across various HOI scenarios. The new denoising scheme consists of a canonical denoising model trained to project noisy data samples from a whitened noise space to a clean data manifold and a"denoising via diffusion"strategy which can handle input trajectories with various noise patterns by first diffusing them to align with the whitened noise space and cleaning via the canonical denoiser. Extensive experiments on four benchmarks with significant domain variations demonstrate the superior effectiveness of our method. GeneOH Diffusion also shows promise for various downstream applications. Project website: https://meowuu7.github.io/GeneOH-Diffusion/.
\ No newline at end of file
diff --git a/data/2024/iclr/General Graph Random Features b/data/2024/iclr/General Graph Random Features
new file mode 100644
index 0000000000..e564ad3e3e
--- /dev/null
+++ b/data/2024/iclr/General Graph Random Features	
@@ -0,0 +1 @@
+We propose a novel random walk-based algorithm for unbiased estimation of arbitrary functions of a weighted adjacency matrix, coined universal graph random features (u-GRFs). This includes many of the most popular examples of kernels defined on the nodes of a graph. Our algorithm enjoys subquadratic time complexity with respect to the number of nodes, overcoming the notoriously prohibitive cubic scaling of exact graph kernel evaluation. It can also be trivially distributed across machines, permitting learning on much larger networks. At the heart of the algorithm is a modulation function which upweights or downweights the contribution from different random walks depending on their lengths. We show that by parameterising it with a neural network we can obtain u-GRFs that give higher-quality kernel estimates or perform efficient, scalable kernel learning. We provide robust theoretical analysis and support our findings with experiments including pointwise estimation of fixed graph kernels, solving non-homogeneous graph ordinary differential equations, node clustering and kernel regression on triangular meshes.
\ No newline at end of file
diff --git a/data/2024/iclr/Generalization error of spectral algorithms b/data/2024/iclr/Generalization error of spectral algorithms
new file mode 100644
index 0000000000..c48b59bf02
--- /dev/null
+++ b/data/2024/iclr/Generalization error of spectral algorithms	
@@ -0,0 +1 @@
+The asymptotically precise estimation of the generalization of kernel methods has recently received attention due to the parallels between neural networks and their associated kernels. However, prior works derive such estimates for training by kernel ridge regression (KRR), whereas neural networks are typically trained with gradient descent (GD). In the present work, we consider the training of kernels with a family of $\textit{spectral algorithms}$ specified by profile $h(\lambda)$, and including KRR and GD as special cases. Then, we derive the generalization error as a functional of learning profile $h(\lambda)$ for two data models: high-dimensional Gaussian and low-dimensional translation-invariant model. Under power-law assumptions on the spectrum of the kernel and target, we use our framework to (i) give full loss asymptotics for both noisy and noiseless observations (ii) show that the loss localizes on certain spectral scales, giving a new perspective on the KRR saturation phenomenon (iii) conjecture, and demonstrate for the considered data models, the universality of the loss w.r.t. non-spectral details of the problem, but only in case of noisy observation.
\ No newline at end of file
diff --git a/data/2024/iclr/Generalization in diffusion models arises from geometry-adaptive harmonic representations b/data/2024/iclr/Generalization in diffusion models arises from geometry-adaptive harmonic representations
new file mode 100644
index 0000000000..7f93c9f173
--- /dev/null
+++ b/data/2024/iclr/Generalization in diffusion models arises from geometry-adaptive harmonic representations	
@@ -0,0 +1 @@
+Deep neural networks (DNNs) trained for image denoising are able to generate high-quality samples with score-based reverse diffusion algorithms. These impressive capabilities seem to imply an escape from the curse of dimensionality, but recent reports of memorization of the training set raise the question of whether these networks are learning the"true"continuous density of the data. Here, we show that two DNNs trained on non-overlapping subsets of a dataset learn nearly the same score function, and thus the same density, when the number of training images is large enough. In this regime of strong generalization, diffusion-generated images are distinct from the training set, and are of high visual quality, suggesting that the inductive biases of the DNNs are well-aligned with the data density. We analyze the learned denoising functions and show that the inductive biases give rise to a shrinkage operation in a basis adapted to the underlying image. Examination of these bases reveals oscillating harmonic structures along contours and in homogeneous regions. We demonstrate that trained denoisers are inductively biased towards these geometry-adaptive harmonic bases since they arise not only when the network is trained on photographic images, but also when it is trained on image classes supported on low-dimensional manifolds for which the harmonic basis is suboptimal. Finally, we show that when trained on regular image classes for which the optimal basis is known to be geometry-adaptive and harmonic, the denoising performance of the networks is near-optimal.
\ No newline at end of file
diff --git a/data/2024/iclr/Generalization of Scaled Deep ResNets in the Mean-Field Regime b/data/2024/iclr/Generalization of Scaled Deep ResNets in the Mean-Field Regime
new file mode 100644
index 0000000000..6fe8ac7a78
--- /dev/null
+++ b/data/2024/iclr/Generalization of Scaled Deep ResNets in the Mean-Field Regime	
@@ -0,0 +1 @@
+Despite the widespread empirical success of ResNet, the generalization properties of deep ResNet are rarely explored beyond the lazy training regime. In this work, we investigate \emph{scaled} ResNet in the limit of infinitely deep and wide neural networks, of which the gradient flow is described by a partial differential equation in the large-neural network limit, i.e., the \emph{mean-field} regime. To derive the generalization bounds under this setting, our analysis necessitates a shift from the conventional time-invariant Gram matrix employed in the lazy training regime to a time-variant, distribution-dependent version. To this end, we provide a global lower bound on the minimum eigenvalue of the Gram matrix under the mean-field regime. Besides, for the traceability of the dynamic of Kullback-Leibler (KL) divergence, we establish the linear convergence of the empirical error and estimate the upper bound of the KL divergence over parameters distribution. Finally, we build the uniform convergence for generalization bound via Rademacher complexity. Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime and contribute to advancing the understanding of the fundamental properties of deep neural networks.
\ No newline at end of file
diff --git a/data/2024/iclr/Generalized Neural Sorting Networks with Error-Free Differentiable Swap Functions b/data/2024/iclr/Generalized Neural Sorting Networks with Error-Free Differentiable Swap Functions
new file mode 100644
index 0000000000..b93e17f248
--- /dev/null
+++ b/data/2024/iclr/Generalized Neural Sorting Networks with Error-Free Differentiable Swap Functions	
@@ -0,0 +1 @@
+Sorting is a fundamental operation of all computer systems, having been a long-standing significant research topic. Beyond the problem formulation of traditional sorting algorithms, we consider sorting problems for more abstract yet expressive inputs, e.g., multi-digit images and image fragments, through a neural sorting network. To learn a mapping from a high-dimensional input to an ordinal variable, the differentiability of sorting networks needs to be guaranteed. In this paper we define a softening error by a differentiable swap function, and develop an error-free swap function that holds a non-decreasing condition and differentiability. Furthermore, a permutation-equivariant Transformer network with multi-head attention is adopted to capture dependency between given inputs and also leverage its model capacity with self-attention. Experiments on diverse sorting benchmarks show that our methods perform better than or comparable to baseline methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Generalized Policy Iteration using Tensor Approximation for Hybrid Control b/data/2024/iclr/Generalized Policy Iteration using Tensor Approximation for Hybrid Control
new file mode 100644
index 0000000000..156ebc83ab
--- /dev/null
+++ b/data/2024/iclr/Generalized Policy Iteration using Tensor Approximation for Hybrid Control	
@@ -0,0 +1 @@
+Optimal Control of dynamic systems involving hybrid actions is a challenging task in robotics. To address this, we present a novel algorithm called Generalized Policy Iteration using Tensor Train (TTPI) that belongs to the class of Approximate Dynamic Programming (ADP). We use a low-rank tensor approximation technique called Tensor Train (TT) to approximate the state-value and advantage function which enables us to efficiently handle hybrid action space. We demonstrate the superiority of our approach over previous baselines for some benchmark problems with hybrid action spaces. Additionally, the robustness and generalization of the policy for hybrid systems are showcased through a real-world robotics experiment involving a non-prehensile manipulation task.
\ No newline at end of file
diff --git "a/data/2024/iclr/Generalized Schr\303\266dinger Bridge Matching" "b/data/2024/iclr/Generalized Schr\303\266dinger Bridge Matching"
new file mode 100644
index 0000000000..72070ec296
--- /dev/null
+++ "b/data/2024/iclr/Generalized Schr\303\266dinger Bridge Matching"	
@@ -0,0 +1 @@
+Modern distribution matching algorithms for training diffusion or flow models directly prescribe the time evolution of the marginal distributions between two boundary distributions. In this work, we consider a generalized distribution matching setup, where these marginals are only implicitly described as a solution to some task-specific objective function. The problem setup, known as the Generalized Schr\"odinger Bridge (GSB), appears prevalently in many scientific areas both within and without machine learning. We propose Generalized Schr\"odinger Bridge Matching (GSBM), a new matching algorithm inspired by recent advances, generalizing them beyond kinetic energy minimization and to account for task-specific state costs. We show that such a generalization can be cast as solving conditional stochastic optimal control, for which efficient variational approximations can be used, and further debiased with the aid of path integral theory. Compared to prior methods for solving GSB problems, our GSBM algorithm better preserves a feasible transport map between the boundary distributions throughout training, thereby enabling stable convergence and significantly improved scalability. We empirically validate our claims on an extensive suite of experimental setups, including crowd navigation, opinion depolarization, LiDAR manifolds, and image domain transfer. Our work brings new algorithmic opportunities for training diffusion models enhanced with task-specific optimality structures. Code available at https://github.com/facebookresearch/generalized-schrodinger-bridge-matching
\ No newline at end of file
diff --git a/data/2024/iclr/Generating Images with 3D Annotations Using Diffusion Models b/data/2024/iclr/Generating Images with 3D Annotations Using Diffusion Models
new file mode 100644
index 0000000000..909eef665b
--- /dev/null
+++ b/data/2024/iclr/Generating Images with 3D Annotations Using Diffusion Models	
@@ -0,0 +1 @@
+Diffusion models have emerged as a powerful generative method, capable of producing stunning photo-realistic images from natural language descriptions. However, these models lack explicit control over the 3D structure in the generated images. Consequently, this hinders our ability to obtain detailed 3D annotations for the generated images or to craft instances with specific poses and distances. In this paper, we propose 3D Diffusion Style Transfer (3D-DST), which incorporates 3D geometry control into diffusion models. Our method exploits ControlNet, which extends diffusion models by using visual prompts in addition to text prompts. We generate images of the 3D objects taken from 3D shape repositories (e.g., ShapeNet and Objaverse), render them from a variety of poses and viewing directions, compute the edge maps of the rendered images, and use these edge maps as visual prompts to generate realistic images. With explicit 3D geometry control, we can easily change the 3D structures of the objects in the generated images and obtain ground-truth 3D annotations automatically. This allows us to improve a wide range of vision tasks, e.g., classification and 3D pose estimation, in both in-distribution (ID) and out-of-distribution (OOD) settings. We demonstrate the effectiveness of our method through extensive experiments on ImageNet-100/200, ImageNet-R, PASCAL3D+, ObjectNet3D, and OOD-CV. The results show that our method significantly outperforms existing methods, e.g., 3.8 percentage points on ImageNet-100 using DeiT-B.
\ No newline at end of file
diff --git a/data/2024/iclr/Generating Pragmatic Examples to Train Neural Program Synthesizers b/data/2024/iclr/Generating Pragmatic Examples to Train Neural Program Synthesizers
new file mode 100644
index 0000000000..37feebb53f
--- /dev/null
+++ b/data/2024/iclr/Generating Pragmatic Examples to Train Neural Program Synthesizers	
@@ -0,0 +1 @@
+Programming-by-example is the task of synthesizing a program that is consistent with a set of user-provided input-output examples. As examples are often an under-specification of one's intent, a good synthesizer must choose the intended program from the many that are consistent with the given set of examples. Prior work frames program synthesis as a cooperative game between a listener (that synthesizes programs) and a speaker (a user choosing examples), and shows that models of computational pragmatic inference are effective in choosing the user intended programs. However, these models require counterfactual reasoning over a large set of programs and examples, which is infeasible in realistic program spaces. In this paper, we propose a novel way to amortize this search with neural networks. We sample pairs of programs and examples via self-play between listener and speaker models, and use pragmatic inference to choose informative training examples from this sample.We then use the informative dataset to train models to improve the synthesizer's ability to disambiguate user-provided examples without human supervision. We validate our method on the challenging task of synthesizing regular expressions from example strings, and find that our method (1) outperforms models trained without choosing pragmatic examples by 23% (a 51% relative increase) (2) matches the performance of supervised learning on a dataset of pragmatic examples provided by humans, despite using no human data in training.
\ No newline at end of file
diff --git a/data/2024/iclr/Generative Adversarial Equilibrium Solvers b/data/2024/iclr/Generative Adversarial Equilibrium Solvers
new file mode 100644
index 0000000000..2d51c197c3
--- /dev/null
+++ b/data/2024/iclr/Generative Adversarial Equilibrium Solvers	
@@ -0,0 +1 @@
+We introduce the use of generative adversarial learning to compute equilibria in general game-theoretic settings, specifically the generalized Nash equilibrium (GNE) in pseudo-games, and its specific instantiation as the competitive equilibrium (CE) in Arrow-Debreu competitive economies. Pseudo-games are a generalization of games in which players' actions affect not only the payoffs of other players but also their feasible action spaces. Although the computation of GNE and CE is intractable in the worst-case, i.e., PPAD-hard, in practice, many applications only require solutions with high accuracy in expectation over a distribution of problem instances. We introduce Generative Adversarial Equilibrium Solvers (GAES): a family of generative adversarial neural networks that can learn GNE and CE from only a sample of problem instances. We provide computational and sample complexity bounds, and apply the framework to finding Nash equilibria in normal-form games, CE in Arrow-Debreu competitive economies, and GNE in an environmental economic model of the Kyoto mechanism.
\ No newline at end of file
diff --git a/data/2024/iclr/Generative Human Motion Stylization in Latent Space b/data/2024/iclr/Generative Human Motion Stylization in Latent Space
new file mode 100644
index 0000000000..edfa1f6aa2
--- /dev/null
+++ b/data/2024/iclr/Generative Human Motion Stylization in Latent Space	
@@ -0,0 +1 @@
+Human motion stylization aims to revise the style of an input motion while keeping its content unaltered. Unlike existing works that operate directly in pose space, we leverage the latent space of pretrained autoencoders as a more expressive and robust representation for motion extraction and infusion. Building upon this, we present a novel generative model that produces diverse stylization results of a single motion (latent) code. During training, a motion code is decomposed into two coding components: a deterministic content code, and a probabilistic style code adhering to a prior distribution; then a generator massages the random combination of content and style codes to reconstruct the corresponding motion codes. Our approach is versatile, allowing the learning of probabilistic style space from either style labeled or unlabeled motions, providing notable flexibility in stylization as well. In inference, users can opt to stylize a motion using style cues from a reference motion or a label. Even in the absence of explicit style input, our model facilitates novel re-stylization by sampling from the unconditional style prior distribution. Experimental results show that our proposed stylization models, despite their lightweight design, outperform the state-of-the-art in style reenactment, content preservation, and generalization across various applications and settings. Project Page: https://murrol.github.io/GenMoStyle
\ No newline at end of file
diff --git a/data/2024/iclr/Generative Judge for Evaluating Alignment b/data/2024/iclr/Generative Judge for Evaluating Alignment
new file mode 100644
index 0000000000..f9af7ce839
--- /dev/null
+++ b/data/2024/iclr/Generative Judge for Evaluating Alignment	
@@ -0,0 +1 @@
+The rapid development of Large Language Models (LLMs) has substantially expanded the range of tasks they can address. In the field of Natural Language Processing (NLP), researchers have shifted their focus from conventional NLP tasks (e.g., sequence tagging and parsing) towards tasks that revolve around aligning with human needs (e.g., brainstorming and email writing). This shift in task distribution imposes new requirements on evaluating these aligned models regarding generality (i.e., assessing performance across diverse scenarios), flexibility (i.e., examining under different protocols), and interpretability (i.e., scrutinizing models with explanations). In this paper, we propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios and accommodates diverse evaluation protocols (e.g., pairwise response comparison and single-response evaluation) with well-structured natural language critiques. To demonstrate the efficacy of our approach, we construct a new testbed covering 58 different scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models, by a large margin. We also provide detailed analysis and case studies to further reveal the potential of our method and make a variety of resources public at https://github.com/GAIR-NLP/auto-j.
\ No newline at end of file
diff --git a/data/2024/iclr/Generative Learning for Financial Time Series with Irregular and Scale-Invariant Patterns b/data/2024/iclr/Generative Learning for Financial Time Series with Irregular and Scale-Invariant Patterns
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Generative Learning for Solving Non-Convex Problem with Multi-Valued Input-Solution Mapping b/data/2024/iclr/Generative Learning for Solving Non-Convex Problem with Multi-Valued Input-Solution Mapping
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Generative Modeling of Regular and Irregular Time Series Data via Koopman VAEs b/data/2024/iclr/Generative Modeling of Regular and Irregular Time Series Data via Koopman VAEs
new file mode 100644
index 0000000000..436a7d5f52
--- /dev/null
+++ b/data/2024/iclr/Generative Modeling of Regular and Irregular Time Series Data via Koopman VAEs	
@@ -0,0 +1 @@
+Generating realistic time series data is important for many engineering and scientific applications. Existing work tackles this problem using generative adversarial networks (GANs). However, GANs are unstable during training, and they can suffer from mode collapse. While variational autoencoders (VAEs) are known to be more robust to the these issues, they are (surprisingly) less considered for time series generation. In this work, we introduce Koopman VAE (KoVAE), a new generative framework that is based on a novel design for the model prior, and that can be optimized for either regular and irregular training data. Inspired by Koopman theory, we represent the latent conditional prior dynamics using a linear map. Our approach enhances generative modeling with two desired features: (i) incorporating domain knowledge can be achieved by leveraging spectral tools that prescribe constraints on the eigenvalues of the linear map; and (ii) studying the qualitative behavior and stability of the system can be performed using tools from dynamical systems theory. Our results show that KoVAE outperforms state-of-the-art GAN and VAE methods across several challenging synthetic and real-world time series generation benchmarks. Whether trained on regular or irregular data, KoVAE generates time series that improve both discriminative and predictive metrics. We also present visual evidence suggesting that KoVAE learns probability density functions that better approximate the empirical ground truth distribution.
\ No newline at end of file
diff --git a/data/2024/iclr/Generative Pre-training for Speech with Flow Matching b/data/2024/iclr/Generative Pre-training for Speech with Flow Matching
new file mode 100644
index 0000000000..21342f9a18
--- /dev/null
+++ b/data/2024/iclr/Generative Pre-training for Speech with Flow Matching	
@@ -0,0 +1 @@
+Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.
\ No newline at end of file
diff --git a/data/2024/iclr/Generative Sliced MMD Flows with Riesz Kernels b/data/2024/iclr/Generative Sliced MMD Flows with Riesz Kernels
new file mode 100644
index 0000000000..95f17e662b
--- /dev/null
+++ b/data/2024/iclr/Generative Sliced MMD Flows with Riesz Kernels	
@@ -0,0 +1 @@
+Maximum mean discrepancy (MMD) flows suffer from high computational costs in large scale computations. In this paper, we show that MMD flows with Riesz kernels $K(x,y) = - \|x-y\|^r$, $r \in (0,2)$ have exceptional properties which allow their efficient computation. We prove that the MMD of Riesz kernels, which is also known as energy distance, coincides with the MMD of their sliced version. As a consequence, the computation of gradients of MMDs can be performed in the one-dimensional setting. Here, for $r=1$, a simple sorting algorithm can be applied to reduce the complexity from $O(MN+N^2)$ to $O((M+N)\log(M+N))$ for two measures with $M$ and $N$ support points. As another interesting follow-up result, the MMD of compactly supported measures can be estimated from above and below by the Wasserstein-1 distance. For the implementations we approximate the gradient of the sliced MMD by using only a finite number $P$ of slices. We show that the resulting error has complexity $O(\sqrt{d/P})$, where $d$ is the data dimension. These results enable us to train generative models by approximating MMD gradient flows by neural networks even for image applications. We demonstrate the efficiency of our model by image generation on MNIST, FashionMNIST and CIFAR10.
\ No newline at end of file
diff --git a/data/2024/iclr/GeoLLM: Extracting Geospatial Knowledge from Large Language Models b/data/2024/iclr/GeoLLM: Extracting Geospatial Knowledge from Large Language Models
new file mode 100644
index 0000000000..3fef752bf4
--- /dev/null
+++ b/data/2024/iclr/GeoLLM: Extracting Geospatial Knowledge from Large Language Models	
@@ -0,0 +1 @@
+The application of machine learning (ML) in a range of geospatial tasks is increasingly common but often relies on globally available covariates such as satellite imagery that can either be expensive or lack predictive power. Here we explore the question of whether the vast amounts of knowledge found in Internet language corpora, now compressed within large language models (LLMs), can be leveraged for geospatial prediction tasks. We first demonstrate that LLMs embed remarkable spatial information about locations, but naively querying LLMs using geographic coordinates alone is ineffective in predicting key indicators like population density. We then present GeoLLM, a novel method that can effectively extract geospatial knowledge from LLMs with auxiliary map data from OpenStreetMap. We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods. Across these tasks, our method demonstrates a 70% improvement in performance (measured using Pearson's $r^2$) relative to baselines that use nearest neighbors or use information directly from the prompt, and performance equal to or exceeding satellite-based benchmarks in the literature. With GeoLLM, we observe that GPT-3.5 outperforms Llama 2 and RoBERTa by 19% and 51% respectively, suggesting that the performance of our method scales well with the size of the model and its pretraining dataset. Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe. Crucially, GeoLLM shows promise in mitigating the limitations of existing geospatial covariates and complementing them well. Code is available on the project website: https://rohinmanvi.github.io/GeoLLM
\ No newline at end of file
diff --git a/data/2024/iclr/Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Networks b/data/2024/iclr/Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Networks
new file mode 100644
index 0000000000..d6e1503780
--- /dev/null
+++ b/data/2024/iclr/Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Networks	
@@ -0,0 +1 @@
+Learning representations of geographical space is vital for any machine learning model that integrates geolocated data, spanning application domains such as remote sensing, ecology, or epidemiology. Recent work embeds coordinates using sine and cosine projections based on Double Fourier Sphere (DFS) features. These embeddings assume a rectangular data domain even on global data, which can lead to artifacts, especially at the poles. At the same time, little attention has been paid to the exact design of the neural network architectures with which these functional embeddings are combined. This work proposes a novel location encoder for globally distributed geographic data that combines spherical harmonic basis functions, natively defined on spherical surfaces, with sinusoidal representation networks (SirenNets) that can be interpreted as learned Double Fourier Sphere embedding. We systematically evaluate positional embeddings and neural network architectures across various benchmarks and synthetic evaluation datasets. In contrast to previous approaches that require the combination of both positional encoding and neural networks to learn meaningful representations, we show that both spherical harmonics and sinusoidal representation networks are competitive on their own but set state-of-the-art performances across tasks when combined. The model code and experiments are available at https://github.com/marccoru/locationencoder.
\ No newline at end of file
diff --git a/data/2024/iclr/Geometrically Aligned Transfer Encoder for Inductive Transfer in Regression Tasks b/data/2024/iclr/Geometrically Aligned Transfer Encoder for Inductive Transfer in Regression Tasks
new file mode 100644
index 0000000000..39cdf7b508
--- /dev/null
+++ b/data/2024/iclr/Geometrically Aligned Transfer Encoder for Inductive Transfer in Regression Tasks	
@@ -0,0 +1 @@
+Transfer learning is a crucial technique for handling a small amount of data that is potentially related to other abundant data. However, most of the existing methods are focused on classification tasks using images and language datasets. Therefore, in order to expand the transfer learning scheme to regression tasks, we propose a novel transfer technique based on differential geometry, namely the Geometrically Aligned Transfer Encoder (GATE). In this method, we interpret the latent vectors from the model to exist on a Riemannian curved manifold. We find a proper diffeomorphism between pairs of tasks to ensure that every arbitrary point maps to a locally flat coordinate in the overlapping region, allowing the transfer of knowledge from the source to the target data. This also serves as an effective regularizer for the model to behave in extrapolation regions. In this article, we demonstrate that GATE outperforms conventional methods and exhibits stable behavior in both the latent space and extrapolation regions for various molecular graph datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Geometry-Aware Projective Mapping for Unbounded Neural Radiance Fields b/data/2024/iclr/Geometry-Aware Projective Mapping for Unbounded Neural Radiance Fields
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models b/data/2024/iclr/Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models
new file mode 100644
index 0000000000..37a054897e
--- /dev/null
+++ b/data/2024/iclr/Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models	
@@ -0,0 +1 @@
+The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to manipulate the text embeddings and remove unwanted content from them. We introduce two contributions, which we refer to as $\textit{soft-weighted regularization}$ and $\textit{inference-time text embedding optimization}$. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content. We evaluate our method quantitatively and qualitatively on extensive experiments, validating its effectiveness. Furthermore, our method is generalizability to both the pixel-space diffusion models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable Diffusion).
\ No newline at end of file
diff --git a/data/2024/iclr/Global Optimality for Non-linear Constrained Restoration Problems via Invexity b/data/2024/iclr/Global Optimality for Non-linear Constrained Restoration Problems via Invexity
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/GlucoBench: Curated List of Continuous Glucose Monitoring Datasets with Prediction Benchmarks b/data/2024/iclr/GlucoBench: Curated List of Continuous Glucose Monitoring Datasets with Prediction Benchmarks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction b/data/2024/iclr/GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction
new file mode 100644
index 0000000000..ed2a217188
--- /dev/null
+++ b/data/2024/iclr/GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction	
@@ -0,0 +1 @@
+Large Language Models (LLMs) combined with instruction tuning have made significant progress when generalizing to unseen tasks. However, they have been less successful in Information Extraction (IE), lagging behind task-specific models. Typically, IE tasks are characterized by complex annotation guidelines that describe the task and give examples to humans. Previous attempts to leverage such information have failed, even with the largest models, as they are not able to follow the guidelines out of the box. In this paper, we propose GoLLIE (Guideline-following Large Language Model for IE), a model able to improve zero-shot results on unseen IE tasks by virtue of being fine-tuned to comply with annotation guidelines. Comprehensive evaluation empirically demonstrates that GoLLIE is able to generalize to and follow unseen guidelines, outperforming previous attempts at zero-shot information extraction. The ablation study shows that detailed guidelines are key for good results.
\ No newline at end of file
diff --git a/data/2024/iclr/Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory b/data/2024/iclr/Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory
new file mode 100644
index 0000000000..d4e272e405
--- /dev/null
+++ b/data/2024/iclr/Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory	
@@ -0,0 +1 @@
+The behavior of neural networks still remains opaque, and a recently widely noted phenomenon is that networks often achieve similar performance when initialized with different random parameters. This phenomenon has attracted significant attention in measuring the similarity between features learned by distinct networks. However, feature similarity could be vague in describing the same feature since equivalent features hardly exist. In this paper, we expand the concept of equivalent feature and provide the definition of what we call functionally equivalent features. These features produce equivalent output under certain transformations. Using this definition, we aim to derive a more intrinsic metric for the so-called feature complexity regarding the redundancy of features learned by a neural network at each layer. We offer a formal interpretation of our approach through the lens of category theory, a well-developed area in mathematics. To quantify the feature complexity, we further propose an efficient algorithm named Iterative Feature Merging. Our experimental results validate our ideas and theories from various perspectives. We empirically demonstrate that the functionally equivalence widely exists among different features learned by the same neural network and we could reduce the number of parameters of the network without affecting the performance.The IFM shows great potential as a data-agnostic model prune method. We have also drawn several interesting empirical findings regarding the defined feature complexity.
\ No newline at end of file
diff --git a/data/2024/iclr/Goodhart's Law in Reinforcement Learning b/data/2024/iclr/Goodhart's Law in Reinforcement Learning
new file mode 100644
index 0000000000..88e9d69104
--- /dev/null
+++ b/data/2024/iclr/Goodhart's Law in Reinforcement Learning	
@@ -0,0 +1 @@
+Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart's law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart's law for a wide range of environments and reward functions. We then provide a geometric explanation for why Goodhart's law occurs in Markov decision processes. We use these theoretical insights to propose an optimal early stopping method that provably avoids the aforementioned pitfall and derive theoretical regret bounds for this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification.
\ No newline at end of file
diff --git a/data/2024/iclr/Gradual Domain Adaptation via Gradient Flow b/data/2024/iclr/Gradual Domain Adaptation via Gradient Flow
new file mode 100644
index 0000000000..baf365d531
--- /dev/null
+++ b/data/2024/iclr/Gradual Domain Adaptation via Gradient Flow	
@@ -0,0 +1 @@
+functions
\ No newline at end of file
diff --git a/data/2024/iclr/Gradual Optimization Learning for Conformational Energy Minimization b/data/2024/iclr/Gradual Optimization Learning for Conformational Energy Minimization
new file mode 100644
index 0000000000..2589ee5347
--- /dev/null
+++ b/data/2024/iclr/Gradual Optimization Learning for Conformational Energy Minimization	
@@ -0,0 +1 @@
+Molecular conformation optimization is crucial to computer-aided drug discovery and materials design. Traditional energy minimization techniques rely on iterative optimization methods that use molecular forces calculated by a physical simulator (oracle) as anti-gradients. However, this is a computationally expensive approach that requires many interactions with a physical simulator. One way to accelerate this procedure is to replace the physical simulator with a neural network. Despite recent progress in neural networks for molecular conformation energy prediction, such models are prone to distribution shift, leading to inaccurate energy minimization. We find that the quality of energy minimization with neural networks can be improved by providing optimization trajectories as additional training data. Still, it takes around $5 \times 10^5$ additional conformations to match the physical simulator's optimization quality. In this work, we present the Gradual Optimization Learning Framework (GOLF) for energy minimization with neural networks that significantly reduces the required additional data. The framework consists of an efficient data-collecting scheme and an external optimizer. The external optimizer utilizes gradients from the energy prediction model to generate optimization trajectories, and the data-collecting scheme selects additional training data to be processed by the physical simulator. Our results demonstrate that the neural network trained with GOLF performs on par with the oracle on a benchmark of diverse drug-like molecules using $50$x less additional data.
\ No newline at end of file
diff --git a/data/2024/iclr/Graph Generation with K2-trees b/data/2024/iclr/Graph Generation with K2-trees
new file mode 100644
index 0000000000..1a39dd9bb8
--- /dev/null
+++ b/data/2024/iclr/Graph Generation with K2-trees	
@@ -0,0 +1 @@
+Generating graphs from a target distribution is a significant challenge across many domains, including drug discovery and social network analysis. In this work, we introduce a novel graph generation method leveraging $K^2$-tree representation, originally designed for lossless graph compression. The $K^2$-tree representation {encompasses inherent hierarchy while enabling compact graph generation}. In addition, we make contributions by (1) presenting a sequential $K^2$-treerepresentation that incorporates pruning, flattening, and tokenization processes and (2) introducing a Transformer-based architecture designed to generate the sequence by incorporating a specialized tree positional encoding scheme. Finally, we extensively evaluate our algorithm on four general and two molecular graph datasets to confirm its superiority for graph generation.
\ No newline at end of file
diff --git a/data/2024/iclr/Graph Lottery Ticket Automated b/data/2024/iclr/Graph Lottery Ticket Automated
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Graph Metanetworks for Processing Diverse Neural Architectures b/data/2024/iclr/Graph Metanetworks for Processing Diverse Neural Architectures
new file mode 100644
index 0000000000..e3983f53a6
--- /dev/null
+++ b/data/2024/iclr/Graph Metanetworks for Processing Diverse Neural Architectures	
@@ -0,0 +1 @@
+Neural networks efficiently encode learned information within their parameters. Consequently, many tasks can be unified by treating neural networks themselves as input data. When doing so, recent studies demonstrated the importance of accounting for the symmetries and geometry of parameter spaces. However, those works developed architectures tailored to specific networks such as MLPs and CNNs without normalization layers, and generalizing such architectures to other types of networks can be challenging. In this work, we overcome these challenges by building new metanetworks - neural networks that take weights from other neural networks as input. Put simply, we carefully build graphs representing the input neural networks and process the graphs using graph neural networks. Our approach, Graph Metanetworks (GMNs), generalizes to neural architectures where competing methods struggle, such as multi-head attention layers, normalization layers, convolutional layers, ResNet blocks, and group-equivariant linear layers. We prove that GMNs are expressive and equivariant to parameter permutation symmetries that leave the input neural network functions unchanged. We validate the effectiveness of our method on several metanetwork tasks over diverse neural network architectures.
\ No newline at end of file
diff --git a/data/2024/iclr/Graph Neural Networks for Learning Equivariant Representations of Neural Networks b/data/2024/iclr/Graph Neural Networks for Learning Equivariant Representations of Neural Networks
new file mode 100644
index 0000000000..9e16b472a9
--- /dev/null
+++ b/data/2024/iclr/Graph Neural Networks for Learning Equivariant Representations of Neural Networks	
@@ -0,0 +1 @@
+Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself. In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry. Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures. We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods. The source code is open-sourced at https://github.com/mkofinas/neural-graphs.
\ No newline at end of file
diff --git a/data/2024/iclr/Graph Parsing Networks b/data/2024/iclr/Graph Parsing Networks
new file mode 100644
index 0000000000..537ee63107
--- /dev/null
+++ b/data/2024/iclr/Graph Parsing Networks	
@@ -0,0 +1 @@
+For a given video-based Human-Object Interaction scene, modeling the spatio-temporal relationship between humans and objects is the important cue to understand the contextual information presented in the video. With the efficient spatio-temporal relationship modeling, it is possible not only to uncover contextual information in each frame, but to directly capture inter-frame dependencies as well. Capturing the position changes of human and objects over the spatio-temporal dimension is more critical when significant changes in the appearance features may not occur over time. When utilizing appearance features, the spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance. In this paper, Spatio-Temporal Interaction Graph Parsing Networks (STIGPN) are constructed, which encode the videos with a graph composed of human and object nodes. These nodes are connected by two types of relations: (i) intra-frame relations: modeling the interactions between human and the interacted objects within each frame. (ii) inter-frame relations: capturing the long range dependencies between human and the interacted objects across frame. With the graph, STIGPN learn spatio-temporal features directly from the whole video-based Human-Object Interaction scenes. Multi-modal features and a multi-stream fusion strategy are used to enhance the reasoning capability of STIGPN. Two Human-Object Interaction video datasets, including CAD-120 and Something-Else, are used to evaluate the proposed architectures, and the state-of-the-art performance demonstrates the superiority of STIGPN. Code for STIGPN is available at https://github.com/GuangmingZhu/STIGPN.
\ No newline at end of file
diff --git a/data/2024/iclr/Graph Transformers on EHRs: Better Representation Improves Downstream Performance b/data/2024/iclr/Graph Transformers on EHRs: Better Representation Improves Downstream Performance
new file mode 100644
index 0000000000..0265b28706
--- /dev/null
+++ b/data/2024/iclr/Graph Transformers on EHRs: Better Representation Improves Downstream Performance	
@@ -0,0 +1 @@
+Following the success of transformer-based methods across various machine learning applications, their adoption to healthcare predictive tasks using electronic health records (EHRs) has also expanded extensively. Similarly, graph-based methods have been shown to be very effective in capturing inherent graph-type relationships in EHRs, leading to improved downstream performance. Although integrating these two families of approaches seems like a natural next step, in practice, creating such a design is challenging and has not been done. This is partly due to known EHR problems, such as high sparsity, making extracting meaningful temporal representations of medical visits challenging. In this study, we propose GT-BEHRT , a new approach that leverages temporal visit embeddings extracted from a graph transformer and uses a BERT -based model to obtain more robust patient representations, especially on longer EHR sequences. The graph-based approach allows GT-BEHRT to implicitly capture the intrinsic graphical relationships between medical observations (concepts), while the BERT model extracts the temporal relationships between visits, loosely mimicking the clinicians’ decision-making process. As part of our method, we also present a two-step pre-training strategy for learning better graphical and temporal representations. We show how such improved representations can ultimately achieve state-of-the-art performance in a variety of standard medical predictive tasks, demonstrating the versatility of our approach 1 .
\ No newline at end of file
diff --git a/data/2024/iclr/Graph-based Virtual Sensing from Sparse and Partial Multivariate Observations b/data/2024/iclr/Graph-based Virtual Sensing from Sparse and Partial Multivariate Observations
new file mode 100644
index 0000000000..216c504270
--- /dev/null
+++ b/data/2024/iclr/Graph-based Virtual Sensing from Sparse and Partial Multivariate Observations	
@@ -0,0 +1 @@
+Virtual sensing techniques allow for inferring signals at new unmonitored locations by exploiting spatio-temporal measurements coming from physical sensors at different locations. However, as the sensor coverage becomes sparse due to costs or other constraints, physical proximity cannot be used to support interpolation. In this paper, we overcome this challenge by leveraging dependencies between the target variable and a set of correlated variables (covariates) that can frequently be associated with each location of interest. From this viewpoint, covariates provide partial observability, and the problem consists of inferring values for unobserved channels by exploiting observations at other locations to learn how such variables can correlate. We introduce a novel graph-based methodology to exploit such relationships and design a graph deep learning architecture, named GgNet, implementing the framework. The proposed approach relies on propagating information over a nested graph structure that is used to learn dependencies between variables as well as locations. GgNet is extensively evaluated under different virtual sensing scenarios, demonstrating higher reconstruction accuracy compared to the state-of-the-art.
\ No newline at end of file
diff --git a/data/2024/iclr/Graph-constrained diffusion for End-to-End Path Planning b/data/2024/iclr/Graph-constrained diffusion for End-to-End Path Planning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/GraphCare: Enhancing Healthcare Predictions with Personalized Knowledge Graphs b/data/2024/iclr/GraphCare: Enhancing Healthcare Predictions with Personalized Knowledge Graphs
new file mode 100644
index 0000000000..1c434c0426
--- /dev/null
+++ b/data/2024/iclr/GraphCare: Enhancing Healthcare Predictions with Personalized Knowledge Graphs	
@@ -0,0 +1 @@
+Clinical predictive models often rely on patients' electronic health records (EHR), but integrating medical knowledge to enhance predictions and decision-making is challenging. This is because personalized predictions require personalized knowledge graphs (KGs), which are difficult to generate from patient EHR data. To address this, we propose \textsc{GraphCare}, an open-world framework that uses external KGs to improve EHR-based predictions. Our method extracts knowledge from large language models (LLMs) and external biomedical KGs to build patient-specific KGs, which are then used to train our proposed Bi-attention AugmenTed (BAT) graph neural network (GNN) for healthcare predictions. On two public datasets, MIMIC-III and MIMIC-IV, \textsc{GraphCare} surpasses baselines in four vital healthcare prediction tasks: mortality, readmission, length of stay (LOS), and drug recommendation. On MIMIC-III, it boosts AUROC by 17.6\% and 6.6\% for mortality and readmission, and F1-score by 7.9\% and 10.8\% for LOS and drug recommendation, respectively. Notably, \textsc{GraphCare} demonstrates a substantial edge in scenarios with limited data availability. Our findings highlight the potential of using external KGs in healthcare prediction tasks and demonstrate the promise of \textsc{GraphCare} in generating personalized KGs for promoting personalized medicine.
\ No newline at end of file
diff --git a/data/2024/iclr/GraphChef: Decision-Tree Recipes to Explain Graph Neural Networks b/data/2024/iclr/GraphChef: Decision-Tree Recipes to Explain Graph Neural Networks
new file mode 100644
index 0000000000..e8a0b26267
--- /dev/null
+++ b/data/2024/iclr/GraphChef: Decision-Tree Recipes to Explain Graph Neural Networks	
@@ -0,0 +1 @@
+We propose a new self-explainable Graph Neural Network (GNN) model: GraphChef. GraphChef integrates decision trees into the GNN message passing framework. Given a dataset, GraphChef returns a set of rules (a recipe) that explains each class in the dataset unlike existing GNNs and explanation methods that reason on individual graphs. Thanks to the decision trees, the GraphChef recipes are human-comprehensible. We also present a new pruning method to produce small and easy-to-digest trees. Experiments demonstrate that GraphChef reaches comparable accuracy to non-self-explainable GNNs, and the produced decision trees are indeed small. We further validate the correctness of the discovered recipes on datasets where explanation ground truth is available: Reddit-Binary, MUTAG, BA-2Motifs, BA-Shapes, Tree-Cycle, and Tree-Grid.
\ No newline at end of file
diff --git a/data/2024/iclr/GraphPulse: Topological representations for temporal graph property prediction b/data/2024/iclr/GraphPulse: Topological representations for temporal graph property prediction
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Grokking as a First Order Phase Transition in Two Layer Networks b/data/2024/iclr/Grokking as a First Order Phase Transition in Two Layer Networks
new file mode 100644
index 0000000000..5f68fa6766
--- /dev/null
+++ b/data/2024/iclr/Grokking as a First Order Phase Transition in Two Layer Networks	
@@ -0,0 +1 @@
+A key property of deep neural networks (DNNs) is their ability to learn new features during training. This intriguing aspect of deep learning stands out most clearly in recently reported Grokking phenomena. While mainly reflected as a sudden increase in test accuracy, Grokking is also believed to be a beyond lazy-learning/Gaussian Process (GP) phenomenon involving feature learning. Here we apply a recent development in the theory of feature learning, the adaptive kernel approach, to two teacher-student models with cubic-polynomial and modular addition teachers. We provide analytical predictions on feature learning and Grokking properties of these models and demonstrate a mapping between Grokking and the theory of phase transitions. We show that after Grokking, the state of the DNN is analogous to the mixed phase following a first-order phase transition. In this mixed phase, the DNN generates useful internal representations of the teacher that are sharply distinct from those before the transition.
\ No newline at end of file
diff --git a/data/2024/iclr/Grokking as the transition from lazy to rich training dynamics b/data/2024/iclr/Grokking as the transition from lazy to rich training dynamics
new file mode 100644
index 0000000000..3dfaf8d1b5
--- /dev/null
+++ b/data/2024/iclr/Grokking as the transition from lazy to rich training dynamics	
@@ -0,0 +1 @@
+We propose that the grokking phenomenon, where the train loss of a neural network decreases much earlier than its test loss, can arise due to a neural network transitioning from lazy training dynamics to a rich, feature learning regime. To illustrate this mechanism, we study the simple setting of vanilla gradient descent on a polynomial regression problem with a two layer neural network which exhibits grokking without regularization in a way that cannot be explained by existing theories. We identify sufficient statistics for the test loss of such a network, and tracking these over training reveals that grokking arises in this setting when the network first attempts to fit a kernel regression solution with its initial features, followed by late-time feature learning where a generalizing solution is identified after train loss is already low. We find that the key determinants of grokking are the rate of feature learning -- which can be controlled precisely by parameters that scale the network output -- and the alignment of the initial features with the target function $y(x)$. We argue this delayed generalization arises when (1) the top eigenvectors of the initial neural tangent kernel and the task labels $y(x)$ are misaligned, but (2) the dataset size is large enough so that it is possible for the network to generalize eventually, but not so large that train loss perfectly tracks test loss at all epochs, and (3) the network begins training in the lazy regime so does not learn features immediately. We conclude with evidence that this transition from lazy (linear model) to rich training (feature learning) can control grokking in more general settings, like on MNIST, one-layer Transformers, and student-teacher networks.
\ No newline at end of file
diff --git a/data/2024/iclr/Grokking in Linear Estimators - A Solvable Model that Groks without Understanding b/data/2024/iclr/Grokking in Linear Estimators - A Solvable Model that Groks without Understanding
new file mode 100644
index 0000000000..231c817827
--- /dev/null
+++ b/data/2024/iclr/Grokking in Linear Estimators - A Solvable Model that Groks without Understanding	
@@ -0,0 +1 @@
+Grokking is the intriguing phenomenon where a model learns to generalize long after it has fit the training data. We show both analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks in a simple teacher-student setup with Gaussian inputs. In this setting, the full training dynamics is derived in terms of the training and generalization data covariance matrix. We present exact predictions on how the grokking time depends on input and output dimensionality, train sample size, regularization, and network initialization. We demonstrate that the sharp increase in generalization accuracy may not imply a transition from"memorization"to"understanding", but can simply be an artifact of the accuracy measure. We provide empirical verification for our calculations, along with preliminary results indicating that some predictions also hold for deeper networks, with non-linear activations.
\ No newline at end of file
diff --git a/data/2024/iclr/Grounded Object-Centric Learning b/data/2024/iclr/Grounded Object-Centric Learning
new file mode 100644
index 0000000000..14c40e99bc
--- /dev/null
+++ b/data/2024/iclr/Grounded Object-Centric Learning	
@@ -0,0 +1 @@
+The extraction of modular object-centric representations for downstream tasks is an emerging area of research. Learning grounded representations of objects that are guaranteed to be stable and invariant promises robust performance across different tasks and environments. Slot Attention (SA) learns object-centric representations by assigning objects to \textit{slots}, but presupposes a \textit{single} distribution from which all slots are randomly initialised. This results in an inability to learn \textit{specialized} slots which bind to specific object types and remain invariant to identity-preserving changes in object appearance. To address this, we present \emph{\textsc{Co}nditional \textsc{S}lot \textsc{A}ttention} (\textsc{CoSA}) using a novel concept of \emph{Grounded Slot Dictionary} (GSD) inspired by vector quantization. Our proposed GSD comprises (i) canonical object-level property vectors and (ii) parametric Gaussian distributions, which define a prior over the slots. We demonstrate the benefits of our method in multiple downstream tasks such as scene generation, composition, and task adaptation, whilst remaining competitive with SA in popular object discovery benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/Grounding Language Plans in Demonstrations Through Counterfactual Perturbations b/data/2024/iclr/Grounding Language Plans in Demonstrations Through Counterfactual Perturbations
new file mode 100644
index 0000000000..6f2cbcb56c
--- /dev/null
+++ b/data/2024/iclr/Grounding Language Plans in Demonstrations Through Counterfactual Perturbations	
@@ -0,0 +1 @@
+Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI. Whereas prior works have focused on leveraging LLMs directly for planning in symbolic spaces, this work uses LLMs to guide the search of task structures and constraints implicit in multi-step demonstrations. Specifically, we borrow from manipulation planning literature the concept of mode families, which group robot configurations by specific motion constraints, to serve as an abstraction layer between the high-level language representations of an LLM and the low-level physical trajectories of a robot. By replaying a few human demonstrations with synthetic perturbations, we generate coverage over the demonstrations' state space with additional successful executions as well as counterfactuals that fail the task. Our explanation-based learning framework trains an end-to-end differentiable neural network to predict successful trajectories from failures and as a by-product learns classifiers that ground low-level states and images in mode families without dense labeling. The learned grounding classifiers can further be used to translate language plans into reactive policies in the physical domain in an interpretable manner. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks. Website: https://yanweiw.github.io/glide
\ No newline at end of file
diff --git a/data/2024/iclr/Group Preference Optimization: Few-Shot Alignment of Large Language Models b/data/2024/iclr/Group Preference Optimization: Few-Shot Alignment of Large Language Models
new file mode 100644
index 0000000000..fb8bf89ad9
--- /dev/null
+++ b/data/2024/iclr/Group Preference Optimization: Few-Shot Alignment of Large Language Models	
@@ -0,0 +1 @@
+Many applications of large language models (LLMs), ranging from chatbots to creative writing, require nuanced subjective judgments that can differ significantly across different groups. Existing alignment algorithms can be expensive to align for each group, requiring prohibitive amounts of group-specific preference data and computation for real-world use cases. We introduce Group Preference Optimization (GPO), an alignment framework that steers language models to preferences of individual groups in a few-shot manner. In GPO, we augment the base LLM with an independent transformer module trained to predict the preferences of a group for the LLM generations. For few-shot learning, we parameterize this module as an in-context autoregressive transformer and train it via meta-learning on several groups. We empirically validate the efficacy of GPO through rigorous evaluations using LLMs with varied sizes on three human opinion adaptation tasks. These tasks involve adapting to the preferences of US demographic groups, global countries, and individual users. Our results demonstrate that GPO not only aligns models more accurately but also requires fewer group-specific preferences, and less training and inference computing resources, outperforming existing strategies such as in-context steering and fine-tuning methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Guaranteed Approximation Bounds for Mixed-Precision Neural Operators b/data/2024/iclr/Guaranteed Approximation Bounds for Mixed-Precision Neural Operators
new file mode 100644
index 0000000000..66fad8a8dc
--- /dev/null
+++ b/data/2024/iclr/Guaranteed Approximation Bounds for Mixed-Precision Neural Operators	
@@ -0,0 +1 @@
+Neural operators, such as Fourier Neural Operators (FNO), form a principled approach for learning solution operators for PDEs and other mappings between function spaces. However, many real-world problems require high-resolution training data, and the training time and limited GPU memory pose big barriers. One solution is to train neural operators in mixed precision to reduce the memory requirement and increase training speed. However, existing mixed-precision training techniques are designed for standard neural networks, and we find that their direct application to FNO leads to numerical overflow and poor memory efficiency. Further, at first glance, it may appear that mixed precision in FNO will lead to drastic accuracy degradation since reducing the precision of the Fourier transform yields poor results in classical numerical solvers. We show that this is not the case; in fact, we prove that reducing the precision in FNO still guarantees a good approximation bound, when done in a targeted manner. Specifically, we build on the intuition that neural operator learning inherently induces an approximation error, arising from discretizing the infinite-dimensional ground-truth input function, implying that training in full precision is not needed. We formalize this intuition by rigorously characterizing the approximation and precision errors of FNO and bounding these errors for general input functions. We prove that the precision error is asymptotically comparable to the approximation error. Based on this, we design a simple method to optimize the memory-intensive half-precision tensor contractions by greedily finding the optimal contraction order. Through extensive experiments on different state-of-the-art neural operators, datasets, and GPUs, we demonstrate that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
\ No newline at end of file
diff --git a/data/2024/iclr/Guess & Sketch: Language Model Guided Transpilation b/data/2024/iclr/Guess & Sketch: Language Model Guided Transpilation
new file mode 100644
index 0000000000..0babec7a27
--- /dev/null
+++ b/data/2024/iclr/Guess & Sketch: Language Model Guided Transpilation	
@@ -0,0 +1 @@
+Abstract. Land biosphere processes are of central importance to the climate system. Specifically, ecosystems interact with the atmosphere through a variety of feedback loops that modulate energy, water, and CO2 fluxes between the land surface and the atmosphere across a wide range of temporal and spatial scales. Human land use and land cover modification add a further level of complexity to land–atmosphere interactions. Dynamic global vegetation models (DGVMs) attempt to capture land ecosystem processes and are increasingly incorporated into Earth system models (ESMs), which makes it possible to study the coupled dynamics of the land biosphere and the climate. In this work we describe a number of modifications to the LPJ-GUESS DGVM, aimed at enabling direct integration into an ESM. These include energy balance closure, the introduction of a sub-daily time step, a new radiative transfer scheme, and improved soil physics. The implemented modifications allow the model (LPJ-GUESS/LSM) to simulate the diurnal exchange of energy, water, and CO2 between the land ecosystem and the atmosphere and thus provide surface boundary conditions to an atmospheric model over land. A site-based evaluation against FLUXNET2015 data shows reasonable agreement between observed and modelled sensible and latent heat fluxes. Differences in predicted ecosystem function between standard LPJ-GUESS and LPJ-GUESS/LSM vary across land cover types. We find that the emerging ecosystem composition and carbon fluxes are sensitive to both the choice of stomatal conductance model and the response of plant water uptake to soil moisture. The new implementation described in this work lays the foundation for using the well-established LPJ-GUESS DGVM as an alternative land surface model (LSM) in coupled land–biosphere–climate studies, where an accurate representation of ecosystem processes is essential.
diff --git a/data/2024/iclr/Guiding Instruction-based Image Editing via Multimodal Large Language Models b/data/2024/iclr/Guiding Instruction-based Image Editing via Multimodal Large Language Models
new file mode 100644
index 0000000000..0e9eebcd3c
--- /dev/null
+++ b/data/2024/iclr/Guiding Instruction-based Image Editing via Multimodal Large Language Models	
@@ -0,0 +1 @@
+Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.
\ No newline at end of file
diff --git a/data/2024/iclr/Guiding Masked Representation Learning to Capture Spatio-Temporal Relationship of Electrocardiogram b/data/2024/iclr/Guiding Masked Representation Learning to Capture Spatio-Temporal Relationship of Electrocardiogram
new file mode 100644
index 0000000000..4039f5d81d
--- /dev/null
+++ b/data/2024/iclr/Guiding Masked Representation Learning to Capture Spatio-Temporal Relationship of Electrocardiogram	
@@ -0,0 +1 @@
+Electrocardiograms (ECG) are widely employed as a diagnostic tool for monitoring electrical signals originating from a heart. Recent machine learning research efforts have focused on the application of screening various diseases using ECG signals. However, adapting to the application of screening disease is challenging in that labeled ECG data are limited. Achieving general representation through self-supervised learning (SSL) is a well-known approach to overcome the scarcity of labeled data; however, a naive application of SSL to ECG data, without considering the spatial-temporal relationships inherent in ECG signals, may yield suboptimal results. In this paper, we introduce ST-MEM (Spatio-Temporal Masked Electrocardiogram Modeling), designed to learn spatio-temporal features by reconstructing masked 12-lead ECG data. ST-MEM outperforms other SSL baseline methods in various experimental settings for arrhythmia classification tasks. Moreover, we demonstrate that ST-MEM is adaptable to various lead combinations. Through quantitative and qualitative analysis, we show a spatio-temporal relationship within ECG data. Our code is available at https://github.com/bakqui/ST-MEM.
\ No newline at end of file
diff --git a/data/2024/iclr/H-GAP: Humanoid Control with a Generalist Planner b/data/2024/iclr/H-GAP: Humanoid Control with a Generalist Planner
new file mode 100644
index 0000000000..0c7bc470e4
--- /dev/null
+++ b/data/2024/iclr/H-GAP: Humanoid Control with a Generalist Planner	
@@ -0,0 +1 @@
+Humanoid control is an important research challenge offering avenues for integration into human-centric infrastructures and enabling physics-driven humanoid animations. The daunting challenges in this field stem from the difficulty of optimizing in high-dimensional action spaces and the instability introduced by the bipedal morphology of humanoids. However, the extensive collection of human motion-captured data and the derived datasets of humanoid trajectories, such as MoCapAct, paves the way to tackle these challenges. In this context, we present Humanoid Generalist Autoencoding Planner (H-GAP), a state-action trajectory generative model trained on humanoid trajectories derived from human motion-captured data, capable of adeptly handling downstream control tasks with Model Predictive Control (MPC). For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviours. Further, without any learning from online interactions, it can also flexibly transfer these behaviors to solve novel downstream control tasks via planning. Notably, H-GAP excels established MPC baselines that have access to the ground truth dynamics model, and is superior or comparable to offline RL methods trained for individual tasks. Finally, we do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing. Code and videos are available at https://ycxuyingchen.github.io/hgap/.
\ No newline at end of file
diff --git a/data/2024/iclr/H2O-SDF: Two-phase Learning for 3D Indoor Reconstruction using Object Surface Fields b/data/2024/iclr/H2O-SDF: Two-phase Learning for 3D Indoor Reconstruction using Object Surface Fields
new file mode 100644
index 0000000000..5f3a9468cb
--- /dev/null
+++ b/data/2024/iclr/H2O-SDF: Two-phase Learning for 3D Indoor Reconstruction using Object Surface Fields	
@@ -0,0 +1 @@
+Advanced techniques using Neural Radiance Fields (NeRF), Signed Distance Fields (SDF), and Occupancy Fields have recently emerged as solutions for 3D indoor scene reconstruction. We introduce a novel two-phase learning approach, H2O-SDF, that discriminates between object and non-object regions within indoor environments. This method achieves a nuanced balance, carefully preserving the geometric integrity of room layouts while also capturing intricate surface details of specific objects. A cornerstone of our two-phase learning framework is the introduction of the Object Surface Field (OSF), a novel concept designed to mitigate the persistent vanishing gradient problem that has previously hindered the capture of high-frequency details in other methods. Our proposed approach is validated through several experiments that include ablation studies.
\ No newline at end of file
diff --git a/data/2024/iclr/HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments b/data/2024/iclr/HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments
new file mode 100644
index 0000000000..8d55026a8f
--- /dev/null
+++ b/data/2024/iclr/HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments	
@@ -0,0 +1 @@
+Recent advances in high-fidelity virtual environments serve as one of the major driving forces for building intelligent embodied agents to perceive, reason and interact with the physical world. Typically, these environments remain unchanged unless agents interact with them. However, in real-world scenarios, agents might also face dynamically changing environments characterized by unexpected events and need to rapidly take action accordingly. To remedy this gap, we propose a new simulated embodied benchmark, called HAZARD, specifically designed to assess the decision-making abilities of embodied agents in dynamic situations. HAZARD consists of three unexpected disaster scenarios, including fire, flood, and wind, and specifically supports the utilization of large language models (LLMs) to assist common sense reasoning and decision-making. This benchmark enables us to evaluate autonomous agents' decision-making capabilities across various pipelines, including reinforcement learning (RL), rule-based, and search-based methods in dynamically changing environments. As a first step toward addressing this challenge using large language models, we further develop an LLM-based agent and perform an in-depth analysis of its promise and challenge of solving these challenging tasks. HAZARD is available at https://vis-www.cs.umass.edu/hazard/.
\ No newline at end of file
diff --git a/data/2024/iclr/HIFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance b/data/2024/iclr/HIFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance
new file mode 100644
index 0000000000..c719043ca6
--- /dev/null
+++ b/data/2024/iclr/HIFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance	
@@ -0,0 +1 @@
+The advancements in automatic text-to-3D generation have been remarkable. Most existing methods use pre-trained text-to-image diffusion models to optimize 3D representations like Neural Radiance Fields (NeRFs) via latent-space denoising score matching. Yet, these methods often result in artifacts and inconsistencies across different views due to their suboptimal optimization approaches and limited understanding of 3D geometry. Moreover, the inherent constraints of NeRFs in rendering crisp geometry and stable textures usually lead to a two-stage optimization to attain high-resolution details. This work proposes holistic sampling and smoothing approaches to achieve high-quality text-to-3D generation, all in a single-stage optimization. We compute denoising scores in the text-to-image diffusion model's latent and image spaces. Instead of randomly sampling timesteps (also referred to as noise levels in denoising score matching), we introduce a novel timestep annealing approach that progressively reduces the sampled timestep throughout optimization. To generate high-quality renderings in a single-stage optimization, we propose regularization for the variance of z-coordinates along NeRF rays. To address texture flickering issues in NeRFs, we introduce a kernel smoothing technique that refines importance sampling weights coarse-to-fine, ensuring accurate and thorough sampling in high-density regions. Extensive experiments demonstrate the superiority of our method over previous approaches, enabling the generation of highly detailed and view-consistent 3D assets through a single-stage training process.
\ No newline at end of file
diff --git a/data/2024/iclr/HYPO: Hyperspherical Out-Of-Distribution Generalization b/data/2024/iclr/HYPO: Hyperspherical Out-Of-Distribution Generalization
new file mode 100644
index 0000000000..dae0b6ac0b
--- /dev/null
+++ b/data/2024/iclr/HYPO: Hyperspherical Out-Of-Distribution Generalization	
@@ -0,0 +1 @@
+Out-of-distribution (OOD) generalization is critical for machine learning models deployed in the real world. However, achieving this can be fundamentally challenging, as it requires the ability to learn invariant features across different domains or environments. In this paper, we propose a novel framework HYPO (HYPerspherical OOD generalization) that provably learns domain-invariant representations in a hyperspherical space. In particular, our hyperspherical learning algorithm is guided by intra-class variation and inter-class separation principles -- ensuring that features from the same class (across different training domains) are closely aligned with their class prototypes, while different class prototypes are maximally separated. We further provide theoretical justifications on how our prototypical learning objective improves the OOD generalization bound. Through extensive experiments on challenging OOD benchmarks, we demonstrate that our approach outperforms competitive baselines and achieves superior performance. Code is available at https://github.com/deeplearning-wisc/hypo.
\ No newline at end of file
diff --git a/data/2024/iclr/Habitat 3.0: A Co-Habitat for Humans, Avatars, and Robots b/data/2024/iclr/Habitat 3.0: A Co-Habitat for Humans, Avatars, and Robots
new file mode 100644
index 0000000000..7fd3877ada
--- /dev/null
+++ b/data/2024/iclr/Habitat 3.0: A Co-Habitat for Humans, Avatars, and Robots	
@@ -0,0 +1 @@
+We present Habitat 3.0: a simulation platform for studying collaborative human-robot tasks in home environments. Habitat 3.0 offers contributions across three dimensions: (1) Accurate humanoid simulation: addressing challenges in modeling complex deformable bodies and diversity in appearance and motion, all while ensuring high simulation speed. (2) Human-in-the-loop infrastructure: enabling real human interaction with simulated robots via mouse/keyboard or a VR interface, facilitating evaluation of robot policies with human input. (3) Collaborative tasks: studying two collaborative tasks, Social Navigation and Social Rearrangement. Social Navigation investigates a robot's ability to locate and follow humanoid avatars in unseen environments, whereas Social Rearrangement addresses collaboration between a humanoid and robot while rearranging a scene. These contributions allow us to study end-to-end learned and heuristic baselines for human-robot collaboration in-depth, as well as evaluate them with humans in the loop. Our experiments demonstrate that learned robot policies lead to efficient task completion when collaborating with unseen humanoid agents and human partners that might exhibit behaviors that the robot has not seen before. Additionally, we observe emergent behaviors during collaborative task execution, such as the robot yielding space when obstructing a humanoid agent, thereby allowing the effective completion of the task by the humanoid agent. Furthermore, our experiments using the human-in-the-loop tool demonstrate that our automated evaluation with humanoids can provide an indication of the relative ordering of different policies when evaluated with real human collaborators. Habitat 3.0 unlocks interesting new features in simulators for Embodied AI, and we hope it paves the way for a new frontier of embodied human-AI interaction capabilities.
\ No newline at end of file
diff --git a/data/2024/iclr/Harnessing Density Ratios for Online Reinforcement Learning b/data/2024/iclr/Harnessing Density Ratios for Online Reinforcement Learning
new file mode 100644
index 0000000000..47a5035609
--- /dev/null
+++ b/data/2024/iclr/Harnessing Density Ratios for Online Reinforcement Learning	
@@ -0,0 +1 @@
+The theories of offline and online reinforcement learning, despite having evolved in parallel, have begun to show signs of the possibility for a unification, with algorithms and analysis techniques for one setting often having natural counterparts in the other. However, the notion of density ratio modeling, an emerging paradigm in offline RL, has been largely absent from online RL, perhaps for good reason: the very existence and boundedness of density ratios relies on access to an exploratory dataset with good coverage, but the core challenge in online RL is to collect such a dataset without having one to start. In this work we show -- perhaps surprisingly -- that density ratio-based algorithms have online counterparts. Assuming only the existence of an exploratory distribution with good coverage, a structural condition known as coverability (Xie et al., 2023), we give a new algorithm (GLOW) that uses density ratio realizability and value function realizability to perform sample-efficient online exploration. GLOW addresses unbounded density ratios via careful use of truncation, and combines this with optimism to guide exploration. GLOW is computationally inefficient; we complement it with a more efficient counterpart, HyGLOW, for the Hybrid RL setting (Song et al., 2022) wherein online RL is augmented with additional offline data. HyGLOW is derived as a special case of a more general meta-algorithm that provides a provable black-box reduction from hybrid RL to offline RL, which may be of independent interest.
\ No newline at end of file
diff --git a/data/2024/iclr/Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning b/data/2024/iclr/Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning
new file mode 100644
index 0000000000..0bda153cce
--- /dev/null
+++ b/data/2024/iclr/Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning	
@@ -0,0 +1 @@
+Representation learning on text-attributed graphs (TAGs) has become a critical research problem in recent years. A typical example of a TAG is a paper citation graph, where the text of each paper serves as node attributes. Initial graph neural network (GNN) pipelines handled these text attributes by transforming them into shallow or hand-crafted features, such as skip-gram or bag-of-words features. Recent efforts have focused on enhancing these pipelines with language models (LMs), which typically demand intricate designs and substantial computational resources. With the advent of powerful large language models (LLMs) such as GPT or Llama2, which demonstrate an ability to reason and to utilize general knowledge, there is a growing need for techniques which combine the textual modelling abilities of LLMs with the structural learning capabilities of GNNs. Hence, in this work, we focus on leveraging LLMs to capture textual information as features, which can be used to boost GNN performance on downstream tasks. A key innovation is our use of explanations as features: we prompt an LLM to perform zero-shot classification, request textual explanations for its decision-making process, and design an LLM-to-LM interpreter to translate these explanations into informative features for downstream GNNs. Our experiments demonstrate that our method achieves state-of-the-art results on well-established TAG datasets, including Cora, PubMed, ogbn-arxiv, as well as our newly introduced dataset, tape-arxiv23. Furthermore, our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv. Lastly, we believe the versatility of the proposed method extends beyond TAGs and holds the potential to enhance other tasks involving graph-text data. Our codes and datasets are available at: https://github.com/XiaoxinHe/TAPE.
\ No newline at end of file
diff --git a/data/2024/iclr/Harnessing Joint Rain- Detail-aware Representations to Eliminate Intricate Rains b/data/2024/iclr/Harnessing Joint Rain- Detail-aware Representations to Eliminate Intricate Rains
new file mode 100644
index 0000000000..e04b59285e
--- /dev/null
+++ b/data/2024/iclr/Harnessing Joint Rain- Detail-aware Representations to Eliminate Intricate Rains	
@@ -0,0 +1 @@
+Recent advances in image deraining have focused on training powerful models on mixed multiple datasets comprising diverse rain types and backgrounds. However, this approach tends to overlook the inherent differences among rainy images, leading to suboptimal results. To overcome this limitation, we focus on addressing various rainy images by delving into meaningful representations that encapsulate both the rain and background components. Leveraging these representations as instructive guidance, we put forth a Context-based Instance-level Modulation (CoI-M) mechanism adept at efficiently modulating CNN- or Transformer-based models. Furthermore, we devise a rain-/detail-aware contrastive learning strategy to help extract joint rain-/detail-aware representations. By integrating CoI-M with the rain-/detail-aware Contrastive learning, we develop CoIC, an innovative and potent algorithm tailored for training models on mixed datasets. Moreover, CoIC offers insight into modeling relationships of datasets, quantitatively assessing the impact of rain and details on restoration, and unveiling distinct behaviors of models given diverse inputs. Extensive experiments validate the efficacy of CoIC in boosting the deraining ability of CNN and Transformer models. CoIC also enhances the deraining prowess remarkably when real-world dataset is included.
\ No newline at end of file
diff --git a/data/2024/iclr/Headless Language Models: Learning without Predicting with Contrastive Weight Tying b/data/2024/iclr/Headless Language Models: Learning without Predicting with Contrastive Weight Tying
new file mode 100644
index 0000000000..cc6a1b47e8
--- /dev/null
+++ b/data/2024/iclr/Headless Language Models: Learning without Predicting with Contrastive Weight Tying	
@@ -0,0 +1 @@
+Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies. In this study, we propose an innovative method that shifts away from probability prediction and instead focuses on reconstructing input embeddings in a contrastive fashion via Constrastive Weight Tying (CWT). We apply this approach to pretrain Headless Language Models in both monolingual and multilingual contexts. Our method offers practical advantages, substantially reducing training computational requirements by up to 20 times, while simultaneously enhancing downstream performance and data efficiency. We observe a significant +1.6 GLUE score increase and a notable +2.7 LAMBADA accuracy improvement compared to classical LMs within similar compute budgets.
\ No newline at end of file
diff --git a/data/2024/iclr/Hebbian Learning based Orthogonal Projection for Continual Learning of Spiking Neural Networks b/data/2024/iclr/Hebbian Learning based Orthogonal Projection for Continual Learning of Spiking Neural Networks
new file mode 100644
index 0000000000..890f451d5d
--- /dev/null
+++ b/data/2024/iclr/Hebbian Learning based Orthogonal Projection for Continual Learning of Spiking Neural Networks	
@@ -0,0 +1 @@
+Neuromorphic computing with spiking neural networks is promising for energy-efficient artificial intelligence (AI) applications. However, different from humans who continually learn different tasks in a lifetime, neural network models suffer from catastrophic forgetting. How could neuronal operations solve this problem is an important question for AI and neuroscience. Many previous studies draw inspiration from observed neuroscience phenomena and propose episodic replay or synaptic metaplasticity, but they are not guaranteed to explicitly preserve knowledge for neuron populations. Other works focus on machine learning methods with more mathematical grounding, e.g., orthogonal projection on high dimensional spaces, but there is no neural correspondence for neuromorphic computing. In this work, we develop a new method with neuronal operations based on lateral connections and Hebbian learning, which can protect knowledge by projecting activity traces of neurons into an orthogonal subspace so that synaptic weight update will not interfere with old tasks. We show that Hebbian and anti-Hebbian learning on recurrent lateral connections can effectively extract the principal subspace of neural activities and enable orthogonal projection. This provides new insights into how neural circuits and Hebbian learning can help continual learning, and also how the concept of orthogonal projection can be realized in neuronal systems. Our method is also flexible to utilize arbitrary training methods based on presynaptic activities/traces. Experiments show that our method consistently solves forgetting for spiking neural networks with nearly zero forgetting under various supervised training methods with different error propagation approaches, and outperforms previous approaches under various settings. Our method can pave a solid path for building continual neuromorphic computing systems.
\ No newline at end of file
diff --git a/data/2024/iclr/Heterogeneous Personalized Federated Learning by Local-Global Updates Mixing via Convergence Rate b/data/2024/iclr/Heterogeneous Personalized Federated Learning by Local-Global Updates Mixing via Convergence Rate
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/HiGen: Hierarchical Graph Generative Networks b/data/2024/iclr/HiGen: Hierarchical Graph Generative Networks
new file mode 100644
index 0000000000..15534224ab
--- /dev/null
+++ b/data/2024/iclr/HiGen: Hierarchical Graph Generative Networks	
@@ -0,0 +1 @@
+Most real-world graphs exhibit a hierarchical structure, which is often overlooked by existing graph generation methods. To address this limitation, we propose a novel graph generative network that captures the hierarchical nature of graphs and successively generates the graph sub-structures in a coarse-to-fine fashion. At each level of hierarchy, this model generates communities in parallel, followed by the prediction of cross-edges between communities using separate neural networks. This modular approach enables scalable graph generation for large and complex graphs. Moreover, we model the output distribution of edges in the hierarchical graph with a multinomial distribution and derive a recursive factorization for this distribution. This enables us to generate community graphs with integer-valued edge weights in an autoregressive manner. Empirical studies demonstrate the effectiveness and scalability of our proposed generative model, achieving state-of-the-art performance in terms of graph quality across various benchmark datasets. The code is available at https://github.com/Karami-m/HiGen_main.
\ No newline at end of file
diff --git a/data/2024/iclr/Hiding in Plain Sight: Disguising Data Stealing Attacks in Federated Learning b/data/2024/iclr/Hiding in Plain Sight: Disguising Data Stealing Attacks in Federated Learning
new file mode 100644
index 0000000000..ca5028bfdb
--- /dev/null
+++ b/data/2024/iclr/Hiding in Plain Sight: Disguising Data Stealing Attacks in Federated Learning	
@@ -0,0 +1 @@
+Malicious server (MS) attacks have enabled the scaling of data stealing in federated learning to large batch sizes and secure aggregation, settings previously considered private. However, many concerns regarding client-side detectability of MS attacks were raised, questioning their practicality once they are publicly known. In this work, for the first time, we thoroughly study the problem of client-side detectability.We demonstrate that most prior MS attacks, which fundamentally rely on one of two key principles, are detectable by principled client-side checks. Further, we formulate desiderata for practical MS attacks and propose SEER, a novel attack framework that satisfies all desiderata, while stealing user data from gradients of realistic networks, even for large batch sizes (up to 512 in our experiments) and under secure aggregation. The key insight of SEER is the use of a secret decoder, which is jointly trained with the shared model. Our work represents a promising first step towards more principled treatment of MS attacks, paving the way for realistic data stealing that can compromise user privacy in real-world deployments.
\ No newline at end of file
diff --git a/data/2024/iclr/Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs b/data/2024/iclr/Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs
new file mode 100644
index 0000000000..46f2bbaad2
--- /dev/null
+++ b/data/2024/iclr/Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs	
@@ -0,0 +1 @@
+Large language models (LLMs) have shown remarkable performance in various natural language processing tasks. However, a primary constraint they face is the context limit, i.e., the maximum number of tokens they can process. Previous works have explored architectural changes and modifications in positional encoding to relax the constraint, but they often require expensive training or do not address the computational demands of self-attention. In this paper, we present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations. HOMER uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. Each chunk is then processed collectively, employing a hierarchical strategy that merges adjacent chunks at progressive transformer layers. A token reduction technique precedes each merging, ensuring memory usage efficiency. We also propose an optimized computational order reducing the memory requirement to logarithmically scale with respect to input length, making it especially favorable for environments with tight memory restrictions. Our experiments demonstrate the proposed method's superior performance and memory efficiency, enabling the broader use of LLMs in contexts requiring extended context. Code is available at https://github.com/alinlab/HOMER.
\ No newline at end of file
diff --git a/data/2024/iclr/High-dimensional SGD aligns with emerging outlier eigenspaces b/data/2024/iclr/High-dimensional SGD aligns with emerging outlier eigenspaces
new file mode 100644
index 0000000000..adf2da9010
--- /dev/null
+++ b/data/2024/iclr/High-dimensional SGD aligns with emerging outlier eigenspaces	
@@ -0,0 +1 @@
+We rigorously study the joint evolution of training dynamics via stochastic gradient descent (SGD) and the spectra of empirical Hessian and gradient matrices. We prove that in two canonical classification tasks for multi-class high-dimensional mixtures and either 1 or 2-layer neural networks, the SGD trajectory rapidly aligns with emerging low-rank outlier eigenspaces of the Hessian and gradient matrices. Moreover, in multi-layer settings this alignment occurs per layer, with the final layer's outlier eigenspace evolving over the course of training, and exhibiting rank deficiency when the SGD converges to sub-optimal classifiers. This establishes some of the rich predictions that have arisen from extensive numerical studies in the last decade about the spectra of Hessian and information matrices over the course of training in overparametrized networks.
\ No newline at end of file
diff --git a/data/2024/iclr/Hindsight PRIORs for Reward Learning from Human Preferences b/data/2024/iclr/Hindsight PRIORs for Reward Learning from Human Preferences
new file mode 100644
index 0000000000..aebdeb88ad
--- /dev/null
+++ b/data/2024/iclr/Hindsight PRIORs for Reward Learning from Human Preferences	
@@ -0,0 +1 @@
+Preference based Reinforcement Learning (PbRL) removes the need to hand specify a reward function by learning a reward from preference feedback over policy behaviors. Current approaches to PbRL do not address the credit assignment problem inherent in determining which parts of a behavior most contributed to a preference, which result in data intensive approaches and subpar reward functions. We address such limitations by introducing a credit assignment strategy (Hindsight PRIOR) that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance through an auxiliary predicted return redistribution objective. Incorporating state importance into reward learning improves the speed of policy learning, overall policy performance, and reward recovery on both locomotion and manipulation tasks. For example, Hindsight PRIOR recovers on average significantly (p<0.05) more reward on MetaWorld (20%) and DMC (15%). The performance gains and our ablations demonstrate the benefits even a simple credit assignment strategy can have on reward learning and that state importance in forward dynamics prediction is a strong proxy for a state's contribution to a preference decision. Code repository can be found at https://github.com/apple/ml-rlhf-hindsight-prior.
\ No newline at end of file
diff --git a/data/2024/iclr/HoloNets: Spectral Convolutions do extend to Directed Graphs b/data/2024/iclr/HoloNets: Spectral Convolutions do extend to Directed Graphs
new file mode 100644
index 0000000000..180791e00f
--- /dev/null
+++ b/data/2024/iclr/HoloNets: Spectral Convolutions do extend to Directed Graphs	
@@ -0,0 +1 @@
+Within the graph learning community, conventional wisdom dictates that spectral convolutional networks may only be deployed on undirected graphs: Only there could the existence of a well-defined graph Fourier transform be guaranteed, so that information may be translated between spatial- and spectral domains. Here we show this traditional reliance on the graph Fourier transform to be superfluous and -- making use of certain advanced tools from complex analysis and spectral theory -- extend spectral convolutions to directed graphs. We provide a frequency-response interpretation of newly developed filters, investigate the influence of the basis used to express filters and discuss the interplay with characteristic operators on which networks are based. In order to thoroughly test the developed theory, we conduct experiments in real world settings, showcasing that directed spectral convolutional networks provide new state of the art results for heterophilic node classification on many datasets and -- as opposed to baselines -- may be rendered stable to resolution-scale varying topological perturbations.
\ No newline at end of file
diff --git a/data/2024/iclr/Horizon-Free Regret for Linear Markov Decision Processes b/data/2024/iclr/Horizon-Free Regret for Linear Markov Decision Processes
new file mode 100644
index 0000000000..0799a714a1
--- /dev/null
+++ b/data/2024/iclr/Horizon-Free Regret for Linear Markov Decision Processes	
@@ -0,0 +1 @@
+A recent line of works showed regret bounds in reinforcement learning (RL) can be (nearly) independent of planning horizon, a.k.a.~the horizon-free bounds. However, these regret bounds only apply to settings where a polynomial dependency on the size of transition model is allowed, such as tabular Markov Decision Process (MDP) and linear mixture MDP. We give the first horizon-free bound for the popular linear MDP setting where the size of the transition model can be exponentially large or even uncountable. In contrast to prior works which explicitly estimate the transition model and compute the inhomogeneous value functions at different time steps, we directly estimate the value functions and confidence sets. We obtain the horizon-free bound by: (1) maintaining multiple weighted least square estimators for the value functions; and (2) a structural lemma which shows the maximal total variation of the inhomogeneous value functions is bounded by a polynomial factor of the feature dimension.
\ No newline at end of file
diff --git a/data/2024/iclr/Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs b/data/2024/iclr/Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs
new file mode 100644
index 0000000000..4df9481263
--- /dev/null
+++ b/data/2024/iclr/Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs	
@@ -0,0 +1 @@
+Recent studies have shown that episodic reinforcement learning (RL) is no harder than bandits when the total reward is bounded by $1$, and proved regret bounds that have a polylogarithmic dependence on the planning horizon $H$. However, it remains an open question that if such results can be carried over to adversarial RL, where the reward is adversarially chosen at each episode. In this paper, we answer this question affirmatively by proposing the first horizon-free policy search algorithm. To tackle the challenges caused by exploration and adversarially chosen reward, our algorithm employs (1) a variance-uncertainty-aware weighted least square estimator for the transition kernel; and (2) an occupancy measure-based technique for the online search of a \emph{stochastic} policy. We show that our algorithm achieves an $\tilde{O}\big((d+\log (|\mathcal{S}|^2 |\mathcal{A}|))\sqrt{K}\big)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|\mathcal{S}|$ and $|\mathcal{A}|$ are the cardinalities of the state and action spaces. We also provide hardness results and regret lower bounds to justify the near optimality of our algorithm and the unavoidability of $\log|\mathcal{S}|$ and $\log|\mathcal{A}|$ in the regret bound.
\ No newline at end of file
diff --git a/data/2024/iclr/How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations b/data/2024/iclr/How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations
new file mode 100644
index 0000000000..92568d487d
--- /dev/null
+++ b/data/2024/iclr/How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations	
@@ -0,0 +1 @@
+While large language models based on the transformer architecture have demonstrated remarkable in-context learning (ICL) capabilities, understandings of such capabilities are still in an early stage, where existing theory and mechanistic understanding focus mostly on simple scenarios such as learning simple function classes. This paper takes initial steps on understanding ICL in more complex scenarios, by studying learning with representations. Concretely, we construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function, composed with a linear function that differs in each instance. By construction, the optimal ICL algorithm first transforms the inputs by the representation function, and then performs linear ICL on top of the transformed dataset. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size. Empirically, we find trained transformers consistently achieve near-optimal ICL performance in this setting, and exhibit the desired dissection where lower layers transforms the dataset and upper layers perform linear ICL. Through extensive probing and a new pasting experiment, we further reveal several mechanisms within the trained transformers, such as concrete copying behaviors on both the inputs and the representations, linear ICL capability of the upper layers alone, and a post-ICL representation selection mechanism in a harder mixture setting. These observed mechanisms align well with our theory and may shed light on how transformers perform ICL in more realistic scenarios.
\ No newline at end of file
diff --git a/data/2024/iclr/How Does Unlabeled Data Provably Help Out-of-Distribution Detection? b/data/2024/iclr/How Does Unlabeled Data Provably Help Out-of-Distribution Detection?
new file mode 100644
index 0000000000..8c7a6783c1
--- /dev/null
+++ b/data/2024/iclr/How Does Unlabeled Data Provably Help Out-of-Distribution Detection?	
@@ -0,0 +1 @@
+Using unlabeled data to regularize the machine learning models has demonstrated promise for improving safety and reliability in detecting out-of-distribution (OOD) data. Harnessing the power of unlabeled in-the-wild data is non-trivial due to the heterogeneity of both in-distribution (ID) and OOD data. This lack of a clean set of OOD samples poses significant challenges in learning an optimal OOD classifier. Currently, there is a lack of research on formally understanding how unlabeled data helps OOD detection. This paper bridges the gap by introducing a new learning framework SAL (Separate And Learn) that offers both strong theoretical guarantees and empirical effectiveness. The framework separates candidate outliers from the unlabeled data and then trains an OOD classifier using the candidate outliers and the labeled ID data. Theoretically, we provide rigorous error bounds from the lens of separability and learnability, formally justifying the two components in our algorithm. Our theory shows that SAL can separate the candidate outliers with small error rates, which leads to a generalization guarantee for the learned OOD classifier. Empirically, SAL achieves state-of-the-art performance on common benchmarks, reinforcing our theoretical insights. Code is publicly available at https://github.com/deeplearning-wisc/sal.
\ No newline at end of file
diff --git a/data/2024/iclr/How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models b/data/2024/iclr/How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? b/data/2024/iclr/How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?
new file mode 100644
index 0000000000..ed8eca0d58
--- /dev/null
+++ b/data/2024/iclr/How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?	
@@ -0,0 +1 @@
+Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters. In this paper, we study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression with a Gaussian prior. We establish a statistical task complexity bound for the attention model pretraining, showing that effective pretraining only requires a small number of independent tasks. Furthermore, we prove that the pretrained model closely matches the Bayes optimal algorithm, i.e., optimally tuned ridge regression, by achieving nearly Bayes optimal risk on unseen tasks under a fixed context length. These theoretical findings complement prior experimental research and shed light on the statistical foundations of ICL.
\ No newline at end of file
diff --git a/data/2024/iclr/How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization b/data/2024/iclr/How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization
new file mode 100644
index 0000000000..362d523145
--- /dev/null
+++ b/data/2024/iclr/How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization	
@@ -0,0 +1 @@
+This paper rigorously shows how over-parameterization changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements. First, we consider the symmetric setting with the symmetric parameterization where $M^* \in \mathbb{R}^{n \times n}$ is a positive semi-definite unknown matrix of rank $r \ll n$, and one uses a symmetric parameterization $XX^\top$ to learn $M^*$. Here $X \in \mathbb{R}^{n \times k}$ with $k>r$ is the factor matrix. We give a novel $\Omega (1/T^2)$ lower bound of randomly initialized GD for the over-parameterized case ($k>r$) where $T$ is the number of iterations. This is in stark contrast to the exact-parameterization scenario ($k=r$) where the convergence rate is $\exp (-\Omega (T))$. Next, we study asymmetric setting where $M^* \in \mathbb{R}^{n_1 \times n_2}$ is the unknown matrix of rank $r \ll \min\{n_1,n_2\}$, and one uses an asymmetric parameterization $FG^\top$ to learn $M^*$ where $F \in \mathbb{R}^{n_1 \times k}$ and $G \in \mathbb{R}^{n_2 \times k}$. Building on prior work, we give a global exact convergence result of randomly initialized GD for the exact-parameterization case ($k=r$) with an $\exp (-\Omega(T))$ rate. Furthermore, we give the first global exact convergence result for the over-parameterization case ($k>r$) with an $\exp(-\Omega(\alpha^2 T))$ rate where $\alpha$ is the initialization scale. This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from $\Omega (1/T^2)$ to linear convergence. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of $\alpha$, recovering the rate in the exact-parameterization case.
\ No newline at end of file
diff --git a/data/2024/iclr/How Realistic Is Your Synthetic Data? Constraining Deep Generative Models for Tabular Data b/data/2024/iclr/How Realistic Is Your Synthetic Data? Constraining Deep Generative Models for Tabular Data
new file mode 100644
index 0000000000..a8d59ef8fe
--- /dev/null
+++ b/data/2024/iclr/How Realistic Is Your Synthetic Data? Constraining Deep Generative Models for Tabular Data	
@@ -0,0 +1 @@
+Deep Generative Models (DGMs) have been shown to be powerful tools for generating tabular data, as they have been increasingly able to capture the complex distributions that characterize them. However, to generate realistic synthetic data, it is often not enough to have a good approximation of their distribution, as it also requires compliance with constraints that encode essential background knowledge on the problem at hand. In this paper, we address this limitation and show how DGMs for tabular data can be transformed into Constrained Deep Generative Models (C-DGMs), whose generated samples are guaranteed to be compliant with the given constraints. This is achieved by automatically parsing the constraints and transforming them into a Constraint Layer (CL) seamlessly integrated with the DGM. Our extensive experimental analysis with various DGMs and tasks reveals that standard DGMs often violate constraints, some exceeding $95\%$ non-compliance, while their corresponding C-DGMs are never non-compliant. Then, we quantitatively demonstrate that, at training time, C-DGMs are able to exploit the background knowledge expressed by the constraints to outperform their standard counterparts with up to $6.5\%$ improvement in utility and detection. Further, we show how our CL does not necessarily need to be integrated at training time, as it can be also used as a guardrail at inference time, still producing some improvements in the overall performance of the models. Finally, we show that our CL does not hinder the sample generation time of the models.
\ No newline at end of file
diff --git a/data/2024/iclr/How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks? b/data/2024/iclr/How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks?
new file mode 100644
index 0000000000..28363766d1
--- /dev/null
+++ b/data/2024/iclr/How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks?	
@@ -0,0 +1 @@
+and a cosine learning rate schedule with a warm-up for the first 100 epochs. We start with an initial learning rate of 1 e − 4 and a decay of 1 e − 5 . The pre-training has been conducted on four NVIDIA A100 using multi-GPU (4) with distributed data parallel (DDP), implemented in MONAI 0.9.0., with a maximum of 800 epochs. We use the binary cross-entropy and Dice Similarity Coefficient (DSC) losses as the objective function for pre-training.
\ No newline at end of file
diff --git a/data/2024/iclr/How connectivity structure shapes rich and lazy learning in neural circuits b/data/2024/iclr/How connectivity structure shapes rich and lazy learning in neural circuits
new file mode 100644
index 0000000000..23cb503794
--- /dev/null
+++ b/data/2024/iclr/How connectivity structure shapes rich and lazy learning in neural circuits	
@@ -0,0 +1 @@
+In theoretical neuroscience, recent work leverages deep learning tools to explore how some network attributes critically influence its learning dynamics. Notably, initial weight distributions with small (resp. large) variance may yield a rich (resp. lazy) regime, where significant (resp. minor) changes to network states and representation are observed over the course of learning. However, in biology, neural circuit connectivity could exhibit a low-rank structure and therefore differs markedly from the random initializations generally used for these studies. As such, here we investigate how the structure of the initial weights -- in particular their effective rank -- influences the network learning regime. Through both empirical and theoretical analyses, we discover that high-rank initializations typically yield smaller network changes indicative of lazier learning, a finding we also confirm with experimentally-driven initial connectivity in recurrent neural networks. Conversely, low-rank initialization biases learning towards richer learning. Importantly, however, as an exception to this rule, we find lazier learning can still occur with a low-rank initialization that aligns with task and data statistics. Our research highlights the pivotal role of initial weight structures in shaping learning regimes, with implications for metabolic costs of plasticity and risks of catastrophic forgetting.
\ No newline at end of file
diff --git a/data/2024/iclr/How do Language Models Bind Entities in Context? b/data/2024/iclr/How do Language Models Bind Entities in Context?
new file mode 100644
index 0000000000..2f8c994167
--- /dev/null
+++ b/data/2024/iclr/How do Language Models Bind Entities in Context?	
@@ -0,0 +1 @@
+To correctly use in-context information, language models (LMs) must bind entities to their attributes. For example, given a context describing a"green square"and a"blue circle", LMs must bind the shapes to their respective colors. We analyze LM representations and identify the binding ID mechanism: a general mechanism for solving the binding problem, which we observe in every sufficiently large model from the Pythia and LLaMA families. Using causal interventions, we show that LMs' internal activations represent binding information by attaching binding ID vectors to corresponding entities and attributes. We further show that binding ID vectors form a continuous subspace, in which distances between binding ID vectors reflect their discernability. Overall, our results uncover interpretable strategies in LMs for representing symbolic knowledge in-context, providing a step towards understanding general in-context reasoning in large-scale LMs.
\ No newline at end of file
diff --git a/data/2024/iclr/How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation b/data/2024/iclr/How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation
new file mode 100644
index 0000000000..e0afb17917
--- /dev/null
+++ b/data/2024/iclr/How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation	
@@ -0,0 +1 @@
+In the classical transformer attention scheme, we are given three $n \times d$ size matrices $Q, K, V$ (the query, key, and value tokens), and the goal is to compute a new $n \times d$ size matrix $D^{-1} \exp(QK^\top) V$ where $D = \mathrm{diag}( \exp(QK^\top) {\bf 1}_n )$. In this work, we study a generalization of attention which captures triple-wise correlations. This generalization is able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers. The potential downside of this generalization is that it appears as though computations are even more difficult, since the straightforward algorithm requires cubic time in $n$. However, we show that in the bounded-entry setting (which arises in practice, and which is well-studied in both theory and practice), there is actually a near-linear time algorithm. More precisely, we show that bounded entries are both necessary and sufficient for quickly performing generalized computations: $\bullet$ On the positive side, if all entries of the input matrices are bounded above by $o(\sqrt[3]{\log n})$ then we show how to approximate the ``tensor-type'' attention matrix in $n^{1+o(1)}$ time. $\bullet$ On the negative side, we show that if the entries of the input matrices may be as large as $\Omega(\sqrt[3]{\log n})$, then there is no algorithm that runs faster than $n^{3-o(1)}$ (assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory). We also show that our construction, algorithms, and lower bounds naturally generalize to higher-order tensors and correlations. Interestingly, the higher the order of the tensors, the lower the bound on the entries needs to be for an efficient algorithm. Our results thus yield a natural tradeoff between the boundedness of the entries, and order of the tensor one may use for more expressive, efficient attention computation.
\ No newline at end of file
diff --git a/data/2024/iclr/How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions b/data/2024/iclr/How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
new file mode 100644
index 0000000000..ddaa0fd3c3
--- /dev/null
+++ b/data/2024/iclr/How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions	
@@ -0,0 +1 @@
+Large language models (LLMs) can"lie", which we define as outputting false statements despite"knowing"the truth in a demonstrable sense. LLMs might"lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
\ No newline at end of file
diff --git a/data/2024/iclr/How to Fine-Tune Vision Models with SGD b/data/2024/iclr/How to Fine-Tune Vision Models with SGD
new file mode 100644
index 0000000000..59e8c52ec4
--- /dev/null
+++ b/data/2024/iclr/How to Fine-Tune Vision Models with SGD	
@@ -0,0 +1 @@
+SGD and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first"embedding"layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: freezing the embedding layer (less than 1% of the parameters) leads to SGD with or without momentum performing slightly better than AdamW while using less memory (e.g., on ViT-L, SGD uses 33% less GPU memory). Our insights result in state-of-the-art accuracies on five popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, BREEDS-Living-17, Waterbirds, and DomainNet.
\ No newline at end of file
diff --git a/data/2024/iclr/Human Feedback is not Gold Standard b/data/2024/iclr/Human Feedback is not Gold Standard
new file mode 100644
index 0000000000..6a089f6fb1
--- /dev/null
+++ b/data/2024/iclr/Human Feedback is not Gold Standard	
@@ -0,0 +1 @@
+Human feedback has become the de facto standard for evaluating the performance of Large Language Models, and is increasingly being used as a training objective. However, it is not clear which properties of a generated output this single `preference' score captures. We hypothesise that preference scores are subjective and open to undesirable biases. We critically analyse the use of human feedback for both training and evaluation, to verify whether it fully captures a range of crucial error criteria. We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality. We further hypothesise that both preference scores and error annotation may be affected by confounders, and leverage instruction-tuned models to generate outputs that vary along two possible confounding dimensions: assertiveness and complexity. We find that the assertiveness of an output skews the perceived rate of factuality errors, indicating that human annotations are not a fully reliable evaluation metric or training objective. Finally, we offer preliminary evidence that using human feedback as a training objective disproportionately increases the assertiveness of model outputs. We encourage future work to carefully consider whether preference scores are well aligned with the desired objective.
\ No newline at end of file
diff --git a/data/2024/iclr/Human Motion Diffusion as a Generative Prior b/data/2024/iclr/Human Motion Diffusion as a Generative Prior
new file mode 100644
index 0000000000..45e8e39f36
--- /dev/null
+++ b/data/2024/iclr/Human Motion Diffusion as a Generative Prior	
@@ -0,0 +1 @@
+Recent work has demonstrated the significant potential of denoising diffusion models for generating human motion, including text-to-motion capabilities. However, these methods are restricted by the paucity of annotated motion data, a focus on single-person motions, and a lack of detailed control. In this paper, we introduce three forms of composition based on diffusion priors: sequential, parallel, and model composition. Using sequential composition, we tackle the challenge of long sequence generation. We introduce DoubleTake, an inference-time method with which we generate long animations consisting of sequences of prompted intervals and their transitions, using a prior trained only for short clips. Using parallel composition, we show promising steps toward two-person generation. Beginning with two fixed priors as well as a few two-person training examples, we learn a slim communication block, ComMDM, to coordinate interaction between the two resulting motions. Lastly, using model composition, we first train individual priors to complete motions that realize a prescribed motion for a given joint. We then introduce DiffusionBlending, an interpolation mechanism to effectively blend several such models to enable flexible and efficient fine-grained joint and trajectory-level control and editing. We evaluate the composition methods using an off-the-shelf motion diffusion model, and further compare the results to dedicated models trained for these specific tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Hybrid Directional Graph Neural Network for Molecules b/data/2024/iclr/Hybrid Directional Graph Neural Network for Molecules
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Hybrid Internal Model: Learning Agile Legged Locomotion with Simulated Robot Response b/data/2024/iclr/Hybrid Internal Model: Learning Agile Legged Locomotion with Simulated Robot Response
new file mode 100644
index 0000000000..e94b77287f
--- /dev/null
+++ b/data/2024/iclr/Hybrid Internal Model: Learning Agile Legged Locomotion with Simulated Robot Response	
@@ -0,0 +1 @@
+Robust locomotion control depends on accurate state estimations. However, the sensors of most legged robots can only provide partial and noisy observations, making the estimation particularly challenging, especially for external states like terrain frictions and elevation maps. Inspired by the classical Internal Model Control principle, we consider these external states as disturbances and introduce Hybrid Internal Model (HIM) to estimate them according to the response of the robot. The response, which we refer to as the hybrid internal embedding, contains the robot's explicit velocity and implicit stability representation, corresponding to two primary goals for locomotion tasks: explicitly tracking velocity and implicitly maintaining stability. We use contrastive learning to optimize the embedding to be close to the robot's successor state, in which the response is naturally embedded. HIM has several appealing benefits: It only needs the robot's proprioceptions, i.e., those from joint encoders and IMU as observations. It innovatively maintains consistent observations between simulation reference and reality that avoids information loss in mimicking learning. It exploits batch-level information that is more robust to noises and keeps better sample efficiency. It only requires 1 hour of training on an RTX 4090 to enable a quadruped robot to traverse any terrain under any disturbances. A wealth of real-world experiments demonstrates its agility, even in high-difficulty tasks and cases never occurred during the training process, revealing remarkable open-world generalizability.
\ No newline at end of file
diff --git a/data/2024/iclr/Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing b/data/2024/iclr/Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
new file mode 100644
index 0000000000..1442977f67
--- /dev/null
+++ b/data/2024/iclr/Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing	
@@ -0,0 +1 @@
+Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
\ No newline at end of file
diff --git a/data/2024/iclr/HypeBoy: Generative Self-Supervised Representation Learning on Hypergraphs b/data/2024/iclr/HypeBoy: Generative Self-Supervised Representation Learning on Hypergraphs
new file mode 100644
index 0000000000..e6208031e7
--- /dev/null
+++ b/data/2024/iclr/HypeBoy: Generative Self-Supervised Representation Learning on Hypergraphs	
@@ -0,0 +1 @@
+Hypergraphs are marked by complex topology, expressing higher-order interactions among multiple nodes with hyperedges, and better capturing the topology is essential for effective representation learning. Recent advances in generative self-supervised learning (SSL) suggest that hypergraph neural networks learned from generative self supervision have the potential to effectively encode the complex hypergraph topology. Designing a generative SSL strategy for hypergraphs, however, is not straightforward. Questions remain with regard to its generative SSL task, connection to downstream tasks, and empirical properties of learned representations. In light of the promises and challenges, we propose a novel generative SSL strategy for hypergraphs. We first formulate a generative SSL task on hypergraphs, hyperedge filling, and highlight its theoretical connection to node classification. Based on the generative SSL task, we propose a hypergraph SSL method, HypeBoy. HypeBoy learns effective general-purpose hypergraph representations, outperforming 16 baseline methods across 11 benchmark datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Hyper Evidential Deep Learning to Quantify Composite Classification Uncertainty b/data/2024/iclr/Hyper Evidential Deep Learning to Quantify Composite Classification Uncertainty
new file mode 100644
index 0000000000..129dd52350
--- /dev/null
+++ b/data/2024/iclr/Hyper Evidential Deep Learning to Quantify Composite Classification Uncertainty	
@@ -0,0 +1 @@
+Deep neural networks (DNNs) have been shown to perform well on exclusive, multi-class classification tasks. However, when different classes have similar visual features, it becomes challenging for human annotators to differentiate them. This scenario necessitates the use of composite class labels. In this paper, we propose a novel framework called Hyper-Evidential Neural Network (HENN) that explicitly models predictive uncertainty due to composite class labels in training data in the context of the belief theory called Subjective Logic (SL). By placing a grouped Dirichlet distribution on the class probabilities, we treat predictions of a neural network as parameters of hyper-subjective opinions and learn the network that collects both single and composite evidence leading to these hyper-opinions by a deterministic DNN from data. We introduce a new uncertainty type called vagueness originally designed for hyper-opinions in SL to quantify composite classification uncertainty for DNNs. Our results demonstrate that HENN outperforms its state-of-the-art counterparts based on four image datasets. The code and datasets are available at: https://github.com/Hugo101/HyperEvidentialNN.
\ No newline at end of file
diff --git a/data/2024/iclr/HyperAttention: Long-context Attention in Near-Linear Time b/data/2024/iclr/HyperAttention: Long-context Attention in Near-Linear Time
new file mode 100644
index 0000000000..42ba0e742f
--- /dev/null
+++ b/data/2024/iclr/HyperAttention: Long-context Attention in Near-Linear Time	
@@ -0,0 +1 @@
+We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts used in Large Language Models (LLMs). Recent work suggests that in the worst-case scenario, quadratic time is necessary unless the entries of the attention matrix are bounded or the matrix has low stable rank. We introduce two parameters which measure: (1) the max column norm in the normalized attention matrix, and (2) the ratio of row norms in the unnormalized attention matrix after detecting and removing large entries. We use these fine-grained parameters to capture the hardness of the problem. Despite previous lower bounds, we are able to achieve a linear time sampling algorithm even when the matrix has unbounded entries or a large stable rank, provided the above parameters are small. HyperAttention features a modular design that easily accommodates integration of other fast low-level implementations, particularly FlashAttention. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods, giving significant speed improvements compared to state-of-the-art solutions like FlashAttention. We validate the empirical performance of HyperAttention on a variety of different long-context length datasets. For example, HyperAttention makes the inference time of ChatGLM2 50\% faster on 32k context length while perplexity increases from 5.6 to 6.3. On larger context length, e.g., 131k, with causal masking, HyperAttention offers 5-fold speedup on a single attention layer.
\ No newline at end of file
diff --git a/data/2024/iclr/HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion b/data/2024/iclr/HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion
new file mode 100644
index 0000000000..b75d15a193
--- /dev/null
+++ b/data/2024/iclr/HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion	
@@ -0,0 +1 @@
+Despite significant advances in large-scale text-to-image models, achieving hyper-realistic human image generation remains a desirable yet unsolved task. Existing models like Stable Diffusion and DALL-E 2 tend to generate human images with incoherent parts or unnatural poses. To tackle these challenges, our key insight is that human image is inherently structural over multiple granularities, from the coarse-level body skeleton to fine-grained spatial geometry. Therefore, capturing such correlations between the explicit appearance and latent structure in one model is essential to generate coherent and natural human images. To this end, we propose a unified framework, HyperHuman, that generates in-the-wild human images of high realism and diverse layouts. Specifically, 1) we first build a large-scale human-centric dataset, named HumanVerse, which consists of 340M images with comprehensive annotations like human pose, depth, and surface normal. 2) Next, we propose a Latent Structural Diffusion Model that simultaneously denoises the depth and surface normal along with the synthesized RGB image. Our model enforces the joint learning of image appearance, spatial relationship, and geometry in a unified network, where each branch in the model complements to each other with both structural awareness and textural richness. 3) Finally, to further boost the visual quality, we propose a Structure-Guided Refiner to compose the predicted conditions for more detailed generation of higher resolution. Extensive experiments demonstrate that our framework yields the state-of-the-art performance, generating hyper-realistic human images under diverse scenarios. Project Page: https://snap-research.github.io/HyperHuman/
\ No newline at end of file
diff --git a/data/2024/iclr/Hypergraph Dynamic System b/data/2024/iclr/Hypergraph Dynamic System
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Hypothesis Search: Inductive Reasoning with Language Models b/data/2024/iclr/Hypothesis Search: Inductive Reasoning with Language Models
new file mode 100644
index 0000000000..3df224d8ab
--- /dev/null
+++ b/data/2024/iclr/Hypothesis Search: Inductive Reasoning with Language Models	
@@ -0,0 +1 @@
+Inductive reasoning is a core problem-solving capacity: humans can identify underlying principles from a few examples, which robustly generalize to novel scenarios. Recent work evaluates large language models (LLMs) on inductive reasoning tasks by directly prompting them yielding"in context learning."This works well for straightforward inductive tasks but performs poorly on complex tasks such as the Abstraction and Reasoning Corpus (ARC). In this work, we propose to improve the inductive reasoning ability of LLMs by generating explicit hypotheses at multiple levels of abstraction: we prompt the LLM to propose multiple abstract hypotheses about the problem, in natural language, then implement the natural language hypotheses as concrete Python programs. These programs can be verified by running on observed examples and generalized to novel inputs. To reduce the hypothesis search space, we explore steps to filter the set of hypotheses to implement: we either ask the LLM to summarize them into a smaller set of hypotheses or ask human annotators to select a subset. We verify our pipeline's effectiveness on the ARC visual inductive reasoning benchmark, its variant 1D-ARC, string transformation dataset SyGuS, and list transformation dataset List Functions. On a random 100-problem subset of ARC, our automated pipeline using LLM summaries achieves 30% accuracy, outperforming the direct prompting baseline (accuracy of 17%). With the minimal human input of selecting from LLM-generated candidates, performance is boosted to 33%. Our ablations show that both abstract hypothesis generation and concrete program representations benefit LLMs on inductive reasoning tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/I-PHYRE: Interactive Physical Reasoning b/data/2024/iclr/I-PHYRE: Interactive Physical Reasoning
new file mode 100644
index 0000000000..e99f0d8cb1
--- /dev/null
+++ b/data/2024/iclr/I-PHYRE: Interactive Physical Reasoning	
@@ -0,0 +1 @@
+Current evaluation protocols predominantly assess physical reasoning in stationary scenes, creating a gap in evaluating agents' abilities to interact with dynamic events. While contemporary methods allow agents to modify initial scene configurations and observe consequences, they lack the capability to interact with events in real time. To address this, we introduce I-PHYRE, a framework that challenges agents to simultaneously exhibit intuitive physical reasoning, multi-step planning, and in-situ intervention. Here, intuitive physical reasoning refers to a quick, approximate understanding of physics to address complex problems; multi-step denotes the need for extensive sequence planning in I-PHYRE, considering each intervention can significantly alter subsequent choices; and in-situ implies the necessity for timely object manipulation within a scene, where minor timing deviations can result in task failure. We formulate four game splits to scrutinize agents' learning and generalization of essential principles of interactive physical reasoning, fostering learning through interaction with representative scenarios. Our exploration involves three planning strategies and examines several supervised and reinforcement agents' zero-shot generalization proficiency on I-PHYRE. The outcomes highlight a notable gap between existing learning algorithms and human performance, emphasizing the imperative for more research in enhancing agents with interactive physical reasoning capabilities. The environment and baselines will be made publicly available.
\ No newline at end of file
diff --git a/data/2024/iclr/IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models b/data/2024/iclr/IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models
new file mode 100644
index 0000000000..76bb8a52ea
--- /dev/null
+++ b/data/2024/iclr/IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models	
@@ -0,0 +1 @@
+In-context learning is a promising paradigm that utilizes in-context examples as prompts for the predictions of large language models. These prompts are crucial for achieving strong performance. However, since the prompts need to be sampled from a large volume of annotated examples, finding the right prompt may result in high annotation costs. To address this challenge, this paper introduces an influence-driven selective annotation method that aims to minimize annotation costs while improving the quality of in-context examples. The essence of our method is to select a pivotal subset from a large-scale unlabeled data pool to annotate for the subsequent sampling of prompts. Specifically, a directed graph is first constructed to represent unlabeled data. Afterward, the influence of candidate unlabeled subsets is quantified with a diffusion process. A simple yet effective greedy algorithm for unlabeled data selection is lastly introduced. It iteratively selects the data if it provides a maximum marginal gain with respect to quantified influence. Compared with previous efforts on selective annotations, our influence-driven method works in an end-to-end manner, avoids an intractable explicit balance between data diversity and representativeness, and enjoys theoretical support. Experiments confirm the superiority of the proposed method on various benchmarks, achieving better performance under lower time consumption during subset selection. The project page is available at https://skzhang1.github.io/IDEAL/.
\ No newline at end of file
diff --git a/data/2024/iclr/IMPUS: Image Morphing with Perceptually-Uniform Sampling Using Diffusion Models b/data/2024/iclr/IMPUS: Image Morphing with Perceptually-Uniform Sampling Using Diffusion Models
new file mode 100644
index 0000000000..c104452791
--- /dev/null
+++ b/data/2024/iclr/IMPUS: Image Morphing with Perceptually-Uniform Sampling Using Diffusion Models	
@@ -0,0 +1 @@
+We present a diffusion-based image morphing approach with perceptually-uniform sampling (IMPUS) that produces smooth, direct and realistic interpolations given an image pair. The embeddings of two images may lie on distinct conditioned distributions of a latent diffusion model, especially when they have significant semantic difference. To bridge this gap, we interpolate in the locally linear and continuous text embedding space and Gaussian latent space. We first optimize the endpoint text embeddings and then map the images to the latent space using a probability flow ODE. Unlike existing work that takes an indirect morphing path, we show that the model adaptation yields a direct path and suppresses ghosting artifacts in the interpolated images. To achieve this, we propose a heuristic bottleneck constraint based on a novel relative perceptual path diversity score that automatically controls the bottleneck size and balances the diversity along the path with its directness. We also propose a perceptually-uniform sampling technique that enables visually smooth changes between the interpolated images. Extensive experiments validate that our IMPUS can achieve smooth, direct, and realistic image morphing and is adaptable to several other generative tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection b/data/2024/iclr/INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection
new file mode 100644
index 0000000000..5a8a3717dd
--- /dev/null
+++ b/data/2024/iclr/INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection	
@@ -0,0 +1 @@
+Knowledge hallucination have raised widespread concerns for the security and reliability of deployed LLMs. Previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. Thus, we propose to explore the dense semantic information retained within LLMs' \textbf{IN}ternal \textbf{S}tates for halluc\textbf{I}nation \textbf{DE}tection (\textbf{INSIDE}). In particular, a simple yet effective \textbf{EigenScore} metric is proposed to better evaluate responses' self-consistency, which exploits the eigenvalues of responses' covariance matrix to measure the semantic consistency/diversity in the dense embedding space. Furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. Extensive experiments and ablation studies are performed on several popular LLMs and question-answering (QA) benchmarks, showing the effectiveness of our proposal.
\ No newline at end of file
diff --git a/data/2024/iclr/INViTE: INterpret and Control Vision-Language Models with Text Explanations b/data/2024/iclr/INViTE: INterpret and Control Vision-Language Models with Text Explanations
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Idempotence and Perceptual Image Compression b/data/2024/iclr/Idempotence and Perceptual Image Compression
new file mode 100644
index 0000000000..b5620240e0
--- /dev/null
+++ b/data/2024/iclr/Idempotence and Perceptual Image Compression	
@@ -0,0 +1 @@
+Idempotence is the stability of image codec to re-compression. At the first glance, it is unrelated to perceptual image compression. However, we find that theoretically: 1) Conditional generative model-based perceptual codec satisfies idempotence; 2) Unconditional generative model with idempotence constraint is equivalent to conditional generative codec. Based on this newfound equivalence, we propose a new paradigm of perceptual image codec by inverting unconditional generative model with idempotence constraints. Our codec is theoretically equivalent to conditional generative codec, and it does not require training new models. Instead, it only requires a pre-trained mean-square-error codec and unconditional generative model. Empirically, we show that our proposed approach outperforms state-of-the-art methods such as HiFiC and ILLM, in terms of Fr\'echet Inception Distance (FID). The source code is provided in https://github.com/tongdaxu/Idempotence-and-Perceptual-Image-Compression.
\ No newline at end of file
diff --git a/data/2024/iclr/Idempotent Generative Network b/data/2024/iclr/Idempotent Generative Network
new file mode 100644
index 0000000000..3b91084087
--- /dev/null
+++ b/data/2024/iclr/Idempotent Generative Network	
@@ -0,0 +1 @@
+We propose a new approach for generative modeling based on training a neural network to be idempotent. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, namely $f(f(z))=f(z)$. The proposed model $f$ is trained to map a source distribution (e.g, Gaussian noise) to a target distribution (e.g. realistic images) using the following objectives: (1) Instances from the target distribution should map to themselves, namely $f(x)=x$. We define the target manifold as the set of all instances that $f$ maps to themselves. (2) Instances that form the source distribution should map onto the defined target manifold. This is achieved by optimizing the idempotence term, $f(f(z))=f(z)$ which encourages the range of $f(z)$ to be on the target manifold. Under ideal assumptions such a process provably converges to the target distribution. This strategy results in a model capable of generating an output in one step, maintaining a consistent latent space, while also allowing sequential applications for refinement. Additionally, we find that by processing inputs from both target and source distributions, the model adeptly projects corrupted or modified data back to the target manifold. This work is a first step towards a ``global projector'' that enables projecting any input into a target data distribution.
\ No newline at end of file
diff --git a/data/2024/iclr/Identifiable Latent Polynomial Causal Models through the Lens of Change b/data/2024/iclr/Identifiable Latent Polynomial Causal Models through the Lens of Change
new file mode 100644
index 0000000000..98c449545e
--- /dev/null
+++ b/data/2024/iclr/Identifiable Latent Polynomial Causal Models through the Lens of Change	
@@ -0,0 +1 @@
+Causal representation learning aims to unveil latent high-level causal representations from observed low-level data. One of its primary tasks is to provide reliable assurance of identifying these latent causal models, known as identifiability. A recent breakthrough explores identifiability by leveraging the change of causal influences among latent causal variables across multiple environments \citep{liu2022identifying}. However, this progress rests on the assumption that the causal relationships among latent causal variables adhere strictly to linear Gaussian models. In this paper, we extend the scope of latent causal models to involve nonlinear causal relationships, represented by polynomial models, and general noise distributions conforming to the exponential family. Additionally, we investigate the necessity of imposing changes on all causal parameters and present partial identifiability results when part of them remains unchanged. Further, we propose a novel empirical estimation method, grounded in our theoretical finding, that enables learning consistent latent causal representations. Our experimental results, obtained from both synthetic and real-world data, validate our theoretical contributions concerning identifiability and consistency.
\ No newline at end of file
diff --git a/data/2024/iclr/Identifying Policy Gradient Subspaces b/data/2024/iclr/Identifying Policy Gradient Subspaces
new file mode 100644
index 0000000000..b69c8a7209
--- /dev/null
+++ b/data/2024/iclr/Identifying Policy Gradient Subspaces	
@@ -0,0 +1 @@
+Policy gradient methods hold great potential for solving complex continuous control tasks. Still, their training efficiency can be improved by exploiting structure within the optimization problem. Recent work indicates that supervised learning can be accelerated by leveraging the fact that gradients lie in a low-dimensional and slowly-changing subspace. In this paper, we conduct a thorough evaluation of this phenomenon for two popular deep policy gradient methods on various simulated benchmark tasks. Our results demonstrate the existence of such gradient subspaces despite the continuously changing data distribution inherent to reinforcement learning. These findings reveal promising directions for future work on more efficient reinforcement learning, e.g., through improving parameter-space exploration or enabling second-order optimization.
\ No newline at end of file
diff --git a/data/2024/iclr/Identifying Representations for Intervention Extrapolation b/data/2024/iclr/Identifying Representations for Intervention Extrapolation
new file mode 100644
index 0000000000..32c49bb7b3
--- /dev/null
+++ b/data/2024/iclr/Identifying Representations for Intervention Extrapolation	
@@ -0,0 +1 @@
+The premise of identifiable and causal representation learning is to improve the current representation learning paradigm in terms of generalizability or robustness. Despite recent progress in questions of identifiability, more theoretical results demonstrating concrete advantages of these methods for downstream tasks are needed. In this paper, we consider the task of intervention extrapolation: predicting how interventions affect an outcome, even when those interventions are not observed at training time, and show that identifiable representations can provide an effective solution to this task even if the interventions affect the outcome non-linearly. Our setup includes an outcome Y, observed features X, which are generated as a non-linear transformation of latent features Z, and exogenous action variables A, which influence Z. The objective of intervention extrapolation is to predict how interventions on A that lie outside the training support of A affect Y. Here, extrapolation becomes possible if the effect of A on Z is linear and the residual when regressing Z on A has full support. As Z is latent, we combine the task of intervention extrapolation with identifiable representation learning, which we call Rep4Ex: we aim to map the observed features X into a subspace that allows for non-linear extrapolation in A. We show that the hidden representation is identifiable up to an affine transformation in Z-space, which is sufficient for intervention extrapolation. The identifiability is characterized by a novel constraint describing the linearity assumption of A on Z. Based on this insight, we propose a method that enforces the linear invariance constraint and can be combined with any type of autoencoder. We validate our theoretical findings through synthetic experiments and show that our approach succeeds in predicting the effects of unseen interventions.
\ No newline at end of file
diff --git a/data/2024/iclr/Identifying the Risks of LM Agents with an LM-Emulated Sandbox b/data/2024/iclr/Identifying the Risks of LM Agents with an LM-Emulated Sandbox
new file mode 100644
index 0000000000..23b7f8c147
--- /dev/null
+++ b/data/2024/iclr/Identifying the Risks of LM Agents with an LM-Emulated Sandbox	
@@ -0,0 +1 @@
+Recent advances in Language Model (LM) agents and tool use, exemplified by applications like ChatGPT Plugins, enable a rich set of capabilities but also amplify potential risks - such as leaking private data or causing financial losses. Identifying these risks is labor-intensive, necessitating implementing the tools, setting up the environment for each test scenario manually, and finding risky cases. As tools and agents become more complex, the high cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks. To address these challenges, we introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios, without manual instantiation. Alongside the emulator, we develop an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks. We test both the tool emulator and evaluator through human evaluation and find that 68.8% of failures identified with ToolEmu would be valid real-world agent failures. Using our curated initial benchmark consisting of 36 high-stakes tools and 144 test cases, we provide a quantitative risk analysis of current LM agents and identify numerous failures with potentially severe outcomes. Notably, even the safest LM agent exhibits such failures 23.9% of the time according to our evaluator, underscoring the need to develop safer LM agents for real-world deployment.
\ No newline at end of file
diff --git a/data/2024/iclr/Illusory Attacks: Information-theoretic detectability matters in adversarial attacks b/data/2024/iclr/Illusory Attacks: Information-theoretic detectability matters in adversarial attacks
new file mode 100644
index 0000000000..ef12020606
--- /dev/null
+++ b/data/2024/iclr/Illusory Attacks: Information-theoretic detectability matters in adversarial attacks	
@@ -0,0 +1 @@
+Autonomous agents deployed in the real world need to be robust against adversarial attacks on sensory inputs. Robustifying agent policies requires anticipating the strongest attacks possible. We demonstrate that existing observation-space attacks on reinforcement learning agents have a common weakness: while effective, their lack of information-theoretic detectability constraints makes them detectable using automated means or human inspection. Detectability is undesirable to adversaries as it may trigger security escalations. We introduce {\epsilon}-illusory, a novel form of adversarial attack on sequential decision-makers that is both effective and of {\epsilon}-bounded statistical detectability. We propose a novel dual ascent algorithm to learn such attacks end-to-end. Compared to existing attacks, we empirically find {\epsilon}-illusory to be significantly harder to detect with automated methods, and a small study with human participants (IRB approval under reference R84123/RE001) suggests they are similarly harder to detect for humans. Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses. The project website can be found at https://tinyurl.com/illusory-attacks.
\ No newline at end of file
diff --git a/data/2024/iclr/Image Background Serves as Good Proxy for Out-of-distribution Data b/data/2024/iclr/Image Background Serves as Good Proxy for Out-of-distribution Data
new file mode 100644
index 0000000000..e85dec6943
--- /dev/null
+++ b/data/2024/iclr/Image Background Serves as Good Proxy for Out-of-distribution Data	
@@ -0,0 +1 @@
+Out-of-distribution (OOD) detection empowers the model trained on the closed image set to identify unknown data in the open world. Though many prior techniques have yielded considerable improvements in this research direction, two crucial obstacles still remain. Firstly, a unified perspective has yet to be presented to view the developed arts with individual designs, which is vital for providing insights into future work. Secondly, we expect sufficient natural OOD supervision to promote the generation of compact boundaries between the in-distribution (ID) and OOD data without collecting explicit OOD samples. To tackle these issues, we propose a general probabilistic framework to interpret many existing methods and an OOD-data-free model, namely \textbf{S}elf-supervised \textbf{S}ampling for \textbf{O}OD \textbf{D}etection (SSOD). SSOD efficiently exploits natural OOD signals from the ID data based on the local property of convolution. With these supervisions, it jointly optimizes the OOD detection and conventional ID classification in an end-to-end manner. Extensive experiments reveal that SSOD establishes competitive state-of-the-art performance on many large-scale benchmarks, outperforming the best previous method by a large margin, \eg, reporting \textbf{-6.28\%} FPR95 and \textbf{+0.77\%} AUROC on ImageNet, \textbf{-19.01\%} FPR95 and \textbf{+3.04\%} AUROC on CIFAR-10, and top-ranked performance on hard OOD datasets, \ie, ImageNet-O and OpenImage-O.
\ No newline at end of file
diff --git a/data/2024/iclr/Image Clustering Conditioned on Text Criteria b/data/2024/iclr/Image Clustering Conditioned on Text Criteria
new file mode 100644
index 0000000000..a0eb439244
--- /dev/null
+++ b/data/2024/iclr/Image Clustering Conditioned on Text Criteria	
@@ -0,0 +1 @@
+Classical clustering methods do not provide users with direct control of the clustering results, and the clustering results may not be consistent with the relevant criterion that a user has in mind. In this work, we present a new methodology for performing image clustering based on user-specified text criteria by leveraging modern vision-language models and large language models. We call our method Image Clustering Conditioned on Text Criteria (IC|TC), and it represents a different paradigm of image clustering. IC|TC requires a minimal and practical degree of human intervention and grants the user significant control over the clustering results in return. Our experiments show that IC|TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, while significantly outperforming baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models b/data/2024/iclr/Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models
new file mode 100644
index 0000000000..f103ccd728
--- /dev/null
+++ b/data/2024/iclr/Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models	
@@ -0,0 +1 @@
+The advent of large pre-trained models has brought about a paradigm shift in both visual representation learning and natural language processing. However, clustering unlabeled images, as a fundamental and classic machine learning problem, still lacks an effective solution, particularly for large-scale datasets. In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale. We first developed a novel algorithm to estimate the number of clusters in a given dataset. We then show that the pre-trained features are significantly more structured by further optimizing the rate reduction objective. The resulting features may significantly improve the clustering accuracy, e.g., from 57\% to 66\% on ImageNet-1k. Furthermore, by leveraging CLIP's multimodality bridge between image and text, we develop a simple yet effective self-labeling algorithm that produces meaningful captions for the clusters. Through extensive experiments, we show that our pipeline works well on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet-1k. It also extends to datasets that are not curated for clustering, such as LAION-Aesthetics and WikiArts. We released the code in https://github.com/LeslieTrue/CPP.
\ No newline at end of file
diff --git a/data/2024/iclr/Image Inpainting via Iteratively Decoupled Probabilistic Modeling b/data/2024/iclr/Image Inpainting via Iteratively Decoupled Probabilistic Modeling
new file mode 100644
index 0000000000..2f64a747d4
--- /dev/null
+++ b/data/2024/iclr/Image Inpainting via Iteratively Decoupled Probabilistic Modeling	
@@ -0,0 +1 @@
+Generative adversarial networks (GANs) have made great success in image inpainting yet still have difficulties tackling large missing regions. In contrast, iterative probabilistic algorithms, such as autoregressive and denoising diffusion models, have to be deployed with massive computing resources for decent effect. To achieve high-quality results with low computational cost, we present a novel pixel spread model (PSM) that iteratively employs decoupled probabilistic modeling, combining the optimization efficiency of GANs with the prediction tractability of probabilistic models. As a result, our model selectively spreads informative pixels throughout the image in a few iterations, largely enhancing the completion quality and efficiency. On multiple benchmarks, we achieve new state-of-the-art performance. Code is released at https://github.com/fenglinglwb/PSM.
\ No newline at end of file
diff --git a/data/2024/iclr/Image Inpainting via Tractable Steering of Diffusion Models b/data/2024/iclr/Image Inpainting via Tractable Steering of Diffusion Models
new file mode 100644
index 0000000000..3afef7fc71
--- /dev/null
+++ b/data/2024/iclr/Image Inpainting via Tractable Steering of Diffusion Models	
@@ -0,0 +1 @@
+Diffusion models are the current state of the art for generating photorealistic images. Controlling the sampling process for constrained image generation tasks such as inpainting, however, remains challenging since exact conditioning on such constraints is intractable. While existing methods use various techniques to approximate the constrained posterior, this paper proposes to exploit the ability of Tractable Probabilistic Models (TPMs) to exactly and efficiently compute the constrained posterior, and to leverage this signal to steer the denoising process of diffusion models. Specifically, this paper adopts a class of expressive TPMs termed Probabilistic Circuits (PCs). Building upon prior advances, we further scale up PCs and make them capable of guiding the image generation process of diffusion models. Empirical results suggest that our approach can consistently improve the overall quality and semantic coherence of inpainted images across three natural image datasets (i.e., CelebA-HQ, ImageNet, and LSUN) with only ~10% additional computational overhead brought by the TPM. Further, with the help of an image encoder and decoder, our method can readily accept semantic constraints on specific regions of the image, which opens up the potential for more controlled image generation tasks. In addition to proposing a new framework for constrained image generation, this paper highlights the benefit of more tractable models and motivates the development of expressive TPMs.
\ No newline at end of file
diff --git a/data/2024/iclr/Image Translation as Diffusion Visual Programmers b/data/2024/iclr/Image Translation as Diffusion Visual Programmers
new file mode 100644
index 0000000000..2173de8e2b
--- /dev/null
+++ b/data/2024/iclr/Image Translation as Diffusion Visual Programmers	
@@ -0,0 +1 @@
+We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework. Our proposed DVP seamlessly embeds a condition-flexible diffusion model within the GPT architecture, orchestrating a coherent sequence of visual programs (i.e., computer vision models) for various pro-symbolic steps, which span RoI identification, style transfer, and position manipulation, facilitating transparent and controllable image translation processes. Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts. This success can be attributed to several key features of DVP: First, DVP achieves condition-flexible translation via instance normalization, enabling the model to eliminate sensitivity caused by the manual guidance and optimally focus on textual descriptions for high-quality content generation. Second, the framework enhances in-context reasoning by deciphering intricate high-dimensional concepts in feature spaces into more accessible low-dimensional symbols (e.g., [Prompt], [RoI object]), allowing for localized, context-free editing while maintaining overall coherence. Last but not least, DVP improves systemic controllability and explainability by offering explicit symbolic representations at each programming stage, empowering users to intuitively interpret and modify results. Our research marks a substantial step towards harmonizing artificial image translation processes with cognitive intelligence, promising broader applications.
\ No newline at end of file
diff --git a/data/2024/iclr/Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval b/data/2024/iclr/Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval
new file mode 100644
index 0000000000..da51c3b857
--- /dev/null
+++ b/data/2024/iclr/Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval	
@@ -0,0 +1 @@
+The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent. Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments when deploying the large vision-language model. To tackle the above problems, we propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning. In the framework, we propose a new adaptive token learner that maps an image to a sentence in the word embedding space of VL model. The sentence adaptively captures discriminative visual information and is further integrated with the text modifier. An asymmetric structure is devised for flexible deployment, in which the lightweight model is adopted for the query side while the large VL model is deployed on the gallery side. The global contrastive distillation and the local alignment regularization are adopted for the alignment between the light model and the VL model for CIR task. Our experiments demonstrate that the proposed ISA could better cope with the real retrieval scenarios and further improve retrieval accuracy and efficiency.
\ No newline at end of file
diff --git a/data/2024/iclr/ImageNet-OOD: Deciphering Modern Out-of-Distribution Detection Algorithms b/data/2024/iclr/ImageNet-OOD: Deciphering Modern Out-of-Distribution Detection Algorithms
new file mode 100644
index 0000000000..355b3a744c
--- /dev/null
+++ b/data/2024/iclr/ImageNet-OOD: Deciphering Modern Out-of-Distribution Detection Algorithms	
@@ -0,0 +1 @@
+The task of out-of-distribution (OOD) detection is notoriously ill-defined. Earlier works focused on new-class detection, aiming to identify label-altering data distribution shifts, also known as"semantic shift."However, recent works argue for a focus on failure detection, expanding the OOD evaluation framework to account for label-preserving data distribution shifts, also known as"covariate shift."Intriguingly, under this new framework, complex OOD detectors that were previously considered state-of-the-art now perform similarly to, or even worse than the simple maximum softmax probability baseline. This raises the question: what are the latest OOD detectors actually detecting? Deciphering the behavior of OOD detection algorithms requires evaluation datasets that decouples semantic shift and covariate shift. To aid our investigations, we present ImageNet-OOD, a clean semantic shift dataset that minimizes the interference of covariate shift. Through comprehensive experiments, we show that OOD detectors are more sensitive to covariate shift than to semantic shift, and the benefits of recent OOD detection algorithms on semantic shift detection is minimal. Our dataset and analyses provide important insights for guiding the design of future OOD detectors.
\ No newline at end of file
diff --git a/data/2024/iclr/ImagenHub: Standardizing the evaluation of conditional image generation models b/data/2024/iclr/ImagenHub: Standardizing the evaluation of conditional image generation models
new file mode 100644
index 0000000000..73e0a4a4f7
--- /dev/null
+++ b/data/2024/iclr/ImagenHub: Standardizing the evaluation of conditional image generation models	
@@ -0,0 +1 @@
+Recently, a myriad of conditional image generation and editing models have been developed to serve different downstream tasks, including text-to-image generation, text-guided image editing, subject-driven image generation, control-guided image generation, etc. However, we observe huge inconsistencies in experimental conditions: datasets, inference, and evaluation metrics - render fair comparisons difficult. This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all the conditional image generation models. Firstly, we define seven prominent tasks and curate high-quality evaluation datasets for them. Secondly, we built a unified inference pipeline to ensure fair comparison. Thirdly, we design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images. We train expert raters to evaluate the model outputs based on the proposed metrics. Our human evaluation achieves a high inter-worker agreement of Krippendorff's alpha on 76% models with a value higher than 0.4. We comprehensively evaluated a total of around 30 models and observed three key takeaways: (1) the existing models' performance is generally unsatisfying except for Text-guided Image Generation and Subject-driven Image Generation, with 74% models achieving an overall score lower than 0.5. (2) we examined the claims from published papers and found 83% of them hold with a few exceptions. (3) None of the existing automatic metrics has a Spearman's correlation higher than 0.2 except subject-driven image generation. Moving forward, we will continue our efforts to evaluate newly published models and update our leaderboard to keep track of the progress in conditional image generation.
\ No newline at end of file
diff --git a/data/2024/iclr/Imitation Learning from Observation with Automatic Discount Scheduling b/data/2024/iclr/Imitation Learning from Observation with Automatic Discount Scheduling
new file mode 100644
index 0000000000..0cc9fdbe33
--- /dev/null
+++ b/data/2024/iclr/Imitation Learning from Observation with Automatic Discount Scheduling	
@@ -0,0 +1 @@
+Humans often acquire new skills through observation and imitation. For robotic agents, learning from the plethora of unlabeled video demonstration data available on the Internet necessitates imitating the expert without access to its action, presenting a challenge known as Imitation Learning from Observations (ILfO). A common approach to tackle ILfO problems is to convert them into inverse reinforcement learning problems, utilizing a proxy reward computed from the agent's and the expert's observations. Nonetheless, we identify that tasks characterized by a progress dependency property pose significant challenges for such approaches; in these tasks, the agent needs to initially learn the expert's preceding behaviors before mastering the subsequent ones. Our investigation reveals that the main cause is that the reward signals assigned to later steps hinder the learning of initial behaviors. To address this challenge, we present a novel ILfO framework that enables the agent to master earlier behaviors before advancing to later ones. We introduce an Automatic Discount Scheduling (ADS) mechanism that adaptively alters the discount factor in reinforcement learning during the training phase, prioritizing earlier rewards initially and gradually engaging later rewards only when the earlier behaviors have been mastered. Our experiments, conducted on nine Meta-World tasks, demonstrate that our method significantly outperforms state-of-the-art methods across all tasks, including those that are unsolvable by them.
\ No newline at end of file
diff --git a/data/2024/iclr/Impact of Computation in Integral Reinforcement Learning for Continuous-Time Control b/data/2024/iclr/Impact of Computation in Integral Reinforcement Learning for Continuous-Time Control
new file mode 100644
index 0000000000..fdb0c7227a
--- /dev/null
+++ b/data/2024/iclr/Impact of Computation in Integral Reinforcement Learning for Continuous-Time Control	
@@ -0,0 +1 @@
+Integral reinforcement learning (IntRL) demands the precise computation of the utility function's integral at its policy evaluation (PEV) stage. This is achieved through quadrature rules, which are weighted sums of utility functions evaluated from state samples obtained in discrete time. Our research reveals a critical yet underexplored phenomenon: the choice of the computational method -- in this case, the quadrature rule -- can significantly impact control performance. This impact is traced back to the fact that computational errors introduced in the PEV stage can affect the policy iteration's convergence behavior, which in turn affects the learned controller. To elucidate how computation impacts control, we draw a parallel between IntRL's policy iteration and Newton's method applied to the Hamilton-Jacobi-Bellman equation. In this light, computational error in PEV manifests as an extra error term in each iteration of Newton's method, with its upper bound proportional to the computational error. Further, we demonstrate that when the utility function resides in a reproducing kernel Hilbert space (RKHS), the optimal quadrature is achievable by employing Bayesian quadrature with the RKHS-inducing kernel function. We prove that the local convergence rates for IntRL using the trapezoidal rule and Bayesian quadrature with a Mat\'ern kernel to be $O(N^{-2})$ and $O(N^{-b})$, where $N$ is the number of evenly-spaced samples and $b$ is the Mat\'ern kernel's smoothness parameter. These theoretical findings are finally validated by two canonical control tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Implicit Gaussian process representation of vector fields over arbitrary latent manifolds b/data/2024/iclr/Implicit Gaussian process representation of vector fields over arbitrary latent manifolds
new file mode 100644
index 0000000000..76c9eca6de
--- /dev/null
+++ b/data/2024/iclr/Implicit Gaussian process representation of vector fields over arbitrary latent manifolds	
@@ -0,0 +1 @@
+Gaussian processes (GPs) are popular nonparametric statistical models for learning unknown functions and quantifying the spatiotemporal uncertainty in data. Recent works have extended GPs to model scalar and vector quantities distributed over non-Euclidean domains, including smooth manifolds appearing in numerous fields such as computer vision, dynamical systems, and neuroscience. However, these approaches assume that the manifold underlying the data is known, limiting their practical utility. We introduce RVGP, a generalisation of GPs for learning vector signals over latent Riemannian manifolds. Our method uses positional encoding with eigenfunctions of the connection Laplacian, associated with the tangent bundle, readily derived from common graph-based approximation of data. We demonstrate that RVGP possesses global regularity over the manifold, which allows it to super-resolve and inpaint vector fields while preserving singularities. Furthermore, we use RVGP to reconstruct high-density neural dynamics derived from low-density EEG recordings in healthy individuals and Alzheimer's patients. We show that vector field singularities are important disease markers and that their reconstruction leads to a comparable classification accuracy of disease states to high-density recordings. Thus, our method overcomes a significant practical limitation in experimental and clinical applications.
\ No newline at end of file
diff --git a/data/2024/iclr/Implicit Maximum a Posteriori Filtering via Adaptive Optimization b/data/2024/iclr/Implicit Maximum a Posteriori Filtering via Adaptive Optimization
new file mode 100644
index 0000000000..e5231e0f9c
--- /dev/null
+++ b/data/2024/iclr/Implicit Maximum a Posteriori Filtering via Adaptive Optimization	
@@ -0,0 +1 @@
+Bayesian filtering approximates the true underlying behavior of a time-varying system by inverting an explicit generative model to convert noisy measurements into state estimates. This process typically requires either storage, inversion, and multiplication of large matrices or Monte Carlo estimation, neither of which are practical in high-dimensional state spaces such as the weight spaces of artificial neural networks. Here, we frame the standard Bayesian filtering problem as optimization over a time-varying objective. Instead of maintaining matrices for the filtering equations or simulating particles, we specify an optimizer that defines the Bayesian filter implicitly. In the linear-Gaussian setting, we show that every Kalman filter has an equivalent formulation using K steps of gradient descent. In the nonlinear setting, our experiments demonstrate that our framework results in filters that are effective, robust, and scalable to high-dimensional systems, comparing well against the standard toolbox of Bayesian filtering solutions. We suggest that it is easier to fine-tune an optimizer than it is to specify the correct filtering equations, making our framework an attractive option for high-dimensional filtering problems.
\ No newline at end of file
diff --git a/data/2024/iclr/Implicit Neural Representation Inference for Low-Dimensional Bayesian Deep Learning b/data/2024/iclr/Implicit Neural Representation Inference for Low-Dimensional Bayesian Deep Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Implicit Neural Representations and the Algebra of Complex Wavelets b/data/2024/iclr/Implicit Neural Representations and the Algebra of Complex Wavelets
new file mode 100644
index 0000000000..dbd6824143
--- /dev/null
+++ b/data/2024/iclr/Implicit Neural Representations and the Algebra of Complex Wavelets	
@@ -0,0 +1 @@
+Implicit neural representations (INRs) have arisen as useful methods for representing signals on Euclidean domains. By parameterizing an image as a multilayer perceptron (MLP) on Euclidean space, INRs effectively represent signals in a way that couples spatial and spectral features of the signal that is not obvious in the usual discrete representation, paving the way for continuous signal processing and machine learning approaches that were not previously possible. Although INRs using sinusoidal activation functions have been studied in terms of Fourier theory, recent works have shown the advantage of using wavelets instead of sinusoids as activation functions, due to their ability to simultaneously localize in both frequency and space. In this work, we approach such INRs and demonstrate how they resolve high-frequency features of signals from coarse approximations done in the first layer of the MLP. This leads to multiple prescriptions for the design of INR architectures, including the use of complex wavelets, decoupling of low and band-pass approximations, and initialization schemes based on the singularities of the desired signal.
\ No newline at end of file
diff --git a/data/2024/iclr/Implicit regularization of deep residual networks towards neural ODEs b/data/2024/iclr/Implicit regularization of deep residual networks towards neural ODEs
new file mode 100644
index 0000000000..a41292c2ac
--- /dev/null
+++ b/data/2024/iclr/Implicit regularization of deep residual networks towards neural ODEs	
@@ -0,0 +1 @@
+Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual networks towards neural ODEs, for nonlinear networks trained with gradient flow. We prove that if the network is initialized as a discretization of a neural ODE, then such a discretization holds throughout training. Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition. Importantly, this condition holds for a family of residual networks where the residuals are two-layer perceptrons with an overparameterization in width that is only linear, and implies the convergence of gradient flow to a global minimum. Numerical experiments illustrate our results.
\ No newline at end of file
diff --git a/data/2024/iclr/ImplicitSLIM and How it Improves Embedding-based Collaborative Filtering b/data/2024/iclr/ImplicitSLIM and How it Improves Embedding-based Collaborative Filtering
new file mode 100644
index 0000000000..67183a26a3
--- /dev/null
+++ b/data/2024/iclr/ImplicitSLIM and How it Improves Embedding-based Collaborative Filtering	
@@ -0,0 +1 @@
+We present ImplicitSLIM, a novel unsupervised learning approach for sparse high-dimensional data, with applications to collaborative filtering. Sparse linear methods (SLIM) and their variations show outstanding performance, but they are memory-intensive and hard to scale. ImplicitSLIM improves embedding-based models by extracting embeddings from SLIM-like models in a computationally cheap and memory-efficient way, without explicit learning of heavy SLIM-like models. We show that ImplicitSLIM improves performance and speeds up convergence for both state of the art and classical collaborative filtering methods. The source code for ImplicitSLIM, related models, and applications is available at https://github.com/ilya-shenbin/ImplicitSLIM.
\ No newline at end of file
diff --git a/data/2024/iclr/Improved Active Learning via Dependent Leverage Score Sampling b/data/2024/iclr/Improved Active Learning via Dependent Leverage Score Sampling
new file mode 100644
index 0000000000..e1b63af060
--- /dev/null
+++ b/data/2024/iclr/Improved Active Learning via Dependent Leverage Score Sampling	
@@ -0,0 +1 @@
+We show how to obtain improved active learning methods in the agnostic (adversarial noise) setting by combining marginal leverage score sampling with non-independent sampling strategies that promote spatial coverage. In particular, we propose an easily implemented method based on the \emph{pivotal sampling algorithm}, which we test on problems motivated by learning-based methods for parametric PDEs and uncertainty quantification. In comparison to independent sampling, our method reduces the number of samples needed to reach a given target accuracy by up to $50\%$. We support our findings with two theoretical results. First, we show that any non-independent leverage score sampling method that obeys a weak \emph{one-sided $\ell_{\infty}$ independence condition} (which includes pivotal sampling) can actively learn $d$ dimensional linear functions with $O(d\log d)$ samples, matching independent sampling. This result extends recent work on matrix Chernoff bounds under $\ell_{\infty}$ independence, and may be of interest for analyzing other sampling strategies beyond pivotal sampling. Second, we show that, for the important case of polynomial regression, our pivotal method obtains an improved bound on $O(d)$ samples.
\ No newline at end of file
diff --git a/data/2024/iclr/Improved Analysis of Sparse Linear Regression in Local Differential Privacy Model b/data/2024/iclr/Improved Analysis of Sparse Linear Regression in Local Differential Privacy Model
new file mode 100644
index 0000000000..b8a6ee64b2
--- /dev/null
+++ b/data/2024/iclr/Improved Analysis of Sparse Linear Regression in Local Differential Privacy Model	
@@ -0,0 +1 @@
+In this paper, we revisit the problem of sparse linear regression in the local differential privacy (LDP) model. Existing research in the non-interactive and sequentially local models has focused on obtaining the lower bounds for the case where the underlying parameter is $1$-sparse, and extending such bounds to the more general $k$-sparse case has proven to be challenging. Moreover, it is unclear whether efficient non-interactive LDP (NLDP) algorithms exist. To address these issues, we first consider the problem in the $\epsilon$ non-interactive LDP model and provide a lower bound of $\Omega(\frac{\sqrt{dk\log d}}{\sqrt{n}\epsilon})$ on the $\ell_2$-norm estimation error for sub-Gaussian data, where $n$ is the sample size and $d$ is the dimension of the space. We propose an innovative NLDP algorithm, the very first of its kind for the problem. As a remarkable outcome, this algorithm also yields a novel and highly efficient estimator as a valuable by-product. Our algorithm achieves an upper bound of $\tilde{O}({\frac{d\sqrt{k}}{\sqrt{n}\epsilon}})$ for the estimation error when the data is sub-Gaussian, which can be further improved by a factor of $O(\sqrt{d})$ if the server has additional public but unlabeled data. For the sequentially interactive LDP model, we show a similar lower bound of $\Omega({\frac{\sqrt{dk}}{\sqrt{n}\epsilon}})$. As for the upper bound, we rectify a previous method and show that it is possible to achieve a bound of $\tilde{O}(\frac{k\sqrt{d}}{\sqrt{n}\epsilon})$. Our findings reveal fundamental differences between the non-private case, central DP model, and local DP model in the sparse linear regression problem.
\ No newline at end of file
diff --git a/data/2024/iclr/Improved Efficiency Based on Learned Saccade and Continuous Scene Reconstruction From Foveated Visual Sampling b/data/2024/iclr/Improved Efficiency Based on Learned Saccade and Continuous Scene Reconstruction From Foveated Visual Sampling
new file mode 100644
index 0000000000..a197bae79f
--- /dev/null
+++ b/data/2024/iclr/Improved Efficiency Based on Learned Saccade and Continuous Scene Reconstruction From Foveated Visual Sampling	
@@ -0,0 +1 @@
+High accuracy, low latency and high energy efficiency represent a set of conflicting goals when searching for system solutions for image classification and detection. While high-quality images naturally result in more precise detection and classification, they also result in a heavier computational workload for imaging and processing, reduced camera frame rates, and increased data communication between the camera and processor. Taking inspiration from the foveal-peripheral sampling mechanism, and saccade mechanism of the human visual system and the filling-in phenomena of brain, we have developed an active scene reconstruction architecture based on multiple foveal views. This model stitches together information from a sequence of foveal-peripheral views, which are sampled from multiple glances. Assisted by a reinforcement learning-based saccade mechanism, our model reduces the required input pixels by over 90% per frame while maintaining the same level of performance in image recognition as with the original images. We evaluated the effectiveness of our model using the GTSRB dataset and the ImageNet dataset. Using an equal number of input pixels, our model demonstrates a 5% higher image recognition accuracy compared to state-of-the-art foveal-peripheral based vision systems. Furthermore, we demonstrate that our foveal sampling/saccadic scene reconstruction model exhibits significantly lower complexity and higher data efficiency during the training phase compared to existing approaches. Code is available at Github.
\ No newline at end of file
diff --git a/data/2024/iclr/Improved Probabilistic Image-Text Representations b/data/2024/iclr/Improved Probabilistic Image-Text Representations
new file mode 100644
index 0000000000..7fa95d4f1c
--- /dev/null
+++ b/data/2024/iclr/Improved Probabilistic Image-Text Representations	
@@ -0,0 +1 @@
+Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic functions are not sufficiently powerful to capture ambiguity, prompting the exploration of probabilistic embeddings to tackle the challenge. However, the existing probabilistic ITM approach encounters two key shortcomings; the burden of heavy computations due to the Monte Carlo approximation, and the loss saturation issue in the face of abundant false negatives. To overcome the issues, this paper presents an improved Probabilistic Cross-Modal Embeddings (named PCME++) by introducing a new probabilistic distance with a closed-form solution. In addition, two optimization techniques are proposed to enhance PCME++ further: first, the incorporation of pseudo-positives to prevent the negative effect under massive false negatives; second, mixed sample data augmentation for probabilistic matching. Experimental results on MS-COCO Caption and two extended benchmarks, CxC and ECCV Caption, demonstrate the effectiveness of PCME++ compared to state-of-the-art ITM methods. The robustness of PCME++ is also evaluated under noisy image-text correspondences. In addition, the potential applicability of PCME++ in automatic prompt-filtering for zero-shot classification is shown. The code is available at https://github.com/naver-ai/pcmepp
\ No newline at end of file
diff --git a/data/2024/iclr/Improved Techniques for Training Consistency Models b/data/2024/iclr/Improved Techniques for Training Consistency Models
new file mode 100644
index 0000000000..39853f1e8c
--- /dev/null
+++ b/data/2024/iclr/Improved Techniques for Training Consistency Models	
@@ -0,0 +1 @@
+Consistency models are a nascent family of generative models that can sample high quality data in one step without the need for adversarial training. Current consistency models achieve optimal sample quality by distilling from pre-trained diffusion models and employing learned metrics such as LPIPS. However, distillation limits the quality of consistency models to that of the pre-trained diffusion model, and LPIPS causes undesirable bias in evaluation. To tackle these challenges, we present improved techniques for consistency training, where consistency models learn directly from data without distillation. We delve into the theory behind consistency training and identify a previously overlooked flaw, which we address by eliminating Exponential Moving Average from the teacher consistency model. To replace learned metrics like LPIPS, we adopt Pseudo-Huber losses from robust statistics. Additionally, we introduce a lognormal noise schedule for the consistency training objective, and propose to double total discretization steps every set number of training iterations. Combined with better hyperparameter tuning, these modifications enable consistency models to achieve FID scores of 2.51 and 3.25 on CIFAR-10 and ImageNet $64\times 64$ respectively in a single sampling step. These scores mark a 3.5$\times$ and 4$\times$ improvement compared to prior consistency training approaches. Through two-step sampling, we further reduce FID scores to 2.24 and 2.77 on these two datasets, surpassing those obtained via distillation in both one-step and two-step settings, while narrowing the gap between consistency models and other state-of-the-art generative models.
\ No newline at end of file
diff --git a/data/2024/iclr/Improved algorithm and bounds for successive projection b/data/2024/iclr/Improved algorithm and bounds for successive projection
new file mode 100644
index 0000000000..ce09564380
--- /dev/null
+++ b/data/2024/iclr/Improved algorithm and bounds for successive projection	
@@ -0,0 +1 @@
+Given a $K$-vertex simplex in a $d$-dimensional space, suppose we measure $n$ points on the simplex with noise (hence, some of the observed points fall outside the simplex). Vertex hunting is the problem of estimating the $K$ vertices of the simplex. A popular vertex hunting algorithm is successive projection algorithm (SPA). However, SPA is observed to perform unsatisfactorily under strong noise or outliers. We propose pseudo-point SPA (pp-SPA). It uses a projection step and a denoise step to generate pseudo-points and feed them into SPA for vertex hunting. We derive error bounds for pp-SPA, leveraging on extreme value theory of (possibly) high-dimensional random vectors. The results suggest that pp-SPA has faster rates and better numerical performances than SPA. Our analysis includes an improved non-asymptotic bound for the original SPA, which is of independent interest.
\ No newline at end of file
diff --git a/data/2024/iclr/Improved statistical and computational complexity of the mean-field Langevin dynamics under structured data b/data/2024/iclr/Improved statistical and computational complexity of the mean-field Langevin dynamics under structured data
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Improving Domain Generalization with Domain Relations b/data/2024/iclr/Improving Domain Generalization with Domain Relations
new file mode 100644
index 0000000000..7936329f76
--- /dev/null
+++ b/data/2024/iclr/Improving Domain Generalization with Domain Relations	
@@ -0,0 +1 @@
+Distribution shift presents a significant challenge in machine learning, where models often underperform during the test stage when faced with a different distribution than the one they were trained on. This paper focuses on domain shifts, which occur when the model is applied to new domains that are different from the ones it was trained on, and propose a new approach called D$^3$G. Unlike previous methods that aim to learn a single model that is domain invariant, D$^3$G leverages domain similarities based on domain metadata to learn domain-specific models. Concretely, D$^3$G learns a set of training-domain-specific functions during the training stage and reweights them based on domain relations during the test stage. These domain relations can be directly obtained and learned from domain metadata. Under mild assumptions, we theoretically prove that using domain relations to reweight training-domain-specific functions achieves stronger out-of-domain generalization compared to the conventional averaging approach. Empirically, we evaluate the effectiveness of D$^3$G using real-world datasets for tasks such as temperature regression, land use classification, and molecule-protein binding affinity prediction. Our results show that D$^3$G consistently outperforms state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Improving Generalization of Alignment with Human Preferences through Group Invariant Learning b/data/2024/iclr/Improving Generalization of Alignment with Human Preferences through Group Invariant Learning
new file mode 100644
index 0000000000..533a09dfc6
--- /dev/null
+++ b/data/2024/iclr/Improving Generalization of Alignment with Human Preferences through Group Invariant Learning	
@@ -0,0 +1 @@
+The success of AI assistants based on language models (LLMs) hinges crucially on Reinforcement Learning from Human Feedback (RLHF), which enables the generation of responses more aligned with human preferences. As universal AI assistants, there's a growing expectation for them to perform consistently across various domains. However, previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. This focus on quick reward gains undermines both the stability in training and the model's ability to generalize to new, unseen data. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains. Given the challenges associated with acquiring group annotations, our method automatically classifies data into different groups, deliberately maximizing performance variance. Then, we optimize the policy to perform well on challenging groups. Lastly, leveraging the established groups, our approach adaptively adjusts the exploration space, allocating more learning capacity to more challenging data and preventing the model from over-optimizing on simpler data. Experimental results indicate that our approach significantly enhances training stability and model generalization.
\ No newline at end of file
diff --git a/data/2024/iclr/Improving Intrinsic Exploration by Creating Stationary Objectives b/data/2024/iclr/Improving Intrinsic Exploration by Creating Stationary Objectives
new file mode 100644
index 0000000000..f09dca3d91
--- /dev/null
+++ b/data/2024/iclr/Improving Intrinsic Exploration by Creating Stationary Objectives	
@@ -0,0 +1 @@
+Exploration bonuses in reinforcement learning guide long-horizon exploration by defining custom intrinsic objectives. Count-based methods use the frequency of state visits to derive an exploration bonus. In this paper, we identify that any intrinsic reward function derived from count-based methods is non-stationary and hence induces a difficult objective to optimize for the agent. The key contribution of our work lies in transforming the original non-stationary rewards into stationary rewards through an augmented state representation. For this purpose, we introduce the Stationary Objectives For Exploration (SOFE) framework. SOFE requires identifying sufficient statistics for different exploration bonuses and finding an efficient encoding of these statistics to use as input to a deep network. SOFE is based on proposing state augmentations that expand the state space but hold the promise of simplifying the optimization of the agent's objective. Our experiments show that SOFE improves the agents' performance in challenging exploration problems, including sparse-reward tasks, pixel-based observations, 3D navigation, and procedurally generated environments.
\ No newline at end of file
diff --git a/data/2024/iclr/Improving Non-Transferable Representation Learning by Harnessing Content and Style b/data/2024/iclr/Improving Non-Transferable Representation Learning by Harnessing Content and Style
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Improving Offline RL by Blending Heuristics b/data/2024/iclr/Improving Offline RL by Blending Heuristics
new file mode 100644
index 0000000000..a276f2db5b
--- /dev/null
+++ b/data/2024/iclr/Improving Offline RL by Blending Heuristics	
@@ -0,0 +1 @@
+We propose Heuristic Blending (HUBL), a simple performance-improving technique for a broad class of offline RL algorithms based on value bootstrapping. HUBL modifies the Bellman operators used in these algorithms, partially replacing the bootstrapped values with heuristic ones that are estimated with Monte-Carlo returns. For trajectories with higher returns, HUBL relies more on the heuristic values and less on bootstrapping; otherwise, it leans more heavily on bootstrapping. HUBL is very easy to combine with many existing offline RL implementations by relabeling the offline datasets with adjusted rewards and discount factors. We derive a theory that explains HUBL's effect on offline RL as reducing offline RL's complexity and thus increasing its finite-sample performance. Furthermore, we empirically demonstrate that HUBL consistently improves the policy quality of four state-of-the-art bootstrapping-based offline RL algorithms (ATAC, CQL, TD3+BC, and IQL), by 9% on average over 27 datasets of the D4RL and Meta-World benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/Improving equilibrium propagation without weight symmetry through Jacobian homeostasis b/data/2024/iclr/Improving equilibrium propagation without weight symmetry through Jacobian homeostasis
new file mode 100644
index 0000000000..2773c16d95
--- /dev/null
+++ b/data/2024/iclr/Improving equilibrium propagation without weight symmetry through Jacobian homeostasis	
@@ -0,0 +1 @@
+Equilibrium propagation (EP) is a compelling alternative to the backpropagation of error algorithm (BP) for computing gradients of neural networks on biological or analog neuromorphic substrates. Still, the algorithm requires weight symmetry and infinitesimal equilibrium perturbations, i.e., nudges, to estimate unbiased gradients efficiently. Both requirements are challenging to implement in physical systems. Yet, whether and how weight asymmetry affects its applicability is unknown because, in practice, it may be masked by biases introduced through the finite nudge. To address this question, we study generalized EP, which can be formulated without weight symmetry, and analytically isolate the two sources of bias. For complex-differentiable non-symmetric networks, we show that the finite nudge does not pose a problem, as exact derivatives can still be estimated via a Cauchy integral. In contrast, weight asymmetry introduces bias resulting in low task performance due to poor alignment of EP's neuronal error vectors compared to BP. To mitigate this issue, we present a new homeostatic objective that directly penalizes functional asymmetries of the Jacobian at the network's fixed point. This homeostatic objective dramatically improves the network's ability to solve complex tasks such as ImageNet 32x32. Our results lay the theoretical groundwork for studying and mitigating the adverse effects of imperfections of physical networks on learning algorithms that rely on the substrate's relaxation dynamics.
\ No newline at end of file
diff --git a/data/2024/iclr/Improving protein optimization with smoothed fitness landscapes b/data/2024/iclr/Improving protein optimization with smoothed fitness landscapes
new file mode 100644
index 0000000000..1fe191707d
--- /dev/null
+++ b/data/2024/iclr/Improving protein optimization with smoothed fitness landscapes	
@@ -0,0 +1 @@
+The ability to engineer novel proteins with higher fitness for a desired property would be revolutionary for biotechnology and medicine. Modeling the combinatorially large space of sequences is infeasible; prior methods often constrain optimization to a small mutational radius, but this drastically limits the design space. Instead of heuristics, we propose smoothing the fitness landscape to facilitate protein optimization. First, we formulate protein fitness as a graph signal then use Tikunov regularization to smooth the fitness landscape. We find optimizing in this smoothed landscape leads to improved performance across multiple methods in the GFP and AAV benchmarks. Second, we achieve state-of-the-art results utilizing discrete energy-based models and MCMC in the smoothed landscape. Our method, called Gibbs sampling with Graph-based Smoothing (GGS), demonstrates a unique ability to achieve 2.5 fold fitness improvement (with in-silico evaluation) over its training set. GGS demonstrates potential to optimize proteins in the limited data regime. Code: https://github.com/kirjner/GGS
\ No newline at end of file
diff --git a/data/2024/iclr/Improving the Convergence of Dynamic NeRFs via Optimal Transport b/data/2024/iclr/Improving the Convergence of Dynamic NeRFs via Optimal Transport
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/In-Context Learning Dynamics with Random Binary Sequences b/data/2024/iclr/In-Context Learning Dynamics with Random Binary Sequences
new file mode 100644
index 0000000000..cb7eea087e
--- /dev/null
+++ b/data/2024/iclr/In-Context Learning Dynamics with Random Binary Sequences	
@@ -0,0 +1 @@
+Large language models (LLMs) trained on huge corpora of text datasets demonstrate intriguing capabilities, achieving state-of-the-art performance on tasks they were not explicitly trained for. The precise nature of LLM capabilities is often mysterious, and different prompts can elicit different capabilities through in-context learning. We propose a framework that enables us to analyze in-context learning dynamics to understand latent concepts underlying LLMs' behavioral patterns. This provides a more nuanced understanding than success-or-failure evaluation benchmarks, but does not require observing internal activations as a mechanistic interpretation of circuits would. Inspired by the cognitive science of human randomness perception, we use random binary sequences as context and study dynamics of in-context learning by manipulating properties of context data, such as sequence length. In the latest GPT-3.5+ models, we find emergent abilities to generate seemingly random numbers and learn basic formal languages, with striking in-context learning dynamics where model outputs transition sharply from seemingly random behaviors to deterministic repetition.
\ No newline at end of file
diff --git a/data/2024/iclr/In-Context Learning Learns Label Relationships but Is Not Conventional Learning b/data/2024/iclr/In-Context Learning Learns Label Relationships but Is Not Conventional Learning
new file mode 100644
index 0000000000..c9606a0d76
--- /dev/null
+++ b/data/2024/iclr/In-Context Learning Learns Label Relationships but Is Not Conventional Learning	
@@ -0,0 +1 @@
+The predictions of Large Language Models (LLMs) on downstream tasks often improve significantly when including examples of the input--label relationship in the context. However, there is currently no consensus about how this in-context learning (ICL) ability of LLMs works. For example, while Xie et al. (2021) liken ICL to a general-purpose learning algorithm, Min et al. (2022) argue ICL does not even learn label relationships from in-context examples. In this paper, we provide novel insights into how ICL leverages label information, revealing both capabilities and limitations. To ensure we obtain a comprehensive picture of ICL behavior, we study probabilistic aspects of ICL predictions and thoroughly examine the dynamics of ICL as more examples are provided. Our experiments show that ICL predictions almost always depend on in-context labels and that ICL can learn truly novel tasks in-context. However, we also find that ICL struggles to fully overcome prediction preferences acquired from pre-training data and, further, that ICL does not consider all in-context information equally.
\ No newline at end of file
diff --git a/data/2024/iclr/In-Context Learning through the Bayesian Prism b/data/2024/iclr/In-Context Learning through the Bayesian Prism
new file mode 100644
index 0000000000..d3aa2e0d2a
--- /dev/null
+++ b/data/2024/iclr/In-Context Learning through the Bayesian Prism	
@@ -0,0 +1 @@
+In-context learning (ICL) is one of the surprising and useful features of large language models and subject of intense research. Recently, stylized meta-learning-like ICL setups have been devised that train transformers on sequences of input-output pairs $(x, f(x))$. The function $f$ comes from a function class and generalization is checked by evaluating on sequences generated from unseen functions from the same class. One of the main discoveries in this line of research has been that for several function classes, such as linear regression, transformers successfully generalize to new functions in the class. However, the inductive biases of these models resulting in this behavior are not clearly understood. A model with unlimited training data and compute is a Bayesian predictor: it learns the pretraining distribution. In this paper we empirically examine how far this Bayesian perspective can help us understand ICL. To this end, we generalize the previous meta-ICL setup to hierarchical meta-ICL setup which involve unions of multiple task families. We instantiate this setup on a diverse range of linear and nonlinear function families and find that transformers can do ICL in this setting as well. Where Bayesian inference is tractable, we find evidence that high-capacity transformers mimic the Bayesian predictor. The Bayesian perspective provides insights into the inductive bias of ICL and how transformers perform a particular task when they are trained on multiple tasks. We also find that transformers can learn to generalize to new function classes that were not seen during pretraining. This involves deviation from the Bayesian predictor. We examine these deviations in more depth offering new insights and hypotheses.
\ No newline at end of file
diff --git a/data/2024/iclr/In-Context Pretraining: Language Modeling Beyond Document Boundaries b/data/2024/iclr/In-Context Pretraining: Language Modeling Beyond Document Boundaries
new file mode 100644
index 0000000000..66e46794f8
--- /dev/null
+++ b/data/2024/iclr/In-Context Pretraining: Language Modeling Beyond Document Boundaries	
@@ -0,0 +1 @@
+Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).
\ No newline at end of file
diff --git a/data/2024/iclr/In-context Autoencoder for Context Compression in a Large Language Model b/data/2024/iclr/In-context Autoencoder for Context Compression in a Large Language Model
new file mode 100644
index 0000000000..040ee8813e
--- /dev/null
+++ b/data/2024/iclr/In-context Autoencoder for Context Compression in a Large Language Model	
@@ -0,0 +1 @@
+We propose the In-context Autoencoder (ICAE), leveraging the power of a large language model (LLM) to compress a long context into short compact memory slots that can be directly conditioned on by the LLM for various purposes. ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data, enabling it to generate memory slots that accurately and comprehensively represent the original context. Then, it is fine-tuned on instruction data for producing desirable responses to various prompts. Experiments demonstrate that our lightweight ICAE, introducing about 1% additional parameters, effectively achieves $4\times$ context compression based on Llama, offering advantages in both improved latency and GPU memory cost during inference, and showing an interesting insight in memorization as well as potential for scalability. These promising results imply a novel perspective on the connection between working memory in cognitive science and representation learning in LLMs, revealing ICAE's significant implications in addressing the long context problem and suggesting further research in LLM context management. Our data, code and models are available at https://github.com/getao/icae.
\ No newline at end of file
diff --git a/data/2024/iclr/In-context Exploration-Exploitation for Reinforcement Learning b/data/2024/iclr/In-context Exploration-Exploitation for Reinforcement Learning
new file mode 100644
index 0000000000..d7d5eafed3
--- /dev/null
+++ b/data/2024/iclr/In-context Exploration-Exploitation for Reinforcement Learning	
@@ -0,0 +1 @@
+In-context learning is a promising approach for online policy learning of offline reinforcement learning (RL) methods, which can be achieved at inference time without gradient optimization. However, this method is hindered by significant computational costs resulting from the gathering of large training trajectory sets and the need to train large Transformer models. We address this challenge by introducing an In-context Exploration-Exploitation (ICEE) algorithm, designed to optimize the efficiency of in-context policy learning. Unlike existing models, ICEE performs an exploration-exploitation trade-off at inference time within a Transformer model, without the need for explicit Bayesian inference. Consequently, ICEE can solve Bayesian optimization problems as efficiently as Gaussian process biased methods do, but in significantly less time. Through experiments in grid world environments, we demonstrate that ICEE can learn to solve new RL tasks using only tens of episodes, marking a substantial improvement over the hundreds of episodes needed by the previous in-context learning method.
\ No newline at end of file
diff --git a/data/2024/iclr/Incentivized Truthful Communication for Federated Bandits b/data/2024/iclr/Incentivized Truthful Communication for Federated Bandits
new file mode 100644
index 0000000000..357fcbb013
--- /dev/null
+++ b/data/2024/iclr/Incentivized Truthful Communication for Federated Bandits	
@@ -0,0 +1 @@
+To enhance the efficiency and practicality of federated bandit learning, recent advances have introduced incentives to motivate communication among clients, where a client participates only when the incentive offered by the server outweighs its participation cost. However, existing incentive mechanisms naively assume the clients are truthful: they all report their true cost and thus the higher cost one participating client claims, the more the server has to pay. Therefore, such mechanisms are vulnerable to strategic clients aiming to optimize their own utility by misreporting. To address this issue, we propose an incentive compatible (i.e., truthful) communication protocol, named Truth-FedBan, where the incentive for each participant is independent of its self-reported cost, and reporting the true cost is the only way to achieve the best utility. More importantly, Truth-FedBan still guarantees the sub-linear regret and communication cost without any overheads. In other words, the core conceptual contribution of this paper is, for the first time, demonstrating the possibility of simultaneously achieving incentive compatibility and nearly optimal regret in federated bandit learning. Extensive numerical studies further validate the effectiveness of our proposed solution.
\ No newline at end of file
diff --git a/data/2024/iclr/Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning b/data/2024/iclr/Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning
new file mode 100644
index 0000000000..e2b88f9a49
--- /dev/null
+++ b/data/2024/iclr/Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning	
@@ -0,0 +1 @@
+Fine-tuning large pre-trained foundation models, such as the 175B GPT-3, has attracted more attention for downstream tasks recently. While parameter-efficient fine-tuning methods have been proposed and proven effective without retraining all model parameters, their performance is limited by the capacity of incremental modules, especially under constrained parameter budgets. \\ To overcome this challenge, we propose CapaBoost, a simple yet effective strategy that enhances model capacity by leveraging low-rank updates through parallel weight modules in target layers. By applying static random masks to the shared weight matrix, CapaBoost constructs a diverse set of weight matrices, effectively increasing the rank of incremental weights without adding parameters. Notably, our approach can be seamlessly integrated into various existing parameter-efficient fine-tuning methods. We extensively validate the efficacy of CapaBoost through experiments on diverse downstream tasks, including natural language understanding, question answering, and image classification. Our results demonstrate significant improvements over baselines, without incurring additional computation or storage costs. Our code is available at \url{https://github.com/LINs-lab/CapaBoost}.
\ No newline at end of file
diff --git a/data/2024/iclr/Incremental Randomized Smoothing Certification b/data/2024/iclr/Incremental Randomized Smoothing Certification
new file mode 100644
index 0000000000..a27f478232
--- /dev/null
+++ b/data/2024/iclr/Incremental Randomized Smoothing Certification	
@@ -0,0 +1 @@
+Randomized smoothing-based certification is an effective approach for obtaining robustness certificates of deep neural networks (DNNs) against adversarial attacks. This method constructs a smoothed DNN model and certifies its robustness through statistical sampling, but it is computationally expensive, especially when certifying with a large number of samples. Furthermore, when the smoothed model is modified (e.g., quantized or pruned), certification guarantees may not hold for the modified DNN, and recertifying from scratch can be prohibitively expensive. We present the first approach for incremental robustness certification for randomized smoothing, IRS. We show how to reuse the certification guarantees for the original smoothed model to certify an approximated model with very few samples. IRS significantly reduces the computational cost of certifying modified DNNs while maintaining strong robustness guarantees. We experimentally demonstrate the effectiveness of our approach, showing up to 3x certification speedup over the certification that applies randomized smoothing of the approximate model from scratch.
\ No newline at end of file
diff --git a/data/2024/iclr/Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images b/data/2024/iclr/Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images
new file mode 100644
index 0000000000..2839347619
--- /dev/null
+++ b/data/2024/iclr/Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images	
@@ -0,0 +1 @@
+Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 times and 8.56 times compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images.
\ No newline at end of file
diff --git a/data/2024/iclr/Influencer Backdoor Attack on Semantic Segmentation b/data/2024/iclr/Influencer Backdoor Attack on Semantic Segmentation
new file mode 100644
index 0000000000..929e372b7d
--- /dev/null
+++ b/data/2024/iclr/Influencer Backdoor Attack on Semantic Segmentation	
@@ -0,0 +1 @@
+When a small number of poisoned samples are injected into the training dataset of a deep neural network, the network can be induced to exhibit malicious behavior during inferences, which poses potential threats to real-world applications. While they have been intensively studied in classification, backdoor attacks on semantic segmentation have been largely overlooked. Unlike classification, semantic segmentation aims to classify every pixel within a given image. In this work, we explore backdoor attacks on segmentation models to misclassify all pixels of a victim class by injecting a specific trigger on non-victim pixels during inferences, which is dubbed Influencer Backdoor Attack (IBA). IBA is expected to maintain the classification accuracy of non-victim pixels and mislead classifications of all victim pixels in every single inference and could be easily applied to real-world scenes. Based on the context aggregation ability of segmentation models, we proposed a simple, yet effective, Nearest-Neighbor trigger injection strategy. We also introduce an innovative Pixel Random Labeling strategy which maintains optimal performance even when the trigger is placed far from the victim pixels. Our extensive experiments reveal that current segmentation models do suffer from backdoor attacks, demonstrate IBA real-world applicability, and show that our proposed techniques can further increase attack performance.
\ No newline at end of file
diff --git a/data/2024/iclr/InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning b/data/2024/iclr/InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning
new file mode 100644
index 0000000000..928ada0479
--- /dev/null
+++ b/data/2024/iclr/InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning	
@@ -0,0 +1 @@
+Data pruning aims to obtain lossless performances with less overall cost. A common approach is to filter out samples that make less contribution to the training. This could lead to gradient expectation bias compared to the original data. To solve this problem, we propose \textbf{InfoBatch}, a novel framework aiming to achieve lossless training acceleration by unbiased dynamic data pruning. Specifically, InfoBatch randomly prunes a portion of less informative samples based on the loss distribution and rescales the gradients of the remaining samples to approximate the original gradient. As a plug-and-play and architecture-agnostic framework, InfoBatch consistently obtains lossless training results on classification, semantic segmentation, vision pertaining, and instruction fine-tuning tasks. On CIFAR10/100, ImageNet-1K, and ADE20K, InfoBatch losslessly saves 40\% overall cost. For pertaining MAE and diffusion model, InfoBatch can respectively save 24.8\% and 27\% cost. For LLaMA instruction fine-tuning, InfoBatch is also able to save 20\% cost and is compatible with coreset selection methods. The code is publicly available at \href{https://github.com/henryqin1997/InfoBatch}{github.com/NUS-HPC-AI-Lab/InfoBatch}.
\ No newline at end of file
diff --git a/data/2024/iclr/InfoCon: Concept Discovery with Generative and Discriminative Informativeness b/data/2024/iclr/InfoCon: Concept Discovery with Generative and Discriminative Informativeness
new file mode 100644
index 0000000000..4e28b354a1
--- /dev/null
+++ b/data/2024/iclr/InfoCon: Concept Discovery with Generative and Discriminative Informativeness	
@@ -0,0 +1 @@
+We focus on the self-supervised discovery of manipulation concepts that can be adapted and reassembled to address various robotic tasks. We propose that the decision to conceptualize a physical procedure should not depend on how we name it (semantics) but rather on the significance of the informativeness in its representation regarding the low-level physical state and state changes. We model manipulation concepts (discrete symbols) as generative and discriminative goals and derive metrics that can autonomously link them to meaningful sub-trajectories from noisy, unlabeled demonstrations. Specifically, we employ a trainable codebook containing encodings (concepts) capable of synthesizing the end-state of a sub-trajectory given the current state (generative informativeness). Moreover, the encoding corresponding to a particular sub-trajectory should differentiate the state within and outside it and confidently predict the subsequent action based on the gradient of its discriminative score (discriminative informativeness). These metrics, which do not rely on human annotation, can be seamlessly integrated into a VQ-VAE framework, enabling the partitioning of demonstrations into semantically consistent sub-trajectories, fulfilling the purpose of discovering manipulation concepts and the corresponding sub-goal (key) states. We evaluate the effectiveness of the learned concepts by training policies that utilize them as guidance, demonstrating superior performance compared to other baselines. Additionally, our discovered manipulation concepts compare favorably to human-annotated ones while saving much manual effort.
\ No newline at end of file
diff --git a/data/2024/iclr/Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression b/data/2024/iclr/Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression
new file mode 100644
index 0000000000..8ac509dd0e
--- /dev/null
+++ b/data/2024/iclr/Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression	
@@ -0,0 +1 @@
+The Information Bottleneck (IB) principle offers an information-theoretic framework for analyzing the training process of deep neural networks (DNNs). Its essence lies in tracking the dynamics of two mutual information (MI) values: between the hidden layer output and the DNN input/target. According to the hypothesis put forth by Shwartz-Ziv&Tishby (2017), the training process consists of two distinct phases: fitting and compression. The latter phase is believed to account for the good generalization performance exhibited by DNNs. Due to the challenging nature of estimating MI between high-dimensional random vectors, this hypothesis was only partially verified for NNs of tiny sizes or specific types, such as quantized NNs. In this paper, we introduce a framework for conducting IB analysis of general NNs. Our approach leverages the stochastic NN method proposed by Goldfeld et al. (2019) and incorporates a compression step to overcome the obstacles associated with high dimensionality. In other words, we estimate the MI between the compressed representations of high-dimensional random vectors. The proposed method is supported by both theoretical and practical justifications. Notably, we demonstrate the accuracy of our estimator through synthetic experiments featuring predefined MI values and comparison with MINE (Belghazi et al., 2018). Finally, we perform IB analysis on a close-to-real-scale convolutional DNN, which reveals new features of the MI dynamics.
\ No newline at end of file
diff --git a/data/2024/iclr/Inherently Interpretable Time Series Classification via Multiple Instance Learning b/data/2024/iclr/Inherently Interpretable Time Series Classification via Multiple Instance Learning
new file mode 100644
index 0000000000..4f45f53d47
--- /dev/null
+++ b/data/2024/iclr/Inherently Interpretable Time Series Classification via Multiple Instance Learning	
@@ -0,0 +1 @@
+Conventional Time Series Classification (TSC) methods are often black boxes that obscure inherent interpretation of their decision-making processes. In this work, we leverage Multiple Instance Learning (MIL) to overcome this issue, and propose a new framework called MILLET: Multiple Instance Learning for Locally Explainable Time series classification. We apply MILLET to existing deep learning TSC models and show how they become inherently interpretable without compromising (and in some cases, even improving) predictive performance. We evaluate MILLET on 85 UCR TSC datasets and also present a novel synthetic dataset that is specially designed to facilitate interpretability evaluation. On these datasets, we show MILLET produces sparse explanations quickly that are of higher quality than other well-known interpretability methods. To the best of our knowledge, our work with MILLET, which is available on GitHub (https://github.com/JAEarly/MILTimeSeriesClassification), is the first to develop general MIL methods for TSC and apply them to an extensive variety of domains
\ No newline at end of file
diff --git a/data/2024/iclr/Initializing Models with Larger Ones b/data/2024/iclr/Initializing Models with Larger Ones
new file mode 100644
index 0000000000..e4ed6dc16f
--- /dev/null
+++ b/data/2024/iclr/Initializing Models with Larger Ones	
@@ -0,0 +1 @@
+Weight initialization plays an important role in neural network training. Widely used initialization methods are proposed and evaluated for networks that are trained from scratch. However, the growing number of pretrained models now offers new opportunities for tackling this classical problem of weight initialization. In this work, we introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model. This enables the transfer of knowledge from pretrained weights to smaller models. Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time. Notably, it can also be used together with knowledge distillation. Weight selection offers a new approach to leverage the power of pretrained models in resource-constrained settings, and we hope it can be a useful tool for training small models in the large-model era. Code is available at https://github.com/OscarXZQ/weight-selection.
\ No newline at end of file
diff --git a/data/2024/iclr/Input-gradient space particle inference for neural network ensembles b/data/2024/iclr/Input-gradient space particle inference for neural network ensembles
new file mode 100644
index 0000000000..98dcac5c20
--- /dev/null
+++ b/data/2024/iclr/Input-gradient space particle inference for neural network ensembles	
@@ -0,0 +1 @@
+Deep Ensembles (DEs) demonstrate improved accuracy, calibration and robustness to perturbations over single neural networks partly due to their functional diversity. Particle-based variational inference (ParVI) methods enhance diversity by formalizing a repulsion term based on a network similarity kernel. However, weight-space repulsion is inefficient due to over-parameterization, while direct function-space repulsion has been found to produce little improvement over DEs. To sidestep these difficulties, we propose First-order Repulsive Deep Ensemble (FoRDE), an ensemble learning method based on ParVI, which performs repulsion in the space of first-order input gradients. As input gradients uniquely characterize a function up to translation and are much smaller in dimension than the weights, this method guarantees that ensemble members are functionally different. Intuitively, diversifying the input gradients encourages each network to learn different features, which is expected to improve the robustness of an ensemble. Experiments on image classification datasets and transfer learning tasks show that FoRDE significantly outperforms the gold-standard DEs and other ensemble methods in accuracy and calibration under covariate shift due to input perturbations.
\ No newline at end of file
diff --git a/data/2024/iclr/Ins-DetCLIP: Aligning Detection Model to Follow Human-Language Instruction b/data/2024/iclr/Ins-DetCLIP: Aligning Detection Model to Follow Human-Language Instruction
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/InsertNeRF: Instilling Generalizability into NeRF with HyperNet Modules b/data/2024/iclr/InsertNeRF: Instilling Generalizability into NeRF with HyperNet Modules
new file mode 100644
index 0000000000..2840cc324f
--- /dev/null
+++ b/data/2024/iclr/InsertNeRF: Instilling Generalizability into NeRF with HyperNet Modules	
@@ -0,0 +1 @@
+Generalizing Neural Radiance Fields (NeRF) to new scenes is a significant challenge that existing approaches struggle to address without extensive modifications to vanilla NeRF framework. We introduce InsertNeRF, a method for INStilling gEneRalizabiliTy into NeRF. By utilizing multiple plug-and-play HyperNet modules, InsertNeRF dynamically tailors NeRF's weights to specific reference scenes, transforming multi-scale sampling-aware features into scene-specific representations. This novel design allows for more accurate and efficient representations of complex appearances and geometries. Experiments show that this method not only achieves superior generalization performance but also provides a flexible pathway for integration with other NeRF-like systems, even in sparse input settings. Code will be available https://github.com/bbbbby-99/InsertNeRF.
\ No newline at end of file
diff --git a/data/2024/iclr/InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation b/data/2024/iclr/InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation
new file mode 100644
index 0000000000..a2395f71c3
--- /dev/null
+++ b/data/2024/iclr/InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation	
@@ -0,0 +1 @@
+Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model. In this paper, we explore a recent method called Rectified Flow, which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its \emph{reflow} procedure, which straightens the trajectories of probability flows, refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Frechet Inception Distance) of $23.3$ on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation, by a significant margin ($37.2$ $\rightarrow$ $23.3$ in FID). By utilizing an expanded network with 1.7B parameters, we further improve the FID to $22.4$. We call our one-step models \emph{InstaFlow}. On MS COCO 2014-30k, InstaFlow yields an FID of $13.1$ in just $0.09$ second, the best in $\leq 0.1$ second regime, outperforming the recent StyleGAN-T ($13.9$ in $0.1$ second). Notably, the training of InstaFlow only costs 199 A100 GPU days. Codes and pre-trained models are available at \url{github.com/gnobitab/InstaFlow}.
\ No newline at end of file
diff --git a/data/2024/iclr/Instant3D: Fast Text-to-3D with Sparse-view Generation and Large Reconstruction Model b/data/2024/iclr/Instant3D: Fast Text-to-3D with Sparse-view Generation and Large Reconstruction Model
new file mode 100644
index 0000000000..54cb96a07d
--- /dev/null
+++ b/data/2024/iclr/Instant3D: Fast Text-to-3D with Sparse-view Generation and Large Reconstruction Model	
@@ -0,0 +1 @@
+Text-to-3D with diffusion models has achieved remarkable progress in recent years. However, existing methods either rely on score distillation-based optimization which suffer from slow inference, low diversity and Janus problems, or are feed-forward methods that generate low-quality results due to the scarcity of 3D training data. In this paper, we propose Instant3D, a novel method that generates high-quality and diverse 3D assets from text prompts in a feed-forward manner. We adopt a two-stage paradigm, which first generates a sparse set of four structured and consistent views from text in one shot with a fine-tuned 2D text-to-image diffusion model, and then directly regresses the NeRF from the generated images with a novel transformer-based sparse-view reconstructor. Through extensive experiments, we demonstrate that our method can generate diverse 3D assets of high visual quality within 20 seconds, which is two orders of magnitude faster than previous optimization-based methods that can take 1 to 10 hours. Our project webpage: https://jiahao.ai/instant3d/.
\ No newline at end of file
diff --git a/data/2024/iclr/InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists b/data/2024/iclr/InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists
new file mode 100644
index 0000000000..47deb1149b
--- /dev/null
+++ b/data/2024/iclr/InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists	
@@ -0,0 +1 @@
+Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task-specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification. We then use a large language model to paraphrase prompt templates that convey the specific tasks to be conducted on each image, and through this process, we create a multi-modal and multi-task training dataset comprising input and output images along with annotated instructions. Following the InstructPix2Pix architecture, we apply instruction-tuning to a text-to-image diffusion model using our constructed dataset, steering its functionality from a generative model to an instruction-guided multi-task vision learner. Experiments demonstrate that our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions.
\ No newline at end of file
diff --git a/data/2024/iclr/InstructDET: Diversifying Referring Object Detection with Generalized Instructions b/data/2024/iclr/InstructDET: Diversifying Referring Object Detection with Generalized Instructions
new file mode 100644
index 0000000000..bf531d221a
--- /dev/null
+++ b/data/2024/iclr/InstructDET: Diversifying Referring Object Detection with Generalized Instructions	
@@ -0,0 +1 @@
+We propose InstructDET, a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions. While deriving from referring expressions (REC), the instructions we leverage are greatly diversified to encompass common user intentions related to object detection. For one image, we produce tremendous instructions that refer to every single object and different combinations of multiple objects. Each instruction and its corresponding object bounding boxes (bbxs) constitute one training data pair. In order to encompass common detection expressions, we involve emerging vision-language model (VLM) and large language model (LLM) to generate instructions guided by text prompts and object bbxs, as the generalizations of foundation models are effective to produce human-like expressions (e.g., describing object property, category, and relationship). We name our constructed dataset as InDET. It contains images, bbxs and generalized instructions that are from foundation models. Our InDET is developed from existing REC datasets and object detection datasets, with the expanding potential that any image with object bbxs can be incorporated through using our InstructDET method. By using our InDET dataset, we show that a conventional ROD model surpasses existing methods on standard REC datasets and our InDET test set. Our data-centric method InstructDET, with automatic data expansion by leveraging foundation models, directs a promising field that ROD can be greatly diversified to execute common object detection instructions.
\ No newline at end of file
diff --git a/data/2024/iclr/InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image b/data/2024/iclr/InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image
new file mode 100644
index 0000000000..4fcc4ca62c
--- /dev/null
+++ b/data/2024/iclr/InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image	
@@ -0,0 +1 @@
+With the success of Neural Radiance Field (NeRF) in 3D-aware portrait editing, a variety of works have achieved promising results regarding both quality and 3D consistency. However, these methods heavily rely on per-prompt optimization when handling natural language as editing instructions. Due to the lack of labeled human face 3D datasets and effective architectures, the area of human-instructed 3D-aware editing for open-world portraits in an end-to-end manner remains under-explored. To solve this problem, we propose an end-to-end diffusion-based framework termed InstructPix2NeRF, which enables instructed 3D-aware portrait editing from a single open-world image with human instructions. At its core lies a conditional latent 3D diffusion process that lifts 2D editing to 3D space by learning the correlation between the paired images' difference and the instructions via triplet data. With the help of our proposed token position randomization strategy, we could even achieve multi-semantic editing through one single pass with the portrait identity well-preserved. Besides, we further propose an identity consistency module that directly modulates the extracted identity signals into our diffusion process, which increases the multi-view 3D identity consistency. Extensive experiments verify the effectiveness of our method and show its superiority against strong baselines quantitatively and qualitatively. Source code and pre-trained models can be found on our project page: \url{https://mybabyyh.github.io/InstructPix2NeRF}.
\ No newline at end of file
diff --git a/data/2024/iclr/InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior b/data/2024/iclr/InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior
new file mode 100644
index 0000000000..59f7681674
--- /dev/null
+++ b/data/2024/iclr/InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior	
@@ -0,0 +1 @@
+Comprehending natural language instructions is a charming property for 3D indoor scene synthesis systems. Existing methods directly model object joint distributions and express object relations implicitly within a scene, thereby hindering the controllability of generation. We introduce InstructScene, a novel generative framework that integrates a semantic graph prior and a layout decoder to improve controllability and fidelity for 3D scene synthesis. The proposed semantic graph prior jointly learns scene appearances and layout distributions, exhibiting versatility across various downstream tasks in a zero-shot manner. To facilitate the benchmarking for text-driven 3D scene synthesis, we curate a high-quality dataset of scene-instruction pairs with large language and multimodal models. Extensive experimental results reveal that the proposed method surpasses existing state-of-the-art approaches by a large margin. Thorough ablation studies confirm the efficacy of crucial design components. Project page: https://chenguolin.github.io/projects/InstructScene.
\ No newline at end of file
diff --git a/data/2024/iclr/Instructive Decoding: Instruction-Tuned Large Language Models are Self-Refiner from Noisy Instructions b/data/2024/iclr/Instructive Decoding: Instruction-Tuned Large Language Models are Self-Refiner from Noisy Instructions
new file mode 100644
index 0000000000..3d7084943f
--- /dev/null
+++ b/data/2024/iclr/Instructive Decoding: Instruction-Tuned Large Language Models are Self-Refiner from Noisy Instructions	
@@ -0,0 +1 @@
+While instruction-tuned language models have demonstrated impressive zero-shot generalization, these models often struggle to generate accurate responses when faced with instructions that fall outside their training set. This paper presents Instructive Decoding (ID), a simple yet effective approach that augments the efficacy of instruction-tuned models. Specifically, ID adjusts the logits for next-token prediction in a contrastive manner, utilizing predictions generated from a manipulated version of the original instruction, referred to as a noisy instruction. This noisy instruction aims to elicit responses that could diverge from the intended instruction yet remain plausible. We conduct experiments across a spectrum of such noisy instructions, ranging from those that insert semantic noise via random words to others like 'opposite' that elicit the deviated responses. Our approach achieves considerable performance gains across various instruction-tuned models and tasks without necessitating any additional parameter updates. Notably, utilizing 'opposite' as the noisy instruction in ID, which exhibits the maximum divergence from the original instruction, consistently produces the most significant performance gains across multiple models and tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Integrating Planning and Deep Reinforcement Learning via Automatic Induction of Task Substructures b/data/2024/iclr/Integrating Planning and Deep Reinforcement Learning via Automatic Induction of Task Substructures
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Intelligent Switching for Reset-Free RL b/data/2024/iclr/Intelligent Switching for Reset-Free RL
new file mode 100644
index 0000000000..0961bc26fb
--- /dev/null
+++ b/data/2024/iclr/Intelligent Switching for Reset-Free RL	
@@ -0,0 +1 @@
+In the real world, the strong episode resetting mechanisms that are needed to train agents in simulation are unavailable. The \textit{resetting} assumption limits the potential of reinforcement learning in the real world, as providing resets to an agent usually requires the creation of additional handcrafted mechanisms or human interventions. Recent work aims to train agents (\textit{forward}) with learned resets by constructing a second (\textit{backward}) agent that returns the forward agent to the initial state. We find that the termination and timing of the transitions between these two agents are crucial for algorithm success. With this in mind, we create a new algorithm, Reset Free RL with Intelligently Switching Controller (RISC) which intelligently switches between the two agents based on the agent's confidence in achieving its current goal. Our new method achieves state-of-the-art performance on several challenging environments for reset-free RL.
\ No newline at end of file
diff --git a/data/2024/iclr/InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation b/data/2024/iclr/InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
new file mode 100644
index 0000000000..b8f6543197
--- /dev/null
+++ b/data/2024/iclr/InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	
@@ -0,0 +1 @@
+This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.
\ No newline at end of file
diff --git a/data/2024/iclr/InterpGNN: Understand and Improve Generalization Ability of Transdutive GNNs through the Lens of Interplay between Train and Test Nodes b/data/2024/iclr/InterpGNN: Understand and Improve Generalization Ability of Transdutive GNNs through the Lens of Interplay between Train and Test Nodes
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Interpretable Diffusion via Information Decomposition b/data/2024/iclr/Interpretable Diffusion via Information Decomposition
new file mode 100644
index 0000000000..a0ec1eff48
--- /dev/null
+++ b/data/2024/iclr/Interpretable Diffusion via Information Decomposition	
@@ -0,0 +1 @@
+Denoising diffusion models enable conditional generation and density modeling of complex relationships like images and text. However, the nature of the learned relationships is opaque making it difficult to understand precisely what relationships between words and parts of an image are captured, or to predict the effect of an intervention. We illuminate the fine-grained relationships learned by diffusion models by noticing a precise relationship between diffusion and information decomposition. Exact expressions for mutual information and conditional mutual information can be written in terms of the denoising model. Furthermore, pointwise estimates can be easily estimated as well, allowing us to ask questions about the relationships between specific images and captions. Decomposing information even further to understand which variables in a high-dimensional space carry information is a long-standing problem. For diffusion models, we show that a natural non-negative decomposition of mutual information emerges, allowing us to quantify informative relationships between words and pixels in an image. We exploit these new relations to measure the compositional understanding of diffusion models, to do unsupervised localization of objects in images, and to measure effects when selectively editing images through prompt interventions.
\ No newline at end of file
diff --git a/data/2024/iclr/Interpretable Meta-Learning of Physical Systems b/data/2024/iclr/Interpretable Meta-Learning of Physical Systems
new file mode 100644
index 0000000000..2587484304
--- /dev/null
+++ b/data/2024/iclr/Interpretable Meta-Learning of Physical Systems	
@@ -0,0 +1 @@
+Machine learning methods can be a valuable aid in the scientific process, but they need to face challenging settings where data come from inhomogeneous experimental conditions. Recent meta-learning methods have made significant progress in multi-task learning, but they rely on black-box neural networks, resulting in high computational costs and limited interpretability. Leveraging the structure of the learning problem, we argue that multi-environment generalization can be achieved using a simpler learning model, with an affine structure with respect to the learning task. Crucially, we prove that this architecture can identify the physical parameters of the system, enabling interpreable learning. We demonstrate the competitive generalization performance and the low computational cost of our method by comparing it to state-of-the-art algorithms on physical systems, ranging from toy models to complex, non-analytical systems. The interpretability of our method is illustrated with original applications to physical-parameter-induced adaptation and to adaptive control.
\ No newline at end of file
diff --git a/data/2024/iclr/Interpretable Sparse System Identification: Beyond Recent Deep Learning Techniques on Time-Series Prediction b/data/2024/iclr/Interpretable Sparse System Identification: Beyond Recent Deep Learning Techniques on Time-Series Prediction
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Interpreting CLIP's Image Representation via Text-Based Decomposition b/data/2024/iclr/Interpreting CLIP's Image Representation via Text-Based Decomposition
new file mode 100644
index 0000000000..ecb4f8664b
--- /dev/null
+++ b/data/2024/iclr/Interpreting CLIP's Image Representation via Text-Based Decomposition	
@@ -0,0 +1 @@
+We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.
\ No newline at end of file
diff --git a/data/2024/iclr/Interpreting Robustness Proofs of Deep Neural Networks b/data/2024/iclr/Interpreting Robustness Proofs of Deep Neural Networks
new file mode 100644
index 0000000000..be1234445b
--- /dev/null
+++ b/data/2024/iclr/Interpreting Robustness Proofs of Deep Neural Networks	
@@ -0,0 +1 @@
+In recent years numerous methods have been developed to formally verify the robustness of deep neural networks (DNNs). Though the proposed techniques are effective in providing mathematical guarantees about the DNNs behavior, it is not clear whether the proofs generated by these methods are human-interpretable. In this paper, we bridge this gap by developing new concepts, algorithms, and representations to generate human understandable interpretations of the proofs. Leveraging the proposed method, we show that the robustness proofs of standard DNNs rely on spurious input features, while the proofs of DNNs trained to be provably robust filter out even the semantically meaningful features. The proofs for the DNNs combining adversarial and provably robust training are the most effective at selectively filtering out spurious features as well as relying on human-understandable input features.
\ No newline at end of file
diff --git a/data/2024/iclr/Interventional Fairness on Partially Known Causal Graphs: A Constrained Optimization Approach b/data/2024/iclr/Interventional Fairness on Partially Known Causal Graphs: A Constrained Optimization Approach
new file mode 100644
index 0000000000..959210ace5
--- /dev/null
+++ b/data/2024/iclr/Interventional Fairness on Partially Known Causal Graphs: A Constrained Optimization Approach	
@@ -0,0 +1 @@
+Fair machine learning aims to prevent discrimination against individuals or sub-populations based on sensitive attributes such as gender and race. In recent years, causal inference methods have been increasingly used in fair machine learning to measure unfairness by causal effects. However, current methods assume that the true causal graph is given, which is often not true in real-world applications. To address this limitation, this paper proposes a framework for achieving causal fairness based on the notion of interventions when the true causal graph is partially known. The proposed approach involves modeling fair prediction using a Partially Directed Acyclic Graph (PDAG), specifically, a class of causal DAGs that can be learned from observational data combined with domain knowledge. The PDAG is used to measure causal fairness, and a constrained optimization problem is formulated to balance between fairness and accuracy. Results on both simulated and real-world datasets demonstrate the effectiveness of this method.
\ No newline at end of file
diff --git a/data/2024/iclr/Intriguing Properties of Data Attribution on Diffusion Models b/data/2024/iclr/Intriguing Properties of Data Attribution on Diffusion Models
new file mode 100644
index 0000000000..e370699e1b
--- /dev/null
+++ b/data/2024/iclr/Intriguing Properties of Data Attribution on Diffusion Models	
@@ -0,0 +1 @@
+Data attribution seeks to trace model outputs back to training data. With the recent development of diffusion models, data attribution has become a desired module to properly assign valuations for high-quality or copyrighted training samples, ensuring that data contributors are fairly compensated or credited. Several theoretically motivated methods have been proposed to implement data attribution, in an effort to improve the trade-off between computational scalability and effectiveness. In this work, we conduct extensive experiments and ablation studies on attributing diffusion models, specifically focusing on DDPMs trained on CIFAR-10 and CelebA, as well as a Stable Diffusion model LoRA-finetuned on ArtBench. Intriguingly, we report counter-intuitive observations that theoretically unjustified design choices for attribution empirically outperform previous baselines by a large margin, in terms of both linear datamodeling score and counterfactual evaluation. Our work presents a significantly more efficient approach for attributing diffusion models, while the unexpected findings suggest that at least in non-convex settings, constructions guided by theoretical assumptions may lead to inferior attribution performance. The code is available at https://github.com/sail-sg/D-TRAK.
\ No newline at end of file
diff --git a/data/2024/iclr/Intriguing Properties of Generative Classifiers b/data/2024/iclr/Intriguing Properties of Generative Classifiers
new file mode 100644
index 0000000000..af3d7bb821
--- /dev/null
+++ b/data/2024/iclr/Intriguing Properties of Generative Classifiers	
@@ -0,0 +1 @@
+What is the best paradigm to recognize objects -- discriminative inference (fast but potentially prone to shortcut learning) or using a generative model (slow but potentially more robust)? We build on recent advances in generative modeling that turn text-to-image models into classifiers. This allows us to study their behavior and to compare them against discriminative models and human psychophysical data. We report four intriguing emergent properties of generative classifiers: they show a record-breaking human-like shape bias (99% for Imagen), near human-level out-of-distribution accuracy, state-of-the-art alignment with human classification errors, and they understand certain perceptual illusions. Our results indicate that while the current dominant paradigm for modeling human object recognition is discriminative inference, zero-shot generative models approximate human object recognition data surprisingly well.
\ No newline at end of file
diff --git a/data/2024/iclr/Invariance-based Learning of Latent Dynamics b/data/2024/iclr/Invariance-based Learning of Latent Dynamics
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Inverse Approximation Theory for Nonlinear Recurrent Neural Networks b/data/2024/iclr/Inverse Approximation Theory for Nonlinear Recurrent Neural Networks
new file mode 100644
index 0000000000..4a4607f184
--- /dev/null
+++ b/data/2024/iclr/Inverse Approximation Theory for Nonlinear Recurrent Neural Networks	
@@ -0,0 +1 @@
+We prove an inverse approximation theorem for the approximation of nonlinear sequence-to-sequence relationships using recurrent neural networks (RNNs). This is a so-called Bernstein-type result in approximation theory, which deduces properties of a target function under the assumption that it can be effectively approximated by a hypothesis space. In particular, we show that nonlinear sequence relationships that can be stably approximated by nonlinear RNNs must have an exponential decaying memory structure - a notion that can be made precise. This extends the previously identified curse of memory in linear RNNs into the general nonlinear setting, and quantifies the essential limitations of the RNN architecture for learning sequential relationships with long-term memory. Based on the analysis, we propose a principled reparameterization method to overcome the limitations. Our theoretical results are confirmed by numerical experiments. The code has been released in https://github.com/radarFudan/Curse-of-memory
\ No newline at end of file
diff --git a/data/2024/iclr/Investigating the Benefits of Projection Head for Representation Learning b/data/2024/iclr/Investigating the Benefits of Projection Head for Representation Learning
new file mode 100644
index 0000000000..8d63094cbb
--- /dev/null
+++ b/data/2024/iclr/Investigating the Benefits of Projection Head for Representation Learning	
@@ -0,0 +1 @@
+An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations. Despite its proven practical effectiveness, the reason behind the success of this technique is poorly understood. The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better? In this work, we provide a rigorous theoretical answer to this question. We start by examining linear models trained with self-supervised contrastive loss. We reveal that the implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers. Consequently, lower layers tend to have more normalized and less specialized representations. We theoretically characterize scenarios where such representations are more beneficial, highlighting the intricate interplay between data augmentation and input features. Additionally, we demonstrate that introducing non-linearity into the network allows lower layers to learn features that are completely absent in higher layers. Finally, we show how this mechanism improves the robustness in supervised contrastive learning and supervised learning. We empirically validate our results through various experiments on CIFAR-10/100, UrbanCars and shifted versions of ImageNet. We also introduce a potential alternative to projection head, which offers a more interpretable and controllable design.
\ No newline at end of file
diff --git a/data/2024/iclr/Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video b/data/2024/iclr/Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video
new file mode 100644
index 0000000000..e72b59a99b
--- /dev/null
+++ b/data/2024/iclr/Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video	
@@ -0,0 +1 @@
+Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a"Walking Tours"dataset. These videos are high-resolution, hours-long, captured in a single uninterrupted take, depicting a large number of objects and actions with natural scene transitions. They are unlabeled and uncurated, thus realistic for self-supervision and comparable with human learning. Second, we introduce a novel self-supervised image pretraining method tailored for learning from continuous videos. Existing methods typically adapt image-based pretraining approaches to incorporate more frames. Instead, we advocate a"tracking to learn to recognize"approach. Our method called DoRA, leads to attention maps that Discover and tRAck objects over time in an end-to-end manner, using transformer cross-attention. We derive multiple views from the tracks and use them in a classical self-supervised distillation loss. Using our novel approach, a single Walking Tours video remarkably becomes a strong competitor to ImageNet for several image and video downstream tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Is Self-Repair a Silver Bullet for Code Generation? b/data/2024/iclr/Is Self-Repair a Silver Bullet for Code Generation?
new file mode 100644
index 0000000000..353fc96271
--- /dev/null
+++ b/data/2024/iclr/Is Self-Repair a Silver Bullet for Code Generation?	
@@ -0,0 +1 @@
+Large language models have shown remarkable aptitude in code generation, but still struggle to perform complex tasks. Self-repair -- in which the model debugs and repairs its own code -- has recently become a popular way to boost performance in these settings. However, despite its increasing popularity, existing studies of self-repair have been limited in scope; in many settings, its efficacy thus remains poorly understood. In this paper, we analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval and APPS. We find that when the cost of carrying out repair is taken into account, performance gains are often modest, vary a lot between subsets of the data, and are sometimes not present at all. We hypothesize that this is because self-repair is bottlenecked by the model's ability to provide feedback on its own code; using a stronger model to artificially boost the quality of the feedback, we observe substantially larger performance gains. Similarly, a small-scale study in which we provide GPT-4 with feedback from human participants suggests that even for the strongest models, self-repair still lags far behind what can be achieved with human-level debugging.
\ No newline at end of file
diff --git a/data/2024/iclr/Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching b/data/2024/iclr/Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
new file mode 100644
index 0000000000..c0d75bc885
--- /dev/null
+++ b/data/2024/iclr/Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching	
@@ -0,0 +1 @@
+Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations. Specifically, recent studies have explored subspace interventions (such as activation patching) as a way to simultaneously manipulate model behavior and attribute the features behind it to given subspaces. In this work, we demonstrate that these two aims diverge, potentially leading to an illusory sense of interpretability. Counterintuitively, even if a subspace intervention makes the model's output behave as if the value of a feature was changed, this effect may be achieved by activating a dormant parallel pathway leveraging another subspace that is causally disconnected from model outputs. We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice. In the context of factual recall, we further show a link to rank-1 fact editing, providing a mechanistic explanation for previous work observing an inconsistency between fact editing performance and fact localization. However, this does not imply that activation patching of subspaces is intrinsically unfit for interpretability. To contextualize our findings, we also show what a success case looks like in a task (indirect object identification) where prior manual circuit analysis informs an understanding of the location of a feature. We explore the additional evidence needed to argue that a patched subspace is faithful.
\ No newline at end of file
diff --git a/data/2024/iclr/Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability b/data/2024/iclr/Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability
new file mode 100644
index 0000000000..90587e7948
--- /dev/null
+++ b/data/2024/iclr/Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability	
@@ -0,0 +1 @@
+What is the relationship between model architecture and the ability to perform in-context learning? In this empirical study, we take the first steps toward answering this question. We evaluate thirteen model architectures capable of causal language modeling across a suite of synthetic in-context learning tasks. These selected architectures represent a broad range of paradigms, including recurrent and convolution-based neural networks, transformers, state space model inspired, and other emerging attention alternatives. We discover that all the considered architectures can perform in-context learning under a wider range of conditions than previously documented. Additionally, we observe stark differences in statistical efficiency and consistency by varying the number of in-context examples and task difficulty. We also measure each architecture's predisposition towards in-context learning when presented with the option to memorize rather than leverage in-context examples. Finally, and somewhat surprisingly, we find that several attention alternatives are sometimes competitive with or better in-context learners than transformers. However, no single architecture demonstrates consistency across all tasks, with performance either plateauing or declining when confronted with a significantly larger number of in-context examples than those encountered during gradient-based training.
\ No newline at end of file
diff --git a/data/2024/iclr/It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition b/data/2024/iclr/It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition
new file mode 100644
index 0000000000..fe2dc03eb0
--- /dev/null
+++ b/data/2024/iclr/It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition	
@@ -0,0 +1 @@
+Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct mapping from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). UADF is a multimodal fusion approach implemented into an auto-regressive decoding process and works in two stages: (i) It first analyzes and calibrates the token-level LLM decision, and (ii) it then dynamically assimilates the information from the acoustic modality. Experimental evidence collected from various ASR tasks shows that UADF surpasses existing fusion mechanisms in several ways. It yields significant improvements in word error rate (WER) while mitigating data uncertainty issues in LLM and addressing the poor generalization relied with sole modality during fusion. We also demonstrate that UADF seamlessly adapts to audio-visual speech recognition.
\ No newline at end of file
diff --git a/data/2024/iclr/Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models b/data/2024/iclr/Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
new file mode 100644
index 0000000000..ca9b65c5fa
--- /dev/null
+++ b/data/2024/iclr/Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models	
@@ -0,0 +1 @@
+We introduce new jailbreak attacks on vision language models (VLMs), which use aligned LLMs and are resilient to text-only jailbreak attacks. Specifically, we develop cross-modality attacks on alignment where we pair adversarial images going through the vision encoder with textual prompts to break the alignment of the language model. Our attacks employ a novel compositional strategy that combines an image, adversarially targeted towards toxic embeddings, with generic prompts to accomplish the jailbreak. Thus, the LLM draws the context to answer the generic prompt from the adversarial image. The generation of benign-appearing adversarial images leverages a novel embedding-space-based methodology, operating with no access to the LLM model. Instead, the attacks require access only to the vision encoder and utilize one of our four embedding space targeting strategies. By not requiring access to the LLM, the attacks lower the entry barrier for attackers, particularly when vision encoders such as CLIP are embedded in closed-source LLMs. The attacks achieve a high success rate across different VLMs, highlighting the risk of cross-modality alignment vulnerabilities, and the need for new alignment approaches for multi-modal models.
\ No newline at end of file
diff --git a/data/2024/iclr/JoMA: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention b/data/2024/iclr/JoMA: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention
new file mode 100644
index 0000000000..5526d8bb7d
--- /dev/null
+++ b/data/2024/iclr/JoMA: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention	
@@ -0,0 +1 @@
+We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the self-attention layer in Transformers, producing a modified dynamics of MLP layers only. JoMA removes unrealistic assumptions in previous analysis (e.g., lack of residual connection) and predicts that the attention first becomes sparse (to learn salient tokens), then dense (to learn less salient tokens) in the presence of nonlinear activations, while in the linear case, it is consistent with existing works that show attention becomes sparse over time. We leverage JoMA to qualitatively explains how tokens are combined to form hierarchies in multilayer Transformers, when the input tokens are generated by a latent hierarchical generative model. Experiments on models trained from real-world dataset (Wikitext2/Wikitext103) and various pre-trained models (OPT, Pythia) verify our theoretical findings. Code can be found in https://github.com/facebookresearch/luckmatters/tree/yuandong3.
\ No newline at end of file
diff --git a/data/2024/iclr/JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling b/data/2024/iclr/JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling
new file mode 100644
index 0000000000..b9708dde82
--- /dev/null
+++ b/data/2024/iclr/JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling	
@@ -0,0 +1 @@
+We introduce JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps). JointNet is extended from a pre-trained text-to-image diffusion model, where a copy of the original network is created for the new dense modality branch and is densely connected with the RGB branch. The RGB branch is locked during network fine-tuning, which enables efficient learning of the new modality distribution while maintaining the strong generalization ability of the large-scale pre-trained diffusion model. We demonstrate the effectiveness of JointNet by using RGBD diffusion as an example and through extensive experiments, showcasing its applicability in a variety of applications, including joint RGBD generation, dense depth prediction, depth-conditioned image generation, and coherent tile-based 3D panorama generation.
\ No newline at end of file
diff --git a/data/2024/iclr/Jointly Training Large Autoregressive Multimodal Models b/data/2024/iclr/Jointly Training Large Autoregressive Multimodal Models
new file mode 100644
index 0000000000..b740be888f
--- /dev/null
+++ b/data/2024/iclr/Jointly Training Large Autoregressive Multimodal Models	
@@ -0,0 +1 @@
+In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.
\ No newline at end of file
diff --git a/data/2024/iclr/Jointly-Learned Exit and Inference for a Dynamic Neural Network b/data/2024/iclr/Jointly-Learned Exit and Inference for a Dynamic Neural Network
new file mode 100644
index 0000000000..32c9433169
--- /dev/null
+++ b/data/2024/iclr/Jointly-Learned Exit and Inference for a Dynamic Neural Network	
@@ -0,0 +1 @@
+Large pretrained models, coupled with fine-tuning, are slowly becoming established as the dominant architecture in machine learning. Even though these models offer impressive performance, their practical application is often limited by the prohibitive amount of resources required for every inference. Early-exiting dynamic neural networks (EDNN) circumvent this issue by allowing a model to make some of its predictions from intermediate layers (i.e., early-exit). Training an EDNN architecture is challenging as it consists of two intertwined components: the gating mechanism (GM) that controls early-exiting decisions and the intermediate inference modules (IMs) that perform inference from intermediate representations. As a result, most existing approaches rely on thresholding confidence metrics for the gating mechanism and strive to improve the underlying backbone network and the inference modules. Although successful, this approach has two fundamental shortcomings: 1) the GMs and the IMs are decoupled during training, leading to a train-test mismatch; and 2) the thresholding gating mechanism introduces a positive bias into the predictive probabilities, making it difficult to readily extract uncertainty information. We propose a novel architecture that connects these two modules. This leads to significant performance improvements on classification datasets and enables better uncertainty characterization capabilities.
\ No newline at end of file
diff --git a/data/2024/iclr/Jumanji: a Diverse Suite of Scalable Reinforcement Learning Environments in JAX b/data/2024/iclr/Jumanji: a Diverse Suite of Scalable Reinforcement Learning Environments in JAX
new file mode 100644
index 0000000000..0b6db3e569
--- /dev/null
+++ b/data/2024/iclr/Jumanji: a Diverse Suite of Scalable Reinforcement Learning Environments in JAX	
@@ -0,0 +1 @@
+Open-source reinforcement learning (RL) environments have played a crucial role in driving progress in the development of AI algorithms. In modern RL research, there is a need for simulated environments that are performant, scalable, and modular to enable their utilization in a wider range of potential real-world applications. Therefore, we present Jumanji, a suite of diverse RL environments specifically designed to be fast, flexible, and scalable. Jumanji provides a suite of environments focusing on combinatorial problems frequently encountered in industry, as well as challenging general decision-making tasks. By leveraging the efficiency of JAX and hardware accelerators like GPUs and TPUs, Jumanji enables rapid iteration of research ideas and large-scale experimentation, ultimately empowering more capable agents. Unlike existing RL environment suites, Jumanji is highly customizable, allowing users to tailor the initial state distribution and problem complexity to their needs. Furthermore, we provide actor-critic baselines for each environment, accompanied by preliminary findings on scaling and generalization scenarios. Jumanji aims to set a new standard for speed, adaptability, and scalability of RL environments.
\ No newline at end of file
diff --git a/data/2024/iclr/KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval b/data/2024/iclr/KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval
new file mode 100644
index 0000000000..3154df9a92
--- /dev/null
+++ b/data/2024/iclr/KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval	
@@ -0,0 +1 @@
+We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval (e.g., 'a list of ice cream shops in San Diego'). In the past, such queries were considered to be tasks that could only be solved via web-search or knowledge bases. More recently, large language models (LLMs) have demonstrated initial emergent abilities in this task. However, many current retrieval benchmarks are either saturated or do not measure constraint satisfaction. Motivated by rising concerns around factual incorrectness and hallucinations of LLMs, we present KITAB, a new dataset for measuring constraint satisfaction abilities of language models. KITAB consists of book-related data across more than 600 authors and 13,000 queries, and also offers an associated dynamic data collection and constraint verification approach for acquiring similar test data for other authors. Our extended experiments on GPT4 and GPT3.5 characterize and decouple common failure modes across dimensions such as information popularity, constraint types, and context availability. Results show that in the absence of context, models exhibit severe limitations as measured by irrelevant information, factual errors, and incompleteness, many of which exacerbate as information popularity decreases. While context availability mitigates irrelevant information, it is not helpful for satisfying constraints, identifying fundamental barriers to constraint satisfaction. We open source our contributions to foster further research on improving constraint satisfaction abilities of future models.
\ No newline at end of file
diff --git a/data/2024/iclr/KW-Design: Pushing the Limit of Protein Design via Knowledge Refinement b/data/2024/iclr/KW-Design: Pushing the Limit of Protein Design via Knowledge Refinement
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Kalman Filter for Online Classification of Non-Stationary Data b/data/2024/iclr/Kalman Filter for Online Classification of Non-Stationary Data
new file mode 100644
index 0000000000..b846e9a9f6
--- /dev/null
+++ b/data/2024/iclr/Kalman Filter for Online Classification of Non-Stationary Data	
@@ -0,0 +1 @@
+In Online Continual Learning (OCL) a learning system receives a stream of data and sequentially performs prediction and training steps. Important challenges in OCL are concerned with automatic adaptation to the particular non-stationary structure of the data, and with quantification of predictive uncertainty. Motivated by these challenges we introduce a probabilistic Bayesian online learning model by using a (possibly pretrained) neural representation and a state space model over the linear predictor weights. Non-stationarity over the linear predictor weights is modelled using a parameter drift transition density, parametrized by a coefficient that quantifies forgetting. Inference in the model is implemented with efficient Kalman filter recursions which track the posterior distribution over the linear weights, while online SGD updates over the transition dynamics coefficient allows to adapt to the non-stationarity seen in data. While the framework is developed assuming a linear Gaussian model, we also extend it to deal with classification problems and for fine-tuning the deep learning representation. In a set of experiments in multi-class classification using data sets such as CIFAR-100 and CLOC we demonstrate the predictive ability of the model and its flexibility to capture non-stationarity.
\ No newline at end of file
diff --git a/data/2024/iclr/Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies b/data/2024/iclr/Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies
new file mode 100644
index 0000000000..9cda9615e2
--- /dev/null
+++ b/data/2024/iclr/Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies	
@@ -0,0 +1 @@
+We consider off-policy evaluation (OPE) of deterministic target policies for reinforcement learning (RL) in environments with continuous action spaces. While it is common to use importance sampling for OPE, it suffers from high variance when the behavior policy deviates significantly from the target policy. In order to address this issue, some recent works on OPE proposed in-sample learning with importance resampling. Yet, these approaches are not applicable to deterministic target policies for continuous action spaces. To address this limitation, we propose to relax the deterministic target policy using a kernel and learn the kernel metrics that minimize the overall mean squared error of the estimated temporal difference update vector of an action value function, where the action value function is used for policy evaluation. We derive the bias and variance of the estimation error due to this relaxation and provide analytic solutions for the optimal kernel metric. In empirical studies using various test domains, we show that the OPE with in-sample learning using the kernel with optimized metric achieves significantly improved accuracy than other baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Kernelised Normalising Flows b/data/2024/iclr/Kernelised Normalising Flows
new file mode 100644
index 0000000000..c9d1294a9f
--- /dev/null
+++ b/data/2024/iclr/Kernelised Normalising Flows	
@@ -0,0 +1 @@
+Normalising Flows are non-parametric statistical models characterised by their dual capabilities of density estimation and generation. This duality requires an inherently invertible architecture. However, the requirement of invertibility imposes constraints on their expressiveness, necessitating a large number of parameters and innovative architectural designs to achieve good results. Whilst flow-based models predominantly rely on neural-network-based transformations for expressive designs, alternative transformation methods have received limited attention. In this work, we present Ferumal flow, a novel kernelised normalising flow paradigm that integrates kernels into the framework. Our results demonstrate that a kernelised flow can yield competitive or superior results compared to neural network-based flows whilst maintaining parameter efficiency. Kernelised flows excel especially in the low-data regime, enabling flexible non-parametric density estimation in applications with sparse data availability.
\ No newline at end of file
diff --git a/data/2024/iclr/Kill Two Birds with One Stone: Rethinking Data Augmentation for Deep Long-tailed Learning b/data/2024/iclr/Kill Two Birds with One Stone: Rethinking Data Augmentation for Deep Long-tailed Learning
new file mode 100644
index 0000000000..424716260a
--- /dev/null
+++ b/data/2024/iclr/Kill Two Birds with One Stone: Rethinking Data Augmentation for Deep Long-tailed Learning	
@@ -0,0 +1 @@
+Real-world tasks are universally associated with training samples that exhibit a long-tailed class distribution, and traditional deep learning models are not suitable for fitting this distribution, thus resulting in a biased trained model. To surmount this dilemma, massive deep long-tailed learning studies have been proposed to achieve inter-class fairness models by designing sophisticated sampling strategies or improving existing model structures and loss functions. Habitually, these studies tend to apply data augmentation strategies to improve the generalization performance of their models. However, this augmentation strategy applied to balanced distributions may not be the best option for long-tailed distributions. For a profound understanding of data augmentation, we first theoretically analyze the gains of traditional augmentation strategies in long-tailed learning, and observe that augmentation methods cause the long-tailed distribution to be imbalanced again, resulting in an intertwined imbalance: inherent data-wise imbalance and extrinsic augmentation-wise imbalance , i.e., two ‘birds’ co-exist in long-tailed learning. Motivated by this observation, we propose an adaptive Dynamic Op-tional Data Augmentation (DODA) to address this intertwined imbalance
\ No newline at end of file
diff --git a/data/2024/iclr/Knowledge Distillation Based on Transformed Teacher Matching b/data/2024/iclr/Knowledge Distillation Based on Transformed Teacher Matching
new file mode 100644
index 0000000000..7367d69261
--- /dev/null
+++ b/data/2024/iclr/Knowledge Distillation Based on Transformed Teacher Matching	
@@ -0,0 +1 @@
+As a technique to bridge logit matching and probability distribution matching, temperature scaling plays a pivotal role in knowledge distillation (KD). Conventionally, temperature scaling is applied to both teacher's logits and student's logits in KD. Motivated by some recent works, in this paper, we drop instead temperature scaling on the student side, and systematically study the resulting variant of KD, dubbed transformed teacher matching (TTM). By reinterpreting temperature scaling as a power transform of probability distribution, we show that in comparison with the original KD, TTM has an inherent R\'enyi entropy term in its objective function, which serves as an extra regularization term. Extensive experiment results demonstrate that thanks to this inherent regularization, TTM leads to trained students with better generalization than the original KD. To further enhance student's capability to match teacher's power transformed probability distribution, we introduce a sample-adaptive weighting coefficient into TTM, yielding a novel distillation approach dubbed weighted TTM (WTTM). It is shown, by comprehensive experiments, that although WTTM is simple, it is effective, improves upon TTM, and achieves state-of-the-art accuracy performance. Our source code is available at https://github.com/zkxufo/TTM.
\ No newline at end of file
diff --git a/data/2024/iclr/Knowledge Fusion of Large Language Models b/data/2024/iclr/Knowledge Fusion of Large Language Models
new file mode 100644
index 0000000000..59eb67d7ec
--- /dev/null
+++ b/data/2024/iclr/Knowledge Fusion of Large Language Models	
@@ -0,0 +1 @@
+While training large language models (LLMs) from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weights is impractical. In this paper, we introduce the notion of knowledge fusion for LLMs, aimed at combining the capabilities of existing LLMs and transferring them into a single LLM. By leveraging the generative distributions of source LLMs, we externalize their collective knowledge and unique strengths, thereby potentially elevating the capabilities of the target model beyond those of any individual source LLM. We validate our approach using three popular LLMs with different architectures--Llama-2, MPT, and OpenLLaMA--across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/FuseLLM}.
\ No newline at end of file
diff --git a/data/2024/iclr/KoLA: Carefully Benchmarking World Knowledge of Large Language Models b/data/2024/iclr/KoLA: Carefully Benchmarking World Knowledge of Large Language Models
new file mode 100644
index 0000000000..7b1d32b7f2
--- /dev/null
+++ b/data/2024/iclr/KoLA: Carefully Benchmarking World Knowledge of Large Language Models	
@@ -0,0 +1 @@
+The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For \textbf{ability modeling}, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. (2) For \textbf{data}, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For \textbf{evaluation criteria}, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge-creating ability. We evaluate $28$ open-source and commercial LLMs and obtain some intriguing findings. The KoLA dataset and open-participation leaderboard are publicly released at https://kola.xlore.cn and will be continuously updated to provide references for developing LLMs and knowledge-related systems.
\ No newline at end of file
diff --git a/data/2024/iclr/Koopman-based generalization bound: New aspect for full-rank weights b/data/2024/iclr/Koopman-based generalization bound: New aspect for full-rank weights
new file mode 100644
index 0000000000..27b805cf4f
--- /dev/null
+++ b/data/2024/iclr/Koopman-based generalization bound: New aspect for full-rank weights	
@@ -0,0 +1 @@
+We propose a new bound for generalization of neural networks using Koopman operators. Whereas most of existing works focus on low-rank weight matrices, we focus on full-rank weight matrices. Our bound is tighter than existing norm-based bounds when the condition numbers of weight matrices are small. Especially, it is completely independent of the width of the network if the weight matrices are orthogonal. Our bound does not contradict to the existing bounds but is a complement to the existing bounds. As supported by several existing empirical results, low-rankness is not the only reason for generalization. Furthermore, our bound can be combined with the existing bounds to obtain a tighter bound. Our result sheds new light on understanding generalization of neural networks with full-rank weight matrices, and it provides a connection between operator-theoretic analysis and generalization of neural networks.
\ No newline at end of file
diff --git a/data/2024/iclr/Kosmos-G: Generating Images in Context with Multimodal Large Language Models b/data/2024/iclr/Kosmos-G: Generating Images in Context with Multimodal Large Language Models
new file mode 100644
index 0000000000..5143431f6a
--- /dev/null
+++ b/data/2024/iclr/Kosmos-G: Generating Images in Context with Multimodal Large Language Models	
@@ -0,0 +1 @@
+Recent advancements in subject-driven image generation have made significant strides. However, current methods still fall short in diverse application scenarios, as they require test-time tuning and cannot accept interleaved multi-image and text input. These limitations keep them far from the ultimate goal of"image as a foreign language in image generation."This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of"image as a foreign language in image generation."The code can be found at https://aka.ms/Kosmos-G
\ No newline at end of file
diff --git a/data/2024/iclr/L2MAC: Large Language Model Automatic Computer for Extensive Code Generation b/data/2024/iclr/L2MAC: Large Language Model Automatic Computer for Extensive Code Generation
new file mode 100644
index 0000000000..6d4eda016b
--- /dev/null
+++ b/data/2024/iclr/L2MAC: Large Language Model Automatic Computer for Extensive Code Generation	
@@ -0,0 +1 @@
+Transformer-based large language models (LLMs) are constrained by the fixed context window of the underlying transformer architecture, hindering their ability to produce long and coherent outputs. Memory-augmented LLMs are a promising solution, but current approaches cannot handle long output generation tasks since they (1) only focus on reading memory and reduce its evolution to the concatenation of new memories or (2) use very specialized memories that cannot adapt to other domains. This paper presents L2MAC, the first practical LLM-based general-purpose stored-program automatic computer (von Neumann architecture) framework, an LLM-based multi-agent system, for long and consistent output generation. Its memory has two components: the instruction registry, which is populated with a prompt program to solve the user-given task, and a file store, which will contain the final and intermediate outputs. Each instruction in turn is executed by a separate LLM agent, whose context is managed by a control unit capable of precise memory reading and writing to ensure effective interaction with the file store. These components enable L2MAC to generate extensive outputs, bypassing the constraints of the finite context window while producing outputs that fulfill a complex user-specified task. We empirically demonstrate that L2MAC achieves state-of-the-art performance in generating large codebases for system design tasks, significantly outperforming other coding methods in implementing the detailed user-specified task; we show that L2MAC works for general-purpose extensive text-based tasks, such as writing an entire book; and we provide valuable insights into L2MAC's performance improvement over existing methods.
\ No newline at end of file
diff --git a/data/2024/iclr/L2P-MIP: Learning to Presolve for Mixed Integer Programming b/data/2024/iclr/L2P-MIP: Learning to Presolve for Mixed Integer Programming
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/LCOT: Linear Circular Optimal Transport b/data/2024/iclr/LCOT: Linear Circular Optimal Transport
new file mode 100644
index 0000000000..0a5c5d8327
--- /dev/null
+++ b/data/2024/iclr/LCOT: Linear Circular Optimal Transport	
@@ -0,0 +1 @@
+The optimal transport problem for measures supported on non-Euclidean spaces has recently gained ample interest in diverse applications involving representation learning. In this paper, we focus on circular probability measures, i.e., probability measures supported on the unit circle, and introduce a new computationally efficient metric for these measures, denoted as Linear Circular Optimal Transport (LCOT). The proposed metric comes with an explicit linear embedding that allows one to apply Machine Learning (ML) algorithms to the embedded measures and seamlessly modify the underlying metric for the ML algorithm to LCOT. We show that the proposed metric is rooted in the Circular Optimal Transport (COT) and can be considered the linearization of the COT metric with respect to a fixed reference measure. We provide a theoretical analysis of the proposed metric and derive the computational complexities for pairwise comparison of circular probability measures. Lastly, through a set of numerical experiments, we demonstrate the benefits of LCOT in learning representations of circular measures.
\ No newline at end of file
diff --git a/data/2024/iclr/LDReg: Local Dimensionality Regularized Self-Supervised Learning b/data/2024/iclr/LDReg: Local Dimensionality Regularized Self-Supervised Learning
new file mode 100644
index 0000000000..bf0e85d441
--- /dev/null
+++ b/data/2024/iclr/LDReg: Local Dimensionality Regularized Self-Supervised Learning	
@@ -0,0 +1 @@
+Representations learned via self-supervised learning (SSL) can be susceptible to dimensional collapse, where the learned representation subspace is of extremely low dimensionality and thus fails to represent the full data distribution and modalities. Dimensional collapse also known as the"underfilling"phenomenon is one of the major causes of degraded performance on downstream tasks. Previous work has investigated the dimensional collapse problem of SSL at a global level. In this paper, we demonstrate that representations can span over high dimensional space globally, but collapse locally. To address this, we propose a method called $\textit{local dimensionality regularization (LDReg)}$. Our formulation is based on the derivation of the Fisher-Rao metric to compare and optimize local distance distributions at an asymptotically small radius for each data point. By increasing the local intrinsic dimensionality, we demonstrate through a range of experiments that LDReg improves the representation quality of SSL. The results also show that LDReg can regularize dimensionality at both local and global levels.
\ No newline at end of file
diff --git a/data/2024/iclr/LEAP: Liberate Sparse-View 3D Modeling from Camera Poses b/data/2024/iclr/LEAP: Liberate Sparse-View 3D Modeling from Camera Poses
new file mode 100644
index 0000000000..59f3283c9c
--- /dev/null
+++ b/data/2024/iclr/LEAP: Liberate Sparse-View 3D Modeling from Camera Poses	
@@ -0,0 +1 @@
+Are camera poses necessary for multi-view 3D modeling? Existing approaches predominantly assume access to accurate camera poses. While this assumption might hold for dense views, accurately estimating camera poses for sparse views is often elusive. Our analysis reveals that noisy estimated poses lead to degraded performance for existing sparse-view 3D modeling methods. To address this issue, we present LEAP, a novel pose-free approach, therefore challenging the prevailing notion that camera poses are indispensable. LEAP discards pose-based operations and learns geometric knowledge from data. LEAP is equipped with a neural volume, which is shared across scenes and is parameterized to encode geometry and texture priors. For each incoming scene, we update the neural volume by aggregating 2D image features in a feature-similarity-driven manner. The updated neural volume is decoded into the radiance field, enabling novel view synthesis from any viewpoint. On both object-centric and scene-level datasets, we show that LEAP significantly outperforms prior methods when they employ predicted poses from state-of-the-art pose estimators. Notably, LEAP performs on par with prior approaches that use ground-truth poses while running $400\times$ faster than PixelNeRF. We show LEAP generalizes to novel object categories and scenes, and learns knowledge closely resembles epipolar geometry. Project page: https://hwjiang1510.github.io/LEAP/
\ No newline at end of file
diff --git a/data/2024/iclr/LEGO-Prover: Neural Theorem Proving with Growing Libraries b/data/2024/iclr/LEGO-Prover: Neural Theorem Proving with Growing Libraries
new file mode 100644
index 0000000000..be38cb7c12
--- /dev/null
+++ b/data/2024/iclr/LEGO-Prover: Neural Theorem Proving with Growing Libraries	
@@ -0,0 +1 @@
+Despite the success of large language models (LLMs), the task of theorem proving still remains one of the hardest reasoning tasks that is far from being fully solved. Prior methods using language models have demonstrated promising results, but they still struggle to prove even middle school level theorems. One common limitation of these methods is that they assume a fixed theorem library during the whole theorem proving process. However, as we all know, creating new useful theorems or even new theories is not only helpful but crucial and necessary for advancing mathematics and proving harder and deeper results. In this work, we present LEGO-Prover, which employs a growing skill library containing verified lemmas as skills to augment the capability of LLMs used in theorem proving. By constructing the proof modularly, LEGO-Prover enables LLMs to utilize existing skills retrieved from the library and to create new skills during the proving process. These skills are further evolved (by prompting an LLM) to enrich the library on another scale. Modular and reusable skills are constantly added to the library to enable tackling increasingly intricate mathematical problems. Moreover, the learned library further bridges the gap between human proofs and formal proofs by making it easier to impute missing steps. LEGO-Prover advances the state-of-the-art pass rate on miniF2F-valid (48.0% to 57.0%) and miniF2F-test (45.5% to 47.1%). During the proving process, LEGO-Prover also manages to generate over 20,000 skills (theorems/lemmas) and adds them to the growing library. Our ablation study indicates that these newly added skills are indeed helpful for proving theorems, resulting in an improvement from a success rate of 47.1% to 50.4%. We also release our code and all the generated skills.
\ No newline at end of file
diff --git a/data/2024/iclr/LEMON: Lossless model expansion b/data/2024/iclr/LEMON: Lossless model expansion
new file mode 100644
index 0000000000..68a00b9657
--- /dev/null
+++ b/data/2024/iclr/LEMON: Lossless model expansion	
@@ -0,0 +1 @@
+Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intensive to obtain. To tackle this inefficiency, we present $\textbf{L}$ossl$\textbf{E}$ss $\textbf{MO}$del Expansio$\textbf{N}$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts. This is followed by model training with an optimized learning rate scheduler tailored explicitly for the scaled models, substantially reducing the training time compared to training from scratch. Notably, LEMON is versatile, ensuring compatibility with various network structures, including models like Vision Transformers and BERT. Our empirical results demonstrate that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch.
\ No newline at end of file
diff --git a/data/2024/iclr/LILO: Learning Interpretable Libraries by Compressing and Documenting Code b/data/2024/iclr/LILO: Learning Interpretable Libraries by Compressing and Documenting Code
new file mode 100644
index 0000000000..de4b577c5c
--- /dev/null
+++ b/data/2024/iclr/LILO: Learning Interpretable Libraries by Compressing and Documenting Code	
@@ -0,0 +1 @@
+While large language models (LLMs) now excel at code generation, a key aspect of software development is the art of refactoring: consolidating code into libraries of reusable and readable programs. In this paper, we introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code to build libraries tailored to particular problem domains. LILO combines LLM-guided program synthesis with recent algorithmic advances in automated refactoring from Stitch: a symbolic compression system that efficiently identifies optimal lambda abstractions across large code corpora. To make these abstractions interpretable, we introduce an auto-documentation (AutoDoc) procedure that infers natural language names and docstrings based on contextual examples of usage. In addition to improving human readability, we find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions. We evaluate LILO on three inductive program synthesis benchmarks for string editing, scene reasoning, and graphics composition. Compared to existing neural and symbolic methods - including the state-of-the-art library learning algorithm DreamCoder - LILO solves more complex tasks and learns richer libraries that are grounded in linguistic knowledge.
\ No newline at end of file
diff --git a/data/2024/iclr/LLCP: Learning Latent Causal Processes for Reasoning-based Video Question Answer b/data/2024/iclr/LLCP: Learning Latent Causal Processes for Reasoning-based Video Question Answer
new file mode 100644
index 0000000000..a9fcbf53cf
--- /dev/null
+++ b/data/2024/iclr/LLCP: Learning Latent Causal Processes for Reasoning-based Video Question Answer	
@@ -0,0 +1 @@
+and “lane”.
\ No newline at end of file
diff --git a/data/2024/iclr/LLM Augmented LLMs: Expanding Capabilities through Composition b/data/2024/iclr/LLM Augmented LLMs: Expanding Capabilities through Composition
new file mode 100644
index 0000000000..19791b8f90
--- /dev/null
+++ b/data/2024/iclr/LLM Augmented LLMs: Expanding Capabilities through Composition	
@@ -0,0 +1 @@
+Foundational models with billions of parameters which have been trained on large corpora of data have demonstrated non-trivial skills in a variety of domains. However, due to their monolithic structure, it is challenging and expensive to augment them or impart new skills. On the other hand, due to their adaptation abilities, several new instances of these models are being trained towards new domains and tasks. In this work, we study the problem of efficient and practical composition of existing foundation models with more specific models to enable newer capabilities. To this end, we propose CALM -- Composition to Augment Language Models -- which introduces cross-attention between models to compose their representations and enable new capabilities. Salient features of CALM are: (i) Scales up LLMs on new tasks by 're-using' existing LLMs along with a few additional parameters and data, (ii) Existing model weights are kept intact, and hence preserves existing capabilities, and (iii) Applies to diverse domains and settings. We illustrate that augmenting PaLM2-S with a smaller model trained on low-resource languages results in an absolute improvement of up to 13\% on tasks like translation into English and arithmetic reasoning for low-resource languages. Similarly, when PaLM2-S is augmented with a code-specific model, we see a relative improvement of 40\% over the base model for code generation and explanation tasks -- on-par with fully fine-tuned counterparts.
\ No newline at end of file
diff --git a/data/2024/iclr/LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts b/data/2024/iclr/LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts
new file mode 100644
index 0000000000..ae5a3855d9
--- /dev/null
+++ b/data/2024/iclr/LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts	
@@ -0,0 +1 @@
+Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in generating images from short, single-object descriptions, these models often struggle to faithfully capture all the nuanced details within longer and more elaborate textual inputs. In response, we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs.
\ No newline at end of file
diff --git a/data/2024/iclr/LLM-Assisted Code Cleaning For Training Accurate Code Generators b/data/2024/iclr/LLM-Assisted Code Cleaning For Training Accurate Code Generators
new file mode 100644
index 0000000000..c244251cf8
--- /dev/null
+++ b/data/2024/iclr/LLM-Assisted Code Cleaning For Training Accurate Code Generators	
@@ -0,0 +1 @@
+Natural language to code generation is an important application area of LLMs and has received wide attention from the community. The majority of relevant studies have exclusively concentrated on increasing the quantity and functional correctness of training sets while disregarding other stylistic elements of programs. More recently, data quality has garnered a lot of interest and multiple works have showcased its importance for improving performance. In this work, we investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs by 1.) renaming variables, 2.) modularizing and decomposing complex code into smaller helper sub-functions, and 3.) inserting natural-language based plans via LLM based transformations. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B on our transformed modularized programs improves the performance by up to 30% compared to fine-tuning on the original dataset. Additionally, we demonstrate improved performance from using a smaller amount of higher-quality data, finding that a model fine-tuned on the entire original dataset is outperformed by a model trained on 15% of our cleaned dataset. Even in comparison to closed-source models, our models outperform the much larger AlphaCoder models.
\ No newline at end of file
diff --git a/data/2024/iclr/LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation b/data/2024/iclr/LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation
new file mode 100644
index 0000000000..2312da9834
--- /dev/null
+++ b/data/2024/iclr/LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation	
@@ -0,0 +1 @@
+Following the impressive development of LLMs, vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO. This direction of research is particularly relevant to medical imaging because medical image analysis and generation consist of reasoning based on a combination of visual features and prior knowledge. Many recent works have focused on training adapter networks that serve as an information bridge between image processing networks and LLMs; but presumably, in order to achieve maximum reasoning potential of LLMs on visual information as well, visual and language features should be allowed to interact more freely. This is especially important in the medical domain because understanding and generating medical images such as chest X-rays (CXR) require not only accurate visual and language-based reasoning but also a more intimate mapping between the two modalities. Thus, taking inspiration from previous work on the transformer and VQ-GAN combination for bidirectional image and text generation, we build upon this approach and develop a method for instruction-tuning an LLM pre-trained only on text to gain vision-language capabilities for medical images. Specifically, we leverage a pretrained LLM's existing question-answering and instruction-following abilities to teach it to understand visual inputs by instructing it to answer questions about image inputs and, symmetrically, output both text and image responses appropriate to a given query by tuning the LLM with diverse tasks that encompass image-based text-generation and text-based image-generation. We show that our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks while being smaller in size compared to previously developed models that perform a narrower range of tasks. The code is at https://github.com/hyn2028/llm-cxr.
\ No newline at end of file
diff --git a/data/2024/iclr/LLM-grounded Video Diffusion Models b/data/2024/iclr/LLM-grounded Video Diffusion Models
new file mode 100644
index 0000000000..c1247a5552
--- /dev/null
+++ b/data/2024/iclr/LLM-grounded Video Diffusion Models	
@@ -0,0 +1 @@
+Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion. To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.
\ No newline at end of file
diff --git a/data/2024/iclr/LLMCarbon: Modeling the End-to-End Carbon Footprint of Large Language Models b/data/2024/iclr/LLMCarbon: Modeling the End-to-End Carbon Footprint of Large Language Models
new file mode 100644
index 0000000000..273a7cdddf
--- /dev/null
+++ b/data/2024/iclr/LLMCarbon: Modeling the End-to-End Carbon Footprint of Large Language Models	
@@ -0,0 +1 @@
+The carbon footprint associated with large language models (LLMs) is a significant concern, encompassing emissions from their training, inference, experimentation, and storage processes, including operational and embodied carbon emissions. An essential aspect is accurately estimating the carbon impact of emerging LLMs even before their training, which heavily relies on GPU usage. Existing studies have reported the carbon footprint of LLM training, but only one tool, mlco2, can predict the carbon footprint of new neural networks prior to physical training. However, mlco2 has several serious limitations. It cannot extend its estimation to dense or mixture-of-experts (MoE) LLMs, disregards critical architectural parameters, focuses solely on GPUs, and cannot model embodied carbon footprints. Addressing these gaps, we introduce \textit{\carb}, an end-to-end carbon footprint projection model designed for both dense and MoE LLMs. Compared to mlco2, \carb~significantly enhances the accuracy of carbon footprint estimations for various LLMs. The source code is released at \url{https://github.com/SotaroKaneda/MLCarbon}.
\ No newline at end of file
diff --git a/data/2024/iclr/LLaMA-Adapter: Efficient Fine-tuning of Large Language Models with Zero-initialized Attention b/data/2024/iclr/LLaMA-Adapter: Efficient Fine-tuning of Large Language Models with Zero-initialized Attention
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset b/data/2024/iclr/LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
new file mode 100644
index 0000000000..d7f009af6a
--- /dev/null
+++ b/data/2024/iclr/LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset	
@@ -0,0 +1 @@
+Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs. This dataset is collected from 210K unique IP addresses in the wild on our Vicuna demo and Chatbot Arena website. We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale. We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions. We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities. The dataset is publicly available at https://huggingface.co/datasets/lmsys/lmsys-chat-1m.
\ No newline at end of file
diff --git a/data/2024/iclr/LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre Memory Units b/data/2024/iclr/LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre Memory Units
new file mode 100644
index 0000000000..6bd29a43fa
--- /dev/null
+++ b/data/2024/iclr/LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre Memory Units	
@@ -0,0 +1 @@
+Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability making them ill-suited for many streaming applications at the edge where devices are heavily resource-constrained. Thus motivated, many researchers have proposed reformulating the transformer models as RNN modules which modify the self-attention computation with explicit states. However, these approaches often incur significant performance degradation. The ultimate goal is to develop a model that has the following properties: parallel training, streaming and low-cost inference, and SOTA performance. In this paper, we propose a new direction to achieve this goal. We show how architectural modifications to a recurrent model can help push its performance toward Transformer models while retaining its sequential processing capability. Specifically, inspired by the recent success of Legendre Memory Units (LMU) in sequence learning tasks, we propose LMUFormer, which augments the LMU with convolutional patch embedding and convolutional channel mixer. Moreover, we present a spiking version of this architecture, which introduces the benefit of states within the patch embedding and channel mixer modules while simultaneously reducing the computing complexity. We evaluated our architectures on multiple sequence datasets. In comparison to SOTA transformer-based models within the ANN domain on the SCv2 dataset, our LMUFormer demonstrates comparable performance while necessitating a remarkable 53 times reduction in parameters and a substantial 65 times decrement in FLOPs. Additionally, owing to our model's proficiency in real-time data processing, we can achieve a 32.03% reduction in sequence length, all while incurring an inconsequential decline in performance. Our code is publicly available at https://github.com/zeyuliu1037/LMUFormer.git.
\ No newline at end of file
diff --git a/data/2024/iclr/LOQA: Learning with Opponent Q-Learning Awareness b/data/2024/iclr/LOQA: Learning with Opponent Q-Learning Awareness
new file mode 100644
index 0000000000..4209db92e0
--- /dev/null
+++ b/data/2024/iclr/LOQA: Learning with Opponent Q-Learning Awareness	
@@ -0,0 +1 @@
+In various real-world scenarios, interactions among agents often resemble the dynamics of general-sum games, where each agent strives to optimize its own utility. Despite the ubiquitous relevance of such settings, decentralized machine learning algorithms have struggled to find equilibria that maximize individual utility while preserving social welfare. In this paper we introduce Learning with Opponent Q-Learning Awareness (LOQA), a novel, decentralized reinforcement learning algorithm tailored to optimizing an agent's individual utility while fostering cooperation among adversaries in partially competitive environments. LOQA assumes the opponent samples actions proportionally to their action-value function Q. Experimental results demonstrate the effectiveness of LOQA at achieving state-of-the-art performance in benchmark scenarios such as the Iterated Prisoner's Dilemma and the Coin Game. LOQA achieves these outcomes with a significantly reduced computational footprint, making it a promising approach for practical multi-agent applications.
\ No newline at end of file
diff --git a/data/2024/iclr/LQ-LoRA: Low-rank plus Quantized Matrix Decomposition for Efficient Language Model Finetuning b/data/2024/iclr/LQ-LoRA: Low-rank plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
new file mode 100644
index 0000000000..d7c9bf952c
--- /dev/null
+++ b/data/2024/iclr/LQ-LoRA: Low-rank plus Quantized Matrix Decomposition for Efficient Language Model Finetuning	
@@ -0,0 +1 @@
+We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Experiments on finetuning RoBERTa and LLaMA-2 (7B and 70B) demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines and enables aggressive quantization to sub-3 bits with only minor performance degradations. When finetuned on a language modeling calibration dataset, LQ-LoRA can also be used for model compression; in this setting our 2.75-bit LLaMA-2-70B model (which has 2.85 bits on average when including the low-rank components and requires 27GB of GPU memory) performs respectably compared to the 16-bit baseline.
\ No newline at end of file
diff --git a/data/2024/iclr/LRM: Large Reconstruction Model for Single Image to 3D b/data/2024/iclr/LRM: Large Reconstruction Model for Single Image to 3D
new file mode 100644
index 0000000000..40cf5440af
--- /dev/null
+++ b/data/2024/iclr/LRM: Large Reconstruction Model for Single Image to 3D	
@@ -0,0 +1 @@
+We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs, including real-world in-the-wild captures and images created by generative models. Video demos and interactable 3D meshes can be found on our LRM project webpage: https://yiconghong.me/LRM.
\ No newline at end of file
diff --git a/data/2024/iclr/LRR: Language-Driven Resamplable Continuous Representation against Adversarial Tracking Attacks b/data/2024/iclr/LRR: Language-Driven Resamplable Continuous Representation against Adversarial Tracking Attacks
new file mode 100644
index 0000000000..2e79121e60
--- /dev/null
+++ b/data/2024/iclr/LRR: Language-Driven Resamplable Continuous Representation against Adversarial Tracking Attacks	
@@ -0,0 +1 @@
+Visual object tracking plays a critical role in visual-based autonomous systems, as it aims to estimate the position and size of the object of interest within a live video. Despite significant progress made in this field, state-of-the-art (SOTA) trackers often fail when faced with adversarial perturbations in the incoming frames. This can lead to significant robustness and security issues when these trackers are deployed in the real world. To achieve high accuracy on both clean and adversarial data, we propose building a spatial-temporal continuous representation using the semantic text guidance of the object of interest. This novel continuous representation enables us to reconstruct incoming frames to maintain semantic and appearance consistency with the object of interest and its clean counterparts. As a result, our proposed method successfully defends against different SOTA adversarial tracking attacks while maintaining high accuracy on clean data. In particular, our method significantly increases tracking accuracy under adversarial attacks with around 90% relative improvement on UAV123, which is even higher than the accuracy on clean data.
\ No newline at end of file
diff --git a/data/2024/iclr/LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition b/data/2024/iclr/LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition
new file mode 100644
index 0000000000..b941d131a4
--- /dev/null
+++ b/data/2024/iclr/LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition	
@@ -0,0 +1 @@
+Bandwidth constraints during signal acquisition frequently impede real-time detection applications. Hyperspectral data is a notable example, whose vast volume compromises real-time hyperspectral detection. To tackle this hurdle, we introduce a novel approach leveraging pre-acquisition modulation to reduce the acquisition volume. This modulation process is governed by a deep learning model, utilizing prior information. Central to our approach is LUM-ViT, a Vision Transformer variant. Uniquely, LUM-ViT incorporates a learnable under-sampling mask tailored for pre-acquisition modulation. To further optimize for optical calculations, we propose a kernel-level weight binarization technique and a three-stage fine-tuning strategy. Our evaluations reveal that, by sampling a mere 10% of the original image pixels, LUM-ViT maintains the accuracy loss within 1.8% on the ImageNet classification task. The method sustains near-original accuracy when implemented on real-world optical hardware, demonstrating its practicality. Code will be available at https://github.com/MaxLLF/LUM-ViT.
\ No newline at end of file
diff --git a/data/2024/iclr/LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models b/data/2024/iclr/LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
new file mode 100644
index 0000000000..54645884c9
--- /dev/null
+++ b/data/2024/iclr/LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models	
@@ -0,0 +1 @@
+Recent advances in self-supervised learning and the Transformer architecture have significantly improved natural language processing (NLP), achieving remarkably low perplexity. However, the growing size of NLP models introduces a memory wall problem during the generation phase. To mitigate this issue, recent efforts have focused on quantizing model weights to sub-4-bit precision while preserving full precision for activations, resulting in practical speed-ups during inference on a single GPU. However, these improvements primarily stem from reduced memory movement, which necessitates a resource-intensive dequantization process rather than actual computational reduction. In this paper, we introduce LUT-GEMM, an efficient kernel for quantized matrix multiplication, which not only eliminates the resource-intensive dequantization process but also reduces computational costs compared to previous kernels for weight-only quantization. Furthermore, we proposed group-wise quantization to offer a flexible trade-off between compression ratio and accuracy. The impact of LUT-GEMM is facilitated by implementing high compression ratios through low-bit quantization and efficient LUT-based operations. We show experimentally that when applied to the OPT-175B model with 3-bit quantization, LUT-GEMM substantially accelerates token generation latency, achieving a remarkable 2.1$\times$ improvement on a single GPU when compared to OPTQ, which relies on the costly dequantization process.
\ No newline at end of file
diff --git a/data/2024/iclr/Label-Agnostic Forgetting: A Supervision-Free Unlearning in Deep Models b/data/2024/iclr/Label-Agnostic Forgetting: A Supervision-Free Unlearning in Deep Models
new file mode 100644
index 0000000000..d864ffae8a
--- /dev/null
+++ b/data/2024/iclr/Label-Agnostic Forgetting: A Supervision-Free Unlearning in Deep Models	
@@ -0,0 +1 @@
+Machine unlearning aims to remove information derived from forgotten data while preserving that of the remaining dataset in a well-trained model. With the increasing emphasis on data privacy, several approaches to machine unlearning have emerged. However, these methods typically rely on complete supervision throughout the unlearning process. Unfortunately, obtaining such supervision, whether for the forgetting or remaining data, can be impractical due to the substantial cost associated with annotating real-world datasets. This challenge prompts us to propose a supervision-free unlearning approach that operates without the need for labels during the unlearning process. Specifically, we introduce a variational approach to approximate the distribution of representations for the remaining data. Leveraging this approximation, we adapt the original model to eliminate information from the forgotten data at the representation level. To further address the issue of lacking supervision information, which hinders alignment with ground truth, we introduce a contrastive loss to facilitate the matching of representations between the remaining data and those of the original model, thus preserving predictive performance. Experimental results across various unlearning tasks demonstrate the effectiveness of our proposed method, Label-Agnostic Forgetting (LAF) without using any labels, which achieves comparable performance to state-of-the-art methods that rely on full supervision information. Furthermore, our approach excels in semi-supervised scenarios, leveraging limited supervision information to outperform fully supervised baselines. This work not only showcases the viability of supervision-free unlearning in deep models but also opens up a new possibility for future research in unlearning at the representation level.
\ No newline at end of file
diff --git a/data/2024/iclr/Label-Focused Inductive Bias over Latent Object Features in Visual Classification b/data/2024/iclr/Label-Focused Inductive Bias over Latent Object Features in Visual Classification
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Label-Noise Robust Diffusion Models b/data/2024/iclr/Label-Noise Robust Diffusion Models
new file mode 100644
index 0000000000..fa0c829ef8
--- /dev/null
+++ b/data/2024/iclr/Label-Noise Robust Diffusion Models	
@@ -0,0 +1 @@
+Conditional diffusion models have shown remarkable performance in various generative tasks, but training them requires large-scale datasets that often contain noise in conditional inputs, a.k.a. noisy labels. This noise leads to condition mismatch and quality degradation of generated data. This paper proposes Transition-aware weighted Denoising Score Matching (TDSM) for training conditional diffusion models with noisy labels, which is the first study in the line of diffusion models. The TDSM objective contains a weighted sum of score networks, incorporating instance-wise and time-dependent label transition probabilities. We introduce a transition-aware weight estimator, which leverages a time-dependent noisy-label classifier distinctively customized to the diffusion process. Through experiments across various datasets and noisy label settings, TDSM improves the quality of generated samples aligned with given conditions. Furthermore, our method improves generation performance even on prevalent benchmark datasets, which implies the potential noisy labels and their risk of generative model learning. Finally, we show the improved performance of TDSM on top of conventional noisy label corrections, which empirically proving its contribution as a part of label-noise robust generative models. Our code is available at: https://github.com/byeonghu-na/tdsm.
\ No newline at end of file
diff --git a/data/2024/iclr/Label-free Node Classification on Graphs with Large Language Models (LLMs) b/data/2024/iclr/Label-free Node Classification on Graphs with Large Language Models (LLMs)
new file mode 100644
index 0000000000..ea677bf0a8
--- /dev/null
+++ b/data/2024/iclr/Label-free Node Classification on Graphs with Large Language Models (LLMs)	
@@ -0,0 +1 @@
+In recent years, there have been remarkable advancements in node classification achieved by Graph Neural Networks (GNNs). However, they necessitate abundant high-quality labels to ensure promising performance. In contrast, Large Language Models (LLMs) exhibit impressive zero-shot proficiency on text-attributed graphs. Yet, they face challenges in efficiently processing structural data and suffer from high inference costs. In light of these observations, this work introduces a label-free node classification on graphs with LLMs pipeline, LLM-GNN. It amalgamates the strengths of both GNNs and LLMs while mitigating their limitations. Specifically, LLMs are leveraged to annotate a small portion of nodes and then GNNs are trained on LLMs' annotations to make predictions for the remaining large portion of nodes. The implementation of LLM-GNN faces a unique challenge: how can we actively select nodes for LLMs to annotate and consequently enhance the GNN training? How can we leverage LLMs to obtain annotations of high quality, representativeness, and diversity, thereby enhancing GNN performance with less cost? To tackle this challenge, we develop an annotation quality heuristic and leverage the confidence scores derived from LLMs to advanced node selection. Comprehensive experimental results validate the effectiveness of LLM-GNN. In particular, LLM-GNN can achieve an accuracy of 74.9% on a vast-scale dataset \products with a cost less than 1 dollar.
\ No newline at end of file
diff --git a/data/2024/iclr/LabelDP-Pro: Learning with Label Differential Privacy via Projections b/data/2024/iclr/LabelDP-Pro: Learning with Label Differential Privacy via Projections
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Lagrangian Flow Networks for Conservation Laws b/data/2024/iclr/Lagrangian Flow Networks for Conservation Laws
new file mode 100644
index 0000000000..421d2eb446
--- /dev/null
+++ b/data/2024/iclr/Lagrangian Flow Networks for Conservation Laws	
@@ -0,0 +1 @@
+We introduce Lagrangian Flow Networks (LFlows) for modeling fluid densities and velocities continuously in space and time. By construction, the proposed LFlows satisfy the continuity equation, a PDE describing mass conservation in its differentiable form. Our model is based on the insight that solutions to the continuity equation can be expressed as time-dependent density transformations via differentiable and invertible maps. This follows from classical theory of the existence and uniqueness of Lagrangian flows for smooth vector fields. Hence, we model fluid densities by transforming a base density with parameterized diffeomorphisms conditioned on time. The key benefit compared to methods relying on numerical ODE solvers or PINNs is that the analytic expression of the velocity is always consistent with changes in density. Furthermore, we require neither expensive numerical solvers, nor additional penalties to enforce the PDE. LFlows show higher predictive accuracy in density modeling tasks compared to competing models in 2D and 3D, while being computationally efficient. As a real-world application, we model bird migration based on sparse weather radar measurements.
\ No newline at end of file
diff --git a/data/2024/iclr/LaneSegNet: Map Learning with Lane Segment Perception for Autonomous Driving b/data/2024/iclr/LaneSegNet: Map Learning with Lane Segment Perception for Autonomous Driving
new file mode 100644
index 0000000000..1be6eeb142
--- /dev/null
+++ b/data/2024/iclr/LaneSegNet: Map Learning with Lane Segment Perception for Autonomous Driving	
@@ -0,0 +1 @@
+A map, as crucial information for downstream applications of an autonomous driving system, is usually represented in lanelines or centerlines. However, existing literature on map learning primarily focuses on either detecting geometry-based lanelines or perceiving topology relationships of centerlines. Both of these methods ignore the intrinsic relationship of lanelines and centerlines, that lanelines bind centerlines. While simply predicting both types of lane in one model is mutually excluded in learning objective, we advocate lane segment as a new representation that seamlessly incorporates both geometry and topology information. Thus, we introduce LaneSegNet, the first end-to-end mapping network generating lane segments to obtain a complete representation of the road structure. Our algorithm features two key modifications. One is a lane attention module to capture pivotal region details within the long-range feature space. Another is an identical initialization strategy for reference points, which enhances the learning of positional priors for lane attention. On the OpenLane-V2 dataset, LaneSegNet outperforms previous counterparts by a substantial gain across three tasks, \textit{i.e.}, map element detection (+4.8 mAP), centerline perception (+6.9 DET$_l$), and the newly defined one, lane segment perception (+5.6 mAP). Furthermore, it obtains a real-time inference speed of 14.7 FPS. Code is accessible at https://github.com/OpenDriveLab/LaneSegNet.
\ No newline at end of file
diff --git a/data/2024/iclr/Langevin Monte Carlo for strongly log-concave distributions: Randomized midpoint revisited b/data/2024/iclr/Langevin Monte Carlo for strongly log-concave distributions: Randomized midpoint revisited
new file mode 100644
index 0000000000..dc221b39e4
--- /dev/null
+++ b/data/2024/iclr/Langevin Monte Carlo for strongly log-concave distributions: Randomized midpoint revisited	
@@ -0,0 +1 @@
+We revisit the problem of sampling from a target distribution that has a smooth strongly log-concave density everywhere in $\mathbb R^p$. In this context, if no additional density information is available, the randomized midpoint discretization for the kinetic Langevin diffusion is known to be the most scalable method in high dimensions with large condition numbers. Our main result is a nonasymptotic and easy to compute upper bound on the Wasserstein-2 error of this method. To provide a more thorough explanation of our method for establishing the computable upper bound, we conduct an analysis of the midpoint discretization for the vanilla Langevin process. This analysis helps to clarify the underlying principles and provides valuable insights that we use to establish an improved upper bound for the kinetic Langevin process with the midpoint discretization. Furthermore, by applying these techniques we establish new guarantees for the kinetic Langevin process with Euler discretization, which have a better dependence on the condition number than existing upper bounds.
\ No newline at end of file
diff --git a/data/2024/iclr/Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks b/data/2024/iclr/Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks
new file mode 100644
index 0000000000..c1b735f902
--- /dev/null
+++ b/data/2024/iclr/Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks	
@@ -0,0 +1 @@
+Training generalist agents is difficult across several axes, requiring us to deal with high-dimensional inputs (space), long horizons (time), and generalization to novel tasks. Recent advances with architectures have allowed for improved scaling along one or two of these axes, but are still computationally prohibitive to use. In this paper, we propose to address all three axes by leveraging \textbf{L}anguage to \textbf{C}ontrol \textbf{D}iffusion models as a hierarchical planner conditioned on language (LCD). We effectively and efficiently scale diffusion models for planning in extended temporal, state, and task dimensions to tackle long horizon control problems conditioned on natural language instructions, as a step towards generalist agents. Comparing LCD with other state-of-the-art models on the CALVIN language robotics benchmark finds that LCD outperforms other SOTA methods in multi-task success rates, whilst improving inference speed over other comparable diffusion models by 3.3x~15x. We show that LCD can successfully leverage the unique strength of diffusion models to produce coherent long range plans while addressing their weakness in generating low-level details and control.
\ No newline at end of file
diff --git a/data/2024/iclr/Language Model Beats Diffusion - Tokenizer is key to visual generation b/data/2024/iclr/Language Model Beats Diffusion - Tokenizer is key to visual generation
new file mode 100644
index 0000000000..69ebae3dd6
--- /dev/null
+++ b/data/2024/iclr/Language Model Beats Diffusion - Tokenizer is key to visual generation	
@@ -0,0 +1 @@
+While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Language Model Decoding as Direct Metrics Optimization b/data/2024/iclr/Language Model Decoding as Direct Metrics Optimization
new file mode 100644
index 0000000000..bf5016f130
--- /dev/null
+++ b/data/2024/iclr/Language Model Decoding as Direct Metrics Optimization	
@@ -0,0 +1 @@
+Despite the remarkable advances in language modeling, current mainstream decoding methods still struggle to generate texts that align with human texts across different aspects. In particular, sampling-based methods produce less-repetitive texts which are often disjunctive in discourse, while search-based methods maintain topic coherence at the cost of increased repetition. Overall, these methods fall short in achieving holistic alignment across a broad range of aspects. In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts measured by multiple metrics of desired aspects simultaneously. The resulting decoding distribution enjoys an analytical solution that scales the input language model distribution via a sequence-level energy function defined by these metrics. And most importantly, we prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts. To facilitate tractable sampling from this globally normalized distribution, we adopt the Sampling-Importance-Resampling technique. Experiments on various domains and model scales demonstrate the superiority of our method in metrics alignment with human texts and human evaluation over strong baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Language Model Detectors Are Easily Optimized Against b/data/2024/iclr/Language Model Detectors Are Easily Optimized Against
new file mode 100644
index 0000000000..0af0b42686
--- /dev/null
+++ b/data/2024/iclr/Language Model Detectors Are Easily Optimized Against	
@@ -0,0 +1 @@
+The fluency and general applicability of large language models (LLMs) has motivated significant interest in detecting whether a piece of text was written by a language model. While both academic and commercial detectors have been deployed in some settings, particularly education, other research has highlighted the fragility of these systems. In this paper, we demonstrate a data-efficient attack that fine-tunes language models to confuse existing detectors, leveraging recent developments in reinforcement learning of language models. We use the ‘human-ness’ score (often just a log probability) of various open-source and commercial detectors as a reward function for reinforcement learning, subject to a KL-divergence constraint that the resulting model does not differ significantly from the original. For a 7B parameter Llama-2 model, fine-tuning for under a day reduces the AUROC of the OpenAI RoBERTa-Large detector from 0.84 to 0.63, while perplexity on OpenWebText increases from 8.7 to only 9.0; with a larger perplexity budget, we can drive AUROC to 0.30 (worse than random). Similar to traditional adversarial attacks, we find that this increase in ‘detector evasion’ generalizes to other detectors not used during training. In light of our empirical results, we advise against continued reliance on LLM-generated text detectors. Models, datasets, and selected experiment code will be released at https://github.com/charlottttee/llm-detector-evasion .
\ No newline at end of file
diff --git a/data/2024/iclr/Language Model Inversion b/data/2024/iclr/Language Model Inversion
new file mode 100644
index 0000000000..b7a9ca066e
--- /dev/null
+++ b/data/2024/iclr/Language Model Inversion	
@@ -0,0 +1 @@
+Language models produce a distribution over the next token; can we use this information to recover the prompt tokens? We consider the problem of language model inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search. On Llama-2 7b, our inversion method reconstructs prompts with a BLEU of $59$ and token-level F1 of $78$ and recovers $27\%$ of prompts exactly. Code for reproducing all experiments is available at http://github.com/jxmorris12/vec2text.
\ No newline at end of file
diff --git a/data/2024/iclr/Language Modeling Is Compression b/data/2024/iclr/Language Modeling Is Compression
new file mode 100644
index 0000000000..6aff85f0e1
--- /dev/null
+++ b/data/2024/iclr/Language Modeling Is Compression	
@@ -0,0 +1 @@
+It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.
\ No newline at end of file
diff --git a/data/2024/iclr/Language-Informed Visual Concept Learning b/data/2024/iclr/Language-Informed Visual Concept Learning
new file mode 100644
index 0000000000..2bae6da10f
--- /dev/null
+++ b/data/2024/iclr/Language-Informed Visual Concept Learning	
@@ -0,0 +1 @@
+Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.
\ No newline at end of file
diff --git a/data/2024/iclr/Language-Interfaced Tabular Oversampling via Progressive Imputation and Self-Authentication b/data/2024/iclr/Language-Interfaced Tabular Oversampling via Progressive Imputation and Self-Authentication
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI b/data/2024/iclr/Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI
new file mode 100644
index 0000000000..f5c6e1f8cc
--- /dev/null
+++ b/data/2024/iclr/Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI	
@@ -0,0 +1 @@
+The current electroencephalogram (EEG) based deep learning models are typically designed for specific datasets and applications in brain-computer interaction (BCI), limiting the scale of the models and thus diminishing their perceptual capabilities and generalizability. Recently, Large Language Models (LLMs) have achieved unprecedented success in text processing, prompting us to explore the capabilities of Large EEG Models (LEMs). We hope that LEMs can break through the limitations of different task types of EEG datasets, and obtain universal perceptual capabilities of EEG signals through unsupervised pre-training. Then the models can be fine-tuned for different downstream tasks. However, compared to text data, the volume of EEG datasets is generally small and the format varies widely. For example, there can be mismatched numbers of electrodes, unequal length data samples, varied task designs, and low signal-to-noise ratio. To overcome these challenges, we propose a unified foundation model for EEG called Large Brain Model (LaBraM). LaBraM enables cross-dataset learning by segmenting the EEG signals into EEG channel patches. Vector-quantized neural spectrum prediction is used to train a semantically rich neural tokenizer that encodes continuous raw EEG channel patches into compact neural codes. We then pre-train neural Transformers by predicting the original neural codes for the masked EEG channel patches. The LaBraMs were pre-trained on about 2,500 hours of various types of EEG signals from around 20 datasets and validated on multiple different types of downstream tasks. Experiments on abnormal detection, event type classification, emotion recognition, and gait prediction show that our LaBraM outperforms all compared SOTA methods in their respective fields. Our code is available at https://github.com/935963004/LaBraM.
\ No newline at end of file
diff --git a/data/2024/iclr/Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior b/data/2024/iclr/Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior
new file mode 100644
index 0000000000..f4d69bc7eb
--- /dev/null
+++ b/data/2024/iclr/Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior	
@@ -0,0 +1 @@
+Shannon and Weaver's seminal information theory divides communication into three levels: technical, semantic, and effectiveness. While the technical level deals with the accurate reconstruction of transmitted symbols, the semantic and effectiveness levels deal with the inferred meaning and its effect on the receiver. Large Language Models (LLMs), with their wide generalizability, make some progress towards the second level. However, LLMs and other communication models are not conventionally designed for predicting and optimizing communication for desired receiver behaviors and intents. As a result, the effectiveness level remains largely untouched by modern communication systems. In this paper, we introduce the receivers'"behavior tokens,"such as shares, likes, clicks, purchases, and retweets, in the LLM's training corpora to optimize content for the receivers and predict their behaviors. Other than showing similar performance to LLMs on content understanding tasks, our trained models show generalization capabilities on the behavior dimension for behavior simulation, content simulation, behavior understanding, and behavior domain adaptation. We show results on all these capabilities using a wide range of tasks on three corpora. We call these models Large Content and Behavior Models (LCBMs). Further, to spur more research on LCBMs, we release our new Content Behavior Corpus (CBC), a repository containing communicator, message, and corresponding receiver behavior (https://behavior-in-the-wild.github.io/LCBM).
\ No newline at end of file
diff --git a/data/2024/iclr/Large Language Model Cascades with Mixture of Thought Representations for Cost-Efficient Reasoning b/data/2024/iclr/Large Language Model Cascades with Mixture of Thought Representations for Cost-Efficient Reasoning
new file mode 100644
index 0000000000..1eca58fe16
--- /dev/null
+++ b/data/2024/iclr/Large Language Model Cascades with Mixture of Thought Representations for Cost-Efficient Reasoning	
@@ -0,0 +1 @@
+Large language models (LLMs) such as GPT-4 have exhibited remarkable performance in a variety of tasks, but this strong performance often comes with the high expense of using paid API services. In this paper, we are motivated to study building an LLM cascade to save the cost of using LLMs, particularly for performing reasoning (e.g., mathematical, causal) tasks. Our cascade pipeline follows the intuition that simpler questions can be addressed by a weaker but more affordable LLM, whereas only the challenging questions necessitate the stronger and more expensive LLM. To realize this decision-making, we consider the"answer consistency"of the weaker LLM as a signal of the question difficulty and propose several methods for the answer sampling and consistency checking, including one leveraging a mixture of two thought representations (i.e., Chain-of-Thought and Program-of-Thought). Through experiments on six reasoning benchmark datasets, with GPT-3.5-turbo and GPT-4 being the weaker and stronger LLMs, respectively, we demonstrate that our proposed LLM cascades can achieve performance comparable to using solely the stronger LLM but require only 40% of its cost.
\ No newline at end of file
diff --git a/data/2024/iclr/Large Language Models Are Not Robust Multiple Choice Selectors b/data/2024/iclr/Large Language Models Are Not Robust Multiple Choice Selectors
new file mode 100644
index 0000000000..b6c9279706
--- /dev/null
+++ b/data/2024/iclr/Large Language Models Are Not Robust Multiple Choice Selectors	
@@ -0,0 +1 @@
+Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs). This work shows that modern LLMs are vulnerable to option position changes in MCQs due to their inherent"selection bias", namely, they prefer to select specific option IDs as answers (like"Option A"). Through extensive empirical analyses with 20 LLMs on three benchmarks, we pinpoint that this behavioral bias primarily stems from LLMs' token bias, where the model a priori assigns more probabilistic mass to specific option ID tokens (e.g., A/B/C/D) when predicting answers from the option IDs. To mitigate selection bias, we propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution. PriDe first estimates the prior by permutating option contents on a small number of test samples, and then applies the estimated prior to debias the remaining samples. We demonstrate that it achieves interpretable and transferable debiasing with high computational efficiency. We hope this work can draw broader research attention to the bias and robustness of modern LLMs.
\ No newline at end of file
diff --git a/data/2024/iclr/Large Language Models Cannot Self-Correct Reasoning Yet b/data/2024/iclr/Large Language Models Cannot Self-Correct Reasoning Yet
new file mode 100644
index 0000000000..40cb760631
--- /dev/null
+++ b/data/2024/iclr/Large Language Models Cannot Self-Correct Reasoning Yet	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities across various applications. Nevertheless, concerns persist regarding the accuracy and appropriateness of their generated content. A contemporary methodology, self-correction, has been proposed as a remedy to these issues. Building upon this premise, this paper critically examines the role and efficacy of self-correction within LLMs, shedding light on its true potential and limitations. Central to our investigation is the notion of intrinsic self-correction, whereby an LLM attempts to correct its initial responses based solely on its inherent capabilities, without the crutch of external feedback. In the context of reasoning, our research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction. Drawing from these insights, we offer suggestions for future research and practical applications in this field.
\ No newline at end of file
diff --git a/data/2024/iclr/Large Language Models are Efficient Learners of Noise-Robust Speech Recognition b/data/2024/iclr/Large Language Models are Efficient Learners of Noise-Robust Speech Recognition
new file mode 100644
index 0000000000..2d4b4456a4
--- /dev/null
+++ b/data/2024/iclr/Large Language Models are Efficient Learners of Noise-Robust Speech Recognition	
@@ -0,0 +1 @@
+Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do}, where one solution is introducing noise information as a conditioner into LLM. However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation to distill the real noise information in audio embeddings to our language embedding. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. Analysis shows that our language-space noise embedding can well represent the noise conditions of source speech, under which off-the-shelf LLMs show strong ability of language-space denoising.
\ No newline at end of file
diff --git a/data/2024/iclr/Large Language Models as Analogical Reasoners b/data/2024/iclr/Large Language Models as Analogical Reasoners
new file mode 100644
index 0000000000..a0922534eb
--- /dev/null
+++ b/data/2024/iclr/Large Language Models as Analogical Reasoners	
@@ -0,0 +1 @@
+Chain-of-thought (CoT) prompting for language models demonstrates impressive performance across reasoning tasks, but typically needs labeled exemplars of the reasoning process. In this work, we introduce a new prompting approach, analogical prompting, designed to automatically guide the reasoning process of large language models. Inspired by analogical reasoning, a cognitive process in which humans draw from relevant past experiences to tackle new problems, our approach prompts language models to self-generate relevant exemplars or knowledge in the context, before proceeding to solve the given problem. This method presents several advantages: it obviates the need for labeling or retrieving exemplars, offering generality and convenience; it can also tailor the generated exemplars and knowledge to each problem, offering adaptability. Experimental results show that our approach outperforms 0-shot CoT and manual few-shot CoT in a variety of reasoning tasks, including math problem solving in GSM8K and MATH, code generation in Codeforces, and other reasoning tasks in BIG-Bench.
\ No newline at end of file
diff --git a/data/2024/iclr/Large Language Models as Automated Aligners for benchmarking Vision-Language Models b/data/2024/iclr/Large Language Models as Automated Aligners for benchmarking Vision-Language Models
new file mode 100644
index 0000000000..b14f44dd21
--- /dev/null
+++ b/data/2024/iclr/Large Language Models as Automated Aligners for benchmarking Vision-Language Models	
@@ -0,0 +1 @@
+With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.
\ No newline at end of file
diff --git a/data/2024/iclr/Large Language Models as Generalizable Policies for Embodied Tasks b/data/2024/iclr/Large Language Models as Generalizable Policies for Embodied Tasks
new file mode 100644
index 0000000000..64f2fc0a26
--- /dev/null
+++ b/data/2024/iclr/Large Language Models as Generalizable Policies for Embodied Tasks	
@@ -0,0 +1 @@
+We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and act solely through environmental interactions. We show that LLaRP is robust to complex paraphrasings of task instructions and can generalize to new tasks that require novel optimal behavior. In particular, on 1,000 unseen tasks it achieves 42% success rate, 1.7x the success rate of other common learned baselines or zero-shot applications of LLMs. Finally, to aid the community in studying language conditioned, massively multi-task, embodied AI problems we release a novel benchmark, Language Rearrangement, consisting of 150,000 training and 1,000 testing tasks for language-conditioned rearrangement. Video examples of LLaRP in unseen Language Rearrangement instructions are at https://llm-rl.github.io.
\ No newline at end of file
diff --git a/data/2024/iclr/Large Language Models as Optimizers b/data/2024/iclr/Large Language Models as Optimizers
new file mode 100644
index 0000000000..60d48f208a
--- /dev/null
+++ b/data/2024/iclr/Large Language Models as Optimizers	
@@ -0,0 +1 @@
+Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to our main application in prompt optimization, where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks. Code at https://github.com/google-deepmind/opro.
\ No newline at end of file
diff --git a/data/2024/iclr/Large Language Models as Tool Makers b/data/2024/iclr/Large Language Models as Tool Makers
new file mode 100644
index 0000000000..14377b57b6
--- /dev/null
+++ b/data/2024/iclr/Large Language Models as Tool Makers	
@@ -0,0 +1 @@
+Recent research has highlighted the potential of large language models (LLMs) to improve their problem-solving capabilities with the aid of suitable external tools. In our work, we further advance this concept by introducing a closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMs create their own reusable tools for problem-solving. Our approach consists of two phases: 1) tool making: an LLM acts as the tool maker that crafts tools for a set of tasks. 2) tool using: another LLM acts as the tool user, which applies the tool built by the tool maker for problem-solving. On the problem-solving server side, tool-making enables continual tool generation and caching as new requests emerge. This framework enables subsequent requests to access cached tools via their corresponding APIs, enhancing the efficiency of task resolution. Recognizing that tool-making requires more sophisticated capabilities, we assign this task to a powerful, albeit resource-intensive, model. Conversely, the simpler tool-using phase is delegated to a lightweight model. This strategic division of labor allows the once-off cost of tool-making to be spread over multiple instances of tool-using, significantly reducing average costs while maintaining strong performance. Furthermore, our method offers a functional cache through the caching and reuse of tools, which stores the functionality of a class of requests instead of the natural language responses from LLMs, thus extending the applicability of the conventional cache mechanism. We evaluate our approach across various complex reasoning tasks, including Big-Bench tasks. With GPT-4 as the tool maker and GPT-3.5 as the tool user, LATM demonstrates performance equivalent to using GPT-4 for both roles, but with a significantly reduced inference cost.
\ No newline at end of file
diff --git a/data/2024/iclr/Large Language Models to Enhance Bayesian Optimization b/data/2024/iclr/Large Language Models to Enhance Bayesian Optimization
new file mode 100644
index 0000000000..2dd883e863
--- /dev/null
+++ b/data/2024/iclr/Large Language Models to Enhance Bayesian Optimization	
@@ -0,0 +1 @@
+Bayesian optimization (BO) is a powerful approach for optimizing complex and expensive-to-evaluate black-box functions. Its importance is underscored in many applications, notably including hyperparameter tuning, but its efficacy depends on efficiently balancing exploration and exploitation. While there has been substantial progress in BO methods, striking this balance remains a delicate process. In this light, we present LLAMBO, a novel approach that integrates the capabilities of Large Language Models (LLM) within BO. At a high level, we frame the BO problem in natural language, enabling LLMs to iteratively propose and evaluate promising solutions conditioned on historical evaluations. More specifically, we explore how combining contextual understanding, few-shot learning proficiency, and domain knowledge of LLMs can improve model-based BO. Our findings illustrate that LLAMBO is effective at zero-shot warmstarting, and enhances surrogate modeling and candidate sampling, especially in the early stages of search when observations are sparse. Our approach is performed in context and does not require LLM finetuning. Additionally, it is modular by design, allowing individual components to be integrated into existing BO frameworks, or function cohesively as an end-to-end method. We empirically validate LLAMBO's efficacy on the problem of hyperparameter tuning, highlighting strong empirical performance across a range of diverse benchmarks, proprietary, and synthetic tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages b/data/2024/iclr/Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
new file mode 100644
index 0000000000..4551dc09ef
--- /dev/null
+++ b/data/2024/iclr/Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages	
@@ -0,0 +1 @@
+Recently there has been a significant surge in multimodal learning in terms of both image-to-text and text-to-image generation. However, the success is typically limited to English, leaving other languages largely behind. Building a competitive counterpart in other languages is highly challenging due to the low-resource nature of non-English multimodal data (i.e., lack of large-scale, high-quality image-text data). In this work, we propose MPM, an effective training paradigm for training large multimodal models in non-English languages. MPM demonstrates that Multilingual language models can Pivot zero-shot Multimodal learning across languages. Specifically, based on a strong multilingual large language model, multimodal models pretrained on English-only image-text data can well generalize to other languages in a (quasi)-zero-shot manner, even surpassing models trained on image-text data in native languages. Taking Chinese as a practice of MPM, we build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese. To facilitate future research, we open-source codes and model weights at https://github.com/OpenBMB/VisCPM.git.
\ No newline at end of file
diff --git a/data/2024/iclr/Large-Vocabulary 3D Diffusion Model with Transformer b/data/2024/iclr/Large-Vocabulary 3D Diffusion Model with Transformer
new file mode 100644
index 0000000000..0c2f5743d2
--- /dev/null
+++ b/data/2024/iclr/Large-Vocabulary 3D Diffusion Model with Transformer	
@@ -0,0 +1 @@
+Creating diverse and high-quality 3D assets with an automatic generative model is highly desirable. Despite extensive efforts on 3D generation, most existing works focus on the generation of a single category or a few categories. In this paper, we introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model. Notably, there are three major challenges for this large-vocabulary 3D generation: a) the need for expressive yet efficient 3D representation; b) large diversity in geometry and texture across categories; c) complexity in the appearances of real-world objects. To this end, we propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects. 1) Considering efficiency and robustness, we adopt a revised triplane representation and improve the fitting speed and accuracy. 2) To handle the drastic variations in geometry and texture, we regard the features of all 3D objects as a combination of generalized 3D knowledge and specialized 3D features. To extract generalized 3D knowledge from diverse categories, we propose a novel 3D-aware transformer with shared cross-plane attention. It learns the cross-plane relations across different planes and aggregates the generalized 3D knowledge with specialized 3D features. 3) In addition, we devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge in the encoded triplanes for handling categories with complex appearances. Extensive experiments on ShapeNet and OmniObject3D (over 200 diverse real-world categories) convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance with large diversity, rich semantics, and high quality.
\ No newline at end of file
diff --git a/data/2024/iclr/Large-scale Training of Foundation Models for Wearable Biosignals b/data/2024/iclr/Large-scale Training of Foundation Models for Wearable Biosignals
new file mode 100644
index 0000000000..2a79329b77
--- /dev/null
+++ b/data/2024/iclr/Large-scale Training of Foundation Models for Wearable Biosignals	
@@ -0,0 +1 @@
+Tracking biosignals is crucial for monitoring wellness and preempting the development of severe medical conditions. Today, wearable devices can conveniently record various biosignals, creating the opportunity to monitor health status without disruption to one's daily routine. Despite widespread use of wearable devices and existing digital biomarkers, the absence of curated data with annotated medical labels hinders the development of new biomarkers to measure common health conditions. In fact, medical datasets are usually small in comparison to other domains, which is an obstacle for developing neural network models for biosignals. To address this challenge, we have employed self-supervised learning using the unlabeled sensor data collected under informed consent from the large longitudinal Apple Heart and Movement Study (AHMS) to train foundation models for two common biosignals: photoplethysmography (PPG) and electrocardiogram (ECG) recorded on Apple Watch. We curated PPG and ECG datasets from AHMS that include data from ~141K participants spanning ~3 years. Our self-supervised learning framework includes participant level positive pair selection, stochastic augmentation module and a regularized contrastive loss optimized with momentum training, and generalizes well to both PPG and ECG modalities. We show that the pre-trained foundation models readily encode information regarding participants' demographics and health conditions. To the best of our knowledge, this is the first study that builds foundation models using large-scale PPG and ECG data collected via wearable consumer devices $\unicode{x2013}$ prior works have commonly used smaller-size datasets collected in clinical and experimental settings. We believe PPG and ECG foundation models can enhance future wearable devices by reducing the reliance on labeled data and hold the potential to help the users improve their health.
\ No newline at end of file
diff --git a/data/2024/iclr/Latent 3D Graph Diffusion b/data/2024/iclr/Latent 3D Graph Diffusion
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Latent Intuitive Physics: Learning to Transfer Hidden Physics from A 3D Video b/data/2024/iclr/Latent Intuitive Physics: Learning to Transfer Hidden Physics from A 3D Video
new file mode 100644
index 0000000000..1545d56faf
--- /dev/null
+++ b/data/2024/iclr/Latent Intuitive Physics: Learning to Transfer Hidden Physics from A 3D Video	
@@ -0,0 +1 @@
+We introduce latent intuitive physics, a transfer learning framework for physics simulation that can infer hidden properties of fluids from a single 3D video and simulate the observed fluid in novel scenes. Our key insight is to use latent features drawn from a learnable prior distribution conditioned on the underlying particle states to capture the invisible and complex physical properties. To achieve this, we train a parametrized prior learner given visual observations to approximate the visual posterior of inverse graphics, and both the particle states and the visual posterior are obtained from a learned neural renderer. The converged prior learner is embedded in our probabilistic physics engine, allowing us to perform novel simulations on unseen geometries, boundaries, and dynamics without knowledge of the true physical parameters. We validate our model in three ways: (i) novel scene simulation with the learned visual-world physics, (ii) future prediction of the observed fluid dynamics, and (iii) supervised particle simulation. Our model demonstrates strong performance in all three tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Latent Representation and Simulation of Markov Processes via Time-Lagged Information Bottleneck b/data/2024/iclr/Latent Representation and Simulation of Markov Processes via Time-Lagged Information Bottleneck
new file mode 100644
index 0000000000..2ea4d935b5
--- /dev/null
+++ b/data/2024/iclr/Latent Representation and Simulation of Markov Processes via Time-Lagged Information Bottleneck	
@@ -0,0 +1 @@
+Markov processes are widely used mathematical models for describing dynamic systems in various fields. However, accurately simulating large-scale systems at long time scales is computationally expensive due to the short time steps required for accurate integration. In this paper, we introduce an inference process that maps complex systems into a simplified representational space and models large jumps in time. To achieve this, we propose Time-lagged Information Bottleneck (T-IB), a principled objective rooted in information theory, which aims to capture relevant temporal features while discarding high-frequency information to simplify the simulation task and minimize the inference error. Our experiments demonstrate that T-IB learns information-optimal representations for accurately modeling the statistical properties and dynamics of the original process at a selected time lag, outperforming existing time-lagged dimensionality reduction methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Latent Trajectory Learning for Limited Timestamps under Distribution Shift over Time b/data/2024/iclr/Latent Trajectory Learning for Limited Timestamps under Distribution Shift over Time
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Layer-wise linear mode connectivity b/data/2024/iclr/Layer-wise linear mode connectivity
new file mode 100644
index 0000000000..a0994f7e98
--- /dev/null
+++ b/data/2024/iclr/Layer-wise linear mode connectivity	
@@ -0,0 +1 @@
+Averaging neural network parameters is an intuitive method for fusing the knowledge of two independent models. It is most prominently used in federated learning. If models are averaged at the end of training, this can only lead to a good performing model if the loss surface of interest is very particular, i.e., the loss in the midpoint between the two models needs to be sufficiently low. This is impossible to guarantee for the non-convex losses of state-of-the-art networks. For averaging models trained on vastly different datasets, it was proposed to average only the parameters of particular layers or combinations of layers, resulting in better performing models. To get a better understanding of the effect of layer-wise averaging, we analyse the performance of the models that result from averaging single layers, or groups of layers. Based on our empirical and theoretical investigation, we introduce a novel notion of the layer-wise linear connectivity, and show that deep networks do not have layer-wise barriers between them.
\ No newline at end of file
diff --git a/data/2024/iclr/LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models b/data/2024/iclr/LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models
new file mode 100644
index 0000000000..ccf28c2287
--- /dev/null
+++ b/data/2024/iclr/LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models	
@@ -0,0 +1 @@
+Graphic layout generation, a growing research field, plays a significant role in user engagement and information perception. Existing methods primarily treat layout generation as a numerical optimization task, focusing on quantitative aspects while overlooking the semantic information of layout, such as the relationship between each layout element. In this paper, we propose LayoutNUWA, the first model that treats layout generation as a code generation task to enhance semantic information and harness the hidden layout expertise of large language models~(LLMs). More concretely, we develop a Code Instruct Tuning (CIT) approach comprising three interconnected modules: 1) the Code Initialization (CI) module quantifies the numerical conditions and initializes them as HTML code with strategically placed masks; 2) the Code Completion (CC) module employs the formatting knowledge of LLMs to fill in the masked portions within the HTML code; 3) the Code Rendering (CR) module transforms the completed code into the final layout output, ensuring a highly interpretable and transparent layout generation procedure that directly maps code to a visualized layout. We attain significant state-of-the-art performance (even over 50\% improvements) on multiple datasets, showcasing the strong capabilities of LayoutNUWA. Our code is available at https://github.com/ProjectNUWA/LayoutNUWA.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning 3D Particle-based Simulators from RGB-D Videos b/data/2024/iclr/Learning 3D Particle-based Simulators from RGB-D Videos
new file mode 100644
index 0000000000..7acb11ecf6
--- /dev/null
+++ b/data/2024/iclr/Learning 3D Particle-based Simulators from RGB-D Videos	
@@ -0,0 +1 @@
+Realistic simulation is critical for applications ranging from robotics to animation. Traditional analytic simulators sometimes struggle to capture sufficiently realistic simulation which can lead to problems including the well known"sim-to-real"gap in robotics. Learned simulators have emerged as an alternative for better capturing real-world physical dynamics, but require access to privileged ground truth physics information such as precise object geometry or particle tracks. Here we propose a method for learning simulators directly from observations. Visual Particle Dynamics (VPD) jointly learns a latent particle-based representation of 3D scenes, a neural simulator of the latent particle dynamics, and a renderer that can produce images of the scene from arbitrary views. VPD learns end to end from posed RGB-D videos and does not require access to privileged information. Unlike existing 2D video prediction models, we show that VPD's 3D structure enables scene editing and long-term predictions. These results pave the way for downstream applications ranging from video editing to robotic planning.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Adaptive Multiresolution Transforms via Meta-Framelet-based Graph Convolutional Network b/data/2024/iclr/Learning Adaptive Multiresolution Transforms via Meta-Framelet-based Graph Convolutional Network
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Learning Delays in Spiking Neural Networks using Dilated Convolutions with Learnable Spacings b/data/2024/iclr/Learning Delays in Spiking Neural Networks using Dilated Convolutions with Learnable Spacings
new file mode 100644
index 0000000000..d8accb3faf
--- /dev/null
+++ b/data/2024/iclr/Learning Delays in Spiking Neural Networks using Dilated Convolutions with Learnable Spacings	
@@ -0,0 +1 @@
+Spiking Neural Networks (SNNs) are a promising research direction for building power-efficient information processing systems, especially for temporal tasks such as speech recognition. In SNNs, delays refer to the time needed for one spike to travel from one neuron to another. These delays matter because they influence the spike arrival times, and it is well-known that spiking neurons respond more strongly to coincident input spikes. More formally, it has been shown theoretically that plastic delays greatly increase the expressivity in SNNs. Yet, efficient algorithms to learn these delays have been lacking. Here, we propose a new discrete-time algorithm that addresses this issue in deep feedforward SNNs using backpropagation, in an offline manner. To simulate delays between consecutive layers, we use 1D convolutions across time. The kernels contain only a few non-zero weights - one per synapse - whose positions correspond to the delays. These positions are learned together with the weights using the recently proposed Dilated Convolution with Learnable Spacings (DCLS). We evaluated our method on three datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC) and its non-spiking version Google Speech Commands v0.02 (GSC) benchmarks, which require detecting temporal patterns. We used feedforward SNNs with two or three hidden fully connected layers, and vanilla leaky integrate-and-fire neurons. We showed that fixed random delays help and that learning them helps even more. Furthermore, our method outperformed the state-of-the-art in the three datasets without using recurrent connections and with substantially fewer parameters. Our work demonstrates the potential of delay learning in developing accurate and precise models for temporal data processing. Our code is based on PyTorch / SpikingJelly and available at: https://github.com/Thvnvtos/SNN-delays
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Energy Decompositions for Partial Inference in GFlowNets b/data/2024/iclr/Learning Energy Decompositions for Partial Inference in GFlowNets
new file mode 100644
index 0000000000..1b99f1317c
--- /dev/null
+++ b/data/2024/iclr/Learning Energy Decompositions for Partial Inference in GFlowNets	
@@ -0,0 +1 @@
+This paper studies generative flow networks (GFlowNets) to sample objects from the Boltzmann energy distribution via a sequence of actions. In particular, we focus on improving GFlowNet with partial inference: training flow functions with the evaluation of the intermediate states or transitions. To this end, the recently developed forward-looking GFlowNet reparameterizes the flow functions based on evaluating the energy of intermediate states. However, such an evaluation of intermediate energies may (i) be too expensive or impossible to evaluate and (ii) even provide misleading training signals under large energy fluctuations along the sequence of actions. To resolve this issue, we propose learning energy decompositions for GFlowNets (LED-GFN). Our main idea is to (i) decompose the energy of an object into learnable potential functions defined on state transitions and (ii) reparameterize the flow functions using the potential functions. In particular, to produce informative local credits, we propose to regularize the potential to change smoothly over the sequence of actions. It is also noteworthy that training GFlowNet with our learned potential can preserve the optimal policy. We empirically verify the superiority of LED-GFN in five problems including the generation of unstructured and maximum independent sets, molecular graphs, and RNA sequences.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood b/data/2024/iclr/Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood
new file mode 100644
index 0000000000..cc26d0e90e
--- /dev/null
+++ b/data/2024/iclr/Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood	
@@ -0,0 +1 @@
+Training energy-based models (EBMs) on high-dimensional data can be both challenging and time-consuming, and there exists a noticeable gap in sample quality between EBMs and other generative frameworks like GANs and diffusion models. To close this gap, inspired by the recent efforts of learning EBMs by maximizing diffusion recovery likelihood (DRL), we propose cooperative diffusion recovery likelihood (CDRL), an effective approach to tractably learn and sample from a series of EBMs defined on increasingly noisy versions of a dataset, paired with an initializer model for each EBM. At each noise level, the two models are jointly estimated within a cooperative training framework: samples from the initializer serve as starting points that are refined by a few MCMC sampling steps from the EBM. The EBM is then optimized by maximizing recovery likelihood, while the initializer model is optimized by learning from the difference between the refined samples and the initial samples. In addition, we made several practical designs for EBM training to further improve the sample quality. Combining these advances, our approach significantly boost the generation performance compared to existing EBM methods on CIFAR-10 and ImageNet datasets. We also demonstrate the effectiveness of our models for several downstream tasks, including classifier-free guided generation, compositional generation, image inpainting and out-of-distribution detection.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Flexible Body Collision Dynamics with Hierarchical Contact Mesh Transformer b/data/2024/iclr/Learning Flexible Body Collision Dynamics with Hierarchical Contact Mesh Transformer
new file mode 100644
index 0000000000..5e031ba748
--- /dev/null
+++ b/data/2024/iclr/Learning Flexible Body Collision Dynamics with Hierarchical Contact Mesh Transformer	
@@ -0,0 +1 @@
+Recently, many mesh-based graph neural network (GNN) models have been proposed for modeling complex high-dimensional physical systems. Remarkable achievements have been made in significantly reducing the solving time compared to traditional numerical solvers. These methods are typically designed to i) reduce the computational cost in solving physical dynamics and/or ii) propose techniques to enhance the solution accuracy in fluid and rigid body dynamics. However, it remains under-explored whether they are effective in addressing the challenges of flexible body dynamics, where instantaneous collisions occur within a very short timeframe. In this paper, we present Hierarchical Contact Mesh Transformer (HCMT), which uses hierarchical mesh structures and can learn long-range dependencies (occurred by collisions) among spatially distant positions of a body -- two close positions in a higher-level mesh correspond to two distant positions in a lower-level mesh. HCMT enables long-range interactions, and the hierarchical mesh structure quickly propagates collision effects to faraway positions. To this end, it consists of a contact mesh Transformer and a hierarchical mesh Transformer (CMT and HMT, respectively). Lastly, we propose a flexible body dynamics dataset, consisting of trajectories that reflect experimental settings frequently used in the display industry for product designs. We also compare the performance of several baselines using well-known benchmark datasets. Our results show that HCMT provides significant performance improvements over existing methods. Our code is available at https://github.com/yuyudeep/hcmt.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning From Simplicial Data Based on Random Walks and 1D Convolutions b/data/2024/iclr/Learning From Simplicial Data Based on Random Walks and 1D Convolutions
new file mode 100644
index 0000000000..169d98d678
--- /dev/null
+++ b/data/2024/iclr/Learning From Simplicial Data Based on Random Walks and 1D Convolutions	
@@ -0,0 +1 @@
+Triggered by limitations of graph-based deep learning methods in terms of computational expressivity and model flexibility, recent years have seen a surge of interest in computational models that operate on higher-order topological domains such as hypergraphs and simplicial complexes. While the increased expressivity of these models can indeed lead to a better classification performance and a more faithful representation of the underlying system, the computational cost of these higher-order models can increase dramatically. To this end, we here explore a simplicial complex neural network learning architecture based on random walks and fast 1D convolutions (SCRaWl), in which we can adjust the increase in computational cost by varying the length and number of random walks considered while accounting for higher-order relationships. Importantly, due to the random walk-based design, the expressivity of the proposed architecture is provably incomparable to that of existing message-passing simplicial neural networks. We empirically evaluate SCRaWl on real-world datasets and show that it outperforms other simplicial neural networks.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Grounded Action Abstractions from Language b/data/2024/iclr/Learning Grounded Action Abstractions from Language
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Learning Hierarchical Image Segmentation For Recognition and By Recognition b/data/2024/iclr/Learning Hierarchical Image Segmentation For Recognition and By Recognition
new file mode 100644
index 0000000000..5ec15c1c37
--- /dev/null
+++ b/data/2024/iclr/Learning Hierarchical Image Segmentation For Recognition and By Recognition	
@@ -0,0 +1 @@
+Large vision and language models learned directly through image-text associations often lack detailed visual substantiation, whereas image segmentation tasks are treated separately from recognition, supervisedly learned without interconnections. Our key observation is that, while an image can be recognized in multiple ways, each has a consistent part-and-whole visual organization. Segmentation thus should be treated not as an end task to be mastered through supervised learning, but as an internal process that evolves with and supports the ultimate goal of recognition. We propose to integrate a hierarchical segmenter into the recognition process, train and adapt the entire model solely on image-level recognition objectives. We learn hierarchical segmentation for free alongside recognition, automatically uncovering part-to-whole relationships that not only underpin but also enhance recognition. Enhancing the Vision Transformer (ViT) with adaptive segment tokens and graph pooling, our model surpasses ViT in unsupervised part-whole discovery, semantic segmentation, image classification, and efficiency. Notably, our model (trained on unlabeled 1M ImageNet images) outperforms SAM (trained on 11M images and 1 billion masks) by absolute 8% in mIoU on PartImageNet object segmentation.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Hierarchical Polynomials with Three-Layer Neural Networks b/data/2024/iclr/Learning Hierarchical Polynomials with Three-Layer Neural Networks
new file mode 100644
index 0000000000..bedb4ec913
--- /dev/null
+++ b/data/2024/iclr/Learning Hierarchical Polynomials with Three-Layer Neural Networks	
@@ -0,0 +1 @@
+We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form $h = g \circ p$ where $p : \mathbb{R}^d \rightarrow \mathbb{R}$ is a degree $k$ polynomial and $g: \mathbb{R} \rightarrow \mathbb{R}$ is a degree $q$ polynomial. This function class generalizes the single-index model, which corresponds to $k=1$, and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree $k$ polynomials $p$, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $\widetilde{\mathcal{O}}(d^k)$ samples and polynomial time. This is a strict improvement over kernel methods, which require $\widetilde \Theta(d^{kq})$ samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of $p$ being a quadratic. When $p$ is indeed a quadratic, we achieve the information-theoretically optimal sample complexity $\widetilde{\mathcal{O}}(d^2)$, which is an improvement over prior work~\citep{nichani2023provable} requiring a sample size of $\widetilde\Theta(d^4)$. Our proof proceeds by showing that during the initial stage of training the network performs feature learning to recover the feature $p$ with $\widetilde{\mathcal{O}}(d^k)$ samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Hierarchical World Models with Adaptive Temporal Abstractions from Discrete Latent Dynamics b/data/2024/iclr/Learning Hierarchical World Models with Adaptive Temporal Abstractions from Discrete Latent Dynamics
new file mode 100644
index 0000000000..945c9b46d6
--- /dev/null
+++ b/data/2024/iclr/Learning Hierarchical World Models with Adaptive Temporal Abstractions from Discrete Latent Dynamics	
@@ -0,0 +1 @@
+.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Implicit Representation for Reconstructing Articulated Objects b/data/2024/iclr/Learning Implicit Representation for Reconstructing Articulated Objects
new file mode 100644
index 0000000000..5a90103654
--- /dev/null
+++ b/data/2024/iclr/Learning Implicit Representation for Reconstructing Articulated Objects	
@@ -0,0 +1 @@
+3D Reconstruction of moving articulated objects without additional information about object structure is a challenging problem. Current methods overcome such challenges by employing category-specific skeletal models. Consequently, they do not generalize well to articulated objects in the wild. We treat an articulated object as an unknown, semi-rigid skeletal structure surrounded by nonrigid material (e.g., skin). Our method simultaneously estimates the visible (explicit) representation (3D shapes, colors, camera parameters) and the implicit skeletal representation, from motion cues in the object video without 3D supervision. Our implicit representation consists of four parts. (1) Skeleton, which specifies how semi-rigid parts are connected. (2) \textcolor{black}{Skinning Weights}, which associates each surface vertex with semi-rigid parts with probability. (3) Rigidity Coefficients, specifying the articulation of the local surface. (4) Time-Varying Transformations, which specify the skeletal motion and surface deformation parameters. We introduce an algorithm that uses physical constraints as regularization terms and iteratively estimates both implicit and explicit representations. Our method is category-agnostic, thus eliminating the need for category-specific skeletons, we show that our method outperforms state-of-the-art across standard video datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Interactive Real-World Simulators b/data/2024/iclr/Learning Interactive Real-World Simulators
new file mode 100644
index 0000000000..f9f43c065f
--- /dev/null
+++ b/data/2024/iclr/Learning Interactive Real-World Simulators	
@@ -0,0 +1 @@
+Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different dimensions (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, we can simulate the visual outcome of both high-level instructions such as ``open the drawer'' and low-level controls such as"move by x, y"from otherwise static scenes and objects. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies, each of which can be deployed in the real world in zero shot after training purely in simulation. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Large DAGs is Harder than you Think: Many Losses are Minimal for the Wrong DAG b/data/2024/iclr/Learning Large DAGs is Harder than you Think: Many Losses are Minimal for the Wrong DAG
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Learning Mean Field Games on Sparse Graphs: A Hybrid Graphex Approach b/data/2024/iclr/Learning Mean Field Games on Sparse Graphs: A Hybrid Graphex Approach
new file mode 100644
index 0000000000..7bac49efb5
--- /dev/null
+++ b/data/2024/iclr/Learning Mean Field Games on Sparse Graphs: A Hybrid Graphex Approach	
@@ -0,0 +1 @@
+Learning the behavior of large agent populations is an important task for numerous research areas. Although the field of multi-agent reinforcement learning (MARL) has made significant progress towards solving these systems, solutions for many agents often remain computationally infeasible and lack theoretical guarantees. Mean Field Games (MFGs) address both of these issues and can be extended to Graphon MFGs (GMFGs) to include network structures between agents. Despite their merits, the real world applicability of GMFGs is limited by the fact that graphons only capture dense graphs. Since most empirically observed networks show some degree of sparsity, such as power law graphs, the GMFG framework is insufficient for capturing these network topologies. Thus, we introduce the novel concept of Graphex MFGs (GXMFGs) which builds on the graph theoretical concept of graphexes. Graphexes are the limiting objects to sparse graph sequences that also have other desirable features such as the small world property. Learning equilibria in these games is challenging due to the rich and sparse structure of the underlying graphs. To tackle these challenges, we design a new learning algorithm tailored to the GXMFG setup. This hybrid graphex learning approach leverages that the system mainly consists of a highly connected core and a sparse periphery. After defining the system and providing a theoretical analysis, we state our learning approach and demonstrate its learning capabilities on both synthetic graphs and real-world networks. This comparison shows that our GXMFG learning algorithm successfully extends MFGs to a highly relevant class of hard, realistic learning problems that are not accurately addressed by current MARL and MFG methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Multi-Agent Communication from Graph Modeling Perspective b/data/2024/iclr/Learning Multi-Agent Communication from Graph Modeling Perspective
new file mode 100644
index 0000000000..0937eda3fb
--- /dev/null
+++ b/data/2024/iclr/Learning Multi-Agent Communication from Graph Modeling Perspective	
@@ -0,0 +1 @@
+In numerous artificial intelligence applications, the collaborative efforts of multiple intelligent agents are imperative for the successful attainment of target objectives. To enhance coordination among these agents, a distributed communication framework is often employed. However, information sharing among all agents proves to be resource-intensive, while the adoption of a manually pre-defined communication architecture imposes limitations on inter-agent communication, thereby constraining the potential for collaborative efforts. In this study, we introduce a novel approach wherein we conceptualize the communication architecture among agents as a learnable graph. We formulate this problem as the task of determining the communication graph while enabling the architecture parameters to update normally, thus necessitating a bi-level optimization process. Utilizing continuous relaxation of the graph representation and incorporating attention units, our proposed approach, CommFormer, efficiently optimizes the communication graph and concurrently refines architectural parameters through gradient descent in an end-to-end manner. Extensive experiments on a variety of cooperative tasks substantiate the robustness of our model across diverse cooperative scenarios, where agents are able to develop more coordinated and sophisticated strategies regardless of changes in the number of agents.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Multi-Agent Communication with Contrastive Learning b/data/2024/iclr/Learning Multi-Agent Communication with Contrastive Learning
new file mode 100644
index 0000000000..0757043626
--- /dev/null
+++ b/data/2024/iclr/Learning Multi-Agent Communication with Contrastive Learning	
@@ -0,0 +1 @@
+Communication is a powerful tool for coordination in multi-agent RL. But inducing an effective, common language is a difficult challenge, particularly in the decentralized setting. In this work, we introduce an alternative perspective where communicative messages sent between agents are considered as different incomplete views of the environment state. By examining the relationship between messages sent and received, we propose to learn to communicate using contrastive learning to maximize the mutual information between messages of a given trajectory. In communication-essential environments, our method outperforms previous work in both performance and learning speed. Using qualitative metrics and representation probing, we show that our method induces more symmetric communication and captures global state information from the environment. Overall, we show the power of contrastive learning and the importance of leveraging messages as encodings for effective communication.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Nash Equilibria in Rank-1 Games b/data/2024/iclr/Learning Nash Equilibria in Rank-1 Games
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Learning No-Regret Sparse Generalized Linear Models with Varying Observation(s) b/data/2024/iclr/Learning No-Regret Sparse Generalized Linear Models with Varying Observation(s)
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Learning Optimal Contracts: How to Exploit Small Action Spaces b/data/2024/iclr/Learning Optimal Contracts: How to Exploit Small Action Spaces
new file mode 100644
index 0000000000..6207c225c7
--- /dev/null
+++ b/data/2024/iclr/Learning Optimal Contracts: How to Exploit Small Action Spaces	
@@ -0,0 +1 @@
+We study principal-agent problems in which a principal commits to an outcome-dependent payment scheme -- called contract -- in order to induce an agent to take a costly, unobservable action leading to favorable outcomes. We consider a generalization of the classical (single-round) version of the problem in which the principal interacts with the agent by committing to contracts over multiple rounds. The principal has no information about the agent, and they have to learn an optimal contract by only observing the outcome realized at each round. We focus on settings in which the size of the agent's action space is small. We design an algorithm that learns an approximately-optimal contract with high probability in a number of rounds polynomial in the size of the outcome space, when the number of actions is constant. Our algorithm solves an open problem by Zhu et al.[2022]. Moreover, it can also be employed to provide a $\tilde{\mathcal{O}}(T^{4/5})$ regret bound in the related online learning setting in which the principal aims at maximizing their cumulative utility, thus considerably improving previously-known regret bounds.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks b/data/2024/iclr/Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks
new file mode 100644
index 0000000000..f801b1ad72
--- /dev/null
+++ b/data/2024/iclr/Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks	
@@ -0,0 +1 @@
+Molecular Representation Learning (MRL) has proven impactful in numerous biochemical applications such as drug discovery and enzyme design. While Graph Neural Networks (GNNs) are effective at learning molecular representations from a 2D molecular graph or a single 3D structure, existing works often overlook the flexible nature of molecules, which continuously interconvert across conformations via chemical bond rotations and minor vibrational perturbations. To better account for molecular flexibility, some recent works formulate MRL as an ensemble learning problem, focusing on explicitly learning from a set of conformer structures. However, most of these studies have limited datasets, tasks, and models. In this work, we introduce the first MoleculAR Conformer Ensemble Learning (MARCEL) benchmark to thoroughly evaluate the potential of learning on conformer ensembles and suggest promising research directions. MARCEL includes four datasets covering diverse molecule- and reaction-level properties of chemically diverse molecules including organocatalysts and transition-metal catalysts, extending beyond the scope of common GNN benchmarks that are confined to drug-like molecules. In addition, we conduct a comprehensive empirical study, which benchmarks representative 1D, 2D, and 3D molecular representation learning models, along with two strategies that explicitly incorporate conformer ensembles into 3D MRL models. Our findings reveal that direct learning from an accessible conformer space can improve performance on a variety of tasks and models.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Performance-Improving Code Edits b/data/2024/iclr/Learning Performance-Improving Code Edits
new file mode 100644
index 0000000000..8c2eb0ae93
--- /dev/null
+++ b/data/2024/iclr/Learning Performance-Improving Code Edits	
@@ -0,0 +1 @@
+The waning of Moore's Law has shifted the focus of the tech industry towards alternative methods for continued performance gains. While optimizing compilers are a standard tool to help increase program efficiency, programmers continue to shoulder much responsibility in crafting and refactoring code with better performance characteristics. In this paper, we investigate the ability of large language models (LLMs) to suggest functionally correct, performance improving code edits. We hypothesize that language models can suggest such edits in ways that would be impractical for static analysis alone. We investigate these questions by curating a large-scale dataset of Performance-Improving Edits, PIE. PIE contains trajectories of programs, where a programmer begins with an initial, slower version and iteratively makes changes to improve the program's performance. We use PIE to evaluate and improve the capacity of large language models. Specifically, use examples from PIE to fine-tune multiple variants of CODEGEN, a billion-scale Transformer-decoder model. Additionally, we use examples from PIE to prompt OpenAI's CODEX using a few-shot prompting. By leveraging PIE, we find that both CODEX and CODEGEN can generate performance-improving edits, with speedups of more than 2.5x for over 25% of the programs, for C++ and Python, even after the C++ programs were compiled using the O3 optimization level. Crucially, we show that PIE allows CODEGEN, an open-sourced and 10x smaller model than CODEX, to match the performance of CODEX on this challenging task. Overall, this work opens new doors for creating systems and methods that can help programmers write efficient code.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Personalized Causally Invariant Representations for Heterogeneous Federated Clients b/data/2024/iclr/Learning Personalized Causally Invariant Representations for Heterogeneous Federated Clients
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Learning Polynomial Problems with SL(2, R)-Equivariance b/data/2024/iclr/Learning Polynomial Problems with SL(2, R)-Equivariance
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Learning Robust Generalizable Radiance Field with Visibility and Feature Augmented Point Representation b/data/2024/iclr/Learning Robust Generalizable Radiance Field with Visibility and Feature Augmented Point Representation
new file mode 100644
index 0000000000..5722d73534
--- /dev/null
+++ b/data/2024/iclr/Learning Robust Generalizable Radiance Field with Visibility and Feature Augmented Point Representation	
@@ -0,0 +1 @@
+This paper introduces a novel paradigm for the generalizable neural radiance field (NeRF). Previous generic NeRF methods combine multiview stereo techniques with image-based neural rendering for generalization, yielding impressive results, while suffering from three issues. First, occlusions often result in inconsistent feature matching. Then, they deliver distortions and artifacts in geometric discontinuities and locally sharp shapes due to their individual process of sampled points and rough feature aggregation. Third, their image-based representations experience severe degradations when source views are not near enough to the target view. To address challenges, we propose the first paradigm that constructs the generalizable neural field based on point-based rather than image-based rendering, which we call the Generalizable neural Point Field (GPF). Our approach explicitly models visibilities by geometric priors and augments them with neural features. We propose a novel nonuniform log sampling strategy to improve both rendering speed and reconstruction quality. Moreover, we present a learnable kernel spatially augmented with features for feature aggregations, mitigating distortions at places with drastically varying geometries. Besides, our representation can be easily manipulated. Experiments show that our model can deliver better geometries, view consistencies, and rendering quality than all counterparts and benchmarks on three datasets in both generalization and finetuning settings, preliminarily proving the potential of the new paradigm for generalizable NeRF.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning b/data/2024/iclr/Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning
new file mode 100644
index 0000000000..d460780835
--- /dev/null
+++ b/data/2024/iclr/Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning	
@@ -0,0 +1 @@
+Deep Metric Learning (DML) has long attracted the attention of the machine learning community as a key objective. Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets. As a result of the success of recent pre-trained models trained from larger-scale datasets, it is challenging to adapt the model to the DML tasks in the local data domain while retaining the previously gained knowledge. In this paper, we investigate parameter-efficient methods for fine-tuning the pre-trained model for DML tasks. In particular, we propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT). Based on the conventional proxy-based DML paradigm, we augment the proxy by incorporating the semantic information from the input image and the ViT, in which we optimize the visual prompts for each class. We demonstrate that our new approximations with semantic information are superior to representative capabilities, thereby improving metric learning performance. We conduct extensive experiments to demonstrate that our proposed framework is effective and efficient by evaluating popular DML benchmarks. In particular, we demonstrate that our fine-tuning method achieves comparable or even better performance than recent state-of-the-art full fine-tuning works of DML while tuning only a small percentage of total parameters.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling b/data/2024/iclr/Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling
new file mode 100644
index 0000000000..a08f609446
--- /dev/null
+++ b/data/2024/iclr/Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling	
@@ -0,0 +1 @@
+Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models. Our code and project page are available at https://jegzheng.github.io/LEGODiffusion.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning Thresholds with Latent Values and Censored Feedback b/data/2024/iclr/Learning Thresholds with Latent Values and Censored Feedback
new file mode 100644
index 0000000000..163b0f167f
--- /dev/null
+++ b/data/2024/iclr/Learning Thresholds with Latent Values and Censored Feedback	
@@ -0,0 +1 @@
+In this paper, we investigate a problem of actively learning threshold in latent space, where the unknown reward $g(\gamma, v)$ depends on the proposed threshold $\gamma$ and latent value $v$ and it can be $only$ achieved if the threshold is lower than or equal to the unknown latent value. This problem has broad applications in practical scenarios, e.g., reserve price optimization in online auctions, online task assignments in crowdsourcing, setting recruiting bars in hiring, etc. We first characterize the query complexity of learning a threshold with the expected reward at most $\epsilon$ smaller than the optimum and prove that the number of queries needed can be infinitely large even when $g(\gamma, v)$ is monotone with respect to both $\gamma$ and $v$. On the positive side, we provide a tight query complexity $\tilde{\Theta}(1/\epsilon^3)$ when $g$ is monotone and the CDF of value distribution is Lipschitz. Moreover, we show a tight $\tilde{\Theta}(1/\epsilon^3)$ query complexity can be achieved as long as $g$ satisfies one-sided Lipschitzness, which provides a complete characterization for this problem. Finally, we extend this model to an online learning setting and demonstrate a tight $\Theta(T^{2/3})$ regret bound using continuous-arm bandit techniques and the aforementioned query complexity results.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning dynamic representations of the functional connectome in neurobiological networks b/data/2024/iclr/Learning dynamic representations of the functional connectome in neurobiological networks
new file mode 100644
index 0000000000..8b26266ac1
--- /dev/null
+++ b/data/2024/iclr/Learning dynamic representations of the functional connectome in neurobiological networks	
@@ -0,0 +1 @@
+The static synaptic connectivity of neuronal circuits stands in direct contrast to the dynamics of their function. As in changing community interactions, different neurons can participate actively in various combinations to effect behaviors at different times. We introduce an unsupervised approach to learn the dynamic affinities between neurons in live, behaving animals, and to reveal which communities form among neurons at different times. The inference occurs in two major steps.1 First, pairwise non-linear affinities between neuronal traces from brain-wide calcium activity are organized by non-negative tensor factorization (NTF). Each factor specifies which groups of neurons are most likely interacting for an inferred interval in time, and for which animals. Finally, a generative model that allows for weighted community detection is applied to the functional motifs produced by NTF to reveal a dynamic functional connectome. Since time codes the different experimental variables (e.g., application of chemical stimuli), this provides an atlas of neural motifs active during separate stages of an experiment (e.g., stimulus application or spontaneous behaviors). Results from our analysis are experimentally validated, confirming that our method is able to robustly predict causal interactions between neurons to generate behavior.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning from Aggregate responses: Instance Level versus Bag Level Loss Functions b/data/2024/iclr/Learning from Aggregate responses: Instance Level versus Bag Level Loss Functions
new file mode 100644
index 0000000000..338dd7489d
--- /dev/null
+++ b/data/2024/iclr/Learning from Aggregate responses: Instance Level versus Bag Level Loss Functions	
@@ -0,0 +1 @@
+Due to the rise of privacy concerns, in many practical applications the training data is aggregated before being shared with the learner, in order to protect privacy of users' sensitive responses. In an aggregate learning framework, the dataset is grouped into bags of samples, where each bag is available only with an aggregate response, providing a summary of individuals' responses in that bag. In this paper, we study two natural loss functions for learning from aggregate responses: bag-level loss and the instance-level loss. In the former, the model is learnt by minimizing a loss between aggregate responses and aggregate model predictions, while in the latter the model aims to fit individual predictions to the aggregate responses. In this work, we show that the instance-level loss can be perceived as a regularized form of the bag-level loss. This observation lets us compare the two approaches with respect to bias and variance of the resulting estimators, and introduce a novel interpolating estimator which combines the two approaches. For linear regression tasks, we provide a precise characterization of the risk of the interpolating estimator in an asymptotic regime where the size of the training set grows in proportion to the features dimension. Our analysis allows us to theoretically understand the effect of different factors, such as bag size on the model prediction risk. In addition, we propose a mechanism for differentially private learning from aggregate responses and derive the optimal bag size in terms of prediction risk-privacy trade-off. We also carry out thorough experiments to corroborate our theory and show the efficacy of the interpolating estimator.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning from Label Proportions: Bootstrapping Supervised Learners via Belief Propagation b/data/2024/iclr/Learning from Label Proportions: Bootstrapping Supervised Learners via Belief Propagation
new file mode 100644
index 0000000000..98df40d826
--- /dev/null
+++ b/data/2024/iclr/Learning from Label Proportions: Bootstrapping Supervised Learners via Belief Propagation	
@@ -0,0 +1 @@
+Learning from Label Proportions (LLP) is a learning problem where only aggregate level labels are available for groups of instances, called bags, during training, and the aim is to get the best performance at the instance-level on the test data. This setting arises in domains like advertising and medicine due to privacy considerations. We propose a novel algorithmic framework for this problem that iteratively performs two main steps. For the first step (Pseudo Labeling) in every iteration, we define a Gibbs distribution over binary instance labels that incorporates a) covariate information through the constraint that instances with similar covariates should have similar labels and b) the bag level aggregated label. We then use Belief Propagation (BP) to marginalize the Gibbs distribution to obtain pseudo labels. In the second step (Embedding Refinement), we use the pseudo labels to provide supervision for a learner that yields a better embedding. Further, we iterate on the two steps again by using the second step's embeddings as new covariates for the next iteration. In the final iteration, a classifier is trained using the pseudo labels. Our algorithm displays strong gains against several SOTA baselines (up to 15%) for the LLP Binary Classification problem on various dataset types - tabular and Image. We achieve these improvements with minimal computational overhead above standard supervised learning due to Belief Propagation, for large bag sizes, even for a million samples.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning from Sparse Offline Datasets via Conservative Density Estimation b/data/2024/iclr/Learning from Sparse Offline Datasets via Conservative Density Estimation
new file mode 100644
index 0000000000..b8ea592d13
--- /dev/null
+++ b/data/2024/iclr/Learning from Sparse Offline Datasets via Conservative Density Estimation	
@@ -0,0 +1 @@
+Offline reinforcement learning (RL) offers a promising direction for learning policies from pre-collected datasets without requiring further interactions with the environment. However, existing methods struggle to handle out-of-distribution (OOD) extrapolation errors, especially in sparse reward or scarce data settings. In this paper, we propose a novel training algorithm called Conservative Density Estimation (CDE), which addresses this challenge by explicitly imposing constraints on the state-action occupancy stationary distribution. CDE overcomes the limitations of existing approaches, such as the stationary distribution correction method, by addressing the support mismatch issue in marginal importance sampling. Our method achieves state-of-the-art performance on the D4RL benchmark. Notably, CDE consistently outperforms baselines in challenging tasks with sparse rewards or insufficient data, demonstrating the advantages of our approach in addressing the extrapolation error problem in offline RL.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning in reverse causal strategic environments with ramifications on two sided markets b/data/2024/iclr/Learning in reverse causal strategic environments with ramifications on two sided markets
new file mode 100644
index 0000000000..b1ae0fd8d9
--- /dev/null
+++ b/data/2024/iclr/Learning in reverse causal strategic environments with ramifications on two sided markets	
@@ -0,0 +1 @@
+Motivated by equilibrium models of labor markets, we develop a formulation of causal strategic classification in which strategic agents can directly manipulate their outcomes. As an application, we compare employers that anticipate the strategic response of a labor force with employers that do not. We show through a combination of theory and experiment that employers with performatively optimal hiring policies improve employer reward, labor force skill level, and in some cases labor force equity. On the other hand, we demonstrate that performative employers harm labor force utility and fail to prevent discrimination in other cases.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning invariant representations of time-homogeneous stochastic dynamical systems b/data/2024/iclr/Learning invariant representations of time-homogeneous stochastic dynamical systems
new file mode 100644
index 0000000000..e3060d583b
--- /dev/null
+++ b/data/2024/iclr/Learning invariant representations of time-homogeneous stochastic dynamical systems	
@@ -0,0 +1 @@
+We consider the general class of time-homogeneous stochastic dynamical systems, both discrete and continuous, and study the problem of learning a representation of the state that faithfully captures its dynamics. This is instrumental to learning the transfer operator or the generator of the system, which in turn can be used for numerous tasks, such as forecasting and interpreting the system dynamics. We show that the search for a good representation can be cast as an optimization problem over neural networks. Our approach is supported by recent results in statistical learning theory, highlighting the role of approximation error and metric distortion in the learning problem. The objective function we propose is associated with projection operators from the representation space to the data space, overcomes metric distortion, and can be empirically estimated from data. In the discrete-time setting, we further derive a relaxed objective function that is differentiable and numerically well-conditioned. We compare our method against state-of-the-art approaches on different datasets, showing better performance across the board.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning model uncertainty as variance-minimizing instance weights b/data/2024/iclr/Learning model uncertainty as variance-minimizing instance weights
new file mode 100644
index 0000000000..111495d39a
--- /dev/null
+++ b/data/2024/iclr/Learning model uncertainty as variance-minimizing instance weights	
@@ -0,0 +1 @@
+Predictive uncertainty–a model’s self-awareness regarding its accuracy on an input– is key for both building robust models via training interventions and for test-time applications such as selective classification. We propose a novel instance-conditional reweighting approach that captures predictive uncertainty using an auxiliary network, and unifies these train-and test-time applications. The auxiliary network is trained using a meta-objective in a bilevel optimization framework. A key contribution of our proposal is the meta-objective of minimizing dropout variance, an approximation of Bayesian predictive uncertainty, We show in controlled experiments that we effectively capture diverse specific notions of uncertainty through this meta-objective, while previous approaches only capture certain aspects. These results translate to significant gains in real-world settings–selective classification, label noise, domain adaptation, calibration–and across datasets–Imagenet, Cifar100, diabetic retinopathy, Camelyon, WILDs, Imagenet-C,-A,-R, Clothing1M, etc. For Diabetic Retinopathy, we see upto 3.4%/3.3% accuracy AUC gains over SOTA in selective classification. We also improve upon large-scale pretrained models such as PLEX (Tran et al., 2022).
\ No newline at end of file
diff --git a/data/2024/iclr/Learning semilinear neural operators: A unified recursive framework for prediction and data assimilation b/data/2024/iclr/Learning semilinear neural operators: A unified recursive framework for prediction and data assimilation
new file mode 100644
index 0000000000..9464e8db81
--- /dev/null
+++ b/data/2024/iclr/Learning semilinear neural operators: A unified recursive framework for prediction and data assimilation	
@@ -0,0 +1 @@
+Recent advances in the theory of Neural Operators (NOs) have enabled fast and accurate computation of the solutions to complex systems described by partial differential equations (PDEs). Despite their great success, current NO-based solutions face important challenges when dealing with spatio-temporal PDEs over long time scales. Specifically, the current theory of NOs does not present a systematic framework to perform data assimilation and efficiently correct the evolution of PDE solutions over time based on sparsely sampled noisy measurements. In this paper, we propose a learning-based state-space approach to compute the solution operators to infinite-dimensional semilinear PDEs. Exploiting the structure of semilinear PDEs and the theory of nonlinear observers in function spaces, we develop a flexible recursive method that allows for both prediction and data assimilation by combining prediction and correction operations. The proposed framework is capable of producing fast and accurate predictions over long time horizons, dealing with irregularly sampled noisy measurements to correct the solution, and benefits from the decoupling between the spatial and temporal dynamics of this class of PDEs. We show through experiments on the Kuramoto-Sivashinsky, Navier-Stokes and Korteweg-de Vries equations that the proposed model is robust to noise and can leverage arbitrary amounts of measurements to correct its prediction over a long time horizon with little computational overhead.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning the greatest common divisor: explaining transformer predictions b/data/2024/iclr/Learning the greatest common divisor: explaining transformer predictions
new file mode 100644
index 0000000000..6113813886
--- /dev/null
+++ b/data/2024/iclr/Learning the greatest common divisor: explaining transformer predictions	
@@ -0,0 +1 @@
+The predictions of small transformers, trained to calculate the greatest common divisor (GCD) of two positive integers, can be fully characterized by looking at model inputs and outputs. As training proceeds, the model learns a list $\mathcal D$ of integers, products of divisors of the base used to represent integers and small primes, and predicts the largest element of $\mathcal D$ that divides both inputs. Training distributions impact performance. Models trained from uniform operands only learn a handful of GCD (up to $38$ GCD $\leq100$). Log-uniform operands boost performance to $73$ GCD $\leq 100$, and a log-uniform distribution of outcomes (i.e. GCD) to $91$. However, training from uniform (balanced) GCD breaks explainability.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning to Act from Actionless Videos through Dense Correspondences b/data/2024/iclr/Learning to Act from Actionless Videos through Dense Correspondences
new file mode 100644
index 0000000000..4248ed9413
--- /dev/null
+++ b/data/2024/iclr/Learning to Act from Actionless Videos through Dense Correspondences	
@@ -0,0 +1 @@
+In this work, we present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments from few video demonstrations without using any action annotations. Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals. By synthesizing videos that ``hallucinate'' robot executing actions and in combination with dense correspondences between frames, our approach can infer the closed-formed action to execute to an environment without the need of any explicit action labels. This unique capability allows us to train the policy solely based on RGB videos and deploy learned policies to various robotic tasks. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks. Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models with four GPUs within a single day.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning to Act without Actions b/data/2024/iclr/Learning to Act without Actions
new file mode 100644
index 0000000000..c3f02ee385
--- /dev/null
+++ b/data/2024/iclr/Learning to Act without Actions	
@@ -0,0 +1 @@
+Pre-training large models on vast amounts of web data has proven to be an effective approach for obtaining powerful, general models in domains such as language and vision. However, this paradigm has not yet taken hold in reinforcement learning. This is because videos, the most abundant form of embodied behavioral data on the web, lack the action labels required by existing methods for imitating behavior from demonstrations. We introduce Latent Action Policies (LAPO), a method for recovering latent action information, and thereby latent-action policies, world models, and inverse dynamics models, purely from videos. LAPO is the first method able to recover the structure of the true action space just from observed dynamics, even in challenging procedurally-generated environments. LAPO enables training latent-action policies that can be rapidly fine-tuned into expert-level policies, either offline using a small action-labeled dataset, or online with rewards. LAPO takes a first step towards pre-training powerful, generalist policies and world models on the vast amounts of videos readily available on the web.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning to Compose: Improving Object Centric Learning by Injecting Compositionality b/data/2024/iclr/Learning to Compose: Improving Object Centric Learning by Injecting Compositionality
new file mode 100644
index 0000000000..da5b2e9fdc
--- /dev/null
+++ b/data/2024/iclr/Learning to Compose: Improving Object Centric Learning by Injecting Compositionality	
@@ -0,0 +1 @@
+Learning compositional representation is a key aspect of object-centric learning as it enables flexible systematic generalization and supports complex visual reasoning. However, most of the existing approaches rely on auto-encoding objective, while the compositionality is implicitly imposed by the architectural or algorithmic bias in the encoder. This misalignment between auto-encoding objective and learning compositionality often results in failure of capturing meaningful object representations. In this study, we propose a novel objective that explicitly encourages compositionality of the representations. Built upon the existing object-centric learning framework (e.g., slot attention), our method incorporates additional constraints that an arbitrary mixture of object representations from two images should be valid by maximizing the likelihood of the composite data. We demonstrate that incorporating our objective to the existing framework consistently improves the objective-centric learning and enhances the robustness to the architectural choices.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning to Embed Time Series Patches Independently b/data/2024/iclr/Learning to Embed Time Series Patches Independently
new file mode 100644
index 0000000000..127a902fec
--- /dev/null
+++ b/data/2024/iclr/Learning to Embed Time Series Patches Independently	
@@ -0,0 +1 @@
+Masked time series modeling has recently gained much attention as a self-supervised representation learning strategy for time series. Inspired by masked image modeling in computer vision, recent works first patchify and partially mask out time series, and then train Transformers to capture the dependencies between patches by predicting masked patches from unmasked patches. However, we argue that capturing such patch dependencies might not be an optimal strategy for time series representation learning; rather, learning to embed patches independently results in better time series representations. Specifically, we propose to use 1) the simple patch reconstruction task, which autoencode each patch without looking at other patches, and 2) the simple patch-wise MLP that embeds each patch independently. In addition, we introduce complementary contrastive learning to hierarchically capture adjacent time series information efficiently. Our proposed method improves time series forecasting and classification performance compared to state-of-the-art Transformer-based models, while it is more efficient in terms of the number of parameters and training/inference time. Code is available at this repository: https://github.com/seunghan96/pits.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning to Jointly Understand Visual and Tactile Signals b/data/2024/iclr/Learning to Jointly Understand Visual and Tactile Signals
new file mode 100644
index 0000000000..e96ad9a382
--- /dev/null
+++ b/data/2024/iclr/Learning to Jointly Understand Visual and Tactile Signals	
@@ -0,0 +1 @@
+Modeling and analyzing objects and shapes has been well-studied in the past. However, manipulation of these complex tools and articulated objects remains difficult for autonomous agents. Our human hands, however, are dexterous and adaptive. We can easily adapt a manipulation skill on one object to all objects in the class and to other similar classes. Our intuition comes from that there is a close connection between manipulations and topology and articulation of objects. The possible articulation of objects indicates the types of manipulation necessary to operate the object. In this work, we aim to take a manipulation perspective to understand everyday objects and tools. We collect a multi-modal visual-tactile dataset that contains paired full-hand force pressure maps and manipulation videos. We also propose a novel method to learn a cross-modal latent manifold that allows for cross-modal prediction and discovery of latent structure in different data modalities. We conduct extensive experiments to demonstrate the effectiveness of our method. ‡
\ No newline at end of file
diff --git a/data/2024/iclr/Learning to Make Adherence-aware Advice b/data/2024/iclr/Learning to Make Adherence-aware Advice
new file mode 100644
index 0000000000..66a548bce1
--- /dev/null
+++ b/data/2024/iclr/Learning to Make Adherence-aware Advice	
@@ -0,0 +1 @@
+As artificial intelligence (AI) systems play an increasingly prominent role in human decision-making, challenges surface in the realm of human-AI interactions. One challenge arises from the suboptimal AI policies due to the inadequate consideration of humans disregarding AI recommendations, as well as the need for AI to provide advice selectively when it is most pertinent. This paper presents a sequential decision-making model that (i) takes into account the human's adherence level (the probability that the human follows/rejects machine advice) and (ii) incorporates a defer option so that the machine can temporarily refrain from making advice. We provide learning algorithms that learn the optimal advice policy and make advice only at critical time stamps. Compared to problem-agnostic reinforcement learning algorithms, our specialized learning algorithms not only enjoy better theoretical convergence properties but also show strong empirical performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning to Reject Meets Long-tail Learning b/data/2024/iclr/Learning to Reject Meets Long-tail Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Learning to Reject with a Fixed Predictor: Application to Decontextualization b/data/2024/iclr/Learning to Reject with a Fixed Predictor: Application to Decontextualization
new file mode 100644
index 0000000000..b7739ae61d
--- /dev/null
+++ b/data/2024/iclr/Learning to Reject with a Fixed Predictor: Application to Decontextualization	
@@ -0,0 +1 @@
+We study the problem of classification with a reject option for a fixed predictor, applicable in natural language processing. We introduce a new problem formulation for this scenario, and an algorithm minimizing a new surrogate loss function. We provide a complete theoretical analysis of the surrogate loss function with a strong $H$-consistency guarantee. For evaluation, we choose the decontextualization task, and provide a manually-labelled dataset of $2\mathord,000$ examples. Our algorithm significantly outperforms the baselines considered, with a $\sim\!\!25\%$ improvement in coverage when halving the error rate, which is only $\sim\!\! 3 \%$ away from the theoretical limit.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning to Relax: Setting Solver Parameters Across a Sequence of Linear System Instances b/data/2024/iclr/Learning to Relax: Setting Solver Parameters Across a Sequence of Linear System Instances
new file mode 100644
index 0000000000..290605292f
--- /dev/null
+++ b/data/2024/iclr/Learning to Relax: Setting Solver Parameters Across a Sequence of Linear System Instances	
@@ -0,0 +1 @@
+Solving a linear system $Ax=b$ is a fundamental scientific computing primitive for which numerous solvers and preconditioners have been developed. These come with parameters whose optimal values depend on the system being solved and are often impossible or too expensive to identify; thus in practice sub-optimal heuristics are used. We consider the common setting in which many related linear systems need to be solved, e.g. during a single numerical simulation. In this scenario, can we sequentially choose parameters that attain a near-optimal overall number of iterations, without extra matrix computations? We answer in the affirmative for Successive Over-Relaxation (SOR), a standard solver whose parameter $\omega$ has a strong impact on its runtime. For this method, we prove that a bandit online learning algorithm--using only the number of iterations as feedback--can select parameters for a sequence of instances such that the overall cost approaches that of the best fixed $\omega$ as the sequence length increases. Furthermore, when given additional structural information, we show that a contextual bandit method asymptotically achieves the performance of the instance-optimal policy, which selects the best $\omega$ for each instance. Our work provides the first learning-theoretic treatment of high-precision linear system solvers and the first end-to-end guarantees for data-driven scientific computing, demonstrating theoretically the potential to speed up numerical methods using well-understood learning algorithms.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning to Solve Bilevel Programs with Binary Tender b/data/2024/iclr/Learning to Solve Bilevel Programs with Binary Tender
new file mode 100644
index 0000000000..2e94a6268a
--- /dev/null
+++ b/data/2024/iclr/Learning to Solve Bilevel Programs with Binary Tender	
@@ -0,0 +1 @@
+Bilevel programs (BPs) find a wide range of applications in fields such as energy, transportation, and machine learning. As compared to BPs with continuous (linear/convex) optimization problems in both levels, the BPs with discrete decision variables have received much less attention, largely due to the ensuing computational intractability and the incapability of gradient-based algorithms for handling discrete optimization formulations. In this paper, we develop deep learning techniques to address this challenge. Specifically, we consider a BP with binary tender, wherein the upper and lower levels are linked via binary variables. We train a neural network to approximate the optimal value of the lower-level problem, as a function of the binary tender. Then, we obtain a single-level reformulation of the BP through a mixed-integer representation of the value function. Furthermore, we conduct a comparative analysis between two types of neural networks: general neural networks and the novel input supermodular neural networks, studying their representational capacities. To solve high-dimensional BPs, we introduce an enhanced sampling method to generate higher-quality samples and implement an iterative process to refine solutions. We demonstrate the performance of these approaches through extensive numerical experiments, whose lower-level problems are linear and mixed-integer programs, respectively.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning to design protein-protein interactions with enhanced generalization b/data/2024/iclr/Learning to design protein-protein interactions with enhanced generalization
new file mode 100644
index 0000000000..a05beb97f5
--- /dev/null
+++ b/data/2024/iclr/Learning to design protein-protein interactions with enhanced generalization	
@@ -0,0 +1 @@
+Discovering mutations enhancing protein-protein interactions (PPIs) is critical for advancing biomedical research and developing improved therapeutics. While machine learning approaches have substantially advanced the field, they often struggle to generalize beyond training data in practical scenarios. The contributions of this work are three-fold. First, we construct PPIRef, the largest and non-redundant dataset of 3D protein-protein interactions, enabling effective large-scale learning. Second, we leverage the PPIRef dataset to pre-train PPIformer, a new SE(3)-equivariant model generalizing across diverse protein-binder variants. We fine-tune PPIformer to predict effects of mutations on protein-protein interactions via a thermodynamically motivated adjustment of the pre-training loss function. Finally, we demonstrate the enhanced generalization of our new PPIformer approach by outperforming other state-of-the-art methods on new, non-leaking splits of standard labeled PPI mutational data and independent case studies optimizing a human antibody against SARS-CoV-2 and increasing the thrombolytic activity of staphylokinase.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning to solve Class-Constrained Bin Packing Problems via Encoder-Decoder Model b/data/2024/iclr/Learning to solve Class-Constrained Bin Packing Problems via Encoder-Decoder Model
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Learning with Language-Guided State Abstractions b/data/2024/iclr/Learning with Language-Guided State Abstractions
new file mode 100644
index 0000000000..a2ff2f4c03
--- /dev/null
+++ b/data/2024/iclr/Learning with Language-Guided State Abstractions	
@@ -0,0 +1 @@
+We describe a framework for using natural language to design state abstractions for imitation learning. Generalizable policy learning in high-dimensional observation spaces is facilitated by well-designed state representations, which can surface important features of an environment and hide irrelevant ones. These state representations are typically manually specified, or derived from other labor-intensive labeling procedures. Our method, LGA (language-guided abstraction), uses a combination of natural language supervision and background knowledge from language models (LMs) to automatically build state representations tailored to unseen tasks. In LGA, a user first provides a (possibly incomplete) description of a target task in natural language; next, a pre-trained LM translates this task description into a state abstraction function that masks out irrelevant features; finally, an imitation policy is trained using a small number of demonstrations and LGA-generated abstract states. Experiments on simulated robotic tasks show that LGA yields state abstractions similar to those designed by humans, but in a fraction of the time, and that these abstractions improve generalization and robustness in the presence of spurious correlations and ambiguous specifications. We illustrate the utility of the learned abstractions on mobile manipulation tasks with a Spot robot.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning with Mixture of Prototypes for Out-of-Distribution Detection b/data/2024/iclr/Learning with Mixture of Prototypes for Out-of-Distribution Detection
new file mode 100644
index 0000000000..83554a1d6a
--- /dev/null
+++ b/data/2024/iclr/Learning with Mixture of Prototypes for Out-of-Distribution Detection	
@@ -0,0 +1 @@
+Out-of-distribution (OOD) detection aims to detect testing samples far away from the in-distribution (ID) training data, which is crucial for the safe deployment of machine learning models in the real world. Distance-based OOD detection methods have emerged with enhanced deep representation learning. They identify unseen OOD samples by measuring their distances from ID class centroids or prototypes. However, existing approaches learn the representation relying on oversimplified data assumptions, e.g, modeling ID data of each class with one centroid class prototype or using loss functions not designed for OOD detection, which overlook the natural diversities within the data. Naively enforcing data samples of each class to be compact around only one prototype leads to inadequate modeling of realistic data and limited performance. To tackle these issues, we propose PrototypicAl Learning with a Mixture of prototypes (PALM) which models each class with multiple prototypes to capture the sample diversities, and learns more faithful and compact samples embeddings to enhance OOD detection. Our method automatically identifies and dynamically updates prototypes, assigning each sample to a subset of prototypes via reciprocal neighbor soft assignment weights. PALM optimizes a maximum likelihood estimation (MLE) loss to encourage the sample embeddings to be compact around the associated prototypes, as well as a contrastive loss on all prototypes to enhance intra-class compactness and inter-class discrimination at the prototype level. Moreover, the automatic estimation of prototypes enables our approach to be extended to the challenging OOD detection task with unlabelled ID data. Extensive experiments demonstrate the superiority of PALM, achieving state-of-the-art average AUROC performance of 93.82 on the challenging CIFAR-100 benchmark. Code is available at https://github.com/jeff024/PALM.
\ No newline at end of file
diff --git a/data/2024/iclr/Learning with a Mole: Transferable latent spatial representations for navigation without reconstruction b/data/2024/iclr/Learning with a Mole: Transferable latent spatial representations for navigation without reconstruction
new file mode 100644
index 0000000000..dda4cb41f8
--- /dev/null
+++ b/data/2024/iclr/Learning with a Mole: Transferable latent spatial representations for navigation without reconstruction	
@@ -0,0 +1 @@
+Agents navigating in 3D environments require some form of memory, which should hold a compact and actionable representation of the history of observations useful for decision taking and planning. In most end-to-end learning approaches the representation is latent and usually does not have a clearly defined interpretation, whereas classical robotics addresses this with scene reconstruction resulting in some form of map, usually estimated with geometry and sensor models and/or learning. In this work we propose to learn an actionable representation of the scene independently of the targeted downstream task and without explicitly optimizing reconstruction. The learned representation is optimized by a blind auxiliary agent trained to navigate with it on multiple short sub episodes branching out from a waypoint and, most importantly, without any direct visual observation. We argue and show that the blindness property is important and forces the (trained) latent representation to be the only means for planning. With probing experiments we show that the learned representation optimizes navigability and not reconstruction. On downstream tasks we show that it is robust to changes in distribution, in particular the sim2real gap, which we evaluate with a real physical robot in a real office building, significantly improving performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Leave-one-out Distinguishability in Machine Learning b/data/2024/iclr/Leave-one-out Distinguishability in Machine Learning
new file mode 100644
index 0000000000..757dc96a11
--- /dev/null
+++ b/data/2024/iclr/Leave-one-out Distinguishability in Machine Learning	
@@ -0,0 +1 @@
+We introduce an analytical framework to quantify the changes in a machine learning algorithm's output distribution following the inclusion of a few data points in its training set, a notion we define as leave-one-out distinguishability (LOOD). This is key to measuring data **memorization** and information **leakage** as well as the **influence** of training data points in machine learning. We illustrate how our method broadens and refines existing empirical measures of memorization and privacy risks associated with training data. We use Gaussian processes to model the randomness of machine learning algorithms, and validate LOOD with extensive empirical analysis of leakage using membership inference attacks. Our analytical framework enables us to investigate the causes of leakage and where the leakage is high. For example, we analyze the influence of activation functions, on data memorization. Additionally, our method allows us to identify queries that disclose the most information about the training data in the leave-one-out setting. We illustrate how optimal queries can be used for accurate **reconstruction** of training data.
\ No newline at end of file
diff --git a/data/2024/iclr/Leftover Lunch: Advantage-based Offline Reinforcement Learning for Language Models b/data/2024/iclr/Leftover Lunch: Advantage-based Offline Reinforcement Learning for Language Models
new file mode 100644
index 0000000000..04a08225de
--- /dev/null
+++ b/data/2024/iclr/Leftover Lunch: Advantage-based Offline Reinforcement Learning for Language Models	
@@ -0,0 +1 @@
+Reinforcement Learning with Human Feedback (RLHF) is the most prominent method for Language Model (LM) alignment. However, RLHF is an unstable and data-hungry process that continually requires new high-quality LM-generated data for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a new class of offline policy gradient algorithms that enable RL training on any pre-existing data. By assuming the entire LM output sequence as a single action, A-LoL allows incorporating sequence-level classifiers or human-designed scoring functions as rewards. Subsequently, by using LM's value estimate, A-LoL only trains on positive advantage (leftover) data points, making it resilient to noise. Overall, A-LoL is an easy-to-implement, sample-efficient, and stable LM training recipe. We demonstrate the effectiveness of A-LoL and its variants with a set of four different language generation tasks. We compare against both online RL (PPO) and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RL baselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant (HHA), LMs trained with A-LoL methods achieve the highest diversity while also being rated more safe and helpful than the baselines according to humans. Additionally, in the remaining three tasks, A-LoL could optimize multiple distinct reward functions even when using noisy or suboptimal training data. We also release our experimental code. https://github.com/abaheti95/LoL-RL
\ No newline at end of file
diff --git a/data/2024/iclr/Lemur: Harmonizing Natural Language and Code for Language Agents b/data/2024/iclr/Lemur: Harmonizing Natural Language and Code for Language Agents
new file mode 100644
index 0000000000..06da89a3f2
--- /dev/null
+++ b/data/2024/iclr/Lemur: Harmonizing Natural Language and Code for Language Agents	
@@ -0,0 +1 @@
+We introduce Lemur and Lemur-Chat, openly accessible language models optimized for both natural language and coding capabilities to serve as the backbone of versatile language agents. The evolution from language chat models to functional language agents demands that models not only master human interaction, reasoning, and planning but also ensure grounding in the relevant environments. This calls for a harmonious blend of language and coding capabilities in the models. Lemur and Lemur-Chat are proposed to address this necessity, demonstrating balanced proficiencies in both domains, unlike existing open-source models that tend to specialize in either. Through meticulous pre-training using a code-intensive corpus and instruction fine-tuning on text and code data, our models achieve state-of-the-art averaged performance across diverse text and coding benchmarks among open-source models. Comprehensive experiments demonstrate Lemur's superiority over existing open-source models and its proficiency across various agent tasks involving human communication, tool usage, and interaction under fully- and partially- observable environments. The harmonization between natural and programming languages enables Lemur-Chat to significantly narrow the gap with proprietary models on agent abilities, providing key insights into developing advanced open-source agents adept at reasoning, planning, and operating seamlessly across environments. https://github.com/OpenLemur/Lemur
\ No newline at end of file
diff --git a/data/2024/iclr/Lemur: Integrating Large Language Models in Automated Program Verification b/data/2024/iclr/Lemur: Integrating Large Language Models in Automated Program Verification
new file mode 100644
index 0000000000..81d7ed4493
--- /dev/null
+++ b/data/2024/iclr/Lemur: Integrating Large Language Models in Automated Program Verification	
@@ -0,0 +1 @@
+The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of transition rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure and demonstrate practical improvements on a set of synthetic and competition benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/Less is More: Fewer Interpretable Region via Submodular Subset Selection b/data/2024/iclr/Less is More: Fewer Interpretable Region via Submodular Subset Selection
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Less is More: One-shot Subgraph Reasoning on Large-scale Knowledge Graphs b/data/2024/iclr/Less is More: One-shot Subgraph Reasoning on Large-scale Knowledge Graphs
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation b/data/2024/iclr/Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation
new file mode 100644
index 0000000000..3b2a2e7fad
--- /dev/null
+++ b/data/2024/iclr/Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation	
@@ -0,0 +1 @@
+Knowledge distillation aims to train a compact student network using soft supervision from a larger teacher network and hard supervision from ground truths. However, determining an optimal knowledge fusion ratio that balances these supervisory signals remains challenging. Prior methods generally resort to a constant or heuristic-based fusion ratio, which often falls short of a proper balance. In this study, we introduce a novel adaptive method for learning a sample-wise knowledge fusion ratio, exploiting both the correctness of teacher and student, as well as how well the student mimics the teacher on each sample. Our method naturally leads to the intra-sample trilateral geometric relations among the student prediction ($S$), teacher prediction ($T$), and ground truth ($G$). To counterbalance the impact of outliers, we further extend to the inter-sample relations, incorporating the teacher's global average prediction $\bar{T}$ for samples within the same class. A simple neural network then learns the implicit mapping from the intra- and inter-sample relations to an adaptive, sample-wise knowledge fusion ratio in a bilevel-optimization manner. Our approach provides a simple, practical, and adaptable solution for knowledge distillation that can be employed across various architectures and model sizes. Extensive experiments demonstrate consistent improvements over other loss re-weighting methods on image classification, attack detection, and click-through rate prediction.
\ No newline at end of file
diff --git a/data/2024/iclr/Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation b/data/2024/iclr/Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation
new file mode 100644
index 0000000000..d8a18b9459
--- /dev/null
+++ b/data/2024/iclr/Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation	
@@ -0,0 +1 @@
+Text-to-3D generation has shown rapid progress in recent days with the advent of score distillation, a methodology of using pretrained text-to-2D diffusion models to optimize neural radiance field (NeRF) in the zero-shot setting. However, the lack of 3D awareness in the 2D diffusion models destabilizes score distillation-based methods from reconstructing a plausible 3D scene. To address this issue, we propose 3DFuse, a novel framework that incorporates 3D awareness into pretrained 2D diffusion models, enhancing the robustness and 3D consistency of score distillation-based methods. We realize this by first constructing a coarse 3D structure of a given text prompt and then utilizing projected, view-specific depth map as a condition for the diffusion model. Additionally, we introduce a training strategy that enables the 2D diffusion model learns to handle the errors and sparsity within the coarse 3D structure for robust generation, as well as a method for ensuring semantic consistency throughout all viewpoints of the scene. Our framework surpasses the limitations of prior arts, and has significant implications for 3D consistent generation of 2D diffusion models.
\ No newline at end of file
diff --git a/data/2024/iclr/Let Models Speak Ciphers: Multiagent Debate through Embeddings b/data/2024/iclr/Let Models Speak Ciphers: Multiagent Debate through Embeddings
new file mode 100644
index 0000000000..0196819175
--- /dev/null
+++ b/data/2024/iclr/Let Models Speak Ciphers: Multiagent Debate through Embeddings	
@@ -0,0 +1 @@
+Discussion and debate among Large Language Models (LLMs) have gained considerable attention due to their potential to enhance the reasoning ability of LLMs. Although natural language is an obvious choice for communication due to LLM's language understanding capability, the token sampling step needed when generating natural language poses a potential risk of information loss, as it uses only one token to represent the model's belief across the entire vocabulary. In this paper, we introduce a communication regime named CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue. Specifically, we remove the token sampling step from LLMs and let them communicate their beliefs across the vocabulary through the expectation of the raw transformer output embeddings. Remarkably, by deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights, outperforming the state-of-the-art LLM debate methods using natural language by 0.5-5.0% across five reasoning tasks and multiple open-source LLMs of varying sizes. This showcases the superiority and robustness of embeddings as an alternative"language"for communication among LLMs. We anticipate that CIPHER will inspire further exploration for the design of interactions within LLM agent systems, offering a new direction that could significantly influence future developments in the field.
\ No newline at end of file
diff --git a/data/2024/iclr/Let's Verify Step by Step b/data/2024/iclr/Let's Verify Step by Step
new file mode 100644
index 0000000000..089eec02d2
--- /dev/null
+++ b/data/2024/iclr/Let's Verify Step by Step	
@@ -0,0 +1 @@
+In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
\ No newline at end of file
diff --git a/data/2024/iclr/Let's do the time-warp-attend: Learning topological invariants of dynamical systems b/data/2024/iclr/Let's do the time-warp-attend: Learning topological invariants of dynamical systems
new file mode 100644
index 0000000000..35b6f24e53
--- /dev/null
+++ b/data/2024/iclr/Let's do the time-warp-attend: Learning topological invariants of dynamical systems	
@@ -0,0 +1 @@
+Dynamical systems across the sciences, from electrical circuits to ecological networks, undergo qualitative and often catastrophic changes in behavior, called bifurcations, when their underlying parameters cross a threshold. Existing methods predict oncoming catastrophes in individual systems but are primarily time-series-based and struggle both to categorize qualitative dynamical regimes across diverse systems and to generalize to real data. To address this challenge, we propose a data-driven, physically-informed deep-learning framework for classifying dynamical regimes and characterizing bifurcation boundaries based on the extraction of topologically invariant features. We focus on the paradigmatic case of the supercritical Hopf bifurcation, which is used to model periodic dynamics across a wide range of applications. Our convolutional attention method is trained with data augmentations that encourage the learning of topological invariants which can be used to detect bifurcation boundaries in unseen systems and to design models of biological systems like oscillatory gene regulatory networks. We further demonstrate our method's use in analyzing real data by recovering distinct proliferation and differentiation dynamics along pancreatic endocrinogenesis trajectory in gene expression space based on single-cell data. Our method provides valuable insights into the qualitative, long-term behavior of a wide range of dynamical systems, and can detect bifurcations or catastrophic transitions in large-scale physical and biological systems.
\ No newline at end of file
diff --git a/data/2024/iclr/Leveraging Generative Models for Unsupervised Alignment of Neural Time Series Data b/data/2024/iclr/Leveraging Generative Models for Unsupervised Alignment of Neural Time Series Data
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Leveraging Hyperbolic Embeddings for Coarse-to-Fine Robot Design b/data/2024/iclr/Leveraging Hyperbolic Embeddings for Coarse-to-Fine Robot Design
new file mode 100644
index 0000000000..38255fbc6d
--- /dev/null
+++ b/data/2024/iclr/Leveraging Hyperbolic Embeddings for Coarse-to-Fine Robot Design	
@@ -0,0 +1 @@
+Multi-cellular robot design aims to create robots comprised of numerous cells that can be efficiently controlled to perform diverse tasks. Previous research has demonstrated the ability to generate robots for various tasks, but these approaches often optimize robots directly in the vast design space, resulting in robots with complicated morphologies that are hard to control. In response, this paper presents a novel coarse-to-fine method for designing multi-cellular robots. Initially, this strategy seeks optimal coarse-grained robots and progressively refines them. To mitigate the challenge of determining the precise refinement juncture during the coarse-to-fine transition, we introduce the Hyperbolic Embeddings for Robot Design (HERD) framework. HERD unifies robots of various granularity within a shared hyperbolic space and leverages a refined Cross-Entropy Method for optimization. This framework enables our method to autonomously identify areas of exploration in hyperbolic space and concentrate on regions demonstrating promise. Finally, the extensive empirical studies on various challenging tasks sourced from EvoGym show our approach's superior efficiency and generalization capability.
\ No newline at end of file
diff --git a/data/2024/iclr/Leveraging Optimization for Adaptive Attacks on Image Watermarks b/data/2024/iclr/Leveraging Optimization for Adaptive Attacks on Image Watermarks
new file mode 100644
index 0000000000..453567e2f4
--- /dev/null
+++ b/data/2024/iclr/Leveraging Optimization for Adaptive Attacks on Image Watermarks	
@@ -0,0 +1 @@
+Untrustworthy users can misuse image generators to synthesize high-quality deepfakes and engage in unethical activities. Watermarking deters misuse by marking generated content with a hidden message, enabling its detection using a secret watermarking key. A core security property of watermarking is robustness, which states that an attacker can only evade detection by substantially degrading image quality. Assessing robustness requires designing an adaptive attack for the specific watermarking algorithm. When evaluating watermarking algorithms and their (adaptive) attacks, it is challenging to determine whether an adaptive attack is optimal, i.e., the best possible attack. We solve this problem by defining an objective function and then approach adaptive attacks as an optimization problem. The core idea of our adaptive attacks is to replicate secret watermarking keys locally by creating surrogate keys that are differentiable and can be used to optimize the attack's parameters. We demonstrate for Stable Diffusion models that such an attacker can break all five surveyed watermarking methods at no visible degradation in image quality. Optimizing our attacks is efficient and requires less than 1 GPU hour to reduce the detection accuracy to 6.3% or less. Our findings emphasize the need for more rigorous robustness testing against adaptive, learnable attackers.
\ No newline at end of file
diff --git a/data/2024/iclr/Leveraging Uncertainty Estimates To Improve Classifier Performance b/data/2024/iclr/Leveraging Uncertainty Estimates To Improve Classifier Performance
new file mode 100644
index 0000000000..af2637cdcd
--- /dev/null
+++ b/data/2024/iclr/Leveraging Uncertainty Estimates To Improve Classifier Performance	
@@ -0,0 +1 @@
+Binary classification involves predicting the label of an instance based on whether the model score for the positive class exceeds a threshold chosen based on the application requirements (e.g., maximizing recall for a precision bound). However, model scores are often not aligned with the true positivity rate. This is especially true when the training involves a differential sampling across classes or there is distributional drift between train and test settings. In this paper, we provide theoretical analysis and empirical evidence of the dependence of model score estimation bias on both uncertainty and score itself. Further, we formulate the decision boundary selection in terms of both model score and uncertainty, prove that it is NP-hard, and present algorithms based on dynamic programming and isotonic regression. Evaluation of the proposed algorithms on three real-world datasets yield 25%-40% gain in recall at high precision bounds over the traditional approach of using model score alone, highlighting the benefits of leveraging uncertainty.
\ No newline at end of file
diff --git a/data/2024/iclr/Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency b/data/2024/iclr/Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency
new file mode 100644
index 0000000000..9e3b3139ed
--- /dev/null
+++ b/data/2024/iclr/Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency	
@@ -0,0 +1 @@
+Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce $\textbf{ITIT}$ ($\textbf{I}$n$\textbf{T}$egrating $\textbf{I}$mage $\textbf{T}$ext): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.
\ No newline at end of file
diff --git a/data/2024/iclr/Leveraging augmented-Lagrangian techniques for differentiating over infeasible quadratic programs in machine learning b/data/2024/iclr/Leveraging augmented-Lagrangian techniques for differentiating over infeasible quadratic programs in machine learning
new file mode 100644
index 0000000000..72c0ab0595
--- /dev/null
+++ b/data/2024/iclr/Leveraging augmented-Lagrangian techniques for differentiating over infeasible quadratic programs in machine learning	
@@ -0,0 +1 @@
+Optimization layers within neural network architectures have become increasingly popular for their ability to solve a wide range of machine learning tasks and to model domain-specific knowledge. However, designing optimization layers requires careful consideration as the underlying optimization problems might be infeasible during training. Motivated by applications in learning, control and robotics, this work focuses on convex quadratic programming (QP) layers. The specific structure of this type of optimization layer can be efficiently exploited for faster computations while still allowing rich modeling capabilities. We leverage primal-dual augmented Lagrangian techniques for computing derivatives of both feasible and infeasible QP solutions. More precisely, we propose a unified approach that tackles the differentiability of the closest feasible QP solutions in a classical ℓ 2 sense. We then harness this approach to enrich the expressive capabilities of existing QP layers. More precisely, we show how differentiating through infeasible QPs during training enables to drive towards feasibility at test time a new range of QP layers. These layers notably demonstrate superior predictive performance in some conventional learning tasks. Additionally, we present alternative formulations that enhance numerical robustness, speed, and accuracy for training such layers. Along with these contributions, we provide an open-source C++ software package called QPLayer for differentiating feasible and infeasible convex QPs and which can be interfaced with modern learning frameworks.
\ No newline at end of file
diff --git a/data/2024/iclr/Lewis's Signaling Game as beta-VAE For Natural Word Lengths and Segments b/data/2024/iclr/Lewis's Signaling Game as beta-VAE For Natural Word Lengths and Segments
new file mode 100644
index 0000000000..4059278c53
--- /dev/null
+++ b/data/2024/iclr/Lewis's Signaling Game as beta-VAE For Natural Word Lengths and Segments	
@@ -0,0 +1 @@
+As a sub-discipline of evolutionary and computational linguistics, emergent communication (EC) studies communication protocols, called emergent languages, arising in simulations where agents communicate. A key goal of EC is to give rise to languages that share statistical properties with natural languages. In this paper, we reinterpret Lewis's signaling game, a frequently used setting in EC, as beta-VAE and reformulate its objective function as ELBO. Consequently, we clarify the existence of prior distributions of emergent languages and show that the choice of the priors can influence their statistical properties. Specifically, we address the properties of word lengths and segmentation, known as Zipf's law of abbreviation (ZLA) and Harris's articulation scheme (HAS), respectively. It has been reported that the emergent languages do not follow them when using the conventional objective. We experimentally demonstrate that by selecting an appropriate prior distribution, more natural segments emerge, while suggesting that the conventional one prevents the languages from following ZLA and HAS.
\ No newline at end of file
diff --git a/data/2024/iclr/LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures b/data/2024/iclr/LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures
new file mode 100644
index 0000000000..dbeba1a0a0
--- /dev/null
+++ b/data/2024/iclr/LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures	
@@ -0,0 +1 @@
+Joint embedding (JE) architectures have emerged as a promising avenue for acquiring transferable data representations. A key obstacle to using JE methods, however, is the inherent challenge of evaluating learned representations without access to a downstream task, and an annotated dataset. Without efficient and reliable evaluation, it is difficult to iterate on architectural and training choices for JE methods. In this paper, we introduce LiDAR (Linear Discriminant Analysis Rank), a metric designed to measure the quality of representations within JE architectures. Our metric addresses several shortcomings of recent approaches based on feature covariance rank by discriminating between informative and uninformative features. In essence, LiDAR quantifies the rank of the Linear Discriminant Analysis (LDA) matrix associated with the surrogate SSL task -- a measure that intuitively captures the information content as it pertains to solving the SSL task. We empirically demonstrate that LiDAR significantly surpasses naive rank based approaches in its predictive power of optimal hyperparameters. Our proposed criterion presents a more robust and intuitive means of assessing the quality of representations within JE architectures, which we hope facilitates broader adoption of these powerful techniques in various domains.
\ No newline at end of file
diff --git a/data/2024/iclr/Lie Group Decompositions for Equivariant Neural Networks b/data/2024/iclr/Lie Group Decompositions for Equivariant Neural Networks
new file mode 100644
index 0000000000..4c36e72682
--- /dev/null
+++ b/data/2024/iclr/Lie Group Decompositions for Equivariant Neural Networks	
@@ -0,0 +1 @@
+Invariance and equivariance to geometrical transformations have proven to be very useful inductive biases when training (convolutional) neural network models, especially in the low-data regime. Much work has focused on the case where the symmetry group employed is compact or abelian, or both. Recent work has explored enlarging the class of transformations used to the case of Lie groups, principally through the use of their Lie algebra, as well as the group exponential and logarithm maps. The applicability of such methods is limited by the fact that depending on the group of interest $G$, the exponential map may not be surjective. Further limitations are encountered when $G$ is neither compact nor abelian. Using the structure and geometry of Lie groups and their homogeneous spaces, we present a framework by which it is possible to work with such groups primarily focusing on the groups $G = \text{GL}^{+}(n, \mathbb{R})$ and $G = \text{SL}(n, \mathbb{R})$, as well as their representation as affine transformations $\mathbb{R}^{n} \rtimes G$. Invariant integration as well as a global parametrization is realized by a decomposition into subgroups and submanifolds which can be handled individually. Under this framework, we show how convolution kernels can be parametrized to build models equivariant with respect to affine transformations. We evaluate the robustness and out-of-distribution generalisation capability of our model on the benchmark affine-invariant classification task, outperforming previous proposals.
\ No newline at end of file
diff --git a/data/2024/iclr/Lifting Architectural Constraints of Injective Flows b/data/2024/iclr/Lifting Architectural Constraints of Injective Flows
new file mode 100644
index 0000000000..e13a9eaa58
--- /dev/null
+++ b/data/2024/iclr/Lifting Architectural Constraints of Injective Flows	
@@ -0,0 +1 @@
+Normalizing Flows explicitly maximize a full-dimensional likelihood on the training data. However, real data is typically only supported on a lower-dimensional manifold leading the model to expend significant compute on modeling noise. Injective Flows fix this by jointly learning a manifold and the distribution on it. So far, they have been limited by restrictive architectures and/or high computational cost. We lift both constraints by a new efficient estimator for the maximum likelihood loss, compatible with free-form bottleneck architectures. We further show that naively learning both the data manifold and the distribution on it can lead to divergent solutions, and use this insight to motivate a stable maximum likelihood training objective. We perform extensive experiments on toy, tabular and image data, demonstrating the competitive performance of the resulting model.
\ No newline at end of file
diff --git "a/data/2024/iclr/Light Schr\303\266dinger Bridge" "b/data/2024/iclr/Light Schr\303\266dinger Bridge"
new file mode 100644
index 0000000000..1297cb0262
--- /dev/null
+++ "b/data/2024/iclr/Light Schr\303\266dinger Bridge"	
@@ -0,0 +1 @@
+Despite the recent advances in the field of computational Schr\"odinger Bridges (SB), most existing SB solvers are still heavy-weighted and require complex optimization of several neural networks. It turns out that there is no principal solver which plays the role of simple-yet-effective baseline for SB just like, e.g., $k$-means method in clustering, logistic regression in classification or Sinkhorn algorithm in discrete optimal transport. We address this issue and propose a novel fast and simple SB solver. Our development is a smart combination of two ideas which recently appeared in the field: (a) parameterization of the Schr\"odinger potentials with sum-exp quadratic functions and (b) viewing the log-Schr\"odinger potentials as the energy functions. We show that combined together these ideas yield a lightweight, simulation-free and theoretically justified SB solver with a simple straightforward optimization objective. As a result, it allows solving SB in moderate dimensions in a matter of minutes on CPU without a painful hyperparameter selection. Our light solver resembles the Gaussian mixture model which is widely used for density estimation. Inspired by this similarity, we also prove an important theoretical result showing that our light solver is a universal approximator of SBs. Furthemore, we conduct the analysis of the generalization error of our light solver. The code for our solver can be found at https://github.com/ngushchin/LightSB
\ No newline at end of file
diff --git a/data/2024/iclr/Light-MILPopt: Solving Large-scale Mixed Integer Linear Programs with Lightweight Optimizer and Small-scale Training Dataset b/data/2024/iclr/Light-MILPopt: Solving Large-scale Mixed Integer Linear Programs with Lightweight Optimizer and Small-scale Training Dataset
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/LightHGNN: Distilling Hypergraph Neural Networks into MLPs for 100x Faster Inference b/data/2024/iclr/LightHGNN: Distilling Hypergraph Neural Networks into MLPs for 100x Faster Inference
new file mode 100644
index 0000000000..8488396827
--- /dev/null
+++ b/data/2024/iclr/LightHGNN: Distilling Hypergraph Neural Networks into MLPs for 100x Faster Inference	
@@ -0,0 +1 @@
+Hypergraph Neural Networks (HGNNs) have recently attracted much attention and exhibited satisfactory performance due to their superiority in high-order correlation modeling. However, it is noticed that the high-order modeling capability of hypergraph also brings increased computation complexity, which hinders its practical industrial deployment. In practice, we find that one key barrier to the efficient deployment of HGNNs is the high-order structural dependencies during inference. In this paper, we propose to bridge the gap between the HGNNs and inference-efficient Multi-Layer Perceptron (MLPs) to eliminate the hypergraph dependency of HGNNs and thus reduce computational complexity as well as improve inference speed. Specifically, we introduce LightHGNN and LightHGNN$^+$ for fast inference with low complexity. LightHGNN directly distills the knowledge from teacher HGNNs to student MLPs via soft labels, and LightHGNN$^+$ further explicitly injects reliable high-order correlations into the student MLPs to achieve topology-aware distillation and resistance to over-smoothing. Experiments on eight hypergraph datasets demonstrate that even without hypergraph dependency, the proposed LightHGNNs can still achieve competitive or even better performance than HGNNs and outperform vanilla MLPs by $16.3$ on average. Extensive experiments on three graph datasets further show the average best performance of our LightHGNNs compared with all other methods. Experiments on synthetic hypergraphs with 5.5w vertices indicate LightHGNNs can run $100\times$ faster than HGNNs, showcasing their ability for latency-sensitive deployments.
\ No newline at end of file
diff --git a/data/2024/iclr/Like Oil and Water: Group Robustness Methods and Poisoning Defenses May Be at Odds b/data/2024/iclr/Like Oil and Water: Group Robustness Methods and Poisoning Defenses May Be at Odds
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Likelihood Training of Cascaded Diffusion Models via Hierarchical Volume-preserving Maps b/data/2024/iclr/Likelihood Training of Cascaded Diffusion Models via Hierarchical Volume-preserving Maps
new file mode 100644
index 0000000000..8bd07bf0d0
--- /dev/null
+++ b/data/2024/iclr/Likelihood Training of Cascaded Diffusion Models via Hierarchical Volume-preserving Maps	
@@ -0,0 +1 @@
+Cascaded models are multi-scale generative models with a marked capacity for producing perceptually impressive samples at high resolutions. In this work, we show that they can also be excellent likelihood models, so long as we overcome a fundamental difficulty with probabilistic multi-scale models: the intractability of the likelihood function. Chiefly, in cascaded models each intermediary scale introduces extraneous variables that cannot be tractably marginalized out for likelihood evaluation. This issue vanishes by modeling the diffusion process on latent spaces induced by a class of transformations we call hierarchical volume-preserving maps, which decompose spatially structured data in a hierarchical fashion without introducing local distortions in the latent space. We demonstrate that two such maps are well-known in the literature for multiscale modeling: Laplacian pyramids and wavelet transforms. Not only do such reparameterizations allow the likelihood function to be directly expressed as a joint likelihood over the scales, we show that the Laplacian pyramid and wavelet transform also produces significant improvements to the state-of-the-art on a selection of benchmarks in likelihood modeling, including density estimation, lossless compression, and out-of-distribution detection. Investigating the theoretical basis of our empirical gains we uncover deep connections to score matching under the Earth Mover’s Distance (EMD), which is a well-known surrogate for perceptual similarity. Code can be found at this https url.
\ No newline at end of file
diff --git a/data/2024/iclr/Linear Log-Normal Attention with Unbiased Concentration b/data/2024/iclr/Linear Log-Normal Attention with Unbiased Concentration
new file mode 100644
index 0000000000..0db9b49a48
--- /dev/null
+++ b/data/2024/iclr/Linear Log-Normal Attention with Unbiased Concentration	
@@ -0,0 +1 @@
+Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This limitation poses a substantial obstacle when dealing with long documents or high-resolution images. In this work, we study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability. Furthermore, we propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention, designed to emulate the distribution and concentration behavior of the original self-attention. Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives, offering a promising avenue for enhancing the scalability of transformer models.
\ No newline at end of file
diff --git a/data/2024/iclr/Linear attention is (maybe) all you need (to understand Transformer optimization) b/data/2024/iclr/Linear attention is (maybe) all you need (to understand Transformer optimization)
new file mode 100644
index 0000000000..29149f78a0
--- /dev/null
+++ b/data/2024/iclr/Linear attention is (maybe) all you need (to understand Transformer optimization)	
@@ -0,0 +1 @@
+Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and K.~Ahn et al.~(NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized Transformer model could actually be a valuable, realistic abstraction for understanding Transformer optimization.
\ No newline at end of file
diff --git a/data/2024/iclr/Linearity of Relation Decoding in Transformer Language Models b/data/2024/iclr/Linearity of Relation Decoding in Transformer Language Models
new file mode 100644
index 0000000000..03a3fa61fd
--- /dev/null
+++ b/data/2024/iclr/Linearity of Relation Decoding in Transformer Language Models	
@@ -0,0 +1 @@
+Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations: relations between words and their synonyms, entities and their attributes, etc. We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation. Linear relation representations may be obtained by constructing a first-order approximation to the LM from a single prompt, and they exist for a variety of factual, commonsense, and linguistic relations. However, we also identify many cases in which LM predictions capture relational knowledge accurately, but this knowledge is not linearly encoded in their representations. Our results thus reveal a simple, interpretable, but heterogeneously deployed knowledge representation strategy in transformer LMs.
\ No newline at end of file
diff --git a/data/2024/iclr/Lion Secretly Solves a Constrained Optimization: As Lyapunov Predicts b/data/2024/iclr/Lion Secretly Solves a Constrained Optimization: As Lyapunov Predicts
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/LipSim: A Provably Robust Perceptual Similarity Metric b/data/2024/iclr/LipSim: A Provably Robust Perceptual Similarity Metric
new file mode 100644
index 0000000000..635e06f122
--- /dev/null
+++ b/data/2024/iclr/LipSim: A Provably Robust Perceptual Similarity Metric	
@@ -0,0 +1 @@
+Recent years have seen growing interest in developing and applying perceptual similarity metrics. Research has shown the superiority of perceptual metrics over pixel-wise metrics in aligning with human perception and serving as a proxy for the human visual system. On the other hand, as perceptual metrics rely on neural networks, there is a growing concern regarding their resilience, given the established vulnerability of neural networks to adversarial attacks. It is indeed logical to infer that perceptual metrics may inherit both the strengths and shortcomings of neural networks. In this work, we demonstrate the vulnerability of state-of-the-art perceptual similarity metrics based on an ensemble of ViT-based feature extractors to adversarial attacks. We then propose a framework to train a robust perceptual similarity metric called LipSim (Lipschitz Similarity Metric) with provable guarantees. By leveraging 1-Lipschitz neural networks as the backbone, LipSim provides guarded areas around each data point and certificates for all perturbations within an $\ell_2$ ball. Finally, a comprehensive set of experiments shows the performance of LipSim in terms of natural and certified scores and on the image retrieval application. The code is available at https://github.com/SaraGhazanfari/LipSim.
\ No newline at end of file
diff --git a/data/2024/iclr/Lipschitz Singularities in Diffusion Models b/data/2024/iclr/Lipschitz Singularities in Diffusion Models
new file mode 100644
index 0000000000..5ffcb123fb
--- /dev/null
+++ b/data/2024/iclr/Lipschitz Singularities in Diffusion Models	
@@ -0,0 +1 @@
+FFHQ
\ No newline at end of file
diff --git a/data/2024/iclr/Lipsum-FT: Robust Fine-Tuning of Zero-Shot Models Using Random Text Guidance b/data/2024/iclr/Lipsum-FT: Robust Fine-Tuning of Zero-Shot Models Using Random Text Guidance
new file mode 100644
index 0000000000..b093a984d3
--- /dev/null
+++ b/data/2024/iclr/Lipsum-FT: Robust Fine-Tuning of Zero-Shot Models Using Random Text Guidance	
@@ -0,0 +1 @@
+Large-scale contrastive vision-language pre-trained models provide the zero-shot model achieving competitive performance across a range of image classification tasks without requiring training on downstream data. Recent works have confirmed that while additional fine-tuning of the zero-shot model on the reference data results in enhanced downstream performance, it compromises the model's robustness against distribution shifts. Our investigation begins by examining the conditions required to achieve the goals of robust fine-tuning, employing descriptions based on feature distortion theory and joint energy-based models. Subsequently, we propose a novel robust fine-tuning algorithm, Lipsum-FT, that effectively utilizes the language modeling aspect of the vision-language pre-trained models. Extensive experiments conducted on distribution shift scenarios in DomainNet and ImageNet confirm the superiority of our proposed Lipsum-FT approach over existing robust fine-tuning methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Listen, Think, and Understand b/data/2024/iclr/Listen, Think, and Understand
new file mode 100644
index 0000000000..28a8d6abbb
--- /dev/null
+++ b/data/2024/iclr/Listen, Think, and Understand	
@@ -0,0 +1 @@
+The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-defined, discrete sound label sets. In contrast, humans possess the ability to not only classify sounds into general categories, but also to listen to the finer details of the sounds, explain the reason for the predictions, think about what the sound infers, and understand the scene and what action needs to be taken, if any. Such capabilities beyond perception are not yet present in existing audio models. On the other hand, modern large language models (LLMs) exhibit emerging reasoning ability but they lack audio perception capabilities. Therefore, we ask the question: can we build a model that has both audio perception and a reasoning ability? In this paper, we propose a new audio foundation model, called LTU (Listen, Think, and Understand). To train LTU, we created a new OpenAQA-5M dataset consisting of 1.9 million closed-ended and 3.7 million open-ended, diverse (audio, question, answer) tuples, and have used an autoregressive training framework with a perception-to-understanding curriculum. LTU demonstrates strong performance and generalization ability on conventional audio tasks such as classification and captioning. More importantly, it exhibits emerging audio reasoning and comprehension abilities that are absent in existing audio models. To the best of our knowledge, LTU is one of the first multimodal large language models that focus on general audio (rather than just speech) understanding.
\ No newline at end of file
diff --git a/data/2024/iclr/LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses b/data/2024/iclr/LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses
new file mode 100644
index 0000000000..c044e6d5a2
--- /dev/null
+++ b/data/2024/iclr/LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses	
@@ -0,0 +1 @@
+A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations of LMs as well as building more trustworthy models. However, standard calibration techniques may not be suited for LM calibration. For instance, post-processing methods such as temperature scaling do not reorder the candidate generations. On the other hand, training-based methods require fine-tuning the entire model, which is impractical for LMs of large scale. We present LitCab, a lightweight calibration mechanism consisting of a single linear layer that takes the input text representation and predicts a bias term, which is then added to the LM output logits. LitCab improves model calibration by only adding<2% of the original model parameters. For evaluation, we construct CaT, a benchmark consisting of eight text generation tasks, covering responses ranging from short phrases to paragraphs. We test LitCab with Llama2-7B, where it improves calibration across all tasks, reducing the average ECE score by as large as 30%. We further conduct a comprehensive evaluation with multiple popular open-sourced LMs from GPT and LLaMA families, yielding the following key findings: (i) Larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (ii) GPT-family models show superior calibration compared to LLaMA, Llama2, and Vicuna models, despite having much fewer parameters. (iii) Fine-tuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of fine-tuning setups for calibrating LMs.
\ No newline at end of file
diff --git a/data/2024/iclr/Llemma: An Open Language Model for Mathematics b/data/2024/iclr/Llemma: An Open Language Model for Mathematics
new file mode 100644
index 0000000000..ca6a5b3ce9
--- /dev/null
+++ b/data/2024/iclr/Llemma: An Open Language Model for Mathematics	
@@ -0,0 +1 @@
+We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.
\ No newline at end of file
diff --git a/data/2024/iclr/LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents b/data/2024/iclr/LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents
new file mode 100644
index 0000000000..05374dff4e
--- /dev/null
+++ b/data/2024/iclr/LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents	
@@ -0,0 +1 @@
+Large language models (LLMs) have recently received considerable attention as alternative solutions for task planning. However, comparing the performance of language-oriented task planners becomes difficult, and there exists a dearth of detailed exploration regarding the effects of various factors such as pre-trained model selection and prompt construction. To address this, we propose a benchmark system for automatically quantifying performance of task planning for home-service embodied agents. Task planners are tested on two pairs of datasets and simulators: 1) ALFRED and AI2-THOR, 2) an extension of Watch-And-Help and VirtualHome. Using the proposed benchmark system, we perform extensive experiments with LLMs and prompts, and explore several enhancements of the baseline planner. We expect that the proposed benchmark tool would accelerate the development of language-oriented task planners.
\ No newline at end of file
diff --git a/data/2024/iclr/Local Composite Saddle Point Optimization b/data/2024/iclr/Local Composite Saddle Point Optimization
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Local Graph Clustering with Noisy Labels b/data/2024/iclr/Local Graph Clustering with Noisy Labels
new file mode 100644
index 0000000000..9a88a0c977
--- /dev/null
+++ b/data/2024/iclr/Local Graph Clustering with Noisy Labels	
@@ -0,0 +1 @@
+The growing interest in machine learning problems over graphs with additional node information such as texts, images, or labels has popularized methods that require the costly operation of processing the entire graph. Yet, little effort has been made to the development of fast local methods (i.e. without accessing the entire graph) that extract useful information from such data. To that end, we propose a study of local graph clustering using noisy node labels as a proxy for additional node information. In this setting, nodes receive initial binary labels based on cluster affiliation: 1 if they belong to the target cluster and 0 otherwise. Subsequently, a fraction of these labels is flipped. We investigate the benefits of incorporating noisy labels for local graph clustering. By constructing a weighted graph with such labels, we study the performance of graph diffusion-based local clustering method on both the original and the weighted graphs. From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%.
\ No newline at end of file
diff --git a/data/2024/iclr/Local Search GFlowNets b/data/2024/iclr/Local Search GFlowNets
new file mode 100644
index 0000000000..1d5f486d6f
--- /dev/null
+++ b/data/2024/iclr/Local Search GFlowNets	
@@ -0,0 +1 @@
+Generative Flow Networks (GFlowNets) are amortized sampling methods that learn a distribution over discrete objects proportional to their rewards. GFlowNets exhibit a remarkable ability to generate diverse samples, yet occasionally struggle to consistently produce samples with high rewards due to over-exploration on wide sample space. This paper proposes to train GFlowNets with local search, which focuses on exploiting high-rewarded sample space to resolve this issue. Our main idea is to explore the local neighborhood via backtracking and reconstruction guided by backward and forward policies, respectively. This allows biasing the samples toward high-reward solutions, which is not possible for a typical GFlowNet solution generation scheme, which uses the forward policy to generate the solution from scratch. Extensive experiments demonstrate a remarkable performance improvement in several biochemical tasks. Source code is available: \url{https://github.com/dbsxodud-11/ls_gfn}.
\ No newline at end of file
diff --git a/data/2024/iclr/Locality Sensitive Sparse Encoding for Learning World Models Online b/data/2024/iclr/Locality Sensitive Sparse Encoding for Learning World Models Online
new file mode 100644
index 0000000000..b54ed47dec
--- /dev/null
+++ b/data/2024/iclr/Locality Sensitive Sparse Encoding for Learning World Models Online	
@@ -0,0 +1 @@
+Acquiring an accurate world model online for model-based reinforcement learning (MBRL) is challenging due to data nonstationarity, which typically causes catastrophic forgetting for neural networks (NNs). From the online learning perspective, a Follow-The-Leader (FTL) world model is desirable, which optimally fits all previous experiences at each round. Unfortunately, NN-based models need re-training on all accumulated data at every interaction step to achieve FTL, which is computationally expensive for lifelong agents. In this paper, we revisit models that can achieve FTL with incremental updates. Specifically, our world model is a linear regression model supported by nonlinear random features. The linear part ensures efficient FTL update while the nonlinear random feature empowers the fitting of complex environments. To best trade off model capacity and computation efficiency, we introduce a locality sensitive sparse encoding, which allows us to conduct efficient sparse updates even with very high dimensional nonlinear features. We validate the representation power of our encoding and verify that it allows efficient online learning under data covariate shift. We also show, in the Dyna MBRL setting, that our world models learned online using a single pass of trajectory data either surpass or match the performance of deep world models trained with replay and other continual learning methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Locality-Aware Graph Rewiring in GNNs b/data/2024/iclr/Locality-Aware Graph Rewiring in GNNs
new file mode 100644
index 0000000000..0c1f0838dd
--- /dev/null
+++ b/data/2024/iclr/Locality-Aware Graph Rewiring in GNNs	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) are popular models for machine learning on graphs that typically follow the message-passing paradigm, whereby the feature of a node is updated recursively upon aggregating information over its neighbors. While exchanging messages over the input graph endows GNNs with a strong inductive bias, it can also make GNNs susceptible to over-squashing, thereby preventing them from capturing long-range interactions in the given graph. To rectify this issue, graph rewiring techniques have been proposed as a means of improving information flow by altering the graph connectivity. In this work, we identify three desiderata for graph-rewiring: (i) reduce over-squashing, (ii) respect the locality of the graph, and (iii) preserve the sparsity of the graph. We highlight fundamental trade-offs that occur between spatial and spectral rewiring techniques; while the former often satisfy (i) and (ii) but not (iii), the latter generally satisfy (i) and (iii) at the expense of (ii). We propose a novel rewiring framework that satisfies all of (i)--(iii) through a locality-aware sequence of rewiring operations. We then discuss a specific instance of such rewiring framework and validate its effectiveness on several real-world benchmarks, showing that it either matches or significantly outperforms existing rewiring approaches.
\ No newline at end of file
diff --git a/data/2024/iclr/Localizing and Editing Knowledge In Text-to-Image Generative Models b/data/2024/iclr/Localizing and Editing Knowledge In Text-to-Image Generative Models
new file mode 100644
index 0000000000..87d963f47e
--- /dev/null
+++ b/data/2024/iclr/Localizing and Editing Knowledge In Text-to-Image Generative Models	
@@ -0,0 +1 @@
+Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have achieved unprecedented quality of photorealism with state-of-the-art FID scores on MS-COCO and other generation benchmarks. Given a caption, image generation requires fine-grained knowledge about attributes such as object structure, style, and viewpoint amongst others. Where does this information reside in text-to-image generative models? In our paper, we tackle this question and understand how knowledge corresponding to distinct visual attributes is stored in large-scale text-to-image diffusion models. We adapt Causal Mediation Analysis for text-to-image models and trace knowledge about distinct visual attributes to various (causal) components in the (i) UNet and (ii) text-encoder of the diffusion model. In particular, we show that unlike generative large-language models, knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet. These sets of components are often distinct for different visual attributes. Remarkably, we find that the CLIP text-encoder in public text-to-image models such as Stable-Diffusion contains only one causal state across different visual attributes, and this is the first self-attention layer corresponding to the last subject token of the attribute in the caption. This is in stark contrast to the causal states in other language models which are often the mid-MLP layers. Based on this observation of only one causal state in the text-encoder, we introduce a fast, data-free model editing method Diff-QuickFix which can effectively edit concepts in text-to-image models. DiffQuickFix can edit (ablate) concepts in under a second with a closed-form update, providing a significant 1000x speedup and comparable editing performance to existing fine-tuning based editing methods.
\ No newline at end of file
diff --git a/data/2024/iclr/LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models b/data/2024/iclr/LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models
new file mode 100644
index 0000000000..f58ca9d803
--- /dev/null
+++ b/data/2024/iclr/LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models	
@@ -0,0 +1 @@
+Quantization is an indispensable technique for serving Large Language Models (LLMs) and has recently found its way into LoRA fine-tuning. In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on a pre-trained model. In such cases it is common to observe a consistent gap in the performance on downstream tasks between full fine-tuning and quantization plus LoRA fine-tuning approach. In response, we propose LoftQ (LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that simultaneously quantizes an LLM and finds a proper low-rank initialization for LoRA fine-tuning. Such an initialization alleviates the discrepancy between the quantized and full-precision model and significantly improves generalization in downstream tasks. We evaluate our method on natural language understanding, question answering, summarization, and natural language generation tasks. Experiments show that our method is highly effective and outperforms existing quantization methods, especially in the challenging 2-bit and 2/4-bit mixed precision regimes. The code is available on https://github.com/yxli2123/LoftQ.
\ No newline at end of file
diff --git a/data/2024/iclr/LogicMP: A Neuro-symbolic Approach for Encoding First-order Logic Constraints b/data/2024/iclr/LogicMP: A Neuro-symbolic Approach for Encoding First-order Logic Constraints
new file mode 100644
index 0000000000..41fdf4df9c
--- /dev/null
+++ b/data/2024/iclr/LogicMP: A Neuro-symbolic Approach for Encoding First-order Logic Constraints	
@@ -0,0 +1 @@
+Integrating first-order logic constraints (FOLCs) with neural networks is a crucial but challenging problem since it involves modeling intricate correlations to satisfy the constraints. This paper proposes a novel neural layer, LogicMP, whose layers perform mean-field variational inference over an MLN. It can be plugged into any off-the-shelf neural network to encode FOLCs while retaining modularity and efficiency. By exploiting the structure and symmetries in MLNs, we theoretically demonstrate that our well-designed, efficient mean-field iterations effectively mitigate the difficulty of MLN inference, reducing the inference from sequential calculation to a series of parallel tensor operations. Empirical results in three kinds of tasks over graphs, images, and text show that LogicMP outperforms advanced competitors in both performance and efficiency.
\ No newline at end of file
diff --git a/data/2024/iclr/Logical Languages Accepted by Transformer Encoders with Hard Attention b/data/2024/iclr/Logical Languages Accepted by Transformer Encoders with Hard Attention
new file mode 100644
index 0000000000..245697e1c0
--- /dev/null
+++ b/data/2024/iclr/Logical Languages Accepted by Transformer Encoders with Hard Attention	
@@ -0,0 +1 @@
+We contribute to the study of formal languages that can be recognized by transformer encoders. We focus on two self-attention mechanisms: (1) UHAT (Unique Hard Attention Transformers) and (2) AHAT (Average Hard Attention Transformers). UHAT encoders are known to recognize only languages inside the circuit complexity class ${\sf AC}^0$, i.e., accepted by a family of poly-sized and depth-bounded boolean circuits with unbounded fan-ins. On the other hand, AHAT encoders can recognize languages outside ${\sf AC}^0$), but their expressive power still lies within the bigger circuit complexity class ${\sf TC}^0$, i.e., ${\sf AC}^0$-circuits extended by majority gates. We first show a negative result that there is an ${\sf AC}^0$-language that cannot be recognized by an UHAT encoder. On the positive side, we show that UHAT encoders can recognize a rich fragment of ${\sf AC}^0$-languages, namely, all languages definable in first-order logic with arbitrary unary numerical predicates. This logic, includes, for example, all regular languages from ${\sf AC}^0$. We then show that AHAT encoders can recognize all languages of our logic even when we enrich it with counting terms. We apply these results to derive new results on the expressive power of UHAT and AHAT up to permutation of letters (a.k.a. Parikh images).
\ No newline at end of file
diff --git a/data/2024/iclr/Long-Short-Range Message-Passing: A Physics-Informed Framework to Capture Non-Local Interaction for Scalable Molecular Dynamics Simulation b/data/2024/iclr/Long-Short-Range Message-Passing: A Physics-Informed Framework to Capture Non-Local Interaction for Scalable Molecular Dynamics Simulation
new file mode 100644
index 0000000000..84e99b3ff5
--- /dev/null
+++ b/data/2024/iclr/Long-Short-Range Message-Passing: A Physics-Informed Framework to Capture Non-Local Interaction for Scalable Molecular Dynamics Simulation	
@@ -0,0 +1 @@
+Computational simulation of chemical and biological systems using ab initio molecular dynamics has been a challenge over decades. Researchers have attempted to address the problem with machine learning and fragmentation-based methods. However, the two approaches fail to give a satisfactory description of long-range and many-body interactions, respectively. Inspired by fragmentation-based methods, we propose the Long-Short-Range Message-Passing (LSR-MP) framework as a generalization of the existing equivariant graph neural networks (EGNNs) with the intent to incorporate long-range interactions efficiently and effectively. We apply the LSR-MP framework to the recently proposed ViSNet and demonstrate the state-of-the-art results with up to 40% MAE reduction for molecules in MD22 and Chignolin datasets. Consistent improvements to various EGNNs will also be discussed to illustrate the general applicability and robustness of our LSR-MP framework. The code for our experiments and trained model weights could be found at https://github.com/liyy2/LSR-MP.
\ No newline at end of file
diff --git a/data/2024/iclr/Long-Term Typhoon Trajectory Prediction: A Physics-Conditioned Approach Without Reanalysis Data b/data/2024/iclr/Long-Term Typhoon Trajectory Prediction: A Physics-Conditioned Approach Without Reanalysis Data
new file mode 100644
index 0000000000..3f5e31bfbc
--- /dev/null
+++ b/data/2024/iclr/Long-Term Typhoon Trajectory Prediction: A Physics-Conditioned Approach Without Reanalysis Data	
@@ -0,0 +1 @@
+In the face of escalating climate changes, typhoon intensities and their ensuing damage have surged. Accurate trajectory prediction is crucial for effective damage control. Traditional physics-based models, while comprehensive, are computationally intensive and rely heavily on the expertise of forecasters. Contemporary data-driven methods often rely on reanalysis data, which can be considered to be the closest to the true representation of weather conditions. However, reanalysis data is not produced in real-time and requires time for adjustment because prediction models are calibrated with observational data. This reanalysis data, such as ERA5, falls short in challenging real-world situations. Optimal preparedness necessitates predictions at least 72 hours in advance, beyond the capabilities of standard physics models. In response to these constraints, we present an approach that harnesses real-time Unified Model (UM) data, sidestepping the limitations of reanalysis data. Our model provides predictions at 6-hour intervals for up to 72 hours in advance and outperforms both state-of-the-art data-driven methods and numerical weather prediction models. In line with our efforts to mitigate adversities inflicted by \rthree{typhoons}, we release our preprocessed \textit{PHYSICS TRACK} dataset, which includes ERA5 reanalysis data, typhoon best-track, and UM forecast data.
\ No newline at end of file
diff --git a/data/2024/iclr/Long-tailed Diffusion Models with Oriented Calibration b/data/2024/iclr/Long-tailed Diffusion Models with Oriented Calibration
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models b/data/2024/iclr/LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
new file mode 100644
index 0000000000..dc7fc5a8e1
--- /dev/null
+++ b/data/2024/iclr/LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models	
@@ -0,0 +1 @@
+We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S^2-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA extends Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset.
\ No newline at end of file
diff --git a/data/2024/iclr/Look, Remember and Reason: Grounded Reasoning in Videos with Language Models b/data/2024/iclr/Look, Remember and Reason: Grounded Reasoning in Videos with Language Models
new file mode 100644
index 0000000000..8b3fea08a2
--- /dev/null
+++ b/data/2024/iclr/Look, Remember and Reason: Grounded Reasoning in Videos with Language Models	
@@ -0,0 +1 @@
+Multi-modal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos. However, existing methods still fall short in tasks like causal or compositional spatiotemporal reasoning over actions, in which model predictions need to be grounded in fine-grained low-level details, such as object motions and object interactions. In this work, we propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, and tracking, to endow the model with the required low-level visual capabilities. We show that a two-stream video encoder with spatiotemporal attention is effective at capturing the required static and motion-based cues in the video. By leveraging the LM's ability to perform the low-level surrogate tasks, we can cast reasoning in videos as the three-step process of Look, Remember, Reason wherein visual information is extracted using low-level visual skills step-by-step and then integrated to arrive at a final answer. We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets. Our approach is trainable end-to-end and surpasses state-of-the-art task-specific methods across these tasks by a large margin.
\ No newline at end of file
diff --git a/data/2024/iclr/Looped Transformers are Better at Learning Learning Algorithms b/data/2024/iclr/Looped Transformers are Better at Learning Learning Algorithms
new file mode 100644
index 0000000000..99b2491852
--- /dev/null
+++ b/data/2024/iclr/Looped Transformers are Better at Learning Learning Algorithms	
@@ -0,0 +1 @@
+Transformers have demonstrated effectiveness in in-context solving data-fitting problems from various (latent) models, as reported by Garg et al. However, the absence of an inherent iterative structure in the transformer architecture presents a challenge in emulating the iterative algorithms, which are commonly employed in traditional machine learning methods. To address this, we propose the utilization of looped transformer architecture and its associated training methodology, with the aim of incorporating iterative characteristics into the transformer architectures. Experimental results suggest that the looped transformer achieves performance comparable to the standard transformer in solving various data-fitting problems, while utilizing less than 10% of the parameter count.
\ No newline at end of file
diff --git a/data/2024/iclr/Low Rank Matrix Completion via Robust Alternating Minimization in Nearly Linear Time b/data/2024/iclr/Low Rank Matrix Completion via Robust Alternating Minimization in Nearly Linear Time
new file mode 100644
index 0000000000..9f23b518c8
--- /dev/null
+++ b/data/2024/iclr/Low Rank Matrix Completion via Robust Alternating Minimization in Nearly Linear Time	
@@ -0,0 +1 @@
+Given a matrix $M\in \mathbb{R}^{m\times n}$, the low rank matrix completion problem asks us to find a rank-$k$ approximation of $M$ as $UV^\top$ for $U\in \mathbb{R}^{m\times k}$ and $V\in \mathbb{R}^{n\times k}$ by only observing a few entries specified by a set of entries $\Omega\subseteq [m]\times [n]$. In particular, we examine an approach that is widely used in practice -- the alternating minimization framework. Jain, Netrapalli, and Sanghavi [JNS13] showed that if $M$ has incoherent rows and columns, then alternating minimization provably recovers the matrix $M$ by observing a nearly linear in $n$ number of entries. While the sample complexity has been subsequently improved [GLZ17], alternating minimization steps are required to be computed exactly. This hinders the development of more efficient algorithms and fails to depict the practical implementation of alternating minimization, where the updates are usually performed approximately in favor of efficiency. In this paper, we take a major step towards a more efficient and error-robust alternating minimization framework. To this end, we develop an analytical framework for alternating minimization that can tolerate a moderate amount of errors caused by approximate updates. Moreover, our algorithm runs in time $\widetilde O(|\Omega| k)$, which is nearly linear in the time to verify the solution while preserving the sample complexity. This improves upon all prior known alternating minimization approaches which require $\widetilde O(|\Omega| k^2)$ time.
\ No newline at end of file
diff --git a/data/2024/iclr/M3C: A Framework towards Convergent, Flexible, and Unsupervised Learning of Mixture Graph Matching and Clustering b/data/2024/iclr/M3C: A Framework towards Convergent, Flexible, and Unsupervised Learning of Mixture Graph Matching and Clustering
new file mode 100644
index 0000000000..4cd45ddd34
--- /dev/null
+++ b/data/2024/iclr/M3C: A Framework towards Convergent, Flexible, and Unsupervised Learning of Mixture Graph Matching and Clustering	
@@ -0,0 +1 @@
+Existing graph matching methods typically assume that there are similar structures between graphs and they are matchable. However, these assumptions do not align with real-world applications. This work addresses a more realistic scenario where graphs exhibit diverse modes, requiring graph grouping before or along with matching, a task termed mixture graph matching and clustering. We introduce Minorize-Maximization Matching and Clustering (M3C), a learning-free algorithm that guarantees theoretical convergence through the Minorize-Maximization framework and offers enhanced flexibility via relaxed clustering. Building on M3C, we develop UM3C, an unsupervised model that incorporates novel edge-wise affinity learning and pseudo label selection. Extensive experimental results on public benchmarks demonstrate that our method outperforms state-of-the-art graph matching and mixture graph matching and clustering approaches in both accuracy and efficiency. Source code will be made publicly available.
\ No newline at end of file
diff --git a/data/2024/iclr/MAMBA: an Effective World Model Approach for Meta-Reinforcement Learning b/data/2024/iclr/MAMBA: an Effective World Model Approach for Meta-Reinforcement Learning
new file mode 100644
index 0000000000..07ea7e98ff
--- /dev/null
+++ b/data/2024/iclr/MAMBA: an Effective World Model Approach for Meta-Reinforcement Learning	
@@ -0,0 +1 @@
+Meta-reinforcement learning (meta-RL) is a promising framework for tackling challenging domains requiring efficient exploration. Existing meta-RL algorithms are characterized by low sample efficiency, and mostly focus on low-dimensional task distributions. In parallel, model-based RL methods have been successful in solving partially observable MDPs, of which meta-RL is a special case. In this work, we leverage this success and propose a new model-based approach to meta-RL, based on elements from existing state-of-the-art model-based and meta-RL methods. We demonstrate the effectiveness of our approach on common meta-RL benchmark domains, attaining greater return with better sample efficiency (up to $15\times$) while requiring very little hyperparameter tuning. In addition, we validate our approach on a slate of more challenging, higher-dimensional domains, taking a step towards real-world generalizing agents.
\ No newline at end of file
diff --git a/data/2024/iclr/MAP IT to Visualize Representations b/data/2024/iclr/MAP IT to Visualize Representations
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding b/data/2024/iclr/MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding
new file mode 100644
index 0000000000..ce57c21dcb
--- /dev/null
+++ b/data/2024/iclr/MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding	
@@ -0,0 +1 @@
+Protein-Protein Interactions (PPIs) are fundamental in various biological processes and play a key role in life activities. The growing demand and cost of experimental PPI assays require computational methods for efficient PPI prediction. While existing methods rely heavily on protein sequence for PPI prediction, it is the protein structure that is the key to determine the interactions. To take both protein modalities into account, we define the microenvironment of an amino acid residue by its sequence and structural contexts, which describe the surrounding chemical properties and geometric features. In addition, microenvironments defined in previous work are largely based on experimentally assayed physicochemical properties, for which the"vocabulary"is usually extremely small. This makes it difficult to cover the diversity and complexity of microenvironments. In this paper, we propose Microenvironment-Aware Protein Embedding for PPI prediction (MPAE-PPI), which encodes microenvironments into chemically meaningful discrete codes via a sufficiently large microenvironment"vocabulary"(i.e., codebook). Moreover, we propose a novel pre-training strategy, namely Masked Codebook Modeling (MCM), to capture the dependencies between different microenvironments by randomly masking the codebook and reconstructing the input. With the learned microenvironment codebook, we can reuse it as an off-the-shelf tool to efficiently and effectively encode proteins of different sizes and functions for large-scale PPI prediction. Extensive experiments show that MAPE-PPI can scale to PPI prediction with millions of PPIs with superior trade-offs between effectiveness and computational efficiency than the state-of-the-art competitors.
\ No newline at end of file
diff --git a/data/2024/iclr/MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods b/data/2024/iclr/MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods
new file mode 100644
index 0000000000..de8f55ff83
--- /dev/null
+++ b/data/2024/iclr/MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods	
@@ -0,0 +1 @@
+Recent research in decoding methods for Natural Language Generation (NLG) tasks has shown that MAP decoding is not optimal, because model probabilities do not always align with human preferences. Stronger decoding methods, including Quality Estimation (QE) reranking and Minimum Bayes' Risk (MBR) decoding, have since been proposed to mitigate the model-perplexity-vs-quality mismatch. While these decoding methods achieve state-of-the-art performance, they are prohibitively expensive to compute. In this work, we propose MBR finetuning and QE finetuning which distill the quality gains from these decoding methods at training time, while using an efficient decoding algorithm at inference time. Using the canonical NLG task of Neural Machine Translation (NMT), we show that even with self-training, these finetuning methods significantly outperform the base model. Moreover, when using an external LLM as a teacher model, these finetuning methods outperform finetuning on human-generated references. These findings suggest new ways to leverage monolingual data to achieve improvements in model quality that are on par with, or even exceed, improvements from human-curated data, while maintaining maximum efficiency during decoding.
\ No newline at end of file
diff --git a/data/2024/iclr/MCM: Masked Cell Modeling for Anomaly Detection in Tabular Data b/data/2024/iclr/MCM: Masked Cell Modeling for Anomaly Detection in Tabular Data
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/MEND: Meta Demonstration Distillation for Efficient and Effective In-Context Learning b/data/2024/iclr/MEND: Meta Demonstration Distillation for Efficient and Effective In-Context Learning
new file mode 100644
index 0000000000..889b743aa1
--- /dev/null
+++ b/data/2024/iclr/MEND: Meta Demonstration Distillation for Efficient and Effective In-Context Learning	
@@ -0,0 +1 @@
+Large Language models (LLMs) have demonstrated impressive in-context learning (ICL) capabilities, where a LLM makes predictions for a given test input together with a few input-output pairs (demonstrations). Nevertheless, the inclusion of demonstrations leads to a quadratic increase in the computational overhead of the self-attention mechanism. Existing solutions attempt to distill lengthy demonstrations into compact vectors. However, they often require task-specific retraining or compromise LLM's in-context learning performance. To mitigate these challenges, we present Meta dEmonstratioN Distillation (MEND), where a language model learns to distill any lengthy demonstrations into vectors without retraining for a new downstream task. We exploit the knowledge distillation to enhance alignment between MEND and LLM, achieving both efficiency and effectiveness simultaneously. MEND is endowed with the meta-knowledge of distilling demonstrations through a two-stage training process, which includes meta-distillation pretraining and fine-tuning. Comprehensive evaluations across seven diverse ICL task partitions using decoder-only (GPT-2) and encoder-decoder (T5) attest to MEND's prowess. It not only matches but often outperforms the Vanilla ICL as well as other state-of-the-art distillation models, while significantly reducing the computational demands. This innovation promises enhanced scalability and efficiency for the practical deployment of large language models
\ No newline at end of file
diff --git a/data/2024/iclr/MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training b/data/2024/iclr/MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
new file mode 100644
index 0000000000..ee379b42c5
--- /dev/null
+++ b/data/2024/iclr/MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training	
@@ -0,0 +1 @@
+Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is partially due to the distinctive challenges associated with modelling musical knowledge, particularly tonal and pitched characteristics of music. To address this research gap, we propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training. In our exploration, we identified an effective combination of teacher models, which outperforms conventional speech and audio approaches in terms of performance. This combination includes an acoustic teacher based on Residual Vector Quantisation - Variational AutoEncoder (RVQ-VAE) and a musical teacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide range of settings to overcome the instability in acoustic language model pre-training, which allows our designed paradigm to scale from 95M to 330M parameters. Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
\ No newline at end of file
diff --git a/data/2024/iclr/MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process b/data/2024/iclr/MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process
new file mode 100644
index 0000000000..cb4fd2572d
--- /dev/null
+++ b/data/2024/iclr/MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process	
@@ -0,0 +1 @@
+Recently, diffusion probabilistic models have attracted attention in generative time series forecasting due to their remarkable capacity to generate high-fidelity samples. However, the effective utilization of their strong modeling ability in the probabilistic time series forecasting task remains an open question, partially due to the challenge of instability arising from their stochastic nature. To address this challenge, we introduce a novel Multi-Granularity Time Series Diffusion (MG-TSD) model, which achieves state-of-the-art predictive performance by leveraging the inherent granularity levels within the data as given targets at intermediate diffusion steps to guide the learning process of diffusion models. The way to construct the targets is motivated by the observation that the forward process of the diffusion model, which sequentially corrupts the data distribution to a standard normal distribution, intuitively aligns with the process of smoothing fine-grained data into a coarse-grained representation, both of which result in a gradual loss of fine distribution features. In the study, we derive a novel multi-granularity guidance diffusion loss function and propose a concise implementation method to effectively utilize coarse-grained data across various granularity levels. More importantly, our approach does not rely on additional external data, making it versatile and applicable across various domains. Extensive experiments conducted on real-world datasets demonstrate that our MG-TSD model outperforms existing time series prediction methods.
\ No newline at end of file
diff --git a/data/2024/iclr/MINDE: Mutual Information Neural Diffusion Estimation b/data/2024/iclr/MINDE: Mutual Information Neural Diffusion Estimation
new file mode 100644
index 0000000000..1d10e18909
--- /dev/null
+++ b/data/2024/iclr/MINDE: Mutual Information Neural Diffusion Estimation	
@@ -0,0 +1 @@
+In this work we present a new method for the estimation of Mutual Information (MI) between random variables. Our approach is based on an original interpretation of the Girsanov theorem, which allows us to use score-based diffusion models to estimate the Kullback Leibler divergence between two densities as a difference between their score functions. As a by-product, our method also enables the estimation of the entropy of random variables. Armed with such building blocks, we present a general recipe to measure MI, which unfolds in two directions: one uses conditional diffusion process, whereas the other uses joint diffusion processes that allow simultaneous modelling of two random variables. Our results, which derive from a thorough experimental protocol over all the variants of our approach, indicate that our method is more accurate than the main alternatives from the literature, especially for challenging distributions. Furthermore, our methods pass MI self-consistency tests, including data processing and additivity under independence, which instead are a pain-point of existing methods.
\ No newline at end of file
diff --git a/data/2024/iclr/MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback b/data/2024/iclr/MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback
new file mode 100644
index 0000000000..d9241ff94a
--- /dev/null
+++ b/data/2024/iclr/MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback	
@@ -0,0 +1 @@
+To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation protocols often emphasize benchmark performance with single-turn exchanges, neglecting the nuanced interactions among the user, LLMs, and external tools, while also underestimating the importance of natural language feedback from users. These oversights contribute to discrepancies between research benchmark evaluations and real-world use cases. We introduce MINT, a benchmark that evaluates LLMs' ability to solve tasks with multi-turn interactions by (1) using tools and (2) leveraging natural language feedback. To ensure reproducibility, we provide an evaluation framework where LLMs can access tools by executing Python code and receive users' natural language feedback simulated by GPT-4. We repurpose a diverse set of established evaluation datasets focusing on reasoning, coding, and decision-making and carefully curate them into a compact subset for efficient evaluation. Our analysis of 20 open- and closed-source LLMs offers intriguing findings. (a) LLMs generally benefit from tools and language feedback, with performance gains (absolute, same below) of 1-8% for each turn of tool use and 2-17% with natural language feedback. (b) Better single-turn performance does not guarantee better multi-turn performance. (c) Surprisingly, on the LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities. We expect MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation can be less accessible compared to commercial LLMs with a larger user base.
\ No newline at end of file
diff --git a/data/2024/iclr/MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations b/data/2024/iclr/MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations
new file mode 100644
index 0000000000..a49e4a660f
--- /dev/null
+++ b/data/2024/iclr/MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations	
@@ -0,0 +1 @@
+Multimodal intent recognition poses significant challenges, requiring the incorporation of non-verbal modalities from real-world contexts to enhance the comprehension of human intentions. Existing benchmark datasets are limited in scale and suffer from difficulties in handling out-of-scope samples that arise in multi-turn conversational interactions. We introduce MIntRec2.0, a large-scale benchmark dataset for multimodal intent recognition in multi-party conversations. It contains 1,245 dialogues with 15,040 samples, each annotated within a new intent taxonomy of 30 fine-grained classes. Besides 9,304 in-scope samples, it also includes 5,736 out-of-scope samples appearing in multi-turn contexts, which naturally occur in real-world scenarios. Furthermore, we provide comprehensive information on the speakers in each utterance, enriching its utility for multi-party conversational research. We establish a general framework supporting the organization of single-turn and multi-turn dialogue data, modality feature extraction, multimodal fusion, as well as in-scope classification and out-of-scope detection. Evaluation benchmarks are built using classic multimodal fusion methods, ChatGPT, and human evaluators. While existing methods incorporating nonverbal information yield improvements, effectively leveraging context information and detecting out-of-scope samples remains a substantial challenge. Notably, large language models exhibit a significant performance gap compared to humans, highlighting the limitations of machine learning methods in the cognitive intent understanding task. We believe that MIntRec2.0 will serve as a valuable resource, providing a pioneering foundation for research in human-machine conversational interactions, and significantly facilitating related applications. The full dataset and codes are available at https://github.com/thuiar/MIntRec2.0.
\ No newline at end of file
diff --git a/data/2024/iclr/MMD Graph Kernel: Effective Metric Learning for Graphs via Maximum Mean Discrepancy b/data/2024/iclr/MMD Graph Kernel: Effective Metric Learning for Graphs via Maximum Mean Discrepancy
new file mode 100644
index 0000000000..72fe7d9da8
--- /dev/null
+++ b/data/2024/iclr/MMD Graph Kernel: Effective Metric Learning for Graphs via Maximum Mean Discrepancy	
@@ -0,0 +1 @@
+This paper focuses on graph metric learning. First, we present a class of maximum mean discrepancy (MMD) based graph kernels, called MMD-GK. These kernels are computed by applying MMD to the node representations of two graphs with message-passing propagation. Secondly, we provide a class of deep MMD-GKs that are able to learn graph kernels and implicit graph features adaptively in an unsupervised manner. Thirdly, we propose a class of supervised deep MMD-GKs that are able to utilize label information of graphs and hence yield more discriminative metrics. Besides the algorithms, we provide theoretical analysis for the proposed methods. The proposed methods are evaluated in comparison to many baselines such as graph kernels and graph neural networks in the tasks of graph clustering and graph classification. The numerical results demonstrate the effectiveness and superiority of our methods. Our code is available at https://github.com/yan-sun-x/MMD-Graph-Kernel .
\ No newline at end of file
diff --git a/data/2024/iclr/MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning b/data/2024/iclr/MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
new file mode 100644
index 0000000000..f1fb64849b
--- /dev/null
+++ b/data/2024/iclr/MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	
@@ -0,0 +1 @@
+Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context. Our code, dataset, dataset tool, and model are available at https://github.com/PKUnlp-icler/MIC
\ No newline at end of file
diff --git a/data/2024/iclr/MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design b/data/2024/iclr/MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design
new file mode 100644
index 0000000000..9de9e45339
--- /dev/null
+++ b/data/2024/iclr/MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design	
@@ -0,0 +1 @@
+Metal-organic frameworks (MOFs) are of immense interest in applications such as gas storage and carbon capture due to their exceptional porosity and tunable chemistry. Their modular nature has enabled the use of template-based methods to generate hypothetical MOFs by combining molecular building blocks in accordance with known network topologies. However, the ability of these methods to identify top-performing MOFs is often hindered by the limited diversity of the resulting chemical space. In this work, we propose MOFDiff: a coarse-grained (CG) diffusion model that generates CG MOF structures through a denoising diffusion process over the coordinates and identities of the building blocks. The all-atom MOF structure is then determined through a novel assembly algorithm. Equivariant graph neural networks are used for the diffusion model to respect the permutational and roto-translational symmetries. We comprehensively evaluate our model's capability to generate valid and novel MOF structures and its effectiveness in designing outstanding MOF materials for carbon capture applications with molecular simulations.
\ No newline at end of file
diff --git a/data/2024/iclr/MOFI: Learning Image Representations from Noisy Entity Annotated Images b/data/2024/iclr/MOFI: Learning Image Representations from Noisy Entity Annotated Images
new file mode 100644
index 0000000000..f8a1e5ea4f
--- /dev/null
+++ b/data/2024/iclr/MOFI: Learning Image Representations from Noisy Entity Annotated Images	
@@ -0,0 +1 @@
+We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.
\ No newline at end of file
diff --git a/data/2024/iclr/MOTOR: A Time-to-Event Foundation Model For Structured Medical Records b/data/2024/iclr/MOTOR: A Time-to-Event Foundation Model For Structured Medical Records
new file mode 100644
index 0000000000..b8686ffeda
--- /dev/null
+++ b/data/2024/iclr/MOTOR: A Time-to-Event Foundation Model For Structured Medical Records	
@@ -0,0 +1 @@
+We present a self-supervised, time-to-event (TTE) foundation model called MOTOR (Many Outcome Time Oriented Representations) which is pretrained on timestamped sequences of events in electronic health records (EHR) and health insurance claims. TTE models are used for estimating the probability distribution of the time until a specific event occurs, which is an important task in medical settings. TTE models provide many advantages over classification using fixed time horizons, including naturally handling censored observations, but are challenging to train with limited labeled data. MOTOR addresses this challenge by pretraining on up to 55M patient records (9B clinical events). We evaluate MOTOR's transfer learning performance on 19 tasks, across 3 patient databases (a private EHR system, MIMIC-IV, and Merative claims data). Task-specific models adapted from MOTOR improve time-dependent C statistics by 4.6% over state-of-the-art, improve label efficiency by up to 95% ,and are more robust to temporal distributional shifts. We further evaluate cross-site portability by adapting our MOTOR foundation model for six prediction tasks on the MIMIC-IV dataset, where it outperforms all baselines. MOTOR is the first foundation model for medical TTE predictions and we release a 143M parameter pretrained model for research use at [redacted URL].
\ No newline at end of file
diff --git a/data/2024/iclr/MT-Ranker: Reference-free machine translation evaluation by inter-system ranking b/data/2024/iclr/MT-Ranker: Reference-free machine translation evaluation by inter-system ranking
new file mode 100644
index 0000000000..bf51cede38
--- /dev/null
+++ b/data/2024/iclr/MT-Ranker: Reference-free machine translation evaluation by inter-system ranking	
@@ -0,0 +1 @@
+Traditionally, Machine Translation (MT) Evaluation has been treated as a regression problem -- producing an absolute translation-quality score. This approach has two limitations: i) the scores lack interpretability, and human annotators struggle with giving consistent scores; ii) most scoring methods are based on (reference, translation) pairs, limiting their applicability in real-world scenarios where references are absent. In practice, we often care about whether a new MT system is better or worse than some competitors. In addition, reference-free MT evaluation is increasingly practical and necessary. Unfortunately, these two practical considerations have yet to be jointly explored. In this work, we formulate the reference-free MT evaluation into a pairwise ranking problem. Given the source sentence and a pair of translations, our system predicts which translation is better. In addition to proposing this new formulation, we further show that this new paradigm can demonstrate superior correlation with human judgments by merely using indirect supervision from natural language inference and weak supervision from our synthetic data. In the context of reference-free evaluation, MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21. On a more challenging benchmark, ACES, which contains fine-grained evaluation criteria such as addition, omission, and mistranslation errors, MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction Following b/data/2024/iclr/MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction Following
new file mode 100644
index 0000000000..2b02e416eb
--- /dev/null
+++ b/data/2024/iclr/MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction Following	
@@ -0,0 +1 @@
+In the realm of large language models (LLMs), enhancing instruction-following capability often involves curating expansive training data. This is achieved through two primary schemes: i) Scaling-Inputs: Amplifying (input, output) pairs per task instruction, aiming for better instruction adherence. ii) Scaling Input-Free Tasks: Enlarging tasks, each composed of an (instruction, output) pair (without requiring a separate input anymore). However, LLMs under Scaling-Inputs tend to be overly sensitive to inputs, leading to misinterpretation or non-compliance with instructions. Conversely, Scaling Input-Free Tasks demands a substantial number of tasks but is less effective in instruction following when dealing with instances in Scaling-Inputs. This work introduces MUFFIN, a new scheme of instruction-following dataset curation. Specifically, we automatically Scale Tasks per Input by diversifying these tasks with various input facets. Experimental results across four zero-shot benchmarks, spanning both Scaling-Inputs and Scaling Input-Free Tasks schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes.
\ No newline at end of file
diff --git a/data/2024/iclr/MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data b/data/2024/iclr/MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data
new file mode 100644
index 0000000000..d623ebc59e
--- /dev/null
+++ b/data/2024/iclr/MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data	
@@ -0,0 +1 @@
+Recent large language models (LLMs) have witnessed significant advancement in various tasks, including mathematical reasoning and theorem proving. As these two tasks require strict and formal multi-step inference, they are appealing domains for exploring the reasoning ability of LLMs but still face important challenges. Previous studies such as Chain-of-Thought (CoT) have revealed the effectiveness of intermediate steps guidance. However, such step-wise annotation requires heavy labor, leading to insufficient training steps for current benchmarks. To fill this gap, this work introduces MUSTARD, a data generation framework that masters uniform synthesis of theorem and proof data of high quality and diversity. MUSTARD synthesizes data in three stages: (1) It samples a few mathematical concept seeds as the problem category. (2) Then, it prompts a generative language model with the sampled concepts to obtain both the problems and their step-wise formal solutions. (3) Lastly, the framework utilizes a proof assistant (e.g., Lean Prover) to filter the valid proofs. With the proposed MUSTARD, we present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. Each data point contains an informal statement, an informal proof, and a translated formal proof that passes the prover validation. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data. We further apply the MUSTARDSAUCE for fine-tuning smaller language models. The fine-tuned Llama 2-7B achieves a 15.41% average relative performance gain in automated theorem proving, and 8.18% in math word problems. Codes and data are available at https://github.com/Eleanor-H/MUSTARD.
\ No newline at end of file
diff --git a/data/2024/iclr/MVDream: Multi-view Diffusion for 3D Generation b/data/2024/iclr/MVDream: Multi-view Diffusion for 3D Generation
new file mode 100644
index 0000000000..2becae6d79
--- /dev/null
+++ b/data/2024/iclr/MVDream: Multi-view Diffusion for 3D Generation	
@@ -0,0 +1 @@
+We introduce MVDream, a diffusion model that is able to generate consistent multi-view images from a given text prompt. Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. We demonstrate that such a multi-view diffusion model is implicitly a generalizable 3D prior agnostic to 3D representations. It can be applied to 3D generation via Score Distillation Sampling, significantly enhancing the consistency and stability of existing 2D-lifting methods. It can also learn new concepts from a few 2D examples, akin to DreamBooth, but for 3D generation.
\ No newline at end of file
diff --git a/data/2024/iclr/MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo b/data/2024/iclr/MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo
new file mode 100644
index 0000000000..b53dd35a2d
--- /dev/null
+++ b/data/2024/iclr/MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo	
@@ -0,0 +1 @@
+Recent advancements in learning-based Multi-View Stereo (MVS) methods have prominently featured transformer-based models with attention mechanisms. However, existing approaches have not thoroughly investigated the profound influence of transformers on different MVS modules, resulting in limited depth estimation capabilities. In this paper, we introduce MVSFormer++, a method that prudently maximizes the inherent characteristics of attention to enhance various components of the MVS pipeline. Formally, our approach involves infusing cross-view information into the pre-trained DINOv2 model to facilitate MVS learning. Furthermore, we employ different attention mechanisms for the feature encoder and cost volume regularization, focusing on feature and spatial aggregations respectively. Additionally, we uncover that some design details would substantially impact the performance of transformer modules in MVS, including normalized 3D positional encoding, adaptive attention scaling, and the position of layer normalization. Comprehensive experiments on DTU, Tanks-and-Temples, BlendedMVS, and ETH3D validate the effectiveness of the proposed method. Notably, MVSFormer++ achieves state-of-the-art performance on the challenging DTU and Tanks-and-Temples benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/MaGIC: Multi-modality Guided Image Completion b/data/2024/iclr/MaGIC: Multi-modality Guided Image Completion
new file mode 100644
index 0000000000..40c0704e4f
--- /dev/null
+++ b/data/2024/iclr/MaGIC: Multi-modality Guided Image Completion	
@@ -0,0 +1 @@
+Vanilla image completion approaches exhibit sensitivity to large missing regions, attributed to the limited availability of reference information for plausible generation. To mitigate this, existing methods incorporate the extra cue as a guidance for image completion. Despite improvements, these approaches are often restricted to employing a single modality (e.g., segmentation or sketch maps), which lacks scalability in leveraging multi-modality for more plausible completion. In this paper, we propose a novel, simple yet effective method for Multi-modal Guided Image Completion, dubbed MaGIC, which not only supports a wide range of single modality as the guidance (e.g., text, canny edge, sketch, segmentation, depth, and pose), but also adapts to arbitrarily customized combination of these modalities (i.e., arbitrary multi-modality) for image completion. For building MaGIC, we first introduce a modality-specific conditional U-Net (MCU-Net) that injects single-modal signal into a U-Net denoiser for single-modal guided image completion. Then, we devise a consistent modality blending (CMB) method to leverage modality signals encoded in multiple learned MCU-Nets through gradient guidance in latent space. Our CMB is training-free, thereby avoids the cumbersome joint re-training of different modalities, which is the secret of MaGIC to achieve exceptional flexibility in accommodating new modalities for completion. Experiments show the superiority of MaGIC over state-of-the-art methods and its generalization to various completion tasks. Our project with code and models is available at yeates.github.io/MaGIC-Page/.
\ No newline at end of file
diff --git a/data/2024/iclr/Machine Unlearning for Image-to-Image Generative Models b/data/2024/iclr/Machine Unlearning for Image-to-Image Generative Models
new file mode 100644
index 0000000000..b5aba4f0c0
--- /dev/null
+++ b/data/2024/iclr/Machine Unlearning for Image-to-Image Generative Models	
@@ -0,0 +1 @@
+Machine unlearning has emerged as a new paradigm to deliberately forget data samples from a given model in order to adhere to stringent regulations. However, existing machine unlearning methods have been primarily focused on classification models, leaving the landscape of unlearning for generative models relatively unexplored. This paper serves as a bridge, addressing the gap by providing a unifying framework of machine unlearning for image-to-image generative models. Within this framework, we propose a computationally-efficient algorithm, underpinned by rigorous theoretical analysis, that demonstrates negligible performance degradation on the retain samples, while effectively removing the information from the forget samples. Empirical studies on two large-scale datasets, ImageNet-1K and Places-365, further show that our algorithm does not rely on the availability of the retain samples, which further complies with data retention policy. To our best knowledge, this work is the first that represents systemic, theoretical, empirical explorations of machine unlearning specifically tailored for image-to-image generative models. Our code is available at https://github.com/jpmorganchase/l2l-generator-unlearning.
\ No newline at end of file
diff --git a/data/2024/iclr/Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors b/data/2024/iclr/Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors
new file mode 100644
index 0000000000..57e7cf9051
--- /dev/null
+++ b/data/2024/iclr/Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors	
@@ -0,0 +1 @@
+We present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors. We introduce a single trade-off parameter between the 2D and 3D priors to control exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, we employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images. Our code, models, and generated 3D assets are available at https://github.com/guochengqian/Magic123.
\ No newline at end of file
diff --git a/data/2024/iclr/Magnitude Invariant Parametrizations Improve Hypernetwork Learning b/data/2024/iclr/Magnitude Invariant Parametrizations Improve Hypernetwork Learning
new file mode 100644
index 0000000000..99358396c8
--- /dev/null
+++ b/data/2024/iclr/Magnitude Invariant Parametrizations Improve Hypernetwork Learning	
@@ -0,0 +1 @@
+Hypernetworks, neural networks that predict the parameters of another neural network, are powerful models that have been successfully used in diverse applications from image generation to multi-task learning. Unfortunately, existing hypernetworks are often challenging to train. Training typically converges far more slowly than for non-hypernetwork models, and the rate of convergence can be very sensitive to hyperparameter choices. In this work, we identify a fundamental and previously unidentified problem that contributes to the challenge of training hypernetworks: a magnitude proportionality between the inputs and outputs of the hypernetwork. We demonstrate both analytically and empirically that this can lead to unstable optimization, thereby slowing down convergence, and sometimes even preventing any learning. We present a simple solution to this problem using a revised hypernetwork formulation that we call Magnitude Invariant Parametrizations (MIP). We demonstrate the proposed solution on several hypernetwork tasks, where it consistently stabilizes training and achieves faster convergence. Furthermore, we perform a comprehensive ablation study including choices of activation function, normalization strategies, input dimensionality, and hypernetwork architecture; and find that MIP improves training in all scenarios. We provide easy-to-use code that can turn existing networks into MIP-based hypernetworks.
\ No newline at end of file
diff --git a/data/2024/iclr/Magnushammer: A Transformer-Based Approach to Premise Selection b/data/2024/iclr/Magnushammer: A Transformer-Based Approach to Premise Selection
new file mode 100644
index 0000000000..260f3f3ecd
--- /dev/null
+++ b/data/2024/iclr/Magnushammer: A Transformer-Based Approach to Premise Selection	
@@ -0,0 +1 @@
+This paper presents a novel approach to premise selection, a crucial reasoning task in automated theorem proving. Traditionally, symbolic methods that rely on extensive domain knowledge and engineering effort are applied to this task. In contrast, this work demonstrates that contrastive training with the transformer architecture can achieve higher-quality retrieval of relevant premises, without the engineering overhead. Our method, Magnushammer, outperforms the most advanced and widely used automation tool in interactive theorem proving called Sledgehammer. On the PISA and miniF2F benchmarks Magnushammer achieves $59.5\%$ (against $38.3\%$) and $34.0\%$ (against $20.9\%$) success rates, respectively. By combining \method with a language-model-based automated theorem prover, we further improve the state-of-the-art proof success rate from $57.0\%$ to $71.0\%$ on the PISA benchmark using $4$x fewer parameters. Moreover, we develop and open source a novel dataset for premise selection, containing textual representations of (proof state, relevant premise) pairs. To the best of our knowledge, this is the largest available premise selection dataset, and the first one for the Isabelle proof assistant.
\ No newline at end of file
diff --git a/data/2024/iclr/Making LLaMA SEE and Draw with SEED Tokenizer b/data/2024/iclr/Making LLaMA SEE and Draw with SEED Tokenizer
new file mode 100644
index 0000000000..49ed9bf3b6
--- /dev/null
+++ b/data/2024/iclr/Making LLaMA SEE and Draw with SEED Tokenizer	
@@ -0,0 +1 @@
+The great success of Large Language Models (LLMs) has expanded the potential of multimodality, contributing to the gradual evolution of General Artificial Intelligence (AGI). A true AGI agent should not only possess the capability to perform predefined multi-tasks but also exhibit emergent abilities in an open-world context. However, despite the considerable advancements made by recent multimodal LLMs, they still fall short in effectively unifying comprehension and generation tasks, let alone open-world emergent abilities. We contend that the key to overcoming the present impasse lies in enabling text and images to be represented and processed interchangeably within a unified autoregressive Transformer. To this end, we introduce SEED, an elaborate image tokenizer that empowers LLMs with the ability to SEE and Draw at the same time. We identify two crucial design principles: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. With SEED tokens, LLM is able to perform scalable multimodal autoregression under its original training recipe, i.e., next-word prediction. SEED-LLaMA is therefore produced by large-scale pretraining and instruction tuning on the interleaved textual and visual data, demonstrating impressive performance on a broad range of multimodal comprehension and generation tasks. More importantly, SEED-LLaMA has exhibited compositional emergent abilities such as multi-turn in-context multimodal generation, acting like your AI assistant.
\ No newline at end of file
diff --git a/data/2024/iclr/Making Pre-trained Language Models Great on Tabular Prediction b/data/2024/iclr/Making Pre-trained Language Models Great on Tabular Prediction
new file mode 100644
index 0000000000..3d22fee1de
--- /dev/null
+++ b/data/2024/iclr/Making Pre-trained Language Models Great on Tabular Prediction	
@@ -0,0 +1 @@
+The transferability of deep neural networks (DNNs) has made significant progress in image and language processing. However, due to the heterogeneity among tables, such DNN bonus is still far from being well exploited on tabular data prediction (e.g., regression or classification tasks). Condensing knowledge from diverse domains, language models (LMs) possess the capability to comprehend feature names from various tables, potentially serving as versatile learners in transferring knowledge across distinct tables and diverse prediction tasks, but their discrete text representation space is inherently incompatible with numerical feature values in tables. In this paper, we present TP-BERTa, a specifically pre-trained LM for tabular data prediction. Concretely, a novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names. Comprehensive experiments demonstrate that our pre-trained TP-BERTa leads the performance among tabular DNNs and is competitive with Gradient Boosted Decision Tree models in typical tabular data regime.
\ No newline at end of file
diff --git a/data/2024/iclr/Making RL with Preference-based Feedback Efficient via Randomization b/data/2024/iclr/Making RL with Preference-based Feedback Efficient via Randomization
new file mode 100644
index 0000000000..0cdde56e0d
--- /dev/null
+++ b/data/2024/iclr/Making RL with Preference-based Feedback Efficient via Randomization	
@@ -0,0 +1 @@
+Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be efficient in terms of statistical complexity, computational complexity, and query complexity. In this work, we consider the RLHF setting where the feedback is given in the format of preferences over pairs of trajectories. In the linear MDP model, using randomization in algorithm design, we present an algorithm that is sample efficient (i.e., has near-optimal worst-case regret bounds) and has polynomial running time (i.e., computational complexity is polynomial with respect to relevant parameters). Our algorithm further minimizes the query complexity through a novel randomized active learning procedure. In particular, our algorithm demonstrates a near-optimal tradeoff between the regret bound and the query complexity. To extend the results to more general nonlinear function approximation, we design a model-based randomized algorithm inspired by the idea of Thompson sampling. Our algorithm minimizes Bayesian regret bound and query complexity, again achieving a near-optimal tradeoff between these two quantities. Computation-wise, similar to the prior Thompson sampling algorithms under the regular RL setting, the main computation primitives of our algorithm are Bayesian supervised learning oracles which have been heavily investigated on the empirical side when applying Thompson sampling algorithms to RL benchmark problems.
\ No newline at end of file
diff --git a/data/2024/iclr/Making Retrieval-Augmented Language Models Robust to Irrelevant Context b/data/2024/iclr/Making Retrieval-Augmented Language Models Robust to Irrelevant Context
new file mode 100644
index 0000000000..2eb25b5ad1
--- /dev/null
+++ b/data/2024/iclr/Making Retrieval-Augmented Language Models Robust to Irrelevant Context	
@@ -0,0 +1 @@
+Retrieval-augmented language models (RALMs) hold promise to produce language understanding systems that are are factual, efficient, and up-to-date. An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant, and does not harm performance when it is not. This is particularly important in multi-hop reasoning scenarios, where misuse of irrelevant evidence can lead to cascading errors. However, recent work has shown that retrieval augmentation can sometimes have a negative effect on performance. In this work, we present a thorough analysis on five open-domain question answering benchmarks, characterizing cases when retrieval reduces accuracy. We then propose two methods to mitigate this issue. First, a simple baseline that filters out retrieved passages that do not entail question-answer pairs according to a natural language inference (NLI) model. This is effective in preventing performance reduction, but at a cost of also discarding relevant passages. Thus, we propose a method for automatically generating data to fine-tune the language model to properly leverage retrieved passages, using a mix of relevant and irrelevant contexts at training time. We empirically show that even 1,000 examples suffice to train the model to be robust to irrelevant contexts while maintaining high performance on examples with relevant ones.
\ No newline at end of file
diff --git a/data/2024/iclr/Manifold Diffusion Fields b/data/2024/iclr/Manifold Diffusion Fields
new file mode 100644
index 0000000000..1a8f0e37e3
--- /dev/null
+++ b/data/2024/iclr/Manifold Diffusion Fields	
@@ -0,0 +1 @@
+We present Manifold Diffusion Fields (MDF), an approach that unlocks learning of diffusion models of data in general non-Euclidean geometries. Leveraging insights from spectral geometry analysis, we define an intrinsic coordinate system on the manifold via the eigen-functions of the Laplace-Beltrami Operator. MDF represents functions using an explicit parametrization formed by a set of multiple input-output pairs. Our approach allows to sample continuous functions on manifolds and is invariant with respect to rigid and isometric transformations of the manifold. In addition, we show that MDF generalizes to the case where the training set contains functions on different manifolds. Empirical results on multiple datasets and manifolds including challenging scientific problems like weather prediction or molecular conformation show that MDF can capture distributions of such functions with better diversity and fidelity than previous approaches.
\ No newline at end of file
diff --git a/data/2024/iclr/Manifold Preserving Guided Diffusion b/data/2024/iclr/Manifold Preserving Guided Diffusion
new file mode 100644
index 0000000000..1efaec6d6b
--- /dev/null
+++ b/data/2024/iclr/Manifold Preserving Guided Diffusion	
@@ -0,0 +1 @@
+Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework that leverages pretrained diffusion models and off-the-shelf neural networks with minimal additional inference cost for a broad range of tasks. Specifically, we leverage the manifold hypothesis to refine the guided diffusion steps and introduce a shortcut algorithm in the process. We then propose two methods for on-manifold training-free guidance using pre-trained autoencoders and demonstrate that our shortcut inherently preserves the manifolds when applied to latent diffusion models. Our experiments show that MPGD is efficient and effective for solving a variety of conditional generation applications in low-compute settings, and can consistently offer up to 3.8x speed-ups with the same number of diffusion steps while maintaining high sample quality compared to the baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Manipulating dropout reveals an optimal balance of efficiency and robustness in biological and machine visual systems b/data/2024/iclr/Manipulating dropout reveals an optimal balance of efficiency and robustness in biological and machine visual systems
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Mask-Based Modeling for Neural Radiance Fields b/data/2024/iclr/Mask-Based Modeling for Neural Radiance Fields
new file mode 100644
index 0000000000..1d814d8525
--- /dev/null
+++ b/data/2024/iclr/Mask-Based Modeling for Neural Radiance Fields	
@@ -0,0 +1 @@
+Most Neural Radiance Fields (NeRFs) exhibit limited generalization capabilities, which restrict their applicability in representing multiple scenes using a single model. To address this problem, existing generalizable NeRF methods simply condition the model on image features. These methods still struggle to learn precise global representations over diverse scenes since they lack an effective mechanism for interacting among different points and views. In this work, we unveil that 3D implicit representation learning can be significantly improved by mask-based modeling. Specifically, we propose masked ray and view modeling for generalizable NeRF (MRVM-NeRF), which is a self-supervised pretraining target to predict complete scene representations from partially masked features along each ray. With this pretraining target, MRVM-NeRF enables better use of correlations across different points and views as the geometry priors, which thereby strengthens the capability of capturing intricate details within the scenes and boosts the generalization capability across different scenes. Extensive experiments demonstrate the effectiveness of our proposed MRVM-NeRF on both synthetic and real-world datasets, qualitatively and quantitatively. Besides, we also conduct experiments to show the compatibility of our proposed method with various backbones and its superiority under few-shot cases.
\ No newline at end of file
diff --git a/data/2024/iclr/Masked Audio Generation using a Single Non-Autoregressive Transformer b/data/2024/iclr/Masked Audio Generation using a Single Non-Autoregressive Transformer
new file mode 100644
index 0000000000..71900fe689
--- /dev/null
+++ b/data/2024/iclr/Masked Audio Generation using a Single Non-Autoregressive Transformer	
@@ -0,0 +1 @@
+We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel rescoring method in which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT, which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation and conduct an extensive empirical evaluation, considering both objective metrics and human studies. The proposed approach is comparable to the evaluated baselines, while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling, considering latency, throughput, and generation quality. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT.
\ No newline at end of file
diff --git a/data/2024/iclr/Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners b/data/2024/iclr/Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners
new file mode 100644
index 0000000000..56dfddf81a
--- /dev/null
+++ b/data/2024/iclr/Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners	
@@ -0,0 +1 @@
+In this work, we propose a Multi-Window Masked Autoencoder (MW-MAE) fitted with a novel Multi-Window Multi-Head Attention (MW-MHA) module that facilitates the modelling of local-global interactions in every decoder transformer block through attention heads of several distinct local and global windows. Empirical results on ten downstream audio tasks show that MW-MAEs consistently outperform standard MAEs in overall performance and learn better general-purpose audio representations, along with demonstrating considerably better scaling characteristics. Investigating attention distances and entropies reveals that MW-MAE encoders learn heads with broader local and global attention. Analyzing attention head feature representations through Projection Weighted Canonical Correlation Analysis (PWCCA) shows that attention heads with the same window sizes across the decoder layers of the MW-MAE learn correlated feature representations which enables each block to independently capture local and global information, leading to a decoupled decoder feature hierarchy. Code for feature extraction and downstream experiments along with pre-trained models will be released publically.
\ No newline at end of file
diff --git a/data/2024/iclr/Masked Completion via Structured Diffusion with White-Box Transformers b/data/2024/iclr/Masked Completion via Structured Diffusion with White-Box Transformers
new file mode 100644
index 0000000000..50f0ae32af
--- /dev/null
+++ b/data/2024/iclr/Masked Completion via Structured Diffusion with White-Box Transformers	
@@ -0,0 +1 @@
+Modern learning frameworks often train deep neural networks with massive amounts of unlabeled data to learn representations by solving simple pretext tasks, then use the representations as foundations for downstream tasks. These networks are empirically designed; as such, they are usually not interpretable, their representations are not structured, and their designs are potentially redundant. White-box deep networks, in which each layer explicitly identifies and transforms structures in the data, present a promising alternative. However, existing white-box architectures have only been shown to work at scale in supervised settings with labeled data, such as classification. In this work, we provide the first instantiation of the white-box design paradigm that can be applied to large-scale unsupervised representation learning. We do this by exploiting a fundamental connection between diffusion, compression, and (masked) completion, deriving a deep transformer-like masked autoencoder architecture, called CRATE-MAE, in which the role of each layer is mathematically fully interpretable: they transform the data distribution to and from a structured representation. Extensive empirical evaluations confirm our analytical insights. CRATE-MAE demonstrates highly promising performance on large-scale imagery datasets while using only ~30% of the parameters compared to the standard masked autoencoder with the same model configuration. The representations learned by CRATE-MAE have explicit structure and also contain semantic meaning. Code is available at https://github.com/Ma-Lab-Berkeley/CRATE .
\ No newline at end of file
diff --git a/data/2024/iclr/Masked Structural Growth for 2x Faster Language Model Pre-training b/data/2024/iclr/Masked Structural Growth for 2x Faster Language Model Pre-training
new file mode 100644
index 0000000000..da184e6b6e
--- /dev/null
+++ b/data/2024/iclr/Masked Structural Growth for 2x Faster Language Model Pre-training	
@@ -0,0 +1 @@
+Accelerating large language model pre-training is a critical issue in present research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems associated with progressive growth: determining the optimal growth schedule, and designing efficient growth operators. In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work. Regarding the growth operators, existing methods rely on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further improvements on training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve up to 2.2x speedup in pre-training different types of language models while maintaining comparable or better downstream performances. Code is publicly available at https://github.com/cofe-ai/MSG.
\ No newline at end of file
diff --git a/data/2024/iclr/Masks, Signs, And Learning Rate Rewinding b/data/2024/iclr/Masks, Signs, And Learning Rate Rewinding
new file mode 100644
index 0000000000..20b8cdf541
--- /dev/null
+++ b/data/2024/iclr/Masks, Signs, And Learning Rate Rewinding	
@@ -0,0 +1 @@
+Learning Rate Rewinding (LRR) has been established as a strong variant of Iterative Magnitude Pruning (IMP) to find lottery tickets in deep overparameterized neural networks. While both iterative pruning schemes couple structure and parameter learning, understanding how LRR excels in both aspects can bring us closer to the design of more flexible deep learning algorithms that can optimize diverse sets of sparse architectures. To this end, we conduct experiments that disentangle the effect of mask learning and parameter optimization and how both benefit from overparameterization. The ability of LRR to flip parameter signs early and stay robust to sign perturbations seems to make it not only more effective in mask identification but also in optimizing diverse sets of masks, including random ones. In support of this hypothesis, we prove in a simplified single hidden neuron setting that LRR succeeds in more cases than IMP, as it can escape initially problematic sign configurations.
\ No newline at end of file
diff --git a/data/2024/iclr/Massive Editing for Large Language Models via Meta Learning b/data/2024/iclr/Massive Editing for Large Language Models via Meta Learning
new file mode 100644
index 0000000000..3113ec81ff
--- /dev/null
+++ b/data/2024/iclr/Massive Editing for Large Language Models via Meta Learning	
@@ -0,0 +1 @@
+While large language models (LLMs) have enabled learning knowledge from the pre-training corpora, the acquired knowledge may be fundamentally incorrect or outdated over time, which necessitates rectifying the knowledge of the language model (LM) after the training. A promising approach involves employing a hyper-network to generate parameter shift, whereas existing hyper-networks suffer from inferior scalability in synchronous editing operation amount. To mitigate the problem, we propose the MAssive Language Model Editing Network (MALMEN), which formulates the parameter shift aggregation as the least square problem, subsequently updating the LM parameters using the normal equation. To accommodate editing multiple facts simultaneously with limited memory budgets, we separate the computation on the hyper-network and LM, enabling arbitrary batch size on both neural networks. Our method is evaluated by editing up to thousands of facts on LMs with different architectures, i.e., BERT-base, GPT-2, T5-XL (2.8B), and GPT-J (6B), across various knowledge-intensive NLP tasks, i.e., closed book fact-checking and question answering. Remarkably, MALMEN is capable of editing hundreds of times more facts than strong baselines with the identical hyper-network architecture and outperforms editor specifically designed for GPT. Our code is available at https://github.com/ChenmienTan/malmen.
\ No newline at end of file
diff --git a/data/2024/iclr/Massively Scalable Inverse Reinforcement Learning in Google Maps b/data/2024/iclr/Massively Scalable Inverse Reinforcement Learning in Google Maps
new file mode 100644
index 0000000000..25ce3df3fa
--- /dev/null
+++ b/data/2024/iclr/Massively Scalable Inverse Reinforcement Learning in Google Maps	
@@ -0,0 +1 @@
+Inverse reinforcement learning (IRL) offers a powerful and general framework for learning humans' latent preferences in route recommendation, yet no approach has successfully addressed planetary-scale problems with hundreds of millions of states and demonstration trajectories. In this paper, we introduce scaling techniques based on graph compression, spatial parallelization, and improved initialization conditions inspired by a connection to eigenvector algorithms. We revisit classic IRL methods in the routing context, and make the key observation that there exists a trade-off between the use of cheap, deterministic planners and expensive yet robust stochastic policies. This insight is leveraged in Receding Horizon Inverse Planning (RHIP), a new generalization of classic IRL algorithms that provides fine-grained control over performance trade-offs via its planning horizon. Our contributions culminate in a policy that achieves a 16-24% improvement in route quality at a global scale, and to the best of our knowledge, represents the largest published study of IRL algorithms in a real-world setting to date. We conclude by conducting an ablation study of key components, presenting negative results from alternative eigenvalue solvers, and identifying opportunities to further improve scalability via IRL-specific batching strategies.
\ No newline at end of file
diff --git a/data/2024/iclr/Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching b/data/2024/iclr/Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching
new file mode 100644
index 0000000000..5525c19280
--- /dev/null
+++ b/data/2024/iclr/Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching	
@@ -0,0 +1 @@
+Powered by large-scale pre-training, vision foundation models exhibit significant potential in open-world image understanding. However, unlike large language models that excel at directly tackling various language tasks, vision foundation models require a task-specific model structure followed by fine-tuning on specific tasks. In this work, we present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher can segment anything by using an in-context example without training. Additionally, we design three effective components within the Matcher framework to collaborate with these foundation models and unleash their full potential in diverse perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20$^i$ with one example, surpassing the state-of-the-art specialist model by 1.6%. In addition, Matcher achieves 33.0% mIoU on the proposed LVIS-92$^i$ for one-shot semantic segmentation, outperforming the state-of-the-art generalist model by 14.4%. Our visualization results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild. Our code can be found at https://github.com/aim-uofa/Matcher.
\ No newline at end of file
diff --git a/data/2024/iclr/MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning b/data/2024/iclr/MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning
new file mode 100644
index 0000000000..bf81f683b1
--- /dev/null
+++ b/data/2024/iclr/MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	
@@ -0,0 +1 @@
+The recently released GPT-4 Code Interpreter has demonstrated remarkable proficiency in solving challenging math problems, primarily attributed to its ability to seamlessly reason with natural language, generate code, execute code, and continue reasoning based on the execution output. In this paper, we present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations and, consequently, enhancing their mathematical reasoning abilities. We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions, referred to as MathCodeInstruct. Each solution interleaves natural language, code, and execution results. We also introduce a customized supervised fine-tuning and inference approach. This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems. Impressively, the MathCoder models achieve state-of-the-art scores among open-source LLMs on the MATH (45.2%) and GSM8K (83.9%) datasets, substantially outperforming other open-source alternatives. Notably, the MathCoder model not only surpasses ChatGPT-3.5 and PaLM-2 on GSM8K and MATH but also outperforms GPT-4 on the competition-level MATH dataset. The dataset and models will be released at https://github.com/mathllm/MathCoder.
\ No newline at end of file
diff --git a/data/2024/iclr/MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts b/data/2024/iclr/MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
new file mode 100644
index 0000000000..5df5423ffe
--- /dev/null
+++ b/data/2024/iclr/MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts	
@@ -0,0 +1 @@
+Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.
\ No newline at end of file
diff --git a/data/2024/iclr/Mathematical Justification of Hard Negative Mining via Isometric Approximation Theorem b/data/2024/iclr/Mathematical Justification of Hard Negative Mining via Isometric Approximation Theorem
new file mode 100644
index 0000000000..d935a4729b
--- /dev/null
+++ b/data/2024/iclr/Mathematical Justification of Hard Negative Mining via Isometric Approximation Theorem	
@@ -0,0 +1 @@
+In deep metric learning, the Triplet Loss has emerged as a popular method to learn many computer vision and natural language processing tasks such as facial recognition, object detection, and visual-semantic embeddings. One issue that plagues the Triplet Loss is network collapse, an undesirable phenomenon where the network projects the embeddings of all data onto a single point. Researchers predominately solve this problem by using triplet mining strategies. While hard negative mining is the most effective of these strategies, existing formulations lack strong theoretical justification for their empirical success. In this paper, we utilize the mathematical theory of isometric approximation to show an equivalence between the Triplet Loss sampled by hard negative mining and an optimization problem that minimizes a Hausdorff-like distance between the neural network and its ideal counterpart function. This provides the theoretical justifications for hard negative mining's empirical efficacy. In addition, our novel application of the isometric approximation theorem provides the groundwork for future forms of hard negative mining that avoid network collapse. Our theory can also be extended to analyze other Euclidean space-based metric learning methods like Ladder Loss or Contrastive Learning.
\ No newline at end of file
diff --git a/data/2024/iclr/Matrix Manifold Neural Networks++ b/data/2024/iclr/Matrix Manifold Neural Networks++
new file mode 100644
index 0000000000..c29bad274b
--- /dev/null
+++ b/data/2024/iclr/Matrix Manifold Neural Networks++	
@@ -0,0 +1 @@
+Deep neural networks (DNNs) on Riemannian manifolds have garnered increasing interest in various applied areas. For instance, DNNs on spherical and hyperbolic manifolds have been designed to solve a wide range of computer vision and nature language processing tasks. One of the key factors that contribute to the success of these networks is that spherical and hyperbolic manifolds have the rich algebraic structures of gyrogroups and gyrovector spaces. This enables principled and effective generalizations of the most successful DNNs to these manifolds. Recently, some works have shown that many concepts in the theory of gyrogroups and gyrovector spaces can also be generalized to matrix manifolds such as Symmetric Positive Definite (SPD) and Grassmann manifolds. As a result, some building blocks for SPD and Grassmann neural networks, e.g., isometric models and multinomial logistic regression (MLR) can be derived in a way that is fully analogous to their spherical and hyperbolic counterparts. Building upon these works, we design fully-connected (FC) and convolutional layers for SPD neural networks. We also develop MLR on Symmetric Positive Semi-definite (SPSD) manifolds, and propose a method for performing backpropagation with the Grassmann logarithmic map in the projector perspective. We demonstrate the effectiveness of the proposed approach in the human action recognition and node classification tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Matryoshka Diffusion Models b/data/2024/iclr/Matryoshka Diffusion Models
new file mode 100644
index 0000000000..637c98516c
--- /dev/null
+++ b/data/2024/iclr/Matryoshka Diffusion Models	
@@ -0,0 +1 @@
+Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion Models(MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions, which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.
\ No newline at end of file
diff --git a/data/2024/iclr/Maximum Entropy Heterogeneous-Agent Reinforcement Learning b/data/2024/iclr/Maximum Entropy Heterogeneous-Agent Reinforcement Learning
new file mode 100644
index 0000000000..8b386a2d9e
--- /dev/null
+++ b/data/2024/iclr/Maximum Entropy Heterogeneous-Agent Reinforcement Learning	
@@ -0,0 +1 @@
+Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years. However, existing state-of-the-art methods face challenges related to sample complexity, training instability, and the risk of converging to a suboptimal Nash Equilibrium. In this paper, we propose a unified framework for learning \emph{stochastic} policies to resolve these issues. We embed cooperative MARL problems into probabilistic graphical models, from which we derive the maximum entropy (MaxEnt) objective for MARL. Based on the MaxEnt framework, we propose Heterogeneous-Agent Soft Actor-Critic (HASAC) algorithm. Theoretically, we prove the monotonic improvement and convergence to quantal response equilibrium (QRE) properties of HASAC. Furthermore, we generalize a unified template for MaxEnt algorithmic design named Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML), which provides any induced method with the same guarantees as HASAC. We evaluate HASAC on six benchmarks: Bi-DexHands, Multi-Agent MuJoCo, StarCraft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, and Light Aircraft Game. Results show that HASAC consistently outperforms strong baselines, exhibiting better sample efficiency, robustness, and sufficient exploration.
\ No newline at end of file
diff --git a/data/2024/iclr/Maximum Entropy Model Correction in Reinforcement Learning b/data/2024/iclr/Maximum Entropy Model Correction in Reinforcement Learning
new file mode 100644
index 0000000000..68a802973d
--- /dev/null
+++ b/data/2024/iclr/Maximum Entropy Model Correction in Reinforcement Learning	
@@ -0,0 +1 @@
+We propose and theoretically analyze an approach for planning with an approximate model in reinforcement learning that can reduce the adverse impact of model error. If the model is accurate enough, it accelerates the convergence to the true value function too. One of its key components is the MaxEnt Model Correction (MoCo) procedure that corrects the model's next-state distributions based on a Maximum Entropy density estimation formulation. Based on MoCo, we introduce the Model Correcting Value Iteration (MoCoVI) algorithm, and its sampled-based variant MoCoDyna. We show that MoCoVI and MoCoDyna's convergence can be much faster than the conventional model-free algorithms. Unlike traditional model-based algorithms, MoCoVI and MoCoDyna effectively utilize an approximate model and still converge to the correct value function.
\ No newline at end of file
diff --git a/data/2024/iclr/Maximum Likelihood Estimation is All You Need for Well-Specified Covariate Shift b/data/2024/iclr/Maximum Likelihood Estimation is All You Need for Well-Specified Covariate Shift
new file mode 100644
index 0000000000..5d9a5e1176
--- /dev/null
+++ b/data/2024/iclr/Maximum Likelihood Estimation is All You Need for Well-Specified Covariate Shift	
@@ -0,0 +1 @@
+A key challenge of modern machine learning systems is to achieve Out-of-Distribution (OOD) generalization -- generalizing to target data whose distribution differs from that of source data. Despite its significant importance, the fundamental question of ``what are the most effective algorithms for OOD generalization'' remains open even under the standard setting of covariate shift. This paper addresses this fundamental question by proving that, surprisingly, classical Maximum Likelihood Estimation (MLE) purely using source data (without any modification) achieves the minimax optimality for covariate shift under the well-specified setting. That is, no algorithm performs better than MLE in this setting (up to a constant factor), justifying MLE is all you need. Our result holds for a very rich class of parametric models, and does not require any boundedness condition on the density ratio. We illustrate the wide applicability of our framework by instantiating it to three concrete examples -- linear regression, logistic regression, and phase retrieval. This paper further complement the study by proving that, under the misspecified setting, MLE is no longer the optimal choice, whereas Maximum Weighted Likelihood Estimator (MWLE) emerges as minimax optimal in certain scenarios.
\ No newline at end of file
diff --git a/data/2024/iclr/Mean Field Theory in Deep Metric Learning b/data/2024/iclr/Mean Field Theory in Deep Metric Learning
new file mode 100644
index 0000000000..f2b5d4d071
--- /dev/null
+++ b/data/2024/iclr/Mean Field Theory in Deep Metric Learning	
@@ -0,0 +1 @@
+In this paper, we explore the application of mean field theory, a technique from statistical physics, to deep metric learning and address the high training complexity commonly associated with conventional metric learning loss functions. By adapting mean field theory for deep metric learning, we develop an approach to design classification-based loss functions from pair-based ones, which can be considered complementary to the proxy-based approach. Applying the mean field theory to two pair-based loss functions, we derive two new loss functions, MeanFieldContrastive and MeanFieldClassWiseMultiSimilarity losses, with reduced training complexity. We extensively evaluate these derived loss functions on three image-retrieval datasets and demonstrate that our loss functions outperform baseline methods in two out of the three datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Meaning Representations from Trajectories in Autoregressive Models b/data/2024/iclr/Meaning Representations from Trajectories in Autoregressive Models
new file mode 100644
index 0000000000..3ab1726484
--- /dev/null
+++ b/data/2024/iclr/Meaning Representations from Trajectories in Autoregressive Models	
@@ -0,0 +1 @@
+We propose to extract meaning representations from autoregressive language models by considering the distribution of all possible trajectories extending an input text. This strategy is prompt-free, does not require fine-tuning, and is applicable to any pre-trained autoregressive model. Moreover, unlike vector-based representations, distribution-based representations can also model asymmetric relations (e.g., direction of logical entailment, hypernym/hyponym relations) by using algebraic operations between likelihood functions. These ideas are grounded in distributional perspectives on semantics and are connected to standard constructions in automata theory, but to our knowledge they have not been applied to modern language models. We empirically show that the representations obtained from large models align well with human annotations, outperform other zero-shot and prompt-free methods on semantic similarity tasks, and can be used to solve more complex entailment and containment tasks that standard embeddings cannot handle. Finally, we extend our method to represent data from different modalities (e.g., image and text) using multimodal autoregressive models. Our code is available at: https://github.com/tianyu139/meaning-as-trajectories
\ No newline at end of file
diff --git a/data/2024/iclr/Measuring Vision-Language STEM Skills of Neural Models b/data/2024/iclr/Measuring Vision-Language STEM Skills of Neural Models
new file mode 100644
index 0000000000..0fd27bcd59
--- /dev/null
+++ b/data/2024/iclr/Measuring Vision-Language STEM Skills of Neural Models	
@@ -0,0 +1 @@
+We introduce a new challenge to test the STEM skills of neural models. The problems in the real world often require solutions, combining knowledge from STEM (science, technology, engineering, and math). Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information of STEM. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and GPT-3.5-Turbo to our benchmark. Results show that the recent model advances only help master a very limited number of lower grade-level skills (2.5% in the third grade) in our dataset. In fact, these models are still well below (averaging 54.7%) the performance of elementary students, not to mention near expert-level performance. To understand and increase the performance on our dataset, we teach the models on a training split of our dataset. Even though we observe improved performance, the model performance remains relatively low compared to average elementary students. To solve STEM problems, we will need novel algorithmic innovations from the community.
\ No newline at end of file
diff --git a/data/2024/iclr/Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks b/data/2024/iclr/Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
new file mode 100644
index 0000000000..2a8338e219
--- /dev/null
+++ b/data/2024/iclr/Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks	
@@ -0,0 +1 @@
+Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such hidden capabilities are relevant leads to sample-efficient 'revival' of the capability, i.e., the model begins reusing these capability after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.
\ No newline at end of file
diff --git a/data/2024/iclr/Mediator Interpretation and Faster Learning Algorithms for Linear Correlated Equilibria in General Sequential Games b/data/2024/iclr/Mediator Interpretation and Faster Learning Algorithms for Linear Correlated Equilibria in General Sequential Games
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis b/data/2024/iclr/Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
new file mode 100644
index 0000000000..7e8580b869
--- /dev/null
+++ b/data/2024/iclr/Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis	
@@ -0,0 +1 @@
+Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping the fine-tuning process. However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage. 2) The prosodic information in prompts is highly coupled with timbre, making it untransferable to each other. This paper introduces Mega-TTS 2, a generic prompting mechanism for zero-shot TTS, to tackle the aforementioned challenges. Specifically, we design a powerful acoustic autoencoder that separately encodes the prosody and timbre information into the compressed latent space while providing high-quality reconstructions. Then, we propose a multi-reference timbre encoder and a prosody latent language model (P-LLM) to extract useful information from multi-sentence prompts. We further leverage the probabilities derived from multiple P-LLM outputs to produce transferable and controllable prosody. Experimental results demonstrate that Mega-TTS 2 could not only synthesize identity-preserving speech with a short prompt of an unseen speaker from arbitrary sources but consistently outperform the fine-tuning method when the volume of data ranges from 10 seconds to 5 minutes. Furthermore, our method enables to transfer various speaking styles to the target timbre in a fine-grained and controlled manner. Audio samples can be found in https://boostprompt.github.io/boostprompt/.
\ No newline at end of file
diff --git a/data/2024/iclr/Memorization Capacity of Multi-Head Attention in Transformers b/data/2024/iclr/Memorization Capacity of Multi-Head Attention in Transformers
new file mode 100644
index 0000000000..2a4bfaa3e9
--- /dev/null
+++ b/data/2024/iclr/Memorization Capacity of Multi-Head Attention in Transformers	
@@ -0,0 +1 @@
+Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize, as a function of the number of heads and sequence length. Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data, distinct from the commonly used general-position assumption. Under these assumptions, we demonstrate that an attention layer with $H$ heads, dimension $d$, and context size $n<d$, featuring $\Theta(Hd^2)$ parameters, can memorize $\Omega(Hn)$ examples. Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator's saturation property. We validate our findings through experiments on synthetic data.
\ No newline at end of file
diff --git a/data/2024/iclr/Memory-Assisted Sub-Prototype Mining for Universal Domain Adaptation b/data/2024/iclr/Memory-Assisted Sub-Prototype Mining for Universal Domain Adaptation
new file mode 100644
index 0000000000..6b739e4661
--- /dev/null
+++ b/data/2024/iclr/Memory-Assisted Sub-Prototype Mining for Universal Domain Adaptation	
@@ -0,0 +1 @@
+Universal domain adaptation aims to align the classes and reduce the feature gap between the same category of the source and target domains. The target private category is set as the unknown class during the adaptation process, as it is not included in the source domain. However, most existing methods overlook the intra-class structure within a category, especially in cases where there exists significant concept shift between the samples belonging to the same category. When samples with large concept shift are forced to be pushed together, it may negatively affect the adaptation performance. Moreover, from the interpretability aspect, it is unreasonable to align visual features with significant differences, such as fighter jets and civil aircraft, into the same category. Unfortunately, due to such semantic ambiguity and annotation cost, categories are not always classified in detail, making it difficult for the model to perform precise adaptation. To address these issues, we propose a novel Memory-Assisted Sub-Prototype Mining (MemSPM) method that can learn the differences between samples belonging to the same category and mine sub-classes when there exists significant concept shift between them. By doing so, our model learns a more reasonable feature space that enhances the transferability and reflects the inherent differences among samples annotated as the same category. We evaluate the effectiveness of our MemSPM method over multiple scenarios, including UniDA, OSDA, and PDA. Our method achieves state-of-the-art performance on four benchmarks in most cases.
\ No newline at end of file
diff --git a/data/2024/iclr/Memory-Consistent Neural Networks for Imitation Learning b/data/2024/iclr/Memory-Consistent Neural Networks for Imitation Learning
new file mode 100644
index 0000000000..837d717075
--- /dev/null
+++ b/data/2024/iclr/Memory-Consistent Neural Networks for Imitation Learning	
@@ -0,0 +1 @@
+Imitation learning considerably simplifies policy synthesis compared to alternative approaches by exploiting access to expert demonstrations. For such imitation policies, errors away from the training samples are particularly critical. Even rare slip-ups in the policy action outputs can compound quickly over time, since they lead to unfamiliar future states where the policy is still more likely to err, eventually causing task failures. We revisit simple supervised ``behavior cloning'' for conveniently training the policy from nothing more than pre-recorded demonstrations, but carefully design the model class to counter the compounding error phenomenon. Our ``memory-consistent neural network'' (MCNN) outputs are hard-constrained to stay within clearly specified permissible regions anchored to prototypical ``memory'' training samples. We provide a guaranteed upper bound for the sub-optimality gap induced by MCNN policies. Using MCNNs on 10 imitation learning tasks, with MLP, Transformer, and Diffusion backbones, spanning dexterous robotic manipulation and driving, proprioceptive inputs and visual inputs, and varying sizes and types of demonstration data, we find large and consistent gains in performance, validating that MCNNs are better-suited than vanilla deep neural networks for imitation learning applications. Website: https://sites.google.com/view/mcnn-imitation
\ No newline at end of file
diff --git a/data/2024/iclr/Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy b/data/2024/iclr/Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy
new file mode 100644
index 0000000000..63cecf50a1
--- /dev/null
+++ b/data/2024/iclr/Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy	
@@ -0,0 +1 @@
+Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks, however, they have issues like (a) High Memory Usage, due to duplication of the network layers into multiple copies as experts; and (b) Redundancy in Experts, as common learning-based routing policies suffer from representational collapse. Therefore, vanilla SMoE models are memory inefficient and non-scalable, especially for resource-constrained downstream scenarios. In this paper, we ask: Can we craft a compact SMoE model by consolidating expert information? What is the best recipe to merge multiple experts into fewer but more knowledgeable experts? Our pilot investigation reveals that conventional model merging methods fail to be effective in such expert merging for SMoE. The potential reasons are: (1) redundant information overshadows critical experts; (2) appropriate neuron permutation for each expert is missing to bring all of them in alignment. To address this, we propose M-SMoE, which leverages routing statistics to guide expert merging. Specifically, it starts with neuron permutation alignment for experts; then, dominant experts and their"group members"are formed; lastly, every expert group is merged into a single expert by utilizing each expert's activation frequency as their weight for merging, thus diminishing the impact of insignificant experts. Moreover, we observed that our proposed merging promotes a low dimensionality in the merged expert's weight space, naturally paving the way for additional compression. Hence, our final method, MC-SMoE (i.e., Merge, then Compress SMoE), further decomposes the merged experts into low-rank and structural sparse alternatives. Extensive experiments across 8 benchmarks validate the effectiveness of MC-SMoE. For instance, our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Meta Inverse Constrained Reinforcement Learning: Convergence Guarantee and Generalization Analysis b/data/2024/iclr/Meta Inverse Constrained Reinforcement Learning: Convergence Guarantee and Generalization Analysis
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Meta-Evolve: Continuous Robot Evolution for One-to-many Policy Transfer b/data/2024/iclr/Meta-Evolve: Continuous Robot Evolution for One-to-many Policy Transfer
new file mode 100644
index 0000000000..07dc4985bc
--- /dev/null
+++ b/data/2024/iclr/Meta-Evolve: Continuous Robot Evolution for One-to-many Policy Transfer	
@@ -0,0 +1 @@
+We investigate the problem of transferring an expert policy from a source robot to multiple different robots. To solve this problem, we propose a method named $Meta$-$Evolve$ that uses continuous robot evolution to efficiently transfer the policy to each target robot through a set of tree-structured evolutionary robot sequences. The robot evolution tree allows the robot evolution paths to be shared, so our approach can significantly outperform naive one-to-one policy transfer. We present a heuristic approach to determine an optimized robot evolution tree. Experiments have shown that our method is able to improve the efficiency of one-to-three transfer of manipulation policy by up to 3.2$\times$ and one-to-six transfer of agile locomotion policy by 2.4$\times$ in terms of simulation cost over the baseline of launching multiple independent one-to-one policy transfers.
\ No newline at end of file
diff --git a/data/2024/iclr/Meta-Learning Priors Using Unrolled Proximal Networks b/data/2024/iclr/Meta-Learning Priors Using Unrolled Proximal Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Meta-VBO: Utilizing Prior Tasks in Optimizing Risk Measures with Gaussian Processes b/data/2024/iclr/Meta-VBO: Utilizing Prior Tasks in Optimizing Risk Measures with Gaussian Processes
new file mode 100644
index 0000000000..bfc5b8bebe
--- /dev/null
+++ b/data/2024/iclr/Meta-VBO: Utilizing Prior Tasks in Optimizing Risk Measures with Gaussian Processes	
@@ -0,0 +1 @@
+Research on optimizing the risk measure of a blackbox function using Gaussian processes, especially Bayesian optimization (BO) of risk measures, has become increasingly important due to the inevitable presence of uncontrollable variables in real-world applications. Nevertheless, existing works on BO of risk measures start the optimization from scratch for every new task without considering the results of prior tasks. In contrast, its vanilla BO counterpart has received a thorough investigation on utilizing prior tasks to speed up the current task through the body of works on meta-BO which, however, have not considered risk measures. To bridge this gap, this paper presents the first algorithm for meta-BO of risk measures (i.e., value-at-risk (VaR) and the conditional VaR ), namely meta-VBO, by introducing a novel adjustment to the upper confidence bound acquisition function. Our proposed algorithm exhibits two desirable properties: (i) invariance to scaling and vertical shifting of the blackbox function and (ii) robustness to prior harmful tasks. We provide a theoretical performance guarantee for our algorithm and empirically demonstrate its performance using several synthetic function benchmarks and real-world objective functions.
\ No newline at end of file
diff --git a/data/2024/iclr/MetaCoCo: A New Few-Shot Classification Benchmark with Spurious Correlation b/data/2024/iclr/MetaCoCo: A New Few-Shot Classification Benchmark with Spurious Correlation
new file mode 100644
index 0000000000..b709251b24
--- /dev/null
+++ b/data/2024/iclr/MetaCoCo: A New Few-Shot Classification Benchmark with Spurious Correlation	
@@ -0,0 +1 @@
+Out-of-distribution (OOD) problems in few-shot classification (FSC) occur when novel classes sampled from testing distributions differ from base classes drawn from training distributions, which considerably degrades the performance of deep learning models deployed in real-world applications. Recent studies suggest that the OOD problems in FSC mainly including: (a) cross-domain few-shot classification (CD-FSC) and (b) spurious-correlation few-shot classification (SC-FSC). Specifically, CD-FSC occurs when a classifier learns transferring knowledge from base classes drawn from seen training distributions but recognizes novel classes sampled from unseen testing distributions. In contrast, SC-FSC arises when a classifier relies on non-causal features (or contexts) that happen to be correlated with the labels (or concepts) in base classes but such relationships no longer hold during the model deployment. Despite CD-FSC has been extensively studied, SC-FSC remains understudied due to lack of the corresponding evaluation benchmarks. To this end, we present Meta Concept Context (MetaCoCo), a benchmark with spurious-correlation shifts collected from real-world scenarios. Moreover, to quantify the extent of spurious-correlation shifts of the presented MetaCoCo, we further propose a metric by using CLIP as a pre-trained vision-language model. Extensive experiments on the proposed benchmark are performed to evaluate the state-of-the-art methods in FSC, cross-domain shifts, and self-supervised learning. The experimental results show that the performance of the existing methods degrades significantly in the presence of spurious-correlation shifts. We open-source all codes of our benchmark and hope that the proposed MetaCoCo can facilitate future research on spurious-correlation shifts problems in FSC. The code is available at: https://github.com/remiMZ/MetaCoCo-ICLR24.
\ No newline at end of file
diff --git a/data/2024/iclr/MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models b/data/2024/iclr/MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
new file mode 100644
index 0000000000..e36e70d180
--- /dev/null
+++ b/data/2024/iclr/MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models	
@@ -0,0 +1 @@
+Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.
\ No newline at end of file
diff --git a/data/2024/iclr/MetaPhysiCa: Improving OOD Robustness in Physics-informed Machine Learning b/data/2024/iclr/MetaPhysiCa: Improving OOD Robustness in Physics-informed Machine Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use b/data/2024/iclr/MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
new file mode 100644
index 0000000000..38ffd25985
--- /dev/null
+++ b/data/2024/iclr/MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use	
@@ -0,0 +1 @@
+Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. Recently, many studies have focused on the tool utilization ability of LLMs. They primarily investigated how LLMs effectively collaborate with given specific tools. However, in scenarios where LLMs serve as intelligent agents, as seen in applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate decision-making processes that involve deciding whether to employ a tool and selecting the most suitable tool(s) from a collection of available tools to fulfill user requests. Therefore, in this paper, we introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. Specifically, we create a dataset called ToolE within the benchmark. This dataset contains various types of user queries in the form of prompts that trigger LLMs to use tools, including both single-tool and multi-tool scenarios. Subsequently, we set the tasks for both tool usage awareness and tool selection. We define four subtasks from different perspectives in tool selection, including tool selection with similar choices, tool selection in specific scenarios, tool selection with possible reliability issues, and multi-tool selection. We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools, highlighting the existing gaps between LLMs and genuine intelligent agents. However, through the error analysis, we found there is still significant room for improvement. Finally, we conclude with insights for tool developers -- we strongly recommend that tool developers choose an appropriate rewrite model for generating new descriptions based on the downstream LLM the tool will apply to. Our code is in https://github.com/HowieHwong/MetaTool.
\ No newline at end of file
diff --git a/data/2024/iclr/MgNO: Efficient Parameterization of Linear Operators via Multigrid b/data/2024/iclr/MgNO: Efficient Parameterization of Linear Operators via Multigrid
new file mode 100644
index 0000000000..33cb336d27
--- /dev/null
+++ b/data/2024/iclr/MgNO: Efficient Parameterization of Linear Operators via Multigrid	
@@ -0,0 +1 @@
+In this work, we propose a concise neural operator architecture for operator learning. Drawing an analogy with a conventional fully connected neural network, we define the neural operator as follows: the output of the $i$-th neuron in a nonlinear operator layer is defined by $O_i(u) = \sigma\left( \sum_j W_{ij} u + B_{ij}\right)$. Here, $ W_{ij}$ denotes the bounded linear operator connecting $j$-th input neuron to $i$-th output neuron, and the bias $ B_{ij}$ takes the form of a function rather than a scalar. Given its new universal approximation property, the efficient parameterization of the bounded linear operators between two neurons (Banach spaces) plays a critical role. As a result, we introduce MgNO, utilizing multigrid structures to parameterize these linear operators between neurons. This approach offers both mathematical rigor and practical expressivity. Additionally, MgNO obviates the need for conventional lifting and projecting operators typically required in previous neural operators. Moreover, it seamlessly accommodates diverse boundary conditions. Our empirical observations reveal that MgNO exhibits superior ease of training compared to other CNN-based models, while also displaying a reduced susceptibility to overfitting when contrasted with spectral-type neural operators. We demonstrate the efficiency and accuracy of our method with consistently state-of-the-art performance on different types of partial differential equations (PDEs).
\ No newline at end of file
diff --git a/data/2024/iclr/Mind Your Augmentation: The Key to Decoupling Dense Self-Supervised Learning b/data/2024/iclr/Mind Your Augmentation: The Key to Decoupling Dense Self-Supervised Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models b/data/2024/iclr/MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
new file mode 100644
index 0000000000..17977f52d9
--- /dev/null
+++ b/data/2024/iclr/MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	
@@ -0,0 +1 @@
+The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from hand-drawn drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, teaching users how to cook based on food photos, and so on. In our experiment, we found that the model trained on short image caption pairs could produce unnatural language outputs (e.g., repetition and fragmentation). To address this problem, we curate a detailed image description dataset in the second stage to finetune the model, which consequently improves the model's generation reliability and overall usability. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.
\ No newline at end of file
diff --git a/data/2024/iclr/MiniLLM: Knowledge Distillation of Large Language Models b/data/2024/iclr/MiniLLM: Knowledge Distillation of Large Language Models
new file mode 100644
index 0000000000..887d733780
--- /dev/null
+++ b/data/2024/iclr/MiniLLM: Knowledge Distillation of Large Language Models	
@@ -0,0 +1 @@
+Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm.
\ No newline at end of file
diff --git a/data/2024/iclr/Minimax optimality of convolutional neural networks for infinite dimensional input-output problems and separation from kernel methods b/data/2024/iclr/Minimax optimality of convolutional neural networks for infinite dimensional input-output problems and separation from kernel methods
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Minimum width for universal approximation using ReLU networks on compact domain b/data/2024/iclr/Minimum width for universal approximation using ReLU networks on compact domain
new file mode 100644
index 0000000000..c591e0b4ec
--- /dev/null
+++ b/data/2024/iclr/Minimum width for universal approximation using ReLU networks on compact domain	
@@ -0,0 +1 @@
+It has been shown that deep neural networks of a large enough width are universal approximators but they are not if the width is too small. There were several attempts to characterize the minimum width $w_{\min}$ enabling the universal approximation property; however, only a few of them found the exact values. In this work, we show that the minimum width for $L^p$ approximation of $L^p$ functions from $[0,1]^{d_x}$ to $\mathbb R^{d_y}$ is exactly $\max\{d_x,d_y,2\}$ if an activation function is ReLU-Like (e.g., ReLU, GELU, Softplus). Compared to the known result for ReLU networks, $w_{\min}=\max\{d_x+1,d_y\}$ when the domain is $\smash{\mathbb R^{d_x}}$, our result first shows that approximation on a compact domain requires smaller width than on $\smash{\mathbb R^{d_x}}$. We next prove a lower bound on $w_{\min}$ for uniform approximation using general activation functions including ReLU: $w_{\min}\ge d_y+1$ if $d_x<d_y\le2d_x$. Together with our first result, this shows a dichotomy between $L^p$ and uniform approximations for general activation functions and input/output dimensions.
\ No newline at end of file
diff --git a/data/2024/iclr/Mirage: Model-agnostic Graph Distillation for Graph Classification b/data/2024/iclr/Mirage: Model-agnostic Graph Distillation for Graph Classification
new file mode 100644
index 0000000000..09b029d6a3
--- /dev/null
+++ b/data/2024/iclr/Mirage: Model-agnostic Graph Distillation for Graph Classification	
@@ -0,0 +1 @@
+GNNs, like other deep learning models, are data and computation hungry. There is a pressing need to scale training of GNNs on large datasets to enable their usage on low-resource environments. Graph distillation is an effort in that direction with the aim to construct a smaller synthetic training set from the original training data without significantly compromising model performance. While initial efforts are promising, this work is motivated by two key observations: (1) Existing graph distillation algorithms themselves rely on training with the full dataset, which undermines the very premise of graph distillation. (2) The distillation process is specific to the target GNN architecture and hyper-parameters and thus not robust to changes in the modeling pipeline. We circumvent these limitations by designing a distillation algorithm called Mirage for graph classification. Mirage is built on the insight that a message-passing GNN decomposes the input graph into a multiset of computation trees. Furthermore, the frequency distribution of computation trees is often skewed in nature, enabling us to condense this data into a concise distilled summary. By compressing the computation data itself, as opposed to emulating gradient flows on the original training set-a prevalent approach to date-Mirage transforms into an unsupervised and architecture-agnostic distillation algorithm. Extensive benchmarking on real-world datasets underscores Mirage's superiority, showcasing enhanced generalization accuracy, data compression, and distillation efficiency when compared to state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Mitigating Emergent Robustness Degradation while Scaling Graph Learning b/data/2024/iclr/Mitigating Emergent Robustness Degradation while Scaling Graph Learning
new file mode 100644
index 0000000000..d19a3f6fdd
--- /dev/null
+++ b/data/2024/iclr/Mitigating Emergent Robustness Degradation while Scaling Graph Learning	
@@ -0,0 +1 @@
+convolutional layer split into multiple DP expert networks with adjustable noise coefficients to handle attacks of different intensities.
\ No newline at end of file
diff --git a/data/2024/iclr/Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning b/data/2024/iclr/Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
new file mode 100644
index 0000000000..998b59438d
--- /dev/null
+++ b/data/2024/iclr/Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	
@@ -0,0 +1 @@
+Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Code and data are available at https://github.com/FuxiaoLiu/LRV-Instruction.
\ No newline at end of file
diff --git a/data/2024/iclr/Mitigating the Curse of Dimensionality for Certified Robustness via Dual Randomized Smoothing b/data/2024/iclr/Mitigating the Curse of Dimensionality for Certified Robustness via Dual Randomized Smoothing
new file mode 100644
index 0000000000..4bef106836
--- /dev/null
+++ b/data/2024/iclr/Mitigating the Curse of Dimensionality for Certified Robustness via Dual Randomized Smoothing	
@@ -0,0 +1 @@
+Randomized Smoothing (RS) has been proven a promising method for endowing an arbitrary image classifier with certified robustness. However, the substantial uncertainty inherent in the high-dimensional isotropic Gaussian noise imposes the curse of dimensionality on RS. Specifically, the upper bound of ${\ell_2}$ certified robustness radius provided by RS exhibits a diminishing trend with the expansion of the input dimension $d$, proportionally decreasing at a rate of $1/\sqrt{d}$. This paper explores the feasibility of providing ${\ell_2}$ certified robustness for high-dimensional input through the utilization of dual smoothing in the lower-dimensional space. The proposed Dual Randomized Smoothing (DRS) down-samples the input image into two sub-images and smooths the two sub-images in lower dimensions. Theoretically, we prove that DRS guarantees a tight ${\ell_2}$ certified robustness radius for the original input and reveal that DRS attains a superior upper bound on the ${\ell_2}$ robustness radius, which decreases proportionally at a rate of $(1/\sqrt m + 1/\sqrt n )$ with $m+n=d$. Extensive experiments demonstrate the generalizability and effectiveness of DRS, which exhibits a notable capability to integrate with established methodologies, yielding substantial improvements in both accuracy and ${\ell_2}$ certified robustness baselines of RS on the CIFAR-10 and ImageNet datasets. Code is available at https://github.com/xiasong0501/DRS.
\ No newline at end of file
diff --git a/data/2024/iclr/MixSATGEN: Learning Graph Mixing for SAT Instance Generation b/data/2024/iclr/MixSATGEN: Learning Graph Mixing for SAT Instance Generation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space b/data/2024/iclr/Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space
new file mode 100644
index 0000000000..0cd38da5f9
--- /dev/null
+++ b/data/2024/iclr/Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space	
@@ -0,0 +1 @@
+Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular data. This paper introduces Tabsyn, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space. The key advantages of the proposed Tabsyn include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations; (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models, which helps generate high-quality synthetic data, (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based methods. Extensive experiments on six datasets with five metrics demonstrate that Tabsyn outperforms existing methods. Specifically, it reduces the error rates by 86% and 67% for column-wise distribution and pair-wise column correlation estimations compared with the most competitive baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Mixture of LoRA Experts b/data/2024/iclr/Mixture of LoRA Experts
new file mode 100644
index 0000000000..19f32e390d
--- /dev/null
+++ b/data/2024/iclr/Mixture of LoRA Experts	
@@ -0,0 +1 @@
+Instruction finetuning on a variety of image-text instruction data is the key to obtaining a versatile Multimodal Large Language Model (MLLM), and different configurations of the instruction data can lead to finetuned models with different capabilities. However, we have discovered that data conflicts are inevitable when mixing instruction data from distinct domains, which can result in performance drops for tasks of a specific domain. To address this issue, we propose to apply an efficient Mixture of Experts (MoE) design, which is a sparse Mixture of LoRA Experts (MoLE) for instruction finetuning MLLMs. Within the Transformer layers, we extend the popular Low-Rank Adaption (LoRA) method by creating a set of LoRA experts specifically for the MLP layer, and route each token to the top-1 expert based on a routing function, allowing adaptive choices for tokens from different domains. Since the LoRA experts are sparsely activated, the training and inference cost are kept roughly constant compared to the original LoRA method. By replacing the plain-LoRA of LLaVA-1.5 with our MoE design, our final model is named LLaVA-MoLE. Extensive experiments proved that LLaVA-MoLE effectively mitigates the data conflict issue when mixing multiple distinct instruction datasets with various configurations, and achieves consistent performance gains over the strong plain-LoRA baselines. Most importantly, on the mixed datasets, LLaVA-MoLE can even outperform the plain-LoRA baseline trained with twice the samples.
\ No newline at end of file
diff --git a/data/2024/iclr/Mixture of Weak and Strong Experts on Graphs b/data/2024/iclr/Mixture of Weak and Strong Experts on Graphs
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models b/data/2024/iclr/Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models
new file mode 100644
index 0000000000..58a7a051de
--- /dev/null
+++ b/data/2024/iclr/Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models	
@@ -0,0 +1 @@
+Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instructiontuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (second and third scenario), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied byFLAN-MOE inspire a reevaluation of the design principles of large-scale, high-performance language models in the framework of task-agnostic learning.
\ No newline at end of file
diff --git a/data/2024/iclr/Model Merging by Uncertainty-Based Gradient Matching b/data/2024/iclr/Model Merging by Uncertainty-Based Gradient Matching
new file mode 100644
index 0000000000..3327bb4d81
--- /dev/null
+++ b/data/2024/iclr/Model Merging by Uncertainty-Based Gradient Matching	
@@ -0,0 +1 @@
+Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters.
\ No newline at end of file
diff --git a/data/2024/iclr/Modeling Boundedly Rational Agents with Latent Inference Budgets b/data/2024/iclr/Modeling Boundedly Rational Agents with Latent Inference Budgets
new file mode 100644
index 0000000000..6d086f9ca5
--- /dev/null
+++ b/data/2024/iclr/Modeling Boundedly Rational Agents with Latent Inference Budgets	
@@ -0,0 +1 @@
+We study the problem of modeling a population of agents pursuing unknown goals subject to unknown computational constraints. In standard models of bounded rationality, sub-optimal decision-making is simulated by adding homoscedastic noise to optimal decisions rather than explicitly simulating constrained inference. In this work, we introduce a latent inference budget model (L-IBM) that models agents' computational constraints explicitly, via a latent variable (inferred jointly with a model of agents' goals) that controls the runtime of an iterative inference algorithm. L-IBMs make it possible to learn agent models using data from diverse populations of suboptimal actors. In three modeling tasks -- inferring navigation goals from routes, inferring communicative intents from human utterances, and predicting next moves in human chess games -- we show that L-IBMs match or outperform Boltzmann models of decision-making under uncertainty. Inferred inference budgets are themselves meaningful, efficient to compute, and correlated with measures of player skill, partner skill and task difficulty.
\ No newline at end of file
diff --git a/data/2024/iclr/Modeling state-dependent communication between brain regions with switching nonlinear dynamical systems b/data/2024/iclr/Modeling state-dependent communication between brain regions with switching nonlinear dynamical systems
new file mode 100644
index 0000000000..8fe1ac7719
--- /dev/null
+++ b/data/2024/iclr/Modeling state-dependent communication between brain regions with switching nonlinear dynamical systems	
@@ -0,0 +1 @@
+Understanding how multiple brain regions interact to produce behavior is a major challenge in systems neuroscience, with many regions causally implicated in common tasks such as sensory processing and decision-making. Moreover, neural dynamics are nonlinear and non-stationary, exhibiting switches both within and across trials. Here we propose multi-region switching dynamical systems (MR-SDS), a switching nonlinear state space model that decomposes multi-region neural dynamics into local and cross-region components. MR-SDS includes directed interactions between brain regions, allowing for estimation of state-dependent communication signals and sensory inputs effects. We show that our model accurately recovers latent trajectories, vector fields underlying switching nonlinear dynamics, and cross-region communication profiles in three simulations. We then apply our method to two large-scale, multi-region neural datasets involving mouse decision-making. The first includes hundreds of neurons per region, recorded simultaneously at single-cell-resolution across 3 distal cortical regions. The second is a mesoscale widefield dataset of 8 adjacent cortical regions imaged across both hemispheres. On these multi-region datasets, MR-SDS out-performs existing models, including multi-region recurrent switching linear models, and reveals multiple distinct dynamical states and a rich set of cross-region communication profiles.
\ No newline at end of file
diff --git a/data/2024/iclr/Modulate Your Spectrum in Self-Supervised Learning b/data/2024/iclr/Modulate Your Spectrum in Self-Supervised Learning
new file mode 100644
index 0000000000..320c39d867
--- /dev/null
+++ b/data/2024/iclr/Modulate Your Spectrum in Self-Supervised Learning	
@@ -0,0 +1 @@
+Whitening loss offers a theoretical guarantee against feature collapse in self-supervised learning (SSL) with joint embedding architectures. Typically, it involves a hard whitening approach, transforming the embedding and applying loss to the whitened output. In this work, we introduce Spectral Transformation (ST), a framework to modulate the spectrum of embedding and to seek for functions beyond whitening that can avoid dimensional collapse. We show that whitening is a special instance of ST by definition, and our empirical investigations unveil other ST instances capable of preventing collapse. Additionally, we propose a novel ST instance named IterNorm with trace loss (INTL). Theoretical analysis confirms INTL's efficacy in preventing collapse and modulating the spectrum of embedding toward equal-eigenvalues during optimization. Our experiments on ImageNet classification and COCO object detection demonstrate INTL's potential in learning superior representations. The code is available at https://github.com/winci-ai/INTL.
\ No newline at end of file
diff --git a/data/2024/iclr/Modulated Phase Diffusor: Content-Oriented Feature Synthesis for Detecting Unknown Objects b/data/2024/iclr/Modulated Phase Diffusor: Content-Oriented Feature Synthesis for Detecting Unknown Objects
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models b/data/2024/iclr/Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
new file mode 100644
index 0000000000..34592f4f04
--- /dev/null
+++ b/data/2024/iclr/Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models	
@@ -0,0 +1 @@
+Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular domain. Mol-Instructions encompasses three key components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions. Each component aims to improve the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on LLMs, we demonstrate the effectiveness of Mol-Instructions in enhancing large models' performance in the intricate realm of biomolecular studies, thus fostering progress in the biomolecular research community. Mol-Instructions is publicly available for ongoing research and will undergo regular updates to enhance its applicability.
\ No newline at end of file
diff --git a/data/2024/iclr/Momentum Benefits Non-iid Federated Learning Simply and Provably b/data/2024/iclr/Momentum Benefits Non-iid Federated Learning Simply and Provably
new file mode 100644
index 0000000000..05c2300cd7
--- /dev/null
+++ b/data/2024/iclr/Momentum Benefits Non-iid Federated Learning Simply and Provably	
@@ -0,0 +1 @@
+Federated learning is a powerful paradigm for large-scale machine learning, but it faces significant challenges due to unreliable network connections, slow communication, and substantial data heterogeneity across clients. FedAvg and SCAFFOLD are two prominent algorithms to address these challenges. In particular, FedAvg employs multiple local updates before communicating with a central server, while SCAFFOLD maintains a control variable on each client to compensate for ``client drift'' in its local updates. Various methods have been proposed to enhance the convergence of these two algorithms, but they either make impractical adjustments to the algorithmic structure or rely on the assumption of bounded data heterogeneity. This paper explores the utilization of momentum to enhance the performance of FedAvg and SCAFFOLD. When all clients participate in the training process, we demonstrate that incorporating momentum allows FedAvg to converge without relying on the assumption of bounded data heterogeneity even using a constant local learning rate. This is novel and fairly surprising as existing analyses for FedAvg require bounded data heterogeneity even with diminishing local learning rates. In partial client participation, we show that momentum enables SCAFFOLD to converge provably faster without imposing any additional assumptions. Furthermore, we use momentum to develop new variance-reduced extensions of FedAvg and SCAFFOLD, which exhibit state-of-the-art convergence rates. Our experimental results support all theoretical findings.
\ No newline at end of file
diff --git a/data/2024/iclr/Monte Carlo guided Denoising Diffusion models for Bayesian linear inverse problems b/data/2024/iclr/Monte Carlo guided Denoising Diffusion models for Bayesian linear inverse problems
new file mode 100644
index 0000000000..eac5461300
--- /dev/null
+++ b/data/2024/iclr/Monte Carlo guided Denoising Diffusion models for Bayesian linear inverse problems	
@@ -0,0 +1 @@
+Ill-posed linear inverse problems arise frequently in various applications, from computational photography to medical imaging. A recent line of research exploits Bayesian inference with informative priors to handle the ill-posedness of such problems. Amongst such priors, score-based generative models (SGM) have recently been successfully applied to several different inverse problems. In this paper, we exploit the particular structure of the prior defined by the SGM to define a sequence of intermediate linear inverse problems. As the noise level decreases, the posterior distributions of these inverse problems get closer to the target posterior of the original inverse problem. To sample from these distributions, we propose the use of Sequential Monte Carlo (SMC) methods. The proposed algorithm, MCGdiff
\ No newline at end of file
diff --git a/data/2024/iclr/More is Better: when Infinite Overparameterization is Optimal and Overfitting is Obligatory b/data/2024/iclr/More is Better: when Infinite Overparameterization is Optimal and Overfitting is Obligatory
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Most discriminative stimuli for functional cell type clustering b/data/2024/iclr/Most discriminative stimuli for functional cell type clustering
new file mode 100644
index 0000000000..becda10df8
--- /dev/null
+++ b/data/2024/iclr/Most discriminative stimuli for functional cell type clustering	
@@ -0,0 +1 @@
+Identifying cell types and understanding their functional properties is crucial for unraveling the mechanisms underlying perception and cognition. In the retina, functional types can be identified by carefully selected stimuli, but this requires expert domain knowledge and biases the procedure towards previously known cell types. In the visual cortex, it is still unknown what functional types exist and how to identify them. Thus, for unbiased identification of the functional cell types in retina and visual cortex, new approaches are needed. Here we propose an optimization-based clustering approach using deep predictive models to obtain functional clusters of neurons using Most Discriminative Stimuli (MDS). Our approach alternates between stimulus optimization with cluster reassignment akin to an expectation-maximization algorithm. The algorithm recovers functional clusters in mouse retina, marmoset retina and macaque visual area V4. This demonstrates that our approach can successfully find discriminative stimuli across species, stages of the visual system and recording techniques. The resulting most discriminative stimuli can be used to assign functional cell types fast and on the fly, without the need to train complex predictive models or show a large natural scene dataset, paving the way for experiments that were previously limited by experimental time. Crucially, MDS are interpretable: they visualize the distinctive stimulus patterns that most unambiguously identify a specific type of neuron.
\ No newline at end of file
diff --git a/data/2024/iclr/Motif: Intrinsic Motivation from Artificial Intelligence Feedback b/data/2024/iclr/Motif: Intrinsic Motivation from Artificial Intelligence Feedback
new file mode 100644
index 0000000000..54260510f1
--- /dev/null
+++ b/data/2024/iclr/Motif: Intrinsic Motivation from Artificial Intelligence Feedback	
@@ -0,0 +1 @@
+Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt.
\ No newline at end of file
diff --git a/data/2024/iclr/Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators b/data/2024/iclr/Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators
new file mode 100644
index 0000000000..a8ce5ade2e
--- /dev/null
+++ b/data/2024/iclr/Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators	
@@ -0,0 +1 @@
+Diffusion models are capable of generating impressive images conditioned on text descriptions, and extensions of these models allow users to edit images at a relatively coarse scale. However, the ability to precisely edit the layout, position, pose, and shape of objects in images with diffusion models is still difficult. To this end, we propose motion guidance, a zero-shot technique that allows a user to specify dense, complex motion fields that indicate where each pixel in an image should move. Motion guidance works by steering the diffusion sampling process with the gradients through an off-the-shelf optical flow network. Specifically, we design a guidance loss that encourages the sample to have the desired motion, as estimated by a flow network, while also being visually similar to the source image. By simultaneously sampling from a diffusion model and guiding the sample to have low guidance loss, we can obtain a motion-edited image. We demonstrate that our technique works on complex motions and produces high quality edits of real and generated images.
\ No newline at end of file
diff --git a/data/2024/iclr/MovingParts: Motion-based 3D Part Discovery in Dynamic Radiance Field b/data/2024/iclr/MovingParts: Motion-based 3D Part Discovery in Dynamic Radiance Field
new file mode 100644
index 0000000000..62a8136073
--- /dev/null
+++ b/data/2024/iclr/MovingParts: Motion-based 3D Part Discovery in Dynamic Radiance Field	
@@ -0,0 +1 @@
+We present MovingParts, a NeRF-based method for dynamic scene reconstruction and part discovery. We consider motion as an important cue for identifying parts, that all particles on the same part share the common motion pattern. From the perspective of fluid simulation, existing deformation-based methods for dynamic NeRF can be seen as parameterizing the scene motion under the Eulerian view, i.e., focusing on specific locations in space through which the fluid flows as time passes. However, it is intractable to extract the motion of constituting objects or parts using the Eulerian view representation. In this work, we introduce the dual Lagrangian view and enforce representations under the Eulerian/Lagrangian views to be cycle-consistent. Under the Lagrangian view, we parameterize the scene motion by tracking the trajectory of particles on objects. The Lagrangian view makes it convenient to discover parts by factorizing the scene motion as a composition of part-level rigid motions. Experimentally, our method can achieve fast and high-quality dynamic scene reconstruction from even a single moving camera, and the induced part-based representation allows direct applications of part tracking, animation, 3D scene editing, etc.
\ No newline at end of file
diff --git a/data/2024/iclr/MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning b/data/2024/iclr/MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning
new file mode 100644
index 0000000000..0934de1311
--- /dev/null
+++ b/data/2024/iclr/MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning	
@@ -0,0 +1 @@
+While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning; this makes it simultaneously much more challenging than other synthetically-crafted benchmarks while remaining realistic and tractable for human annotators to solve with high accuracy. We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning.
\ No newline at end of file
diff --git a/data/2024/iclr/Multi-Resolution Diffusion Models for Time Series Forecasting b/data/2024/iclr/Multi-Resolution Diffusion Models for Time Series Forecasting
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Multi-Scale Representations by Varying Window Attention for Semantic Segmentation b/data/2024/iclr/Multi-Scale Representations by Varying Window Attention for Semantic Segmentation
new file mode 100644
index 0000000000..16094e0aca
--- /dev/null
+++ b/data/2024/iclr/Multi-Scale Representations by Varying Window Attention for Semantic Segmentation	
@@ -0,0 +1 @@
+Multi-scale learning is central to semantic segmentation. We visualize the effective receptive field (ERF) of canonical multi-scale representations and point out two risks in learning them: scale inadequacy and field inactivation. A novel multi-scale learner, varying window attention (VWA), is presented to address these issues. VWA leverages the local window attention (LWA) and disentangles LWA into the query window and context window, allowing the context's scale to vary for the query to learn representations at multiple scales. However, varying the context to large-scale windows (enlarging ratio R) can significantly increase the memory footprint and computation cost (R^2 times larger than LWA). We propose a simple but professional re-scaling strategy to zero the extra induced cost without compromising performance. Consequently, VWA uses the same cost as LWA to overcome the receptive limitation of the local window. Furthermore, depending on VWA and employing various MLPs, we introduce a multi-scale decoder (MSD), VWFormer, to improve multi-scale representations for semantic segmentation. VWFormer achieves efficiency competitive with the most compute-friendly MSDs, like FPN and MLP decoder, but performs much better than any MSDs. For instance, using nearly half of UPerNet's computation, VWFormer outperforms it by 1.0%-2.5% mIoU on ADE20K. With little extra overhead, ~10G FLOPs, Mask2Former armed with VWFormer improves by 1.0%-1.3%. The code and models are available at https://github.com/yan-hao-tian/vw
\ No newline at end of file
diff --git a/data/2024/iclr/Multi-Source Diffusion Models for Simultaneous Music Generation and Separation b/data/2024/iclr/Multi-Source Diffusion Models for Simultaneous Music Generation and Separation
new file mode 100644
index 0000000000..c4669ba6af
--- /dev/null
+++ b/data/2024/iclr/Multi-Source Diffusion Models for Simultaneous Music Generation and Separation	
@@ -0,0 +1 @@
+In this work, we define a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e., generating a mixture, separating the sources), we also introduce and experiment on the partial generation task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task based on Dirac likelihood functions. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the source separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.
\ No newline at end of file
diff --git a/data/2024/iclr/Multi-Task Reinforcement Learning with Mixture of Orthogonal Experts b/data/2024/iclr/Multi-Task Reinforcement Learning with Mixture of Orthogonal Experts
new file mode 100644
index 0000000000..7f38060c2d
--- /dev/null
+++ b/data/2024/iclr/Multi-Task Reinforcement Learning with Mixture of Orthogonal Experts	
@@ -0,0 +1 @@
+Multi-Task Reinforcement Learning (MTRL) tackles the long-standing problem of endowing agents with skills that generalize across a variety of problems. To this end, sharing representations plays a fundamental role in capturing both unique and common characteristics of the tasks. Tasks may exhibit similarities in terms of skills, objects, or physical properties while leveraging their representations eases the achievement of a universal policy. Nevertheless, the pursuit of learning a shared set of diverse representations is still an open challenge. In this paper, we introduce a novel approach for representation learning in MTRL that encapsulates common structures among the tasks using orthogonal representations to promote diversity. Our method, named Mixture Of Orthogonal Experts (MOORE), leverages a Gram-Schmidt process to shape a shared subspace of representations generated by a mixture of experts. When task-specific information is provided, MOORE generates relevant representations from this shared subspace. We assess the effectiveness of our approach on two MTRL benchmarks, namely MiniGrid and MetaWorld, showing that MOORE surpasses related baselines and establishes a new state-of-the-art result on MetaWorld.
\ No newline at end of file
diff --git a/data/2024/iclr/Multi-View Causal Representation Learning with Partial Observability b/data/2024/iclr/Multi-View Causal Representation Learning with Partial Observability
new file mode 100644
index 0000000000..b185bb85e1
--- /dev/null
+++ b/data/2024/iclr/Multi-View Causal Representation Learning with Partial Observability	
@@ -0,0 +1 @@
+We present a unified framework for studying the identifiability of representations learned from simultaneously observed views, such as different data modalities. We allow a partially observed setting in which each view constitutes a nonlinear mixture of a subset of underlying latent variables, which can be causally related. We prove that the information shared across all subsets of any number of views can be learned up to a smooth bijection using contrastive learning and a single encoder per view. We also provide graphical criteria indicating which latent variables can be identified through a simple set of rules, which we refer to as identifiability algebra. Our general framework and theoretical results unify and extend several previous works on multi-view nonlinear ICA, disentanglement, and causal representation learning. We experimentally validate our claims on numerical, image, and multi-modal data sets. Further, we demonstrate that the performance of prior methods is recovered in different special cases of our setup. Overall, we find that access to multiple partial views enables us to identify a more fine-grained representation, under the generally milder assumption of partial observability.
\ No newline at end of file
diff --git a/data/2024/iclr/Multi-View Representation is What You Need for Point-Cloud Pre-Training b/data/2024/iclr/Multi-View Representation is What You Need for Point-Cloud Pre-Training
new file mode 100644
index 0000000000..e74ce6d97f
--- /dev/null
+++ b/data/2024/iclr/Multi-View Representation is What You Need for Point-Cloud Pre-Training	
@@ -0,0 +1 @@
+A promising direction for pre-training 3D point clouds is to leverage the massive amount of data in 2D, whereas the domain gap between 2D and 3D creates a fundamental challenge. This paper proposes a novel approach to point-cloud pre-training that learns 3D representations by leveraging pre-trained 2D networks. Different from the popular practice of predicting 2D features first and then obtaining 3D features through dimensionality lifting, our approach directly uses a 3D network for feature extraction. We train the 3D feature extraction network with the help of the novel 2D knowledge transfer loss, which enforces the 2D projections of the 3D feature to be consistent with the output of pre-trained 2D networks. To prevent the feature from discarding 3D signals, we introduce the multi-view consistency loss that additionally encourages the projected 2D feature representations to capture pixel-wise correspondences across different views. Such correspondences induce 3D geometry and effectively retain 3D features in the projected 2D features. Experimental results demonstrate that our pre-trained model can be successfully transferred to various downstream tasks, including 3D shape classification, part segmentation, 3D object detection, and semantic segmentation, achieving state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Multi-granularity Correspondence Learning from Long-term Noisy Videos b/data/2024/iclr/Multi-granularity Correspondence Learning from Long-term Noisy Videos
new file mode 100644
index 0000000000..7dfc7db6f7
--- /dev/null
+++ b/data/2024/iclr/Multi-granularity Correspondence Learning from Long-term Noisy Videos	
@@ -0,0 +1 @@
+Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at https://lin-yijie.github.io/projects/Norton.
\ No newline at end of file
diff --git a/data/2024/iclr/Multi-modal Gaussian Process Variational Autoencoders for Neural and Behavioral Data b/data/2024/iclr/Multi-modal Gaussian Process Variational Autoencoders for Neural and Behavioral Data
new file mode 100644
index 0000000000..44c5683c86
--- /dev/null
+++ b/data/2024/iclr/Multi-modal Gaussian Process Variational Autoencoders for Neural and Behavioral Data	
@@ -0,0 +1 @@
+Characterizing the relationship between neural population activity and behavioral data is a central goal of neuroscience. While latent variable models (LVMs) are successful in describing high-dimensional time-series data, they are typically only designed for a single type of data, making it difficult to identify structure shared across different experimental data modalities. Here, we address this shortcoming by proposing an unsupervised LVM which extracts temporally evolving shared and independent latents for distinct, simultaneously recorded experimental modalities. We do this by combining Gaussian Process Factor Analysis (GPFA), an interpretable LVM for neural spiking data with temporally smooth latent space, with Gaussian Process Variational Autoencoders (GP-VAEs), which similarly use a GP prior to characterize correlations in a latent space, but admit rich expressivity due to a deep neural network mapping to observations. We achieve interpretability in our model by partitioning latent variability into components that are either shared between or independent to each modality. We parameterize the latents of our model in the Fourier domain, and show improved latent identification using this approach over standard GP-VAE methods. We validate our model on simulated multi-modal data consisting of Poisson spike counts and MNIST images that scale and rotate smoothly over time. We show that the multi-modal GP-VAE (MM-GPVAE) is able to not only identify the shared and independent latent structure across modalities accurately, but provides good reconstructions of both images and neural rates on held-out trials. Finally, we demonstrate our framework on two real world multi-modal experimental settings: Drosophila whole-brain calcium imaging alongside tracked limb positions, and Manduca sexta spike train measurements from ten wing muscles as the animal tracks a visual stimulus.
\ No newline at end of file
diff --git a/data/2024/iclr/Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction b/data/2024/iclr/Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction
new file mode 100644
index 0000000000..e6f6e1d655
--- /dev/null
+++ b/data/2024/iclr/Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction	
@@ -0,0 +1 @@
+Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech signals. In contrast, this paper aims to incorporate multi-resolution information into speech self-supervised representation learning. We introduce a SSL model that leverages a hierarchical Transformer architecture, complemented by HuBERT-style masked prediction objectives, to process speech at multiple resolutions. Experimental results indicate that the proposed model not only achieves more efficient inference but also exhibits superior or comparable performance to the original HuBERT model over various tasks. Specifically, significant performance improvements over the original HuBERT have been observed in fine-tuning experiments on the LibriSpeech speech recognition benchmark as well as in evaluations using the Speech Universal PERformance Benchmark (SUPERB) and Multilingual SUPERB (ML-SUPERB).
\ No newline at end of file
diff --git a/data/2024/iclr/Multi-task Learning with 3D-Aware Regularization b/data/2024/iclr/Multi-task Learning with 3D-Aware Regularization
new file mode 100644
index 0000000000..46e73825e3
--- /dev/null
+++ b/data/2024/iclr/Multi-task Learning with 3D-Aware Regularization	
@@ -0,0 +1 @@
+Deep neural networks have become a standard building block for designing models that can perform multiple dense computer vision tasks such as depth estimation and semantic segmentation thanks to their ability to capture complex correlations in high dimensional feature space across tasks. However, the cross-task correlations that are learned in the unstructured feature space can be extremely noisy and susceptible to overfitting, consequently hurting performance. We propose to address this problem by introducing a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space and decodes them into their task output space through differentiable rendering. We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their performance; as we evidence using standard benchmarks NYUv2 and PASCAL-Context.
\ No newline at end of file
diff --git a/data/2024/iclr/Multilinear Operator Networks b/data/2024/iclr/Multilinear Operator Networks
new file mode 100644
index 0000000000..272472eaaf
--- /dev/null
+++ b/data/2024/iclr/Multilinear Operator Networks	
@@ -0,0 +1 @@
+Despite the remarkable capabilities of deep neural networks in image recognition, the dependence on activation functions remains a largely unexplored area and has yet to be eliminated. On the other hand, Polynomial Networks is a class of models that does not require activation functions, but have yet to perform on par with modern architectures. In this work, we aim close this gap and propose MONet, which relies solely on multilinear operators. The core layer of MONet, called Mu-Layer, captures multiplicative interactions of the elements of the input token. MONet captures high-degree interactions of the input elements and we demonstrate the efficacy of our approach on a series of image recognition and scientific computing benchmarks. The proposed model outperforms prior polynomial networks and performs on par with modern architectures. We believe that MONet can inspire further research on models that use entirely multilinear operations.
\ No newline at end of file
diff --git a/data/2024/iclr/Multilingual Jailbreak Challenges in Large Language Models b/data/2024/iclr/Multilingual Jailbreak Challenges in Large Language Models
new file mode 100644
index 0000000000..1e109bd7c9
--- /dev/null
+++ b/data/2024/iclr/Multilingual Jailbreak Challenges in Large Language Models	
@@ -0,0 +1 @@
+While large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, they pose potential safety concerns, such as the ``jailbreak'' problem, wherein malicious instructions can manipulate LLMs to exhibit undesirable behavior. Although several preventive measures have been developed to mitigate the potential risks associated with LLMs, they have primarily focused on English. In this study, we reveal the presence of multilingual jailbreak challenges within LLMs and consider two potential risky scenarios: unintentional and intentional. The unintentional scenario involves users querying LLMs using non-English prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario concerns malicious users combining malicious instructions with multilingual prompts to deliberately attack LLMs. The experimental results reveal that in the unintentional scenario, the rate of unsafe content increases as the availability of languages decreases. Specifically, low-resource languages exhibit about three times the likelihood of encountering harmful content compared to high-resource languages, with both ChatGPT and GPT-4. In the intentional scenario, multilingual prompts can exacerbate the negative impact of malicious instructions, with astonishingly high rates of unsafe output: 80.92\% for ChatGPT and 40.71\% for GPT-4. To handle such a challenge in the multilingual context, we propose a novel \textsc{Self-Defense} framework that automatically generates multilingual training data for safety fine-tuning. Experimental results show that ChatGPT fine-tuned with such data can achieve a substantial reduction in unsafe content generation. Data is available at \url{https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs}.
\ No newline at end of file
diff --git a/data/2024/iclr/Multimarginal Generative Modeling with Stochastic Interpolants b/data/2024/iclr/Multimarginal Generative Modeling with Stochastic Interpolants
new file mode 100644
index 0000000000..0ed03951b9
--- /dev/null
+++ b/data/2024/iclr/Multimarginal Generative Modeling with Stochastic Interpolants	
@@ -0,0 +1 @@
+Given a set of $K$ probability densities, we consider the multimarginal generative modeling problem of learning a joint distribution that recovers these densities as marginals. The structure of this joint distribution should identify multi-way correspondences among the prescribed marginals. We formalize an approach to this task within a generalization of the stochastic interpolant framework, leading to efficient learning algorithms built upon dynamical transport of measure. Our generative models are defined by velocity and score fields that can be characterized as the minimizers of simple quadratic objectives, and they are defined on a simplex that generalizes the time variable in the usual dynamical transport framework. The resulting transport on the simplex is influenced by all marginals, and we show that multi-way correspondences can be extracted. The identification of such correspondences has applications to style transfer, algorithmic fairness, and data decorruption. In addition, the multimarginal perspective enables an efficient algorithm for reducing the dynamical transport cost in the ordinary two-marginal setting. We demonstrate these capacities with several numerical examples.
\ No newline at end of file
diff --git a/data/2024/iclr/Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications b/data/2024/iclr/Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications
new file mode 100644
index 0000000000..a3b12cfcf4
--- /dev/null
+++ b/data/2024/iclr/Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications	
@@ -0,0 +1 @@
+In many machine learning systems that jointly learn from multiple modalities, a core research question is to understand the nature of multimodal interactions: how modalities combine to provide new task-relevant information that was not present in either alone. We study this challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data and naturally co-occurring multimodal data (e.g., unlabeled images and captions, video and corresponding audio) but when labeling them is time-consuming. Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds to quantify the amount of multimodal interactions in this semi-supervised setting. We propose two lower bounds: one based on the shared information between modalities and the other based on disagreement between separately trained unimodal classifiers, and derive an upper bound through connections to approximate algorithms for min-entropy couplings. We validate these estimated bounds and show how they accurately track true interactions. Finally, we show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Multimodal Molecular Pretraining via Modality Blending b/data/2024/iclr/Multimodal Molecular Pretraining via Modality Blending
new file mode 100644
index 0000000000..1c12996495
--- /dev/null
+++ b/data/2024/iclr/Multimodal Molecular Pretraining via Modality Blending	
@@ -0,0 +1 @@
+Self-supervised learning has recently gained growing interest in molecular modeling for scientific tasks such as AI-assisted drug discovery. Current studies consider leveraging both 2D and 3D molecular structures for representation learning. However, relying on straightforward alignment strategies that treat each modality separately, these methods fail to exploit the intrinsic correlation between 2D and 3D representations that reflect the underlying structural characteristics of molecules, and only perform coarse-grained molecule-level alignment. To derive fine-grained alignment and promote structural molecule understanding, we introduce an atomic-relation level"blend-then-predict"self-supervised learning approach, MoleBLEND, which first blends atom relations represented by different modalities into one unified relation matrix for joint encoding, then recovers modality-specific information for 2D and 3D structures individually. By treating atom relationships as anchors, MoleBLEND organically aligns and integrates visually dissimilar 2D and 3D modalities of the same molecule at fine-grained atomic level, painting a more comprehensive depiction of each molecule. Extensive experiments show that MoleBLEND achieves state-of-the-art performance across major 2D/3D molecular benchmarks. We further provide theoretical insights from the perspective of mutual-information maximization, demonstrating that our method unifies contrastive, generative (cross-modality prediction) and mask-then-predict (single-modality prediction) objectives into one single cohesive framework.
\ No newline at end of file
diff --git a/data/2024/iclr/Multimodal Patient Representation Learning with Missing Modalities and Labels b/data/2024/iclr/Multimodal Patient Representation Learning with Missing Modalities and Labels
new file mode 100644
index 0000000000..945c9b46d6
--- /dev/null
+++ b/data/2024/iclr/Multimodal Patient Representation Learning with Missing Modalities and Labels	
@@ -0,0 +1 @@
+.
\ No newline at end of file
diff --git a/data/2024/iclr/Multiscale Positive-Unlabeled Detection of AI-Generated Texts b/data/2024/iclr/Multiscale Positive-Unlabeled Detection of AI-Generated Texts
new file mode 100644
index 0000000000..c566f3918f
--- /dev/null
+++ b/data/2024/iclr/Multiscale Positive-Unlabeled Detection of AI-Generated Texts	
@@ -0,0 +1 @@
+Recent releases of Large Language Models (LLMs), e.g. ChatGPT, are astonishing at generating human-like texts, but they may impact the authenticity of texts. Previous works proposed methods to detect these AI-generated texts, including simple ML classifiers, pretrained-model-based zero-shot methods, and finetuned language classification models. However, mainstream detectors always fail on short texts, like SMSes, Tweets, and reviews. In this paper, a Multiscale Positive-Unlabeled (MPU) training framework is proposed to address the difficulty of short-text detection without sacrificing long-texts. Firstly, we acknowledge the human-resemblance property of short machine texts, and rephrase AI text detection as a partial Positive-Unlabeled (PU) problem by regarding these short machine texts as partially ``unlabeled". Then in this PU context, we propose the length-sensitive Multiscale PU Loss, where a recurrent model in abstraction is used to estimate positive priors of scale-variant corpora. Additionally, we introduce a Text Multiscaling module to enrich training corpora. Experiments show that our MPU method augments detection performance on long AI-generated texts, and significantly improves short-text detection of language model detectors. Language Models trained with MPU could outcompete existing detectors on various short-text and long-text detection benchmarks. The codes are available at https://github.com/mindspore-lab/mindone/tree/master/examples/detect_chatgpt and https://github.com/YuchuanTian/AIGC_text_detector.
\ No newline at end of file
diff --git a/data/2024/iclr/Multisize Dataset Condensation b/data/2024/iclr/Multisize Dataset Condensation
new file mode 100644
index 0000000000..8ee6f6d054
--- /dev/null
+++ b/data/2024/iclr/Multisize Dataset Condensation	
@@ -0,0 +1 @@
+While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the"subset degradation problem"in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an"adaptive subset loss"on top of the basic condensation loss to mitigate the"subset degradation problem". Our MDC method offers several benefits: 1) No additional condensation process is required; 2) reduced storage requirement by reusing condensed images. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including SVHN, CIFAR-10, CIFAR-100 and ImageNet. For example, we achieved 5.22%-6.40% average accuracy gains on condensing CIFAR-10 to ten images per class. Code is available at: https://github.com/he-y/Multisize-Dataset-Condensation.
\ No newline at end of file
diff --git a/data/2024/iclr/NAISR: A 3D Neural Additive Model for Interpretable Shape Representation b/data/2024/iclr/NAISR: A 3D Neural Additive Model for Interpretable Shape Representation
new file mode 100644
index 0000000000..ff757bf427
--- /dev/null
+++ b/data/2024/iclr/NAISR: A 3D Neural Additive Model for Interpretable Shape Representation	
@@ -0,0 +1 @@
+Deep implicit functions (DIFs) have emerged as a powerful paradigm for many computer vision tasks such as 3D shape reconstruction, generation, registration, completion, editing, and understanding. However, given a set of 3D shapes with associated covariates there is at present no shape representation method which allows to precisely represent the shapes while capturing the individual dependencies on each covariate. Such a method would be of high utility to researchers to discover knowledge hidden in a population of shapes. For scientific shape discovery, we propose a 3D Neural Additive Model for Interpretable Shape Representation ($\texttt{NAISR}$) which describes individual shapes by deforming a shape atlas in accordance to the effect of disentangled covariates. Our approach captures shape population trends and allows for patient-specific predictions through shape transfer. $\texttt{NAISR}$ is the first approach to combine the benefits of deep implicit shape representations with an atlas deforming according to specified covariates. We evaluate $\texttt{NAISR}$ with respect to shape reconstruction, shape disentanglement, shape evolution, and shape transfer on three datasets: 1) $\textit{Starman}$, a simulated 2D shape dataset; 2) the ADNI hippocampus 3D shape dataset; and 3) a pediatric airway 3D shape dataset. Our experiments demonstrate that $\textit{Starman}$ achieves excellent shape reconstruction performance while retaining interpretability. Our code is available at $\href{https://github.com/uncbiag/NAISR}{https://github.com/uncbiag/NAISR}$.
\ No newline at end of file
diff --git a/data/2024/iclr/NECO: NEural Collapse Based Out-of-distribution detection b/data/2024/iclr/NECO: NEural Collapse Based Out-of-distribution detection
new file mode 100644
index 0000000000..cce4b4ee2f
--- /dev/null
+++ b/data/2024/iclr/NECO: NEural Collapse Based Out-of-distribution detection	
@@ -0,0 +1 @@
+Detecting out-of-distribution (OOD) data is a critical challenge in machine learning due to model overconfidence, often without awareness of their epistemological limits. We hypothesize that ``neural collapse'', a phenomenon affecting in-distribution data for models trained beyond loss convergence, also influences OOD data. To benefit from this interplay, we introduce NECO, a novel post-hoc method for OOD detection, which leverages the geometric properties of ``neural collapse'' and of principal component spaces to identify OOD data. Our extensive experiments demonstrate that NECO achieves state-of-the-art results on both small and large-scale OOD detection tasks while exhibiting strong generalization capabilities across different network architectures. Furthermore, we provide a theoretical explanation for the effectiveness of our method in OOD detection. Code is available at https://gitlab.com/drti/neco
\ No newline at end of file
diff --git a/data/2024/iclr/NEFTune: Noisy Embeddings Improve Instruction Finetuning b/data/2024/iclr/NEFTune: Noisy Embeddings Improve Instruction Finetuning
new file mode 100644
index 0000000000..62773adeef
--- /dev/null
+++ b/data/2024/iclr/NEFTune: Noisy Embeddings Improve Instruction Finetuning	
@@ -0,0 +1 @@
+We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.
\ No newline at end of file
diff --git a/data/2024/iclr/NOLA: Compressing LoRA using Linear Combination of Random Basis b/data/2024/iclr/NOLA: Compressing LoRA using Linear Combination of Random Basis
new file mode 100644
index 0000000000..b2787ae74c
--- /dev/null
+++ b/data/2024/iclr/NOLA: Compressing LoRA using Linear Combination of Random Basis	
@@ -0,0 +1 @@
+Fine-tuning Large Language Models (LLMs) and storing them for each downstream task or domain is impractical because of the massive model size (e.g., 350GB in GPT-3). Current literature, such as LoRA, showcases the potential of low-rank modifications to the original weights of an LLM, enabling efficient adaptation and storage for task-specific models. These methods can reduce the number of parameters needed to fine-tune an LLM by several orders of magnitude. Yet, these methods face two primary limitations: (1) the parameter count is lower-bounded by the rank one decomposition, and (2) the extent of reduction is heavily influenced by both the model architecture and the chosen rank. We introduce NOLA, which overcomes the rank one lower bound present in LoRA. It achieves this by re-parameterizing the low-rank matrices in LoRA using linear combinations of randomly generated matrices (basis) and optimizing the linear mixture coefficients only. This approach allows us to decouple the number of trainable parameters from both the choice of rank and the network architecture. We present adaptation results using GPT-2, LLaMA-2, and ViT in natural language and computer vision tasks. NOLA performs as well as LoRA models with much fewer number of parameters compared to LoRA with rank one, the best compression LoRA can archive. Particularly, on LLaMA-2 70B, our method is almost 20 times more compact than the most compressed LoRA without degradation in accuracy. Our code is available here: https://github.com/UCDvision/NOLA
\ No newline at end of file
diff --git a/data/2024/iclr/Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on HuggingFace b/data/2024/iclr/Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on HuggingFace
new file mode 100644
index 0000000000..9e1ac6d3c4
--- /dev/null
+++ b/data/2024/iclr/Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on HuggingFace	
@@ -0,0 +1 @@
+Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding of current dataset documentation practices. To shed light on this question, here we take Hugging Face -- one of the largest platforms for sharing and collaborating on ML models and datasets -- as a prominent case study. By analyzing all 7,433 dataset documentation on Hugging Face, our investigation provides an overview of the Hugging Face dataset ecosystem and insights into dataset documentation practices, yielding 5 main findings: (1) The dataset card completion rate shows marked heterogeneity correlated with dataset popularity. (2) A granular examination of each section within the dataset card reveals that the practitioners seem to prioritize Dataset Description and Dataset Structure sections, while the Considerations for Using the Data section receives the lowest proportion of content. (3) By analyzing the subsections within each section and utilizing topic modeling to identify key topics, we uncover what is discussed in each section, and underscore significant themes encompassing both technical and social impacts, as well as limitations within the Considerations for Using the Data section. (4) Our findings also highlight the need for improved accessibility and reproducibility of datasets in the Usage sections. (5) In addition, our human annotation evaluation emphasizes the pivotal role of comprehensive dataset content in shaping individuals' perceptions of a dataset card's overall quality. Overall, our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis and underlines the need for more thorough dataset documentation in machine learning research.
\ No newline at end of file
diff --git a/data/2024/iclr/Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation b/data/2024/iclr/Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation
new file mode 100644
index 0000000000..a53825cd5c
--- /dev/null
+++ b/data/2024/iclr/Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation	
@@ -0,0 +1 @@
+Text-to-image generative models have garnered immense attention for their ability to produce high-fidelity images from text prompts. Among these, Stable Diffusion distinguishes itself as a leading open-source model in this fast-growing field. However, the intricacies of fine-tuning these models pose multiple challenges from new methodology integration to systematic evaluation. Addressing these issues, this paper introduces LyCORIS (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion) [https://github.com/KohakuBlueleaf/LyCORIS], an open-source library that offers a wide selection of fine-tuning methodologies for Stable Diffusion. Furthermore, we present a thorough framework for the systematic assessment of varied fine-tuning techniques. This framework employs a diverse suite of metrics and delves into multiple facets of fine-tuning, including hyperparameter adjustments and the evaluation with different prompt types across various concept categories. Through this comprehensive approach, our work provides essential insights into the nuanced effects of fine-tuning parameters, bridging the gap between state-of-the-art research and practical application.
\ No newline at end of file
diff --git a/data/2024/iclr/NeRM: Learning Neural Representations for High-Framerate Human Motion Synthesis b/data/2024/iclr/NeRM: Learning Neural Representations for High-Framerate Human Motion Synthesis
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Near-Optimal Quantum Algorithm for Minimizing the Maximal Loss b/data/2024/iclr/Near-Optimal Quantum Algorithm for Minimizing the Maximal Loss
new file mode 100644
index 0000000000..ee2ed2119d
--- /dev/null
+++ b/data/2024/iclr/Near-Optimal Quantum Algorithm for Minimizing the Maximal Loss	
@@ -0,0 +1 @@
+The problem of minimizing the maximum of $N$ convex, Lipschitz functions plays significant roles in optimization and machine learning. It has a series of results, with the most recent one requiring $O(N\epsilon^{-2/3} + \epsilon^{-8/3})$ queries to a first-order oracle to compute an $\epsilon$-suboptimal point. On the other hand, quantum algorithms for optimization are rapidly advancing with speedups shown on many important optimization problems. In this paper, we conduct a systematic study for quantum algorithms and lower bounds for minimizing the maximum of $N$ convex, Lipschitz functions. On one hand, we develop quantum algorithms with an improved complexity bound of $\tilde{O}(\sqrt{N}\epsilon^{-5/3} + \epsilon^{-8/3})$. On the other hand, we prove that quantum algorithms must take $\tilde{\Omega}(\sqrt{N}\epsilon^{-2/3})$ queries to a first order quantum oracle, showing that our dependence on $N$ is optimal up to poly-logarithmic factors.
\ No newline at end of file
diff --git a/data/2024/iclr/Near-Optimal Solutions of Constrained Learning Problems b/data/2024/iclr/Near-Optimal Solutions of Constrained Learning Problems
new file mode 100644
index 0000000000..7f6f6895df
--- /dev/null
+++ b/data/2024/iclr/Near-Optimal Solutions of Constrained Learning Problems	
@@ -0,0 +1 @@
+With the widespread adoption of machine learning systems, the need to curtail their behavior has become increasingly apparent. This is evidenced by recent advancements towards developing models that satisfy robustness, safety, and fairness requirements. These requirements can be imposed (with generalization guarantees) by formulating constrained learning problems that can then be tackled by dual ascent algorithms. Yet, though these algorithms converge in objective value, even in non-convex settings, they cannot guarantee that their outcome is feasible. Doing so requires randomizing over all iterates, which is impractical in virtually any modern applications. Still, final iterates have been observed to perform well in practice. In this work, we address this gap between theory and practice by characterizing the constraint violation of Lagrangian minimizers associated with optimal dual variables, despite lack of convexity. To do this, we leverage the fact that non-convex, finite-dimensional constrained learning problems can be seen as parametrizations of convex, functional problems. Our results show that rich parametrizations effectively mitigate the issue of feasibility in dual methods, shedding light on prior empirical successes of dual learning. We illustrate our findings in fair learning tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Nearly d-Linear Convergence Bounds for Diffusion Models via Stochastic Localization b/data/2024/iclr/Nearly d-Linear Convergence Bounds for Diffusion Models via Stochastic Localization
new file mode 100644
index 0000000000..2a201e1704
--- /dev/null
+++ b/data/2024/iclr/Nearly d-Linear Convergence Bounds for Diffusion Models via Stochastic Localization	
@@ -0,0 +1 @@
+Denoising diffusions are a powerful method to generate approximate samples from high-dimensional data distributions. Recent results provide polynomial bounds on their convergence rate, assuming $L^2$-accurate scores. Until now, the tightest bounds were either superlinear in the data dimension or required strong smoothness assumptions. We provide the first convergence bounds which are linear in the data dimension (up to logarithmic factors) assuming only finite second moments of the data distribution. We show that diffusion models require at most $\tilde O(\frac{d \log^2(1/\delta)}{\varepsilon^2})$ steps to approximate an arbitrary distribution on $\mathbb{R}^d$ corrupted with Gaussian noise of variance $\delta$ to within $\varepsilon^2$ in KL divergence. Our proof extends the Girsanov-based methods of previous works. We introduce a refined treatment of the error from discretizing the reverse SDE inspired by stochastic localization.
\ No newline at end of file
diff --git a/data/2024/iclr/Negative Label Guided OOD Detection with Pretrained Vision-Language Models b/data/2024/iclr/Negative Label Guided OOD Detection with Pretrained Vision-Language Models
new file mode 100644
index 0000000000..dae87e467b
--- /dev/null
+++ b/data/2024/iclr/Negative Label Guided OOD Detection with Pretrained Vision-Language Models	
@@ -0,0 +1 @@
+Out-of-distribution (OOD) detection aims at identifying samples from unknown classes, playing a crucial role in trustworthy models against errors on unexpected inputs. Extensive research has been dedicated to exploring OOD detection in the vision modality. Vision-language models (VLMs) can leverage both textual and visual information for various multi-modal applications, whereas few OOD detection methods take into account information from the text modality. In this paper, we propose a novel post hoc OOD detection method, called NegLabel, which takes a vast number of negative labels from extensive corpus databases. We design a novel scheme for the OOD score collaborated with negative labels. Theoretical analysis helps to understand the mechanism of negative labels. Extensive experiments demonstrate that our method NegLabel achieves state-of-the-art performance on various OOD detection benchmarks and generalizes well on multiple VLM architectures. Furthermore, our method NegLabel exhibits remarkable robustness against diverse domain shifts. The codes are available at https://github.com/tmlr-group/NegLabel.
\ No newline at end of file
diff --git a/data/2024/iclr/Negatively Correlated Ensemble Reinforcement Learning for Online Diverse Game Level Generation b/data/2024/iclr/Negatively Correlated Ensemble Reinforcement Learning for Online Diverse Game Level Generation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models b/data/2024/iclr/Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/NetInfoF Framework: Measuring and Exploiting Network Usable Information b/data/2024/iclr/NetInfoF Framework: Measuring and Exploiting Network Usable Information
new file mode 100644
index 0000000000..8e801eff2f
--- /dev/null
+++ b/data/2024/iclr/NetInfoF Framework: Measuring and Exploiting Network Usable Information	
@@ -0,0 +1 @@
+Given a node-attributed graph, and a graph task (link prediction or node classification), can we tell if a graph neural network (GNN) will perform well? More specifically, do the graph structure and the node features carry enough usable information for the task? Our goals are (1) to develop a fast tool to measure how much information is in the graph structure and in the node features, and (2) to exploit the information to solve the task, if there is enough. We propose NetInfoF, a framework including NetInfoF_Probe and NetInfoF_Act, for the measurement and the exploitation of network usable information (NUI), respectively. Given a graph data, NetInfoF_Probe measures NUI without any model training, and NetInfoF_Act solves link prediction and node classification, while two modules share the same backbone. In summary, NetInfoF has following notable advantages: (a) General, handling both link prediction and node classification; (b) Principled, with theoretical guarantee and closed-form solution; (c) Effective, thanks to the proposed adjustment to node similarity; (d) Scalable, scaling linearly with the input size. In our carefully designed synthetic datasets, NetInfoF correctly identifies the ground truth of NUI and is the only method being robust to all graph scenarios. Applied on real-world datasets, NetInfoF wins in 11 out of 12 times on link prediction compared to general GNN baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Network Memory Footprint Compression Through Jointly Learnable Codebooks and Mappings b/data/2024/iclr/Network Memory Footprint Compression Through Jointly Learnable Codebooks and Mappings
new file mode 100644
index 0000000000..436a1a753d
--- /dev/null
+++ b/data/2024/iclr/Network Memory Footprint Compression Through Jointly Learnable Codebooks and Mappings	
@@ -0,0 +1 @@
+The massive interest in deep neural networks (DNNs) for both computer vision and natural language processing has been sparked by the growth in computational power. However, this led to an increase in the memory footprint, to a point where it can be challenging to simply load a model on commodity devices such as mobile phones. To address this limitation, quantization is a favored solution as it maps high precision tensors to a low precision, memory efficient format. In terms of memory footprint reduction, its most effective variants are based on codebooks. These methods, however, suffer from two limitations. First, they either define a single codebook for each tensor, or use a memory-expensive mapping to multiple codebooks. Second, gradient descent optimization of the mapping favors jumps toward extreme values, hence not defining a proximal search. In this work, we propose to address these two limitations. First, we initially group similarly distributed neurons and leverage the re-ordered structure to either apply different scale factors to the different groups, or map weights that fall in these groups to several codebooks, without any mapping overhead. Second, stemming from this initialization, we propose a joint learning of the codebook and weight mappings that bears similarities with recent gradient-based post-training quantization techniques. Third, drawing estimation from straight-through estimation techniques, we introduce a novel gradient update definition to enable a proximal search of the codebooks and their mappings. The proposed jointly learnable codebooks and mappings (JLCM) method allows a very efficient approximation of any DNN: as such, a Llama 7B can be compressed down to 2Go and loaded on 5-year-old smartphones.
\ No newline at end of file
diff --git a/data/2024/iclr/Neur2RO: Neural Two-Stage Robust Optimization b/data/2024/iclr/Neur2RO: Neural Two-Stage Robust Optimization
new file mode 100644
index 0000000000..f699a3421e
--- /dev/null
+++ b/data/2024/iclr/Neur2RO: Neural Two-Stage Robust Optimization	
@@ -0,0 +1 @@
+Robust optimization provides a mathematical framework for modeling and solving decision-making problems under worst-case uncertainty. This work addresses two-stage robust optimization (2RO) problems (also called adjustable robust optimization), wherein first-stage and second-stage decisions are made before and after uncertainty is realized, respectively. This results in a nested min-max-min optimization problem which is extremely challenging computationally, especially when the decisions are discrete. We propose Neur2RO, an efficient machine learning-driven instantiation of column-and-constraint generation (CCG), a classical iterative algorithm for 2RO. Specifically, we learn to estimate the value function of the second-stage problem via a novel neural network architecture that is easy to optimize over by design. Embedding our neural network into CCG yields high-quality solutions quickly as evidenced by experiments on two 2RO benchmarks, knapsack and capital budgeting. For knapsack, Neur2RO finds solutions that are within roughly $2\%$ of the best-known values in a few seconds compared to the three hours of the state-of-the-art exact branch-and-price algorithm; for larger and more complex instances, Neur2RO finds even better solutions. For capital budgeting, Neur2RO outperforms three variants of the $k$-adaptability algorithm, particularly on the largest instances, with a 10 to 100-fold reduction in solution time. Our code and data are available at https://github.com/khalil-research/Neur2RO.
\ No newline at end of file
diff --git a/data/2024/iclr/NeurRev: Train Better Sparse Neural Network Practically via Neuron Revitalization b/data/2024/iclr/NeurRev: Train Better Sparse Neural Network Practically via Neuron Revitalization
new file mode 100644
index 0000000000..8b8f2f4300
--- /dev/null
+++ b/data/2024/iclr/NeurRev: Train Better Sparse Neural Network Practically via Neuron Revitalization	
@@ -0,0 +1 @@
+Dynamic Sparse Training (DST) employs a greedy search mechanism to identify an optimal sparse subnetwork by periodically pruning and growing network connections during training. To guarantee effectiveness, DST algorithms rely on high search frequency, which consequently, requires large learning rate and batch size to enforce stable neuron learning. Such settings demand extreme memory consumption, as well as generating significant system overheads that limit the wide deployment of deep learning-based applications on resource-constraint platforms. To reconcile such, we propose an Neuron Revitalizationframework for DST (NeurRev), based on an innovative finding that dormant neurons exist in the presence of weight sparsity and cannot be revitalized (i.e., activated for learning) even with a high sparse mask search frequency. These dormant neurons produce a large quantity of zeros during training, which contribute relatively little to the outputs of succeeding layers or to the final results. Different from most existing DST algorithms that spare no effort designing weight-growing criteria, NeurRev focuses on optimizing the long-neglected pruning part, which awakens dormant neurons by pruning and incurs no additional computation costs. As such, NerRev advances more effective neuron learning, which not only achieves outperformance accuracy in a variety of networks and datasets but also promotes low-cost dynamism at the system level. Systematical evaluations on training speed and system overhead are conducted on mobile devices, where the proposed NeurRev framework consistently outperforms representative state-of-the-arts. Code available in https:
\ No newline at end of file
diff --git a/data/2024/iclr/Neural Active Learning Beyond Bandits b/data/2024/iclr/Neural Active Learning Beyond Bandits
new file mode 100644
index 0000000000..4d54af8b45
--- /dev/null
+++ b/data/2024/iclr/Neural Active Learning Beyond Bandits	
@@ -0,0 +1 @@
+We study both stream-based and pool-based active learning with neural network approximations. A recent line of works proposed bandit-based approaches that transformed active learning into a bandit problem, achieving both theoretical and empirical success. However, the performance and computational costs of these methods may be susceptible to the number of classes, denoted as $K$, due to this transformation. Therefore, this paper seeks to answer the question:"How can we mitigate the adverse impacts of $K$ while retaining the advantages of principled exploration and provable performance guarantees in active learning?"To tackle this challenge, we propose two algorithms based on the newly designed exploitation and exploration neural networks for stream-based and pool-based active learning. Subsequently, we provide theoretical performance guarantees for both algorithms in a non-parametric setting, demonstrating a slower error-growth rate concerning $K$ for the proposed approaches. We use extensive experiments to evaluate the proposed algorithms, which consistently outperform state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural Atoms: Propagating Long-range Interaction in Molecular Graphs through Efficient Communication Channel b/data/2024/iclr/Neural Atoms: Propagating Long-range Interaction in Molecular Graphs through Efficient Communication Channel
new file mode 100644
index 0000000000..7b2cf271ac
--- /dev/null
+++ b/data/2024/iclr/Neural Atoms: Propagating Long-range Interaction in Molecular Graphs through Efficient Communication Channel	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) have been widely adopted for drug discovery with molecular graphs. Nevertheless, current GNNs mainly excel in leveraging short-range interactions (SRI) but struggle to capture long-range interactions (LRI), both of which are crucial for determining molecular properties. To tackle this issue, we propose a method to abstract the collective information of atomic groups into a few $\textit{Neural Atoms}$ by implicitly projecting the atoms of a molecular. Specifically, we explicitly exchange the information among neural atoms and project them back to the atoms' representations as an enhancement. With this mechanism, neural atoms establish the communication channels among distant nodes, effectively reducing the interaction scope of arbitrary node pairs into a single hop. To provide an inspection of our method from a physical perspective, we reveal its connection to the traditional LRI calculation method, Ewald Summation. The Neural Atom can enhance GNNs to capture LRI by approximating the potential LRI of the molecular. We conduct extensive experiments on four long-range graph benchmarks, covering graph-level and link-level tasks on molecular graphs. We achieve up to a 27.32% and 38.27% improvement in the 2D and 3D scenarios, respectively. Empirically, our method can be equipped with an arbitrary GNN to help capture LRI. Code and datasets are publicly available in https://github.com/tmlr-group/NeuralAtom.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural Auto-designer for Enhanced Quantum Kernels b/data/2024/iclr/Neural Auto-designer for Enhanced Quantum Kernels
new file mode 100644
index 0000000000..4cecef02cd
--- /dev/null
+++ b/data/2024/iclr/Neural Auto-designer for Enhanced Quantum Kernels	
@@ -0,0 +1 @@
+Quantum kernels hold great promise for offering computational advantages over classical learners, with the effectiveness of these kernels closely tied to the design of the quantum feature map. However, the challenge of designing effective quantum feature maps for real-world datasets, particularly in the absence of sufficient prior information, remains a significant obstacle. In this study, we present a data-driven approach that automates the design of problem-specific quantum feature maps. Our approach leverages feature-selection techniques to handle high-dimensional data on near-term quantum machines with limited qubits, and incorporates a deep neural predictor to efficiently evaluate the performance of various candidate quantum kernels. Through extensive numerical simulations on different datasets, we demonstrate the superiority of our proposal over prior methods, especially for the capability of eliminating the kernel concentration issue and identifying the feature map with prediction advantages. Our work not only unlocks the potential of quantum kernels for enhancing real-world tasks but also highlights the substantial role of deep learning in advancing quantum machine learning.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural Common Neighbor with Completion for Link Prediction b/data/2024/iclr/Neural Common Neighbor with Completion for Link Prediction
new file mode 100644
index 0000000000..17f845302c
--- /dev/null
+++ b/data/2024/iclr/Neural Common Neighbor with Completion for Link Prediction	
@@ -0,0 +1 @@
+In this work, we propose a novel link prediction model and further boost it by studying graph incompleteness. First, we introduce MPNN-then-SF, an innovative architecture leveraging structural feature (SF) to guide MPNN's representation pooling, with its implementation, namely Neural Common Neighbor (NCN). NCN exhibits superior expressiveness and scalability compared with existing models, which can be classified into two categories: SF-then-MPNN, augmenting MPNN's input with SF, and SF-and-MPNN, decoupling SF and MPNN. Second, we investigate the impact of graph incompleteness -- the phenomenon that some links are unobserved in the input graph -- on SF, like the common neighbor. Through dataset visualization, we observe that incompleteness reduces common neighbors and induces distribution shifts, significantly affecting model performance. To address this issue, we propose to use a link prediction model to complete the common neighbor structure. Combining this method with NCN, we propose Neural Common Neighbor with Completion (NCNC). NCN and NCNC outperform recent strong baselines by large margins, and NCNC further surpasses state-of-the-art models in standard link prediction benchmarks. Our code is available at https://github.com/GraphPKU/NeuralCommonNeighbor.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural Contractive Dynamical Systems b/data/2024/iclr/Neural Contractive Dynamical Systems
new file mode 100644
index 0000000000..e7c464292a
--- /dev/null
+++ b/data/2024/iclr/Neural Contractive Dynamical Systems	
@@ -0,0 +1 @@
+Stability guarantees are crucial when ensuring a fully autonomous robot does not take undesirable or potentially harmful actions. Unfortunately, global stability guarantees are hard to provide in dynamical systems learned from data, especially when the learned dynamics are governed by neural networks. We propose a novel methodology to learn neural contractive dynamical systems, where our neural architecture ensures contraction, and hence, global stability. To efficiently scale the method to high-dimensional dynamical systems, we develop a variant of the variational autoencoder that learns dynamics in a low-dimensional latent representation space while retaining contractive stability after decoding. We further extend our approach to learning contractive systems on the Lie group of rotations to account for full-pose end-effector dynamic motions. The result is the first highly flexible learning architecture that provides contractive stability guarantees with capability to perform obstacle avoidance. Empirically, we demonstrate that our approach encodes the desired dynamics more accurately than the current state-of-the-art, which provides less strong stability guarantees.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural Fine-Tuning Search for Few-Shot Learning b/data/2024/iclr/Neural Fine-Tuning Search for Few-Shot Learning
new file mode 100644
index 0000000000..48ef19abb8
--- /dev/null
+++ b/data/2024/iclr/Neural Fine-Tuning Search for Few-Shot Learning	
@@ -0,0 +1 @@
+In few-shot recognition, a classifier that has been trained on one set of classes is required to rapidly adapt and generalize to a disjoint, novel set of classes. To that end, recent studies have shown the efficacy of fine-tuning with carefully crafted adaptation architectures. However this raises the question of: How can one design the optimal adaptation strategy? In this paper, we study this question through the lens of neural architecture search (NAS). Given a pre-trained neural network, our algorithm discovers the optimal arrangement of adapters, which layers to keep frozen and which to fine-tune. We demonstrate the generality of our NAS method by applying it to both residual networks and vision transformers and report state-of-the-art performance on Meta-Dataset and Meta-Album.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural Fourier Transform: A General Approach to Equivariant Representation Learning b/data/2024/iclr/Neural Fourier Transform: A General Approach to Equivariant Representation Learning
new file mode 100644
index 0000000000..62bc013474
--- /dev/null
+++ b/data/2024/iclr/Neural Fourier Transform: A General Approach to Equivariant Representation Learning	
@@ -0,0 +1 @@
+Symmetry learning has proven to be an effective approach for extracting the hidden structure of data, with the concept of equivariance relation playing the central role. However, most of the current studies are built on architectural theory and corresponding assumptions on the form of data. We propose Neural Fourier Transform (NFT), a general framework of learning the latent linear action of the group without assuming explicit knowledge of how the group acts on data. We present the theoretical foundations of NFT and show that the existence of a linear equivariant feature, which has been assumed ubiquitously in equivariance learning, is equivalent to the existence of a group invariant kernel on the dataspace. We also provide experimental results to demonstrate the application of NFT in typical scenarios with varying levels of knowledge about the acting group.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural Language of Thought Models b/data/2024/iclr/Neural Language of Thought Models
new file mode 100644
index 0000000000..a83daf8cb6
--- /dev/null
+++ b/data/2024/iclr/Neural Language of Thought Models	
@@ -0,0 +1 @@
+The Language of Thought Hypothesis suggests that human cognition operates on a structured, language-like system of mental representations. While neural language models can naturally benefit from the compositional structure inherently and explicitly expressed in language data, learning such representations from non-linguistic general observations, like images, remains a challenge. In this work, we introduce the Neural Language of Thought Model (NLoTM), a novel approach for unsupervised learning of LoTH-inspired representation and generation. NLoTM comprises two key components: (1) the Semantic Vector-Quantized Variational Autoencoder, which learns hierarchical, composable discrete representations aligned with objects and their properties, and (2) the Autoregressive LoT Prior, an autoregressive transformer that learns to generate semantic concept tokens compositionally, capturing the underlying data distribution. We evaluate NLoTM on several 2D and 3D image datasets, demonstrating superior performance in downstream tasks, out-of-distribution generalization, and image generation quality compared to patch-based VQ-VAE and continuous object-centric representations. Our work presents a significant step towards creating neural networks exhibiting more human-like understanding by developing LoT-like representations and offers insights into the intersection of cognitive science and machine learning.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural Neighborhood Search for Multi-agent Path Finding b/data/2024/iclr/Neural Neighborhood Search for Multi-agent Path Finding
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization b/data/2024/iclr/Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization
new file mode 100644
index 0000000000..303f9044d5
--- /dev/null
+++ b/data/2024/iclr/Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization	
@@ -0,0 +1 @@
+Diffusion models have emerged as a powerful tool rivaling GANs in generating high-quality samples with improved fidelity, flexibility, and robustness. A key component of these models is to learn the score function through score matching. Despite empirical success on various tasks, it remains unclear whether gradient-based algorithms can learn the score function with a provable accuracy. As a first step toward answering this question, this paper establishes a mathematical framework for analyzing score estimation using neural networks trained by gradient descent. Our analysis covers both the optimization and the generalization aspects of the learning procedure. In particular, we propose a parametric form to formulate the denoising score-matching problem as a regression with noisy labels. Compared to the standard supervised learning setup, the score-matching problem introduces distinct challenges, including unbounded input, vector-valued output, and an additional time variable, preventing existing techniques from being applied directly. In this paper, we show that with proper designs, the evolution of neural networks during training can be accurately modeled by a series of kernel regression tasks. Furthermore, by applying an early-stopping rule for gradient descent and leveraging recent developments in neural tangent kernels, we establish the first generalization error (sample complexity) bounds for learning the score function with neural networks, despite the presence of noise in the observations. Our analysis is grounded in a novel parametric form of the neural network and an innovative connection between score matching and regression analysis, facilitating the application of advanced statistical and optimization techniques.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural Optimal Transport with General Cost Functionals b/data/2024/iclr/Neural Optimal Transport with General Cost Functionals
new file mode 100644
index 0000000000..e6368ca325
--- /dev/null
+++ b/data/2024/iclr/Neural Optimal Transport with General Cost Functionals	
@@ -0,0 +1 @@
+We introduce a novel neural network-based algorithm to compute optimal transport (OT) plans for general cost functionals. In contrast to common Euclidean costs, i.e., $\ell^1$ or $\ell^2$, such functionals provide more flexibility and allow using auxiliary information, such as class labels, to construct the required transport map. Existing methods for general costs are discrete and have limitations in practice, i.e. they do not provide an out-of-sample estimation. We address the challenge of designing a continuous OT approach for general costs that generalizes to new data points in high-dimensional spaces, such as images. Additionally, we provide the theoretical error analysis for our recovered transport plans. As an application, we construct a cost functional to map data distributions while preserving the class-wise structure.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural Polynomial Gabor Fields for Macro Motion Analysis b/data/2024/iclr/Neural Polynomial Gabor Fields for Macro Motion Analysis
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Neural Processing of Tri-Plane Hybrid Neural Fields b/data/2024/iclr/Neural Processing of Tri-Plane Hybrid Neural Fields
new file mode 100644
index 0000000000..f804a7ef47
--- /dev/null
+++ b/data/2024/iclr/Neural Processing of Tri-Plane Hybrid Neural Fields	
@@ -0,0 +1 @@
+Driven by the appealing properties of neural fields for storing and communicating 3D data, the problem of directly processing them to address tasks such as classification and part segmentation has emerged and has been investigated in recent works. Early approaches employ neural fields parameterized by shared networks trained on the whole dataset, achieving good task performance but sacrificing reconstruction quality. To improve the latter, later methods focus on individual neural fields parameterized as large Multi-Layer Perceptrons (MLPs), which are, however, challenging to process due to the high dimensionality of the weight space, intrinsic weight space symmetries, and sensitivity to random initialization. Hence, results turn out significantly inferior to those achieved by processing explicit representations, e.g., point clouds or meshes. In the meantime, hybrid representations, in particular based on tri-planes, have emerged as a more effective and efficient alternative to realize neural fields, but their direct processing has not been investigated yet. In this paper, we show that the tri-plane discrete data structure encodes rich information, which can be effectively processed by standard deep-learning machinery. We define an extensive benchmark covering a diverse set of fields such as occupancy, signed/unsigned distance, and, for the first time, radiance fields. While processing a field with the same reconstruction quality, we achieve task performance far superior to frameworks that process large MLPs and, for the first time, almost on par with architectures handling explicit representations.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural Rate Control for Learned Video Compression b/data/2024/iclr/Neural Rate Control for Learned Video Compression
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Neural SDF Flow for 3D Reconstruction of Dynamic Scenes b/data/2024/iclr/Neural SDF Flow for 3D Reconstruction of Dynamic Scenes
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Neural Snowflakes: Universal Latent Graph Inference via Trainable Latent Geometries b/data/2024/iclr/Neural Snowflakes: Universal Latent Graph Inference via Trainable Latent Geometries
new file mode 100644
index 0000000000..a20e3c0df6
--- /dev/null
+++ b/data/2024/iclr/Neural Snowflakes: Universal Latent Graph Inference via Trainable Latent Geometries	
@@ -0,0 +1 @@
+The inductive bias of a graph neural network (GNN) is largely encoded in its specified graph. Latent graph inference relies on latent geometric representations to dynamically rewire or infer a GNN's graph to maximize the GNN's predictive downstream performance, but it lacks solid theoretical foundations in terms of embedding-based representation guarantees. This paper addresses this issue by introducing a trainable deep learning architecture, coined neural snowflake, that can adaptively implement fractal-like metrics on $\mathbb{R}^d$. We prove that any given finite weights graph can be isometrically embedded by a standard MLP encoder. Furthermore, when the latent graph can be represented in the feature space of a sufficiently regular kernel, we show that the combined neural snowflake and MLP encoder do not succumb to the curse of dimensionality by using only a low-degree polynomial number of parameters in the number of nodes. This implementation enables a low-dimensional isometric embedding of the latent graph. We conduct synthetic experiments to demonstrate the superior metric learning capabilities of neural snowflakes when compared to more familiar spaces like Euclidean space. Additionally, we carry out latent graph inference experiments on graph benchmarks. Consistently, the neural snowflake model achieves predictive performance that either matches or surpasses that of the state-of-the-art latent graph inference models. Importantly, this performance improvement is achieved without requiring random search for optimal latent geometry. Instead, the neural snowflake model achieves this enhancement in a differentiable manner.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural Spectral Methods: Self-supervised learning in the spectral domain b/data/2024/iclr/Neural Spectral Methods: Self-supervised learning in the spectral domain
new file mode 100644
index 0000000000..cd632eec16
--- /dev/null
+++ b/data/2024/iclr/Neural Spectral Methods: Self-supervised learning in the spectral domain	
@@ -0,0 +1 @@
+We present Neural Spectral Methods, a technique to solve parametric Partial Differential Equations (PDEs), grounded in classical spectral methods. Our method uses orthogonal bases to learn PDE solutions as mappings between spectral coefficients. In contrast to current machine learning approaches which enforce PDE constraints by minimizing the numerical quadrature of the residuals in the spatiotemporal domain, we leverage Parseval's identity and introduce a new training strategy through a \textit{spectral loss}. Our spectral loss enables more efficient differentiation through the neural network, and substantially reduces training complexity. At inference time, the computational cost of our method remains constant, regardless of the spatiotemporal resolution of the domain. Our experimental results demonstrate that our method significantly outperforms previous machine learning approaches in terms of speed and accuracy by one to two orders of magnitude on multiple different problems. When compared to numerical solvers of the same accuracy, our method demonstrates a $10\times$ increase in performance speed.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural structure learning with stochastic differential equations b/data/2024/iclr/Neural structure learning with stochastic differential equations
new file mode 100644
index 0000000000..3236129486
--- /dev/null
+++ b/data/2024/iclr/Neural structure learning with stochastic differential equations	
@@ -0,0 +1 @@
+Discovering the underlying relationships among variables from temporal observations has been a longstanding challenge in numerous scientific disciplines, including biology, finance, and climate science. The dynamics of such systems are often best described using continuous-time stochastic processes. Unfortunately, most existing structure learning approaches assume that the underlying process evolves in discrete-time and/or observations occur at regular time intervals. These mismatched assumptions can often lead to incorrect learned structures and models. In this work, we introduce a novel structure learning method, SCOTCH, which combines neural stochastic differential equations (SDE) with variational inference to infer a posterior distribution over possible structures. This continuous-time approach can naturally handle both learning from and predicting observations at arbitrary time points. Theoretically, we establish sufficient conditions for an SDE and SCOTCH to be structurally identifiable, and prove its consistency under infinite data limits. Empirically, we demonstrate that our approach leads to improved structure learning performance on both synthetic and real-world datasets compared to relevant baselines under regular and irregular sampling intervals.
\ No newline at end of file
diff --git a/data/2024/iclr/Neural-Symbolic Recursive Machine for Systematic Generalization b/data/2024/iclr/Neural-Symbolic Recursive Machine for Systematic Generalization
new file mode 100644
index 0000000000..ef7e4590ac
--- /dev/null
+++ b/data/2024/iclr/Neural-Symbolic Recursive Machine for Systematic Generalization	
@@ -0,0 +1 @@
+Current learning models often struggle with human-like systematic generalization, particularly in learning compositional rules from limited data and extrapolating them to novel combinations. We introduce the Neural-Symbolic Recursive Machine (NSR), whose core is a Grounded Symbol System (GSS), allowing for the emergence of combinatorial syntax and semantics directly from training data. The NSR employs a modular design that integrates neural perception, syntactic parsing, and semantic reasoning. These components are synergistically trained through a novel deduction-abduction algorithm. Our findings demonstrate that NSR's design, imbued with the inductive biases of equivariance and compositionality, grants it the expressiveness to adeptly handle diverse sequence-to-sequence tasks and achieve unparalleled systematic generalization. We evaluate NSR's efficacy across four challenging benchmarks designed to probe systematic generalization capabilities: SCAN for semantic parsing, PCFG for string manipulation, HINT for arithmetic reasoning, and a compositional machine translation task. The results affirm NSR's superiority over contemporary neural and hybrid models in terms of generalization and transferability.
\ No newline at end of file
diff --git a/data/2024/iclr/Neuro-Inspired Information-Theoretic Hierarchical Perception for Multimodal Learning b/data/2024/iclr/Neuro-Inspired Information-Theoretic Hierarchical Perception for Multimodal Learning
new file mode 100644
index 0000000000..edea0470d5
--- /dev/null
+++ b/data/2024/iclr/Neuro-Inspired Information-Theoretic Hierarchical Perception for Multimodal Learning	
@@ -0,0 +1 @@
+Integrating and processing information from various sources or modalities are critical for obtaining a comprehensive and accurate perception of the real world in autonomous systems and cyber-physical systems. Drawing inspiration from neuroscience, we develop the Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the concept of information bottleneck. Different from most traditional fusion models that incorporate all modalities identically in neural networks, our model designates a prime modality and regards the remaining modalities as detectors in the information pathway, serving to distill the flow of information. Our proposed perception model focuses on constructing an effective and compact information flow by achieving a balance between the minimization of mutual information between the latent state and the input modal state, and the maximization of mutual information between the latent states and the remaining modal states. This approach leads to compact latent state representations that retain relevant information while minimizing redundancy, thereby substantially enhancing the performance of multimodal representation learning. Experimental evaluations on the MUStARD, CMU-MOSI, and CMU-MOSEI datasets demonstrate that our model consistently distills crucial information in multimodal learning scenarios, outperforming state-of-the-art benchmarks. Remarkably, on the CMU-MOSI dataset, ITHP surpasses human-level performance in the multimodal sentiment binary classification task across all evaluation metrics (i.e., Binary Accuracy, F1 Score, Mean Absolute Error, and Pearson Correlation).
\ No newline at end of file
diff --git a/data/2024/iclr/NeuroBack: Improving CDCL SAT Solving using Graph Neural Networks b/data/2024/iclr/NeuroBack: Improving CDCL SAT Solving using Graph Neural Networks
new file mode 100644
index 0000000000..4576d85aac
--- /dev/null
+++ b/data/2024/iclr/NeuroBack: Improving CDCL SAT Solving using Graph Neural Networks	
@@ -0,0 +1 @@
+Propositional satisfiability (SAT) is an NP-complete problem that impacts many research fields, such as planning, verification, and security. Mainstream modern SAT solvers are based on the Conflict-Driven Clause Learning (CDCL) algorithm. Recent work aimed to enhance CDCL SAT solvers using Graph Neural Networks (GNNs). However, so far this approach either has not made solving more effective, or required substantial GPU resources for frequent online model inferences. Aiming to make GNN improvements practical, this paper proposes an approach called NeuroBack, which builds on two insights: (1) predicting phases (i.e., values) of variables appearing in the majority (or even all) of the satisfying assignments are essential for CDCL SAT solving, and (2) it is sufficient to query the neural model only once for the predictions before the SAT solving starts. Once trained, the offline model inference allows NeuroBack to execute exclusively on the CPU, removing its reliance on GPU resources. To train NeuroBack, a new dataset called DataBack containing 120,286 data samples is created. NeuroBack is implemented as an enhancement to a state-of-the-art SAT solver called Kissat. As a result, it allowed Kissat to solve up to 5.2% and 7.4% more problems on two recent SAT competition problem sets, SATCOMP-2022 and SATCOMP-2023, respectively. NeuroBack therefore shows how machine learning can be harnessed to improve SAT solving in an effective and practical manner.
\ No newline at end of file
diff --git a/data/2024/iclr/Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data b/data/2024/iclr/Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data
new file mode 100644
index 0000000000..8f1499d5ca
--- /dev/null
+++ b/data/2024/iclr/Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data	
@@ -0,0 +1 @@
+State-of-the-art systems neuroscience experiments yield large-scale multimodal data, and these data sets require new tools for analysis. Inspired by the success of large pretrained models in vision and language domains, we reframe the analysis of large-scale, cellular-resolution neuronal spiking data into an autoregressive spatiotemporal generation problem. Neuroformer is a multimodal, multitask generative pretrained transformer (GPT) model that is specifically designed to handle the intricacies of data in systems neuroscience. It scales linearly with feature size, can process an arbitrary number of modalities, and is adaptable to downstream tasks, such as predicting behavior. We first trained Neuroformer on simulated datasets, and found that it both accurately predicted simulated neuronal circuit activity, and also intrinsically inferred the underlying neural circuit connectivity, including direction. When pretrained to decode neural responses, the model predicted the behavior of a mouse with only few-shot fine-tuning, suggesting that the model begins learning how to do so directly from the neural representations themselves, without any explicit supervision. We used an ablation study to show that joint training on neuronal responses and behavior boosted performance, highlighting the model's ability to associate behavioral and neural representations in an unsupervised manner. These findings show that Neuroformer can analyze neural datasets and their emergent properties, informing the development of models and hypotheses associated with the brain.
\ No newline at end of file
diff --git a/data/2024/iclr/Neuron Activation Coverage: Rethinking Out-of-distribution Detection and Generalization b/data/2024/iclr/Neuron Activation Coverage: Rethinking Out-of-distribution Detection and Generalization
new file mode 100644
index 0000000000..ac09454f48
--- /dev/null
+++ b/data/2024/iclr/Neuron Activation Coverage: Rethinking Out-of-distribution Detection and Generalization	
@@ -0,0 +1 @@
+The out-of-distribution (OOD) problem generally arises when neural networks encounter data that significantly deviates from the training data distribution, i.e., in-distribution (InD). In this paper, we study the OOD problem from a neuron activation view. We first formulate neuron activation states by considering both the neuron output and its influence on model decisions. Then, to characterize the relationship between neurons and OOD issues, we introduce the \textit{neuron activation coverage} (NAC) -- a simple measure for neuron behaviors under InD data. Leveraging our NAC, we show that 1) InD and OOD inputs can be largely separated based on the neuron behavior, which significantly eases the OOD detection problem and beats the 21 previous methods over three benchmarks (CIFAR-10, CIFAR-100, and ImageNet-1K). 2) a positive correlation between NAC and model generalization ability consistently holds across architectures and datasets, which enables a NAC-based criterion for evaluating model robustness. Compared to prevalent InD validation criteria, we show that NAC not only can select more robust models, but also has a stronger correlation with OOD test performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Neuron-Enhanced AutoEncoder Matrix Completion and Collaborative Filtering: Theory and Practice b/data/2024/iclr/Neuron-Enhanced AutoEncoder Matrix Completion and Collaborative Filtering: Theory and Practice
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Neurosymbolic Grounding for Compositional World Models b/data/2024/iclr/Neurosymbolic Grounding for Compositional World Models
new file mode 100644
index 0000000000..bdeebdb2ae
--- /dev/null
+++ b/data/2024/iclr/Neurosymbolic Grounding for Compositional World Models	
@@ -0,0 +1 @@
+We introduce Cosmos, a framework for object-centric world modeling that is designed for compositional generalization (CompGen), i.e., high performance on unseen input scenes obtained through the composition of known visual"atoms."The central insight behind Cosmos is the use of a novel form of neurosymbolic grounding. Specifically, the framework introduces two new tools: (i) neurosymbolic scene encodings, which represent each entity in a scene using a real vector computed using a neural encoder, as well as a vector of composable symbols describing attributes of the entity, and (ii) a neurosymbolic attention mechanism that binds these entities to learned rules of interaction. Cosmos is end-to-end differentiable; also, unlike traditional neurosymbolic methods that require representations to be manually mapped to symbols, it computes an entity's symbolic attributes using vision-language foundation models. Through an evaluation that considers two different forms of CompGen on an established blocks-pushing domain, we show that the framework establishes a new state-of-the-art for CompGen in world modeling. Artifacts are available at: https://trishullab.github.io/cosmos-web/
\ No newline at end of file
diff --git a/data/2024/iclr/Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors b/data/2024/iclr/Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors
new file mode 100644
index 0000000000..0dcea2c07c
--- /dev/null
+++ b/data/2024/iclr/Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors	
@@ -0,0 +1 @@
+Modeling long-range dependencies across sequences is a longstanding goal in machine learning and has led to architectures, such as state space models, that dramatically outperform Transformers on long sequences. However, these impressive empirical gains have been by and large demonstrated on benchmarks (e.g. Long Range Arena), where models are randomly initialized and trained to predict a target label from an input sequence. In this work, we show that random initialization leads to gross overestimation of the differences between architectures and that pretraining with standard denoising objectives, using $\textit{only the downstream task data}$, leads to dramatic gains across multiple architectures and to very small gaps between Transformers and state space models (SSMs). In stark contrast to prior works, we find vanilla Transformers to match the performance of S4 on Long Range Arena when properly pretrained, and we improve the best reported results of SSMs on the PathX-256 task by 20 absolute points. Subsequently, we analyze the utility of previously-proposed structured parameterizations for SSMs and show they become mostly redundant in the presence of data-driven initialization obtained through pretraining. Our work shows that, when evaluating different architectures on supervised tasks, incorporation of data-driven priors via pretraining is essential for reliable performance estimation, and can be done efficiently.
\ No newline at end of file
diff --git a/data/2024/iclr/New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions b/data/2024/iclr/New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/NfgTransformer: Equivariant Representation Learning for Normal-form Games b/data/2024/iclr/NfgTransformer: Equivariant Representation Learning for Normal-form Games
new file mode 100644
index 0000000000..55663e8877
--- /dev/null
+++ b/data/2024/iclr/NfgTransformer: Equivariant Representation Learning for Normal-form Games	
@@ -0,0 +1 @@
+Normal-form games (NFGs) are the fundamental model of strategic interaction. We study their representation using neural networks. We describe the inherent equivariance of NFGs -- any permutation of strategies describes an equivalent game -- as well as the challenges this poses for representation learning. We then propose the NfgTransformer architecture that leverages this equivariance, leading to state-of-the-art performance in a range of game-theoretic tasks including equilibrium-solving, deviation gain estimation and ranking, with a common approach to NFG representation. We show that the resulting model is interpretable and versatile, paving the way towards deep learning systems capable of game-theoretic reasoning when interacting with humans and with each other.
\ No newline at end of file
diff --git a/data/2024/iclr/Node2ket: Efficient High-Dimensional Network Embedding in Quantum Hilbert Space b/data/2024/iclr/Node2ket: Efficient High-Dimensional Network Embedding in Quantum Hilbert Space
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Noise Map Guidance: Inversion with Spatial Context for Real Image Editing b/data/2024/iclr/Noise Map Guidance: Inversion with Spatial Context for Real Image Editing
new file mode 100644
index 0000000000..c71ae674d0
--- /dev/null
+++ b/data/2024/iclr/Noise Map Guidance: Inversion with Spatial Context for Real Image Editing	
@@ -0,0 +1 @@
+Text-guided diffusion models have become a popular tool in image synthesis, known for producing high-quality and diverse images. However, their application to editing real images often encounters hurdles primarily due to the text condition deteriorating the reconstruction quality and subsequently affecting editing fidelity. Null-text Inversion (NTI) has made strides in this area, but it fails to capture spatial context and requires computationally intensive per-timestep optimization. Addressing these challenges, we present Noise Map Guidance (NMG), an inversion method rich in a spatial context, tailored for real-image editing. Significantly, NMG achieves this without necessitating optimization, yet preserves the editing quality. Our empirical investigations highlight NMG's adaptability across various editing techniques and its robustness to variants of DDIM inversions.
\ No newline at end of file
diff --git a/data/2024/iclr/Noise-free Score Distillation b/data/2024/iclr/Noise-free Score Distillation
new file mode 100644
index 0000000000..d18b368734
--- /dev/null
+++ b/data/2024/iclr/Noise-free Score Distillation	
@@ -0,0 +1 @@
+Score Distillation Sampling (SDS) has emerged as the de facto approach for text-to-content generation in non-image domains. In this paper, we reexamine the SDS process and introduce a straightforward interpretation that demystifies the necessity for large Classifier-Free Guidance (CFG) scales, rooted in the distillation of an undesired noise term. Building upon our interpretation, we propose a novel Noise-Free Score Distillation (NFSD) process, which requires minimal modifications to the original SDS framework. Through this streamlined design, we achieve more effective distillation of pre-trained text-to-image diffusion models while using a nominal CFG scale. This strategic choice allows us to prevent the over-smoothing of results, ensuring that the generated data is both realistic and complies with the desired prompt. To demonstrate the efficacy of NFSD, we provide qualitative examples that compare NFSD and SDS, as well as several other methods.
\ No newline at end of file
diff --git a/data/2024/iclr/NoiseDiffusion: Correcting Noise for Image Interpolation with Diffusion Models beyond Spherical Linear Interpolation b/data/2024/iclr/NoiseDiffusion: Correcting Noise for Image Interpolation with Diffusion Models beyond Spherical Linear Interpolation
new file mode 100644
index 0000000000..9f0597bc6e
--- /dev/null
+++ b/data/2024/iclr/NoiseDiffusion: Correcting Noise for Image Interpolation with Diffusion Models beyond Spherical Linear Interpolation	
@@ -0,0 +1 @@
+Image interpolation based on diffusion models is promising in creating fresh and interesting images. Advanced interpolation methods mainly focus on spherical linear interpolation, where images are encoded into the noise space and then interpolated for denoising to images. However, existing methods face challenges in effectively interpolating natural images (not generated by diffusion models), thereby restricting their practical applicability. Our experimental investigations reveal that these challenges stem from the invalidity of the encoding noise, which may no longer obey the expected noise distribution, e.g., a normal distribution. To address these challenges, we propose a novel approach to correct noise for image interpolation, NoiseDiffusion. Specifically, NoiseDiffusion approaches the invalid noise to the expected distribution by introducing subtle Gaussian noise and introduces a constraint to suppress noise with extreme values. In this context, promoting noise validity contributes to mitigating image artifacts, but the constraint and introduced exogenous noise typically lead to a reduction in signal-to-noise ratio, i.e., loss of original image information. Hence, NoiseDiffusion performs interpolation within the noisy image space and injects raw images into these noisy counterparts to address the challenge of information loss. Consequently, NoiseDiffusion enables us to interpolate natural images without causing artifacts or information loss, thus achieving the best interpolation results.
\ No newline at end of file
diff --git a/data/2024/iclr/Noisy Interpolation Learning with Shallow Univariate ReLU Networks b/data/2024/iclr/Noisy Interpolation Learning with Shallow Univariate ReLU Networks
new file mode 100644
index 0000000000..a44bf857ae
--- /dev/null
+++ b/data/2024/iclr/Noisy Interpolation Learning with Shallow Univariate ReLU Networks	
@@ -0,0 +1 @@
+Understanding how overparameterized neural networks generalize despite perfect interpolation of noisy training data is a fundamental question. Mallinar et. al. 2022 noted that neural networks seem to often exhibit ``tempered overfitting'', wherein the population risk does not converge to the Bayes optimal error, but neither does it approach infinity, yielding non-trivial generalization. However, this has not been studied rigorously. We provide the first rigorous analysis of the overfitting behavior of regression with minimum norm ($\ell_2$ of weights), focusing on univariate two-layer ReLU networks. We show overfitting is tempered (with high probability) when measured with respect to the $L_1$ loss, but also show that the situation is more complex than suggested by Mallinar et. al., and overfitting is catastrophic with respect to the $L_2$ loss, or when taking an expectation over the training set.
\ No newline at end of file
diff --git a/data/2024/iclr/Non-Exchangeable Conformal Risk Control b/data/2024/iclr/Non-Exchangeable Conformal Risk Control
new file mode 100644
index 0000000000..d56eb5edae
--- /dev/null
+++ b/data/2024/iclr/Non-Exchangeable Conformal Risk Control	
@@ -0,0 +1 @@
+Split conformal prediction has recently sparked great interest due to its ability to provide formally guaranteed uncertainty sets or intervals for predictions made by black-box neural models, ensuring a predefined probability of containing the actual ground truth. While the original formulation assumes data exchangeability, some extensions handle non-exchangeable data, which is often the case in many real-world scenarios. In parallel, some progress has been made in conformal methods that provide statistical guarantees for a broader range of objectives, such as bounding the best $F_1$-score or minimizing the false negative rate in expectation. In this paper, we leverage and extend these two lines of work by proposing non-exchangeable conformal risk control, which allows controlling the expected value of any monotone loss function when the data is not exchangeable. Our framework is flexible, makes very few assumptions, and allows weighting the data based on its relevance for a given test example; a careful choice of weights may result on tighter bounds, making our framework useful in the presence of change points, time series, or other forms of distribution drift. Experiments with both synthetic and real world data show the usefulness of our method.
\ No newline at end of file
diff --git a/data/2024/iclr/Non-negative Contrastive Learning b/data/2024/iclr/Non-negative Contrastive Learning
new file mode 100644
index 0000000000..4071faab5b
--- /dev/null
+++ b/data/2024/iclr/Non-negative Contrastive Learning	
@@ -0,0 +1 @@
+Deep representations have shown promising performance when transferred to downstream tasks in a black-box manner. Yet, their inherent lack of interpretability remains a significant challenge, as these features are often opaque to human understanding. In this paper, we propose Non-negative Contrastive Learning (NCL), a renaissance of Non-negative Matrix Factorization (NMF) aimed at deriving interpretable features. The power of NCL lies in its enforcement of non-negativity constraints on features, reminiscent of NMF's capability to extract features that align closely with sample clusters. NCL not only aligns mathematically well with an NMF objective but also preserves NMF's interpretability attributes, resulting in a more sparse and disentangled representation compared to standard contrastive learning (CL). Theoretically, we establish guarantees on the identifiability and downstream generalization of NCL. Empirically, we show that these advantages enable NCL to outperform CL significantly on feature disentanglement, feature selection, as well as downstream classification tasks. At last, we show that NCL can be easily extended to other learning scenarios and benefit supervised learning as well. Code is available at https://github.com/PKU-ML/non_neg.
\ No newline at end of file
diff --git a/data/2024/iclr/Nougat: Neural Optical Understanding for Academic Documents b/data/2024/iclr/Nougat: Neural Optical Understanding for Academic Documents
new file mode 100644
index 0000000000..6acf3c7757
--- /dev/null
+++ b/data/2024/iclr/Nougat: Neural Optical Understanding for Academic Documents	
@@ -0,0 +1 @@
+Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.
\ No newline at end of file
diff --git a/data/2024/iclr/Novel Quadratic Constraints for Extending LipSDP beyond Slope-Restricted Activations b/data/2024/iclr/Novel Quadratic Constraints for Extending LipSDP beyond Slope-Restricted Activations
new file mode 100644
index 0000000000..0b53eff953
--- /dev/null
+++ b/data/2024/iclr/Novel Quadratic Constraints for Extending LipSDP beyond Slope-Restricted Activations	
@@ -0,0 +1 @@
+Recently, semidefinite programming (SDP) techniques have shown great promise in providing accurate Lipschitz bounds for neural networks. Specifically, the LipSDP approach (Fazlyab et al., 2019) has received much attention and provides the least conservative Lipschitz upper bounds that can be computed with polynomial time guarantees. However, one main restriction of LipSDP is that its formulation requires the activation functions to be slope-restricted on $[0,1]$, preventing its further use for more general activation functions such as GroupSort, MaxMin, and Householder. One can rewrite MaxMin activations for example as residual ReLU networks. However, a direct application of LipSDP to the resultant residual ReLU networks is conservative and even fails in recovering the well-known fact that the MaxMin activation is 1-Lipschitz. Our paper bridges this gap and extends LipSDP beyond slope-restricted activation functions. To this end, we provide novel quadratic constraints for GroupSort, MaxMin, and Householder activations via leveraging their underlying properties such as sum preservation. Our proposed analysis is general and provides a unified approach for estimating $\ell_2$ and $\ell_\infty$ Lipschitz bounds for a rich class of neural network architectures, including non-residual and residual neural networks and implicit models, with GroupSort, MaxMin, and Householder activations. Finally, we illustrate the utility of our approach with a variety of experiments and show that our proposed SDPs generate less conservative Lipschitz bounds in comparison to existing approaches.
\ No newline at end of file
diff --git a/data/2024/iclr/NuwaDynamics: Discovering and Updating in Causal Spatio-Temporal Modeling b/data/2024/iclr/NuwaDynamics: Discovering and Updating in Causal Spatio-Temporal Modeling
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/ODE Discovery for Longitudinal Heterogeneous Treatment Effects Inference b/data/2024/iclr/ODE Discovery for Longitudinal Heterogeneous Treatment Effects Inference
new file mode 100644
index 0000000000..f72ef222d0
--- /dev/null
+++ b/data/2024/iclr/ODE Discovery for Longitudinal Heterogeneous Treatment Effects Inference	
@@ -0,0 +1 @@
+Inferring unbiased treatment effects has received widespread attention in the machine learning community. In recent years, our community has proposed numerous solutions in standard settings, high-dimensional treatment settings, and even longitudinal settings. While very diverse, the solution has mostly relied on neural networks for inference and simultaneous correction of assignment bias. New approaches typically build on top of previous approaches by proposing new (or refined) architectures and learning algorithms. However, the end result -- a neural-network-based inference machine -- remains unchallenged. In this paper, we introduce a different type of solution in the longitudinal setting: a closed-form ordinary differential equation (ODE). While we still rely on continuous optimization to learn an ODE, the resulting inference machine is no longer a neural network. Doing so yields several advantages such as interpretability, irregular sampling, and a different set of identification assumptions. Above all, we consider the introduction of a completely new type of solution to be our most important contribution as it may spark entirely new innovations in treatment effects in general. We facilitate this by formulating our contribution as a framework that can transform any ODE discovery method into a treatment effects method.
\ No newline at end of file
diff --git a/data/2024/iclr/ODEFormer: Symbolic Regression of Dynamical Systems with Transformers b/data/2024/iclr/ODEFormer: Symbolic Regression of Dynamical Systems with Transformers
new file mode 100644
index 0000000000..ed54c97507
--- /dev/null
+++ b/data/2024/iclr/ODEFormer: Symbolic Regression of Dynamical Systems with Transformers	
@@ -0,0 +1 @@
+We introduce ODEFormer, the first transformer able to infer multidimensional ordinary differential equation (ODE) systems in symbolic form from the observation of a single solution trajectory. We perform extensive evaluations on two datasets: (i) the existing"Strogatz"dataset featuring two-dimensional systems; (ii) ODEBench, a collection of one- to four-dimensional systems that we carefully curated from the literature to provide a more holistic benchmark. ODEFormer consistently outperforms existing methods while displaying substantially improved robustness to noisy and irregularly sampled observations, as well as faster inference. We release our code, model and benchmark dataset publicly.
\ No newline at end of file
diff --git a/data/2024/iclr/ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update b/data/2024/iclr/ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update
new file mode 100644
index 0000000000..9f06922bd8
--- /dev/null
+++ b/data/2024/iclr/ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update	
@@ -0,0 +1 @@
+In this study, we investigate the DIstribution Correction Estimation (DICE) methods, an important line of work in offline reinforcement learning (RL) and imitation learning (IL). DICE-based methods impose state-action-level behavior constraint, which is an ideal choice for offline learning. However, they typically perform much worse than current state-of-the-art (SOTA) methods that solely use action-level behavior constraint. After revisiting DICE-based methods, we find there exist two gradient terms when learning the value function using true-gradient update: forward gradient (taken on the current state) and backward gradient (taken on the next state). Using forward gradient bears a large similarity to many offline RL methods, and thus can be regarded as applying action-level constraint. However, directly adding the backward gradient may degenerate or cancel out its effect if these two gradients have conflicting directions. To resolve this issue, we propose a simple yet effective modification that projects the backward gradient onto the normal plane of the forward gradient, resulting in an orthogonal-gradient update, a new learning rule for DICE-based methods. We conduct thorough theoretical analyses and find that the projected backward gradient brings state-level behavior regularization, which reveals the mystery of DICE-based methods: the value learning objective does try to impose state-action-level constraint, but needs to be used in a corrected way. Through toy examples and extensive experiments on complex offline RL and IL tasks, we demonstrate that DICE-based methods using orthogonal-gradient updates (O-DICE) achieve SOTA performance and great robustness.
\ No newline at end of file
diff --git a/data/2024/iclr/OMNI: Open-endedness via Models of human Notions of Interestingness b/data/2024/iclr/OMNI: Open-endedness via Models of human Notions of Interestingness
new file mode 100644
index 0000000000..c5490e8280
--- /dev/null
+++ b/data/2024/iclr/OMNI: Open-endedness via Models of human Notions of Interestingness	
@@ -0,0 +1 @@
+Open-ended algorithms aim to learn new, interesting behaviors forever. That requires a vast environment search space, but there are thus infinitely many possible tasks. Even after filtering for tasks the current agent can learn (i.e., learning progress), countless learnable yet uninteresting tasks remain (e.g., minor variations of previously learned tasks). An Achilles Heel of open-endedness research is the inability to quantify (and thus prioritize) tasks that are not just learnable, but also $\textit{interesting}$ (e.g., worthwhile and novel). We propose solving this problem by $\textit{Open-endedness via Models of human Notions of Interestingness}$ (OMNI). The insight is that we can utilize foundation models (FMs) as a model of interestingness (MoI), because they $\textit{already}$ internalize human concepts of interestingness from training on vast amounts of human-generated data, where humans naturally write about what they find interesting or boring. We show that FM-based MoIs improve open-ended learning by focusing on tasks that are both learnable $\textit{and interesting}$, outperforming baselines based on uniform task sampling or learning progress alone. This approach has the potential to dramatically advance the ability to intelligently select which tasks to focus on next (i.e., auto-curricula), and could be seen as AI selecting its own next task to learn, facilitating self-improving AI and AI-Generating Algorithms. Project website at https://www.jennyzhangzt.com/omni/
\ No newline at end of file
diff --git a/data/2024/iclr/OVOR: OnePrompt with Virtual Outlier Regularization for Rehearsal-Free Class-Incremental Learning b/data/2024/iclr/OVOR: OnePrompt with Virtual Outlier Regularization for Rehearsal-Free Class-Incremental Learning
new file mode 100644
index 0000000000..8cdb16c34c
--- /dev/null
+++ b/data/2024/iclr/OVOR: OnePrompt with Virtual Outlier Regularization for Rehearsal-Free Class-Incremental Learning	
@@ -0,0 +1 @@
+Recent works have shown that by using large pre-trained models along with learnable prompts, rehearsal-free methods for class-incremental learning (CIL) settings can achieve superior performance to prominent rehearsal-based ones. Rehearsal-free CIL methods struggle with distinguishing classes from different tasks, as those are not trained together. In this work we propose a regularization method based on virtual outliers to tighten decision boundaries of the classifier, such that confusion of classes among different tasks is mitigated. Recent prompt-based methods often require a pool of task-specific prompts, in order to prevent overwriting knowledge of previous tasks with that of the new task, leading to extra computation in querying and composing an appropriate prompt from the pool. This additional cost can be eliminated, without sacrificing accuracy, as we reveal in the paper. We illustrate that a simplified prompt-based method can achieve results comparable to previous state-of-the-art (SOTA) methods equipped with a prompt pool, using much less learnable parameters and lower inference cost. Our regularization method has demonstrated its compatibility with different prompt-based methods, boosting those previous SOTA rehearsal-free CIL methods' accuracy on the ImageNet-R and CIFAR-100 benchmarks. Our source code is available at https://github.com/jpmorganchase/ovor.
\ No newline at end of file
diff --git a/data/2024/iclr/OWL: A Large Language Model for IT Operations b/data/2024/iclr/OWL: A Large Language Model for IT Operations
new file mode 100644
index 0000000000..db1ed21704
--- /dev/null
+++ b/data/2024/iclr/OWL: A Large Language Model for IT Operations	
@@ -0,0 +1 @@
+With the rapid development of IT operations, it has become increasingly crucial to efficiently manage and analyze large volumes of data for practical applications. The techniques of Natural Language Processing (NLP) have shown remarkable capabilities for various tasks, including named entity recognition, machine translation and dialogue systems. Recently, Large Language Models (LLMs) have achieved significant improvements across various NLP downstream tasks. However, there is a lack of specialized LLMs for IT operations. In this paper, we introduce the OWL, a large language model trained on our collected OWL-Instruct dataset with a wide range of IT-related information, where the mixture-of-adapter strategy is proposed to improve the parameter-efficient tuning across different domains or tasks. Furthermore, we evaluate the performance of our OWL on the OWL-Bench established by us and open IT-related benchmarks. OWL demonstrates superior performance results on IT tasks, which outperforms existing models by significant margins. Moreover, we hope that the findings of our work will provide more insights to revolutionize the techniques of IT operations with specialized LLMs.
\ No newline at end of file
diff --git a/data/2024/iclr/Object-Aware Inversion and Reassembly for Image Editing b/data/2024/iclr/Object-Aware Inversion and Reassembly for Image Editing
new file mode 100644
index 0000000000..c4f6e9f3e0
--- /dev/null
+++ b/data/2024/iclr/Object-Aware Inversion and Reassembly for Image Editing	
@@ -0,0 +1 @@
+By comparing the original and target prompts, we can obtain numerous editing pairs, each comprising an object and its corresponding editing target. To allow editability while maintaining fidelity to the input image, existing editing methods typically involve a fixed number of inversion steps that project the whole input image to its noisier latent representation, followed by a denoising process guided by the target prompt. However, we find that the optimal number of inversion steps for achieving ideal editing results varies significantly among different editing pairs, owing to varying editing difficulties. Therefore, the current literature, which relies on a fixed number of inversion steps, produces sub-optimal generation quality, especially when handling multiple editing pairs in a natural image. To this end, we propose a new image editing paradigm, dubbed Object-aware Inversion and Reassembly (OIR), to enable object-level fine-grained editing. Specifically, we design a new search metric, which determines the optimal inversion steps for each editing pair, by jointly considering the editability of the target and the fidelity of the non-editing region. We use our search metric to find the optimal inversion step for each editing pair when editing an image. We then edit these editing pairs separately to avoid concept mismatch. Subsequently, we propose an additional reassembly step to seamlessly integrate the respective editing results and the non-editing region to obtain the final edited image. To systematically evaluate the effectiveness of our method, we collect two datasets called OIRBench for benchmarking single- and multi-object editing, respectively. Experiments demonstrate that our method achieves superior performance in editing object shapes, colors, materials, categories, etc., especially in multi-object editing scenarios.
\ No newline at end of file
diff --git a/data/2024/iclr/Object-Centric Learning with Slot Mixture Module b/data/2024/iclr/Object-Centric Learning with Slot Mixture Module
new file mode 100644
index 0000000000..e1f8c4db54
--- /dev/null
+++ b/data/2024/iclr/Object-Centric Learning with Slot Mixture Module	
@@ -0,0 +1 @@
+Object-centric architectures usually apply a differentiable module to the entire feature map to decompose it into sets of entity representations called slots. Some of these methods structurally resemble clustering algorithms, where the cluster's center in latent space serves as a slot representation. Slot Attention is an example of such a method, acting as a learnable analog of the soft k-means algorithm. Our work employs a learnable clustering method based on the Gaussian Mixture Model. Unlike other approaches, we represent slots not only as centers of clusters but also incorporate information about the distance between clusters and assigned vectors, leading to more expressive slot representations. Our experiments demonstrate that using this approach instead of Slot Attention improves performance in object-centric scenarios, achieving state-of-the-art results in the set property prediction task.
\ No newline at end of file
diff --git a/data/2024/iclr/Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE b/data/2024/iclr/Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
new file mode 100644
index 0000000000..51d4d77dc5
--- /dev/null
+++ b/data/2024/iclr/Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE	
@@ -0,0 +1 @@
+Recent studies have demonstrated Large Language Models (LLMs) can extend their zero-shot generalization capabilities to multimodal learning through instruction tuning. As more modalities and downstream tasks are introduced, negative conflicts and interference may have a worse impact on performance. While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. To the best of our knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to address this problem. The experimental results (about 20% improvement) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. Code and datasets are available at https://openlamm.github.io/paper_list/Octavius.
\ No newline at end of file
diff --git a/data/2024/iclr/OctoPack: Instruction Tuning Code Large Language Models b/data/2024/iclr/OctoPack: Instruction Tuning Code Large Language Models
new file mode 100644
index 0000000000..4f5d696966
--- /dev/null
+++ b/data/2024/iclr/OctoPack: Instruction Tuning Code Large Language Models	
@@ -0,0 +1 @@
+Finetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with human instructions. We compile CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B parameter StarCoder model, and achieve state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2% pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models, OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among all permissive models, demonstrating CommitPack's benefits in generalizing to a wider set of languages and natural coding tasks. Code, models and data are freely available at https://github.com/bigcode-project/octopack.
\ No newline at end of file
diff --git a/data/2024/iclr/Off-Policy Primal-Dual Safe Reinforcement Learning b/data/2024/iclr/Off-Policy Primal-Dual Safe Reinforcement Learning
new file mode 100644
index 0000000000..48feb4cfde
--- /dev/null
+++ b/data/2024/iclr/Off-Policy Primal-Dual Safe Reinforcement Learning	
@@ -0,0 +1 @@
+Primal-dual safe RL methods commonly perform iterations between the primal update of the policy and the dual update of the Lagrange Multiplier. Such a training paradigm is highly susceptible to the error in cumulative cost estimation since this estimation serves as the key bond connecting the primal and dual update processes. We show that this problem causes significant underestimation of cost when using off-policy methods, leading to the failure to satisfy the safety constraint. To address this issue, we propose conservative policy optimization, which learns a policy in a constraint-satisfying area by considering the uncertainty in cost estimation. This improves constraint satisfaction but also potentially hinders reward maximization. We then introduce local policy convexification to help eliminate such suboptimality by gradually reducing the estimation uncertainty. We provide theoretical interpretations of the joint coupling effect of these two ingredients and further verify them by extensive experiments. Results on benchmark tasks show that our method not only achieves an asymptotic performance comparable to state-of-the-art on-policy methods while using much fewer samples, but also significantly reduces constraint violation during training. Our code is available at https://github.com/ZifanWu/CAL.
\ No newline at end of file
diff --git a/data/2024/iclr/Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees b/data/2024/iclr/Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees
new file mode 100644
index 0000000000..3d2cdfd973
--- /dev/null
+++ b/data/2024/iclr/Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees	
@@ -0,0 +1 @@
+Hybrid RL is the setting where an RL agent has access to both offline data and online data by interacting with the real-world environment. In this work, we propose a new hybrid RL algorithm that combines an on-policy actor-critic method with offline data. On-policy methods such as policy gradient and natural policy gradient (NPG) have shown to be more robust to model misspecification, though sometimes it may not be as sample efficient as methods that rely on off-policy learning. On the other hand, offline methods that depend on off-policy training often require strong assumptions in theory and are less stable to train in practice. Our new approach integrates a procedure of off-policy training on the offline data into an on-policy NPG framework. We show that our approach, in theory, can obtain a best-of-both-worlds type of result -- it achieves the state-of-art theoretical guarantees of offline RL when offline RL-specific assumptions hold, while at the same time maintaining the theoretical guarantees of on-policy NPG regardless of the offline RL assumptions' validity. Experimentally, in challenging rich-observation environments, we show that our approach outperforms a state-of-the-art hybrid RL baseline which only relies on off-policy policy optimization, demonstrating the empirical benefit of combining on-policy and off-policy learning. Our code is publicly available at https://github.com/YifeiZhou02/HNPG.
\ No newline at end of file
diff --git a/data/2024/iclr/Offline RL with Observation Histories: Analyzing and Improving Sample Complexity b/data/2024/iclr/Offline RL with Observation Histories: Analyzing and Improving Sample Complexity
new file mode 100644
index 0000000000..0f3c70fa4c
--- /dev/null
+++ b/data/2024/iclr/Offline RL with Observation Histories: Analyzing and Improving Sample Complexity	
@@ -0,0 +1 @@
+Offline reinforcement learning (RL) can in principle synthesize more optimal behavior from a dataset consisting only of suboptimal trials. One way that this can happen is by"stitching"together the best parts of otherwise suboptimal trajectories that overlap on similar states, to create new behaviors where each individual state is in-distribution, but the overall returns are higher. However, in many interesting and complex applications, such as autonomous navigation and dialogue systems, the state is partially observed. Even worse, the state representation is unknown or not easy to define. In such cases, policies and value functions are often conditioned on observation histories instead of states. In these cases, it is not clear if the same kind of"stitching"is feasible at the level of observation histories, since two different trajectories would always have different histories, and thus"similar states"that might lead to effective stitching cannot be leveraged. Theoretically, we show that standard offline RL algorithms conditioned on observation histories suffer from poor sample complexity, in accordance with the above intuition. We then identify sufficient conditions under which offline RL can still be efficient -- intuitively, it needs to learn a compact representation of history comprising only features relevant for action selection. We introduce a bisimulation loss that captures the extent to which this happens, and propose that offline RL can explicitly optimize this loss to aid worst-case sample complexity. Empirically, we show that across a variety of tasks either our proposed loss improves performance, or the value of this loss is already minimized as a consequence of standard offline RL, indicating that it correlates well with good performance.
\ No newline at end of file
diff --git a/data/2024/iclr/OmniControl: Control Any Joint at Any Time for Human Motion Generation b/data/2024/iclr/OmniControl: Control Any Joint at Any Time for Human Motion Generation
new file mode 100644
index 0000000000..ac6231783d
--- /dev/null
+++ b/data/2024/iclr/OmniControl: Control Any Joint at Any Time for Human Motion Generation	
@@ -0,0 +1 @@
+We present a novel approach named OmniControl for incorporating flexible spatial control signals into a text-conditioned human motion generation model based on the diffusion process. Unlike previous methods that can only control the pelvis trajectory, OmniControl can incorporate flexible spatial control signals over different joints at different times with only one model. Specifically, we propose analytic spatial guidance that ensures the generated motion can tightly conform to the input control signals. At the same time, realism guidance is introduced to refine all the joints to generate more coherent motion. Both the spatial and realism guidance are essential and they are highly complementary for balancing control accuracy and motion realism. By combining them, OmniControl generates motions that are realistic, coherent, and consistent with the spatial constraints. Experiments on HumanML3D and KIT-ML datasets show that OmniControl not only achieves significant improvement over state-of-the-art methods on pelvis control but also shows promising results when incorporating the constraints over other joints.
\ No newline at end of file
diff --git a/data/2024/iclr/OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models b/data/2024/iclr/OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
new file mode 100644
index 0000000000..d0b4c254dc
--- /dev/null
+++ b/data/2024/iclr/OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models	
@@ -0,0 +1 @@
+Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements. Although recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM, they hand-craft quantization parameters, leading to low performance, especially in extremely low-bit quantization. To tackle this issue, we introduce an Omnidirectionally calibrated Quantization (\textbf{OmniQuant}) technique for LLMs, which achieves good performance in diverse quantization settings while maintaining the computational efficiency of PTQ by efficiently optimizing various quantization parameters. OmniQuant comprises two innovative components including Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET). LWC modulates the extreme values of weights by optimizing the clipping threshold. Meanwhile, LET tackles activation outliers by shifting the challenge of quantization from activations to weights. Operating within a differentiable framework using block-wise error minimization, OmniQuant can optimize the quantization process efficiently for both weight-only and weight-activation quantization. For instance, the LLaMA-2 model family size 7-70B can be processed with OmniQuant on a single A100-40G GPU within 1-16 hours using 128 samples. Extensive experiments validate OmniQuant's superior performance across diverse quantization configurations such as W4A4 (4-bit weight, 4-bit activation), W6A6, W4A16, W3A16, and W2A16. Additionally, OmniQuant demonstrates effectiveness in instruction-tuned models and delivers notable improvements in inference speed and memory reduction on real devices. Codes are available at \url{https://github.com/OpenGVLab/OmniQuant}.
\ No newline at end of file
diff --git a/data/2024/iclr/On Accelerating Diffusion-Based Sampling Processes via Improved Integration Approximation b/data/2024/iclr/On Accelerating Diffusion-Based Sampling Processes via Improved Integration Approximation
new file mode 100644
index 0000000000..472ca6b059
--- /dev/null
+++ b/data/2024/iclr/On Accelerating Diffusion-Based Sampling Processes via Improved Integration Approximation	
@@ -0,0 +1 @@
+A popular approach to sample a diffusion-based generative model is to solve an ordinary differential equation (ODE). In existing samplers, the coefficients of the ODE solvers are pre-determined by the ODE formulation, the reverse discrete timesteps, and the employed ODE methods. In this paper, we consider accelerating several popular ODE-based sampling processes (including EDM, DDIM, and DPM-Solver) by optimizing certain coefficients via improved integration approximation (IIA). We propose to minimize, for each time step, a mean squared error (MSE) function with respect to the selected coefficients. The MSE is constructed by applying the original ODE solver for a set of fine-grained timesteps, which in principle provides a more accurate integration approximation in predicting the next diffusion state. The proposed IIA technique does not require any change of a pre-trained model, and only introduces a very small computational overhead for solving a number of quadratic optimization problems. Extensive experiments show that considerably better FID scores can be achieved by using IIA-EDM, IIA-DDIM, and IIA-DPM-Solver than the original counterparts when the neural function evaluation (NFE) is small (i.e., less than 25).
\ No newline at end of file
diff --git a/data/2024/iclr/On Bias-Variance Alignment in Deep Models b/data/2024/iclr/On Bias-Variance Alignment in Deep Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/On Differentially Private Federated Linear Contextual Bandits b/data/2024/iclr/On Differentially Private Federated Linear Contextual Bandits
new file mode 100644
index 0000000000..f4000a1130
--- /dev/null
+++ b/data/2024/iclr/On Differentially Private Federated Linear Contextual Bandits	
@@ -0,0 +1 @@
+We consider cross-silo federated linear contextual bandit (LCB) problem under differential privacy, where multiple silos (agents) interact with the local users and communicate via a central server to realize collaboration while without sacrificing each user's privacy. We identify three issues in the state-of-the-art: (i) failure of claimed privacy protection and (ii) incorrect regret bound due to noise miscalculation and (iii) ungrounded communication cost. To resolve these issues, we take a two-step principled approach. First, we design an algorithmic framework consisting of a generic federated LCB algorithm and flexible privacy protocols. Then, leveraging the proposed framework, we study federated LCBs under two different privacy constraints. We first establish privacy and regret guarantees under silo-level local differential privacy, which fix the issues present in state-of-the-art algorithm. To further improve the regret performance, we next consider shuffle model of differential privacy, under which we show that our algorithm can achieve nearly ``optimal'' regret without a trusted server. We accomplish this via two different schemes -- one relies on a new result on privacy amplification via shuffling for DP mechanisms and another one leverages the integration of a shuffle protocol for vector sum into the tree-based mechanism, both of which might be of independent interest. Finally, we support our theoretical results with numerical evaluations over contextual bandit instances generated from both synthetic and real-life data.
\ No newline at end of file
diff --git a/data/2024/iclr/On Diffusion Modeling for Anomaly Detection b/data/2024/iclr/On Diffusion Modeling for Anomaly Detection
new file mode 100644
index 0000000000..08ebbeedb3
--- /dev/null
+++ b/data/2024/iclr/On Diffusion Modeling for Anomaly Detection	
@@ -0,0 +1 @@
+Known for their impressive performance in generative modeling, diffusion models are attractive candidates for density-based anomaly detection. This paper investigates different variations of diffusion modeling for unsupervised and semi-supervised anomaly detection. In particular, we find that Denoising Diffusion Probability Models (DDPM) are performant on anomaly detection benchmarks yet computationally expensive. By simplifying DDPM in application to anomaly detection, we are naturally led to an alternative approach called Diffusion Time Estimation (DTE). DTE estimates the distribution over diffusion time for a given input and uses the mode or mean of this distribution as the anomaly score. We derive an analytical form for this density and leverage a deep neural network to improve inference efficiency. Through empirical evaluations on the ADBench benchmark, we demonstrate that all diffusion-based anomaly detection methods perform competitively for both semi-supervised and unsupervised settings. Notably, DTE achieves orders of magnitude faster inference time than DDPM, while outperforming it on this benchmark. These results establish diffusion-based anomaly detection as a scalable alternative to traditional methods and recent deep-learning techniques for standard unsupervised and semi-supervised anomaly detection settings.
\ No newline at end of file
diff --git a/data/2024/iclr/On Double Descent in Reinforcement Learning with LSTD and Random Features b/data/2024/iclr/On Double Descent in Reinforcement Learning with LSTD and Random Features
new file mode 100644
index 0000000000..20fdc64e53
--- /dev/null
+++ b/data/2024/iclr/On Double Descent in Reinforcement Learning with LSTD and Random Features	
@@ -0,0 +1 @@
+Temporal Difference (TD) algorithms are widely used in Deep Reinforcement Learning (RL). Their performance is heavily influenced by the size of the neural network. While in supervised learning, the regime of over-parameterization and its benefits are well understood, the situation in RL is much less clear. In this paper, we present a theoretical analysis of the influence of network size and $l_2$-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define over-parameterization as the regime when it is larger than one. Furthermore, we observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Squared Bellman Error (MSBE) that feature correction terms responsible for the double descent. Correction terms vanish when the $l_2$-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.
\ No newline at end of file
diff --git a/data/2024/iclr/On Error Propagation of Diffusion Models b/data/2024/iclr/On Error Propagation of Diffusion Models
new file mode 100644
index 0000000000..39a19f2d5e
--- /dev/null
+++ b/data/2024/iclr/On Error Propagation of Diffusion Models	
@@ -0,0 +1 @@
+Although diffusion models (DMs) have shown promising performances in a number of tasks (e.g., speech synthesis and image generation), they might suffer from error propagation because of their sequential structure. However, this is not certain because some sequential models, such as Conditional Random Field (CRF), are free from this problem. To address this issue, we develop a theoretical framework to mathematically formulate error propagation in the architecture of DMs, The framework contains three elements, including modular error, cumulative error, and propagation equation. The modular and cumulative errors are related by the equation, which interprets that DMs are indeed affected by error propagation. Our theoretical study also suggests that the cumulative error is closely related to the generation quality of DMs. Based on this finding, we apply the cumulative error as a regularization term to reduce error propagation. Because the term is computationally intractable, we derive its upper bound and design a bootstrap algorithm to efficiently estimate the bound for optimization. We have conducted extensive experiments on multiple image datasets, showing that our proposed regularization reduces error propagation, significantly improves vanilla DMs, and outperforms previous baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/On Harmonizing Implicit Subpopulations b/data/2024/iclr/On Harmonizing Implicit Subpopulations
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation b/data/2024/iclr/On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation
new file mode 100644
index 0000000000..c8226dc363
--- /dev/null
+++ b/data/2024/iclr/On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation	
@@ -0,0 +1 @@
+In this work, we study first-order algorithms for solving Bilevel Optimization (BO) where the objective functions are smooth but possibly nonconvex in both levels and the variables are restricted to closed convex sets. As a first step, we study the landscape of BO through the lens of penalty methods, in which the upper- and lower-level objectives are combined in a weighted sum with penalty parameter $\sigma>0$. In particular, we establish a strong connection between the penalty function and the hyper-objective by explicitly characterizing the conditions under which the values and derivatives of the two must be $O(\sigma)$-close. A by-product of our analysis is the explicit formula for the gradient of hyper-objective when the lower-level problem has multiple solutions under minimal conditions, which could be of independent interest. Next, viewing the penalty formulation as $O(\sigma)$-approximation of the original BO, we propose first-order algorithms that find an $\epsilon$-stationary solution by optimizing the penalty formulation with $\sigma = O(\epsilon)$. When the perturbed lower-level problem uniformly satisfies the small-error proximal error-bound (EB) condition, we propose a first-order algorithm that converges to an $\epsilon$-stationary point of the penalty function, using in total $O(\epsilon^{-3})$ and $O(\epsilon^{-7})$ accesses to first-order (stochastic) gradient oracles when the oracle is deterministic and oracles are noisy, respectively. Under an additional assumption on stochastic oracles, we show that the algorithm can be implemented in a fully {\it single-loop} manner, i.e., with $O(1)$ samples per iteration, and achieves the improved oracle-complexity of $O(\epsilon^{-3})$ and $O(\epsilon^{-5})$, respectively.
\ No newline at end of file
diff --git a/data/2024/iclr/On Representation Complexity of Model-based and Model-free Reinforcement Learning b/data/2024/iclr/On Representation Complexity of Model-based and Model-free Reinforcement Learning
new file mode 100644
index 0000000000..a027aa4ac4
--- /dev/null
+++ b/data/2024/iclr/On Representation Complexity of Model-based and Model-free Reinforcement Learning	
@@ -0,0 +1 @@
+We study the representation complexity of model-based and model-free reinforcement learning (RL) in the context of circuit complexity. We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal $Q$-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as $Q$-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal $Q$-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal $Q$-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.
\ No newline at end of file
diff --git a/data/2024/iclr/On Stationary Point Convergence of PPO-Clip b/data/2024/iclr/On Stationary Point Convergence of PPO-Clip
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/On Trajectory Augmentations for Off-Policy Evaluation b/data/2024/iclr/On Trajectory Augmentations for Off-Policy Evaluation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/On gauge freedom, conservativity and intrinsic dimensionality estimation in diffusion models b/data/2024/iclr/On gauge freedom, conservativity and intrinsic dimensionality estimation in diffusion models
new file mode 100644
index 0000000000..b6241f5cc3
--- /dev/null
+++ b/data/2024/iclr/On gauge freedom, conservativity and intrinsic dimensionality estimation in diffusion models	
@@ -0,0 +1 @@
+Diffusion models are generative models that have recently demonstrated impressive performances in terms of sampling quality and density estimation in high dimensions. They rely on a forward continuous diffusion process and a backward continuous denoising process, which can be described by a time-dependent vector field and is used as a generative model. In the original formulation of the diffusion model, this vector field is assumed to be the score function (i.e. it is the gradient of the log-probability at a given time in the diffusion process). Curiously, on the practical side, most studies on diffusion models implement this vector field as a neural network function and do not constrain it be the gradient of some energy function (that is, most studies do not constrain the vector field to be conservative). Even though some studies investigated empirically whether such a constraint will lead to a performance gain, they lead to contradicting results and failed to provide analytical results. Here, we provide three analytical results regarding the extent of the modeling freedom of this vector field. {Firstly, we propose a novel decomposition of vector fields into a conservative component and an orthogonal component which satisfies a given (gauge) freedom. Secondly, from this orthogonal decomposition, we show that exact density estimation and exact sampling is achieved when the conservative component is exactly equals to the true score and therefore conservativity is neither necessary nor sufficient to obtain exact density estimation and exact sampling. Finally, we show that when it comes to inferring local information of the data manifold, constraining the vector field to be conservative is desirable.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Analysis of GAN-based Image-to-Image Translation with Gaussian Noise Injection b/data/2024/iclr/On the Analysis of GAN-based Image-to-Image Translation with Gaussian Noise Injection
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/On the Effect of Batch Size in Byzantine-Robust Distributed Learning b/data/2024/iclr/On the Effect of Batch Size in Byzantine-Robust Distributed Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/On the Expressivity of Objective-Specification Formalisms in Reinforcement Learning b/data/2024/iclr/On the Expressivity of Objective-Specification Formalisms in Reinforcement Learning
new file mode 100644
index 0000000000..0c2e3b77b0
--- /dev/null
+++ b/data/2024/iclr/On the Expressivity of Objective-Specification Formalisms in Reinforcement Learning	
@@ -0,0 +1 @@
+Most algorithms in reinforcement learning (RL) require that the objective is formalised with a Markovian reward function. However, it is well-known that certain tasks cannot be expressed by means of an objective in the Markov rewards formalism, motivating the study of alternative objective-specification formalisms in RL such as Linear Temporal Logic and Multi-Objective Reinforcement Learning. To date, there has not yet been any thorough analysis of how these formalisms relate to each other in terms of their expressivity. We fill this gap in the existing literature by providing a comprehensive comparison of 17 salient objective-specification formalisms. We place these formalisms in a preorder based on their expressive power, and present this preorder as a Hasse diagram. We find a variety of limitations for the different formalisms, and argue that no formalism is both dominantly expressive and straightforward to optimise with current techniques. For example, we prove that each of Regularised RL, (Outer) Nonlinear Markov Rewards, Reward Machines, Linear Temporal Logic, and Limit Average Rewards can express a task that the others cannot. The significance of our results is twofold. First, we identify important expressivity limitations to consider when specifying objectives for policy optimization. Second, our results highlight the need for future research which adapts reward learning to work with a greater variety of formalisms, since many existing reward learning methods assume that the desired objective takes a Markovian form. Our work contributes towards a more cohesive understanding of the costs and benefits of different RL objective-specification formalisms.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Foundations of Shortcut Learning b/data/2024/iclr/On the Foundations of Shortcut Learning
new file mode 100644
index 0000000000..f3a424dc90
--- /dev/null
+++ b/data/2024/iclr/On the Foundations of Shortcut Learning	
@@ -0,0 +1 @@
+Deep-learning models can extract a rich assortment of features from data. Which features a model uses depends not only on \emph{predictivity} -- how reliably a feature indicates training-set labels -- but also on \emph{availability} -- how easily the feature can be extracted from inputs. The literature on shortcut learning has noted examples in which models privilege one feature over another, for example texture over shape and image backgrounds over foreground objects. Here, we test hypotheses about which input properties are more available to a model, and systematically study how predictivity and availability interact to shape models' feature use. We construct a minimal, explicit generative framework for synthesizing classification datasets with two latent features that vary in predictivity and in factors we hypothesize to relate to availability, and we quantify a model's shortcut bias -- its over-reliance on the shortcut (more available, less predictive) feature at the expense of the core (less available, more predictive) feature. We find that linear models are relatively unbiased, but introducing a single hidden layer with ReLU or Tanh units yields a bias. Our empirical findings are consistent with a theoretical account based on Neural Tangent Kernels. Finally, we study how models used in practice trade off predictivity and availability in naturalistic datasets, discovering availability manipulations which increase models' degree of shortcut bias. Taken together, these findings suggest that the propensity to learn shortcut features is a fundamental characteristic of deep nonlinear architectures warranting systematic study given its role in shaping how models solve tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Hardness of Constrained Cooperative Multi-Agent Reinforcement Learning b/data/2024/iclr/On the Hardness of Constrained Cooperative Multi-Agent Reinforcement Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/On the Hardness of Online Nonconvex Optimization with Single Oracle Feedback b/data/2024/iclr/On the Hardness of Online Nonconvex Optimization with Single Oracle Feedback
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs b/data/2024/iclr/On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs
new file mode 100644
index 0000000000..a1d9908ce9
--- /dev/null
+++ b/data/2024/iclr/On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs	
@@ -0,0 +1 @@
+capacities
\ No newline at end of file
diff --git a/data/2024/iclr/On the Learnability of Watermarks for Language Models b/data/2024/iclr/On the Learnability of Watermarks for Language Models
new file mode 100644
index 0000000000..dac25e3b1d
--- /dev/null
+++ b/data/2024/iclr/On the Learnability of Watermarks for Language Models	
@@ -0,0 +1 @@
+Watermarking of language model outputs enables statistical detection of model-generated text, which can mitigate harms and misuses of language models. Existing watermarking strategies operate by altering the decoder of an existing language model. In this paper, we ask whether language models can directly learn to generate watermarked text, which would have significant implications for the real-world deployment of watermarks. First, learned watermarks could be used to build open models that naturally generate watermarked text, enabling watermarking for open models, where users can control the decoding procedure. Second, if watermarking is used to determine the provenance of generated text, an adversary can hurt the reputation of a victim model by spoofing its watermark and generating damaging watermarked text. To investigate the learnability of watermarks, we propose watermark distillation, which trains a student model to behave like a teacher model that uses decoding-based watermarking. We test our approach on three decoding-based watermarking strategies and various hyperparameter settings, finding that models can learn to generate watermarked text with high detectability. We also find limitations to learnability, including the loss of watermarking capabilities under fine-tuning on normal text and high sample complexity when learning low-distortion watermarks.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Limitations of Temperature Scaling for Distributions with Overlaps b/data/2024/iclr/On the Limitations of Temperature Scaling for Distributions with Overlaps
new file mode 100644
index 0000000000..5519fc30f7
--- /dev/null
+++ b/data/2024/iclr/On the Limitations of Temperature Scaling for Distributions with Overlaps	
@@ -0,0 +1 @@
+Despite the impressive generalization capabilities of deep neural networks, they have been repeatedly shown to be overconfident when they are wrong. Fixing this issue is known as model calibration, and has consequently received much attention in the form of modified training schemes and post-training calibration procedures such as temperature scaling. While temperature scaling is frequently used because of its simplicity, it is often outperformed by modified training schemes. In this work, we identify a specific bottleneck for the performance of temperature scaling. We show that for empirical risk minimizers for a general set of distributions in which the supports of classes have overlaps, the performance of temperature scaling degrades with the amount of overlap between classes, and asymptotically becomes no better than random when there are a large number of classes. On the other hand, we prove that optimizing a modified form of the empirical risk induced by the Mixup data augmentation technique can in fact lead to reasonably good calibration performance, showing that training-time calibration may be necessary in some situations. We also verify that our theoretical results reflect practice by showing that Mixup significantly outperforms empirical risk minimization (with respect to multiple calibration metrics) on image classification benchmarks with class overlaps introduced in the form of label noise.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Markov Property of Neural Algorithmic Reasoning: Analyses and Methods b/data/2024/iclr/On the Markov Property of Neural Algorithmic Reasoning: Analyses and Methods
new file mode 100644
index 0000000000..649ca24ae0
--- /dev/null
+++ b/data/2024/iclr/On the Markov Property of Neural Algorithmic Reasoning: Analyses and Methods	
@@ -0,0 +1 @@
+Neural algorithmic reasoning is an emerging research direction that endows neural networks with the ability to mimic algorithmic executions step-by-step. A common paradigm in existing designs involves the use of historical embeddings in predicting the results of future execution steps. Our observation in this work is that such historical dependence intrinsically contradicts the Markov nature of algorithmic reasoning tasks. Based on this motivation, we present our ForgetNet, which does not use historical embeddings and thus is consistent with the Markov nature of the tasks. To address challenges in training ForgetNet at early stages, we further introduce G-ForgetNet, which uses a gating mechanism to allow for the selective integration of historical embeddings. Such an enhanced capability provides valuable computational pathways during the model's early training phase. Our extensive experiments, based on the CLRS-30 algorithmic reasoning benchmark, demonstrate that both ForgetNet and G-ForgetNet achieve better generalization capability than existing methods. Furthermore, we investigate the behavior of the gating mechanism, highlighting its degree of alignment with our intuitions and its effectiveness for robust performance.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Over-Memorization During Natural, Robust and Catastrophic Overfitting b/data/2024/iclr/On the Over-Memorization During Natural, Robust and Catastrophic Overfitting
new file mode 100644
index 0000000000..79f06f50d5
--- /dev/null
+++ b/data/2024/iclr/On the Over-Memorization During Natural, Robust and Catastrophic Overfitting	
@@ -0,0 +1 @@
+Overfitting negatively impacts the generalization ability of deep neural networks (DNNs) in both natural and adversarial training. Existing methods struggle to consistently address different types of overfitting, typically designing strategies that focus separately on either natural or adversarial patterns. In this work, we adopt a unified perspective by solely focusing on natural patterns to explore different types of overfitting. Specifically, we examine the memorization effect in DNNs and reveal a shared behaviour termed over-memorization, which impairs their generalization capacity. This behaviour manifests as DNNs suddenly becoming high-confidence in predicting certain training patterns and retaining a persistent memory for them. Furthermore, when DNNs over-memorize an adversarial pattern, they tend to simultaneously exhibit high-confidence prediction for the corresponding natural pattern. These findings motivate us to holistically mitigate different types of overfitting by hindering the DNNs from over-memorization training patterns. To this end, we propose a general framework, Distraction Over-Memorization (DOM), which explicitly prevents over-memorization by either removing or augmenting the high-confidence natural patterns. Extensive experiments demonstrate the effectiveness of our proposed method in mitigating overfitting across various training paradigms.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Parameterization of Second-Order Optimization Effective towards the Infinite Width b/data/2024/iclr/On the Parameterization of Second-Order Optimization Effective towards the Infinite Width
new file mode 100644
index 0000000000..1395d08e31
--- /dev/null
+++ b/data/2024/iclr/On the Parameterization of Second-Order Optimization Effective towards the Infinite Width	
@@ -0,0 +1 @@
+Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on further larger scales, we identify a specific parameterization for second-order optimization that promotes feature learning in a stable manner even if the network width increases significantly. Inspired by a maximal update parameterization, we consider a one-step update of the gradient and reveal the appropriate scales of hyperparameters including random initialization, learning rates, and damping terms. Our approach covers two major second-order optimization algorithms, K-FAC and Shampoo, and we demonstrate that our parameterization achieves higher generalization performance in feature learning. In particular, it enables us to transfer the hyperparameters across models with different widths.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Posterior Distribution in Denoising: Application to Uncertainty Quantification b/data/2024/iclr/On the Posterior Distribution in Denoising: Application to Uncertainty Quantification
new file mode 100644
index 0000000000..c9ede2fb7d
--- /dev/null
+++ b/data/2024/iclr/On the Posterior Distribution in Denoising: Application to Uncertainty Quantification	
@@ -0,0 +1 @@
+Denoisers play a central role in many applications, from noise suppression in low-grade imaging sensors, to empowering score-based generative models. The latter category of methods makes use of Tweedie's formula, which links the posterior mean in Gaussian denoising (\ie the minimum MSE denoiser) with the score of the data distribution. Here, we derive a fundamental relation between the higher-order central moments of the posterior distribution, and the higher-order derivatives of the posterior mean. We harness this result for uncertainty quantification of pre-trained denoisers. Particularly, we show how to efficiently compute the principal components of the posterior distribution for any desired region of an image, as well as to approximate the full marginal distribution along those (or any other) one-dimensional directions. Our method is fast and memory-efficient, as it does not explicitly compute or store the high-order moment tensors and it requires no training or fine tuning of the denoiser. Code and examples are available on the project webpage in https://hilamanor.github.io/GaussianDenoisingPosterior/ .
\ No newline at end of file
diff --git a/data/2024/iclr/On the Provable Advantage of Unsupervised Pretraining b/data/2024/iclr/On the Provable Advantage of Unsupervised Pretraining
new file mode 100644
index 0000000000..7f239d38d8
--- /dev/null
+++ b/data/2024/iclr/On the Provable Advantage of Unsupervised Pretraining	
@@ -0,0 +1 @@
+Unsupervised pretraining, which learns a useful representation using a large amount of unlabeled data to facilitate the learning of downstream tasks, is a critical component of modern large-scale machine learning systems. Despite its tremendous empirical success, the rigorous theoretical understanding of why unsupervised pretraining generally helps remains rather limited -- most existing results are restricted to particular methods or approaches for unsupervised pretraining with specialized structural assumptions. This paper studies a generic framework, where the unsupervised representation learning task is specified by an abstract class of latent variable models $\Phi$ and the downstream task is specified by a class of prediction functions $\Psi$. We consider a natural approach of using Maximum Likelihood Estimation (MLE) for unsupervised pretraining and Empirical Risk Minimization (ERM) for learning downstream tasks. We prove that, under a mild ''informative'' condition, our algorithm achieves an excess risk of $\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_\Phi/m} + \sqrt{\mathcal{C}_\Psi/n})$ for downstream tasks, where $\mathcal{C}_\Phi, \mathcal{C}_\Psi$ are complexity measures of function classes $\Phi, \Psi$, and $m, n$ are the number of unlabeled and labeled data respectively. Comparing to the baseline of $\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_{\Phi \circ \Psi}/n})$ achieved by performing supervised learning using only the labeled data, our result rigorously shows the benefit of unsupervised pretraining when $m \gg n$ and $\mathcal{C}_{\Phi\circ \Psi}>\mathcal{C}_\Psi$. This paper further shows that our generic framework covers a wide range of approaches for unsupervised pretraining, including factor models, Gaussian mixture models, and contrastive learning.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Reliability of Watermarks for Large Language Models b/data/2024/iclr/On the Reliability of Watermarks for Large Language Models
new file mode 100644
index 0000000000..18215bc3ea
--- /dev/null
+++ b/data/2024/iclr/On the Reliability of Watermarks for Large Language Models	
@@ -0,0 +1 @@
+As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. While these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text, resulting in high-confidence detections when enough tokens are observed. For example, after strong human paraphrasing the watermark is detectable after observing 800 tokens on average, when setting a 1e-5 false positive rate. We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document, and we compare the robustness of watermarking to other kinds of detectors.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Role of Discrete Tokenization in Visual Representation Learning b/data/2024/iclr/On the Role of Discrete Tokenization in Visual Representation Learning
new file mode 100644
index 0000000000..6c184376ae
--- /dev/null
+++ b/data/2024/iclr/On the Role of Discrete Tokenization in Visual Representation Learning	
@@ -0,0 +1 @@
+In the realm of self-supervised learning (SSL), masked image modeling (MIM) has gained popularity alongside contrastive learning methods. MIM involves reconstructing masked regions of input images using their unmasked portions. A notable subset of MIM methodologies employs discrete tokens as the reconstruction target, but the theoretical underpinnings of this choice remain underexplored. In this paper, we explore the role of these discrete tokens, aiming to unravel their benefits and limitations. Building upon the connection between MIM and contrastive learning, we provide a comprehensive theoretical understanding on how discrete tokenization affects the model's generalization capabilities. Furthermore, we propose a novel metric named TCAS, which is specifically designed to assess the effectiveness of discrete tokens within the MIM framework. Inspired by this metric, we contribute an innovative tokenizer design and propose a corresponding MIM method named ClusterMIM. It demonstrates superior performance on a variety of benchmark datasets and ViT backbones. Code is available at https://github.com/PKU-ML/ClusterMIM.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Role of General Function Approximation in Offline Reinforcement Learning b/data/2024/iclr/On the Role of General Function Approximation in Offline Reinforcement Learning
new file mode 100644
index 0000000000..c7aed6b096
--- /dev/null
+++ b/data/2024/iclr/On the Role of General Function Approximation in Offline Reinforcement Learning	
@@ -0,0 +1 @@
+We study offline reinforcement learning (RL) with general function approximation. General function approximation is a powerful tool for algorithm design and analysis, but its adaptation to offline RL encounters several challenges due to varying approximation targets and assumptions that blur the real meanings of function assumptions. In this paper, we try to formulate and clarify the treatment of general function approximation in offline RL in two aspects: ( 1 ) analyzing different types of assumptions and their practical usage, and ( 2 ) understanding its role as a restriction on underlying MDPs from information-theoretic perspectives. Additionally, we introduce a new insight for lower bound establishing: one can exploit model-realizability to establish general-purpose lower bounds that can be generalized into other functions. Building upon this insight, we propose two generic lower bounds that contribute to a better understanding of offline RL with general function approximation.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Scalability and Memory Efficiency of Semidefinite Programs for Lipschitz Constant Estimation of Neural Networks b/data/2024/iclr/On the Scalability and Memory Efficiency of Semidefinite Programs for Lipschitz Constant Estimation of Neural Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/On the Stability of Expressive Positional Encodings for Graphs b/data/2024/iclr/On the Stability of Expressive Positional Encodings for Graphs
new file mode 100644
index 0000000000..eb9e2875aa
--- /dev/null
+++ b/data/2024/iclr/On the Stability of Expressive Positional Encodings for Graphs	
@@ -0,0 +1 @@
+Designing effective positional encodings for graphs is key to building powerful graph transformers and enhancing message-passing graph neural networks. Although widespread, using Laplacian eigenvectors as positional encodings faces two fundamental challenges: (1) \emph{Non-uniqueness}: there are many different eigendecompositions of the same Laplacian, and (2) \emph{Instability}: small perturbations to the Laplacian could result in completely different eigenspaces, leading to unpredictable changes in positional encoding. Despite many attempts to address non-uniqueness, most methods overlook stability, leading to poor generalization on unseen graph structures. We identify the cause of instability to be a ``hard partition'' of eigenspaces. Hence, we introduce Stable and Expressive Positional Encodings (SPE), an architecture for processing eigenvectors that uses eigenvalues to ``softly partition'' eigenspaces. SPE is the first architecture that is (1) provably stable, and (2) universally expressive for basis invariant functions whilst respecting all symmetries of eigenvectors. Besides guaranteed stability, we prove that SPE is at least as expressive as existing methods, and highly capable of counting graph structures. Finally, we evaluate the effectiveness of our method on molecular property prediction, and out-of-distribution generalization tasks, finding improved generalization compared to existing positional encoding methods. Our code is available at \url{https://github.com/Graph-COM/SPE}.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Stability of Iterative Retraining of Generative Models on their own Data b/data/2024/iclr/On the Stability of Iterative Retraining of Generative Models on their own Data
new file mode 100644
index 0000000000..a229928e3b
--- /dev/null
+++ b/data/2024/iclr/On the Stability of Iterative Retraining of Generative Models on their own Data	
@@ -0,0 +1 @@
+Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to these models' striking performance and ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models will be trained on both clean and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact of training generative models on mixed datasets -- from classical training on real data to self-consuming generative models trained on purely synthetic data. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Variance of Neural Network Training with respect to Test Sets and Distributions b/data/2024/iclr/On the Variance of Neural Network Training with respect to Test Sets and Distributions
new file mode 100644
index 0000000000..0d64407b20
--- /dev/null
+++ b/data/2024/iclr/On the Variance of Neural Network Training with respect to Test Sets and Distributions	
@@ -0,0 +1 @@
+Typical neural network trainings have substantial variance in test-set performance between repeated runs, impeding hyperparameter comparison and training reproducibility. In this work we present the following results towards understanding this variation. (1) Despite having significant variance on their test-sets, we demonstrate that standard CIFAR-10 and ImageNet trainings have little variance in performance on the underlying test-distributions from which their test-sets are sampled. (2) We show that these trainings make approximately independent errors on their test-sets. That is, the event that a trained network makes an error on one particular example does not affect its chances of making errors on other examples, relative to their average rates over repeated runs of training with the same hyperparameters. (3) We prove that the variance of neural network trainings on their test-sets is a downstream consequence of the class-calibration property discovered by Jiang et al. (2021). Our analysis yields a simple formula which accurately predicts variance for the binary classification case. (4) We conduct preliminary studies of data augmentation, learning rate, finetuning instability and distribution-shift through the lens of variance between runs.
\ No newline at end of file
diff --git a/data/2024/iclr/On the Vulnerability of Adversarially Trained Models Against Two-faced Attacks b/data/2024/iclr/On the Vulnerability of Adversarially Trained Models Against Two-faced Attacks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/On the generalization capacity of neural networks during generic multimodal reasoning b/data/2024/iclr/On the generalization capacity of neural networks during generic multimodal reasoning
new file mode 100644
index 0000000000..213ca2b26c
--- /dev/null
+++ b/data/2024/iclr/On the generalization capacity of neural networks during generic multimodal reasoning	
@@ -0,0 +1 @@
+The advent of the Transformer has led to the development of large language models (LLM), which appear to demonstrate human-like capabilities. To assess the generality of this class of models and a variety of other base neural network architectures to multimodal domains, we evaluated and compared their capacity for multimodal generalization. We introduce a multimodal question-answer benchmark to evaluate three specific types of out-of-distribution (OOD) generalization performance: distractor generalization (generalization in the presence of distractors), systematic compositional generalization (generalization to new task permutations), and productive compositional generalization (generalization to more complex tasks structures). We found that across model architectures (e.g., RNNs, Transformers, Perceivers, etc.), models with multiple attention layers, or models that leveraged cross-attention mechanisms between input domains, fared better. Our positive results demonstrate that for multimodal distractor and systematic generalization, either cross-modal attention or models with deeper attention layers are key architectural features required to integrate multimodal inputs. On the other hand, neither of these architectural features led to productive generalization, suggesting fundamental limitations of existing architectures for specific types of multimodal generalization. These results demonstrate the strengths and limitations of specific architectural components underlying modern neural models for multimodal reasoning. Finally, we provide Generic COG (gCOG), a configurable benchmark with several multimodal generalization splits, for future studies to explore.
\ No newline at end of file
diff --git a/data/2024/iclr/On the hardness of learning under symmetries b/data/2024/iclr/On the hardness of learning under symmetries
new file mode 100644
index 0000000000..8adc6e53f0
--- /dev/null
+++ b/data/2024/iclr/On the hardness of learning under symmetries	
@@ -0,0 +1 @@
+We study the problem of learning equivariant neural networks via gradient descent. The incorporation of known symmetries ("equivariance") into neural nets has empirically improved the performance of learning pipelines, in domains ranging from biology to computer vision. However, a rich yet separate line of learning theoretic research has demonstrated that actually learning shallow, fully-connected (i.e. non-symmetric) networks has exponential complexity in the correlational statistical query (CSQ) model, a framework encompassing gradient descent. In this work, we ask: are known problem symmetries sufficient to alleviate the fundamental hardness of learning neural nets with gradient descent? We answer this question in the negative. In particular, we give lower bounds for shallow graph neural networks, convolutional networks, invariant polynomials, and frame-averaged networks for permutation subgroups, which all scale either superpolynomially or exponentially in the relevant input dimension. Therefore, in spite of the significant inductive bias imparted via symmetry, actually learning the complete classes of functions represented by equivariant neural networks via gradient descent remains hard.
\ No newline at end of file
diff --git a/data/2024/iclr/On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes b/data/2024/iclr/On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
new file mode 100644
index 0000000000..76413175a0
--- /dev/null
+++ b/data/2024/iclr/On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes	
@@ -0,0 +1 @@
+Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher's distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks, and task-agnostic distillation for instruction-tuning.
\ No newline at end of file
diff --git a/data/2024/iclr/One For All: Towards Training One Graph Model For All Classification Tasks b/data/2024/iclr/One For All: Towards Training One Graph Model For All Classification Tasks
new file mode 100644
index 0000000000..24eb8f880d
--- /dev/null
+++ b/data/2024/iclr/One For All: Towards Training One Graph Model For All Classification Tasks	
@@ -0,0 +1 @@
+Designing a single model to address multiple tasks has been a long-standing objective in artificial intelligence. Recently, large language models have demonstrated exceptional capability in solving different tasks within the language domain. However, a unified model for various graph tasks remains underexplored, primarily due to the challenges unique to the graph learning domain. First, graph data from different areas carry distinct attributes and follow different distributions. Such discrepancy makes it hard to represent graphs in a single representation space. Second, tasks on graphs diversify into node, link, and graph tasks, requiring distinct embedding strategies. Finally, an appropriate graph prompting paradigm for in-context learning is unclear. We propose \textbf{One for All (OFA)}, the first general framework that can use a single graph model to address the above challenges. Specifically, OFA proposes text-attributed graphs to unify different graph data by describing nodes and edges with natural language and uses language models to encode the diverse and possibly cross-domain text attributes to feature vectors in the same embedding space. Furthermore, OFA introduces the concept of nodes-of-interest to standardize different tasks with a single task representation. For in-context learning on graphs, OFA introduces a novel graph prompting paradigm that appends prompting substructures to the input graph, which enables it to address varied tasks without fine-tuning. We train the OFA model using graph data from multiple domains (including citation networks, molecular graphs, knowledge graphs, etc.) simultaneously and evaluate its ability in supervised, few-shot, and zero-shot learning scenarios. OFA performs well across different tasks, making it the first general-purpose across-domains classification model on graphs.
\ No newline at end of file
diff --git a/data/2024/iclr/One Forward is Enough for Neural Network Training via Likelihood Ratio Method b/data/2024/iclr/One Forward is Enough for Neural Network Training via Likelihood Ratio Method
new file mode 100644
index 0000000000..527b5bba6f
--- /dev/null
+++ b/data/2024/iclr/One Forward is Enough for Neural Network Training via Likelihood Ratio Method	
@@ -0,0 +1 @@
+While backpropagation (BP) is the mainstream approach for gradient computation in neural network training, its heavy reliance on the chain rule of differentiation constrains the designing flexibility of network architecture and training pipelines. We avoid the recursive computation in BP and develop a unified likelihood ratio (ULR) method for gradient estimation with just one forward propagation. Not only can ULR be extended to train a wide variety of neural network architectures, but the computation flow in BP can also be rearranged by ULR for better device adaptation. Moreover, we propose several variance reduction techniques to further accelerate the training process. Our experiments offer numerical results across diverse aspects, including various neural network training scenarios, computation flow rearrangement, and fine-tuning of pre-trained models. All findings demonstrate that ULR effectively enhances the flexibility of neural network training by permitting localized module training without compromising the global objective and significantly boosts the network robustness.
\ No newline at end of file
diff --git a/data/2024/iclr/One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention b/data/2024/iclr/One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention
new file mode 100644
index 0000000000..8e9835ec96
--- /dev/null
+++ b/data/2024/iclr/One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention	
@@ -0,0 +1 @@
+Recent works have empirically analyzed in-context learning and shown that transformers trained on synthetic linear regression tasks can learn to implement ridge regression, which is the Bayes-optimal predictor, given sufficient capacity [Aky\"urek et al., 2023], while one-layer transformers with linear self-attention and no MLP layer will learn to implement one step of gradient descent (GD) on a least-squares linear regression objective [von Oswald et al., 2022]. However, the theory behind these observations remains poorly understood. We theoretically study transformers with a single layer of linear self-attention, trained on synthetic noisy linear regression data. First, we mathematically show that when the covariates are drawn from a standard Gaussian distribution, the one-layer transformer which minimizes the pre-training loss will implement a single step of GD on the least-squares linear regression objective. Then, we find that changing the distribution of the covariates and weight vector to a non-isotropic Gaussian distribution has a strong impact on the learned algorithm: the global minimizer of the pre-training loss now implements a single step of $\textit{pre-conditioned}$ GD. However, if only the distribution of the responses is changed, then this does not have a large effect on the learned algorithm: even when the response comes from a more general family of $\textit{nonlinear}$ functions, the global minimizer of the pre-training loss still implements a single step of GD on a least-squares linear regression objective.
\ No newline at end of file
diff --git a/data/2024/iclr/One-hot Generalized Linear Model for Switching Brain State Discovery b/data/2024/iclr/One-hot Generalized Linear Model for Switching Brain State Discovery
new file mode 100644
index 0000000000..f729b446d7
--- /dev/null
+++ b/data/2024/iclr/One-hot Generalized Linear Model for Switching Brain State Discovery	
@@ -0,0 +1 @@
+Exposing meaningful and interpretable neural interactions is critical to understanding neural circuits. Inferred neural interactions from neural signals primarily reflect functional interactions. In a long experiment, subject animals may experience different stages defined by the experiment, stimuli, or behavioral states, and hence functional interactions can change over time. To model dynamically changing functional interactions, prior work employs state-switching generalized linear models with hidden Markov models (i.e., HMM-GLMs). However, we argue they lack biological plausibility, as functional interactions are shaped and confined by the underlying anatomical connectome. Here, we propose a novel prior-informed state-switching GLM. We introduce both a Gaussian prior and a one-hot prior over the GLM in each state. The priors are learnable. We will show that the learned prior should capture the state-constant interaction, shedding light on the underlying anatomical connectome and revealing more likely physical neuron interactions. The state-dependent interaction modeled by each GLM offers traceability to capture functional variations across multiple brain states. Our methods effectively recover true interaction structures in simulated data, achieve the highest predictive likelihood with real neural datasets, and render interaction structures and hidden states more interpretable when applied to real neural data.
\ No newline at end of file
diff --git a/data/2024/iclr/One-shot Active Learning Based on Lewis Weight Sampling for Multiple Deep Models b/data/2024/iclr/One-shot Active Learning Based on Lewis Weight Sampling for Multiple Deep Models
new file mode 100644
index 0000000000..08d02a415b
--- /dev/null
+++ b/data/2024/iclr/One-shot Active Learning Based on Lewis Weight Sampling for Multiple Deep Models	
@@ -0,0 +1 @@
+Active learning (AL) for multiple target models aims to reduce labeled data querying while effectively training multiple models concurrently. Existing AL algorithms often rely on iterative model training, which can be computationally expensive, particularly for deep models. In this paper, we propose a one-shot AL method to address this challenge, which performs all label queries without repeated model training. Specifically, we extract different representations of the same dataset using distinct network backbones, and actively learn the linear prediction layer on each representation via an $\ell_p$-regression formulation. The regression problems are solved approximately by sampling and reweighting the unlabeled instances based on their maximum Lewis weights across the representations. An upper bound on the number of samples needed is provided with a rigorous analysis for $p\in [1, +\infty)$. Experimental results on 11 benchmarks show that our one-shot approach achieves competitive performances with the state-of-the-art AL methods for multiple target models.
\ No newline at end of file
diff --git a/data/2024/iclr/One-shot Empirical Privacy Estimation for Federated Learning b/data/2024/iclr/One-shot Empirical Privacy Estimation for Federated Learning
new file mode 100644
index 0000000000..dc060a3ff8
--- /dev/null
+++ b/data/2024/iclr/One-shot Empirical Privacy Estimation for Federated Learning	
@@ -0,0 +1 @@
+Privacy estimation techniques for differentially private (DP) algorithms are useful for comparing against analytical bounds, or to empirically measure privacy loss in settings where known analytical bounds are not tight. However, existing privacy auditing techniques usually make strong assumptions on the adversary (e.g., knowledge of intermediate model iterates or the training data distribution), are tailored to specific tasks, model architectures, or DP algorithm, and/or require retraining the model many times (typically on the order of thousands). These shortcomings make deploying such techniques at scale difficult in practice, especially in federated settings where model training can take days or weeks. In this work, we present a novel"one-shot"approach that can systematically address these challenges, allowing efficient auditing or estimation of the privacy loss of a model during the same, single training run used to fit model parameters, and without requiring any a priori knowledge about the model architecture, task, or DP training algorithm. We show that our method provides provably correct estimates for the privacy loss under the Gaussian mechanism, and we demonstrate its performance on well-established FL benchmark datasets under several adversarial threat models.
\ No newline at end of file
diff --git a/data/2024/iclr/Online Continual Learning for Interactive Instruction Following Agents b/data/2024/iclr/Online Continual Learning for Interactive Instruction Following Agents
new file mode 100644
index 0000000000..4db1f9b6be
--- /dev/null
+++ b/data/2024/iclr/Online Continual Learning for Interactive Instruction Following Agents	
@@ -0,0 +1 @@
+In learning an embodied agent executing daily tasks via language directives, the literature largely assumes that the agent learns all training data at the beginning. We argue that such a learning scenario is less realistic since a robotic agent is supposed to learn the world continuously as it explores and perceives it. To take a step towards a more realistic embodied agent learning scenario, we propose two continual learning setups for embodied agents; learning new behaviors (Behavior Incremental Learning, Behavior-IL) and new environments (Environment Incremental Learning, Environment-IL) For the tasks, previous 'data prior' based continual learning methods maintain logits for the past tasks. However, the stored information is often insufficiently learned information and requires task boundary information, which might not always be available. Here, we propose to update them based on confidence scores without task boundary information during training (i.e., task-free) in a moving average fashion, named Confidence-Aware Moving Average (CAMA). In the proposed Behavior-IL and Environment-IL setups, our simple CAMA outperforms prior state of the art in our empirical validations by noticeable margins. The project page including codes is https://github.com/snumprlab/cl-alfred.
\ No newline at end of file
diff --git a/data/2024/iclr/Online GNN Evaluation Under Test-time Graph Distribution Shifts b/data/2024/iclr/Online GNN Evaluation Under Test-time Graph Distribution Shifts
new file mode 100644
index 0000000000..fef8b0aa76
--- /dev/null
+++ b/data/2024/iclr/Online GNN Evaluation Under Test-time Graph Distribution Shifts	
@@ -0,0 +1 @@
+Evaluating the performance of a well-trained GNN model on real-world graphs is a pivotal step for reliable GNN online deployment and serving. Due to a lack of test node labels and unknown potential training-test graph data distribution shifts, conventional model evaluation encounters limitations in calculating performance metrics (e.g., test error) and measuring graph data-level discrepancies, particularly when the training graph used for developing GNNs remains unobserved during test time. In this paper, we study a new research problem, online GNN evaluation, which aims to provide valuable insights into the well-trained GNNs's ability to effectively generalize to real-world unlabeled graphs under the test-time graph distribution shifts. Concretely, we develop an effective learning behavior discrepancy score, dubbed LeBeD, to estimate the test-time generalization errors of well-trained GNN models. Through a novel GNN re-training strategy with a parameter-free optimality criterion, the proposed LeBeD comprehensively integrates learning behavior discrepancies from both node prediction and structure reconstruction perspectives. This enables the effective evaluation of the well-trained GNNs' ability to capture test node semantics and structural representations, making it an expressive metric for estimating the generalization error in online GNN evaluation. Extensive experiments on real-world test graphs under diverse graph distribution shifts could verify the effectiveness of the proposed method, revealing its strong correlation with ground-truth test errors on various well-trained GNN models.
\ No newline at end of file
diff --git a/data/2024/iclr/Online Information Acquisition: Hiring Multiple Agents b/data/2024/iclr/Online Information Acquisition: Hiring Multiple Agents
new file mode 100644
index 0000000000..8a768b0085
--- /dev/null
+++ b/data/2024/iclr/Online Information Acquisition: Hiring Multiple Agents	
@@ -0,0 +1 @@
+We investigate the mechanism design problem faced by a principal who hires \emph{multiple} agents to gather and report costly information. Then, the principal exploits the information to make an informed decision. We model this problem as a game, where the principal announces a mechanism consisting in action recommendations and a payment function, a.k.a. scoring rule. Then, each agent chooses an effort level and receives partial information about an underlying state of nature based on the effort. Finally, the agents report the information (possibly non-truthfully), the principal takes a decision based on this information, and the agents are paid according to the scoring rule. While previous work focuses on single-agent problems, we consider multi-agents settings. This poses the challenge of coordinating the agents' efforts and aggregating correlated information. Indeed, we show that optimal mechanisms must correlate agents' efforts, which introduces externalities among the agents, and hence complex incentive compatibility constraints and equilibrium selection problems. First, we design a polynomial-time algorithm to find an optimal incentive compatible mechanism. Then, we study an online problem, where the principal repeatedly interacts with a group of unknown agents. We design a no-regret algorithm that provides $\widetilde{\mathcal{O}}(T^{2/3})$ regret with respect to an optimal mechanism, matching the state-of-the-art bound for single-agent settings.
\ No newline at end of file
diff --git a/data/2024/iclr/Online Stabilization of Spiking Neural Networks b/data/2024/iclr/Online Stabilization of Spiking Neural Networks
new file mode 100644
index 0000000000..b359a1482f
--- /dev/null
+++ b/data/2024/iclr/Online Stabilization of Spiking Neural Networks	
@@ -0,0 +1 @@
+Spiking neural networks (SNNs), attributed to the binary, event-driven nature of spikes, possess heightened biological plausibility and enhanced energy efficiency on neuromorphic hardware compared to analog neural networks (ANNs). Mainstream SNN training schemes apply backpropagation-through-time (BPTT) with surrogate gradients to replace the non-differentiable spike emitting process during backpropagation. While achieving competitive performance, the requirement for storing intermediate information at all time-steps incurs higher memory consumption and fails to fulfill the online property crucial to biological brains. Our work focuses on online training techniques, aiming for memory efficiency while preserving biological plausibility. The limitation of not having access to future information in early time steps in online training has constrained previous efforts to incorporate advantageous modules such as batch normalization. To address this problem, we propose Online Spiking Renormalization (OSR) to ensure consistent parameters between testing and training, and Online Threshold Stabilizer (OTS) to stabilize neuron firing rates across time steps. Furthermore, we design a novel online approach to compute the sample mean and variance over time for OSR. Experiments conducted on various datasets demonstrate the proposed method’s superior performance among SNN online training algorithms. Our code is available at https://github.com/zhuyaoyu/SNN-online-normalization.
\ No newline at end of file
diff --git a/data/2024/iclr/Open the Black Box: Step-based Policy Updates for Temporally-Correlated Episodic Reinforcement Learning b/data/2024/iclr/Open the Black Box: Step-based Policy Updates for Temporally-Correlated Episodic Reinforcement Learning
new file mode 100644
index 0000000000..85a9b3985d
--- /dev/null
+++ b/data/2024/iclr/Open the Black Box: Step-based Policy Updates for Temporally-Correlated Episodic Reinforcement Learning	
@@ -0,0 +1 @@
+Current advancements in reinforcement learning (RL) have predominantly focused on learning step-based policies that generate actions for each perceived state. While these methods efficiently leverage step information from environmental interaction, they often ignore the temporal correlation between actions, resulting in inefficient exploration and unsmooth trajectories that are challenging to implement on real hardware. Episodic RL (ERL) seeks to overcome these challenges by exploring in parameters space that capture the correlation of actions. However, these approaches typically compromise data efficiency, as they treat trajectories as opaque \emph{black boxes}. In this work, we introduce a novel ERL algorithm, Temporally-Correlated Episodic RL (TCE), which effectively utilizes step information in episodic policy updates, opening the 'black box' in existing ERL methods while retaining the smooth and consistent exploration in parameter space. TCE synergistically combines the advantages of step-based and episodic RL, achieving comparable performance to recent ERL methods while maintaining data efficiency akin to state-of-the-art (SoTA) step-based RL.
\ No newline at end of file
diff --git a/data/2024/iclr/Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy b/data/2024/iclr/Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy
new file mode 100644
index 0000000000..c70114e01c
--- /dev/null
+++ b/data/2024/iclr/Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy	
@@ -0,0 +1 @@
+The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve the assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We perform a human evaluation study upon which we base our decision on the final metric. We apply our benchmark to a suite of vision-language models and show a detailed comparison of their abilities on object, action, and attribute classification. Our contributions aim to lay the foundation for more precise and meaningful assessments, facilitating targeted progress in the exciting field of vision-language modeling.
\ No newline at end of file
diff --git a/data/2024/iclr/OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views b/data/2024/iclr/OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views
new file mode 100644
index 0000000000..1aed05a8b8
--- /dev/null
+++ b/data/2024/iclr/OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views	
@@ -0,0 +1 @@
+Large visual-language models (VLMs), like CLIP, enable open-set image segmentation to segment arbitrary concepts from an image in a zero-shot manner. This goes beyond the traditional closed-set assumption, i.e., where models can only segment classes from a pre-defined training set. More recently, first works on open-set segmentation in 3D scenes have appeared in the literature. These methods are heavily influenced by closed-set 3D convolutional approaches that process point clouds or polygon meshes. However, these 3D scene representations do not align well with the image-based nature of the visual-language models. Indeed, point cloud and 3D meshes typically have a lower resolution than images and the reconstructed 3D scene geometry might not project well to the underlying 2D image sequences used to compute pixel-aligned CLIP features. To address these challenges, we propose OpenNeRF which naturally operates on posed images and directly encodes the VLM features within the NeRF. This is similar in spirit to LERF, however our work shows that using pixel-wise VLM features (instead of global CLIP features) results in an overall less complex architecture without the need for additional DINO regularization. Our OpenNeRF further leverages NeRF's ability to render novel views and extract open-set VLM features from areas that are not well observed in the initial posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF outperforms recent open-vocabulary methods such as LERF and OpenScene by at least +4.9 mIoU.
\ No newline at end of file
diff --git a/data/2024/iclr/OpenTab: Advancing Large Language Models as Open-domain Table Reasoners b/data/2024/iclr/OpenTab: Advancing Large Language Models as Open-domain Table Reasoners
new file mode 100644
index 0000000000..17c663c583
--- /dev/null
+++ b/data/2024/iclr/OpenTab: Advancing Large Language Models as Open-domain Table Reasoners	
@@ -0,0 +1 @@
+Large Language Models (LLMs) trained on large volumes of data excel at various natural language tasks, but they cannot handle tasks requiring knowledge that has not been trained on previously. One solution is to use a retriever that fetches relevant information to expand LLM's knowledge scope. However, existing textual-oriented retrieval-based LLMs are not ideal on structured table data due to diversified data modalities and large table sizes. In this work, we propose OpenTab, an open-domain table reasoning framework powered by LLMs. Overall, OpenTab leverages table retriever to fetch relevant tables and then generates SQL programs to parse the retrieved tables efficiently. Utilizing the intermediate data derived from the SQL executions, it conducts grounded inference to produce accurate response. Extensive experimental evaluation shows that OpenTab significantly outperforms baselines in both open- and closed-domain settings, achieving up to 21.5% higher accuracy. We further run ablation studies to validate the efficacy of our proposed designs of the system.
\ No newline at end of file
diff --git a/data/2024/iclr/OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text b/data/2024/iclr/OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text
new file mode 100644
index 0000000000..f9b7e69fdb
--- /dev/null
+++ b/data/2024/iclr/OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text	
@@ -0,0 +1 @@
+There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the reasoning abilities of large language models. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quantitative reasoning. However, because all known open source web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and LaTeX content and removing boilerplate from HTML documents, as well as our methods for quality filtering and deduplication. Additionally, we run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. We hope that our dataset, openly released on the Hugging Face Hub, will help spur advances in the reasoning abilities of large language models.
\ No newline at end of file
diff --git a/data/2024/iclr/Optimal Sample Complexity for Average Reward Markov Decision Processes b/data/2024/iclr/Optimal Sample Complexity for Average Reward Markov Decision Processes
new file mode 100644
index 0000000000..d090f98eb9
--- /dev/null
+++ b/data/2024/iclr/Optimal Sample Complexity for Average Reward Markov Decision Processes	
@@ -0,0 +1 @@
+We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of $\widetilde O(|S||A|t_{\text{mix}}^2 \epsilon^{-2})$ and a lower bound of $\Omega(|S||A|t_{\text{mix}} \epsilon^{-2})$. In these expressions, $|S|$ and $|A|$ denote the cardinalities of the state and action spaces respectively, $t_{\text{mix}}$ serves as a uniform upper limit for the total variation mixing times, and $\epsilon$ signifies the error tolerance. Therefore, a notable gap of $t_{\text{mix}}$ still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of $\widetilde O(|S||A|t_{\text{mix}}\epsilon^{-2})$. This marks the first algorithm and analysis to reach the literature's lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin and Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.
\ No newline at end of file
diff --git a/data/2024/iclr/Optimal Sample Complexity of Contrastive Learning b/data/2024/iclr/Optimal Sample Complexity of Contrastive Learning
new file mode 100644
index 0000000000..0cc402fb77
--- /dev/null
+++ b/data/2024/iclr/Optimal Sample Complexity of Contrastive Learning	
@@ -0,0 +1 @@
+Contrastive learning is a highly successful technique for learning representations of data from labeled tuples, specifying the distance relations within the tuple. We study the sample complexity of contrastive learning, i.e. the minimum number of labeled tuples sufficient for getting high generalization accuracy. We give tight bounds on the sample complexity in a variety of settings, focusing on arbitrary distance functions, both general $\ell_p$-distances, and tree metrics. Our main result is an (almost) optimal bound on the sample complexity of learning $\ell_p$-distances for integer $p$. For any $p \ge 1$ we show that $\tilde \Theta(\min(nd,n^2))$ labeled tuples are necessary and sufficient for learning $d$-dimensional representations of $n$-point datasets. Our results hold for an arbitrary distribution of the input samples and are based on giving the corresponding bounds on the Vapnik-Chervonenkis/Natarajan dimension of the associated problems. We further show that the theoretical bounds on sample complexity obtained via VC/Natarajan dimension can have strong predictive power for experimental results, in contrast with the folklore belief about a substantial gap between the statistical learning theory and the practice of deep learning.
\ No newline at end of file
diff --git a/data/2024/iclr/Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms b/data/2024/iclr/Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Optimal criterion for feature learning of two-layer linear neural network in high dimensional interpolation regime b/data/2024/iclr/Optimal criterion for feature learning of two-layer linear neural network in high dimensional interpolation regime
new file mode 100644
index 0000000000..e8b0e14ddb
--- /dev/null
+++ b/data/2024/iclr/Optimal criterion for feature learning of two-layer linear neural network in high dimensional interpolation regime	
@@ -0,0 +1 @@
+Deep neural networks with feature learning have shown surprising generalization performance in high dimensional settings, but it has not been fully understood how and when they enjoy the benefit of feature learning. In this paper, we theoretically analyze the statistical properties of the benefit from feature learning in a two-layer linear neural network with multiple outputs in a high-dimensional setting. For that purpose, we propose a new criterion that allows feature lerning of a two-layer linear neural network in a high-dimensional setting. Interestingly, we can show that models with smaller values of the criterion generalize even in situations where normal ridge regression fails to generalize. This is because the proposed criterion contains a proper regularization for the feature mapping and acts as an upper bound on the predictive risk. As an important characterization of the criterion, the two-layer linear neural network that minimizes this criterion can achieve the optimal Bayes risk that is determined by the distribution of the true signals across the multiple outputs. To the best of our knowledge, this is the first study to specifically identify the conditions under which a model obtained by proper feature learning can outperform normal ridge regression in a high-dimensional multiple-output linear regression problem.
\ No newline at end of file
diff --git a/data/2024/iclr/Optimal robust Memorization with ReLU Neural Networks b/data/2024/iclr/Optimal robust Memorization with ReLU Neural Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Optimal transport based adversarial patch to leverage large scale attack transferability b/data/2024/iclr/Optimal transport based adversarial patch to leverage large scale attack transferability
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Oracle Efficient Algorithms for Groupwise Regret b/data/2024/iclr/Oracle Efficient Algorithms for Groupwise Regret
new file mode 100644
index 0000000000..4425d008f0
--- /dev/null
+++ b/data/2024/iclr/Oracle Efficient Algorithms for Groupwise Regret	
@@ -0,0 +1 @@
+We study the problem of online prediction, in which at each time step $t$, an individual $x_t$ arrives, whose label we must predict. Each individual is associated with various groups, defined based on their features such as age, sex, race etc., which may intersect. Our goal is to make predictions that have regret guarantees not just overall but also simultaneously on each sub-sequence comprised of the members of any single group. Previous work such as [Blum&Lykouris] and [Lee et al] provide attractive regret guarantees for these problems; however, these are computationally intractable on large model classes. We show that a simple modification of the sleeping experts technique of [Blum&Lykouris] yields an efficient reduction to the well-understood problem of obtaining diminishing external regret absent group considerations. Our approach gives similar regret guarantees compared to [Blum&Lykouris]; however, we run in time linear in the number of groups, and are oracle-efficient in the hypothesis class. This in particular implies that our algorithm is efficient whenever the number of groups is polynomially bounded and the external-regret problem can be solved efficiently, an improvement on [Blum&Lykouris]'s stronger condition that the model class must be small. Our approach can handle online linear regression and online combinatorial optimization problems like online shortest paths. Beyond providing theoretical regret bounds, we evaluate this algorithm with an extensive set of experiments on synthetic data and on two real data sets -- Medical costs and the Adult income dataset, both instantiated with intersecting groups defined in terms of race, sex, and other demographic characteristics. We find that uniformly across groups, our algorithm gives substantial error improvements compared to running a standard online linear regression algorithm with no groupwise regret guarantees.
\ No newline at end of file
diff --git a/data/2024/iclr/Orbit-Equivariant Graph Neural Networks b/data/2024/iclr/Orbit-Equivariant Graph Neural Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Out-Of-Domain Unlabeled Data Improves Generalization b/data/2024/iclr/Out-Of-Domain Unlabeled Data Improves Generalization
new file mode 100644
index 0000000000..4dac09c965
--- /dev/null
+++ b/data/2024/iclr/Out-Of-Domain Unlabeled Data Improves Generalization	
@@ -0,0 +1 @@
+We propose a novel framework for incorporating unlabeled data into semi-supervised classification problems, where scenarios involving the minimization of either i) adversarially robust or ii) non-robust loss functions have been considered. Notably, we allow the unlabeled samples to deviate slightly (in total variation sense) from the in-domain distribution. The core idea behind our framework is to combine Distributionally Robust Optimization (DRO) with self-supervised training. As a result, we also leverage efficient polynomial-time algorithms for the training stage. From a theoretical standpoint, we apply our framework on the classification problem of a mixture of two Gaussians in $\mathbb{R}^d$, where in addition to the $m$ independent and labeled samples from the true distribution, a set of $n$ (usually with $n\gg m$) out of domain and unlabeled samples are given as well. Using only the labeled data, it is known that the generalization error can be bounded by $\propto\left(d/m\right)^{1/2}$. However, using our method on both isotropic and non-isotropic Gaussian mixture models, one can derive a new set of analytically explicit and non-asymptotic bounds which show substantial improvement on the generalization error compared to ERM. Our results underscore two significant insights: 1) out-of-domain samples, even when unlabeled, can be harnessed to narrow the generalization gap, provided that the true data distribution adheres to a form of the ``cluster assumption", and 2) the semi-supervised learning paradigm can be regarded as a special case of our framework when there are no distributional shifts. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness b/data/2024/iclr/Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness
new file mode 100644
index 0000000000..3dfeafec56
--- /dev/null
+++ b/data/2024/iclr/Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness	
@@ -0,0 +1 @@
+Effective out-of-distribution (OOD) detection is crucial for reliable machine learning models, yet most current methods are limited in practical use due to requirements like access to training data or intervention in training. We present a novel method for detecting OOD data in Transformers based on transformation smoothness between intermediate layers of a network (BLOOD), which is applicable to pre-trained models without access to training data. BLOOD utilizes the tendency of between-layer representation transformations of in-distribution (ID) data to be smoother than the corresponding transformations of OOD data, a property that we also demonstrate empirically. We evaluate BLOOD on several text classification tasks with Transformer networks and demonstrate that it outperforms methods with comparable resource requirements. Our analysis also suggests that when learning simpler tasks, OOD data transformations maintain their original sharpness, whereas sharpness increases with more complex tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Out-of-Distribution Detection with Negative Prompts b/data/2024/iclr/Out-of-Distribution Detection with Negative Prompts
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Out-of-Variable Generalisation for Discriminative Models b/data/2024/iclr/Out-of-Variable Generalisation for Discriminative Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization b/data/2024/iclr/Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization
new file mode 100644
index 0000000000..e9e229d6fb
--- /dev/null
+++ b/data/2024/iclr/Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization	
@@ -0,0 +1 @@
+We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a particular heavy-tailed structure in natural data. Our result offers intuitive explanations for several previously reported observations about network training dynamics. In particular, it implies a conceptually new cause for progressive sharpening and the edge of stability; we also highlight connections to other concepts in optimization and generalization including grokking, simplicity bias, and Sharpness-Aware Minimization. Experimentally, we demonstrate the significant influence of paired groups of outliers in the training data with strong opposing signals: consistent, large magnitude features which dominate the network output throughout training and provide gradients which point in opposite directions. Due to these outliers, early optimization enters a narrow valley which carefully balances the opposing groups; subsequent sharpening causes their loss to rise rapidly, oscillating between high on one group and then the other, until the overall loss spikes. We describe how to identify these groups, explore what sets them apart, and carefully study their effect on the network's optimization and behavior. We complement these experiments with a mechanistic explanation on a toy example of opposing signals and a theoretical analysis of a two-layer linear network on a simple model. Our finding enables new qualitative predictions of training behavior which we confirm experimentally. It also provides a new lens through which to study and improve modern training practices for stochastic optimization, which we highlight via a case study of Adam versus SGD.
\ No newline at end of file
diff --git a/data/2024/iclr/Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization b/data/2024/iclr/Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization
new file mode 100644
index 0000000000..b28a71c26c
--- /dev/null
+++ b/data/2024/iclr/Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization	
@@ -0,0 +1 @@
+Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-language models, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature generator is introduced to synthesize OOD features using just the class name of any unknown class. Such synthesized features will provide useful knowledge about unknowns and help regularize the decision boundary between ID and OOD data when optimized jointly. Equally important is our adaptive self-distillation mechanism to regularize our feature generation model during joint optimization, i.e., adaptively transferring knowledge between model states to further prevent overfitting. Experiments validate that our method yields convincing gains in OOD generalization performance in different settings. Code: https://github.com/apple/ml-ogen.
\ No newline at end of file
diff --git a/data/2024/iclr/Overthinking the Truth: Understanding how Language Models Process False Demonstrations b/data/2024/iclr/Overthinking the Truth: Understanding how Language Models Process False Demonstrations
new file mode 100644
index 0000000000..c4ce571cec
--- /dev/null
+++ b/data/2024/iclr/Overthinking the Truth: Understanding how Language Models Process False Demonstrations	
@@ -0,0 +1 @@
+Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena:"overthinking"and"false induction heads". The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. At early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some"critical layer", after which the accuracy given incorrect demonstrations progressively decreases. The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking. Beyond scientific understanding, our results suggest that studying intermediate model computations could be a promising avenue for understanding and guarding against harmful model behaviors.
\ No newline at end of file
diff --git a/data/2024/iclr/P2OT: Progressive Partial Optimal Transport for Deep Imbalanced Clustering b/data/2024/iclr/P2OT: Progressive Partial Optimal Transport for Deep Imbalanced Clustering
new file mode 100644
index 0000000000..1294fedd43
--- /dev/null
+++ b/data/2024/iclr/P2OT: Progressive Partial Optimal Transport for Deep Imbalanced Clustering	
@@ -0,0 +1 @@
+Deep clustering, which learns representation and semantic clustering without labels information, poses a great challenge for deep learning-based approaches. Despite significant progress in recent years, most existing methods focus on uniformly distributed datasets, significantly limiting the practical applicability of their methods. In this paper, we first introduce a more practical problem setting named deep imbalanced clustering, where the underlying classes exhibit an imbalance distribution. To tackle this problem, we propose a novel pseudo-labeling-based learning framework. Our framework formulates pseudo-label generation as a progressive partial optimal transport problem, which progressively transports each sample to imbalanced clusters under prior distribution constraints, thus generating imbalance-aware pseudo-labels and learning from high-confident samples. In addition, we transform the initial formulation into an unbalanced optimal transport problem with augmented constraints, which can be solved efficiently by a fast matrix scaling algorithm. Experiments on various datasets, including a human-curated long-tailed CIFAR100, challenging ImageNet-R, and large-scale subsets of fine-grained iNaturalist2018 datasets, demonstrate the superiority of our method.
\ No newline at end of file
diff --git a/data/2024/iclr/P2Seg: Pointly-supervised Segmentation via Mutual Distillation b/data/2024/iclr/P2Seg: Pointly-supervised Segmentation via Mutual Distillation
new file mode 100644
index 0000000000..576a95e7be
--- /dev/null
+++ b/data/2024/iclr/P2Seg: Pointly-supervised Segmentation via Mutual Distillation	
@@ -0,0 +1 @@
+Point-level Supervised Instance Segmentation (PSIS) aims to enhance the applicability and scalability of instance segmentation by utilizing low-cost yet instance-informative annotations. Existing PSIS methods usually rely on positional information to distinguish objects, but predicting precise boundaries remains challenging due to the lack of contour annotations. Nevertheless, weakly supervised semantic segmentation methods are proficient in utilizing intra-class feature consistency to capture the boundary contours of the same semantic regions. In this paper, we design a Mutual Distillation Module (MDM) to leverage the complementary strengths of both instance position and semantic information and achieve accurate instance-level object perception. The MDM consists of Semantic to Instance (S2I) and Instance to Semantic (I2S). S2I is guided by the precise boundaries of semantic regions to learn the association between annotated points and instance contours. I2S leverages discriminative relationships between instances to facilitate the differentiation of various objects within the semantic map. Extensive experiments substantiate the efficacy of MDM in fostering the synergy between instance and semantic information, consequently improving the quality of instance-level object representations. Our method achieves 55.7 mAP$_{50}$ and 17.6 mAP on the PASCAL VOC and MS COCO datasets, significantly outperforming recent PSIS methods and several box-supervised instance segmentation competitors.
\ No newline at end of file
diff --git a/data/2024/iclr/PAC-FNO: Parallel-Structured All-Component Fourier Neural Operators for Recognizing Low-Quality Images b/data/2024/iclr/PAC-FNO: Parallel-Structured All-Component Fourier Neural Operators for Recognizing Low-Quality Images
new file mode 100644
index 0000000000..4bac221811
--- /dev/null
+++ b/data/2024/iclr/PAC-FNO: Parallel-Structured All-Component Fourier Neural Operators for Recognizing Low-Quality Images	
@@ -0,0 +1 @@
+A standard practice in developing image recognition models is to train a model on a specific image resolution and then deploy it. However, in real-world inference, models often encounter images different from the training sets in resolution and/or subject to natural variations such as weather changes, noise types and compression artifacts. While traditional solutions involve training multiple models for different resolutions or input variations, these methods are computationally expensive and thus do not scale in practice. To this end, we propose a novel neural network model, parallel-structured and all-component Fourier neural operator (PAC-FNO), that addresses the problem. Unlike conventional feed-forward neural networks, PAC-FNO operates in the frequency domain, allowing it to handle images of varying resolutions within a single model. We also propose a two-stage algorithm for training PAC-FNO with a minimal modification to the original, downstream model. Moreover, the proposed PAC-FNO is ready to work with existing image recognition models. Extensively evaluating methods with seven image recognition benchmarks, we show that the proposed PAC-FNO improves the performance of existing baseline models on images with various resolutions by up to 77.1% and various types of natural variations in the images at inference.
\ No newline at end of file
diff --git a/data/2024/iclr/PAE: Reinforcement Learning from External Knowledge for Efficient Exploration b/data/2024/iclr/PAE: Reinforcement Learning from External Knowledge for Efficient Exploration
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback b/data/2024/iclr/PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback
new file mode 100644
index 0000000000..6ad064af9f
--- /dev/null
+++ b/data/2024/iclr/PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback	
@@ -0,0 +1 @@
+We present a novel unified bilevel optimization-based framework, \textsf{PARL}, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning using utility or preference-based feedback. We identify a major gap within current algorithmic designs for solving policy alignment due to a lack of precise characterization of the dependence of the alignment objective on the data generated by policy trajectories. This shortfall contributes to the sub-optimal performance observed in contemporary algorithms. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable (optimal policy for the designed reward). Interestingly, from an optimization perspective, our formulation leads to a new class of stochastic bilevel problems where the stochasticity at the upper objective depends upon the lower-level variable. {True to our best knowledge, this work presents the first formulation of the RLHF as a bilevel optimization problem which generalizes the existing RLHF formulations and addresses the existing distribution shift issues in RLHF formulations.} To demonstrate the efficacy of our formulation in resolving alignment issues in RL, we devised an algorithm named \textsf{A-PARL} to solve PARL problem, establishing sample complexity bounds of order $\mathcal{O}(1/T)$. Our empirical results substantiate that the proposed \textsf{PARL} can address the alignment concerns in RL by showing significant improvements (up to 63\% in terms of required samples) for policy alignment in large-scale environments of the Deepmind control suite and Meta world tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/PB-LLM: Partially Binarized Large Language Models b/data/2024/iclr/PB-LLM: Partially Binarized Large Language Models
new file mode 100644
index 0000000000..42cc529540
--- /dev/null
+++ b/data/2024/iclr/PB-LLM: Partially Binarized Large Language Models	
@@ -0,0 +1 @@
+This paper explores network binarization, a radical form of quantization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. Due to previous binarization methods collapsing LLMs, we propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs. Specifically, our exploration first uncovers the ineffectiveness of naive applications of existing binarization algorithms and highlights the imperative role of salient weights in achieving low-bit quantization. Thus, PB-LLM filters a small ratio of salient weights during binarization, allocating them to higher-bit storage, i.e., partially-binarization. PB-LLM is extended to recover the capacities of quantized LMMs, by analyzing from the perspective of post-training quantization (PTQ) and quantization-aware training (QAT). Under PTQ, combining the concepts from GPTQ, we reconstruct the binarized weight matrix guided by the Hessian matrix and successfully recover the reasoning capacity of PB-LLM in low-bit. Under QAT, we freeze the salient weights during training, explore the derivation of optimal scaling factors crucial for minimizing the quantization error, and propose a scaling mechanism based on this derived scaling strategy for residual binarized weights. Those explorations and the developed methodologies significantly contribute to rejuvenating the performance of low-bit quantized LLMs and present substantial advancements in the field of network binarization for LLMs.The code is available at https://github.com/hahnyuan/BinaryLLM.
\ No newline at end of file
diff --git a/data/2024/iclr/PBADet: A One-Stage Anchor-Free Approach for Part-Body Association b/data/2024/iclr/PBADet: A One-Stage Anchor-Free Approach for Part-Body Association
new file mode 100644
index 0000000000..11b36a7390
--- /dev/null
+++ b/data/2024/iclr/PBADet: A One-Stage Anchor-Free Approach for Part-Body Association	
@@ -0,0 +1 @@
+The detection of human parts (e.g., hands, face) and their correct association with individuals is an essential task, e.g., for ubiquitous human-machine interfaces and action recognition. Traditional methods often employ multi-stage processes, rely on cumbersome anchor-based systems, or do not scale well to larger part sets. This paper presents PBADet, a novel one-stage, anchor-free approach for part-body association detection. Building upon the anchor-free object representation across multi-scale feature maps, we introduce a singular part-to-body center offset that effectively encapsulates the relationship between parts and their parent bodies. Our design is inherently versatile and capable of managing multiple parts-to-body associations without compromising on detection accuracy or robustness. Comprehensive experiments on various datasets underscore the efficacy of our approach, which not only outperforms existing state-of-the-art techniques but also offers a more streamlined and efficient solution to the part-body association challenge.
\ No newline at end of file
diff --git a/data/2024/iclr/PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction b/data/2024/iclr/PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction
new file mode 100644
index 0000000000..232d673281
--- /dev/null
+++ b/data/2024/iclr/PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction	
@@ -0,0 +1 @@
+We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images even with little visual overlap, while simultaneously estimating the relative camera poses in ~1.3 seconds on a single A100 GPU. PF-LRM is a highly scalable method utilizing the self-attention blocks to exchange information between 3D object tokens and 2D image tokens; we predict a coarse point cloud for each view, and then use a differentiable Perspective-n-Point (PnP) solver to obtain camera poses. When trained on a huge amount of multi-view posed data of ~1M objects, PF-LRM shows strong cross-dataset generalization ability, and outperforms baseline methods by a large margin in terms of pose prediction accuracy and 3D reconstruction quality on various unseen evaluation datasets. We also demonstrate our model's applicability in downstream text/image-to-3D task with fast feed-forward inference. Our project website is at: https://totoro97.github.io/pf-lrm .
\ No newline at end of file
diff --git a/data/2024/iclr/PINNACLE: PINN Adaptive ColLocation and Experimental points selection b/data/2024/iclr/PINNACLE: PINN Adaptive ColLocation and Experimental points selection
new file mode 100644
index 0000000000..653a0166d1
--- /dev/null
+++ b/data/2024/iclr/PINNACLE: PINN Adaptive ColLocation and Experimental points selection	
@@ -0,0 +1 @@
+Physics-Informed Neural Networks (PINNs), which incorporate PDEs as soft constraints, train with a composite loss function that contains multiple training point types: different types of collocation points chosen during training to enforce each PDE and initial/boundary conditions, and experimental points which are usually costly to obtain via experiments or simulations. Training PINNs using this loss function is challenging as it typically requires selecting large numbers of points of different types, each with different training dynamics. Unlike past works that focused on the selection of either collocation or experimental points, this work introduces PINN Adaptive ColLocation and Experimental points selection (PINNACLE), the first algorithm that jointly optimizes the selection of all training point types, while automatically adjusting the proportion of collocation point types as training progresses. PINNACLE uses information on the interaction among training point types, which had not been considered before, based on an analysis of PINN training dynamics via the Neural Tangent Kernel (NTK). We theoretically show that the criterion used by PINNACLE is related to the PINN generalization error, and empirically demonstrate that PINNACLE is able to outperform existing point selection methods for forward, inverse, and transfer learning problems.
\ No newline at end of file
diff --git a/data/2024/iclr/PRES: Toward Scalable Memory-Based Dynamic Graph Neural Networks b/data/2024/iclr/PRES: Toward Scalable Memory-Based Dynamic Graph Neural Networks
new file mode 100644
index 0000000000..fecd56201c
--- /dev/null
+++ b/data/2024/iclr/PRES: Toward Scalable Memory-Based Dynamic Graph Neural Networks	
@@ -0,0 +1 @@
+Memory-based Dynamic Graph Neural Networks (MDGNNs) are a family of dynamic graph neural networks that leverage a memory module to extract, distill, and memorize long-term temporal dependencies, leading to superior performance compared to memory-less counterparts. However, training MDGNNs faces the challenge of handling entangled temporal and structural dependencies, requiring sequential and chronological processing of data sequences to capture accurate temporal patterns. During the batch training, the temporal data points within the same batch will be processed in parallel, while their temporal dependencies are neglected. This issue is referred to as temporal discontinuity and restricts the effective temporal batch size, limiting data parallelism and reducing MDGNNs' flexibility in industrial applications. This paper studies the efficient training of MDGNNs at scale, focusing on the temporal discontinuity in training MDGNNs with large temporal batch sizes. We first conduct a theoretical study on the impact of temporal batch size on the convergence of MDGNN training. Based on the analysis, we propose PRES, an iterative prediction-correction scheme combined with a memory coherence learning objective to mitigate the effect of temporal discontinuity, enabling MDGNNs to be trained with significantly larger temporal batches without sacrificing generalization performance. Experimental results demonstrate that our approach enables up to a 4x larger temporal batch (3.4x speed-up) during MDGNN training.
\ No newline at end of file
diff --git a/data/2024/iclr/PRIME: Prioritizing Interpretability in Failure Mode Extraction b/data/2024/iclr/PRIME: Prioritizing Interpretability in Failure Mode Extraction
new file mode 100644
index 0000000000..48fcf297ff
--- /dev/null
+++ b/data/2024/iclr/PRIME: Prioritizing Interpretability in Failure Mode Extraction	
@@ -0,0 +1 @@
+In this work, we study the challenge of providing human-understandable descriptions for failure modes in trained image classification models. Existing works address this problem by first identifying clusters (or directions) of incorrectly classified samples in a latent space and then aiming to provide human-understandable text descriptions for them. We observe that in some cases, describing text does not match well with identified failure modes, partially owing to the fact that shared interpretable attributes of failure modes may not be captured using clustering in the feature space. To improve on these shortcomings, we propose a novel approach that prioritizes interpretability in this problem: we start by obtaining human-understandable concepts (tags) of images in the dataset and then analyze the model's behavior based on the presence or absence of combinations of these tags. Our method also ensures that the tags describing a failure mode form a minimal set, avoiding redundant and noisy descriptions. Through several experiments on different datasets, we show that our method successfully identifies failure modes and generates high-quality text descriptions associated with them. These results highlight the importance of prioritizing interpretability in understanding model failures.
\ No newline at end of file
diff --git a/data/2024/iclr/PROGRAM: PROtotype GRAph Model based Pseudo-Label Learning for Test-Time Adaptation b/data/2024/iclr/PROGRAM: PROtotype GRAph Model based Pseudo-Label Learning for Test-Time Adaptation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/PTaRL: Prototype-based Tabular Representation Learning via Space Calibration b/data/2024/iclr/PTaRL: Prototype-based Tabular Representation Learning via Space Calibration
new file mode 100644
index 0000000000..93b6138cec
--- /dev/null
+++ b/data/2024/iclr/PTaRL: Prototype-based Tabular Representation Learning via Space Calibration	
@@ -0,0 +1 @@
+Tabular data have been playing a mostly important role in diverse real-world fields, such as healthcare, engineering, finance, etc. With the recent success of deep learning, many tabular machine learning (ML) methods based on deep networks (e.g., Transformer, ResNet) have achieved competitive performance on tabular benchmarks. However, existing deep tabular ML methods suffer from the representation entanglement and localization, which largely hinders their prediction performance and leads to performance inconsistency on tabular tasks. To overcome these problems, we explore a novel direction of applying prototype learning for tabular ML and propose a prototype-based tabular representation learning framework, PTaRL, for tabular prediction tasks. The core idea of PTaRL is to construct prototype-based projection space (P-Space) and learn the disentangled representation around global data prototypes. Specifically, PTaRL mainly involves two stages: (i) Prototype Generation, that constructs global prototypes as the basis vectors of P-Space for representation, and (ii) Prototype Projection, that projects the data samples into P-Space and keeps the core global data information via Optimal Transport. Then, to further acquire the disentangled representations, we constrain PTaRL with two strategies: (i) to diversify the coordinates towards global prototypes of different representations within P-Space, we bring up a diversification constraint for representation calibration; (ii) to avoid prototype entanglement in P-Space, we introduce a matrix orthogonalization constraint to ensure the independence of global prototypes. Finally, we conduct extensive experiments in PTaRL coupled with state-of-the-art deep tabular ML models on various tabular benchmarks and the results have shown our consistent superiority.
\ No newline at end of file
diff --git a/data/2024/iclr/PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization b/data/2024/iclr/PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
new file mode 100644
index 0000000000..5d328724ba
--- /dev/null
+++ b/data/2024/iclr/PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization	
@@ -0,0 +1 @@
+Instruction tuning large language models (LLMs) remains a challenging task, owing to the complexity of hyperparameter selection and the difficulty involved in evaluating the tuned models. To determine the optimal hyperparameters, an automatic, robust, and reliable evaluation benchmark is essential. However, establishing such a benchmark is not a trivial task due to the challenges associated with evaluation accuracy and privacy protection. In response to these challenges, we introduce a judge large language model, named PandaLM, which is trained to distinguish the superior model given several LLMs. PandaLM's focus extends beyond just the objective correctness of responses, which is the main focus of traditional evaluation datasets. It addresses vital subjective factors such as relative conciseness, clarity, adherence to instructions, comprehensiveness, and formality. To ensure the reliability of PandaLM, we collect a diverse human-annotated test dataset, where all contexts are generated by humans and labels are aligned with human preferences. Our results indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluation ability and 88.28% of GPT-4's in terms of F1-score on our test dataset. PandaLM enables the evaluation of LLM to be fairer but with less cost, evidenced by significant improvements achieved by models tuned through PandaLM compared to their counterparts trained with default Alpaca's hyperparameters. In addition, PandaLM does not depend on API-based evaluations, thus avoiding potential data leakage. All resources of PandaLM are released at https://github.com/WeOpenML/PandaLM.
\ No newline at end of file
diff --git a/data/2024/iclr/PanoDiffusion: 360-degree Panorama Outpainting via Diffusion b/data/2024/iclr/PanoDiffusion: 360-degree Panorama Outpainting via Diffusion
new file mode 100644
index 0000000000..92c1a8106a
--- /dev/null
+++ b/data/2024/iclr/PanoDiffusion: 360-degree Panorama Outpainting via Diffusion	
@@ -0,0 +1 @@
+Generating complete 360-degree panoramas from narrow field of view images is ongoing research as omnidirectional RGB data is not readily available. Existing GAN-based approaches face some barriers to achieving higher quality output, and have poor generalization performance over different mask types. In this paper, we present our 360-degree indoor RGB-D panorama outpainting model using latent diffusion models (LDM), called PanoDiffusion. We introduce a new bi-modal latent diffusion structure that utilizes both RGB and depth panoramic data during training, which works surprisingly well to outpaint depth-free RGB images during inference. We further propose a novel technique of introducing progressive camera rotations during each diffusion denoising step, which leads to substantial improvement in achieving panorama wraparound consistency. Results show that our PanoDiffusion not only significantly outperforms state-of-the-art methods on RGB-D panorama outpainting by producing diverse well-structured results for different types of masks, but can also synthesize high-quality depth panoramas to provide realistic 3D indoor models.
\ No newline at end of file
diff --git a/data/2024/iclr/Parallelizing non-linear sequential models over the sequence length b/data/2024/iclr/Parallelizing non-linear sequential models over the sequence length
new file mode 100644
index 0000000000..f35381eec3
--- /dev/null
+++ b/data/2024/iclr/Parallelizing non-linear sequential models over the sequence length	
@@ -0,0 +1 @@
+Sequential models, such as Recurrent Neural Networks and Neural Ordinary Differential Equations, have long suffered from slow training due to their inherent sequential nature. For many years this bottleneck has persisted, as many thought sequential models could not be parallelized. We challenge this long-held belief with our parallel algorithm that accelerates GPU evaluation of sequential models by up to 3 orders of magnitude faster without compromising output accuracy. The algorithm does not need any special structure in the sequential models' architecture, making it applicable to a wide range of architectures. Using our method, training sequential models can be more than 10 times faster than the common sequential method without any meaningful difference in the training results. Leveraging this accelerated training, we discovered the efficacy of the Gated Recurrent Unit in a long time series classification problem with 17k time samples. By overcoming the training bottleneck, our work serves as the first step to unlock the potential of non-linear sequential models for long sequence problems.
\ No newline at end of file
diff --git a/data/2024/iclr/Parameter-Efficient Multi-Task Model Fusion with Partial Linearization b/data/2024/iclr/Parameter-Efficient Multi-Task Model Fusion with Partial Linearization
new file mode 100644
index 0000000000..f0f6091551
--- /dev/null
+++ b/data/2024/iclr/Parameter-Efficient Multi-Task Model Fusion with Partial Linearization	
@@ -0,0 +1 @@
+Large pre-trained models have enabled significant advances in machine learning and served as foundation components. Model fusion methods, such as task arithmetic, have been proven to be powerful and scalable to incorporate fine-tuned weights from different tasks into a multi-task model. However, efficiently fine-tuning large pre-trained models on multiple downstream tasks remains challenging, leading to inefficient multi-task model fusion. In this work, we propose a novel method to improve multi-task fusion for parameter-efficient fine-tuning techniques like LoRA fine-tuning. Specifically, our approach partially linearizes only the adapter modules and applies task arithmetic over the linearized adapters. This allows us to leverage the the advantages of model fusion over linearized fine-tuning, while still performing fine-tuning and inference efficiently. We demonstrate that our partial linearization technique enables a more effective fusion of multiple tasks into a single model, outperforming standard adapter tuning and task arithmetic alone. Experimental results demonstrate the capabilities of our proposed partial linearization technique to effectively construct unified multi-task models via the fusion of fine-tuned task vectors. We evaluate performance over an increasing number of tasks and find that our approach outperforms standard parameter-efficient fine-tuning techniques. The results highlight the benefits of partial linearization for scalable and efficient multi-task model fusion. The code is available at https://github.com/tanganke/peta
\ No newline at end of file
diff --git a/data/2024/iclr/Parametric Augmentation for Time Series Contrastive Learning b/data/2024/iclr/Parametric Augmentation for Time Series Contrastive Learning
new file mode 100644
index 0000000000..1f2d5e4bb2
--- /dev/null
+++ b/data/2024/iclr/Parametric Augmentation for Time Series Contrastive Learning	
@@ -0,0 +1 @@
+Modern techniques like contrastive learning have been effectively used in many areas, including computer vision, natural language processing, and graph-structured data. Creating positive examples that assist the model in learning robust and discriminative representations is a crucial stage in contrastive learning approaches. Usually, preset human intuition directs the selection of relevant data augmentations. Due to patterns that are easily recognized by humans, this rule of thumb works well in the vision and language domains. However, it is impractical to visually inspect the temporal structures in time series. The diversity of time series augmentations at both the dataset and instance levels makes it difficult to choose meaningful augmentations on the fly. In this study, we address this gap by analyzing time series data augmentation using information theory and summarizing the most commonly adopted augmentations in a unified format. We then propose a contrastive learning framework with parametric augmentation, AutoTCL, which can be adaptively employed to support time series representation learning. The proposed approach is encoder-agnostic, allowing it to be seamlessly integrated with different backbone encoders. Experiments on univariate forecasting tasks demonstrate the highly competitive results of our method, with an average 6.5\% reduction in MSE and 4.7\% in MAE over the leading baselines. In classification tasks, AutoTCL achieves a $1.2\%$ increase in average accuracy.
\ No newline at end of file
diff --git a/data/2024/iclr/Parsing neural dynamics with infinite recurrent switching linear dynamical systems b/data/2024/iclr/Parsing neural dynamics with infinite recurrent switching linear dynamical systems
new file mode 100644
index 0000000000..102b29dcda
--- /dev/null
+++ b/data/2024/iclr/Parsing neural dynamics with infinite recurrent switching linear dynamical systems	
@@ -0,0 +1 @@
+Unsupervised methods for dimensionality reduction of neural activity and behavior have provided unprecedented insights into the underpinnings of neural information processing. One popular approach involves the recurrent switching linear dynamical system (rSLDS) model, which describes the latent dynamics of neural spike train data using discrete switches between a finite number of low-dimensional linear dynamical systems. However, a few properties of rSLDS model limit its deployability on trial varying data, such as a fixed number of states over trials, and no latent structure nor organization of states. Here we overcome these limitations by endowing the rSLDS model with a semi-Markov discrete state process, with latent geometry, that captures key properties of stochastic processes over partitions with flexible state cardinality. We leverage partial differential equations (PDE) theory to derive an efficient, semi-parametric formulation for dynamical sufficient statistics to the discrete states. This process, combined with switching dynamics, defines our infinite recurrent switching linear dynamical system (irSLDS) model class. We first validate and demonstrate the capabilities of our model on synthetic data. Next, we turn to the analysis of mice electrophysiological data during decision-making, and uncover strong non-stationary processes underlying both within-trial and trial-averaged neural activity.
\ No newline at end of file
diff --git a/data/2024/iclr/Partitioning Message Passing for Graph Fraud Detection b/data/2024/iclr/Partitioning Message Passing for Graph Fraud Detection
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Patched Denoising Diffusion Models For High-Resolution Image Synthesis b/data/2024/iclr/Patched Denoising Diffusion Models For High-Resolution Image Synthesis
new file mode 100644
index 0000000000..0d70c89383
--- /dev/null
+++ b/data/2024/iclr/Patched Denoising Diffusion Models For High-Resolution Image Synthesis	
@@ -0,0 +1 @@
+We propose an effective denoising diffusion model for generating high-resolution images (e.g., 1024$\times$512), trained on small-size image patches (e.g., 64$\times$64). We name our algorithm Patch-DM, in which a new feature collage strategy is designed to avoid the boundary artifact when synthesizing large-size images. Feature collage systematically crops and combines partial features of the neighboring patches to predict the features of a shifted image patch, allowing the seamless generation of the entire image due to the overlap in the patch feature space. Patch-DM produces high-quality image synthesis results on our newly collected dataset of nature images (1024$\times$512), as well as on standard benchmarks of smaller sizes (256$\times$256), including LSUN-Bedroom, LSUN-Church, and FFHQ. We compare our method with previous patch-based generation methods and achieve state-of-the-art FID scores on all four datasets. Further, Patch-DM also reduces memory complexity compared to the classic diffusion models.
\ No newline at end of file
diff --git a/data/2024/iclr/Path Choice Matters for Clear Attributions in Path Methods b/data/2024/iclr/Path Choice Matters for Clear Attributions in Path Methods
new file mode 100644
index 0000000000..28b5619192
--- /dev/null
+++ b/data/2024/iclr/Path Choice Matters for Clear Attributions in Path Methods	
@@ -0,0 +1 @@
+Rigorousness and clarity are both essential for interpretations of DNNs to engender human trust. Path methods are commonly employed to generate rigorous attributions that satisfy three axioms. However, the meaning of attributions remains ambiguous due to distinct path choices. To address the ambiguity, we introduce \textbf{Concentration Principle}, which centrally allocates high attributions to indispensable features, thereby endowing aesthetic and sparsity. We then present \textbf{SAMP}, a model-agnostic interpreter, which efficiently searches the near-optimal path from a pre-defined set of manipulation paths. Moreover, we propose the infinitesimal constraint (IC) and momentum strategy (MS) to improve the rigorousness and optimality. Visualizations show that SAMP can precisely reveal DNNs by pinpointing salient image pixels. We also perform quantitative experiments and observe that our method significantly outperforms the counterparts. Code: https://github.com/zbr17/SAMP.
\ No newline at end of file
diff --git a/data/2024/iclr/Pathformer: Multi-scale Transformers with Adaptive Pathways for Time Series Forecasting b/data/2024/iclr/Pathformer: Multi-scale Transformers with Adaptive Pathways for Time Series Forecasting
new file mode 100644
index 0000000000..5edaa06711
--- /dev/null
+++ b/data/2024/iclr/Pathformer: Multi-scale Transformers with Adaptive Pathways for Time Series Forecasting	
@@ -0,0 +1 @@
+Transformers for time series forecasting mainly model time series from limited or fixed scales, making it challenging to capture different characteristics spanning various scales. We propose Pathformer, a multi-scale Transformer with adaptive pathways. It integrates both temporal resolution and temporal distance for multi-scale modeling. Multi-scale division divides the time series into different temporal resolutions using patches of various sizes. Based on the division of each scale, dual attention is performed over these patches to capture global correlations and local details as temporal dependencies. We further enrich the multi-scale Transformer with adaptive pathways, which adaptively adjust the multi-scale modeling process based on the varying temporal dynamics of the input, improving the accuracy and generalization of Pathformer. Extensive experiments on eleven real-world datasets demonstrate that Pathformer not only achieves state-of-the-art performance by surpassing all current models but also exhibits stronger generalization abilities under various transfer scenarios. The code is made available at https://github.com/decisionintelligence/pathformer.
\ No newline at end of file
diff --git a/data/2024/iclr/PeFLL: Personalized Federated Learning by Learning to Learn b/data/2024/iclr/PeFLL: Personalized Federated Learning by Learning to Learn
new file mode 100644
index 0000000000..f1f352266e
--- /dev/null
+++ b/data/2024/iclr/PeFLL: Personalized Federated Learning by Learning to Learn	
@@ -0,0 +1 @@
+We present PeFLL, a new personalized federated learning algorithm that improves over the state-of-the-art in three aspects: 1) it produces more accurate models, especially in the low-data regime, and not only for clients present during its training phase, but also for any that may emerge in the future; 2) it reduces the amount of on-client computation and client-server communication by providing future clients with ready-to-use personalized models that require no additional finetuning or optimization; 3) it comes with theoretical guarantees that establish generalization from the observed clients to future ones. At the core of PeFLL lies a learning-to-learn approach that jointly trains an embedding network and a hypernetwork. The embedding network is used to represent clients in a latent descriptor space in a way that reflects their similarity to each other. The hypernetwork takes as input such descriptors and outputs the parameters of fully personalized client models. In combination, both networks constitute a learning algorithm that achieves state-of-the-art performance in several personalized federated learning benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models b/data/2024/iclr/Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models
new file mode 100644
index 0000000000..bd087ecdca
--- /dev/null
+++ b/data/2024/iclr/Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models	
@@ -0,0 +1 @@
+Aligning large language models (LLMs) with human values and intents critically involves the use of human or AI feedback. While dense feedback annotations are expensive to acquire and integrate, sparse feedback presents a structural design choice between ratings (e.g., score Response A on a scale of 1-7) and rankings (e.g., is Response A better than Response B?). In this work, we analyze the effect of this design choice for the alignment and evaluation of LLMs. We uncover an inconsistency problem wherein the preferences inferred from ratings and rankings significantly disagree 60% for both human and AI annotators. Our subsequent analysis identifies various facets of annotator biases that explain this phenomena, such as human annotators would rate denser responses higher while preferring accuracy during pairwise judgments. To our surprise, we also observe that the choice of feedback protocol also has a significant effect on the evaluation of aligned LLMs. In particular, we find that LLMs that leverage rankings data for alignment (say model X) are preferred over those that leverage ratings data (say model Y), with a rank-based evaluation protocol (is X/Y's response better than reference response?) but not with a rating-based evaluation protocol (score Rank X/Y's response on a scale of 1-7). Our findings thus shed light on critical gaps in methods for evaluating the real-world utility of language models and their strong dependence on the feedback protocol used for alignment. Our code and data are available at https://github.com/Hritikbansal/sparse_feedback.
\ No newline at end of file
diff --git a/data/2024/iclr/PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts b/data/2024/iclr/PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts
new file mode 100644
index 0000000000..e2de6faf29
--- /dev/null
+++ b/data/2024/iclr/PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts	
@@ -0,0 +1 @@
+Vision-language models like CLIP are widely used in zero-shot image classification due to their ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better performance is still an open question. This paper draws inspiration from the human visual perception process: when classifying an object, humans first infer contextual attributes (e.g., background and orientation) which help separate the foreground object from the background, and then classify the object based on this information. Inspired by it, we observe that providing CLIP with contextual attributes improves zero-shot image classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and interoperability. Our code is available at https://github.com/umd-huang-lab/perceptionCLIP
\ No newline at end of file
diff --git a/data/2024/iclr/Perceptual Scales Predicted by Fisher Information Metrics b/data/2024/iclr/Perceptual Scales Predicted by Fisher Information Metrics
new file mode 100644
index 0000000000..2d767433ef
--- /dev/null
+++ b/data/2024/iclr/Perceptual Scales Predicted by Fisher Information Metrics	
@@ -0,0 +1 @@
+Perception is often viewed as a process that transforms physical variables, external to an observer, into internal psychological variables. Such a process can be modeled by a function coined perceptual scale. The perceptual scale can be deduced from psychophysical measurements that consist in comparing the relative differences between stimuli (i.e. difference scaling experiments). However, this approach is often overlooked by the modeling and experimentation communities. Here, we demonstrate the value of measuring the perceptual scale of classical (spatial frequency, orientation) and less classical physical variables (interpolation between textures) by embedding it in recent probabilistic modeling of perception. First, we show that the assumption that an observer has an internal representation of univariate parameters such as spatial frequency or orientation while stimuli are high-dimensional does not lead to contradictory predictions when following the theoretical framework. Second, we show that the measured perceptual scale corresponds to the transduction function hypothesized in this framework. In particular, we demonstrate that it is related to the Fisher information of the generative model that underlies perception and we test the predictions given by the generative model of different stimuli in a set a of difference scaling experiments. Our main conclusion is that the perceptual scale is mostly driven by the stimulus power spectrum. Finally, we propose that this measure of perceptual scale is a way to push further the notion of perceptual distances by estimating the perceptual geometry of images i.e. the path between images instead of simply the distance between those.
\ No newline at end of file
diff --git a/data/2024/iclr/Performance Gaps in Multi-view Clustering under the Nested Matrix-Tensor Model b/data/2024/iclr/Performance Gaps in Multi-view Clustering under the Nested Matrix-Tensor Model
new file mode 100644
index 0000000000..84ace95cf0
--- /dev/null
+++ b/data/2024/iclr/Performance Gaps in Multi-view Clustering under the Nested Matrix-Tensor Model	
@@ -0,0 +1 @@
+We study the estimation of a planted signal hidden in a recently introduced nested matrix-tensor model, which is an extension of the classical spiked rank-one tensor model, motivated by multi-view clustering. Prior work has theoretically examined the performance of a tensor-based approach, which relies on finding a best rank-one approximation, a problem known to be computationally hard. A tractable alternative approach consists in computing instead the best rank-one (matrix) approximation of an unfolding of the observed tensor data, but its performance was hitherto unknown. We quantify here the performance gap between these two approaches, in particular by deriving the precise algorithmic threshold of the unfolding approach and demonstrating that it exhibits a BBP-type transition behavior. This work is therefore in line with recent contributions which deepen our understanding of why tensor-based methods surpass matrix-based methods in handling structured tensor data.
\ No newline at end of file
diff --git a/data/2024/iclr/Periodicity Decoupling Framework for Long-term Series Forecasting b/data/2024/iclr/Periodicity Decoupling Framework for Long-term Series Forecasting
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Personalize Segment Anything Model with One Shot b/data/2024/iclr/Personalize Segment Anything Model with One Shot
new file mode 100644
index 0000000000..c8accda154
--- /dev/null
+++ b/data/2024/iclr/Personalize Segment Anything Model with One Shot	
@@ -0,0 +1 @@
+Driven by large-data pre-training, Segment Anything Model (SAM) has been demonstrated as a powerful and promptable framework, revolutionizing the segmentation models. Despite the generality, customizing SAM for specific visual concepts without man-powered prompting is under explored, e.g., automatically segmenting your pet dog in different images. In this paper, we propose a training-free Personalization approach for SAM, termed as PerSAM. Given only a single image with a reference mask, PerSAM first localizes the target concept by a location prior, and segments it within other images or videos via three techniques: target-guided attention, target-semantic prompting, and cascaded post-refinement. In this way, we effectively adapt SAM for private use without any training. To further alleviate the mask ambiguity, we present an efficient one-shot fine-tuning variant, PerSAM-F. Freezing the entire SAM, we introduce two learnable weights for multi-scale masks, only training 2 parameters within 10 seconds for improved performance. To demonstrate our efficacy, we construct a new segmentation dataset, PerSeg, for personalized evaluation, and test our methods on video object segmentation with competitive performance. Besides, our approach can also enhance DreamBooth to personalize Stable Diffusion for text-to-image generation, which discards the background disturbance for better target appearance learning. Code is released at https://github.com/ZrrSkywalker/Personalize-SAM
\ No newline at end of file
diff --git a/data/2024/iclr/Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning b/data/2024/iclr/Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning
new file mode 100644
index 0000000000..87c4013ba9
--- /dev/null
+++ b/data/2024/iclr/Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Offline reinforcement learning (RL), where the agent aims to learn the optimal policy based on the data collected by a behavior policy, has attracted increasing attention in recent years. While offline RL with linear function approximation has been extensively studied with optimal results achieved under certain assumptions, many works shift their interest to offline RL with non-linear function approximation. However, limited works on offline RL with non-linear function approximation have instance-dependent regret guarantees. In this paper, we propose an oracle-efficient algorithm, dubbed Pessimistic Nonlinear Least-Square Value Iteration (PNLSVI), for offline RL with non-linear function approximation. Our algorithmic design comprises three innovative components: (1) a variance-based weighted regression scheme that can be applied to a wide range of function classes, (2) a subroutine for variance estimation, and (3) a planning phase that utilizes a pessimistic value iteration approach. Our algorithm enjoys a regret bound that has a tight dependency on the function class complexity and achieves minimax optimal instance-dependent regret when specialized to linear function approximation. Our work extends the previous instance-dependent results within simpler function classes, such as linear and differentiable function to a more general framework.
\ No newline at end of file
diff --git a/data/2024/iclr/Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement b/data/2024/iclr/Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement
new file mode 100644
index 0000000000..20353b65c1
--- /dev/null
+++ b/data/2024/iclr/Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement	
@@ -0,0 +1 @@
+The ability to derive underlying principles from a handful of observations and then generalize to novel situations -- known as inductive reasoning -- is central to human intelligence. Prior work suggests that language models (LMs) often fall short on inductive reasoning, despite achieving impressive success on research benchmarks. In this work, we conduct a systematic study of the inductive reasoning capabilities of LMs through iterative hypothesis refinement, a technique that more closely mirrors the human inductive process than standard input-output prompting. Iterative hypothesis refinement employs a three-step process: proposing, selecting, and refining hypotheses in the form of textual rules. By examining the intermediate rules, we observe that LMs are phenomenal hypothesis proposers (i.e., generating candidate rules), and when coupled with a (task-specific) symbolic interpreter that is able to systematically filter the proposed set of rules, this hybrid approach achieves strong results across inductive reasoning benchmarks that require inducing causal relations, language-like instructions, and symbolic concepts. However, they also behave as puzzling inductive reasoners, showing notable performance gaps between rule induction (i.e., identifying plausible rules) and rule application (i.e., applying proposed rules to instances), suggesting that LMs are proposing hypotheses without being able to actually apply the rules. Through empirical and human analyses, we further reveal several discrepancies between the inductive reasoning processes of LMs and humans, shedding light on both the potentials and limitations of using LMs in inductive reasoning tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/PhyloGFN: Phylogenetic inference with generative flow networks b/data/2024/iclr/PhyloGFN: Phylogenetic inference with generative flow networks
new file mode 100644
index 0000000000..ff2e81073b
--- /dev/null
+++ b/data/2024/iclr/PhyloGFN: Phylogenetic inference with generative flow networks	
@@ -0,0 +1 @@
+Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history and numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets. PhyloGFN is competitive with prior works in marginal likelihood estimation and achieves a closer fit to the target distribution than state-of-the-art variational inference methods. Our code is available at https://github.com/zmy1116/phylogfn.
\ No newline at end of file
diff --git a/data/2024/iclr/Piecewise Linear Parametrization of Policies: Towards Interpretable Deep Reinforcement Learning b/data/2024/iclr/Piecewise Linear Parametrization of Policies: Towards Interpretable Deep Reinforcement Learning
new file mode 100644
index 0000000000..a46a940b9c
--- /dev/null
+++ b/data/2024/iclr/Piecewise Linear Parametrization of Policies: Towards Interpretable Deep Reinforcement Learning	
@@ -0,0 +1 @@
+Learning inherently interpretable policies is a central challenge in the path to developing autonomous agents that humans can trust. We argue for the use of policies that are piecewise-linear. We carefully study to what extent they can retain the interpretable properties of linear policies while performing competitively with neural baselines. In particular, we propose the HyperCombinator (HC), a piecewise-linear neural architecture expressing a policy with a controllably small number of sub-policies. Each sub-policy is linear with respect to interpretable features, shedding light on the agent’s decision process without needing an additional explanation model. We evaluate HC policies in control and navigation experiments, visualize the improved interpretability of the agent and highlight its trade-off with performance.
\ No newline at end of file
diff --git "a/data/2024/iclr/PixArt-\316\261: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis" "b/data/2024/iclr/PixArt-\316\261: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis"
new file mode 100644
index 0000000000..edd5694e84
--- /dev/null
+++ "b/data/2024/iclr/PixArt-\316\261: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis"	
@@ -0,0 +1 @@
+The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.
\ No newline at end of file
diff --git a/data/2024/iclr/PlaSma: Procedural Knowledge Models for Language-based Planning and Re-Planning b/data/2024/iclr/PlaSma: Procedural Knowledge Models for Language-based Planning and Re-Planning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks b/data/2024/iclr/Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks
new file mode 100644
index 0000000000..0eb7b00254
--- /dev/null
+++ b/data/2024/iclr/Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have been shown to be capable of performing high-level planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL achieves state-of-the-art results on over 25 challenging robotics tasks with up to 10 stages. PSL solves long-horizon tasks from raw visual input spanning four benchmarks at success rates of over 85%, out-performing language-based, classical, and end-to-end approaches. Video results and code at https://mihdalal.github.io/planseqlearn/
\ No newline at end of file
diff --git a/data/2024/iclr/Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents b/data/2024/iclr/Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents
new file mode 100644
index 0000000000..2b66432efe
--- /dev/null
+++ b/data/2024/iclr/Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents	
@@ -0,0 +1 @@
+Proactive dialogues serve as a practical yet challenging dialogue problem in the era of large language models (LLMs), where the dialogue policy planning is the key to improving the proactivity of LLMs. Most existing studies enable the dialogue policy planning of LLMs using various prompting schemes or iteratively enhance this capability in handling the given case with verbal AI feedback. However, these approaches are either bounded by the policy planning capability of the frozen LLMs or hard to be transferred to new cases. In this work, we introduce a new dialogue policy planning paradigm to strategize LLMs for proactive dialogue problems with a tunable language model plug-in as a plug-and-play dialogue policy planner, named PPDPP. Specifically, we develop a novel training framework to facilitate supervised fine-tuning over available human-annotated data as well as reinforcement learning from goal-oriented AI feedback with dynamic interaction data collected by the LLM-based self-play simulation. In this manner, the LLM-powered dialogue agent can not only be generalized to different cases after the training, but also be applicable to different applications by just substituting the learned plug-in. In addition, we propose to evaluate the policy planning capability of dialogue systems under the interactive setting. Experimental results demonstrate that PPDPP consistently and substantially outperforms existing approaches on three different proactive dialogue applications, including negotiation, emotional support, and tutoring dialogues.
\ No newline at end of file
diff --git a/data/2024/iclr/Plug-and-Play Posterior Sampling under Mismatched Measurement and Prior Models b/data/2024/iclr/Plug-and-Play Posterior Sampling under Mismatched Measurement and Prior Models
new file mode 100644
index 0000000000..fc3a8905ca
--- /dev/null
+++ b/data/2024/iclr/Plug-and-Play Posterior Sampling under Mismatched Measurement and Prior Models	
@@ -0,0 +1 @@
+Posterior sampling has been shown to be a powerful Bayesian approach for solving imaging inverse problems. The recent plug-and-play unadjusted Langevin algorithm (PnP-ULA) has emerged as a promising method for Monte Carlo sampling and minimum mean squared error (MMSE) estimation by combining physical measurement models with deep-learning priors specified using image denoisers. However, the intricate relationship between the sampling distribution of PnP-ULA and the mismatched data-fidelity and denoiser has not been theoretically analyzed. We address this gap by proposing a posterior-L2 pseudometric and using it to quantify an explicit error bound for PnP-ULA under mismatched posterior distribution. We numerically validate our theory on several inverse problems such as sampling from Gaussian mixture models and image deblurring. Our results suggest that the sensitivity of the sampling distribution of PnP-ULA to a mismatch in the measurement model and the denoiser can be precisely characterized.
\ No newline at end of file
diff --git a/data/2024/iclr/Plugin estimators for selective classification with out-of-distribution detection b/data/2024/iclr/Plugin estimators for selective classification with out-of-distribution detection
new file mode 100644
index 0000000000..626618f75a
--- /dev/null
+++ b/data/2024/iclr/Plugin estimators for selective classification with out-of-distribution detection	
@@ -0,0 +1 @@
+Real-world classifiers can benefit from the option of abstaining from predicting on samples where they have low confidence. Such abstention is particularly useful on samples which are close to the learned decision boundary, or which are outliers with respect to the training sample. These settings have been the subject of extensive but disjoint study in the selective classification (SC) and out-of-distribution (OOD) detection literature. Recent work on selective classification with OOD detection (SCOD) has argued for the unified study of these problems; however, the formal underpinnings of this problem are still nascent, and existing techniques are heuristic in nature. In this paper, we propose new plugin estimators for SCOD that are theoretically grounded, effective, and generalise existing approaches from the SC and OOD detection literature. In the course of our analysis, we formally explicate how na\"{i}ve use of existing SC and OOD detection baselines may be inadequate for SCOD. We empirically demonstrate that our approaches yields competitive SC and OOD detection performance compared to baselines from both literatures.
\ No newline at end of file
diff --git a/data/2024/iclr/PnP Inversion: Boosting Diffusion-based Editing with 3 Lines of Code b/data/2024/iclr/PnP Inversion: Boosting Diffusion-based Editing with 3 Lines of Code
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training b/data/2024/iclr/PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training
new file mode 100644
index 0000000000..968a3da3dd
--- /dev/null
+++ b/data/2024/iclr/PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training	
@@ -0,0 +1 @@
+Large Language Models (LLMs) are trained with a pre-defined context length, restricting their use in scenarios requiring long inputs. Previous efforts for adapting LLMs to a longer length usually requires fine-tuning with this target length (Full-length fine-tuning), suffering intensive training cost. To decouple train length from target length for efficient context window extension, we propose Positional Skip-wisE (PoSE) training that smartly simulates long inputs using a fixed context window. This is achieved by first dividing the original context window into several chunks, then designing distinct skipping bias terms to manipulate the position indices of each chunk. These bias terms and the lengths of each chunk are altered for every training example, allowing the model to adapt to all positions within target length. Experimental results show that PoSE greatly reduces memory and time overhead compared with Full-length fine-tuning, with minimal impact on performance. Leveraging this advantage, we have successfully extended the LLaMA model to 128k tokens using a 2k training context window. Furthermore, we empirically confirm that PoSE is compatible with all RoPE-based LLMs and position interpolation strategies. Notably, our method can potentially support infinite length, limited only by memory usage in inference. With ongoing progress for efficient inference, we believe PoSE can further scale the context window beyond 128k.
\ No newline at end of file
diff --git a/data/2024/iclr/Point2SSM: Learning Morphological Variations of Anatomies from Point Clouds b/data/2024/iclr/Point2SSM: Learning Morphological Variations of Anatomies from Point Clouds
new file mode 100644
index 0000000000..f9ded2d555
--- /dev/null
+++ b/data/2024/iclr/Point2SSM: Learning Morphological Variations of Anatomies from Point Clouds	
@@ -0,0 +1 @@
+We present Point2SSM, a novel unsupervised learning approach for constructing correspondence-based statistical shape models (SSMs) directly from raw point clouds. SSM is crucial in clinical research, enabling population-level analysis of morphological variation in bones and organs. Traditional methods of SSM construction have limitations, including the requirement of noise-free surface meshes or binary volumes, reliance on assumptions or templates, and prolonged inference times due to simultaneous optimization of the entire cohort. Point2SSM overcomes these barriers by providing a data-driven solution that infers SSMs directly from raw point clouds, reducing inference burdens and increasing applicability as point clouds are more easily acquired. While deep learning on 3D point clouds has seen success in unsupervised representation learning and shape correspondence, its application to anatomical SSM construction is largely unexplored. We conduct a benchmark of state-of-the-art point cloud deep networks on the SSM task, revealing their limited robustness to clinical challenges such as noisy, sparse, or incomplete input and limited training data. Point2SSM addresses these issues through an attention-based module, providing effective correspondence mappings from learned point features. Our results demonstrate that the proposed method significantly outperforms existing networks in terms of accurate surface sampling and correspondence, better capturing population-level statistics.
\ No newline at end of file
diff --git a/data/2024/iclr/Poisoned Forgery Face: Towards Backdoor Attacks on Face Forgery Detection b/data/2024/iclr/Poisoned Forgery Face: Towards Backdoor Attacks on Face Forgery Detection
new file mode 100644
index 0000000000..b3dc90e558
--- /dev/null
+++ b/data/2024/iclr/Poisoned Forgery Face: Towards Backdoor Attacks on Face Forgery Detection	
@@ -0,0 +1 @@
+The proliferation of face forgery techniques has raised significant concerns within society, thereby motivating the development of face forgery detection methods. These methods aim to distinguish forged faces from genuine ones and have proven effective in practical applications. However, this paper introduces a novel and previously unrecognized threat in face forgery detection scenarios caused by backdoor attack. By embedding backdoors into models and incorporating specific trigger patterns into the input, attackers can deceive detectors into producing erroneous predictions for forged faces. To achieve this goal, this paper proposes \emph{Poisoned Forgery Face} framework, which enables clean-label backdoor attacks on face forgery detectors. Our approach involves constructing a scalable trigger generator and utilizing a novel convolving process to generate translation-sensitive trigger patterns. Moreover, we employ a relative embedding method based on landmark-based regions to enhance the stealthiness of the poisoned samples. Consequently, detectors trained on our poisoned samples are embedded with backdoors. Notably, our approach surpasses SoTA backdoor baselines with a significant improvement in attack success rate (+16.39\% BD-AUC) and reduction in visibility (-12.65\% $L_\infty$). Furthermore, our attack exhibits promising performance against backdoor defenses. We anticipate that this paper will draw greater attention to the potential threats posed by backdoor attacks in face forgery detection scenarios. Our codes will be made available at \url{https://github.com/JWLiang007/PFF}
\ No newline at end of file
diff --git a/data/2024/iclr/Policy Rehearsing: Training Generalizable Policies for Reinforcement Learning b/data/2024/iclr/Policy Rehearsing: Training Generalizable Policies for Reinforcement Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/PolyGCL: GRAPH CONTRASTIVE LEARNING via Learnable Spectral Polynomial Filters b/data/2024/iclr/PolyGCL: GRAPH CONTRASTIVE LEARNING via Learnable Spectral Polynomial Filters
new file mode 100644
index 0000000000..7484f42b91
--- /dev/null
+++ b/data/2024/iclr/PolyGCL: GRAPH CONTRASTIVE LEARNING via Learnable Spectral Polynomial Filters	
@@ -0,0 +1 @@
+Recently
\ No newline at end of file
diff --git a/data/2024/iclr/PolyVoice: Language Models for Speech to Speech Translation b/data/2024/iclr/PolyVoice: Language Models for Speech to Speech Translation
new file mode 100644
index 0000000000..c9196d131f
--- /dev/null
+++ b/data/2024/iclr/PolyVoice: Language Models for Speech to Speech Translation	
@@ -0,0 +1 @@
+We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice.
\ No newline at end of file
diff --git a/data/2024/iclr/Polynomial Width is Sufficient for Set Representation with High-dimensional Features b/data/2024/iclr/Polynomial Width is Sufficient for Set Representation with High-dimensional Features
new file mode 100644
index 0000000000..e0455cd8bc
--- /dev/null
+++ b/data/2024/iclr/Polynomial Width is Sufficient for Set Representation with High-dimensional Features	
@@ -0,0 +1 @@
+Set representation has become ubiquitous in deep learning for modeling the inductive bias of neural networks that are insensitive to the input order. DeepSets is the most widely used neural network architecture for set representation. It involves embedding each set element into a latent space with dimension $L$, followed by a sum pooling to obtain a whole-set embedding, and finally mapping the whole-set embedding to the output. In this work, we investigate the impact of the dimension $L$ on the expressive power of DeepSets. Previous analyses either oversimplified high-dimensional features to be one-dimensional features or were limited to analytic activations, thereby diverging from practical use or resulting in $L$ that grows exponentially with the set size $N$ and feature dimension $D$. To investigate the minimal value of $L$ that achieves sufficient expressive power, we present two set-element embedding layers: (a) linear + power activation (LP) and (b) linear + exponential activations (LE). We demonstrate that $L$ being poly$(N, D)$ is sufficient for set representation using both embedding layers. We also provide a lower bound of $L$ for the LP embedding layer. Furthermore, we extend our results to permutation-equivariant set functions and the complex field.
\ No newline at end of file
diff --git a/data/2024/iclr/Polynormer: Polynomial-Expressive Graph Transformer in Linear Time b/data/2024/iclr/Polynormer: Polynomial-Expressive Graph Transformer in Linear Time
new file mode 100644
index 0000000000..5239a14d03
--- /dev/null
+++ b/data/2024/iclr/Polynormer: Polynomial-Expressive Graph Transformer in Linear Time	
@@ -0,0 +1 @@
+Graph transformers (GTs) have emerged as a promising architecture that is theoretically more expressive than message-passing graph neural networks (GNNs). However, typical GT models have at least quadratic complexity and thus cannot scale to large graphs. While there are several linear GTs recently proposed, they still lag behind GNN counterparts on several popular graph datasets, which poses a critical concern on their practical expressivity. To balance the trade-off between expressivity and scalability of GTs, we propose Polynormer, a polynomial-expressive GT model with linear complexity. Polynormer is built upon a novel base model that learns a high-degree polynomial on input features. To enable the base model permutation equivariant, we integrate it with graph topology and node features separately, resulting in local and global equivariant attention models. Consequently, Polynormer adopts a linear local-to-global attention scheme to learn high-degree equivariant polynomials whose coefficients are controlled by attention scores. Polynormer has been evaluated on $13$ homophilic and heterophilic datasets, including large graphs with millions of nodes. Our extensive experiment results show that Polynormer outperforms state-of-the-art GNN and GT baselines on most datasets, even without the use of nonlinear activation functions.
\ No newline at end of file
diff --git a/data/2024/iclr/Pooling Image Datasets with Multiple Covariate Shift and Imbalance b/data/2024/iclr/Pooling Image Datasets with Multiple Covariate Shift and Imbalance
new file mode 100644
index 0000000000..70b75b9af8
--- /dev/null
+++ b/data/2024/iclr/Pooling Image Datasets with Multiple Covariate Shift and Imbalance	
@@ -0,0 +1 @@
+Small sample sizes are common in many disciplines, which necessitates pooling roughly similar datasets across multiple institutions to study weak but relevant associations between images and disease outcomes. Such data often manifest shift/imbalance in covariates (i.e., secondary non-imaging data). Controlling for such nuisance variables is common within standard statistical analysis, but the ideas do not directly apply to overparameterized models. Consequently, recent work has shown how strategies from invariant representation learning provides a meaningful starting point, but the current repertoire of methods is limited to accounting for shifts/imbalances in just a couple of covariates at a time. In this paper, we show how viewing this problem from the perspective of Category theory provides a simple and effective solution that completely avoids elaborate multi-stage training pipelines that would otherwise be needed. We show the effectiveness of this approach via extensive experiments on real datasets. Further, we discuss how this style of formulation offers a unified perspective on at least 5+ distinct problem settings, from self-supervised learning to matching problems in 3D reconstruction.
\ No newline at end of file
diff --git a/data/2024/iclr/Pose Modulated Avatars from Video b/data/2024/iclr/Pose Modulated Avatars from Video
new file mode 100644
index 0000000000..6c3d48d85f
--- /dev/null
+++ b/data/2024/iclr/Pose Modulated Avatars from Video	
@@ -0,0 +1 @@
+It is now possible to reconstruct dynamic human motion and shape from a sparse set of cameras using Neural Radiance Fields (NeRF) driven by an underlying skeleton. However, a challenge remains to model the deformation of cloth and skin in relation to skeleton pose. Unlike existing avatar models that are learned implicitly or rely on a proxy surface, our approach is motivated by the observation that different poses necessitate unique frequency assignments. Neglecting this distinction yields noisy artifacts in smooth areas or blurs fine-grained texture and shape details in sharp regions. We develop a two-branch neural network that is adaptive and explicit in the frequency domain. The first branch is a graph neural network that models correlations among body parts locally, taking skeleton pose as input. The second branch combines these correlation features to a set of global frequencies and then modulates the feature encoding. Our experiments demonstrate that our network outperforms state-of-the-art methods in terms of preserving details and generalization capabilities.
\ No newline at end of file
diff --git a/data/2024/iclr/Post-hoc bias scoring is optimal for fair classification b/data/2024/iclr/Post-hoc bias scoring is optimal for fair classification
new file mode 100644
index 0000000000..52cf432a1d
--- /dev/null
+++ b/data/2024/iclr/Post-hoc bias scoring is optimal for fair classification	
@@ -0,0 +1 @@
+We consider a binary classification problem under group fairness constraints, which can be one of Demographic Parity (DP), Equalized Opportunity (EOp), or Equalized Odds (EO). We propose an explicit characterization of Bayes optimal classifier under the fairness constraints, which turns out to be a simple modification rule of the unconstrained classifier. Namely, we introduce a novel instance-level measure of bias, which we call bias score, and the modification rule is a simple linear rule on top of the finite amount of bias scores.Based on this characterization, we develop a post-hoc approach that allows us to adapt to fairness constraints while maintaining high accuracy. In the case of DP and EOp constraints, the modification rule is thresholding a single bias score, while in the case of EO constraints we are required to fit a linear modification rule with 2 parameters. The method can also be applied for composite group-fairness criteria, such as ones involving several sensitive attributes.
\ No newline at end of file
diff --git a/data/2024/iclr/Posterior Sampling Based on Gradient Flows of the MMD with Negative Distance Kernel b/data/2024/iclr/Posterior Sampling Based on Gradient Flows of the MMD with Negative Distance Kernel
new file mode 100644
index 0000000000..6342c27731
--- /dev/null
+++ b/data/2024/iclr/Posterior Sampling Based on Gradient Flows of the MMD with Negative Distance Kernel	
@@ -0,0 +1 @@
+We propose conditional flows of the maximum mean discrepancy (MMD) with the negative distance kernel for posterior sampling and conditional generative modeling. This MMD, which is also known as energy distance, has several advantageous properties like efficient computation via slicing and sorting. We approximate the joint distribution of the ground truth and the observations using discrete Wasserstein gradient flows and establish an error bound for the posterior distributions. Further, we prove that our particle flow is indeed a Wasserstein gradient flow of an appropriate functional. The power of our method is demonstrated by numerical examples including conditional image generation and inverse problems like superresolution, inpainting and computed tomography in low-dose and limited-angle settings.
\ No newline at end of file
diff --git a/data/2024/iclr/Pre-Training and Fine-Tuning Generative Flow Networks b/data/2024/iclr/Pre-Training and Fine-Tuning Generative Flow Networks
new file mode 100644
index 0000000000..a369715d30
--- /dev/null
+++ b/data/2024/iclr/Pre-Training and Fine-Tuning Generative Flow Networks	
@@ -0,0 +1 @@
+Generative Flow Networks (GFlowNets) are amortized samplers that learn stochastic policies to sequentially generate compositional objects from a given unnormalized reward distribution. They can generate diverse sets of high-reward objects, which is an important consideration in scientific discovery tasks. However, as they are typically trained from a given extrinsic reward function, it remains an important open challenge about how to leverage the power of pre-training and train GFlowNets in an unsupervised fashion for efficient adaptation to downstream tasks. Inspired by recent successes of unsupervised pre-training in various domains, we introduce a novel approach for reward-free pre-training of GFlowNets. By framing the training as a self-supervised problem, we propose an outcome-conditioned GFlowNet (OC-GFN) that learns to explore the candidate space. Specifically, OC-GFN learns to reach any targeted outcomes, akin to goal-conditioned policies in reinforcement learning. We show that the pre-trained OC-GFN model can allow for a direct extraction of a policy capable of sampling from any new reward functions in downstream tasks. Nonetheless, adapting OC-GFN on a downstream task-specific reward involves an intractable marginalization over possible outcomes. We propose a novel way to approximate this marginalization by learning an amortized predictor enabling efficient fine-tuning. Extensive experimental results validate the efficacy of our approach, demonstrating the effectiveness of pre-training the OC-GFN, and its ability to swiftly adapt to downstream tasks and discover modes more efficiently. This work may serve as a foundation for further exploration of pre-training strategies in the context of GFlowNets.
\ No newline at end of file
diff --git a/data/2024/iclr/Pre-training LiDAR-based 3D Object Detectors through Colorization b/data/2024/iclr/Pre-training LiDAR-based 3D Object Detectors through Colorization
new file mode 100644
index 0000000000..c827f75393
--- /dev/null
+++ b/data/2024/iclr/Pre-training LiDAR-based 3D Object Detectors through Colorization	
@@ -0,0 +1 @@
+Accurate 3D object detection and understanding for self-driving cars heavily relies on LiDAR point clouds, necessitating large amounts of labeled data to train. In this work, we introduce an innovative pre-training approach, Grounded Point Colorization (GPC), to bridge the gap between data and labels by teaching the model to colorize LiDAR point clouds, equipping it with valuable semantic cues. To tackle challenges arising from color variations and selection bias, we incorporate color as"context"by providing ground-truth colors as hints during colorization. Experimental results on the KITTI and Waymo datasets demonstrate GPC's remarkable effectiveness. Even with limited labeled data, GPC significantly improves fine-tuning performance; notably, on just 20% of the KITTI dataset, GPC outperforms training from scratch with the entire dataset. In sum, we introduce a fresh perspective on pre-training for 3D object detection, aligning the objective with the model's intended role and ultimately advancing the accuracy and efficiency of 3D object detection for autonomous vehicles.
\ No newline at end of file
diff --git a/data/2024/iclr/Pre-training Sequence, Structure, and Surface Features for Comprehensive Protein Representation Learning b/data/2024/iclr/Pre-training Sequence, Structure, and Surface Features for Comprehensive Protein Representation Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Pre-training with Random Orthogonal Projection Image Modeling b/data/2024/iclr/Pre-training with Random Orthogonal Projection Image Modeling
new file mode 100644
index 0000000000..5056e196b4
--- /dev/null
+++ b/data/2024/iclr/Pre-training with Random Orthogonal Projection Image Modeling	
@@ -0,0 +1 @@
+Masked Image Modeling (MIM) is a powerful self-supervised strategy for visual pre-training without the use of labels. MIM applies random crops to input images, processes them with an encoder, and then recovers the masked inputs with a decoder, which encourages the network to capture and learn structural information about objects and scenes. The intermediate feature representations obtained from MIM are suitable for fine-tuning on downstream tasks. In this paper, we propose an Image Modeling framework based on random orthogonal projection instead of binary masking as in MIM. Our proposed Random Orthogonal Projection Image Modeling (ROPIM) reduces spatially-wise token information under guaranteed bound on the noise variance and can be considered as masking entire spatial image area under locally varying masking degrees. Since ROPIM uses a random subspace for the projection that realizes the masking step, the readily available complement of the subspace can be used during unmasking to promote recovery of removed information. In this paper, we show that using random orthogonal projection leads to superior performance compared to crop-based masking. We demonstrate state-of-the-art results on several popular benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/Pre-training with Synthetic Data Helps Offline Reinforcement Learning b/data/2024/iclr/Pre-training with Synthetic Data Helps Offline Reinforcement Learning
new file mode 100644
index 0000000000..b46291d111
--- /dev/null
+++ b/data/2024/iclr/Pre-training with Synthetic Data Helps Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Recently, it has been shown that for offline deep reinforcement learning (DRL), pre-training Decision Transformer with a large language corpus can improve downstream performance (Reid et al., 2022). A natural question to ask is whether this performance gain can only be achieved with language pre-training, or can be achieved with simpler pre-training schemes which do not involve language. In this paper, we first show that language is not essential for improved performance, and indeed pre-training with synthetic IID data for a small number of updates can match the performance gains from pre-training with a large language corpus; moreover, pre-training with data generated by a one-step Markov chain can further improve the performance. Inspired by these experimental results, we then consider pre-training Conservative Q-Learning (CQL), a popular offline DRL algorithm, which is Q-learning-based and typically employs a Multi-Layer Perceptron (MLP) backbone. Surprisingly, pre-training with simple synthetic data for a small number of updates can also improve CQL, providing consistent performance improvement on D4RL Gym locomotion datasets. The results of this paper not only illustrate the importance of pre-training for offline DRL but also show that the pre-training data can be synthetic and generated with remarkably simple mechanisms.
\ No newline at end of file
diff --git a/data/2024/iclr/Predicting Emergent Abilities with Infinite Resolution Evaluation b/data/2024/iclr/Predicting Emergent Abilities with Infinite Resolution Evaluation
new file mode 100644
index 0000000000..4755dbba32
--- /dev/null
+++ b/data/2024/iclr/Predicting Emergent Abilities with Infinite Resolution Evaluation	
@@ -0,0 +1 @@
+The scientific scale-up of large language models (LLMs) necessitates a comprehensive understanding of their scaling properties. However, the existing literature on the scaling properties only yields an incomplete answer: optimization loss decreases predictably as the model size increases, in line with established scaling law; yet no scaling law for task has been established and the task performances are far from predictable during scaling. Task performances typically show minor gains on small models until they improve dramatically once models exceed a size threshold, exemplifying the ``emergent abilities''. In this study, we discover that small models, although they exhibit minor performance, demonstrate critical and consistent task performance improvements that are not captured by conventional evaluation strategies due to insufficient measurement resolution. To measure such improvements, we introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase. With PassUntil, we conduct a quantitative investigation into the scaling law of task performance. The investigation contains two parts. Firstly, a strict task scaling law that is not conventionally known to exist, is identified, enhancing the predictability of task performances. Remarkably, we are able to predict the performance of the 2.4B model on code generation with merely 0.05\% deviation before training starts, which is the first systematic attempt to verify predictable scaling proposed by GPT-4's report. Secondly, we are able to study emergent abilities quantitatively. We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function and has a increasing speed. We then examine two hypothesis and imply that the ``multiple circuits hypothesis'' might be responsible for the accelerated emergence.
\ No newline at end of file
diff --git a/data/2024/iclr/Prediction Error-based Classification for Class-Incremental Learning b/data/2024/iclr/Prediction Error-based Classification for Class-Incremental Learning
new file mode 100644
index 0000000000..40e70ce90a
--- /dev/null
+++ b/data/2024/iclr/Prediction Error-based Classification for Class-Incremental Learning	
@@ -0,0 +1 @@
+Class-incremental learning (CIL) is a particularly challenging variant of continual learning, where the goal is to learn to discriminate between all classes presented in an incremental fashion. Existing approaches often suffer from excessive forgetting and imbalance of the scores assigned to classes that have not been seen together during training. In this study, we introduce a novel approach, Prediction Error-based Classification (PEC), which differs from traditional discriminative and generative classification paradigms. PEC computes a class score by measuring the prediction error of a model trained to replicate the outputs of a frozen random neural network on data from that class. The method can be interpreted as approximating a classification rule based on Gaussian Process posterior variance. PEC offers several practical advantages, including sample efficiency, ease of tuning, and effectiveness even when data are presented one class at a time. Our empirical results show that PEC performs strongly in single-pass-through-data CIL, outperforming other rehearsal-free baselines in all cases and rehearsal-based methods with moderate replay buffer size in most cases across multiple benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/Prediction without Preclusion: Recourse Verification with Reachable Sets b/data/2024/iclr/Prediction without Preclusion: Recourse Verification with Reachable Sets
new file mode 100644
index 0000000000..909ca3b16c
--- /dev/null
+++ b/data/2024/iclr/Prediction without Preclusion: Recourse Verification with Reachable Sets	
@@ -0,0 +1 @@
+Machine learning models are often used to decide who receives a loan, a job interview, or a public benefit. Models in such settings use features without considering their actionability. As a result, they can assign predictions that are fixed $-$ meaning that individuals who are denied loans and interviews are, in fact, precluded from access to credit and employment. In this work, we introduce a procedure called recourse verification to test if a model assigns fixed predictions to its decision subjects. We propose a model-agnostic approach for recourse verification with reachable sets $-$ i.e., the set of all points that a person can reach through their actions in feature space. We develop methods to construct reachable sets for discrete feature spaces, which can certify the responsiveness of any model by simply querying its predictions. We conduct a comprehensive empirical study on the infeasibility of recourse on datasets from consumer finance. Our results highlight how models can inadvertently preclude access by assigning fixed predictions and underscore the need to account for actionability in model development.
\ No newline at end of file
diff --git a/data/2024/iclr/Predictive auxiliary objectives in deep RL mimic learning in the brain b/data/2024/iclr/Predictive auxiliary objectives in deep RL mimic learning in the brain
new file mode 100644
index 0000000000..62ffd3e0c2
--- /dev/null
+++ b/data/2024/iclr/Predictive auxiliary objectives in deep RL mimic learning in the brain	
@@ -0,0 +1 @@
+The ability to predict upcoming events has been hypothesized to comprise a key aspect of natural and machine cognition. This is supported by trends in deep reinforcement learning (RL), where self-supervised auxiliary objectives such as prediction are widely used to support representation learning and improve task performance. Here, we study the effects predictive auxiliary objectives have on representation learning across different modules of an RL system and how these mimic representational changes observed in the brain. We find that predictive objectives improve and stabilize learning particularly in resource-limited architectures, and we identify settings where longer predictive horizons better support representational transfer. Furthermore, we find that representational changes in this RL system bear a striking resemblance to changes in neural activity observed in the brain across various experiments. Specifically, we draw a connection between the auxiliary predictive model of the RL system and hippocampus, an area thought to learn a predictive model to support memory-guided behavior. We also connect the encoder network and the value learning network of the RL system to visual cortex and striatum in the brain, respectively. This work demonstrates how representation learning in deep RL systems can provide an interpretable framework for modeling multi-region interactions in the brain. The deep RL perspective taken here also suggests an additional role of the hippocampus in the brain -- that of an auxiliary learning system that benefits representation learning in other regions.
\ No newline at end of file
diff --git a/data/2024/iclr/Predictive, scalable and interpretable knowledge tracing on structured domains b/data/2024/iclr/Predictive, scalable and interpretable knowledge tracing on structured domains
new file mode 100644
index 0000000000..3e1fe777bc
--- /dev/null
+++ b/data/2024/iclr/Predictive, scalable and interpretable knowledge tracing on structured domains	
@@ -0,0 +1 @@
+Intelligent tutoring systems optimize the selection and timing of learning materials to enhance understanding and long-term retention. This requires estimates of both the learner's progress (''knowledge tracing''; KT), and the prerequisite structure of the learning domain (''knowledge mapping''). While recent deep learning models achieve high KT accuracy, they do so at the expense of the interpretability of psychologically-inspired models. In this work, we present a solution to this trade-off. PSI-KT is a hierarchical generative approach that explicitly models how both individual cognitive traits and the prerequisite structure of knowledge influence learning dynamics, thus achieving interpretability by design. Moreover, by using scalable Bayesian inference, PSI-KT targets the real-world need for efficient personalization even with a growing body of learners and learning histories. Evaluated on three datasets from online learning platforms, PSI-KT achieves superior multi-step predictive accuracy and scalable inference in continual-learning settings, all while providing interpretable representations of learner-specific traits and the prerequisite structure of knowledge that causally supports learning. In sum, predictive, scalable and interpretable knowledge tracing with solid knowledge mapping lays a key foundation for effective personalized learning to make education accessible to a broad, global audience.
\ No newline at end of file
diff --git a/data/2024/iclr/Principled Architecture-aware Scaling of Hyperparameters b/data/2024/iclr/Principled Architecture-aware Scaling of Hyperparameters
new file mode 100644
index 0000000000..793186692f
--- /dev/null
+++ b/data/2024/iclr/Principled Architecture-aware Scaling of Hyperparameters	
@@ -0,0 +1 @@
+Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process. Current works try to automatically optimize or design principles of hyperparameters, such that they can generalize to diverse unseen scenarios. However, most designs or optimization methods are agnostic to the choice of network structures, and thus largely ignore the impact of neural architectures on hyperparameters. In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture, which includes the network depth, width, convolutional kernel size, and connectivity patterns. By pursuing every parameter to be maximally updated with the same mean squared change in pre-activations, we can generalize our initialization and learning rates across MLPs (multi-layer perception) and CNNs (convolutional neural network) with sophisticated graph topologies. We verify our principles with comprehensive experiments. More importantly, our strategy further sheds light on advancing current benchmarks for architecture design. A fair comparison of AutoML algorithms requires accurate network rankings. However, we demonstrate that network rankings can be easily changed by better training networks in benchmarks with our architecture-aware learning rates and initialization.
\ No newline at end of file
diff --git a/data/2024/iclr/Principled Federated Domain Adaptation: Gradient Projection and Auto-Weighting b/data/2024/iclr/Principled Federated Domain Adaptation: Gradient Projection and Auto-Weighting
new file mode 100644
index 0000000000..2cf106672c
--- /dev/null
+++ b/data/2024/iclr/Principled Federated Domain Adaptation: Gradient Projection and Auto-Weighting	
@@ -0,0 +1 @@
+Federated Domain Adaptation (FDA) describes the federated learning (FL) setting where source clients and a server work collaboratively to improve the performance of a target client where limited data is available. The domain shift between the source and target domains, coupled with limited data of the target client, makes FDA a challenging problem, e.g., common techniques such as federated averaging and fine-tuning fail due to domain shift and data scarcity. To theoretically understand the problem, we introduce new metrics that characterize the FDA setting and a theoretical framework with novel theorems for analyzing the performance of server aggregation rules. Further, we propose a novel lightweight aggregation rule, Federated Gradient Projection ($\texttt{FedGP}$), which significantly improves the target performance with domain shift and data scarcity. Moreover, our theory suggests an $\textit{auto-weighting scheme}$ that finds the optimal combinations of the source and target gradients. This scheme improves both $\texttt{FedGP}$ and a simpler heuristic aggregation rule. Extensive experiments verify the theoretical insights and illustrate the effectiveness of the proposed methods in practice.
\ No newline at end of file
diff --git a/data/2024/iclr/Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning b/data/2024/iclr/Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning
new file mode 100644
index 0000000000..a4ea584ebf
--- /dev/null
+++ b/data/2024/iclr/Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning	
@@ -0,0 +1 @@
+Reinforcement learning (RL) for complex tasks remains a challenge, primarily due to the difficulties of engineering scalar reward functions and the inherent inefficiency of training models from scratch. Instead, it would be better to specify complex tasks in terms of elementary subtasks and to reuse subtask solutions whenever possible. In this work, we address continuous space lexicographic multi-objective RL problems, consisting of prioritized subtasks, which are notoriously difficult to solve. We show that these can be scalarized with a subtask transformation and then solved incrementally using value decomposition. Exploiting this insight, we propose prioritized soft Q-decomposition (PSQD), a novel algorithm for learning and adapting subtask solutions under lexicographic priorities in continuous state-action spaces. PSQD offers the ability to reuse previously learned subtask solutions in a zero-shot composition, followed by an adaptation step. Its ability to use retained subtask training data for offline learning eliminates the need for new environment interaction during adaptation. We demonstrate the efficacy of our approach by presenting successful learning, reuse, and adaptation results for both low- and high-dimensional simulated robot control tasks, as well as offline learning results. In contrast to baseline approaches, PSQD does not trade off between conflicting subtasks or priority constraints and satisfies subtask priorities during learning. PSQD provides an intuitive framework for tackling complex RL problems, offering insights into the inner workings of the subtask composition.
\ No newline at end of file
diff --git a/data/2024/iclr/Privacy Amplification for Matrix Mechanisms b/data/2024/iclr/Privacy Amplification for Matrix Mechanisms
new file mode 100644
index 0000000000..dceaf9fc3e
--- /dev/null
+++ b/data/2024/iclr/Privacy Amplification for Matrix Mechanisms	
@@ -0,0 +1 @@
+Privacy amplification exploits randomness in data selection to provide tighter differential privacy (DP) guarantees. This analysis is key to DP-SGD's success in machine learning, but, is not readily applicable to the newer state-of-the-art algorithms. This is because these algorithms, known as DP-FTRL, use the matrix mechanism to add correlated noise instead of independent noise as in DP-SGD. In this paper, we propose"MMCC", the first algorithm to analyze privacy amplification via sampling for any generic matrix mechanism. MMCC is nearly tight in that it approaches a lower bound as $\epsilon\to0$. To analyze correlated outputs in MMCC, we prove that they can be analyzed as if they were independent, by conditioning them on prior outputs. Our"conditional composition theorem"has broad utility: we use it to show that the noise added to binary-tree-DP-FTRL can asymptotically match the noise added to DP-SGD with amplification. Our amplification algorithm also has practical empirical utility: we show it leads to significant improvement in the privacy-utility trade-offs for DP-FTRL algorithms on standard benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/Privacy-Preserving In-Context Learning for Large Language Models b/data/2024/iclr/Privacy-Preserving In-Context Learning for Large Language Models
new file mode 100644
index 0000000000..cb3af8d9ea
--- /dev/null
+++ b/data/2024/iclr/Privacy-Preserving In-Context Learning for Large Language Models	
@@ -0,0 +1 @@
+In-context learning (ICL) is an important capability of Large Language Models (LLMs), enabling these models to dynamically adapt based on specific, in-context exemplars, thereby improving accuracy and relevance. However, LLM's responses may leak the sensitive private information contained in in-context exemplars. To address this challenge, we propose Differentially Private In-context Learning (DP-ICL), a general paradigm for privatizing ICL tasks. The key idea for DP-ICL paradigm is generating differentially private responses through a noisy consensus among an ensemble of LLM's responses based on disjoint exemplar sets. Based on the general paradigm of DP-ICL, we instantiate several techniques showing how to privatize ICL for text classification and language generation. We evaluate DP-ICL on four text classification benchmarks and two language generation tasks, and our empirical results show that DP-ICL achieves a strong utility-privacy tradeoff.
\ No newline at end of file
diff --git a/data/2024/iclr/Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation b/data/2024/iclr/Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation
new file mode 100644
index 0000000000..12630764e7
--- /dev/null
+++ b/data/2024/iclr/Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation	
@@ -0,0 +1 @@
+We study the problem of in-context learning (ICL) with large language models (LLMs) on private datasets. This scenario poses privacy risks, as LLMs may leak or regurgitate the private examples demonstrated in the prompt. We propose a novel algorithm that generates synthetic few-shot demonstrations from the private dataset with formal differential privacy (DP) guarantees, and show empirically that it can achieve effective ICL. We conduct extensive experiments on standard benchmarks and compare our algorithm with non-private ICL and zero-shot solutions. Our results demonstrate that our algorithm can achieve competitive performance with strong privacy levels. These results open up new possibilities for ICL with privacy protection for a broad range of applications.
\ No newline at end of file
diff --git a/data/2024/iclr/Private Zeroth-Order Nonsmooth Nonconvex Optimization b/data/2024/iclr/Private Zeroth-Order Nonsmooth Nonconvex Optimization
new file mode 100644
index 0000000000..9a4804a4b0
--- /dev/null
+++ b/data/2024/iclr/Private Zeroth-Order Nonsmooth Nonconvex Optimization	
@@ -0,0 +1 @@
+We introduce a new zeroth-order algorithm for private stochastic optimization on nonconvex and nonsmooth objectives. Given a dataset of size $M$, our algorithm ensures $(\alpha,\alpha\rho^2/2)$-R\'enyi differential privacy and finds a $(\delta,\epsilon)$-stationary point so long as $M=\tilde\Omega\left(\frac{d}{\delta\epsilon^3} + \frac{d^{3/2}}{\rho\delta\epsilon^2}\right)$. This matches the optimal complexity of its non-private zeroth-order analog. Notably, although the objective is not smooth, we have privacy ``for free'' whenever $\rho \ge \sqrt{d}\epsilon$.
\ No newline at end of file
diff --git a/data/2024/iclr/Privately Aligning Language Models with Reinforcement Learning b/data/2024/iclr/Privately Aligning Language Models with Reinforcement Learning
new file mode 100644
index 0000000000..45feed0034
--- /dev/null
+++ b/data/2024/iclr/Privately Aligning Language Models with Reinforcement Learning	
@@ -0,0 +1 @@
+Positioned between pre-training and user deployment, aligning large language models (LLMs) through reinforcement learning (RL) has emerged as a prevailing strategy for training instruction following-models such as ChatGPT. In this work, we initiate the study of privacy-preserving alignment of LLMs through Differential Privacy (DP) in conjunction with RL. Following the influential work of Ziegler et al. (2020), we study two dominant paradigms: (i) alignment via RL without human in the loop (e.g., positive review generation) and (ii) alignment via RL from human feedback (RLHF) (e.g., summarization in a human-preferred way). We give a new DP framework to achieve alignment via RL, and prove its correctness. Our experimental results validate the effectiveness of our approach, offering competitive utility while ensuring strong privacy protections.
\ No newline at end of file
diff --git a/data/2024/iclr/Privileged Sensing Scaffolds Reinforcement Learning b/data/2024/iclr/Privileged Sensing Scaffolds Reinforcement Learning
new file mode 100644
index 0000000000..beff03b1f9
--- /dev/null
+++ b/data/2024/iclr/Privileged Sensing Scaffolds Reinforcement Learning	
@@ -0,0 +1 @@
+We need to look at our shoelaces as we first learn to tie them but having mastered this skill, can do it from touch alone. We call this phenomenon"sensory scaffolding": observation streams that are not needed by a master might yet aid a novice learner. We consider such sensory scaffolding setups for training artificial agents. For example, a robot arm may need to be deployed with just a low-cost, robust, general-purpose camera; yet its performance may improve by having privileged training-time-only access to informative albeit expensive and unwieldy motion capture rigs or fragile tactile sensors. For these settings, we propose"Scaffolder", a reinforcement learning approach which effectively exploits privileged sensing in critics, world models, reward estimators, and other such auxiliary components that are only used at training time, to improve the target policy. For evaluating sensory scaffolding agents, we design a new"S3"suite of ten diverse simulated robotic tasks that explore a wide range of practical sensor setups. Agents must use privileged camera sensing to train blind hurdlers, privileged active visual perception to help robot arms overcome visual occlusions, privileged touch sensors to train robot hands, and more. Scaffolder easily outperforms relevant prior baselines and frequently performs comparably even to policies that have test-time access to the privileged sensors. Website: https://penn-pal-lab.github.io/scaffolder/
\ No newline at end of file
diff --git a/data/2024/iclr/Probabilistic Adaptation of Black-Box Text-to-Video Models b/data/2024/iclr/Probabilistic Adaptation of Black-Box Text-to-Video Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Probabilistic Self-supervised Representation Learning via Scoring Rules Minimization b/data/2024/iclr/Probabilistic Self-supervised Representation Learning via Scoring Rules Minimization
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Probabilistically Rewired Message-Passing Neural Networks b/data/2024/iclr/Probabilistically Rewired Message-Passing Neural Networks
new file mode 100644
index 0000000000..7bccaa5185
--- /dev/null
+++ b/data/2024/iclr/Probabilistically Rewired Message-Passing Neural Networks	
@@ -0,0 +1 @@
+Message-passing graph neural networks (MPNNs) emerged as powerful tools for processing graph-structured input. However, they operate on a fixed input graph structure, ignoring potential noise and missing information. Furthermore, their local aggregation mechanism can lead to problems such as over-squashing and limited expressive power in capturing relevant graph structures. Existing solutions to these challenges have primarily relied on heuristic methods, often disregarding the underlying data distribution. Hence, devising principled approaches for learning to infer graph structures relevant to the given prediction task remains an open challenge. In this work, leveraging recent progress in exact and differentiable $k$-subset sampling, we devise probabilistically rewired MPNNs (PR-MPNNs), which learn to add relevant edges while omitting less beneficial ones. For the first time, our theoretical analysis explores how PR-MPNNs enhance expressive power, and we identify precise conditions under which they outperform purely randomized approaches. Empirically, we demonstrate that our approach effectively mitigates issues like over-squashing and under-reaching. In addition, on established real-world datasets, our method exhibits competitive or superior predictive performance compared to traditional MPNN models and recent graph transformer architectures.
\ No newline at end of file
diff --git a/data/2024/iclr/Procedural Fairness Through Decoupling Objectionable Data Generating Components b/data/2024/iclr/Procedural Fairness Through Decoupling Objectionable Data Generating Components
new file mode 100644
index 0000000000..49cc9b6e8c
--- /dev/null
+++ b/data/2024/iclr/Procedural Fairness Through Decoupling Objectionable Data Generating Components	
@@ -0,0 +1 @@
+We reveal and address the frequently overlooked yet important issue of disguised procedural unfairness, namely, the potentially inadvertent alterations on the behavior of neutral (i.e., not problematic) aspects of data generating process, and/or the lack of procedural assurance of the greatest benefit of the least advantaged individuals. Inspired by John Rawls's advocacy for pure procedural justice, we view automated decision-making as a microcosm of social institutions, and consider how the data generating process itself can satisfy the requirements of procedural fairness. We propose a framework that decouples the objectionable data generating components from the neutral ones by utilizing reference points and the associated value instantiation rule. Our findings highlight the necessity of preventing disguised procedural unfairness, drawing attention not only to the objectionable data generating components that we aim to mitigate, but also more importantly, to the neutral components that we intend to keep unaffected.
\ No newline at end of file
diff --git a/data/2024/iclr/Progressive Fourier Neural Representation for Sequential Video Compilation b/data/2024/iclr/Progressive Fourier Neural Representation for Sequential Video Compilation
new file mode 100644
index 0000000000..7f71f56467
--- /dev/null
+++ b/data/2024/iclr/Progressive Fourier Neural Representation for Sequential Video Compilation	
@@ -0,0 +1 @@
+Neural Implicit Representation (NIR) has recently gained significant attention due to its remarkable ability to encode complex and high-dimensional data into representation space and easily reconstruct it through a trainable mapping function. However, NIR methods assume a one-to-one mapping between the target data and representation models regardless of data relevancy or similarity. This results in poor generalization over multiple complex data and limits their efficiency and scalability. Motivated by continual learning, this work investigates how to accumulate and transfer neural implicit representations for multiple complex video data over sequential encoding sessions. To overcome the limitation of NIR, we propose a novel method, Progressive Fourier Neural Representation (PFNR), that aims to find an adaptive and compact sub-module in Fourier space to encode videos in each training session. This sparsified neural encoding allows the neural network to hold free weights, enabling an improved adaptation for future videos. In addition, when learning a representation for a new video, PFNR transfers the representation of previous videos with frozen weights. This design allows the model to continuously accumulate high-quality neural representations for multiple videos while ensuring lossless decoding that perfectly preserves the learned representations for previous videos. We validate our PFNR method on the UVG8/17 and DAVIS50 video sequence benchmarks and achieve impressive performance gains over strong continual learning baselines. The PFNR code is available at https://github.com/ihaeyong/PFNR.git.
\ No newline at end of file
diff --git a/data/2024/iclr/Progressive3D: Progressively Local Editing for Text-to-3D Content Creation with Complex Semantic Prompts b/data/2024/iclr/Progressive3D: Progressively Local Editing for Text-to-3D Content Creation with Complex Semantic Prompts
new file mode 100644
index 0000000000..02bbe943e0
--- /dev/null
+++ b/data/2024/iclr/Progressive3D: Progressively Local Editing for Text-to-3D Content Creation with Complex Semantic Prompts	
@@ -0,0 +1 @@
+Recent text-to-3D generation methods achieve impressive 3D content creation capacity thanks to the advances in image diffusion models and optimizing strategies. However, current methods struggle to generate correct 3D content for a complex prompt in semantics, i.e., a prompt describing multiple interacted objects binding with different attributes. In this work, we propose a general framework named Progressive3D, which decomposes the entire generation into a series of locally progressive editing steps to create precise 3D content for complex prompts, and we constrain the content change to only occur in regions determined by user-defined region prompts in each editing step. Furthermore, we propose an overlapped semantic component suppression technique to encourage the optimization process to focus more on the semantic differences between prompts. Extensive experiments demonstrate that the proposed Progressive3D framework generates precise 3D content for prompts with complex semantics and is general for various text-to-3D methods driven by different 3D representations.
\ No newline at end of file
diff --git a/data/2024/iclr/Project and Probe: Sample-Efficient Adaptation by Interpolating Orthogonal Features b/data/2024/iclr/Project and Probe: Sample-Efficient Adaptation by Interpolating Orthogonal Features
new file mode 100644
index 0000000000..54f168ba27
--- /dev/null
+++ b/data/2024/iclr/Project and Probe: Sample-Efficient Adaptation by Interpolating Orthogonal Features	
@@ -0,0 +1 @@
+Transfer learning with a small amount of target data is an effective and common approach to adapting a pre-trained model to distribution shifts. In some situations, target data labels may be expensive to obtain, so we may only have access to a limited number of target data points. To make the most of a very small target dataset, we propose a lightweight, sample-efficient approach that learns a diverse set of features and adapts to a target distribution by interpolating these features. Our approach, P ROJECT AND P ROBE (P RO 2 ), first learns a linear projection that maps a pre-trained embedding onto orthogonal directions while being predictive of labels in the source dataset. The goal of this step is to learn a variety of predictive features, so that at least some of them remain useful after distribution shift. P RO 2 then learns a linear classifier on top of these projected features using a small target dataset. Theoretically, we find that P RO 2 results in more sample-efficient generalization by inducing a favorable bias-variance tradeoff. Our experiments on four datasets, with multiple distribution shift settings for each, show that P RO 2 improves performance by 5-15% when given limited target data compared to prior methods such as standard linear probing.
\ No newline at end of file
diff --git a/data/2024/iclr/Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models b/data/2024/iclr/Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models
new file mode 100644
index 0000000000..e4bb3d3076
--- /dev/null
+++ b/data/2024/iclr/Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models	
@@ -0,0 +1 @@
+Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus's capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment&MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model. We open-source our code, dataset, and model at https://kaistai.github.io/prometheus/.
\ No newline at end of file
diff --git a/data/2024/iclr/Prompt Gradient Projection for Continual Learning b/data/2024/iclr/Prompt Gradient Projection for Continual Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Prompt Learning with Quaternion Networks b/data/2024/iclr/Prompt Learning with Quaternion Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models b/data/2024/iclr/Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models
new file mode 100644
index 0000000000..7dca556239
--- /dev/null
+++ b/data/2024/iclr/Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models	
@@ -0,0 +1 @@
+The recent explosion in the capabilities of large language models has led to a wave of interest in how best to prompt a model to perform a given task. While it may be tempting to simply choose a prompt based on average performance on a validation set, this can lead to a deployment where unexpectedly poor responses are generated, especially for the worst-off users. To mitigate this prospect, we propose Prompt Risk Control, a lightweight framework for selecting a prompt based on rigorous upper bounds on families of informative risk measures. We offer methods for producing bounds on a diverse set of metrics, including quantities that measure worst-case responses and disparities in generation quality across the population of users. In addition, we extend the underlying statistical bounding techniques to accommodate the possibility of distribution shifts in deployment. Experiments on applications such as open-ended chat, medical question summarization, and code generation highlight how such a framework can foster responsible deployment by reducing the risk of the worst outcomes.
\ No newline at end of file
diff --git a/data/2024/iclr/PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization b/data/2024/iclr/PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization
new file mode 100644
index 0000000000..b418f4d054
--- /dev/null
+++ b/data/2024/iclr/PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization	
@@ -0,0 +1 @@
+Highly effective, task-specific prompts are often heavily engineered by experts to integrate detailed instructions and domain insights based on a deep understanding of both instincts of large language models (LLMs) and the intricacies of the target task. However, automating the generation of such expert-level prompts remains elusive. Existing prompt optimization methods tend to overlook the depth of domain knowledge and struggle to efficiently explore the vast space of expert-level prompts. Addressing this, we present PromptAgent, an optimization method that autonomously crafts prompts equivalent in quality to those handcrafted by experts. At its core, PromptAgent views prompt optimization as a strategic planning problem and employs a principled planning algorithm, rooted in Monte Carlo tree search, to strategically navigate the expert-level prompt space. Inspired by human-like trial-and-error exploration, PromptAgent induces precise expert-level insights and in-depth instructions by reflecting on model errors and generating constructive error feedback. Such a novel framework allows the agent to iteratively examine intermediate prompts (states), refine them based on error feedbacks (actions), simulate future rewards, and search for high-reward paths leading to expert prompts. We apply PromptAgent to 12 tasks spanning three practical domains: BIG-Bench Hard (BBH), as well as domain-specific and general NLP tasks, showing it significantly outperforms strong Chain-of-Thought and recent prompt optimization baselines. Extensive analyses emphasize its capability to craft expert-level, detailed, and domain-insightful prompts with great efficiency and generalizability.
\ No newline at end of file
diff --git a/data/2024/iclr/PromptTTS 2: Describing and Generating Voices with Text Prompt b/data/2024/iclr/PromptTTS 2: Describing and Generating Voices with Text Prompt
new file mode 100644
index 0000000000..5e32832df7
--- /dev/null
+++ b/data/2024/iclr/PromptTTS 2: Describing and Generating Voices with Text Prompt	
@@ -0,0 +1 @@
+Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two main challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech. In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts. Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice variability) based on the text prompt representation. For the prompt generation pipeline, it generates text prompts for speech with a speech language understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompts based on the recognition results. Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, PromptTTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. Additionally, the prompt generation pipeline produces high-quality text prompts, eliminating the large labeling cost. The demo page of PromptTTS 2 is available online.
\ No newline at end of file
diff --git a/data/2024/iclr/Proper Laplacian Representation Learning b/data/2024/iclr/Proper Laplacian Representation Learning
new file mode 100644
index 0000000000..115cd21935
--- /dev/null
+++ b/data/2024/iclr/Proper Laplacian Representation Learning	
@@ -0,0 +1 @@
+The ability to learn good representations of states is essential for solving large reinforcement learning problems, where exploration, generalization, and transfer are particularly challenging. The Laplacian representation is a promising approach to address these problems by inducing informative state encoding and intrinsic rewards for temporally-extended action discovery and reward shaping. To obtain the Laplacian representation one needs to compute the eigensystem of the graph Laplacian, which is often approximated through optimization objectives compatible with deep learning approaches. These approximations, however, depend on hyperparameters that are impossible to tune efficiently, converge to arbitrary rotations of the desired eigenvectors, and are unable to accurately recover the corresponding eigenvalues. In this paper we introduce a theoretically sound objective and corresponding optimization algorithm for approximating the Laplacian representation. Our approach naturally recovers both the true eigenvectors and eigenvalues while eliminating the hyperparameter dependence of previous approximations. We provide theoretical guarantees for our method and we show that those results translate empirically into robust learning across multiple environments.
\ No newline at end of file
diff --git a/data/2024/iclr/Protein Multimer Structure Prediction via Prompt Learning b/data/2024/iclr/Protein Multimer Structure Prediction via Prompt Learning
new file mode 100644
index 0000000000..b8b21b1d6f
--- /dev/null
+++ b/data/2024/iclr/Protein Multimer Structure Prediction via Prompt Learning	
@@ -0,0 +1 @@
+Understanding the 3D structures of protein multimers is crucial, as they play a vital role in regulating various cellular processes. It has been empirically confirmed that the multimer structure prediction~(MSP) can be well handled in a step-wise assembly fashion using provided dimer structures and predicted protein-protein interactions~(PPIs). However, due to the biological gap in the formation of dimers and larger multimers, directly applying PPI prediction techniques can often cause a \textit{poor generalization} to the MSP task. To address this challenge, we aim to extend the PPI knowledge to multimers of different scales~(i.e., chain numbers). Specifically, we propose \textbf{\textsc{PromptMSP}}, a pre-training and \textbf{Prompt} tuning framework for \textbf{M}ultimer \textbf{S}tructure \textbf{P}rediction. First, we tailor the source and target tasks for effective PPI knowledge learning and efficient inference, respectively. We design PPI-inspired prompt learning to narrow the gaps of two task formats and generalize the PPI knowledge to multimers of different scales. We provide a meta-learning strategy to learn a reliable initialization of the prompt model, enabling our prompting framework to effectively adapt to limited data for large-scale multimers. Empirically, we achieve both significant accuracy (RMSD and TM-Score) and efficiency improvements compared to advanced MSP models. The code, data and checkpoints are released at \url{https://github.com/zqgao22/PromptMSP}.
\ No newline at end of file
diff --git a/data/2024/iclr/Protein-Ligand Interaction Prior for Binding-aware 3D Molecule Diffusion Models b/data/2024/iclr/Protein-Ligand Interaction Prior for Binding-aware 3D Molecule Diffusion Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Protein-ligand binding representation learning from fine-grained interactions b/data/2024/iclr/Protein-ligand binding representation learning from fine-grained interactions
new file mode 100644
index 0000000000..5be1280d90
--- /dev/null
+++ b/data/2024/iclr/Protein-ligand binding representation learning from fine-grained interactions	
@@ -0,0 +1 @@
+The binding between proteins and ligands plays a crucial role in the realm of drug discovery. Previous deep learning approaches have shown promising results over traditional computationally intensive methods, but resulting in poor generalization due to limited supervised data. In this paper, we propose to learn protein-ligand binding representation in a self-supervised learning manner. Different from existing pre-training approaches which treat proteins and ligands individually, we emphasize to discern the intricate binding patterns from fine-grained interactions. Specifically, this self-supervised learning problem is formulated as a prediction of the conclusive binding complex structure given a pocket and ligand with a Transformer based interaction module, which naturally emulates the binding process. To ensure the representation of rich binding information, we introduce two pre-training tasks, i.e.~atomic pairwise distance map prediction and mask ligand reconstruction, which comprehensively model the fine-grained interactions from both structure and feature space. Extensive experiments have demonstrated the superiority of our method across various binding tasks, including protein-ligand affinity prediction, virtual screening and protein-ligand docking.
\ No newline at end of file
diff --git a/data/2024/iclr/Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction b/data/2024/iclr/Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction
new file mode 100644
index 0000000000..ddcfc243ee
--- /dev/null
+++ b/data/2024/iclr/Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction	
@@ -0,0 +1 @@
+Multimodal learning significantly benefits cancer survival prediction, especially the integration of pathological images and genomic data. Despite advantages of multimodal learning for cancer survival prediction, massive redundancy in multimodal data prevents it from extracting discriminative and compact information: (1) An extensive amount of intra-modal task-unrelated information blurs discriminability, especially for gigapixel whole slide images (WSIs) with many patches in pathology and thousands of pathways in genomic data, leading to an ``intra-modal redundancy"issue. (2) Duplicated information among modalities dominates the representation of multimodal data, which makes modality-specific information prone to being ignored, resulting in an ``inter-modal redundancy"issue. To address these, we propose a new framework, Prototypical Information Bottlenecking and Disentangling (PIBD), consisting of Prototypical Information Bottleneck (PIB) module for intra-modal redundancy and Prototypical Information Disentanglement (PID) module for inter-modal redundancy. Specifically, a variant of information bottleneck, PIB, is proposed to model prototypes approximating a bunch of instances for different risk levels, which can be used for selection of discriminative instances within modality. PID module decouples entangled multimodal data into compact distinct components: modality-common and modality-specific knowledge, under the guidance of the joint prototypical distribution. Extensive experiments on five cancer benchmark datasets demonstrated our superiority over other methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Provable Benefits of Multi-task RL under Non-Markovian Decision Making Processes b/data/2024/iclr/Provable Benefits of Multi-task RL under Non-Markovian Decision Making Processes
new file mode 100644
index 0000000000..ee907e2fd1
--- /dev/null
+++ b/data/2024/iclr/Provable Benefits of Multi-task RL under Non-Markovian Decision Making Processes	
@@ -0,0 +1 @@
+In multi-task reinforcement learning (RL) under Markov decision processes (MDPs), the presence of shared latent structures among multiple MDPs has been shown to yield significant benefits to the sample efficiency compared to single-task RL. In this paper, we investigate whether such a benefit can extend to more general sequential decision making problems, such as partially observable MDPs (POMDPs) and more general predictive state representations (PSRs). The main challenge here is that the large and complex model space makes it hard to identify what types of common latent structure of multi-task PSRs can reduce the model complexity and improve sample efficiency. To this end, we posit a joint model class for tasks and use the notion of $\eta$-bracketing number to quantify its complexity; this number also serves as a general metric to capture the similarity of tasks and thus determines the benefit of multi-task over single-task RL. We first study upstream multi-task learning over PSRs, in which all tasks share the same observation and action spaces. We propose a provably efficient algorithm UMT-PSR for finding near-optimal policies for all PSRs, and demonstrate that the advantage of multi-task learning manifests if the joint model class of PSRs has a smaller $\eta$-bracketing number compared to that of individual single-task learning. We also provide several example multi-task PSRs with small $\eta$-bracketing numbers, which reap the benefits of multi-task learning. We further investigate downstream learning, in which the agent needs to learn a new target task that shares some commonalities with the upstream tasks via a similarity constraint. By exploiting the learned PSRs from the upstream, we develop a sample-efficient algorithm that provably finds a near-optimal policy.
\ No newline at end of file
diff --git a/data/2024/iclr/Provable Compositional Generalization for Object-Centric Learning b/data/2024/iclr/Provable Compositional Generalization for Object-Centric Learning
new file mode 100644
index 0000000000..83f705e377
--- /dev/null
+++ b/data/2024/iclr/Provable Compositional Generalization for Object-Centric Learning	
@@ -0,0 +1 @@
+Learning representations that generalize to novel compositions of known concepts is crucial for bridging the gap between human and machine perception. One prominent effort is learning object-centric representations, which are widely conjectured to enable compositional generalization. Yet, it remains unclear when this conjecture will be true, as a principled theoretical or empirical understanding of compositional generalization is lacking. In this work, we investigate when compositional generalization is guaranteed for object-centric representations through the lens of identifiability theory. We show that autoencoders that satisfy structural assumptions on the decoder and enforce encoder-decoder consistency will learn object-centric representations that provably generalize compositionally. We validate our theoretical result and highlight the practical relevance of our assumptions through experiments on synthetic image data.
\ No newline at end of file
diff --git a/data/2024/iclr/Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning b/data/2024/iclr/Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Provable Offline Preference-Based Reinforcement Learning b/data/2024/iclr/Provable Offline Preference-Based Reinforcement Learning
new file mode 100644
index 0000000000..a253c4fb50
--- /dev/null
+++ b/data/2024/iclr/Provable Offline Preference-Based Reinforcement Learning	
@@ -0,0 +1 @@
+In this paper, we investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs.
\ No newline at end of file
diff --git a/data/2024/iclr/Provable Reward-Agnostic Preference-Based Reinforcement Learning b/data/2024/iclr/Provable Reward-Agnostic Preference-Based Reinforcement Learning
new file mode 100644
index 0000000000..dcff4c6f0f
--- /dev/null
+++ b/data/2024/iclr/Provable Reward-Agnostic Preference-Based Reinforcement Learning	
@@ -0,0 +1 @@
+Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals. While PbRL has demonstrated practical success in fine-tuning language models, existing theoretical work focuses on regret minimization and fails to capture most of the practical frameworks. In this study, we fill in such a gap between theoretical PbRL and practical algorithms by proposing a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired before collecting any human feedback. Theoretical analysis demonstrates that our algorithm requires less human feedback for learning the optimal policy under preference-based models with linear parameterization and unknown transitions, compared to the existing theoretical literature. Specifically, our framework can incorporate linear and low-rank MDPs with efficient sample complexity. Additionally, we investigate reward-agnostic RL with action-based comparison feedback and introduce an efficient querying algorithm tailored to this scenario.
\ No newline at end of file
diff --git a/data/2024/iclr/Provable Robust Watermarking for AI-Generated Text b/data/2024/iclr/Provable Robust Watermarking for AI-Generated Text
new file mode 100644
index 0000000000..1e893fc347
--- /dev/null
+++ b/data/2024/iclr/Provable Robust Watermarking for AI-Generated Text	
@@ -0,0 +1 @@
+We study the problem of watermarking large language models (LLMs) generated text -- one of the most promising approaches for addressing the safety challenges of LLM usage. In this paper, we propose a rigorous theoretical framework to quantify the effectiveness and robustness of LLM watermarks. We propose a robust and high-quality watermark method, Unigram-Watermark, by extending an existing approach with a simplified fixed grouping strategy. We prove that our watermark method enjoys guaranteed generation quality, correctness in watermark detection, and is robust against text editing and paraphrasing. Experiments on three varying LLMs and two datasets verify that our Unigram-Watermark achieves superior detection accuracy and comparable generation quality in perplexity, thus promoting the responsible use of LLMs. Code is available at https://github.com/XuandongZhao/Unigram-Watermark.
\ No newline at end of file
diff --git a/data/2024/iclr/Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo b/data/2024/iclr/Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo
new file mode 100644
index 0000000000..d950a162f3
--- /dev/null
+++ b/data/2024/iclr/Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo	
@@ -0,0 +1 @@
+We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O}(d^{3/2}H^{3/2}\sqrt{T})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
\ No newline at end of file
diff --git a/data/2024/iclr/Provably Efficient CVaR RL in Low-rank MDPs b/data/2024/iclr/Provably Efficient CVaR RL in Low-rank MDPs
new file mode 100644
index 0000000000..8210a32e2b
--- /dev/null
+++ b/data/2024/iclr/Provably Efficient CVaR RL in Low-rank MDPs	
@@ -0,0 +1 @@
+We study risk-sensitive Reinforcement Learning (RL), where we aim to maximize the Conditional Value at Risk (CVaR) with a fixed risk tolerance $\tau$. Prior theoretical work studying risk-sensitive RL focuses on the tabular Markov Decision Processes (MDPs) setting. To extend CVaR RL to settings where state space is large, function approximation must be deployed. We study CVaR RL in low-rank MDPs with nonlinear function approximation. Low-rank MDPs assume the underlying transition kernel admits a low-rank decomposition, but unlike prior linear models, low-rank MDPs do not assume the feature or state-action representation is known. We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to carefully balance the interplay between exploration, exploitation, and representation learning in CVaR RL. We prove that our algorithm achieves a sample complexity of $\tilde{O}\left(\frac{H^7 A^2 d^4}{\tau^2 \epsilon^2}\right)$ to yield an $\epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations. Computational-wise, we design a novel discretized Least-Squares Value Iteration (LSVI) algorithm for the CVaR objective as the planning oracle and show that we can find the near-optimal policy in a polynomial running time with a Maximum Likelihood Estimation oracle. To our knowledge, this is the first provably efficient CVaR RL algorithm in low-rank MDPs.
\ No newline at end of file
diff --git a/data/2024/iclr/Provably Efficient Iterated CVaR Reinforcement Learning with Function Approximation and Human Feedback b/data/2024/iclr/Provably Efficient Iterated CVaR Reinforcement Learning with Function Approximation and Human Feedback
new file mode 100644
index 0000000000..b94195db9f
--- /dev/null
+++ b/data/2024/iclr/Provably Efficient Iterated CVaR Reinforcement Learning with Function Approximation and Human Feedback	
@@ -0,0 +1 @@
+Risk-sensitive reinforcement learning (RL) aims to optimize policies that balance the expected reward and risk. In this paper, we present a novel risk-sensitive RL framework that employs an Iterated Conditional Value-at-Risk (CVaR) objective under both linear and general function approximations, enriched by human feedback. These new formulations provide a principled way to guarantee safety in each decision making step throughout the control process. Moreover, integrating human feedback into risk-sensitive RL framework bridges the gap between algorithmic decision-making and human participation, allowing us to also guarantee safety for human-in-the-loop systems. We propose provably sample-efficient algorithms for this Iterated CVaR RL and provide rigorous theoretical analysis. Furthermore, we establish a matching lower bound to corroborate the optimality of our algorithms in a linear context.
\ No newline at end of file
diff --git a/data/2024/iclr/Provably Efficient UCB-type Algorithms For Learning Predictive State Representations b/data/2024/iclr/Provably Efficient UCB-type Algorithms For Learning Predictive State Representations
new file mode 100644
index 0000000000..0a0de41157
--- /dev/null
+++ b/data/2024/iclr/Provably Efficient UCB-type Algorithms For Learning Predictive State Representations	
@@ -0,0 +1 @@
+The general sequential decision-making problem, which includes Markov decision processes (MDPs) and partially observable MDPs (POMDPs) as special cases, aims at maximizing a cumulative reward by making a sequence of decisions based on a history of observations and actions over time. Recent studies have shown that the sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs). Despite these advancements, existing approaches typically involve oracles or steps that are computationally intractable. On the other hand, the upper confidence bound (UCB) based approaches, which have served successfully as computationally efficient methods in bandits and MDPs, have not been investigated for more general PSRs, due to the difficulty of optimistic bonus design in these more challenging settings. This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models. We further characterize the sample complexity bounds for our designed UCB-type algorithms for both online and offline PSRs. In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
\ No newline at end of file
diff --git a/data/2024/iclr/Provably Robust Conformal Prediction with Improved Efficiency b/data/2024/iclr/Provably Robust Conformal Prediction with Improved Efficiency
new file mode 100644
index 0000000000..4e7548568a
--- /dev/null
+++ b/data/2024/iclr/Provably Robust Conformal Prediction with Improved Efficiency	
@@ -0,0 +1 @@
+Conformal prediction is a powerful tool to generate uncertainty sets with guaranteed coverage using any predictive model, under the assumption that the training and test data are i.i.d.. Recently, it has been shown that adversarial examples are able to manipulate conformal methods to construct prediction sets with invalid coverage rates, as the i.i.d. assumption is violated. To address this issue, a recent work, Randomized Smoothed Conformal Prediction (RSCP), was first proposed to certify the robustness of conformal prediction methods to adversarial noise. However, RSCP has two major limitations: (i) its robustness guarantee is flawed when used in practice and (ii) it tends to produce large uncertainty sets. To address these limitations, we first propose a novel framework called RSCP+ to provide provable robustness guarantee in evaluation, which fixes the issues in the original RSCP method. Next, we propose two novel methods, Post-Training Transformation (PTT) and Robust Conformal Training (RCT), to effectively reduce prediction set size with little computation overhead. Experimental results in CIFAR10, CIFAR100, and ImageNet suggest the baseline method only yields trivial predictions including full label set, while our methods could boost the efficiency by up to $4.36\times$, $5.46\times$, and $16.9\times$ respectively and provide practical robustness guarantee. Our codes are available at https://github.com/Trustworthy-ML-Lab/Provably-Robust-Conformal-Prediction.
\ No newline at end of file
diff --git a/data/2024/iclr/Proving Test Set Contamination in Black-Box Language Models b/data/2024/iclr/Proving Test Set Contamination in Black-Box Language Models
new file mode 100644
index 0000000000..3cfad25123
--- /dev/null
+++ b/data/2024/iclr/Proving Test Set Contamination in Black-Box Language Models	
@@ -0,0 +1 @@
+Large language models are trained on vast amounts of internet data, prompting concerns and speculation that they have memorized public benchmarks. Going from speculation to proof of contamination is challenging, as the pretraining data used by proprietary models are often not publicly accessible. We show that it is possible to provide provable guarantees of test set contamination in language models without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for language models to memorize example order means that a contaminated language model will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples. We demonstrate that our procedure is sensitive enough to reliably prove test set contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets of only 1000 examples, and datasets that appear only a few times in the pretraining corpus. Using our test, we audit five popular publicly accessible language models for test set contamination and find little evidence for pervasive contamination.
\ No newline at end of file
diff --git a/data/2024/iclr/Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning b/data/2024/iclr/Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning
new file mode 100644
index 0000000000..52724cd409
--- /dev/null
+++ b/data/2024/iclr/Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning	
@@ -0,0 +1 @@
+Training generally capable agents that thoroughly explore their environment and learn new and diverse skills is a long-term goal of robot learning. Quality Diversity Reinforcement Learning (QD-RL) is an emerging research area that blends the best aspects of both fields -- Quality Diversity (QD) provides a principled form of exploration and produces collections of behaviorally diverse agents, while Reinforcement Learning (RL) provides a powerful performance improvement operator enabling generalization across tasks and dynamic environments. Existing QD-RL approaches have been constrained to sample efficient, deterministic off-policy RL algorithms and/or evolution strategies, and struggle with highly stochastic environments. In this work, we, for the first time, adapt on-policy RL, specifically Proximal Policy Optimization (PPO), to the Differentiable Quality Diversity (DQD) framework and propose additional improvements over prior work that enable efficient optimization and discovery of novel skills on challenging locomotion tasks. Our new algorithm, Proximal Policy Gradient Arborescence (PPGA), achieves state-of-the-art results, including a 4x improvement in best reward over baselines on the challenging humanoid domain.
\ No newline at end of file
diff --git a/data/2024/iclr/Pseudo-Generalized Dynamic View Synthesis from a Video b/data/2024/iclr/Pseudo-Generalized Dynamic View Synthesis from a Video
new file mode 100644
index 0000000000..67de8f6889
--- /dev/null
+++ b/data/2024/iclr/Pseudo-Generalized Dynamic View Synthesis from a Video	
@@ -0,0 +1 @@
+Rendering scenes observed in a monocular video from novel viewpoints is a challenging problem. For static scenes the community has studied both scene-specific optimization techniques, which optimize on every test scene, and generalized techniques, which only run a deep net forward pass on a test scene. In contrast, for dynamic scenes, scene-specific optimization techniques exist, but, to our best knowledge, there is currently no generalized method for dynamic novel view synthesis from a given monocular video. To answer whether generalized dynamic novel view synthesis from monocular videos is possible today, we establish an analysis framework based on existing techniques and work toward the generalized approach. We find a pseudo-generalized process without scene-specific appearance optimization is possible, but geometrically and temporally consistent depth estimates are needed. Despite no scene-specific appearance optimization, the pseudo-generalized approach improves upon some scene-specific methods.
\ No newline at end of file
diff --git a/data/2024/iclr/PubDef: Defending Against Transfer Attacks From Public Models b/data/2024/iclr/PubDef: Defending Against Transfer Attacks From Public Models
new file mode 100644
index 0000000000..76ef300efb
--- /dev/null
+++ b/data/2024/iclr/PubDef: Defending Against Transfer Attacks From Public Models	
@@ -0,0 +1 @@
+Adversarial attacks have been a looming and unaddressed threat in the industry. However, through a decade-long history of the robustness evaluation literature, we have learned that mounting a strong or optimal attack is challenging. It requires both machine learning and domain expertise. In other words, the white-box threat model, religiously assumed by a large majority of the past literature, is unrealistic. In this paper, we propose a new practical threat model where the adversary relies on transfer attacks through publicly available surrogate models. We argue that this setting will become the most prevalent for security-sensitive applications in the future. We evaluate the transfer attacks in this setting and propose a specialized defense method based on a game-theoretic perspective. The defenses are evaluated under 24 public models and 11 attack algorithms across three datasets (CIFAR-10, CIFAR-100, and ImageNet). Under this threat model, our defense, PubDef, outperforms the state-of-the-art white-box adversarial training by a large margin with almost no loss in the normal accuracy. For instance, on ImageNet, our defense achieves 62% accuracy under the strongest transfer attack vs only 36% of the best adversarially trained model. Its accuracy when not under attack is only 2% lower than that of an undefended model (78% vs 80%). We release our code at https://github.com/wagner-group/pubdef.
\ No newline at end of file
diff --git a/data/2024/iclr/Pushing Boundaries: Mixup's Influence on Neural Collapse b/data/2024/iclr/Pushing Boundaries: Mixup's Influence on Neural Collapse
new file mode 100644
index 0000000000..fa19ecebf4
--- /dev/null
+++ b/data/2024/iclr/Pushing Boundaries: Mixup's Influence on Neural Collapse	
@@ -0,0 +1 @@
+Mixup is a data augmentation strategy that employs convex combinations of training instances and their respective labels to augment the robustness and calibration of deep neural networks. Despite its widespread adoption, the nuanced mechanisms that underpin its success are not entirely understood. The observed phenomenon of Neural Collapse, where the last-layer activations and classifier of deep networks converge to a simplex equiangular tight frame (ETF), provides a compelling motivation to explore whether mixup induces alternative geometric configurations and whether those could explain its success. In this study, we delve into the last-layer activations of training data for deep networks subjected to mixup, aiming to uncover insights into its operational efficacy. Our investigation, spanning various architectures and dataset pairs, reveals that mixup's last-layer activations predominantly converge to a distinctive configuration different than one might expect. In this configuration, activations from mixed-up examples of identical classes align with the classifier, while those from different classes delineate channels along the decision boundary. Moreover, activations in earlier layers exhibit patterns, as if trained with manifold mixup. These findings are unexpected, as mixed-up features are not simple convex combinations of feature class means (as one might get, for example, by training mixup with the mean squared error loss). By analyzing this distinctive geometric configuration, we elucidate the mechanisms by which mixup enhances model calibration. To further validate our empirical observations, we conduct a theoretical analysis under the assumption of an unconstrained features model, utilizing the mixup loss. Through this, we characterize and derive the optimal last-layer features under the assumption that the classifier forms a simplex ETF.
\ No newline at end of file
diff --git a/data/2024/iclr/Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning b/data/2024/iclr/Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning
new file mode 100644
index 0000000000..cf7ef64000
--- /dev/null
+++ b/data/2024/iclr/Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning	
@@ -0,0 +1 @@
+The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized sub-models optimizes overall performance with a constant computational cost. However, conventional MoEs pose challenges at scale due to the need to store all experts in memory. In this paper, we push MoE to the limit. We propose extremely parameter-efficient MoE by uniquely combining MoE architecture with lightweight experts.Our MoE architecture outperforms standard parameter-efficient fine-tuning (PEFT) methods and is on par with full fine-tuning by only updating the lightweight experts -- less than 1% of an 11B parameters model. Furthermore, our method generalizes to unseen tasks as it does not depend on any prior task knowledge. Our research underscores the versatility of the mixture of experts architecture, showcasing its ability to deliver robust performance even when subjected to rigorous parameter constraints. Our code used in all the experiments is publicly available here: https://github.com/for-ai/parameter-efficient-moe.
\ No newline at end of file
diff --git a/data/2024/iclr/Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision b/data/2024/iclr/Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
new file mode 100644
index 0000000000..1ea257654d
--- /dev/null
+++ b/data/2024/iclr/Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision	
@@ -0,0 +1 @@
+The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://q-future.github.io/Q-Bench.
\ No newline at end of file
diff --git a/data/2024/iclr/QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models b/data/2024/iclr/QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
new file mode 100644
index 0000000000..bdf0f77d0f
--- /dev/null
+++ b/data/2024/iclr/QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models	
@@ -0,0 +1 @@
+Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at https://github.com/yuhuixu1993/qa-lora.
\ No newline at end of file
diff --git a/data/2024/iclr/QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models b/data/2024/iclr/QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
new file mode 100644
index 0000000000..4510e20cab
--- /dev/null
+++ b/data/2024/iclr/QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models	
@@ -0,0 +1 @@
+Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting b/data/2024/iclr/Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
new file mode 100644
index 0000000000..f5786d7706
--- /dev/null
+++ b/data/2024/iclr/Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting	
@@ -0,0 +1 @@
+As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.
\ No newline at end of file
diff --git a/data/2024/iclr/Quantifying the Plausibility of Context Reliance in Neural Machine Translation b/data/2024/iclr/Quantifying the Plausibility of Context Reliance in Neural Machine Translation
new file mode 100644
index 0000000000..a21107b122
--- /dev/null
+++ b/data/2024/iclr/Quantifying the Plausibility of Context Reliance in Neural Machine Translation	
@@ -0,0 +1 @@
+Establishing whether language models can use contextual information in a human-plausible way is important to ensure their trustworthiness in real-world settings. However, the questions of when and which parts of the context affect model generations are typically tackled separately, with current plausibility evaluations being practically limited to a handful of artificial benchmarks. To address this, we introduce Plausibility Evaluation of Context Reliance (PECoRe), an end-to-end interpretability framework designed to quantify context usage in language models' generations. Our approach leverages model internals to (i) contrastively identify context-sensitive target tokens in generated texts and (ii) link them to contextual cues justifying their prediction. We use \pecore to quantify the plausibility of context-aware machine translation models, comparing model rationales with human annotations across several discourse-level phenomena. Finally, we apply our method to unannotated model translations to identify context-mediated predictions and highlight instances of (im)plausible context usage throughout generation.
\ No newline at end of file
diff --git a/data/2024/iclr/Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification b/data/2024/iclr/Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification
new file mode 100644
index 0000000000..a431f87f0a
--- /dev/null
+++ b/data/2024/iclr/Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification	
@@ -0,0 +1 @@
+Inverse reinforcement learning (IRL) aims to infer an agent's preferences (represented as a reward function $R$) from their behaviour (represented as a policy $\pi$). To do this, we need a behavioural model of how $\pi$ relates to $R$. In the current literature, the most common behavioural models are optimality, Boltzmann-rationality, and causal entropy maximisation. However, the true relationship between a human's preferences and their behaviour is much more complex than any of these behavioural models. This means that the behavioural models are misspecified, which raises the concern that they may lead to systematic errors if applied to real data. In this paper, we analyse how sensitive the IRL problem is to misspecification of the behavioural model. Specifically, we provide necessary and sufficient conditions that completely characterise how the observed data may differ from the assumed behavioural model without incurring an error above a given threshold. In addition to this, we also characterise the conditions under which a behavioural model is robust to small perturbations of the observed policy, and we analyse how robust many behavioural models are to misspecification of their parameter values (such as e.g.\ the discount rate). Our analysis suggests that the IRL problem is highly sensitive to misspecification, in the sense that very mild misspecification can lead to very large errors in the inferred reward function.
\ No newline at end of file
diff --git a/data/2024/iclr/Quasi-Monte Carlo for 3D Sliced Wasserstein b/data/2024/iclr/Quasi-Monte Carlo for 3D Sliced Wasserstein
new file mode 100644
index 0000000000..ec996ffe54
--- /dev/null
+++ b/data/2024/iclr/Quasi-Monte Carlo for 3D Sliced Wasserstein	
@@ -0,0 +1 @@
+Monte Carlo (MC) integration has been employed as the standard approximation method for the Sliced Wasserstein (SW) distance, whose analytical expression involves an intractable expectation. However, MC integration is not optimal in terms of absolute approximation error. To provide a better class of empirical SW, we propose quasi-sliced Wasserstein (QSW) approximations that rely on Quasi-Monte Carlo (QMC) methods. For a comprehensive investigation of QMC for SW, we focus on the 3D setting, specifically computing the SW between probability measures in three dimensions. In greater detail, we empirically evaluate various methods to construct QMC point sets on the 3D unit-hypersphere, including the Gaussian-based and equal area mappings, generalized spiral points, and optimizing discrepancy energies. Furthermore, to obtain an unbiased estimator for stochastic optimization, we extend QSW to Randomized Quasi-Sliced Wasserstein (RQSW) by introducing randomness in the discussed point sets. Theoretically, we prove the asymptotic convergence of QSW and the unbiasedness of RQSW. Finally, we conduct experiments on various 3D tasks, such as point-cloud comparison, point-cloud interpolation, image style transfer, and training deep point-cloud autoencoders, to demonstrate the favorable performance of the proposed QSW and RQSW variants.
\ No newline at end of file
diff --git a/data/2024/iclr/Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL b/data/2024/iclr/Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL
new file mode 100644
index 0000000000..a58a65133c
--- /dev/null
+++ b/data/2024/iclr/Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL	
@@ -0,0 +1 @@
+In this study, we aim to enhance the arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization and elucidate two ensuing challenges that impede the successful and economical design of prompt optimization techniques. One primary issue is the absence of an effective method to evaluate prompts during inference when the golden answer is unavailable. Concurrently, learning via interactions with the LLMs to navigate the expansive natural language prompting space proves to be resource-intensive. To address this, we introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data. Such data exists as by-products when diverse prompts are benchmarked on open-accessible datasets. With Prompt-OIRL, the query-dependent prompt optimization objective is achieved by first learning an offline reward model. This model can evaluate any query-prompt pairs without accessing LLMs. Subsequently, a best-of-N strategy is deployed to recommend the optimal prompt. Our experimental evaluations across various LLM scales and arithmetic reasoning datasets underscore both the efficacy and economic viability of the proposed approach.
\ No newline at end of file
diff --git a/data/2024/iclr/Query-Policy Misalignment in Preference-Based Reinforcement Learning b/data/2024/iclr/Query-Policy Misalignment in Preference-Based Reinforcement Learning
new file mode 100644
index 0000000000..7e21219a3b
--- /dev/null
+++ b/data/2024/iclr/Query-Policy Misalignment in Preference-Based Reinforcement Learning	
@@ -0,0 +1 @@
+Preference-based reinforcement learning (PbRL) provides a natural way to align RL agents' behavior with human desired outcomes, but is often restrained by costly human feedback. To improve feedback efficiency, most existing PbRL methods focus on selecting queries to maximally improve the overall quality of the reward model, but counter-intuitively, we find that this may not necessarily lead to improved performance. To unravel this mystery, we identify a long-neglected issue in the query selection schemes of existing PbRL studies: Query-Policy Misalignment. We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents' interests, thus offering little help on policy learning and eventually resulting in poor feedback efficiency. We show that this issue can be effectively addressed via near on-policy query and a specially designed hybrid experience replay, which together enforce the bidirectional query-policy alignment. Simple yet elegant, our method can be easily incorporated into existing approaches by changing only a few lines of code. We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing query-policy misalignment in PbRL tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Querying Easily Flip-flopped Samples for Deep Active Learning b/data/2024/iclr/Querying Easily Flip-flopped Samples for Deep Active Learning
new file mode 100644
index 0000000000..92a44f10b2
--- /dev/null
+++ b/data/2024/iclr/Querying Easily Flip-flopped Samples for Deep Active Learning	
@@ -0,0 +1 @@
+Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data. One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is. The sample's distance to the decision boundary is a natural measure of predictive uncertainty, but it is often intractable to compute, especially for complex decision boundaries formed in multiclass classification tasks. To address this issue, this paper proposes the {\it least disagree metric} (LDM), defined as the smallest probability of disagreement of the predicted label, and an estimator for LDM proven to be asymptotically consistent under mild assumptions. The estimator is computationally efficient and can be easily implemented for deep learning models using parameter perturbation. The LDM-based active learning is performed by querying unlabeled data with the smallest LDM. Experimental results show that our LDM-based active learning algorithm obtains state-of-the-art overall performance on all considered datasets and deep architectures.
\ No newline at end of file
diff --git a/data/2024/iclr/Quick-Tune: Quickly Learning Which Pretrained Model to Finetune and How b/data/2024/iclr/Quick-Tune: Quickly Learning Which Pretrained Model to Finetune and How
new file mode 100644
index 0000000000..bc49a2f05a
--- /dev/null
+++ b/data/2024/iclr/Quick-Tune: Quickly Learning Which Pretrained Model to Finetune and How	
@@ -0,0 +1 @@
+With the ever-increasing number of pretrained models, machine learning practitioners are continuously faced with which pretrained model to use, and how to finetune it for a new dataset. In this paper, we propose a methodology that jointly searches for the optimal pretrained model and the hyperparameters for finetuning it. Our method transfers knowledge about the performance of many pretrained models with multiple hyperparameter configurations on a series of datasets. To this aim, we evaluated over 20k hyperparameter configurations for finetuning 24 pretrained image classification models on 87 datasets to generate a large-scale meta-dataset. We meta-learn a multi-fidelity performance predictor on the learning curves of this meta-dataset and use it for fast hyperparameter optimization on new datasets. We empirically demonstrate that our resulting approach can quickly select an accurate pretrained model for a new dataset together with its optimal hyperparameters.
\ No newline at end of file
diff --git a/data/2024/iclr/R-EDL: Relaxing Nonessential Settings of Evidential Deep Learning b/data/2024/iclr/R-EDL: Relaxing Nonessential Settings of Evidential Deep Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/R-MAE: Regions Meet Masked Autoencoders b/data/2024/iclr/R-MAE: Regions Meet Masked Autoencoders
new file mode 100644
index 0000000000..2d0a256a9f
--- /dev/null
+++ b/data/2024/iclr/R-MAE: Regions Meet Masked Autoencoders	
@@ -0,0 +1 @@
+In this work, we explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions, while being highly effective especially with high-quality regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent improvements across various pre-training datasets and downstream detection and segmentation benchmarks, with negligible computational overheads. Beyond the quantitative evaluation, our analysis indicates the models pre-trained with masked region autoencoding unlock the potential for interactive segmentation. The code is provided at https://github.com/facebookresearch/r-mae.
\ No newline at end of file
diff --git a/data/2024/iclr/RA-DIT: Retrieval-Augmented Dual Instruction Tuning b/data/2024/iclr/RA-DIT: Retrieval-Augmented Dual Instruction Tuning
new file mode 100644
index 0000000000..6a1ea64131
--- /dev/null
+++ b/data/2024/iclr/RA-DIT: Retrieval-Augmented Dual Instruction Tuning	
@@ -0,0 +1 @@
+Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance. We introduce Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning methodology that provides a third option by retrofitting any LLM with retrieval capabilities. Our approach operates in two distinct fine-tuning steps: (1) one updates a pre-trained LM to better use retrieved information, while (2) the other updates the retriever to return more relevant results, as preferred by the LM. By fine-tuning over tasks that require both knowledge utilization and contextual awareness, we demonstrate that each stage yields significant performance improvements, and using both leads to additional gains. Our best model, RA-DIT 65B, achieves state-of-the-art performance across a range of knowledge-intensive zero- and few-shot learning benchmarks, significantly outperforming existing in-context RALM approaches by up to +8.9% in 0-shot setting and +1.4% in 5-shot setting on average.
\ No newline at end of file
diff --git a/data/2024/iclr/RAIN: Your Language Models Can Align Themselves without Finetuning b/data/2024/iclr/RAIN: Your Language Models Can Align Themselves without Finetuning
new file mode 100644
index 0000000000..c2b1f85abb
--- /dev/null
+++ b/data/2024/iclr/RAIN: Your Language Models Can Align Themselves without Finetuning	
@@ -0,0 +1 @@
+Large language models (LLMs) often demonstrate inconsistencies with human preferences. Previous research typically gathered human preference data and then aligned the pre-trained models using reinforcement learning or instruction tuning, a.k.a. the finetuning step. In contrast, aligning frozen LLMs without requiring alignment data is more appealing. This work explores the potential of the latter setting. We discover that by integrating self-evaluation and rewind mechanisms, unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation and use the evaluation results to guide rewind and generation for AI safety. Notably, RAIN operates without the need of extra data for model alignment and abstains from any training, gradient computation, or parameter updates. Experimental results evaluated by GPT-4 and humans demonstrate the effectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rate of LLaMA 30B from 82% of vanilla inference to 97%, while maintaining the helpfulness rate. On the TruthfulQA dataset, RAIN improves the truthfulness of the already-well-aligned LLaMA-2-chat 13B model by 5%.
\ No newline at end of file
diff --git a/data/2024/iclr/RAPPER: Reinforced Rationale-Prompted Paradigm for Natural Language Explanation in Visual Question Answering b/data/2024/iclr/RAPPER: Reinforced Rationale-Prompted Paradigm for Natural Language Explanation in Visual Question Answering
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval b/data/2024/iclr/RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
new file mode 100644
index 0000000000..a730e5b73c
--- /dev/null
+++ b/data/2024/iclr/RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval	
@@ -0,0 +1 @@
+Retrieval-augmented language models can better adapt to changes in world state and incorporate long-tail knowledge. However, most existing methods retrieve only short contiguous chunks from a retrieval corpus, limiting holistic understanding of the overall document context. We introduce the novel approach of recursively embedding, clustering, and summarizing chunks of text, constructing a tree with differing levels of summarization from the bottom up. At inference time, our RAPTOR model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented LMs on several tasks. On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the QuALITY benchmark by 20% in absolute accuracy.
\ No newline at end of file
diff --git a/data/2024/iclr/RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design b/data/2024/iclr/RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design
new file mode 100644
index 0000000000..e5aec32c10
--- /dev/null
+++ b/data/2024/iclr/RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design	
@@ -0,0 +1 @@
+While artificial intelligence has made remarkable strides in revealing the relationship between biological macromolecules' primary sequence and tertiary structure, designing RNA sequences based on specified tertiary structures remains challenging. Though existing approaches in protein design have thoroughly explored structure-to-sequence dependencies in proteins, RNA design still confronts difficulties due to structural complexity and data scarcity. Moreover, direct transplantation of protein design methodologies into RNA design fails to achieve satisfactory outcomes although sharing similar structural components. In this study, we aim to systematically construct a data-driven RNA design pipeline. We crafted a large, well-curated benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure. More importantly, we proposed a hierarchical data-efficient representation learning framework that learns structural representations through contrastive learning at both cluster-level and sample-level to fully leverage the limited data. By constraining data representations within a limited hyperspherical space, the intrinsic relationships between data points could be explicitly imposed. Moreover, we incorporated extracted secondary structures with base pairs as prior knowledge to facilitate the RNA design process. Extensive experiments demonstrate the effectiveness of our proposed method, providing a reliable baseline for future RNA design tasks. The source code and benchmark dataset are available at https://github.com/A4Bio/RDesign.
\ No newline at end of file
diff --git a/data/2024/iclr/REBAR: Retrieval-Based Reconstruction for Time-series Contrastive Learning b/data/2024/iclr/REBAR: Retrieval-Based Reconstruction for Time-series Contrastive Learning
new file mode 100644
index 0000000000..a1f75a8afc
--- /dev/null
+++ b/data/2024/iclr/REBAR: Retrieval-Based Reconstruction for Time-series Contrastive Learning	
@@ -0,0 +1 @@
+The success of self-supervised contrastive learning hinges on identifying positive data pairs, such that when they are pushed together in embedding space, the space encodes useful information for subsequent downstream tasks. Constructing positive pairs is non-trivial as the pairing must be similar enough to reflect a shared semantic meaning, but different enough to capture within-class variation. Classical approaches in vision use augmentations to exploit well-established invariances to construct positive pairs, but invariances in the time-series domain are much less obvious. In our work, we propose a novel method of using a learned measure for identifying positive pairs. Our Retrieval-Based Reconstruction (REBAR) measure measures the similarity between two sequences as the reconstruction error that results from reconstructing one sequence with retrieved information from the other. Then, if the two sequences have high REBAR similarity, we label them as a positive pair. Through validation experiments, we show that the REBAR error is a predictor of mutual class membership. Once integrated into a contrastive learning framework, our REBAR method learns an embedding that achieves state-of-the-art performance on downstream tasks across various modalities.
\ No newline at end of file
diff --git a/data/2024/iclr/RECOMBINER: Robust and Enhanced Compression with Bayesian Implicit Neural Representations b/data/2024/iclr/RECOMBINER: Robust and Enhanced Compression with Bayesian Implicit Neural Representations
new file mode 100644
index 0000000000..dec8160838
--- /dev/null
+++ b/data/2024/iclr/RECOMBINER: Robust and Enhanced Compression with Bayesian Implicit Neural Representations	
@@ -0,0 +1 @@
+COMpression with Bayesian Implicit NEural Representations (COMBINER) is a recent data compression method that addresses a key inefficiency of previous Implicit Neural Representation (INR)-based approaches: it avoids quantization and enables direct optimization of the rate-distortion performance. However, COMBINER still has significant limitations: 1) it uses factorized priors and posterior approximations that lack flexibility; 2) it cannot effectively adapt to local deviations from global patterns in the data; and 3) its performance can be susceptible to modeling choices and the variational parameters' initializations. Our proposed method, Robust and Enhanced COMBINER (RECOMBINER), addresses these issues by 1) enriching the variational approximation while retaining a low computational cost via a linear reparameterization of the INR weights, 2) augmenting our INRs with learnable positional encodings that enable them to adapt to local details and 3) splitting high-resolution data into patches to increase robustness and utilizing expressive hierarchical priors to capture dependency across patches. We conduct extensive experiments across several data modalities, showcasing that RECOMBINER achieves competitive results with the best INR-based methods and even outperforms autoencoder-based codecs on low-resolution images at low bitrates. Our PyTorch implementation is available at https://github.com/cambridge-mlg/RECOMBINER/.
\ No newline at end of file
diff --git a/data/2024/iclr/RECOMP: Improving Retrieval-Augmented LMs with Context Compression and Selective Augmentation b/data/2024/iclr/RECOMP: Improving Retrieval-Augmented LMs with Context Compression and Selective Augmentation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/REFACTOR: Learning to Extract Theorems from Proofs b/data/2024/iclr/REFACTOR: Learning to Extract Theorems from Proofs
new file mode 100644
index 0000000000..3691da3025
--- /dev/null
+++ b/data/2024/iclr/REFACTOR: Learning to Extract Theorems from Proofs	
@@ -0,0 +1 @@
+Human mathematicians are often good at recognizing modular and reusable theorems that make complex mathematical results within reach. In this paper, we propose a novel method called theoREm-from-prooF extrACTOR (REFACTOR) for training neural networks to mimic this ability in formal mathematical theorem proving. We show on a set of unseen proofs, REFACTOR is able to extract 19.6% of the theorems that humans would use to write the proofs. When applying the model to the existing Metamath library, REFACTOR extracted 16 new theorems. With newly extracted theorems, we show that the existing proofs in the MetaMath database can be refactored. The new theorems are used very frequently after refactoring, with an average usage of 733.5 times, and help shorten the proof lengths. Lastly, we demonstrate that the prover trained on the new-theorem refactored dataset proves more test theorems and outperforms state-of-the-art baselines by frequently leveraging a diverse set of newly extracted theorems. Code can be found at https://github.com/jinpz/refactor.
\ No newline at end of file
diff --git a/data/2024/iclr/RETSim: Resilient and Efficient Text Similarity b/data/2024/iclr/RETSim: Resilient and Efficient Text Similarity
new file mode 100644
index 0000000000..61e7f60f6c
--- /dev/null
+++ b/data/2024/iclr/RETSim: Resilient and Efficient Text Similarity	
@@ -0,0 +1 @@
+This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. We also introduce the W4NT3D benchmark (Wiki-40B 4dversarial Near-T3xt Dataset) for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings. RETSim and the W4NT3D benchmark are open-sourced under the MIT License at https://github.com/google/unisim.
\ No newline at end of file
diff --git a/data/2024/iclr/REValueD: Regularised Ensemble Value-Decomposition for Factorisable Markov Decision Processes b/data/2024/iclr/REValueD: Regularised Ensemble Value-Decomposition for Factorisable Markov Decision Processes
new file mode 100644
index 0000000000..a8274983b9
--- /dev/null
+++ b/data/2024/iclr/REValueD: Regularised Ensemble Value-Decomposition for Factorisable Markov Decision Processes	
@@ -0,0 +1 @@
+Discrete-action reinforcement learning algorithms often falter in tasks with high-dimensional discrete action spaces due to the vast number of possible actions. A recent advancement leverages value-decomposition, a concept from multi-agent reinforcement learning, to tackle this challenge. This study delves deep into the effects of this value-decomposition, revealing that whilst it curtails the over-estimation bias inherent to Q-learning algorithms, it amplifies target variance. To counteract this, we present an ensemble of critics to mitigate target variance. Moreover, we introduce a regularisation loss that helps to mitigate the effects that exploratory actions in one dimension can have on the value of optimal actions in other dimensions. Our novel algorithm, REValueD, tested on discretised versions of the DeepMind Control Suite tasks, showcases superior performance, especially in the challenging humanoid and dog tasks. We further dissect the factors influencing REValueD's performance, evaluating the significance of the regularisation loss and the scalability of REValueD with increasing sub-actions per dimension.
\ No newline at end of file
diff --git a/data/2024/iclr/RLCD: Reinforcement Learning from Contrastive Distillation for LM Alignment b/data/2024/iclr/RLCD: Reinforcement Learning from Contrastive Distillation for LM Alignment
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/RLIF: Interactive Imitation Learning as Reinforcement Learning b/data/2024/iclr/RLIF: Interactive Imitation Learning as Reinforcement Learning
new file mode 100644
index 0000000000..0b87ef8a98
--- /dev/null
+++ b/data/2024/iclr/RLIF: Interactive Imitation Learning as Reinforcement Learning	
@@ -0,0 +1 @@
+Although reinforcement learning methods offer a powerful framework for automatic skill acquisition, for practical learning-based control problems in domains such as robotics, imitation learning often provides a more convenient and accessible alternative. In particular, an interactive imitation learning method such as DAgger, which queries a near-optimal expert to intervene online to collect correction data for addressing the distributional shift challenges that afflict na\"ive behavioral cloning, can enjoy good performance both in theory and practice without requiring manually specified reward functions and other components of full reinforcement learning methods. In this paper, we explore how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning. Our proposed method uses reinforcement learning with user intervention signals themselves as rewards. This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert. We also provide a unified framework to analyze our RL method and DAgger; for which we present the asymptotic analysis of the suboptimal gap for both methods as well as the non-asymptotic sample complexity bound of our method. We then evaluate our method on challenging high-dimensional continuous control simulation benchmarks as well as real-world robotic vision-based manipulation tasks. The results show that it strongly outperforms DAgger-like approaches across the different tasks, especially when the intervening experts are suboptimal. Code and videos can be found on the project website: https://rlif-page.github.io
\ No newline at end of file
diff --git a/data/2024/iclr/Raidar: geneRative AI Detection viA Rewriting b/data/2024/iclr/Raidar: geneRative AI Detection viA Rewriting
new file mode 100644
index 0000000000..69946e0c29
--- /dev/null
+++ b/data/2024/iclr/Raidar: geneRative AI Detection viA Rewriting	
@@ -0,0 +1 @@
+We find that large language models (LLMs) are more likely to modify human-written text than AI-generated text when tasked with rewriting. This tendency arises because LLMs often perceive AI-generated text as high-quality, leading to fewer modifications. We introduce a method to detect AI-generated content by prompting LLMs to rewrite text and calculating the editing distance of the output. We dubbed our geneRative AI Detection viA Rewriting method Raidar. Raidar significantly improves the F1 detection scores of existing AI content detection models -- both academic and commercial -- across various domains, including News, creative writing, student essays, code, Yelp reviews, and arXiv papers, with gains of up to 29 points. Operating solely on word symbols without high-dimensional features, our method is compatible with black box LLMs, and is inherently robust on new content. Our results illustrate the unique imprint of machine-generated text through the lens of the machines themselves.
\ No newline at end of file
diff --git a/data/2024/iclr/Random Sparse Lifts: Construction, Analysis and Convergence of finite sparse networks b/data/2024/iclr/Random Sparse Lifts: Construction, Analysis and Convergence of finite sparse networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Rayleigh Quotient Graph Neural Networks for Graph-level Anomaly Detection b/data/2024/iclr/Rayleigh Quotient Graph Neural Networks for Graph-level Anomaly Detection
new file mode 100644
index 0000000000..264453a9ae
--- /dev/null
+++ b/data/2024/iclr/Rayleigh Quotient Graph Neural Networks for Graph-level Anomaly Detection	
@@ -0,0 +1 @@
+Graph-level anomaly detection has gained significant attention as it finds applications in various domains, such as cancer diagnosis and enzyme prediction. However, existing methods fail to capture the spectral properties of graph anomalies, resulting in unexplainable framework design and unsatisfying performance. In this paper, we re-investigate the spectral differences between anomalous and normal graphs. Our main observation shows a significant disparity in the accumulated spectral energy between these two classes. Moreover, we prove that the accumulated spectral energy of the graph signal can be represented by its Rayleigh Quotient, indicating that the Rayleigh Quotient is a driving factor behind the anomalous properties of graphs. Motivated by this, we propose Rayleigh Quotient Graph Neural Network (RQGNN), the first spectral GNN that explores the inherent spectral features of anomalous graphs for graph-level anomaly detection. Specifically, we introduce a novel framework with two components: the Rayleigh Quotient learning component (RQL) and Chebyshev Wavelet GNN with RQ-pooling (CWGNN-RQ). RQL explicitly captures the Rayleigh Quotient of graphs and CWGNN-RQ implicitly explores the spectral space of graphs. Extensive experiments on 10 real-world datasets show that RQGNN outperforms the best rival by 6.74% in Macro-F1 score and 1.44% in AUC, demonstrating the effectiveness of our framework. Our code is available at https://github.com/xydong127/RQGNN.
\ No newline at end of file
diff --git a/data/2024/iclr/ReFusion: Improving Natural Language Understanding with Computation-Efficient Retrieval Representation Fusion b/data/2024/iclr/ReFusion: Improving Natural Language Understanding with Computation-Efficient Retrieval Representation Fusion
new file mode 100644
index 0000000000..3db8ea2746
--- /dev/null
+++ b/data/2024/iclr/ReFusion: Improving Natural Language Understanding with Computation-Efficient Retrieval Representation Fusion	
@@ -0,0 +1 @@
+Retrieval-based augmentations (RA) incorporating knowledge from an external database into language models have greatly succeeded in various knowledge-intensive (KI) tasks. However, integrating retrievals in non-knowledge-intensive (NKI) tasks is still challenging. Existing works focus on concatenating retrievals with inputs to improve model performance. Unfortunately, the use of retrieval concatenation-based augmentations causes an increase in the input length, substantially raising the computational demands of attention mechanisms. This paper proposes a new paradigm of RA named \textbf{ReFusion}, a computation-efficient Retrieval representation Fusion with bi-level optimization. Unlike previous works, ReFusion directly fuses the retrieval representations into the hidden states of models. Specifically, ReFusion leverages an adaptive retrieval integrator to seek the optimal combination of the proposed ranking schemes across different model layers. Experimental results demonstrate that the proposed ReFusion can achieve superior and robust performance in various NKI tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/ReLoRA: High-Rank Training Through Low-Rank Updates b/data/2024/iclr/ReLoRA: High-Rank Training Through Low-Rank Updates
new file mode 100644
index 0000000000..6b60d081e9
--- /dev/null
+++ b/data/2024/iclr/ReLoRA: High-Rank Training Through Low-Rank Updates	
@@ -0,0 +1 @@
+Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparameterized models remains poorly understood, while training costs grow exponentially. In this paper, we explore parameter-efficient training techniques as an approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to training transformer language models with up to 1.3B parameters and demonstrate comparable performance to regular neural network training. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup. Our findings show the potential of parameter-efficient techniques for large-scale pre-training.
\ No newline at end of file
diff --git a/data/2024/iclr/ReSimAD: Zero-Shot 3D Domain Transfer for Autonomous Driving with Source Reconstruction and Target Simulation b/data/2024/iclr/ReSimAD: Zero-Shot 3D Domain Transfer for Autonomous Driving with Source Reconstruction and Target Simulation
new file mode 100644
index 0000000000..0fafed5fcc
--- /dev/null
+++ b/data/2024/iclr/ReSimAD: Zero-Shot 3D Domain Transfer for Autonomous Driving with Source Reconstruction and Target Simulation	
@@ -0,0 +1 @@
+Domain shifts such as sensor type changes and geographical situation variations are prevalent in Autonomous Driving (AD), which poses a challenge since AD model relying on the previous domain knowledge can be hardly directly deployed to a new domain without additional costs. In this paper, we provide a new perspective and approach of alleviating the domain shifts, by proposing a Reconstruction-Simulation-Perception (ReSimAD) scheme. Specifically, the implicit reconstruction process is based on the knowledge from the previous old domain, aiming to convert the domain-related knowledge into domain-invariant representations, e.g., 3D scene-level meshes. Besides, the point clouds simulation process of multiple new domains is conditioned on the above reconstructed 3D meshes, where the target-domain-like simulation samples can be obtained, thus reducing the cost of collecting and annotating new-domain data for the subsequent perception process. For experiments, we consider different cross-domain situations such as Waymo-to-KITTI, Waymo-to-nuScenes, Waymo-to-ONCE, etc, to verify the zero-shot target-domain perception using ReSimAD. Results demonstrate that our method is beneficial to boost the domain generalization ability, even promising for 3D pre-training.
\ No newline at end of file
diff --git a/data/2024/iclr/ReTaSA: A Nonparametric Functional Estimation Approach for Addressing Continuous Target Shift b/data/2024/iclr/ReTaSA: A Nonparametric Functional Estimation Approach for Addressing Continuous Target Shift
new file mode 100644
index 0000000000..147c6e4258
--- /dev/null
+++ b/data/2024/iclr/ReTaSA: A Nonparametric Functional Estimation Approach for Addressing Continuous Target Shift	
@@ -0,0 +1 @@
+The presence of distribution shifts poses a significant challenge for deploying modern machine learning models in real-world applications. This work focuses on the target shift problem in a regression setting (Zhang et al., 2013; Nguyen et al., 2016). More specifically, the target variable y (also known as the response variable), which is continuous, has different marginal distributions in the training source and testing domain, while the conditional distribution of features x given y remains the same. While most literature focuses on classification tasks with finite target space, the regression problem has an infinite dimensional target space, which makes many of the existing methods inapplicable. In this work, we show that the continuous target shift problem can be addressed by estimating the importance weight function from an ill-posed integral equation. We propose a nonparametric regularized approach named ReTaSA to solve the ill-posed integral equation and provide theoretical justification for the estimated importance weight function. The effectiveness of the proposed method has been demonstrated with extensive numerical studies on synthetic and real-world datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting b/data/2024/iclr/Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting
new file mode 100644
index 0000000000..55999dc408
--- /dev/null
+++ b/data/2024/iclr/Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting	
@@ -0,0 +1 @@
+Reconstructing dynamic 3D scenes from 2D images and generating diverse views over time is challenging due to scene complexity and temporal dynamics. Despite advancements in neural implicit models, limitations persist: (i) Inadequate Scene Structure: Existing methods struggle to reveal the spatial and temporal structure of dynamic scenes from directly learning the complex 6D plenoptic function. (ii) Scaling Deformation Modeling: Explicitly modeling scene element deformation becomes impractical for complex dynamics. To address these issues, we consider the spacetime as an entirety and propose to approximate the underlying spatio-temporal 4D volume of a dynamic scene by optimizing a collection of 4D primitives, with explicit geometry and appearance modeling. Learning to optimize the 4D primitives enables us to synthesize novel views at any desired time with our tailored rendering routine. Our model is conceptually simple, consisting of a 4D Gaussian parameterized by anisotropic ellipses that can rotate arbitrarily in space and time, as well as view-dependent and time-evolved appearance represented by the coefficient of 4D spherindrical harmonics. This approach offers simplicity, flexibility for variable-length video and end-to-end training, and efficient real-time rendering, making it suitable for capturing complex dynamic scene motions. Experiments across various benchmarks, including monocular and multi-view scenarios, demonstrate our 4DGS model's superior visual quality and efficiency.
\ No newline at end of file
diff --git a/data/2024/iclr/Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis b/data/2024/iclr/Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis
new file mode 100644
index 0000000000..2bc1d8808e
--- /dev/null
+++ b/data/2024/iclr/Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis	
@@ -0,0 +1 @@
+One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait video. To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a large image-to-plane model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter; (3) synthesizes realistic video with natural torso movement and switchable background using a head-torso-background super-resolution model; and (4) supports one-shot audio-driven talking face generation with a generalizable audio-to-motion model. Extensive experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos compared to previous methods. Video samples and source code are available at https://real3dportrait.github.io .
\ No newline at end of file
diff --git a/data/2024/iclr/Realistic Evaluation of Semi-supervised Learning Algorithms in Open Environments b/data/2024/iclr/Realistic Evaluation of Semi-supervised Learning Algorithms in Open Environments
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning b/data/2024/iclr/Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning
new file mode 100644
index 0000000000..8b763af45c
--- /dev/null
+++ b/data/2024/iclr/Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning	
@@ -0,0 +1 @@
+Large language models (LLMs) have demonstrated impressive reasoning abilities in complex tasks. However, they lack up-to-date knowledge and experience hallucinations during reasoning, which can lead to incorrect reasoning processes and diminish their performance and trustworthiness. Knowledge graphs (KGs), which capture vast amounts of facts in a structured format, offer a reliable source of knowledge for reasoning. Nevertheless, existing KG-based LLM reasoning methods only treat KGs as factual knowledge bases and overlook the importance of their structural information for reasoning. In this paper, we propose a novel method called reasoning on graphs (RoG) that synergizes LLMs with KGs to enable faithful and interpretable reasoning. Specifically, we present a planning-retrieval-reasoning framework, where RoG first generates relation paths grounded by KGs as faithful plans. These plans are then used to retrieve valid reasoning paths from the KGs for LLMs to conduct faithful reasoning. Furthermore, RoG not only distills knowledge from KGs to improve the reasoning ability of LLMs through training but also allows seamless integration with any arbitrary LLMs during inference. Extensive experiments on two benchmark KGQA datasets demonstrate that RoG achieves state-of-the-art performance on KG reasoning tasks and generates faithful and interpretable reasoning results.
\ No newline at end of file
diff --git a/data/2024/iclr/Reasoning with Latent Diffusion in Offline Reinforcement Learning b/data/2024/iclr/Reasoning with Latent Diffusion in Offline Reinforcement Learning
new file mode 100644
index 0000000000..6b5aa19acf
--- /dev/null
+++ b/data/2024/iclr/Reasoning with Latent Diffusion in Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Offline reinforcement learning (RL) holds promise as a means to learn high-reward policies from a static dataset, without the need for further environment interactions. However, a key challenge in offline RL lies in effectively stitching portions of suboptimal trajectories from the static dataset while avoiding extrapolation errors arising due to a lack of support in the dataset. Existing approaches use conservative methods that are tricky to tune and struggle with multi-modal data (as we show) or rely on noisy Monte Carlo return-to-go samples for reward conditioning. In this work, we propose a novel approach that leverages the expressiveness of latent diffusion to model in-support trajectory sequences as compressed latent skills. This facilitates learning a Q-function while avoiding extrapolation error via batch-constraining. The latent space is also expressive and gracefully copes with multi-modal data. We show that the learned temporally-abstract latent space encodes richer task-specific information for offline RL tasks as compared to raw state-actions. This improves credit assignment and facilitates faster reward propagation during Q-learning. Our method demonstrates state-of-the-art performance on the D4RL benchmarks, particularly excelling in long-horizon, sparse-reward tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Reclaiming the Source of Programmatic Policies: Programmatic versus Latent Spaces b/data/2024/iclr/Reclaiming the Source of Programmatic Policies: Programmatic versus Latent Spaces
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Reconciling Spatial and Temporal Abstractions for Goal Representation b/data/2024/iclr/Reconciling Spatial and Temporal Abstractions for Goal Representation
new file mode 100644
index 0000000000..d6727ea728
--- /dev/null
+++ b/data/2024/iclr/Reconciling Spatial and Temporal Abstractions for Goal Representation	
@@ -0,0 +1 @@
+Goal representation affects the performance of Hierarchical Reinforcement Learning (HRL) algorithms by decomposing the complex learning problem into easier subtasks. Recent studies show that representations that preserve temporally abstract environment dynamics are successful in solving difficult problems and provide theoretical guarantees for optimality. These methods however cannot scale to tasks where environment dynamics increase in complexity i.e. the temporally abstract transition relations depend on larger number of variables. On the other hand, other efforts have tried to use spatial abstraction to mitigate the previous issues. Their limitations include scalability to high dimensional environments and dependency on prior knowledge. In this paper, we propose a novel three-layer HRL algorithm that introduces, at different levels of the hierarchy, both a spatial and a temporal goal abstraction. We provide a theoretical study of the regret bounds of the learned policies. We evaluate the approach on complex continuous control tasks, demonstrating the effectiveness of spatial and temporal abstractions learned by this approach. Find open-source code at https://github.com/cosynus-lix/STAR.
\ No newline at end of file
diff --git a/data/2024/iclr/Recursive Generalization Transformer for Image Super-Resolution b/data/2024/iclr/Recursive Generalization Transformer for Image Super-Resolution
new file mode 100644
index 0000000000..24c3378e80
--- /dev/null
+++ b/data/2024/iclr/Recursive Generalization Transformer for Image Super-Resolution	
@@ -0,0 +1 @@
+Transformer architectures have exhibited remarkable performance in image super-resolution (SR). Since the quadratic computational complexity of the self-attention (SA) in Transformer, existing methods tend to adopt SA in a local region to reduce overheads. However, the local design restricts the global context exploitation, which is crucial for accurate image reconstruction. In this work, we propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images. Specifically, we propose the recursive-generalization self-attention (RG-SA). It recursively aggregates input features into representative feature maps, and then utilizes cross-attention to extract global information. Meanwhile, the channel dimensions of attention matrices (query, key, and value) are further scaled to mitigate the redundancy in the channel domain. Furthermore, we combine the RG-SA with local self-attention to enhance the exploitation of the global context, and propose the hybrid adaptive integration (HAI) for module integration. The HAI allows the direct and effective fusion between features at different levels (local or global). Extensive experiments demonstrate that our RGT outperforms recent state-of-the-art methods quantitatively and qualitatively. Code and pre-trained models are available at https://github.com/zhengchen1999/RGT.
\ No newline at end of file
diff --git a/data/2024/iclr/Reinforcement Symbolic Regression Machine b/data/2024/iclr/Reinforcement Symbolic Regression Machine
new file mode 100644
index 0000000000..0b33582d24
--- /dev/null
+++ b/data/2024/iclr/Reinforcement Symbolic Regression Machine	
@@ -0,0 +1 @@
+In nature, the behaviors of many complex systems can be described by parsimonious math equations. Automatically distilling these equations from limited data is cast as a symbolic regression process which hitherto remains a grand challenge. Keen efforts in recent years have been placed on tackling this issue and demonstrated success in symbolic regression. However, there still exist bottlenecks that current methods struggle to break when the discrete search space tends toward infinity and especially when the underlying math formula is intricate. To this end, we propose a novel Reinforcement Symbolic Regression Machine (RSRM) that masters the capability of uncovering complex math equations from only scarce data. The RSRM model is composed of three key modules: (1) a Monte Carlo tree search (MCTS) agent that explores optimal math expression trees consisting of pre-defined math operators and variables, (2) a Double Q-learning block that helps reduce the feasible search space of MCTS via properly understanding the distribution of reward, and (3) a modulated sub-tree discovery block that heuristically learns and defines new math operators to improve representation ability of math expression trees. Biding of these modules yields the state-of-the-art performance of RSRM in symbolic regression as demonstrated by multiple sets of benchmark examples. The RSRM model shows clear superiority over several representative baseline models.
\ No newline at end of file
diff --git a/data/2024/iclr/Relaxing the Additivity Constraints in Decentralized No-Regret High-Dimensional Bayesian Optimization b/data/2024/iclr/Relaxing the Additivity Constraints in Decentralized No-Regret High-Dimensional Bayesian Optimization
new file mode 100644
index 0000000000..d0bb98df2f
--- /dev/null
+++ b/data/2024/iclr/Relaxing the Additivity Constraints in Decentralized No-Regret High-Dimensional Bayesian Optimization	
@@ -0,0 +1 @@
+Bayesian Optimization (BO) is typically used to optimize an unknown function $f$ that is noisy and costly to evaluate, by exploiting an acquisition function that must be maximized at each optimization step. Even if provably asymptotically optimal BO algorithms are efficient at optimizing low-dimensional functions, scaling them to high-dimensional spaces remains an open problem, often tackled by assuming an additive structure for $f$. By doing so, BO algorithms typically introduce additional restrictive assumptions on the additive structure that reduce their applicability domain. This paper contains two main contributions: (i) we relax the restrictive assumptions on the additive structure of $f$ without weakening the maximization guarantees of the acquisition function, and (ii) we address the over-exploration problem for decentralized BO algorithms. To these ends, we propose DuMBO, an asymptotically optimal decentralized BO algorithm that achieves very competitive performance against state-of-the-art BO algorithms, especially when the additive structure of $f$ comprises high-dimensional factors.
\ No newline at end of file
diff --git a/data/2024/iclr/Relay Diffusion: Unifying diffusion process across resolutions for image synthesis b/data/2024/iclr/Relay Diffusion: Unifying diffusion process across resolutions for image synthesis
new file mode 100644
index 0000000000..76d68bdc3c
--- /dev/null
+++ b/data/2024/iclr/Relay Diffusion: Unifying diffusion process across resolutions for image synthesis	
@@ -0,0 +1 @@
+Diffusion models achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find the main reason is that \emph{the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain}. In this work, we present Relay Diffusion Model (RDM), which transfers a low-resolution image or noise into an equivalent high-resolution one for diffusion model via blurring diffusion and block noise. Therefore, the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning. RDM achieves state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256$\times$256, surpassing previous works such as ADM, LDM and DiT by a large margin. All the codes and checkpoints are open-sourced at \url{https://github.com/THUDM/RelayDiffusion}.
\ No newline at end of file
diff --git a/data/2024/iclr/Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment b/data/2024/iclr/Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment
new file mode 100644
index 0000000000..88635eaa5d
--- /dev/null
+++ b/data/2024/iclr/Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment	
@@ -0,0 +1 @@
+We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images. Our unsupervised approach enables the training of a first-of-its-kind large-scale vision language model (VLM) for remote sensing images at two different resolutions. We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images. On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation.
\ No newline at end of file
diff --git a/data/2024/iclr/Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning b/data/2024/iclr/Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning
new file mode 100644
index 0000000000..80228cfc5e
--- /dev/null
+++ b/data/2024/iclr/Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning	
@@ -0,0 +1 @@
+Methods for carefully selecting or generating a small set of training data to learn from, i.e., data pruning, coreset selection, and data distillation, have been shown to be effective in reducing the ever-increasing cost of training neural networks. Behind this success are rigorously designed strategies for identifying informative training examples out of large datasets. However, these strategies come with additional computational costs associated with subset selection or data distillation before training begins, and furthermore, many are shown to even under-perform random sampling in high data compression regimes. As such, many data pruning, coreset selection, or distillation methods may not reduce 'time-to-accuracy', which has become a critical efficiency measure of training deep neural networks over large datasets. In this work, we revisit a powerful yet overlooked random sampling strategy to address these challenges and introduce an approach called Repeated Sampling of Random Subsets (RSRS or RS2), where we randomly sample the subset of training data for each epoch of model training. We test RS2 against thirty state-of-the-art data pruning and data distillation methods across four datasets including ImageNet. Our results demonstrate that RS2 significantly reduces time-to-accuracy compared to existing techniques. For example, when training on ImageNet in the high-compression regime (using less than 10% of the dataset each epoch), RS2 yields accuracy improvements up to 29% compared to competing pruning methods while offering a runtime reduction of 7x. Beyond the above meta-study, we provide a convergence analysis for RS2 and discuss its generalization capability. The primary goal of our work is to establish RS2 as a competitive baseline for future data selection or distillation techniques aimed at efficient training.
\ No newline at end of file
diff --git a/data/2024/iclr/Repelling Random Walks b/data/2024/iclr/Repelling Random Walks
new file mode 100644
index 0000000000..c723bd5ed2
--- /dev/null
+++ b/data/2024/iclr/Repelling Random Walks	
@@ -0,0 +1 @@
+We present a novel quasi-Monte Carlo mechanism to improve graph-based sampling, coined repelling random walks. By inducing correlations between the trajectories of an interacting ensemble such that their marginal transition probabilities are unmodified, we are able to explore the graph more efficiently, improving the concentration of statistical estimators whilst leaving them unbiased. The mechanism has a trivial drop-in implementation. We showcase the effectiveness of repelling random walks in a range of settings including estimation of graph kernels, the PageRank vector and graphlet concentrations. We provide detailed experimental evaluation and robust theoretical guarantees. To our knowledge, repelling random walks constitute the first rigorously studied quasi-Monte Carlo scheme correlating the directions of walkers on a graph, inviting new research in this exciting nascent domain.
\ No newline at end of file
diff --git a/data/2024/iclr/Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models b/data/2024/iclr/Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
new file mode 100644
index 0000000000..23da1bb4f1
--- /dev/null
+++ b/data/2024/iclr/Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models	
@@ -0,0 +1 @@
+An increasing number of vision-language tasks can be handled with little to no training, i.e., in a zero and few-shot manner, by marrying large language models (LLMs) to vision encoders, resulting in large vision-language models (LVLMs). While this has huge upsides, such as not requiring training data or custom architectures, how an input is presented to an LVLM can have a major impact on zero-shot model performance. In particular, inputs phrased in an underspecified way can result in incorrect answers due to factors like missing visual information, complex implicit reasoning, or linguistic ambiguity. Therefore, adding visually-grounded information to the input as a preemptive clarification should improve model performance by reducing underspecification, e.g., by localizing objects and disambiguating references. Similarly, in the VQA setting, changing the way questions are framed can make them easier for models to answer. To this end, we present Rephrase, Augment and Reason (RepARe), a gradient-free framework that extracts salient details about the image using the underlying LVLM as a captioner and reasoner, in order to propose modifications to the original question. We then use the LVLM's confidence over a generated answer as an unsupervised scoring function to select the rephrased question most likely to improve zero-shot performance. Focusing on three visual question answering tasks, we show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively. Additionally, we find that using gold answers for oracle question candidate selection achieves a substantial gain in VQA accuracy by up to 14.41%. Through extensive analysis, we demonstrate that outputs from RepARe increase syntactic complexity, and effectively utilize vision-language interaction and the frozen LLM.
\ No newline at end of file
diff --git a/data/2024/iclr/Replay across Experiments: A Natural Extension of Off-Policy RL b/data/2024/iclr/Replay across Experiments: A Natural Extension of Off-Policy RL
new file mode 100644
index 0000000000..db0d33cdb7
--- /dev/null
+++ b/data/2024/iclr/Replay across Experiments: A Natural Extension of Off-Policy RL	
@@ -0,0 +1 @@
+Replaying data is a principal mechanism underlying the stability and data efficiency of off-policy reinforcement learning (RL). We present an effective yet simple framework to extend the use of replays across multiple experiments, minimally adapting the RL workflow for sizeable improvements in controller performance and research iteration times. At its core, Replay Across Experiments (RaE) involves reusing experience from previous experiments to improve exploration and bootstrap learning while reducing required changes to a minimum in comparison to prior work. We empirically show benefits across a number of RL algorithms and challenging control domains spanning both locomotion and manipulation, including hard exploration tasks from egocentric vision. Through comprehensive ablations, we demonstrate robustness to the quality and amount of data available and various hyperparameter choices. Finally, we discuss how our approach can be applied more broadly across research life cycles and can increase resilience by reloading data across random seeds or hyperparameter variations.
\ No newline at end of file
diff --git a/data/2024/iclr/RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems b/data/2024/iclr/RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
new file mode 100644
index 0000000000..bc55926178
--- /dev/null
+++ b/data/2024/iclr/RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/Leolty/repobench.
\ No newline at end of file
diff --git a/data/2024/iclr/Representation Deficiency in Masked Language Modeling b/data/2024/iclr/Representation Deficiency in Masked Language Modeling
new file mode 100644
index 0000000000..fb9f489602
--- /dev/null
+++ b/data/2024/iclr/Representation Deficiency in Masked Language Modeling	
@@ -0,0 +1 @@
+Masked Language Modeling (MLM) has been one of the most prominent approaches for pretraining bidirectional text encoders due to its simplicity and effectiveness. One notable concern about MLM is that the special $\texttt{[MASK]}$ symbol causes a discrepancy between pretraining data and downstream data as it is present only in pretraining but not in fine-tuning. In this work, we offer a new perspective on the consequence of such a discrepancy: We demonstrate empirically and theoretically that MLM pretraining allocates some model dimensions exclusively for representing $\texttt{[MASK]}$ tokens, resulting in a representation deficiency for real tokens and limiting the pretrained model's expressiveness when it is adapted to downstream data without $\texttt{[MASK]}$ tokens. Motivated by the identified issue, we propose MAE-LM, which pretrains the Masked Autoencoder architecture with MLM where $\texttt{[MASK]}$ tokens are excluded from the encoder. Empirically, we show that MAE-LM improves the utilization of model dimensions for real token representations, and MAE-LM consistently outperforms MLM-pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/ResFields: Residual Neural Fields for Spatiotemporal Signals b/data/2024/iclr/ResFields: Residual Neural Fields for Spatiotemporal Signals
new file mode 100644
index 0000000000..43ea23fc8d
--- /dev/null
+++ b/data/2024/iclr/ResFields: Residual Neural Fields for Spatiotemporal Signals	
@@ -0,0 +1 @@
+Neural fields, a category of neural networks trained to represent high-frequency signals, have gained significant attention in recent years due to their impressive performance in modeling complex 3D data, such as signed distance (SDFs) or radiance fields (NeRFs), via a single multi-layer perceptron (MLP). However, despite the power and simplicity of representing signals with an MLP, these methods still face challenges when modeling large and complex temporal signals due to the limited capacity of MLPs. In this paper, we propose an effective approach to address this limitation by incorporating temporal residual layers into neural fields, dubbed ResFields. It is a novel class of networks specifically designed to effectively represent complex temporal signals. We conduct a comprehensive analysis of the properties of ResFields and propose a matrix factorization technique to reduce the number of trainable parameters and enhance generalization capabilities. Importantly, our formulation seamlessly integrates with existing MLP-based neural fields and consistently improves results across various challenging tasks: 2D video approximation, dynamic shape modeling via temporal SDFs, and dynamic NeRF reconstruction. Lastly, we demonstrate the practical utility of ResFields by showcasing its effectiveness in capturing dynamic 3D scenes from sparse RGBD cameras of a lightweight capture system.
\ No newline at end of file
diff --git a/data/2024/iclr/Respect the model: Fine-grained and Robust Explanation with Sharing Ratio Decomposition b/data/2024/iclr/Respect the model: Fine-grained and Robust Explanation with Sharing Ratio Decomposition
new file mode 100644
index 0000000000..2c14308c68
--- /dev/null
+++ b/data/2024/iclr/Respect the model: Fine-grained and Robust Explanation with Sharing Ratio Decomposition	
@@ -0,0 +1 @@
+The truthfulness of existing explanation methods in authentically elucidating the underlying model's decision-making process has been questioned. Existing methods have deviated from faithfully representing the model, thus susceptible to adversarial attacks. To address this, we propose a novel eXplainable AI (XAI) method called SRD (Sharing Ratio Decomposition), which sincerely reflects the model's inference process, resulting in significantly enhanced robustness in our explanations. Different from the conventional emphasis on the neuronal level, we adopt a vector perspective to consider the intricate nonlinear interactions between filters. We also introduce an interesting observation termed Activation-Pattern-Only Prediction (APOP), letting us emphasize the importance of inactive neurons and redefine relevance encapsulating all relevant information including both active and inactive neurons. Our method, SRD, allows for the recursive decomposition of a Pointwise Feature Vector (PFV), providing a high-resolution Effective Receptive Field (ERF) at any layer.
\ No newline at end of file
diff --git a/data/2024/iclr/Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL b/data/2024/iclr/Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL
new file mode 100644
index 0000000000..4715c6d40e
--- /dev/null
+++ b/data/2024/iclr/Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL	
@@ -0,0 +1 @@
+Most existing works focus on direct perturbations to the victim's state/action or the underlying transition dynamics to demonstrate the vulnerability of reinforcement learning agents to adversarial attacks. However, such direct manipulations may not be always realizable. In this paper, we consider a multi-agent setting where a well-trained victim agent $\nu$ is exploited by an attacker controlling another agent $\alpha$ with an \textit{adversarial policy}. Previous models do not account for the possibility that the attacker may only have partial control over $\alpha$ or that the attack may produce easily detectable"abnormal"behaviors. Furthermore, there is a lack of provably efficient defenses against these adversarial policies. To address these limitations, we introduce a generalized attack framework that has the flexibility to model to what extent the adversary is able to control the agent, and allows the attacker to regulate the state distribution shift and produce stealthier adversarial policies. Moreover, we offer a provably efficient defense with polynomial convergence to the most robust victim policy through adversarial training with timescale separation. This stands in sharp contrast to supervised learning, where adversarial training typically provides only \textit{empirical} defenses. Using the Robosumo competition experiments, we show that our generalized attack formulation results in much stealthier adversarial policies when maintaining the same winning rate as baselines. Additionally, our adversarial training approach yields stable learning dynamics and less exploitable victim policies.
\ No newline at end of file
diff --git a/data/2024/iclr/Rethinking Backdoor Attacks on Dataset Distillation: A Kernel Method Perspective b/data/2024/iclr/Rethinking Backdoor Attacks on Dataset Distillation: A Kernel Method Perspective
new file mode 100644
index 0000000000..9c85101dce
--- /dev/null
+++ b/data/2024/iclr/Rethinking Backdoor Attacks on Dataset Distillation: A Kernel Method Perspective	
@@ -0,0 +1 @@
+Dataset distillation offers a potential means to enhance data efficiency in deep learning. Recent studies have shown its ability to counteract backdoor risks present in original training samples. In this study, we delve into the theoretical aspects of backdoor attacks and dataset distillation based on kernel methods. We introduce two new theory-driven trigger pattern generation methods specialized for dataset distillation. Following a comprehensive set of analyses and experiments, we show that our optimization-based trigger design framework informs effective backdoor attacks on dataset distillation. Notably, datasets poisoned by our designed trigger prove resilient against conventional backdoor attack detection and mitigation methods. Our empirical results validate that the triggers developed using our approaches are proficient at executing resilient backdoor attacks.
\ No newline at end of file
diff --git a/data/2024/iclr/Rethinking Branching on Exact Combinatorial Optimization Solver: The First Deep Symbolic Discovery Framework b/data/2024/iclr/Rethinking Branching on Exact Combinatorial Optimization Solver: The First Deep Symbolic Discovery Framework
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Rethinking Channel Dependence for Multivariate Time Series Forecasting: Learning from Leading Indicators b/data/2024/iclr/Rethinking Channel Dependence for Multivariate Time Series Forecasting: Learning from Leading Indicators
new file mode 100644
index 0000000000..a060cec531
--- /dev/null
+++ b/data/2024/iclr/Rethinking Channel Dependence for Multivariate Time Series Forecasting: Learning from Leading Indicators	
@@ -0,0 +1 @@
+Recently, channel-independent methods have achieved state-of-the-art performance in multivariate time series (MTS) forecasting. Despite reducing overfitting risks, these methods miss potential opportunities in utilizing channel dependence for accurate predictions. We argue that there exist locally stationary lead-lag relationships between variates, i.e., some lagged variates may follow the leading indicators within a short time period. Exploiting such channel dependence is beneficial since leading indicators offer advance information that can be used to reduce the forecasting difficulty of the lagged variates. In this paper, we propose a new method named LIFT that first efficiently estimates leading indicators and their leading steps at each time step and then judiciously allows the lagged variates to utilize the advance information from leading indicators. LIFT plays as a plugin that can be seamlessly collaborated with arbitrary time series forecasting methods. Extensive experiments on six real-world datasets demonstrate that LIFT improves the state-of-the-art methods by 5.5% in average forecasting performance. Our code is available at https://github.com/SJTU-Quant/LIFT.
\ No newline at end of file
diff --git a/data/2024/iclr/Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models b/data/2024/iclr/Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Rethinking Complex Queries on Knowledge Graphs with Neural Link Predictors b/data/2024/iclr/Rethinking Complex Queries on Knowledge Graphs with Neural Link Predictors
new file mode 100644
index 0000000000..09087433ba
--- /dev/null
+++ b/data/2024/iclr/Rethinking Complex Queries on Knowledge Graphs with Neural Link Predictors	
@@ -0,0 +1 @@
+Reasoning on knowledge graphs is a challenging task because it utilizes observed information to predict the missing one. Particularly, answering complex queries based on first-order logic is one of the crucial tasks to verify learning to reason abilities for generalization and composition. Recently, the prevailing method is query embedding which learns the embedding of a set of entities and treats logic operations as set operations and has shown great empirical success. Though there has been much research following the same formulation, many of its claims lack a formal and systematic inspection. In this paper, we rethink this formulation and justify many of the previous claims by characterizing the scope of queries investigated previously and precisely identifying the gap between its formulation and its goal, as well as providing complexity analysis for the currently investigated queries. Moreover, we develop a new dataset containing ten new types of queries with features that have never been considered and therefore can provide a thorough investigation of complex queries. Finally, we propose a new neural-symbolic method, Fuzzy Inference with Truth value (FIT), where we equip the neural link predictors with fuzzy logic theory to support end-to-end learning using complex queries with provable reasoning capability. Empirical results show that our method outperforms previous methods significantly in the new dataset and also surpasses previous methods in the existing dataset at the same time.
\ No newline at end of file
diff --git a/data/2024/iclr/Rethinking Information-theoretic Generalization: Loss Entropy Induced PAC Bounds b/data/2024/iclr/Rethinking Information-theoretic Generalization: Loss Entropy Induced PAC Bounds
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Rethinking Label Poisoning for GNNs: Pitfalls and Attacks b/data/2024/iclr/Rethinking Label Poisoning for GNNs: Pitfalls and Attacks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Rethinking Model Ensemble in Transfer-based Adversarial Attacks b/data/2024/iclr/Rethinking Model Ensemble in Transfer-based Adversarial Attacks
new file mode 100644
index 0000000000..34b8eb8a01
--- /dev/null
+++ b/data/2024/iclr/Rethinking Model Ensemble in Transfer-based Adversarial Attacks	
@@ -0,0 +1 @@
+It is widely recognized that deep learning models lack robustness to adversarial examples. An intriguing property of adversarial examples is that they can transfer across different models, which enables black-box attacks without any knowledge of the victim model. An effective strategy to improve the transferability is attacking an ensemble of models. However, previous works simply average the outputs of different models, lacking an in-depth analysis on how and why model ensemble methods can strongly improve the transferability. In this paper, we rethink the ensemble in adversarial attacks and define the common weakness of model ensemble with two properties: 1) the flatness of loss landscape; and 2) the closeness to the local optimum of each model. We empirically and theoretically show that both properties are strongly correlated with the transferability and propose a Common Weakness Attack (CWA) to generate more transferable adversarial examples by promoting these two properties. Experimental results on both image classification and object detection tasks validate the effectiveness of our approach to improving the adversarial transferability, especially when attacking adversarially trained models. We also successfully apply our method to attack a black-box large vision-language model -- Google's Bard, showing the practical effectiveness. Code is available at \url{https://github.com/huanranchen/AdversarialAttacks}.
\ No newline at end of file
diff --git a/data/2024/iclr/Rethinking and Extending the Probabilistic Inference Capacity of GNNs b/data/2024/iclr/Rethinking and Extending the Probabilistic Inference Capacity of GNNs
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Rethinking the Benefits of Steerable Features in 3D Equivariant Graph Neural Networks b/data/2024/iclr/Rethinking the Benefits of Steerable Features in 3D Equivariant Graph Neural Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Rethinking the Power of Graph Canonization in Graph Representation Learning with Stability b/data/2024/iclr/Rethinking the Power of Graph Canonization in Graph Representation Learning with Stability
new file mode 100644
index 0000000000..ff60e1a30e
--- /dev/null
+++ b/data/2024/iclr/Rethinking the Power of Graph Canonization in Graph Representation Learning with Stability	
@@ -0,0 +1 @@
+The expressivity of Graph Neural Networks (GNNs) has been studied broadly in recent years to reveal the design principles for more powerful GNNs. Graph canonization is known as a typical approach to distinguish non-isomorphic graphs, yet rarely adopted when developing expressive GNNs. This paper proposes to maximize the expressivity of GNNs by graph canonization, then the power of such GNNs is studies from the perspective of model stability. A stable GNN will map similar graphs to close graph representations in the vectorial space, and the stability of GNNs is critical to generalize their performance to unseen graphs. We theoretically reveal the trade-off of expressivity and stability in graph-canonization-enhanced GNNs. Then we introduce a notion of universal graph canonization as the general solution to address the trade-off and characterize a widely applicable sufficient condition to solve the universal graph canonization. A comprehensive set of experiments demonstrates the effectiveness of the proposed method. In many popular graph benchmark datasets, graph canonization successfully enhances GNNs and provides highly competitive performance, indicating the capability and great potential of proposed method in general graph representation learning. In graph datasets where the sufficient condition holds, GNNs enhanced by universal graph canonization consistently outperform GNN baselines and successfully improve the SOTA performance up to $31\%$, providing the optimal solution to numerous challenging real-world graph analytical tasks like gene network representation learning in bioinformatics.
\ No newline at end of file
diff --git a/data/2024/iclr/Rethinking the symmetry-preserving circuits for constrained variational quantum algorithms b/data/2024/iclr/Rethinking the symmetry-preserving circuits for constrained variational quantum algorithms
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Retrieval is Accurate Generation b/data/2024/iclr/Retrieval is Accurate Generation
new file mode 100644
index 0000000000..30750ded02
--- /dev/null
+++ b/data/2024/iclr/Retrieval is Accurate Generation	
@@ -0,0 +1 @@
+Standard language models generate text by selecting tokens from a fixed, finite, and standalone vocabulary. We introduce a novel method that selects context-aware phrases from a collection of supporting documents. One of the most significant challenges for this paradigm shift is determining the training oracles, because a string of text can be segmented in various ways and each segment can be retrieved from numerous possible documents. To address this, we propose to initialize the training oracles using linguistic heuristics and, more importantly, bootstrap the oracles through iterative self-reinforcement. Extensive experiments show that our model not only outperforms standard language models on a variety of knowledge-intensive tasks but also demonstrates improved generation quality in open-ended text generation. For instance, compared to the standard language model counterpart, our model raises the accuracy from 23.47% to 36.27% on OpenbookQA, and improves the MAUVE score from 42.61% to 81.58% in open-ended text generation. Remarkably, our model also achieves the best performance and the lowest latency among several retrieval-augmented baselines. In conclusion, we assert that retrieval is more accurate generation and hope that our work will encourage further research on this new paradigm shift.
\ No newline at end of file
diff --git a/data/2024/iclr/Retrieval meets Long Context Large Language Models b/data/2024/iclr/Retrieval meets Long Context Large Language Models
new file mode 100644
index 0000000000..5d16458c17
--- /dev/null
+++ b/data/2024/iclr/Retrieval meets Long Context Large Language Models	
@@ -0,0 +1 @@
+Extending the context window of large language models (LLMs) is getting popular recently, while the solution of augmenting LLMs with retrieval has existed for years. The natural questions are: i) Retrieval-augmentation versus long context window, which one is better for downstream tasks? ii) Can both methods be combined to get the best of both worlds? In this work, we answer these questions by studying both solutions using two state-of-the-art pretrained LLMs, i.e., a proprietary 43B GPT and Llama2-70B. Perhaps surprisingly, we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation. More importantly, we demonstrate that retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes. Our best model, retrieval-augmented Llama2-70B with 32K context window, outperforms GPT-3.5-turbo-16k and Davinci003 in terms of average score on nine long context tasks including question answering, query-based summarization, and in-context few-shot learning tasks. It also outperforms its non-retrieval Llama2-70B-32k baseline by a margin, while being much faster at generation. Our study provides general insights on the choice of retrieval-augmentation versus long context extension of LLM for practitioners.
\ No newline at end of file
diff --git a/data/2024/iclr/Retrieval-Enhanced Contrastive Vision-Text Models b/data/2024/iclr/Retrieval-Enhanced Contrastive Vision-Text Models
new file mode 100644
index 0000000000..9c373d2558
--- /dev/null
+++ b/data/2024/iclr/Retrieval-Enhanced Contrastive Vision-Text Models	
@@ -0,0 +1 @@
+Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre-training dataset. Hence, a key ingredient to their success has been the use of large-scale curated pre-training data aiming at expanding the set of concepts that they can memorize during the pre-training stage. In this work, we explore an alternative to encoding fine-grained knowledge directly into the model's parameters: we instead train the model to retrieve this knowledge from an external memory. Specifically, we propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark, where we even outperform the fine-tuned models on unseen classes.
\ No newline at end of file
diff --git a/data/2024/iclr/Retrieval-Guided Reinforcement Learning for Boolean Circuit Minimization b/data/2024/iclr/Retrieval-Guided Reinforcement Learning for Boolean Circuit Minimization
new file mode 100644
index 0000000000..508365aab7
--- /dev/null
+++ b/data/2024/iclr/Retrieval-Guided Reinforcement Learning for Boolean Circuit Minimization	
@@ -0,0 +1 @@
+Logic synthesis, a pivotal stage in chip design, entails optimizing chip specifications encoded in hardware description languages like Verilog into highly efficient implementations using Boolean logic gates. The process involves a sequential application of logic minimization heuristics (``synthesis recipe"), with their arrangement significantly impacting crucial metrics such as area and delay. Addressing the challenge posed by the broad spectrum of design complexities - from variations of past designs (e.g., adders and multipliers) to entirely novel configurations (e.g., innovative processor instructions) - requires a nuanced `synthesis recipe` guided by human expertise and intuition. This study conducts a thorough examination of learning and search techniques for logic synthesis, unearthing a surprising revelation: pre-trained agents, when confronted with entirely novel designs, may veer off course, detrimentally affecting the search trajectory. We present ABC-RL, a meticulously tuned $\alpha$ parameter that adeptly adjusts recommendations from pre-trained agents during the search process. Computed based on similarity scores through nearest neighbor retrieval from the training dataset, ABC-RL yields superior synthesis recipes tailored for a wide array of hardware designs. Our findings showcase substantial enhancements in the Quality-of-result (QoR) of synthesized circuits, boasting improvements of up to 24.8% compared to state-of-the-art techniques. Furthermore, ABC-RL achieves an impressive up to 9x reduction in runtime (iso-QoR) when compared to current state-of-the-art methodologies.
\ No newline at end of file
diff --git a/data/2024/iclr/Retrieval-based Disentangled Representation Learning with Natural Language Supervision b/data/2024/iclr/Retrieval-based Disentangled Representation Learning with Natural Language Supervision
new file mode 100644
index 0000000000..2dfa9d5f9c
--- /dev/null
+++ b/data/2024/iclr/Retrieval-based Disentangled Representation Learning with Natural Language Supervision	
@@ -0,0 +1 @@
+Disentangled representation learning remains challenging as the underlying factors of variation in the data do not naturally exist. The inherent complexity of real-world data makes it unfeasible to exhaustively enumerate and encapsulate all its variations within a finite set of factors. However, it is worth noting that most real-world data have linguistic equivalents, typically in the form of textual descriptions. These linguistic counterparts can represent the data and effortlessly decomposed into distinct tokens. In light of this, we present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish dimensions that capture intrinsic characteristics within data through its natural language counterpart, thus facilitating disentanglement. We extensively assess the performance of VDR across 15 retrieval benchmark datasets, covering text-to-text and cross-modal retrieval scenarios, as well as human evaluation. Our experimental results compellingly demonstrate the superiority of VDR over previous bi-encoder retrievers with comparable model size and training costs, achieving an impressive 8.7% improvement in NDCG@10 on the BEIR benchmark, a 5.3% increase on MS COCO, and a 6.0% increase on Flickr30k in terms of mean recall in the zero-shot setting. Moreover, The results from human evaluation indicate that interpretability of our method is on par with SOTA captioning models.
\ No newline at end of file
diff --git a/data/2024/iclr/Retro-fallback: retrosynthetic planning in an uncertain world b/data/2024/iclr/Retro-fallback: retrosynthetic planning in an uncertain world
new file mode 100644
index 0000000000..16a6cd7590
--- /dev/null
+++ b/data/2024/iclr/Retro-fallback: retrosynthetic planning in an uncertain world	
@@ -0,0 +1 @@
+Retrosynthesis is the task of planning a series of chemical reactions to create a desired molecule from simpler, buyable molecules. While previous works have proposed algorithms to find optimal solutions for a range of metrics (e.g. shortest, lowest-cost), these works generally overlook the fact that we have imperfect knowledge of the space of possible reactions, meaning plans created by algorithms may not work in a laboratory. In this paper we propose a novel formulation of retrosynthesis in terms of stochastic processes to account for this uncertainty. We then propose a novel greedy algorithm called retro-fallback which maximizes the probability that at least one synthesis plan can be executed in the lab. Using in-silico benchmarks we demonstrate that retro-fallback generally produces better sets of synthesis plans than the popular MCTS and retro* algorithms.
\ No newline at end of file
diff --git a/data/2024/iclr/RetroBridge: Modeling Retrosynthesis with Markov Bridges b/data/2024/iclr/RetroBridge: Modeling Retrosynthesis with Markov Bridges
new file mode 100644
index 0000000000..c3ee56a93f
--- /dev/null
+++ b/data/2024/iclr/RetroBridge: Modeling Retrosynthesis with Markov Bridges	
@@ -0,0 +1 @@
+Retrosynthesis planning is a fundamental challenge in chemistry which aims at designing reaction pathways from commercially available starting materials to a target molecule. Each step in multi-step retrosynthesis planning requires accurate prediction of possible precursor molecules given the target molecule and confidence estimates to guide heuristic search algorithms. We model single-step retrosynthesis planning as a distribution learning problem in a discrete state space. First, we introduce the Markov Bridge Model, a generative framework aimed to approximate the dependency between two intractable discrete distributions accessible via a finite sample of coupled data points. Our framework is based on the concept of a Markov bridge, a Markov process pinned at its endpoints. Unlike diffusion-based methods, our Markov Bridge Model does not need a tractable noise distribution as a sampling proxy and directly operates on the input product molecules as samples from the intractable prior distribution. We then address the retrosynthesis planning problem with our novel framework and introduce RetroBridge, a template-free retrosynthesis modeling approach that achieves state-of-the-art results on standard evaluation benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization b/data/2024/iclr/Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization
new file mode 100644
index 0000000000..57ab0e1c24
--- /dev/null
+++ b/data/2024/iclr/Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization	
@@ -0,0 +1 @@
+Recent months have seen the emergence of a powerful new trend in which large language models (LLMs) are augmented to become autonomous language agents capable of performing objective oriented multi-step tasks on their own, rather than merely responding to queries from human users. Most existing language agents, however, are not optimized using environment-specific rewards. Although some agents enable iterative refinement through verbal feedback, they do not reason and plan in ways that are compatible with gradient-based learning from rewards. This paper introduces a principled framework for reinforcing large language agents by learning a retrospective model, which automatically tunes the language agent prompts from environment feedback through policy gradient. Specifically, our proposed agent architecture learns from rewards across multiple environments and tasks, for fine-tuning a pre-trained language model which refines the language agent prompt by summarizing the root cause of prior failed attempts and proposing action plans. Experimental results on various tasks demonstrate that the language agents improve over time and that our approach considerably outperforms baselines that do not properly leverage gradients from the environment. This demonstrates that using policy gradient optimization to improve language agents, for which we believe our work is one of the first, seems promising and can be applied to optimize other models in the agent architecture to enhance agent performances over time.
\ No newline at end of file
diff --git a/data/2024/iclr/Reverse Diffusion Monte Carlo b/data/2024/iclr/Reverse Diffusion Monte Carlo
new file mode 100644
index 0000000000..4af9ea2cdf
--- /dev/null
+++ b/data/2024/iclr/Reverse Diffusion Monte Carlo	
@@ -0,0 +1 @@
+We propose a Monte Carlo sampler from the reverse diffusion process. Unlike the practice of diffusion models, where the intermediary updates -- the score functions -- are learned with a neural network, we transform the score matching problem into a mean estimation one. By estimating the means of the regularized posterior distributions, we derive a novel Monte Carlo sampling algorithm called reverse diffusion Monte Carlo (rdMC), which is distinct from the Markov chain Monte Carlo (MCMC) methods. We determine the sample size from the error tolerance and the properties of the posterior distribution to yield an algorithm that can approximately sample the target distribution with any desired accuracy. Additionally, we demonstrate and prove under suitable conditions that sampling with rdMC can be significantly faster than that with MCMC. For multi-modal target distributions such as those in Gaussian mixture models, rdMC greatly improves over the Langevin-style MCMC sampling methods both theoretically and in practice. The proposed rdMC method offers a new perspective and solution beyond classical MCMC algorithms for the challenging complex distributions.
\ No newline at end of file
diff --git a/data/2024/iclr/Reverse Forward Curriculum Learning for Extreme Sample and Demo Efficiency b/data/2024/iclr/Reverse Forward Curriculum Learning for Extreme Sample and Demo Efficiency
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Revisit and Outstrip Entity Alignment: A Perspective of Generative Models b/data/2024/iclr/Revisit and Outstrip Entity Alignment: A Perspective of Generative Models
new file mode 100644
index 0000000000..42db8ddb14
--- /dev/null
+++ b/data/2024/iclr/Revisit and Outstrip Entity Alignment: A Perspective of Generative Models	
@@ -0,0 +1 @@
+Recent embedding-based methods have achieved great successes in exploiting entity alignment from knowledge graph (KG) embeddings of multiple modalities. In this paper, we study embedding-based entity alignment (EEA) from a perspective of generative models. We show that EEA shares similarities with typical generative models and prove the effectiveness of the recently developed generative adversarial network (GAN)-based EEA methods theoretically. We then reveal that their incomplete objective limits the capacity on both entity alignment and entity synthesis (i.e., generating new entities). We mitigate this problem by introducing a generative EEA (GEEA) framework with the proposed mutual variational autoencoder (M-VAE) as the generative model. M-VAE enables entity conversion between KGs and generation of new entities from random noise vectors. We demonstrate the power of GEEA with theoretical analysis and empirical experiments on both entity alignment and entity synthesis tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Revisiting Data Augmentation in Deep Reinforcement Learning b/data/2024/iclr/Revisiting Data Augmentation in Deep Reinforcement Learning
new file mode 100644
index 0000000000..8c6f90fe03
--- /dev/null
+++ b/data/2024/iclr/Revisiting Data Augmentation in Deep Reinforcement Learning	
@@ -0,0 +1 @@
+Various data augmentation techniques have been recently proposed in image-based deep reinforcement learning (DRL). Although they empirically demonstrate the effectiveness of data augmentation for improving sample efficiency or generalization, which technique should be preferred is not always clear. To tackle this question, we analyze existing methods to better understand them and to uncover how they are connected. Notably, by expressing the variance of the Q-targets and that of the empirical actor/critic losses of these methods, we can analyze the effects of their different components and compare them. We furthermore formulate an explanation about how these methods may be affected by choosing different data augmentation transformations in calculating the target Q-values. This analysis suggests recommendations on how to exploit data augmentation in a more principled way. In addition, we include a regularization term called tangent prop, previously proposed in computer vision, but whose adaptation to DRL is novel to the best of our knowledge. We evaluate our proposition and validate our analysis in several domains. Compared to different relevant baselines, we demonstrate that it achieves state-of-the-art performance in most environments and shows higher sample efficiency and better generalization ability in some complex environments.
\ No newline at end of file
diff --git a/data/2024/iclr/Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation b/data/2024/iclr/Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation
new file mode 100644
index 0000000000..575622f42e
--- /dev/null
+++ b/data/2024/iclr/Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation	
@@ -0,0 +1 @@
+The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at https://github.com/v-manhlt3/m-LTM-Audio-Text-Retrieval
\ No newline at end of file
diff --git a/data/2024/iclr/Revisiting Link Prediction: a data perspective b/data/2024/iclr/Revisiting Link Prediction: a data perspective
new file mode 100644
index 0000000000..c48dba8bf2
--- /dev/null
+++ b/data/2024/iclr/Revisiting Link Prediction: a data perspective	
@@ -0,0 +1 @@
+Link prediction, a fundamental task on graphs, has proven indispensable in various applications, e.g., friend recommendation, protein analysis, and drug interaction prediction. However, since datasets span a multitude of domains, they could have distinct underlying mechanisms of link formation. Evidence in existing literature underscores the absence of a universally best algorithm suitable for all datasets. In this paper, we endeavor to explore principles of link prediction across diverse datasets from a data-centric perspective. We recognize three fundamental factors critical to link prediction: local structural proximity, global structural proximity, and feature proximity. We then unearth relationships among those factors where (i) global structural proximity only shows effectiveness when local structural proximity is deficient. (ii) The incompatibility can be found between feature and structural proximity. Such incompatibility leads to GNNs for Link Prediction (GNN4LP) consistently underperforming on edges where the feature proximity factor dominates. Inspired by these new insights from a data perspective, we offer practical instruction for GNN4LP model design and guidelines for selecting appropriate benchmark datasets for more comprehensive evaluations.
\ No newline at end of file
diff --git a/data/2024/iclr/Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages b/data/2024/iclr/Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages
new file mode 100644
index 0000000000..646c1b49f3
--- /dev/null
+++ b/data/2024/iclr/Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages	
@@ -0,0 +1 @@
+Plasticity, the ability of a neural network to evolve with new data, is crucial for high-performance and sample-efficient visual reinforcement learning (VRL). Although methods like resetting and regularization can potentially mitigate plasticity loss, the influences of various components within the VRL framework on the agent's plasticity are still poorly understood. In this work, we conduct a systematic empirical exploration focusing on three primary underexplored facets and derive the following insightful conclusions: (1) data augmentation is essential in maintaining plasticity; (2) the critic's plasticity loss serves as the principal bottleneck impeding efficient training; and (3) without timely intervention to recover critic's plasticity in the early stages, its loss becomes catastrophic. These insights suggest a novel strategy to address the high replay ratio (RR) dilemma, where exacerbated plasticity loss hinders the potential improvements of sample efficiency brought by increased reuse frequency. Rather than setting a static RR for the entire training process, we propose Adaptive RR, which dynamically adjusts the RR based on the critic's plasticity level. Extensive evaluations indicate that Adaptive RR not only avoids catastrophic plasticity loss in the early stages but also benefits from more frequent reuse in later phases, resulting in superior sample efficiency.
\ No newline at end of file
diff --git a/data/2024/iclr/Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods b/data/2024/iclr/Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods
new file mode 100644
index 0000000000..9964272ea7
--- /dev/null
+++ b/data/2024/iclr/Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods	
@@ -0,0 +1 @@
+In the past several years, the last-iterate convergence of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz convex functions, different works have established the optimal $O(\log(1/\delta)\log T/\sqrt{T})$ or $O(\sqrt{\log(1/\delta)/T})$ high-probability convergence rates for the final iterate, where $T$ is the time horizon and $\delta$ is the failure probability. However, to prove these bounds, all the existing works are either limited to compact domains or require almost surely bounded noises. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last-iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness, and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed noises.
\ No newline at end of file
diff --git a/data/2024/iclr/Reward Design for Justifiable Sequential Decision-Making b/data/2024/iclr/Reward Design for Justifiable Sequential Decision-Making
new file mode 100644
index 0000000000..c11c969b5d
--- /dev/null
+++ b/data/2024/iclr/Reward Design for Justifiable Sequential Decision-Making	
@@ -0,0 +1 @@
+Equipping agents with the capacity to justify made decisions using supporting evidence represents a cornerstone of accountable decision-making. Furthermore, ensuring that justifications are in line with human expectations and societal norms is vital, especially in high-stakes situations such as healthcare. In this work, we propose the use of a debate-based reward model for reinforcement learning agents, where the outcome of a zero-sum debate game quantifies the justifiability of a decision in a particular state. This reward model is then used to train a justifiable policy, whose decisions can be more easily corroborated with supporting evidence. In the debate game, two argumentative agents take turns providing supporting evidence for two competing decisions. Given the proposed evidence, a proxy of a human judge evaluates which decision is better justified. We demonstrate the potential of our approach in learning policies for prescribing and justifying treatment decisions of septic patients. We show that augmenting the reward with the feedback signal generated by the debate-based reward model yields policies highly favored by the judge when compared to the policy obtained solely from the environment rewards, while hardly sacrificing any performance. Moreover, in terms of the overall performance and justifiability of trained policies, the debate-based feedback is comparable to the feedback obtained from an ideal judge proxy that evaluates decisions using the full information encoded in the state. This suggests that the debate game outputs key information contained in states that is most relevant for evaluating decisions, which in turn substantiates the practicality of combining our approach with human-in-the-loop evaluations. Lastly, we showcase that agents trained via multi-agent debate learn to propose evidence that is resilient to refutations and closely aligns with human preferences.
\ No newline at end of file
diff --git a/data/2024/iclr/Reward Model Ensembles Help Mitigate Overoptimization b/data/2024/iclr/Reward Model Ensembles Help Mitigate Overoptimization
new file mode 100644
index 0000000000..958574cc0e
--- /dev/null
+++ b/data/2024/iclr/Reward Model Ensembles Help Mitigate Overoptimization	
@@ -0,0 +1 @@
+Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As part of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the"true"reward, these learned reward models are susceptible to overoptimization. Gao et al. (2023) studied this phenomenon in a synthetic human feedback setup with a significantly larger"gold"reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We additionally extend the setup of Gao et al. (2023) to include 25% label noise to better mirror real-world conditions. Both with and without label noise, we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for BoN sampling. For PPO, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. Moreover, combining it with a small KL penalty successfully prevents overoptimization at no performance cost. Overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization.
\ No newline at end of file
diff --git a/data/2024/iclr/Reward-Consistent Dynamics Models are Strongly Generalizable for Offline Reinforcement Learning b/data/2024/iclr/Reward-Consistent Dynamics Models are Strongly Generalizable for Offline Reinforcement Learning
new file mode 100644
index 0000000000..bcbd56501d
--- /dev/null
+++ b/data/2024/iclr/Reward-Consistent Dynamics Models are Strongly Generalizable for Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Learning a precise dynamics model can be crucial for offline reinforcement learning, which, unfortunately, has been found to be quite challenging. Dynamics models that are learned by fitting historical transitions often struggle to generalize to unseen transitions. In this study, we identify a hidden but pivotal factor termed dynamics reward that remains consistent across transitions, offering a pathway to better generalization. Therefore, we propose the idea of reward-consistent dynamics models: any trajectory generated by the dynamics model should maximize the dynamics reward derived from the data. We implement this idea as the MOREC (Model-based Offline reinforcement learning with Reward Consistency) method, which can be seamlessly integrated into previous offline model-based reinforcement learning (MBRL) methods. MOREC learns a generalizable dynamics reward function from offline data, which is subsequently employed as a transition filter in any offline MBRL method: when generating transitions, the dynamics model generates a batch of transitions and selects the one with the highest dynamics reward value. On a synthetic task, we visualize that MOREC has a strong generalization ability and can surprisingly recover some distant unseen transitions. On 21 offline tasks in D4RL and NeoRL benchmarks, MOREC improves the previous state-of-the-art performance by a significant margin, i.e., 4.6% on D4RL tasks and 25.9% on NeoRL tasks. Notably, MOREC is the first method that can achieve above 95% online RL performance in 6 out of 12 D4RL tasks and 3 out of 9 NeoRL tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Reward-Free Curricula for Training Robust World Models b/data/2024/iclr/Reward-Free Curricula for Training Robust World Models
new file mode 100644
index 0000000000..0201947ee3
--- /dev/null
+++ b/data/2024/iclr/Reward-Free Curricula for Training Robust World Models	
@@ -0,0 +1 @@
+There has been a recent surge of interest in developing generally-capable agents that can adapt to new tasks without additional training in the environment. Learning world models from reward-free exploration is a promising approach, and enables policies to be trained using imagined experience for new tasks. However, achieving a general agent requires robustness across different environments. In this work, we address the novel problem of generating curricula in the reward-free setting to train robust world models. We consider robustness in terms of minimax regret over all environment instantiations and show that the minimax regret can be connected to minimising the maximum error in the world model across environment instances. This result informs our algorithm, WAKER: Weighted Acquisition of Knowledge across Environments for Robustness. WAKER selects environments for data collection based on the estimated error of the world model for each environment. Our experiments demonstrate that WAKER outperforms several baselines, resulting in improved robustness, efficiency, and generalisation.
\ No newline at end of file
diff --git a/data/2024/iclr/Ring-A-Bell! How Reliable are Concept Removal Methods For Diffusion Models? b/data/2024/iclr/Ring-A-Bell! How Reliable are Concept Removal Methods For Diffusion Models?
new file mode 100644
index 0000000000..3b5e9faceb
--- /dev/null
+++ b/data/2024/iclr/Ring-A-Bell! How Reliable are Concept Removal Methods For Diffusion Models?	
@@ -0,0 +1 @@
+Diffusion models for text-to-image (T2I) synthesis, such as Stable Diffusion (SD), have recently demonstrated exceptional capabilities for generating high-quality content. However, this progress has raised several concerns of potential misuse, particularly in creating copyrighted, prohibited, and restricted content, or NSFW (not safe for work) images. While efforts have been made to mitigate such problems, either by implementing a safety filter at the evaluation stage or by fine-tuning models to eliminate undesirable concepts or styles, the effectiveness of these safety measures in dealing with a wide range of prompts remains largely unexplored. In this work, we aim to investigate these safety mechanisms by proposing one novel concept retrieval algorithm for evaluation. We introduce Ring-A-Bell, a model-agnostic red-teaming tool for T2I diffusion models, where the whole evaluation can be prepared in advance without prior knowledge of the target model. Specifically, Ring-A-Bell first performs concept extraction to obtain holistic representations for sensitive and inappropriate concepts. Subsequently, by leveraging the extracted concept, Ring-A-Bell automatically identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content, allowing the user to assess the reliability of deployed safety mechanisms. Finally, we empirically validate our method by testing online services such as Midjourney and various methods of concept removal. Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms, thus revealing the defects of the so-called safety mechanisms which could practically lead to the generation of harmful contents. Our codes are available at https://github.com/chiayi-hsu/Ring-A-Bell.
\ No newline at end of file
diff --git a/data/2024/iclr/RingAttention with Blockwise Transformers for Near-Infinite Context b/data/2024/iclr/RingAttention with Blockwise Transformers for Near-Infinite Context
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Risk Bounds of Accelerated SGD for Overparameterized Linear Regression b/data/2024/iclr/Risk Bounds of Accelerated SGD for Overparameterized Linear Regression
new file mode 100644
index 0000000000..dea1e69d3e
--- /dev/null
+++ b/data/2024/iclr/Risk Bounds of Accelerated SGD for Overparameterized Linear Regression	
@@ -0,0 +1 @@
+Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often achieves better generalization performance than SGD. However, existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization. In this paper, we study the generalization of ASGD for overparameterized linear regression, which is possibly the simplest setting of learning with overparameterization. We establish an instance-dependent excess risk bound for ASGD within each eigen-subspace of the data covariance matrix. Our analysis shows that (i) ASGD outperforms SGD in the subspace of small eigenvalues, exhibiting a faster rate of exponential decay for bias error, while in the subspace of large eigenvalues, its bias error decays slower than SGD; and (ii) the variance error of ASGD is always larger than that of SGD. Our result suggests that ASGD can outperform SGD when the difference between the initialization and the true weight vector is mostly confined to the subspace of small eigenvalues. Additionally, when our analysis is specialized to linear regression in the strongly convex setting, it yields a tighter bound for bias error than the best-known result.
\ No newline at end of file
diff --git a/data/2024/iclr/Robot Fleet Learning via Policy Merging b/data/2024/iclr/Robot Fleet Learning via Policy Merging
new file mode 100644
index 0000000000..8353e1ce3d
--- /dev/null
+++ b/data/2024/iclr/Robot Fleet Learning via Policy Merging	
@@ -0,0 +1 @@
+Fleets of robots ingest massive amounts of heterogeneous streaming data silos generated by interacting with their environments, far more than what can be stored or transmitted with ease. At the same time, teams of robots should co-acquire diverse skills through their heterogeneous experiences in varied settings. How can we enable such fleet-level learning without having to transmit or centralize fleet-scale data? In this paper, we investigate policy merging (PoMe) from such distributed heterogeneous datasets as a potential solution. To efficiently merge policies in the fleet setting, we propose FLEET-MERGE, an instantiation of distributed learning that accounts for the permutation invariance that arises when parameterizing the control policies with recurrent neural networks. We show that FLEET-MERGE consolidates the behavior of policies trained on 50 tasks in the Meta-World environment, with good performance on nearly all training tasks at test time. Moreover, we introduce a novel robotic tool-use benchmark, FLEET-TOOLS, for fleet policy learning in compositional and contact-rich robot manipulation tasks, to validate the efficacy of FLEET-MERGE on the benchmark.
\ No newline at end of file
diff --git a/data/2024/iclr/Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula b/data/2024/iclr/Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula
new file mode 100644
index 0000000000..e3ebbcec83
--- /dev/null
+++ b/data/2024/iclr/Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula	
@@ -0,0 +1 @@
+Robustness against adversarial attacks and distribution shifts is a long-standing goal of Reinforcement Learning (RL). To this end, Robust Adversarial Reinforcement Learning (RARL) trains a protagonist against destabilizing forces exercised by an adversary in a competitive zero-sum Markov game, whose optimal solution, i.e., rational strategy, corresponds to a Nash equilibrium. However, finding Nash equilibria requires facing complex saddle point optimization problems, which can be prohibitive to solve, especially for high-dimensional control. In this paper, we propose a novel approach for adversarial RL based on entropy regularization to ease the complexity of the saddle point optimization problem. We show that the solution of this entropy-regularized problem corresponds to a Quantal Response Equilibrium (QRE), a generalization of Nash equilibria that accounts for bounded rationality, i.e., agents sometimes play random actions instead of optimal ones. Crucially, the connection between the entropy-regularized objective and QRE enables free modulation of the rationality of the agents by simply tuning the temperature coefficient. We leverage this insight to propose our novel algorithm, Quantal Adversarial RL (QARL), which gradually increases the rationality of the adversary in a curriculum fashion until it is fully rational, easing the complexity of the optimization problem while retaining robustness. We provide extensive evidence of QARL outperforming RARL and recent baselines across several MuJoCo locomotion and navigation problems in overall performance and robustness.
\ No newline at end of file
diff --git a/data/2024/iclr/Robust Angular Synchronization via Directed Graph Neural Networks b/data/2024/iclr/Robust Angular Synchronization via Directed Graph Neural Networks
new file mode 100644
index 0000000000..068df8d5cc
--- /dev/null
+++ b/data/2024/iclr/Robust Angular Synchronization via Directed Graph Neural Networks	
@@ -0,0 +1 @@
+The angular synchronization problem aims to accurately estimate (up to a constant additive phase) a set of unknown angles $\theta_1, \dots, \theta_n\in[0, 2\pi)$ from $m$ noisy measurements of their offsets $\theta_i-\theta_j \;\mbox{mod} \; 2\pi.$ Applications include, for example, sensor network localization, phase retrieval, and distributed clock synchronization. An extension of the problem to the heterogeneous setting (dubbed $k$-synchronization) is to estimate $k$ groups of angles simultaneously, given noisy observations (with unknown group assignment) from each group. Existing methods for angular synchronization usually perform poorly in high-noise regimes, which are common in applications. In this paper, we leverage neural networks for the angular synchronization problem, and its heterogeneous extension, by proposing GNNSync, a theoretically-grounded end-to-end trainable framework using directed graph neural networks. In addition, new loss functions are devised to encode synchronization objectives. Experimental results on extensive data sets demonstrate that GNNSync attains competitive, and often superior, performance against a comprehensive set of baselines for the angular synchronization problem and its extension, validating the robustness of GNNSync even at high noise levels.
\ No newline at end of file
diff --git a/data/2024/iclr/Robust Classification via Regression for Learning with Noisy Labels b/data/2024/iclr/Robust Classification via Regression for Learning with Noisy Labels
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Robust Model Based Reinforcement Learning Using L1 Adaptive Control b/data/2024/iclr/Robust Model Based Reinforcement Learning Using L1 Adaptive Control
new file mode 100644
index 0000000000..c6a973f2b3
--- /dev/null
+++ b/data/2024/iclr/Robust Model Based Reinforcement Learning Using L1 Adaptive Control	
@@ -0,0 +1 @@
+We introduce $\mathcal{L}_1$-MBRL, a control-theoretic augmentation scheme for Model-Based Reinforcement Learning (MBRL) algorithms. Unlike model-free approaches, MBRL algorithms learn a model of the transition function using data and use it to design a control input. Our approach generates a series of approximate control-affine models of the learned transition function according to the proposed switching law. Using the approximate model, control input produced by the underlying MBRL is perturbed by the $\mathcal{L}_1$ adaptive control, which is designed to enhance the robustness of the system against uncertainties. Importantly, this approach is agnostic to the choice of MBRL algorithm, enabling the use of the scheme with various MBRL algorithms. MBRL algorithms with $\mathcal{L}_1$ augmentation exhibit enhanced performance and sample efficiency across multiple MuJoCo environments, outperforming the original MBRL algorithms, both with and without system noise.
\ No newline at end of file
diff --git a/data/2024/iclr/Robust Model-Based Optimization for Challenging Fitness Landscapes b/data/2024/iclr/Robust Model-Based Optimization for Challenging Fitness Landscapes
new file mode 100644
index 0000000000..e1df5bcf6c
--- /dev/null
+++ b/data/2024/iclr/Robust Model-Based Optimization for Challenging Fitness Landscapes	
@@ -0,0 +1 @@
+Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of"separation"in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at https://github.com/sabagh1994/PGVAE.
\ No newline at end of file
diff --git a/data/2024/iclr/Robust NAS under adversarial training: benchmark, theory, and beyond b/data/2024/iclr/Robust NAS under adversarial training: benchmark, theory, and beyond
new file mode 100644
index 0000000000..ed87c4a51a
--- /dev/null
+++ b/data/2024/iclr/Robust NAS under adversarial training: benchmark, theory, and beyond	
@@ -0,0 +1 @@
+Recent developments in neural architecture search (NAS) emphasize the significance of considering robust architectures against malicious data. However, there is a notable absence of benchmark evaluations and theoretical guarantees for searching these robust architectures, especially when adversarial training is considered. In this work, we aim to address these two challenges, making twofold contributions. First, we release a comprehensive data set that encompasses both clean accuracy and robust accuracy for a vast array of adversarially trained networks from the NAS-Bench-201 search space on image datasets. Then, leveraging the neural tangent kernel (NTK) tool from deep learning theory, we establish a generalization theory for searching architecture in terms of clean accuracy and robust accuracy under multi-objective adversarial training. We firmly believe that our benchmark and theoretical insights will significantly benefit the NAS community through reliable reproducibility, efficient assessment, and theoretical foundation, particularly in the pursuit of robust architectures.
\ No newline at end of file
diff --git a/data/2024/iclr/Robust Similarity Learning with Difference Alignment Regularization b/data/2024/iclr/Robust Similarity Learning with Difference Alignment Regularization
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Robust Training of Federated Models with Extremely Label Deficiency b/data/2024/iclr/Robust Training of Federated Models with Extremely Label Deficiency
new file mode 100644
index 0000000000..96a8ff0582
--- /dev/null
+++ b/data/2024/iclr/Robust Training of Federated Models with Extremely Label Deficiency	
@@ -0,0 +1 @@
+Federated semi-supervised learning (FSSL) has emerged as a powerful paradigm for collaboratively training machine learning models using distributed data with label deficiency. Advanced FSSL methods predominantly focus on training a single model on each client. However, this approach could lead to a discrepancy between the objective functions of labeled and unlabeled data, resulting in gradient conflicts. To alleviate gradient conflict, we propose a novel twin-model paradigm, called Twin-sight, designed to enhance mutual guidance by providing insights from different perspectives of labeled and unlabeled data. In particular, Twin-sight concurrently trains a supervised model with a supervised objective function while training an unsupervised model using an unsupervised objective function. To enhance the synergy between these two models, Twin-sight introduces a neighbourhood-preserving constraint, which encourages the preservation of the neighbourhood relationship among data features extracted by both models. Our comprehensive experiments on four benchmark datasets provide substantial evidence that Twin-sight can significantly outperform state-of-the-art methods across various experimental settings, demonstrating the efficacy of the proposed Twin-sight.
\ No newline at end of file
diff --git a/data/2024/iclr/Robust agents learn causal world models b/data/2024/iclr/Robust agents learn causal world models
new file mode 100644
index 0000000000..e73e108c4e
--- /dev/null
+++ b/data/2024/iclr/Robust agents learn causal world models	
@@ -0,0 +1 @@
+It has long been hypothesised that causal reasoning plays a fundamental role in robust and general intelligence. However, it is not known if agents must learn causal models in order to generalise to new domains, or if other inductive biases are sufficient. We answer this question, showing that any agent capable of satisfying a regret bound under a large set of distributional shifts must have learned an approximate causal model of the data generating process, which converges to the true causal model for optimal agents. We discuss the implications of this result for several research areas including transfer learning and causal inference.
\ No newline at end of file
diff --git a/data/2024/iclr/RobustTSF: Towards Theory and Design of Robust Time Series Forecasting with Anomalies b/data/2024/iclr/RobustTSF: Towards Theory and Design of Robust Time Series Forecasting with Anomalies
new file mode 100644
index 0000000000..d3031ed329
--- /dev/null
+++ b/data/2024/iclr/RobustTSF: Towards Theory and Design of Robust Time Series Forecasting with Anomalies	
@@ -0,0 +1 @@
+Time series forecasting is an important and forefront task in many real-world applications. However, most of time series forecasting techniques assume that the training data is clean without anomalies. This assumption is unrealistic since the collected time series data can be contaminated in practice. The forecasting model will be inferior if it is directly trained by time series with anomalies. Thus it is essential to develop methods to automatically learn a robust forecasting model from the contaminated data. In this paper, we first statistically define three types of anomalies, then theoretically and experimentally analyze the loss robustness and sample robustness when these anomalies exist. Based on our analyses, we propose a simple and efficient algorithm to learn a robust forecasting model. Extensive experiments show that our method is highly robust and outperforms all existing approaches. The code is available at https://github.com/haochenglouis/RobustTSF.
\ No newline at end of file
diff --git a/data/2024/iclr/Robustifying State-space Models for Long Sequences via Approximate Diagonalization b/data/2024/iclr/Robustifying State-space Models for Long Sequences via Approximate Diagonalization
new file mode 100644
index 0000000000..d6138a53f0
--- /dev/null
+++ b/data/2024/iclr/Robustifying State-space Models for Long Sequences via Approximate Diagonalization	
@@ -0,0 +1 @@
+State-space models (SSMs) have recently emerged as a framework for learning long-range sequence tasks. An example is the structured state-space sequence (S4) layer, which uses the diagonal-plus-low-rank structure of the HiPPO initialization framework. However, the complicated structure of the S4 layer poses challenges; and, in an effort to address these challenges, models such as S4D and S5 have considered a purely diagonal structure. This choice simplifies the implementation, improves computational efficiency, and allows channel communication. However, diagonalizing the HiPPO framework is itself an ill-posed problem. In this paper, we propose a general solution for this and related ill-posed diagonalization problems in machine learning. We introduce a generic, backward-stable"perturb-then-diagonalize"(PTD) methodology, which is based on the pseudospectral theory of non-normal operators, and which may be interpreted as the approximate diagonalization of the non-normal matrices defining SSMs. Based on this, we introduce the S4-PTD and S5-PTD models. Through theoretical analysis of the transfer functions of different initialization schemes, we demonstrate that the S4-PTD/S5-PTD initialization strongly converges to the HiPPO framework, while the S4D/S5 initialization only achieves weak convergences. As a result, our new models show resilience to Fourier-mode noise-perturbed inputs, a crucial property not achieved by the S4D/S5 models. In addition to improved robustness, our S5-PTD model averages 87.6% accuracy on the Long-Range Arena benchmark, demonstrating that the PTD methodology helps to improve the accuracy of deep learning models.
\ No newline at end of file
diff --git a/data/2024/iclr/Robustifying and Boosting Training-Free Neural Architecture Search b/data/2024/iclr/Robustifying and Boosting Training-Free Neural Architecture Search
new file mode 100644
index 0000000000..4f4ca6210e
--- /dev/null
+++ b/data/2024/iclr/Robustifying and Boosting Training-Free Neural Architecture Search	
@@ -0,0 +1 @@
+Neural architecture search (NAS) has become a key component of AutoML and a standard tool to automate the design of deep neural networks. Recently, training-free NAS as an emerging paradigm has successfully reduced the search costs of standard training-based NAS by estimating the true architecture performance with only training-free metrics. Nevertheless, the estimation ability of these metrics typically varies across different tasks, making it challenging to achieve robust and consistently good search performance on diverse tasks with only a single training-free metric. Meanwhile, the estimation gap between training-free metrics and the true architecture performances limits training-free NAS to achieve superior performance. To address these challenges, we propose the robustifying and boosting training-free NAS (RoBoT) algorithm which (a) employs the optimized combination of existing training-free metrics explored from Bayesian optimization to develop a robust and consistently better-performing metric on diverse tasks, and (b) applies greedy search, i.e., the exploitation, on the newly developed metric to bridge the aforementioned gap and consequently to boost the search performance of standard training-free NAS further. Remarkably, the expected performance of our RoBoT can be theoretically guaranteed, which improves over the existing training-free NAS under mild conditions with additional interesting insights. Our extensive experiments on various NAS benchmark tasks yield substantial empirical evidence to support our theoretical results.
\ No newline at end of file
diff --git a/data/2024/iclr/Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks b/data/2024/iclr/Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks
new file mode 100644
index 0000000000..0655dfd170
--- /dev/null
+++ b/data/2024/iclr/Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks	
@@ -0,0 +1 @@
+In light of recent advancements in generative AI models, it has become essential to distinguish genuine content from AI-generated one to prevent the malicious usage of fake materials as authentic ones and vice versa. Various techniques have been introduced for identifying AI-generated images, with watermarking emerging as a promising approach. In this paper, we analyze the robustness of various AI-image detectors including watermarking and classifier-based deepfake detectors. For watermarking methods that introduce subtle image perturbations (i.e., low perturbation budget methods), we reveal a fundamental trade-off between the evasion error rate (i.e., the fraction of watermarked images detected as non-watermarked ones) and the spoofing error rate (i.e., the fraction of non-watermarked images detected as watermarked ones) upon an application of diffusion purification attack. To validate our theoretical findings, we also provide empirical evidence demonstrating that diffusion purification effectively removes low perturbation budget watermarks by applying minimal changes to images. The diffusion purification attack is ineffective for high perturbation watermarking methods where notable changes are applied to images. In this case, we develop a model substitution adversarial attack that can successfully remove watermarks. Moreover, we show that watermarking methods are vulnerable to spoofing attacks where the attacker aims to have real images identified as watermarked ones, damaging the reputation of the developers. In particular, with black-box access to the watermarking method, a watermarked noise image can be generated and added to real images, causing them to be incorrectly classified as watermarked. Finally, we extend our theory to characterize a fundamental trade-off between the robustness and reliability of classifier-based deep fake detectors and demonstrate it through experiments.
\ No newline at end of file
diff --git a/data/2024/iclr/Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs b/data/2024/iclr/Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs
new file mode 100644
index 0000000000..c81473925d
--- /dev/null
+++ b/data/2024/iclr/Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs	
@@ -0,0 +1 @@
+Vision tasks are characterized by the properties of locality and translation invariance. The superior performance of convolutional neural networks (CNNs) on these tasks is widely attributed to the inductive bias of locality and weight sharing baked into their architecture. Existing attempts to quantify the statistical benefits of these biases in CNNs over locally connected convolutional neural networks (LCNs) and fully connected neural networks (FCNs) fall into one of the following categories: either they disregard the optimizer and only provide uniform convergence upper bounds with no separating lower bounds, or they consider simplistic tasks that do not truly mirror the locality and translation invariance as found in real-world vision tasks. To address these deficiencies, we introduce the Dynamic Signal Distribution (DSD) classification task that models an image as consisting of $k$ patches, each of dimension $d$, and the label is determined by a $d$-sparse signal vector that can freely appear in any one of the $k$ patches. On this task, for any orthogonally equivariant algorithm like gradient descent, we prove that CNNs require $\tilde{O}(k+d)$ samples, whereas LCNs require $\Omega(kd)$ samples, establishing the statistical advantages of weight sharing in translation invariant tasks. Furthermore, LCNs need $\tilde{O}(k(k+d))$ samples, compared to $\Omega(k^2d)$ samples for FCNs, showcasing the benefits of locality in local tasks. Additionally, we develop information theoretic tools for analyzing randomized algorithms, which may be of interest for statistical research.
\ No newline at end of file
diff --git a/data/2024/iclr/Rotation Has Two Sides: Evaluating Data Augmentation for Deep One-class Classification b/data/2024/iclr/Rotation Has Two Sides: Evaluating Data Augmentation for Deep One-class Classification
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/S2AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic b/data/2024/iclr/S2AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic
new file mode 100644
index 0000000000..518cec38d8
--- /dev/null
+++ b/data/2024/iclr/S2AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic	
@@ -0,0 +1 @@
+Learning expressive stochastic policies instead of deterministic ones has been proposed to achieve better stability, sample complexity, and robustness. Notably, in Maximum Entropy Reinforcement Learning (MaxEnt RL), the policy is modeled as an expressive Energy-Based Model (EBM) over the Q-values. However, this formulation requires the estimation of the entropy of such EBMs, which is an open problem. To address this, previous MaxEnt RL methods either implicitly estimate the entropy, resulting in high computational complexity and variance (SQL), or follow a variational inference procedure that fits simplified actor distributions (e.g., Gaussian) for tractability (SAC). We propose Stein Soft Actor-Critic (S$^2$AC), a MaxEnt RL algorithm that learns expressive policies without compromising efficiency. Specifically, S$^2$AC uses parameterized Stein Variational Gradient Descent (SVGD) as the underlying policy. We derive a closed-form expression of the entropy of such policies. Our formula is computationally efficient and only depends on first-order derivatives and vector products. Empirical results show that S$^2$AC yields more optimal solutions to the MaxEnt objective than SQL and SAC in the multi-goal environment, and outperforms SAC and SQL on the MuJoCo benchmark. Our code is available at: https://github.com/SafaMessaoud/S2AC-Energy-Based-RL-with-Stein-Soft-Actor-Critic
\ No newline at end of file
diff --git a/data/2024/iclr/SAFLEX: Self-Adaptive Augmentation via Feature Label Extrapolation b/data/2024/iclr/SAFLEX: Self-Adaptive Augmentation via Feature Label Extrapolation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/SALMON: Self-Alignment with Instructable Reward Models b/data/2024/iclr/SALMON: Self-Alignment with Instructable Reward Models
new file mode 100644
index 0000000000..7a89e92747
--- /dev/null
+++ b/data/2024/iclr/SALMON: Self-Alignment with Instructable Reward Models	
@@ -0,0 +1 @@
+Supervised Fine-Tuning (SFT) on response demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful paradigm for aligning LLM-based AI agents. However, a significant limitation of such an approach is its dependency on high-quality human annotations, making its application to intricate tasks challenging due to difficulties in obtaining consistent response demonstrations and in-distribution response preferences. This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision, using only a small set of human-defined principles, yet achieving superior performance. Central to our approach is an instructable reward model. Trained on synthetic preference data, this model can generate reward scores based on arbitrary human-defined principles. By merely adjusting these principles during the RL training phase, we gain full control over the preferences with the instructable reward model, subsequently influencing the behavior of the RL-trained policy models, and reducing the reliance on the collection of online human preferences. Applying our method to the LLaMA-2-70b base language model, we developed an AI assistant named Dromedary-2. With only 6 exemplars for in-context learning and 31 human-defined principles, Dromedary-2 significantly surpasses the performance of several state-of-the-art AI systems, including LLaMA-2-Chat-70b, on various benchmark datasets. We have open-sourced the code and model weights to encourage further research into aligning LLM-based AI agents with enhanced supervision efficiency, improved controllability, and scalable oversight.
\ No newline at end of file
diff --git a/data/2024/iclr/SALMONN: Towards Generic Hearing Abilities for Large Language Models b/data/2024/iclr/SALMONN: Towards Generic Hearing Abilities for Large Language Models
new file mode 100644
index 0000000000..dbb1cf0149
--- /dev/null
+++ b/data/2024/iclr/SALMONN: Towards Generic Hearing Abilities for Large Language Models	
@@ -0,0 +1 @@
+Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. The source code, model checkpoints and data are available at https://github.com/bytedance/SALMONN.
\ No newline at end of file
diff --git a/data/2024/iclr/SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer b/data/2024/iclr/SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer
new file mode 100644
index 0000000000..cb8daf4779
--- /dev/null
+++ b/data/2024/iclr/SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer	
@@ -0,0 +1 @@
+Generative adversarial networks (GANs) learn a target probability distribution by optimizing a generator and a discriminator with minimax objectives. This paper addresses the question of whether such optimization actually provides the generator with gradients that make its distribution close to the target distribution. We derive metrizable conditions, sufficient conditions for the discriminator to serve as the distance between the distributions by connecting the GAN formulation with the concept of sliced optimal transport. Furthermore, by leveraging these theoretical results, we propose a novel GAN training scheme, called slicing adversarial network (SAN). With only simple modifications, a broad class of existing GANs can be converted to SANs. Experiments on synthetic and image datasets support our theoretical results and the SAN's effectiveness as compared to usual GANs. Furthermore, we also apply SAN to StyleGAN-XL, which leads to state-of-the-art FID score amongst GANs for class conditional generation on ImageNet 256$\times$256. Our implementation is available on https://ytakida.github.io/san.
\ No newline at end of file
diff --git a/data/2024/iclr/SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos b/data/2024/iclr/SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos
new file mode 100644
index 0000000000..e3337bec8f
--- /dev/null
+++ b/data/2024/iclr/SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos	
@@ -0,0 +1 @@
+We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations. The motivation of this problem is to learn a structured and plannable state and action space. Recent works succeeded in sequence modeling of steps with only sequence-level annotations accessible during training, which overlooked the roles of states in the procedures. In this work, we point out that State CHangEs MAtter (SCHEMA) for procedure planning in instructional videos. We aim to establish a more structured state space by investigating the causal relations between steps and states in procedures. Specifically, we explicitly represent each step as state changes and track the state changes in procedures. For step representation, we leveraged the commonsense knowledge in large language models (LLMs) to describe the state changes of steps via our designed chain-of-thought prompting. For state change tracking, we align visual state observations with language state descriptions via cross-modal contrastive learning, and explicitly model the intermediate states of the procedure using LLM-generated state descriptions. Experiments on CrossTask, COIN, and NIV benchmark datasets demonstrate that our proposed SCHEMA model achieves state-of-the-art performance and obtains explainable visualizations.
\ No newline at end of file
diff --git a/data/2024/iclr/SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis b/data/2024/iclr/SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
new file mode 100644
index 0000000000..9620b034f8
--- /dev/null
+++ b/data/2024/iclr/SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis	
@@ -0,0 +1 @@
+We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models
\ No newline at end of file
diff --git a/data/2024/iclr/SE(3)-Stochastic Flow Matching for Protein Backbone Generation b/data/2024/iclr/SE(3)-Stochastic Flow Matching for Protein Backbone Generation
new file mode 100644
index 0000000000..a6ffb5659a
--- /dev/null
+++ b/data/2024/iclr/SE(3)-Stochastic Flow Matching for Protein Backbone Generation	
@@ -0,0 +1 @@
+The computational design of novel protein structures has the potential to impact numerous scientific disciplines greatly. Toward this goal, we introduce FoldFlow, a series of novel generative models of increasing modeling power based on the flow-matching paradigm over $3\mathrm{D}$ rigid motions -- i.e. the group $\text{SE}(3)$ -- enabling accurate modeling of protein backbones. We first introduce FoldFlow-Base, a simulation-free approach to learning deterministic continuous-time dynamics and matching invariant target distributions on $\text{SE}(3)$. We next accelerate training by incorporating Riemannian optimal transport to create FoldFlow-OT, leading to the construction of both more simple and stable flows. Finally, we design FoldFlow-SFM, coupling both Riemannian OT and simulation-free training to learn stochastic continuous-time dynamics over $\text{SE}(3)$. Our family of FoldFlow, generative models offers several key advantages over previous approaches to the generative modeling of proteins: they are more stable and faster to train than diffusion-based approaches, and our models enjoy the ability to map any invariant source distribution to any invariant target distribution over $\text{SE}(3)$. Empirically, we validate FoldFlow, on protein backbone generation of up to $300$ amino acids leading to high-quality designable, diverse, and novel samples.
\ No newline at end of file
diff --git a/data/2024/iclr/SEA: Sparse Linear Attention with Estimated Attention Mask b/data/2024/iclr/SEA: Sparse Linear Attention with Estimated Attention Mask
new file mode 100644
index 0000000000..e34ba62fe9
--- /dev/null
+++ b/data/2024/iclr/SEA: Sparse Linear Attention with Estimated Attention Mask	
@@ -0,0 +1 @@
+The transformer architecture has driven breakthroughs in recent years on tasks which require modeling pairwise relationships between sequential elements, as is the case in natural language understanding. However, long seqeuences pose a problem due to the quadratic complexity of the attention operation. Previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix. Yet, these approaches cannot straightforwardly distill knowledge from a teacher's attention matrix and often require complete retraining from scratch. Furthermore, previous sparse and linear approaches lose interpretability if they cannot produce full attention matrices. To address these challenges, we propose SEA: Sparse linear attention with an Estimated Attention mask. SEA estimates the attention matrix with linear complexity via kernel-based linear attention, then subsequently creates a sparse attention matrix with a top-k selection to perform a sparse attention operation. For language modeling tasks (Wikitext2), previous linear and sparse attention methods show roughly two-fold worse perplexity scores over the quadratic OPT-1.3B baseline, while SEA achieves better perplexity than OPT-1.3B, using roughly half the memory of OPT-1.3B, providing interpretable attention matrix. We believe that our work will have a large practical impact, as it opens the possibility of running large transformers on resource-limited devices with less memory.
\ No newline at end of file
diff --git a/data/2024/iclr/SEABO: A Simple Search-Based Method for Offline Imitation Learning b/data/2024/iclr/SEABO: A Simple Search-Based Method for Offline Imitation Learning
new file mode 100644
index 0000000000..49c1519b55
--- /dev/null
+++ b/data/2024/iclr/SEABO: A Simple Search-Based Method for Offline Imitation Learning	
@@ -0,0 +1 @@
+Offline reinforcement learning (RL) has attracted much attention due to its ability in learning from static offline datasets and eliminating the need of interacting with the environment. Nevertheless, the success of offline RL relies heavily on the offline transitions annotated with reward labels. In practice, we often need to hand-craft the reward function, which is sometimes difficult, labor-intensive, or inefficient. To tackle this challenge, we set our focus on the offline imitation learning (IL) setting, and aim at getting a reward function based on the expert data and unlabeled data. To that end, we propose a simple yet effective search-based offline IL method, tagged SEABO. SEABO allocates a larger reward to the transition that is close to its closest neighbor in the expert demonstration, and a smaller reward otherwise, all in an unsupervised learning manner. Experimental results on a variety of D4RL datasets indicate that SEABO can achieve competitive performance to offline RL algorithms with ground-truth rewards, given only a single expert trajectory, and can outperform prior reward learning and offline IL methods across many tasks. Moreover, we demonstrate that SEABO also works well if the expert demonstrations contain only observations. Our code is publicly available at https://github.com/dmksjfl/SEABO.
\ No newline at end of file
diff --git a/data/2024/iclr/SEAL: A Framework for Systematic Evaluation of Real-World Super-Resolution b/data/2024/iclr/SEAL: A Framework for Systematic Evaluation of Real-World Super-Resolution
new file mode 100644
index 0000000000..ef9c905e43
--- /dev/null
+++ b/data/2024/iclr/SEAL: A Framework for Systematic Evaluation of Real-World Super-Resolution	
@@ -0,0 +1 @@
+Real-world Super-Resolution (Real-SR) methods focus on dealing with diverse real-world images and have attracted increasing attention in recent years. The key idea is to use a complex and high-order degradation model to mimic real-world degradations. Although they have achieved impressive results in various scenarios, they are faced with the obstacle of evaluation. Currently, these methods are only assessed by their average performance on a small set of degradation cases randomly selected from a large space, which fails to provide a comprehensive understanding of their overall performance and often yields inconsistent and potentially misleading results. To overcome the limitation in evaluation, we propose SEAL, a framework for systematic evaluation of real-SR. In particular, we cluster the extensive degradation space to create a set of representative degradation cases, which serves as a comprehensive test set. Next, we propose a coarse-to-fine evaluation protocol to measure the distributed and relative performance of real-SR methods on the test set. The protocol incorporates two new metrics: acceptance rate (AR) and relative performance ratio (RPR), derived from acceptance and excellence lines. Under SEAL, we benchmark existing real-SR methods, obtain new observations and insights into their performance, and develop a new strong baseline. We consider SEAL as the first step towards creating a comprehensive real-SR evaluation platform, which can promote the development of real-SR. The source code is available at https://github.com/XPixelGroup/SEAL
\ No newline at end of file
diff --git a/data/2024/iclr/SEGNO: Generalizing Equivariant Graph Neural Networks with Physical Inductive Biases b/data/2024/iclr/SEGNO: Generalizing Equivariant Graph Neural Networks with Physical Inductive Biases
new file mode 100644
index 0000000000..1dddf43660
--- /dev/null
+++ b/data/2024/iclr/SEGNO: Generalizing Equivariant Graph Neural Networks with Physical Inductive Biases	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) with equivariant properties have emerged as powerful tools for modeling complex dynamics of multi-object physical systems. However, their generalization ability is limited by the inadequate consideration of physical inductive biases: (1) Existing studies overlook the continuity of transitions among system states, opting to employ several discrete transformation layers to learn the direct mapping between two adjacent states; (2) Most models only account for first-order velocity information, despite the fact that many physical systems are governed by second-order motion laws. To incorporate these inductive biases, we propose the Second-order Equivariant Graph Neural Ordinary Differential Equation (SEGNO). Specifically, we show how the second-order continuity can be incorporated into GNNs while maintaining the equivariant property. Furthermore, we offer theoretical insights into SEGNO, highlighting that it can learn a unique trajectory between adjacent states, which is crucial for model generalization. Additionally, we prove that the discrepancy between this learned trajectory of SEGNO and the true trajectory is bounded. Extensive experiments on complex dynamical systems including molecular dynamics and motion capture demonstrate that our model yields a significant improvement over the state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction b/data/2024/iclr/SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
new file mode 100644
index 0000000000..c748f0a775
--- /dev/null
+++ b/data/2024/iclr/SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction	
@@ -0,0 +1 @@
+Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level") depicting a single scene. To deliver a coherent long video ("story-level"), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos. Project page: https://vchitect.github.io/SEINE-project/ .
\ No newline at end of file
diff --git a/data/2024/iclr/SEPT: Towards Efficient Scene Representation Learning for Motion Prediction b/data/2024/iclr/SEPT: Towards Efficient Scene Representation Learning for Motion Prediction
new file mode 100644
index 0000000000..063c6b283f
--- /dev/null
+++ b/data/2024/iclr/SEPT: Towards Efficient Scene Representation Learning for Motion Prediction	
@@ -0,0 +1 @@
+Motion prediction is crucial for autonomous vehicles to operate safely in complex traffic environments. Extracting effective spatiotemporal relationships among traffic elements is key to accurate forecasting. Inspired by the successful practice of pretrained large language models, this paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful spatiotemporal understanding for complex traffic scenes. Specifically, our approach involves three masking-reconstruction modeling tasks on scene inputs including agents' trajectories and road network, pretraining the scene encoder to capture kinematics within trajectory, spatial structure of road network, and interactions among roads and agents. The pretrained encoder is then finetuned on the downstream forecasting task. Extensive experiments demonstrate that SEPT, without elaborate architectural design or manual feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming previous methods on all main metrics by a large margin.
\ No newline at end of file
diff --git a/data/2024/iclr/SF(DA)2: Source-free Domain Adaptation Through the Lens of Data Augmentation b/data/2024/iclr/SF(DA)2: Source-free Domain Adaptation Through the Lens of Data Augmentation
new file mode 100644
index 0000000000..7dd79fa33a
--- /dev/null
+++ b/data/2024/iclr/SF(DA)2: Source-free Domain Adaptation Through the Lens of Data Augmentation	
@@ -0,0 +1 @@
+In the face of the deep learning model's vulnerability to domain shift, source-free domain adaptation (SFDA) methods have been proposed to adapt models to new, unseen target domains without requiring access to source domain data. Although the potential benefits of applying data augmentation to SFDA are attractive, several challenges arise such as the dependence on prior knowledge of class-preserving transformations and the increase in memory and computational requirements. In this paper, we propose Source-free Domain Adaptation Through the Lens of Data Augmentation (SF(DA)$^2$), a novel approach that leverages the benefits of data augmentation without suffering from these challenges. We construct an augmentation graph in the feature space of the pretrained model using the neighbor relationships between target features and propose spectral neighborhood clustering to identify partitions in the prediction space. Furthermore, we propose implicit feature augmentation and feature disentanglement as regularization loss functions that effectively utilize class semantic information within the feature space. These regularizers simulate the inclusion of an unlimited number of augmented target features into the augmentation graph while minimizing computational and memory demands. Our method shows superior adaptation performance in SFDA scenarios, including 2D image and 3D point cloud datasets and a highly imbalanced dataset.
\ No newline at end of file
diff --git a/data/2024/iclr/SGD Finds then Tunes Features in Two-Layer Neural Networks with near-Optimal Sample Complexity: A Case Study in the XOR problem b/data/2024/iclr/SGD Finds then Tunes Features in Two-Layer Neural Networks with near-Optimal Sample Complexity: A Case Study in the XOR problem
new file mode 100644
index 0000000000..a88968a62c
--- /dev/null
+++ b/data/2024/iclr/SGD Finds then Tunes Features in Two-Layer Neural Networks with near-Optimal Sample Complexity: A Case Study in the XOR problem	
@@ -0,0 +1 @@
+In this work, we consider the optimization process of minibatch stochastic gradient descent (SGD) on a 2-layer neural network with data separated by a quadratic ground truth function. We prove that with data drawn from the $d$-dimensional Boolean hypercube labeled by the quadratic ``XOR'' function $y = -x_ix_j$, it is possible to train to a population error $o(1)$ with $d \:\text{polylog}(d)$ samples. Our result considers simultaneously training both layers of the two-layer-neural network with ReLU activations via standard minibatch SGD on the logistic loss. To our knowledge, this work is the first to give a sample complexity of $\tilde{O}(d)$ for efficiently learning the XOR function on isotropic data on a standard neural network with standard training. Our main technique is showing that the network evolves in two phases: a $\textit{signal-finding}$ phase where the network is small and many of the neurons evolve independently to find features, and a $\textit{signal-heavy}$ phase, where SGD maintains and balances the features. We leverage the simultaneous training of the layers to show that it is sufficient for only a small fraction of the neurons to learn features, since those neurons will be amplified by the simultaneous growth of their second layer weights.
\ No newline at end of file
diff --git a/data/2024/iclr/SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore b/data/2024/iclr/SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
new file mode 100644
index 0000000000..3ab48e7f9f
--- /dev/null
+++ b/data/2024/iclr/SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore	
@@ -0,0 +1 @@
+The legality of training language models (LMs) on copyrighted or otherwise restricted data is under intense debate. However, as we show, model performance significantly degrades if trained only on low-risk text (e.g., out-of-copyright books or government documents), due to its limited size and domain coverage. We present SILO, a new language model that manages this risk-performance tradeoff during inference. SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text and (2) augmenting it with a more general and easily modifiable nonparametric datastore (e.g., containing copyrighted books or news) that is only queried during inference. The datastore allows use of high-risk data without training on it, supports sentence-level data attribution, and enables data producers to opt out from the model by removing content from the store. These capabilities can foster compliance with data-use regulations such as the fair use doctrine in the United States and the GDPR in the European Union. Our experiments show that the parametric LM struggles on domains not covered by OLC. However, access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile, a more diverse corpus with mostly high-risk text. We also analyze which nonparametric approach works best, where the remaining errors lie, and how performance scales with datastore size. Our results suggest that it is possible to build high quality language models while mitigating their legal risk.
\ No newline at end of file
diff --git a/data/2024/iclr/SKILL-MIX: a Flexible and Expandable Family of Evaluations for AI Models b/data/2024/iclr/SKILL-MIX: a Flexible and Expandable Family of Evaluations for AI Models
new file mode 100644
index 0000000000..89021ff1ad
--- /dev/null
+++ b/data/2024/iclr/SKILL-MIX: a Flexible and Expandable Family of Evaluations for AI Models	
@@ -0,0 +1 @@
+With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora&Goyal, 2023). This work introduces Skill-Mix, a new evaluation to measure ability to combine skills. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model. Administering a version of to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards ("cramming for the leaderboard"). Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on $k=5$ is suggestive of going beyond"stochastic parrot"behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training. We sketch how the methodology can lead to a Skill-Mix based eco-system of open evaluations for AI capabilities of future models.
\ No newline at end of file
diff --git a/data/2024/iclr/SLiMe: Segment Like Me b/data/2024/iclr/SLiMe: Segment Like Me
new file mode 100644
index 0000000000..a8c3b6d78b
--- /dev/null
+++ b/data/2024/iclr/SLiMe: Segment Like Me	
@@ -0,0 +1 @@
+Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel"weighted accumulated self-attention map"from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.
\ No newline at end of file
diff --git a/data/2024/iclr/SNIP: Bridging Mathematical Symbolic and Numeric Realms with Unified Pre-training b/data/2024/iclr/SNIP: Bridging Mathematical Symbolic and Numeric Realms with Unified Pre-training
new file mode 100644
index 0000000000..b01ec76f2c
--- /dev/null
+++ b/data/2024/iclr/SNIP: Bridging Mathematical Symbolic and Numeric Realms with Unified Pre-training	
@@ -0,0 +1 @@
+In an era where symbolic mathematical equations are indispensable for modeling complex natural phenomena, scientific inquiry often involves collecting observations and translating them into mathematical expressions. Recently, deep learning has emerged as a powerful tool for extracting insights from data. However, existing models typically specialize in either numeric or symbolic domains, and are usually trained in a supervised manner tailored to specific tasks. This approach neglects the substantial benefits that could arise from a task-agnostic multi-modal understanding between symbolic equations and their numeric counterparts. To bridge the gap, we introduce SNIP, a Symbolic-Numeric Integrated Pre-training model, which employs contrastive learning between symbolic and numeric domains, enhancing their mutual similarities in the embeddings. By performing latent space analysis, we observe that SNIP provides cross-domain insights into the representations, revealing that symbolic supervision enhances the embeddings of numeric data and vice versa. We evaluate SNIP across diverse tasks, including symbolic-to-numeric mathematical property prediction and numeric-to-symbolic equation discovery, commonly known as symbolic regression. Results show that SNIP effectively transfers to various tasks, consistently outperforming fully supervised baselines and competing strongly with established task-specific methods, especially in the low data regime scenarios where available data is limited. Code and model are available at: https://github.com/deep-symbolic-mathematics/Multimodal-Math-Pretraining
\ No newline at end of file
diff --git a/data/2024/iclr/SOHES: Self-supervised Open-world Hierarchical Entity Segmentation b/data/2024/iclr/SOHES: Self-supervised Open-world Hierarchical Entity Segmentation
new file mode 100644
index 0000000000..b2fe8a1405
--- /dev/null
+++ b/data/2024/iclr/SOHES: Self-supervised Open-world Hierarchical Entity Segmentation	
@@ -0,0 +1 @@
+Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: https://SOHES.github.io.
\ No newline at end of file
diff --git a/data/2024/iclr/SOInter: A Novel Deep Energy-Based Interpretation Method for Explaining Structured Output Models b/data/2024/iclr/SOInter: A Novel Deep Energy-Based Interpretation Method for Explaining Structured Output Models
new file mode 100644
index 0000000000..fb6ff599e7
--- /dev/null
+++ b/data/2024/iclr/SOInter: A Novel Deep Energy-Based Interpretation Method for Explaining Structured Output Models	
@@ -0,0 +1 @@
+We propose a novel interpretation technique to explain the behavior of structured output models, which learn mappings between an input vector to a set of output variables simultaneously. Because of the complex relationship between the computational path of output variables in structured models, a feature can affect the value of output through other ones. We focus on one of the outputs as the target and try to find the most important features utilized by the structured model to decide on the target in each locality of the input space. In this paper, we assume an arbitrary structured output model is available as a black box and argue how considering the correlations between output variables can improve the explanation performance. The goal is to train a function as an interpreter for the target output variable over the input space. We introduce an energy-based training process for the interpreter function, which effectively considers the structural information incorporated into the model to be explained. The effectiveness of the proposed method is confirmed using a variety of simulated and real data sets.
\ No newline at end of file
diff --git a/data/2024/iclr/SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents b/data/2024/iclr/SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents
new file mode 100644
index 0000000000..613bc8c42c
--- /dev/null
+++ b/data/2024/iclr/SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents	
@@ -0,0 +1 @@
+Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide variety of scenarios; they coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-Eval. With SOTOPIA, we find significant differences between these models in terms of their social intelligence, and we identify a subset of SOTOPIA scenarios, SOTOPIA-hard, that is generally challenging for all models. We find that on this subset, GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills. These findings demonstrate SOTOPIA's promise as a general platform for research on evaluating and improving social intelligence in artificial agents.
\ No newline at end of file
diff --git a/data/2024/iclr/SPDER: Semiperiodic Damping-Enabled Object Representation b/data/2024/iclr/SPDER: Semiperiodic Damping-Enabled Object Representation
new file mode 100644
index 0000000000..fa4ebecfd9
--- /dev/null
+++ b/data/2024/iclr/SPDER: Semiperiodic Damping-Enabled Object Representation	
@@ -0,0 +1 @@
+We present a neural network architecture designed to naturally learn a positional embedding and overcome the spectral bias towards lower frequencies faced by conventional implicit neural representation networks. Our proposed architecture, SPDER, is a simple MLP that uses an activation function composed of a sinusoidal multiplied by a sublinear function, called the damping function. The sinusoidal enables the network to automatically learn the positional embedding of an input coordinate while the damping passes on the actual coordinate value by preventing it from being projected down to within a finite range of values. Our results indicate that SPDERs speed up training by 10x and converge to losses 1,500-50,000x lower than that of the state-of-the-art for image representation. SPDER is also state-of-the-art in audio representation. The superior representation capability allows SPDER to also excel on multiple downstream tasks such as image super-resolution and video frame interpolation. We provide intuition as to why SPDER significantly improves fitting compared to that of other INR methods while requiring no hyperparameter tuning or preprocessing.
\ No newline at end of file
diff --git a/data/2024/iclr/SPTNet: An Efficient Alternative Framework for Generalized Category Discovery with Spatial Prompt Tuning b/data/2024/iclr/SPTNet: An Efficient Alternative Framework for Generalized Category Discovery with Spatial Prompt Tuning
new file mode 100644
index 0000000000..9caae64f7e
--- /dev/null
+++ b/data/2024/iclr/SPTNet: An Efficient Alternative Framework for Generalized Category Discovery with Spatial Prompt Tuning	
@@ -0,0 +1 @@
+Generalized Category Discovery (GCD) aims to classify unlabelled images from both `seen' and `unseen' classes by transferring knowledge from a set of labelled `seen' class images. A key theme in existing GCD approaches is adapting large-scale pre-trained models for the GCD task. An alternate perspective, however, is to adapt the data representation itself for better alignment with the pre-trained model. As such, in this paper, we introduce a two-stage adaptation approach termed SPTNet, which iteratively optimizes model parameters (i.e., model-finetuning) and data parameters (i.e., prompt learning). Furthermore, we propose a novel spatial prompt tuning method (SPT) which considers the spatial property of image data, enabling the method to better focus on object parts, which can transfer between seen and unseen classes. We thoroughly evaluate our SPTNet on standard benchmarks and demonstrate that our method outperforms existing GCD methods. Notably, we find our method achieves an average accuracy of 61.4% on the SSB, surpassing prior state-of-the-art methods by approximately 10%. The improvement is particularly remarkable as our method yields extra parameters amounting to only 0.117% of those in the backbone architecture. Project page: https://visual-ai.github.io/sptnet.
\ No newline at end of file
diff --git a/data/2024/iclr/SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores b/data/2024/iclr/SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores
new file mode 100644
index 0000000000..2c69fb1d64
--- /dev/null
+++ b/data/2024/iclr/SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores	
@@ -0,0 +1 @@
+The ever-growing complexity of reinforcement learning (RL) tasks demands a distributed system to efficiently generate and process a massive amount of data. However, existing open-source libraries suffer from various limitations, which impede their practical use in challenging scenarios where large-scale training is necessary. In this paper, we present a novel abstraction on the dataflows of RL training, which unifies diverse RL training applications into a general framework. Following this abstraction, we develop a scalable, efficient, and extensible distributed RL system called ReaLlyScalableRL, which allows efficient and massively parallelized training and easy development of customized algorithms. Our evaluation shows that SRL outperforms existing academic libraries, reaching at most 21x higher training throughput in a distributed setting. On learning performance, beyond performing and scaling well on common RL benchmarks with different RL algorithms, SRL can reproduce the same solution in the challenging hide-and-seek environment as reported by OpenAI with up to 5x speedup in wall-clock time. Notably, SRL is the first in the academic community to perform RL experiments at a large scale with over 15k CPU cores. SRL source code is available at: https://github.com/openpsi-project/srl .
\ No newline at end of file
diff --git a/data/2024/iclr/STARC: A General Framework For Quantifying Differences Between Reward Functions b/data/2024/iclr/STARC: A General Framework For Quantifying Differences Between Reward Functions
new file mode 100644
index 0000000000..65434680e8
--- /dev/null
+++ b/data/2024/iclr/STARC: A General Framework For Quantifying Differences Between Reward Functions	
@@ -0,0 +1 @@
+In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use \emph{reward learning algorithms}, which attempt to \emph{learn} a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to anticipate in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.
\ No newline at end of file
diff --git a/data/2024/iclr/STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models b/data/2024/iclr/STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models
new file mode 100644
index 0000000000..b2540f52c2
--- /dev/null
+++ b/data/2024/iclr/STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models	
@@ -0,0 +1 @@
+Image generative models have made significant progress in generating realistic and diverse images, supported by comprehensive guidance from various evaluation metrics. However, current video generative models struggle to generate even short video clips, with limited tools that provide insights for improvements. Current video evaluation metrics are simple adaptations of image metrics by switching the embeddings with video embedding networks, which may underestimate the unique characteristics of video. Our analysis reveals that the widely used Frechet Video Distance (FVD) has a stronger emphasis on the spatial aspect than the temporal naturalness of video and is inherently constrained by the input size of the embedding networks used, limiting it to 16 frames. Additionally, it demonstrates considerable instability and diverges from human evaluations. To address the limitations, we propose STREAM, a new video evaluation metric uniquely designed to independently evaluate spatial and temporal aspects. This feature allows comprehensive analysis and evaluation of video generative models from various perspectives, unconstrained by video length. We provide analytical and experimental evidence demonstrating that STREAM provides an effective evaluation tool for both visual and temporal quality of videos, offering insights into area of improvement for video generative models. To the best of our knowledge, STREAM is the first evaluation metric that can separately assess the temporal and spatial aspects of videos. Our code is available at https://github.com/pro2nit/STREAM.
\ No newline at end of file
diff --git a/data/2024/iclr/STanHop: Sparse Tandem Hopfield Model for Memory-Enhanced Time Series Prediction b/data/2024/iclr/STanHop: Sparse Tandem Hopfield Model for Memory-Enhanced Time Series Prediction
new file mode 100644
index 0000000000..8cb1be98c8
--- /dev/null
+++ b/data/2024/iclr/STanHop: Sparse Tandem Hopfield Model for Memory-Enhanced Time Series Prediction	
@@ -0,0 +1 @@
+We present STanHop-Net (Sparse Tandem Hopfield Network) for multivariate time series prediction with memory-enhanced capabilities. At the heart of our approach is STanHop, a novel Hopfield-based neural network block, which sparsely learns and stores both temporal and cross-series representations in a data-dependent fashion. In essence, STanHop sequentially learn temporal representation and cross-series representation using two tandem sparse Hopfield layers. In addition, StanHop incorporates two additional external memory modules: a Plug-and-Play module and a Tune-and-Play module for train-less and task-aware memory-enhancements, respectively. They allow StanHop-Net to swiftly respond to certain sudden events. Methodologically, we construct the StanHop-Net by stacking STanHop blocks in a hierarchical fashion, enabling multi-resolution feature extraction with resolution-specific sparsity. Theoretically, we introduce a sparse extension of the modern Hopfield model (Generalized Sparse Modern Hopfield Model) and show that it endows a tighter memory retrieval error compared to the dense counterpart without sacrificing memory capacity. Empirically, we validate the efficacy of our framework on both synthetic and real-world settings.
\ No newline at end of file
diff --git a/data/2024/iclr/SWAP: Sparse Entropic Wasserstein Regression for Robust Network Pruning b/data/2024/iclr/SWAP: Sparse Entropic Wasserstein Regression for Robust Network Pruning
new file mode 100644
index 0000000000..03f693ae9b
--- /dev/null
+++ b/data/2024/iclr/SWAP: Sparse Entropic Wasserstein Regression for Robust Network Pruning	
@@ -0,0 +1 @@
+This study addresses the challenge of inaccurate gradients in computing the empirical Fisher Information Matrix during neural network pruning. We introduce SWAP, a formulation of Entropic Wasserstein regression (EWR) for pruning, capitalizing on the geometric properties of the optimal transport problem. The ``swap'' of the commonly used linear regression with the EWR in optimization is analytically demonstrated to offer noise mitigation effects by incorporating neighborhood interpolation across data points with only marginal additional computational cost. The unique strength of SWAP is its intrinsic ability to balance noise reduction and covariance information preservation effectively. Extensive experiments performed on various networks and datasets show comparable performance of SWAP with state-of-the-art (SoTA) network pruning algorithms. Our proposed method outperforms the SoTA when the network size or the target sparsity is large, the gain is even larger with the existence of noisy gradients, possibly from noisy data, analog memory, or adversarial attacks. Notably, our proposed method achieves a gain of 6% improvement in accuracy and 8% improvement in testing loss for MobileNetV1 with less than one-fourth of the network parameters remaining.
\ No newline at end of file
diff --git a/data/2024/iclr/SYMBOL: Generating Flexible Black-Box Optimizers through Symbolic Equation Learning b/data/2024/iclr/SYMBOL: Generating Flexible Black-Box Optimizers through Symbolic Equation Learning
new file mode 100644
index 0000000000..0834aeb8d9
--- /dev/null
+++ b/data/2024/iclr/SYMBOL: Generating Flexible Black-Box Optimizers through Symbolic Equation Learning	
@@ -0,0 +1 @@
+Recent Meta-learning for Black-Box Optimization (MetaBBO) methods harness neural networks to meta-learn configurations of traditional black-box optimizers. Despite their success, they are inevitably restricted by the limitations of predefined hand-crafted optimizers. In this paper, we present \textsc{Symbol}, a novel framework that promotes the automated discovery of black-box optimizers through symbolic equation learning. Specifically, we propose a Symbolic Equation Generator (SEG) that allows closed-form optimization rules to be dynamically generated for specific tasks and optimization steps. Within \textsc{Symbol}, we then develop three distinct strategies based on reinforcement learning, so as to meta-learn the SEG efficiently. Extensive experiments reveal that the optimizers generated by \textsc{Symbol} not only surpass the state-of-the-art BBO and MetaBBO baselines, but also exhibit exceptional zero-shot generalization abilities across entirely unseen tasks with different problem dimensions, population sizes, and optimization horizons. Furthermore, we conduct in-depth analyses of our \textsc{Symbol} framework and the optimization rules that it generates, underscoring its desirable flexibility and interpretability.
\ No newline at end of file
diff --git a/data/2024/iclr/SaNN: Simple Yet Powerful Simplicial-aware Neural Networks b/data/2024/iclr/SaNN: Simple Yet Powerful Simplicial-aware Neural Networks
new file mode 100644
index 0000000000..0c31c9e49d
--- /dev/null
+++ b/data/2024/iclr/SaNN: Simple Yet Powerful Simplicial-aware Neural Networks	
@@ -0,0 +1 @@
+Simplicial neural networks (SNNs) are deep models for higher-order graph representation learning. SNNs learn low-dimensional embeddings of simplices in a simplicial complex by aggregating features of their respective upper, lower, boundary, and coboundary adjacent simplices. The aggregation in SNNs is carried out during training. Since the number of simplices of various orders in a simplicial complex is significantly large, the memory and training-time requirement in SNNs is enormous. In this work, we propose a scalable simplicial-aware neural network (SaNN) model with a constant run-time and memory requirements independent of the size of the simplicial complex and the density of interactions in it. SaNN is based on pre-aggregated simplicial-aware features as inputs to a neural network, so it has a strong simplicial-structural inductive bias. We provide theoretical conditions under which SaNN is provably more powerful than the Weisfeiler-Lehman (WL) graph isomorphism test and as powerful as the simplicial Weisfeiler-Lehman (SWL) test. We also show that SaNN is permutation and orientation equivariant and satisfies simplicial-awareness of the highest order in a simplicial complex. We demonstrate via numerical experiments that despite being computationally economical, the proposed model achieves state-of-the-art performance in predicting trajectories, simplicial closures, and classifying graphs.
\ No newline at end of file
diff --git a/data/2024/iclr/Safe Collaborative Filtering b/data/2024/iclr/Safe Collaborative Filtering
new file mode 100644
index 0000000000..c2cdb6f7b7
--- /dev/null
+++ b/data/2024/iclr/Safe Collaborative Filtering	
@@ -0,0 +1 @@
+Excellent tail performance is crucial for modern machine learning tasks, such as algorithmic fairness, class imbalance, and risk-sensitive decision making, as it ensures the effective handling of challenging samples within a dataset. Tail performance is also a vital determinant of success for personalized recommender systems to reduce the risk of losing users with low satisfaction. This study introduces a"safe"collaborative filtering method that prioritizes recommendation quality for less-satisfied users rather than focusing on the average performance. Our approach minimizes the conditional value at risk (CVaR), which represents the average risk over the tails of users' loss. To overcome computational challenges for web-scale recommender systems, we develop a robust yet practical algorithm that extends the most scalable method, implicit alternating least squares (iALS). Empirical evaluation on real-world datasets demonstrates the excellent tail performance of our approach while maintaining competitive computational efficiency.
\ No newline at end of file
diff --git a/data/2024/iclr/Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model b/data/2024/iclr/Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model
new file mode 100644
index 0000000000..10853dd583
--- /dev/null
+++ b/data/2024/iclr/Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model	
@@ -0,0 +1 @@
+Safe offline RL is a promising way to bypass risky online interactions towards safe policy learning. Most existing methods only enforce soft constraints, i.e., constraining safety violations in expectation below thresholds predetermined. This can lead to potentially unsafe outcomes, thus unacceptable in safety-critical scenarios. An alternative is to enforce the hard constraint of zero violation. However, this can be challenging in offline setting, as it needs to strike the right balance among three highly intricate and correlated aspects: safety constraint satisfaction, reward maximization, and behavior regularization imposed by offline datasets. Interestingly, we discover that via reachability analysis of safe-control theory, the hard safety constraint can be equivalently translated to identifying the largest feasible region given the offline dataset. This seamlessly converts the original trilogy problem to a feasibility-dependent objective, i.e., maximizing reward value within the feasible region while minimizing safety risks in the infeasible region. Inspired by these, we propose FISOR (FeasIbility-guided Safe Offline RL), which allows safety constraint adherence, reward maximization, and offline policy learning to be realized via three decoupled processes, while offering strong safety performance and stability. In FISOR, the optimal policy for the translated optimization problem can be derived in a special form of weighted behavior cloning. Thus, we propose a novel energy-guided diffusion model that does not require training a complicated time-dependent classifier to extract the policy, greatly simplifying the training. We compare FISOR against baselines on DSRL benchmark for safe offline RL. Evaluation results show that FISOR is the only method that can guarantee safety satisfaction in all tasks, while achieving top returns in most tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Safe and Robust Watermark Injection with a Single OoD Image b/data/2024/iclr/Safe and Robust Watermark Injection with a Single OoD Image
new file mode 100644
index 0000000000..996be0413a
--- /dev/null
+++ b/data/2024/iclr/Safe and Robust Watermark Injection with a Single OoD Image	
@@ -0,0 +1 @@
+Training a high-performance deep neural network requires large amounts of data and computational resources. Protecting the intellectual property (IP) and commercial ownership of a deep model is challenging yet increasingly crucial. A major stream of watermarking strategies implants verifiable backdoor triggers by poisoning training samples, but these are often unrealistic due to data privacy and safety concerns and are vulnerable to minor model changes such as fine-tuning. To overcome these challenges, we propose a safe and robust backdoor-based watermark injection technique that leverages the diverse knowledge from a single out-of-distribution (OoD) image, which serves as a secret key for IP verification. The independence of training data makes it agnostic to third-party promises of IP security. We induce robustness via random perturbation of model parameters during watermark injection to defend against common watermark removal attacks, including fine-tuning, pruning, and model extraction. Our experimental results demonstrate that the proposed watermarking approach is not only time- and sample-efficient without training data, but also robust against the watermark removal attacks above.
\ No newline at end of file
diff --git a/data/2024/iclr/SafeDreamer: Safe Reinforcement Learning with World Models b/data/2024/iclr/SafeDreamer: Safe Reinforcement Learning with World Models
new file mode 100644
index 0000000000..a4ac86dae0
--- /dev/null
+++ b/data/2024/iclr/SafeDreamer: Safe Reinforcement Learning with World Models	
@@ -0,0 +1 @@
+The deployment of Reinforcement Learning (RL) in real-world applications is constrained by its failure to satisfy safety criteria. Existing Safe Reinforcement Learning (SafeRL) methods, which rely on cost functions to enforce safety, often fail to achieve zero-cost performance in complex scenarios, especially vision-only tasks. These limitations are primarily due to model inaccuracies and inadequate sample efficiency. The integration of the world model has proven effective in mitigating these shortcomings. In this work, we introduce SafeDreamer, a novel algorithm incorporating Lagrangian-based methods into world model planning processes within the superior Dreamer framework. Our method achieves nearly zero-cost performance on various tasks, spanning low-dimensional and vision-only input, within the Safety-Gymnasium benchmark, showcasing its efficacy in balancing performance and safety in RL tasks. Further details can be found in the code repository: \url{https://github.com/PKU-Alignment/SafeDreamer}.
\ No newline at end of file
diff --git a/data/2024/iclr/Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions b/data/2024/iclr/Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
new file mode 100644
index 0000000000..d4b3d8386a
--- /dev/null
+++ b/data/2024/iclr/Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions	
@@ -0,0 +1 @@
+Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) when fine-tuning a model like LLaMA can substantially improve its safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find exaggerated safety behaviours, where too much safety-tuning makes models refuse perfectly safe prompts if they superficially resemble unsafe ones. As a whole, our results illustrate trade-offs in training LLMs to be helpful and training them to be safe.
\ No newline at end of file
diff --git a/data/2024/iclr/SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation b/data/2024/iclr/SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation
new file mode 100644
index 0000000000..1f62fca9b2
--- /dev/null
+++ b/data/2024/iclr/SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation	
@@ -0,0 +1 @@
+With evolving data regulations, machine unlearning (MU) has become an important tool for fostering trust and safety in today's AI models. However, existing MU methods focusing on data and/or weight perspectives often grapple with limitations in unlearning accuracy, stability, and cross-domain applicability. To address these challenges, we introduce the concept of 'weight saliency' in MU, drawing parallels with input saliency in model explanation. This innovation directs MU's attention toward specific model weights rather than the entire model, improving effectiveness and efficiency. The resultant method that we call saliency unlearning (SalUn) narrows the performance gap with 'exact' unlearning (model retraining from scratch after removing the forgetting dataset). To the best of our knowledge, SalUn is the first principled MU approach adaptable enough to effectively erase the influence of forgetting data, classes, or concepts in both image classification and generation. For example, SalUn yields a stability advantage in high-variance random data forgetting, e.g., with a 0.2% gap compared to exact unlearning on the CIFAR-10 dataset. Moreover, in preventing conditional diffusion models from generating harmful images, SalUn achieves nearly 100% unlearning accuracy, outperforming current state-of-the-art baselines like Erased Stable Diffusion and Forget-Me-Not.
\ No newline at end of file
diff --git a/data/2024/iclr/Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks b/data/2024/iclr/Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks
new file mode 100644
index 0000000000..1b6a87030c
--- /dev/null
+++ b/data/2024/iclr/Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks	
@@ -0,0 +1 @@
+Multitask Reinforcement Learning (MTRL) approaches have gained increasing attention for its wide applications in many important Reinforcement Learning (RL) tasks. However, while recent advancements in MTRL theory have focused on the improved statistical efficiency by assuming a shared structure across tasks, exploration--a crucial aspect of RL--has been largely overlooked. This paper addresses this gap by showing that when an agent is trained on a sufficiently diverse set of tasks, a generic policy-sharing algorithm with myopic exploration design like $\epsilon$-greedy that are inefficient in general can be sample-efficient for MTRL. To the best of our knowledge, this is the first theoretical demonstration of the"exploration benefits"of MTRL. It may also shed light on the enigmatic success of the wide applications of myopic exploration in practice. To validate the role of diversity, we conduct experiments on synthetic robotic control environments, where the diverse task set aligns with the task selection by automatic curriculum learning, which is empirically shown to improve sample-efficiency.
\ No newline at end of file
diff --git a/data/2024/iclr/Sample-Efficient Linear Representation Learning from Non-IID Non-Isotropic Data b/data/2024/iclr/Sample-Efficient Linear Representation Learning from Non-IID Non-Isotropic Data
new file mode 100644
index 0000000000..f3a8184da6
--- /dev/null
+++ b/data/2024/iclr/Sample-Efficient Linear Representation Learning from Non-IID Non-Isotropic Data	
@@ -0,0 +1 @@
+A powerful concept behind much of the recent progress in machine learning is the extraction of common features across data from heterogeneous sources or tasks. Intuitively, using all of one's data to learn a common representation function benefits both computational effort and statistical generalization by leaving a smaller number of parameters to fine-tune on a given task. Toward theoretically grounding these merits, we propose a general setting of recovering linear operators $M$ from noisy vector measurements $y = Mx + w$, where the covariates $x$ may be both non-i.i.d. and non-isotropic. We demonstrate that existing isotropy-agnostic representation learning approaches incur biases on the representation update, which causes the scaling of the noise terms to lose favorable dependence on the number of source tasks. This in turn can cause the sample complexity of representation learning to be bottlenecked by the single-task data size. We introduce an adaptation, $\texttt{De-bias&Feature-Whiten}$ ($\texttt{DFW}$), of the popular alternating minimization-descent scheme proposed independently in Collins et al., (2021) and Nayer and Vaswani (2022), and establish linear convergence to the optimal representation with noise level scaling down with the $\textit{total}$ source data size. This leads to generalization bounds on the same order as an oracle empirical risk minimizer. We verify the vital importance of $\texttt{DFW}$ on various numerical simulations. In particular, we show that vanilla alternating-minimization descent fails catastrophically even for iid, but mildly non-isotropic data. Our analysis unifies and generalizes prior work, and provides a flexible framework for a wider range of applications, such as in controls and dynamical systems.
\ No newline at end of file
diff --git a/data/2024/iclr/Sample-Efficient Multi-Agent RL: An Optimization Perspective b/data/2024/iclr/Sample-Efficient Multi-Agent RL: An Optimization Perspective
new file mode 100644
index 0000000000..dd66ef7c20
--- /dev/null
+++ b/data/2024/iclr/Sample-Efficient Multi-Agent RL: An Optimization Perspective	
@@ -0,0 +1 @@
+We study multi-agent reinforcement learning (MARL) for the general-sum Markov Games (MGs) under the general function approximation. In order to find the minimum assumption for sample-efficient learning, we introduce a novel complexity measure called the Multi-Agent Decoupling Coefficient (MADC) for general-sum MGs. Using this measure, we propose the first unified algorithmic framework that ensures sample efficiency in learning Nash Equilibrium, Coarse Correlated Equilibrium, and Correlated Equilibrium for both model-based and model-free MARL problems with low MADC. We also show that our algorithm provides comparable sublinear regret to the existing works. Moreover, our algorithm combines an equilibrium-solving oracle with a single objective optimization subprocedure that solves for the regularized payoff of each deterministic joint policy, which avoids solving constrained optimization problems within data-dependent constraints (Jin et al. 2020; Wang et al. 2023) or executing sampling procedures with complex multi-objective optimization problems (Foster et al. 2023), thus being more amenable to empirical implementation.
\ No newline at end of file
diff --git a/data/2024/iclr/Sample-Efficient Quality-Diversity by Cooperative Coevolution b/data/2024/iclr/Sample-Efficient Quality-Diversity by Cooperative Coevolution
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation b/data/2024/iclr/Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation
new file mode 100644
index 0000000000..8c1c9f8509
--- /dev/null
+++ b/data/2024/iclr/Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation	
@@ -0,0 +1 @@
+We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. Specifically, we propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP), which incorporates both model-based and value-based incarnations. In particular, LOOP features a novel construction of confidence sets and a low-switching policy updating scheme, which are tailored to the average-reward and function approximation setting. Moreover, for AMDPs, we propose a novel complexity measure -- average-reward generalized eluder coefficient (AGEC) -- which captures the challenge of exploration in AMDPs with general function approximation. Such a complexity measure encompasses almost all previously known tractable AMDP models, such as linear AMDPs and linear mixture AMDPs, and also includes newly identified cases such as kernel AMDPs and AMDPs with Bellman eluder dimensions. Using AGEC, we prove that LOOP achieves a sublinear $\tilde{\mathcal{O}}(\mathrm{poly}(d, \mathrm{sp}(V^*)) \sqrt{T\beta} )$ regret, where $d$ and $\beta$ correspond to AGEC and log-covering number of the hypothesis class respectively, $\mathrm{sp}(V^*)$ is the span of the optimal state bias function, $T$ denotes the number of steps, and $\tilde{\mathcal{O}} (\cdot) $ omits logarithmic factors. When specialized to concrete AMDP models, our regret bounds are comparable to those established by the existing algorithms designed specifically for these special cases. To the best of our knowledge, this paper presents the first comprehensive theoretical framework capable of handling nearly all AMDPs.
\ No newline at end of file
diff --git a/data/2024/iclr/Sampling Multimodal Distributions with the Vanilla Score: Benefits of Data-Based Initialization b/data/2024/iclr/Sampling Multimodal Distributions with the Vanilla Score: Benefits of Data-Based Initialization
new file mode 100644
index 0000000000..2a76884c34
--- /dev/null
+++ b/data/2024/iclr/Sampling Multimodal Distributions with the Vanilla Score: Benefits of Data-Based Initialization	
@@ -0,0 +1 @@
+There is a long history, as well as a recent explosion of interest, in statistical and generative modeling approaches based on score functions -- derivatives of the log-likelihood of a distribution. In seminal works, Hyv\"arinen proposed vanilla score matching as a way to learn distributions from data by computing an estimate of the score function of the underlying ground truth, and established connections between this method and established techniques like Contrastive Divergence and Pseudolikelihood estimation. It is by now well-known that vanilla score matching has significant difficulties learning multimodal distributions. Although there are various ways to overcome this difficulty, the following question has remained unanswered -- is there a natural way to sample multimodal distributions using just the vanilla score? Inspired by a long line of related experimental works, we prove that the Langevin diffusion with early stopping, initialized at the empirical distribution, and run on a score function estimated from data successfully generates natural multimodal distributions (mixtures of log-concave distributions).
\ No newline at end of file
diff --git a/data/2024/iclr/Scalable Diffusion for Materials Generation b/data/2024/iclr/Scalable Diffusion for Materials Generation
new file mode 100644
index 0000000000..262576f2d2
--- /dev/null
+++ b/data/2024/iclr/Scalable Diffusion for Materials Generation	
@@ -0,0 +1 @@
+Generative models trained on internet-scale data are capable of generating novel and realistic texts, images, and videos. A natural next question is whether these models can advance science, for example by generating novel stable materials. Traditionally, models with explicit structures (e.g., graphs) have been used in modeling structural relationships in scientific data (e.g., atoms and bonds in crystals), but generating structures can be difficult to scale to large and complex systems. Another challenge in generating materials is the mismatch between standard generative modeling metrics and downstream applications. For instance, common metrics such as the reconstruction error do not correlate well with the downstream goal of discovering stable materials. In this work, we tackle the scalability challenge by developing a unified crystal representation that can represent any crystal structure (UniMat), followed by training a diffusion probabilistic model on these UniMat representations. Our empirical results suggest that despite the lack of explicit structure modeling, UniMat can generate high fidelity crystal structures from larger and more complex chemical systems, outperforming previous graph-based approaches under various generative modeling metrics. To better connect the generation quality of materials to downstream applications, such as discovering novel stable materials, we propose additional metrics for evaluating generative models of materials, including per-composition formation energy and stability with respect to convex hulls through decomposition energy from Density Function Theory (DFT). Lastly, we show that conditional generation with UniMat can scale to previously established crystal datasets with up to millions of crystals structures, outperforming random structure search (the current leading method for structure discovery) in discovering new stable materials.
\ No newline at end of file
diff --git a/data/2024/iclr/Scalable Modular Network: A Framework for Adaptive Learning via Agreement Routing b/data/2024/iclr/Scalable Modular Network: A Framework for Adaptive Learning via Agreement Routing
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Scalable Monotonic Neural Networks b/data/2024/iclr/Scalable Monotonic Neural Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Scalable Neural Network Kernels b/data/2024/iclr/Scalable Neural Network Kernels
new file mode 100644
index 0000000000..10166db14c
--- /dev/null
+++ b/data/2024/iclr/Scalable Neural Network Kernels	
@@ -0,0 +1 @@
+We introduce the concept of scalable neural network kernels (SNNKs), the replacements of regular feedforward layers (FFLs), capable of approximating the latter, but with favorable computational properties. SNNKs effectively disentangle the inputs from the parameters of the neural network in the FFL, only to connect them in the final computation via the dot-product kernel. They are also strictly more expressive, as allowing to model complicated relationships beyond the functions of the dot-products of parameter-input vectors. We also introduce the neural network bundling process that applies SNNKs to compactify deep neural network architectures, resulting in additional compression gains. In its extreme version, it leads to the fully bundled network whose optimal parameters can be expressed via explicit formulae for several loss functions (e.g. mean squared error), opening a possibility to bypass backpropagation. As a by-product of our analysis, we introduce the mechanism of the universal random features (or URFs), applied to instantiate several SNNK variants, and interesting on its own in the context of scalable kernel methods. We provide rigorous theoretical analysis of all these concepts as well as an extensive empirical evaluation, ranging from point-wise kernel estimation to Transformers' fine-tuning with novel adapter layers inspired by SNNKs. Our mechanism provides up to 5x reduction in the number of trainable parameters, while maintaining competitive accuracy.
\ No newline at end of file
diff --git a/data/2024/iclr/Scalable and Effective Implicit Graph Neural Networks on Large Graphs b/data/2024/iclr/Scalable and Effective Implicit Graph Neural Networks on Large Graphs
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Scale-Adaptive Diffusion Model for Complex Sketch Synthesis b/data/2024/iclr/Scale-Adaptive Diffusion Model for Complex Sketch Synthesis
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models b/data/2024/iclr/ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models
new file mode 100644
index 0000000000..4261be8bae
--- /dev/null
+++ b/data/2024/iclr/ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models	
@@ -0,0 +1 @@
+In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.
\ No newline at end of file
diff --git a/data/2024/iclr/Scaling Convex Neural Networks with Burer-Monteiro Factorization b/data/2024/iclr/Scaling Convex Neural Networks with Burer-Monteiro Factorization
new file mode 100644
index 0000000000..6f2a29ac95
--- /dev/null
+++ b/data/2024/iclr/Scaling Convex Neural Networks with Burer-Monteiro Factorization	
@@ -0,0 +1 @@
+It has been demonstrated that the training problem for a variety of (non) linear two-layer neural networks (such as two-layer perceptrons, convolutional networks, and self-attention) can be posed as equivalent convex optimization problems, with an induced regularizer which encourages low rank. However, this regularizer becomes prohibitively expensive to compute at moderate scales, impeding training convex neural networks. To this end, we propose applying the Burer-Monteiro factorization to convex neural networks, which for the first time enables a Burer-Monteiro perspective on neural networks with non-linearities. This factorization leads to an equivalent yet computationally tractable non-convex alternative with no spurious local minima. We develop a novel relative optimality bound of stationary points of the Burer-Monteiro factorization, providing verifiable conditions under which any stationary point is a global optimum. Further, for the first time, we show that linear self-attention with sufficiently many heads has no spurious local minima. Our experiments validate the novel relative optimality bound and the utility of the Burer-Monteiro factorization for scaling convex neural networks.
\ No newline at end of file
diff --git a/data/2024/iclr/Scaling Laws for Sparsely-Connected Foundation Models b/data/2024/iclr/Scaling Laws for Sparsely-Connected Foundation Models
new file mode 100644
index 0000000000..39e7c84772
--- /dev/null
+++ b/data/2024/iclr/Scaling Laws for Sparsely-Connected Foundation Models	
@@ -0,0 +1 @@
+We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e.,"foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data, which we validate empirically across model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to characterize the"optimal sparsity", the sparsity level which yields the best performance for a given effective model size and training budget. For a fixed number of non-zero parameters, we identify that the optimal sparsity increases with the amount of data used for training. We also extend our study to different sparsity structures (such as the hardware-friendly n:m pattern) and strategies (such as starting from a pretrained dense model). Our findings shed light on the power and limitations of weight sparsity across various parameter and computational settings, offering both theoretical understanding and practical implications for leveraging sparsity towards computational efficiency improvements.
\ No newline at end of file
diff --git a/data/2024/iclr/Scaling Laws of RoPE-based Extrapolation b/data/2024/iclr/Scaling Laws of RoPE-based Extrapolation
new file mode 100644
index 0000000000..f97a42df4e
--- /dev/null
+++ b/data/2024/iclr/Scaling Laws of RoPE-based Extrapolation	
@@ -0,0 +1 @@
+The extrapolation capability of Large Language Models (LLMs) based on Rotary Position Embedding is currently a topic of considerable interest. The mainstream approach to addressing extrapolation with LLMs involves modifying RoPE by replacing 10000, the rotary base of $\theta_n={10000}^{-2n/d}$ in the original RoPE, with a larger value and providing longer fine-tuning text. In this work, we first observe that fine-tuning a RoPE-based LLM with either a smaller or larger base in pre-training context length could significantly enhance its extrapolation performance. After that, we propose \textbf{\textit{Scaling Laws of RoPE-based Extrapolation}}, a unified framework from the periodic perspective, to describe the relationship between the extrapolation performance and base value as well as tuning context length. In this process, we also explain the origin of the RoPE-based extrapolation issue by \textbf{\textit{critical dimension for extrapolation}}. Besides these observations and analyses, we achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
\ No newline at end of file
diff --git a/data/2024/iclr/Scaling Supervised Local Learning with Augmented Auxiliary Networks b/data/2024/iclr/Scaling Supervised Local Learning with Augmented Auxiliary Networks
new file mode 100644
index 0000000000..b98cec6cfb
--- /dev/null
+++ b/data/2024/iclr/Scaling Supervised Local Learning with Augmented Auxiliary Networks	
@@ -0,0 +1 @@
+Deep neural networks are typically trained using global error signals that backpropagate (BP) end-to-end, which is not only biologically implausible but also suffers from the update locking problem and requires huge memory consumption. Local learning, which updates each layer independently with a gradient-isolated auxiliary network, offers a promising alternative to address the above problems. However, existing local learning methods are confronted with a large accuracy gap with the BP counterpart, particularly for large-scale networks. This is due to the weak coupling between local layers and their subsequent network layers, as there is no gradient communication across layers. To tackle this issue, we put forward an augmented local learning method, dubbed AugLocal. AugLocal constructs each hidden layer's auxiliary network by uniformly selecting a small subset of layers from its subsequent network layers to enhance their synergy. We also propose to linearly reduce the depth of auxiliary networks as the hidden layer goes deeper, ensuring sufficient network capacity while reducing the computational cost of auxiliary networks. Our extensive experiments on four image classification datasets (i.e., CIFAR-10, SVHN, STL-10, and ImageNet) demonstrate that AugLocal can effectively scale up to tens of local layers with a comparable accuracy to BP-trained networks while reducing GPU memory usage by around 40%. The proposed AugLocal method, therefore, opens up a myriad of opportunities for training high-performance deep neural networks on resource-constrained platforms.Code is available at https://github.com/ChenxiangMA/AugLocal.
\ No newline at end of file
diff --git a/data/2024/iclr/Scaling for Training Time and Post-hoc Out-of-distribution Detection Enhancement b/data/2024/iclr/Scaling for Training Time and Post-hoc Out-of-distribution Detection Enhancement
new file mode 100644
index 0000000000..c2c71d1c42
--- /dev/null
+++ b/data/2024/iclr/Scaling for Training Time and Post-hoc Out-of-distribution Detection Enhancement	
@@ -0,0 +1 @@
+The capacity of a modern deep learning system to determine if a sample falls within its realm of knowledge is fundamental and important. In this paper, we offer insights and analyses of recent state-of-the-art out-of-distribution (OOD) detection methods - extremely simple activation shaping (ASH). We demonstrate that activation pruning has a detrimental effect on OOD detection, while activation scaling enhances it. Moreover, we propose SCALE, a simple yet effective post-hoc network enhancement method for OOD detection, which attains state-of-the-art OOD detection performance without compromising in-distribution (ID) accuracy. By integrating scaling concepts into the training process to capture a sample's ID characteristics, we propose Intermediate Tensor SHaping (ISH), a lightweight method for training time OOD detection enhancement. We achieve AUROC scores of +1.85\% for near-OOD and +0.74\% for far-OOD datasets on the OpenOOD v1.5 ImageNet-1K benchmark. Our code and models are available at https://github.com/kai422/SCALE.
\ No newline at end of file
diff --git a/data/2024/iclr/Scaling physics-informed hard constraints with mixture-of-experts b/data/2024/iclr/Scaling physics-informed hard constraints with mixture-of-experts
new file mode 100644
index 0000000000..a23abba725
--- /dev/null
+++ b/data/2024/iclr/Scaling physics-informed hard constraints with mixture-of-experts	
@@ -0,0 +1 @@
+Imposing known physical constraints, such as conservation laws, during neural network training introduces an inductive bias that can improve accuracy, reliability, convergence, and data efficiency for modeling physical dynamics. While such constraints can be softly imposed via loss function penalties, recent advancements in differentiable physics and optimization improve performance by incorporating PDE-constrained optimization as individual layers in neural networks. This enables a stricter adherence to physical constraints. However, imposing hard constraints significantly increases computational and memory costs, especially for complex dynamical systems. This is because it requires solving an optimization problem over a large number of points in a mesh, representing spatial and temporal discretizations, which greatly increases the complexity of the constraint. To address this challenge, we develop a scalable approach to enforce hard physical constraints using Mixture-of-Experts (MoE), which can be used with any neural network architecture. Our approach imposes the constraint over smaller decomposed domains, each of which is solved by an"expert"through differentiable optimization. During training, each expert independently performs a localized backpropagation step by leveraging the implicit function theorem; the independence of each expert allows for parallelization across multiple GPUs. Compared to standard differentiable optimization, our scalable approach achieves greater accuracy in the neural PDE solver setting for predicting the dynamics of challenging non-linear systems. We also improve training stability and require significantly less computation time during both training and inference stages.
\ No newline at end of file
diff --git a/data/2024/iclr/Score Models for Offline Goal-Conditioned Reinforcement Learning b/data/2024/iclr/Score Models for Offline Goal-Conditioned Reinforcement Learning
new file mode 100644
index 0000000000..fbc0a32d5e
--- /dev/null
+++ b/data/2024/iclr/Score Models for Offline Goal-Conditioned Reinforcement Learning	
@@ -0,0 +1 @@
+Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions. Offline GCRL is pivotal for developing generalist agents capable of leveraging pre-existing datasets to learn diverse and reusable skills without hand-engineering reward functions. However, contemporary approaches to GCRL based on supervised learning and contrastive learning are often suboptimal in the offline setting. An alternative perspective on GCRL optimizes for occupancy matching, but necessitates learning a discriminator, which subsequently serves as a pseudo-reward for downstream RL. Inaccuracies in the learned discriminator can cascade, negatively influencing the resulting policy. We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe. The key insight is combining the occupancy matching perspective of GCRL with a convex dual formulation to derive a learning objective that can better leverage suboptimal offline data. SMORe learns scores or unnormalized densities representing the importance of taking an action at a state for reaching a particular goal. SMORe is principled and our extensive experiments on the fully offline GCRL benchmark composed of robot manipulation and locomotion tasks, including high-dimensional observations, show that SMORe can outperform state-of-the-art baselines by a significant margin.
\ No newline at end of file
diff --git a/data/2024/iclr/Score Regularized Policy Optimization through Diffusion Behavior b/data/2024/iclr/Score Regularized Policy Optimization through Diffusion Behavior
new file mode 100644
index 0000000000..0ae1ecaecf
--- /dev/null
+++ b/data/2024/iclr/Score Regularized Policy Optimization through Diffusion Behavior	
@@ -0,0 +1 @@
+Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behavior distribution's score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Score-based generative models break the curse of dimensionality in learning a family of sub-Gaussian distributions b/data/2024/iclr/Score-based generative models break the curse of dimensionality in learning a family of sub-Gaussian distributions
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective b/data/2024/iclr/Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective
new file mode 100644
index 0000000000..0c7a97095d
--- /dev/null
+++ b/data/2024/iclr/Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective	
@@ -0,0 +1 @@
+Large Language Models (LLMs) inherently encode a wealth of knowledge within their parameters through pre-training on extensive corpora. While prior research has delved into operations on these parameters to manipulate the underlying implicit knowledge (encompassing detection, editing, and merging), there remains an ambiguous understanding regarding their transferability across models with varying scales. In this paper, we seek to empirically investigate knowledge transfer from larger to smaller models through a parametric perspective. To achieve this, we employ sensitivity-based techniques to extract and align knowledge-specific parameters between different LLMs. Moreover, the LoRA module is used as the intermediary mechanism for injecting the extracted knowledge into smaller models. Evaluations across four benchmarks validate the efficacy of our proposed method. Our findings highlight the critical factors contributing to the process of parametric knowledge transfer, underscoring the transferability of model parameters across LLMs of different scales. Project website: https://maszhongming.github.io/ParaKnowTransfer.
\ No newline at end of file
diff --git a/data/2024/iclr/Seer: Language Instructed Video Prediction with Latent Diffusion Models b/data/2024/iclr/Seer: Language Instructed Video Prediction with Latent Diffusion Models
new file mode 100644
index 0000000000..36583e0b36
--- /dev/null
+++ b/data/2024/iclr/Seer: Language Instructed Video Prediction with Latent Diffusion Models	
@@ -0,0 +1 @@
+Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named \textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We enhance the U-Net and language conditioning model by incorporating computation-efficient spatial-temporal attention. Furthermore, we introduce a novel Frame Sequential Text Decomposer module that dissects a sentence's global instruction into temporally aligned sub-instructions, ensuring precise integration into each frame of generation. Our framework allows us to effectively leverage the extensive prior knowledge embedded in pretrained T2I models across the frames. With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2), Bridgedata and EpicKitchens-100 datasets demonstrate our superior video prediction performance with around 480-GPU hours versus CogVideo with over 12,480-GPU hours: achieving the 31% FVD improvement compared to the current SOTA model on SSv2 and 83.7% average preference in the human evaluation.
\ No newline at end of file
diff --git a/data/2024/iclr/Select to Perfect: Imitating desired behavior from large multi-agent data b/data/2024/iclr/Select to Perfect: Imitating desired behavior from large multi-agent data
new file mode 100644
index 0000000000..34fd6d4342
--- /dev/null
+++ b/data/2024/iclr/Select to Perfect: Imitating desired behavior from large multi-agent data	
@@ -0,0 +1 @@
+AI agents are commonly trained with large datasets of demonstrations of human behavior. However, not all behaviors are equally safe or desirable. Desired characteristics for an AI agent can be expressed by assigning desirability scores, which we assume are not assigned to individual behaviors but to collective trajectories. For example, in a dataset of vehicle interactions, these scores might relate to the number of incidents that occurred. We first assess the effect of each individual agent's behavior on the collective desirability score, e.g., assessing how likely an agent is to cause incidents. This allows us to selectively imitate agents with a positive effect, e.g., only imitating agents that are unlikely to cause incidents. To enable this, we propose the concept of an agent's Exchange Value, which quantifies an individual agent's contribution to the collective desirability score. The Exchange Value is the expected change in desirability score when substituting the agent for a randomly selected agent. We propose additional methods for estimating Exchange Values from real-world datasets, enabling us to learn desired imitation policies that outperform relevant baselines. The project website can be found at https://tinyurl.com/select-to-perfect.
\ No newline at end of file
diff --git a/data/2024/iclr/Self-Alignment with Instruction Backtranslation b/data/2024/iclr/Self-Alignment with Instruction Backtranslation
new file mode 100644
index 0000000000..9da5a945e9
--- /dev/null
+++ b/data/2024/iclr/Self-Alignment with Instruction Backtranslation	
@@ -0,0 +1 @@
+We present a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model. Finetuning LLaMa on two iterations of our approach yields a model that outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data, demonstrating highly effective self-alignment.
\ No newline at end of file
diff --git a/data/2024/iclr/Self-Consuming Generative Models Go MAD b/data/2024/iclr/Self-Consuming Generative Models Go MAD
new file mode 100644
index 0000000000..b3339684f4
--- /dev/null
+++ b/data/2024/iclr/Self-Consuming Generative Models Go MAD	
@@ -0,0 +1 @@
+Seismic advances in generative AI algorithms have led to the temptation to use AI-synthesized data to train next-generation models. Repeating this process creates autophagous (“self-consuming”) loops whose properties are poorly understood. We conduct a thorough analysis using state-of-the-art generative image models of three autophagous loop families that differ in how they incorporate fixed or fresh real training data and whether previous generations' samples have been biased to trade off data quality versus diversity. Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease. We term this condition Model Autophagy Disorder (MAD) and show that appreciable MADness arises in just a few generations.
\ No newline at end of file
diff --git a/data/2024/iclr/Self-Guided Masked Autoencoders for Domain-Agnostic Self-Supervised Learning b/data/2024/iclr/Self-Guided Masked Autoencoders for Domain-Agnostic Self-Supervised Learning
new file mode 100644
index 0000000000..61bb1dbef5
--- /dev/null
+++ b/data/2024/iclr/Self-Guided Masked Autoencoders for Domain-Agnostic Self-Supervised Learning	
@@ -0,0 +1 @@
+Self-supervised learning excels in learning representations from large amounts of unlabeled data, demonstrating success across multiple data modalities. Yet, extending self-supervised learning to new modalities is non-trivial because the specifics of existing methods are tailored to each domain, such as domain-specific augmentations which reflect the invariances in the target task. While masked modeling is promising as a domain-agnostic framework for self-supervised learning because it does not rely on input augmentations, its mask sampling procedure remains domain-specific. We present Self-guided Masked Autoencoders (SMA), a fully domain-agnostic masked modeling method. SMA trains an attention based model using a masked modeling objective, by learning masks to sample without any domain-specific assumptions. We evaluate SMA on three self-supervised learning benchmarks in protein biology, chemical property prediction, and particle physics. We find SMA is capable of learning representations without domain-specific knowledge and achieves state-of-the-art performance on these three benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/Self-Supervised Contrastive Learning for Long-term Forecasting b/data/2024/iclr/Self-Supervised Contrastive Learning for Long-term Forecasting
new file mode 100644
index 0000000000..d233fe691b
--- /dev/null
+++ b/data/2024/iclr/Self-Supervised Contrastive Learning for Long-term Forecasting	
@@ -0,0 +1 @@
+Long-term forecasting presents unique challenges due to the time and memory complexity of handling long sequences. Existing methods, which rely on sliding windows to process long sequences, struggle to effectively capture long-term variations that are partially caught within the short window (i.e., outer-window variations). In this paper, we introduce a novel approach that overcomes this limitation by employing contrastive learning and enhanced decomposition architecture, specifically designed to focus on long-term variations. To this end, our contrastive loss incorporates global autocorrelation held in the whole time series, which facilitates the construction of positive and negative pairs in a self-supervised manner. When combined with our decomposition networks, our contrastive learning significantly improves long-term forecasting performance. Extensive experiments demonstrate that our approach outperforms 14 baseline models in multiple experiments over nine long-term benchmarks, especially in challenging scenarios that require a significantly long output for forecasting. Source code is available at https://github.com/junwoopark92/Self-Supervised-Contrastive-Forecsating.
\ No newline at end of file
diff --git a/data/2024/iclr/Self-Supervised Dataset Distillation for Transfer Learning b/data/2024/iclr/Self-Supervised Dataset Distillation for Transfer Learning
new file mode 100644
index 0000000000..9e72b14f07
--- /dev/null
+++ b/data/2024/iclr/Self-Supervised Dataset Distillation for Transfer Learning	
@@ -0,0 +1 @@
+Dataset distillation methods have achieved remarkable success in distilling a large dataset into a small set of representative samples. However, they are not designed to produce a distilled dataset that can be effectively used for facilitating self-supervised pre-training. To this end, we propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL). We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is \textit{biased} due to the randomness originating from data augmentations or masking. To address this issue, we propose to minimize the mean squared error (MSE) between a model's representations of the synthetic examples and their corresponding learnable target feature representations for the inner objective, which does not introduce any randomness. Our primary motivation is that the model obtained by the proposed inner optimization can mimic the \textit{self-supervised target model}. To achieve this, we also introduce the MSE between representations of the inner model and the self-supervised target model on the original full dataset for outer optimization. Lastly, assuming that a feature extractor is fixed, we only optimize a linear head on top of the feature extractor, which allows us to reduce the computational cost and obtain a closed-form solution of the head with kernel ridge regression. We empirically validate the effectiveness of our method on various applications involving transfer learning.
\ No newline at end of file
diff --git a/data/2024/iclr/Self-Supervised Heterogeneous Graph Learning: a Homophily and Heterogeneity View b/data/2024/iclr/Self-Supervised Heterogeneous Graph Learning: a Homophily and Heterogeneity View
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Self-Supervised High Dynamic Range Imaging with Multi-Exposure Images in Dynamic Scenes b/data/2024/iclr/Self-Supervised High Dynamic Range Imaging with Multi-Exposure Images in Dynamic Scenes
new file mode 100644
index 0000000000..394c8a99dc
--- /dev/null
+++ b/data/2024/iclr/Self-Supervised High Dynamic Range Imaging with Multi-Exposure Images in Dynamic Scenes	
@@ -0,0 +1 @@
+Merging multi-exposure images is a common approach for obtaining high dynamic range (HDR) images, with the primary challenge being the avoidance of ghosting artifacts in dynamic scenes. Recent methods have proposed using deep neural networks for deghosting. However, the methods typically rely on sufficient data with HDR ground-truths, which are difficult and costly to collect. In this work, to eliminate the need for labeled data, we propose SelfHDR, a self-supervised HDR reconstruction method that only requires dynamic multi-exposure images during training. Specifically, SelfHDR learns a reconstruction network under the supervision of two complementary components, which can be constructed from multi-exposure images and focus on HDR color as well as structure, respectively. The color component is estimated from aligned multi-exposure images, while the structure one is generated through a structure-focused network that is supervised by the color component and an input reference (\eg, medium-exposure) image. During testing, the learned reconstruction network is directly deployed to predict an HDR image. Experiments on real-world images demonstrate our SelfHDR achieves superior results against the state-of-the-art self-supervised methods, and comparable performance to supervised ones. Codes are available at https://github.com/cszhilu1998/SelfHDR
\ No newline at end of file
diff --git a/data/2024/iclr/Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation b/data/2024/iclr/Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation
new file mode 100644
index 0000000000..7dc954f601
--- /dev/null
+++ b/data/2024/iclr/Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation	
@@ -0,0 +1 @@
+Large language models (large LMs) are susceptible to producing text that contains hallucinated content. An important instance of this problem is self-contradiction, where the LM generates two contradictory sentences within the same context. In this work, we present a comprehensive investigation into self-contradiction for various instruction-tuned LMs, covering evaluation, detection, and mitigation. Our primary evaluation task is open-domain text generation, but we also demonstrate the applicability of our approach to shorter question answering. Our analysis reveals the prevalence of self-contradictions, e.g., in 17.7% of all sentences produced by ChatGPT. We then propose a novel prompting-based framework designed to effectively detect and mitigate self-contradictions. Our detector achieves high accuracy, e.g., around 80% F1 score when prompting ChatGPT. The mitigation algorithm iteratively refines the generated text to remove contradictory information while preserving text fluency and informativeness. Importantly, our entire framework is applicable to black-box LMs and does not require retrieval of external knowledge. Rather, our method complements retrieval-based methods, as a large portion of self-contradictions (e.g., 35.2% for ChatGPT) cannot be verified using online text. Our approach is practically effective and has been released as a push-button tool to benefit the public at https://chatprotect.ai/.
\ No newline at end of file
diff --git a/data/2024/iclr/Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment b/data/2024/iclr/Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment
new file mode 100644
index 0000000000..b914032c24
--- /dev/null
+++ b/data/2024/iclr/Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment	
@@ -0,0 +1 @@
+Pocket representations play a vital role in various biomedical applications, such as druggability estimation, ligand affinity prediction, and de novo drug design. While existing geometric features and pretrained representations have demonstrated promising results, they usually treat pockets independent of ligands, neglecting the fundamental interactions between them. However, the limited pocket-ligand complex structures available in the PDB database (less than 100 thousand non-redundant pairs) hampers large-scale pretraining endeavors for interaction modeling. To address this constraint, we propose a novel pocket pretraining approach that leverages knowledge from high-resolution atomic protein structures, assisted by highly effective pretrained small molecule representations. By segmenting protein structures into drug-like fragments and their corresponding pockets, we obtain a reasonable simulation of ligand-receptor interactions, resulting in the generation of over 5 million complexes. Subsequently, the pocket encoder is trained in a contrastive manner to align with the representation of pseudo-ligand furnished by some pretrained small molecule encoders. Our method, named ProFSA, achieves state-of-the-art performance across various tasks, including pocket druggability prediction, pocket matching, and ligand binding affinity prediction. Notably, ProFSA surpasses other pretraining methods by a substantial margin. Moreover, our work opens up a new avenue for mitigating the scarcity of protein-ligand complex data through the utilization of high-quality and diverse protein structure databases.
\ No newline at end of file
diff --git a/data/2024/iclr/Self-supervised Representation Learning from Random Data Projectors b/data/2024/iclr/Self-supervised Representation Learning from Random Data Projectors
new file mode 100644
index 0000000000..1d3eb53b44
--- /dev/null
+++ b/data/2024/iclr/Self-supervised Representation Learning from Random Data Projectors	
@@ -0,0 +1 @@
+Self-supervised representation learning~(SSRL) has advanced considerably by exploiting the transformation invariance assumption under artificially designed data augmentations. While augmentation-based SSRL algorithms push the boundaries of performance in computer vision and natural language processing, they are often not directly applicable to other data modalities, and can conflict with application-specific data augmentation constraints. This paper presents an SSRL approach that can be applied to any data modality and network architecture because it does not rely on augmentations or masking. Specifically, we show that high-quality data representations can be learned by reconstructing random data projections. We evaluate the proposed approach on a wide range of representation learning tasks that span diverse modalities and real-world applications. We show that it outperforms multiple state-of-the-art SSRL baselines. Due to its wide applicability and strong empirical results, we argue that learning from randomness is a fruitful research direction worthy of attention and further study.
\ No newline at end of file
diff --git a/data/2024/iclr/SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning b/data/2024/iclr/SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning
new file mode 100644
index 0000000000..662554e761
--- /dev/null
+++ b/data/2024/iclr/SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning	
@@ -0,0 +1 @@
+The recent progress in large language models (LLMs), especially the invention of chain-of-thought prompting, has made it possible to automatically answer questions by stepwise reasoning. However, when faced with more complicated problems that require non-linear thinking, even the strongest LLMs make mistakes. To address this, we explore whether LLMs are able to recognize errors in their own step-by-step reasoning, without resorting to external resources. To this end, we propose SelfCheck, a general-purpose zero-shot verification schema for recognizing such errors. We then use the results of these checks to improve question-answering performance by conducting weighted voting on multiple solutions to the question. We test SelfCheck on three datasets (GSM8K, MathQA, and MATH) and find that it successfully recognizes errors and, in turn, increases final answer accuracies.
\ No newline at end of file
diff --git a/data/2024/iclr/SemiReward: A General Reward Model for Semi-supervised Learning b/data/2024/iclr/SemiReward: A General Reward Model for Semi-supervised Learning
new file mode 100644
index 0000000000..c215e5486a
--- /dev/null
+++ b/data/2024/iclr/SemiReward: A General Reward Model for Semi-supervised Learning	
@@ -0,0 +1 @@
+Semi-supervised learning (SSL) has witnessed great progress with various improvements in the self-training framework with pseudo labeling. The main challenge is how to distinguish high-quality pseudo labels against the confirmation bias. However, existing pseudo-label selection strategies are limited to pre-defined schemes or complex hand-crafted policies specially designed for classification, failing to achieve high-quality labels, fast convergence, and task versatility simultaneously. To these ends, we propose a Semi-supervised Reward framework (SemiReward) that predicts reward scores to evaluate and filter out high-quality pseudo labels, which is pluggable to mainstream SSL methods in wide task types and scenarios. To mitigate confirmation bias, SemiReward is trained online in two stages with a generator model and subsampling strategy. With classification and regression tasks on 13 standard SSL benchmarks across three modalities, extensive experiments verify that SemiReward achieves significant performance gains and faster convergence speeds upon Pseudo Label, FlexMatch, and Free/SoftMatch. Code and models are available at https://github.com/Westlake-AI/SemiReward.
\ No newline at end of file
diff --git a/data/2024/iclr/Sentence-level Prompts Benefit Composed Image Retrieval b/data/2024/iclr/Sentence-level Prompts Benefit Composed Image Retrieval
new file mode 100644
index 0000000000..2eb46e6a5c
--- /dev/null
+++ b/data/2024/iclr/Sentence-level Prompts Benefit Composed Image Retrieval	
@@ -0,0 +1 @@
+Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption. Most existing CIR models adopt the late-fusion strategy to combine visual and language features. Besides, several approaches have also been suggested to generate a pseudo-word token from the reference image, which is further integrated into the relative caption for CIR. However, these pseudo-word-based prompting methods have limitations when target image encompasses complex changes on reference image, e.g., object removal and attribute modification. In this work, we demonstrate that learning an appropriate sentence-level prompt for the relative caption (SPRC) is sufficient for achieving effective composed image retrieval. Instead of relying on pseudo-word-based prompts, we propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts. By concatenating the learned sentence-level prompt with the relative caption, one can readily use existing text-based image retrieval models to enhance CIR performance. Furthermore, we introduce both image-text contrastive loss and text prompt alignment loss to enforce the learning of suitable sentence-level prompts. Experiments show that our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets. The source code and pretrained model are publicly available at https://github.com/chunmeifeng/SPRC
\ No newline at end of file
diff --git a/data/2024/iclr/Separate and Diffuse: Using a Pretrained Diffusion Model for Better Source Separation b/data/2024/iclr/Separate and Diffuse: Using a Pretrained Diffusion Model for Better Source Separation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Separating common from salient patterns with Contrastive Representation Learning b/data/2024/iclr/Separating common from salient patterns with Contrastive Representation Learning
new file mode 100644
index 0000000000..db85be5e3d
--- /dev/null
+++ b/data/2024/iclr/Separating common from salient patterns with Contrastive Representation Learning	
@@ -0,0 +1 @@
+Contrastive Analysis is a sub-field of Representation Learning that aims at separating common factors of variation between two datasets, a background (i.e., healthy subjects) and a target (i.e., diseased subjects), from the salient factors of variation, only present in the target dataset. Despite their relevance, current models based on Variational Auto-Encoders have shown poor performance in learning semantically-expressive representations. On the other hand, Contrastive Representation Learning has shown tremendous performance leaps in various applications (classification, clustering, etc.). In this work, we propose to leverage the ability of Contrastive Learning to learn semantically expressive representations well adapted for Contrastive Analysis. We reformulate it under the lens of the InfoMax Principle and identify two Mutual Information terms to maximize and one to minimize. We decompose the first two terms into an Alignment and a Uniformity term, as commonly done in Contrastive Learning. Then, we motivate a novel Mutual Information minimization strategy to prevent information leakage between common and salient distributions. We validate our method, called SepCLR, on three visual datasets and three medical datasets, specifically conceived to assess the pattern separation capability in Contrastive Analysis. Code available at https://github.com/neurospin-projects/2024_rlouiset_sep_clr.
\ No newline at end of file
diff --git a/data/2024/iclr/SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking b/data/2024/iclr/SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking
new file mode 100644
index 0000000000..e1c1da9726
--- /dev/null
+++ b/data/2024/iclr/SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking	
@@ -0,0 +1 @@
+In many domains, autoregressive models can attain high likelihood on the task of predicting the next observation. However, this maximum-likelihood (MLE) objective does not necessarily match a downstream use-case of autoregressively generating high-quality sequences. The MLE objective weights sequences proportionally to their frequency under the data distribution, with no guidance for the model's behaviour out of distribution (OOD): leading to compounding error during autoregressive generation. In order to address this compounding error problem, we formulate sequence generation as an imitation learning (IL) problem. This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset, including divergences with weight on OOD generated sequences. The IL framework also allows us to incorporate backtracking by introducing a backspace action into the generation process. This further mitigates the compounding error problem by allowing the model to revert a sampled token if it takes the sequence OOD. Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes. We identify the SequenceMatch-$\chi^2$ divergence as a more suitable training objective for autoregressive models which are used for generation. We show that empirically, SequenceMatch training leads to improvements over MLE on text generation with language models and arithmetic.
\ No newline at end of file
diff --git a/data/2024/iclr/Set Learning for Accurate and Calibrated Models b/data/2024/iclr/Set Learning for Accurate and Calibrated Models
new file mode 100644
index 0000000000..9d289be79d
--- /dev/null
+++ b/data/2024/iclr/Set Learning for Accurate and Calibrated Models	
@@ -0,0 +1 @@
+Model overconfidence and poor calibration are common in machine learning and difficult to account for when applying standard empirical risk minimization. In this work, we propose a novel method to alleviate these problems that we call odd-$k$-out learning (OKO), which minimizes the cross-entropy error for sets rather than for single examples. This naturally allows the model to capture correlations across data examples and achieves both better accuracy and calibration, especially in limited training data and class-imbalanced regimes. Perhaps surprisingly, OKO often yields better calibration even when training with hard labels and dropping any additional calibration parameter tuning, such as temperature scaling. We demonstrate this in extensive experimental analyses and provide a mathematical theory to interpret our findings. We emphasize that OKO is a general framework that can be easily adapted to many settings and a trained model can be applied to single examples at inference time, without significant run-time overhead or architecture changes.
\ No newline at end of file
diff --git a/data/2024/iclr/SetCSE: Set Operations using Contrastive Learning of Sentence Embeddings b/data/2024/iclr/SetCSE: Set Operations using Contrastive Learning of Sentence Embeddings
new file mode 100644
index 0000000000..b18e57fbe6
--- /dev/null
+++ b/data/2024/iclr/SetCSE: Set Operations using Contrastive Learning of Sentence Embeddings	
@@ -0,0 +1 @@
+Taking inspiration from Set Theory, we introduce SetCSE, an innovative information retrieval framework. SetCSE employs sets to represent complex semantics and incorporates well-defined operations for structured information querying under the provided context. Within this framework, we introduce an inter-set contrastive learning objective to enhance comprehension of sentence embedding models concerning the given semantics. Furthermore, we present a suite of operations, including SetCSE intersection, difference, and operation series, that leverage sentence embeddings of the enhanced model for complex sentence retrieval tasks. Throughout this paper, we demonstrate that SetCSE adheres to the conventions of human language expressions regarding compounded semantics, provides a significant enhancement in the discriminatory capability of underlying sentence embedding models, and enables numerous information retrieval tasks involving convoluted and intricate prompts which cannot be achieved using existing querying methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Shadow Cones: A Generalized Framework for Partial Order Embeddings b/data/2024/iclr/Shadow Cones: A Generalized Framework for Partial Order Embeddings
new file mode 100644
index 0000000000..7742b57775
--- /dev/null
+++ b/data/2024/iclr/Shadow Cones: A Generalized Framework for Partial Order Embeddings	
@@ -0,0 +1 @@
+Hyperbolic space has proven to be well-suited for capturing hierarchical relations in data, such as trees and directed acyclic graphs. Prior work introduced the concept of entailment cones, which uses partial orders defined by nested cones in the Poincar\'e ball to model hierarchies. Here, we introduce the ``shadow cones"framework, a physics-inspired entailment cone construction. Specifically, we model partial orders as subset relations between shadows formed by a light source and opaque objects in hyperbolic space. The shadow cones framework generalizes entailment cones to a broad class of formulations and hyperbolic space models beyond the Poincar\'e ball. This results in clear advantages over existing constructions: for example, shadow cones possess better optimization properties over constructions limited to the Poincar\'e ball. Our experiments on datasets of various sizes and hierarchical structures show that shadow cones consistently and significantly outperform existing entailment cone constructions. These results indicate that shadow cones are an effective way to model partial orders in hyperbolic space, offering physically intuitive and novel insights about the nature of such structures.
\ No newline at end of file
diff --git a/data/2024/iclr/Sharpness-Aware Data Poisoning Attack b/data/2024/iclr/Sharpness-Aware Data Poisoning Attack
new file mode 100644
index 0000000000..f337bc1a64
--- /dev/null
+++ b/data/2024/iclr/Sharpness-Aware Data Poisoning Attack	
@@ -0,0 +1 @@
+Recent research has highlighted the vulnerability of Deep Neural Networks (DNNs) against data poisoning attacks. These attacks aim to inject poisoning samples into the models' training dataset such that the trained models have inference failures. While previous studies have executed different types of attacks, one major challenge that greatly limits their effectiveness is the uncertainty of the re-training process after the injection of poisoning samples, including the re-training initialization or algorithms. To address this challenge, we propose a novel attack method called ''Sharpness-Aware Data Poisoning Attack (SAPA)''. In particular, it leverages the concept of DNNs' loss landscape sharpness to optimize the poisoning effect on the worst re-trained model. It helps enhance the preservation of the poisoning effect, regardless of the specific retraining procedure employed. Extensive experiments demonstrate that SAPA offers a general and principled strategy that significantly enhances various types of poisoning attacks.
\ No newline at end of file
diff --git a/data/2024/iclr/Sharpness-Aware Minimization Enhances Feature Quality via Balanced Learning b/data/2024/iclr/Sharpness-Aware Minimization Enhances Feature Quality via Balanced Learning
new file mode 100644
index 0000000000..32fa14d185
--- /dev/null
+++ b/data/2024/iclr/Sharpness-Aware Minimization Enhances Feature Quality via Balanced Learning	
@@ -0,0 +1 @@
+Sharpness-Aware Minimization (SAM) has emerged as a promising alternative optimizer to stochastic gradient descent (SGD). The originally-proposed motivation behind SAM was to bias neural networks towards flatter minima that are believed to generalize better. However, recent studies have shown conflicting evidence on the relationship between flatness and generalization, suggesting that flatness does fully explain SAM's success. Sidestepping this debate, we identify an orthogonal effect of SAM that is beneficial out-of-distribution: we argue that SAM implicitly balances the quality of diverse features. SAM achieves this effect by adaptively suppressing well-learned features which gives remaining features opportunity to be learned. We show that this mechanism is beneficial in datasets that contain redundant or spurious features where SGD falls for the simplicity bias and would not otherwise learn all available features. Our insights are supported by experiments on real data: we demonstrate that SAM improves the quality of features in datasets containing redundant or spurious features, including CelebA, Waterbirds, CIFAR-MNIST, and DomainBed.
\ No newline at end of file
diff --git a/data/2024/iclr/Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning b/data/2024/iclr/Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
new file mode 100644
index 0000000000..75c6ece8d0
--- /dev/null
+++ b/data/2024/iclr/Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning	
@@ -0,0 +1 @@
+The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, OpenLLaMA and the concurrent TinyLlama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building competitive small-scale LLMs
\ No newline at end of file
diff --git a/data/2024/iclr/Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation b/data/2024/iclr/Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation
new file mode 100644
index 0000000000..2f3e5e7e25
--- /dev/null
+++ b/data/2024/iclr/Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation	
@@ -0,0 +1 @@
+Automatic Sign Language Translation requires the integration of both computer vision and natural language processing to effectively bridge the communication gap between sign and spoken languages. However, the deficiency in large-scale training data to support sign language translation means we need to leverage resources from spoken language. We introduce, Sign2GPT, a novel framework for sign language translation that utilizes large-scale pretrained vision and language models via lightweight adapters for gloss-free sign language translation. The lightweight adapters are crucial for sign language translation, due to the constraints imposed by limited dataset sizes and the computational requirements when training with long sign videos. We also propose a novel pretraining strategy that directs our encoder to learn sign representations from automatically extracted pseudo-glosses without requiring gloss order information or annotations. We evaluate our approach on two public benchmark sign language translation datasets, namely RWTH-PHOENIX-Weather 2014T and CSL-Daily, and improve on state-of-the-art gloss-free translation performance with a significant margin.
\ No newline at end of file
diff --git a/data/2024/iclr/Simple Minimax Optimal Byzantine Robust Algorithm for Nonconvex Objectives with Uniform Gradient Heterogeneity b/data/2024/iclr/Simple Minimax Optimal Byzantine Robust Algorithm for Nonconvex Objectives with Uniform Gradient Heterogeneity
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Simplifying Transformer Blocks b/data/2024/iclr/Simplifying Transformer Blocks
new file mode 100644
index 0000000000..01d5f15560
--- /dev/null
+++ b/data/2024/iclr/Simplifying Transformer Blocks	
@@ -0,0 +1 @@
+A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections&normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable. In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters.
\ No newline at end of file
diff --git a/data/2024/iclr/Sin3DM: Learning a Diffusion Model from a Single 3D Textured Shape b/data/2024/iclr/Sin3DM: Learning a Diffusion Model from a Single 3D Textured Shape
new file mode 100644
index 0000000000..17f7a9e09b
--- /dev/null
+++ b/data/2024/iclr/Sin3DM: Learning a Diffusion Model from a Single 3D Textured Shape	
@@ -0,0 +1 @@
+Synthesizing novel 3D models that resemble the input example has long been pursued by graphics artists and machine learning researchers. In this paper, we present Sin3DM, a diffusion model that learns the internal patch distribution from a single 3D textured shape and generates high-quality variations with fine geometry and texture details. Training a diffusion model directly in 3D would induce large memory and computational cost. Therefore, we first compress the input into a lower-dimensional latent space and then train a diffusion model on it. Specifically, we encode the input 3D textured shape into triplane feature maps that represent the signed distance and texture fields of the input. The denoising network of our diffusion model has a limited receptive field to avoid overfitting, and uses triplane-aware 2D convolution blocks to improve the result quality. Aside from randomly generating new samples, our model also facilitates applications such as retargeting, outpainting and local editing. Through extensive qualitative and quantitative evaluation, we show that our method outperforms prior methods in generation quality of 3D shapes.
\ No newline at end of file
diff --git a/data/2024/iclr/SineNet: Learning Temporal Dynamics in Time-Dependent Partial Differential Equations b/data/2024/iclr/SineNet: Learning Temporal Dynamics in Time-Dependent Partial Differential Equations
new file mode 100644
index 0000000000..3081d96062
--- /dev/null
+++ b/data/2024/iclr/SineNet: Learning Temporal Dynamics in Time-Dependent Partial Differential Equations	
@@ -0,0 +1 @@
+We consider using deep neural networks to solve time-dependent partial differential equations (PDEs), where multi-scale processing is crucial for modeling complex, time-evolving dynamics. While the U-Net architecture with skip connections is commonly used by prior studies to enable multi-scale processing, our analysis shows that the need for features to evolve across layers results in temporally misaligned features in skip connections, which limits the model's performance. To address this limitation, we propose SineNet, consisting of multiple sequentially connected U-shaped network blocks, referred to as waves. In SineNet, high-resolution features are evolved progressively through multiple stages, thereby reducing the amount of misalignment within each stage. We furthermore analyze the role of skip connections in enabling both parallel and sequential processing of multi-scale information. Our method is rigorously tested on multiple PDE datasets, including the Navier-Stokes equations and shallow water equations, showcasing the advantages of our proposed approach over conventional U-Nets with a comparable parameter budget. We further demonstrate that increasing the number of waves in SineNet while maintaining the same number of parameters leads to a monotonically improved performance. The results highlight the effectiveness of SineNet and the potential of our approach in advancing the state-of-the-art in neural PDE solver design. Our code is available as part of AIRS (https://github.com/divelab/AIRS).
\ No newline at end of file
diff --git a/data/2024/iclr/Single Motion Diffusion b/data/2024/iclr/Single Motion Diffusion
new file mode 100644
index 0000000000..38197200ff
--- /dev/null
+++ b/data/2024/iclr/Single Motion Diffusion	
@@ -0,0 +1 @@
+Synthesizing realistic animations of humans, animals, and even imaginary creatures, has long been a goal for artists and computer graphics professionals. Compared to the imaging domain, which is rich with large available datasets, the number of data instances for the motion domain is limited, particularly for the animation of animals and exotic creatures (e.g., dragons), which have unique skeletons and motion patterns. In this work, we present a Single Motion Diffusion Model, dubbed SinMDM, a model designed to learn the internal motifs of a single motion sequence with arbitrary topology and synthesize motions of arbitrary length that are faithful to them. We harness the power of diffusion models and present a denoising network explicitly designed for the task of learning from a single input motion. SinMDM is designed to be a lightweight architecture, which avoids overfitting by using a shallow network with local attention layers that narrow the receptive field and encourage motion diversity. SinMDM can be applied in various contexts, including spatial and temporal in-betweening, motion expansion, style transfer, and crowd animation. Our results show that SinMDM outperforms existing methods both in quality and time-space efficiency. Moreover, while current approaches require additional training for different applications, our work facilitates these applications at inference time. Our code and trained models are available at https://sinmdm.github.io/SinMDM-page.
\ No newline at end of file
diff --git a/data/2024/iclr/Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation b/data/2024/iclr/Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation
new file mode 100644
index 0000000000..fa0a68c960
--- /dev/null
+++ b/data/2024/iclr/Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation	
@@ -0,0 +1 @@
+This work aims at decreasing the end-to-end generation latency of large language models (LLMs). One of the major causes of the high generation latency is the sequential decoding approach adopted by almost all state-of-the-art LLMs. In this work, motivated by the thinking and writing process of humans, we propose Skeleton-of-Thought (SoT), which first guides LLMs to generate the skeleton of the answer, and then conducts parallel API calls or batched decoding to complete the contents of each skeleton point in parallel. Not only does SoT provide considerable speed-ups across 12 LLMs, but it can also potentially improve the answer quality on several question categories. SoT is an initial attempt at data-centric optimization for inference efficiency, and showcases the potential of eliciting high-quality answers by explicitly planning the answer structure in language.
\ No newline at end of file
diff --git a/data/2024/iclr/Skill Machines: Temporal Logic Skill Composition in Reinforcement Learning b/data/2024/iclr/Skill Machines: Temporal Logic Skill Composition in Reinforcement Learning
new file mode 100644
index 0000000000..01e56b2fdd
--- /dev/null
+++ b/data/2024/iclr/Skill Machines: Temporal Logic Skill Composition in Reinforcement Learning	
@@ -0,0 +1 @@
+It is desirable for an agent to be able to solve a rich variety of problems that can be specified through language in the same environment. A popular approach towards obtaining such agents is to reuse skills learned in prior tasks to generalise compositionally to new ones. However, this is a challenging problem due to the curse of dimensionality induced by the combinatorially large number of ways high-level goals can be combined both logically and temporally in language. To address this problem, we propose a framework where an agent first learns a sufficient set of skill primitives to achieve all high-level goals in its environment. The agent can then flexibly compose them both logically and temporally to provably achieve temporal logic specifications in any regular language, such as regular fragments of linear temporal logic. This provides the agent with the ability to map from complex temporal logic task specifications to near-optimal behaviours zero-shot. We demonstrate this experimentally in a tabular setting, as well as in a high-dimensional video game and continuous control environment. Finally, we also demonstrate that the performance of skill machines can be improved with regular off-policy reinforcement learning algorithms when optimal behaviours are desired.
\ No newline at end of file
diff --git a/data/2024/iclr/Skill or Luck? Return Decomposition via Advantage Functions b/data/2024/iclr/Skill or Luck? Return Decomposition via Advantage Functions
new file mode 100644
index 0000000000..59057f10ed
--- /dev/null
+++ b/data/2024/iclr/Skill or Luck? Return Decomposition via Advantage Functions	
@@ -0,0 +1 @@
+Learning from off-policy data is essential for sample-efficient reinforcement learning. In the present work, we build on the insight that the advantage function can be understood as the causal effect of an action on the return, and show that this allows us to decompose the return of a trajectory into parts caused by the agent's actions (skill) and parts outside of the agent's control (luck). Furthermore, this decomposition enables us to naturally extend Direct Advantage Estimation (DAE) to off-policy settings (Off-policy DAE). The resulting method can learn from off-policy trajectories without relying on importance sampling techniques or truncating off-policy actions. We draw connections between Off-policy DAE and previous methods to demonstrate how it can speed up learning and when the proposed off-policy corrections are important. Finally, we use the MinAtar environments to illustrate how ignoring off-policy corrections can lead to suboptimal policy optimization performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Skip-Attention: Improving Vision Transformers by Paying Less Attention b/data/2024/iclr/Skip-Attention: Improving Vision Transformers by Paying Less Attention
new file mode 100644
index 0000000000..7f224cd9f1
--- /dev/null
+++ b/data/2024/iclr/Skip-Attention: Improving Vision Transformers by Paying Less Attention	
@@ -0,0 +1 @@
+This work aims to improve the efficiency of vision transformers (ViT). While ViTs use computationally expensive self-attention operations in every layer, we identify that these operations are highly correlated across layers -- a key redundancy that causes unnecessary computations. Based on this observation, we propose SkipAt, a method to reuse self-attention computation from preceding layers to approximate attention at one or more subsequent layers. To ensure that reusing self-attention blocks across layers does not degrade the performance, we introduce a simple parametric function, which outperforms the baseline transformer's performance while running computationally faster. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS. We achieve improved throughput at the same-or-higher accuracy levels in all these tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/SliceGPT: Compress Large Language Models by Deleting Rows and Columns b/data/2024/iclr/SliceGPT: Compress Large Language Models by Deleting Rows and Columns
new file mode 100644
index 0000000000..21558756fa
--- /dev/null
+++ b/data/2024/iclr/SliceGPT: Compress Large Language Models by Deleting Rows and Columns	
@@ -0,0 +1 @@
+Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. Code is available at: https://github.com/microsoft/TransformerCompression
\ No newline at end of file
diff --git a/data/2024/iclr/Sliced Denoising: A Physics-Informed Molecular Pre-Training Method b/data/2024/iclr/Sliced Denoising: A Physics-Informed Molecular Pre-Training Method
new file mode 100644
index 0000000000..5fb7be1fbf
--- /dev/null
+++ b/data/2024/iclr/Sliced Denoising: A Physics-Informed Molecular Pre-Training Method	
@@ -0,0 +1 @@
+While molecular pre-training has shown great potential in enhancing drug discovery, the lack of a solid physical interpretation in current methods raises concerns about whether the learned representation truly captures the underlying explanatory factors in observed data, ultimately resulting in limited generalization and robustness. Although denoising methods offer a physical interpretation, their accuracy is often compromised by ad-hoc noise design, leading to inaccurate learned force fields. To address this limitation, this paper proposes a new method for molecular pre-training, called sliced denoising (SliDe), which is based on the classical mechanical intramolecular potential theory. SliDe utilizes a novel noise strategy that perturbs bond lengths, angles, and torsion angles to achieve better sampling over conformations. Additionally, it introduces a random slicing approach that circumvents the computationally expensive calculation of the Jacobian matrix, which is otherwise essential for estimating the force field. By aligning with physical principles, SliDe shows a 42\% improvement in the accuracy of estimated force fields compared to current state-of-the-art denoising methods, and thus outperforms traditional baselines on various molecular property prediction tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Sliced Wasserstein Estimation with Control Variates b/data/2024/iclr/Sliced Wasserstein Estimation with Control Variates
new file mode 100644
index 0000000000..cc380488a3
--- /dev/null
+++ b/data/2024/iclr/Sliced Wasserstein Estimation with Control Variates	
@@ -0,0 +1 @@
+The sliced Wasserstein (SW) distances between two probability measures are defined as the expectation of the Wasserstein distance between two one-dimensional projections of the two measures. The randomness comes from a projecting direction that is used to project the two input measures to one dimension. Due to the intractability of the expectation, Monte Carlo integration is performed to estimate the value of the SW distance. Despite having various variants, there has been no prior work that improves the Monte Carlo estimation scheme for the SW distance in terms of controlling its variance. To bridge the literature on variance reduction and the literature on the SW distance, we propose computationally efficient control variates to reduce the variance of the empirical estimation of the SW distance. The key idea is to first find Gaussian approximations of projected one-dimensional measures, then we utilize the closed-form of the Wasserstein-2 distance between two Gaussian distributions to design the control variates. In particular, we propose using a lower bound and an upper bound of the Wasserstein-2 distance between two fitted Gaussians as two computationally efficient control variates. We empirically show that the proposed control variate estimators can help to reduce the variance considerably when comparing measures over images and point-clouds. Finally, we demonstrate the favorable performance of the proposed control variate estimators in gradient flows to interpolate between two point-clouds and in deep generative modeling on standard image datasets, such as CIFAR10 and CelebA.
\ No newline at end of file
diff --git a/data/2024/iclr/Small-scale proxies for large-scale Transformer training instabilities b/data/2024/iclr/Small-scale proxies for large-scale Transformer training instabilities
new file mode 100644
index 0000000000..db072de8c4
--- /dev/null
+++ b/data/2024/iclr/Small-scale proxies for large-scale Transformer training instabilities	
@@ -0,0 +1 @@
+Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training stability and instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the $\mu$Param (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model activation and gradient norms.
\ No newline at end of file
diff --git a/data/2024/iclr/SmartPlay : A Benchmark for LLMs as Intelligent Agents b/data/2024/iclr/SmartPlay : A Benchmark for LLMs as Intelligent Agents
new file mode 100644
index 0000000000..4cf39fc7d4
--- /dev/null
+++ b/data/2024/iclr/SmartPlay : A Benchmark for LLMs as Intelligent Agents	
@@ -0,0 +1 @@
+Recent large language models (LLMs) have demonstrated great potential toward intelligent agents and next-gen automation, but there currently lacks a systematic benchmark for evaluating LLMs' abilities as agents. We introduce SmartPlay: both a challenging benchmark and a methodology for evaluating LLMs as agents. SmartPlay consists of 6 different games, including Rock-Paper-Scissors, Tower of Hanoi, Minecraft. Each game features a unique setting, providing up to 20 evaluation settings and infinite environment variations. Each game in SmartPlay uniquely challenges a subset of 9 important capabilities of an intelligent LLM agent, including reasoning with object dependencies, planning ahead, spatial reasoning, learning from history, and understanding randomness. The distinction between the set of capabilities each game test allows us to analyze each capability separately. SmartPlay serves not only as a rigorous testing ground for evaluating the overall performance of LLM agents but also as a road-map for identifying gaps in current methodologies. We release our benchmark at github.com/Microsoft/SmartPlay
\ No newline at end of file
diff --git a/data/2024/iclr/Smooth ECE: Principled Reliability Diagrams via Kernel Smoothing b/data/2024/iclr/Smooth ECE: Principled Reliability Diagrams via Kernel Smoothing
new file mode 100644
index 0000000000..633c7ec579
--- /dev/null
+++ b/data/2024/iclr/Smooth ECE: Principled Reliability Diagrams via Kernel Smoothing	
@@ -0,0 +1 @@
+Calibration measures and reliability diagrams are two fundamental tools for measuring and interpreting the calibration of probabilistic predictors. Calibration measures quantify the degree of miscalibration, and reliability diagrams visualize the structure of this miscalibration. However, the most common constructions of reliability diagrams and calibration measures -- binning and ECE -- both suffer from well-known flaws (e.g. discontinuity). We show that a simple modification fixes both constructions: first smooth the observations using an RBF kernel, then compute the Expected Calibration Error (ECE) of this smoothed function. We prove that with a careful choice of bandwidth, this method yields a calibration measure that is well-behaved in the sense of (B{\l}asiok, Gopalan, Hu, and Nakkiran 2023a) -- a consistent calibration measure. We call this measure the SmoothECE. Moreover, the reliability diagram obtained from this smoothed function visually encodes the SmoothECE, just as binned reliability diagrams encode the BinnedECE. We also provide a Python package with simple, hyperparameter-free methods for measuring and plotting calibration: `pip install relplot\`.
\ No newline at end of file
diff --git a/data/2024/iclr/Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community b/data/2024/iclr/Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community
new file mode 100644
index 0000000000..355974ba8d
--- /dev/null
+++ b/data/2024/iclr/Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community	
@@ -0,0 +1 @@
+Social reward as a form of community recognition provides a strong source of motivation for users of online platforms to engage and contribute with content. The recent progress of text-conditioned image synthesis has ushered in a collaborative era where AI empowers users to craft original visual artworks seeking community validation. Nevertheless, assessing these models in the context of collective community preference introduces distinct challenges. Existing evaluation methods predominantly center on limited size user studies guided by image quality and prompt alignment. This work pioneers a paradigm shift, unveiling Social Reward - an innovative reward modeling framework that leverages implicit feedback from social network users engaged in creative editing of generated images. We embark on an extensive journey of dataset curation and refinement, drawing from Picsart: an online visual creation and editing platform, yielding a first million-user-scale dataset of implicit human preferences for user-generated visual art named Picsart Image-Social. Our analysis exposes the shortcomings of current metrics in modeling community creative preference of text-to-image models' outputs, compelling us to introduce a novel predictive model explicitly tailored to address these limitations. Rigorous quantitative experiments and user study show that our Social Reward model aligns better with social popularity than existing metrics. Furthermore, we utilize Social Reward to fine-tune text-to-image models, yielding images that are more favored by not only Social Reward, but also other established metrics. These findings highlight the relevance and effectiveness of Social Reward in assessing community appreciation for AI-generated artworks, establishing a closer alignment with users' creative goals: creating popular visual art. Codes can be accessed at https://github.com/Picsart-AI-Research/Social-Reward
\ No newline at end of file
diff --git a/data/2024/iclr/Social-Transmotion: Promptable Human Trajectory Prediction b/data/2024/iclr/Social-Transmotion: Promptable Human Trajectory Prediction
new file mode 100644
index 0000000000..f77f1e7113
--- /dev/null
+++ b/data/2024/iclr/Social-Transmotion: Promptable Human Trajectory Prediction	
@@ -0,0 +1 @@
+Accurate human trajectory prediction is crucial for applications such as autonomous vehicles, robotics, and surveillance systems. Yet, existing models often fail to fully leverage the non-verbal social cues human subconsciously communicate when navigating the space. To address this, we introduce Social-Transmotion, a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior. We translate the idea of a prompt from Natural Language Processing (NLP) to the task of human trajectory prediction, where a prompt can be a sequence of x-y coordinates on the ground, bounding boxes in the image plane, or body pose keypoints in either 2D or 3D. This, in turn, augments trajectory data, leading to enhanced human trajectory prediction. Using masking technique, our model exhibits flexibility and adaptability by capturing spatiotemporal interactions between agents based on the available visual cues. We delve into the merits of using 2D versus 3D poses, and a limited set of poses. Additionally, we investigate the spatial and temporal attention map to identify which keypoints and time-steps in the sequence are vital for optimizing human trajectory prediction. Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY. The code is publicly available: https://github.com/vita-epfl/social-transmotion.
\ No newline at end of file
diff --git a/data/2024/iclr/SocioDojo: Building Lifelong Analytical Agents with Real-world Text and Time Series b/data/2024/iclr/SocioDojo: Building Lifelong Analytical Agents with Real-world Text and Time Series
new file mode 100644
index 0000000000..37cd81748d
--- /dev/null
+++ b/data/2024/iclr/SocioDojo: Building Lifelong Analytical Agents with Real-world Text and Time Series	
@@ -0,0 +1 @@
+We introduce SocioDojo, an open-ended lifelong learning environment for developing ready-to-deploy autonomous agents capable of performing human-like analysis and decision-making on societal topics such as economics, finance, politics, and culture. It consists of (1) information sources from news, social media, reports, etc., (2) a knowledge base built from books, journals, and encyclope-dias, plus a toolbox of Internet and knowledge graph search interfaces, (3) 30K high-quality time series in finance, economy, society, and polls, which support a novel task called “hyperportfolio”, that can reliably and scalably evaluate societal analysis and decision-making power of agents, inspired by portfolio optimization with time series as assets to “invest”. We also propose a novel Analyst-Assistant-Actuator architecture for the hyperportfolio task, and a Hypothesis & Proof prompting for producing in-depth analyses on input news, articles, etc. to assist decision-making. We perform experiments and ablation studies to explore the factors that impact performance. The results show that our proposed method achieves improvements of 32.4% and 30.4% compared to the state-of-the-art method in the two experimental settings.
\ No newline at end of file
diff --git a/data/2024/iclr/Soft Contrastive Learning for Time Series b/data/2024/iclr/Soft Contrastive Learning for Time Series
new file mode 100644
index 0000000000..e579382b25
--- /dev/null
+++ b/data/2024/iclr/Soft Contrastive Learning for Time Series	
@@ -0,0 +1 @@
+Contrastive learning has shown to be effective to learn representations from time series in a self-supervised way. However, contrasting similar time series instances or values from adjacent timestamps within a time series leads to ignore their inherent correlations, which results in deteriorating the quality of learned representations. To address this issue, we propose SoftCLT, a simple yet effective soft contrastive learning strategy for time series. This is achieved by introducing instance-wise and temporal contrastive loss with soft assignments ranging from zero to one. Specifically, we define soft assignments for 1) instance-wise contrastive loss by the distance between time series on the data space, and 2) temporal contrastive loss by the difference of timestamps. SoftCLT is a plug-and-play method for time series contrastive learning that improves the quality of learned representations without bells and whistles. In experiments, we demonstrate that SoftCLT consistently improves the performance in various downstream tasks including classification, semi-supervised learning, transfer learning, and anomaly detection, showing state-of-the-art performance. Code is available at this repository: https://github.com/seunghan96/softclt.
\ No newline at end of file
diff --git a/data/2024/iclr/Soft Mixture Denoising: Beyond the Expressive Bottleneck of Diffusion Models b/data/2024/iclr/Soft Mixture Denoising: Beyond the Expressive Bottleneck of Diffusion Models
new file mode 100644
index 0000000000..64cb33ab58
--- /dev/null
+++ b/data/2024/iclr/Soft Mixture Denoising: Beyond the Expressive Bottleneck of Diffusion Models	
@@ -0,0 +1 @@
+Because diffusion models have shown impressive performances in a number of tasks, such as image synthesis, there is a trend in recent works to prove (with certain assumptions) that these models have strong approximation capabilities. In this paper, we show that current diffusion models actually have an expressive bottleneck in backward denoising and some assumption made by existing theoretical guarantees is too strong. Based on this finding, we prove that diffusion models have unbounded errors in both local and global denoising. In light of our theoretical studies, we introduce soft mixture denoising (SMD), an expressive and efficient model for backward denoising. SMD not only permits diffusion models to well approximate any Gaussian mixture distributions in theory, but also is simple and efficient for implementation. Our experiments on multiple image datasets show that SMD significantly improves different types of diffusion models (e.g., DDPM), espeically in the situation of few backward iterations.
\ No newline at end of file
diff --git a/data/2024/iclr/Soft Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity b/data/2024/iclr/Soft Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity
new file mode 100644
index 0000000000..589c9c1fe9
--- /dev/null
+++ b/data/2024/iclr/Soft Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity	
@@ -0,0 +1 @@
+Robust Markov Decision Processes (MDPs) and risk-sensitive MDPs are both powerful tools for making decisions in the presence of uncertainties. Previous efforts have aimed to establish their connections, revealing equivalences in specific formulations. This paper introduces a new formulation for risk-sensitive MDPs, which assesses risk in a slightly different manner compared to the classical Markov risk measure (Ruszczy\'nski 2010), and establishes its equivalence with a class of soft robust MDP (RMDP) problems, including the standard RMDP as a special case. Leveraging this equivalence, we further derive the policy gradient theorem for both problems, proving gradient domination and global convergence of the exact policy gradient method under the tabular setting with direct parameterization. This forms a sharp contrast to the Markov risk measure, known to be potentially non-gradient-dominant (Huang et al. 2021). We also propose a sample-based offline learning algorithm, namely the robust fitted-Z iteration (RFZI), for a specific soft RMDP problem with a KL-divergence regularization term (or equivalently the risk-sensitive MDP with an entropy risk measure). We showcase its streamlined design and less stringent assumptions due to the equivalence and analyze its sample complexity
\ No newline at end of file
diff --git a/data/2024/iclr/Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification b/data/2024/iclr/Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
new file mode 100644
index 0000000000..74b8c9a406
--- /dev/null
+++ b/data/2024/iclr/Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification	
@@ -0,0 +1 @@
+Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the \textit{Code Usage Frequency} of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit \uline{c}ode-based \uline{s}elf-\uline{v}erification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset \textbf{(53.9\% $\to$ 84.3\%)}.
\ No newline at end of file
diff --git a/data/2024/iclr/Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution b/data/2024/iclr/Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution
new file mode 100644
index 0000000000..6e60f7119b
--- /dev/null
+++ b/data/2024/iclr/Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution	
@@ -0,0 +1 @@
+Diffusion models, as a kind of powerful generative model, have given impressive results on image super-resolution (SR) tasks. However, due to the randomness introduced in the reverse process of diffusion models, the performances of diffusion-based SR models are fluctuating at every time of sampling, especially for samplers with few resampled steps. This inherent randomness of diffusion models results in ineffectiveness and instability, making it challenging for users to guarantee the quality of SR results. However, our work takes this randomness as an opportunity: fully analyzing and leveraging it leads to the construction of an effective plug-and-play sampling method that owns the potential to benefit a series of diffusion-based SR methods. More in detail, we propose to steadily sample high-quality SR images from pre-trained diffusion-based SR models by solving diffusion ordinary differential equations (diffusion ODEs) with optimal boundary conditions (BCs) and analyze the characteristics between the choices of BCs and their corresponding SR results. Our analysis shows the route to obtain an approximately optimal BC via an efficient exploration in the whole space. The quality of SR results sampled by the proposed method with fewer steps outperforms the quality of results sampled by current methods with randomness from the same pre-trained diffusion-based SR model, which means that our sampling method"boosts"current diffusion-based SR models without any additional training.
\ No newline at end of file
diff --git a/data/2024/iclr/Solving High Frequency and Multi-Scale PDEs with Gaussian Processes b/data/2024/iclr/Solving High Frequency and Multi-Scale PDEs with Gaussian Processes
new file mode 100644
index 0000000000..88769c2199
--- /dev/null
+++ b/data/2024/iclr/Solving High Frequency and Multi-Scale PDEs with Gaussian Processes	
@@ -0,0 +1 @@
+Machine learning based solvers have garnered much attention in physical simulation and scientific computing, with a prominent example, physics-informed neural networks (PINNs). However, PINNs often struggle to solve high-frequency and multi-scale PDEs, which can be due to spectral bias during neural network training. To address this problem, we resort to the Gaussian process (GP) framework. To flexibly capture the dominant frequencies, we model the power spectrum of the PDE solution with a student $t$ mixture or Gaussian mixture. We apply the inverse Fourier transform to obtain the covariance function (by Wiener-Khinchin theorem). The covariance derived from the Gaussian mixture spectrum corresponds to the known spectral mixture kernel. Next, we estimate the mixture weights in the log domain, which we show is equivalent to placing a Jeffreys prior. It automatically induces sparsity, prunes excessive frequencies, and adjusts the remaining toward the ground truth. Third, to enable efficient and scalable computation on massive collocation points, which are critical to capture high frequencies, we place the collocation points on a grid, and multiply our covariance function at each input dimension. We use the GP conditional mean to predict the solution and its derivatives so as to fit the boundary condition and the equation itself. As a result, we can derive a Kronecker product structure in the covariance matrix. We use Kronecker product properties and multilinear algebra to promote computational efficiency and scalability, without low-rank approximations. We show the advantage of our method in systematic experiments. The code is released at \url{https://github.com/xuangu-fang/Gaussian-Process-Slover-for-High-Freq-PDE}.
\ No newline at end of file
diff --git a/data/2024/iclr/Solving Homogeneous and Heterogeneous Cooperative Tasks with Greedy Sequential Execution b/data/2024/iclr/Solving Homogeneous and Heterogeneous Cooperative Tasks with Greedy Sequential Execution
new file mode 100644
index 0000000000..431a922b91
--- /dev/null
+++ b/data/2024/iclr/Solving Homogeneous and Heterogeneous Cooperative Tasks with Greedy Sequential Execution	
@@ -0,0 +1 @@
+Cooperative
\ No newline at end of file
diff --git a/data/2024/iclr/Some Fundamental Aspects about Lipschitz Continuity of Neural Networks b/data/2024/iclr/Some Fundamental Aspects about Lipschitz Continuity of Neural Networks
new file mode 100644
index 0000000000..457c9ae170
--- /dev/null
+++ b/data/2024/iclr/Some Fundamental Aspects about Lipschitz Continuity of Neural Networks	
@@ -0,0 +1 @@
+Lipschitz continuity is a crucial functional property of any predictive model, that naturally governs its robustness, generalisation, as well as adversarial vulnerability. Contrary to other works that focus on obtaining tighter bounds and developing different practical strategies to enforce certain Lipschitz properties, we aim to thoroughly examine and characterise the Lipschitz behaviour of Neural Networks. Thus, we carry out an empirical investigation in a range of different settings (namely, architectures, datasets, label noise, and more) by exhausting the limits of the simplest and the most general lower and upper bounds. As a highlight of this investigation, we showcase a remarkable fidelity of the lower Lipschitz bound, identify a striking Double Descent trend in both upper and lower bounds to the Lipschitz and explain the intriguing effects of label noise on function smoothness and generalisation.
\ No newline at end of file
diff --git a/data/2024/iclr/Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training b/data/2024/iclr/Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
new file mode 100644
index 0000000000..df72fd2e78
--- /dev/null
+++ b/data/2024/iclr/Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training	
@@ -0,0 +1 @@
+Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time. Theoretically, we show that Sophia, in a much simplified setting, adapts to the heterogeneous curvatures in different parameter dimensions, and thus has a run-time bound that does not depend on the condition number of the loss.
\ No newline at end of file
diff --git a/data/2024/iclr/Source-Free and Image-Only Unsupervised Domain Adaptation for Category Level Object Pose Estimation b/data/2024/iclr/Source-Free and Image-Only Unsupervised Domain Adaptation for Category Level Object Pose Estimation
new file mode 100644
index 0000000000..d5f1d12560
--- /dev/null
+++ b/data/2024/iclr/Source-Free and Image-Only Unsupervised Domain Adaptation for Category Level Object Pose Estimation	
@@ -0,0 +1 @@
+We consider the problem of source-free unsupervised category-level pose estimation from only RGB images to a target domain without any access to source domain data or 3D annotations during adaptation. Collecting and annotating real-world 3D data and corresponding images is laborious, expensive, yet unavoidable process, since even 3D pose domain adaptation methods require 3D data in the target domain. We introduce 3DUDA, a method capable of adapting to a nuisance-ridden target domain without 3D or depth data. Our key insight stems from the observation that specific object subparts remain stable across out-of-domain (OOD) scenarios, enabling strategic utilization of these invariant subcomponents for effective model updates. We represent object categories as simple cuboid meshes, and harness a generative model of neural feature activations modeled at each mesh vertex learnt using differential rendering. We focus on individual locally robust mesh vertex features and iteratively update them based on their proximity to corresponding features in the target domain even when the global pose is not correct. Our model is then trained in an EM fashion, alternating between updating the vertex features and the feature extractor. We show that our method simulates fine-tuning on a global pseudo-labeled dataset under mild assumptions, which converges to the target domain asymptotically. Through extensive empirical validation, including a complex extreme UDA setup which combines real nuisances, synthetic noise, and occlusion, we demonstrate the potency of our simple approach in addressing the domain shift challenge and significantly improving pose estimation accuracy.
\ No newline at end of file
diff --git a/data/2024/iclr/SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression b/data/2024/iclr/SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
new file mode 100644
index 0000000000..93f809296e
--- /dev/null
+++ b/data/2024/iclr/SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression	
@@ -0,0 +1 @@
+Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. However, quantization down to 3-4 bits per parameter usually leads to moderate-to-high accuracy losses, especially for smaller models in the 1-10B parameter range, which are well-suited for edge deployments. To address this accuracy issue, we introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique which enables for the first time near-lossless compression of LLMs across model scales, while reaching similar compression levels to previous methods. SpQR works by identifying and isolating outlier weights, which cause particularly-large quantization errors, and storing them in higher precision, while compressing all other weights to 3-4 bits, and achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup thus making powerful LLMs available to consumer without any downsides. SpQR comes with efficient algorithms for both encoding weights into its format, as well as decoding them efficiently at runtime. Specifically, we provide an efficient GPU inference algorithm for SpQR which yields faster inference than 16-bit baselines at similar accuracy, while enabling memory compression gains of more than 4x.
\ No newline at end of file
diff --git a/data/2024/iclr/SpaCE: The Spatial Confounding Environment b/data/2024/iclr/SpaCE: The Spatial Confounding Environment
new file mode 100644
index 0000000000..259653134d
--- /dev/null
+++ b/data/2024/iclr/SpaCE: The Spatial Confounding Environment	
@@ -0,0 +1 @@
+Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.
\ No newline at end of file
diff --git a/data/2024/iclr/Space Group Constrained Crystal Generation b/data/2024/iclr/Space Group Constrained Crystal Generation
new file mode 100644
index 0000000000..a2959671fd
--- /dev/null
+++ b/data/2024/iclr/Space Group Constrained Crystal Generation	
@@ -0,0 +1 @@
+Crystals are the foundation of numerous scientific and industrial applications. While various learning-based approaches have been proposed for crystal generation, existing methods seldom consider the space group constraint which is crucial in describing the geometry of crystals and closely relevant to many desirable properties. However, considering space group constraint is challenging owing to its diverse and nontrivial forms. In this paper, we reduce the space group constraint into an equivalent formulation that is more tractable to be handcrafted into the generation process. In particular, we translate the space group constraint into two parts: the basis constraint of the invariant logarithmic space of the lattice matrix and the Wyckoff position constraint of the fractional coordinates. Upon the derived constraints, we then propose DiffCSP++, a novel diffusion model that has enhanced a previous work DiffCSP by further taking space group constraint into account. Experiments on several popular datasets verify the benefit of the involvement of the space group constraint, and show that our DiffCSP++ achieves promising performance on crystal structure prediction, ab initio crystal generation and controllable generation with customized space groups.
\ No newline at end of file
diff --git a/data/2024/iclr/Space and time continuous physics simulation from partial observations b/data/2024/iclr/Space and time continuous physics simulation from partial observations
new file mode 100644
index 0000000000..b59dfa11ea
--- /dev/null
+++ b/data/2024/iclr/Space and time continuous physics simulation from partial observations	
@@ -0,0 +1 @@
+Modern techniques for physical simulations rely on numerical schemes and mesh-refinement methods to address trade-offs between precision and complexity, but these handcrafted solutions are tedious and require high computational power. Data-driven methods based on large-scale machine learning promise high adaptivity by integrating long-range dependencies more directly and efficiently. In this work, we focus on fluid dynamics and address the shortcomings of a large part of the literature, which are based on fixed support for computations and predictions in the form of regular or irregular grids. We propose a novel setup to perform predictions in a continuous spatial and temporal domain while being trained on sparse observations. We formulate the task as a double observation problem and propose a solution with two interlinked dynamical systems defined on, respectively, the sparse positions and the continuous domain, which allows to forecast and interpolate a solution from the initial condition. Our practical implementation involves recurrent GNNs and a spatio-temporal attention observer capable of interpolating the solution at arbitrary locations. Our model not only generalizes to new initial conditions (as standard auto-regressive models do) but also performs evaluation at arbitrary space and time locations. We evaluate on three standard datasets in fluid dynamics and compare to strong baselines, which are outperformed both in classical settings and in the extended new task requiring continuous predictions.
\ No newline at end of file
diff --git a/data/2024/iclr/Sparse Autoencoders Find Highly Interpretable Features in Language Models b/data/2024/iclr/Sparse Autoencoders Find Highly Interpretable Features in Language Models
new file mode 100644
index 0000000000..fdfe421725
--- /dev/null
+++ b/data/2024/iclr/Sparse Autoencoders Find Highly Interpretable Features in Language Models	
@@ -0,0 +1 @@
+One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.
\ No newline at end of file
diff --git a/data/2024/iclr/Sparse MoE with Language Guided Routing for Multilingual Machine Translation b/data/2024/iclr/Sparse MoE with Language Guided Routing for Multilingual Machine Translation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging b/data/2024/iclr/Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging
new file mode 100644
index 0000000000..d4b73b2504
--- /dev/null
+++ b/data/2024/iclr/Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging	
@@ -0,0 +1 @@
+Neural networks can be significantly compressed by pruning, yielding sparse models with reduced storage and computational demands while preserving predictive performance. Model soups (Wortsman et al., 2022) enhance generalization and out-of-distribution (OOD) performance by averaging the parameters of multiple models into a single one, without increasing inference time. However, achieving both sparsity and parameter averaging is challenging as averaging arbitrary sparse models reduces the overall sparsity due to differing sparse connectivities. This work addresses these challenges by demonstrating that exploring a single retraining phase of Iterative Magnitude Pruning (IMP) with varied hyperparameter configurations such as batch ordering or weight decay yields models suitable for averaging, sharing identical sparse connectivity by design. Averaging these models significantly enhances generalization and OOD performance over their individual counterparts. Building on this, we introduce Sparse Model Soups (SMS), a novel method for merging sparse models by initiating each prune-retrain cycle with the averaged model from the previous phase. SMS preserves sparsity, exploits sparse network benefits, is modular and fully parallelizable, and substantially improves IMP's performance. We further demonstrate that SMS can be adapted to enhance state-of-the-art pruning-during-training approaches.
\ No newline at end of file
diff --git a/data/2024/iclr/Sparse Spiking Neural Network: Exploiting Heterogeneity in Timescales for Pruning Recurrent SNN b/data/2024/iclr/Sparse Spiking Neural Network: Exploiting Heterogeneity in Timescales for Pruning Recurrent SNN
new file mode 100644
index 0000000000..4e3b789b68
--- /dev/null
+++ b/data/2024/iclr/Sparse Spiking Neural Network: Exploiting Heterogeneity in Timescales for Pruning Recurrent SNN	
@@ -0,0 +1 @@
+Recurrent Spiking Neural Networks (RSNNs) have emerged as a computationally efficient and brain-inspired learning model. The design of sparse RSNNs with fewer neurons and synapses helps reduce the computational complexity of RSNNs. Traditionally, sparse SNNs are obtained by first training a dense and complex SNN for a target task, and, then, pruning neurons with low activity (activity-based pruning) while maintaining task performance. In contrast, this paper presents a task-agnostic methodology for designing sparse RSNNs by pruning a large randomly initialized model. We introduce a novel Lyapunov Noise Pruning (LNP) algorithm that uses graph sparsification methods and utilizes Lyapunov exponents to design a stable sparse RSNN from a randomly initialized RSNN. We show that the LNP can leverage diversity in neuronal timescales to design a sparse Heterogeneous RSNN (HRSNN). Further, we show that the same sparse HRSNN model can be trained for different tasks, such as image classification and temporal prediction. We experimentally show that, in spite of being task-agnostic, LNP increases computational efficiency (fewer neurons and synapses) and prediction performance of RSNNs compared to traditional activity-based pruning of trained dense models.
\ No newline at end of file
diff --git a/data/2024/iclr/Sparse Weight Averaging with Multiple Particles for Iterative Magnitude Pruning b/data/2024/iclr/Sparse Weight Averaging with Multiple Particles for Iterative Magnitude Pruning
new file mode 100644
index 0000000000..449c4e8b0a
--- /dev/null
+++ b/data/2024/iclr/Sparse Weight Averaging with Multiple Particles for Iterative Magnitude Pruning	
@@ -0,0 +1 @@
+Given the ever-increasing size of modern neural networks, the significance of sparse architectures has surged due to their accelerated inference speeds and minimal memory demands. When it comes to global pruning techniques, Iterative Magnitude Pruning (IMP) still stands as a state-of-the-art algorithm despite its simple nature, particularly in extremely sparse regimes. In light of the recent finding that the two successive matching IMP solutions are linearly connected without a loss barrier, we propose Sparse Weight Averaging with Multiple Particles (SWAMP), a straightforward modification of IMP that achieves performance comparable to an ensemble of two IMP solutions. For every iteration, we concurrently train multiple sparse models, referred to as particles, using different batch orders yet the same matching ticket, and then weight average such models to produce a single mask. We demonstrate that our method consistently outperforms existing baselines across different sparsities through extensive experiments on various data and neural network structures.
\ No newline at end of file
diff --git a/data/2024/iclr/SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation b/data/2024/iclr/SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation
new file mode 100644
index 0000000000..4e7626b5a3
--- /dev/null
+++ b/data/2024/iclr/SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation	
@@ -0,0 +1 @@
+Humans demonstrate remarkable skill in transferring manipulation abilities across objects of varying shapes, poses, and appearances, a capability rooted in their understanding of semantic correspondences between different instances. To equip robots with a similar high-level comprehension, we present SparseDFF, a novel DFF for 3D scenes utilizing large 2D vision models to extract semantic features from sparse RGBD images, a domain where research is limited despite its relevance to many tasks with fixed-camera setups. SparseDFF generates view-consistent 3D DFFs, enabling efficient one-shot learning of dexterous manipulations by mapping image features to a 3D point cloud. Central to SparseDFF is a feature refinement network, optimized with a contrastive loss between views and a point-pruning mechanism for feature continuity. This facilitates the minimization of feature discrepancies w.r.t. end-effector parameters, bridging demonstrations and target manipulations. Validated in real-world scenarios with a dexterous hand, SparseDFF proves effective in manipulating both rigid and deformable objects, demonstrating significant generalization capabilities across object and scene variations.
\ No newline at end of file
diff --git a/data/2024/iclr/SparseFormer: Sparse Visual Recognition via Limited Latent Tokens b/data/2024/iclr/SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
new file mode 100644
index 0000000000..9c34f6158b
--- /dev/null
+++ b/data/2024/iclr/SparseFormer: Sparse Visual Recognition via Limited Latent Tokens	
@@ -0,0 +1 @@
+Human visual recognition is a sparse process, where only a few salient visual cues are attended to rather than traversing every detail uniformly. However, most current vision networks follow a dense paradigm, processing every single visual unit (e.g,, pixel or patch) in a uniform manner. In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner. SparseFormer learns to represent images using a highly limited number of tokens (down to 49) in the latent space with sparse feature sampling procedure instead of processing dense units in the original pixel space. Therefore, SparseFormer circumvents most of dense operations on the image space and has much lower computational costs. Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on par with canonical or well-established models while offering better accuracy-throughput tradeoff. Moreover, the design of our network can be easily extended to the video classification with promising performance at lower computational costs. We hope that our work can provide an alternative way for visual modeling and inspire further research on sparse neural architectures. The code will be publicly available at https://github.com/showlab/sparseformer
\ No newline at end of file
diff --git a/data/2024/iclr/Sparsistency for inverse optimal transport b/data/2024/iclr/Sparsistency for inverse optimal transport
new file mode 100644
index 0000000000..14fa378903
--- /dev/null
+++ b/data/2024/iclr/Sparsistency for inverse optimal transport	
@@ -0,0 +1 @@
+Optimal Transport is a useful metric to compare probability distributions and to compute a pairing given a ground cost. Its entropic regularization variant (eOT) is crucial to have fast algorithms and reflect fuzzy/noisy matchings. This work focuses on Inverse Optimal Transport (iOT), the problem of inferring the ground cost from samples drawn from a coupling that solves an eOT problem. It is a relevant problem that can be used to infer unobserved/missing links, and to obtain meaningful information about the structure of the ground cost yielding the pairing. On one side, iOT benefits from convexity, but on the other side, being ill-posed, it requires regularization to handle the sampling noise. This work presents an in-depth theoretical study of the l1 regularization to model for instance Euclidean costs with sparse interactions between features. Specifically, we derive a sufficient condition for the robust recovery of the sparsity of the ground cost that can be seen as a far reaching generalization of the Lasso's celebrated Irrepresentability Condition. To provide additional insight into this condition, we work out in detail the Gaussian case. We show that as the entropic penalty varies, the iOT problem interpolates between a graphical Lasso and a classical Lasso, thereby establishing a connection between iOT and graph estimation, an important problem in ML.
\ No newline at end of file
diff --git a/data/2024/iclr/Spatially-Aware Transformers for Embodied Agents b/data/2024/iclr/Spatially-Aware Transformers for Embodied Agents
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Spatio-Temporal Few-Shot Learning via Diffusive Neural Network Generation b/data/2024/iclr/Spatio-Temporal Few-Shot Learning via Diffusive Neural Network Generation
new file mode 100644
index 0000000000..45e4dee64c
--- /dev/null
+++ b/data/2024/iclr/Spatio-Temporal Few-Shot Learning via Diffusive Neural Network Generation	
@@ -0,0 +1 @@
+Spatio-temporal modeling is foundational for smart city applications, yet it is often hindered by data scarcity in many cities and regions. To bridge this gap, we propose a novel generative pre-training framework, GPD, for spatio-temporal few-shot learning with urban knowledge transfer. Unlike conventional approaches that heavily rely on common feature extraction or intricate few-shot learning designs, our solution takes a novel approach by performing generative pre-training on a collection of neural network parameters optimized with data from source cities. We recast spatio-temporal few-shot learning as pre-training a generative diffusion model, which generates tailored neural networks guided by prompts, allowing for adaptability to diverse data distributions and city-specific characteristics. GPD employs a Transformer-based denoising diffusion model, which is model-agnostic to integrate with powerful spatio-temporal neural networks. By addressing challenges arising from data gaps and the complexity of generalizing knowledge across cities, our framework consistently outperforms state-of-the-art baselines on multiple real-world datasets for tasks such as traffic speed prediction and crowd flow prediction. The implementation of our approach is available: https://github.com/tsinghua-fib-lab/GPD.
\ No newline at end of file
diff --git a/data/2024/iclr/Spectrally Transformed Kernel Regression b/data/2024/iclr/Spectrally Transformed Kernel Regression
new file mode 100644
index 0000000000..779d5366d2
--- /dev/null
+++ b/data/2024/iclr/Spectrally Transformed Kernel Regression	
@@ -0,0 +1 @@
+Unlabeled data is a key component of modern machine learning. In general, the role of unlabeled data is to impose a form of smoothness, usually from the similarity information encoded in a base kernel, such as the $\epsilon$-neighbor kernel or the adjacency matrix of a graph. This work revisits the classical idea of spectrally transformed kernel regression (STKR), and provides a new class of general and scalable STKR estimators able to leverage unlabeled data. Intuitively, via spectral transformation, STKR exploits the data distribution for which unlabeled data can provide additional information. First, we show that STKR is a principled and general approach, by characterizing a universal type of"target smoothness", and proving that any sufficiently smooth function can be learned by STKR. Second, we provide scalable STKR implementations for the inductive setting and a general transformation function, while prior work is mostly limited to the transductive setting. Third, we derive statistical guarantees for two scenarios: STKR with a known polynomial transformation, and STKR with kernel PCA when the transformation is unknown. Overall, we believe that this work helps deepen our understanding of how to work with unlabeled data, and its generality makes it easier to inspire new methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips b/data/2024/iclr/Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips
new file mode 100644
index 0000000000..6ed0741b4e
--- /dev/null
+++ b/data/2024/iclr/Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips	
@@ -0,0 +1 @@
+Neuromorphic computing, which exploits Spiking Neural Networks (SNNs) on neuromorphic chips, is a promising energy-efficient alternative to traditional AI. CNN-based SNNs are the current mainstream of neuromorphic computing. By contrast, no neuromorphic chips are designed especially for Transformer-based SNNs, which have just emerged, and their performance is only on par with CNN-based SNNs, offering no distinct advantage. In this work, we propose a general Transformer-based SNN architecture, termed as ``Meta-SpikeFormer", whose goals are: 1) Lower-power, supports the spike-driven paradigm that there is only sparse addition in the network; 2) Versatility, handles various vision tasks; 3) High-performance, shows overwhelming performance advantages over CNN-based SNNs; 4) Meta-architecture, provides inspiration for future next-generation Transformer-based neuromorphic chip designs. Specifically, we extend the Spike-driven Transformer in \citet{yao2023spike} into a meta architecture, and explore the impact of structure, spike-driven self-attention, and skip connection on its performance. On ImageNet-1K, Meta-SpikeFormer achieves 80.0\% top-1 accuracy (55M), surpassing the current state-of-the-art (SOTA) SNN baselines (66M) by 3.7\%. This is the first direct training SNN backbone that can simultaneously supports classification, detection, and segmentation, obtaining SOTA results in SNNs. Finally, we discuss the inspiration of the meta SNN architecture for neuromorphic chip design. Source code and models are available at \url{https://github.com/BICLab/Spike-Driven-Transformer-V2}.
\ No newline at end of file
diff --git a/data/2024/iclr/SpikePoint: An Efficient Point-based Spiking Neural Network for Event Cameras Action Recognition b/data/2024/iclr/SpikePoint: An Efficient Point-based Spiking Neural Network for Event Cameras Action Recognition
new file mode 100644
index 0000000000..46dd4fb563
--- /dev/null
+++ b/data/2024/iclr/SpikePoint: An Efficient Point-based Spiking Neural Network for Event Cameras Action Recognition	
@@ -0,0 +1 @@
+Event cameras are bio-inspired sensors that respond to local changes in light intensity and feature low latency, high energy efficiency, and high dynamic range. Meanwhile, Spiking Neural Networks (SNNs) have gained significant attention due to their remarkable efficiency and fault tolerance. By synergistically harnessing the energy efficiency inherent in event cameras and the spike-based processing capabilities of SNNs, their integration could enable ultra-low-power application scenarios, such as action recognition tasks. However, existing approaches often entail converting asynchronous events into conventional frames, leading to additional data mapping efforts and a loss of sparsity, contradicting the design concept of SNNs and event cameras. To address this challenge, we propose SpikePoint, a novel end-to-end point-based SNN architecture. SpikePoint excels at processing sparse event cloud data, effectively extracting both global and local features through a singular-stage structure. Leveraging the surrogate training method, SpikePoint achieves high accuracy with few parameters and maintains low power consumption, specifically employing the identity mapping feature extractor on diverse datasets. SpikePoint achieves state-of-the-art (SOTA) performance on four event-based action recognition datasets using only 16 timesteps, surpassing other SNN methods. Moreover, it also achieves SOTA performance across all methods on three datasets, utilizing approximately 0.3\% of the parameters and 0.5\% of power consumption employed by artificial neural networks (ANNs). These results emphasize the significance of Point Cloud and pave the way for many ultra-low-power event-based data processing applications.
\ No newline at end of file
diff --git a/data/2024/iclr/Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM b/data/2024/iclr/Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM
new file mode 100644
index 0000000000..6684a13f14
--- /dev/null
+++ b/data/2024/iclr/Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM	
@@ -0,0 +1 @@
+We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained end-to-end and operates directly on spectrograms, simplifying our architecture. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a `cross-modal' chain-of-thought within a single decoding pass. Our method surpasses existing spoken language models in speaker preservation and semantic coherence. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken QA datasets. We release our audio samples (https://michelleramanovich.github.io/spectron/spectron) and spoken QA dataset (https://github.com/google-research-datasets/LLAMA1-Test-Set).
\ No newline at end of file
diff --git a/data/2024/iclr/Stabilizing Backpropagation Through Time to Learn Complex Physics b/data/2024/iclr/Stabilizing Backpropagation Through Time to Learn Complex Physics
new file mode 100644
index 0000000000..2f2f9e460e
--- /dev/null
+++ b/data/2024/iclr/Stabilizing Backpropagation Through Time to Learn Complex Physics	
@@ -0,0 +1 @@
+Of all the vector fields surrounding the minima of recurrent learning setups, the gradient field with its exploding and vanishing updates appears a poor choice for optimization, offering little beyond efficient computability. We seek to improve this suboptimal practice in the context of physics simulations, where backpropagating feedback through many unrolled time steps is considered crucial to acquiring temporally coherent behavior. The alternative vector field we propose follows from two principles: physics simulators, unlike neural networks, have a balanced gradient flow, and certain modifications to the backpropagation pass leave the positions of the original minima unchanged. As any modification of backpropagation decouples forward and backward pass, the rotation-free character of the gradient field is lost. Therefore, we discuss the negative implications of using such a rotational vector field for optimization and how to counteract them. Our final procedure is easily implementable via a sequence of gradient stopping and component-wise comparison operations, which do not negatively affect scalability. Our experiments on three control problems show that especially as we increase the complexity of each task, the unbalanced updates from the gradient can no longer provide the precise control signals necessary while our method still solves the tasks. Our code can be found at https://github.com/tum-pbs/StableBPTT.
\ No newline at end of file
diff --git a/data/2024/iclr/Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data b/data/2024/iclr/Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data
new file mode 100644
index 0000000000..a05149725b
--- /dev/null
+++ b/data/2024/iclr/Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data	
@@ -0,0 +1 @@
+Robotic systems that rely primarily on self-supervised learning have the potential to decrease the amount of human annotation and engineering effort required to learn control strategies. In the same way that prior robotic systems have leveraged self-supervised techniques from computer vision (CV) and natural language processing (NLP), our work builds on prior work showing that the reinforcement learning (RL) itself can be cast as a self-supervised problem: learning to reach any goal without human-specified rewards or labels. Despite the seeming appeal, little (if any) prior work has demonstrated how self-supervised RL methods can be practically deployed on robotic systems. By first studying a challenging simulated version of this task, we discover design decisions about architectures and hyperparameters that increase the success rate by $2 \times$. These findings lay the groundwork for our main result: we demonstrate that a self-supervised RL algorithm based on contrastive learning can solve real-world, image-based robotic manipulation tasks, with tasks being specified by a single goal image provided after training.
\ No newline at end of file
diff --git a/data/2024/iclr/Stable Anisotropic Regularization b/data/2024/iclr/Stable Anisotropic Regularization
new file mode 100644
index 0000000000..59789f7264
--- /dev/null
+++ b/data/2024/iclr/Stable Anisotropic Regularization	
@@ -0,0 +1 @@
+Given the success of Large Language Models (LLMs), there has been considerable interest in studying the properties of model activations. The literature overwhelmingly agrees that LLM representations are dominated by a few"outlier dimensions"with exceedingly high variance and magnitude. Several studies in Natural Language Processing (NLP) have sought to mitigate the impact of such outlier dimensions and force LLMs to be isotropic (i.e., have uniform variance across all dimensions in embedding space). Isotropy is thought to be a desirable property for LLMs that improves model performance and more closely aligns textual representations with human intuition. However, many of the claims regarding isotropy in NLP have been based on the average cosine similarity of embeddings, which has recently been shown to be a flawed measure of isotropy. In this paper, we propose I-STAR: IsoScore*-based STable Anisotropic Regularization, a novel regularization method that can be used to increase or decrease levels of isotropy in embedding space during training. I-STAR uses IsoScore*, the first accurate measure of isotropy that is both differentiable and stable on mini-batch computations. In contrast to several previous works, we find that decreasing isotropy in contextualized embeddings improves performance on the majority of tasks and models considered in this paper.
\ No newline at end of file
diff --git a/data/2024/iclr/Stable Neural Stochastic Differential Equations in Analyzing Irregular Time Series Data b/data/2024/iclr/Stable Neural Stochastic Differential Equations in Analyzing Irregular Time Series Data
new file mode 100644
index 0000000000..9e3a014e9e
--- /dev/null
+++ b/data/2024/iclr/Stable Neural Stochastic Differential Equations in Analyzing Irregular Time Series Data	
@@ -0,0 +1 @@
+Irregular sampling intervals and missing values in real-world time series data present challenges for conventional methods that assume consistent intervals and complete data. Neural Ordinary Differential Equations (Neural ODEs) offer an alternative approach, utilizing neural networks combined with ODE solvers to learn continuous latent representations through parameterized vector fields. Neural Stochastic Differential Equations (Neural SDEs) extend Neural ODEs by incorporating a diffusion term, although this addition is not trivial, particularly when addressing irregular intervals and missing values. Consequently, careful design of drift and diffusion functions is crucial for maintaining stability and enhancing performance, while incautious choices can result in adverse properties such as the absence of strong solutions, stochastic destabilization, or unstable Euler discretizations, significantly affecting Neural SDEs' performance. In this study, we propose three stable classes of Neural SDEs: Langevin-type SDE, Linear Noise SDE, and Geometric SDE. Then, we rigorously demonstrate their robustness in maintaining excellent performance under distribution shift, while effectively preventing overfitting. To assess the effectiveness of our approach, we conduct extensive experiments on four benchmark datasets for interpolation, forecasting, and classification tasks, and analyze the robustness of our methods with 30 public datasets under different missing rates. Our results demonstrate the efficacy of the proposed method in handling real-world irregular time series data.
\ No newline at end of file
diff --git a/data/2024/iclr/Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns b/data/2024/iclr/Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns
new file mode 100644
index 0000000000..e3e4275d08
--- /dev/null
+++ b/data/2024/iclr/Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns	
@@ -0,0 +1 @@
+Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain syntactic structures. To address this shortcoming, we propose stack attention: an attention operator that incorporates stacks, inspired by their theoretical connections to context-free languages (CFLs). We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision. We propose two variants: one related to deterministic pushdown automata (PDAs) and one based on nondeterministic PDAs, which allows transformers to recognize arbitrary CFLs. We show that transformers with stack attention are very effective at learning CFLs that standard transformers struggle on, achieving strong results on a CFL with theoretically maximal parsing difficulty. We also show that stack attention is more effective at natural language modeling under a constrained parameter budget, and we include results on machine translation.
\ No newline at end of file
diff --git a/data/2024/iclr/State Representation Learning Using an Unbalanced Atlas b/data/2024/iclr/State Representation Learning Using an Unbalanced Atlas
new file mode 100644
index 0000000000..ed118f08d3
--- /dev/null
+++ b/data/2024/iclr/State Representation Learning Using an Unbalanced Atlas	
@@ -0,0 +1 @@
+The manifold hypothesis posits that high-dimensional data often lies on a lower-dimensional manifold and that utilizing this manifold as the target space yields more efficient representations. While numerous traditional manifold-based techniques exist for dimensionality reduction, their application in self-supervised learning has witnessed slow progress. The recent MSimCLR method combines manifold encoding with SimCLR but requires extremely low target encoding dimensions to outperform SimCLR, limiting its applicability. This paper introduces a novel learning paradigm using an unbalanced atlas (UA), capable of surpassing state-of-the-art self-supervised learning approaches. We investigated and engineered the DeepInfomax with an unbalanced atlas (DIM-UA) method by adapting the Spatiotemporal DeepInfomax (ST-DIM) framework to align with our proposed UA paradigm. The efficacy of DIM-UA is demonstrated through training and evaluation on the Atari Annotated RAM Interface (AtariARI) benchmark, a modified version of the Atari 2600 framework that produces annotated image samples for representation learning. The UA paradigm improves existing algorithms significantly as the number of target encoding dimensions grows. For instance, the mean F1 score averaged over categories of DIM-UA is ~75% compared to ~70% of ST-DIM when using 16384 hidden units.
\ No newline at end of file
diff --git a/data/2024/iclr/Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts b/data/2024/iclr/Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts
new file mode 100644
index 0000000000..40609831c5
--- /dev/null
+++ b/data/2024/iclr/Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts	
@@ -0,0 +1 @@
+Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitions the input space into multiple regions with distinct behaviors. By focusing on a Gaussian mixture of experts, we establish theoretical results on the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. When the true number of experts $k_{\ast}$ is known, we demonstrate that the convergence rates of density and parameter estimations are both parametric on the sample size. However, when $k_{\ast}$ becomes unknown and the true model is over-specified by a Gaussian mixture of $k$ experts where $k>k_{\ast}$, our findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells associated with the true parameters to guarantee the convergence of the density estimation. Moreover, while the density estimation rate remains parametric under this setting, the parameter estimation rates become substantially slow due to an intrinsic interaction between the softmax gating and expert functions.
\ No newline at end of file
diff --git a/data/2024/iclr/Statistical Rejection Sampling Improves Preference Optimization b/data/2024/iclr/Statistical Rejection Sampling Improves Preference Optimization
new file mode 100644
index 0000000000..467f12182c
--- /dev/null
+++ b/data/2024/iclr/Statistical Rejection Sampling Improves Preference Optimization	
@@ -0,0 +1 @@
+Improving the alignment of language models with human preferences remains an active research challenge. Previous approaches have primarily utilized Reinforcement Learning from Human Feedback (RLHF) via online RL methods such as Proximal Policy Optimization (PPO). Recently, offline methods such as Sequence Likelihood Calibration (SLiC) and Direct Preference Optimization (DPO) have emerged as attractive alternatives, offering improvements in stability and scalability while maintaining competitive performance. SLiC refines its loss function using sequence pairs sampled from a supervised fine-tuned (SFT) policy, while DPO directly optimizes language models based on preference data, foregoing the need for a separate reward model. However, the maximum likelihood estimator (MLE) of the target optimal policy requires labeled preference pairs sampled from that policy. DPO's lack of a reward model constrains its ability to sample preference pairs from the optimal policy, and SLiC is restricted to sampling preference pairs only from the SFT policy. To address these limitations, we introduce a novel approach called Statistical Rejection Sampling Optimization (RSO) that aims to source preference data from the target optimal policy using rejection sampling, enabling a more accurate estimation of the optimal policy. We also propose a unified framework that enhances the loss functions used in both SLiC and DPO from a preference modeling standpoint. Through extensive experiments across three diverse tasks, we demonstrate that RSO consistently outperforms both SLiC and DPO on evaluations from both Large Language Model (LLM) and human raters.
\ No newline at end of file
diff --git a/data/2024/iclr/Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming b/data/2024/iclr/Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming
new file mode 100644
index 0000000000..8acefc5509
--- /dev/null
+++ b/data/2024/iclr/Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming	
@@ -0,0 +1 @@
+$K$-means clustering is a widely used machine learning method for identifying patterns in large datasets. Recently, semidefinite programming (SDP) relaxations have been proposed for solving the $K$-means optimization problem, which enjoy strong statistical optimality guarantees. However, the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. In contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm widely used by machine learning practitioners, but it lacks a solid statistical underpinning and theoretical guarantees. In this paper, we consider an NMF-like algorithm that solves a nonnegative low-rank restriction of the SDP-relaxed $K$-means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is as simple and scalable as state-of-the-art NMF algorithms while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves significantly smaller mis-clustering errors compared to the existing state-of-the-art while maintaining scalability.
\ No newline at end of file
diff --git a/data/2024/iclr/Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds b/data/2024/iclr/Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds
new file mode 100644
index 0000000000..5415ecc11b
--- /dev/null
+++ b/data/2024/iclr/Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds	
@@ -0,0 +1 @@
+Recent studies have presented compelling evidence that large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world, which marks an initial step toward versatile robotics. However, these efforts tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to"a blindfolded text-based game."Consequently, LLM-based agents frequently encounter challenges in intuitively comprehending their surroundings and producing responses that are easy to understand. In this paper, we propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation. Steve-Eye integrates the LLM with a visual encoder which enables it to process visual-text inputs and generate multimodal feedback. In addition, we use a semi-automatic strategy to collect an extensive dataset comprising 850K open-world instruction pairs, empowering our model to encompass three essential functions for an agent: multimodal perception, foundational knowledge base, and skill prediction and planning. Lastly, we develop three open-world evaluation benchmarks, then carry out extensive experiments from a wide range of perspectives to validate our model's capability to strategically act and plan. Codes and datasets will be released.
\ No newline at end of file
diff --git a/data/2024/iclr/Stochastic Controlled Averaging for Federated Learning with Communication Compression b/data/2024/iclr/Stochastic Controlled Averaging for Federated Learning with Communication Compression
new file mode 100644
index 0000000000..e24074f653
--- /dev/null
+++ b/data/2024/iclr/Stochastic Controlled Averaging for Federated Learning with Communication Compression	
@@ -0,0 +1 @@
+Communication compression, a technique aiming to reduce the information volume to be transmitted over the air, has gained great interests in Federated Learning (FL) for the potential of alleviating its communication overhead. However, communication compression brings forth new challenges in FL due to the interplay of compression-incurred information distortion and inherent characteristics of FL such as partial participation and data heterogeneity. Despite the recent development, the performance of compressed FL approaches has not been fully exploited. The existing approaches either cannot accommodate arbitrary data heterogeneity or partial participation, or require stringent conditions on compression. In this paper, we revisit the seminal stochastic controlled averaging method by proposing an equivalent but more efficient/simplified formulation with halved uplink communication costs. Building upon this implementation, we propose two compressed FL algorithms, SCALLION and SCAFCOM, to support unbiased and biased compression, respectively. Both the proposed methods outperform the existing compressed FL methods in terms of communication and computation complexities. Moreover, SCALLION and SCAFCOM accommodates arbitrary data heterogeneity and do not make any additional assumptions on compression errors. Experiments show that SCALLION and SCAFCOM can match the performance of corresponding full-precision FL approaches with substantially reduced uplink communication, and outperform recent compressed FL methods under the same communication budget.
\ No newline at end of file
diff --git a/data/2024/iclr/Stochastic Modified Equations and Dynamics of Dropout Algorithm b/data/2024/iclr/Stochastic Modified Equations and Dynamics of Dropout Algorithm
new file mode 100644
index 0000000000..15766f1e54
--- /dev/null
+++ b/data/2024/iclr/Stochastic Modified Equations and Dynamics of Dropout Algorithm	
@@ -0,0 +1 @@
+Dropout is a widely utilized regularization technique in the training of neural networks, nevertheless, its underlying mechanism and its impact on achieving good generalization abilities remain poorly understood. In this work, we derive the stochastic modified equations for analyzing the dynamics of dropout, where its discrete iteration process is approximated by a class of stochastic differential equations. In order to investigate the underlying mechanism by which dropout facilitates the identification of flatter minima, we study the noise structure of the derived stochastic modified equation for dropout. By drawing upon the structural resemblance between the Hessian and covariance through several intuitive approximations, we empirically demonstrate the universal presence of the inverse variance-flatness relation and the Hessian-variance relation, throughout the training process of dropout. These theoretical and empirical findings make a substantial contribution to our understanding of the inherent tendency of dropout to locate flatter minima.
\ No newline at end of file
diff --git a/data/2024/iclr/Str2Str: A Score-based Framework for Zero-shot Protein Conformation Sampling b/data/2024/iclr/Str2Str: A Score-based Framework for Zero-shot Protein Conformation Sampling
new file mode 100644
index 0000000000..731d5b06c5
--- /dev/null
+++ b/data/2024/iclr/Str2Str: A Score-based Framework for Zero-shot Protein Conformation Sampling	
@@ -0,0 +1 @@
+The dynamic nature of proteins is crucial for determining their biological functions and properties, for which Monte Carlo (MC) and molecular dynamics (MD) simulations stand as predominant tools to study such phenomena. By utilizing empirically derived force fields, MC or MD simulations explore the conformational space through numerically evolving the system via Markov chain or Newtonian mechanics. However, the high-energy barrier of the force fields can hamper the exploration of both methods by the rare event, resulting in inadequately sampled ensemble without exhaustive running. Existing learning-based approaches perform direct sampling yet heavily rely on target-specific simulation data for training, which suffers from high data acquisition cost and poor generalizability. Inspired by simulated annealing, we propose Str2Str, a novel structure-to-structure translation framework capable of zero-shot conformation sampling with roto-translation equivariant property. Our method leverages an amortized denoising score matching objective trained on general crystal structures and has no reliance on simulation data during both training and inference. Experimental results across several benchmarking protein systems demonstrate that Str2Str outperforms previous state-of-the-art generative structure prediction models and can be orders of magnitude faster compared to long MD simulations. Our open-source implementation is available at https://github.com/lujiarui/Str2Str
\ No newline at end of file
diff --git a/data/2024/iclr/Strategic Preys Make Acute Predators: Enhancing Camouflaged Object Detectors by Generating Camouflaged Objects b/data/2024/iclr/Strategic Preys Make Acute Predators: Enhancing Camouflaged Object Detectors by Generating Camouflaged Objects
new file mode 100644
index 0000000000..bf086080b2
--- /dev/null
+++ b/data/2024/iclr/Strategic Preys Make Acute Predators: Enhancing Camouflaged Object Detectors by Generating Camouflaged Objects	
@@ -0,0 +1 @@
+Camouflaged object detection (COD) is the challenging task of identifying camouflaged objects visually blended into surroundings. Albeit achieving remarkable success, existing COD detectors still struggle to obtain precise results in some challenging cases. To handle this problem, we draw inspiration from the prey-vs-predator game that leads preys to develop better camouflage and predators to acquire more acute vision systems and develop algorithms from both the prey side and the predator side. On the prey side, we propose an adversarial training framework, Camouflageator, which introduces an auxiliary generator to generate more camouflaged objects that are harder for a COD method to detect. Camouflageator trains the generator and detector in an adversarial way such that the enhanced auxiliary generator helps produce a stronger detector. On the predator side, we introduce a novel COD method, called Internal Coherence and Edge Guidance (ICEG), which introduces a camouflaged feature coherence module to excavate the internal coherence of camouflaged objects, striving to obtain more complete segmentation results. Additionally, ICEG proposes a novel edge-guided separated calibration module to remove false predictions to avoid obtaining ambiguous boundaries. Extensive experiments show that ICEG outperforms existing COD detectors and Camouflageator is flexible to improve various COD detectors, including ICEG, which brings state-of-the-art COD performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Structural Estimation of Partially Observed Linear Non-Gaussian Acyclic Model: A Practical Approach with Identifiability b/data/2024/iclr/Structural Estimation of Partially Observed Linear Non-Gaussian Acyclic Model: A Practical Approach with Identifiability
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Structural Fairness-aware Active Learning for Graph Neural Networks b/data/2024/iclr/Structural Fairness-aware Active Learning for Graph Neural Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Structural Inference with Dynamics Encoding and Partial Correlation Coefficients b/data/2024/iclr/Structural Inference with Dynamics Encoding and Partial Correlation Coefficients
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Structuring Representation Geometry with Rotationally Equivariant Contrastive Learning b/data/2024/iclr/Structuring Representation Geometry with Rotationally Equivariant Contrastive Learning
new file mode 100644
index 0000000000..19a90f2bdb
--- /dev/null
+++ b/data/2024/iclr/Structuring Representation Geometry with Rotationally Equivariant Contrastive Learning	
@@ -0,0 +1 @@
+Self-supervised learning converts raw perceptual data such as images to a compact space where simple Euclidean distances measure meaningful variations in data. In this paper, we extend this formulation by adding additional geometric structure to the embedding space by enforcing transformations of input space to correspond to simple (i.e., linear) transformations of embedding space. Specifically, in the contrastive learning setting, we introduce an equivariance objective and theoretically prove that its minima forces augmentations on input space to correspond to rotations on the spherical embedding space. We show that merely combining our equivariant loss with a non-collapse term results in non-trivial representations, without requiring invariance to data augmentations. Optimal performance is achieved by also encouraging approximate invariance, where input augmentations correspond to small rotations. Our method, CARE: Contrastive Augmentation-induced Rotational Equivariance, leads to improved performance on downstream tasks, and ensures sensitivity in embedding space to important variations in data (e.g., color) that standard contrastive methods do not achieve. Code is available at https://github.com/Sharut/CARE.
\ No newline at end of file
diff --git a/data/2024/iclr/SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs b/data/2024/iclr/SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs
new file mode 100644
index 0000000000..827fabe258
--- /dev/null
+++ b/data/2024/iclr/SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs	
@@ -0,0 +1 @@
+Large language models (LLMs) have made significant advancements in various natural language processing tasks, including question answering (QA) tasks. While incorporating new information with the retrieval of relevant passages is a promising way to improve QA with LLMs, the existing methods often require additional fine-tuning which becomes infeasible with recent LLMs. Augmenting retrieved passages via prompting has the potential to address this limitation, but this direction has been limitedly explored. To this end, we design a simple yet effective framework to enhance open-domain QA (ODQA) with LLMs, based on the summarized retrieval (SuRe). SuRe helps LLMs predict more accurate answers for a given question, which are well-supported by the summarized retrieval that could be viewed as an explicit rationale extracted from the retrieved passages. Specifically, SuRe first constructs summaries of the retrieved passages for each of the multiple answer candidates. Then, SuRe confirms the most plausible answer from the candidate set by evaluating the validity and ranking of the generated summaries. Experimental results on diverse ODQA benchmarks demonstrate the superiority of SuRe, with improvements of up to 4.6% in exact match (EM) and 4.0% in F1 score over standard prompting approaches. SuRe also can be integrated with a broad range of retrieval methods and LLMs. Finally, the generated summaries from SuRe show additional advantages to measure the importance of retrieved passages and serve as more preferred rationales by models and humans.
\ No newline at end of file
diff --git a/data/2024/iclr/Subtractive Mixture Models via Squaring: Representation and Learning b/data/2024/iclr/Subtractive Mixture Models via Squaring: Representation and Learning
new file mode 100644
index 0000000000..938abdd44d
--- /dev/null
+++ b/data/2024/iclr/Subtractive Mixture Models via Squaring: Representation and Learning	
@@ -0,0 +1 @@
+Mixture models are traditionally represented and learned by adding several distributions as components. Allowing mixtures to subtract probability mass or density can drastically reduce the number of components needed to model complex distributions. However, learning such subtractive mixtures while ensuring they still encode a non-negative function is challenging. We investigate how to learn and perform inference on deep subtractive mixtures by squaring them. We do this in the framework of probabilistic circuits, which enable us to represent tensorized mixtures and generalize several other subtractive models. We theoretically prove that the class of squared circuits allowing subtractions can be exponentially more expressive than traditional additive mixtures; and, we empirically show this increased expressiveness on a series of real-world distribution estimation tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Successor Heads: Recurring, Interpretable Attention Heads In The Wild b/data/2024/iclr/Successor Heads: Recurring, Interpretable Attention Heads In The Wild
new file mode 100644
index 0000000000..7059194ac1
--- /dev/null
+++ b/data/2024/iclr/Successor Heads: Recurring, Interpretable Attention Heads In The Wild	
@@ -0,0 +1 @@
+In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads increment 'Monday' into 'Tuesday'. We explain the successor head behavior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms. Existing research in this area has found interpretable language model components in small toy models. However, results in toy models have not yet led to insights that explain the internals of frontier models and little is currently understood about the internal operations of large language models. In this paper, we analyze the behavior of successor heads in large language models (LLMs) and find that they implement abstract representations that are common to different architectures. They form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2. We find a set of 'mod-10 features' that underlie how successor heads increment in LLMs across different architectures and sizes. We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Additionally, we study the behavior of successor heads on natural language data, identifying interpretable polysemanticity in a Pythia successor head.
\ No newline at end of file
diff --git a/data/2024/iclr/Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs b/data/2024/iclr/Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs
new file mode 100644
index 0000000000..8ccec2ba71
--- /dev/null
+++ b/data/2024/iclr/Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs	
@@ -0,0 +1 @@
+Most interpretability research in NLP focuses on understanding the behavior and features of a fully trained model. However, certain insights into model behavior may only be accessible by observing the trajectory of the training process. We present a case study of syntax acquisition in masked language models (MLMs) that demonstrates how analyzing the evolution of interpretable artifacts throughout training deepens our understanding of emergent behavior. In particular, we study Syntactic Attention Structure (SAS), a naturally emerging property of MLMs wherein specific Transformer heads tend to focus on specific syntactic relations. We identify a brief window in pretraining when models abruptly acquire SAS, concurrent with a steep drop in loss. This breakthrough precipitates the subsequent acquisition of linguistic capabilities. We then examine the causal role of SAS by manipulating SAS during training, and demonstrate that SAS is necessary for the development of grammatical capabilities. We further find that SAS competes with other beneficial traits during training, and that briefly suppressing SAS improves model quality. These findings offer an interpretation of a real-world example of both simplicity bias and breakthrough training dynamics.
\ No newline at end of file
diff --git a/data/2024/iclr/Sufficient conditions for offline reactivation in recurrent neural networks b/data/2024/iclr/Sufficient conditions for offline reactivation in recurrent neural networks
new file mode 100644
index 0000000000..e778678f53
--- /dev/null
+++ b/data/2024/iclr/Sufficient conditions for offline reactivation in recurrent neural networks	
@@ -0,0 +1 @@
+During periods of quiescence, such as sleep, neural activity in many brain circuits resembles that observed during periods of task engagement. However, the precise conditions under which task-optimized networks can autonomously reactivate the same network states responsible for online behavior are poorly understood. In this study, we develop a mathematical framework that outlines sufficient conditions for the emergence of neural reactivation in circuits that encode features of smoothly varying stimuli. We demonstrate mathematically that noisy recurrent networks optimized to track environmental state variables using change-based sensory information naturally develop denoising dynamics, which, in the absence of input, cause the network to revisit state configurations observed during periods of online activity. We validate our findings using numerical experiments on two canonical neuroscience tasks: spatial position estimation based on self-motion cues, and head direction estimation based on angular velocity cues. Overall, our work provides theoretical support for modeling offline reactivation as an emergent consequence of task optimization in noisy neural circuits.
\ No newline at end of file
diff --git a/data/2024/iclr/Sum-Product-Set Networks: Deep Tractable Models for Tree-Structured Graphs b/data/2024/iclr/Sum-Product-Set Networks: Deep Tractable Models for Tree-Structured Graphs
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Supervised Knowledge Makes Large Language Models Better In-context Learners b/data/2024/iclr/Supervised Knowledge Makes Large Language Models Better In-context Learners
new file mode 100644
index 0000000000..28fcf61f97
--- /dev/null
+++ b/data/2024/iclr/Supervised Knowledge Makes Large Language Models Better In-context Learners	
@@ -0,0 +1 @@
+Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The recent progress in large-scale generative models has further expanded their use in real-world language applications. However, the critical challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. While previous in-context learning research has focused on enhancing models to adhere to users' specific instructions and quality expectations, and to avoid undesired outputs, little to no work has explored the use of task-Specific fine-tuned Language Models (SLMs) to improve LLMs' in-context learning during the inference stage. Our primary contribution is the establishment of a simple yet effective framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks. Using our proposed plug-in method, enhanced versions of Llama 2 and ChatGPT surpass their original versions regarding generalizability and factuality. We offer a comprehensive suite of resources, including 16 curated datasets, prompts, model checkpoints, and LLM outputs across 9 distinct tasks. The code and data are released at: https://github.com/YangLinyi/Supervised-Knowledge-Makes-Large-Language-Models-Better-In-context-Learners. Our empirical analysis sheds light on the advantages of incorporating discriminative models into LLMs and highlights the potential of our methodology in fostering more reliable LLMs.
\ No newline at end of file
diff --git a/data/2024/iclr/SweetDreamer: Aligning Geometric Priors in 2D diffusion for Consistent Text-to-3D b/data/2024/iclr/SweetDreamer: Aligning Geometric Priors in 2D diffusion for Consistent Text-to-3D
new file mode 100644
index 0000000000..3d0e9ed9e8
--- /dev/null
+++ b/data/2024/iclr/SweetDreamer: Aligning Geometric Priors in 2D diffusion for Consistent Text-to-3D	
@@ -0,0 +1 @@
+It is inherently ambiguous to lift 2D results from pre-trained diffusion models to a 3D world for text-to-3D generation. 2D diffusion models solely learn view-agnostic priors and thus lack 3D knowledge during the lifting, leading to the multi-view inconsistency problem. We find that this problem primarily stems from geometric inconsistency, and avoiding misplaced geometric structures substantially mitigates the problem in the final outputs. Therefore, we improve the consistency by aligning the 2D geometric priors in diffusion models with well-defined 3D shapes during the lifting, addressing the vast majority of the problem. This is achieved by fine-tuning the 2D diffusion model to be viewpoint-aware and to produce view-specific coordinate maps of canonically oriented 3D objects. In our process, only coarse 3D information is used for aligning. This"coarse"alignment not only resolves the multi-view inconsistency in geometries but also retains the ability in 2D diffusion models to generate detailed and diversified high-quality objects unseen in the 3D datasets. Furthermore, our aligned geometric priors (AGP) are generic and can be seamlessly integrated into various state-of-the-art pipelines, obtaining high generalizability in terms of unseen shapes and visual appearance while greatly alleviating the multi-view inconsistency problem. Our method represents a new state-of-the-art performance with an 85+% consistency rate by human evaluation, while many previous methods are around 30%. Our project page is https://sweetdreamer3d.github.io/
\ No newline at end of file
diff --git a/data/2024/iclr/Symbol as Points: Panoptic Symbol Spotting via Point-based Representation b/data/2024/iclr/Symbol as Points: Panoptic Symbol Spotting via Point-based Representation
new file mode 100644
index 0000000000..8bad127792
--- /dev/null
+++ b/data/2024/iclr/Symbol as Points: Panoptic Symbol Spotting via Point-based Representation	
@@ -0,0 +1 @@
+This work studies the problem of panoptic symbol spotting, which is to spot and parse both countable object instances (windows, doors, tables, etc.) and uncountable stuff (wall, railing, etc.) from computer-aided design (CAD) drawings. Existing methods typically involve either rasterizing the vector graphics into images and using image-based methods for symbol spotting, or directly building graphs and using graph neural networks for symbol recognition. In this paper, we take a different approach, which treats graphic primitives as a set of 2D points that are locally connected and use point cloud segmentation methods to tackle it. Specifically, we utilize a point transformer to extract the primitive features and append a mask2former-like spotting head to predict the final output. To better use the local connection information of primitives and enhance their discriminability, we further propose the attention with connection module (ACM) and contrastive connection learning scheme (CCL). Finally, we propose a KNN interpolation mechanism for the mask attention module of the spotting head to better handle primitive mask downsampling, which is primitive-level in contrast to pixel-level for the image. Our approach, named SymPoint, is simple yet effective, outperforming recent state-of-the-art method GAT-CADNet by an absolute increase of 9.6% PQ and 10.4% RQ on the FloorPlanCAD dataset. The source code and models will be available at https://github.com/nicehuster/SymPoint.
\ No newline at end of file
diff --git a/data/2024/iclr/Symmetric Mean-field Langevin Dynamics for Distributional Minimax Problems b/data/2024/iclr/Symmetric Mean-field Langevin Dynamics for Distributional Minimax Problems
new file mode 100644
index 0000000000..d80bf68a4e
--- /dev/null
+++ b/data/2024/iclr/Symmetric Mean-field Langevin Dynamics for Distributional Minimax Problems	
@@ -0,0 +1 @@
+In this paper, we extend mean-field Langevin dynamics to minimax optimization over probability distributions for the first time with symmetric and provably convergent updates. We propose mean-field Langevin averaged gradient (MFL-AG), a single-loop algorithm that implements gradient descent ascent in the distribution spaces with a novel weighted averaging, and establish average-iterate convergence to the mixed Nash equilibrium. We also study both time and particle discretization regimes and prove a new uniform-in-time propagation of chaos result which accounts for the dependency of the particle interactions on all previous distributions. Furthermore, we propose mean-field Langevin anchored best response (MFL-ABR), a symmetric double-loop algorithm based on best response dynamics with linear last-iterate convergence. Finally, we study applications to zero-sum Markov games and conduct simulations demonstrating long-term optimality.
\ No newline at end of file
diff --git a/data/2024/iclr/Symmetric Neural-Collapse Representations with Supervised Contrastive Loss: The Impact of ReLU and Batching b/data/2024/iclr/Symmetric Neural-Collapse Representations with Supervised Contrastive Loss: The Impact of ReLU and Batching
new file mode 100644
index 0000000000..986c342c52
--- /dev/null
+++ b/data/2024/iclr/Symmetric Neural-Collapse Representations with Supervised Contrastive Loss: The Impact of ReLU and Batching	
@@ -0,0 +1 @@
+Supervised contrastive loss (SCL) is a competitive and often superior alternative to the cross-entropy loss for classification. While prior studies have demonstrated that both losses yield symmetric training representations under balanced data, this symmetry breaks under class imbalances. This paper presents an intriguing discovery: the introduction of a ReLU activation at the final layer effectively restores the symmetry in SCL-learned representations. We arrive at this finding analytically, by establishing that the global minimizers of an unconstrained features model with SCL loss and entry-wise non-negativity constraints form an orthogonal frame. Extensive experiments conducted across various datasets, architectures, and imbalance scenarios corroborate our finding. Importantly, our experiments reveal that the inclusion of the ReLU activation restores symmetry without compromising test accuracy. This constitutes the first geometry characterization of SCL under imbalances. Additionally, our analysis and experiments underscore the pivotal role of batch selection strategies in representation geometry. By proving necessary and sufficient conditions for mini-batch choices that ensure invariant symmetric representations, we introduce batch-binding as an efficient strategy that guarantees these conditions hold.
\ No newline at end of file
diff --git a/data/2024/iclr/Symphony: Symmetry-Equivariant Point-Centered Spherical Harmonics for 3D Molecule Generation b/data/2024/iclr/Symphony: Symmetry-Equivariant Point-Centered Spherical Harmonics for 3D Molecule Generation
new file mode 100644
index 0000000000..c6b92bb5ad
--- /dev/null
+++ b/data/2024/iclr/Symphony: Symmetry-Equivariant Point-Centered Spherical Harmonics for 3D Molecule Generation	
@@ -0,0 +1 @@
+We present Symphony, an $E(3)$-equivariant autoregressive generative model for 3D molecular geometries that iteratively builds a molecule from molecular fragments. Existing autoregressive models such as G-SchNet and G-SphereNet for molecules utilize rotationally invariant features to respect the 3D symmetries of molecules. In contrast, Symphony uses message-passing with higher-degree $E(3)$-equivariant features. This allows a novel representation of probability distributions via spherical harmonic signals to efficiently model the 3D geometry of molecules. We show that Symphony is able to accurately generate small molecules from the QM9 dataset, outperforming existing autoregressive models and approaching the performance of diffusion models.
\ No newline at end of file
diff --git a/data/2024/iclr/Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control b/data/2024/iclr/Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
new file mode 100644
index 0000000000..adf2a3c7ac
--- /dev/null
+++ b/data/2024/iclr/Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control	
@@ -0,0 +1 @@
+Building agents with large language models (LLMs) for computer control is a burgeoning research area, where the agent receives computer states and performs actions to complete complex tasks. Previous computer agents have demonstrated the benefits of in-context learning (ICL); however, their performance is hindered by several issues. First, the limited context length of LLMs and complex computer states restrict the number of exemplars, as a single webpage can consume the entire context. Second, the exemplars in current methods, such as high-level plans and multi-choice questions, cannot represent complete trajectories, leading to suboptimal performance in long-horizon tasks. Third, existing computer agents rely on task-specific exemplars and overlook the similarity among tasks, resulting in poor generalization to novel tasks. To address these challenges, we introduce Synapse, a computer agent featuring three key components: i) state abstraction, which filters out task-irrelevant information from raw states, allowing more exemplars within the limited context, ii) trajectory-as-exemplar prompting, which prompts the LLM with complete trajectories of the abstracted states and actions to improve multi-step decision-making, and iii) exemplar memory, which stores the embeddings of exemplars and retrieves them via similarity search for generalization to novel tasks. We evaluate Synapse on MiniWoB++, a standard task suite, and Mind2Web, a real-world website benchmark. In MiniWoB++, Synapse achieves a 99.2% average success rate (a 10% relative improvement) across 64 tasks using demonstrations from only 48 tasks. Notably, Synapse is the first ICL method to solve the book-flight task in MiniWoB++. Synapse also exhibits a 56% relative improvement in average step success rate over the previous state-of-the-art prompting scheme in Mind2Web.
\ No newline at end of file
diff --git a/data/2024/iclr/Synaptic Weight Distributions Depend on the Geometry of Plasticity b/data/2024/iclr/Synaptic Weight Distributions Depend on the Geometry of Plasticity
new file mode 100644
index 0000000000..443e2b9d79
--- /dev/null
+++ b/data/2024/iclr/Synaptic Weight Distributions Depend on the Geometry of Plasticity	
@@ -0,0 +1 @@
+A growing literature in computational neuroscience leverages gradient descent and learning algorithms that approximate it to study synaptic plasticity in the brain. However, the vast majority of this work ignores a critical underlying assumption: the choice of distance for synaptic changes - i.e. the geometry of synaptic plasticity. Gradient descent assumes that the distance is Euclidean, but many other distances are possible, and there is no reason that biology necessarily uses Euclidean geometry. Here, using the theoretical tools provided by mirror descent, we show that the distribution of synaptic weights will depend on the geometry of synaptic plasticity. We use these results to show that experimentally-observed log-normal weight distributions found in several brain areas are not consistent with standard gradient descent (i.e. a Euclidean geometry), but rather with non-Euclidean distances. Finally, we show that it should be possible to experimentally test for different synaptic geometries by comparing synaptic weight distributions before and after learning. Overall, our work shows that the current paradigm in theoretical work on synaptic plasticity that assumes Euclidean synaptic geometry may be misguided and that it should be possible to experimentally determine the true geometry of synaptic plasticity in the brain.
\ No newline at end of file
diff --git a/data/2024/iclr/SyncDreamer: Generating Multiview-consistent Images from a Single-view Image b/data/2024/iclr/SyncDreamer: Generating Multiview-consistent Images from a Single-view Image
new file mode 100644
index 0000000000..0480ad8884
--- /dev/null
+++ b/data/2024/iclr/SyncDreamer: Generating Multiview-consistent Images from a Single-view Image	
@@ -0,0 +1 @@
+In this paper, we present a novel diffusion model called that generates multiview-consistent images from a single-view image. Using pretrained large-scale 2D diffusion models, recent work Zero123 demonstrates the ability to generate plausible novel views from a single-view image of an object. However, maintaining consistency in geometry and colors for the generated images remains a challenge. To address this issue, we propose a synchronized multiview diffusion model that models the joint probability distribution of multiview images, enabling the generation of multiview-consistent images in a single reverse process. SyncDreamer synchronizes the intermediate states of all the generated images at every step of the reverse process through a 3D-aware feature attention mechanism that correlates the corresponding features across different views. Experiments show that SyncDreamer generates images with high consistency across different views, thus making it well-suited for various 3D generation tasks such as novel-view-synthesis, text-to-3D, and image-to-3D.
\ No newline at end of file
diff --git a/data/2024/iclr/Synergistic Patch Pruning for Vision Transformer: Unifying Intra- & Inter-Layer Patch Importance b/data/2024/iclr/Synergistic Patch Pruning for Vision Transformer: Unifying Intra- & Inter-Layer Patch Importance
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/T-Rep: Representation Learning for Time Series using Time-Embeddings b/data/2024/iclr/T-Rep: Representation Learning for Time Series using Time-Embeddings
new file mode 100644
index 0000000000..3dd47a7490
--- /dev/null
+++ b/data/2024/iclr/T-Rep: Representation Learning for Time Series using Time-Embeddings	
@@ -0,0 +1 @@
+Multivariate time series present challenges to standard machine learning techniques, as they are often unlabeled, high dimensional, noisy, and contain missing data. To address this, we propose T-Rep, a self-supervised method to learn time series representations at a timestep granularity. T-Rep learns vector embeddings of time alongside its feature extractor, to extract temporal features such as trend, periodicity, or distribution shifts from the signal. These time-embeddings are leveraged in pretext tasks, to incorporate smooth and fine-grained temporal dependencies in the representations, as well as reinforce robustness to missing data. We evaluate T-Rep on downstream classification, forecasting, and anomaly detection tasks. It is compared to existing self-supervised algorithms for time series, which it outperforms in all three tasks. We test T-Rep in missing data regimes, where it proves more resilient than its counterparts. Finally, we provide latent space visualisation experiments, highlighting the interpretability of the learned representations.
\ No newline at end of file
diff --git a/data/2024/iclr/TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series b/data/2024/iclr/TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series
new file mode 100644
index 0000000000..f31a9f0534
--- /dev/null
+++ b/data/2024/iclr/TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series	
@@ -0,0 +1 @@
+We introduce a new model for multivariate probabilistic time series prediction, designed to flexibly address a range of tasks including forecasting, interpolation, and their combinations. Building on copula theory, we propose a simplified objective for the recently-introduced transformer-based attentional copulas (TACTiS), wherein the number of distributional parameters now scales linearly with the number of variables instead of factorially. The new objective requires the introduction of a training curriculum, which goes hand-in-hand with necessary changes to the original architecture. We show that the resulting model has significantly better training dynamics and achieves state-of-the-art performance across diverse real-world forecasting tasks, while maintaining the flexibility of prior work, such as seamless handling of unaligned and unevenly-sampled time series. Code is made available at https://github.com/ServiceNow/TACTiS.
\ No newline at end of file
diff --git a/data/2024/iclr/TAIL: Task-specific Adapters for Imitation Learning with Large Pretrained Models b/data/2024/iclr/TAIL: Task-specific Adapters for Imitation Learning with Large Pretrained Models
new file mode 100644
index 0000000000..82761affdc
--- /dev/null
+++ b/data/2024/iclr/TAIL: Task-specific Adapters for Imitation Learning with Large Pretrained Models	
@@ -0,0 +1 @@
+The full potential of large pretrained models remains largely untapped in control domains like robotics. This is mainly because of the scarcity of data and the computational challenges associated with training or fine-tuning these large models for such applications. Prior work mainly emphasizes either effective pretraining of large models for decision-making or single-task adaptation. But real-world problems will require data-efficient, continual adaptation for new control tasks. Recognizing these constraints, we introduce TAIL (Task-specific Adapters for Imitation Learning), a framework for efficient adaptation to new control tasks. Inspired by recent advancements in parameter-efficient fine-tuning in language domains, we explore efficient fine-tuning techniques -- e.g., Bottleneck Adapters, P-Tuning, and Low-Rank Adaptation (LoRA) -- in TAIL to adapt large pretrained models for new tasks with limited demonstration data. Our extensive experiments in large-scale language-conditioned manipulation tasks comparing prevalent parameter-efficient fine-tuning techniques and adaptation baselines suggest that TAIL with LoRA can achieve the best post-adaptation performance with only 1\% of the trainable parameters of full fine-tuning, while avoiding catastrophic forgetting and preserving adaptation plasticity in continual learning settings.
\ No newline at end of file
diff --git a/data/2024/iclr/TD-MPC2: Scalable, Robust World Models for Continuous Control b/data/2024/iclr/TD-MPC2: Scalable, Robust World Models for Continuous Control
new file mode 100644
index 0000000000..fda88ce974
--- /dev/null
+++ b/data/2024/iclr/TD-MPC2: Scalable, Robust World Models for Continuous Control	
@@ -0,0 +1 @@
+TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results with a single set of hyperparameters. We further show that agent capabilities increase with model and data size, and successfully train a single 317M parameter agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents. Explore videos, models, data, code, and more at https://tdmpc2.com
\ No newline at end of file
diff --git a/data/2024/iclr/TEDDY: Trimming Edges with Degree-based Discrimination Strategy b/data/2024/iclr/TEDDY: Trimming Edges with Degree-based Discrimination Strategy
new file mode 100644
index 0000000000..57257178ff
--- /dev/null
+++ b/data/2024/iclr/TEDDY: Trimming Edges with Degree-based Discrimination Strategy	
@@ -0,0 +1 @@
+Since the pioneering work on the lottery ticket hypothesis for graph neural networks (GNNs) was proposed in Chen et al. (2021), the study on finding graph lottery tickets (GLT) has become one of the pivotal focus in the GNN community, inspiring researchers to discover sparser GLT while achieving comparable performance to original dense networks. In parallel, the graph structure has gained substantial attention as a crucial factor in GNN training dynamics, also elucidated by several recent studies. Despite this, contemporary studies on GLT, in general, have not fully exploited inherent pathways in the graph structure and identified tickets in an iterative manner, which is time-consuming and inefficient. To address these limitations, we introduce TEDDY, a one-shot edge sparsification framework that leverages structural information by incorporating edge-degree information. Following edge sparsification, we encourage the parameter sparsity during training via simple projected gradient descent on the $\ell_0$ ball. Given the target sparsity levels for both the graph structure and the model parameters, our TEDDY facilitates efficient and rapid realization of GLT within a single training. Remarkably, our experimental results demonstrate that TEDDY significantly surpasses conventional iterative approaches in generalization, even when conducting one-shot sparsification that solely utilizes graph structures, without taking feature information into account.
\ No newline at end of file
diff --git a/data/2024/iclr/TEMPO: Prompt-based Generative Pre-trained Transformer for Time Series Forecasting b/data/2024/iclr/TEMPO: Prompt-based Generative Pre-trained Transformer for Time Series Forecasting
new file mode 100644
index 0000000000..34aebdcf9a
--- /dev/null
+++ b/data/2024/iclr/TEMPO: Prompt-based Generative Pre-trained Transformer for Time Series Forecasting	
@@ -0,0 +1 @@
+The past decade has witnessed significant advances in time series modeling with deep learning. While achieving state-of-the-art results, the best-performing architectures vary highly across applications and domains. Meanwhile, for natural language processing, the Generative Pre-trained Transformer (GPT) has demonstrated impressive performance via training one general-purpose model across various textual datasets. It is intriguing to explore whether GPT-type architectures can be effective for time series, capturing the intrinsic dynamic attributes and leading to significant accuracy improvements. In this paper, we propose a novel framework, TEMPO, that can effectively learn time series representations. We focus on utilizing two essential inductive biases of the time series task for pre-trained models: (i) decomposition of the complex interaction between trend, seasonal and residual components; and (ii) introducing the design of prompts to facilitate distribution adaptation in different types of time series. TEMPO expands the capability for dynamically modeling real-world temporal phenomena from data within diverse domains. Our experiments demonstrate the superior performance of TEMPO over state-of-the-art methods on zero shot setting for a number of time series benchmark datasets. This performance gain is observed not only in scenarios involving previously unseen datasets but also in scenarios with multi-modal inputs. This compelling finding highlights TEMPO's potential to constitute a foundational model-building framework.
\ No newline at end of file
diff --git a/data/2024/iclr/TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series b/data/2024/iclr/TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series
new file mode 100644
index 0000000000..a3ee0a46e1
--- /dev/null
+++ b/data/2024/iclr/TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series	
@@ -0,0 +1 @@
+This work summarizes two ways to accomplish Time-Series (TS) tasks in today's Large Language Model (LLM) context: LLM-for-TS (model-centric) designs and trains a fundamental large model, or fine-tunes a pre-trained LLM for TS data; TS-for-LLM (data-centric) converts TS into a model-friendly representation to enable the pre-trained LLM to handle TS data. Given the lack of data, limited resources, semantic context requirements, and so on, this work focuses on TS-for-LLM, where we aim to activate LLM's ability for TS data by designing a TS embedding method suitable for LLM. The proposed method is named TEST. It first tokenizes TS, builds an encoder to embed TS via instance-wise, feature-wise, and text-prototype-aligned contrast, where the TS embedding space is aligned to LLM embedding layer space, then creates soft prompts to make LLM more open to that embeddings, and finally implements TS tasks using the frozen LLM. We also demonstrate the feasibility of TS-for-LLM through theory and experiments. Experiments are carried out on TS classification, forecasting, and representation tasks using eight frozen LLMs with various structures and sizes. The results show that the pre-trained LLM with TEST strategy can achieve better or comparable performance than today's SOTA TS models and offer benefits for few-shot and generalization. By treating LLM as the pattern machine, TEST can endow LLM's ability to process TS data without compromising language ability. We hope that this study will serve as a foundation for future work to support TS+LLM progress.
\ No newline at end of file
diff --git a/data/2024/iclr/TESTAM: A Time-Enhanced Spatio-Temporal Attention Model with Mixture of Experts b/data/2024/iclr/TESTAM: A Time-Enhanced Spatio-Temporal Attention Model with Mixture of Experts
new file mode 100644
index 0000000000..3b1eff84ff
--- /dev/null
+++ b/data/2024/iclr/TESTAM: A Time-Enhanced Spatio-Temporal Attention Model with Mixture of Experts	
@@ -0,0 +1 @@
+Accurate traffic forecasting is challenging due to the complex dependency on road networks, various types of roads, and the abrupt speed change due to the events. Recent works mainly focus on dynamic spatial modeling with adaptive graph embedding or graph attention having less consideration for temporal characteristics and in-situ modeling. In this paper, we propose a novel deep learning model named TESTAM, which individually models recurring and non-recurring traffic patterns by a mixture-of-experts model with three experts on temporal modeling, spatio-temporal modeling with static graph, and dynamic spatio-temporal dependency modeling with dynamic graph. By introducing different experts and properly routing them, TESTAM could better model various circumstances, including spatially isolated nodes, highly related nodes, and recurring and non-recurring events. For the proper routing, we reformulate a gating problem into a classification problem with pseudo labels. Experimental results on three public traffic network datasets, METR-LA, PEMS-BAY, and EXPY-TKY, demonstrate that TESTAM achieves a better indication and modeling of recurring and non-recurring traffic. We published the official code at https://github.com/HyunWookL/TESTAM
\ No newline at end of file
diff --git a/data/2024/iclr/TOSS: High-quality Text-guided Novel View Synthesis from a Single Image b/data/2024/iclr/TOSS: High-quality Text-guided Novel View Synthesis from a Single Image
new file mode 100644
index 0000000000..06deb23d84
--- /dev/null
+++ b/data/2024/iclr/TOSS: High-quality Text-guided Novel View Synthesis from a Single Image	
@@ -0,0 +1 @@
+In this paper, we present TOSS, which introduces text to the task of novel view synthesis (NVS) from just a single RGB image. While Zero-1-to-3 has demonstrated impressive zero-shot open-set NVS capability, it treats NVS as a pure image-to-image translation problem. This approach suffers from the challengingly under-constrained nature of single-view NVS: the process lacks means of explicit user control and often results in implausible NVS generations. To address this limitation, TOSS uses text as high-level semantic information to constrain the NVS solution space. TOSS fine-tunes text-to-image Stable Diffusion pre-trained on large-scale text-image pairs and introduces modules specifically tailored to image and camera pose conditioning, as well as dedicated training for pose correctness and preservation of fine details. Comprehensive experiments are conducted with results showing that our proposed TOSS outperforms Zero-1-to-3 with more plausible, controllable and multiview-consistent NVS results. We further support these results with comprehensive ablations that underscore the effectiveness and potential of the introduced semantic guidance and architecture design.
\ No newline at end of file
diff --git a/data/2024/iclr/TRAM: Bridging Trust Regions and Sharpness Aware Minimization b/data/2024/iclr/TRAM: Bridging Trust Regions and Sharpness Aware Minimization
new file mode 100644
index 0000000000..9019a924e7
--- /dev/null
+++ b/data/2024/iclr/TRAM: Bridging Trust Regions and Sharpness Aware Minimization	
@@ -0,0 +1 @@
+Sharpness-aware minimization (SAM) reports improving domain generalization by reducing the loss surface curvature in the parameter space. However, generalization during fine-tuning is often more dependent on the transferability of representations in the function space. Trust-region methods (TR) target this goal by regularizing representation curvature to reduce catastrophic forgetting of pre-trained task-agnostic information while adopting task-specific skills. We consider unifying these strategies for low curvature in both parameter space and function space to improve out-of-domain (OOD) generalization. We propose Trust Region Aware Minimization (TRAM), a SAM algorithm fine-tuning for low parameter sharpness and smooth, informative representations preserving pre-trained structure. TRAM uses a trust region bound to inform the SAM adversarial neighborhood, introducing an awareness of function curvature within optimization for flatter minima. We empirically validate TRAM in vision (cross-dataset adaptation) and text (OOD language modeling, zero-shot cross-lingual transfer) tasks where robust domain transfer and representation generality are critical. TRAM outperforms SAM- and TR-based optimization across all tasks, notably surpassing competing methods for hard transfer between anticorrelated domains. TRAM establishes a novel standard in fine-tuning for domain-generalizable models with minimal additional computation over previous sharpness-aware methods.
\ No newline at end of file
diff --git a/data/2024/iclr/TUVF: Learning Generalizable Texture UV Radiance Fields b/data/2024/iclr/TUVF: Learning Generalizable Texture UV Radiance Fields
new file mode 100644
index 0000000000..16a694400c
--- /dev/null
+++ b/data/2024/iclr/TUVF: Learning Generalizable Texture UV Radiance Fields	
@@ -0,0 +1 @@
+Textures are a vital aspect of creating visually appealing and realistic 3D models. In this paper, we study the problem of generating high-fidelity texture given shapes of 3D assets, which has been relatively less explored compared with generic 3D shape modeling. Our goal is to facilitate a controllable texture generation process, such that one texture code can correspond to a particular appearance style independent of any input shapes from a category. We introduce Texture UV Radiance Fields (TUVF) that generate textures in a learnable UV sphere space rather than directly on the 3D shape. This allows the texture to be disentangled from the underlying shape and transferable to other shapes that share the same UV space, i.e., from the same category. We integrate the UV sphere space with the radiance field, which provides a more efficient and accurate representation of textures than traditional texture maps. We perform our experiments on synthetic and real-world object datasets where we achieve not only realistic synthesis but also substantial improvements over state-of-the-arts on texture controlling and editing. Project Page: https://www.anjiecheng.me/TUVF
\ No newline at end of file
diff --git a/data/2024/iclr/TabR: Tabular Deep Learning Meets Nearest Neighbors b/data/2024/iclr/TabR: Tabular Deep Learning Meets Nearest Neighbors
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Tackling the Data Heterogeneity in Asynchronous Federated Learning with Cached Update Calibration b/data/2024/iclr/Tackling the Data Heterogeneity in Asynchronous Federated Learning with Cached Update Calibration
new file mode 100644
index 0000000000..7ed06c6303
--- /dev/null
+++ b/data/2024/iclr/Tackling the Data Heterogeneity in Asynchronous Federated Learning with Cached Update Calibration	
@@ -0,0 +1 @@
+Asynchronous federated learning, which enables local clients to send their model update asynchronously to the server without waiting for others, has recently emerged for its improved efficiency and scalability over traditional synchronized federated learning. In this paper, we study how the asynchronous delay affects the convergence of asynchronous federated learning under non-i.i.d. distributed data across clients. Through the theoretical convergence analysis of one representative asynchronous federated learning algorithm under standard nonconvex stochastic settings, we show that the asynchronous delay can largely slow down the convergence, especially with high data heterogeneity. To further improve the convergence of asynchronous federated learning under heterogeneous data distributions, we propose a novel asynchronous federated learning method with a cached update calibration. Specifically, we let the server cache the latest update for each client and reuse these variables for calibrating the global update at each round. We theoretically prove the convergence acceleration for our proposed method under nonconvex stochastic settings. Extensive experiments on several vision and language tasks demonstrate our superior performances compared to other asynchronous federated learning baselines.
\ No newline at end of file
diff --git a/data/2024/iclr/Tag2Text: Guiding Vision-Language Model via Image Tagging b/data/2024/iclr/Tag2Text: Guiding Vision-Language Model via Image Tagging
new file mode 100644
index 0000000000..a8826fb528
--- /dev/null
+++ b/data/2024/iclr/Tag2Text: Guiding Vision-Language Model via Image Tagging	
@@ -0,0 +1 @@
+This paper presents Tag2Text, a vision language pre-training (VLP) framework, which introduces image tagging into vision-language models to guide the learning of visual-linguistic features. In contrast to prior works which utilize object tags either manually labeled or automatically detected with an off-the-shelf detector with limited performance, our approach explicitly learns an image tagger using tags parsed from image-paired text and thus provides a strong semantic guidance to vision-language models. In this way, Tag2Text can utilize large-scale annotation-free image tags in accordance with image-text pairs, and provides more diverse tag categories beyond objects. As a result, Tag2Text demonstrates the ability of a foundational image tagging model, with superior zero-shot performance even comparable to fully supervised models. Moreover, by leveraging the tagging guidance, Tag2Text effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks. Across a wide range of downstream benchmarks, Tag2Text achieves state-of-the-art results with similar model sizes and data scales, demonstrating the efficacy of the proposed tagging guidance. Code, demo and pre-trained models are available at https://github.com/xinyu1205/recognize-anything.
\ No newline at end of file
diff --git a/data/2024/iclr/Tailoring Self-Rationalizers with Multi-Reward Distillation b/data/2024/iclr/Tailoring Self-Rationalizers with Multi-Reward Distillation
new file mode 100644
index 0000000000..4efda6f5a7
--- /dev/null
+++ b/data/2024/iclr/Tailoring Self-Rationalizers with Multi-Reward Distillation	
@@ -0,0 +1 @@
+Large language models (LMs) are capable of generating free-text rationales to aid question answering. However, prior work 1) suggests that useful self-rationalization is emergent only at significant scales (e.g., 175B parameter GPT-3); and 2) focuses largely on downstream performance, ignoring the semantics of the rationales themselves, e.g., are they faithful, true, and helpful for humans? In this work, we enable small-scale LMs (approx. 200x smaller than GPT-3) to generate rationales that not only improve downstream task performance, but are also more plausible, consistent, and diverse, assessed both by automatic and human evaluation. Our method, MaRio (Multi-rewArd RatIOnalization), is a multi-reward conditioned self-rationalization algorithm that optimizes multiple distinct properties like plausibility, diversity and consistency. Results on five difficult question-answering datasets StrategyQA, QuaRel, OpenBookQA, NumerSense and QASC show that not only does MaRio improve task accuracy, but it also improves the self-rationalization quality of small LMs across the aforementioned axes better than a supervised fine-tuning (SFT) baseline. Extensive human evaluations confirm that MaRio rationales are preferred vs. SFT rationales, as well as qualitative improvements in plausibility and consistency.
\ No newline at end of file
diff --git a/data/2024/iclr/Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models b/data/2024/iclr/Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models
new file mode 100644
index 0000000000..36943e9503
--- /dev/null
+++ b/data/2024/iclr/Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models	
@@ -0,0 +1 @@
+We present Step-Back Prompting, a simple prompting technique that enables LLMs to do abstractions to derive high-level concepts and first principles from instances containing specific details. Using the concepts and principles to guide reasoning, LLMs significantly improve their abilities in following a correct reasoning path towards the solution. We conduct experiments of Step-Back Prompting with PaLM-2L, GPT-4 and Llama2-70B models, and observe substantial performance gains on various challenging reasoning-intensive tasks including STEM, Knowledge QA, and Multi-Hop Reasoning. For instance, Step-Back Prompting improves PaLM-2L performance on MMLU (Physics and Chemistry) by 7% and 11% respectively, TimeQA by 27%, and MuSiQue by 7%.
\ No newline at end of file
diff --git a/data/2024/iclr/Talk like a Graph: Encoding Graphs for Large Language Models b/data/2024/iclr/Talk like a Graph: Encoding Graphs for Large Language Models
new file mode 100644
index 0000000000..13fd2d0dcd
--- /dev/null
+++ b/data/2024/iclr/Talk like a Graph: Encoding Graphs for Large Language Models	
@@ -0,0 +1 @@
+Graphs are a powerful tool for representing and analyzing complex relationships in real-world applications such as social networks, recommender systems, and computational finance. Reasoning on graphs is essential for drawing inferences about the relationships between entities in a complex system, and to identify hidden patterns and trends. Despite the remarkable progress in automated reasoning with natural text, reasoning on graphs with large language models (LLMs) remains an understudied problem. In this work, we perform the first comprehensive study of encoding graph-structured data as text for consumption by LLMs. We show that LLM performance on graph reasoning tasks varies on three fundamental levels: (1) the graph encoding method, (2) the nature of the graph task itself, and (3) interestingly, the very structure of the graph considered. These novel results provide valuable insight on strategies for encoding graphs as text. Using these insights we illustrate how the correct choice of encoders can boost performance on graph reasoning tasks inside LLMs by 4.8% to 61.8%, depending on the task.
\ No newline at end of file
diff --git a/data/2024/iclr/Tangent Transformers for Composition, Privacy and Removal b/data/2024/iclr/Tangent Transformers for Composition, Privacy and Removal
new file mode 100644
index 0000000000..ef2871039c
--- /dev/null
+++ b/data/2024/iclr/Tangent Transformers for Composition, Privacy and Removal	
@@ -0,0 +1 @@
+We introduce Tangent Attention Fine-Tuning (TAFT), a method for fine-tuning linearized transformers obtained by computing a First-order Taylor Expansion around a pre-trained initialization. We show that the Jacobian-Vector Product resulting from linearization can be computed efficiently in a single forward pass, reducing training and inference cost to the same order of magnitude as its original non-linear counterpart, while using the same number of parameters. Furthermore, we show that, when applied to various downstream visual classification tasks, the resulting Tangent Transformer fine-tuned with TAFT can perform comparably with fine-tuning the original non-linear network. Since Tangent Transformers are linear with respect to the new set of weights, and the resulting fine-tuning loss is convex, we show that TAFT enjoys several advantages compared to non-linear fine-tuning when it comes to model composition, parallel training, machine unlearning, and differential privacy. Our code is available at: https://github.com/tianyu139/tangent-model-composition
\ No newline at end of file
diff --git a/data/2024/iclr/TapMo: Shape-aware Motion Generation of Skeleton-free Characters b/data/2024/iclr/TapMo: Shape-aware Motion Generation of Skeleton-free Characters
new file mode 100644
index 0000000000..e2c092a086
--- /dev/null
+++ b/data/2024/iclr/TapMo: Shape-aware Motion Generation of Skeleton-free Characters	
@@ -0,0 +1 @@
+Previous motion generation methods are limited to the pre-rigged 3D human model, hindering their applications in the animation of various non-rigged characters. In this work, we present TapMo, a Text-driven Animation Pipeline for synthesizing Motion in a broad spectrum of skeleton-free 3D characters. The pivotal innovation in TapMo is its use of shape deformation-aware features as a condition to guide the diffusion model, thereby enabling the generation of mesh-specific motions for various characters. Specifically, TapMo comprises two main components - Mesh Handle Predictor and Shape-aware Diffusion Module. Mesh Handle Predictor predicts the skinning weights and clusters mesh vertices into adaptive handles for deformation control, which eliminates the need for traditional skeletal rigging. Shape-aware Motion Diffusion synthesizes motion with mesh-specific adaptations. This module employs text-guided motions and mesh features extracted during the first stage, preserving the geometric integrity of the animations by accounting for the character's shape and deformation. Trained in a weakly-supervised manner, TapMo can accommodate a multitude of non-human meshes, both with and without associated text motions. We demonstrate the effectiveness and generalizability of TapMo through rigorous qualitative and quantitative experiments. Our results reveal that TapMo consistently outperforms existing auto-animation methods, delivering superior-quality animations for both seen or unseen heterogeneous 3D characters.
\ No newline at end of file
diff --git a/data/2024/iclr/Task Adaptation from Skills: Information Geometry, Disentanglement, and New Objectives for Unsupervised Reinforcement Learning b/data/2024/iclr/Task Adaptation from Skills: Information Geometry, Disentanglement, and New Objectives for Unsupervised Reinforcement Learning
new file mode 100644
index 0000000000..945c9b46d6
--- /dev/null
+++ b/data/2024/iclr/Task Adaptation from Skills: Information Geometry, Disentanglement, and New Objectives for Unsupervised Reinforcement Learning	
@@ -0,0 +1 @@
+.
\ No newline at end of file
diff --git a/data/2024/iclr/Task Planning for Visual Room Rearrangement under Partial Observability b/data/2024/iclr/Task Planning for Visual Room Rearrangement under Partial Observability
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Task structure and nonlinearity jointly determine learned representational geometry b/data/2024/iclr/Task structure and nonlinearity jointly determine learned representational geometry
new file mode 100644
index 0000000000..e6bcd23eef
--- /dev/null
+++ b/data/2024/iclr/Task structure and nonlinearity jointly determine learned representational geometry	
@@ -0,0 +1 @@
+The utility of a learned neural representation depends on how well its geometry supports performance in downstream tasks. This geometry depends on the structure of the inputs, the structure of the target outputs, and the architecture of the network. By studying the learning dynamics of networks with one hidden layer, we discovered that the network's activation function has an unexpectedly strong impact on the representational geometry: Tanh networks tend to learn representations that reflect the structure of the target outputs, while ReLU networks retain more information about the structure of the raw inputs. This difference is consistently observed across a broad class of parameterized tasks in which we modulated the degree of alignment between the geometry of the task inputs and that of the task labels. We analyzed the learning dynamics in weight space and show how the differences between the networks with Tanh and ReLU nonlinearities arise from the asymmetric asymptotic behavior of ReLU, which leads feature neurons to specialize for different regions of input space. By contrast, feature neurons in Tanh networks tend to inherit the task label structure. Consequently, when the target outputs are low dimensional, Tanh networks generate neural representations that are more disentangled than those obtained with a ReLU nonlinearity. Our findings shed light on the interplay between input-output geometry, nonlinearity, and learned representations in neural networks.
\ No newline at end of file
diff --git a/data/2024/iclr/Teach LLMs to Phish: Stealing Private Information from Language Models b/data/2024/iclr/Teach LLMs to Phish: Stealing Private Information from Language Models
new file mode 100644
index 0000000000..29d20de02d
--- /dev/null
+++ b/data/2024/iclr/Teach LLMs to Phish: Stealing Private Information from Language Models	
@@ -0,0 +1 @@
+When large language models are trained on private data, it can be a significant privacy risk for them to memorize and regurgitate sensitive information. In this work, we propose a new practical data extraction attack that we call"neural phishing". This attack enables an adversary to target and extract sensitive or personally identifiable information (PII), e.g., credit card numbers, from a model trained on user data with upwards of 10% attack success rates, at times, as high as 50%. Our attack assumes only that an adversary can insert as few as 10s of benign-appearing sentences into the training dataset using only vague priors on the structure of the user data.
\ No newline at end of file
diff --git a/data/2024/iclr/Teaching Language Models to Hallucinate Less with Synthetic Tasks b/data/2024/iclr/Teaching Language Models to Hallucinate Less with Synthetic Tasks
new file mode 100644
index 0000000000..2325f35b7a
--- /dev/null
+++ b/data/2024/iclr/Teaching Language Models to Hallucinate Less with Synthetic Tasks	
@@ -0,0 +1 @@
+Large language models (LLMs) frequently hallucinate on abstractive summarization tasks such as document-based question-answering, meeting summarization, and clinical report generation, even though all necessary information is included in context. However, optimizing LLMs to hallucinate less on these tasks is challenging, as hallucination is hard to efficiently evaluate at each optimization step. In this work, we show that reducing hallucination on a synthetic task can also reduce hallucination on real-world downstream tasks. Our method, SynTra, first designs a synthetic task where hallucinations are easy to elicit and measure. It next optimizes the LLM's system message via prefix-tuning on the synthetic task, and finally transfers the system message to realistic, hard-to-optimize tasks. Across three realistic abstractive summarization tasks, SynTra reduces hallucination for two 13B-parameter LLMs using only a synthetic retrieval task for supervision. We also find that optimizing the system message rather than the model weights can be critical; fine-tuning the entire model on the synthetic task can counterintuitively increase hallucination. Overall, SynTra demonstrates that the extra flexibility of working with synthetic data can help mitigate undesired behaviors in practice.
\ No newline at end of file
diff --git a/data/2024/iclr/Teaching Large Language Models to Self-Debug b/data/2024/iclr/Teaching Large Language Models to Self-Debug
new file mode 100644
index 0000000000..5cf15325a9
--- /dev/null
+++ b/data/2024/iclr/Teaching Large Language Models to Self-Debug	
@@ -0,0 +1 @@
+Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose Self-Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i.e., without any human feedback on the code correctness or error messages, the model is able to identify its mistakes by investigating the execution results and explaining the generated code in natural language. Self-Debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, Self-Debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest level by 9%. On TransCoder and MBPP where unit tests are available, Self-Debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, Self-Debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10x candidate programs.
\ No newline at end of file
diff --git a/data/2024/iclr/Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs b/data/2024/iclr/Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs
new file mode 100644
index 0000000000..7c7b3c4fce
--- /dev/null
+++ b/data/2024/iclr/Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs	
@@ -0,0 +1 @@
+In human-written articles, we often leverage the subtleties of text style, such as bold and italics, to guide the attention of readers. These textual emphases are vital for the readers to grasp the conveyed information. When interacting with large language models (LLMs), we have a similar need - steering the model to pay closer attention to user-specified information, e.g., an instruction. Existing methods, however, are constrained to process plain text and do not support such a mechanism. This motivates us to introduce PASTA - Post-hoc Attention STeering Approach, a method that allows LLMs to read text with user-specified emphasis marks. To this end, PASTA identifies a small subset of attention heads and applies precise attention reweighting on them, directing the model attention to user-specified parts. Like prompting, PASTA is applied at inference time and does not require changing any model parameters. Experiments demonstrate that PASTA can substantially enhance an LLM's ability to follow user instructions or integrate new knowledge from user inputs, leading to a significant performance improvement on a variety of tasks, e.g., an average accuracy improvement of 22% for LLAMA-7B. Our code is publicly available at https://github.com/QingruZhang/PASTA .
\ No newline at end of file
diff --git a/data/2024/iclr/Temporal Generalization Estimation in Evolving Graphs b/data/2024/iclr/Temporal Generalization Estimation in Evolving Graphs
new file mode 100644
index 0000000000..9884f76ec4
--- /dev/null
+++ b/data/2024/iclr/Temporal Generalization Estimation in Evolving Graphs	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) are widely deployed in vast fields, but they often struggle to maintain accurate representations as graphs evolve. We theoretically establish a lower bound, proving that under mild conditions, representation distortion inevitably occurs over time. To estimate the temporal distortion without human annotation after deployment, one naive approach is to pre-train a recurrent model (e.g., RNN) before deployment and use this model afterwards, but the estimation is far from satisfactory. In this paper, we analyze the representation distortion from an information theory perspective, and attribute it primarily to inaccurate feature extraction during evolution. Consequently, we introduce Smart, a straightforward and effective baseline enhanced by an adaptive feature extractor through self-supervised graph reconstruction. In synthetic random graphs, we further refine the former lower bound to show the inevitable distortion over time and empirically observe that Smart achieves good estimation performance. Moreover, we observe that Smart consistently shows outstanding generalization estimation on four real-world evolving graphs. The ablation studies underscore the necessity of graph reconstruction. For example, on OGB-arXiv dataset, the estimation metric MAPE deteriorates from 2.19% to 8.00% without reconstruction.
\ No newline at end of file
diff --git a/data/2024/iclr/Tensor Programs VI: Feature Learning in Infinite Depth Neural Networks b/data/2024/iclr/Tensor Programs VI: Feature Learning in Infinite Depth Neural Networks
new file mode 100644
index 0000000000..596cac6991
--- /dev/null
+++ b/data/2024/iclr/Tensor Programs VI: Feature Learning in Infinite Depth Neural Networks	
@@ -0,0 +1 @@
+By classifying infinite-width neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for *widthwise hyperparameter transfer*, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinite-width-then-depth limits. In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$\mu$P that extends $\mu$P and show empirically it admits depthwise hyperparameter transfer. We identify *feature diversity* as a crucial factor in deep networks, and Depth-$\mu$P can be characterized as maximizing both feature learning and feature diversity. Exploiting this, we find that absolute value, among all homogeneous nonlinearities, maximizes feature diversity and indeed empirically leads to significantly better performance. However, if each block is deeper (such as modern transformers), then we find fundamental limitations in all possible infinite-depth limits of such parametrizations, which we illustrate both theoretically and empirically on simple networks as well as Megatron transformer trained on Common Crawl.
\ No newline at end of file
diff --git a/data/2024/iclr/Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game b/data/2024/iclr/Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
new file mode 100644
index 0000000000..db618a6b38
--- /dev/null
+++ b/data/2024/iclr/Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game	
@@ -0,0 +1 @@
+While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. To help researchers study this problem, we present a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based"defenses"against prompt injection, all created by players of an online game called Tensor Trust. To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs. The attacks in our dataset have a lot of easily interpretable stucture, and shed light on the weaknesses of LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking. Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset. Furthermore, we show that some attack strategies from the dataset generalize to deployed LLM-based applications, even though they have a very different set of constraints to the game. We release all data and source code at https://tensortrust.ai/paper
\ No newline at end of file
diff --git a/data/2024/iclr/Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models b/data/2024/iclr/Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
new file mode 100644
index 0000000000..740b0390ef
--- /dev/null
+++ b/data/2024/iclr/Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models	
@@ -0,0 +1 @@
+One fascinating aspect of pre-trained vision-language models~(VLMs) learning under language supervision is their impressive zero-shot generalization capability. However, this ability is hindered by distribution shifts between the training and testing data. Previous test time adaptation~(TTA) methods for VLMs in zero-shot classification rely on minimizing the entropy of model outputs, tending to be stuck in incorrect model predictions. In this work, we propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident. Specifically, a CLIP model is adopted as the reward model during TTA and provides feedback for the VLM. Given a single test sample, the VLM is forced to maximize the CLIP reward between the input and sampled results from the VLM output distribution. The proposed \textit{reinforcement learning with CLIP feedback~(RLCF)} framework is highly flexible and universal. Beyond the classification task, with task-specific sampling strategies and a proper reward baseline choice, RLCF can be easily extended to not only discrimination tasks like retrieval but also generalization tasks like image captioning, improving the zero-shot generalization capacity of VLMs. According to the characteristics of these VL tasks, we build different fully TTA pipelines with RLCF to improve the zero-shot generalization ability of various VLMs. Extensive experiments along with promising empirical results demonstrate the effectiveness of RLCF. The code is available at https://github.com/mzhaoshuai/RLCF.
\ No newline at end of file
diff --git a/data/2024/iclr/Test-Time Training on Nearest Neighbors for Large Language Models b/data/2024/iclr/Test-Time Training on Nearest Neighbors for Large Language Models
new file mode 100644
index 0000000000..b3fe31c796
--- /dev/null
+++ b/data/2024/iclr/Test-Time Training on Nearest Neighbors for Large Language Models	
@@ -0,0 +1 @@
+Many recent efforts augment language models with retrieval, by adding retrieved data to the input context. For this approach to succeed, the retrieved data must be added at both training and test time. Moreover, as input length grows linearly with the size of retrieved data, cost in computation and memory grows quadratically for modern Transformers. To avoid these complications, we simply fine-tune the model on retrieved data at test time, using its standard training setup. We build a large-scale distributed index based on text embeddings of the Pile dataset. For each test input, our system retrieves its neighbors and fine-tunes the model on their text. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than 20 language modeling tasks in the Pile. For example, test-time training with nearest neighbors significantly narrows the performance gap between a small GPT-2 and a GPT-Neo model more than 10 times larger. Sufficient index quality and size, however, are necessary. Our work establishes a first baseline of test-time training for language modeling.
\ No newline at end of file
diff --git a/data/2024/iclr/Test-time Adaptation against Multi-modal Reliability Bias b/data/2024/iclr/Test-time Adaptation against Multi-modal Reliability Bias
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Text-to-3D with Classifier Score Distillation b/data/2024/iclr/Text-to-3D with Classifier Score Distillation
new file mode 100644
index 0000000000..22534d966b
--- /dev/null
+++ b/data/2024/iclr/Text-to-3D with Classifier Score Distillation	
@@ -0,0 +1 @@
+Text-to-3D generation has made remarkable progress recently, particularly with methods based on Score Distillation Sampling (SDS) that leverages pre-trained 2D diffusion models. While the usage of classifier-free guidance is well acknowledged to be crucial for successful optimization, it is considered an auxiliary trick rather than the most essential component. In this paper, we re-evaluate the role of classifier-free guidance in score distillation and discover a surprising finding: the guidance alone is enough for effective text-to-3D generation tasks. We name this method Classifier Score Distillation (CSD), which can be interpreted as using an implicit classification model for generation. This new perspective reveals new insights for understanding existing techniques. We validate the effectiveness of CSD across a variety of text-to-3D tasks including shape generation, texture synthesis, and shape editing, achieving results superior to those of state-of-the-art methods. Our project page is https://xinyu-andy.github.io/Classifier-Score-Distillation
\ No newline at end of file
diff --git a/data/2024/iclr/TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields b/data/2024/iclr/TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields
new file mode 100644
index 0000000000..a4367b95c8
--- /dev/null
+++ b/data/2024/iclr/TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields	
@@ -0,0 +1 @@
+Recent works learn 3D representation explicitly under text-3D guidance. However, limited text-3D data restricts the vocabulary scale and text control of generations. Generators may easily fall into a stereotype concept for certain text prompts, thus losing open-vocabulary generation ability. To tackle this issue, we introduce a conditional 3D generative model, namely TextField3D. Specifically, rather than using the text prompts as input directly, we suggest to inject dynamic noise into the latent space of given text prompts, i.e., Noisy Text Fields (NTFs). In this way, limited 3D data can be mapped to the appropriate range of textual latent space that is expanded by NTFs. To this end, an NTFGen module is proposed to model general text latent code in noisy fields. Meanwhile, an NTFBind module is proposed to align view-invariant image latent code to noisy fields, further supporting image-conditional 3D generation. To guide the conditional generation in both geometry and texture, multi-modal discrimination is constructed with a text-3D discriminator and a text-2.5D discriminator. Compared to previous methods, TextField3D includes three merits: 1) large vocabulary, 2) text consistency, and 3) low latency. Extensive experiments demonstrate that our method achieves a potential open-vocabulary 3D generation capability.
\ No newline at end of file
diff --git a/data/2024/iclr/The Alignment Problem from a Deep Learning Perspective b/data/2024/iclr/The Alignment Problem from a Deep Learning Perspective
new file mode 100644
index 0000000000..7108cac95c
--- /dev/null
+++ b/data/2024/iclr/The Alignment Problem from a Deep Learning Perspective	
@@ -0,0 +1 @@
+In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities at many critical tasks. We argue that, without substantial effort to prevent it, AGIs could learn to pursue goals that are in conflict (i.e. misaligned) with human interests. If trained like today's most capable models, AGIs could learn to act deceptively to receive higher reward, learn misaligned internally-represented goals which generalize beyond their fine-tuning distributions, and pursue those goals using power-seeking strategies. We review emerging evidence for these properties. AGIs with these properties would be difficult to align and may appear aligned even when they are not. Finally, we briefly outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and we review research directions aimed at preventing this outcome.
\ No newline at end of file
diff --git a/data/2024/iclr/The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World b/data/2024/iclr/The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
new file mode 100644
index 0000000000..e8ae0f19f0
--- /dev/null
+++ b/data/2024/iclr/The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World	
@@ -0,0 +1 @@
+We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.
\ No newline at end of file
diff --git a/data/2024/iclr/The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing b/data/2024/iclr/The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing
new file mode 100644
index 0000000000..c95a64ad6f
--- /dev/null
+++ b/data/2024/iclr/The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing	
@@ -0,0 +1 @@
+We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-Leibler divergence between the marginal distributions of the two SDEs gradually decreases while that for the ODEs remains as the time approaches zero, which shows the promise of SDE in image editing. Inspired by it, we provide the SDE counterparts for widely used ODE baselines in various tasks including inpainting and image-to-image translation, where SDE shows a consistent and substantial improvement. Moreover, we propose SDE-Drag -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed DragBench) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods.
\ No newline at end of file
diff --git a/data/2024/iclr/The Consensus Game: Language Model Generation via Equilibrium Search b/data/2024/iclr/The Consensus Game: Language Model Generation via Equilibrium Search
new file mode 100644
index 0000000000..eb4204bea9
--- /dev/null
+++ b/data/2024/iclr/The Consensus Game: Language Model Generation via Equilibrium Search	
@@ -0,0 +1 @@
+When applied to question answering and other text generation tasks, language models (LMs) may be queried generatively (by sampling answers from their output distribution) or discriminatively (by using them to score or rank a set of candidate outputs). These procedures sometimes yield very different predictions. How do we reconcile mutually incompatible scoring procedures to obtain coherent LM predictions? We introduce a new, a training-free, game-theoretic procedure for language model decoding. Our approach casts language model decoding as a regularized imperfect-information sequential signaling game - which we term the CONSENSUS GAME - in which a GENERATOR seeks to communicate an abstract correctness parameter using natural language sentences to a DISCRIMINATOR. We develop computational procedures for finding approximate equilibria of this game, resulting in a decoding algorithm we call EQUILIBRIUM-RANKING. Applied to a large number of tasks (including reading comprehension, commonsense reasoning, mathematical problem-solving, and dialog), EQUILIBRIUM-RANKING consistently, and sometimes substantially, improves performance over existing LM decoding procedures - on multiple benchmarks, we observe that applying EQUILIBRIUM-RANKING to LLaMA-7B outperforms the much larger LLaMA-65B and PaLM-540B models. These results highlight the promise of game-theoretic tools for addressing fundamental challenges of truthfulness and consistency in LMs.
\ No newline at end of file
diff --git a/data/2024/iclr/The Cost of Scaling Down Large Language Models: Reducing Model Size Affects Memory before In-context Learning b/data/2024/iclr/The Cost of Scaling Down Large Language Models: Reducing Model Size Affects Memory before In-context Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models b/data/2024/iclr/The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/The Devil is in the Object Boundary: Towards Annotation-free Instance Segmentation using Foundation Models b/data/2024/iclr/The Devil is in the Object Boundary: Towards Annotation-free Instance Segmentation using Foundation Models
new file mode 100644
index 0000000000..2929a481cd
--- /dev/null
+++ b/data/2024/iclr/The Devil is in the Object Boundary: Towards Annotation-free Instance Segmentation using Foundation Models	
@@ -0,0 +1 @@
+Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, \textit{i.e.}, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose $\textbf{Zip}$ which $\textbf{Z}$ips up CL$\textbf{ip}$ and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations. Code is released at https://github.com/ChengShiest/Zip-Your-CLIP
\ No newline at end of file
diff --git a/data/2024/iclr/The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images b/data/2024/iclr/The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images
new file mode 100644
index 0000000000..452055c1fe
--- /dev/null
+++ b/data/2024/iclr/The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images	
@@ -0,0 +1 @@
+This paper investigates discrepancies in how neural networks learn from different imaging domains, which are commonly overlooked when adopting computer vision techniques from the domain of natural images to other specialized domains such as medical images. Recent works have found that the generalization error of a trained network typically increases with the intrinsic dimension ($d_{data}$) of its training set. Yet, the steepness of this relationship varies significantly between medical (radiological) and natural imaging domains, with no existing theoretical explanation. We address this gap in knowledge by establishing and empirically validating a generalization scaling law with respect to $d_{data}$, and propose that the substantial scaling discrepancy between the two considered domains may be at least partially attributed to the higher intrinsic ``label sharpness'' ($K_\mathcal{F}$) of medical imaging datasets, a metric which we propose. Next, we demonstrate an additional benefit of measuring the label sharpness of a training set: it is negatively correlated with the trained model's adversarial robustness, which notably leads to models for medical images having a substantially higher vulnerability to adversarial attack. Finally, we extend our $d_{data}$ formalism to the related metric of learned representation intrinsic dimension ($d_{repr}$), derive a generalization scaling law with respect to $d_{repr}$, and show that $d_{data}$ serves as an upper bound for $d_{repr}$. Our theoretical results are supported by thorough experiments with six models and eleven natural and medical imaging datasets over a range of training set sizes. Our findings offer insights into the influence of intrinsic dataset properties on generalization, representation learning, and robustness in deep neural networks. Code link: https://github.com/mazurowski-lab/intrinsic-properties
\ No newline at end of file
diff --git a/data/2024/iclr/The Effective Horizon Explains Deep RL Performance in Stochastic Environments b/data/2024/iclr/The Effective Horizon Explains Deep RL Performance in Stochastic Environments
new file mode 100644
index 0000000000..73cd1fee30
--- /dev/null
+++ b/data/2024/iclr/The Effective Horizon Explains Deep RL Performance in Stochastic Environments	
@@ -0,0 +1 @@
+Reinforcement learning (RL) theory has largely focused on proving minimax sample complexity bounds. These require strategic exploration algorithms that use relatively limited function classes for representing the policy or value function. Our goal is to explain why deep RL algorithms often perform well in practice, despite using random exploration and much more expressive function classes like neural networks. Our work arrives at an explanation by showing that many stochastic MDPs can be solved by performing only a few steps of value iteration on the random policy's Q function and then acting greedily. When this is true, we find that it is possible to separate the exploration and learning components of RL, making it much easier to analyze. We introduce a new RL algorithm, SQIRL, that iteratively learns a near-optimal policy by exploring randomly to collect rollouts and then performing a limited number of steps of fitted-Q iteration over those rollouts. Any regression algorithm that satisfies basic in-distribution generalization properties can be used in SQIRL to efficiently solve common MDPs. This can explain why deep RL works, since it is empirically established that neural networks generalize well in-distribution. Furthermore, SQIRL explains why random exploration works well in practice. We leverage SQIRL to derive instance-dependent sample complexity bounds for RL that are exponential only in an"effective horizon"of lookahead and on the complexity of the class used for function approximation. Empirically, we also find that SQIRL performance strongly correlates with PPO and DQN performance in a variety of stochastic environments, supporting that our theoretical analysis is predictive of practical performance. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon.
\ No newline at end of file
diff --git a/data/2024/iclr/The Effectiveness of Random Forgetting for Robust Generalization b/data/2024/iclr/The Effectiveness of Random Forgetting for Robust Generalization
new file mode 100644
index 0000000000..81e2693db6
--- /dev/null
+++ b/data/2024/iclr/The Effectiveness of Random Forgetting for Robust Generalization	
@@ -0,0 +1 @@
+Deep neural networks are susceptible to adversarial attacks, which can compromise their performance and accuracy. Adversarial Training (AT) has emerged as a popular approach for protecting neural networks against such attacks. However, a key challenge of AT is robust overfitting, where the network's robust performance on test data deteriorates with further training, thus hindering generalization. Motivated by the concept of active forgetting in the brain, we introduce a novel learning paradigm called"Forget to Mitigate Overfitting (FOMO)". FOMO alternates between the forgetting phase, which randomly forgets a subset of weights and regulates the model's information through weight reinitialization, and the relearning phase, which emphasizes learning generalizable features. Our experiments on benchmark datasets and adversarial attacks show that FOMO alleviates robust overfitting by significantly reducing the gap between the best and last robust test accuracy while improving the state-of-the-art robustness. Furthermore, FOMO provides a better trade-off between standard and robust accuracy, outperforming baseline adversarial methods. Finally, our framework is robust to AutoAttacks and increases generalization in many real-world scenarios.
\ No newline at end of file
diff --git a/data/2024/iclr/The Expressive Leaky Memory Neuron: an Efficient and Expressive Phenomenological Neuron Model Can Solve Long-Horizon Tasks b/data/2024/iclr/The Expressive Leaky Memory Neuron: an Efficient and Expressive Phenomenological Neuron Model Can Solve Long-Horizon Tasks
new file mode 100644
index 0000000000..ff44daf2ee
--- /dev/null
+++ b/data/2024/iclr/The Expressive Leaky Memory Neuron: an Efficient and Expressive Phenomenological Neuron Model Can Solve Long-Horizon Tasks	
@@ -0,0 +1 @@
+Biological cortical neurons are remarkably sophisticated computational devices, temporally integrating their vast synaptic input over an intricate dendritic tree, subject to complex, nonlinearly interacting internal biological processes. A recent study proposed to characterize this complexity by fitting accurate surrogate models to replicate the input-output relationship of a detailed biophysical cortical pyramidal neuron model and discovered it needed temporal convolutional networks (TCN) with millions of parameters. Requiring these many parameters, however, could stem from a misalignment between the inductive biases of the TCN and cortical neuron's computations. In light of this, and to explore the computational implications of leaky memory units and nonlinear dendritic processing, we introduce the Expressive Leaky Memory (ELM) neuron model, a biologically inspired phenomenological model of a cortical neuron. Remarkably, by exploiting such slowly decaying memory-like hidden states and two-layered nonlinear integration of synaptic input, our ELM neuron can accurately match the aforementioned input-output relationship with under ten thousand trainable parameters. To further assess the computational ramifications of our neuron design, we evaluate it on various tasks with demanding temporal structures, including the Long Range Arena (LRA) datasets, as well as a novel neuromorphic dataset based on the Spiking Heidelberg Digits dataset (SHD-Adding). Leveraging a larger number of memory units with sufficiently long timescales, and correspondingly sophisticated synaptic integration, the ELM neuron displays substantial long-range processing capabilities, reliably outperforming the classic Transformer or Chrono-LSTM architectures on LRA, and even solving the Pathfinder-X task with over 70% accuracy (16k context length).
\ No newline at end of file
diff --git a/data/2024/iclr/The Expressive Power of Low-Rank Adaptation b/data/2024/iclr/The Expressive Power of Low-Rank Adaptation
new file mode 100644
index 0000000000..0866346cbd
--- /dev/null
+++ b/data/2024/iclr/The Expressive Power of Low-Rank Adaptation	
@@ -0,0 +1 @@
+Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}}{\text{depth of }f}$. We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters.
\ No newline at end of file
diff --git a/data/2024/iclr/The Expressive Power of Transformers with Chain of Thought b/data/2024/iclr/The Expressive Power of Transformers with Chain of Thought
new file mode 100644
index 0000000000..927d4498c8
--- /dev/null
+++ b/data/2024/iclr/The Expressive Power of Transformers with Chain of Thought	
@@ -0,0 +1 @@
+Recent theoretical work has identiﬁed surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating ﬁnite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers’ reasoning can be improved by allowing them to use a “chain of thought” or “scratchpad”, i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this
\ No newline at end of file
diff --git a/data/2024/iclr/The False Promise of Imitating Proprietary Language Models b/data/2024/iclr/The False Promise of Imitating Proprietary Language Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/The Generalization Gap in Offline Reinforcement Learning b/data/2024/iclr/The Generalization Gap in Offline Reinforcement Learning
new file mode 100644
index 0000000000..741be3e29c
--- /dev/null
+++ b/data/2024/iclr/The Generalization Gap in Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Despite recent progress in offline learning, these methods are still trained and tested on the same environment. In this paper, we compare the generalization abilities of widely used online and offline learning methods such as online reinforcement learning (RL), offline RL, sequence modeling, and behavioral cloning. Our experiments show that offline learning algorithms perform worse on new environments than online learning ones. We also introduce the first benchmark for evaluating generalization in offline learning, collecting datasets of varying sizes and skill-levels from Procgen (2D video games) and WebShop (e-commerce websites). The datasets contain trajectories for a limited number of game levels or natural language instructions and at test time, the agent has to generalize to new levels or instructions. Our experiments reveal that existing offline learning algorithms struggle to match the performance of online RL on both train and test environments. Behavioral cloning is a strong baseline, outperforming state-of-the-art offline RL and sequence modeling approaches when trained on data from multiple environments and tested on new ones. Finally, we find that increasing the diversity of the data, rather than its size, improves performance on new environments for all offline learning algorithms. Our study demonstrates the limited generalization of current offline learning algorithms highlighting the need for more research in this area.
\ No newline at end of file
diff --git "a/data/2024/iclr/The Generative AI Paradox: \"What It Can Create, It May Not Understand\"" "b/data/2024/iclr/The Generative AI Paradox: \"What It Can Create, It May Not Understand\""
new file mode 100644
index 0000000000..57c81c8d3c
--- /dev/null
+++ "b/data/2024/iclr/The Generative AI Paradox: \"What It Can Create, It May Not Understand\""	
@@ -0,0 +1 @@
+The recent wave of generative AI has sparked unprecedented global attention, with both excitement and concern over potentially superhuman levels of artificial intelligence: models now take only seconds to produce outputs that would challenge or exceed the capabilities even of expert humans. At the same time, models still show basic errors in understanding that would not be expected even in non-expert humans. This presents us with an apparent paradox: how do we reconcile seemingly superhuman capabilities with the persistence of errors that few humans would make? In this work, we posit that this tension reflects a divergence in the configuration of intelligence in today's generative models relative to intelligence in humans. Specifically, we propose and test the Generative AI Paradox hypothesis: generative models, having been trained directly to reproduce expert-like outputs, acquire generative capabilities that are not contingent upon -- and can therefore exceed -- their ability to understand those same types of outputs. This contrasts with humans, for whom basic understanding almost always precedes the ability to generate expert-level outputs. We test this hypothesis through controlled experiments analyzing generation vs. understanding in generative models, across both language and image modalities. Our results show that although models can outperform humans in generation, they consistently fall short of human capabilities in measures of understanding, as well as weaker correlation between generation and understanding performance, and more brittleness to adversarial inputs. Our findings support the hypothesis that models' generative capability may not be contingent upon understanding capability, and call for caution in interpreting artificial intelligence by analogy to human intelligence.
\ No newline at end of file
diff --git a/data/2024/iclr/The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry b/data/2024/iclr/The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry
new file mode 100644
index 0000000000..41314fc02c
--- /dev/null
+++ b/data/2024/iclr/The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry	
@@ -0,0 +1 @@
+The Hedgehog (Hh) signaling pathway was first identified in the common fruit fly. It is a highly conserved evolutionary pathway of signal transmission from the cell membrane to the nucleus. The Hh signaling pathway plays an important role in the embryonic development. It exerts its biological effects through a signaling cascade that culminates in a change of balance between activator and repressor forms of glioma-associated oncogene (Gli) transcription factors. The components of the Hh signaling pathway involved in the signaling transfer to the Gli transcription factors include Hedgehog ligands (Sonic Hh [SHh], Indian Hh [IHh], and Desert Hh [DHh]), Patched receptor (Ptch1, Ptch2), Smoothened receptor (Smo), Suppressor of fused homolog (Sufu), kinesin protein Kif7, protein kinase A (PKA), and cyclic adenosine monophosphate (cAMP). The activator form of Gli travels to the nucleus and stimulates the transcription of the target genes by binding to their promoters. The main target genes of the Hh signaling pathway are PTCH1, PTCH2, and GLI1. Deregulation of the Hh signaling pathway is associated with developmental anomalies and cancer, including Gorlin syndrome, and sporadic cancers, such as basal cell carcinoma, medulloblastoma, pancreatic, breast, colon, ovarian, and small-cell lung carcinomas. The aberrant activation of the Hh signaling pathway is caused by mutations in the related genes (ligand-independent signaling) or by the excessive expression of the Hh signaling molecules (ligand-dependent signaling - autocrine or paracrine). Several Hh signaling pathway inhibitors, such as vismodegib and sonidegib, have been developed for cancer treatment. These drugs are regarded as promising cancer therapies, especially for patients with refractory/advanced cancers.
\ No newline at end of file
diff --git a/data/2024/iclr/The Hidden Language of Diffusion Models b/data/2024/iclr/The Hidden Language of Diffusion Models
new file mode 100644
index 0000000000..0d201bc252
--- /dev/null
+++ b/data/2024/iclr/The Hidden Language of Diffusion Models	
@@ -0,0 +1 @@
+Text-to-image diffusion models have demonstrated an unparalleled ability to generate high-quality, diverse images from a textual prompt. However, the internal representations learned by these models remain an enigma. In this work, we present Conceptor, a novel method to interpret the internal representation of a textual concept by a diffusion model. This interpretation is obtained by decomposing the concept into a small set of human-interpretable textual elements. Applied over the state-of-the-art Stable Diffusion model, Conceptor reveals non-trivial structures in the representations of concepts. For example, we find surprising visual connections between concepts, that transcend their textual semantics. We additionally discover concepts that rely on mixtures of exemplars, biases, renowned artistic styles, or a simultaneous fusion of multiple meanings of the concept. Through a large battery of experiments, we demonstrate Conceptor's ability to provide meaningful, robust, and faithful decompositions for a wide variety of abstract, concrete, and complex textual concepts, while allowing to naturally connect each decomposition element to its corresponding visual impact on the generated images. Our code will be available at: https://hila-chefer.github.io/Conceptor/
\ No newline at end of file
diff --git a/data/2024/iclr/The Human-AI Substitution game: active learning from a strategic labeler b/data/2024/iclr/The Human-AI Substitution game: active learning from a strategic labeler
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/The Joint Effect of Task Similarity and Overparameterization on Catastrophic Forgetting - An Analytical Model b/data/2024/iclr/The Joint Effect of Task Similarity and Overparameterization on Catastrophic Forgetting - An Analytical Model
new file mode 100644
index 0000000000..cf1c9f39be
--- /dev/null
+++ b/data/2024/iclr/The Joint Effect of Task Similarity and Overparameterization on Catastrophic Forgetting - An Analytical Model	
@@ -0,0 +1 @@
+In continual learning, catastrophic forgetting is affected by multiple aspects of the tasks. Previous works have analyzed separately how forgetting is affected by either task similarity or overparameterization. In contrast, our paper examines how task similarity and overparameterization jointly affect forgetting in an analyzable model. Specifically, we focus on two-task continual linear regression, where the second task is a random orthogonal transformation of an arbitrary first task (an abstraction of random permutation tasks). We derive an exact analytical expression for the expected forgetting - and uncover a nuanced pattern. In highly overparameterized models, intermediate task similarity causes the most forgetting. However, near the interpolation threshold, forgetting decreases monotonically with the expected task similarity. We validate our findings with linear regression on synthetic data, and with neural networks on established permutation task benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/The LLM Surgeon b/data/2024/iclr/The LLM Surgeon
new file mode 100644
index 0000000000..8155fc5db8
--- /dev/null
+++ b/data/2024/iclr/The LLM Surgeon	
@@ -0,0 +1 @@
+State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative to training smaller models from scratch. To do so, we scale Kronecker-factored curvature approximations of the target loss landscape to large language models. In doing so, we can compute both the dynamic allocation of structures that can be removed as well as updates of remaining weights that account for the removal. We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights, while remaining computationally efficient. Experimentally, our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance, and achieve state-of-the-art results in unstructured and semi-structured pruning of large language models.
\ No newline at end of file
diff --git a/data/2024/iclr/The Marginal Value of Momentum for Small Learning Rate SGD b/data/2024/iclr/The Marginal Value of Momentum for Small Learning Rate SGD
new file mode 100644
index 0000000000..3ac9e2703a
--- /dev/null
+++ b/data/2024/iclr/The Marginal Value of Momentum for Small Learning Rate SGD	
@@ -0,0 +1 @@
+Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/The Need for Speed: Pruning Transformers with One Recipe b/data/2024/iclr/The Need for Speed: Pruning Transformers with One Recipe
new file mode 100644
index 0000000000..8156829719
--- /dev/null
+++ b/data/2024/iclr/The Need for Speed: Pruning Transformers with One Recipe	
@@ -0,0 +1 @@
+We introduce the $\textbf{O}$ne-shot $\textbf{P}$runing $\textbf{T}$echnique for $\textbf{I}$nterchangeable $\textbf{N}$etworks ($\textbf{OPTIN}$) framework as a tool to increase the efficiency of pre-trained transformer architectures $\textit{without requiring re-training}$. Recent works have explored improving transformer efficiency, however often incur computationally expensive re-training procedures or depend on architecture-specific characteristics, thus impeding practical wide-scale adoption. To address these shortcomings, the OPTIN framework leverages intermediate feature distillation, capturing the long-range dependencies of model parameters (coined $\textit{trajectory}$), to produce state-of-the-art results on natural language, image classification, transfer learning, and semantic segmentation tasks $\textit{without re-training}$. Given a FLOP constraint, the OPTIN framework will compress the network while maintaining competitive accuracy performance and improved throughput. Particularly, we show a $\leq 2$% accuracy degradation from NLP baselines and a $0.5$% improvement from state-of-the-art methods on image classification at competitive FLOPs reductions. We further demonstrate the generalization of tasks and architecture with comparative performance using Mask2Former for semantic segmentation and cnn-style networks. OPTIN presents one of the first one-shot efficient frameworks for compressing transformer architectures that generalizes well across different class domains, in particular: natural language and image-related tasks, without $\textit{re-training}$.
\ No newline at end of file
diff --git a/data/2024/iclr/The Reasonableness Behind Unreasonable Translation Capability of Large Language Model b/data/2024/iclr/The Reasonableness Behind Unreasonable Translation Capability of Large Language Model
new file mode 100644
index 0000000000..e69de29bb2
diff --git "a/data/2024/iclr/The Reversal Curse: LLMs trained on \"A is B\" fail to learn \"B is A\"" "b/data/2024/iclr/The Reversal Curse: LLMs trained on \"A is B\" fail to learn \"B is A\""
new file mode 100644
index 0000000000..311d88fdd0
--- /dev/null
+++ "b/data/2024/iclr/The Reversal Curse: LLMs trained on \"A is B\" fail to learn \"B is A\""	
@@ -0,0 +1 @@
+We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form"A is B", it will not automatically generalize to the reverse direction"B is A". This is the Reversal Curse. For instance, if a model is trained on"Valentina Tereshkova was the first woman to travel to space", it will not automatically be able to answer the question,"Who was the first woman to travel to space?". Moreover, the likelihood of the correct answer ("Valentina Tershkova") will not be higher than for a random name. Thus, models do not generalize a prevalent pattern in their training set: if"A is B"occurs,"B is A"is more likely to occur. It is worth noting, however, that if"A is B"appears in-context, models can deduce the reverse relationship. We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as"Uriah Hawthorne is the composer of Abyssal Melodies"and showing that they fail to correctly answer"Who composed Abyssal Melodies?". The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT-3.5 and GPT-4) on questions about real-world celebrities, such as"Who is Tom Cruise's mother? [A: Mary Lee Pfeiffer]"and the reverse"Who is Mary Lee Pfeiffer's son?". GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. Code available at: https://github.com/lukasberglund/reversal_curse.
\ No newline at end of file
diff --git a/data/2024/iclr/The Trickle-down Impact of Reward Inconsistency on RLHF b/data/2024/iclr/The Trickle-down Impact of Reward Inconsistency on RLHF
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction b/data/2024/iclr/The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction
new file mode 100644
index 0000000000..4e9558004d
--- /dev/null
+++ b/data/2024/iclr/The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction	
@@ -0,0 +1 @@
+Transformer-based Large Language Models (LLMs) have become a fixture in modern machine learning. Correspondingly, significant resources are allocated towards research that aims to further advance this technology, typically resulting in models of increasing size that are trained on increasing amounts of data. This work, however, demonstrates the surprising result that it is often possible to significantly improve the performance of LLMs by selectively removing higher-order components of their weight matrices. This simple intervention, which we call LAyer-SElective Rank reduction (LASER), can be done on a model after training has completed, and requires no additional parameters or data. We show extensive experiments demonstrating the generality of this finding across language models and datasets, and provide in-depth analyses offering insights into both when LASER is effective and the mechanism by which it operates.
\ No newline at end of file
diff --git a/data/2024/iclr/The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning b/data/2024/iclr/The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
new file mode 100644
index 0000000000..c0e8e763bc
--- /dev/null
+++ b/data/2024/iclr/The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning	
@@ -0,0 +1 @@
+The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be"superficial."This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.
\ No newline at end of file
diff --git a/data/2024/iclr/The Unreasonable Effectiveness of Linear Prediction as a Perceptual Metric b/data/2024/iclr/The Unreasonable Effectiveness of Linear Prediction as a Perceptual Metric
new file mode 100644
index 0000000000..0d0898f4ff
--- /dev/null
+++ b/data/2024/iclr/The Unreasonable Effectiveness of Linear Prediction as a Perceptual Metric	
@@ -0,0 +1 @@
+We show how perceptual embeddings of the visual system can be constructed at inference-time with no training data or deep neural network features. Our perceptual embeddings are solutions to a weighted least squares (WLS) problem, defined at the pixel-level, and solved at inference-time, that can capture global and local image characteristics. The distance in embedding space is used to define a perceptual similarity metric which we call LASI: Linear Autoregressive Similarity Index. Experiments on full-reference image quality assessment datasets show LASI performs competitively with learned deep feature based methods like LPIPS (Zhang et al., 2018) and PIM (Bhardwaj et al., 2020), at a similar computational cost to hand-crafted methods such as MS-SSIM (Wang et al., 2003). We found that increasing the dimensionality of the embedding space consistently reduces the WLS loss while increasing performance on perceptual tasks, at the cost of increasing the computational complexity. LASI is fully differentiable, scales cubically with the number of embedding dimensions, and can be parallelized at the pixel-level. A Maximum Differentiation (MAD) competition (Wang&Simoncelli, 2008) between LASI and LPIPS shows that both methods are capable of finding failure points for the other, suggesting these metrics can be combined.
\ No newline at end of file
diff --git a/data/2024/iclr/The Update-Equivalence Framework for Decision-Time Planning b/data/2024/iclr/The Update-Equivalence Framework for Decision-Time Planning
new file mode 100644
index 0000000000..edb5614d6c
--- /dev/null
+++ b/data/2024/iclr/The Update-Equivalence Framework for Decision-Time Planning	
@@ -0,0 +1 @@
+The process of revising (or constructing) a policy at execution time -- known as decision-time planning -- has been key to achieving superhuman performance in perfect-information games like chess and Go. A recent line of work has extended decision-time planning to imperfect-information games, leading to superhuman performance in poker. However, these methods involve solving subgames whose sizes grow quickly in the amount of non-public information, making them unhelpful when the amount of non-public information is large. Motivated by this issue, we introduce an alternative framework for decision-time planning that is not based on solving subgames, but rather on update equivalence. In this update-equivalence framework, decision-time planning algorithms replicate the updates of last-iterate algorithms, which need not rely on public information. This facilitates scalability to games with large amounts of non-public information. Using this framework, we derive a provably sound search algorithm for fully cooperative games based on mirror descent and a search algorithm for adversarial games based on magnetic mirror descent. We validate the performance of these algorithms in cooperative and adversarial domains, notably in Hanabi, the standard benchmark for search in fully cooperative imperfect-information games. Here, our mirror descent approach exceeds or matches the performance of public information-based search while using two orders of magnitude less search time. This is the first instance of a non-public-information-based algorithm outperforming public-information-based approaches in a domain they have historically dominated.
\ No newline at end of file
diff --git a/data/2024/iclr/The Wasserstein Believer: Learning Belief Updates for Partially Observable Environments through Reliable Latent Space Models b/data/2024/iclr/The Wasserstein Believer: Learning Belief Updates for Partially Observable Environments through Reliable Latent Space Models
new file mode 100644
index 0000000000..9e566161f4
--- /dev/null
+++ b/data/2024/iclr/The Wasserstein Believer: Learning Belief Updates for Partially Observable Environments through Reliable Latent Space Models	
@@ -0,0 +1 @@
+Partially Observable Markov Decision Processes (POMDPs) are used to model environments where the full state cannot be perceived by an agent. As such the agent needs to reason taking into account the past observations and actions. However, simply remembering the full history is generally intractable due to the exponential growth in the history space. Maintaining a probability distribution that models the belief over what the true state is can be used as a sufficient statistic of the history, but its computation requires access to the model of the environment and is often intractable. While SOTA algorithms use Recurrent Neural Networks to compress the observation-action history aiming to learn a sufficient statistic, they lack guarantees of success and can lead to sub-optimal policies. To overcome this, we propose the Wasserstein Belief Updater, an RL algorithm that learns a latent model of the POMDP and an approximation of the belief update. Our approach comes with theoretical guarantees on the quality of our approximation ensuring that our outputted beliefs allow for learning the optimal value function.
\ No newline at end of file
diff --git a/data/2024/iclr/The importance of feature preprocessing for differentially private linear optimization b/data/2024/iclr/The importance of feature preprocessing for differentially private linear optimization
new file mode 100644
index 0000000000..a61dcbcd49
--- /dev/null
+++ b/data/2024/iclr/The importance of feature preprocessing for differentially private linear optimization	
@@ -0,0 +1 @@
+Training machine learning models with differential privacy (DP) has received increasing interest in recent years. One of the most popular algorithms for training differentially private models is differentially private stochastic gradient descent (DPSGD) and its variants, where at each step gradients are clipped and combined with some noise. Given the increasing usage of DPSGD, we ask the question: is DPSGD alone sufficient to find a good minimizer for every dataset under privacy constraints? Towards answering this question, we show that even for the simple case of linear classification, unlike non-private optimization, (private) feature preprocessing is vital for differentially private optimization. In detail, we first show theoretically that there exists an example where without feature preprocessing, DPSGD incurs an optimality gap proportional to the maximum Euclidean norm of features over all samples. We then propose an algorithm called DPSGD-F, which combines DPSGD with feature preprocessing and prove that for classification tasks, it incurs an optimality gap proportional to the diameter of the features $\max_{x, x' \in D} \|x - x'\|_2$. We finally demonstrate the practicality of our algorithm on image classification benchmarks.
\ No newline at end of file
diff --git a/data/2024/iclr/The mechanistic basis of data dependence and abrupt learning in an in-context classification task b/data/2024/iclr/The mechanistic basis of data dependence and abrupt learning in an in-context classification task
new file mode 100644
index 0000000000..14b41935ce
--- /dev/null
+++ b/data/2024/iclr/The mechanistic basis of data dependence and abrupt learning in an in-context classification task	
@@ -0,0 +1 @@
+Transformer models exhibit in-context learning: the ability to accurately predict the response to a novel query based on illustrative examples in the input sequence. In-context learning contrasts with traditional in-weights learning of query-output relationships. What aspects of the training data distribution and architecture favor in-context vs in-weights learning? Recent work has shown that specific distributional properties inherent in language, such as burstiness, large dictionaries and skewed rank-frequency distributions, control the trade-off or simultaneous appearance of these two forms of learning. We first show that these results are recapitulated in a minimal attention-only network trained on a simplified dataset. In-context learning (ICL) is driven by the abrupt emergence of an induction head, which subsequently competes with in-weights learning. By identifying progress measures that precede in-context learning and targeted experiments, we construct a two-parameter model of an induction head which emulates the full data distributional dependencies displayed by the attention-based network. A phenomenological model of induction head formation traces its abrupt emergence to the sequential learning of three nested logits enabled by an intrinsic curriculum. We propose that the sharp transitions in attention-based networks arise due to a specific chain of multi-layer operations necessary to achieve ICL, which is implemented by nested nonlinearities sequentially learned during training.
\ No newline at end of file
diff --git a/data/2024/iclr/The optimality of kernel classifiers in Sobolev space b/data/2024/iclr/The optimality of kernel classifiers in Sobolev space
new file mode 100644
index 0000000000..b0fa454670
--- /dev/null
+++ b/data/2024/iclr/The optimality of kernel classifiers in Sobolev space	
@@ -0,0 +1 @@
+Kernel methods are widely used in machine learning, especially for classification problems. However, the theoretical analysis of kernel classification is still limited. This paper investigates the statistical performances of kernel classifiers. With some mild assumptions on the conditional probability $\eta(x)=\mathbb{P}(Y=1\mid X=x)$, we derive an upper bound on the classification excess risk of a kernel classifier using recent advances in the theory of kernel regression. We also obtain a minimax lower bound for Sobolev spaces, which shows the optimality of the proposed classifier. Our theoretical results can be extended to the generalization error of overparameterized neural network classifiers. To make our theoretical results more applicable in realistic settings, we also propose a simple method to estimate the interpolation smoothness of $2\eta(x)-1$ and apply the method to real datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Theoretical Analysis of Robust Overfitting for Wide DNNs: An NTK Approach b/data/2024/iclr/Theoretical Analysis of Robust Overfitting for Wide DNNs: An NTK Approach
new file mode 100644
index 0000000000..327b9bad30
--- /dev/null
+++ b/data/2024/iclr/Theoretical Analysis of Robust Overfitting for Wide DNNs: An NTK Approach	
@@ -0,0 +1 @@
+Adversarial training (AT) is a canonical method for enhancing the robustness of deep neural networks (DNNs). However, recent studies empirically demonstrated that it suffers from robust overfitting, i.e., a long time AT can be detrimental to the robustness of DNNs. This paper presents a theoretical explanation of robust overfitting for DNNs. Specifically, we non-trivially extend the neural tangent kernel (NTK) theory to AT and prove that an adversarially trained wide DNN can be well approximated by a linearized DNN. Moreover, for squared loss, closed-form AT dynamics for the linearized DNN can be derived, which reveals a new AT degeneration phenomenon: a long-term AT will result in a wide DNN degenerates to that obtained without AT and thus cause robust overfitting. Based on our theoretical results, we further design a method namely Adv-NTK, the first AT algorithm for infinite-width DNNs. Experiments on real-world datasets show that Adv-NTK can help infinite-width DNNs enhance comparable robustness to that of their finite-width counterparts, which in turn justifies our theoretical findings. The code is available at https://github.com/fshp971/adv-ntk.
\ No newline at end of file
diff --git a/data/2024/iclr/Theoretical Understanding of Learning from Adversarial Perturbations b/data/2024/iclr/Theoretical Understanding of Learning from Adversarial Perturbations
new file mode 100644
index 0000000000..63c52ff9f3
--- /dev/null
+++ b/data/2024/iclr/Theoretical Understanding of Learning from Adversarial Perturbations	
@@ -0,0 +1 @@
+It is not fully understood why adversarial examples can deceive neural networks and transfer between different networks. To elucidate this, several studies have hypothesized that adversarial perturbations, while appearing as noises, contain class features. This is supported by empirical evidence showing that networks trained on mislabeled adversarial examples can still generalize well to correctly labeled test samples. However, a theoretical understanding of how perturbations include class features and contribute to generalization is limited. In this study, we provide a theoretical framework for understanding learning from perturbations using a one-hidden-layer network trained on mutually orthogonal samples. Our results highlight that various adversarial perturbations, even perturbations of a few pixels, contain sufficient class features for generalization. Moreover, we reveal that the decision boundary when learning from perturbations matches that from standard samples except for specific regions under mild conditions. The code is available at https://github.com/s-kumano/learning-from-adversarial-perturbations.
\ No newline at end of file
diff --git a/data/2024/iclr/Thin-Shell Object Manipulations With Differentiable Physics Simulations b/data/2024/iclr/Thin-Shell Object Manipulations With Differentiable Physics Simulations
new file mode 100644
index 0000000000..e13af435ec
--- /dev/null
+++ b/data/2024/iclr/Thin-Shell Object Manipulations With Differentiable Physics Simulations	
@@ -0,0 +1 @@
+In this work, we aim to teach robots to manipulate various thin-shell materials. Prior works studying thin-shell object manipulation mostly rely on heuristic policies or learn policies from real-world video demonstrations, and only focus on limited material types and tasks (e.g., cloth unfolding). However, these approaches face significant challenges when extended to a wider variety of thin-shell materials and a diverse range of tasks. While virtual simulations are shown to be effective in diverse robot skill learning and evaluation, prior thin-shell simulation environments only support a subset of thin-shell materials, which also limits their supported range of tasks. We introduce ThinShellLab - a fully differentiable simulation platform tailored for robotic interactions with diverse thin-shell materials possessing varying material properties, enabling flexible thin-shell manipulation skill learning and evaluation. Our experiments suggest that manipulating thin-shell objects presents several unique challenges: 1) thin-shell manipulation relies heavily on frictional forces due to the objects' co-dimensional nature, 2) the materials being manipulated are highly sensitive to minimal variations in interaction actions, and 3) the constant and frequent alteration in contact pairs makes trajectory optimization methods susceptible to local optima, and neither standard reinforcement learning algorithms nor trajectory optimization methods (either gradient-based or gradient-free) are able to solve the tasks alone. To overcome these challenges, we present an optimization scheme that couples sampling-based trajectory optimization and gradient-based optimization, boosting both learning efficiency and converged performance across various proposed tasks. In addition, the differentiable nature of our platform facilitates a smooth sim-to-real transition.
\ No newline at end of file
diff --git a/data/2024/iclr/Think before you speak: Training Language Models With Pause Tokens b/data/2024/iclr/Think before you speak: Training Language Models With Pause Tokens
new file mode 100644
index 0000000000..167553cca9
--- /dev/null
+++ b/data/2024/iclr/Think before you speak: Training Language Models With Pause Tokens	
@@ -0,0 +1 @@
+Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token? We operationalize this idea by performing training and inference on language models with a (learnable) $\textit{pause}$ token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate $\textit{pause-training}$ on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of $18\%$ EM score on the QA task of SQuAD, $8\%$ on CommonSenseQA and $1\%$ accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm.
\ No newline at end of file
diff --git a/data/2024/iclr/Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph b/data/2024/iclr/Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph
new file mode 100644
index 0000000000..9d757a9da3
--- /dev/null
+++ b/data/2024/iclr/Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph	
@@ -0,0 +1 @@
+Although large language models (LLMs) have achieved significant success in various tasks, they often struggle with hallucination problems, especially in scenarios requiring deep and responsible reasoning. These issues could be partially addressed by introducing external knowledge graphs (KG) in LLM reasoning. In this paper, we propose a new LLM-KG integrating paradigm ``$\hbox{LLM}\otimes\hbox{KG}$'' which treats the LLM as an agent to interactively explore related entities and relations on KGs and perform reasoning based on the retrieved knowledge. We further implement this paradigm by introducing a new approach called Think-on-Graph (ToG), in which the LLM agent iteratively executes beam search on KG, discovers the most promising reasoning paths, and returns the most likely reasoning results. We use a number of well-designed experiments to examine and illustrate the following advantages of ToG: 1) compared with LLMs, ToG has better deep reasoning power; 2) ToG has the ability of knowledge traceability and knowledge correctability by leveraging LLMs reasoning and expert feedback; 3) ToG provides a flexible plug-and-play framework for different LLMs, KGs and prompting strategies without any additional training cost; 4) the performance of ToG with small LLM models could exceed large LLM such as GPT-4 in certain scenarios and this reduces the cost of LLM deployment and application. As a training-free method with lower computational cost and better generality, ToG achieves overall SOTA in 6 out of 9 datasets where most previous SOTAs rely on additional training.
\ No newline at end of file
diff --git a/data/2024/iclr/Thought Propagation: an Analogical Approach to Complex Reasoning with Large Language Models b/data/2024/iclr/Thought Propagation: an Analogical Approach to Complex Reasoning with Large Language Models
new file mode 100644
index 0000000000..9a94b3c845
--- /dev/null
+++ b/data/2024/iclr/Thought Propagation: an Analogical Approach to Complex Reasoning with Large Language Models	
@@ -0,0 +1 @@
+Large Language Models (LLMs) have achieved remarkable success in reasoning tasks with the development of prompting methods. However, existing prompting approaches cannot reuse insights of solving similar problems and suffer from accumulated errors in multi-step reasoning, since they prompt LLMs to reason \textit{from scratch}. To address these issues, we propose \textbf{\textit{Thought Propagation} (TP)}, which explores the analogous problems and leverages their solutions to enhance the complex reasoning ability of LLMs. These analogous problems are related to the input one, with reusable solutions and problem-solving strategies. Thus, it is promising to propagate insights of solving previous analogous problems to inspire new problem-solving. To achieve this, TP first prompts LLMs to propose and solve a set of analogous problems that are related to the input one. Then, TP reuses the results of analogous problems to directly yield a new solution or derive a knowledge-intensive plan for execution to amend the initial solution obtained from scratch. TP is compatible with existing prompting approaches, allowing plug-and-play generalization and enhancement in a wide range of tasks without much labor in task-specific prompt engineering. Experiments across three challenging tasks demonstrate TP enjoys a substantial improvement over the baselines by an average of 12\% absolute increase in finding the optimal solutions in Shortest-path Reasoning, 13\% improvement of human preference in Creative Writing, and 15\% enhancement in the task completion rate of LLM-Agent Planning.
\ No newline at end of file
diff --git a/data/2024/iclr/Threaten Spiking Neural Networks through Combining Rate and Temporal Information b/data/2024/iclr/Threaten Spiking Neural Networks through Combining Rate and Temporal Information
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Threshold-Consistent Margin Loss for Open-World Deep Metric Learning b/data/2024/iclr/Threshold-Consistent Margin Loss for Open-World Deep Metric Learning
new file mode 100644
index 0000000000..349373751e
--- /dev/null
+++ b/data/2024/iclr/Threshold-Consistent Margin Loss for Open-World Deep Metric Learning	
@@ -0,0 +1 @@
+Existing losses used in deep metric learning (DML) for image retrieval often lead to highly non-uniform intra-class and inter-class representation structures across test classes and data distributions. When combined with the common practice of using a fixed threshold to declare a match, this gives rise to significant performance variations in terms of false accept rate (FAR) and false reject rate (FRR) across test classes and data distributions. We define this issue in DML as threshold inconsistency. In real-world applications, such inconsistency often complicates the threshold selection process when deploying commercial image retrieval systems. To measure this inconsistency, we propose a novel variance-based metric called Operating-Point-Inconsistency-Score (OPIS) that quantifies the variance in the operating characteristics across classes. Using the OPIS metric, we find that achieving high accuracy levels in a DML model does not automatically guarantee threshold consistency. In fact, our investigation reveals a Pareto frontier in the high-accuracy regime, where existing methods to improve accuracy often lead to degradation in threshold consistency. To address this trade-off, we introduce the Threshold-Consistent Margin (TCM) loss, a simple yet effective regularization technique that promotes uniformity in representation structures across classes by selectively penalizing hard sample pairs. Extensive experiments demonstrate TCM's effectiveness in enhancing threshold consistency while preserving accuracy, simplifying the threshold selection process in practical DML settings.
\ No newline at end of file
diff --git a/data/2024/iclr/TiC-CLIP: Continual Training of CLIP Models b/data/2024/iclr/TiC-CLIP: Continual Training of CLIP Models
new file mode 100644
index 0000000000..e74375e2c0
--- /dev/null
+++ b/data/2024/iclr/TiC-CLIP: Continual Training of CLIP Models	
@@ -0,0 +1 @@
+Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps. TiC-DataComp, our largest dataset, contains over 12.7B timestamped image-text pairs spanning 9 years (2014-2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses $\approx 8\%$ zero-shot accuracy on our curated retrieval task from 2021-2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by $2.5\times$ when compared to the standard practice of retraining from scratch. Code is available at https://github.com/apple/ml-tic-clip.
\ No newline at end of file
diff --git a/data/2024/iclr/Tight Rates in Supervised Outlier Transfer Learning b/data/2024/iclr/Tight Rates in Supervised Outlier Transfer Learning
new file mode 100644
index 0000000000..cf23da4b86
--- /dev/null
+++ b/data/2024/iclr/Tight Rates in Supervised Outlier Transfer Learning	
@@ -0,0 +1 @@
+A critical barrier to learning an accurate decision rule for outlier detection is the scarcity of outlier data. As such, practitioners often turn to the use of similar but imperfect outlier data from which they might transfer information to the target outlier detection task. Despite the recent empirical success of transfer learning approaches in outlier detection, a fundamental understanding of when and how knowledge can be transferred from a source to a target outlier detection task remains elusive. In this work, we adopt the traditional framework of Neyman-Pearson classification -- which formalizes supervised outlier detection -- with the added assumption that one has access to some related but imperfect outlier data. Our main results are as follows: We first determine the information-theoretic limits of the problem under a measure of discrepancy that extends some existing notions from traditional balanced classification; interestingly, unlike in balanced classification, seemingly very dissimilar sources can provide much information about a target, thus resulting in fast transfer. We then show that, in principle, these information-theoretic limits are achievable by adaptive procedures, i.e., procedures with no a priori information on the discrepancy between source and target outlier distributions.
\ No newline at end of file
diff --git a/data/2024/iclr/Time Fairness in Online Knapsack Problems b/data/2024/iclr/Time Fairness in Online Knapsack Problems
new file mode 100644
index 0000000000..ce3ee9d31b
--- /dev/null
+++ b/data/2024/iclr/Time Fairness in Online Knapsack Problems	
@@ -0,0 +1 @@
+The online knapsack problem is a classic problem in the field of online algorithms. Its canonical version asks how to pack items of different values and weights arriving online into a capacity-limited knapsack so as to maximize the total value of the admitted items. Although optimal competitive algorithms are known for this problem, they may be fundamentally unfair, i.e., individual items may be treated inequitably in different ways. We formalize a practically-relevant notion of time fairness which effectively models a trade off between static and dynamic pricing in a motivating application such as cloud resource allocation, and show that existing algorithms perform poorly under this metric. We propose a parameterized deterministic algorithm where the parameter precisely captures the Pareto-optimal trade-off between fairness (static pricing) and competitiveness (dynamic pricing). We show that randomization is theoretically powerful enough to be simultaneously competitive and fair; however, it does not work well in experiments. To further improve the trade-off between fairness and competitiveness, we develop a nearly-optimal learning-augmented algorithm which is fair, consistent, and robust (competitive), showing substantial performance improvements in numerical experiments.
\ No newline at end of file
diff --git a/data/2024/iclr/Time Travel in LLMs: Tracing Data Contamination in Large Language Models b/data/2024/iclr/Time Travel in LLMs: Tracing Data Contamination in Large Language Models
new file mode 100644
index 0000000000..0205e7c637
--- /dev/null
+++ b/data/2024/iclr/Time Travel in LLMs: Tracing Data Contamination in Large Language Models	
@@ -0,0 +1 @@
+Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in measuring LLMs' real effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination at the instance level; using this information, our approach then assesses wider contamination at the partition level. To estimate contamination of individual instances, we employ"guided instruction:"a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or nearly matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE-L or BLEURT) is statistically significantly better with the completions from guided instruction compared to a"general instruction"that does not include the dataset and partition name. The second idea marks a dataset partition as contaminated if a classifier based on GPT-4 with few-shot in-context learning prompt marks multiple generated completions as exact/near-exact matches of the corresponding reference instances. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human experts. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Time-Efficient Reinforcement Learning with Stochastic Stateful Policies b/data/2024/iclr/Time-Efficient Reinforcement Learning with Stochastic Stateful Policies
new file mode 100644
index 0000000000..948af880b9
--- /dev/null
+++ b/data/2024/iclr/Time-Efficient Reinforcement Learning with Stochastic Stateful Policies	
@@ -0,0 +1 @@
+Stateful policies play an important role in reinforcement learning, such as handling partially observable environments, enhancing robustness, or imposing an inductive bias directly into the policy structure. The conventional method for training stateful policies is Backpropagation Through Time (BPTT), which comes with significant drawbacks, such as slow training due to sequential gradient propagation and the occurrence of vanishing or exploding gradients. The gradient is often truncated to address these issues, resulting in a biased policy update. We present a novel approach for training stateful policies by decomposing the latter into a stochastic internal state kernel and a stateless policy, jointly optimized by following the stateful policy gradient. We introduce different versions of the stateful policy gradient theorem, enabling us to easily instantiate stateful variants of popular reinforcement learning and imitation learning algorithms. Furthermore, we provide a theoretical analysis of our new gradient estimator and compare it with BPTT. We evaluate our approach on complex continuous control tasks, e.g., humanoid locomotion, and demonstrate that our gradient estimator scales effectively with task complexity while offering a faster and simpler alternative to BPTT.
\ No newline at end of file
diff --git a/data/2024/iclr/Time-LLM: Time Series Forecasting by Reprogramming Large Language Models b/data/2024/iclr/Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
new file mode 100644
index 0000000000..81a6de6c5c
--- /dev/null
+++ b/data/2024/iclr/Time-LLM: Time Series Forecasting by Reprogramming Large Language Models	
@@ -0,0 +1 @@
+Time series forecasting holds significant importance in many real-world dynamic systems and has been extensively studied. Unlike natural language process (NLP) and computer vision (CV), where a single large model can tackle multiple tasks, models for time series forecasting are often specialized, necessitating distinct designs for different tasks and applications. While pre-trained foundation models have made impressive strides in NLP and CV, their development in time series domains has been constrained by data sparsity. Recent studies have revealed that large language models (LLMs) possess robust pattern recognition and reasoning abilities over complex sequences of tokens. However, the challenge remains in effectively aligning the modalities of time series data and natural language to leverage these capabilities. In this work, we present Time-LLM, a reprogramming framework to repurpose LLMs for general time series forecasting with the backbone language models kept intact. We begin by reprogramming the input time series with text prototypes before feeding it into the frozen LLM to align the two modalities. To augment the LLM's ability to reason with time series data, we propose Prompt-as-Prefix (PaP), which enriches the input context and directs the transformation of reprogrammed input patches. The transformed time series patches from the LLM are finally projected to obtain the forecasts. Our comprehensive evaluations demonstrate that Time-LLM is a powerful time series learner that outperforms state-of-the-art, specialized forecasting models. Moreover, Time-LLM excels in both few-shot and zero-shot learning scenarios.
\ No newline at end of file
diff --git a/data/2024/iclr/Time-Varying Propensity Score to Bridge the Gap between the Past and Present b/data/2024/iclr/Time-Varying Propensity Score to Bridge the Gap between the Past and Present
new file mode 100644
index 0000000000..d7b41ce99d
--- /dev/null
+++ b/data/2024/iclr/Time-Varying Propensity Score to Bridge the Gap between the Past and Present	
@@ -0,0 +1 @@
+Real-world deployment of machine learning models is challenging because data evolves over time. While no model can work when data evolves in an arbitrary fashion, if there is some pattern to these changes, we might be able to design methods to address it. This paper addresses situations when data evolves gradually. We introduce a time-varying propensity score that can detect gradual shifts in the distribution of data which allows us to selectively sample past data to update the model -- not just similar data from the past like that of a standard propensity score but also data that evolved in a similar fashion in the past. The time-varying propensity score is quite general: we demonstrate different ways of implementing it and evaluate it on a variety of problems ranging from supervised learning (e.g., image classification problems) where data undergoes a sequence of gradual shifts, to reinforcement learning tasks (e.g., robotic manipulation and continuous control) where data shifts as the policy or the task changes.
\ No newline at end of file
diff --git a/data/2024/iclr/TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting b/data/2024/iclr/TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting
new file mode 100644
index 0000000000..55f026c073
--- /dev/null
+++ b/data/2024/iclr/TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting	
@@ -0,0 +1 @@
+Time series forecasting is widely used in extensive applications, such as traffic planning and weather forecasting. However, real-world time series usually present intricate temporal variations, making forecasting extremely challenging. Going beyond the mainstream paradigms of plain decomposition and multiperiodicity analysis, we analyze temporal variations in a novel view of multiscale-mixing, which is based on an intuitive but important observation that time series present distinct patterns in different sampling scales. The microscopic and the macroscopic information are reflected in fine and coarse scales respectively, and thereby complex variations can be inherently disentangled. Based on this observation, we propose TimeMixer as a fully MLP-based architecture with Past-Decomposable-Mixing (PDM) and Future-Multipredictor-Mixing (FMM) blocks to take full advantage of disentangled multiscale series in both past extraction and future prediction phases. Concretely, PDM applies the decomposition to multiscale series and further mixes the decomposed seasonal and trend components in fine-to-coarse and coarse-to-fine directions separately, which successively aggregates the microscopic seasonal and macroscopic trend information. FMM further ensembles multiple predictors to utilize complementary forecasting capabilities in multiscale observations. Consequently, TimeMixer is able to achieve consistent state-of-the-art performances in both long-term and short-term forecasting tasks with favorable run-time efficiency.
\ No newline at end of file
diff --git a/data/2024/iclr/To Grok or not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets b/data/2024/iclr/To Grok or not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets
new file mode 100644
index 0000000000..847806a0ec
--- /dev/null
+++ b/data/2024/iclr/To Grok or not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets	
@@ -0,0 +1 @@
+Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large. In general, it is very difficult to know if the network has memorized a particular set of examples or understood the underlying rule (or both). Motivated by this challenge, we study an interpretable model where generalizing representations are understood analytically, and are easily distinguishable from the memorizing ones. Namely, we consider multi-layer perceptron (MLP) and Transformer architectures trained on modular arithmetic tasks, where ($\xi \cdot 100\%$) of labels are corrupted (\emph{i.e.} some results of the modular operations in the training set are incorrect). We show that (i) it is possible for the network to memorize the corrupted labels \emph{and} achieve $100\%$ generalization at the same time; (ii) the memorizing neurons can be identified and pruned, lowering the accuracy on corrupted data and improving the accuracy on uncorrupted data; (iii) regularization methods such as weight decay, dropout and BatchNorm force the network to ignore the corrupted data during optimization, and achieve $100\%$ accuracy on the uncorrupted dataset; and (iv) the effect of these regularization methods is (``mechanistically'') interpretable: weight decay and dropout force all the neurons to learn generalizing representations, while BatchNorm de-amplifies the output of memorizing neurons and amplifies the output of the generalizing ones. Finally, we show that in the presence of regularization, the training dynamics involves two consecutive stages: first, the network undergoes \emph{grokking} dynamics reaching high train \emph{and} test accuracy; second, it unlearns the memorizing representations, where the train accuracy suddenly jumps from $100\%$ to $100 (1-\xi)\%$.
\ No newline at end of file
diff --git a/data/2024/iclr/To the Cutoff... and Beyond? A Longitudinal Perspective on LLM Data Contamination b/data/2024/iclr/To the Cutoff... and Beyond? A Longitudinal Perspective on LLM Data Contamination
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving b/data/2024/iclr/ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
new file mode 100644
index 0000000000..402ac103a4
--- /dev/null
+++ b/data/2024/iclr/ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	
@@ -0,0 +1 @@
+Large language models have made significant progress in various language tasks, yet they still struggle with complex mathematics. In this paper, we propose ToRA a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems by seamlessly integrating natural language reasoning with the utilization of external tools (e.g., computation libraries and symbolic solvers), thereby amalgamating the analytical prowess of language and the computational efficiency of tools. To train ToRA, we curate interactive tool-use trajectories on mathematical datasets, apply imitation learning on the annotations, and propose output space shaping to further refine models' reasoning behavior. As a result, ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales with 13%-19% absolute improvements on average. Notably, ToRA-7B reaches 44.6% on the competition-level dataset MATH, surpassing the best open-source model WizardMath-70B by 22% absolute. ToRA-Code-34B is also the first open-source model that achieves an accuracy exceeding 50% on MATH, which significantly outperforms GPT-4's CoT result, and is competitive with GPT-4 solving problems with programs. Additionally, we conduct a comprehensive analysis of the benefits and remaining challenges of tool interaction for mathematical reasoning, providing valuable insights for future research.
\ No newline at end of file
diff --git a/data/2024/iclr/TokenFlow: Consistent Diffusion Features for Consistent Video Editing b/data/2024/iclr/TokenFlow: Consistent Diffusion Features for Consistent Video Editing
new file mode 100644
index 0000000000..730f3c0765
--- /dev/null
+++ b/data/2024/iclr/TokenFlow: Consistent Diffusion Features for Consistent Video Editing	
@@ -0,0 +1 @@
+The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/
\ No newline at end of file
diff --git a/data/2024/iclr/Tool-Augmented Reward Modeling b/data/2024/iclr/Tool-Augmented Reward Modeling
new file mode 100644
index 0000000000..d292faa40c
--- /dev/null
+++ b/data/2024/iclr/Tool-Augmented Reward Modeling	
@@ -0,0 +1 @@
+Reward modeling (a.k.a., preference modeling) is instrumental for aligning large language models with human preferences, particularly within the context of reinforcement learning from human feedback (RLHF). While conventional reward models (RMs) have exhibited remarkable scalability, they oft struggle with fundamental functionality such as arithmetic computation, code execution, and factual lookup. In this paper, we propose a tool-augmented preference modeling approach, named Themis, to address these limitations by empowering RMs with access to external environments, including calculators and search engines. This approach not only fosters synergy between tool utilization and reward grading but also enhances interpretive capacity and scoring reliability. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources and construct task-specific tool engagement and reasoning traces in an autoregressive manner. We validate our approach across a wide range of domains, incorporating seven distinct external tools. Our experimental results demonstrate a noteworthy overall improvement of 17.7% across eight tasks in preference ranking. Furthermore, our approach outperforms Gopher 280B by 7.3% on TruthfulQA task in zero-shot evaluation. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines across four distinct tasks. Additionally, we provide a comprehensive collection of tool-related RM datasets, incorporating data from seven distinct tool APIs, totaling 15,000 instances. We have made the code, data, and model checkpoints publicly available to facilitate and inspire further research advancements\footnote{\url{https://github.com/ernie-research/Tool-Augmented-Reward-Model}}.
\ No newline at end of file
diff --git a/data/2024/iclr/ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search b/data/2024/iclr/ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search
new file mode 100644
index 0000000000..bbe494f857
--- /dev/null
+++ b/data/2024/iclr/ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search	
@@ -0,0 +1 @@
+Large language models (LLMs) have demonstrated powerful decision-making and planning capabilities in solving complicated real-world problems. LLM-based autonomous agents can interact with diverse tools (e.g., functional APIs) and generate solution plans that execute a series of API function calls in a step-by-step manner. The multitude of candidate API function calls significantly expands the action space, amplifying the critical need for efficient action space navigation. However, existing methods either struggle with unidirectional exploration in expansive action spaces, trapped into a locally optimal solution, or suffer from exhaustively traversing all potential actions, causing inefficient navigation. To address these issues, we propose ToolChain*, an efficient tree search-based planning algorithm for LLM-based agents. It formulates the entire action space as a decision tree, where each node represents a possible API function call involved in a solution plan. By incorporating the A* search algorithm with task-specific cost function design, it efficiently prunes high-cost branches that may involve incorrect actions, identifying the most low-cost valid path as the solution. Extensive experiments on multiple tool-use and reasoning tasks demonstrate that ToolChain* efficiently balances exploration and exploitation within an expansive action space. It outperforms state-of-the-art baselines on planning and reasoning tasks by 3.1% and 3.5% on average while requiring 7.35x and 2.31x less time, respectively.
\ No newline at end of file
diff --git a/data/2024/iclr/ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs b/data/2024/iclr/ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
new file mode 100644
index 0000000000..11b9e636de
--- /dev/null
+++ b/data/2024/iclr/ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs	
@@ -0,0 +1 @@
+Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.
\ No newline at end of file
diff --git a/data/2024/iclr/Topic Modeling as Multi-Objective Contrastive Optimization b/data/2024/iclr/Topic Modeling as Multi-Objective Contrastive Optimization
new file mode 100644
index 0000000000..9fbd16781a
--- /dev/null
+++ b/data/2024/iclr/Topic Modeling as Multi-Objective Contrastive Optimization	
@@ -0,0 +1 @@
+Recent representation learning approaches enhance neural topic models by optimizing the weighted linear combination of the evidence lower bound (ELBO) of the log-likelihood and the contrastive learning objective that contrasts pairs of input documents. However, document-level contrastive learning might capture low-level mutual information, such as word ratio, which disturbs topic modeling. Moreover, there is a potential conflict between the ELBO loss that memorizes input details for better reconstruction quality, and the contrastive loss which attempts to learn topic representations that generalize among input documents. To address these issues, we first introduce a novel contrastive learning method oriented towards sets of topic vectors to capture useful semantics that are shared among a set of input documents. Secondly, we explicitly cast contrastive topic modeling as a gradient-based multi-objective optimization problem, with the goal of achieving a Pareto stationary solution that balances the trade-off between the ELBO and the contrastive objective. Extensive experiments demonstrate that our framework consistently produces higher-performing neural topic models in terms of topic coherence, topic diversity, and downstream performance.
\ No newline at end of file
diff --git a/data/2024/iclr/TopoMLP: A Simple yet Strong Pipeline for Driving Topology Reasoning b/data/2024/iclr/TopoMLP: A Simple yet Strong Pipeline for Driving Topology Reasoning
new file mode 100644
index 0000000000..7b4bed5848
--- /dev/null
+++ b/data/2024/iclr/TopoMLP: A Simple yet Strong Pipeline for Driving Topology Reasoning	
@@ -0,0 +1 @@
+Topology reasoning aims to comprehensively understand road scenes and present drivable routes in autonomous driving. It requires detecting road centerlines (lane) and traffic elements, further reasoning their topology relationship, i.e., lane-lane topology, and lane-traffic topology. In this work, we first present that the topology score relies heavily on detection performance on lane and traffic elements. Therefore, we introduce a powerful 3D lane detector and an improved 2D traffic element detector to extend the upper limit of topology performance. Further, we propose TopoMLP, a simple yet high-performance pipeline for driving topology reasoning. Based on the impressive detection performance, we develop two simple MLP-based heads for topology generation. TopoMLP achieves state-of-the-art performance on OpenLane-V2 benchmark, i.e., 41.2% OLS with ResNet-50 backbone. It is also the 1st solution for 1st OpenLane Topology in Autonomous Driving Challenge. We hope such simple and strong pipeline can provide some new insights to the community. Code is at https://github.com/wudongming97/TopoMLP.
\ No newline at end of file
diff --git a/data/2024/iclr/TorchRL: A data-driven decision-making library for PyTorch b/data/2024/iclr/TorchRL: A data-driven decision-making library for PyTorch
new file mode 100644
index 0000000000..db60d5a912
--- /dev/null
+++ b/data/2024/iclr/TorchRL: A data-driven decision-making library for PyTorch	
@@ -0,0 +1 @@
+PyTorch has ascended as a premier machine learning framework, yet it lacks a native and comprehensive library for decision and control tasks suitable for large development teams dealing with complex real-world data and environments. To address this issue, we propose TorchRL, a generalistic control library for PyTorch that provides well-integrated, yet standalone components. We introduce a new and flexible PyTorch primitive, the TensorDict, which facilitates streamlined algorithm development across the many branches of Reinforcement Learning (RL) and control. We provide a detailed description of the building blocks and an extensive overview of the library across domains and tasks. Finally, we experimentally demonstrate its reliability and flexibility and show comparative benchmarks to demonstrate its computational efficiency. TorchRL fosters long-term support and is publicly available on GitHub for greater reproducibility and collaboration within the research community. The code is open-sourced on GitHub.
\ No newline at end of file
diff --git a/data/2024/iclr/Toward Optimal Policy Population Growth in Two-Player Zero-Sum Games b/data/2024/iclr/Toward Optimal Policy Population Growth in Two-Player Zero-Sum Games
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Toward Student-oriented Teacher Network Training for Knowledge Distillation b/data/2024/iclr/Toward Student-oriented Teacher Network Training for Knowledge Distillation
new file mode 100644
index 0000000000..07c18c5aac
--- /dev/null
+++ b/data/2024/iclr/Toward Student-oriented Teacher Network Training for Knowledge Distillation	
@@ -0,0 +1 @@
+How to conduct teacher training for knowledge distillation is still an open problem. It has been widely observed that a best-performing teacher does not necessarily yield the best-performing student, suggesting a fundamental discrepancy between the current teacher training practice and the ideal teacher training strategy. To fill this gap, we explore the feasibility of training a teacher that is oriented toward student performance with empirical risk minimization (ERM). Our analyses are inspired by the recent findings that the effectiveness of knowledge distillation hinges on the teacher's capability to approximate the true label distribution of training inputs. We theoretically establish that the ERM minimizer can approximate the true label distribution of training data as long as the feature extractor of the learner network is Lipschitz continuous and is robust to feature transformations. In light of our theory, we propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM. Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.
\ No newline at end of file
diff --git a/data/2024/iclr/Toward effective protection against diffusion-based mimicry through score distillation b/data/2024/iclr/Toward effective protection against diffusion-based mimicry through score distillation
new file mode 100644
index 0000000000..b0478acec4
--- /dev/null
+++ b/data/2024/iclr/Toward effective protection against diffusion-based mimicry through score distillation	
@@ -0,0 +1 @@
+While generative diffusion models excel in producing high-quality images, they can also be misused to mimic authorized images, posing a significant threat to AI systems. Efforts have been made to add calibrated perturbations to protect images from diffusion-based mimicry pipelines. However, most of the existing methods are too ineffective and even impractical to be used by individual users due to their high computation and memory requirements. In this work, we present novel findings on attacking latent diffusion models (LDM) and propose new plug-and-play strategies for more effective protection. In particular, we explore the bottleneck in attacking an LDM, discovering that the encoder module rather than the denoiser module is the vulnerable point. Based on this insight, we present our strategy using Score Distillation Sampling (SDS) to double the speed of protection and reduce memory occupation by half without compromising its strength. Additionally, we provide a robust protection strategy by counterintuitively minimizing the semantic loss, which can assist in generating more natural perturbations. Finally, we conduct extensive experiments to substantiate our findings and comprehensively evaluate our newly proposed strategies. We hope our insights and protective measures can contribute to better defense against malicious diffusion-based mimicry, advancing the development of secure AI systems. The code is available in https://github.com/xavihart/Diff-Protect
\ No newline at end of file
diff --git a/data/2024/iclr/Towards 3D Molecule-Text Interpretation in Language Models b/data/2024/iclr/Towards 3D Molecule-Text Interpretation in Language Models
new file mode 100644
index 0000000000..9620349ce1
--- /dev/null
+++ b/data/2024/iclr/Towards 3D Molecule-Text Interpretation in Language Models	
@@ -0,0 +1 @@
+Language Models (LMs) have greatly influenced diverse domains. However, their inherent limitation in comprehending 3D molecular structures has considerably constrained their potential in the biomolecular domain. To bridge this gap, we focus on 3D molecule-text interpretation, and propose 3D-MoLM: 3D-Molecular Language Modeling. Specifically, 3D-MoLM enables an LM to interpret and analyze 3D molecules by equipping the LM with a 3D molecular encoder. This integration is achieved by a 3D molecule-text projector, bridging the 3D molecular encoder's representation space and the LM's input space. Moreover, to enhance 3D-MoLM's ability of cross-modal molecular understanding and instruction following, we meticulously curated a 3D molecule-centric instruction tuning dataset -- 3D-MoIT. Through 3D molecule-text alignment and 3D molecule-centric instruction tuning, 3D-MoLM establishes an integration of 3D molecular encoder and LM. It significantly surpasses existing baselines on downstream tasks, including molecule-text retrieval, molecule captioning, and more challenging open-text molecular QA tasks, especially focusing on 3D-dependent properties. We release our codes and datasets at https://github.com/lsh0520/3D-MoLM.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation b/data/2024/iclr/Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation
new file mode 100644
index 0000000000..8fa32ad830
--- /dev/null
+++ b/data/2024/iclr/Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation	
@@ -0,0 +1 @@
+Off-Policy Evaluation (OPE) aims to assess the effectiveness of counterfactual policies using only offline logged data and is often used to identify the top-k promising policies for deployment in online A/B tests. Existing evaluation metrics for OPE estimators primarily focus on the"accuracy"of OPE or that of downstream policy selection, neglecting risk-return tradeoff in the subsequent online policy deployment. To address this issue, we draw inspiration from portfolio evaluation in finance and develop a new metric, called SharpeRatio@k, which measures the risk-return tradeoff of policy portfolios formed by an OPE estimator under varying online evaluation budgets (k). We validate our metric in two example scenarios, demonstrating its ability to effectively distinguish between low-risk and high-risk estimators and to accurately identify the most efficient one. Efficiency of an estimator is characterized by its capability to form the most advantageous policy portfolios, maximizing returns while minimizing risks during online deployment, a nuance that existing metrics typically overlook. To facilitate a quick, accurate, and consistent evaluation of OPE via SharpeRatio@k, we have also integrated this metric into an open-source software, SCOPE-RL (https://github.com/hakuhodo-technologies/scope-rl). Employing SharpeRatio@k and SCOPE-RL, we conduct comprehensive benchmarking experiments on various estimators and RL tasks, focusing on their risk-return tradeoff. These experiments offer several interesting directions and suggestions for future OPE research.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Best Practices of Activation Patching in Language Models: Metrics and Methods b/data/2024/iclr/Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
new file mode 100644
index 0000000000..a7fb48f1de
--- /dev/null
+++ b/data/2024/iclr/Towards Best Practices of Activation Patching in Language Models: Metrics and Methods	
@@ -0,0 +1 @@
+Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide recommendations for the best practices of activation patching going forwards.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Category Unification of 3D Single Object Tracking on Point Clouds b/data/2024/iclr/Towards Category Unification of 3D Single Object Tracking on Point Clouds
new file mode 100644
index 0000000000..87cc2045c1
--- /dev/null
+++ b/data/2024/iclr/Towards Category Unification of 3D Single Object Tracking on Point Clouds	
@@ -0,0 +1 @@
+Category-specific models are provenly valuable methods in 3D single object tracking (SOT) regardless of Siamese or motion-centric paradigms. However, such over-specialized model designs incur redundant parameters, thus limiting the broader applicability of 3D SOT task. This paper first introduces unified models that can simultaneously track objects across all categories using a single network with shared model parameters. Specifically, we propose to explicitly encode distinct attributes associated to different object categories, enabling the model to adapt to cross-category data. We find that the attribute variances of point cloud objects primarily occur from the varying size and shape (e.g., large and square vehicles v.s. small and slender humans). Based on this observation, we design a novel point set representation learning network inheriting transformer architecture, termed AdaFormer, which adaptively encodes the dynamically varying shape and size information from cross-category data in a unified manner. We further incorporate the size and shape prior derived from the known template targets into the model's inputs and learning objective, facilitating the learning of unified representation. Equipped with such designs, we construct two category-unified models SiamCUT and MoCUT.Extensive experiments demonstrate that SiamCUT and MoCUT exhibit strong generalization and training stability. Furthermore, our category-unified models outperform the category-specific counterparts by a significant margin (e.g., on KITTI dataset, 12% and 3% performance gains on the Siamese and motion paradigms). Our code will be available.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Characterizing Domain Counterfactuals for Invertible Latent Causal Models b/data/2024/iclr/Towards Characterizing Domain Counterfactuals for Invertible Latent Causal Models
new file mode 100644
index 0000000000..1004f6309b
--- /dev/null
+++ b/data/2024/iclr/Towards Characterizing Domain Counterfactuals for Invertible Latent Causal Models	
@@ -0,0 +1 @@
+Answering counterfactual queries has important applications such as explainability, robustness, and fairness but is challenging when the causal variables are unobserved and the observations are non-linear mixtures of these latent variables, such as pixels in images. One approach is to recover the latent Structural Causal Model (SCM), which may be infeasible in practice due to requiring strong assumptions, e.g., linearity of the causal mechanisms or perfect atomic interventions. Meanwhile, more practical ML-based approaches using naive domain translation models to generate counterfactual samples lack theoretical grounding and may construct invalid counterfactuals. In this work, we strive to strike a balance between practicality and theoretical guarantees by analyzing a specific type of causal query called domain counterfactuals, which hypothesizes what a sample would have looked like if it had been generated in a different domain (or environment). We show that recovering the latent SCM is unnecessary for estimating domain counterfactuals, thereby sidestepping some of the theoretic challenges. By assuming invertibility and sparsity of intervention, we prove domain counterfactual estimation error can be bounded by a data fit term and intervention sparsity term. Building upon our theoretical results, we develop a theoretically grounded practical algorithm that simplifies the modeling process to generative model estimation under autoregressive and shared parameter constraints that enforce intervention sparsity. Finally, we show an improvement in counterfactual estimation over baseline methods through extensive simulated and image-based experiments.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators b/data/2024/iclr/Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators
new file mode 100644
index 0000000000..67a5c5a0f8
--- /dev/null
+++ b/data/2024/iclr/Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators	
@@ -0,0 +1 @@
+The majority of the research on the quantization of Deep Neural Networks (DNNs) is focused on reducing the precision of tensors visible by high-level frameworks (e.g., weights, activations, and gradients). However, current hardware still relies on high-accuracy core operations. Most significant is the operation of accumulating products. This high-precision accumulation operation is gradually becoming the main computational bottleneck. This is because, so far, the usage of low-precision accumulators led to a significant degradation in performance. In this work, we present a simple method to train and fine-tune high-end DNNs, to allow, for the first time, utilization of cheaper, $12$-bits accumulators, with no significant degradation in accuracy. Lastly, we show that as we decrease the accumulation precision further, using fine-grained gradient approximations can improve the DNN accuracy.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Codable Watermarking for Injecting Multi-Bits Information to LLMs b/data/2024/iclr/Towards Codable Watermarking for Injecting Multi-Bits Information to LLMs
new file mode 100644
index 0000000000..dc0f74fc94
--- /dev/null
+++ b/data/2024/iclr/Towards Codable Watermarking for Injecting Multi-Bits Information to LLMs	
@@ -0,0 +1 @@
+As large language models (LLMs) generate texts with increasing fluency and realism, there is a growing need to identify the source of texts to prevent the abuse of LLMs. Text watermarking techniques have proven reliable in distinguishing whether a text is generated by LLMs by injecting hidden patterns. However, we argue that existing LLM watermarking methods are encoding-inefficient and cannot flexibly meet the diverse information encoding needs (such as encoding model version, generation time, user id, etc.). In this work, we conduct the first systematic study on the topic of Codable Text Watermarking for LLMs (CTWL) that allows text watermarks to carry multi-bit customizable information. First of all, we study the taxonomy of LLM watermarking technologies and give a mathematical formulation for CTWL. Additionally, we provide a comprehensive evaluation system for CTWL: (1) watermarking success rate, (2) robustness against various corruptions, (3) coding rate of payload information, (4) encoding and decoding efficiency, (5) impacts on the quality of the generated text. To meet the requirements of these non-Pareto-improving metrics, we follow the most prominent vocabulary partition-based watermarking direction, and devise an advanced CTWL method named Balance-Marking. The core idea of our method is to use a proxy language model to split the vocabulary into probability-balanced parts, thereby effectively maintaining the quality of the watermarked text. Our code is available at https://github.com/lancopku/codable-watermarking-for-llm.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Cross Domain Generalization of Hamiltonian Representation via Meta Learning b/data/2024/iclr/Towards Cross Domain Generalization of Hamiltonian Representation via Meta Learning
new file mode 100644
index 0000000000..b64f13a578
--- /dev/null
+++ b/data/2024/iclr/Towards Cross Domain Generalization of Hamiltonian Representation via Meta Learning	
@@ -0,0 +1 @@
+Recent advances in deep learning for physics have focused on discovering shared representations of target systems by incorporating physics priors or inductive biases into neural networks. While effective, these methods are limited to the system domain, where the type of system remains consistent and thus cannot ensure the adaptation to new, or unseen physical systems governed by different laws. For instance, a neural network trained on a mass-spring system cannot guarantee accurate predictions for the behavior of a two-body system or any other system with different physical laws. In this work, we take a significant leap forward by targeting cross domain generalization within the field of Hamiltonian dynamics. We model our system with a graph neural network (GNN) and employ a meta learning algorithm to enable the model to gain experience over a distribution of systems and make it adapt to new physics. Our approach aims to learn a unified Hamiltonian representation that is generalizable across multiple system domains, thereby overcoming the limitations of system-specific models. We demonstrate that the meta-trained model captures the generalized Hamiltonian representation that is consistent across different physical domains. Overall, through the use of meta learning, we offer a framework that achieves cross domain generalization, providing a step towards a unified model for understanding a wide array of dynamical systems via deep learning.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Eliminating Hard Label Constraints in Gradient Inversion Attacks b/data/2024/iclr/Towards Eliminating Hard Label Constraints in Gradient Inversion Attacks
new file mode 100644
index 0000000000..acb4b72892
--- /dev/null
+++ b/data/2024/iclr/Towards Eliminating Hard Label Constraints in Gradient Inversion Attacks	
@@ -0,0 +1 @@
+Gradient inversion attacks aim to reconstruct local training data from intermediate gradients exposed in the federated learning framework. Despite successful attacks, all previous methods, starting from reconstructing a single data point and then relaxing the single-image limit to batch level, are only tested under hard label constraints. Even for single-image reconstruction, we still lack an analysis-based algorithm to recover augmented soft labels. In this work, we change the focus from enlarging batchsize to investigating the hard label constraints, considering a more realistic circumstance where label smoothing and mixup techniques are used in the training process. In particular, we are the first to initiate a novel algorithm to simultaneously recover the ground-truth augmented label and the input feature of the last fully-connected layer from single-input gradients, and provide a necessary condition for any analytical-based label recovery methods. Extensive experiments testify to the label recovery accuracy, as well as the benefits to the following image reconstruction. We believe soft labels in classification tasks are worth further attention in gradient inversion attacks.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Energy Efficient Spiking Neural Networks: An Unstructured Pruning Framework b/data/2024/iclr/Towards Energy Efficient Spiking Neural Networks: An Unstructured Pruning Framework
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Towards Enhancing Time Series Contrastive Learning: A Dynamic Bad Pair Mining Approach b/data/2024/iclr/Towards Enhancing Time Series Contrastive Learning: A Dynamic Bad Pair Mining Approach
new file mode 100644
index 0000000000..90431685cb
--- /dev/null
+++ b/data/2024/iclr/Towards Enhancing Time Series Contrastive Learning: A Dynamic Bad Pair Mining Approach	
@@ -0,0 +1 @@
+Not all positive pairs are beneficial to time series contrastive learning. In this paper, we study two types of bad positive pairs that can impair the quality of time series representation learned through contrastive learning: the noisy positive pair and the faulty positive pair. We observe that, with the presence of noisy positive pairs, the model tends to simply learn the pattern of noise (Noisy Alignment). Meanwhile, when faulty positive pairs arise, the model wastes considerable amount of effort aligning non-representative patterns (Faulty Alignment). To address this problem, we propose a Dynamic Bad Pair Mining (DBPM) algorithm, which reliably identifies and suppresses bad positive pairs in time series contrastive learning. Specifically, DBPM utilizes a memory module to dynamically track the training behavior of each positive pair along training process. This allows us to identify potential bad positive pairs at each epoch based on their historical training behaviors. The identified bad pairs are subsequently down-weighted through a transformation module, thereby mitigating their negative impact on the representation learning process. DBPM is a simple algorithm designed as a lightweight plug-in without learnable parameters to enhance the performance of existing state-of-the-art methods. Through extensive experiments conducted on four large-scale, real-world time series datasets, we demonstrate DBPM's efficacy in mitigating the adverse effects of bad positive pairs.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Establishing Guaranteed Error for Learned Database Operations b/data/2024/iclr/Towards Establishing Guaranteed Error for Learned Database Operations
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Towards Faithful Explanations: Boosting Rationalization with Shortcuts Discovery b/data/2024/iclr/Towards Faithful Explanations: Boosting Rationalization with Shortcuts Discovery
new file mode 100644
index 0000000000..419e580c96
--- /dev/null
+++ b/data/2024/iclr/Towards Faithful Explanations: Boosting Rationalization with Shortcuts Discovery	
@@ -0,0 +1 @@
+The remarkable success in neural networks provokes the selective rationalization. It explains the prediction results by identifying a small subset of the inputs sufficient to support them. Since existing methods still suffer from adopting the shortcuts in data to compose rationales and limited large-scale annotated rationales by human, in this paper, we propose a Shortcuts-fused Selective Rationalization (SSR) method, which boosts the rationalization by discovering and exploiting potential shortcuts. Specifically, SSR first designs a shortcuts discovery approach to detect several potential shortcuts. Then, by introducing the identified shortcuts, we propose two strategies to mitigate the problem of utilizing shortcuts to compose rationales. Finally, we develop two data augmentations methods to close the gap in the number of annotated rationales. Extensive experimental results on real-world datasets clearly validate the effectiveness of our proposed method.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Faithful XAI Evaluation via Generalization-Limited Backdoor Watermark b/data/2024/iclr/Towards Faithful XAI Evaluation via Generalization-Limited Backdoor Watermark
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Towards Foundation Models for Knowledge Graph Reasoning b/data/2024/iclr/Towards Foundation Models for Knowledge Graph Reasoning
new file mode 100644
index 0000000000..d4f2962337
--- /dev/null
+++ b/data/2024/iclr/Towards Foundation Models for Knowledge Graph Reasoning	
@@ -0,0 +1 @@
+Foundation models in language and vision have the ability to run inference on any textual and visual inputs thanks to the transferable representations such as a vocabulary of tokens in language. Knowledge graphs (KGs) have different entity and relation vocabularies that generally do not overlap. The key challenge of designing foundation models on KGs is to learn such transferable representations that enable inference on any graph with arbitrary entity and relation vocabularies. In this work, we make a step towards such foundation models and present ULTRA, an approach for learning universal and transferable graph representations. ULTRA builds relational representations as a function conditioned on their interactions. Such a conditioning strategy allows a pre-trained ULTRA model to inductively generalize to any unseen KG with any relation vocabulary and to be fine-tuned on any graph. Conducting link prediction experiments on 57 different KGs, we find that the zero-shot inductive inference performance of a single pre-trained ULTRA model on unseen graphs of various sizes is often on par or better than strong baselines trained on specific graphs. Fine-tuning further boosts the performance.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets b/data/2024/iclr/Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets
new file mode 100644
index 0000000000..c787310ba8
--- /dev/null
+++ b/data/2024/iclr/Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets	
@@ -0,0 +1 @@
+Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and Selection b/data/2024/iclr/Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and Selection
new file mode 100644
index 0000000000..ad03eb849e
--- /dev/null
+++ b/data/2024/iclr/Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and Selection	
@@ -0,0 +1 @@
+Endowing machines with abstract reasoning ability has been a long-term research topic in artificial intelligence. Raven's Progressive Matrix (RPM) is widely used to probe abstract visual reasoning in machine intelligence, where models will analyze the underlying rules and select one image from candidates to complete the image matrix. Participators of RPM tests can show powerful reasoning ability by inferring and combining attribute-changing rules and imagining the missing images at arbitrary positions of a matrix. However, existing solvers can hardly manifest such an ability in realistic RPM tests. In this paper, we propose a deep latent variable model for answer generation problems through Rule AbstractIon and SElection (RAISE). RAISE can encode image attributes into latent concepts and abstract atomic rules that act on the latent concepts. When generating answers, RAISE selects one atomic rule out of the global knowledge set for each latent concept to constitute the underlying rule of an RPM. In the experiments of bottom-right and arbitrary-position answer generation, RAISE outperforms the compared solvers in most configurations of realistic RPM datasets. In the odd-one-out task and two held-out configurations, RAISE can leverage acquired latent concepts and atomic rules to find the rule-breaking image in a matrix and handle problems with unseen combinations of rules and attributes.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Green AI in Fine-tuning Large Language Models via Adaptive Backpropagation b/data/2024/iclr/Towards Green AI in Fine-tuning Large Language Models via Adaptive Backpropagation
new file mode 100644
index 0000000000..6b0ad3ff80
--- /dev/null
+++ b/data/2024/iclr/Towards Green AI in Fine-tuning Large Language Models via Adaptive Backpropagation	
@@ -0,0 +1 @@
+Fine-tuning is the most effective way of adapting pre-trained large language models (LLMs) to downstream applications. With the fast growth of LLM-enabled AI applications and democratization of open-souced LLMs, fine-tuning has become possible for non-expert individuals, but intensively performed LLM fine-tuning worldwide could result in significantly high energy consumption and carbon footprint, which may bring large environmental impact. Mitigating such environmental impact towards Green AI directly correlates to reducing the FLOPs of fine-tuning, but existing techniques on efficient LLM fine-tuning can only achieve limited reduction of such FLOPs, due to their ignorance of the backpropagation cost in fine-tuning. To address this limitation, in this paper we present GreenTrainer, a new LLM fine-tuning technique that adaptively evaluates different tensors' backpropagation costs and contributions to the fine-tuned model accuracy, to minimize the fine-tuning cost by selecting the most appropriate set of tensors in training. Such selection in GreenTrainer is made based on a given objective of FLOPs reduction, which can flexibly adapt to the carbon footprint in energy supply and the need in Green AI. Experiment results over multiple open-sourced LLM models and abstractive summarization datasets show that, compared to fine-tuning the whole LLM model, GreenTrainer can save up to 64% FLOPs in fine-tuning without any noticeable model accuracy loss. Compared to the existing fine-tuning techniques such as LoRa, GreenTrainer can achieve up to 4% improvement on model accuracy with on-par FLOPs reduction.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Identifiable Unsupervised Domain Translation: A Diversified Distribution Matching Approach b/data/2024/iclr/Towards Identifiable Unsupervised Domain Translation: A Diversified Distribution Matching Approach
new file mode 100644
index 0000000000..9538ba4beb
--- /dev/null
+++ b/data/2024/iclr/Towards Identifiable Unsupervised Domain Translation: A Diversified Distribution Matching Approach	
@@ -0,0 +1 @@
+Unsupervised domain translation (UDT) aims to find functions that convert samples from one domain (e.g., sketches) to another domain (e.g., photos) without changing the high-level semantic meaning (also referred to as ``content''). The translation functions are often sought by probability distribution matching of the transformed source domain and target domain. CycleGAN stands as arguably the most representative approach among this line of work. However, it was noticed in the literature that CycleGAN and variants could fail to identify the desired translation functions and produce content-misaligned translations. This limitation arises due to the presence of multiple translation functions -- referred to as ``measure-preserving automorphism"(MPA) -- in the solution space of the learning criteria. Despite awareness of such identifiability issues, solutions have remained elusive. This study delves into the core identifiability inquiry and introduces an MPA elimination theory. Our analysis shows that MPA is unlikely to exist, if multiple pairs of diverse cross-domain conditional distributions are matched by the learning function. Our theory leads to a UDT learner using distribution matching over auxiliary variable-induced subsets of the domains -- other than over the entire data domains as in the classical approaches. The proposed framework is the first to rigorously establish translation identifiability under reasonable UDT settings, to our best knowledge. Experiments corroborate with our theoretical claims.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Imitation Learning to Branch for MIP: A Hybrid Reinforcement Learning based Sample Augmentation Approach b/data/2024/iclr/Towards Imitation Learning to Branch for MIP: A Hybrid Reinforcement Learning based Sample Augmentation Approach
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Towards LLM4QPE: Unsupervised Pretraining of Quantum Property Estimation and A Benchmark b/data/2024/iclr/Towards LLM4QPE: Unsupervised Pretraining of Quantum Property Estimation and A Benchmark
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching b/data/2024/iclr/Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching
new file mode 100644
index 0000000000..2326e2c2fd
--- /dev/null
+++ b/data/2024/iclr/Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching	
@@ -0,0 +1 @@
+The ultimate goal of Dataset Distillation is to synthesize a small synthetic dataset such that a model trained on this synthetic set will perform equally well as a model trained on the full, real dataset. Until now, no method of Dataset Distillation has reached this completely lossless goal, in part due to the fact that previous methods only remain effective when the total number of synthetic samples is extremely small. Since only so much information can be contained in such a small number of samples, it seems that to achieve truly loss dataset distillation, we must develop a distillation method that remains effective as the size of the synthetic dataset grows. In this work, we present such an algorithm and elucidate why existing methods fail to generate larger, high-quality synthetic sets. Current state-of-the-art methods rely on trajectory-matching, or optimizing the synthetic data to induce similar long-term training dynamics as the real data. We empirically find that the training stage of the trajectories we choose to match (i.e., early or late) greatly affects the effectiveness of the distilled dataset. Specifically, early trajectories (where the teacher network learns easy patterns) work well for a low-cardinality synthetic set since there are fewer examples wherein to distribute the necessary information. Conversely, late trajectories (where the teacher network learns hard patterns) provide better signals for larger synthetic sets since there are now enough samples to represent the necessary complex patterns. Based on our findings, we propose to align the difficulty of the generated patterns with the size of the synthetic dataset. In doing so, we successfully scale trajectory matching-based methods to larger synthetic datasets, achieving lossless dataset distillation for the very first time. Code and distilled datasets are available at https://gzyaftermath.github.io/DATM.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Non-Asymptotic Convergence for Diffusion-Based Generative Models b/data/2024/iclr/Towards Non-Asymptotic Convergence for Diffusion-Based Generative Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Towards Optimal Feature-Shaping Methods for Out-of-Distribution Detection b/data/2024/iclr/Towards Optimal Feature-Shaping Methods for Out-of-Distribution Detection
new file mode 100644
index 0000000000..dce821be5a
--- /dev/null
+++ b/data/2024/iclr/Towards Optimal Feature-Shaping Methods for Out-of-Distribution Detection	
@@ -0,0 +1 @@
+Feature shaping refers to a family of methods that exhibit state-of-the-art performance for out-of-distribution (OOD) detection. These approaches manipulate the feature representation, typically from the penultimate layer of a pre-trained deep learning model, so as to better differentiate between in-distribution (ID) and OOD samples. However, existing feature-shaping methods usually employ rules manually designed for specific model architectures and OOD datasets, which consequently limit their generalization ability. To address this gap, we first formulate an abstract optimization framework for studying feature-shaping methods. We then propose a concrete reduction of the framework with a simple piecewise constant shaping function and show that existing feature-shaping methods approximate the optimal solution to the concrete optimization problem. Further, assuming that OOD data is inaccessible, we propose a formulation that yields a closed-form solution for the piecewise constant shaping function, utilizing solely the ID data. Through extensive experiments, we show that the feature-shaping function optimized by our method improves the generalization ability of OOD detection across a large variety of datasets and model architectures.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback b/data/2024/iclr/Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback
new file mode 100644
index 0000000000..bbd6ae8764
--- /dev/null
+++ b/data/2024/iclr/Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback	
@@ -0,0 +1 @@
+We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback, without prior knowledge on transitions or access to simulators. We introduce two algorithms that achieve improved regret performance compared to existing approaches. The first algorithm, although computationally inefficient, ensures a regret of $\widetilde{\mathcal{O}}\left(\sqrt{K}\right)$, where $K$ is the number of episodes. This is the first result with the optimal $K$ dependence in the considered setting. The second algorithm, which is based on the policy optimization framework, guarantees a regret of $\widetilde{\mathcal{O}}\left(K^{\frac{3}{4}} \right)$ and is computationally efficient. Both our results significantly improve over the state-of-the-art: a computationally inefficient algorithm by Kong et al. [2023] with $\widetilde{\mathcal{O}}\left(K^{\frac{4}{5}}+poly\left(\frac{1}{\lambda_{\min}}\right) \right)$ regret, for some problem-dependent constant $\lambda_{\min}$ that can be arbitrarily close to zero, and a computationally efficient algorithm by Sherman et al. [2023b] with $\widetilde{\mathcal{O}}\left(K^{\frac{6}{7}} \right)$ regret.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Poisoning Fair Representations b/data/2024/iclr/Towards Poisoning Fair Representations
new file mode 100644
index 0000000000..7805dd9a37
--- /dev/null
+++ b/data/2024/iclr/Towards Poisoning Fair Representations	
@@ -0,0 +1 @@
+Fair machine learning seeks to mitigate model prediction bias against certain demographic subgroups such as elder and female. Recently, fair representation learning (FRL) trained by deep neural networks has demonstrated superior performance, whereby representations containing no demographic information are inferred from the data and then used as the input to classification or other downstream tasks. Despite the development of FRL methods, their vulnerability under data poisoning attack, a popular protocol to benchmark model robustness under adversarial scenarios, is under-explored. Data poisoning attacks have been developed for classical fair machine learning methods which incorporate fairness constraints into shallow-model classifiers. Nonetheless, these attacks fall short in FRL due to notably different fairness goals and model architectures. This work proposes the first data poisoning framework attacking FRL. We induce the model to output unfair representations that contain as much demographic information as possible by injecting carefully crafted poisoning samples into the training data. This attack entails a prohibitive bilevel optimization, wherefore an effective approximated solution is proposed. A theoretical analysis on the needed number of poisoning samples is derived and sheds light on defending against the attack. Experiments on benchmark fairness datasets and state-of-the-art fair representation learning models demonstrate the superiority of our attack.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Principled Representation Learning from Videos for Reinforcement Learning b/data/2024/iclr/Towards Principled Representation Learning from Videos for Reinforcement Learning
new file mode 100644
index 0000000000..be8e14444f
--- /dev/null
+++ b/data/2024/iclr/Towards Principled Representation Learning from Videos for Reinforcement Learning	
@@ -0,0 +1 @@
+We study pre-training representations for decision-making using video data, which is abundantly available for tasks such as game agents and software testing. Even though significant empirical advances have been made on this problem, a theoretical understanding remains absent. We initiate the theoretical investigation into principled approaches for representation learning and focus on learning the latent state representations of the underlying MDP using video data. We study two types of settings: one where there is iid noise in the observation, and a more challenging setting where there is also the presence of exogenous noise, which is non-iid noise that is temporally correlated, such as the motion of people or cars in the background. We study three commonly used approaches: autoencoding, temporal contrastive learning, and forward modeling. We prove upper bounds for temporal contrastive learning and forward modeling in the presence of only iid noise. We show that these approaches can learn the latent state and use it to do efficient downstream RL with polynomial sample complexity. When exogenous noise is also present, we establish a lower bound result showing that the sample complexity of learning from video data can be exponentially worse than learning from action-labeled trajectory data. This partially explains why reinforcement learning with video pre-training is hard. We evaluate these representational learning methods in two visual domains, yielding results that are consistent with our theoretical findings.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Reliable and Efficient Backdoor Trigger Inversion via Decoupling Benign Features b/data/2024/iclr/Towards Reliable and Efficient Backdoor Trigger Inversion via Decoupling Benign Features
new file mode 100644
index 0000000000..16b9eb8619
--- /dev/null
+++ b/data/2024/iclr/Towards Reliable and Efficient Backdoor Trigger Inversion via Decoupling Benign Features	
@@ -0,0 +1 @@
+Recent studies revealed that using third-party models may lead to backdoor threats, where adversaries can maliciously manipulate model predictions based on back-doors implanted during model training. Arguably, backdoor trigger inversion (BTI), which generates trigger patterns of given benign samples for a backdoored model, is the most critical module for backdoor defenses used in these scenarios. With BTI, defenders can remove backdoors by fine-tuning based on generated poisoned samples with ground-truth labels or deactivate backdoors by removing trigger patterns during the inference process. However, we find that existing BTI methods suffer from relatively poor performance, i.e. , their generated triggers are significantly different from the ones used by the adversaries even in the feature space. We argue that it is mostly because existing methods require to ‘extract’ backdoor features at first, while this task is very difficult since defenders have no information ( e.g. , trigger pattern or target label) about poisoned samples. In this paper, we explore BTI from another perspective where we decouple benign features instead of decoupling backdoor features directly. Specifically, our method consists of two main steps, including (1) decoupling benign features and (2) trigger inversion by minimizing the differences between benign samples and their generated poisoned version in decoupled benign features while maximizing the differences in remaining backdoor features. In particular, our method is more efficient since it doesn’t need to ‘scan’ all classes to speculate the target label, as required by
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Robust Fidelity for Evaluating Explainability of Graph Neural Networks b/data/2024/iclr/Towards Robust Fidelity for Evaluating Explainability of Graph Neural Networks
new file mode 100644
index 0000000000..3f04e90341
--- /dev/null
+++ b/data/2024/iclr/Towards Robust Fidelity for Evaluating Explainability of Graph Neural Networks	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) are neural models that leverage the dependency structure in graphical data via message passing among the graph nodes. GNNs have emerged as pivotal architectures in analyzing graph-structured data, and their expansive application in sensitive domains requires a comprehensive understanding of their decision-making processes -- necessitating a framework for GNN explainability. An explanation function for GNNs takes a pre-trained GNN along with a graph as input, to produce a `sufficient statistic' subgraph with respect to the graph label. A main challenge in studying GNN explainability is to provide fidelity measures that evaluate the performance of these explanation functions. This paper studies this foundational challenge, spotlighting the inherent limitations of prevailing fidelity metrics, including $Fid_+$, $Fid_-$, and $Fid_\Delta$. Specifically, a formal, information-theoretic definition of explainability is introduced and it is shown that existing metrics often fail to align with this definition across various statistical scenarios. The reason is due to potential distribution shifts when subgraphs are removed in computing these fidelity measures. Subsequently, a robust class of fidelity measures are introduced, and it is shown analytically that they are resilient to distribution shift issues and are applicable in a wide range of scenarios. Extensive empirical analysis on both synthetic and real datasets are provided to illustrate that the proposed metrics are more coherent with gold standard metrics. The source code is available at https://trustai4s-lab.github.io/fidelity.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Robust Multi-Modal Reasoning via Model Selection b/data/2024/iclr/Towards Robust Multi-Modal Reasoning via Model Selection
new file mode 100644
index 0000000000..51e5afd740
--- /dev/null
+++ b/data/2024/iclr/Towards Robust Multi-Modal Reasoning via Model Selection	
@@ -0,0 +1 @@
+The reasoning capabilities of LLM (Large Language Model) are widely acknowledged in recent research, inspiring studies on tool learning and autonomous agents. LLM serves as the"brain"of the agent, orchestrating multiple tools for collaborative multi-step task solving. Unlike methods invoking tools like calculators or weather APIs for straightforward tasks, multi-modal agents excel by integrating diverse AI models for complex challenges. However, current multi-modal agents neglect the significance of model selection: they primarily focus on the planning and execution phases, and will only invoke predefined task-specific models for each subtask, making the execution fragile. Meanwhile, other traditional model selection methods are either incompatible with or suboptimal for the multi-modal agent scenarios, due to ignorance of dependencies among subtasks arising by multi-step reasoning. To this end, we identify the key challenges therein and propose the $\textit{M}^3$ framework as a plug-in with negligible runtime overhead at test-time. This framework improves model selection and bolsters the robustness of multi-modal agents in multi-step reasoning. In the absence of suitable benchmarks, we create MS-GQA, a new dataset specifically designed to investigate the model selection challenge in multi-modal agents. Our experiments reveal that our framework enables dynamic model selection, considering both user inputs and subtask dependencies, thereby robustifying the overall reasoning process. Our code and benchmark: https://github.com/LINs-lab/M3.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Robust Offline Reinforcement Learning under Diverse Data Corruption b/data/2024/iclr/Towards Robust Offline Reinforcement Learning under Diverse Data Corruption
new file mode 100644
index 0000000000..bf4ff6cc6b
--- /dev/null
+++ b/data/2024/iclr/Towards Robust Offline Reinforcement Learning under Diverse Data Corruption	
@@ -0,0 +1 @@
+Offline reinforcement learning (RL) presents a promising approach for learning reinforced policies from offline datasets without the need for costly or unsafe interactions with the environment. However, datasets collected by humans in real-world environments are often noisy and may even be maliciously corrupted, which can significantly degrade the performance of offline RL. In this work, we first investigate the performance of current offline RL algorithms under comprehensive data corruption, including states, actions, rewards, and dynamics. Our extensive experiments reveal that implicit Q-learning (IQL) demonstrates remarkable resilience to data corruption among various offline RL algorithms. Furthermore, we conduct both empirical and theoretical analyses to understand IQL's robust performance, identifying its supervised policy learning scheme as the key factor. Despite its relative robustness, IQL still suffers from heavy-tail targets of Q functions under dynamics corruption. To tackle this challenge, we draw inspiration from robust statistics to employ the Huber loss to handle the heavy-tailedness and utilize quantile estimators to balance penalization for corrupted data and learning stability. By incorporating these simple yet effective modifications into IQL, we propose a more robust offline RL approach named Robust IQL (RIQL). Extensive experiments demonstrate that RIQL exhibits highly robust performance when subjected to diverse data corruption scenarios.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Robust Out-of-Distribution Generalization Bounds via Sharpness b/data/2024/iclr/Towards Robust Out-of-Distribution Generalization Bounds via Sharpness
new file mode 100644
index 0000000000..0887a5ca07
--- /dev/null
+++ b/data/2024/iclr/Towards Robust Out-of-Distribution Generalization Bounds via Sharpness	
@@ -0,0 +1 @@
+Generalizing to out-of-distribution (OOD) data or unseen domain, termed OOD generalization, still lacks appropriate theoretical guarantees. Canonical OOD bounds focus on different distance measurements between source and target domains but fail to consider the optimization property of the learned model. As empirically shown in recent work, the sharpness of learned minima influences OOD generalization. To bridge this gap between optimization and OOD generalization, we study the effect of sharpness on how a model tolerates data change in domain shift which is usually captured by"robustness"in generalization. In this paper, we give a rigorous connection between sharpness and robustness, which gives better OOD guarantees for robust algorithms. It also provides a theoretical backing for"flat minima leads to better OOD generalization". Overall, we propose a sharpness-based OOD generalization bound by taking robustness into consideration, resulting in a tighter bound than non-robust guarantees. Our findings are supported by the experiments on a ridge regression model, as well as the experiments on deep learning classification tasks.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Robust and Efficient Cloud-Edge Elastic Model Adaptation via Selective Entropy Distillation b/data/2024/iclr/Towards Robust and Efficient Cloud-Edge Elastic Model Adaptation via Selective Entropy Distillation
new file mode 100644
index 0000000000..6f7c542403
--- /dev/null
+++ b/data/2024/iclr/Towards Robust and Efficient Cloud-Edge Elastic Model Adaptation via Selective Entropy Distillation	
@@ -0,0 +1 @@
+The conventional deep learning paradigm often involves training a deep model on a server and then deploying the model or its distilled ones to resource-limited edge devices. Usually, the models shall remain fixed once deployed (at least for some period) due to the potential high cost of model adaptation for both the server and edge sides. However, in many real-world scenarios, the test environments may change dynamically (known as distribution shifts), which often results in degraded performance. Thus, one has to adapt the edge models promptly to attain promising performance. Moreover, with the increasing data collected at the edge, this paradigm also fails to further adapt the cloud model for better performance. To address these, we encounter two primary challenges: 1) the edge model has limited computation power and may only support forward propagation; 2) the data transmission budget between cloud and edge devices is limited in latency-sensitive scenarios. In this paper, we establish a Cloud-Edge Elastic Model Adaptation (CEMA) paradigm in which the edge models only need to perform forward propagation and the edge models can be adapted online. In our CEMA, to reduce the communication burden, we devise two criteria to exclude unnecessary samples from uploading to the cloud, i.e., dynamic unreliable and low-informative sample exclusion. Based on the uploaded samples, we update and distribute the affine parameters of normalization layers by distilling from the stronger foundation model to the edge model with a sample replay strategy. Extensive experimental results on ImageNet-C and ImageNet-R verify the effectiveness of our CEMA.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition b/data/2024/iclr/Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition
new file mode 100644
index 0000000000..95354ba9c0
--- /dev/null
+++ b/data/2024/iclr/Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition	
@@ -0,0 +1 @@
+Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model pre-training and VPR, how to bridge the gap and fully unleash the capability of pre-trained models for VPR is still a key issue to address. To this end, we propose a novel method to realize seamless adaptation of pre-trained models for VPR. Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method to achieve both global and local adaptation efficiently, in which only lightweight adapters are tuned without adjusting the pre-trained model. Besides, to guide effective adaptation, we propose a mutual nearest neighbor local feature loss, which ensures proper dense local features are produced for local matching and avoids time-consuming spatial verification in re-ranking. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the two-stage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission). The code is released at https://github.com/Lu-Feng/SelaVPR.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Transparent Time Series Forecasting b/data/2024/iclr/Towards Transparent Time Series Forecasting
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Towards Understanding Factual Knowledge of Large Language Models b/data/2024/iclr/Towards Understanding Factual Knowledge of Large Language Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Towards Understanding Sycophancy in Language Models b/data/2024/iclr/Towards Understanding Sycophancy in Language Models
new file mode 100644
index 0000000000..3f701364c9
--- /dev/null
+++ b/data/2024/iclr/Towards Understanding Sycophancy in Language Models	
@@ -0,0 +1 @@
+Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond b/data/2024/iclr/Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond
new file mode 100644
index 0000000000..4c2979c391
--- /dev/null
+++ b/data/2024/iclr/Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond	
@@ -0,0 +1 @@
+Developing a universal model that can effectively harness heterogeneous resources and respond to a wide range of personalized needs has been a longstanding community aspiration. Our daily choices, especially in domains like fashion and retail, are substantially shaped by multi-modal data, such as pictures and textual descriptions. These modalities not only offer intuitive guidance but also cater to personalized user preferences. However, the predominant personalization approaches mainly focus on the ID or text-based recommendation problem, failing to comprehend the information spanning various tasks or modalities. In this paper, our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP), which effectively leverages multi-modal data while eliminating the complexities associated with task- and modality-specific customization. We argue that the advancements in foundational generative modeling have provided the flexibility and effectiveness necessary to achieve the objective. In light of this, we develop a generic and extensible personalization generative framework, that can handle a wide range of personalized needs including item recommendation, product search, preference prediction, explanation generation, and further user-guided image generation. Our methodology enhances the capabilities of foundational language models for personalized tasks by seamlessly ingesting interleaved cross-modal user history information, ensuring a more precise and customized experience for users. To train and evaluate the proposed multi-modal personalized tasks, we also introduce a novel and comprehensive benchmark covering a variety of user requirements. Our experiments on the real-world benchmark showcase the model's potential, outperforming competitive methods specialized for each task.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards a statistical theory of data selection under weak supervision b/data/2024/iclr/Towards a statistical theory of data selection under weak supervision
new file mode 100644
index 0000000000..9ad64b220f
--- /dev/null
+++ b/data/2024/iclr/Towards a statistical theory of data selection under weak supervision	
@@ -0,0 +1 @@
+Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given $N$ unlabeled samples $\{{\boldsymbol x}_i\}_{i\le N}$, and to be given access to a `surrogate model' that can predict labels $y_i$ better than random guessing. Our goal is to select a subset of the samples, to be denoted by $\{{\boldsymbol x}_i\}_{i\in G}$, of size $|G|=n<N$. We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: $(i)$~Data selection can be very effective, in particular beating training on the full sample in some cases; $(ii)$~Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards domain-invariant Self-Supervised Learning with Batch Styles Standardization b/data/2024/iclr/Towards domain-invariant Self-Supervised Learning with Batch Styles Standardization
new file mode 100644
index 0000000000..ee65e626d5
--- /dev/null
+++ b/data/2024/iclr/Towards domain-invariant Self-Supervised Learning with Batch Styles Standardization	
@@ -0,0 +1 @@
+In Self-Supervised Learning (SSL), models are typically pretrained, fine-tuned, and evaluated on the same domains. However, they tend to perform poorly when evaluated on unseen domains, a challenge that Unsupervised Domain Generalization (UDG) seeks to address. Current UDG methods rely on domain labels, which are often challenging to collect, and domain-specific architectures that lack scalability when confronted with numerous domains, making the current methodology impractical and rigid. Inspired by contrastive-based UDG methods that mitigate spurious correlations by restricting comparisons to examples from the same domain, we hypothesize that eliminating style variability within a batch could provide a more convenient and flexible way to reduce spurious correlations without requiring domain labels. To verify this hypothesis, we introduce Batch Styles Standardization (BSS), a relatively simple yet powerful Fourier-based method to standardize the style of images in a batch specifically designed for integration with SSL methods to tackle UDG. Combining BSS with existing SSL methods offers serious advantages over prior UDG methods: (1) It eliminates the need for domain labels or domain-specific network components to enhance domain-invariance in SSL representations, and (2) offers flexibility as BSS can be seamlessly integrated with diverse contrastive-based but also non-contrastive-based SSL methods. Experiments on several UDG datasets demonstrate that it significantly improves downstream task performances on unseen domains, often outperforming or rivaling with UDG methods. Finally, this work clarifies the underlying mechanisms contributing to BSS's effectiveness in improving domain-invariance in SSL representations and performances on unseen domain.
\ No newline at end of file
diff --git a/data/2024/iclr/Towards image compression with perfect realism at ultra-low bitrates b/data/2024/iclr/Towards image compression with perfect realism at ultra-low bitrates
new file mode 100644
index 0000000000..6c134bef19
--- /dev/null
+++ b/data/2024/iclr/Towards image compression with perfect realism at ultra-low bitrates	
@@ -0,0 +1 @@
+Image codecs are typically optimized to trade-off bitrate \vs distortion metrics. At low bitrates, this leads to compression artefacts which are easily perceptible, even when training with perceptual or adversarial losses. To improve image quality and remove dependency on the bitrate, we propose to decode with iterative diffusion models. We condition the decoding process on a vector-quantized image representation, as well as a global image description to provide additional context. We dub our model PerCo for 'perceptual compression', and compare it to state-of-the-art codecs at rates from 0.1 down to 0.003 bits per pixel. The latter rate is more than an order of magnitude smaller than those considered in most prior work, compressing a 512x768 Kodak image with less than 153 bytes. Despite this ultra-low bitrate, our approach maintains the ability to reconstruct realistic images. We find that our model leads to reconstructions with state-of-the-art visual quality as measured by FID and KID. As predicted by rate-distortion-perception theory, visual quality is less dependent on the bitrate than previous methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Tractable MCMC for Private Learning with Pure and Gaussian Differential Privacy b/data/2024/iclr/Tractable MCMC for Private Learning with Pure and Gaussian Differential Privacy
new file mode 100644
index 0000000000..45891ca6c1
--- /dev/null
+++ b/data/2024/iclr/Tractable MCMC for Private Learning with Pure and Gaussian Differential Privacy	
@@ -0,0 +1 @@
+Posterior sampling, i.e., exponential mechanism to sample from the posterior distribution, provides $\varepsilon$-pure differential privacy (DP) guarantees and does not suffer from potentially unbounded privacy breach introduced by $(\varepsilon,\delta)$-approximate DP. In practice, however, one needs to apply approximate sampling methods such as Markov chain Monte Carlo (MCMC), thus re-introducing the unappealing $\delta$-approximation error into the privacy guarantees. To bridge this gap, we propose the Approximate SAample Perturbation (abbr. ASAP) algorithm which perturbs an MCMC sample with noise proportional to its Wasserstein-infinity ($W_\infty$) distance from a reference distribution that satisfies pure DP or pure Gaussian DP (i.e., $\delta=0$). We then leverage a Metropolis-Hastings algorithm to generate the sample and prove that the algorithm converges in $W_\infty$ distance. We show that by combining our new techniques with a localization step, we obtain the first nearly linear-time algorithm that achieves the optimal rates in the DP-ERM problem with strongly convex and smooth losses.
\ No newline at end of file
diff --git a/data/2024/iclr/Tractable Probabilistic Graph Representation Learning with Graph-Induced Sum-Product Networks b/data/2024/iclr/Tractable Probabilistic Graph Representation Learning with Graph-Induced Sum-Product Networks
new file mode 100644
index 0000000000..045f8153c0
--- /dev/null
+++ b/data/2024/iclr/Tractable Probabilistic Graph Representation Learning with Graph-Induced Sum-Product Networks	
@@ -0,0 +1 @@
+We introduce Graph-Induced Sum-Product Networks (GSPNs), a new probabilistic framework for graph representation learning that can tractably answer probabilistic queries. Inspired by the computational trees induced by vertices in the context of message-passing neural networks, we build hierarchies of sum-product networks (SPNs) where the parameters of a parent SPN are learnable transformations of the a-posterior mixing probabilities of its children's sum units. Due to weight sharing and the tree-shaped computation graphs of GSPNs, we obtain the efficiency and efficacy of deep graph networks with the additional advantages of a probabilistic model. We show the model's competitiveness on scarce supervision scenarios, under missing data, and for graph classification in comparison to popular neural models. We complement the experiments with qualitative analyses on hyper-parameters and the model's ability to answer probabilistic queries.
\ No newline at end of file
diff --git a/data/2024/iclr/Training Bayesian Neural Networks with Sparse Subspace Variational Inference b/data/2024/iclr/Training Bayesian Neural Networks with Sparse Subspace Variational Inference
new file mode 100644
index 0000000000..e256cf460b
--- /dev/null
+++ b/data/2024/iclr/Training Bayesian Neural Networks with Sparse Subspace Variational Inference	
@@ -0,0 +1 @@
+Bayesian neural networks (BNNs) offer uncertainty quantification but come with the downside of substantially increased training and inference costs. Sparse BNNs have been investigated for efficient inference, typically by either slowly introducing sparsity throughout the training or by post-training compression of dense BNNs. The dilemma of how to cut down massive training costs remains, particularly given the requirement to learn about the uncertainty. To solve this challenge, we introduce Sparse Subspace Variational Inference (SSVI), the first fully sparse BNN framework that maintains a consistently highly sparse Bayesian model throughout the training and inference phases. Starting from a randomly initialized low-dimensional sparse subspace, our approach alternately optimizes the sparse subspace basis selection and its associated parameters. While basis selection is characterized as a non-differentiable problem, we approximate the optimal solution with a removal-and-addition strategy, guided by novel criteria based on weight distribution statistics. Our extensive experiments show that SSVI sets new benchmarks in crafting sparse BNNs, achieving, for instance, a 10-20x compression in model size with under 3\% performance drop, and up to 20x FLOPs reduction during training compared with dense VI training. Remarkably, SSVI also demonstrates enhanced robustness to hyperparameters, reducing the need for intricate tuning in VI and occasionally even surpassing VI-trained dense BNNs on both accuracy and uncertainty metrics.
\ No newline at end of file
diff --git a/data/2024/iclr/Training Diffusion Models with Reinforcement Learning b/data/2024/iclr/Training Diffusion Models with Reinforcement Learning
new file mode 100644
index 0000000000..f8c126e980
--- /dev/null
+++ b/data/2024/iclr/Training Diffusion Models with Reinforcement Learning	
@@ -0,0 +1 @@
+Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve prompt-image alignment using feedback from a vision-language model without the need for additional data collection or human annotation. The project's website can be found at http://rl-diffusion.github.io .
\ No newline at end of file
diff --git a/data/2024/iclr/Training Graph Transformers via Curriculum-Enhanced Attention Distillation b/data/2024/iclr/Training Graph Transformers via Curriculum-Enhanced Attention Distillation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Training Socially Aligned Language Models on Simulated Social Interactions b/data/2024/iclr/Training Socially Aligned Language Models on Simulated Social Interactions
new file mode 100644
index 0000000000..ff34b23158
--- /dev/null
+++ b/data/2024/iclr/Training Socially Aligned Language Models on Simulated Social Interactions	
@@ -0,0 +1 @@
+Social alignment in AI systems aims to ensure that these models behave according to established societal values. However, unlike humans, who derive consensus on value judgments through social interaction, current language models (LMs) are trained to rigidly replicate their training corpus in isolation, leading to subpar generalization in unfamiliar scenarios and vulnerability to adversarial attacks. This work presents a novel training paradigm that permits LMs to learn from simulated social interactions. In comparison to existing methodologies, our approach is considerably more scalable and efficient, demonstrating superior performance in alignment benchmarks and human evaluations. This paradigm shift in the training of LMs brings us a step closer to developing AI systems that can robustly and accurately reflect societal norms and values.
\ No newline at end of file
diff --git a/data/2024/iclr/Training Unbiased Diffusion Models From Biased Dataset b/data/2024/iclr/Training Unbiased Diffusion Models From Biased Dataset
new file mode 100644
index 0000000000..21a52dc47f
--- /dev/null
+++ b/data/2024/iclr/Training Unbiased Diffusion Models From Biased Dataset	
@@ -0,0 +1 @@
+With significant advancements in diffusion models, addressing the potential risks of dataset bias becomes increasingly important. Since generated outputs directly suffer from dataset bias, mitigating latent bias becomes a key factor in improving sample quality and proportion. This paper proposes time-dependent importance reweighting to mitigate the bias for the diffusion models. We demonstrate that the time-dependent density ratio becomes more precise than previous approaches, thereby minimizing error propagation in generative learning. While directly applying it to score-matching is intractable, we discover that using the time-dependent density ratio both for reweighting and score correction can lead to a tractable form of the objective function to regenerate the unbiased data density. Furthermore, we theoretically establish a connection with traditional score-matching, and we demonstrate its convergence to an unbiased distribution. The experimental evidence supports the usefulness of the proposed method, which outperforms baselines including time-independent importance reweighting on CIFAR-10, CIFAR-100, FFHQ, and CelebA with various bias settings. Our code is available at https://github.com/alsdudrla10/TIW-DSM.
\ No newline at end of file
diff --git a/data/2024/iclr/Training-free Multi-objective Diffusion Model for 3D Molecule Generation b/data/2024/iclr/Training-free Multi-objective Diffusion Model for 3D Molecule Generation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Trajeglish: Traffic Modeling as Next-Token Prediction b/data/2024/iclr/Trajeglish: Traffic Modeling as Next-Token Prediction
new file mode 100644
index 0000000000..86333930af
--- /dev/null
+++ b/data/2024/iclr/Trajeglish: Traffic Modeling as Next-Token Prediction	
@@ -0,0 +1 @@
+A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs. In pursuit of this functionality, we apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios. Using a simple data-driven tokenization scheme, we discretize trajectories to centimeter-level resolution using a small vocabulary. We then model the multi-agent sequence of discrete motion tokens with a GPT-like encoder-decoder that is autoregressive in time and takes into account intra-timestep interaction between agents. Scenarios sampled from our model exhibit state-of-the-art realism; our model tops the Waymo Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%. We ablate our modeling choices in full autonomy and partial autonomy settings, and show that the representations learned by our model can quickly be adapted to improve performance on nuScenes. We additionally evaluate the scalability of our model with respect to parameter count and dataset size, and use density estimates from our model to quantify the saliency of context length and intra-timestep interaction for the traffic modeling task.
\ No newline at end of file
diff --git a/data/2024/iclr/Transferring Labels to Solve Annotation Mismatches Across Object Detection Datasets b/data/2024/iclr/Transferring Labels to Solve Annotation Mismatches Across Object Detection Datasets
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Transferring Learning Trajectories of Neural Networks b/data/2024/iclr/Transferring Learning Trajectories of Neural Networks
new file mode 100644
index 0000000000..103ea13288
--- /dev/null
+++ b/data/2024/iclr/Transferring Learning Trajectories of Neural Networks	
@@ -0,0 +1 @@
+Training deep neural networks (DNNs) is computationally expensive, which is problematic especially when performing duplicated or similar training runs in model ensemble or fine-tuning pre-trained models, for example. Once we have trained one DNN on some dataset, we have its learning trajectory (i.e., a sequence of intermediate parameters during training) which may potentially contain useful information for learning the dataset. However, there has been no attempt to utilize such information of a given learning trajectory for another training. In this paper, we formulate the problem of"transferring"a given learning trajectory from one initial parameter to another one (learning transfer problem) and derive the first algorithm to approximately solve it by matching gradients successively along the trajectory via permutation symmetry. We empirically show that the transferred parameters achieve non-trivial accuracy before any direct training, and can be trained significantly faster than training from scratch.
\ No newline at end of file
diff --git a/data/2024/iclr/Transformer Fusion with Optimal Transport b/data/2024/iclr/Transformer Fusion with Optimal Transport
new file mode 100644
index 0000000000..7e57780c7a
--- /dev/null
+++ b/data/2024/iclr/Transformer Fusion with Optimal Transport	
@@ -0,0 +1 @@
+Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components. We flesh out an abstraction for layer alignment, that can generalize to arbitrary architectures - in principle - and we apply this to the key ingredients of Transformers such as multi-head self-attention, layer-normalization, and residual connections, and we discuss how to handle them via various ablation studies. Furthermore, our method allows the fusion of models of different sizes (heterogeneous fusion), providing a new and efficient way to compress Transformers. The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. In our analysis, we uncover intriguing insights about the significant role of soft alignment in the case of Transformers. Our results showcase the potential of fusing multiple Transformers, thus compounding their expertise, in the budding paradigm of model fusion and recombination. Code is available at https://github.com/graldij/transformer-fusion.
\ No newline at end of file
diff --git a/data/2024/iclr/Transformer-Modulated Diffusion Models for Probabilistic Multivariate Time Series Forecasting b/data/2024/iclr/Transformer-Modulated Diffusion Models for Probabilistic Multivariate Time Series Forecasting
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Transformer-VQ: Linear-Time Transformers via Vector Quantization b/data/2024/iclr/Transformer-VQ: Linear-Time Transformers via Vector Quantization
new file mode 100644
index 0000000000..27754a11d4
--- /dev/null
+++ b/data/2024/iclr/Transformer-VQ: Linear-Time Transformers via Vector Quantization	
@@ -0,0 +1 @@
+We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on ImageNet64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. Code available: \url{https://github.com/transformer-vq/transformer_vq}
\ No newline at end of file
diff --git a/data/2024/iclr/Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining b/data/2024/iclr/Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining
new file mode 100644
index 0000000000..4ba811d7fc
--- /dev/null
+++ b/data/2024/iclr/Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining	
@@ -0,0 +1 @@
+Large transformer models pretrained on offline reinforcement learning datasets have demonstrated remarkable in-context reinforcement learning (ICRL) capabilities, where they can make good decisions when prompted with interaction trajectories from unseen environments. However, when and how transformers can be trained to perform ICRL have not been theoretically well-understood. In particular, it is unclear which reinforcement-learning algorithms transformers can perform in context, and how distribution mismatch in offline training data affects the learned algorithms. This paper provides a theoretical framework that analyzes supervised pretraining for ICRL. This includes two recently proposed training methods -- algorithm distillation and decision-pretrained transformers. First, assuming model realizability, we prove the supervised-pretrained transformer will imitate the conditional expectation of the expert algorithm given the observed trajectory. The generalization error will scale with model capacity and a distribution divergence factor between the expert and offline algorithms. Second, we show transformers with ReLU attention can efficiently approximate near-optimal online reinforcement learning algorithms like LinUCB and Thompson sampling for stochastic linear bandits, and UCB-VI for tabular Markov decision processes. This provides the first quantitative analysis of the ICRL capabilities of transformers pretrained from offline trajectories.
\ No newline at end of file
diff --git a/data/2024/iclr/Transformers can optimally learn regression mixture models b/data/2024/iclr/Transformers can optimally learn regression mixture models
new file mode 100644
index 0000000000..42f9d8a104
--- /dev/null
+++ b/data/2024/iclr/Transformers can optimally learn regression mixture models	
@@ -0,0 +1 @@
+Mixture models arise in many regression problems, but most methods have seen limited adoption partly due to these algorithms' highly-tailored and model-specific nature. On the other hand, transformers are flexible, neural sequence models that present the intriguing possibility of providing general-purpose prediction methods, even in this mixture setting. In this work, we investigate the hypothesis that transformers can learn an optimal predictor for mixtures of regressions. We construct a generative process for a mixture of linear regressions for which the decision-theoretic optimal procedure is given by data-driven exponential weights on a finite set of parameters. We observe that transformers achieve low mean-squared error on data generated via this process. By probing the transformer's output at inference time, we also show that transformers typically make predictions that are close to the optimal predictor. Our experiments also demonstrate that transformers can learn mixtures of regressions in a sample-efficient fashion and are somewhat robust to distribution shifts. We complement our experimental observations by proving constructively that the decision-theoretic optimal procedure is indeed implementable by a transformer.
\ No newline at end of file
diff --git a/data/2024/iclr/Transport meets Variational Inference: Controlled Monte Carlo Diffusions b/data/2024/iclr/Transport meets Variational Inference: Controlled Monte Carlo Diffusions
new file mode 100644
index 0000000000..54d0a42a6b
--- /dev/null
+++ b/data/2024/iclr/Transport meets Variational Inference: Controlled Monte Carlo Diffusions	
@@ -0,0 +1 @@
+Connecting optimal transport and variational inference, we present a principled and systematic framework for sampling and generative modelling centred around divergences on path space. Our work culminates in the development of the \emph{Controlled Monte Carlo Diffusion} sampler (CMCD) for Bayesian computation, a score-based annealing technique that crucially adapts both forward and backward dynamics in a diffusion model. On the way, we clarify the relationship between the EM-algorithm and iterative proportional fitting (IPF) for Schr{\"o}dinger bridges, deriving as well a regularised objective that bypasses the iterative bottleneck of standard IPF-updates. Finally, we show that CMCD has a strong foundation in the Jarzinsky and Crooks identities from statistical physics, and that it convincingly outperforms competing approaches across a wide array of experiments.
\ No newline at end of file
diff --git a/data/2024/iclr/Traveling Waves Encode The Recent Past and Enhance Sequence Learning b/data/2024/iclr/Traveling Waves Encode The Recent Past and Enhance Sequence Learning
new file mode 100644
index 0000000000..b4334a8330
--- /dev/null
+++ b/data/2024/iclr/Traveling Waves Encode The Recent Past and Enhance Sequence Learning	
@@ -0,0 +1 @@
+Traveling waves of neural activity have been observed throughout the brain at a diversity of regions and scales; however, their precise computational role is still debated. One physically inspired hypothesis suggests that the cortical sheet may act like a wave-propagating system capable of invertibly storing a short-term memory of sequential stimuli through induced waves traveling across the cortical surface, and indeed many experimental results from neuroscience correlate wave activity with memory tasks. To date, however, the computational implications of this idea have remained hypothetical due to the lack of a simple recurrent neural network architecture capable of exhibiting such waves. In this work, we introduce a model to fill this gap, which we denote the Wave-RNN (wRNN), and demonstrate how such an architecture indeed efficiently encodes the recent past through a suite of synthetic memory tasks where wRNNs learn faster and reach significantly lower error than wave-free counterparts. We further explore the implications of this memory storage system on more complex sequence modeling tasks such as sequential image classification and find that wave-based models not only again outperform comparable wave-free RNNs while using significantly fewer parameters, but additionally perform comparably to more complex gated architectures such as LSTMs and GRUs.
\ No newline at end of file
diff --git a/data/2024/iclr/Treatment Effects Estimation By Uniform Transformer b/data/2024/iclr/Treatment Effects Estimation By Uniform Transformer
new file mode 100644
index 0000000000..434fae1a63
--- /dev/null
+++ b/data/2024/iclr/Treatment Effects Estimation By Uniform Transformer	
@@ -0,0 +1 @@
+In observational studies, balancing covariates in different treatment groups is essential to estimate treatment effects. One of the most commonly used methods for such purposes is weighting. The performance of this class of methods usually depends on strong regularity conditions for the underlying model, which might not hold in practice. In this paper, we investigate weighting methods from a functional estimation perspective and argue that the weights needed for covariate balancing could differ from those needed for treatment effects estimation under low regularity conditions. Motivated by this observation, we introduce a new framework of weighting that directly targets the treatment effects estimation. Unlike existing methods, the resulting estimator for a treatment effect under this new framework is a simple kernel-based $U$-statistic after applying a data-driven transformation to the observed covariates. We characterize the theoretical properties of the new estimators of treatment effects under a nonparametric setting and show that they are able to work robustly under low regularity conditions. The new framework is also applied to several numerical examples to demonstrate its practical merits.
\ No newline at end of file
diff --git a/data/2024/iclr/Tree Cross Attention b/data/2024/iclr/Tree Cross Attention
new file mode 100644
index 0000000000..562a763fdc
--- /dev/null
+++ b/data/2024/iclr/Tree Cross Attention	
@@ -0,0 +1 @@
+Cross Attention is a popular method for retrieving information from a set of context tokens for making predictions. At inference time, for each prediction, Cross Attention scans the full set of $\mathcal{O}(N)$ tokens. In practice, however, often only a small subset of tokens are required for good performance. Methods such as Perceiver IO are cheap at inference as they distill the information to a smaller-sized set of latent tokens $L<N$ on which cross attention is then applied, resulting in only $\mathcal{O}(L)$ complexity. However, in practice, as the number of input tokens and the amount of information to distill increases, the number of latent tokens needed also increases significantly. In this work, we propose Tree Cross Attention (TCA) - a module based on Cross Attention that only retrieves information from a logarithmic $\mathcal{O}(\log(N))$ number of tokens for performing inference. TCA organizes the data in a tree structure and performs a tree search at inference time to retrieve the relevant tokens for prediction. Leveraging TCA, we introduce ReTreever, a flexible architecture for token-efficient inference. We show empirically that Tree Cross Attention (TCA) performs comparable to Cross Attention across various classification and uncertainty regression tasks while being significantly more token-efficient. Furthermore, we compare ReTreever against Perceiver IO, showing significant gains while using the same number of tokens for inference.
\ No newline at end of file
diff --git a/data/2024/iclr/Tree Search-Based Policy Optimization under Stochastic Execution Delay b/data/2024/iclr/Tree Search-Based Policy Optimization under Stochastic Execution Delay
new file mode 100644
index 0000000000..ed8d7eb2f5
--- /dev/null
+++ b/data/2024/iclr/Tree Search-Based Policy Optimization under Stochastic Execution Delay	
@@ -0,0 +1 @@
+The standard formulation of Markov decision processes (MDPs) assumes that the agent's decisions are executed immediately. However, in numerous realistic applications such as robotics or healthcare, actions are performed with a delay whose value can even be stochastic. In this work, we introduce stochastic delayed execution MDPs, a new formalism addressing random delays without resorting to state augmentation. We show that given observed delay values, it is sufficient to perform a policy search in the class of Markov policies in order to reach optimal performance, thus extending the deterministic fixed delay case. Armed with this insight, we devise DEZ, a model-based algorithm that optimizes over the class of Markov policies. DEZ leverages Monte-Carlo tree search similar to its non-delayed variant EfficientZero to accurately infer future states from the action queue. Thus, it handles delayed execution while preserving the sample efficiency of EfficientZero. Through a series of experiments on the Atari suite, we demonstrate that although the previous baseline outperforms the naive method in scenarios with constant delay, it underperforms in the face of stochastic delays. In contrast, our approach significantly outperforms the baselines, for both constant and stochastic delays. The code is available at http://github.com/davidva1/Delayed-EZ .
\ No newline at end of file
diff --git a/data/2024/iclr/Tree-Planner: Efficient Close-loop Task Planning with Large Language Models b/data/2024/iclr/Tree-Planner: Efficient Close-loop Task Planning with Large Language Models
new file mode 100644
index 0000000000..5ffa29c7fb
--- /dev/null
+++ b/data/2024/iclr/Tree-Planner: Efficient Close-loop Task Planning with Large Language Models	
@@ -0,0 +1 @@
+This paper studies close-loop task planning, which refers to the process of generating a sequence of skills (a plan) to accomplish a specific goal while adapting the plan based on real-time observations. Recently, prompting Large Language Models (LLMs) to generate actions iteratively has become a prevalent paradigm due to its superior performance and user-friendliness. However, this paradigm is plagued by two inefficiencies: high token consumption and redundant error correction, both of which hinder its scalability for large-scale testing and applications. To address these issues, we propose Tree-Planner, which reframes task planning with LLMs into three distinct phases: plan sampling, action tree construction, and grounded deciding. Tree-Planner starts by using an LLM to sample a set of potential plans before execution, followed by the aggregation of them to form an action tree. Finally, the LLM performs a top-down decision-making process on the tree, taking into account real-time environmental information. Experiments show that Tree-Planner achieves state-of-the-art performance while maintaining high efficiency. By decomposing LLM queries into a single plan-sampling call and multiple grounded-deciding calls, a considerable part of the prompt are less likely to be repeatedly consumed. As a result, token consumption is reduced by 92.2% compared to the previously best-performing model. Additionally, by enabling backtracking on the action tree as needed, the correction process becomes more flexible, leading to a 40.5% decrease in error corrections.
\ No newline at end of file
diff --git a/data/2024/iclr/True Knowledge Comes from Practice: Aligning Large Language Models with Embodied Environments via Reinforcement Learning b/data/2024/iclr/True Knowledge Comes from Practice: Aligning Large Language Models with Embodied Environments via Reinforcement Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning b/data/2024/iclr/Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning
new file mode 100644
index 0000000000..040356ed65
--- /dev/null
+++ b/data/2024/iclr/Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning	
@@ -0,0 +1 @@
+This paper introduces an efficient strategy to transform Large Language Models (LLMs) into Multi-Modal Large Language Models (MLLMs). By conceptualizing this transformation as a domain adaptation process, i.e., transitioning from text understanding to embracing multiple modalities, we intriguingly note that, within each attention block, tuning LayerNorm suffices to yield strong performance. Moreover, when benchmarked against other tuning approaches like full parameter finetuning or LoRA, its benefits on efficiency are substantial. For example, when compared to LoRA on a 13B model scale, performance can be enhanced by an average of over 20% across five multi-modal tasks, and meanwhile, results in a significant reduction of trainable parameters by 41.9% and a decrease in GPU memory usage by 17.6%. On top of this LayerNorm strategy, we showcase that selectively tuning only with conversational data can improve efficiency further. Beyond these empirical outcomes, we provide a comprehensive analysis to explore the role of LayerNorm in adapting LLMs to the multi-modal domain and improving the expressive power of the model.
\ No newline at end of file
diff --git a/data/2024/iclr/Turning large language models into cognitive models b/data/2024/iclr/Turning large language models into cognitive models
new file mode 100644
index 0000000000..2c77007da6
--- /dev/null
+++ b/data/2024/iclr/Turning large language models into cognitive models	
@@ -0,0 +1 @@
+Large language models are powerful systems that excel at many tasks, ranging from translation to mathematical reasoning. Yet, at the same time, these models often show unhuman-like characteristics. In the present paper, we address this gap and ask whether large language models can be turned into cognitive models. We find that -- after finetuning them on data from psychological experiments -- these models offer accurate representations of human behavior, even outperforming traditional cognitive models in two decision-making domains. In addition, we show that their representations contain the information necessary to model behavior on the level of individual subjects. Finally, we demonstrate that finetuning on multiple tasks enables large language models to predict human behavior in a previously unseen task. Taken together, these results suggest that large, pre-trained models can be adapted to become generalist cognitive models, thereby opening up new research directions that could transform cognitive psychology and the behavioral sciences as a whole.
\ No newline at end of file
diff --git a/data/2024/iclr/Two-stage LLM Fine-tuning with Less Specialization and More Generalization b/data/2024/iclr/Two-stage LLM Fine-tuning with Less Specialization and More Generalization
new file mode 100644
index 0000000000..42e363cf26
--- /dev/null
+++ b/data/2024/iclr/Two-stage LLM Fine-tuning with Less Specialization and More Generalization	
@@ -0,0 +1 @@
+Pretrained large language models (LLMs) are general purpose problem solvers applicable to a diverse set of tasks with prompts. They can be further improved towards a specific task by fine-tuning on a specialized dataset. However, fine-tuning usually makes the model narrowly specialized on this dataset with reduced general in-context learning performances, which is undesirable whenever the fine-tuned model needs to handle additional tasks where no fine-tuning data is available. In this work, we first demonstrate that fine-tuning on a single task indeed decreases LLMs' general in-context learning performance. We discover one important cause of such forgetting, format specialization, where the model overfits to the format of the fine-tuned task.We further show that format specialization happens at the very beginning of fine-tuning. To solve this problem, we propose Prompt Tuning with MOdel Tuning (ProMoT), a simple yet effective two-stage fine-tuning framework that reduces format specialization and improves generalization.ProMoT offloads task-specific format learning into additional and removable parameters by first doing prompt tuning and then fine-tuning the model itself with this soft prompt attached. With experiments on several fine-tuning tasks and 8 in-context evaluation tasks, we show that ProMoT achieves comparable performance on fine-tuned tasks to standard fine-tuning, but with much less loss of in-context learning performances across a board range of out-of-domain evaluation tasks. More importantly, ProMoT can even enhance generalization on in-context learning tasks that are semantically related to the fine-tuned task, e.g. ProMoT on En-Fr translation significantly improves performance on other language pairs, and ProMoT on NLI improves performance on summarization. Experiments also show that ProMoT can improve the generalization performance of multi-task training.
\ No newline at end of file
diff --git a/data/2024/iclr/UC-NERF: Neural Radiance Field for Under-Calibrated Multi-View Cameras in Autonomous Driving b/data/2024/iclr/UC-NERF: Neural Radiance Field for Under-Calibrated Multi-View Cameras in Autonomous Driving
new file mode 100644
index 0000000000..f4f717bcc1
--- /dev/null
+++ b/data/2024/iclr/UC-NERF: Neural Radiance Field for Under-Calibrated Multi-View Cameras in Autonomous Driving	
@@ -0,0 +1 @@
+Multi-camera setups find widespread use across various applications, such as autonomous driving, as they greatly expand sensing capabilities. Despite the fast development of Neural radiance field (NeRF) techniques and their wide applications in both indoor and outdoor scenes, applying NeRF to multi-camera systems remains very challenging. This is primarily due to the inherent under-calibration issues in multi-camera setup, including inconsistent imaging effects stemming from separately calibrated image signal processing units in diverse cameras, and system errors arising from mechanical vibrations during driving that affect relative camera poses. In this paper, we present UC-NeRF, a novel method tailored for novel view synthesis in under-calibrated multi-view camera systems. Firstly, we propose a layer-based color correction to rectify the color inconsistency in different image regions. Second, we propose virtual warping to generate more viewpoint-diverse but color-consistent virtual views for color correction and 3D recovery. Finally, a spatiotemporally constrained pose refinement is designed for more robust and accurate pose calibration in multi-camera systems. Our method not only achieves state-of-the-art performance of novel view synthesis in multi-camera setups, but also effectively facilitates depth estimation in large-scale outdoor scenes with the synthesized novel views.
\ No newline at end of file
diff --git a/data/2024/iclr/UNR-Explainer: Counterfactual Explanations for Unsupervised Node Representation Learning Models b/data/2024/iclr/UNR-Explainer: Counterfactual Explanations for Unsupervised Node Representation Learning Models
new file mode 100644
index 0000000000..f087d89141
--- /dev/null
+++ b/data/2024/iclr/UNR-Explainer: Counterfactual Explanations for Unsupervised Node Representation Learning Models	
@@ -0,0 +1 @@
+in
\ No newline at end of file
diff --git a/data/2024/iclr/USB-NeRF: Unrolling Shutter Bundle Adjusted Neural Radiance Fields b/data/2024/iclr/USB-NeRF: Unrolling Shutter Bundle Adjusted Neural Radiance Fields
new file mode 100644
index 0000000000..cd2aea82f6
--- /dev/null
+++ b/data/2024/iclr/USB-NeRF: Unrolling Shutter Bundle Adjusted Neural Radiance Fields	
@@ -0,0 +1 @@
+Neural Radiance Fields (NeRF) has received much attention recently due to its impressive capability to represent 3D scene and synthesize novel view images. Existing works usually assume that the input images are captured by a global shutter camera. Thus, rolling shutter (RS) images cannot be trivially applied to an off-the-shelf NeRF algorithm for novel view synthesis. Rolling shutter effect would also affect the accuracy of the camera pose estimation (e.g. via COLMAP), which further prevents the success of NeRF algorithm with RS images. In this paper, we propose Unrolling Shutter Bundle Adjusted Neural Radiance Fields (USB-NeRF). USB-NeRF is able to correct rolling shutter distortions and recover accurate camera motion trajectory simultaneously under the framework of NeRF, by modeling the physical image formation process of a RS camera. Experimental results demonstrate that USB-NeRF achieves better performance compared to prior works, in terms of RS effect removal, novel view image synthesis as well as camera motion estimation. Furthermore, our algorithm can also be used to recover high-fidelity high frame-rate global shutter video from a sequence of RS images.
\ No newline at end of file
diff --git a/data/2024/iclr/Un-Mixing Test-Time Normalization Statistics: Combatting Label Temporal Correlation b/data/2024/iclr/Un-Mixing Test-Time Normalization Statistics: Combatting Label Temporal Correlation
new file mode 100644
index 0000000000..e6e0de754c
--- /dev/null
+++ b/data/2024/iclr/Un-Mixing Test-Time Normalization Statistics: Combatting Label Temporal Correlation	
@@ -0,0 +1 @@
+Recent test-time adaptation methods heavily rely on nuanced adjustments of batch normalization (BN) parameters. However, one critical assumption often goes overlooked: that of independently and identically distributed (i.i.d.) test batches with respect to unknown labels. This oversight leads to skewed BN statistics and undermines the reliability of the model under non-i.i.d. scenarios. To tackle this challenge, this paper presents a novel method termed 'Un-Mixing Test-Time Normalization Statistics' (UnMix-TNS). Our method re-calibrates the statistics for each instance within a test batch by mixing it with multiple distinct statistics components, thus inherently simulating the i.i.d. scenario. The core of this method hinges on a distinctive online unmixing procedure that continuously updates these statistics components by incorporating the most similar instances from new test batches. Remarkably generic in its design, UnMix-TNS seamlessly integrates with a wide range of leading test-time adaptation methods and pre-trained architectures equipped with BN layers. Empirical evaluations corroborate the robustness of UnMix-TNS under varied scenarios-ranging from single to continual and mixed domain shifts, particularly excelling with temporally correlated test data and corrupted non-i.i.d. real-world streams. This adaptability is maintained even with very small batch sizes or single instances. Our results highlight UnMix-TNS's capacity to markedly enhance stability and performance across various benchmarks. Our code is publicly available at https://github.com/devavratTomar/unmixtns.
\ No newline at end of file
diff --git a/data/2024/iclr/Unbalancedness in Neural Monge Maps Improves Unpaired Domain Translation b/data/2024/iclr/Unbalancedness in Neural Monge Maps Improves Unpaired Domain Translation
new file mode 100644
index 0000000000..e460004b7f
--- /dev/null
+++ b/data/2024/iclr/Unbalancedness in Neural Monge Maps Improves Unpaired Domain Translation	
@@ -0,0 +1 @@
+In optimal transport (OT), a Monge map is known as a mapping that transports a source distribution to a target distribution in the most cost-efficient way. Recently, multiple neural estimators for Monge maps have been developed and applied in diverse unpaired domain translation tasks, e.g. in single-cell biology and computer vision. However, the classic OT framework enforces mass conservation, which makes it prone to outliers and limits its applicability in real-world scenarios. The latter can be particularly harmful in OT domain translation tasks, where the relative position of a sample within a distribution is explicitly taken into account. While unbalanced OT tackles this challenge in the discrete setting, its integration into neural Monge map estimators has received limited attention. We propose a theoretically grounded method to incorporate unbalancedness into any Monge map estimator. We improve existing estimators to model cell trajectories over time and to predict cellular responses to perturbations. Moreover, our approach seamlessly integrates with the OT flow matching (OT-FM) framework. While we show that OT-FM performs competitively in image translation, we further improve performance by incorporating unbalancedness (UOT-FM), which better preserves relevant features. We hence establish UOT-FM as a principled method for unpaired image translation.
\ No newline at end of file
diff --git a/data/2024/iclr/Unbiased Watermark for Large Language Models b/data/2024/iclr/Unbiased Watermark for Large Language Models
new file mode 100644
index 0000000000..f56587f8e0
--- /dev/null
+++ b/data/2024/iclr/Unbiased Watermark for Large Language Models	
@@ -0,0 +1 @@
+The recent advancements in large language models (LLMs) have sparked a growing apprehension regarding the potential misuse. One approach to mitigating this risk is to incorporate watermarking techniques into LLMs, allowing for the tracking and attribution of model outputs. This study examines a crucial aspect of watermarking: how significantly watermarks impact the quality of model-generated outputs. Previous studies have suggested a trade-off between watermark strength and output quality. However, our research demonstrates that it is possible to integrate watermarks without affecting the output probability distribution with appropriate implementation. We refer to this type of watermark as an unbiased watermark. This has significant implications for the use of LLMs, as it becomes impossible for users to discern whether a service provider has incorporated watermarks or not. Furthermore, the presence of watermarks does not compromise the performance of the model in downstream tasks, ensuring that the overall utility of the language model is preserved. Our findings contribute to the ongoing discussion around responsible AI development, suggesting that unbiased watermarks can serve as an effective means of tracking and attributing model outputs without sacrificing output quality.
\ No newline at end of file
diff --git a/data/2024/iclr/Uncertainty Quantification via Stable Distribution Propagation b/data/2024/iclr/Uncertainty Quantification via Stable Distribution Propagation
new file mode 100644
index 0000000000..526393bce2
--- /dev/null
+++ b/data/2024/iclr/Uncertainty Quantification via Stable Distribution Propagation	
@@ -0,0 +1 @@
+We propose a new approach for propagating stable probability distributions through neural networks. Our method is based on local linearization, which we show to be an optimal approximation in terms of total variation distance for the ReLU non-linearity. This allows propagating Gaussian and Cauchy input uncertainties through neural networks to quantify their output uncertainties. To demonstrate the utility of propagating distributions, we apply the proposed method to predicting calibrated confidence intervals and selective prediction on out-of-distribution data. The results demonstrate a broad applicability of propagating distributions and show the advantages of our method over other approaches such as moment matching.
\ No newline at end of file
diff --git a/data/2024/iclr/Uncertainty-aware Constraint Inference in Inverse Constrained Reinforcement Learning b/data/2024/iclr/Uncertainty-aware Constraint Inference in Inverse Constrained Reinforcement Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Uncertainty-aware Graph-based Hyperspectral Image Classification b/data/2024/iclr/Uncertainty-aware Graph-based Hyperspectral Image Classification
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Unconstrained Stochastic CCA: Unifying Multiview and Self-Supervised Learning b/data/2024/iclr/Unconstrained Stochastic CCA: Unifying Multiview and Self-Supervised Learning
new file mode 100644
index 0000000000..3cbe7025a2
--- /dev/null
+++ b/data/2024/iclr/Unconstrained Stochastic CCA: Unifying Multiview and Self-Supervised Learning	
@@ -0,0 +1 @@
+The Canonical Correlation Analysis (CCA) family of methods is foundational in multiview learning. Regularised linear CCA methods can be seen to generalise Partial Least Squares (PLS) and be unified with a Generalized Eigenvalue Problem (GEP) framework. However, classical algorithms for these linear methods are computationally infeasible for large-scale data. Extensions to Deep CCA show great promise, but current training procedures are slow and complicated. First we propose a novel unconstrained objective that characterizes the top subspace of GEPs. Our core contribution is a family of fast algorithms for stochastic PLS, stochastic CCA, and Deep CCA, simply obtained by applying stochastic gradient descent (SGD) to the corresponding CCA objectives. Our algorithms show far faster convergence and recover higher correlations than the previous state-of-the-art on all standard CCA and Deep CCA benchmarks. These improvements allow us to perform a first-of-its-kind PLS analysis of an extremely large biomedical dataset from the UK Biobank, with over 33,000 individuals and 500,000 features. Finally, we apply our algorithms to match the performance of `CCA-family' Self-Supervised Learning (SSL) methods on CIFAR-10 and CIFAR-100 with minimal hyper-parameter tuning, and also present theory to clarify the links between these methods and classical CCA, laying the groundwork for future insights.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding Addition in Transformers b/data/2024/iclr/Understanding Addition in Transformers
new file mode 100644
index 0000000000..1eb531c2c2
--- /dev/null
+++ b/data/2024/iclr/Understanding Addition in Transformers	
@@ -0,0 +1 @@
+Understanding the inner workings of machine learning models like Transformers is vital for their safe and ethical use. This paper presents an in-depth analysis of a one-layer Transformer model trained for integer addition. We reveal that the model divides the task into parallel, digit-specific streams and employs distinct algorithms for different digit positions. Our study also finds that the model starts calculations late but executes them rapidly. A rare use case with high loss is identified and explained. Overall, the model's algorithm is explained in detail. These findings are validated through rigorous testing and mathematical modeling, contributing to the broader works in Mechanistic Interpretability, AI safety, and alignment. Our approach opens the door for analyzing more complex tasks and multi-layer Transformer models.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding Catastrophic Forgetting in Language Models via Implicit Inference b/data/2024/iclr/Understanding Catastrophic Forgetting in Language Models via Implicit Inference
new file mode 100644
index 0000000000..2d415aaabc
--- /dev/null
+++ b/data/2024/iclr/Understanding Catastrophic Forgetting in Language Models via Implicit Inference	
@@ -0,0 +1 @@
+We lack a systematic understanding of the effects of fine-tuning (via methods such as instruction-tuning or reinforcement learning from human feedback), particularly on tasks outside the narrow fine-tuning distribution. In a simplified scenario, we demonstrate that improving performance on tasks within the fine-tuning data distribution comes at the expense of capabilities on other tasks. We hypothesize that language models implicitly infer the task of the prompt and that fine-tuning skews this inference towards tasks in the fine-tuning distribution. To test this, we propose Conjugate Prompting, which artificially makes the task look farther from the fine-tuning distribution while requiring the same capability, and we find that this recovers some of the pretraining capabilities in our synthetic setup. Since real-world fine-tuning distributions are predominantly English, we apply conjugate prompting to recover pretrained capabilities in LLMs by simply translating the prompts to different languages. This allows us to recover in-context learning abilities lost via instruction tuning, natural reasoning capability lost during code fine-tuning, and, more concerningly, harmful content generation suppressed by safety fine-tuning in chatbots like ChatGPT.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding Certified Training with Interval Bound Propagation b/data/2024/iclr/Understanding Certified Training with Interval Bound Propagation
new file mode 100644
index 0000000000..ceda1adbf5
--- /dev/null
+++ b/data/2024/iclr/Understanding Certified Training with Interval Bound Propagation	
@@ -0,0 +1 @@
+As robustness verification methods are becoming more precise, training certifiably robust neural networks is becoming ever more relevant. To this end, certified training methods compute and then optimize an upper bound on the worst-case loss over a robustness specification. Curiously, training methods based on the imprecise interval bound propagation (IBP) consistently outperform those leveraging more precise bounding methods. Still, we lack an understanding of the mechanisms making IBP so successful. In this work, we thoroughly investigate these mechanisms by leveraging a novel metric measuring the tightness of IBP bounds. We first show theoretically that, for deep linear models, tightness decreases with width and depth at initialization, but improves with IBP training, given sufficient network width. We, then, derive sufficient and necessary conditions on weight matrices for IBP bounds to become exact and demonstrate that these impose strong regularization, explaining the empirically observed trade-off between robustness and accuracy in certified training. Our extensive experimental evaluation validates our theoretical predictions for ReLU networks, including that wider networks improve performance, yielding state-of-the-art results. Interestingly, we observe that while all IBP-based training methods lead to high tightness, this is neither sufficient nor necessary to achieve high certifiable robustness. This hints at the existence of new training methods that do not induce the strong regularization required for tight IBP bounds, leading to improved robustness and standard accuracy.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding Convergence and Generalization in Federated Learning through Feature Learning Theory b/data/2024/iclr/Understanding Convergence and Generalization in Federated Learning through Feature Learning Theory
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Understanding Domain Generalization: A Noise Robustness Perspective b/data/2024/iclr/Understanding Domain Generalization: A Noise Robustness Perspective
new file mode 100644
index 0000000000..e265ded18d
--- /dev/null
+++ b/data/2024/iclr/Understanding Domain Generalization: A Noise Robustness Perspective	
@@ -0,0 +1 @@
+Despite the rapid development of machine learning algorithms for domain generalization (DG), there is no clear empirical evidence that the existing DG algorithms outperform the classic empirical risk minimization (ERM) across standard benchmarks. To better understand this phenomenon, we investigate whether there are benefits of DG algorithms over ERM through the lens of label noise. Specifically, our finite-sample analysis reveals that label noise exacerbates the effect of spurious correlations for ERM, undermining generalization. Conversely, we illustrate that DG algorithms exhibit implicit label-noise robustness during finite-sample training even when spurious correlation is present. Such desirable property helps mitigate spurious correlations and improve generalization in synthetic experiments. However, additional comprehensive experiments on real-world benchmark datasets indicate that label-noise robustness does not necessarily translate to better performance compared to ERM. We conjecture that the failure mode of ERM arising from spurious correlations may be less pronounced in practice.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding Expressivity of GNN in Rule Learning b/data/2024/iclr/Understanding Expressivity of GNN in Rule Learning
new file mode 100644
index 0000000000..d4c2a9d667
--- /dev/null
+++ b/data/2024/iclr/Understanding Expressivity of GNN in Rule Learning	
@@ -0,0 +1 @@
+Rule learning is critical to improving knowledge graph (KG) reasoning due to their ability to provide logical and interpretable explanations. Recently, Graph Neural Networks (GNNs) with tail entity scoring achieve the state-of-the-art performance on KG reasoning. However, the theoretical understandings for these GNNs are either lacking or focusing on single-relational graphs, leaving what the kind of rules these GNNs can learn an open problem. We propose to fill the above gap in this paper. Specifically, GNNs with tail entity scoring are unified into a common framework. Then, we analyze their expressivity by formally describing the rule structures they can learn and theoretically demonstrating their superiority. These results further inspire us to propose a novel labeling strategy to learn more rules in KG reasoning. Experimental results are consistent with our theoretical findings and verify the effectiveness of our proposed method. The code is publicly available at https://github.com/LARS-research/Rule-learning-expressivity.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding In-Context Learning from Repetitions b/data/2024/iclr/Understanding In-Context Learning from Repetitions
new file mode 100644
index 0000000000..fedbadd2e6
--- /dev/null
+++ b/data/2024/iclr/Understanding In-Context Learning from Repetitions	
@@ -0,0 +1 @@
+This paper explores the elusive mechanism underpinning in-context learning in Large Language Models (LLMs). Our work provides a novel perspective by examining in-context learning via the lens of surface repetitions. We quantitatively investigate the role of surface features in text generation, and empirically establish the existence of \emph{token co-occurrence reinforcement}, a principle that strengthens the relationship between two tokens based on their contextual co-occurrences. By investigating the dual impacts of these features, our research illuminates the internal workings of in-context learning and expounds on the reasons for its failures. This paper provides an essential contribution to the understanding of in-context learning and its potential limitations, providing a fresh perspective on this exciting capability.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions b/data/2024/iclr/Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions
new file mode 100644
index 0000000000..e14adccc20
--- /dev/null
+++ b/data/2024/iclr/Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions	
@@ -0,0 +1 @@
+In order to understand the in-context learning phenomenon, recent works have adopted a stylized experimental framework and demonstrated that Transformers can learn gradient-based learning algorithms for various classes of real-valued functions. However, the limitations of Transformers in implementing learning algorithms, and their ability to learn other forms of algorithms are not well understood. Additionally, the degree to which these capabilities are confined to attention-based models is unclear. Furthermore, it remains to be seen whether the insights derived from these stylized settings can be extrapolated to pretrained Large Language Models (LLMs). In this work, we take a step towards answering these questions by demonstrating the following: (a) On a test-bed with a variety of Boolean function classes, we find that Transformers can nearly match the optimal learning algorithm for 'simpler' tasks, while their performance deteriorates on more 'complex' tasks. Additionally, we find that certain attention-free models perform (almost) identically to Transformers on a range of tasks. (b) When provided a teaching sequence, i.e. a set of examples that uniquely identifies a function in a class, we show that Transformers learn more sample-efficiently. Interestingly, our results show that Transformers can learn to implement two distinct algorithms to solve a single task, and can adaptively select the more sample-efficient algorithm depending on the sequence of in-context examples. (c) Lastly, we show that extant LLMs, e.g. LLaMA-2, GPT-4, can compete with nearest-neighbor baselines on prediction tasks that are guaranteed to not be in their training set.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP b/data/2024/iclr/Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP
new file mode 100644
index 0000000000..fdfa810867
--- /dev/null
+++ b/data/2024/iclr/Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP	
@@ -0,0 +1 @@
+Multi-modal learning has become increasingly popular due to its ability to leverage information from different data sources (e.g., text and images) to improve the model performance. Recently, CLIP has emerged as an effective approach that employs vision-language contrastive pretraining to learn joint image and text representations and exhibits remarkable performance in zero-shot learning and text-guided natural image generation. Despite the huge practical success of CLIP, its theoretical understanding remains elusive. In this paper, we formally study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned. We also analyze its zero-shot transfer performance on the downstream tasks. Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks b/data/2024/iclr/Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks
new file mode 100644
index 0000000000..d077834454
--- /dev/null
+++ b/data/2024/iclr/Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks	
@@ -0,0 +1 @@
+Pre-training on large-scale datasets and then fine-tuning on downstream tasks have become a standard practice in deep learning. However, pre-training data often contain label noise that may adversely affect the generalization of the model. This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. More specifically, through extensive experiments of supervised pre-training models on synthetic noisy ImageNet-1K and YFCC15M datasets, we demonstrate that while slight noise in pre-training can benefit in-domain (ID) transfer performance, where the training and testing data share the same distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing data distribution are different. We empirically verify that the reason behind is noise in pre-training shapes the feature space differently. We then propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization on both ID and OOD tasks, considering one may not be able to fully fine-tune or even access the pre-trained models. We conduct practical experiments on popular vision and language models that are pre-trained on noisy data for evaluation of our approach. Our analysis and results show the importance of this interesting and novel research direction, which we term Noisy Model Learning.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding prompt engineering may not require rethinking generalization b/data/2024/iclr/Understanding prompt engineering may not require rethinking generalization
new file mode 100644
index 0000000000..20b34d1112
--- /dev/null
+++ b/data/2024/iclr/Understanding prompt engineering may not require rethinking generalization	
@@ -0,0 +1 @@
+Zero-shot learning in prompted vision-language models, the practice of crafting prompts to build classifiers without an explicit training process, has achieved impressive performance in many settings. This success presents a seemingly surprising observation: these methods suffer relatively little from overfitting, i.e., when a prompt is manually engineered to achieve low error on a given training set (thus rendering the method no longer actually zero-shot), the approach still performs well on held-out test data. In this paper, we show that we can explain such performance well via recourse to classical PAC-Bayes bounds. Specifically, we show that the discrete nature of prompts, combined with a PAC-Bayes prior given by a language model, results in generalization bounds that are remarkably tight by the standards of the literature: for instance, the generalization bound of an ImageNet classifier is often within a few percentage points of the true test error. We demonstrate empirically that this holds for existing handcrafted prompts and prompts generated through simple greedy search. Furthermore, the resulting bound is well-suited for model selection: the models with the best bound typically also have the best test performance. This work thus provides a possible justification for the widespread practice of prompt engineering, even if it seems that such methods could potentially overfit the training data.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding the Effects of RLHF on LLM Generalisation and Diversity b/data/2024/iclr/Understanding the Effects of RLHF on LLM Generalisation and Diversity
new file mode 100644
index 0000000000..d9b65eb2ef
--- /dev/null
+++ b/data/2024/iclr/Understanding the Effects of RLHF on LLM Generalisation and Diversity	
@@ -0,0 +1 @@
+Large language models (LLMs) fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date, such as OpenAI's ChatGPT or Anthropic's Claude. While there has been significant work developing these methods, our understanding of the benefits and downsides of each stage in RLHF is still limited. To fill this gap, we present an extensive analysis of how each stage of the process (i.e. supervised fine-tuning (SFT), reward modelling, and RLHF) affects two key properties: out-of-distribution (OOD) generalisation and output diversity. OOD generalisation is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model's ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarisation and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalises better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalisation and diversity. Our results provide guidance on which fine-tuning method should be used depending on the application, and show that more research is needed to improve the tradeoff between generalisation and diversity.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift b/data/2024/iclr/Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift
new file mode 100644
index 0000000000..8b2a163a76
--- /dev/null
+++ b/data/2024/iclr/Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift	
@@ -0,0 +1 @@
+Recently, multimodal contrastive learning (MMCL) approaches, such as CLIP, have achieved a remarkable success in learning representations that are robust against distribution shift and generalize to new domains. Despite the empirical success, the mechanism behind learning such generalizable representations is not understood. In this work, we rigorously analyze this problem and uncover two mechanisms behind MMCL's robustness: \emph{intra-class contrasting}, which allows the model to learn features with a high variance, and \emph{inter-class feature sharing}, where annotated details in one class help learning other classes better. Both mechanisms prevent spurious features that are over-represented in the training data to overshadow the generalizable core features. This yields superior zero-shot classification accuracy under distribution shift. Furthermore, we theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions. We validate our theoretical findings through experiments, including a well-designed synthetic experiment and an experiment involving training CLIP models on MSCOCO/Conceptual Captions and evaluating them on shifted ImageNets.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding the Robustness of Randomized Feature Defense Against Query-Based Adversarial Attacks b/data/2024/iclr/Understanding the Robustness of Randomized Feature Defense Against Query-Based Adversarial Attacks
new file mode 100644
index 0000000000..bcf7c3f6a3
--- /dev/null
+++ b/data/2024/iclr/Understanding the Robustness of Randomized Feature Defense Against Query-Based Adversarial Attacks	
@@ -0,0 +1 @@
+Recent works have shown that deep neural networks are vulnerable to adversarial examples that find samples close to the original image but can make the model misclassify. Even with access only to the model's output, an attacker can employ black-box attacks to generate such adversarial examples. In this work, we propose a simple and lightweight defense against black-box attacks by adding random noise to hidden features at intermediate layers of the model at inference time. Our theoretical analysis confirms that this method effectively enhances the model's resilience against both score-based and decision-based black-box attacks. Importantly, our defense does not necessitate adversarial training and has minimal impact on accuracy, rendering it applicable to any pre-trained model. Our analysis also reveals the significance of selectively adding noise to different parts of the model based on the gradient of the adversarial objective function, which can be varied during the attack. We demonstrate the robustness of our defense against multiple black-box attacks through extensive empirical experiments involving diverse models with various architectures.
\ No newline at end of file
diff --git a/data/2024/iclr/Understanding when Dynamics-Invariant Data Augmentations Benefit Model-free Reinforcement Learning Updates b/data/2024/iclr/Understanding when Dynamics-Invariant Data Augmentations Benefit Model-free Reinforcement Learning Updates
new file mode 100644
index 0000000000..2c7cf95696
--- /dev/null
+++ b/data/2024/iclr/Understanding when Dynamics-Invariant Data Augmentations Benefit Model-free Reinforcement Learning Updates	
@@ -0,0 +1 @@
+Recently, data augmentation (DA) has emerged as a method for leveraging domain knowledge to inexpensively generate additional data in reinforcement learning (RL) tasks, often yielding substantial improvements in data efficiency. While prior work has demonstrated the utility of incorporating augmented data directly into model-free RL updates, it is not well-understood when a particular DA strategy will improve data efficiency. In this paper, we seek to identify general aspects of DA responsible for observed learning improvements. Our study focuses on sparse-reward tasks with dynamics-invariant data augmentation functions, serving as an initial step towards a more general understanding of DA and its integration into RL training. Experimentally, we isolate three relevant aspects of DA: state-action coverage, reward density, and the number of augmented transitions generated per update (the augmented replay ratio). From our experiments, we draw two conclusions: (1) increasing state-action coverage often has a much greater impact on data efficiency than increasing reward density, and (2) decreasing the augmented replay ratio substantially improves data efficiency. In fact, certain tasks in our empirical study are solvable only when the replay ratio is sufficiently low.
\ No newline at end of file
diff --git a/data/2024/iclr/Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization b/data/2024/iclr/Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization
new file mode 100644
index 0000000000..5ce421e4d0
--- /dev/null
+++ b/data/2024/iclr/Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization	
@@ -0,0 +1 @@
+Combining offline and online reinforcement learning (RL) is crucial for efficient and safe learning. However, previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance. We ask: Can we achieve straightforward yet effective offline and online learning without introducing extra conservatism or regularization? In this study, we propose Uni-o4, which utilizes an on-policy objective for both offline and online learning. Owning to the alignment of objectives in two phases, the RL agent can transfer between offline and online learning seamlessly. This property enhances the flexibility of the learning paradigm, allowing for arbitrary combinations of pretraining, fine-tuning, offline, and online learning. In the offline phase, specifically, Uni-o4 leverages diverse ensemble policies to address the mismatch issues between the estimated behavior policy and the offline dataset. Through a simple offline policy evaluation (OPE) approach, Uni-o4 can achieve multi-step policy improvement safely. We demonstrate that by employing the method above, the fusion of these two paradigms can yield superior offline initialization as well as stable and rapid online fine-tuning capabilities. Through real-world robot tasks, we highlight the benefits of this paradigm for rapid deployment in challenging, previously unseen real-world environments. Additionally, through comprehensive evaluations using numerous simulated benchmarks, we substantiate that our method achieves state-of-the-art performance in both offline and offline-to-online fine-tuning learning. Our website: https://lei-kun.github.io/uni-o4/ .
\ No newline at end of file
diff --git a/data/2024/iclr/Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback b/data/2024/iclr/Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback
new file mode 100644
index 0000000000..f2fd806260
--- /dev/null
+++ b/data/2024/iclr/Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback	
@@ -0,0 +1 @@
+Reinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types and various learning methods in different environments. However, quantifying progress in RLHF with diverse feedback is challenging due to the lack of standardized annotation platforms and widely used unified benchmarks. To bridge this gap, we introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF. It aims to provide a complete workflow from real human feedback, fostering progress in the development of practical problems. Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations. Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 30+ popular tasks. Through extensive experiments, the results in the collected datasets demonstrate competitive performance compared to those from well-designed manual rewards. We evaluate various design choices and offer insights into their strengths and potential areas of improvement. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions based on realistic human feedback. The website is available at https://uni-rlhf.github.io/.
\ No newline at end of file
diff --git a/data/2024/iclr/Uni3D: Exploring Unified 3D Representation at Scale b/data/2024/iclr/Uni3D: Exploring Unified 3D Representation at Scale
new file mode 100644
index 0000000000..6a16337993
--- /dev/null
+++ b/data/2024/iclr/Uni3D: Exploring Unified 3D Representation at Scale	
@@ -0,0 +1 @@
+Scaling up representations for images or text has been extensively investigated in the past few years and has led to revolutions in learning vision and language. However, scalable representation for 3D objects and scenes is relatively unexplored. In this work, we present Uni3D, a 3D foundation model to explore the unified 3D representation at scale. Uni3D uses a 2D initialized ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features. Via the simple architecture and pretext task, Uni3D can leverage abundant 2D pretrained models as initialization and image-text aligned models as the target, unlocking the great potential of 2D models and scaling-up strategies to the 3D world. We efficiently scale up Uni3D to one billion parameters, and set new records on a broad range of 3D tasks, such as zero-shot classification, few-shot classification, open-world understanding and part segmentation. We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild. We believe that Uni3D provides a new direction for exploring both scaling up and efficiency of the representation in 3D domain.
\ No newline at end of file
diff --git a/data/2024/iclr/UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling b/data/2024/iclr/UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling
new file mode 100644
index 0000000000..31c1e761c9
--- /dev/null
+++ b/data/2024/iclr/UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling	
@@ -0,0 +1 @@
+Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation models and the number of downstream tasks grow, the standard full fine-tuning paradigm becomes unsustainable due to heavy computational and storage costs. This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on pre-trained vision-language models. Specifically, adapters are distributed to different modalities and their interactions, with the total number of tunable parameters reduced by partial weight sharing. The unified and knowledge-sharing design enables powerful cross-modal representations that can benefit various downstream tasks, requiring only 1.0%-2.0% tunable parameters of the pre-trained model. Extensive experiments on 6 cross-modal downstream benchmarks (including video-text retrieval, image-text retrieval, VideoQA, and VQA) show that in most cases, UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy. Particularly, on the MSRVTT retrieval task, UniAdapter achieves 49.7% recall@1 with 2.2% model parameters, outperforming the latest competitors by 2.0%. The code and models are available at https://github.com/RERV/UniAdapter.
\ No newline at end of file
diff --git a/data/2024/iclr/UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science b/data/2024/iclr/UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science
new file mode 100644
index 0000000000..54c5ab43cb
--- /dev/null
+++ b/data/2024/iclr/UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science	
@@ -0,0 +1 @@
+Recent advancements in NLP have witnessed the groundbreaking impact of pretrained models, yielding impressive outcomes across various tasks. This study seeks to extend the power of pretraining methodologies to facilitating the prediction over tables in data science, a domain traditionally overlooked, yet inherently challenging due to the plethora of table schemas intrinsic to different tasks. The primary research questions underpinning this work revolve around the establishment of a universal pretraining protocol for tables with varied structures, the generalizability and transferability of learned knowledge across tasks, the adaptation to diverse downstream applications, and the incorporation of incremental columns over time. In response to these challenges, we introduce UniTabE, a straightforward yet effective method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures. UniTabE's core concept relies on representing each basic table element with a module, termed TabUnit. This is subsequently followed by a Transformer encoder to refine the representation. Moreover, our model is designed to facilitate pretraining and finetuning through the utilization of free-form prompts. In order to implement the pretraining phase, we curated an expansive tabular dataset comprising approximately 13B samples, meticulously gathered from the Kaggle platform. This research primarily centers on classification and regression tasks involving tabular data, and conducts rigorous experimental testing and analyses to validate the effectiveness of our methodology. The experimental results demonstrate UniTabE's superior performance against several baselines across massive benchmarks. This, therefore, underscores UniTabE's potential to significantly enhance the semantic representation of tabular data, thereby marking a significant stride for tabular data analysis.
\ No newline at end of file
diff --git a/data/2024/iclr/Unified Generative Modeling of 3D Molecules with Bayesian Flow Networks b/data/2024/iclr/Unified Generative Modeling of 3D Molecules with Bayesian Flow Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Unified Human-Scene Interaction via Prompted Chain-of-Contacts b/data/2024/iclr/Unified Human-Scene Interaction via Prompted Chain-of-Contacts
new file mode 100644
index 0000000000..930a9b5f6d
--- /dev/null
+++ b/data/2024/iclr/Unified Human-Scene Interaction via Prompted Chain-of-Contacts	
@@ -0,0 +1 @@
+Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality. Despite advancements in motion quality and physical plausibility, two pivotal factors, versatile interaction control and the development of a user-friendly interface, require further exploration before the practical application of HSI. This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands. This framework is built upon the definition of interaction as Chain of Contacts (CoC): steps of human joint-object part pairs, which is inspired by the strong correlation between interaction types and human-object contact regions. Based on the definition, UniHSI constitutes a Large Language Model (LLM) Planner to translate language prompts into task plans in the form of CoC, and a Unified Controller that turns CoC into uniform task execution. To facilitate training and evaluation, we collect a new dataset named ScenePlan that encompasses thousands of task plans generated by LLMs based on diverse scenarios. Comprehensive experiments demonstrate the effectiveness of our framework in versatile task execution and generalizability to real scanned scenes. The project page is at https://github.com/OpenRobotLab/UniHSI .
\ No newline at end of file
diff --git a/data/2024/iclr/Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization b/data/2024/iclr/Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
new file mode 100644
index 0000000000..b3f14c2e59
--- /dev/null
+++ b/data/2024/iclr/Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	
@@ -0,0 +1 @@
+Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to both vision and language data. However, the prevailing approaches primarily regard the visual input as a prompt and focus exclusively on optimizing the text generation process conditioned upon vision content by a frozen LLM. Such an inequitable treatment of vision and language heavily constrains the model's potential. In this paper, we break through this limitation by representing both vision and language in a unified form. Specifically, we introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language that LLM can read. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. Coped with this tokenizer, the presented foundation model called LaVIT can handle both image and text indiscriminately under the same generative learning paradigm. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously. Extensive experiments further showcase that it outperforms the existing models by a large margin on massive vision-language tasks. Our code and models are available at https://github.com/jy0205/LaVIT.
\ No newline at end of file
diff --git a/data/2024/iclr/Unified Projection-Free Algorithms for Adversarial DR-Submodular Optimization b/data/2024/iclr/Unified Projection-Free Algorithms for Adversarial DR-Submodular Optimization
new file mode 100644
index 0000000000..a69b440b43
--- /dev/null
+++ b/data/2024/iclr/Unified Projection-Free Algorithms for Adversarial DR-Submodular Optimization	
@@ -0,0 +1 @@
+This paper introduces unified projection-free Frank-Wolfe type algorithms for adversarial continuous DR-submodular optimization, spanning scenarios such as full information and (semi-)bandit feedback, monotone and non-monotone functions, different constraints, and types of stochastic queries. For every problem considered in the non-monotone setting, the proposed algorithms are either the first with proven sub-linear $\alpha$-regret bounds or have better $\alpha$-regret bounds than the state of the art, where $\alpha$ is a corresponding approximation bound in the offline setting. In the monotone setting, the proposed approach gives state-of-the-art sub-linear $\alpha$-regret bounds among projection-free algorithms in 7 of the 8 considered cases while matching the result of the remaining case. Additionally, this paper addresses semi-bandit and bandit feedback for adversarial DR-submodular optimization, advancing the understanding of this optimization area.
\ No newline at end of file
diff --git a/data/2024/iclr/Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence b/data/2024/iclr/Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence
new file mode 100644
index 0000000000..bd167786df
--- /dev/null
+++ b/data/2024/iclr/Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence	
@@ -0,0 +1 @@
+This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks. In the context of dense matching, many works benefit from one of two forms of aggregation: feature aggregation, which pertains to the alignment of similar features, or cost aggregation, a procedure aimed at instilling coherence in the flow estimates across neighboring pixels. In this work, we first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes. We then introduce a simple yet effective architecture that harnesses self- and cross-attention mechanisms to show that our approach unifies feature aggregation and cost aggregation and effectively harnesses the strengths of both techniques. Within the proposed attention layers, the features and cost volume both complement each other, and the attention layers are interleaved through a coarse-to-fine design to further promote accurate correspondence estimation. Finally at inference, our network produces multi-scale predictions, computes their confidence scores, and selects the most confident flow for final prediction. Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
\ No newline at end of file
diff --git a/data/2024/iclr/Universal Backdoor Attacks b/data/2024/iclr/Universal Backdoor Attacks
new file mode 100644
index 0000000000..711e5faccf
--- /dev/null
+++ b/data/2024/iclr/Universal Backdoor Attacks	
@@ -0,0 +1 @@
+Backdoors implanted in pre-trained language models (PLMs) can be transferred to various downstream tasks, which exposes a severe security threat. However, most existing backdoor attacks against PLMs are un-targeted and task-specific. Few targeted and task-agnostic methods use manually pre-defined triggers and output representations, which prevent the attacks from being more effective and general. In this paper, we first summarize the requirements that a more threatening backdoor attack against PLMs should satisfy, and then propose a new backdoor attack method called UOR, which breaks the bottleneck of the previous approach by turning manual selection into automatic optimization. Specifically, we define poisoned supervised contrastive learning which can automatically learn the more uniform and universal output representations of triggers for various PLMs. Moreover, we use gradient search to select appropriate trigger words which can be adaptive to different PLMs and vocabularies. Experiments show that our method can achieve better attack performance on various text classification tasks compared to manual methods. Further, we tested our method on PLMs with different architectures, different usage paradigms, and more difficult tasks, which demonstrated the universality of our method.
\ No newline at end of file
diff --git a/data/2024/iclr/Universal Guidance for Diffusion Models b/data/2024/iclr/Universal Guidance for Diffusion Models
new file mode 100644
index 0000000000..c1cfb4f024
--- /dev/null
+++ b/data/2024/iclr/Universal Guidance for Diffusion Models	
@@ -0,0 +1 @@
+Typical diffusion models are trained to accept a particular form of conditioning, most commonly text, and cannot be conditioned on other modalities without retraining. In this work, we propose a universal guidance algorithm that enables diffusion models to be controlled by arbitrary guidance modalities without the need to retrain any use-specific components. We show that our algorithm successfully generates quality images with guidance functions including segmentation, face recognition, object detection, and classifier signals. Code is available at github.com/arpitbansal297/Universal-Guided-Diffusion.
\ No newline at end of file
diff --git a/data/2024/iclr/Universal Humanoid Motion Representations for Physics-Based Control b/data/2024/iclr/Universal Humanoid Motion Representations for Physics-Based Control
new file mode 100644
index 0000000000..6384992af8
--- /dev/null
+++ b/data/2024/iclr/Universal Humanoid Motion Representations for Physics-Based Control	
@@ -0,0 +1 @@
+We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control. Due to the high dimensionality of humanoids and the inherent difficulties in reinforcement learning, prior methods have focused on learning skill embeddings for a narrow range of movement styles (e.g. locomotion, game characters) from specialized motion datasets. This limited scope hampers their applicability in complex tasks. We close this gap by significantly increasing the coverage of our motion representation space. To achieve this, we first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset. We then create our motion representation by distilling skills directly from the imitator. This is achieved by using an encoder-decoder structure with a variational information bottleneck. Additionally, we jointly learn a prior conditioned on proprioception (humanoid's own pose and velocities) to improve model expressiveness and sampling efficiency for downstream tasks. By sampling from the prior, we can generate long, stable, and diverse human motions. Using this latent space for hierarchical RL, we show that our policies solve tasks using human-like behavior. We demonstrate the effectiveness of our motion representation by solving generative tasks (e.g. strike, terrain traversal) and motion tracking using VR controllers.
\ No newline at end of file
diff --git a/data/2024/iclr/UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition b/data/2024/iclr/UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition
new file mode 100644
index 0000000000..14e8c9b5a4
--- /dev/null
+++ b/data/2024/iclr/UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition	
@@ -0,0 +1 @@
+Large language models (LLMs) have demonstrated remarkable generalizability, such as understanding arbitrary entities and relations. Instruction tuning has proven effective for distilling LLMs into more cost-efficient models such as Alpaca and Vicuna. Yet such student models still trail the original LLMs by large margins in downstream applications. In this paper, we explore targeted distillation with mission-focused instruction tuning to train student models that can excel in a broad application class such as open information extraction. Using named entity recognition (NER) for case study, we show how ChatGPT can be distilled into much smaller UniversalNER models for open NER. For evaluation, we assemble the largest NER benchmark to date, comprising 43 datasets across 9 diverse domains such as biomedicine, programming, social media, law, finance. Without using any direct supervision, UniversalNER attains remarkable NER accuracy across tens of thousands of entity types, outperforming general instruction-tuned models such as Alpaca and Vicuna by over 30 absolute F1 points in average. With a tiny fraction of parameters, UniversalNER not only acquires ChatGPT's capability in recognizing arbitrary entity types, but also outperforms its NER accuracy by 7-9 absolute F1 points in average. Remarkably, UniversalNER even outperforms by a large margin state-of-the-art multi-task instruction-tuned systems such as InstructUIE, which uses supervised NER examples. We also conduct thorough ablation studies to assess the impact of various components in our distillation approach. We release the distillation recipe, data, and UniversalNER models to facilitate future research on targeted distillation.
\ No newline at end of file
diff --git a/data/2024/iclr/Unknown Domain Inconsistency Minimization for Domain Generalization b/data/2024/iclr/Unknown Domain Inconsistency Minimization for Domain Generalization
new file mode 100644
index 0000000000..a609c23edd
--- /dev/null
+++ b/data/2024/iclr/Unknown Domain Inconsistency Minimization for Domain Generalization	
@@ -0,0 +1 @@
+The objective of domain generalization (DG) is to enhance the transferability of the model learned from a source domain to unobserved domains. To prevent overfitting to a specific domain, Sharpness-Aware Minimization (SAM) reduces source domain's loss sharpness. Although SAM variants have delivered significant improvements in DG, we highlight that there's still potential for improvement in generalizing to unknown domains through the exploration on data space. This paper introduces an objective rooted in both parameter and data perturbed regions for domain generalization, coined Unknown Domain Inconsistency Minimization (UDIM). UDIM reduces the loss landscape inconsistency between source domain and unknown domains. As unknown domains are inaccessible, these domains are empirically crafted by perturbing instances from the source domain dataset. In particular, by aligning the loss landscape acquired in the source domain to the loss landscape of perturbed domains, we expect to achieve generalization grounded on these flat minima for the unknown domains. Theoretically, we validate that merging SAM optimization with the UDIM objective establishes an upper bound for the true objective of the DG task. In an empirical aspect, UDIM consistently outperforms SAM variants across multiple DG benchmark datasets. Notably, UDIM shows statistically significant improvements in scenarios with more restrictive domain information, underscoring UDIM's generalization capability in unseen domains. Our code is available at \url{https://github.com/SJShin-AI/UDIM}.
\ No newline at end of file
diff --git a/data/2024/iclr/Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation b/data/2024/iclr/Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
new file mode 100644
index 0000000000..963e27e2ad
--- /dev/null
+++ b/data/2024/iclr/Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation	
@@ -0,0 +1 @@
+Generative pre-trained models have demonstrated remarkable effectiveness in language and vision domains by learning useful representations. In this paper, we extend the scope of this effectiveness by showing that visual robot manipulation can significantly benefit from large-scale video generative pre-training. We introduce GR-1, a straightforward GPT-style model designed for multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states. It predicts robot actions as well as future images in an end-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly finetuned on robot data after pre-trained on a large-scale video dataset. We perform extensive experiments on the challenging CALVIN benchmark and a real robot. On CALVIN benchmark, our method outperforms state-of-the-art baseline methods and improves the success rate from 88.9% to 94.9%. In the setting of zero-shot unseen scene generalization, GR-1 improves the success rate from 53.3% to 85.4%. In real robot experiments, GR-1 also outperforms baseline methods and shows strong potentials in generalization to unseen scenes and objects. We provide inaugural evidence that a unified GPT-style transformer, augmented with large-scale video generative pre-training, exhibits remarkable generalization to multi-task visual robot manipulation. Project page: https://GR1-Manipulation.github.io
\ No newline at end of file
diff --git a/data/2024/iclr/Unleashing the Potential of Fractional Calculus in Graph Neural Networks with FROND b/data/2024/iclr/Unleashing the Potential of Fractional Calculus in Graph Neural Networks with FROND
new file mode 100644
index 0000000000..8445bbca72
--- /dev/null
+++ b/data/2024/iclr/Unleashing the Potential of Fractional Calculus in Graph Neural Networks with FROND	
@@ -0,0 +1 @@
+We introduce the FRactional-Order graph Neural Dynamical network (FROND), a new continuous graph neural network (GNN) framework. Unlike traditional continuous GNNs that rely on integer-order differential equations, FROND employs the Caputo fractional derivative to leverage the non-local properties of fractional calculus. This approach enables the capture of long-term dependencies in feature updates, moving beyond the Markovian update mechanisms in conventional integer-order models and offering enhanced capabilities in graph representation learning. We offer an interpretation of the node feature updating process in FROND from a non-Markovian random walk perspective when the feature updating is particularly governed by a diffusion process. We demonstrate analytically that oversmoothing can be mitigated in this setting. Experimentally, we validate the FROND framework by comparing the fractional adaptations of various established integer-order continuous GNNs, demonstrating their consistently improved performance and underscoring the framework's potential as an effective extension to enhance traditional continuous GNNs. The code is available at \url{https://github.com/zknus/ICLR2024-FROND}.
\ No newline at end of file
diff --git a/data/2024/iclr/Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning b/data/2024/iclr/Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning
new file mode 100644
index 0000000000..a4071172a0
--- /dev/null
+++ b/data/2024/iclr/Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. In real-world scenarios, data collection could be costly and risky; therefore, offline RL becomes particularly challenging when the in-domain data is limited. Given recent advances in Large Language Models (LLMs) and their few-shot learning prowess, this paper introduces $\textbf{La}$nguage Models for $\textbf{Mo}$tion Control ($\textbf{LaMo}$), a general framework based on Decision Transformers to effectively use pre-trained Language Models (LMs) for offline RL. Our framework highlights four crucial components: (1) Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to full-weight fine-tuning, to combine the pre-trained knowledge from LMs and in-domain knowledge effectively, (3) using the non-linear MLP transformation instead of linear projections, to generate embeddings, and (4) integrating an auxiliary language prediction loss during fine-tuning to stabilize the LMs and retain their original abilities on languages. Empirical results indicate $\textbf{LaMo}$ achieves state-of-the-art performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reward tasks. In particular, our method demonstrates superior performance in scenarios with limited data samples. Our project website is https://lamo2023.github.io
\ No newline at end of file
diff --git a/data/2024/iclr/Unlocking the Power of Representations in Long-term Novelty-based Exploration b/data/2024/iclr/Unlocking the Power of Representations in Long-term Novelty-based Exploration
new file mode 100644
index 0000000000..067e10d7e0
--- /dev/null
+++ b/data/2024/iclr/Unlocking the Power of Representations in Long-term Novelty-based Exploration	
@@ -0,0 +1 @@
+We introduce Robust Exploration via Clustering-based Online Density Estimation (RECODE), a non-parametric method for novelty-based exploration that estimates visitation counts for clusters of states based on their similarity in a chosen embedding space. By adapting classical clustering to the nonstationary setting of Deep RL, RECODE can efficiently track state visitation counts over thousands of episodes. We further propose a novel generalization of the inverse dynamics loss, which leverages masked transformer architectures for multi-step prediction; which in conjunction with RECODE achieves a new state-of-the-art in a suite of challenging 3D-exploration tasks in DM-Hard-8. RECODE also sets new state-of-the-art in hard exploration Atari games, and is the first agent to reach the end screen in"Pitfall!".
\ No newline at end of file
diff --git a/data/2024/iclr/Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models b/data/2024/iclr/Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models
new file mode 100644
index 0000000000..f13e981a5c
--- /dev/null
+++ b/data/2024/iclr/Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models	
@@ -0,0 +1 @@
+Language models have shown promise in various tasks but can be affected by undesired data during training, fine-tuning, or alignment. For example, if some unsafe conversations are wrongly annotated as safe ones, the model fine-tuned on these samples may be harmful. Therefore, the correctness of annotations, i.e., the credibility of the dataset, is important. This study focuses on the credibility of real-world datasets, including the popular benchmarks Jigsaw Civil Comments, Anthropic Harmless&Red Team, PKU BeaverTails&SafeRLHF, that can be used for training a harmless language model. Given the cost and difficulty of cleaning these datasets by humans, we introduce a systematic framework for evaluating the credibility of datasets, identifying label errors, and evaluating the influence of noisy labels in the curated language data, specifically focusing on unsafe comments and conversation classification. With the framework, we find and fix an average of 6.16% label errors in 11 datasets constructed from the above benchmarks. The data credibility and downstream learning performance can be remarkably improved by directly fixing label errors, indicating the significance of cleaning existing real-world datasets. We provide an open-source tool, Docta, for data cleaning at https://github.com/Docta-ai/docta.
\ No newline at end of file
diff --git "a/data/2024/iclr/Unpaired Image-to-Image Translation via Neural Schr\303\266dinger Bridge" "b/data/2024/iclr/Unpaired Image-to-Image Translation via Neural Schr\303\266dinger Bridge"
new file mode 100644
index 0000000000..e95f3ad54c
--- /dev/null
+++ "b/data/2024/iclr/Unpaired Image-to-Image Translation via Neural Schr\303\266dinger Bridge"	
@@ -0,0 +1 @@
+Diffusion models are a powerful class of generative models which simulate stochastic differential equations (SDEs) to generate data from noise. While diffusion models have achieved remarkable progress, they have limitations in unpaired image-to-image (I2I) translation tasks due to the Gaussian prior assumption. Schr\"{o}dinger Bridge (SB), which learns an SDE to translate between two arbitrary distributions, have risen as an attractive solution to this problem. Yet, to our best knowledge, none of SB models so far have been successful at unpaired translation between high-resolution images. In this work, we propose Unpaired Neural Schr\"{o}dinger Bridge (UNSB), which expresses the SB problem as a sequence of adversarial learning problems. This allows us to incorporate advanced discriminators and regularization to learn a SB between unpaired data. We show that UNSB is scalable and successfully solves various unpaired I2I translation tasks. Code: \url{https://github.com/cyclomon/UNSB}
\ No newline at end of file
diff --git a/data/2024/iclr/Unprocessing Seven Years of Algorithmic Fairness b/data/2024/iclr/Unprocessing Seven Years of Algorithmic Fairness
new file mode 100644
index 0000000000..8a2c9b4177
--- /dev/null
+++ b/data/2024/iclr/Unprocessing Seven Years of Algorithmic Fairness	
@@ -0,0 +1 @@
+Seven years ago, researchers proposed a postprocessing method to equalize the error rates of a model across different demographic groups. The work launched hundreds of papers purporting to improve over the postprocessing baseline. We empirically evaluate these claims through thousands of model evaluations on several tabular datasets. We find that the fairness-accuracy Pareto frontier achieved by postprocessing contains all other methods we were feasibly able to evaluate. In doing so, we address two common methodological errors that have confounded previous observations. One relates to the comparison of methods with different unconstrained base models. The other concerns methods achieving different levels of constraint relaxation. At the heart of our study is a simple idea we call unprocessing that roughly corresponds to the inverse of postprocessing. Unprocessing allows for a direct comparison of methods using different underlying models and levels of relaxation.
\ No newline at end of file
diff --git a/data/2024/iclr/Unraveling the Enigma of Double Descent: An In-depth Analysis through the Lens of Learned Feature Space b/data/2024/iclr/Unraveling the Enigma of Double Descent: An In-depth Analysis through the Lens of Learned Feature Space
new file mode 100644
index 0000000000..5f2708987f
--- /dev/null
+++ b/data/2024/iclr/Unraveling the Enigma of Double Descent: An In-depth Analysis through the Lens of Learned Feature Space	
@@ -0,0 +1 @@
+Double descent presents a counter-intuitive aspect within the machine learning domain, and researchers have observed its manifestation in various models and tasks. While some theoretical explanations have been proposed for this phenomenon in specific contexts, an accepted theory to account for its occurrence in deep learning remains yet to be established. In this study, we revisit the phenomenon of double descent and demonstrate that its occurrence is strongly influenced by the presence of noisy data. Through conducting a comprehensive analysis of the feature space of learned representations, we unveil that double descent arises in imperfect models trained with noisy data. We argue that double descent is a consequence of the model first learning the noisy data until interpolation and then adding implicit regularization via over-parameterization acquiring therefore capability to separate the information from the noise.
\ No newline at end of file
diff --git a/data/2024/iclr/Unraveling the Key Components of OOD Generalization via Diversification b/data/2024/iclr/Unraveling the Key Components of OOD Generalization via Diversification
new file mode 100644
index 0000000000..d12d2c25b0
--- /dev/null
+++ b/data/2024/iclr/Unraveling the Key Components of OOD Generalization via Diversification	
@@ -0,0 +1 @@
+Supervised learning datasets may contain multiple cues that explain the training set equally well, i.e., learning any of them would lead to the correct predictions on the training data. However, many of them can be spurious, i.e., lose their predictive power under a distribution shift and consequently fail to generalize to out-of-distribution (OOD) data. Recently developed"diversification"methods (Lee et al., 2023; Pagliardini et al., 2023) approach this problem by finding multiple diverse hypotheses that rely on different features. This paper aims to study this class of methods and identify the key components contributing to their OOD generalization abilities. We show that (1) diversification methods are highly sensitive to the distribution of the unlabeled data used for diversification and can underperform significantly when away from a method-specific sweet spot. (2) Diversification alone is insufficient for OOD generalization. The choice of the used learning algorithm, e.g., the model's architecture and pretraining, is crucial. In standard experiments (classification on Waterbirds and Office-Home datasets), using the second-best choice leads to an up to 20\% absolute drop in accuracy. (3) The optimal choice of learning algorithm depends on the unlabeled data and vice versa i.e. they are co-dependent. (4) Finally, we show that, in practice, the above pitfalls cannot be alleviated by increasing the number of diverse hypotheses, the major feature of diversification methods. These findings provide a clearer understanding of the critical design factors influencing the OOD generalization abilities of diversification methods. They can guide practitioners in how to use the existing methods best and guide researchers in developing new, better ones.
\ No newline at end of file
diff --git a/data/2024/iclr/Unsupervised Order Learning b/data/2024/iclr/Unsupervised Order Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Unsupervised Pretraining for Fact Verification by Language Model Distillation b/data/2024/iclr/Unsupervised Pretraining for Fact Verification by Language Model Distillation
new file mode 100644
index 0000000000..e2764d471f
--- /dev/null
+++ b/data/2024/iclr/Unsupervised Pretraining for Fact Verification by Language Model Distillation	
@@ -0,0 +1 @@
+Fact verification aims to verify a claim using evidence from a trustworthy knowledge base. To address this challenge, algorithms must produce features for every claim that are both semantically meaningful, and compact enough to find a semantic alignment with the source information. In contrast to previous work, which tackled the alignment problem by learning over annotated corpora of claims and their corresponding labels, we propose SFAVEL (Self-supervised Fact Verification via Language Model Distillation), a novel unsupervised pretraining framework that leverages pre-trained language models to distil self-supervised features into high-quality claim-fact alignments without the need for annotations. This is enabled by a novel contrastive loss function that encourages features to attain high-quality claim and evidence alignments whilst preserving the semantic relationships across the corpora. Notably, we present results that achieve a new state-of-the-art on FB15k-237 (+5.3% Hits@1) and FEVER (+8% accuracy) with linear evaluation.
\ No newline at end of file
diff --git a/data/2024/iclr/Unveiling Options with Neural Network Decomposition b/data/2024/iclr/Unveiling Options with Neural Network Decomposition
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Unveiling and Manipulating Prompt Influence in Large Language Models b/data/2024/iclr/Unveiling and Manipulating Prompt Influence in Large Language Models
new file mode 100644
index 0000000000..b2b68d1523
--- /dev/null
+++ b/data/2024/iclr/Unveiling and Manipulating Prompt Influence in Large Language Models	
@@ -0,0 +1 @@
+Prompts play a crucial role in guiding the responses of Large Language Models (LLMs). However, the intricate role of individual tokens in prompts, known as input saliency, in shaping the responses remains largely underexplored. Existing saliency methods either misalign with LLM generation objectives or rely heavily on linearity assumptions, leading to potential inaccuracies. To address this, we propose Token Distribution Dynamics (TDD), a \textcolor{black}{simple yet effective} approach to unveil and manipulate the role of prompts in generating LLM outputs. TDD leverages the robust interpreting capabilities of the language model head (LM head) to assess input saliency. It projects input tokens into the embedding space and then estimates their significance based on distribution dynamics over the vocabulary. We introduce three TDD variants: forward, backward, and bidirectional, each offering unique insights into token relevance. Extensive experiments reveal that the TDD surpasses state-of-the-art baselines with a big margin in elucidating the causal relationships between prompts and LLM outputs. Beyond mere interpretation, we apply TDD to two prompt manipulation tasks for controlled text generation: zero-shot toxic language suppression and sentiment steering. Empirical results underscore TDD's proficiency in identifying both toxic and sentimental cues in prompts, subsequently mitigating toxicity or modulating sentiment in the generated content.
\ No newline at end of file
diff --git a/data/2024/iclr/Unveiling the Pitfalls of Knowledge Editing for Large Language Models b/data/2024/iclr/Unveiling the Pitfalls of Knowledge Editing for Large Language Models
new file mode 100644
index 0000000000..2647fad2ba
--- /dev/null
+++ b/data/2024/iclr/Unveiling the Pitfalls of Knowledge Editing for Large Language Models	
@@ -0,0 +1 @@
+As the cost associated with fine-tuning Large Language Models (LLMs) continues to rise, recent research efforts have pivoted towards developing methodologies to edit implicit knowledge embedded within LLMs. Yet, there's still a dark cloud lingering overhead -- will knowledge editing trigger butterfly effect? since it is still unclear whether knowledge editing might introduce side effects that pose potential risks or not. This paper pioneers the investigation into the potential pitfalls associated with knowledge editing for LLMs. To achieve this, we introduce new benchmark datasets and propose innovative evaluation metrics. Our results underline two pivotal concerns: (1) Knowledge Conflict: Editing groups of facts that logically clash can magnify the inherent inconsistencies in LLMs-a facet neglected by previous methods. (2) Knowledge Distortion: Altering parameters with the aim of editing factual knowledge can irrevocably warp the innate knowledge structure of LLMs. Experimental results vividly demonstrate that knowledge editing might inadvertently cast a shadow of unintended consequences on LLMs, which warrant attention and efforts for future works. Code and data are available at https://github.com/zjunlp/PitfallsKnowledgeEditing.
\ No newline at end of file
diff --git a/data/2024/iclr/Unveiling the Unseen: Identifiable Clusters in Trained Depthwise Convolutional Kernels b/data/2024/iclr/Unveiling the Unseen: Identifiable Clusters in Trained Depthwise Convolutional Kernels
new file mode 100644
index 0000000000..8d42750bf8
--- /dev/null
+++ b/data/2024/iclr/Unveiling the Unseen: Identifiable Clusters in Trained Depthwise Convolutional Kernels	
@@ -0,0 +1 @@
+Recent advances in depthwise-separable convolutional neural networks (DS-CNNs) have led to novel architectures, that surpass the performance of classical CNNs, by a considerable scalability and accuracy margin. This paper reveals another striking property of DS-CNN architectures: discernible and explainable patterns emerge in their trained depthwise convolutional kernels in all layers. Through an extensive analysis of millions of trained filters, with different sizes and from various models, we employed unsupervised clustering with autoencoders, to categorize these filters. Astonishingly, the patterns converged into a few main clusters, each resembling the difference of Gaussian (DoG) functions, and their first and second-order derivatives. Notably, we were able to classify over 95\% and 90\% of the filters from state-of-the-art ConvNextV2 and ConvNeXt models, respectively. This finding is not merely a technological curiosity; it echoes the foundational models neuroscientists have long proposed for the vision systems of mammals. Our results thus deepen our understanding of the emergent properties of trained DS-CNNs and provide a bridge between artificial and biological visual processing systems. More broadly, they pave the way for more interpretable and biologically-inspired neural network designs in the future.
\ No newline at end of file
diff --git a/data/2024/iclr/V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection b/data/2024/iclr/V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection
new file mode 100644
index 0000000000..5913d16cbe
--- /dev/null
+++ b/data/2024/iclr/V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection	
@@ -0,0 +1 @@
+We introduce a highly performant 3D object detector for point clouds using the DETR framework. The prior attempts all end up with suboptimal results because they fail to learn accurate inductive biases from the limited scale of training data. In particular, the queries often attend to points that are far away from the target objects, violating the locality principle in object detection. To address the limitation, we introduce a novel 3D Vertex Relative Position Encoding (3DV-RPE) method which computes position encoding for each point based on its relative position to the 3D boxes predicted by the queries in each decoder layer, thus providing clear information to guide the model to focus on points near the objects, in accordance with the principle of locality. In addition, we systematically improve the pipeline from various aspects such as data normalization based on our understanding of the task. We show exceptional results on the challenging ScanNetV2 benchmark, achieving significant improvements over the previous 3DETR in $\rm{AP}_{25}$/$\rm{AP}_{50}$ from 65.0\%/47.0\% to 77.8\%/66.0\%, respectively. In addition, our method sets a new record on ScanNetV2 and SUN RGB-D datasets.Code will be released at http://github.com/yichaoshen-MS/V-DETR.
\ No newline at end of file
diff --git a/data/2024/iclr/VBH-GNN: Variational Bayesian Heterogeneous Graph Neural Networks for Cross-subject Emotion Recognition b/data/2024/iclr/VBH-GNN: Variational Bayesian Heterogeneous Graph Neural Networks for Cross-subject Emotion Recognition
new file mode 100644
index 0000000000..739c2945df
--- /dev/null
+++ b/data/2024/iclr/VBH-GNN: Variational Bayesian Heterogeneous Graph Neural Networks for Cross-subject Emotion Recognition	
@@ -0,0 +1 @@
+cross-subject
\ No newline at end of file
diff --git a/data/2024/iclr/VCR-Graphormer: A Mini-batch Graph Transformer via Virtual Connections b/data/2024/iclr/VCR-Graphormer: A Mini-batch Graph Transformer via Virtual Connections
new file mode 100644
index 0000000000..f813000b0b
--- /dev/null
+++ b/data/2024/iclr/VCR-Graphormer: A Mini-batch Graph Transformer via Virtual Connections	
@@ -0,0 +1 @@
+Graph transformer has been proven as an effective graph learning method for its adoption of attention mechanism that is capable of capturing expressive representations from complex topological and feature information of graphs. Graph transformer conventionally performs dense attention (or global attention) for every pair of nodes to learn node representation vectors, resulting in quadratic computational costs that are unaffordable for large-scale graph data. Therefore, mini-batch training for graph transformers is a promising direction, but limited samples in each mini-batch can not support effective dense attention to encode informative representations. Facing this bottleneck, (1) we start by assigning each node a token list that is sampled by personalized PageRank (PPR) and then apply standard multi-head self-attention only on this list to compute its node representations. This PPR tokenization method decouples model training from complex graph topological information and makes heavy feature engineering offline and independent, such that mini-batch training of graph transformers is possible by loading each node's token list in batches. We further prove this PPR tokenization is viable as a graph convolution network with a fixed polynomial filter and jumping knowledge. However, only using personalized PageRank may limit information carried by a token list, which could not support different graph inductive biases for model training. To this end, (2) we rewire graphs by introducing multiple types of virtual connections through structure- and content-based super nodes that enable PPR tokenization to encode local and global contexts, long-range interaction, and heterophilous information into each node's token list, and then formalize our Virtual Connection Ranking based Graph Transformer (VCR-Graphormer).
\ No newline at end of file
diff --git a/data/2024/iclr/VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models b/data/2024/iclr/VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models
new file mode 100644
index 0000000000..dd7766e083
--- /dev/null
+++ b/data/2024/iclr/VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models	
@@ -0,0 +1 @@
+The role of data in building AI systems has recently been emphasized by the emerging concept of data-centric AI. Unfortunately, in the real-world, datasets may contain dirty samples, such as poisoned samples from backdoor attack, noisy labels in crowdsourcing, and even hybrids of them. The presence of such dirty samples makes the DNNs vunerable and unreliable.Hence, it is critical to detect dirty samples to improve the quality and realiability of dataset. Existing detectors only focus on detecting poisoned samples or noisy labels, that are often prone to weak generalization when dealing with dirty samples from other domains.In this paper, we find a commonality of various dirty samples is visual-linguistic inconsistency between images and associated labels. To capture the semantic inconsistency between modalities, we propose versatile data cleanser (VDC) leveraging the surpassing capabilities of multimodal large language models (MLLM) in cross-modal alignment and reasoning.It consists of three consecutive modules: the visual question generation module to generate insightful questions about the image; the visual question answering module to acquire the semantics of the visual content by answering the questions with MLLM; followed by the visual answer evaluation module to evaluate the inconsistency.Extensive experiments demonstrate its superior performance and generalization to various categories and types of dirty samples. The code is available at \url{https://github.com/zihao-ai/vdc}.
\ No newline at end of file
diff --git a/data/2024/iclr/VDT: General-purpose Video Diffusion Transformers via Mask Modeling b/data/2024/iclr/VDT: General-purpose Video Diffusion Transformers via Mask Modeling
new file mode 100644
index 0000000000..02837223b5
--- /dev/null
+++ b/data/2024/iclr/VDT: General-purpose Video Diffusion Transformers via Mask Modeling	
@@ -0,0 +1 @@
+This work introduces Video Diffusion Transformer (VDT), which pioneers the use of transformers in diffusion-based video generation. It features transformer blocks with modularized temporal and spatial attention modules to leverage the rich spatial-temporal representation inherited in transformers. We also propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios. VDT offers several appealing benefits. 1) It excels at capturing temporal dependencies to produce temporally consistent video frames and even simulate the physics and dynamics of 3D objects over time. 2) It facilitates flexible conditioning information, \eg, simple concatenation in the token space, effectively unifying different token lengths and modalities. 3) Pairing with our proposed spatial-temporal mask modeling mechanism, it becomes a general-purpose video diffuser for harnessing a range of tasks, including unconditional generation, video prediction, interpolation, animation, and completion, etc. Extensive experiments on these tasks spanning various scenarios, including autonomous driving, natural weather, human action, and physics-based simulation, demonstrate the effectiveness of VDT. Additionally, we present comprehensive studies on how \model handles conditioning information with the mask modeling mechanism, which we believe will benefit future research and advance the field. Project page: https:VDT-2023.github.io
\ No newline at end of file
diff --git a/data/2024/iclr/VONet: Unsupervised Video Object Learning With Parallel U-Net Attention and Object-wise Sequential VAE b/data/2024/iclr/VONet: Unsupervised Video Object Learning With Parallel U-Net Attention and Object-wise Sequential VAE
new file mode 100644
index 0000000000..665dae55f7
--- /dev/null
+++ b/data/2024/iclr/VONet: Unsupervised Video Object Learning With Parallel U-Net Attention and Object-wise Sequential VAE	
@@ -0,0 +1 @@
+Unsupervised video object learning seeks to decompose video scenes into structural object representations without any supervision from depth, optical flow, or segmentation. We present VONet, an innovative approach that is inspired by MONet. While utilizing a U-Net architecture, VONet employs an efficient and effective parallel attention inference process, generating attention masks for all slots simultaneously. Additionally, to enhance the temporal consistency of each mask across consecutive video frames, VONet develops an object-wise sequential VAE framework. The integration of these innovative encoder-side techniques, in conjunction with an expressive transformer-based decoder, establishes VONet as the leading unsupervised method for object learning across five MOVI datasets, encompassing videos of diverse complexities. Code is available at https://github.com/hnyu/vonet.
\ No newline at end of file
diff --git a/data/2024/iclr/VQ-TR: Vector Quantized Attention for Time Series Forecasting b/data/2024/iclr/VQ-TR: Vector Quantized Attention for Time Series Forecasting
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/ValUES: A Framework for Systematic Validation of Uncertainty Estimation in Semantic Segmentation b/data/2024/iclr/ValUES: A Framework for Systematic Validation of Uncertainty Estimation in Semantic Segmentation
new file mode 100644
index 0000000000..726e41cc23
--- /dev/null
+++ b/data/2024/iclr/ValUES: A Framework for Systematic Validation of Uncertainty Estimation in Semantic Segmentation	
@@ -0,0 +1 @@
+Uncertainty estimation is an essential and heavily-studied component for the reliable application of semantic segmentation methods. While various studies exist claiming methodological advances on the one hand, and successful application on the other hand, the field is currently hampered by a gap between theory and practice leaving fundamental questions unanswered: Can data-related and model-related uncertainty really be separated in practice? Which components of an uncertainty method are essential for real-world performance? Which uncertainty method works well for which application? In this work, we link this research gap to a lack of systematic and comprehensive evaluation of uncertainty methods. Specifically, we identify three key pitfalls in current literature and present an evaluation framework that bridges the research gap by providing 1) a controlled environment for studying data ambiguities as well as distribution shifts, 2) systematic ablations of relevant method components, and 3) test-beds for the five predominant uncertainty applications: OoD-detection, active learning, failure detection, calibration, and ambiguity modeling. Empirical results on simulated as well as real-world data demonstrate how the proposed framework is able to answer the predominant questions in the field revealing for instance that 1) separation of uncertainty types works on simulated data but does not necessarily translate to real-world data, 2) aggregation of scores is a crucial but currently neglected component of uncertainty methods, 3) While ensembles are performing most robustly across the different downstream tasks and settings, test-time augmentation often constitutes a light-weight alternative. Code is at: https://github.com/IML-DKFZ/values
\ No newline at end of file
diff --git a/data/2024/iclr/Vanishing Gradients in Reinforcement Finetuning of Language Models b/data/2024/iclr/Vanishing Gradients in Reinforcement Finetuning of Language Models
new file mode 100644
index 0000000000..eecea52890
--- /dev/null
+++ b/data/2024/iclr/Vanishing Gradients in Reinforcement Finetuning of Language Models	
@@ -0,0 +1 @@
+Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using policy gradient algorithms. This work identifies a fundamental optimization obstacle in RFT: we prove that the expected gradient for an input vanishes when its reward standard deviation under the model is small, even if the expected reward is far from optimal. Through experiments on an RFT benchmark and controlled environments, as well as a theoretical analysis, we then demonstrate that vanishing gradients due to small reward standard deviation are prevalent and detrimental, leading to extremely slow reward maximization. Lastly, we explore ways to overcome vanishing gradients in RFT. We find the common practice of an initial supervised finetuning (SFT) phase to be the most promising candidate, which sheds light on its importance in an RFT pipeline. Moreover, we show that a relatively small number of SFT optimization steps on as few as 1% of the input samples can suffice, indicating that the initial SFT phase need not be expensive in terms of compute and data labeling efforts. Overall, our results emphasize that being mindful for inputs whose expected gradient vanishes, as measured by the reward standard deviation, is crucial for successful execution of RFT.
\ No newline at end of file
diff --git a/data/2024/iclr/Variance Reduced Halpern Iteration for Finite-Sum Monotone Inclusions b/data/2024/iclr/Variance Reduced Halpern Iteration for Finite-Sum Monotone Inclusions
new file mode 100644
index 0000000000..8390ba395c
--- /dev/null
+++ b/data/2024/iclr/Variance Reduced Halpern Iteration for Finite-Sum Monotone Inclusions	
@@ -0,0 +1 @@
+Machine learning approaches relying on such criteria as adversarial robustness or multi-agent settings have raised the need for solving game-theoretic equilibrium problems. Of particular relevance to these applications are methods targeting finite-sum structure, which generically arises in empirical variants of learning problems in these contexts. Further, methods with computable approximation errors are highly desirable, as they provide verifiable exit criteria. Motivated by these applications, we study finite-sum monotone inclusion problems, which model broad classes of equilibrium problems. Our main contributions are variants of the classical Halpern iteration that employ variance reduction to obtain improved complexity guarantees in which $n$ component operators in the finite sum are ``on average'' either cocoercive or Lipschitz continuous and monotone, with parameter $L$. The resulting oracle complexity of our methods, which provide guarantees for the last iterate and for a (computable) operator norm residual, is $\widetilde{\mathcal{O}}( n + \sqrt{n}L\varepsilon^{-1})$, which improves upon existing methods by a factor up to $\sqrt{n}$. This constitutes the first variance reduction-type result for general finite-sum monotone inclusions and for more specific problems such as convex-concave optimization when operator norm residual is the optimality measure. We further argue that, up to poly-logarithmic factors, this complexity is unimprovable in the monotone Lipschitz setting; i.e., the provided result is near-optimal.
\ No newline at end of file
diff --git a/data/2024/iclr/Variance-aware Regret Bounds for Stochastic Contextual Dueling Bandits b/data/2024/iclr/Variance-aware Regret Bounds for Stochastic Contextual Dueling Bandits
new file mode 100644
index 0000000000..7d2cb236a2
--- /dev/null
+++ b/data/2024/iclr/Variance-aware Regret Bounds for Stochastic Contextual Dueling Bandits	
@@ -0,0 +1 @@
+Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems. While substantial efforts have been made to minimize the cumulative regret in dueling bandits, a notable gap in the current research is the absence of regret bounds that account for the inherent uncertainty in pairwise comparisons between the dueling arms. Intuitively, greater uncertainty suggests a higher level of difficulty in the problem. To bridge this gap, this paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM). We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $\tilde O\big(d\sqrt{\sum_{t=1}^T\sigma_t^2} + d\big)$, where $\sigma_t$ is the variance of the pairwise comparison in round $t$, $d$ is the dimension of the context vectors, and $T$ is the time horizon. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $\tilde O(d)$ regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms.
\ No newline at end of file
diff --git a/data/2024/iclr/Variance-enlarged Poisson Learning for Graph-based Semi-Supervised Learning with Extremely Sparse Labeled Data b/data/2024/iclr/Variance-enlarged Poisson Learning for Graph-based Semi-Supervised Learning with Extremely Sparse Labeled Data
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Variational Bayesian Last Layers b/data/2024/iclr/Variational Bayesian Last Layers
new file mode 100644
index 0000000000..2a1d55410d
--- /dev/null
+++ b/data/2024/iclr/Variational Bayesian Last Layers	
@@ -0,0 +1 @@
+We introduce a deterministic variational formulation for training Bayesian last layer neural networks. This yields a sampling-free, single-pass model and loss that effectively improves uncertainty estimation. Our variational Bayesian last layer (VBLL) can be trained and evaluated with only quadratic complexity in last layer width, and is thus (nearly) computationally free to add to standard architectures. We experimentally investigate VBLLs, and show that they improve predictive accuracy, calibration, and out of distribution detection over baselines across both regression and classification. Finally, we investigate combining VBLL layers with variational Bayesian feature learning, yielding a lower variance collapsed variational inference method for Bayesian neural networks.
\ No newline at end of file
diff --git a/data/2024/iclr/VeRA: Vector-based Random Matrix Adaptation b/data/2024/iclr/VeRA: Vector-based Random Matrix Adaptation
new file mode 100644
index 0000000000..acd37eded1
--- /dev/null
+++ b/data/2024/iclr/VeRA: Vector-based Random Matrix Adaptation	
@@ -0,0 +1 @@
+Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, image classification tasks, and show its application in instruction-tuning of 7B and 13B language models.
\ No newline at end of file
diff --git a/data/2024/iclr/VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation b/data/2024/iclr/VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation b/data/2024/iclr/ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation
new file mode 100644
index 0000000000..ac0bd1883c
--- /dev/null
+++ b/data/2024/iclr/ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation	
@@ -0,0 +1 @@
+Since real-world machine systems are running in non-stationary environments, Continual Test-Time Adaptation (CTTA) task is proposed to adapt the pre-trained model to continually changing target domains. Recently, existing methods mainly focus on model-based adaptation, which aims to leverage a self-training manner to extract the target domain knowledge. However, pseudo labels can be noisy and the updated model parameters are unreliable under dynamic data distributions, leading to error accumulation and catastrophic forgetting in the continual adaptation process. To tackle these challenges and maintain the model plasticity, we design a Visual Domain Adapter (ViDA) for CTTA, explicitly handling both domain-specific and domain-shared knowledge. Specifically, we first comprehensively explore the different domain representations of the adapters with trainable high-rank or low-rank embedding spaces. Then we inject ViDAs into the pre-trained model, which leverages high-rank and low-rank features to adapt the current domain distribution and maintain the continual domain-shared knowledge, respectively. To exploit the low-rank and high-rank ViDAs more effectively, we further propose a Homeostatic Knowledge Allotment (HKA) strategy, which adaptively combines different knowledge from each ViDA. Extensive experiments conducted on four widely used benchmarks demonstrate that our proposed method achieves state-of-the-art performance in both classification and segmentation CTTA tasks. Note that, our method can be regarded as a novel transfer paradigm for large-scale models, delivering promising results in adaptation to continually changing distributions. Project page: https://sites.google.com/view/iclr2024-vida/home.
\ No newline at end of file
diff --git a/data/2024/iclr/ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models b/data/2024/iclr/ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
new file mode 100644
index 0000000000..e71b9b8b28
--- /dev/null
+++ b/data/2024/iclr/ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models	
@@ -0,0 +1 @@
+With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.
\ No newline at end of file
diff --git a/data/2024/iclr/Video Decomposition Prior: Editing Videos Layer by Layer b/data/2024/iclr/Video Decomposition Prior: Editing Videos Layer by Layer
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Video Language Planning b/data/2024/iclr/Video Language Planning
new file mode 100644
index 0000000000..3cb1418a1c
--- /dev/null
+++ b/data/2024/iclr/Video Language Planning	
@@ -0,0 +1 @@
+We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains: from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).
\ No newline at end of file
diff --git a/data/2024/iclr/Views Can Be Deceiving: Improved SSL Through Feature Space Augmentation b/data/2024/iclr/Views Can Be Deceiving: Improved SSL Through Feature Space Augmentation
new file mode 100644
index 0000000000..545b4ed192
--- /dev/null
+++ b/data/2024/iclr/Views Can Be Deceiving: Improved SSL Through Feature Space Augmentation	
@@ -0,0 +1 @@
+Supervised learning methods have been found to exhibit inductive biases favoring simpler features. When such features are spuriously correlated with the label, this can result in suboptimal performance on minority subgroups. Despite the growing popularity of methods which learn from unlabeled data, the extent to which these representations rely on spurious features for prediction is unclear. In this work, we explore the impact of spurious features on Self-Supervised Learning (SSL) for visual representation learning. We first empirically show that commonly used augmentations in SSL can cause undesired invariances in the image space, and illustrate this with a simple example. We further show that classical approaches in combating spurious correlations, such as dataset re-sampling during SSL, do not consistently lead to invariant representations. Motivated by these findings, we propose LateTVG to remove spurious information from these representations during pre-training, by regularizing later layers of the encoder via pruning. We find that our method produces representations which outperform the baselines on several benchmarks, without the need for group or label information during SSL.
\ No newline at end of file
diff --git a/data/2024/iclr/Vision Transformers Need Registers b/data/2024/iclr/Vision Transformers Need Registers
new file mode 100644
index 0000000000..e3cb8acc6f
--- /dev/null
+++ b/data/2024/iclr/Vision Transformers Need Registers	
@@ -0,0 +1 @@
+Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.
\ No newline at end of file
diff --git a/data/2024/iclr/Vision-Language Foundation Models as Effective Robot Imitators b/data/2024/iclr/Vision-Language Foundation Models as Effective Robot Imitators
new file mode 100644
index 0000000000..e43e32fef1
--- /dev/null
+++ b/data/2024/iclr/Vision-Language Foundation Models as Effective Robot Imitators	
@@ -0,0 +1 @@
+Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step vision-language comprehension, models sequential history information with an explicit policy head, and is slightly fine-tuned by imitation learning only on language-conditioned manipulation datasets. Such a decomposition provides RoboFlamingo the flexibility for open-loop control and deployment on low-performance platforms. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control. Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. We believe RoboFlamingo has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy.
\ No newline at end of file
diff --git a/data/2024/iclr/Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning b/data/2024/iclr/Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
new file mode 100644
index 0000000000..230e670ef1
--- /dev/null
+++ b/data/2024/iclr/Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning	
@@ -0,0 +1 @@
+Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/vlm-rm. We can improve performance by providing a second"baseline"prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.
\ No newline at end of file
diff --git a/data/2024/iclr/Vision-by-Language for Training-Free Compositional Image Retrieval b/data/2024/iclr/Vision-by-Language for Training-Free Compositional Image Retrieval
new file mode 100644
index 0000000000..bafb38cc65
--- /dev/null
+++ b/data/2024/iclr/Vision-by-Language for Training-Free Compositional Image Retrieval	
@@ -0,0 +1 @@
+Given an image and a target modification (e.g an image of the Eiffel tower and the text"without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.
\ No newline at end of file
diff --git a/data/2024/iclr/Visual Data-Type Understanding does not emerge from scaling Vision-Language Models b/data/2024/iclr/Visual Data-Type Understanding does not emerge from scaling Vision-Language Models
new file mode 100644
index 0000000000..62c5c3d63f
--- /dev/null
+++ b/data/2024/iclr/Visual Data-Type Understanding does not emerge from scaling Vision-Language Models	
@@ -0,0 +1 @@
+Recent advances in the development of vision-language models (VLMs) are yielding remarkable success in recognizing visual semantic content, including impressive instances of compositional image understanding. Here, we introduce the novel task of Visual Data-Type Identification, a basic perceptual skill with implications for data curation (e.g., noisy data-removal from large datasets, domain-specific retrieval) and autonomous vision (e.g., distinguishing changing weather conditions from camera lens staining). We develop two datasets consisting of animal images altered across a diverse set of 27 visual data-types, spanning four broad categories. An extensive zero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced performance landscape. While VLMs are reasonably good at identifying certain stylistic \textit{data-types}, such as cartoons and sketches, they struggle with simpler data-types arising from basic manipulations like image rotations or additive noise. Our findings reveal that (i) model scaling alone yields marginal gains for contrastively-trained models like CLIP, and (ii) there is a pronounced drop in performance for the largest auto-regressively trained VLMs like OpenFlamingo. This finding points to a blind spot in current frontier VLMs: they excel in recognizing semantic content but fail to acquire an understanding of visual data-types through scaling. By analyzing the pre-training distributions of these models and incorporating data-type information into the captions during fine-tuning, we achieve a significant enhancement in performance. By exploring this previously uncharted task, we aim to set the stage for further advancing VLMs to equip them with visual data-type understanding. Code and datasets are released at https://github.com/bethgelab/DataTypeIdentification.
\ No newline at end of file
diff --git a/data/2024/iclr/Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis b/data/2024/iclr/Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
new file mode 100644
index 0000000000..b2c5760973
--- /dev/null
+++ b/data/2024/iclr/Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis	
@@ -0,0 +1 @@
+Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.
\ No newline at end of file
diff --git a/data/2024/iclr/Waxing-and-Waning: a Generic Similarity-based Framework for Efficient Self-Supervised Learning b/data/2024/iclr/Waxing-and-Waning: a Generic Similarity-based Framework for Efficient Self-Supervised Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Weaker MVI Condition: Extragradient Methods with Multi-Step Exploration b/data/2024/iclr/Weaker MVI Condition: Extragradient Methods with Multi-Step Exploration
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Weakly Supervised Virus Capsid Detection with Image-Level Annotations in Electron Microscopy Images b/data/2024/iclr/Weakly Supervised Virus Capsid Detection with Image-Level Annotations in Electron Microscopy Images
new file mode 100644
index 0000000000..617255478f
--- /dev/null
+++ b/data/2024/iclr/Weakly Supervised Virus Capsid Detection with Image-Level Annotations in Electron Microscopy Images	
@@ -0,0 +1 @@
+of
\ No newline at end of file
diff --git a/data/2024/iclr/Weakly-supervised Audio Separation via Bi-modal Semantic Similarity b/data/2024/iclr/Weakly-supervised Audio Separation via Bi-modal Semantic Similarity
new file mode 100644
index 0000000000..cff07d72f2
--- /dev/null
+++ b/data/2024/iclr/Weakly-supervised Audio Separation via Bi-modal Semantic Similarity	
@@ -0,0 +1 @@
+Conditional sound separation in multi-source audio mixtures without having access to single source sound data during training is a long standing challenge. Existing mix-and-separate based methods suffer from significant performance drop with multi-source training mixtures due to the lack of supervision signal for single source separation cases during training. However, in the case of language-conditional audio separation, we do have access to corresponding text descriptions for each audio mixture in our training data, which can be seen as (rough) representations of the audio samples in the language modality. To this end, in this paper, we propose a generic bi-modal separation framework which can enhance the existing unsupervised frameworks to separate single-source signals in a target modality (i.e., audio) using the easily separable corresponding signals in the conditioning modality (i.e., language), without having access to single-source samples in the target modality during training. We empirically show that this is well within reach if we have access to a pretrained joint embedding model between the two modalities (i.e., CLAP). Furthermore, we propose to incorporate our framework into two fundamental scenarios to enhance separation performance. First, we show that our proposed methodology significantly improves the performance of purely unsupervised baselines by reducing the distribution shift between training and test samples. In particular, we show that our framework can achieve 71% boost in terms of Signal-to-Distortion Ratio (SDR) over the baseline, reaching 97.5% of the supervised learning performance. Second, we show that we can further improve the performance of the supervised learning itself by 17% if we augment it by our proposed weakly-supervised framework, that enables a powerful semi-supervised framework for audio separation.
\ No newline at end of file
diff --git a/data/2024/iclr/Weatherproofing Retrieval for Localization with Generative AI and Geometric Consistency b/data/2024/iclr/Weatherproofing Retrieval for Localization with Generative AI and Geometric Consistency
new file mode 100644
index 0000000000..46453f680b
--- /dev/null
+++ b/data/2024/iclr/Weatherproofing Retrieval for Localization with Generative AI and Geometric Consistency	
@@ -0,0 +1 @@
+State-of-the-art visual localization approaches generally rely on a first image retrieval step whose role is crucial. Yet, retrieval often struggles when facing varying conditions, due to e.g. weather or time of day, with dramatic consequences on the visual localization accuracy. In this paper, we improve this retrieval step and tailor it to the final localization task. Among the several changes we advocate for, we propose to synthesize variants of the training set images, obtained from generative text-to-image models, in order to automatically expand the training set towards a number of nameable variations that particularly hurt visual localization. After expanding the training set, we propose a training approach that leverages the specificities and the underlying geometry of this mix of real and synthetic images. We experimentally show that those changes translate into large improvements for the most challenging visual localization datasets. Project page: https://europe.naverlabs.com/ret4loc
\ No newline at end of file
diff --git a/data/2024/iclr/WebArena: A Realistic Web Environment for Building Autonomous Agents b/data/2024/iclr/WebArena: A Realistic Web Environment for Building Autonomous Agents
new file mode 100644
index 0000000000..47734e5aea
--- /dev/null
+++ b/data/2024/iclr/WebArena: A Realistic Web Environment for Building Autonomous Agents	
@@ -0,0 +1 @@
+With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.
\ No newline at end of file
diff --git a/data/2024/iclr/What Algorithms can Transformers Learn? A Study in Length Generalization b/data/2024/iclr/What Algorithms can Transformers Learn? A Study in Length Generalization
new file mode 100644
index 0000000000..7882b49200
--- /dev/null
+++ b/data/2024/iclr/What Algorithms can Transformers Learn? A Study in Length Generalization	
@@ -0,0 +1 @@
+Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true algorithm for solving a task. We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Here, we propose a unifying framework to understand when and how Transformers can exhibit strong length generalization on a given task. Specifically, we leverage RASP (Weiss et al., 2021) -- a programming language designed for the computational model of a Transformer -- and introduce the RASP-Generalization Conjecture: Transformers tend to length generalize on a task if the task can be solved by a short RASP program which works for all input lengths. This simple conjecture remarkably captures most known instances of length generalization on algorithmic tasks. Moreover, we leverage our insights to drastically improve generalization performance on traditionally hard tasks (such as parity and addition). On the theoretical side, we give a simple example where the"min-degree-interpolator"model of learning from Abbe et al. (2023) does not correctly predict Transformers' out-of-distribution behavior, but our conjecture does. Overall, our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.
\ No newline at end of file
diff --git a/data/2024/iclr/What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning b/data/2024/iclr/What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning
new file mode 100644
index 0000000000..9d597bc823
--- /dev/null
+++ b/data/2024/iclr/What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning	
@@ -0,0 +1 @@
+Instruction tuning is a standard technique employed to align large language models to end tasks and user preferences after the initial pretraining phase. Recent research indicates the critical role of data engineering in instruction tuning -- when appropriately selected, only limited data is necessary to achieve superior performance. However, we still lack a principled understanding of what makes good instruction tuning data for alignment, and how we should select data automatically and effectively. In this work, we delve deeply into automatic data selection strategies for alignment. We start with controlled studies to measure data across three dimensions: complexity, quality, and diversity, along which we examine existing methods and introduce novel techniques for enhanced data measurement. Subsequently, we propose a simple strategy to select data samples based on the measurement. We present deita (short for Data-Efficient Instruction Tuning for Alignment), a series of models fine-tuned from LLaMA and Mistral models using data samples automatically selected with our proposed approach. Empirically, deita performs better or on par with the state-of-the-art open-source alignment models with only 6K SFT training data samples -- over 10x less than the data used in the baselines. When further trained with direct preference optimization (DPO), deita-Mistral-7B + DPO trained with 6K SFT and 10K DPO samples achieve 7.55 MT-Bench and 90.06% AlpacaEval scores. We anticipate this work to provide tools on automatic data selection, facilitating data-efficient alignment. We release our models as well as the selected datasets for future researches to effectively align models more efficiently.
\ No newline at end of file
diff --git a/data/2024/iclr/What Makes a Good Prune? Maximal Unstructured Pruning for Maximal Cosine Similarity b/data/2024/iclr/What Makes a Good Prune? Maximal Unstructured Pruning for Maximal Cosine Similarity
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/What Matters to You? Towards Visual Representation Alignment for Robot Learning b/data/2024/iclr/What Matters to You? Towards Visual Representation Alignment for Robot Learning
new file mode 100644
index 0000000000..cc4590c3b1
--- /dev/null
+++ b/data/2024/iclr/What Matters to You? Towards Visual Representation Alignment for Robot Learning	
@@ -0,0 +1 @@
+When operating in service of people, robots need to optimize rewards aligned with end-user preferences. Since robots will rely on raw perceptual inputs like RGB images, their rewards will inevitably use visual representations. Recently there has been excitement in using representations from pre-trained visual models, but key to making these work in robotics is fine-tuning, which is typically done via proxy tasks like dynamics prediction or enforcing temporal cycle-consistency. However, all these proxy tasks bypass the human's input on what matters to them, exacerbating spurious correlations and ultimately leading to robot behaviors that are misaligned with user preferences. In this work, we propose that robots should leverage human feedback to align their visual representations with the end-user and disentangle what matters for the task. We propose Representation-Aligned Preference-based Learning (RAPL), a method for solving the visual representation alignment problem and visual reward learning problem through the lens of preference-based learning and optimal transport. Across experiments in X-MAGICAL and in robotic manipulation, we find that RAPL's reward consistently generates preferred robot behaviors with high sample efficiency, and shows strong zero-shot generalization when the visual representation is learned from a different embodiment than the robot's.
\ No newline at end of file
diff --git a/data/2024/iclr/What does automatic differentiation compute for neural networks? b/data/2024/iclr/What does automatic differentiation compute for neural networks?
new file mode 100644
index 0000000000..e8f538c213
--- /dev/null
+++ b/data/2024/iclr/What does automatic differentiation compute for neural networks?	
@@ -0,0 +1 @@
+Forward-or reverse-mode automatic differentiation (AD) is a popular algorithm for computing the derivative of a function expressed by a program. AD always outputs the correct derivative if a program does not use any non-differentiable functions and control flows; however, it may return an arbitrary value otherwise. In this work, we investigate what AD computes for neural networks that may contain non-differentiable functions such as ReLU and maxpools. We first prove that AD always returns a generalized derivative called a Clarke subderivative for networks with pointwise activation functions, if the minibatch size is one and all non-differentiable neurons have distinct bias parameters. We show that the same conclusion does not hold otherwise, but does hold under some mild sufficient conditions. We also prove similar results for more general networks that can use max-pools and bias parameters shared across different neurons. We empirically check our sufficient conditions over popular network architectures and observe that AD almost always computes a Clarke subderivative in practical learning setups.
\ No newline at end of file
diff --git a/data/2024/iclr/What does the Knowledge Neuron Thesis Have to do with Knowledge? b/data/2024/iclr/What does the Knowledge Neuron Thesis Have to do with Knowledge?
new file mode 100644
index 0000000000..449de93349
--- /dev/null
+++ b/data/2024/iclr/What does the Knowledge Neuron Thesis Have to do with Knowledge?	
@@ -0,0 +1 @@
+We reassess the Knowledge Neuron (KN) Thesis: an interpretation of the mechanism underlying the ability of large language models to recall facts from a training corpus. This nascent thesis proposes that facts are recalled from the training corpus through the MLP weights in a manner resembling key-value memory, implying in effect that"knowledge"is stored in the network. Furthermore, by modifying the MLP modules, one can control the language model's generation of factual information. The plausibility of the KN thesis has been demonstrated by the success of KN-inspired model editing methods (Dai et al., 2022; Meng et al., 2022). We find that this thesis is, at best, an oversimplification. Not only have we found that we can edit the expression of certain linguistic phenomena using the same model editing methods but, through a more comprehensive evaluation, we have found that the KN thesis does not adequately explain the process of factual expression. While it is possible to argue that the MLP weights store complex patterns that are interpretable both syntactically and semantically, these patterns do not constitute"knowledge."To gain a more comprehensive understanding of the knowledge representation process, we must look beyond the MLP weights and explore recent models' complex layer structures and attention mechanisms.
\ No newline at end of file
diff --git a/data/2024/iclr/What's In My Big Data? b/data/2024/iclr/What's In My Big Data?
new file mode 100644
index 0000000000..658a9cd7f5
--- /dev/null
+++ b/data/2024/iclr/What's In My Big Data?	
@@ -0,0 +1 @@
+Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to ten different corpora used to train popular language models, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE. We open-source WIMBD's code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them.
\ No newline at end of file
diff --git a/data/2024/iclr/What's in a Prior? Learned Proximal Networks for Inverse Problems b/data/2024/iclr/What's in a Prior? Learned Proximal Networks for Inverse Problems
new file mode 100644
index 0000000000..09053492b9
--- /dev/null
+++ b/data/2024/iclr/What's in a Prior? Learned Proximal Networks for Inverse Problems	
@@ -0,0 +1 @@
+Proximal operators are ubiquitous in inverse problems, commonly appearing as part of algorithmic strategies to regularize problems that are otherwise ill-posed. Modern deep learning models have been brought to bear for these tasks too, as in the framework of plug-and-play or deep unrolling, where they loosely resemble proximal operators. Yet, something essential is lost in employing these purely data-driven approaches: there is no guarantee that a general deep network represents the proximal operator of any function, nor is there any characterization of the function for which the network might provide some approximate proximal. This not only makes guaranteeing convergence of iterative schemes challenging but, more fundamentally, complicates the analysis of what has been learned by these networks about their training data. Herein we provide a framework to develop learned proximal networks (LPN), prove that they provide exact proximal operators for a data-driven nonconvex regularizer, and show how a new training strategy, dubbed proximal matching, provably promotes the recovery of the log-prior of the true data distribution. Such LPN provide general, unsupervised, expressive proximal operators that can be used for general inverse problems with convergence guarantees. We illustrate our results in a series of cases of increasing complexity, demonstrating that these models not only result in state-of-the-art performance, but provide a window into the resulting priors learned from data.
\ No newline at end of file
diff --git a/data/2024/iclr/When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations b/data/2024/iclr/When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations
new file mode 100644
index 0000000000..d3d5a4f2bd
--- /dev/null
+++ b/data/2024/iclr/When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations	
@@ -0,0 +1 @@
+Context-based fine-tuning methods, including prompting, in-context learning, soft prompting (also known as prompt tuning), and prefix-tuning, have gained popularity due to their ability to often match the performance of full fine-tuning with a fraction of the parameters. Despite their empirical successes, there is little theoretical understanding of how these techniques influence the internal computation of the model and their expressiveness limitations. We show that despite the continuous embedding space being more expressive than the discrete token space, soft-prompting and prefix-tuning are potentially less expressive than full fine-tuning, even with the same number of learnable parameters. Concretely, context-based fine-tuning cannot change the relative attention pattern over the content and can only bias the outputs of an attention layer in a fixed direction. This suggests that while techniques like prompting, in-context learning, soft prompting, and prefix-tuning can effectively elicit skills present in the pretrained model, they may not be able to learn novel tasks that require new attention patterns.
\ No newline at end of file
diff --git a/data/2024/iclr/When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method b/data/2024/iclr/When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
new file mode 100644
index 0000000000..2199721ef5
--- /dev/null
+++ b/data/2024/iclr/When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method	
@@ -0,0 +1 @@
+While large language models (LLMs) often adopt finetuning to unlock their capabilities for downstream applications, our understanding on the inductive biases (especially the scaling properties) of different finetuning methods is still limited. To fill this gap, we conduct systematic experiments studying whether and how different scaling factors, including LLM model size, pretraining data size, new finetuning parameter size and finetuning data size, affect the finetuning performance. We consider two types of finetuning -- full-model tuning (FMT) and parameter efficient tuning (PET, including prompt tuning and LoRA), and explore their scaling behaviors in the data-limited regime where the LLM model size substantially outweighs the finetuning data size. Based on two sets of pretrained bilingual LLMs from 1B to 16B and experiments on bilingual machine translation and multilingual summarization benchmarks, we find that 1) LLM finetuning follows a powerbased multiplicative joint scaling law between finetuning data size and each other scaling factor; 2) LLM finetuning benefits more from LLM model scaling than pretraining data scaling, and PET parameter scaling is generally ineffective; and 3) the optimal finetuning method is highly task- and finetuning data-dependent. We hope our findings could shed light on understanding, selecting and developing LLM finetuning methods.
\ No newline at end of file
diff --git a/data/2024/iclr/When Semantic Segmentation Meets Frequency Aliasing b/data/2024/iclr/When Semantic Segmentation Meets Frequency Aliasing
new file mode 100644
index 0000000000..6601e42830
--- /dev/null
+++ b/data/2024/iclr/When Semantic Segmentation Meets Frequency Aliasing	
@@ -0,0 +1 @@
+Despite recent advancements in semantic segmentation, where and what pixels are hard to segment remains largely unexplored. Existing research only separates an image into easy and hard regions and empirically observes the latter are associated with object boundaries. In this paper, we conduct a comprehensive analysis of hard pixel errors, categorizing them into three types: false responses, merging mistakes, and displacements. Our findings reveal a quantitative association between hard pixels and aliasing, which is distortion caused by the overlapping of frequency components in the Fourier domain during downsampling. To identify the frequencies responsible for aliasing, we propose using the equivalent sampling rate to calculate the Nyquist frequency, which marks the threshold for aliasing. Then, we introduce the aliasing score as a metric to quantify the extent of aliasing. While positively correlated with the proposed aliasing score, three types of hard pixels exhibit different patterns. Here, we propose two novel de-aliasing filter (DAF) and frequency mixing (FreqMix) modules to alleviate aliasing degradation by accurately removing or adjusting frequencies higher than the Nyquist frequency. The DAF precisely removes the frequencies responsible for aliasing before downsampling, while the FreqMix dynamically selects high-frequency components within the encoder block. Experimental results demonstrate consistent improvements in semantic segmentation and low-light instance segmentation tasks. The code is available at: https://github.com/Linwei-Chen/Seg-Aliasing.
\ No newline at end of file
diff --git a/data/2024/iclr/When can transformers reason with abstract symbols? b/data/2024/iclr/When can transformers reason with abstract symbols?
new file mode 100644
index 0000000000..35c2fe1c1b
--- /dev/null
+++ b/data/2024/iclr/When can transformers reason with abstract symbols?	
@@ -0,0 +1 @@
+We investigate the capabilities of transformer models on relational reasoning tasks. In these tasks, models are trained on a set of strings encoding abstract relations, and are then tested out-of-distribution on data that contains symbols that did not appear in the training dataset. We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relations and generalize to the test set when trained by gradient descent on sufficiently large quantities of training data. This is in contrast to classical fully-connected networks, which we prove fail to learn to reason. Our results inspire modifications of the transformer architecture that add only two trainable parameters per head, and that we empirically demonstrate improve data efficiency for learning to reason.
\ No newline at end of file
diff --git a/data/2024/iclr/When should we prefer Decision Transformers for Offline Reinforcement Learning? b/data/2024/iclr/When should we prefer Decision Transformers for Offline Reinforcement Learning?
new file mode 100644
index 0000000000..aa9c837cce
--- /dev/null
+++ b/data/2024/iclr/When should we prefer Decision Transformers for Offline Reinforcement Learning?	
@@ -0,0 +1 @@
+Offline reinforcement learning (RL) allows agents to learn effective, return-maximizing policies from a static dataset. Three popular algorithms for offline RL are Conservative Q-Learning (CQL), Behavior Cloning (BC), and Decision Transformer (DT), from the class of Q-Learning, Imitation Learning, and Sequence Modeling respectively. A key open question is: which algorithm is preferred under what conditions? We study this question empirically by exploring the performance of these algorithms across the commonly used D4RL and Robomimic benchmarks. We design targeted experiments to understand their behavior concerning data suboptimality, task complexity, and stochasticity. Our key findings are: (1) DT requires more data than CQL to learn competitive policies but is more robust; (2) DT is a substantially better choice than both CQL and BC in sparse-reward and low-quality data settings; (3) DT and BC are preferable as task horizon increases, or when data is obtained from human demonstrators; and (4) CQL excels in situations characterized by the combination of high stochasticity and low data quality. We also investigate architectural choices and scaling trends for DT on Atari and D4RL and make design/scaling recommendations. We find that scaling the amount of data for DT by 5x gives a 2.5x average score improvement on Atari.
\ No newline at end of file
diff --git a/data/2024/iclr/Where We Have Arrived in Proving the Emergence of Sparse Interaction Primitives in DNNs b/data/2024/iclr/Where We Have Arrived in Proving the Emergence of Sparse Interaction Primitives in DNNs
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Whittle Index with Multiple Actions and State Constraint for Inventory Management b/data/2024/iclr/Whittle Index with Multiple Actions and State Constraint for Inventory Management
new file mode 100644
index 0000000000..41622b4720
--- /dev/null
+++ b/data/2024/iclr/Whittle Index with Multiple Actions and State Constraint for Inventory Management	
@@ -0,0 +1 @@
+,
\ No newline at end of file
diff --git a/data/2024/iclr/Why is SAM Robust to Label Noise? b/data/2024/iclr/Why is SAM Robust to Label Noise?
new file mode 100644
index 0000000000..32fcf0bbd9
--- /dev/null
+++ b/data/2024/iclr/Why is SAM Robust to Label Noise?	
@@ -0,0 +1 @@
+Sharpness-Aware Minimization (SAM) is most known for achieving state-of the-art performances on natural image and language tasks. However, its most pronounced improvements (of tens of percent) is rather in the presence of label noise. Understanding SAM's label noise robustness requires a departure from characterizing the robustness of minimas lying in"flatter"regions of the loss landscape. In particular, the peak performance under label noise occurs with early stopping, far before the loss converges. We decompose SAM's robustness into two effects: one induced by changes to the logit term and the other induced by changes to the network Jacobian. The first can be observed in linear logistic regression where SAM provably up-weights the gradient contribution from clean examples. Although this explicit up-weighting is also observable in neural networks, when we intervene and modify SAM to remove this effect, surprisingly, we see no visible degradation in performance. We infer that SAM's effect in deeper networks is instead explained entirely by the effect SAM has on the network Jacobian. We theoretically derive the implicit regularization induced by this Jacobian effect in two layer linear networks. Motivated by our analysis, we see that cheaper alternatives to SAM that explicitly induce these regularization effects largely recover the benefits in deep networks trained on real-world datasets.
\ No newline at end of file
diff --git a/data/2024/iclr/WildChat: 1M ChatGPT Interaction Logs in the Wild b/data/2024/iclr/WildChat: 1M ChatGPT Interaction Logs in the Wild
new file mode 100644
index 0000000000..4532da3e98
--- /dev/null
+++ b/data/2024/iclr/WildChat: 1M ChatGPT Interaction Logs in the Wild	
@@ -0,0 +1 @@
+Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice. To bridge this gap, we offered free access to ChatGPT for online users in exchange for their affirmative, consensual opt-in to anonymously collect their chat transcripts and request headers. From this, we compiled WildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. We compare WildChat with other popular user-chatbot interaction datasets, and find that our dataset offers the most diverse user prompts, contains the largest number of languages, and presents the richest variety of potentially toxic use-cases for researchers to study. In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses, alongside request headers. This augmentation allows for more detailed analysis of user behaviors across different geographical regions and temporal dimensions. Finally, because it captures a broad range of use cases, we demonstrate the dataset's potential utility in fine-tuning instruction-following models. WildChat is released at https://wildchat.allen.ai under AI2 ImpACT Licenses.
\ No newline at end of file
diff --git a/data/2024/iclr/WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space b/data/2024/iclr/WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space
new file mode 100644
index 0000000000..8e75e82eeb
--- /dev/null
+++ b/data/2024/iclr/WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space	
@@ -0,0 +1 @@
+Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this work, we instead model instances in view space, alleviating the need for posed images and learned camera distributions. We find that in this setting, existing GAN-based methods are prone to generating flat geometry and struggle with distribution coverage. We hence propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs). We first train an autoencoder that infers a compressed latent representation, which additionally captures the images' underlying 3D structure and enables not only reconstruction but also novel view synthesis. To learn a faithful 3D representation, we leverage cues from monocular depth prediction. Then, we train a diffusion model in the 3D-aware latent space, thereby enabling synthesis of high-quality 3D-consistent image samples, outperforming recent state-of-the-art GAN-based methods. Importantly, our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry and does not require posed images or learned pose or camera distributions. It directly learns a 3D representation without relying on canonical camera coordinates. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data. See https://katjaschwarz.github.io/wildfusion for videos of our 3D results.
\ No newline at end of file
diff --git a/data/2024/iclr/Win-Win: Training High-Resolution Vision Transformers from Two Windows b/data/2024/iclr/Win-Win: Training High-Resolution Vision Transformers from Two Windows
new file mode 100644
index 0000000000..c83d6f8b03
--- /dev/null
+++ b/data/2024/iclr/Win-Win: Training High-Resolution Vision Transformers from Two Windows	
@@ -0,0 +1 @@
+Transformers have become the standard in state-of-the-art vision architectures, achieving impressive performance on both image-level and dense pixelwise tasks. However, training vision transformers for high-resolution pixelwise tasks has a prohibitive cost. Typical solutions boil down to hierarchical architectures, fast and approximate attention, or training on low-resolution crops. This latter solution does not constrain architectural choices, but it leads to a clear performance drop when testing at resolutions significantly higher than that used for training, thus requiring ad-hoc and slow post-processing schemes. In this paper, we propose a novel strategy for efficient training and inference of high-resolution vision transformers. The key principle is to mask out most of the high-resolution inputs during training, keeping only N random windows. This allows the model to learn local interactions between tokens inside each window, and global interactions between tokens from different windows. As a result, the model can directly process the high-resolution input at test time without any special trick. We show that this strategy is effective when using relative positional embedding such as rotary embeddings. It is 4 times faster to train than a full-resolution network, and it is straightforward to use at test time compared to existing approaches. We apply this strategy to three dense prediction tasks with high-resolution data. First, we show on the task of semantic segmentation that a simple setting with 2 windows performs best, hence the name of our method: Win-Win. Second, we confirm this result on the task of monocular depth prediction. Third, we further extend it to the binocular task of optical flow, reaching state-of-the-art performance on the Spring benchmark that contains Full-HD images with an order of magnitude faster inference than the best competitor.
\ No newline at end of file
diff --git "a/data/2024/iclr/W\303\274rstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models" "b/data/2024/iclr/W\303\274rstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models"
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Xformer: Hybrid X-Shaped Transformer for Image Denoising b/data/2024/iclr/Xformer: Hybrid X-Shaped Transformer for Image Denoising
new file mode 100644
index 0000000000..4ee5fa04b0
--- /dev/null
+++ b/data/2024/iclr/Xformer: Hybrid X-Shaped Transformer for Image Denoising	
@@ -0,0 +1 @@
+In this paper, we present a hybrid X-shaped vision Transformer, named Xformer, which performs notably on image denoising tasks. We explore strengthening the global representation of tokens from different scopes. In detail, we adopt two types of Transformer blocks. The spatial-wise Transformer block performs fine-grained local patches interactions across tokens defined by spatial dimension. The channel-wise Transformer block performs direct global context interactions across tokens defined by channel dimension. Based on the concurrent network structure, we design two branches to conduct these two interaction fashions. Within each branch, we employ an encoder-decoder architecture to capture multi-scale features. Besides, we propose the Bidirectional Connection Unit (BCU) to couple the learned representations from these two branches while providing enhanced information fusion. The joint designs make our Xformer powerful to conduct global information modeling in both spatial and channel dimensions. Extensive experiments show that Xformer, under the comparable model complexity, achieves state-of-the-art performance on the synthetic and real-world image denoising tasks. We also provide code and models at https://github.com/gladzhang/Xformer.
\ No newline at end of file
diff --git a/data/2024/iclr/YaRN: Efficient Context Window Extension of Large Language Models b/data/2024/iclr/YaRN: Efficient Context Window Extension of Large Language Models
new file mode 100644
index 0000000000..18964d12ec
--- /dev/null
+++ b/data/2024/iclr/YaRN: Efficient Context Window Extension of Large Language Models	
@@ -0,0 +1 @@
+Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. The models fine-tuned using YaRN has been made available and reproduced online up to 128k context length at https://github.com/jquesnelle/yarn
\ No newline at end of file
diff --git a/data/2024/iclr/Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML b/data/2024/iclr/Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML
new file mode 100644
index 0000000000..7c57975fdd
--- /dev/null
+++ b/data/2024/iclr/Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML	
@@ -0,0 +1 @@
+Medical applications of machine learning (ML) have experienced a surge in popularity in recent years. The intensive care unit (ICU) is a natural habitat for ML given the abundance of available data from electronic health records. Models have been proposed to address numerous ICU prediction tasks like the early detection of complications. While authors frequently report state-of-the-art performance, it is challenging to verify claims of superiority. Datasets and code are not always published, and cohort definitions, preprocessing pipelines, and training setups are difficult to reproduce. This work introduces Yet Another ICU Benchmark (YAIB), a modular framework that allows researchers to define reproducible and comparable clinical ML experiments; we offer an end-to-end solution from cohort definition to model evaluation. The framework natively supports most open-access ICU datasets (MIMIC III/IV, eICU, HiRID, AUMCdb) and is easily adaptable to future ICU datasets. Combined with a transparent preprocessing pipeline and extensible training code for multiple ML and deep learning models, YAIB enables unified model development. Our benchmark comes with five predefined established prediction tasks (mortality, acute kidney injury, sepsis, kidney function, and length of stay) developed in collaboration with clinicians. Adding further tasks is straightforward by design. Using YAIB, we demonstrate that the choice of dataset, cohort definition, and preprocessing have a major impact on the prediction performance - often more so than model class - indicating an urgent need for YAIB as a holistic benchmarking tool. We provide our work to the clinical ML community to accelerate method development and enable real-world clinical implementations. Software Repository: https://github.com/rvandewater/YAIB.
\ No newline at end of file
diff --git a/data/2024/iclr/You Only Query Once: An Efficient Label-Only Membership Inference Attack b/data/2024/iclr/You Only Query Once: An Efficient Label-Only Membership Inference Attack
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/ZeRO++: Extremely Efficient Collective Communication for Large Model Training b/data/2024/iclr/ZeRO++: Extremely Efficient Collective Communication for Large Model Training
new file mode 100644
index 0000000000..9377801efd
--- /dev/null
+++ b/data/2024/iclr/ZeRO++: Extremely Efficient Collective Communication for Large Model Training	
@@ -0,0 +1 @@
+While the Zero Redundancy Optimizer (ZeRO) excels in training large-scale models, it struggles to achieve good throughput in environments with limited band-width or small batches where communication becomes a major bottleneck. Inspired by the principles of fine-grained quantization in machine learning algorithms, we designed ZeRO++, an optimizer robust to quantization effects that allows for significant communication volume reduction using low-precision quantization techniques. ZeRO++ composes of three communication volume reduction techniques (low-precision all-gather, data remapping, and low-precision gradient averaging) to significantly reduce the communication volume up to 4x that enables up to 2.16x better throughput at 384 GPU scale. Our results also show ZeRO++ can speedup the RLHF by 3.3x compared to vanilla ZeRO. To verify the convergence of ZeRO++, we test up to 13B model for pretraining with 8/6-bits all gather and up to 30B model for finetuning with 4-bit or 2-bit all gather, and demonstrate on-par accuracy as original ZeRO (aka standard training). As a byproduct, the model trained with ZeRO++ is weight-quantized, which can be directly used for inference without post-training quantization or quantization-aware training.
\ No newline at end of file
diff --git a/data/2024/iclr/Zero Bubble (Almost) Pipeline Parallelism b/data/2024/iclr/Zero Bubble (Almost) Pipeline Parallelism
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Zero and Few-shot Semantic Parsing with Ambiguous Inputs b/data/2024/iclr/Zero and Few-shot Semantic Parsing with Ambiguous Inputs
new file mode 100644
index 0000000000..d6a0ad115a
--- /dev/null
+++ b/data/2024/iclr/Zero and Few-shot Semantic Parsing with Ambiguous Inputs	
@@ -0,0 +1 @@
+Despite the frequent challenges posed by ambiguity when representing meaning via natural language, it is often ignored or deliberately removed in tasks mapping language to formally-designed representations, which generally assume a one-to-one mapping between linguistic and formal representations. We attempt to address this shortcoming by introducing AmP, a framework, dataset, and challenge for translating ambiguous natural language to formal representations like logic and code. We define templates and generate data for five well-documented linguistic ambiguities. Using AmP, we investigate how several few-shot text-to-code systems handle ambiguity, introducing three new metrics. We find that large pre-trained models perform poorly at capturing the distribution of possible meanings without deliberate instruction. However, models are able to capture the distribution well when ambiguity is attested in their inputs. These results motivate a call for including ambiguity explicitly in datasets and promote considering the distribution of possible outputs when evaluating systems. Data and code: https://github.com/esteng/ambiguous_parsing
\ No newline at end of file
diff --git a/data/2024/iclr/Zero-Mean Regularized Spectral Contrastive Learning: Implicitly Mitigating Wrong Connections in Positive-Pair Graphs b/data/2024/iclr/Zero-Mean Regularized Spectral Contrastive Learning: Implicitly Mitigating Wrong Connections in Positive-Pair Graphs
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/Zero-Shot Continuous Prompt Transfer: Generalizing Task Semantics Across Language Models b/data/2024/iclr/Zero-Shot Continuous Prompt Transfer: Generalizing Task Semantics Across Language Models
new file mode 100644
index 0000000000..9e1ddbbe0f
--- /dev/null
+++ b/data/2024/iclr/Zero-Shot Continuous Prompt Transfer: Generalizing Task Semantics Across Language Models	
@@ -0,0 +1 @@
+Prompt tuning in natural language processing (NLP) has become an increasingly popular method for adapting large language models to specific tasks. However, the transferability of these prompts, especially continuous prompts, between different models remains a challenge. In this work, we propose a zero-shot continuous prompt transfer method, where source prompts are encoded into relative space and the corresponding target prompts are searched for transferring to target models. Experimental results confirm the effectiveness of our method, showing that 'task semantics' in continuous prompts can be generalized across various language models. Moreover, we find that combining 'task semantics' from multiple source models can further enhance the generalizability of transfer.
\ No newline at end of file
diff --git a/data/2024/iclr/Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models b/data/2024/iclr/Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models
new file mode 100644
index 0000000000..c28b426acf
--- /dev/null
+++ b/data/2024/iclr/Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models	
@@ -0,0 +1 @@
+If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image-editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller can accomplish. Specifically, we finetune InstructPix2Pix on video data, consisting of both human videos and robot rollouts, such that it outputs hypothetical future"subgoal"observations given the robot's current observation and a language command. We also use the robot data to train a low-level goal-conditioned policy to act as the aforementioned low-level controller. We find that the high-level subgoal predictions can utilize Internet-scale pretraining and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization and precision than conventional language-conditioned policies. We achieve state-of-the-art results on the CALVIN benchmark, and also demonstrate robust generalization on real-world manipulation tasks, beating strong baselines that have access to privileged information or that utilize orders of magnitude more compute and training data. The project website can be found at http://rail-berkeley.github.io/susie .
\ No newline at end of file
diff --git a/data/2024/iclr/Zero-Shot Robustification of Zero-Shot Models b/data/2024/iclr/Zero-Shot Robustification of Zero-Shot Models
new file mode 100644
index 0000000000..f8b5e06ee2
--- /dev/null
+++ b/data/2024/iclr/Zero-Shot Robustification of Zero-Shot Models	
@@ -0,0 +1 @@
+Zero-shot inference is a powerful paradigm that enables the use of large pretrained models for downstream classification tasks without further training. However, these models are vulnerable to inherited biases that can impact their performance. The traditional solution is fine-tuning, but this undermines the key advantage of pretrained models, which is their ability to be used out-of-the-box. We propose RoboShot, a method that improves the robustness of pretrained model embeddings in a fully zero-shot fashion. First, we use language models (LMs) to obtain useful insights from task descriptions. These insights are embedded and used to remove harmful and boost useful components in embeddings -- without any supervision. Theoretically, we provide a simple and tractable model for biases in zero-shot embeddings and give a result characterizing under what conditions our approach can boost performance. Empirically, we evaluate RoboShot on nine image and NLP classification tasks and show an average improvement of 15.98% on worst group accuracy, with trivial decrease in overall accuracy over several zero-shot baselines. Additionally, we demonstrate that RoboShot is compatible with a variety of pretrained and language models and propose a way to further boost performance with a zero-shot adaptation variant.
\ No newline at end of file
diff --git a/data/2024/iclr/Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles b/data/2024/iclr/Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles
new file mode 100644
index 0000000000..27067b037f
--- /dev/null
+++ b/data/2024/iclr/Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles	
@@ -0,0 +1 @@
+In this study, we delve into an emerging optimization challenge involving a black-box objective function that can only be gauged via a ranking oracle-a situation frequently encountered in real-world scenarios, especially when the function is evaluated by human judges. Such challenge is inspired from Reinforcement Learning with Human Feedback (RLHF), an approach recently employed to enhance the performance of Large Language Models (LLMs) using human guidance. We introduce ZO-RankSGD, an innovative zeroth-order optimization algorithm designed to tackle this optimization problem, accompanied by theoretical assurances. Our algorithm utilizes a novel rank-based random estimator to determine the descent direction and guarantees convergence to a stationary point. Moreover, ZO-RankSGD is readily applicable to policy optimization problems in Reinforcement Learning (RL), particularly when only ranking oracles for the episode reward are available. Last but not least, we demonstrate the effectiveness of ZO-RankSGD in a novel application: improving the quality of images generated by a diffusion generative model with human ranking feedback. Throughout experiments, we found that ZO-RankSGD can significantly enhance the detail of generated images with only a few rounds of human feedback. Overall, our work advances the field of zeroth-order optimization by addressing the problem of optimizing functions with only ranking feedback, and offers a new and effective approach for aligning Artificial Intelligence (AI) with human intentions.
\ No newline at end of file
diff --git a/data/2024/iclr/Zipformer: A faster and better encoder for automatic speech recognition b/data/2024/iclr/Zipformer: A faster and better encoder for automatic speech recognition
new file mode 100644
index 0000000000..d248464460
--- /dev/null
+++ b/data/2024/iclr/Zipformer: A faster and better encoder for automatic speech recognition	
@@ -0,0 +1 @@
+The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.
\ No newline at end of file
diff --git a/data/2024/iclr/f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization b/data/2024/iclr/f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization
new file mode 100644
index 0000000000..cdb784dec6
--- /dev/null
+++ b/data/2024/iclr/f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization	
@@ -0,0 +1 @@
+Training and deploying machine learning models that meet fairness criteria for protected groups are fundamental in modern artificial intelligence. While numerous constraints and regularization terms have been proposed in the literature to promote fairness in machine learning tasks, most of these methods are not amenable to stochastic optimization due to the complex and nonlinear structure of constraints and regularizers. Here, the term"stochastic"refers to the ability of the algorithm to work with small mini-batches of data. Motivated by the limitation of existing literature, this paper presents a unified stochastic optimization framework for fair empirical risk minimization based on f-divergence measures (f-FERM). The proposed stochastic algorithm enjoys theoretical convergence guarantees. In addition, our experiments demonstrate the superiority of fairness-accuracy tradeoffs offered by f-FERM for almost all batch sizes (ranging from full-batch to batch size of one). Moreover, we show that our framework can be extended to the case where there is a distribution shift from training to the test data. Our extension is based on a distributionally robust optimization reformulation of f-FERM objective under $L_p$ norms as uncertainty sets. Again, in this distributionally robust setting, f-FERM not only enjoys theoretical convergence guarantees but also outperforms other baselines in the literature in the tasks involving distribution shifts. An efficient stochastic implementation of $f$-FERM is publicly available.
\ No newline at end of file
diff --git a/data/2024/iclr/fairret: a Framework for Differentiable Fairness Regularization Terms b/data/2024/iclr/fairret: a Framework for Differentiable Fairness Regularization Terms
new file mode 100644
index 0000000000..706ab1d72b
--- /dev/null
+++ b/data/2024/iclr/fairret: a Framework for Differentiable Fairness Regularization Terms	
@@ -0,0 +1 @@
+Current fairness toolkits in machine learning only admit a limited range of fairness definitions and have seen little integration with automatic differentiation libraries, despite the central role these libraries play in modern machine learning pipelines. We introduce a framework of fairness regularization terms (fairrets) which quantify bias as modular, flexible objectives that are easily integrated in automatic differentiation pipelines. By employing a general definition of fairness in terms of linear-fractional statistics, a wide class of fairrets can be computed efficiently. Experiments show the behavior of their gradients and their utility in enforcing fairness with minimal loss of predictive power compared to baselines. Our contribution includes a PyTorch implementation of the fairret framework.
\ No newline at end of file
diff --git a/data/2024/iclr/iGraphMix: Input Graph Mixup Method for Node Classification b/data/2024/iclr/iGraphMix: Input Graph Mixup Method for Node Classification
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2024/iclr/iTransformer: Inverted Transformers Are Effective for Time Series Forecasting b/data/2024/iclr/iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
new file mode 100644
index 0000000000..f316418df0
--- /dev/null
+++ b/data/2024/iclr/iTransformer: Inverted Transformers Are Effective for Time Series Forecasting	
@@ -0,0 +1 @@
+The recent boom of linear forecasting models questions the ongoing passion for architectural modifications of Transformer-based forecasters. These forecasters leverage Transformers to model the global dependencies over temporal tokens of time series, with each token formed by multiple variates of the same timestamp. However, Transformers are challenged in forecasting series with larger lookback windows due to performance degradation and computation explosion. Besides, the embedding for each temporal token fuses multiple variates that represent potential delayed events and distinct physical measurements, which may fail in learning variate-centric representations and result in meaningless attention maps. In this work, we reflect on the competent duties of Transformer components and repurpose the Transformer architecture without any modification to the basic components. We propose iTransformer that simply applies the attention and feed-forward network on the inverted dimensions. Specifically, the time points of individual series are embedded into variate tokens which are utilized by the attention mechanism to capture multivariate correlations; meanwhile, the feed-forward network is applied for each variate token to learn nonlinear representations. The iTransformer model achieves state-of-the-art on challenging real-world datasets, which further empowers the Transformer family with promoted performance, generalization ability across different variates, and better utilization of arbitrary lookback windows, making it a nice alternative as the fundamental backbone of time series forecasting. Code is available at this repository: https://github.com/thuml/iTransformer.
\ No newline at end of file
diff --git a/data/2024/iclr/lpNTK: Better Generalisation with Less Data via Sample Interaction During Learning b/data/2024/iclr/lpNTK: Better Generalisation with Less Data via Sample Interaction During Learning
new file mode 100644
index 0000000000..a755742f8f
--- /dev/null
+++ b/data/2024/iclr/lpNTK: Better Generalisation with Less Data via Sample Interaction During Learning	
@@ -0,0 +1 @@
+Although much research has been done on proposing new models or loss functions to improve the generalisation of artificial neural networks (ANNs), less attention has been directed to the impact of the training data on generalisation. In this work, we start from approximating the interaction between samples, i.e. how learning one sample would modify the model's prediction on other samples. Through analysing the terms involved in weight updates in supervised learning, we find that labels influence the interaction between samples. Therefore, we propose the labelled pseudo Neural Tangent Kernel (lpNTK) which takes label information into consideration when measuring the interactions between samples. We first prove that lpNTK asymptotically converges to the empirical neural tangent kernel in terms of the Frobenius norm under certain assumptions. Secondly, we illustrate how lpNTK helps to understand learning phenomena identified in previous work, specifically the learning difficulty of samples and forgetting events during learning. Moreover, we also show that using lpNTK to identify and remove poisoning training samples does not hurt the generalisation performance of ANNs.
\ No newline at end of file
diff --git a/data/2024/iclr/sRGB Real Noise Modeling via Noise-Aware Sampling with Normalizing Flows b/data/2024/iclr/sRGB Real Noise Modeling via Noise-Aware Sampling with Normalizing Flows
new file mode 100644
index 0000000000..e69de29bb2
diff --git "a/data/2024/iclr/\316\261TC-VAE: On the relationship between Disentanglement and Diversity" "b/data/2024/iclr/\316\261TC-VAE: On the relationship between Disentanglement and Diversity"
new file mode 100644
index 0000000000..e69de29bb2
diff --git "a/data/2024/iclr/\317\2002vec: Policy Representation with Successor Features" "b/data/2024/iclr/\317\2002vec: Policy Representation with Successor Features"
new file mode 100644
index 0000000000..8562d9e7f6
--- /dev/null
+++ "b/data/2024/iclr/\317\2002vec: Policy Representation with Successor Features"	
@@ -0,0 +1 @@
+This paper describes $\pi2\text{vec}$, a method for representing behaviors of black box policies as feature vectors. The policy representations capture how the statistics of foundation model features change in response to the policy behavior in a task agnostic way, and can be trained from offline data, allowing them to be used in offline policy selection. This work provides a key piece of a recipe for fusing together three modern lines of research: Offline policy evaluation as a counterpart to offline RL, foundation models as generic and powerful state representations, and efficient policy selection in resource constrained environments.
\ No newline at end of file
diff --git "a/data/2024/iclr/\342\210\236-Diff: Infinite Resolution Diffusion with Subsampled Mollified States" "b/data/2024/iclr/\342\210\236-Diff: Infinite Resolution Diffusion with Subsampled Mollified States"
new file mode 100644
index 0000000000..8653175d9d
--- /dev/null
+++ "b/data/2024/iclr/\342\210\236-Diff: Infinite Resolution Diffusion with Subsampled Mollified States"	
@@ -0,0 +1 @@
+This paper introduces $\infty$-Diff, a generative diffusion model defined in an infinite-dimensional Hilbert space, which can model infinite resolution data. By training on randomly sampled subsets of coordinates and denoising content only at those locations, we learn a continuous function for arbitrary resolution sampling. Unlike prior neural field-based infinite-dimensional models, which use point-wise functions requiring latent compression, our method employs non-local integral operators to map between Hilbert spaces, allowing spatial context aggregation. This is achieved with an efficient multi-scale function-space architecture that operates directly on raw sparse coordinates, coupled with a mollified diffusion process that smooths out irregularities. Through experiments on high-resolution datasets, we found that even at an $8\times$ subsampling rate, our model retains high-quality diffusion. This leads to significant run-time and memory savings, delivers samples with lower FID scores, and scales beyond the training resolution while retaining detail.
\ No newline at end of file